text miner’s little helper: scalable self-tuning ... di corso_presentation.pdf · data analytics...

117
Text miner’s little helper: scalable self-tuning methodologies for knowledge exploration Evelina Di Corso Cycle XXXI Advisor: Prof. Tania Cerquitelli Dipartimento di Automatica e Informatica Politecnico di Torino ITALY

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Text miner’s little helper: scalableself-tuning methodologies for

knowledge exploration

Evelina Di CorsoCycle XXXI

Advisor: Prof. Tania Cerquitelli

Dipartimento di Automatica e InformaticaPolitecnico di Torino

ITALY

Page 2: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Outline of the presentation

• Problem statement and the purpose of the study

• Framework architecture

• Results & discussion

• Other research activities

• Conclusion & Future work

2

Page 3: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The problem statement and

the purpose of the study

Page 4: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

3

Page 5: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

3

Page 6: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

3

Page 7: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

3

Page 8: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

Data analysis is challenging• It is a multi-step process

• Plethora of analytics algorithms are available

• Different parameters of specific-algorithms need to be manually set

• Variable data distribution

3

Page 9: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge Discovery from Data process - KDD

4

Page 10: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge Discovery from Data process - KDD

Case study

• Analysis of textual data collections via unsupervisedanalysis 4

Page 11: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster analysis on textual data collections

Document collection

Document

Processing

Textual data

processing

Term relevance Topic Detection

Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

5

Page 12: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster analysis on textual data collections

Document collection

Document

Processing

Textual data

processing

Term relevance Topic Detection

Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

Manifold suitable data weighting functions

5

Page 13: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster analysis on textual data collections

Document collection

Document

Processing

Textual data

processing

Term relevance Topic Detection

Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

Manifold suitable data weighting functions

Different topic modeling algorithms

Several parameters

5

Page 14: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster analysis on textual data collections

Document collection

Document

Processing

Textual data

processing

Term relevance Topic Detection

Weighting functionDocument

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

Manifold suitable data weighting functions

Different topic modeling algorithms

Several parameters

Various quality indices

Analysis guided by a domain expert

5

Page 15: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Research goal

Design and develop a new generation of data analytics enginesbased on self-tuning techniques

Research issues

• Automating data mining activities

• Parameter-free algorithms

• Self-assessment strategies

Case study

• Analysis of textual data collections via unsupervisedanalysis

6

Page 16: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The framework architecture

Page 17: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The ESCAPE System Architecture

Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation 7

Page 18: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Document processing and characterisation

• Document processing

• Weighting schemas

• Statistics definition and computation

8

Page 19: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Document processing and characterisation

• Document processing1. Document splitting

2. Tokenisation

3. Case normalisation

4. Stopword removal

5. Stemming• Bag-Of-Word representation

9

Page 20: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Document processing and characterisation

• Weighting schemas• Document-Term matrix X

• Local weight lij• Global weight gj

• Xij = lij * gj

10

Page 21: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Document processing and characterisation

• Weighting schemas• Document-Term matrix X

• Local weight lij• Global weight gj

• Xij = lij * gj

Weighting Schemas Definition

Local

TF = tfij

LogTF = log2(tfij+1)

Boolean = {0,1}

Global

IDF = log|𝐷|

𝑑𝑓𝑗

Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗

log 𝑛

Tfglob = tfj

10

Page 22: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Document processing and characterisation

• Statistics definition and computation• # documents

• # terms

• Avg document length

• Hapax %

• Type of Token Ratio TTR

• Hapax removal Boolean variable

• …

• # categories• # documents• Avg document length• # terms• Dictionary• Avg frequency terms• Max frequency terms• Min frequency terms

• Hapax %• Type of Token Ratio

(TTR)• Guiraud Index• Hapax removal

Boolean variable

Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Data miners' little helper: data transformation activity cues for cluster analysis on document collections. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics.

Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Prompting the data transformation activities for cluster analysis on collections of documents. In: 25th Italian Symposium on Advanced Database Systems

Lexical Richness IndicesBasic statistics

11

Page 23: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach

• Probabilistic Model

12

Page 24: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach• Algebraic model

• Cluster Analysis

• Probabilistic Model

13

Page 25: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach• Algebraic model

• Cluster Analysis

• Probabilistic Model• Latent Dirichlet Allocation (LDA)

14

Page 26: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach• Algebraic model

• Cluster Analysis

• Probabilistic Model• Latent Dirichlet Allocation

15

Page 27: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The joint-approach includes two steps:

• A data reduction phase through Latent Semantic Analysis (LSA)

• A data clustering phase to find similar documents or relations between them

ESCAPE includes two innovative algorithms to automatically configure the joint-approach

Joint-Approach

16

Page 28: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The joint-approach includes two steps:

• A data reduction phase through Latent Semantic Analysis (LSA)

• A data clustering phase to find similar documents or relations between them

ESCAPE includes two innovative strategies to automatically configure the joint-approach

Joint-Approach

16

Page 29: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Reduction phase

The correct choice of the number of dimensions to be considered is a research issue

ESCAPE includes

• Self-Tuning Data Reduction algorithm

Singular Value

Mag

nit

ud

e

Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2017) Self-tuning techniques for large scale cluster analysis on textual data collections. In: ACM SIGAPP Symposium on Applied Computing

Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2018) All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection. In: IEEE International Conference on Big Data17

Page 30: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Reduction phase

The correct choice of the number of dimensions to be considered is a research issue

ESCAPE includes

• Self-Tuning Data Reduction algorithm

Input parameters:• Weighted Document-Term matrix X• The max number of singular values

Steps

Output:• Three values for dimensionality

reduction: KLSA

1. SVD decomposition on X2. The trend of the singular values are

analysed in terms of significance3. The algorithm ends when at least three

values for KLSA have been identified

Singular Value

Mag

nit

ud

e

18

Page 31: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Input parameters:• Low rank matrix Xk

• Range of desired clusters [mincls – maxcls]

Steps:

Output:• Final optimal solution

Joint-Approach Self-tuning Clustering

The K-Means partitional algorithm

• kcls: user-defined parameter

ESCAPE includes

• Self-Tuning Clustering algorithm1. Cluster analysis performed for each kcls ∈

[mincls – maxcls]

2. Partition analysis through Silhouette-Based indices

3. Selection of the final optimal solution

19

Page 32: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Clustering

Self-Tuning Clustering algorithm

• The Silhouette index is a quality measure of how similar an object is to itsown cluster (cohesion) compared to other clusters (separation)

• The Silhouette ranges from -1 to +1• a high value indicates that the object is well matched to its own cluster and poorly

matched to neighbouring cluster

• For each document i, the silhouette is defined as:

Where:

• ai is the average distance between i and the other documents in the samecluster;

• bi is the lowest average distance between the document i and each one ofthe other clusters (not containing the document i)

si = bi − ai

max(bi, ai)

20

Page 33: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Clustering

Self-Tuning Clustering algorithm

• The Silhouette index is a quality measure of how similar an object is to itsown cluster (cohesion) compared to other clusters (separation)

• The Silhouette ranges from -1 to +1• a high value indicates that the object is well matched to its own cluster and poorly

matched to neighbouring cluster

• For each document i, the silhouette is defined as:

Where:

• ai is the average distance between i and the other documents in the samecluster;

• bi is the lowest average distance between the document i and each one ofthe other clusters (not containing the document i)

si = bi − ai

max(bi, ai)

20

Page 34: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Clustering

Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):

• Distribution of silhouette values in positive bins

• Preference for left-skewed distribution

• The Average Silhouette Index (ASI)

• The Global Silhouette Index (GSI)

Where Ck is the set of documents belonging to cluster k = 1,...,K; |Ck| is the cardinality ofcluster Ck (documents belonging to the cluster Ck), and N is the total number of documents.

Cerquitelli Tania et all. (2018) Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data. In: International Conference on Smart Energy Systems and Technologies21

Page 35: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Clustering

Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):

• Distribution of silhouette values in positive bins

• Preference for left-skewed distribution

• The Average Silhouette Index (ASI)

• The Global Silhouette Index (GSI)

• The higher the values indices, the better the clustering partition• ASI gives an overview of the average silhouette of the entire cluster set• GSI takes into account the imbalance number of elements in each cluster• Clusters with large number of documents are penalised in the GSI 21

Page 36: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Joint-Approach Self-tuning Clustering

A rank score is defined:

The score lies in the range [0, (3− 3/maxcls)]

Score = (1−rank_GSI/maxcls)+(1−rank_ASI/maxcls)+(1−rank_WS/ maxcls))

Silhouette-based indices for Wikipedia dataset with 1000 documents

Silh

ou

ette

-bas

ed in

dic

es

Kcls22

Page 37: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach• Algebraic model

• Cluster Analysis

• Probabilistic Model• Latent Dirichlet Allocation (LDA)

23

Page 38: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Probabilistic Model

• Generative probabilistic model• Parametric Bayesian Probabilistic Graphical Model• Documents as random mixtures over latent topics• Topics are characterised by a distribution over words• LDA estimates both at the same time

• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference

• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set apriori24

Page 39: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Probabilistic Model

• Generative probabilistic model• the maximum number of iterations (max iter = 100)• the Optimiser (Online Variational Bayes)2

• the document concentration (α = 50/K)1

• the topic concentration (β = 0.1)1

• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference

• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set apriori[1] Thomas Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences.[2] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of machine Learning research.

24

Page 40: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Probabilistic Model

• Assumptions:• Assume topics exist outside of the document collection

• Each topic is a distribution over fixed vocabulary

• Each word is drawn from one of those topics

• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set a-priori

25

Page 41: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Input parameters:• LDA topic-term distribution• Range of desired clusters [mincls – maxcls]

Steps:

Output:• Three good configurations

Probabilistic Model - Self-tuning LDA

1. Topic characterisation2. Similarity computation3. K identication

Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania (2018) Useful ToPIC: Self-tuning strategies to enhance Latent Dirichlet Allocation. In: IEEE International Big Data Congress

Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)

ESCAPE proposes a novel iterative strategy: ToPIC-Similarity

26

Page 42: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each

topic• Based on the indices of the dataset, given

2. Similarity computation

3. K identication

Q = |V|∙TTR/AvgFreq, n=൞

Q

K, 𝑖𝑓 𝑄 ≥ 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞

𝐴𝑣𝑔𝐹𝑟𝑒𝑞, 𝑖𝑓 𝑄 < 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞

Probabilistic Model - Self-tuning LDA

27

Page 43: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Probabilistic Model - Self-tuning LDA

• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each

topic

2. Similarity computation: computes the semantic similarity among the obtained topics• Cosine similarity

• Norm of the similarity matrix

• Mean with respect to the number of topics

3. K identication

28

Page 44: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Probabilistic Model - Self-tuning LDA

• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each

topic

2. Similarity computation: computes the semantic similarity among the obtained topics

3. K identication: identifies optimal values for K, using a trade-off approach• The K values identified are the first three values that

(i) belong to a decreasing segment of the curve and

(ii) are local minima

29

Page 45: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation and visualisation

• Knowledge validation• Objective metrics

• Knowledge visualisation• Visualisation techniques

30

Page 46: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation and visualisation

• Knowledge validation• Objective metrics

• Knowledge visualisation• Visualisation techniques

31

Page 47: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation and visualisation

• Knowledge validation• Objective metrics

• Knowledge visualisation• Visualisation techniques

32

Page 48: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation - Objective metrics

For each approach, different quality metrics have been integrated

Joint-Approach:• The weighted silhouette index (WS)• The average silhouette index (ASI)• The global silhouette index (GSI)

Probabilist Modelling:• The weighted silhouette index (WS)• The perplexity index (Perp)• The Entropy index (Entr)

33

Page 49: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation - Objective metrics

For each approach, different quality metrics have been integrated

• Joint-Approach:• The Weighted Silhouette index (WS)• The Average Silhouette Index (ASI)• The Global Silhouette Index (GSI)

• Probabilistic Modelling:• The Weighted Silhouette index (WS)• The Perplexity index (Perplexity)• The Entropy index (H)

33

Page 50: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation - Objective metrics

For each approach, different quality metrics have been integrated

• Joint-Approach:• The Weighted Silhouette index (WS)• The Average Silhouette Index (ASI)• The Global Silhouette Index (GSI)

• Probabilistic Modelling:• The Weighted Silhouette index (WS)• The Perplexity index (Perplexity)• The Entropy index (H)

33

Page 51: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge validation - Objective metrics

For each approach, different quality metrics have been integrated

• Joint-Approach

• Probabilistic Modelling:• The Weighted Silhouette index (WS)

• The Perplexity index (Perplexity)

• The Entropy index (H)

34

Page 52: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

ESCAPE provide two forms of human-readable knowledge:

1. document-topic distribution

2. topic-term distribution

Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)35

Page 53: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

ESCAPE provide two forms of human-readable knowledge:

1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution

• coarse-grained versus fine-grained groups through the analysis of the impactof the different weighting schemas

2. topic-term distribution

36

Page 54: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

ESCAPE provide two forms of human-readable knowledge:

1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution

• t-Distributed Stochastic Neighbour Embedding (t-SNE)

coarse-grained versus fine-grained groups through the analysis of the impact of thedifferent weighting schemas

2. topic-term distribution

t-SNE representation37

Page 55: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

ESCAPE provide two forms of human-readable knowledge:

1. document-topic distribution: characterises the document distribution over the topics• coarse-grained versus fine-grained groups through the analysis of

the impact of the different weighting schemas• correlation matrices

Correlation matrix38

Page 56: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

1. document-topic distribution

2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words

• topic cohesion/separation in terms of relevant words

ESCAPE provide two forms of human-readable knowledge:

39

Page 57: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

1. document-topic distribution

2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words

• Word-clouds

• topic-term distribution through the analysis of the top-k relevant words

• topic cohesion/separation in terms of relevant words

ESCAPE provide two forms of human-readable knowledge:

Word-cloud representation40

Page 58: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

1. document-topic distribution

2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words

• Word-clouds

• topic-term distribution through the analysis of the top-k relevant words

• topic cohesion/separation in terms of relevant words

ESCAPE provide two forms of human-readable knowledge:

Word-cloud representation

For the Joint-Approach:• Relevant terms are extracted using FP-Growth

algorithmFor the Probabilistic model:• Relevant terms are extracted by the topic-term

probability of the LDA model

40

Page 59: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualisation - Visualisation techniques

1. document-topic distribution

2. topic-term distribution: describes the distribution over words for each topic• topic cohesion/separation in terms of relevant words

• graph representation

• topic-term distribution through the analysis of the top-k relevant words

• topic cohesion/separation in terms of relevant words

ESCAPE provide two forms of human-readable knowledge:

Graph representation41

Page 60: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Experimental results

Page 61: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The current implementation of ESCAPE runs on

• It is a project developed in Python• It exploited the Pyspark Scalable Machine Learning libraries Ml and MLlib

• Specifically, the algorithms included are: • pyspark.mllib.linalg.distributed for SVD• pyspark.ml.clustering for K-Means and LDA

• All experiments have been performed on the BigData@PoliTO cluster

ESCAPE architecture

42

Page 62: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The current implementation of ESCAPE runs on

• It is a project developed in Python• It exploited the Pyspark Scalable Machine Learning libraries Ml and MLlib

• Specifically, the algorithms included are: • pyspark.mllib.linalg.distributed for SVD• pyspark.ml.clustering for K-Means and LDA

• All experiments have been performed on the BigData@PoliTO cluster

ESCAPE architecture

42

Page 63: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Experimental settings – Default configurations

• Joint-Approach:• the number of dimensions to be considered during the data reduction SVD

phase (krid=20% matrix rank)

• the number of clusters (topics) in which to divide the collection under analysis (Kmax= average term frequency)

• Probabilistic modelling:• the number of clusters (topics) in which to divide the collection under

analysis (Kmax= average term frequency)

43

Page 64: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

ESCAPE has been tested over different real datasets

Experimental datasets

ID Dataset Source Textual data type

D1 Wikipedia Documents

D2 Wikipedia Documents

D3 Wikipedia Documents

D4 Twitter Short messages

D5 PubMed Articles

D6 PubMed Abstracts

D7 Reuters Documents

44

Page 65: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

ESCAPE has been tested over different real datasets

Experimental datasets

ID Dataset Source Textual data type

D1 Wikipedia Documents

D2 Wikipedia Documents

D3 Wikipedia Documents

D4 Twitter Short messages

D5 PubMed Articles

D6 Pubmed Abstracts

D7 Reuters Documents

44

Page 66: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73

WH: With Hapax WoH: without Hapax

45

Page 67: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Page 68: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Page 69: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Page 70: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Page 71: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Page 72: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Term relevance

• In ESCAPE, the following weighting schemas have been integrated:

Weighting Schemas Definition

LocalTF = tfij

LogTF = log2(tfij+1)

Boolean = {0,1}

Global

IDF = log|𝐷|

𝑑𝑓𝑗

Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗

log 𝑛

Tfglob = tfj

46

Page 73: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Term relevance

• In ESCAPE, the following weighting schemas have been integrated:

Weighting Schemas Definition

LocalTF = tfij

LogTF = log2(tfij+1)

Boolean = {0,1}

Global

IDF = log|𝐷|

𝑑𝑓𝑗

Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗

log 𝑛

Tfglob = tfj

46

Page 74: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Experimental results

• Quality of ESCAPE solutions for each methodology:• Top-k configurations

• Weighting schemas impact on the corpus

• Comparison with the state-of-the-art approaches

• Comparison among ESCAPE methodologies• Adjusted Rand Index (ARI)

• Quantitative metrics

• Qualitative visualisations

47

Page 75: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Preliminary results – Joint Approach

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D1

TF-IDF

26 7 0.383 0.358 0.408

22m, 20s41 10 0.419 0.339 0.391

67 10 0.361 0.297 0.352

TF-Entropy

29 11 0.334 0.35 0.401

26m, 18s42 10 0.368 0.331 0.382

62 8 0.364 0.274 0.326

LogTF-IDF

19 5 0.437 0.431 0.48

25m, 23s22 5 0.35 0.343 0.393

67 4 0.225 0.201 0.251

LogTF-Entropy

10 6 0.44 0.453 0.5

27m, 12m24 5 0.323 0.318 0.367

67 7 0.268 0.218 0.267

Bool-IDF

8 5 0.445 0.444 0.494

25m, 33s22 6 0.293 0.312 0.365

65 6 0.226 0.233 0.286

Bool-Entropy

9 5 0.447 0.444 0.495

28m, 38s23 5 0.354 0.348 0.4

65 4 0.28 0.234 0.285

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D4Bool-IDF 6 6 0.465 0.422 0.737 50m, 29s

Bool-Entropy 13 7 0.342 0.32 0.532 1h, 10m, 33s

D5

TF-IDF 14 5 0.352 0.284 0.333 1h, 37m, 19s

TF-Entropy 15 10 0.377 0.28 0.332 1h, 39m, 34s

LogTF-IDF 15 5 0.397 0.312 0.362 1h, 43m, 15s

LogTF-Entropy 16 5 0.384 0.287 0.336 1h, 47m, 34s

Bool-IDF 16 4 0.315 0.347 0.395 1h, 46m, 42s

Bool-Entropy 16 4 0.328 0.336 0.385 1h, 48m, 45s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

48

Page 76: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Preliminary results – Joint Approach

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D1

TF-IDF

26 7 0.383 0.358 0.408

22m, 20s41 10 0.419 0.339 0.391

67 10 0.361 0.297 0.352

TF-Entropy

29 11 0.334 0.35 0.401

26m, 18s42 10 0.368 0.331 0.382

62 8 0.364 0.274 0.326

LogTF-IDF

19 5 0.437 0.431 0.48

25m, 23s22 5 0.35 0.343 0.393

67 4 0.225 0.201 0.251

LogTF-Entropy

10 6 0.44 0.453 0.5

27m, 12m24 5 0.323 0.318 0.367

67 7 0.268 0.218 0.267

Bool-IDF

8 5 0.445 0.444 0.494

25m, 33s22 6 0.293 0.312 0.365

65 6 0.226 0.233 0.286

Bool-Entropy

9 5 0.447 0.444 0.495

28m, 38s23 5 0.354 0.348 0.4

65 4 0.28 0.234 0.285

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D4Bool-IDF 6 6 0.465 0.422 0.737 50m, 29s

Bool-Entropy 13 7 0.342 0.32 0.532 1h, 10m, 33s

D5

TF-IDF 14 5 0.352 0.284 0.333 1h, 37m, 19s

TF-Entropy 15 10 0.377 0.28 0.332 1h, 39m, 34s

LogTF-IDF 15 5 0.397 0.312 0.362 1h, 43m, 15s

LogTF-Entropy 16 5 0.384 0.287 0.336 1h, 47m, 34s

Bool-IDF 16 4 0.315 0.347 0.395 1h, 46m, 42s

Bool-Entropy 16 4 0.328 0.336 0.385 1h, 48m, 45s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

48

Page 77: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Top-k solution

• Joint-Approach• Data reduction parameter

• Number of clusters

49

Page 78: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Top-k solution

• Joint-Approach• Data reduction parameter

• Number of clusters

Singular Value

Mag

nit

ud

e

49

Page 79: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Top-k solution

• Joint-Approach• Data reduction parameter

• Number of clusters

• Rank Function Score

K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-

Solution2 0.21 0.239 0.29 19 18 18 0.105 18

3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19

5 0.332 0.315 0.37 9 4 4 2.105 4

6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2

8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11

10 0.383 0.357 0.409 2 1 1 2.789 1

11 0.29 0.295 0.347 17 12 12 0.842 14

12 0.34 0.312 0.365 5 7 5 2.105 4

13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7

16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10

18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15

Score = (1 − 𝑟𝑎𝑛𝑘𝐺𝑆𝐼

maxcls)+(1−

𝑟𝑎𝑛𝑘𝐴𝑆𝐼

maxcls)+(1−

𝑟𝑎𝑛𝑘𝑊𝑆

maxcls))

50

Page 80: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Top-k solution

• Joint-Approach• Data reduction parameter

• Number of clusters

Silh

ou

ette

-bas

ed in

dic

es

Kcls

K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-

Solution2 0.21 0.239 0.29 19 18 18 0.105 18

3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19

5 0.332 0.315 0.37 9 4 4 2.105 4

6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2

8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11

10 0.383 0.357 0.409 2 1 1 2.789 1

11 0.29 0.295 0.347 17 12 12 0.842 14

12 0.34 0.312 0.365 5 7 5 2.105 4

13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7

16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10

18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15

50

Page 81: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Weight impact

TF-IDF

LogTF-Entropy

D1 correlation matrices

Original Category

51

Page 82: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

TF-IDF

LogTF-Entropy

Weight impact

D1 correlation matrices

TF-IDF

LogTF-Entropy

Original Category ESCAPE Label

51

Page 83: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

TF-IDF

LogTF-Entropy

Weight impact

D1 correlation matrices

TF-IDF

LogTF-Entropy

Original Category

D1 t-SNE representation

ESCAPE Label

51

Page 84: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Comparison with the state-of-the-art

State-of-the-art methodology: the Elbow methodThe good trade-off to determine the optimal number of clusters is in correspondence to the change ofslope from steep to shallow (an elbow)

K

SSE

D1 SSE trend D1 Silhouette index for each document

52

Page 85: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Preliminary results – Probabilistic modelling

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D1

TF-IDF

3 8.812 0.772 0.25640m,24s6 8.597 0.693 0.363

10 8.482 0.682 0.395

TF-Entropy5 9.072 0.762 0.282

30m,32s8 9.248 0.632 0.338

9 9.267 0.631 0.339

LogTF-IDF8 9.187 0.675 0.320

40m,17s17 9.126 0.637 0.362

LogTF-Entropy5 9.912 0.891 0.100

30m,547 9.884 0.846 0.17411 9.979 0.951 0.108

Boolean-TF4 6.492 0.697 0.421

44m,43s5 6.464 0.661 0.483

17 6.420 0.381 1.090

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D4 Bool-TF 6 2.808 0.546 0.613 1h, 34m, 31s

D5

TF-IDF 14 7.662 0.085 1.902 1h, 50m, 27s

TF-Entropy 4 8.556 0.081 1.782 1h, 54m, 25s

LogTF-IDF 14 7.776 0.094 1.754 2h, 14m, 41s

LogTF-Entropy 4 8.622 0.08 1.743 2h, 17m, 25s

Bool-TF 10 5.22 0.101 1.318 2h, 20m, 13s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

53

Page 86: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Preliminary results – Probabilistic modelling

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D1

TF-IDF

3 8.812 0.772 0.25640m,24s6 8.597 0.693 0.363

10 8.482 0.682 0.395

TF-Entropy5 9.072 0.762 0.282

30m,32s8 9.248 0.632 0.338

9 9.267 0.631 0.339

LogTF-IDF8 9.187 0.675 0.320

40m,17s17 9.126 0.637 0.362

LogTF-Entropy5 9.912 0.891 0.100

30m,547 9.884 0.846 0.17411 9.979 0.951 0.108

Boolean-TF4 6.492 0.697 0.421

44m,43s5 6.464 0.661 0.483

17 6.420 0.381 1.090

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D4 Bool-TF 6 2.808 0.546 0.613 1h, 34m, 31s

D5

TF-IDF 14 7.662 0.085 1.902 1h, 50m, 27s

TF-Entropy 4 8.556 0.081 1.782 1h, 54m, 25s

LogTF-IDF 14 7.776 0.094 1.754 2h, 14m, 41s

LogTF-Entropy 4 8.622 0.08 1.743 2h, 17m, 25s

Bool-TF 10 5.22 0.101 1.318 2h, 20m, 13s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

53

Page 87: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Top-k solution

• Probabilistic Modelling• Number of clusters

ToP

IC in

dex

54

Page 88: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Weight impact

TF-IDF

LogTF-Entropy

D1 scatter document-topic distribution

55

Page 89: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Weight impact

TF-IDF

LogTF-Entropy

D1 scatter document-topic distribution D1 t-SNE visualisation

55

Page 90: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

State-of-the-art: to estimate the best value for the number of topics K

• Rate of Perplexity Change (RPC)1

• Based on the statistical perplexity

• Entropy optimised Latent Dirichlet Allocation (En-LDA)2

• Based on the Entropy of the model

Comparison with the state-of-the-art

[1] Weizhong Zhao et al. A heuristic approach to determine an appropriate number of topics in topic modeling.[2] Wen Zhang et al. En-lda: An novel approach to automatic bug report assignment with entropy optimised Latent Dirichlet Allocation. 56

Page 91: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Comparison with the state-of-the-art

RPC – K=3 EnLDA – K=19 ESCAPE – K=10

State-of-the-art: to estimate the best value for the number of topics K

• Rate of Perplexity Change (RPC)• Based on the statistical perplexity

• Entropy optimised Latent Dirichlet Allocation (En-LDA)

• Based on the Entropy of the model

56

Page 92: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 comparison between methodologies

Cluster ID

Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total

TF-IDF 215 176 159 139 99 93 49 25 19 15

989

TF-Entropy 228 167 166 135 106 75 54 27 16 15

LogTF-IDF 225 212 191 183 178

LogTF-Entropy 223 191 184 183 105 103

Bool-IDF 236 223 191 181 158

Bool-Entropy 230 223 192 177 167

Cluster ID

Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total

TF-IDF 205 193 187 180 144 21 19 14 13 13

989

TF-Entropy 428 236 197 113 15

LogTF-IDF 464 406 91 8 7 5 5 3

LogTF-Entropy 827 160 1 1 0

Bool-TF 230 215 194 188 162

D1 joint-approach cluster set cardinality

D1 probabilistic modelling cluster set cardinality

57

Page 93: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 comparison between partitions

To compare the different solutions, ESCAPE includes:

• Adjusted Rand Index (ARI) metric• The ARI lies between 0 and 1

• In ESCAPE, the ARI index is used to compare:1. the different weighting schema impact given a methodology

2. the best partitioning between the different methodologies

Weighting schema

Dataset TF-IDF LogTF-IDF TF-Entropy LogTF-Entropy Boolean

D1 0.554 0.321 0.320 0.100 0.790

58

Page 94: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 comparison between partitions

To compare the different solutions, ESCAPE includes:

• Adjusted Rand Index (ARI) metric• The ARI lies between 0 and 1

• In ESCAPE, the ARI index is used to compare:1. the different weighting schema impact given a methodology

2. the best partitioning between the different methodologies

Weighting schema

Dataset TF-IDF LogTF-IDF TF-Entropy LogTF-Entropy Boolean

D1 0.554 0.321 0.320 0.100 0.790

58

Page 95: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

Five original categories: cooking, literature, mathematics, music and sport 59

Page 96: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

Cluster 5Cluster 5

Cluster 4Cluster 4

60

Page 97: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

D1 topic comparison

Cluster 6

Cluster 3

Cluster 1

61

Page 98: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

62

Page 99: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

63

Page 100: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

64

Page 101: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

65

Page 102: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Lesson learnt

Several lessons have been learnt from these analyses:

• No strategy is universally superior

• The joint-approach is able to find homogeneous partitions in term of documents for each cluster• the local weight LogTF tends to find a small number of clusters

• the global weight IDF is able to create more clusters able to find also sub-topics related to the major categories

• The probabilistic model heterogenous partitions in term of documents for each cluster• for certain weighting schemas the documents are well separated

• the Entropy-based lead to very poor results

66

Page 103: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

1. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis ontextual data collections,Proceedings of the Symposium on Applied Computing, 2017,ACM

2. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeperunderstanding of a crisis tweet collection,2017 IEEE International Conference on Big Data (Big Data),6,2017,IEEE

3. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies toEnhance Latent Dirichlet Allocation,2018 IEEE International Congress on Big Data, 2018,IEEE

4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: datatransformation activity cues for cluster analysis on document collections,Proceedings of the 7th InternationalConference on Web Intelligence, Mining and Semantics,2017,ACM

5. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientificliterature. European Conference on Advances in Databases and Information Systems,2019,Springer. In-press

6. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction andvisualisation at ease. ACM Celebration of Women in Computing womENcourage 2019. In-press

7. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the datatransformation activities for cluster analysis on collections of documents.

8. Di Corso, Evelina. Supporting decision making with self-learning methodologies (PhD poster December 2018,Politecnico di Torino, Italy)

9. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning DataTransformation, 2016.

Publication list

Page 104: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Other research activities

Page 105: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Other research activities

Participating to different research projects in several companies in the data science area as data scientist:

• Research contract with ENEL Foundation and ENDESA Energia• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita;

González, Alicia Mateo; Sobrino, Eduardo Martin. Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems

• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga. Clustering-Based Assessment of Residential Consumers from

Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies - Best Paper Award

• Research contract with Edison SPA• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco;

Casagrande, Silvia; Tamburini, Martina. Exploring energy performance certificates through visualization,2019.

• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina. Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems and Technologies – (In press)

• Research contract with Zirak Srl - Information Technology

67

Page 106: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Other research activities

Participating to different research projects in the data science area asdata scientist, collaboration with:

• the Interuniversity Department of Regional and Urban Studies and Planning• Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through

Unsupervised Analytics Methods,European Conference on Advances in Databases and Information Systems,2018, Springer.

• Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation ofscientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)

• the Department of Energy• Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of

buildings through unsupervised data mining techniques,2017 IEEE International Conference on Internet ofThings (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical andSocial Computing (CPSCom) and IEEE Smart Data (SmartData), 2017, IEEE.

67

Page 107: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The ESCAPE System Architecture

Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation68

Page 108: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The ESCAPE System Architecture

Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation68

Page 109: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

The METATECH architecture

METeorological data Analysis for Thermal Energy CHaracterization

Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele (2018) METATECH: METeorological Data Analysis for Thermal Energy CHaracterization by Means of Self-Learning Transparent Models

69

Page 110: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Knowledge visualization based on dynamic dashboard

Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)

Cluster 1

Cluster 2

70

Page 111: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)

Cluster 1

Cluster 2

Knowledge visualization based on dynamic dashboard

70

Page 112: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Cluster 1

Knowledge visualization based on dynamic dashboard

71

Page 113: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Conclusion & future work

Page 114: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Conclusion

• In these three years, I have been able to design and develop anew framework, named ESCAPE (Enhanced Self-tuningCharacterisation of document collections After ParameterEvaluation), able to support the analyst during all the phases ofthe analysis process tailored to textual data.

• ESCAPE includes three main building blocks to streamline theanalytics process and to derive high-quality information in termsof well-separated and well-cohesive groups of documentscharacterising the main topics in a given corpus.

72

Page 115: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Future research directions

Different directions have yet to be analysed and explored. Specifically, we are currently including:

1. New data analytics algorithms to exploit other interesting models:• other algebraic data reduction algorithms• autoencoder-based data reduction algorithms• Non-parametric models (such as Deep Neural Network DNNs and K-NNs)• More weighting functions and statistical features

2. A semantic component able to support the analyst in a double phase:• pre-processing phase, to eliminate semantically bound words• post-processing phase, to represent subtopic of the same macro category and to add a hierarchy level for each word of the

dictionary to support other analytics tasks

3. A Knowledge-Base: to store all the results of the experiments to efficiently support self-tuningmethodologies

4. A self-learning methodology: based on a classification algorithm trained on the knowledgebase content to forecast the best methods for future analyses.

5. Integrating in ESCAPE the analysis of other types of data.

73

Page 116: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Evelina Di CorsoDipartimento di Automatica e Informatica (DAUIN)

Politecnico di Torino, ITALY

Corso Duca degli Abruzzi, 24 - 10129 Torino

[email protected]

Page 117: Text miner’s little helper: scalable self-tuning ... DI CORSO_presentation.pdf · Data analytics Large volumes of heterogeneous data are being collected in •social networks •scientific

Publication list1. Cerquitelli, Tania; Di Corso, Evelina. Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms. EDBT/ICDT Workshops,2016.

2. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis on textual data collections,Proceedings of the Symposium onApplied Computing, 2017,ACM

3. Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of buildings through unsupervised data mining techniques,2017IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing(CPSCom) and IEEE Smart Data (SmartData), 2017,IEEE

4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: data transformation activity cues for cluster analysis on documentcollections,Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics,2017,ACM

5. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning Data Transformation, 2016.

6. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the data transformation activities for cluster analysis on collections ofdocuments,.

7. Venturini, Luca; Di Corso, Evelina. Analyzing spatial data from twitter during a disaster,2017 IEEE International Conference on Big Data (Big Data),2017,IEEE

8. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection,2017 IEEE InternationalConference on Big Data (Big Data),6,2017,IEEE

9. Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele. Metatech: meteorological data analysis for thermal energy characterization by means of self-learning transparentmodels,Energies,11,6,1336,2018,Multidisciplinary Digital Publishing Institute

10. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita; González, Alicia Mateo; Sobrino, Eduardo Martin.Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems(DAS),164-169,2018,IEEE

11. Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through Unsupervised Analytics Methods, European Conference onAdvances in Databases and Information Systems,205-217,2018,Springer

12. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies to Enhance Latent Dirichlet Allocation,2018 IEEE InternationalCongress on Big Data, 2018,IEEE

13. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga.Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies,2018,IEEE

14. Di Corso, Evelina. Supporting decision making with self-learning methodologies.

15. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Exploring energy performance certificates through visualization,2019.

16. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientific literature. European Conference on Advances in Databasesand Information Systems,2019,Springer. In-press

17. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction and visualisation at ease. ACM Celebration of Women inComputing womENcourage 2019. In-press

18. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems andTechnologies – In-press