text miner’s little helper: scalable self-tuning ... di corso_presentation.pdf · data analytics...

Text miner’s little helper: scalableself-tuning methodologies for

knowledge exploration

Evelina Di CorsoCycle XXXI

Advisor: Prof. Tania Cerquitelli

Dipartimento di Automatica e InformaticaPolitecnico di Torino

ITALY

Outline of the presentation

• Problem statement and the purpose of the study

• Framework architecture

• Results & discussion

• Other research activities

• Conclusion & Future work

2

The problem statement and

the purpose of the study

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

3

Data analytics

Large volumes of heterogeneous data are being collected in• social networks

• scientific computation and digital library

• smart environments

Data analysis is challenging• It is a multi-step process

• Plethora of analytics algorithms are available

• Different parameters of specific-algorithms need to be manually set

• Variable data distribution

3

Knowledge Discovery from Data process - KDD

4

Knowledge Discovery from Data process - KDD

Case study

• Analysis of textual data collections via unsupervisedanalysis 4

Cluster analysis on textual data collections

Document collection

Document

Processing

Textual data

processing

Term relevance Topic Detection

Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

5


Document collection

Document

Processing

Textual data

processing


Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment

Manifold suitable data weighting functions

5


Document collection

Document

Processing

Textual data

processing


Weighting function

Document

clustering and

Topic modelling

Quality

Metrics

Result

Assessment


Different topic modeling algorithms

Several parameters

5


Document collection

Document

Processing

Textual data

processing


Weighting functionDocument

clustering and

Topic modelling

Quality

Metrics

Result

Assessment


Different topic modeling algorithms

Several parameters

Various quality indices

Analysis guided by a domain expert

5

Research goal

Design and develop a new generation of data analytics enginesbased on self-tuning techniques

Research issues

• Automating data mining activities

• Parameter-free algorithms

• Self-assessment strategies

Case study

• Analysis of textual data collections via unsupervisedanalysis

6

The framework architecture

The ESCAPE System Architecture

Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation 7

Document processing and characterisation

• Document processing

• Weighting schemas

• Statistics definition and computation

8


• Document processing1. Document splitting

2. Tokenisation

3. Case normalisation

4. Stopword removal

5. Stemming• Bag-Of-Word representation

9


• Weighting schemas• Document-Term matrix X

• Local weight lij• Global weight gj

• Xij = lij * gj

10


• Weighting schemas• Document-Term matrix X

• Local weight lij• Global weight gj

• Xij = lij * gj

Weighting Schemas Definition

Local

TF = tfij

LogTF = log2(tfij+1)

Boolean = {0,1}

Global

IDF = log|𝐷|

𝑑𝑓𝑗

Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗

log 𝑛

Tfglob = tfj

10


• Statistics definition and computation• # documents

• # terms

• Avg document length

• Hapax %

• Type of Token Ratio TTR

• Hapax removal Boolean variable

• …

• # categories• # documents• Avg document length• # terms• Dictionary• Avg frequency terms• Max frequency terms• Min frequency terms

• Hapax %• Type of Token Ratio

(TTR)• Guiraud Index• Hapax removal

Boolean variable

Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Data miners' little helper: data transformation activity cues for cluster analysis on document collections. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics.

Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Prompting the data transformation activities for cluster analysis on collections of documents. In: 25th Italian Symposium on Advanced Database Systems

Lexical Richness IndicesBasic statistics

11

Self-Tuning Exploratory Data Analytics

Two main methodologies

• Joint-Approach

• Probabilistic Model

12



• Joint-Approach• Algebraic model

• Cluster Analysis

• Probabilistic Model

13





• Probabilistic Model• Latent Dirichlet Allocation (LDA)

14





• Probabilistic Model• Latent Dirichlet Allocation

15

The joint-approach includes two steps:

• A data reduction phase through Latent Semantic Analysis (LSA)

• A data clustering phase to find similar documents or relations between them

ESCAPE includes two innovative algorithms to automatically configure the joint-approach

Joint-Approach

16

The joint-approach includes two steps:

• A data reduction phase through Latent Semantic Analysis (LSA)

• A data clustering phase to find similar documents or relations between them

ESCAPE includes two innovative strategies to automatically configure the joint-approach

Joint-Approach

16

Joint-Approach Self-tuning Reduction phase

The correct choice of the number of dimensions to be considered is a research issue

ESCAPE includes

• Self-Tuning Data Reduction algorithm

Singular Value

Mag

nit

ud

e

Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2017) Self-tuning techniques for large scale cluster analysis on textual data collections. In: ACM SIGAPP Symposium on Applied Computing

Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2018) All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection. In: IEEE International Conference on Big Data17

Joint-Approach Self-tuning Reduction phase

The correct choice of the number of dimensions to be considered is a research issue

ESCAPE includes

• Self-Tuning Data Reduction algorithm

Input parameters:• Weighted Document-Term matrix X• The max number of singular values

Steps

Output:• Three values for dimensionality

reduction: KLSA

1. SVD decomposition on X2. The trend of the singular values are

analysed in terms of significance3. The algorithm ends when at least three

values for KLSA have been identified

Singular Value

Mag

nit

ud

e

18

Input parameters:• Low rank matrix Xk

• Range of desired clusters [mincls – maxcls]

Steps:

Output:• Final optimal solution

Joint-Approach Self-tuning Clustering

The K-Means partitional algorithm

• kcls: user-defined parameter

ESCAPE includes

• Self-Tuning Clustering algorithm1. Cluster analysis performed for each kcls ∈

[mincls – maxcls]

2. Partition analysis through Silhouette-Based indices

3. Selection of the final optimal solution

19


Self-Tuning Clustering algorithm

• The Silhouette index is a quality measure of how similar an object is to itsown cluster (cohesion) compared to other clusters (separation)

• The Silhouette ranges from -1 to +1• a high value indicates that the object is well matched to its own cluster and poorly

matched to neighbouring cluster

• For each document i, the silhouette is defined as:

Where:

• ai is the average distance between i and the other documents in the samecluster;

• bi is the lowest average distance between the document i and each one ofthe other clusters (not containing the document i)

si = bi − ai

max(bi, ai)

20


Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):

• Distribution of silhouette values in positive bins

• Preference for left-skewed distribution

• The Average Silhouette Index (ASI)

• The Global Silhouette Index (GSI)

Where Ck is the set of documents belonging to cluster k = 1,...,K; |Ck| is the cardinality ofcluster Ck (documents belonging to the cluster Ck), and N is the total number of documents.

Cerquitelli Tania et all. (2018) Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data. In: International Conference on Smart Energy Systems and Technologies21


Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):

• Distribution of silhouette values in positive bins

• Preference for left-skewed distribution

• The Average Silhouette Index (ASI)

• The Global Silhouette Index (GSI)

• The higher the values indices, the better the clustering partition• ASI gives an overview of the average silhouette of the entire cluster set• GSI takes into account the imbalance number of elements in each cluster• Clusters with large number of documents are penalised in the GSI 21


A rank score is defined:

The score lies in the range [0, (3− 3/maxcls)]

Score = (1−rank_GSI/maxcls)+(1−rank_ASI/maxcls)+(1−rank_WS/ maxcls))

Silhouette-based indices for Wikipedia dataset with 1000 documents

Silh

ou

ette

-bas

ed in

dic

es

Kcls22





• Probabilistic Model• Latent Dirichlet Allocation (LDA)

23

Probabilistic Model

• Generative probabilistic model• Parametric Bayesian Probabilistic Graphical Model• Documents as random mixtures over latent topics• Topics are characterised by a distribution over words• LDA estimates both at the same time

• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference

• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set apriori24

Probabilistic Model

• Generative probabilistic model• the maximum number of iterations (max iter = 100)• the Optimiser (Online Variational Bayes)2

• the document concentration (α = 50/K)1

• the topic concentration (β = 0.1)1

• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference

• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set apriori[1] Thomas Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences.[2] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of machine Learning research.

24

Probabilistic Model

• Assumptions:• Assume topics exist outside of the document collection

• Each topic is a distribution over fixed vocabulary

• Each word is drawn from one of those topics

• Each document is a random mixture of corpus-wide topics

• Issue: Number of topics must be set a-priori

25

Input parameters:• LDA topic-term distribution• Range of desired clusters [mincls – maxcls]

Steps:

Output:• Three good configurations

Probabilistic Model - Self-tuning LDA

1. Topic characterisation2. Similarity computation3. K identication

Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania (2018) Useful ToPIC: Self-tuning strategies to enhance Latent Dirichlet Allocation. In: IEEE International Big Data Congress

Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)

ESCAPE proposes a novel iterative strategy: ToPIC-Similarity

26

• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each

topic• Based on the indices of the dataset, given

2. Similarity computation

3. K identication

Q = |V|∙TTR/AvgFreq, n=൞

Q

K, 𝑖𝑓 𝑄 ≥ 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞

𝐴𝑣𝑔𝐹𝑟𝑒𝑞, 𝑖𝑓 𝑄 < 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞


27



topic

2. Similarity computation: computes the semantic similarity among the obtained topics• Cosine similarity

• Norm of the similarity matrix

• Mean with respect to the number of topics

3. K identication

28



topic

2. Similarity computation: computes the semantic similarity among the obtained topics

3. K identication: identifies optimal values for K, using a trade-off approach• The K values identified are the first three values that

(i) belong to a decreasing segment of the curve and

(ii) are local minima

29

Knowledge validation and visualisation

• Knowledge validation• Objective metrics

• Knowledge visualisation• Visualisation techniques

30




31




32

Knowledge validation - Objective metrics

For each approach, different quality metrics have been integrated

Joint-Approach:• The weighted silhouette index (WS)• The average silhouette index (ASI)• The global silhouette index (GSI)

Probabilist Modelling:• The weighted silhouette index (WS)• The perplexity index (Perp)• The Entropy index (Entr)

33



• Joint-Approach:• The Weighted Silhouette index (WS)• The Average Silhouette Index (ASI)• The Global Silhouette Index (GSI)

• Probabilistic Modelling:• The Weighted Silhouette index (WS)• The Perplexity index (Perplexity)• The Entropy index (H)

33



• Joint-Approach

• Probabilistic Modelling:• The Weighted Silhouette index (WS)

• The Perplexity index (Perplexity)

• The Entropy index (H)

34

Knowledge visualisation - Visualisation techniques

ESCAPE provide two forms of human-readable knowledge:

1. document-topic distribution

2. topic-term distribution

Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)35



1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution

• coarse-grained versus fine-grained groups through the analysis of the impactof the different weighting schemas


36



1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution

• t-Distributed Stochastic Neighbour Embedding (t-SNE)

coarse-grained versus fine-grained groups through the analysis of the impact of thedifferent weighting schemas


t-SNE representation37



1. document-topic distribution: characterises the document distribution over the topics• coarse-grained versus fine-grained groups through the analysis of

the impact of the different weighting schemas• correlation matrices

Correlation matrix38



2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words

• topic cohesion/separation in terms of relevant words


39




• Word-clouds

• topic-term distribution through the analysis of the top-k relevant words



Word-cloud representation40




• Word-clouds




Word-cloud representation

For the Joint-Approach:• Relevant terms are extracted using FP-Growth

algorithmFor the Probabilistic model:• Relevant terms are extracted by the topic-term

probability of the LDA model

40



2. topic-term distribution: describes the distribution over words for each topic• topic cohesion/separation in terms of relevant words

• graph representation




Graph representation41

Experimental results

The current implementation of ESCAPE runs on

• It is a project developed in Python• It exploited the Pyspark Scalable Machine Learning libraries Ml and MLlib

• Specifically, the algorithms included are: • pyspark.mllib.linalg.distributed for SVD• pyspark.ml.clustering for K-Means and LDA

• All experiments have been performed on the BigData@PoliTO cluster

ESCAPE architecture

42

Experimental settings – Default configurations

• Joint-Approach:• the number of dimensions to be considered during the data reduction SVD

phase (krid=20% matrix rank)

• the number of clusters (topics) in which to divide the collection under analysis (Kmax= average term frequency)

• Probabilistic modelling:• the number of clusters (topics) in which to divide the collection under

analysis (Kmax= average term frequency)

43

ESCAPE has been tested over different real datasets

Experimental datasets

ID Dataset Source Textual data type

D1 Wikipedia Documents



D4 Twitter Short messages

D5 PubMed Articles

D6 PubMed Abstracts

D7 Reuters Documents

44

ESCAPE has been tested over different real datasets

Experimental datasets

ID Dataset Source Textual data type




D4 Twitter Short messages

D5 PubMed Articles

D6 Pubmed Abstracts

D7 Reuters Documents

44

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73

WH: With Hapax WoH: without Hapax

45

Statistics definition and computation

Wikipedia Twitter PubMed

WH WoH WH WoH WH WoH

Features D1 D4 D5

# categories 5 6 -

# documents 989 60,005 1,000

Max frequency 5,394 6,936 775

Min frequency 1 2 1 2 1 2

Avg frequency 25 45 19 36 15 18

Avg document length 852 836 5 5 3,600 3,469

# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305

Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362

TTR 0.04 0.03 0.05 0.03 0.06 0.05

Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0

Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45

WH: With Hapax WoH: without Hapax

Term relevance

• In ESCAPE, the following weighting schemas have been integrated:

Weighting Schemas Definition

LocalTF = tfij

LogTF = log2(tfij+1)

Boolean = {0,1}

Global

IDF = log|𝐷|

𝑑𝑓𝑗

Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗

log 𝑛

Tfglob = tfj

46

Experimental results

• Quality of ESCAPE solutions for each methodology:• Top-k configurations

• Weighting schemas impact on the corpus

• Comparison with the state-of-the-art approaches

• Comparison among ESCAPE methodologies• Adjusted Rand Index (ARI)

• Quantitative metrics

• Qualitative visualisations

47

Preliminary results – Joint Approach

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D1

TF-IDF

26 7 0.383 0.358 0.408

22m, 20s41 10 0.419 0.339 0.391

67 10 0.361 0.297 0.352

TF-Entropy

29 11 0.334 0.35 0.401

26m, 18s42 10 0.368 0.331 0.382

62 8 0.364 0.274 0.326

LogTF-IDF

19 5 0.437 0.431 0.48

25m, 23s22 5 0.35 0.343 0.393

67 4 0.225 0.201 0.251

LogTF-Entropy

10 6 0.44 0.453 0.5

27m, 12m24 5 0.323 0.318 0.367

67 7 0.268 0.218 0.267

Bool-IDF

8 5 0.445 0.444 0.494

25m, 33s22 6 0.293 0.312 0.365

65 6 0.226 0.233 0.286

Bool-Entropy

9 5 0.447 0.444 0.495

28m, 38s23 5 0.354 0.348 0.4

65 4 0.28 0.234 0.285

Dataset Weight K-LSA K-Clustering GSI ASI WSExecution

time

D4Bool-IDF 6 6 0.465 0.422 0.737 50m, 29s

Bool-Entropy 13 7 0.342 0.32 0.532 1h, 10m, 33s

D5

TF-IDF 14 5 0.352 0.284 0.333 1h, 37m, 19s

TF-Entropy 15 10 0.377 0.28 0.332 1h, 39m, 34s

LogTF-IDF 15 5 0.397 0.312 0.362 1h, 43m, 15s

LogTF-Entropy 16 5 0.384 0.287 0.336 1h, 47m, 34s

Bool-IDF 16 4 0.315 0.347 0.395 1h, 46m, 42s

Bool-Entropy 16 4 0.328 0.336 0.385 1h, 48m, 45s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

48

Top-k solution

• Joint-Approach• Data reduction parameter

• Number of clusters

49

Top-k solution



Singular Value

Mag

nit

ud

e

49

Top-k solution



• Rank Function Score

K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-

Solution2 0.21 0.239 0.29 19 18 18 0.105 18

3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19

5 0.332 0.315 0.37 9 4 4 2.105 4

6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2

8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11

10 0.383 0.357 0.409 2 1 1 2.789 1

11 0.29 0.295 0.347 17 12 12 0.842 14

12 0.34 0.312 0.365 5 7 5 2.105 4

13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7

16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10

18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15

Score = (1 − 𝑟𝑎𝑛𝑘𝐺𝑆𝐼

maxcls)+(1−

𝑟𝑎𝑛𝑘𝐴𝑆𝐼

maxcls)+(1−

𝑟𝑎𝑛𝑘𝑊𝑆

maxcls))

50

Top-k solution



Silh

ou

ette

-bas

ed in

dic

es

Kcls

K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-

Solution2 0.21 0.239 0.29 19 18 18 0.105 18

3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19

5 0.332 0.315 0.37 9 4 4 2.105 4

6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2

8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11

10 0.383 0.357 0.409 2 1 1 2.789 1

11 0.29 0.295 0.347 17 12 12 0.842 14

12 0.34 0.312 0.365 5 7 5 2.105 4

13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7

16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10

18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15

50

Weight impact

TF-IDF

LogTF-Entropy

D1 correlation matrices

Original Category

51

TF-IDF

LogTF-Entropy

Weight impact


TF-IDF

LogTF-Entropy

Original Category ESCAPE Label

51

TF-IDF

LogTF-Entropy

Weight impact


TF-IDF

LogTF-Entropy

Original Category

D1 t-SNE representation

ESCAPE Label

51

Comparison with the state-of-the-art

State-of-the-art methodology: the Elbow methodThe good trade-off to determine the optimal number of clusters is in correspondence to the change ofslope from steep to shallow (an elbow)

K

SSE

D1 SSE trend D1 Silhouette index for each document

52

Preliminary results – Probabilistic modelling

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D1

TF-IDF

3 8.812 0.772 0.25640m,24s6 8.597 0.693 0.363

10 8.482 0.682 0.395

TF-Entropy5 9.072 0.762 0.282

30m,32s8 9.248 0.632 0.338

9 9.267 0.631 0.339

LogTF-IDF8 9.187 0.675 0.320

40m,17s17 9.126 0.637 0.362

LogTF-Entropy5 9.912 0.891 0.100

30m,547 9.884 0.846 0.17411 9.979 0.951 0.108

Boolean-TF4 6.492 0.697 0.421

44m,43s5 6.464 0.661 0.483

17 6.420 0.381 1.090

Dataset Weight K Perplexity Silhouette EntropyExecution

time

D4 Bool-TF 6 2.808 0.546 0.613 1h, 34m, 31s

D5

TF-IDF 14 7.662 0.085 1.902 1h, 50m, 27s

TF-Entropy 4 8.556 0.081 1.782 1h, 54m, 25s

LogTF-IDF 14 7.776 0.094 1.754 2h, 14m, 41s

LogTF-Entropy 4 8.622 0.08 1.743 2h, 17m, 25s

Bool-TF 10 5.22 0.101 1.318 2h, 20m, 13s

D1 top-3 solutions for each weighting schema

D4 and D5 best solution for each weighting schema

53

Top-k solution

• Probabilistic Modelling• Number of clusters

ToP

IC in

dex

54

Weight impact

TF-IDF

LogTF-Entropy

D1 scatter document-topic distribution

55

Weight impact

TF-IDF

LogTF-Entropy

D1 scatter document-topic distribution D1 t-SNE visualisation

55

State-of-the-art: to estimate the best value for the number of topics K

• Rate of Perplexity Change (RPC)1

• Based on the statistical perplexity

• Entropy optimised Latent Dirichlet Allocation (En-LDA)2

• Based on the Entropy of the model


[1] Weizhong Zhao et al. A heuristic approach to determine an appropriate number of topics in topic modeling.[2] Wen Zhang et al. En-lda: An novel approach to automatic bug report assignment with entropy optimised Latent Dirichlet Allocation. 56


RPC – K=3 EnLDA – K=19 ESCAPE – K=10

State-of-the-art: to estimate the best value for the number of topics K

• Rate of Perplexity Change (RPC)• Based on the statistical perplexity

• Entropy optimised Latent Dirichlet Allocation (En-LDA)

• Based on the Entropy of the model

56

D1 comparison between methodologies

Cluster ID

Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total

TF-IDF 215 176 159 139 99 93 49 25 19 15

989

TF-Entropy 228 167 166 135 106 75 54 27 16 15

LogTF-IDF 225 212 191 183 178

LogTF-Entropy 223 191 184 183 105 103

Bool-IDF 236 223 191 181 158

Bool-Entropy 230 223 192 177 167

Cluster ID

Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total

TF-IDF 205 193 187 180 144 21 19 14 13 13

989

TF-Entropy 428 236 197 113 15

LogTF-IDF 464 406 91 8 7 5 5 3

LogTF-Entropy 827 160 1 1 0

Bool-TF 230 215 194 188 162

D1 joint-approach cluster set cardinality

D1 probabilistic modelling cluster set cardinality

57

D1 comparison between partitions

To compare the different solutions, ESCAPE includes:

• Adjusted Rand Index (ARI) metric• The ARI lies between 0 and 1

• In ESCAPE, the ARI index is used to compare:1. the different weighting schema impact given a methodology

2. the best partitioning between the different methodologies

Weighting schema

Dataset TF-IDF LogTF-IDF TF-Entropy LogTF-Entropy Boolean

D1 0.554 0.321 0.320 0.100 0.790

58

D1 topic comparison

Cluster ID Joint-Approach Probabilistic Modelling

Cluster 0 Literature Music

Cluster 1 Food Maths

Cluster 2 Music Oil

Cluster 3 Maths Literature

Cluster 4 Sport Sport

Cluster 5 Sport Dynamic Sport

Cluster 6 Graph Music

Cluster 7 Music Quiddich

Cluster 8 Literature Literature

Cluster 9 Oil Musical Instruments

Five original categories: cooking, literature, mathematics, music and sport 59

D1 topic comparison




Cluster 2 Music Oil








Cluster 5Cluster 5

Cluster 4Cluster 4

60




Cluster 2 Music Oil








D1 topic comparison

Cluster 6

Cluster 3

Cluster 1

61

D1 topic comparison




Cluster 2 Music Oil








62

D1 topic comparison




Cluster 2 Music Oil








63

D1 topic comparison




Cluster 2 Music Oil








64

D1 topic comparison




Cluster 2 Music Oil








65

Lesson learnt

Several lessons have been learnt from these analyses:

• No strategy is universally superior

• The joint-approach is able to find homogeneous partitions in term of documents for each cluster• the local weight LogTF tends to find a small number of clusters

• the global weight IDF is able to create more clusters able to find also sub-topics related to the major categories

• The probabilistic model heterogenous partitions in term of documents for each cluster• for certain weighting schemas the documents are well separated

• the Entropy-based lead to very poor results

66

1. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis ontextual data collections,Proceedings of the Symposium on Applied Computing, 2017,ACM

2. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeperunderstanding of a crisis tweet collection,2017 IEEE International Conference on Big Data (Big Data),6,2017,IEEE

3. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies toEnhance Latent Dirichlet Allocation,2018 IEEE International Congress on Big Data, 2018,IEEE

4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: datatransformation activity cues for cluster analysis on document collections,Proceedings of the 7th InternationalConference on Web Intelligence, Mining and Semantics,2017,ACM

5. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientificliterature. European Conference on Advances in Databases and Information Systems,2019,Springer. In-press

6. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction andvisualisation at ease. ACM Celebration of Women in Computing womENcourage 2019. In-press

7. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the datatransformation activities for cluster analysis on collections of documents.

8. Di Corso, Evelina. Supporting decision making with self-learning methodologies (PhD poster December 2018,Politecnico di Torino, Italy)

9. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning DataTransformation, 2016.

Publication list

Other research activities


Participating to different research projects in several companies in the data science area as data scientist:

• Research contract with ENEL Foundation and ENDESA Energia• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita;

González, Alicia Mateo; Sobrino, Eduardo Martin. Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems

• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga. Clustering-Based Assessment of Residential Consumers from

Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies - Best Paper Award

• Research contract with Edison SPA• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco;

Casagrande, Silvia; Tamburini, Martina. Exploring energy performance certificates through visualization,2019.

• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina. Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems and Technologies – (In press)

• Research contract with Zirak Srl - Information Technology

67


Participating to different research projects in the data science area asdata scientist, collaboration with:

• the Interuniversity Department of Regional and Urban Studies and Planning• Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through

Unsupervised Analytics Methods,European Conference on Advances in Databases and Information Systems,2018, Springer.

• Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation ofscientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)

• the Department of Energy• Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of

buildings through unsupervised data mining techniques,2017 IEEE International Conference on Internet ofThings (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical andSocial Computing (CPSCom) and IEEE Smart Data (SmartData), 2017, IEEE.

67

The ESCAPE System Architecture

Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation68

The METATECH architecture

METeorological data Analysis for Thermal Energy CHaracterization

Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele (2018) METATECH: METeorological Data Analysis for Thermal Energy CHaracterization by Means of Self-Learning Transparent Models

69

Knowledge visualization based on dynamic dashboard

Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)

Cluster 1

Cluster 2

70

Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)

Cluster 1

Cluster 2


70

Cluster 1


71

Conclusion & future work

Conclusion

• In these three years, I have been able to design and develop anew framework, named ESCAPE (Enhanced Self-tuningCharacterisation of document collections After ParameterEvaluation), able to support the analyst during all the phases ofthe analysis process tailored to textual data.

• ESCAPE includes three main building blocks to streamline theanalytics process and to derive high-quality information in termsof well-separated and well-cohesive groups of documentscharacterising the main topics in a given corpus.

72

Future research directions

Different directions have yet to be analysed and explored. Specifically, we are currently including:

1. New data analytics algorithms to exploit other interesting models:• other algebraic data reduction algorithms• autoencoder-based data reduction algorithms• Non-parametric models (such as Deep Neural Network DNNs and K-NNs)• More weighting functions and statistical features

2. A semantic component able to support the analyst in a double phase:• pre-processing phase, to eliminate semantically bound words• post-processing phase, to represent subtopic of the same macro category and to add a hierarchy level for each word of the

dictionary to support other analytics tasks

3. A Knowledge-Base: to store all the results of the experiments to efficiently support self-tuningmethodologies

4. A self-learning methodology: based on a classification algorithm trained on the knowledgebase content to forecast the best methods for future analyses.

5. Integrating in ESCAPE the analysis of other types of data.

73

Evelina Di CorsoDipartimento di Automatica e Informatica (DAUIN)

Politecnico di Torino, ITALY

Corso Duca degli Abruzzi, 24 - 10129 Torino

[email protected]

Publication list1. Cerquitelli, Tania; Di Corso, Evelina. Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms. EDBT/ICDT Workshops,2016.

2. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis on textual data collections,Proceedings of the Symposium onApplied Computing, 2017,ACM

3. Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of buildings through unsupervised data mining techniques,2017IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing(CPSCom) and IEEE Smart Data (SmartData), 2017,IEEE

4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: data transformation activity cues for cluster analysis on documentcollections,Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics,2017,ACM

5. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning Data Transformation, 2016.

6. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the data transformation activities for cluster analysis on collections ofdocuments,.

7. Venturini, Luca; Di Corso, Evelina. Analyzing spatial data from twitter during a disaster,2017 IEEE International Conference on Big Data (Big Data),2017,IEEE

8. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection,2017 IEEE InternationalConference on Big Data (Big Data),6,2017,IEEE

9. Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele. Metatech: meteorological data analysis for thermal energy characterization by means of self-learning transparentmodels,Energies,11,6,1336,2018,Multidisciplinary Digital Publishing Institute

10. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita; González, Alicia Mateo; Sobrino, Eduardo Martin.Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems(DAS),164-169,2018,IEEE

11. Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through Unsupervised Analytics Methods, European Conference onAdvances in Databases and Information Systems,205-217,2018,Springer

12. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies to Enhance Latent Dirichlet Allocation,2018 IEEE InternationalCongress on Big Data, 2018,IEEE

13. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga.Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies,2018,IEEE

14. Di Corso, Evelina. Supporting decision making with self-learning methodologies.

15. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Exploring energy performance certificates through visualization,2019.

16. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientific literature. European Conference on Advances in Databasesand Information Systems,2019,Springer. In-press

17. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction and visualisation at ease. ACM Celebration of Women inComputing womENcourage 2019. In-press

18. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems andTechnologies – In-press