text miner’s little helper: scalable self-tuning ... di corso_presentation.pdf · data analytics...
TRANSCRIPT
Text miner’s little helper: scalableself-tuning methodologies for
knowledge exploration
Evelina Di CorsoCycle XXXI
Advisor: Prof. Tania Cerquitelli
Dipartimento di Automatica e InformaticaPolitecnico di Torino
ITALY
Outline of the presentation
• Problem statement and the purpose of the study
• Framework architecture
• Results & discussion
• Other research activities
• Conclusion & Future work
2
The problem statement and
the purpose of the study
Data analytics
Large volumes of heterogeneous data are being collected in• social networks
• scientific computation and digital library
• smart environments
3
Data analytics
Large volumes of heterogeneous data are being collected in• social networks
• scientific computation and digital library
• smart environments
3
Data analytics
Large volumes of heterogeneous data are being collected in• social networks
• scientific computation and digital library
• smart environments
3
Data analytics
Large volumes of heterogeneous data are being collected in• social networks
• scientific computation and digital library
• smart environments
3
Data analytics
Large volumes of heterogeneous data are being collected in• social networks
• scientific computation and digital library
• smart environments
Data analysis is challenging• It is a multi-step process
• Plethora of analytics algorithms are available
• Different parameters of specific-algorithms need to be manually set
• Variable data distribution
3
Knowledge Discovery from Data process - KDD
4
Knowledge Discovery from Data process - KDD
Case study
• Analysis of textual data collections via unsupervisedanalysis 4
Cluster analysis on textual data collections
Document collection
Document
Processing
Textual data
processing
Term relevance Topic Detection
Weighting function
Document
clustering and
Topic modelling
Quality
Metrics
Result
Assessment
5
Cluster analysis on textual data collections
Document collection
Document
Processing
Textual data
processing
Term relevance Topic Detection
Weighting function
Document
clustering and
Topic modelling
Quality
Metrics
Result
Assessment
Manifold suitable data weighting functions
5
Cluster analysis on textual data collections
Document collection
Document
Processing
Textual data
processing
Term relevance Topic Detection
Weighting function
Document
clustering and
Topic modelling
Quality
Metrics
Result
Assessment
Manifold suitable data weighting functions
Different topic modeling algorithms
Several parameters
5
Cluster analysis on textual data collections
Document collection
Document
Processing
Textual data
processing
Term relevance Topic Detection
Weighting functionDocument
clustering and
Topic modelling
Quality
Metrics
Result
Assessment
Manifold suitable data weighting functions
Different topic modeling algorithms
Several parameters
Various quality indices
Analysis guided by a domain expert
5
Research goal
Design and develop a new generation of data analytics enginesbased on self-tuning techniques
Research issues
• Automating data mining activities
• Parameter-free algorithms
• Self-assessment strategies
Case study
• Analysis of textual data collections via unsupervisedanalysis
6
The framework architecture
The ESCAPE System Architecture
Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation 7
Document processing and characterisation
• Document processing
• Weighting schemas
• Statistics definition and computation
8
Document processing and characterisation
• Document processing1. Document splitting
2. Tokenisation
3. Case normalisation
4. Stopword removal
5. Stemming• Bag-Of-Word representation
9
Document processing and characterisation
• Weighting schemas• Document-Term matrix X
• Local weight lij• Global weight gj
• Xij = lij * gj
10
Document processing and characterisation
• Weighting schemas• Document-Term matrix X
• Local weight lij• Global weight gj
• Xij = lij * gj
Weighting Schemas Definition
Local
TF = tfij
LogTF = log2(tfij+1)
Boolean = {0,1}
Global
IDF = log|𝐷|
𝑑𝑓𝑗
Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗
log 𝑛
Tfglob = tfj
10
Document processing and characterisation
• Statistics definition and computation• # documents
• # terms
• Avg document length
• Hapax %
• Type of Token Ratio TTR
• Hapax removal Boolean variable
• …
• # categories• # documents• Avg document length• # terms• Dictionary• Avg frequency terms• Max frequency terms• Min frequency terms
• Hapax %• Type of Token Ratio
(TTR)• Guiraud Index• Hapax removal
Boolean variable
Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Data miners' little helper: data transformation activity cues for cluster analysis on document collections. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics.
Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S. (2017) Prompting the data transformation activities for cluster analysis on collections of documents. In: 25th Italian Symposium on Advanced Database Systems
Lexical Richness IndicesBasic statistics
11
Self-Tuning Exploratory Data Analytics
Two main methodologies
• Joint-Approach
• Probabilistic Model
12
Self-Tuning Exploratory Data Analytics
Two main methodologies
• Joint-Approach• Algebraic model
• Cluster Analysis
• Probabilistic Model
13
Self-Tuning Exploratory Data Analytics
Two main methodologies
• Joint-Approach• Algebraic model
• Cluster Analysis
• Probabilistic Model• Latent Dirichlet Allocation (LDA)
14
Self-Tuning Exploratory Data Analytics
Two main methodologies
• Joint-Approach• Algebraic model
• Cluster Analysis
• Probabilistic Model• Latent Dirichlet Allocation
15
The joint-approach includes two steps:
• A data reduction phase through Latent Semantic Analysis (LSA)
• A data clustering phase to find similar documents or relations between them
ESCAPE includes two innovative algorithms to automatically configure the joint-approach
Joint-Approach
16
The joint-approach includes two steps:
• A data reduction phase through Latent Semantic Analysis (LSA)
• A data clustering phase to find similar documents or relations between them
ESCAPE includes two innovative strategies to automatically configure the joint-approach
Joint-Approach
16
Joint-Approach Self-tuning Reduction phase
The correct choice of the number of dimensions to be considered is a research issue
ESCAPE includes
• Self-Tuning Data Reduction algorithm
Singular Value
Mag
nit
ud
e
Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2017) Self-tuning techniques for large scale cluster analysis on textual data collections. In: ACM SIGAPP Symposium on Applied Computing
Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco (2018) All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection. In: IEEE International Conference on Big Data17
Joint-Approach Self-tuning Reduction phase
The correct choice of the number of dimensions to be considered is a research issue
ESCAPE includes
• Self-Tuning Data Reduction algorithm
Input parameters:• Weighted Document-Term matrix X• The max number of singular values
Steps
Output:• Three values for dimensionality
reduction: KLSA
1. SVD decomposition on X2. The trend of the singular values are
analysed in terms of significance3. The algorithm ends when at least three
values for KLSA have been identified
Singular Value
Mag
nit
ud
e
18
Input parameters:• Low rank matrix Xk
• Range of desired clusters [mincls – maxcls]
Steps:
Output:• Final optimal solution
Joint-Approach Self-tuning Clustering
The K-Means partitional algorithm
• kcls: user-defined parameter
ESCAPE includes
• Self-Tuning Clustering algorithm1. Cluster analysis performed for each kcls ∈
[mincls – maxcls]
2. Partition analysis through Silhouette-Based indices
3. Selection of the final optimal solution
19
Joint-Approach Self-tuning Clustering
Self-Tuning Clustering algorithm
• The Silhouette index is a quality measure of how similar an object is to itsown cluster (cohesion) compared to other clusters (separation)
• The Silhouette ranges from -1 to +1• a high value indicates that the object is well matched to its own cluster and poorly
matched to neighbouring cluster
• For each document i, the silhouette is defined as:
Where:
• ai is the average distance between i and the other documents in the samecluster;
• bi is the lowest average distance between the document i and each one ofthe other clusters (not containing the document i)
si = bi − ai
max(bi, ai)
20
Joint-Approach Self-tuning Clustering
Self-Tuning Clustering algorithm
• The Silhouette index is a quality measure of how similar an object is to itsown cluster (cohesion) compared to other clusters (separation)
• The Silhouette ranges from -1 to +1• a high value indicates that the object is well matched to its own cluster and poorly
matched to neighbouring cluster
• For each document i, the silhouette is defined as:
Where:
• ai is the average distance between i and the other documents in the samecluster;
• bi is the lowest average distance between the document i and each one ofthe other clusters (not containing the document i)
si = bi − ai
max(bi, ai)
20
Joint-Approach Self-tuning Clustering
Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):
• Distribution of silhouette values in positive bins
• Preference for left-skewed distribution
• The Average Silhouette Index (ASI)
• The Global Silhouette Index (GSI)
Where Ck is the set of documents belonging to cluster k = 1,...,K; |Ck| is the cardinality ofcluster Ck (documents belonging to the cluster Ck), and N is the total number of documents.
Cerquitelli Tania et all. (2018) Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data. In: International Conference on Smart Energy Systems and Technologies21
Joint-Approach Self-tuning Clustering
Three indicators are based on the previous definition:• The Weighted Silhouette Index (WS):
• Distribution of silhouette values in positive bins
• Preference for left-skewed distribution
• The Average Silhouette Index (ASI)
• The Global Silhouette Index (GSI)
• The higher the values indices, the better the clustering partition• ASI gives an overview of the average silhouette of the entire cluster set• GSI takes into account the imbalance number of elements in each cluster• Clusters with large number of documents are penalised in the GSI 21
Joint-Approach Self-tuning Clustering
A rank score is defined:
The score lies in the range [0, (3− 3/maxcls)]
Score = (1−rank_GSI/maxcls)+(1−rank_ASI/maxcls)+(1−rank_WS/ maxcls))
Silhouette-based indices for Wikipedia dataset with 1000 documents
Silh
ou
ette
-bas
ed in
dic
es
Kcls22
Self-Tuning Exploratory Data Analytics
Two main methodologies
• Joint-Approach• Algebraic model
• Cluster Analysis
• Probabilistic Model• Latent Dirichlet Allocation (LDA)
23
Probabilistic Model
• Generative probabilistic model• Parametric Bayesian Probabilistic Graphical Model• Documents as random mixtures over latent topics• Topics are characterised by a distribution over words• LDA estimates both at the same time
• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference
• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics
• Issue: Number of topics must be set apriori24
Probabilistic Model
• Generative probabilistic model• the maximum number of iterations (max iter = 100)• the Optimiser (Online Variational Bayes)2
• the document concentration (α = 50/K)1
• the topic concentration (β = 0.1)1
• Approach: inferring hidden structure using posterior inference• Discovering topics in the collection using Bayesian inference
• Assumptions:• Assume topics exist outside of the document collection• Each topic is a distribution over fixed vocabulary• Each word is drawn from one of those topics• Each document is a random mixture of corpus-wide topics
• Issue: Number of topics must be set apriori[1] Thomas Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences.[2] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of machine Learning research.
24
Probabilistic Model
• Assumptions:• Assume topics exist outside of the document collection
• Each topic is a distribution over fixed vocabulary
• Each word is drawn from one of those topics
• Each document is a random mixture of corpus-wide topics
• Issue: Number of topics must be set a-priori
25
Input parameters:• LDA topic-term distribution• Range of desired clusters [mincls – maxcls]
Steps:
Output:• Three good configurations
Probabilistic Model - Self-tuning LDA
1. Topic characterisation2. Similarity computation3. K identication
Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania (2018) Useful ToPIC: Self-tuning strategies to enhance Latent Dirichlet Allocation. In: IEEE International Big Data Congress
Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)
ESCAPE proposes a novel iterative strategy: ToPIC-Similarity
26
• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each
topic• Based on the indices of the dataset, given
2. Similarity computation
3. K identication
Q = |V|∙TTR/AvgFreq, n=൞
Q
K, 𝑖𝑓 𝑄 ≥ 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞
𝐴𝑣𝑔𝐹𝑟𝑒𝑞, 𝑖𝑓 𝑄 < 𝐾∙𝐴𝑣𝑔𝐹𝑟𝑒𝑞
Probabilistic Model - Self-tuning LDA
27
Probabilistic Model - Self-tuning LDA
• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each
topic
2. Similarity computation: computes the semantic similarity among the obtained topics• Cosine similarity
• Norm of the similarity matrix
• Mean with respect to the number of topics
3. K identication
28
Probabilistic Model - Self-tuning LDA
• The proposed method is iterative, and comprises three main steps:1. Topic characterisation: selects the n most representative words for each
topic
2. Similarity computation: computes the semantic similarity among the obtained topics
3. K identication: identifies optimal values for K, using a trade-off approach• The K values identified are the first three values that
(i) belong to a decreasing segment of the curve and
(ii) are local minima
29
Knowledge validation and visualisation
• Knowledge validation• Objective metrics
• Knowledge visualisation• Visualisation techniques
30
Knowledge validation and visualisation
• Knowledge validation• Objective metrics
• Knowledge visualisation• Visualisation techniques
31
Knowledge validation and visualisation
• Knowledge validation• Objective metrics
• Knowledge visualisation• Visualisation techniques
32
Knowledge validation - Objective metrics
For each approach, different quality metrics have been integrated
Joint-Approach:• The weighted silhouette index (WS)• The average silhouette index (ASI)• The global silhouette index (GSI)
Probabilist Modelling:• The weighted silhouette index (WS)• The perplexity index (Perp)• The Entropy index (Entr)
33
Knowledge validation - Objective metrics
For each approach, different quality metrics have been integrated
• Joint-Approach:• The Weighted Silhouette index (WS)• The Average Silhouette Index (ASI)• The Global Silhouette Index (GSI)
• Probabilistic Modelling:• The Weighted Silhouette index (WS)• The Perplexity index (Perplexity)• The Entropy index (H)
33
Knowledge validation - Objective metrics
For each approach, different quality metrics have been integrated
• Joint-Approach:• The Weighted Silhouette index (WS)• The Average Silhouette Index (ASI)• The Global Silhouette Index (GSI)
• Probabilistic Modelling:• The Weighted Silhouette index (WS)• The Perplexity index (Perplexity)• The Entropy index (H)
33
Knowledge validation - Objective metrics
For each approach, different quality metrics have been integrated
• Joint-Approach
• Probabilistic Modelling:• The Weighted Silhouette index (WS)
• The Perplexity index (Perplexity)
• The Entropy index (H)
34
Knowledge visualisation - Visualisation techniques
ESCAPE provide two forms of human-readable knowledge:
1. document-topic distribution
2. topic-term distribution
Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation of scientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)35
Knowledge visualisation - Visualisation techniques
ESCAPE provide two forms of human-readable knowledge:
1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution
• coarse-grained versus fine-grained groups through the analysis of the impactof the different weighting schemas
2. topic-term distribution
36
Knowledge visualisation - Visualisation techniques
ESCAPE provide two forms of human-readable knowledge:
1. document-topic distribution: characterises the document distribution over the topics• topic cohesion/separation in terms of document distribution
• t-Distributed Stochastic Neighbour Embedding (t-SNE)
coarse-grained versus fine-grained groups through the analysis of the impact of thedifferent weighting schemas
2. topic-term distribution
t-SNE representation37
Knowledge visualisation - Visualisation techniques
ESCAPE provide two forms of human-readable knowledge:
1. document-topic distribution: characterises the document distribution over the topics• coarse-grained versus fine-grained groups through the analysis of
the impact of the different weighting schemas• correlation matrices
Correlation matrix38
Knowledge visualisation - Visualisation techniques
1. document-topic distribution
2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words
• topic cohesion/separation in terms of relevant words
ESCAPE provide two forms of human-readable knowledge:
39
Knowledge visualisation - Visualisation techniques
1. document-topic distribution
2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words
• Word-clouds
• topic-term distribution through the analysis of the top-k relevant words
• topic cohesion/separation in terms of relevant words
ESCAPE provide two forms of human-readable knowledge:
Word-cloud representation40
Knowledge visualisation - Visualisation techniques
1. document-topic distribution
2. topic-term distribution: describes the distribution over words for each topic• topic-term distribution through the analysis of the top-k relevant words
• Word-clouds
• topic-term distribution through the analysis of the top-k relevant words
• topic cohesion/separation in terms of relevant words
ESCAPE provide two forms of human-readable knowledge:
Word-cloud representation
For the Joint-Approach:• Relevant terms are extracted using FP-Growth
algorithmFor the Probabilistic model:• Relevant terms are extracted by the topic-term
probability of the LDA model
40
Knowledge visualisation - Visualisation techniques
1. document-topic distribution
2. topic-term distribution: describes the distribution over words for each topic• topic cohesion/separation in terms of relevant words
• graph representation
• topic-term distribution through the analysis of the top-k relevant words
• topic cohesion/separation in terms of relevant words
ESCAPE provide two forms of human-readable knowledge:
Graph representation41
Experimental results
The current implementation of ESCAPE runs on
• It is a project developed in Python• It exploited the Pyspark Scalable Machine Learning libraries Ml and MLlib
• Specifically, the algorithms included are: • pyspark.mllib.linalg.distributed for SVD• pyspark.ml.clustering for K-Means and LDA
• All experiments have been performed on the BigData@PoliTO cluster
ESCAPE architecture
42
The current implementation of ESCAPE runs on
• It is a project developed in Python• It exploited the Pyspark Scalable Machine Learning libraries Ml and MLlib
• Specifically, the algorithms included are: • pyspark.mllib.linalg.distributed for SVD• pyspark.ml.clustering for K-Means and LDA
• All experiments have been performed on the BigData@PoliTO cluster
ESCAPE architecture
42
Experimental settings – Default configurations
• Joint-Approach:• the number of dimensions to be considered during the data reduction SVD
phase (krid=20% matrix rank)
• the number of clusters (topics) in which to divide the collection under analysis (Kmax= average term frequency)
• Probabilistic modelling:• the number of clusters (topics) in which to divide the collection under
analysis (Kmax= average term frequency)
43
ESCAPE has been tested over different real datasets
Experimental datasets
ID Dataset Source Textual data type
D1 Wikipedia Documents
D2 Wikipedia Documents
D3 Wikipedia Documents
D4 Twitter Short messages
D5 PubMed Articles
D6 PubMed Abstracts
D7 Reuters Documents
44
ESCAPE has been tested over different real datasets
Experimental datasets
ID Dataset Source Textual data type
D1 Wikipedia Documents
D2 Wikipedia Documents
D3 Wikipedia Documents
D4 Twitter Short messages
D5 PubMed Articles
D6 Pubmed Abstracts
D7 Reuters Documents
44
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73
WH: With Hapax WoH: without Hapax
45
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45
WH: With Hapax WoH: without Hapax
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45
WH: With Hapax WoH: without Hapax
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45
WH: With Hapax WoH: without Hapax
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45
WH: With Hapax WoH: without Hapax
Statistics definition and computation
Wikipedia Twitter PubMed
WH WoH WH WoH WH WoH
Features D1 D4 D5
# categories 5 6 -
# documents 989 60,005 1,000
Max frequency 5,394 6,936 775
Min frequency 1 2 1 2 1 2
Avg frequency 25 45 19 36 15 18
Avg document length 852 836 5 5 3,600 3,469
# terms 843,967 828,372 312,718 304,666 3,600,153 3,469,305
Dictionary |V| 33,635 18,040 16,345 12,136 227,210 96,362
TTR 0.04 0.03 0.05 0.03 0.06 0.05
Hapax (%) 46.3 0.0 49.26 0.0 57.02 0.0
Guiraud Index 36.61 19.82 29.23 15.02 119.75 51.73 45
WH: With Hapax WoH: without Hapax
Term relevance
• In ESCAPE, the following weighting schemas have been integrated:
Weighting Schemas Definition
LocalTF = tfij
LogTF = log2(tfij+1)
Boolean = {0,1}
Global
IDF = log|𝐷|
𝑑𝑓𝑗
Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗
log 𝑛
Tfglob = tfj
46
Term relevance
• In ESCAPE, the following weighting schemas have been integrated:
Weighting Schemas Definition
LocalTF = tfij
LogTF = log2(tfij+1)
Boolean = {0,1}
Global
IDF = log|𝐷|
𝑑𝑓𝑗
Entropy = 1 +∑i 𝑝𝑖𝑗𝑙𝑜𝑔 𝑝𝑖𝑗
log 𝑛
Tfglob = tfj
46
Experimental results
• Quality of ESCAPE solutions for each methodology:• Top-k configurations
• Weighting schemas impact on the corpus
• Comparison with the state-of-the-art approaches
• Comparison among ESCAPE methodologies• Adjusted Rand Index (ARI)
• Quantitative metrics
• Qualitative visualisations
47
Preliminary results – Joint Approach
Dataset Weight K-LSA K-Clustering GSI ASI WSExecution
time
D1
TF-IDF
26 7 0.383 0.358 0.408
22m, 20s41 10 0.419 0.339 0.391
67 10 0.361 0.297 0.352
TF-Entropy
29 11 0.334 0.35 0.401
26m, 18s42 10 0.368 0.331 0.382
62 8 0.364 0.274 0.326
LogTF-IDF
19 5 0.437 0.431 0.48
25m, 23s22 5 0.35 0.343 0.393
67 4 0.225 0.201 0.251
LogTF-Entropy
10 6 0.44 0.453 0.5
27m, 12m24 5 0.323 0.318 0.367
67 7 0.268 0.218 0.267
Bool-IDF
8 5 0.445 0.444 0.494
25m, 33s22 6 0.293 0.312 0.365
65 6 0.226 0.233 0.286
Bool-Entropy
9 5 0.447 0.444 0.495
28m, 38s23 5 0.354 0.348 0.4
65 4 0.28 0.234 0.285
Dataset Weight K-LSA K-Clustering GSI ASI WSExecution
time
D4Bool-IDF 6 6 0.465 0.422 0.737 50m, 29s
Bool-Entropy 13 7 0.342 0.32 0.532 1h, 10m, 33s
D5
TF-IDF 14 5 0.352 0.284 0.333 1h, 37m, 19s
TF-Entropy 15 10 0.377 0.28 0.332 1h, 39m, 34s
LogTF-IDF 15 5 0.397 0.312 0.362 1h, 43m, 15s
LogTF-Entropy 16 5 0.384 0.287 0.336 1h, 47m, 34s
Bool-IDF 16 4 0.315 0.347 0.395 1h, 46m, 42s
Bool-Entropy 16 4 0.328 0.336 0.385 1h, 48m, 45s
D1 top-3 solutions for each weighting schema
D4 and D5 best solution for each weighting schema
48
Preliminary results – Joint Approach
Dataset Weight K-LSA K-Clustering GSI ASI WSExecution
time
D1
TF-IDF
26 7 0.383 0.358 0.408
22m, 20s41 10 0.419 0.339 0.391
67 10 0.361 0.297 0.352
TF-Entropy
29 11 0.334 0.35 0.401
26m, 18s42 10 0.368 0.331 0.382
62 8 0.364 0.274 0.326
LogTF-IDF
19 5 0.437 0.431 0.48
25m, 23s22 5 0.35 0.343 0.393
67 4 0.225 0.201 0.251
LogTF-Entropy
10 6 0.44 0.453 0.5
27m, 12m24 5 0.323 0.318 0.367
67 7 0.268 0.218 0.267
Bool-IDF
8 5 0.445 0.444 0.494
25m, 33s22 6 0.293 0.312 0.365
65 6 0.226 0.233 0.286
Bool-Entropy
9 5 0.447 0.444 0.495
28m, 38s23 5 0.354 0.348 0.4
65 4 0.28 0.234 0.285
Dataset Weight K-LSA K-Clustering GSI ASI WSExecution
time
D4Bool-IDF 6 6 0.465 0.422 0.737 50m, 29s
Bool-Entropy 13 7 0.342 0.32 0.532 1h, 10m, 33s
D5
TF-IDF 14 5 0.352 0.284 0.333 1h, 37m, 19s
TF-Entropy 15 10 0.377 0.28 0.332 1h, 39m, 34s
LogTF-IDF 15 5 0.397 0.312 0.362 1h, 43m, 15s
LogTF-Entropy 16 5 0.384 0.287 0.336 1h, 47m, 34s
Bool-IDF 16 4 0.315 0.347 0.395 1h, 46m, 42s
Bool-Entropy 16 4 0.328 0.336 0.385 1h, 48m, 45s
D1 top-3 solutions for each weighting schema
D4 and D5 best solution for each weighting schema
48
Top-k solution
• Joint-Approach• Data reduction parameter
• Number of clusters
49
Top-k solution
• Joint-Approach• Data reduction parameter
• Number of clusters
Singular Value
Mag
nit
ud
e
49
Top-k solution
• Joint-Approach• Data reduction parameter
• Number of clusters
• Rank Function Score
K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-
Solution2 0.21 0.239 0.29 19 18 18 0.105 18
3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19
5 0.332 0.315 0.37 9 4 4 2.105 4
6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2
8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11
10 0.383 0.357 0.409 2 1 1 2.789 1
11 0.29 0.295 0.347 17 12 12 0.842 14
12 0.34 0.312 0.365 5 7 5 2.105 4
13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7
16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10
18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15
Score = (1 − 𝑟𝑎𝑛𝑘𝐺𝑆𝐼
maxcls)+(1−
𝑟𝑎𝑛𝑘𝐴𝑆𝐼
maxcls)+(1−
𝑟𝑎𝑛𝑘𝑊𝑆
maxcls))
50
Top-k solution
• Joint-Approach• Data reduction parameter
• Number of clusters
Silh
ou
ette
-bas
ed in
dic
es
Kcls
K GSI ASI WS rank_GSI rank_ASI rank_WS ScoreRank-
Solution2 0.21 0.239 0.29 19 18 18 0.105 18
3 0.294 0.244 0.296 16 17 17 0.368 174 0.255 0.237 0.29 18 19 19 0.053 19
5 0.332 0.315 0.37 9 4 4 2.105 4
6 0.307 0.256 0.309 14 16 16 0.579 167 0.383 0.354 0.405 1 2 2 2.737 2
8 0.345 0.315 0.365 4 5 6 2.211 39 0.329 0.301 0.352 11 11 11 1.263 11
10 0.383 0.357 0.409 2 1 1 2.789 1
11 0.29 0.295 0.347 17 12 12 0.842 14
12 0.34 0.312 0.365 5 7 5 2.105 4
13 0.336 0.306 0.358 7 10 10 1.579 914 0.32 0.322 0.376 13 3 3 2 615 0.333 0.314 0.364 8 6 7 1.895 7
16 0.336 0.311 0.363 6 8 9 1.789 817 0.322 0.311 0.364 12 9 8 1.474 10
18 0.371 0.281 0.336 3 15 15 1.263 1119 0.33 0.284 0.337 10 14 14 1 1320 0.306 0.285 0.338 15 13 13 0.842 15
50
Weight impact
TF-IDF
LogTF-Entropy
D1 correlation matrices
Original Category
51
TF-IDF
LogTF-Entropy
Weight impact
D1 correlation matrices
TF-IDF
LogTF-Entropy
Original Category ESCAPE Label
51
TF-IDF
LogTF-Entropy
Weight impact
D1 correlation matrices
TF-IDF
LogTF-Entropy
Original Category
D1 t-SNE representation
ESCAPE Label
51
Comparison with the state-of-the-art
State-of-the-art methodology: the Elbow methodThe good trade-off to determine the optimal number of clusters is in correspondence to the change ofslope from steep to shallow (an elbow)
K
SSE
D1 SSE trend D1 Silhouette index for each document
52
Preliminary results – Probabilistic modelling
Dataset Weight K Perplexity Silhouette EntropyExecution
time
D1
TF-IDF
3 8.812 0.772 0.25640m,24s6 8.597 0.693 0.363
10 8.482 0.682 0.395
TF-Entropy5 9.072 0.762 0.282
30m,32s8 9.248 0.632 0.338
9 9.267 0.631 0.339
LogTF-IDF8 9.187 0.675 0.320
40m,17s17 9.126 0.637 0.362
LogTF-Entropy5 9.912 0.891 0.100
30m,547 9.884 0.846 0.17411 9.979 0.951 0.108
Boolean-TF4 6.492 0.697 0.421
44m,43s5 6.464 0.661 0.483
17 6.420 0.381 1.090
Dataset Weight K Perplexity Silhouette EntropyExecution
time
D4 Bool-TF 6 2.808 0.546 0.613 1h, 34m, 31s
D5
TF-IDF 14 7.662 0.085 1.902 1h, 50m, 27s
TF-Entropy 4 8.556 0.081 1.782 1h, 54m, 25s
LogTF-IDF 14 7.776 0.094 1.754 2h, 14m, 41s
LogTF-Entropy 4 8.622 0.08 1.743 2h, 17m, 25s
Bool-TF 10 5.22 0.101 1.318 2h, 20m, 13s
D1 top-3 solutions for each weighting schema
D4 and D5 best solution for each weighting schema
53
Preliminary results – Probabilistic modelling
Dataset Weight K Perplexity Silhouette EntropyExecution
time
D1
TF-IDF
3 8.812 0.772 0.25640m,24s6 8.597 0.693 0.363
10 8.482 0.682 0.395
TF-Entropy5 9.072 0.762 0.282
30m,32s8 9.248 0.632 0.338
9 9.267 0.631 0.339
LogTF-IDF8 9.187 0.675 0.320
40m,17s17 9.126 0.637 0.362
LogTF-Entropy5 9.912 0.891 0.100
30m,547 9.884 0.846 0.17411 9.979 0.951 0.108
Boolean-TF4 6.492 0.697 0.421
44m,43s5 6.464 0.661 0.483
17 6.420 0.381 1.090
Dataset Weight K Perplexity Silhouette EntropyExecution
time
D4 Bool-TF 6 2.808 0.546 0.613 1h, 34m, 31s
D5
TF-IDF 14 7.662 0.085 1.902 1h, 50m, 27s
TF-Entropy 4 8.556 0.081 1.782 1h, 54m, 25s
LogTF-IDF 14 7.776 0.094 1.754 2h, 14m, 41s
LogTF-Entropy 4 8.622 0.08 1.743 2h, 17m, 25s
Bool-TF 10 5.22 0.101 1.318 2h, 20m, 13s
D1 top-3 solutions for each weighting schema
D4 and D5 best solution for each weighting schema
53
Top-k solution
• Probabilistic Modelling• Number of clusters
ToP
IC in
dex
54
Weight impact
TF-IDF
LogTF-Entropy
D1 scatter document-topic distribution
55
Weight impact
TF-IDF
LogTF-Entropy
D1 scatter document-topic distribution D1 t-SNE visualisation
55
State-of-the-art: to estimate the best value for the number of topics K
• Rate of Perplexity Change (RPC)1
• Based on the statistical perplexity
• Entropy optimised Latent Dirichlet Allocation (En-LDA)2
• Based on the Entropy of the model
Comparison with the state-of-the-art
[1] Weizhong Zhao et al. A heuristic approach to determine an appropriate number of topics in topic modeling.[2] Wen Zhang et al. En-lda: An novel approach to automatic bug report assignment with entropy optimised Latent Dirichlet Allocation. 56
Comparison with the state-of-the-art
RPC – K=3 EnLDA – K=19 ESCAPE – K=10
State-of-the-art: to estimate the best value for the number of topics K
• Rate of Perplexity Change (RPC)• Based on the statistical perplexity
• Entropy optimised Latent Dirichlet Allocation (En-LDA)
• Based on the Entropy of the model
56
D1 comparison between methodologies
Cluster ID
Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total
TF-IDF 215 176 159 139 99 93 49 25 19 15
989
TF-Entropy 228 167 166 135 106 75 54 27 16 15
LogTF-IDF 225 212 191 183 178
LogTF-Entropy 223 191 184 183 105 103
Bool-IDF 236 223 191 181 158
Bool-Entropy 230 223 192 177 167
Cluster ID
Weight Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Total
TF-IDF 205 193 187 180 144 21 19 14 13 13
989
TF-Entropy 428 236 197 113 15
LogTF-IDF 464 406 91 8 7 5 5 3
LogTF-Entropy 827 160 1 1 0
Bool-TF 230 215 194 188 162
D1 joint-approach cluster set cardinality
D1 probabilistic modelling cluster set cardinality
57
D1 comparison between partitions
To compare the different solutions, ESCAPE includes:
• Adjusted Rand Index (ARI) metric• The ARI lies between 0 and 1
• In ESCAPE, the ARI index is used to compare:1. the different weighting schema impact given a methodology
2. the best partitioning between the different methodologies
Weighting schema
Dataset TF-IDF LogTF-IDF TF-Entropy LogTF-Entropy Boolean
D1 0.554 0.321 0.320 0.100 0.790
58
D1 comparison between partitions
To compare the different solutions, ESCAPE includes:
• Adjusted Rand Index (ARI) metric• The ARI lies between 0 and 1
• In ESCAPE, the ARI index is used to compare:1. the different weighting schema impact given a methodology
2. the best partitioning between the different methodologies
Weighting schema
Dataset TF-IDF LogTF-IDF TF-Entropy LogTF-Entropy Boolean
D1 0.554 0.321 0.320 0.100 0.790
58
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
Five original categories: cooking, literature, mathematics, music and sport 59
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
Cluster 5Cluster 5
Cluster 4Cluster 4
60
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
D1 topic comparison
Cluster 6
Cluster 3
Cluster 1
61
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
62
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
63
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
64
D1 topic comparison
Cluster ID Joint-Approach Probabilistic Modelling
Cluster 0 Literature Music
Cluster 1 Food Maths
Cluster 2 Music Oil
Cluster 3 Maths Literature
Cluster 4 Sport Sport
Cluster 5 Sport Dynamic Sport
Cluster 6 Graph Music
Cluster 7 Music Quiddich
Cluster 8 Literature Literature
Cluster 9 Oil Musical Instruments
65
Lesson learnt
Several lessons have been learnt from these analyses:
• No strategy is universally superior
• The joint-approach is able to find homogeneous partitions in term of documents for each cluster• the local weight LogTF tends to find a small number of clusters
• the global weight IDF is able to create more clusters able to find also sub-topics related to the major categories
• The probabilistic model heterogenous partitions in term of documents for each cluster• for certain weighting schemas the documents are well separated
• the Entropy-based lead to very poor results
66
1. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis ontextual data collections,Proceedings of the Symposium on Applied Computing, 2017,ACM
2. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeperunderstanding of a crisis tweet collection,2017 IEEE International Conference on Big Data (Big Data),6,2017,IEEE
3. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies toEnhance Latent Dirichlet Allocation,2018 IEEE International Congress on Big Data, 2018,IEEE
4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: datatransformation activity cues for cluster analysis on document collections,Proceedings of the 7th InternationalConference on Web Intelligence, Mining and Semantics,2017,ACM
5. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientificliterature. European Conference on Advances in Databases and Information Systems,2019,Springer. In-press
6. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction andvisualisation at ease. ACM Celebration of Women in Computing womENcourage 2019. In-press
7. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the datatransformation activities for cluster analysis on collections of documents.
8. Di Corso, Evelina. Supporting decision making with self-learning methodologies (PhD poster December 2018,Politecnico di Torino, Italy)
9. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning DataTransformation, 2016.
Publication list
Other research activities
Other research activities
Participating to different research projects in several companies in the data science area as data scientist:
• Research contract with ENEL Foundation and ENDESA Energia• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita;
González, Alicia Mateo; Sobrino, Eduardo Martin. Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems
• Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga. Clustering-Based Assessment of Residential Consumers from
Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies - Best Paper Award
• Research contract with Edison SPA• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco;
Casagrande, Silvia; Tamburini, Martina. Exploring energy performance certificates through visualization,2019.
• Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina. Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems and Technologies – (In press)
• Research contract with Zirak Srl - Information Technology
67
Other research activities
Participating to different research projects in the data science area asdata scientist, collaboration with:
• the Interuniversity Department of Regional and Urban Studies and Planning• Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through
Unsupervised Analytics Methods,European Conference on Advances in Databases and Information Systems,2018, Springer.
• Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia (2018) Towards automated visualisation ofscientific literature. In: European Conference on Advances in Databases and Information Systems (In Press)
• the Department of Energy• Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of
buildings through unsupervised data mining techniques,2017 IEEE International Conference on Internet ofThings (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical andSocial Computing (CPSCom) and IEEE Smart Data (SmartData), 2017, IEEE.
67
The ESCAPE System Architecture
Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation68
The ESCAPE System Architecture
Enhanced Self-tuning Characterisation of document collections After Parameter Evaluation68
The METATECH architecture
METeorological data Analysis for Thermal Energy CHaracterization
Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele (2018) METATECH: METeorological Data Analysis for Thermal Energy CHaracterization by Means of Self-Learning Transparent Models
69
Knowledge visualization based on dynamic dashboard
Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)
Cluster 1
Cluster 2
70
Scatter plot of the SVD visualization (left) and the Energy consumption levels (right)
Cluster 1
Cluster 2
Knowledge visualization based on dynamic dashboard
70
Cluster 1
Knowledge visualization based on dynamic dashboard
71
Conclusion & future work
Conclusion
• In these three years, I have been able to design and develop anew framework, named ESCAPE (Enhanced Self-tuningCharacterisation of document collections After ParameterEvaluation), able to support the analyst during all the phases ofthe analysis process tailored to textual data.
• ESCAPE includes three main building blocks to streamline theanalytics process and to derive high-quality information in termsof well-separated and well-cohesive groups of documentscharacterising the main topics in a given corpus.
72
Future research directions
Different directions have yet to be analysed and explored. Specifically, we are currently including:
1. New data analytics algorithms to exploit other interesting models:• other algebraic data reduction algorithms• autoencoder-based data reduction algorithms• Non-parametric models (such as Deep Neural Network DNNs and K-NNs)• More weighting functions and statistical features
2. A semantic component able to support the analyst in a double phase:• pre-processing phase, to eliminate semantically bound words• post-processing phase, to represent subtopic of the same macro category and to add a hierarchy level for each word of the
dictionary to support other analytics tasks
3. A Knowledge-Base: to store all the results of the experiments to efficiently support self-tuningmethodologies
4. A self-learning methodology: based on a classification algorithm trained on the knowledgebase content to forecast the best methods for future analyses.
5. Integrating in ESCAPE the analysis of other types of data.
73
Evelina Di CorsoDipartimento di Automatica e Informatica (DAUIN)
Politecnico di Torino, ITALY
Corso Duca degli Abruzzi, 24 - 10129 Torino
Publication list1. Cerquitelli, Tania; Di Corso, Evelina. Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms. EDBT/ICDT Workshops,2016.
2. Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco. Self-tuning techniques for large scale cluster analysis on textual data collections,Proceedings of the Symposium onApplied Computing, 2017,ACM
3. Di Corso, Evelina; Cerquitelli, Tania; Piscitelli, Marco Savino; Capozzoli, Alfonso. Exploring energy certificates of buildings through unsupervised data mining techniques,2017IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing(CPSCom) and IEEE Smart Data (SmartData), 2017,IEEE
4. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Data miners' little helper: data transformation activity cues for cluster analysis on documentcollections,Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics,2017,ACM
5. Baralis, Elena Maria; Cerquitelli, Tania; Chiusano, Silvia Anna; Di Corso, Evelina. Towards Self-Learning Data Transformation, 2016.
6. Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia. Discussion Paper Prompting the data transformation activities for cluster analysis on collections ofdocuments,.
7. Venturini, Luca; Di Corso, Evelina. Analyzing spatial data from twitter during a disaster,2017 IEEE International Conference on Big Data (Big Data),2017,IEEE
8. Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection,2017 IEEE InternationalConference on Big Data (Big Data),6,2017,IEEE
9. Di Corso, Evelina; Cerquitelli, Tania; Apiletti, Daniele. Metatech: meteorological data analysis for thermal energy characterization by means of self-learning transparentmodels,Energies,11,6,1336,2018,Multidisciplinary Digital Publishing Institute
10. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Del Pizzo, Anita; González, Alicia Mateo; Sobrino, Eduardo Martin.Discovering electricity consumption over time for residential consumers through cluster analysis,2018 International Conference on Development and Application Systems(DAS),164-169,2018,IEEE
11. Daraio, Elena; Di Corso, Evelina; Cerquitelli, Tania; Chiusano, Silvia. Characterizing Air-Quality Data Through Unsupervised Analytics Methods, European Conference onAdvances in Databases and Information Systems,205-217,2018,Springer
12. Proto, Stefano; Di Corso, Evelina; Ventura, Francesco; Cerquitelli, Tania. Useful ToPIC: Self-Tuning Strategies to Enhance Latent Dirichlet Allocation,2018 IEEE InternationalCongress on Big Data, 2018,IEEE
13. Cerquitelli, Tania; Chicco, Gianfranco; Di Corso, Evelina; Ventura, Francesco; Montesano, Giuseppe; Armiento, Mirko; González, Alicia Mateo; Santiago, Andrea Veiga.Clustering-Based Assessment of Residential Consumers from Hourly-Metered Data,2018 International Conference on Smart Energy Systems and Technologies,2018,IEEE
14. Di Corso, Evelina. Supporting decision making with self-learning methodologies.
15. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Exploring energy performance certificates through visualization,2019.
16. Di Corso, Evelina; Proto, Stefano; Cerquitelli, Tania; Chiusano, Silvia. Towards automated visualisation of scientific literature. European Conference on Advances in Databasesand Information Systems,2019,Springer. In-press
17. Di Corso, Evelina; Cerquitelli, Tania. Democratising data science on corpora: automated knowledge extraction and visualisation at ease. ACM Celebration of Women inComputing womENcourage 2019. In-press
18. Cerquitelli, Tania; Di Corso, Evelina; Proto, Stefano; Capozzoli, Alfonso; Bellotti, Fabio; Cassese, Maria G; Baralis, Elena; Mellia, Marco; Casagrande, Silvia; Tamburini, Martina.Visualising high-resolution energy maps through the exploratory analysis of energy performance certificates International Conference on Smart Energy Systems andTechnologies – In-press