thinking in clustering yueshen xu
TRANSCRIPT
Thinking in (Text) Clustering
(No math, be not afraid)
Yueshen Xu (lecturer)
[email protected] / [email protected]
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2017/4/13
Outline
Background
What can be clustered?
Problems in K-XXX (Means/Medoid/Center…)
Similarity Measure
Convex and Concave
Problems in Gaussian Mixture Model
Problems in Matrix Factorization
Multinomial and Sparsity
2
Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF, Multinomial Distribution
Basics, not state-of-the-art
Software Engineering2017/4/13
Background
Information Overloading
3
we need
summarization
Visualization
Dimensional Reduction
Big DataCloud ComputingArtificial IntelligenceDeep Learning,…, etc
Software Engineering2017/4/13
Background
Dimensional Reduction (DR)
Clustering
Text Clustering, Webpage Clustering, Image Clustering…
Summarization
Document Summarization, Image Summarization…
Factorization
Rating Matrix Factorization, Image Non-negative Factorization
4
Automatic Applicable Explainable
Basic Requirement
Clustering (Text)
Software Engineering2017/4/13
Related Research Areas
Dimensional Reduction (DR)
Text Mining
Natural Language Processing
Computational Linguistics
Information Retrieval
Artificial Intelligence
(Text) Clustering
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
DR
Data Mining
Artific
ial In
tellig
ence
Machine
Learning
Machine
Translation
(Text)
Clustering
We all know what (text) clustering is, right?
Widely-accepted topic, since everyone knows it
Software Engineering2017/4/13
What can be clustered?
6
Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),
(5.234, 3.56, 4.454, 6.78)
Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)
Data Sample 3:(China, modern, people, gov.), (policy, paper, conference, chair), (report, solution, UN, UK)
Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj)
Data Sample 5:(▲▼♦), (♣♠█),(■□●)
Software Engineering2017/4/13
Is there anything that
cannot be clustered?
7
Yes, but not related to us
What can be clustered?
Anything which a similarity measure can be defined over
Matrix topologyAll kinds of data can be
clustered
Software Engineering2017/4/13
K-Means Trap
8
Defects of K-Means, K-Medoid,K-XXX
How many K?
Where are the initial centers?
Do the data really form a sphere?
Do the data really follow Minkowski /Euclidean distance?
Software Engineering2017/4/13
How about these?
What kind of data that K-XXX better fits?
What kind of data that the methods relying on distance-similarity computation better fit?
CONVEX
Software Engineering2017/4/13
Alternative Gaussian Mixture Model
11
Why Gaussian central limit theorem
Is central limit theorem always applicable in
real-world cases?
1. Parameter Tuning
2. High applicability of Gaussian distribution
How to estimate parameters?
Expectation-Maximization
No closed-form solution
Software Engineering2017/4/13
Alternative
Matrix Factorization
12
No closed solution
‘Cause we are not in department of math
SVD, PMF, NMF, Tensor Factorization…
Software Engineering2017/4/13
Triangle
1313
Is there no perfect method here?
What we probably want
No constraint in the form of data
No assumption in datadistribution
Closed-solution
Triangle borrowed from distributed computing
Software Engineering2017/4/13
Triangle (Cont.)
I do not know whether such a method exists or not
Form
Distribution Closed-solution
Hierarchical Clustering?
GMM/Gaussian Process
K-Means/Medoid
impossible
Matrix Factorization
impossible impossible
Software Engineering2017/4/13
Multinomial Distribution
Discrete Data (Text)
15
One document:
(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0 meeting,0,0,0,0,report,0,….)
Multinomial distribution
Clustering Sampling
Markov Chain Monte Carlo
Friendly to sparsity
Software Engineering2017/4/13
Sparsity
Sparsity brings a lot of problems
16
Also in clustering What can we do?
➢ Ensemble Learning (Ensemble clustering)
➢ Missing values pre-filling
➢ Tuning ☺
➢ …
10000 words 1 term
Software Engineering2017/4/13
Reference
My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
‘Random Thoughts in Clustering’
‘Non-parametric Bayesian learning in discrete data’
‘The research of topic modeling in text mining’
‘Matrix factorization with user generated content’
…, etc.
Website
You can download all slides of mine
➢ http://web.xidian.edu.cn/ysxu/teach.html
➢ http://liu.cs.uic.edu/yueshenxu/
➢ http://www.slideshare.net/obamaxys2011
➢ https://www.researchgate.net/profile/Yueshen_Xu
17