thinking in clustering yueshen xu

18
Thinking in (Text) Clustering No math, be not afraidYueshen Xu (lecturer) [email protected] / [email protected] Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML

Upload: yueshen-xu

Post on 21-Apr-2017

55 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Thinking in (Text) Clustering

(No math, be not afraid)

Yueshen Xu (lecturer)

[email protected] / [email protected]

Data and Knowledge Engineering Research Center

Xidian University

Text Mining & NLP & ML

Software Engineering2017/4/13

Outline

Background

What can be clustered?

Problems in K-XXX (Means/Medoid/Center…)

Similarity Measure

Convex and Concave

Problems in Gaussian Mixture Model

Problems in Matrix Factorization

Multinomial and Sparsity

2

Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF, Multinomial Distribution

Basics, not state-of-the-art

Software Engineering2017/4/13

Background

Information Overloading

3

we need

summarization

Visualization

Dimensional Reduction

Big DataCloud ComputingArtificial IntelligenceDeep Learning,…, etc

Software Engineering2017/4/13

Background

Dimensional Reduction (DR)

Clustering

Text Clustering, Webpage Clustering, Image Clustering…

Summarization

Document Summarization, Image Summarization…

Factorization

Rating Matrix Factorization, Image Non-negative Factorization

4

Automatic Applicable Explainable

Basic Requirement

Clustering (Text)

Software Engineering2017/4/13

Related Research Areas

Dimensional Reduction (DR)

Text Mining

Natural Language Processing

Computational Linguistics

Information Retrieval

Artificial Intelligence

(Text) Clustering

Some Concepts

5

Information Retrieval

Computational Linguistics

Natural Language Processing

LSA/Topic Model

Text Mining

DR

Data Mining

Artific

ial In

tellig

ence

Machine

Learning

Machine

Translation

(Text)

Clustering

We all know what (text) clustering is, right?

Widely-accepted topic, since everyone knows it

Software Engineering2017/4/13

What can be clustered?

6

Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),

(5.234, 3.56, 4.454, 6.78)

Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)

Data Sample 3:(China, modern, people, gov.), (policy, paper, conference, chair), (report, solution, UN, UK)

Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj)

Data Sample 5:(▲▼♦), (♣♠█),(■□●)

Software Engineering2017/4/13

Is there anything that

cannot be clustered?

7

Yes, but not related to us

What can be clustered?

Anything which a similarity measure can be defined over

Matrix topologyAll kinds of data can be

clustered

Software Engineering2017/4/13

K-Means Trap

8

Defects of K-Means, K-Medoid,K-XXX

How many K?

Where are the initial centers?

Do the data really form a sphere?

Do the data really follow Minkowski /Euclidean distance?

Software Engineering2017/4/13

How about these?

What kind of data that K-XXX better fits?

What kind of data that the methods relying on distance-similarity computation better fit?

CONVEX

Software Engineering2017/4/13

Alternative Gaussian Mixture Model

Software Engineering2017/4/13

Alternative Gaussian Mixture Model

11

Why Gaussian central limit theorem

Is central limit theorem always applicable in

real-world cases?

1. Parameter Tuning

2. High applicability of Gaussian distribution

How to estimate parameters?

Expectation-Maximization

No closed-form solution

Software Engineering2017/4/13

Alternative

Matrix Factorization

12

No closed solution

‘Cause we are not in department of math

SVD, PMF, NMF, Tensor Factorization…

Software Engineering2017/4/13

Triangle

1313

Is there no perfect method here?

What we probably want

No constraint in the form of data

No assumption in datadistribution

Closed-solution

Triangle borrowed from distributed computing

Software Engineering2017/4/13

Triangle (Cont.)

I do not know whether such a method exists or not

Form

Distribution Closed-solution

Hierarchical Clustering?

GMM/Gaussian Process

K-Means/Medoid

impossible

Matrix Factorization

impossible impossible

Software Engineering2017/4/13

Multinomial Distribution

Discrete Data (Text)

15

One document:

(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0 meeting,0,0,0,0,report,0,….)

Multinomial distribution

Clustering Sampling

Markov Chain Monte Carlo

Friendly to sparsity

Software Engineering2017/4/13

Sparsity

Sparsity brings a lot of problems

16

Also in clustering What can we do?

➢ Ensemble Learning (Ensemble clustering)

➢ Missing values pre-filling

➢ Tuning ☺

➢ …

10000 words 1 term

Software Engineering2017/4/13

Reference

My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)

‘Random Thoughts in Clustering’

‘Non-parametric Bayesian learning in discrete data’

‘The research of topic modeling in text mining’

‘Matrix factorization with user generated content’

…, etc.

Website

You can download all slides of mine

➢ http://web.xidian.edu.cn/ysxu/teach.html

➢ http://liu.cs.uic.edu/yueshenxu/

➢ http://www.slideshare.net/obamaxys2011

➢ https://www.researchgate.net/profile/Yueshen_Xu

17

Software Engineering2017/4/13 18

Q&A