(semi-)nonnegative matrix factorization and k-mean clustering

C. Ding, NMF => Unsupervised Clustering 1

(Semi-)Nonnegative Matrix Factorization and

K-mean Clustering

Xiaofeng He Lawrence Berkeley Nat’l LabHorst Simon Lawrence Berkeley Nat’l LabTao Li Florida Int’l Univ.Michael Jordan UC BerkeleyHaesun Park Georgia Tech

Chris DingLawrence Berkeley National Laboratory

Nonnegative Matrix Factorization (NMF)

),,,( 21 nxxxX L=Data Matrix: n points in p-dim:

TFGX ≈Decomposition (low-rank approximation)

Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX

),,,( 21 kgggG L=),,,( 21 kfffF L=

is an image, document, webpage, etc

Some historical notes

• Earlier work by statistics people (G. Golub)• P. Paatero (1994) Environmetrices• Lee and Seung (1999, 2000)

– Parts of whole (no cancellation)– A multiplicative update algorithm

0 0050 710

080 20 0

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

Pixel vector

XFGT ≈

),,,( 21 kfffF L= ),,,( 21 kgggG L=

Lee and Seung (1999): Parts-based Perspective

original

),,,( 21 nxxxX L=

TFGX ≈ ),,,( 21 kfffF L=

Straightforward NMF doesn’t get parts-based picture

Several People explicitly sparsify F to get parts-based picture

Donono & Stodden (2003) study condition for parts-of-whole

(Li, et al, 2001; Hoyer 2003)

“Parts of Whole” Picture

Meanwhile …….A number of studies empirically show the usefulness of NMF for pattern discovery/clustering:

Xu et al (SIGIR’03)Brunet et al (PNAS’04)Many others

We claim:

NMF factors give holistic pictures of the data

Our Experiments: NMF gives holistic pictures

Task:Prove NMF is doing “Data Clustering”

NMF => K-means Clustering

NMF-Kmeans Theorem

0||X||min

−≥=

)(Trmin0,

XGXGXX TTT

GIGGT−

G -orthogonal NMF is equivalent to relaxed K-means clustering.

Proof.

(Ding, He, Simon, SDM 2005)

• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hartigan, etc)

K-means clustering

• Computationally Efficient (order-mN)• Most widely used in practice

– Benchmark to evaluate other algorithms

∑∑∈=

−=kCi

kK cxJ 2

||||min

TnxxxX ),,,( 21 L=Given n points in m-dim:

K-means objective

Reformulate K-means Clustering

∑ ∑ ∑=

∈−= i

kCji j

,2 1||||

}2/1/)00,11,00( k

Cluster membership indicators:

∑ ∑=

TTkiK XhXhxJ

),,( 1 KhhH L=

)(Trmax0,

XHXH TT

HIHHT ≥=Solving K-mean =>

(Zha, Ding, Gu, He, Simon, NIPS 2001) (Ding & He, ICML 2004)

Reformulate K-means Clustering

Cluster membership indicators :

Hhhh ==

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢

1100000

0011100

0000011

C1 C2 C3

NMF-Kmeans Theorem

0||X||min

−≥=

)(Trmin0,

XGXGXX TTT

GIGGT−

Proof.

(Ding, He, Simon, SDM 2005)

Kernel K-means Clustering

Map feature vector to higher-dim space

∑∑∈=

−=kCi

kK cxJ 2

1||)()(||min φφφ

Kernel K-means objective:

)( ii xx φ→

∑∈

≡kCi

c )(1)( φφ

Kernal K-means optimization:

∑ ∑∑= ∈

k Cjij

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

k Cjiji

== ∑ ∑= ∈

φφφ

Matrix of pairwise similarities

Symmetric NMF:

Symmetric NMF

Is Equivalence to )(Trmax0,

HIHHT ≥=

0,||||min T

HIHHHHW

THHW ≈

Orthogonal symmetric NMF is equivalent to Kernel K-means clustering.

Symmetric Nonnegative matrix

Orthogonality in NMF

Strict orthogonal G: hard clustering

Non-orthogonal G: soft clustering),( 21 hhH =

Ambiguous/outlier points

),,,( 21 nxxxX L=

K-means Clustering Theorem

0,||X||min T

GIGGGF

T +±±≥=−

(Ding, Li, Jordan, 2006)

Proof requires only G-orthogonality and nonnegativity

),,,( 21 kgggG L=

),,,( 21 kfffF L= => cluster centroids

=> cluster indicators

NMF Generalizations

SVD: TT VUGFX Σ== ±±±

TGFX +±± =Semi-NMF:

Tri-NMF:

Convex-NMF:

Kernel-NMF:

TGWXX ++±± =

TGSFX +±+± =

TGWXX ++±± = )()( φφ

(Ding, Li, Peng, Park, KDD 2006)

Semi-NMF:

• For any mixed-sign input data (centered data)• Clustrering and Low-rank approximation

TGFX +±± =

Update F:

Update G:

1)( −= GGXGF T

ikik FFGXFFFGXFGG

])([)(])([)(

||||min TFGX −

In NMF TGFX +++ =TGFX +±± =In Semi-NMF

For fk factor to capture the notion of cluster centroid, Require fk to be a convex combination of input data

Convex-NMF

is in a large space

+=++= XWFxwxwf nnkk ,111 L

TGWXX ++±± =

),,,( 21 kfffF L=

For F interpretability, ±= XWF(Affine combination )

Convex-NMF:

||||min TXWGX −

Update F:

Update G:

ikik GWGXXGXXGWGXXGXXWW

])[(])[(])[(])[(

ikik WXXGWXXWWXXGWXXW

GG])([])([])([])([

TGWXX ++±± =

Computing algorithm

Semi-NMF factors: Convex-NMF factors:

- Sparse factorization is a recent trend.- Sparsity is usually explicitly enforced- Convex-NMF factors are naturally sparse

⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢

000001000001000001000001000001

),,( 1 keeWG L

From this we infer convex NMF factors are naturally sparse

Sparsity of Convex-NMF

2222 ||)(|||||||||| TTkk kXX

T WGIvWGIGXWX T −=−=− ∑ σ

Consider 22 ||)(|||||| TTkk

T WGIeWGI −=− ∑Solution is

48476 1cluster

xxxxxxxx48476 2cluster

xxxxxxxx

A Simple Example

08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF

30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex

Experiments on 7 datasets

NMF variants always perform better than K-means

Kernel NMF -- Generalized Convex NMF

NMF/semi-NMF

)( ii xx φ→TFGX =)(φ

Minimization objective depends on kernel only:

)()(),()Tr(||)()(|| 2 TTT WGIXXGWIWGXX −−=− φφφφ

(Ding & He, ICML 2004)

)](,),(),([)( 21 nxxxX φφφφ L=

depends on the explicit mapping function )(•φ

TGWXX ])([)( φφ =Kernel NMF:

Kernel K-means Clustering

∑∑∈=

−=kCi

kK cxJ 2

1||)()(||min φφφ

)( ii xx φ→

∑∈

≡kCi

c )(1)( φφ

∑ ∑∑= ∈

k Cjij

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

k Cjiji

== ∑ ∑= ∈

φφφ

NMF and PLSI : Equivalence

So far we only use the Frobenius norm as the NMF objective function. Another objective is the KL divergence

∑∑∈=

−=kCi

kK cxJ 2

1||)()(||min φφφ

)( ii xx φ→

∑∈

≡kCi

c )(1)( φφ

∑ ∑∑= ∈

k Cjij

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

k Cjiji

== ∑ ∑= ∈

φφφ

ikik GWGXXGXXGWGXXGXXWW

])[(])[(])[(])[(

Kernel-NMF Algorithm

Update F:

Update G:

ikik WXXGWXXWWXXGWXXW

GG])([])([])([])([

Computing algorithm depends only on the kernel

)(),( XX φφ

Orthogonal Nonnegative Tri-Factorization

0,||X||min

FIFFGSF

T +±+±≥=−

3-factor NMF with explicit orthogonality constraints

Simultaneous K-means clustering of rows and columns

),,,( 21 kgggG L=

),,,( 21 kfffF L= => Row cluster indicators

=> Column cluster indicators(Ding, Li, Peng, Park, KDD 2006)

1. Solution is unique2. Can’t reduce to NMF

NMF-like algorithms are different ways to relax F , G !

),,(,,/ 11

knnkkk nndiagDXGDFnXgf L=== −

IGGGGXXGXGDXJ TTTnK =−=−= − ~~,||~~|||||| 221

1|||||||||||| T

kK FGXfxgfxJ

−=−=−= ∑∑∑∑==∈=

),,,( 21 kgggG L=

),,,( 21 kfffF L= = cluster centroids

= cluster indicators

),,,( 21 nxxxX L= = input data

K-means clustering objective function

NMF PLSINMF objective functions• Frobenius norm• KL-divergence: ij

iKL FGxxJ

ijTij )(log

1+−= ∑∑

),(log),(11

iPLSI dwpdwxJ ∑∑

)|()()|(),( kjkk

kiji zdpzpzwpdwp ∑=

Probabilistic LSI (Hoffman, 1999) is a latent variable model for clustering:

constant+−= −KLNMFPLSI JJWe can show(Ding, Li, Peng, AAAI 2006)

Summary• NMF is doing K-means clustering (or PLSI)• Interpretability is key to motivate new NMF-

like factorizations– Semi-NMF, Convex-NMF, Kernel-NMF, Tri-NMF

• NMF-like algorithms always outperform K-means clustering

• Advantage: hard/soft clustering• Convex-NMF enforces notion of cluster centroids

and is naturally sparse

NMF: A new/rich paradigm for unsupervised learning

References

• On the Equivalence of Nonnegative Matrix Factorization and K-means /Spectral clustering, Chris Ding, XiaofengHe, Horst Simon, SDM 2005.

• Convex and Semi-Nonnegative Matrix Factorization, Chris Ding, Tao Li, Michael Jordan, submitted

• Orthogonal Non-negative Matrix Tri-Factorization for clustering, Chris Ding, Tao Li, Wei Peng, Haesun Park,KDD 2006.

• Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square and a Hybrid Algorithm, Chris Ding, Tao Li, Wei Peng, AAAI 2006.

Data Clustering: NMF and PCA

0,||X||min T

GIGGGF

T +±±≥=−

NMF is useful due to nonnegativity.

G-orthogonality and nonnegativity

),,,( 21 kgggG L=

),,,( 21 kfffF L= => cluster centroids

=> cluster indicators

What happens if we ignore nonnegativity?

K-means clustering PCA

0,||))((X||min T

GIGGRGRF

T +±±≥=−

Ignore nonnegativity => orth. transform R

)]()([Trmax GRXXGR TT

GRVFRUVUX T ==Σ= ,,

TTT UUFRFRFF == ))((

Equivelevant to

Solution is given by SVD:

Cluster indicator projection:

Centroid subspace projection:

TTT VVGRGRGG == ))((

PCA/SVD is automatically doing K-means clustering

(Ding & He, ICML 2004)

48476 1cluster

xxxxxxxx48476 2cluster

xxxxxxxx

A Simple Example

08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF

30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex

NMF = Spectral Clustering (Normalized Cut)

Normalized Cut ⇒

cluster indicators:

)~()~(),,( 111

yWIyyWIyyyJT

−++−=

TrNcut LL

Re-write:}

||||/)00,11,00( 2/12/1k

k hDDyk

IYYYW T =tosubject :Optimize ),~YTY

Tr(max

2/12/1~ −−= WDDW

0,||~||min T

HIHHHHW

k DhhhWDh

DhhhWDhhhJ

)()(),,(11

−++−= LLNcut

(Gu , et al, 2001)

(semi-)nonnegative matrix factorization and k-mean clustering

Documents

symmetric nonnegative matrix factorization for graph...

nonnegative matrix factorization via rank-one downdate

nonnegative matrix factorization: algorithms and...

nonnegative matrix factorization for clustering

advances in nonnegative matrix and tensor factorization

depth estimation of sound images using directional...

nonnegative matrix factorization - algorithms and...

nonnegative matrix factorization for interactive topic...

nonnegative matrix factorization for semi-supervised...

exploring nonnegative matrix factorization · introduction...

nonnegative matrix factorization for real...

nonnegative matrix factorization

weakly supervised nonnegative matrix factorization for...

nonnegative matrix factorization: algorithms and...

nonnegative matrix factorization for clustering ·...

projective nonnegative matrix factorization: sparseness

nonnegative matrix factorization for real...

computing a nonnegative matrix factorization...

putting nonnegative matrix factorization to the test

robust collaborative nonnegative matrix factorization for...