understanding how choices change clusterings: geometric comparison of popular mixture model...

Understanding How Choices Change Clusterings: Geometric Comparison of Popular

Mixture Model Distances

Scott A. Mitchellhttp://www.cs.sandia.gov/~samitch

Computer Science and Informatics Department (1415)Sandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Preview Summary

€

χ 2 ≥ JS ≥ Hs2

uCC

2sE

sH sE

sG usG

≥

arcsin ≤= =

x2

x

x

≥

arcsin≤

€

(x + y)

2Q

x − y

x + y

⎛

⎝ ⎜

⎞

⎠ ⎟1

€

an

(x − y)2n

(x + y)2n−1n=1

∞

∑1

€

max(x,y)

2Z

min(x, y)

max(x, y)

⎛

⎝ ⎜

⎞

⎠ ⎟1

Z(z

)

χ2 – JS(red)

Difference in Z functions

z

χ2 - Hs2

(blue)

JS - Hs2

(black)(Z1 –

Z2)

Z functions

z

Hs2

(blue)

JS(red)

χ2 (green)

z

χ2 / JS (red)

Ratio of Z functions

Z2 /

Z1

JS / Hs2 (black)

χ2 / Hs2 (blue)

0,0 1,1

1,0

Z(z)

Q(q

)

q=c 1, z=c 2

q=1,

z=0

q=0, z=1

p=1

p=1.

5

p= 0

.6

u=1

u=.7u= 0.4

Outline

• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction

Text Document Clustering at Sandia

Source Credit: Danny Dunlavy

Example Software: Prototype-2 from Network Grand Challenge LDRD

Dataflow

Problem: given a pile of documents, categorize them- so you only have to read one category - to identify outliers not in any category

Example: How strange is my technical paper, the one I’m giving this talk on?

• Google Scholar search “mixture model geometry”, get 529,000 hits, save them all to local disk• Throw out “the”, “and”, “however” to de-noise. Stem “geometrical”, “geometry”, to “geometry”.• Identify grammar, to help answer “expository or framing?”• Identify authors, institutions, to help identify relationships.

• Each document is a bag of “words”, unordered

Feature Extraction

DocumentClustering

Part of Speech Tagging

Term Tokenization

EntityExtraction

Visualization/Presentation

DocumentIngestion

Source Credit: David G. Robinson

Reduce word-space to feature-spacee.g. 20,000 dimensions to 50

Feature Extraction

DocumentClustering


Term Tokenization

EntityExtraction


DocumentIngestion

Cluster points• Pick distance function • Pick distance threshold• Build a graph,

with an edge between documents x,y if distance(x,y) < threshold

• Look at the graph

Need coherent research program. Distances are one puzzle piece.

What happens if I tweak one of the 50+ knobs on this Frankenstein? Why? Predictable changes?What does it all mean?

This or That?

Feature Extraction

DocumentClustering


Term Tokenization

EntityExtraction


DocumentIngestion

Which Distance?• Criteria

– Reproduce ground truth?• What ground truth? Journal I sent my paper to? Expert opinion?

– Stability of outcome?• stability ≠ accuracy

– Use the one this application area always uses?• Maybe not such a bad idea: leverage insight, one knob at a time comparisons

– Information theory?• “I think entropy is relevant and this distance measures it”

• Let’s try something else– Does it even matter? When do two distance functions give a different ordering to points?– Geometry and algebra

Generality• “Documents” could be any pile of data• “Words” could be any discrete categorical features you care about• “Graphs” could have more structure: filtered simplicial complexes;

or less: proximity to cluster center• Any dimension K – typically > 50

e.g. documents in 50-d concept space, or concepts in 20,000-d wordspace• Applications

– Cluster cyber-traffic based on header features, content analysis.

Specialization• Bivariate distances

• Not convoluting document-in-conceptspace with concept-in-wordspace.• Not univariate measure of points. • Not univariate measure of partition as Graph Entropy (Berry, Phillips)

• Distances between points which are mixture models

TS+

S

T:

x2

x

xx

sometimes distances project to positive part of sphere S+

contrast to LSA on S

x

Outline

• Application context• Distances

properties• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction

Distance Properties

100’s of distances to choose from.

Bivariate: D(x,y). Not D(x). Not D(partition).

Where variants exist, pick the ones with

Most won’t satisfy 3.

Iter-Distance Properties• What matters is ordering of points, level-set shape

same level sets → same clusterings

• Stronger properties don’t hold, but we’ll see how they don’t.– Max 1, Bounded Ratio c2 → Bounded Difference c1 = 1-c2

but we’ll show smaller c1

stro

ng

er

xz

y

Local order preserving:D1 D2 agree which of {y,z} is closer to x

x

z

y

Global order preserving:D1 D2 agree which of (x,y) and (w,z) is smaller

w

Outline

• Application context• Distances

properties

the distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction

• Base

• Problems– unbounded as y→0, no Max 1– unsymmetric

Chi-Squared χ2

€

χ 2(x, y) =(xk − yk )2

ykk=1

K

∑ =(x − y)2

y1

• Fix (a.k.a Trianglar Discrimination)

• View: f(x,0) = x, const x+y, x, head-on, iso-curves– fig 1, 2, 3

)expected,observed(2χ expected = midpoint if same distribution

€

p ≡ x + y

d ≡ x − y

Chi-Squared χ2

• ComponentwiseAlgebraic Properties

• Qp = x+y; d = x-y; q = d/pp in [0,2]; d,q in [0,1]

• Zu = max(x,y); v = min(x,y); z = u/vu,v,z in [0,1]

revisit figures 1,2,3

• Further study: f-divergence, Csiszár(1967), Dragomir(1980+) RGMIA

1

),(

a

bafbaD f

z

zz

u

z

zuzZ

u

1

41

21

1

2)(

2

22χ

yx

yx

2

2

2

1χ

22

2)(

2q

pqQ

pχ

Chi-Squared χ2

• Componentwise Geometric Interpretation

0,0 1,1

1,0

Z(z)

Q(q

)

q=c 1, z=c 2

q=1,

z=0

q=0, z=1

p=1

p=1.

5

p= 0

.6

u=1

u=.7u= 0.4

geometrically: D( ) = (r / r’) D(o) = (p/1) D(o) = p Q(q)/2also works for p > 1

r

r′o′

o

geometrically: D( ) = (s / s’) D(o) = (u / 1) D(o) = u/2 Z(z)

z

0

1

1

0

r

r′′o

o′′

uv

uv

1

o

o′′o′

z

1

0

0

labels are lengths of segments, except points o, o′, o′′

Q

Z2

d2

q

2

2

2

12

p

z

zq

1

1

q

qz

1

1

2

d2

q

2

2

2

12

p

1

Chi-Squared χ2

• Q curvesChi-Squarediso-p lines, plus y=0, x=1d ≤ p

q vs

. Q(q

)

pq v

s. p

Q(q

)

p=3/4

p=1

d=x-y

D(x

,y)

Z same

Eus surface, contour lines, and x+y=c lines

y

x

Compare to Euclidean, function of d only, no p dependence

Jenson-Shannon (same form as χ2)

• Kullback-Leibler

– measures entropy, information theoretic– non-symmetric– unbounded

• Jenson-Shannon Fix

– x=0…– x=y…– One of the terms can be negative, but JS ≥ 0, Unique Zero holds–

• stronger, factor as Chi-Squared

k

kK

kk y

xxyxKL 2

1

log),(

),(),( yxaJSayaxJS

€

JSs(x,y) =u

2ZJS

v

u

⎛

⎝ ⎜

⎞

⎠ ⎟1

=p

2QJS

d

p

⎛

⎝ ⎜

⎞

⎠ ⎟1

Figure 4

Hellinger

• Hellinger

– H satisfies Triangle Inequality, H2 doesn’t– x=0…– x=y…–

• Factor as Chi-Squared, JS

• Performs well for certain applications (Kegelmeyer, Robinson favorites)

€

H 2 = xk − yk( )2

=k=1

K

∑ x − y( )2

1

),(),( 22 yxaHayaxH

€

Hs2(x,y) =

u

2ZH

v

u

⎛

⎝ ⎜

⎞

⎠ ⎟1

=p

2QH

d

p

⎛

⎝ ⎜

⎞

⎠ ⎟1

H

Hs (blue) and JSs(red) vs. (x-y) for (x+y)=c, x=1 or y=0

y=0

x=1

y=0

x=1

x+y=

1/4

x+y=

1/2

x+y=

3/4

x+y=

1

x+y=5/4

x+y=3/2

x+y=7/4

x+y=

1

x+y=3/2x+y=

1/2

(x-y)

Hs surface, contour lines, and x+y=c lines

y

x

C√=Hs2 surface, contour lines, and x+y=c lines

y

x

H2

x

H and JS

Figure 5

1D Q plots, χ2, JS, H2

Chi2s (green) ≥ JSs(red) ≥ H2

s (blue) vs. (x-y) for (x+y)=c, x=1, or y=0

(x-y)p=

1 C

hi2 s =

Euc

lidea

n2 )

Hellinger and CosineEuclidean and Cosine

• Hellinger projects to sphere using square-root, then takes Euclidean distance

yxC cos1x

2

H

y

G

2

0

S1

yxcos

H

C

Hellinger and Euclidean projections to 3-sphere, and difference between them.

Add animation fig

€

cosθ =1− 2sin2 θ

2⇒ C =

H 2

2= Hs

2

€

xk =1 ⇔ xk∑∑2

=1

€

H = x − y2

x2

x

x

€

x

Geodesic Distance?• Suggest Geodesic distance on sphere G over

– More convex than H (or E), barely satisfies triangle inequality– G(x,z)=G(x,y)+G(y,z) for y=λx+(1-λ)z

strict inequality for other y• C, E, G are global order preserving, over both

€

x

x2

and x

x

y

G0

S1

H

€

z

€

x

x2

and x

Eus (blue) and Gus(red) vs. (x-y)

(x-y)

Eus surface, contour lines, and x+y=c lines

y

x

Euclidean and Geodesic (norm)

Hs (blue) and G√s(red) vs. (x-y) for (x+y)=c, x=1 or y=0

y=0

x=1

y=0

x=1

G√s surface, contour lines, and x+y=c lines

left is visually indistinguishable from Hs, so skip it

(x-y)

y

x

Hellinger and Geodesic (root)

Outline

• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction

3d plots for selected x points• x=[1,0,0]

– fig 6: H2, JS, Chi2

– fig 7: H, H2, E shortcomings• No ortho-max for E

– fig 8: H, H2, Eu

• X=[1,1,0]/2– fig 9: H2, JS, Chi2

– fig 10: H, H2, Eu

• X=[1,1,1]/3– fig 11: H2, JS, Chi2

– fig 12: H, H2, Eu

• X=[0.2 0.3 0.5]– fig 13: H2, JS, Chi2

– fig 14: H, H2, Eu

• X=[0.89, 0.1 0.01] - sharp upturns– fig 15: H2, JS, Chi2

– fig 16: H, H2, Eu

1,0,0 0,1,0

0,0,1

1 2

3

T

S+

(1,0,0)

(0,1,0)

(0,0,1)

(0,0,1)

Distances from (1,0,0)

(0,1,0)

(0,0,1)

(1,0,0)

Distances from (.5,.5,0)

Distances from (.2,.3,.5)

(0,0,1)

(0,1,0)

(1,0,0)

(1,0,0) (0,1,0)

(0,1,0)

(0,0,1)

(1,0,0)

(0,0,1) (1,0,0)

3-dimensional mixture model distances using Euclideanu(yellow + black contours), Hellingers(blue) and JSs(red)

(0,0,1)

(0,1,0)

(1,0,0)

Distances from (.89,.1,.01)

Static backup

Outline

• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q, W• Ratios• Worst-case difference construction

Algebraic Forms, Q, Zfor X, JS, H2

• Q fn• Q series• Z fn 22

sHJS χ

uCC

2sE

sH sE

sG usG

≥

arcsin ≤= =

x2

x

x

≥

arcsin≤

yx

yxZ

yx

2

)(

1112

2

)(

)(

nn

n

n yx

yxa

),max(

),min(

2

),max(

yx

yxQ

yx

2

d2

q

2

2

2

12

p

r

r′ 1o′

o

z

0

1

1

0

r

r′′o

o′′

uv

Q

Z

Q fn

• Recall

• Q fn

• p=x+y d=|x-y| q=d/p

€

χ 2 =1

2

x − y( )2

x + y( )1

€

H 2 = x − y( )2

1

€

D f (x, y) =p

2Q f

d

p

⎛

⎝ ⎜

⎞

⎠ ⎟1

€

JS = x log2

2x

x + y

⎛

⎝ ⎜

⎞

⎠ ⎟+ y log2

2y

x + y

⎛

⎝ ⎜

⎞

⎠ ⎟1

Q functions

Q(q

)

Hs2

(blue)

JS(red)

χ2 (green)

Q fn series

• •

•

€

χ 2 =1

2pq2

Q functions

Q(q

)

Hs2

(blue)

JS(red)

χ2 (green)

q

Z fn

• Recall

• Z fn

u=max(x,y) v=min(x,y) z=v/u

€

χ 2 =1

2

x − y( )2

x + y( )1

€

H 2 = x − y( )2

1

€

JS = x log2

2x

x + y

⎛

⎝ ⎜

⎞

⎠ ⎟+ y log2

2y

x + y

⎛

⎝ ⎜

⎞

⎠ ⎟1

€

D f (x, y) =u

2Z f

v

u

⎛

⎝ ⎜

⎞

⎠ ⎟1

Z functions

z

Z(z

)

Hs2

(blue)

JS(red)

χ2 (green)

W fn• Zf =1+z+… for all f

u(1+z) = u+v and ||u+v||1 = 2

Not componentwise equality

Z functions

z

Z(z

)Hs

2

(blue)

JS(red)

χ2 (green)

W functions

z

W(z

)

Hs2

(blue)

JS(red)

χ2 (green)

Outline


q

Q2 /

Q1

Relative difference in Q functions (Q

1 –

Q2)

/ (Q

1 +

Q2)

q

JS and Hs2 (black)

Q functionsQ

(q)

Hs2

(blue)

JS(red)

χ2 (green)

χ2 / JS (red)

Ratio of Q functions

q

JS / Hs2 (black)

χ2 / Hs2 (blue)

χ2 - Hs2

(blue)

Difference in Q functions

q

(Q1 –

Q2)

χ2 - JS(red)

JS - Hs2

(black)

χ2 and Hs2 (blue)

χ2 and JS (red)

Q plots

Relative difference in Z functions

χ2 and JS (red)

related byq = (1-z)/(1+z)p=(u/2)(1+z)

χ2 – JS(red)


z

(Z1 –

Z2)

χ2 - Hs2

(blue)

JS - Hs2

(black)

Z functions

z

Z(z

)

Hs2

(blue)

JS(red)

χ2 (green)

(Z1 –

Z2)

/ (Z

1 +

Z2)

z

χ2 / JS (red)


Z2 /

Z1

JS / Hs2 (black)

χ2 / Hs2 (blue) JS and Hs

2 (black)

χ2 and Hs2 (blue)

z

Z plots

Selected theorems

• Z-increasing <≈> Q-decreasing• Z, Q monotonic• Z1/Z2, Q1/Q2 ratios monotonic

• Z, Q differences 1 unique max, 1 inflection pt

Z-Q Equivalence

uv

1

o

o′′o′

2

d2

q

2

2

2

12

p

z

1

0

0

z

zq

1

1

q

qz

1

1

• Zf actually are decreasing (lots of algebra, derivates, L’Hopital’s rule)

Z-Q RatiosThm: Proof from componentwise

€

D1

D2

=Z1

Z2

(z) =Q1

Q2

q( )

€

Z1

Z2

decreasing ⇔Q1

Q2

increasing

€

maxZ1

Z2

= maxQ1

Q2

€

minZ1

Z2

= minQ1

Q2

€

q =1− z

1+ zand z =

1− q

1+ q

Proof from leading term of series, or

directly from functional forms

z

χ2 / JS (red)


Z2 /

Z1 JS / Hs

2 (black)

χ2 / Hs2 (blue)

χ2 / JS (red)

Ratio of Q functions

q

JS / Hs2 (black)

χ2 / Hs2 (blue)Q

2 /

Q1

Key feature is large flat section.

This appears new.

Z-Q Differences

χ2 - Hs2

(blue)

Difference in Q functions

q

(Q1 –

Q2)

χ2 - JS(red)

JS - Hs2

(black)

χ2 – JS(red)


z

(Z1 –

Z2)

χ2 - Hs2

(blue)

JS - Hs2

(black)

€

′ ′ M > 0

€

′ ′ M > 0

€

′ ′ M < 0

€

′ ′ M < 0

No closed form except upper left

Outline


Contours

• The contours looked similar?• Not local-order preserving

– Exploits different ratios for small z vs. moderate z

– x=[0.89, 0.1, 0.01]– y=[0.9, 0, 0.01]– z=[0.65, 0.35, 0]– w=[0.6, 0.4, 0]

€

χ 2 x, y( ) < χ 2 x,z( ) but JS x, y( ) > JS x,z( ) and Hs2 x, y( ) > Hs

2 x,z( )

€

JS x, y( ) < JS x,w( ) but Hs2 x,y( ) > Hs

2 x,w( )

y

zwx

Worst-Case Construction

• Relies on– Large K dimension– Zero component to get small z ratios– Moderately similar components to get moderate z ratios

• Implies Distance is small

€

x = (a,a,...,a,b,b...,b,0)

y = b,b,...,b,a,a,...,a,0( )

z = x1,x2,...x j ,c,0,0,...0,d( )

where a =1

k -1+ ε, b =

1

k -1−ε,

d,c, j : D1(x,y) = D1(x,z)€

k → ∞, ε → 0 givesD2

D1

(x,y) → min,D2

D1

(x,z) →1

• Find points x,y,z, such that D1(x,y)=D1(x,z) but D2 gives the most different answer

Near worst-case

• Doesn’t have to be very extreme

• Still relies on a zero component, but small dim k, large epsilon

€

x = (a,a,...,a,b,b...,b,0)

y = b,b,...,b,a,a,...,a,0( )

z = x1, x2,...x j ,c,0,0,...0,d( )

where a =1

k -1+ ε, b =

1

k -1−ε,

d,c, j : D1(x,y) = D1(x,z)

Main Observations

≥≤

= =≥

≤

Z(z

)

χ2 – JS(red)


z

χ2 - Hs2

(blue)

JS - Hs2

(black)(Z1 –

Z2)

Z functions

z

Hs2

(blue)

JS(red)

χ2 (green)

z

χ2 / JS (red)


Z2 /

Z1

JS / Hs2 (black)

χ2 / Hs2 (blue)

0,0 1,1

1,0

Z(z)

Q(q

)

q=c 1, z=c 2

q=1,

z=0

q=0, z=1

p=1

p=1.

5

p= 0

.6

u=1

u=.7u= 0.4

22sHJS χ

uCC

2sE

sH sE

sG usG

arcsin

x2

x

x

arcsin

yx

yxZ

yx

2

)(

1112

2

)(

)(

nn

n

n yx

yxa

),max(

),min(

2

),max(

yx

yxQ

yx

Conclusion

• Insight into what distances do differently• What’s new / novel?

– Geometric analysis, pictures!– Ratio monotonicity, difference analysis– Algebraic limits probably known, but references hard to chase

• Future– Clustering case study on real data– Hellinger square-root projection study– Other families of distances

• Compound distances, Earth Mover’s• Partition/cluster metrics

– Affect of norms other than 1-norm (triangle inequality)?– SNL needs program in understanding effects of text-analysis

pipeline knobs.

Chi2s (green) ≥ JSs(red) vs. (x-y) for (x+y)=c

Chi2s (blue) ≥ JSs(blue) vs. (x-y) for x=1 or y=0

y=0

(blue

) →JS

s =

Chi2 s

= Euc

lidea

n =

x

x+y=

1/4

x+y=

1/2

x+y=

3/4

x+y=

1 (h

ere

Chi

2 s =

Euc

lidea

n2 )

x+y=

5/4

x+y=3/2

x+y=7/4

x=1

(blu

e): C

hi2 s

≥JS s

(x-y)

dist

ance

JSs surface, contour lines, and x+y=c lines

x

y

x=1

y=0

(x-y)=0

x+y=1/4

x+y=

1/2

x+y=

3/4

x+y=1

x+y=5/4

x+y=3/2

x+y=7/4

Hs surface, contour lines, and x+y=c lines Hs (blue) and JSs(red) vs. (x-y) for (x+y)=c, x=1 or y=0

y=0

x=1

y=0

x=1

x+y=

1/4

x+y=

1/2

x+y=

3/4

x+y=

1

x+y=5/4

x+y=3/2

x+y=7/4

x+y=

1

x+y=3/2x+y=

1/2

(x-y)

y

x

Chi2s (green) ≥ JSs(red) ≥ H2

s (blue) vs. (x-y) for (x+y)=c, x=1, or y=0

(x-y)

C√=Hs2 surface, contour lines, and x+y=c lines

y

x

for late-start summary quad chart

(0,0,1)

(0,1,0)

(1,0,0)

Sharp upturn on lower right shows H & JS emphasize high ratios of point coordinates.

Euclidean Hellinger Jensen-Shannondistancesfrom (.89,.1,.01)

1

2T

S+

S

Two-dimensional mixture models T, unit sphere S, and positive part of unit sphere S+. Coordinate axis 1 and 2.

Point x on T projected to S+ under normalization and square-root.

x2

x

xx

1,0,0 0,1,0

0,0,1

1 2

3

T

S+

JS

u=x

=1 ↔

Z(z

)

q = 1

↔ z

= 0

q = 1/3 ↔ z = 1/2

q = 0 ↔ z = 1

xy

x+y

= 1

↔ Q

(q)

0,0 1,1

1,0

Z(z)

Q(q

)

q=c 1, z=c 2

q=1,

z=0

q=0, z=1p=

1

p=1.

5

p= 0

.6

u=1

u=.7u= 0.4

labels are lengths of segments, except points o, o’geometrically: D( ) = (r / r’) D(o) = (p/1) D(o) = p Q(q)/2also works for p > 1

2

d2

q

2

2

2

12

p

r

r′ 1o′

o

labels are lengths of segments, except points o, o’ geometrically: D( ) = (s / s’) D(o) = (u / 1) D(o) = u/2 Z(z)

z

0

1

1

0

r

r′′o

o′′

uv

uv

1

o

o′′o′

2

d2

q

2

2

2

12

p

z

1

0

0

z

zq

1

1

q

qz

1

1

understanding how choices change clusterings: geometric comparison of popular mixture model...

Documents

z functions z h s

black z

green z

js red ratio of z functions

text document clustering

js h s

blue js h s

distance functions