understanding how choices change clusterings: geometric comparison of popular mixture model...
TRANSCRIPT
Understanding How Choices Change Clusterings: Geometric Comparison of Popular
Mixture Model Distances
Scott A. Mitchellhttp://www.cs.sandia.gov/~samitch
Computer Science and Informatics Department (1415)Sandia National Laboratories
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration
under contract DE-AC04-94AL85000.
Preview Summary
€
χ 2 ≥ JS ≥ Hs2
uCC
2sE
sH sE
sG usG
≥
arcsin ≤= =
x2
x
x
≥
arcsin≤
€
(x + y)
2Q
x − y
x + y
⎛
⎝ ⎜
⎞
⎠ ⎟1
€
an
(x − y)2n
(x + y)2n−1n=1
∞
∑1
€
max(x,y)
2Z
min(x, y)
max(x, y)
⎛
⎝ ⎜
⎞
⎠ ⎟1
Z(z
)
χ2 – JS(red)
Difference in Z functions
z
χ2 - Hs2
(blue)
JS - Hs2
(black)(Z1 –
Z2)
Z functions
z
Hs2
(blue)
JS(red)
χ2 (green)
z
χ2 / JS (red)
Ratio of Z functions
Z2 /
Z1
JS / Hs2 (black)
χ2 / Hs2 (blue)
0,0 1,1
1,0
Z(z)
Q(q
)
q=c 1, z=c 2
q=1,
z=0
q=0, z=1
p=1
p=1.
5
p= 0
.6
u=1
u=.7u= 0.4
Outline
• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction
Text Document Clustering at Sandia
Source Credit: Danny Dunlavy
Example Software: Prototype-2 from Network Grand Challenge LDRD
Dataflow
Problem: given a pile of documents, categorize them- so you only have to read one category - to identify outliers not in any category
Example: How strange is my technical paper, the one I’m giving this talk on?
• Google Scholar search “mixture model geometry”, get 529,000 hits, save them all to local disk• Throw out “the”, “and”, “however” to de-noise. Stem “geometrical”, “geometry”, to “geometry”.• Identify grammar, to help answer “expository or framing?”• Identify authors, institutions, to help identify relationships.
• Each document is a bag of “words”, unordered
Feature Extraction
DocumentClustering
Part of Speech Tagging
Term Tokenization
EntityExtraction
Visualization/Presentation
DocumentIngestion
Source Credit: David G. Robinson
Reduce word-space to feature-spacee.g. 20,000 dimensions to 50
Feature Extraction
DocumentClustering
Part of Speech Tagging
Term Tokenization
EntityExtraction
Visualization/Presentation
DocumentIngestion
Cluster points• Pick distance function • Pick distance threshold• Build a graph,
with an edge between documents x,y if distance(x,y) < threshold
• Look at the graph
Need coherent research program. Distances are one puzzle piece.
What happens if I tweak one of the 50+ knobs on this Frankenstein? Why? Predictable changes?What does it all mean?
This or That?
Feature Extraction
DocumentClustering
Part of Speech Tagging
Term Tokenization
EntityExtraction
Visualization/Presentation
DocumentIngestion
Which Distance?• Criteria
– Reproduce ground truth?• What ground truth? Journal I sent my paper to? Expert opinion?
– Stability of outcome?• stability ≠ accuracy
– Use the one this application area always uses?• Maybe not such a bad idea: leverage insight, one knob at a time comparisons
– Information theory?• “I think entropy is relevant and this distance measures it”
• Let’s try something else– Does it even matter? When do two distance functions give a different ordering to points?– Geometry and algebra
Generality• “Documents” could be any pile of data• “Words” could be any discrete categorical features you care about• “Graphs” could have more structure: filtered simplicial complexes;
or less: proximity to cluster center• Any dimension K – typically > 50
e.g. documents in 50-d concept space, or concepts in 20,000-d wordspace• Applications
– Cluster cyber-traffic based on header features, content analysis.
Specialization• Bivariate distances
• Not convoluting document-in-conceptspace with concept-in-wordspace.• Not univariate measure of points. • Not univariate measure of partition as Graph Entropy (Berry, Phillips)
• Distances between points which are mixture models
TS+
S
T:
x2
x
xx
sometimes distances project to positive part of sphere S+
contrast to LSA on S
x
Outline
• Application context• Distances
properties• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction
Distance Properties
100’s of distances to choose from.
Bivariate: D(x,y). Not D(x). Not D(partition).
Where variants exist, pick the ones with
Most won’t satisfy 3.
Iter-Distance Properties• What matters is ordering of points, level-set shape
same level sets → same clusterings
• Stronger properties don’t hold, but we’ll see how they don’t.– Max 1, Bounded Ratio c2 → Bounded Difference c1 = 1-c2
but we’ll show smaller c1
stro
ng
er
xz
y
Local order preserving:D1 D2 agree which of {y,z} is closer to x
x
z
y
Global order preserving:D1 D2 agree which of (x,y) and (w,z) is smaller
w
Outline
• Application context• Distances
properties
the distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction
• Base
• Problems– unbounded as y→0, no Max 1– unsymmetric
Chi-Squared χ2
€
χ 2(x, y) =(xk − yk )2
ykk=1
K
∑ =(x − y)2
y1
• Fix (a.k.a Trianglar Discrimination)
• View: f(x,0) = x, const x+y, x, head-on, iso-curves– fig 1, 2, 3
)expected,observed(2χ expected = midpoint if same distribution
€
p ≡ x + y
d ≡ x − y
Chi-Squared χ2
• ComponentwiseAlgebraic Properties
• Qp = x+y; d = x-y; q = d/pp in [0,2]; d,q in [0,1]
• Zu = max(x,y); v = min(x,y); z = u/vu,v,z in [0,1]
revisit figures 1,2,3
• Further study: f-divergence, Csiszár(1967), Dragomir(1980+) RGMIA
1
),(
a
bafbaD f
z
zz
u
z
zuzZ
u
1
41
21
1
2)(
2
22χ
yx
yx
2
2
2
1χ
22
2)(
2q
pqQ
pχ
Chi-Squared χ2
• Componentwise Geometric Interpretation
0,0 1,1
1,0
Z(z)
Q(q
)
q=c 1, z=c 2
q=1,
z=0
q=0, z=1
p=1
p=1.
5
p= 0
.6
u=1
u=.7u= 0.4
geometrically: D( ) = (r / r’) D(o) = (p/1) D(o) = p Q(q)/2also works for p > 1
r
r′o′
o
geometrically: D( ) = (s / s’) D(o) = (u / 1) D(o) = u/2 Z(z)
z
0
1
1
0
r
r′′o
o′′
uv
uv
1
o
o′′o′
z
1
0
0
labels are lengths of segments, except points o, o′, o′′
Q
Z2
d2
q
2
2
2
12
p
z
zq
1
1
q
qz
1
1
2
d2
q
2
2
2
12
p
1
Chi-Squared χ2
• Q curvesChi-Squarediso-p lines, plus y=0, x=1d ≤ p
q vs
. Q(q
)
pq v
s. p
Q(q
)
p=3/4
p=1
d=x-y
D(x
,y)
Z same
Eus surface, contour lines, and x+y=c lines
y
x
Compare to Euclidean, function of d only, no p dependence
Jenson-Shannon (same form as χ2)
• Kullback-Leibler
– measures entropy, information theoretic– non-symmetric– unbounded
• Jenson-Shannon Fix
– x=0…– x=y…– One of the terms can be negative, but JS ≥ 0, Unique Zero holds–
• stronger, factor as Chi-Squared
k
kK
kk y
xxyxKL 2
1
log),(
),(),( yxaJSayaxJS
€
JSs(x,y) =u
2ZJS
v
u
⎛
⎝ ⎜
⎞
⎠ ⎟1
=p
2QJS
d
p
⎛
⎝ ⎜
⎞
⎠ ⎟1
Figure 4
Hellinger
• Hellinger
– H satisfies Triangle Inequality, H2 doesn’t– x=0…– x=y…–
• Factor as Chi-Squared, JS
• Performs well for certain applications (Kegelmeyer, Robinson favorites)
€
H 2 = xk − yk( )2
=k=1
K
∑ x − y( )2
1
),(),( 22 yxaHayaxH
€
Hs2(x,y) =
u
2ZH
v
u
⎛
⎝ ⎜
⎞
⎠ ⎟1
=p
2QH
d
p
⎛
⎝ ⎜
⎞
⎠ ⎟1
H
Hs (blue) and JSs(red) vs. (x-y) for (x+y)=c, x=1 or y=0
y=0
x=1
y=0
x=1
x+y=
1/4
x+y=
1/2
x+y=
3/4
x+y=
1
x+y=5/4
x+y=3/2
x+y=7/4
x+y=
1
x+y=3/2x+y=
1/2
(x-y)
Hs surface, contour lines, and x+y=c lines
y
x
C√=Hs2 surface, contour lines, and x+y=c lines
y
x
H2
x
H and JS
Figure 5
1D Q plots, χ2, JS, H2
Chi2s (green) ≥ JSs(red) ≥ H2
s (blue) vs. (x-y) for (x+y)=c, x=1, or y=0
(x-y)p=
1 C
hi2 s =
Euc
lidea
n2 )
Hellinger and CosineEuclidean and Cosine
• Hellinger projects to sphere using square-root, then takes Euclidean distance
yxC cos1x
2
H
y
G
2
0
S1
yxcos
H
C
Hellinger and Euclidean projections to 3-sphere, and difference between them.
Add animation fig
€
cosθ =1− 2sin2 θ
2⇒ C =
H 2
2= Hs
2
€
xk =1 ⇔ xk∑∑2
=1
€
H = x − y2
x2
x
x
€
x
Geodesic Distance?• Suggest Geodesic distance on sphere G over
– More convex than H (or E), barely satisfies triangle inequality– G(x,z)=G(x,y)+G(y,z) for y=λx+(1-λ)z
strict inequality for other y• C, E, G are global order preserving, over both
€
x
x2
and x
x
y
G0
S1
H
€
z
€
x
x2
and x
Eus (blue) and Gus(red) vs. (x-y)
(x-y)
Eus surface, contour lines, and x+y=c lines
y
x
Euclidean and Geodesic (norm)
Hs (blue) and G√s(red) vs. (x-y) for (x+y)=c, x=1 or y=0
y=0
x=1
y=0
x=1
G√s surface, contour lines, and x+y=c lines
left is visually indistinguishable from Hs, so skip it
(x-y)
y
x
Hellinger and Geodesic (root)
Outline
• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q• Ratios• Worst-case difference construction
3d plots for selected x points• x=[1,0,0]
– fig 6: H2, JS, Chi2
– fig 7: H, H2, E shortcomings• No ortho-max for E
– fig 8: H, H2, Eu
• X=[1,1,0]/2– fig 9: H2, JS, Chi2
– fig 10: H, H2, Eu
• X=[1,1,1]/3– fig 11: H2, JS, Chi2
– fig 12: H, H2, Eu
• X=[0.2 0.3 0.5]– fig 13: H2, JS, Chi2
– fig 14: H, H2, Eu
• X=[0.89, 0.1 0.01] - sharp upturns– fig 15: H2, JS, Chi2
– fig 16: H, H2, Eu
1,0,0 0,1,0
0,0,1
1 2
3
T
S+
(1,0,0)
(0,1,0)
(0,0,1)
(0,0,1)
Distances from (1,0,0)
(0,1,0)
(0,0,1)
(1,0,0)
Distances from (.5,.5,0)
Distances from (.2,.3,.5)
(0,0,1)
(0,1,0)
(1,0,0)
(1,0,0) (0,1,0)
(0,1,0)
(0,0,1)
(1,0,0)
(0,0,1) (1,0,0)
3-dimensional mixture model distances using Euclideanu(yellow + black contours), Hellingers(blue) and JSs(red)
(0,0,1)
(0,1,0)
(1,0,0)
Distances from (.89,.1,.01)
Static backup
Outline
• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q, W• Ratios• Worst-case difference construction
Algebraic Forms, Q, Zfor X, JS, H2
• Q fn• Q series• Z fn 22
sHJS χ
uCC
2sE
sH sE
sG usG
≥
arcsin ≤= =
x2
x
x
≥
arcsin≤
yx
yxZ
yx
2
)(
1112
2
)(
)(
nn
n
n yx
yxa
),max(
),min(
2
),max(
yx
yxQ
yx
2
d2
q
2
2
2
12
p
r
r′ 1o′
o
z
0
1
1
0
r
r′′o
o′′
uv
Q
Z
Q fn
• Recall
• Q fn
• p=x+y d=|x-y| q=d/p
€
χ 2 =1
2
x − y( )2
x + y( )1
€
H 2 = x − y( )2
1
€
D f (x, y) =p
2Q f
d
p
⎛
⎝ ⎜
⎞
⎠ ⎟1
€
JS = x log2
2x
x + y
⎛
⎝ ⎜
⎞
⎠ ⎟+ y log2
2y
x + y
⎛
⎝ ⎜
⎞
⎠ ⎟1
Q functions
Q(q
)
Hs2
(blue)
JS(red)
χ2 (green)
Q fn series
• •
•
€
χ 2 =1
2pq2
Q functions
Q(q
)
Hs2
(blue)
JS(red)
χ2 (green)
q
Z fn
• Recall
• Z fn
u=max(x,y) v=min(x,y) z=v/u
€
χ 2 =1
2
x − y( )2
x + y( )1
€
H 2 = x − y( )2
1
€
JS = x log2
2x
x + y
⎛
⎝ ⎜
⎞
⎠ ⎟+ y log2
2y
x + y
⎛
⎝ ⎜
⎞
⎠ ⎟1
€
D f (x, y) =u
2Z f
v
u
⎛
⎝ ⎜
⎞
⎠ ⎟1
Z functions
z
Z(z
)
Hs2
(blue)
JS(red)
χ2 (green)
W fn• Zf =1+z+… for all f
u(1+z) = u+v and ||u+v||1 = 2
Not componentwise equality
Z functions
z
Z(z
)Hs
2
(blue)
JS(red)
χ2 (green)
W functions
z
W(z
)
Hs2
(blue)
JS(red)
χ2 (green)
Outline
• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q, W• Ratios• Worst-case difference construction
q
Q2 /
Q1
Relative difference in Q functions (Q
1 –
Q2)
/ (Q
1 +
Q2)
q
JS and Hs2 (black)
Q functionsQ
(q)
Hs2
(blue)
JS(red)
χ2 (green)
χ2 / JS (red)
Ratio of Q functions
q
JS / Hs2 (black)
χ2 / Hs2 (blue)
χ2 - Hs2
(blue)
Difference in Q functions
q
(Q1 –
Q2)
χ2 - JS(red)
JS - Hs2
(black)
χ2 and Hs2 (blue)
χ2 and JS (red)
Q plots
Relative difference in Z functions
χ2 and JS (red)
related byq = (1-z)/(1+z)p=(u/2)(1+z)
χ2 – JS(red)
Difference in Z functions
z
(Z1 –
Z2)
χ2 - Hs2
(blue)
JS - Hs2
(black)
Z functions
z
Z(z
)
Hs2
(blue)
JS(red)
χ2 (green)
(Z1 –
Z2)
/ (Z
1 +
Z2)
z
χ2 / JS (red)
Ratio of Z functions
Z2 /
Z1
JS / Hs2 (black)
χ2 / Hs2 (blue) JS and Hs
2 (black)
χ2 and Hs2 (blue)
z
Z plots
Selected theorems
• Z-increasing <≈> Q-decreasing• Z, Q monotonic• Z1/Z2, Q1/Q2 ratios monotonic
• Z, Q differences 1 unique max, 1 inflection pt
Z-Q Equivalence
uv
1
o
o′′o′
2
d2
q
2
2
2
12
p
z
1
0
0
z
zq
1
1
q
qz
1
1
• Zf actually are decreasing (lots of algebra, derivates, L’Hopital’s rule)
Z-Q RatiosThm: Proof from componentwise
€
D1
D2
=Z1
Z2
(z) =Q1
Q2
q( )
€
Z1
Z2
decreasing ⇔Q1
Q2
increasing
€
maxZ1
Z2
= maxQ1
Q2
€
minZ1
Z2
= minQ1
Q2
€
q =1− z
1+ zand z =
1− q
1+ q
Proof from leading term of series, or
directly from functional forms
z
χ2 / JS (red)
Ratio of Z functions
Z2 /
Z1 JS / Hs
2 (black)
χ2 / Hs2 (blue)
χ2 / JS (red)
Ratio of Q functions
q
JS / Hs2 (black)
χ2 / Hs2 (blue)Q
2 /
Q1
Key feature is large flat section.
This appears new.
Z-Q Differences
χ2 - Hs2
(blue)
Difference in Q functions
q
(Q1 –
Q2)
χ2 - JS(red)
JS - Hs2
(black)
χ2 – JS(red)
Difference in Z functions
z
(Z1 –
Z2)
χ2 - Hs2
(blue)
JS - Hs2
(black)
€
′ ′ M > 0
€
′ ′ M > 0
€
′ ′ M < 0
€
′ ′ M < 0
No closed form except upper left
Outline
• Application context• Distances• 3d plots• Algebraic reformulation as Z, Q, W• Ratios• Worst-case difference construction
Contours
• The contours looked similar?• Not local-order preserving
– Exploits different ratios for small z vs. moderate z
– x=[0.89, 0.1, 0.01]– y=[0.9, 0, 0.01]– z=[0.65, 0.35, 0]– w=[0.6, 0.4, 0]
€
χ 2 x, y( ) < χ 2 x,z( ) but JS x, y( ) > JS x,z( ) and Hs2 x, y( ) > Hs
2 x,z( )
€
JS x, y( ) < JS x,w( ) but Hs2 x,y( ) > Hs
2 x,w( )
y
zwx
Worst-Case Construction
• Relies on– Large K dimension– Zero component to get small z ratios– Moderately similar components to get moderate z ratios
• Implies Distance is small
€
x = (a,a,...,a,b,b...,b,0)
y = b,b,...,b,a,a,...,a,0( )
z = x1,x2,...x j ,c,0,0,...0,d( )
where a =1
k -1+ ε, b =
1
k -1−ε,
d,c, j : D1(x,y) = D1(x,z)€
k → ∞, ε → 0 givesD2
D1
(x,y) → min,D2
D1
(x,z) →1
• Find points x,y,z, such that D1(x,y)=D1(x,z) but D2 gives the most different answer
Near worst-case
• Doesn’t have to be very extreme
• Still relies on a zero component, but small dim k, large epsilon
€
x = (a,a,...,a,b,b...,b,0)
y = b,b,...,b,a,a,...,a,0( )
z = x1, x2,...x j ,c,0,0,...0,d( )
where a =1
k -1+ ε, b =
1
k -1−ε,
d,c, j : D1(x,y) = D1(x,z)
Main Observations
≥≤
= =≥
≤
Z(z
)
χ2 – JS(red)
Difference in Z functions
z
χ2 - Hs2
(blue)
JS - Hs2
(black)(Z1 –
Z2)
Z functions
z
Hs2
(blue)
JS(red)
χ2 (green)
z
χ2 / JS (red)
Ratio of Z functions
Z2 /
Z1
JS / Hs2 (black)
χ2 / Hs2 (blue)
0,0 1,1
1,0
Z(z)
Q(q
)
q=c 1, z=c 2
q=1,
z=0
q=0, z=1
p=1
p=1.
5
p= 0
.6
u=1
u=.7u= 0.4
22sHJS χ
uCC
2sE
sH sE
sG usG
arcsin
x2
x
x
arcsin
yx
yxZ
yx
2
)(
1112
2
)(
)(
nn
n
n yx
yxa
),max(
),min(
2
),max(
yx
yxQ
yx
Conclusion
• Insight into what distances do differently• What’s new / novel?
– Geometric analysis, pictures!– Ratio monotonicity, difference analysis– Algebraic limits probably known, but references hard to chase
• Future– Clustering case study on real data– Hellinger square-root projection study– Other families of distances
• Compound distances, Earth Mover’s• Partition/cluster metrics
– Affect of norms other than 1-norm (triangle inequality)?– SNL needs program in understanding effects of text-analysis
pipeline knobs.
Chi2s (green) ≥ JSs(red) vs. (x-y) for (x+y)=c
Chi2s (blue) ≥ JSs(blue) vs. (x-y) for x=1 or y=0
y=0
(blue
) →JS
s =
Chi2 s
= Euc
lidea
n =
x
x+y=
1/4
x+y=
1/2
x+y=
3/4
x+y=
1 (h
ere
Chi
2 s =
Euc
lidea
n2 )
x+y=
5/4
x+y=3/2
x+y=7/4
x=1
(blu
e): C
hi2 s
≥JS s
(x-y)
dist
ance
JSs surface, contour lines, and x+y=c lines
x
y
x=1
y=0
(x-y)=0
x+y=1/4
x+y=
1/2
x+y=
3/4
x+y=1
x+y=5/4
x+y=3/2
x+y=7/4
Hs surface, contour lines, and x+y=c lines Hs (blue) and JSs(red) vs. (x-y) for (x+y)=c, x=1 or y=0
y=0
x=1
y=0
x=1
x+y=
1/4
x+y=
1/2
x+y=
3/4
x+y=
1
x+y=5/4
x+y=3/2
x+y=7/4
x+y=
1
x+y=3/2x+y=
1/2
(x-y)
y
x
Chi2s (green) ≥ JSs(red) ≥ H2
s (blue) vs. (x-y) for (x+y)=c, x=1, or y=0
(x-y)
C√=Hs2 surface, contour lines, and x+y=c lines
y
x
for late-start summary quad chart
(0,0,1)
(0,1,0)
(1,0,0)
Sharp upturn on lower right shows H & JS emphasize high ratios of point coordinates.
Euclidean Hellinger Jensen-Shannondistancesfrom (.89,.1,.01)
1
2T
S+
S
Two-dimensional mixture models T, unit sphere S, and positive part of unit sphere S+. Coordinate axis 1 and 2.
Point x on T projected to S+ under normalization and square-root.
x2
x
xx
1,0,0 0,1,0
0,0,1
1 2
3
T
S+
JS
u=x
=1 ↔
Z(z
)
q = 1
↔ z
= 0
q = 1/3 ↔ z = 1/2
q = 0 ↔ z = 1
xy
x+y
= 1
↔ Q
(q)
0,0 1,1
1,0
Z(z)
Q(q
)
q=c 1, z=c 2
q=1,
z=0
q=0, z=1p=
1
p=1.
5
p= 0
.6
u=1
u=.7u= 0.4
labels are lengths of segments, except points o, o’geometrically: D( ) = (r / r’) D(o) = (p/1) D(o) = p Q(q)/2also works for p > 1
2
d2
q
2
2
2
12
p
r
r′ 1o′
o
labels are lengths of segments, except points o, o’ geometrically: D( ) = (s / s’) D(o) = (u / 1) D(o) = u/2 Z(z)
z
0
1
1
0
r
r′′o
o′′
uv
uv
1
o
o′′o′
2
d2
q
2
2
2
12
p
z
1
0
0
z
zq
1
1
q
qz
1
1