algorithms at scale - nus computinggilbert/cs5234/2019/... · 2019. 9. 20. · algorithms at scale...
TRANSCRIPT
AlgorithmsatScale(Week6)
Summary
Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage
distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-
approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce
space.Otherclusteringproblems
LastWeek:GraphStreaming
ConnectivityBipartitetestMSTSpannersMatching
Goingforward…
ProblemsetdueThursday,October3:• Experimentalproblemset.• Implementastreamingalgorithms/sketch.• Seewhatperformanceyoucanget.• Goal:testitoutandseewhatyoucanlearnaboutit.
Comingup:• End-of-semesterMiniProject.• Teamsoftwo.• Goal:lookmoredeeplyintoatopiccoveredinthisclass.• I’llprovideoptionsfromeachofthe4partsoftheclass
(sublineartime/sampling,streaming,caching,parallel)• Willsendmoreinformation.
Task:Findapartnerthisweek.
k-Clustering
Givenpoints:P=p1,p2,…,pn
Assumptions:⇒ Pointsareinametricspace:
distancessatisfytriangleinequality.
⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.
Goal:⇒ Chooseasetkpoints(“centers”)
thatminimizesomemetric.
Example:3clusters
k-Clustering
Givenpoints:P=p1,p2,…,pn
Assumptions:⇒ Pointsareinametricspace:
distancessatisfytriangleinequality.
⇒ Thenumberofclustersk isgiven.
Goal:⇒ Chooseasetkpoints(“centers”)
thatminimizesomemetric.
Example:3clusters
Metricspace:1. d(x,y)=0iff x=y2. d(x,y)=d(y,x)3. d(x,y)≤d(x,z)+d(z,y)
k-Clustering
Givenpoints:P=p1,p2,…,pn
Manyclusteringvariants:⇒ k-Center⇒ k-Median⇒ k-Means⇒ k-Medoids⇒ Min-CutClustering⇒ SpectralClustering⇒ Etc.⇒ Etc.⇒ Etc.⇒ Etc.
Example:3clusters
k-CenterClustering
Givenpoints:P=p1,p2,…,pn
Assumptions:⇒ Pointsareinametricspace:
distancessatisfytriangleinequality.
⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.
Goal:⇒ Chooseasetk points(“centers”)
thatminimizethemaximumdistancetoacenter.
Example:3clusters
k-MedianClustering
Givenpoints:P=p1,p2,…,pn
Assumptions:⇒ Pointsareinametricspace:
distancessatisfytriangleinequality.
⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.
Goal:⇒ Chooseasetk points(“centers”)
thatminimizetheaveragedistancetoacenter.
⇒ Equivalent:minimizethesumofthedistancestothecenters.
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
k-MedianClustering
Givenpoints:P=p1,p2,…,pn
Facts:• k-MedianisNP-hard.• InEuclideanmetric,thereisanearly
lineartime(1+𝜀)approximationalgorithm.
• Ingeneral:o Li-Svensson 2013
(1+√3)-approximationo Byrka etal.2015
2.675-approximationo Improvementssincethen?
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
k-MedianClustering
Givenpoints:P=p1,p2,…,pn
FindpointsC=c1,c2,…,ck inP
thatminimize:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) =nX
i=1
mincj2C
|pi � cj |
k-MedianClustering
Givenpoints:P=p1,p2,…,pn
FindpointsC=c1,c2,…,ck inP
andassignmentfunctionc() that
mapsP—>C minimizing:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6D(P,C) =nX
i=1
|pi � c(i)|
Summary
Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage
distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-
approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce
space.Otherclusteringproblems
Approximatek-MedianClustering
Givenpoints:P=p1,p2,…,pn
LetC* betheoptimalclustering.
ClusteringC isa𝛄-approximation
if:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) �D(P,C⇤)
Approximatek-MedianClustering
Givenpoints:P=p1,p2,…,pn
LetC* betheoptimalclusteringwithk centers.ClusteringC isan(α,𝛄)-approximationifithasatmostαk centersand:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) �D(P,C⇤)
(2,2)-approximation
Example:6clusters• Avg.dist.:4• Totaldist.:44
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
10
4
12
8
10
Approximatek-MedianClustering
Givenpoints:P=p1,p2,…,pn
LetC* betheoptimalclusteringwithk centers.ClusteringC isan(α,𝛄)-approximationifithasatmostαk centersand:
Today:(2,4)-approximation
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) �D(P,C⇤)
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3Example:y1 = 0 x1,2 = 1
y2 = 1 x2,3 = 0
y3 = 1 x1,3 = 0
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):
ILP:
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):
ILP:
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):
ILP:
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):
ILP:
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Variables(intuition):
ILP:
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Claim1:Ifx andy satisfytheconstraints,thenitisavalidsolutiontotheclusteringproblem.
ILP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Claim2:Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.
ILP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
Approximatek-MedianClustering
IntegerLinearProgram
Badnews:SolvingIntegerLinearProgramsisNP-Hard.
ILP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : xi,j , yj 2 {0, 1}
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : 0 xi,j , yj 1
Approximatek-MedianClustering
Relax:LinearProgram
Goodnews:Relax!Replaceintegralconstraintswith[0,1] constraints.
LP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : 0 xi,j , yj 1
Approximatek-MedianClustering
Relax:LinearProgram
Goodnews:Relax!Cansolveefficiently(inpolynomialtime)usinganLPsolver.
LP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : 0 xi,j , yj 1
Approximatek-MedianClustering
Relax:LinearProgram
Goodnews:Relax!Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.
LP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
Approximatek-MedianClustering
Relax:LinearProgram
Goodnews:Relax!Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.
IfC isa(fractional)solutiontotheLP,andC* istheoptimal(integral)solution,then:
p1p2
p3
D(C,P ) D(C⇤, P )
Solutionisnoworse thantheoptimalsolution!Maybebetterthanoptimal!
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : 0 xi,j , yj 1
Approximatek-MedianClustering
Relax:LinearProgram
Badnews:solutionisfractionalIfx andy satisfytheconstraints,thenitmayNOTbeavalidsolutiontotheclusteringproblem.
LP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
Approximatek-MedianClustering
Relax:LinearProgram
Badnews:solutionisfractionalIfx andy satisfytheconstraints,thenitmayNOTbeavalidsolutiontotheclusteringproblem.
p1p2
p3
Variables(intuition):yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
y1 = 0.5 x1,2 = 0.5
y2 = 0.5 x2,3 = 0
y3 = 1 x1,3 = 0.5
Approximatek-MedianClustering
Relax:LinearProgram
Solution:roundtointegersIfx andy satisfytheconstraints,thenmaybewecanroundthevariablesinawaythatdoesnotincreasethecosttoomuch.
p1p2
p3
yj : Is point pj a cluster head?
xi,j : Is point pi assigned to center pj?
y1 = 0.5 x1,2 = 0.5
y2 = 0.5 x2,3 = 0
y3 = 1 x1,3 = 0.5
Roundingthek-MedianLP
Step1: Whatisthecost?
Definethecostofpi:
LPminimizes:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
Ci =X
j
xi,jd(pi, pj)
minX
i
Ci
Goal:roundinawaythatdoesnotincreasecosttoomuch!
Roundingthek-MedianLP
Step1: Whatisthecost?
Definethecostofpi:
Goalafterrounding:constructC’ st.
p1p2
p3
Ci =X
j
xi,jd(pi, pj)
minX
i
Ci
Goal:roundinawaythatdoesnotincreasecosttoomuch!
C 0j 4Cj
Roundingthek-MedianLP
Step2:Sortbycost
Notice:smallestcostishardesttoround.
(Mostriskthatitwillincreasetoomuch.)
p1p2
p3
Roundingthek-MedianLP
Step3:Addpj withsmallestcostCjtooursetofcenters.
S={pj}p1
p2
p3
Roundingthek-MedianLP
Step4:Ifpi iswithindistance4Cj ofpj,thenwecandeleteit.
S={pj}
è pi isalreadycloseenoughtoacenterinoursolution.
p1p2
p3
C 0i d(pi, pj) 4Cj
4Ci
Recall:Cj wasthesmallest.
Roundingthek-MedianLP
Step4:Ifpi iswithindistance4Ci ofpj,thenwecandeleteit.
S={pj}
è pi isalreadycloseenoughtoacenterinoursolution.
p1p2
p3
C 0i d(pi, pj) 4Ci
Roundingthek-MedianLP
Step4:Ifthereissomepointqwhere:
thenwecandeleteit.
è pi isalreadycloseenoughtoacenterinoursolution.
p1p2
d(pi, q) 2Ci
d(pj , q) 2Cj
q
2Ci2C2
Recall:Cj wasthesmallest.
C 0i d(pi, pj)
d(pi, q) + d(q, pj)
2Ci + 2Cj
2Ci + 2Ci
4Ci
Roundingthek-MedianLP
Step4:Ifthereissomepointqwhere:
thenwecandeleteit.
è AllnodesinV(i) arecloseenoughtopi thatwecandeletethem.
p1p2
d(pi, q) 2Ci
d(pj , q) 2Cj
q
2Ci2C2
V (j) = {pi | 9q, d(pi, q) 2Ci, d(pj , q) 2Cj}
Roundingthek-MedianLP
Step5:Repeatuntilallaredeleted.
p1p2
q
2Ci2C2
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
p1p2
q
2Ci2C2
WheredidweusetheLPsolution??
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Claim:Forallj: C 0j 4Cj
ComputeC’ usingcentersinS.
Roundingthek-MedianLP
Step4:Ifthereissomepointqwhere:
thenwecandeleteit.
è pi isalreadycloseenoughtoacenterinoursolution.
p1p2
d(pi, q) 2Ci
d(pj , q) 2Cj
q
2Ci2C2
Recall:Cj wasthesmallest.
C 0i d(pi, pj)
d(pi, q) + d(q, pj)
2Ci + 2Cj
2Ci + 2Ci
4Ci
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Claim:Forallj: C 0j 4Cj
ComputeC’ usingcentersinS.
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Claim:Forallj:èC 0j 4Cj
d(C 0, P ) 4d(C,P ) 4d(C⇤, P )
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Remainingproblem?
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Remainingproblem:HowmanycentersaddedtoS?
Roundingthek-MedianLP
RoundingAlgorithm:
1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint
withminimumCj.• Addpj toS.• DeleteallpointsinV(j).
3. ReturnS.
Claim:Atmost2k centersaddedtoS.
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
è Sincey’ssumtok,ifV(j)aredisjoint,cannotaddmorethan2kpointstoS.
p2X
j2V (i)
yj � 1/2
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
è Sincey’ssumtokandV(j)aredisjoint,cannotaddmorethan2kpointstoS.
p2X
j : d(pi,pj)2Ci
yj � 1/2
Subtlepoint:symmetry!Ifaddingpi deletespj,thenaddpj deletespi.
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
Observation1:
p2X
j : d(pi,pj)2Ci
yj � 1/2
X
j : d(pi,pj)2Ci
yj �X
j : d(pi,pj)2Ci
xi,j
8i :P
j xi,j = 1P
j yj k
8i, j : xi,j yj
8i, j : 0 xi,j , yj 1
Approximatek-MedianClustering
Relax:LinearProgram
LP:
p1p2
p3
minX
i,j
xi,jd(pi, pj)
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
Observation1:
p2X
j : d(pi,pj)2Ci
yj � 1/2
X
j : d(pi,pj)2Ci
yj �X
j : d(pi,pj)2Ci
xi,j
Roundingthek-MedianLP
Observation2:
Ci =“average”distancefromitoacenter. p2
Ci =X
j
xi,jd(pi, pj)
Roundingthek-MedianLP
Observation2:
Ci =“average”distancefromitoacenter.
LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.
p2
E[Z] = Ci
Roundingthek-MedianLP
Observation2:
Ci =“average”distancefromitoacenter.
LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.
p2
X
j : d(pi,pj)2Ci
xi,j = Pr(Z 2Ci) = 1� Pr(Z > 2Ci)
Roundingthek-MedianLP
Observation2:
Ci =“average”distancefromitoacenter.
LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.
p2
X
j : d(pi,pj)2Ci
xi,j = Pr(Z 2Ci) = 1� Pr(Z > 2Ci)
= 1� Pr(Z > 2E[Z])
� 1� 1/2 = 1/2
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
Conclusion:
p2X
j : d(pi,pj)2Ci
yj � 1/2
X
j : d(pi,pj)2Ci
yj �X
j : d(pi,pj)2Ci
xi,j � 1/2
Roundingthek-MedianLP
Keylemma:Ifpi isaddedtoS,then:
è Fact:yi’s sumto≤kè Fact:V(i) aredisjointè Fact:Foreachpi addedtoS,deletepointswithyj’s
thatsumtoatleast½.è Cannotaddmorethan2k pointstoS.
p2X
j : d(pi,pj)2Ci
yj � 1/2
Approximatek-MedianClustering
Givenpoints:P=p1,p2,…,pn
Today:(2,4)-approximation• GiveIntegerLinearProgram(ILP).• RelaxtoLinearProgram(LP).• SolveLP.• Round(carefully).
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
Weightedk-MedianClustering
Givenpoints:P=p1,p2,…,pn
Givenweights:w1,w2,…,wn
FindpointsC=c1,c2,…,ck inP
andassignmentfunctionc() that
mapsP—>C minimizing:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) =nX
i=1
wi|pi � c(i)|
Weightedk-MedianClustering
Givenpoints:P=p1,p2,…,pn
Givenweights:w1,w2,…,wn
FindpointsC=c1,c2,…,ck inP
andassignmentfunctionc() that
mapsP—>C minimizing:
Example:3clusters• Avg.dist.:2• Totaldist.:22
22
1
13
4
3
6
D(P,C) =nX
i=1
wi|pi � c(i)|
Exercise:
Showhowtoadapttheapproximatek-medianalgorithmtogivea(2,4)-approximatesolutionfortheweightedk-Medianclusteringproblem.
StreamingData
Dataarrivesinastream:S=s1,s2,…,sT
Eachsj isapoint.⇒ Eachpointshowsupexactlyonce.⇒ Pointsshowupinanarbitrary(worst-case)order.
ExampleinEuclideanplane:S=(17,3),(1,7),(15,1),(4,1),(3,19),(1,1),(2,1)
Atendofstream:output k clustercenters.
StreamingData
Dataarrivesinastream:S=s1,s2,…,sT
Memory:
Goal:(2,O(1))-approximation
O(pnk)
Warning:Today,theapproximationratioisgoingtobelarge.
S=∅repeattimes:
1.LetP=nextpoints.2.Find(2,4)-approximateclusteringonP.3.Add2knewclustercenterstoS.Weighteach
clustercenterwith#ofpointsattachedtoit.4.EmptyP.
Return(2,4)-approximate(weighted)clusteringonS.
Streamingk-Median
Core-SetAlgorithm
rn
k pnk
Streamingk-Median
Core-SetAlgorithm
St (√nk elements)
Datastreamcontainingnelements
S1 (√nk elements) S1 (√nk elements)
2kcenters
2kcenters
2kcenters
(2,4)-approximatek-median
(2,4)-approximatek-median
(2,4)-approximatek-median
2kcenters
(2,4)-approximateweightedk-median
2pnk centersatintermediatelevel
Streamingk-Median
Core-SetAlgorithm
St (√nk elements)
Datastreamcontainingnelements
S1 (√nk elements) S1 (√nk elements)
(2,4)-approximatek-median
(2,4)-approximatek-median
(2,4)-approximatek-median
2kcenters
(2,4)-approximateweightedk-median
2pnk centersatintermediatelevel
2kcenters
2kcenters
2kcenters CoreSet
Streamingk-Median
Core-SetAlgorithm
St (√nk elements)
Datastreamcontainingnelements
S1 (√nk elements) S1 (√nk elements)
2kcenters
2kcenters
2kcenters
(2,4)-approximatek-median
(2,4)-approximatek-median
(2,4)-approximatek-median
2kcenters
(2,4)-approximateweightedk-median
2pnk centersatintermediatelevel
Space:O(pnk)
Streamingk-Median
Core-SetAlgorithm
Claims:
Claim1:Spaceisatmost.
Claim2:Theoutputisatmost2k centers.
O(pnk)
Byconstruction.
Streamingk-Median
Core-SetAlgorithm
Claims:
Claim1:Spaceisatmost.
Claim2:Theoutputisatmost2k centers.
Claim3:Theoutputis(2,80)-approximationfork-Median.
O(pnk)
Byconstruction.
Streamingk-Median
Core-SetAlgorithm
Notation:
1:Substream Si istheith segmentofthestream.
2:PointsTi arethe2k centersoutputbyith (2,4)-approximation.
3:Sw aretheweightedpoints,andwaretheweights,usedforthefinal(2,4)-approximation.
4:PointsT arethefinaloutputofthealgorithm.
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthefinaldistancesbytwoparts:(1) thedistanceofapointtotheintermediateclustering,and(2) thedistanceoftheintermediateclusteringtothefinal
clustering.
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
Streamingk-Median
Core-SetAlgorithm
Proof: d(S, T ) =tX
i=1
X
x2Si
d(x, T )
tX
i=1
X
x2Si
d(x, ti
(x)) + d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
X
x2Si
d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
2kX
j=1
|Si
|d(tij
, T )
tX
i=1
d(Si
, T
i
) + d(Sw
, T )
Definitionofd(S,T).
Variablesxrangeoverallpointsintheset
Streamingk-Median
Core-SetAlgorithm
Proof: d(S, T ) =tX
i=1
X
x2Si
d(x, T )
tX
i=1
X
x2Si
d(x, ti
(x)) + d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
X
x2Si
d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
2kX
j=1
|Si
|d(tij
, T )
tX
i=1
d(Si
, T
i
) + d(Sw
, T )
TriangleInequality
Pointti(x) isthecenterassignedtox intheintermediatecoreset,wherexisapointinsegmentSi ofthestream.
Streamingk-Median
Core-SetAlgorithm
Proof: d(S, T ) =tX
i=1
X
x2Si
d(x, T )
tX
i=1
X
x2Si
d(x, ti
(x)) + d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
X
x2Si
d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
2kX
j=1
|Si
|d(tij
, T )
tX
i=1
d(Si
, T
i
) + d(Sw
, T )
Definitionofd(Si,Ti).
Streamingk-Median
Core-SetAlgorithm
Proof: d(S, T ) =tX
i=1
X
x2Si
d(x, T )
tX
i=1
X
x2Si
d(x, ti
(x)) + d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
X
x2Si
d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
2kX
j=1
|Si
|d(tij
, T )
tX
i=1
d(Si
, T
i
) + d(Sw
, T )
Iterateoverallcentersincoreset.
Counthowmanytimeseachisincludedinthesum.
Pointtij isoneofthe2kpointsinthecoreset fortheith segment.
Streamingk-Median
Core-SetAlgorithm
Proof: d(S, T ) =tX
i=1
X
x2Si
d(x, T )
tX
i=1
X
x2Si
d(x, ti
(x)) + d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
X
x2Si
d(ti
(x), T )
tX
i=1
d(Si
, T
i
) +tX
i=1
2kX
j=1
|Si
|d(tij
, T )
tX
i=1
d(Si
, T
i
) + d(Sw
, T )Definitionofd(Sw,T).
Weightw(i)=|Si|.
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthefinaldistancesbytwoparts:(1) thedistanceofapointtotheintermediateclustering,and(2) thedistanceoftheintermediateclusteringtothefinal
clustering.
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
Streamingk-Median
Core-SetAlgorithm
Lemma:
Goal:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
d(S, T ) 80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Usefulfact:
WhereA issomelargersetofallpossiblepointsinthemetricspace,andS’ isanarbitrarysubsetofA.
Interpretation:ToclusterS’,wecanfocusonpointsinS’ (andonlyloseafactorof2.)Wedon’tneedcentersnotinS’.
minT 0✓S0
d(S0, T 0) 2 minT 0✓A
d(S0, T 0)
Streamingk-Median
Core-SetAlgorithm
Usefulfact:
Proof: TriangleInequalityLetT’ betheoptimalsolutioninA.Lett besomepointinT’thatisnotinS’,let t’ betheclosestpointinS’ tot,andlets besomeotherpointinS’.Wecanreplacet witht’ because:
minT 0✓S0
d(S0, T 0) 2 minT 0✓A
d(S0, T 0)
d(s, t0) d(s, t) + d(t, t0) d(s, t) + d(s, t) 2d(set)
Streamingk-Median
Core-SetAlgorithm
Lemma:
Goal:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
d(S, T ) 80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthedistancestothecoreset bytheoptimalclustering.
tX
i=1
d(Si, Ti) 8d(S,C⇤)
tX
i=1
d(Si, Ti) tX
i=1
4 minT 0✓Si
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, C⇤)
8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Proof: Becauseweusea(2,4)-approximationalgorithmtocomputethecoreset.
tX
i=1
d(Si, Ti) tX
i=1
4 minT 0✓Si
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, C⇤)
8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Proof:
Becauseweonlyloseafactoroftwogoingtoalargesetofpoints.
tX
i=1
d(Si, Ti) tX
i=1
4 minT 0✓Si
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, T0)
tX
i=1
8d(Si, C⇤)
8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Proof:
Bydefinitionoftheoptimalclustering.
tX
i=1
d(Si, Ti) tX
i=1
4 minT 0✓Si
d(Si, T0)
tX
i=1
8 minT 0✓P
d(Si, T0)
tX
i=1
8d(Si, C⇤)
8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Proof:
Bysummingoverallpoints.
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthedistancestothecoreset bytheoptimalclustering.
tX
i=1
d(Si, Ti) 8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Lemma:
Goal:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
d(S, T ) 80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthecostofthesecondpart…
d(Sw, T ) 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Part1: d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw
, C
⇤) =X
i,j
|Si,j
|d(ti,j
, T
⇤)
X
i,j
X
x2Si,j
[d(ti,j
, x) + d(x, t⇤(x))]
X
i
X
x2Si
[d(ti
(x), x) + d(x, t⇤(x))]
tX
i=1
d(Si
, T
i
) + d(S,C⇤)
Definitionofweighted...
Streamingk-Median
Core-SetAlgorithm
Part1: d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw
, C
⇤) =X
i,j
|Si,j
|d(ti,j
, T
⇤)
X
i,j
X
x2Si,j
[d(ti,j
, x) + d(x, t⇤(x))]
X
i
X
x2Si
[d(ti
(x), x) + d(x, t⇤(x))]
tX
i=1
d(Si
, T
i
) + d(S,C⇤)
SumoverSij andusetriangleinequality.
Streamingk-Median
Core-SetAlgorithm
Part1: d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw
, C
⇤) =X
i,j
|Si,j
|d(ti,j
, T
⇤)
X
i,j
X
x2Si,j
[d(ti,j
, x) + d(x, t⇤(x))]
X
i
X
x2Si
[d(ti
(x), x) + d(x, t⇤(x))]
tX
i=1
d(Si
, T
i
) + d(S,C⇤)
Simplifyenumerationoverallpointsincoreset.
Streamingk-Median
Core-SetAlgorithm
Part1: d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw
, C
⇤) =X
i,j
|Si,j
|d(ti,j
, T
⇤)
X
i,j
X
x2Si,j
[d(ti,j
, x) + d(x, t⇤(x))]
X
i
X
x2Si
[d(ti
(x), x) + d(x, t⇤(x))]
tX
i=1
d(Si
, T
i
) + d(S,C⇤) Definition…
Streamingk-Median
Core-SetAlgorithm
Part1:
Part2:
Conclusion:
d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw, T ) 8d(Sw, C⇤)
d(Sw, T ) 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Part1:
Part2:
d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw, T ) 4 minT 0✓Sw
d(Sw, T0)
8 minT 0✓P
d(Sw, T0)
8d(Sw, C⇤)
d(Sw, T ) 8d(Sw, C⇤)
Becauseused4-approximationalgorithm.
Streamingk-Median
Core-SetAlgorithm
Part1:
Part2:
d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw, T ) 4 minT 0✓Sw
d(Sw, T0)
8 minT 0✓P
d(Sw, T0)
8d(Sw, C⇤)
d(Sw, T ) 8d(Sw, C⇤)
BecauseusingpointsinSwonlylosesafactorof2.
Streamingk-Median
Core-SetAlgorithm
Part1:
Part2:
d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw, T ) 4 minT 0✓Sw
d(Sw, T0)
8 minT 0✓P
d(Sw, T0)
8d(Sw, C⇤)
d(Sw, T ) 8d(Sw, C⇤)
Bydefinition…
Streamingk-Median
Core-SetAlgorithm
Part1:
Part2:
d(Sw, C⇤)
tX
i=1
d(Si, Ti) + d(S,C⇤)
d(Sw, T ) 8d(Sw, C⇤)
Streamingk-Median
Core-SetAlgorithm
Additallup:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
8d(S,C⇤) + d(Sw, T )
8d(S,C⇤) + 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)
80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Additallup:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
8d(S,C⇤) + d(Sw, T )
8d(S,C⇤) + 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)
80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Additallup:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
8d(S,C⇤) + d(Sw, T )
8d(S,C⇤) + 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)
80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Additallup:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
8d(S,C⇤) + d(Sw, T )
8d(S,C⇤) + 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)
80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
Additallup:
d(S, T ) tX
i=1
d(Si, Ti) + d(Sw, T )
8d(S,C⇤) + d(Sw, T )
8d(S,C⇤) + 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)
80d(S,C⇤)
Streamingk-Median
Core-SetAlgorithm
St (√nk elements)
Datastreamcontainingnelements
S1 (√nk elements) S1 (√nk elements)
2kcenters
2kcenters
2kcenters
(2,4)-approximatek-median
(2,4)-approximatek-median
(2,4)-approximatek-median
2kcenters
(2,4)-approximateweightedk-median
2pnk centersatintermediatelevel
Space:O(pnk)
Approximation:(2,80)
CoreSetAlgorithm
Questions:
Whatifyouwantlessspace?• Increasesegmentsize?• Decreasenumberofcoresets?
CoreSetAlgorithm
Question:
Whatifyouwantlessspace?• Increasesegmentsize?• Decreasenumberofcoresets?
Idea: hierarchicalconstruction!
HierarchicalConstruction
St
Datastreamcontainingnelements
S1 S6
2k
2kcenters
S2 S5S4S3
2k 2k 2k 2k 2k2k
2kcenters
2kcenters
2kcenters
CoreSetAlgorithm
Algorithmidea:
Define.
Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.
m = n✏
CoreSetAlgorithm
Algorithmidea:
Define.
Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.
Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.
m = n✏
CoreSetAlgorithm
Algorithmidea:
Define.
Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.
Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.
m = n✏
HierarchicalConstruction
St
Datastreamcontainingnelements
S1 S6
2k
2kcenters
S2 S5S4S3
2k 2k 2k 2k 2k2k
2kcenters
2kcenters
2kcenters
CoreSetAlgorithm
Algorithmidea:
Define.
Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.
Treewithfan-outm hashowmanylevels?
logm n =
log n
logm=
log n
log n✏=
1
✏
m = n✏
CoreSetAlgorithm
Algorithmidea:
Define.
Treewithfan-outm hashowmanylevels?
Spaceusage?
logm n =
log n
logm=
log n
log n✏=
1
✏
m = n✏
HierarchicalConstruction
St
Datastreamcontainingnelements
S1 S6
2k
2kcenters
S2 S5S4S3
2k 2k 2k 2k 2k2k
2kcenters
2kcenters
2kcenters
CoreSetAlgorithm
Algorithmidea:
Define.
Treewithfan-outm hashowmanylevels?
Spaceusage:
logm n =
log n
logm=
log n
log n✏=
1
✏
m = n✏
✓1
✏
◆(m)(2k) =
2kn✏
✏
Storeatmostmsetsofcentersfor eachlevelofthetree.
CoreSetAlgorithm
Algorithmidea:
Define.
Approximationfactor?
m = n✏
Streamingk-Median
Core-SetAlgorithm
Lemma:
Interpretation:Wecanboundthecostoflevel1by8timeslevel0…
Similarly:Wecanboundthecostoflevel2by8timeslevel1…Wecanboundthecostoflevel(1/𝜀)by8timeslevel(1/𝜀)-1.
d(Sw, T ) 8tX
i=1
d(Si, Ti) + 8d(S,C⇤)
CoreSetAlgorithm
Algorithmidea:
Define.
Approximationfactor:
m = n✏
O(81/✏)
HierarchicalConstruction
St
Datastreamcontainingnelements
S1 S6
2k
2kcenters
S2 S5S4S3
2k 2k 2k 2k 2k2k
2kcenters
2kcenters
2kcenters
Space:Approximation:(2,)O(81/✏) O(kn1/✏/✏)
k-CenterClustering
Givenpoints:P=p1,p2,…,pn
Assumptions:⇒ Pointsareinametricspace:
distancessatisfytriangleinequality.
⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.
Goal:⇒ Chooseasetk points(“centers”)
thatminimizethemaximumdistancetoacenter.
Example:3clusters
k-CenterApproximationAlgorithm
Showthatthisisa2-approximation:
1. T={x} foranyx inP.2. Repeatuntil|T|=k:• Letz bethepointinP that
maximizesd(z,T).• Addz toT.
3. ReturnT.
Claim:cost(P,T)≤2cost(P,C*)cost(P,T) isthemaximumdistanceofanypointinP tothesetT.
k-CenterClustering
Someusefulthingstoprove:
Ifx isthefarthestpointfromT attheend(atdistancer):⇒ EverypointT∪{x} isatleastr fromeachother.⇒ Everyotherpointisdistance<r fromT.
IfC* isanoptimalclustering:⇒ AtleasttwopointsinT∪{x} areassignedtothesamecenter.⇒ Sothecentermustbeatleastdistancer/2 fromoneofthem.
Showthatthisisan8-approximation:
T=firstk pointsinstream.R=1Repeatuntilendofstream:
1. While|T|≤k:• Getnewpointx.• ifd(x,T)>2R,thenaddx toT.
2. T’=∅.3. Whilesomez inT hasd(z,T’)>2R:addz toT’4. T=T’5. R=2R
Streamingk-CenterClustering
Assumeminimumdistancebetweenpointsis1.
RebuildT’here.
DoubleR.
Streamingk-CenterClustering
Someusefulthingstoprove:
Beforestep(2):⇒ Everypointiswithin2RofT.
Beforestep(5):⇒ Everypointiswithin4RofT.
Beforestep(2):⇒ Therearek+1centersatdistanceatleastRfromeachother.
Beforestep(5):⇒ Allcentersaredistanceatleast2Rfromeachother.
Summary
Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage
distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-
approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce
space.Otherclusteringproblems
LastWeek:GraphStreaming
ConnectivityBipartitetestMSTSpannersMatching