aspaceefficientstreaming algorithmfortrianglecoun7ng ... · triangles" •...

26
A Space Efficient Streaming Algorithm for Triangle Coun7ng Using the Birthday Paradox By: Madhav Jha et al (KDD 2013) Presented by: Ahmed ElKishky

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

A  Space  Efficient  Streaming  Algorithm  for  Triangle  Coun7ng  Using  the  Birthday  Paradox  

By:  Madhav  Jha  et  al  (KDD  2013)    

Presented  by:  Ahmed  El-­‐Kishky  

Page 2: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Triangles  

•  A  triangle  is  three  ver7ces  A,B,C  such  that  there  exists  edges  (A,B),  (B,C),  (A,C)  

•  There  is  a  rich  body  of  literature  on  analysis  of  triangles  and  coun7ng  algorithms.  

Page 3: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Mo7va7on  

Friends  of  friends  tend  to  become                                friends  themselves!    

A  

C  B  

(leR  to  right)  Paul  Erdös  ,  Ronald  Graham,  Fan  Chung  Graham    

Stanley  Wasserman,  Kathrine  Faust,  1994.  Social  Network  Analysis:  Methods  and  Applica7ons.  Cambridge:  Cambridge  University  Press  

Page 4: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

More  Mo7va7ons  

Key  Idea:  Connected  regions  of  high  curvature  (i.e.,  dense  in  triangles)  indicate  a  common  topic!  

Eckmann-­‐Moses,  Uncovering  the  Hidden  Thema8c  Structure    of  the  Web    (PNAS,  2001)        

Page 5: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

More  Mo7va7ons  

Key  Idea:  Triangle  Distribu7on  among  spam  hosts  is  significantly  different  from  non-­‐spam  hosts!  

Triangles  used  for  Web  Spam  Detec7on  (BeccheZ  et  al.  KDD  ‘08)  

Page 6: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

More  Mo7va7ons  

Key  Claim:  The  amount  of  triangles  in  the  self-­‐centered  social  network  of  a  user  is  a  good  indicator  of  the  role  of  that  user  in  the  community!  

Triangles  used  for  assessing  Content  Quality  in  Social  Networksç  

Welser,  Gleave,  Fisher,  Smith  Journal  of  Social  Structure  2007    

Page 7: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

More  Mo7va7ons  

Page 8: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Even  More  Mo7va7ons  

•  Numerous  other  applications  including  :  •  Motif  Detection/  Frequent  Subgraph  Mining  (e.g.,  Protein-­‐Protein  Interaction  Networks)    

•  Community  Detection  (Berry  et  al.  ‘09)  •  Outlier  Detection  (Tsourakakis  ‘08)  •  Link  Recommendation      

Page 9: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Previous  Algorithms  

•  Coun7ng  triangles  is  important,  but  graphs  can  grow  prohibi7vely  large  

•  Methods  such  as  random  sampling,  using  the  map  reduce  paradigm,  using  external  memory,  as  well  as  massive  parallelism  have  been  employed  

•  Most  methods  need  to  store  a  large  frac7on  of  the  data  

Page 10: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

The  algorithm  

•  A  space-­‐efficient  algorithm  that  approximates  transi7vity  (global  clustering  coefficient)  and  total  triangle  count    

•  Maintains  a  “sketch”  of  the  graph  to  perform  es7mates  

•  Single  pass  algorithms  •  Input  is  a  stream  of  edges  (no  edge  /  vertex  dele7on)  

•  O(sqrt(n))  space  (n  is  the  number  of  ver7ces)    

Page 11: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

The  Streaming  SeZng  •  Let  G  be  a  simple  undirected  graph  with  n  ver7ces  and  m  edges.  

•  Let  T  denote  the  number  of  triangles  in  the  graph  and  W  be  the  number  of  wedges  (paths  of  length  2)  

•  Let  k  be  the  measure  of  transi7vity  =  3T  /  W  •  The  streaming  algorithm  considers  a  sequence  of  (dis7nct)  edges  e1,  e2  …  em  

•  Let  Gt  be  the  graph  at  7me  t,  formed  by  the  edge  set  {ei}  :  i<=t.  

Page 12: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

High  Level  Intui7on  

•  A  wedge  is  closed  if  it  par7cipates  in  a  triangle  and  open  otherwise.  

•  k  =  3T  /  W  is  exactly  the  probability  that  a  uniform  random  wedge  is  closed.  

•  This  formula  gives  us  a  randomized  method  for  es7ma7ng  k  and  consequently  es7ma7ng  T  by  genera7ng  a  set  of  (independent)  uniform  random  wedges  and  finding  the  frac7on  that  are  closed.  

Page 13: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

High  Level  Intui7on  

•  The  formula  k  =  3T  /  W  is  dependent  on  knowledge  of  the  number  of  closed  wedges  and  total  number  of  wedges.  

•   How  do  we  sample  wedges  from  a  stream  of  edges?  

Page 14: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Streaming  Triangles  

Page 15: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Streaming  Triangles  

Page 16: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Streaming  Triangles  

Page 17: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

The  Birthday  Paradox  

•  The  event  of  at  least  two  persons  (edges)  having    the  same  birthday  (vertex)  is  complementary  to  all  persons  (edges)  having  different  birthdays  (none  sharing  a  vertex)  

•  As  such  as  long  as  W  <=  m  (  preoy  much  always  true  for  networks)  O(sqrt(n))  edges  suffice.  

Page 18: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

The  Birthday  Paradox  Con7nued  

•  Because  a  small  number  of  uniform  random  edges  can  give  enough  wedges  to  perform  wedge  sampling,  these  wedges  can  be  used  to  es7mate  k.    

•  We  can  reverse  the  birthday  paradox  to  es7mate  W.  

Page 19: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Birthday  Paradox  for  Wedges  

Page 20: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Proof  of  Birthday  Paradox  

Page 21: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Es7ma7ng  T    

= total_wedges"

This  directly  implies  that    

Page 22: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Another  Insight  •  As  the  reservoir  of  wedges  is  maintained,  closure  of  wedges  by  the  future  edges  in  the  stream  is  checked  for.  But  there  are  closed  wedges  that  cannot  be  verified,  because  the  closing  edge  may  have  already  appeared  in  the  past.    

•  Insight:  In  each  triangle,  there  is  exactly  one  wedge  whose  closing  edge  appears  in  the  future.    

•  When  approximated,  these  future-­‐closed  edges  =  one  third  of  closed  wedges.  

•  Hence,  the  factor  3  that  pops  up  in  Streaming-­‐Triangles    

Page 23: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Experimental  Results  

Page 24: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Experimental  Results  

Page 25: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Experimental  Results  

Page 26: ASpaceEfficientStreaming AlgorithmforTriangleCoun7ng ... · Triangles" • A"triangle"is"three"ver7ces"A,B,C"such"that there"exists"edges"(A,B),"(B,C),"(A,C)" • There"is"arich"body"of"literature"on"analysis"of"

Experimental  Results