data mining with mapreduce: graph and tensor algorithms with applications

69
Charalampos (Babis) E. Tsourakakis Modern Data Mining Algorithms 1 Data Analysis Project 20 Apr. 2010

Upload: charalampos-tsourakakis

Post on 10-Mar-2016

224 views

Category:

Documents


3 download

DESCRIPTION

Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

TRANSCRIPT

Charalampos  (Babis)    E.  Tsourakakis  

Modern Data Mining Algorithms 1

Data Analysis Project 20 Apr. 2010

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 2

Modern Data Mining Algorithms 3

Leonard Euler (1707-1783)

Seven Bridges of Königsberg Eulerian Paths

Modern Data Mining Algorithms P0-4

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

Modern Data Mining Algorithms 5

m customers n products

Market Basket Analysis

m documents n words

Documents-Terms

freedom

dance

prison

0200040006000800010000051015202530time (min)value

Temperature 02000400060008000100000100200300400500600time (min)valueLight

020004000600080001000000.511.522.5time (min)value

Voltage 0200040006000800010000010203040time (min)value

Humidity

Intel Berkeley lab

6 Modern Data Mining Algorithms

time

Loca

tion

Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)  

7

Multi-­‐dimensional    time  series  can  be  modeled    in  such  way.  

Modern Data Mining Algorithms

Modern Data Mining Algorithms 8

voxel x subjects x trials x task conditions x timeticks

Functional Magnetic Resonance Imaging (fMRI)

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 9

  Spam  Detection    Exponential  random  graphs    Clustering  Coefficients  &  Transitivity  Ratio     Uncovering  the  Hidden  Thematic  Structure  of  the  web  

  Link  Recommendation  

Modern Data Mining Algorithms 10

Friends of friends tend to become friends themselves

Modern Data Mining Algorithms 11

Spectral  family  

Triangle    Sparsifiers  

Randomized  SVD    

Contributions  

Modern Data Mining Algorithms 12

Theorem  1    

Δ(G)    =    #  triangles  in  graph  G(V,E)                                                                    =  eigenvalues  of  adjacency  matrix  AG                

Modern Data Mining Algorithms 13

Theorem  2  

Δ(i)  =  #Δs  vertex  i  participates  at.                    =  j-­‐th  eigenvector                    =  i-­‐th  entry  of  

i

Δ(i) = 2

Modern Data Mining Algorithms 14

Airports Political blogs

Modern Data Mining Algorithms 15

  Very  important  for  us  because:   Few  eigenvalues  contribute  a  lot!   Cubes  amplify  this  even  more!   Lanczos  converges  fast  due  to  large  spectral  gaps!  

Modern Data Mining Algorithms 16

  Almost  symmetric  around  0!  

  Sum  of  cubes  almost  cancels  out!  

Political Blogs

Omit!

Keep only 3!

3

Modern Data Mining Algorithms 17

Nodes   Edges     Description  

~75K   ~405K   Epinions  network  

~404K   ~2.1M   Flickr  

~27K   ~341K   Arxiv  Hep-­‐Th  

~1K   ~17K   Political  blogs  

~13K   ~148K   Reuters  news  

~3M   35M   Wikipedia  2006-­‐Sep-­‐05  

~3.15M   ~37M   Wikipedia  2006-­‐Nov-­‐04  

~13.5K   ~37.5K   AS  Oregon  

~23.5K   ~47.5K   CAIDA  AS  2004  to  2008  (means  over  151          timestamps)  

Social Networks

Co-authorship network

Information Networks

Web Graphs

Internet Graphs

18 Modern Data Mining Algorithms

19 Modern Data Mining Algorithms

Modern Data Mining Algorithms 20

Triangles node i participates Tria

ngle

s no

de i

parti

cipa

tes

acco

rdin

g to

our

est

imat

ion

21 Modern Data Mining Algorithms

2-3 eigenvalues almost ideal results!

Modern Data Mining Algorithms 22

  Kronecker  graphs  is  a  model  for  generating  graphs  that  mimic  properties  of  real-­‐world  networks.  The  basic  operation  is  the  Kronecker  product([Leskovec  et  al.]).  

0   1   1  

1   0   1  

1   1   0  

Initiator graph

Adjacency matrix A[0]

Kronecker Product

Adjacency matrix A[1] Adjacency matrix A[2]

Repeat  k  times   Adjacency matrix A[k]

Modern Data Mining Algorithms 23

  Theorem[KroneckerTRC  ]  Let  B  =  A[k]    k-­‐th  Kronecker  product  and  Δ(GA),      Δ(GΒ)    the  total  number  of  triangles  in  GA  ,  GΒ  .    Then,  the    following  equality  holds:  

  Observation  1:  Eigendecomposition  <-­‐>  SVD  when  matrix  is  symmetric,  i.e.,      eigenvectors  =  left  singular  vectors      λi=σi  sgn(uivi)    (where  λi,σi  eigenvalue,  singular  value  respectively,  ui  and  vi  left  and  right  singular  vectors  respectively.              

  Observation  2:  We  care  about  a  low  rank  approximation  of  A  

Modern Data Mining Algorithms 24

  Frieze,  Kannan,  Vempala  

  Idea:  Sample  c  columns,  obtain  A  and  find  Ak  instead  of  the  optimal  Ak.  Recover  signs  from  left  and  right  singular  vectors.  Use  EigenTriangle!  

  Results:  c=100,  k=6  for  Flickr,  EigenTriangle  95.6%  accuracy,  Approximation  95.46%  

Modern Data Mining Algorithms 25

(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one

~ ~

Modern Data Mining Algorithms 26

Spectral  family  

Triangle    Sparsifiers  

Randomized  SVD    

Contributions  

  Approximate  a  given  graph  G  with  a  sparse  graph  H,  such  that  H  is  close  to  G  in  a  certain  notion.  

  Examples:      Cut  preserving  Benczur-­‐Karger  

         Spectral  Sparsifier  Spielman-­‐Teng    

Modern Data Mining Algorithms 27

What  about  Triangle  Sparsifiers?    

G(V,E)  i j

HEADS! (i,j) “survives” with probability p

28 Modern Data Mining Algorithms

t =# Δ

G(V,E)  k m

TAILS! (k,m) “dies”

29 Modern Data Mining Algorithms

Now, count triangles in G’ and let T/p3

be the estimate of t.

G’(V,E’)  

t =# Δ

Τ =# Δ Main  Theoretical  Results:  Under  mild  conditions  on  the  triangle  density  (at  least  nearly  linear  number  of  triangles),  our  estimate  is  strongly  concentrated  around  the  true  number  of  triangles!  

Modern Data Mining Algorithms 30

Modern Data Mining Algorithms 31

Re  

1  day  =  86400  seconds   Expected  Speedup  1/p2  

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 32

 Milgram  1967  

Modern Data Mining Algorithms 33

The “small world experiment” • Pick 300 people at random •  Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?

Only 6! Typically  the  diameter  of  real-­‐world  network  is  surprisingly  small!    

Does  the  same  observation  hold  on  the  Yahoo  Web  Graph  (2002),  where  #nodes=1.4B  and  #edges=6.83B?  

Modern Data Mining Algorithms 34

  Assume  we  have  a  multiset  M={x1,..,xm}  and  we  want  to  count  the  number  of  distinct  elements  n  from  M.  How  can  we  do  this  using  small  amount  of  space?  

                                                                                           Flajolet  &  G.  Nigel  Martin  

Modern Data Mining Algorithms 35

  Hash  function  h(x  in  U):[0,..,2L-­‐1]    y  =  Σ  bit(y,k)  2k      ρ(y)  =  minimum  k  s.t  bit(y,k)=1,  o/w  L   Let’s  keep  a  bitmask[0..L]    Hash  every  x  in  M  and  find  ρ(h(x)).  If  BITMASK[ρ(h(x))]  is  not  0,  then  flip  it!    

  How  will  the  bitmask  look  at  the  end?  0000000000….  010110…  1111111111111  

Modern Data Mining Algorithms 36

i<<log(n) i>>log(n) i~=log(n)

  How  will  the  bitmask  look  at  the  end?  0000000000….  010110…  1111111111111  

Modern Data Mining Algorithms 37

i<<log(n) i>>log(n) i~=log(n)

This  region  will  give  us  the  information.  Flajolet-­‐Martin  prove  that  for  the  random  variable  R=leftmost  0  in  our  bitmask:                                                                                          E(R)=  log(0.77351*n)  

  For  every  h  =  1,2,  ..    Estimate  the  cardinality  of  the  set  N(h),  i.e.,  the  pairs  of  nodes  reachable  within  h  steps.  

 When  the  cardinality  stabilizes,  output  the  number  of  steps  to  reach  that  cardinality  as  the  diameter.    

  Scalability  O(diam(G)*m),  m=#edges    Efficient  access  to  the  file  (very  important)    Parallelizable  (also  very  important)  

Modern Data Mining Algorithms 38

  The  diameter  of  the  Yahoo  Web  Graph  is  surprisingly  small  (7~8)  

Modern Data Mining Algorithms 39

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 40

= x x

Document to term matrix

Documents to Document HCs

Strength of each concept

Term to Term HCs data graph java brain lung

CS

MD

41 Modern Data Mining Algorithms

Modern Data Mining Algorithms 42

Tucker  is  an  SVD-­‐like  decomposition  of  a  tensor,  one  projection  matrix  per  mode  and  a  core  tensor  giving  the  correlation  among  the  projection  matrices  

  In:  D    Out:  D’=[G;U0,U1,U2]  1.  Spatial  compression  

  Tucker  decomposition  2.  Temporal  compression  

  Wavelet  transform  3.  Sparsify  the  core      

tensor  G    e2  =  1  -­‐  ||G||2/||D||2  

modality

D  

loca

tion

X  U1  

U2T  lo

catio

n

modality

Tucker-2 sparsify

G'  U1  

U2T  lo

catio

n

modality

In   Out  

Transform Matrix (fixed)

U0  

Wavelet coefficients

G  

43 Modern Data Mining Algorithms

  In:      sensor  measurements  

  Out:      Projection  matrices  U1  and  U2      Core  G’  (wavelet  coefficients)  

 Mining  guide:    U1  and  U2  reveal  the  patterns  on  location  and  modality,  respectively  

  G’  provides  the  patterns  on  time  

G'  U1  

U2T  loca

tion

modality

D  

loca

tion

modality

0200040006000800010000051015202530time (min)value

Temperature

02000400060008000100000100200300400500600time (min)value

Light

0200040006000800010000010203040time (min)value

Humidity

020004000600080001000000.511.522.5time (min)value

Voltage

44 Modern Data Mining Algorithms

  1st  HC  :  dominant  trend,  e.g.  daily  periodicity.    2nd  HC:  Exceptions  

G'  U1  

U2T  

1st Hidden Concept Daily Periodicity

2nd Hidden Concept Exceptions

1 . .

54

1 . .

54

45 Modern Data Mining Algorithms

•  1st  HC  indicates  the  main  sensor  modality  correlations  ▪  Temperature  and  light  are  positively  correlated,  while  humidity  is  anti-­‐

correlated  with  the  rest  

•  2nd  HC  indicates  an  abnormal  pattern  which  is  due  to  battery  outage  for  some  sensors  

volt humid

temp light

volt humid

temp light

1st Hidden Concept 2nd Hidden Concept

G'  U1  

U2T  

modality

1 2 3 4 1 2 3 4

46 Modern Data Mining Algorithms

U1  

U2T  

modality

•  1st  scalogram  indicates  daily  periodicity  •  2nd  scalogram  gives  abnormal  flat  trend  due  to  battery  outage  

47 Modern Data Mining Algorithms

G'  

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 48

 Most  of  the  real-­‐world  processes  result  in  sparse  tensors.  However,  there  exist  important  processes  which  result  in  dense  tensors:  

Modern Data Mining Algorithms 49

Physical  Process     Percentage  of  non-­‐zero  entries  

Sensor  network  (sensor  x  measurement  type  x  timeticks)  

85%  

Computer  network  (machine  x    measurement  type  x  timeticks)  

81%  

  It  can  be  either  very  slow  or  impossible  to  perform  due  to  memory  constraints  a  Tucker  decomposition  on  a  dense  tensor.  

  Can  we  trade  a  little  bit  of  accuracy  for  efficiency?  

Modern Data Mining Algorithms 50

Modern Data Mining Algorithms 51

McSherry Achlioptas

MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.

  Toss  a  coin  for  each  non-­‐zero  entry  with  probability  p      If  it  “survives”  reweigh  it  by  1/p.      If  not,  make  it  zero!  

  Perform  Tucker  on  the  sparsified  tensor!    For  the  theoretical  results,  see  Tsourakakis,  SDM  2010.  

Modern Data Mining Algorithms 52

  Intemon  (Carnegie  Mellon  University  Self-­‐Monitoring  system)  

  Tensor  X,  100  machines  x  12  types  of    measurement  x  10080  timeticks  

  Jimeng  Sun  showed  in  his  thesis  that  Tucker  decompositions  can  be  used  to  monitor  efficiently  the  system  

Modern Data Mining Algorithms 53

Modern Data Mining Algorithms 54

For  p=0.1  we  obtain    that  Pearson’s  Correlation  Coefficient    is  0.99  

Ideal  ρ=1  

Modern Data Mining Algorithms 55

Exact MACH

The  qualitative  analysis  which  is  important  for  our  goals  remains  the  same!  

Find the differences!

  Berkeley  Lab  

  Tensor  54  sensors  x  4  types  of  measurement  x  5385  timeticks  

Modern Data Mining Algorithms 56

Modern Data Mining Algorithms 57

The  qualitative  analysis  which  is  important    for  our  goals  remains  the  same!  

Modern Data Mining Algorithms 58

The  spatial  principal  mode  is  also  preserved,    and  Pearson’s  correlation  coefficient    is  again  almost  1!  

Modern Data Mining Algorithms 59

                       REMARKS  1)  Daily  periodicity    is    apparent.  2)  Pearson’s  correlation  Coefficient  0.99  with  the  exact  component.  

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 60

 More  Applications  of  Probabilistic  Combinatorics  in  Large  Scale  Graph  Mining    Randomized  Algorithms  work  very  well  (e.g.,  sublinear  time  algorithm),  but  typically  hard  to  analyze.  

  Smallest  p*  for  tensor  sparsification  for  the  (messy)  HOOI    algorithm  

Modern Data Mining Algorithms 61

  Better  sparsification  (Edge  (1,2)  is  important,  Weighted  Graphs!)  

  Property  Testing:  Is  a  graph  triangle  free?        Does  Boolean  Matrix  Multiplication  have  a  truly  subcubic  algorithm?  

Triangle Sparsifiers 62 3/16/2010

Modern Data Mining Algorithms 63

Faloutsos Miller Schwartz Frieze Kolountzakis Koutis

Drineas Kang Leskovec

Modern Data Mining Algorithms 64

Modern Data Mining Algorithms 65

Modern Data Mining Algorithms 66

Concentration appears

Concentration becomes stronger

Pick p=1/ Keep doubling until concentration

67 Modern Data Mining Algorithms

Mildness, pick p=1

Concentration

How to choose p?

Modern Data Mining Algorithms 68

I  want  to  compute  the  number  of  triangles!  

Use  Lanczos  to  compute  the  first  two    eigenvalues  

please!  

Is  the  cube  of  the  second  one  significantly  smaller  than  

the  cube  of  the  first?  

  NO   Iterate  then!  

After  some  iterations…  (hopefully  

few!)  

Compute  the    k-­‐th  eigenvalue.  

Is          much  smaller  

than      ?  

YES!  Algorithm  terminates!  The  estimated  #  of  Δs  is  the  sum  of  cubes  of  λi’s  

divided  by  6!  

Modern Data Mining Algorithms 69

Remark:Even if our theoretical results refer to HOSVD, MACH works for HOOI