data-intensive information processing applications ― session #5

48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5 http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 1/48

Upload: api-25884893

Post on 30-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 1/48

Page 2: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 2/48

Page 3: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 3/48

Today’s Agenda

Graph problems and representations

Parallel breadth-first search

PageRank

Page 4: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 4/48

What ’s a graph?

G = (V,E), where

V represents the set of vertices (nodes)

E represents the set of edges (links)

Both vertices and edges may contain additional information

 

Directed vs. undirected edges

Presence or absence of cycles

Graphs are everywhere:

Hyperlink structure of the Web

Physical structure of computers on the Internet Interstate highway system

Social networks

Page 5: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 5/48

Source: Wikipedia (Königsberg)

Page 6: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 6/48

Som e Graph Problem s

Finding shortest paths

Routing Internet traffic and UPS trucks

Finding minimum spanning trees Telco laying down fiber

Finding Max Flow

Airline scheduling

Identify “special” nodes and communities

Breaking up terrorist cells, spread of avian flu

par e ma c ng Monster.com, Match.com

...

Page 7: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 7/48

Graphs and MapReduc e

Graph algorithms typically involve:

Performing computations at each node: based on node features,edge features, and local link structure

Propagating computations: “traversing” the graph

 

How do you represent graph data in MapReduce?

How do you traverse a graph in MapReduce?

Page 8: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 8/48

Represent ing Graphs

G = (V, E)

Two common re resentations

Adjacency matrix Adjacency list

Page 9: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 9/48

Adjacenc y Mat r i c es

Represent a graph as an n x n square matrix M 

n = |V|

M ij 

= 1 means a link from node i to j 

1 2 3 4

1 0 1 0 1 1

2 1 0 1 13

4 1 0 1 0 4

Page 10: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 10/48

Adjac enc y Mat r ic es : Cr i t ique

Advantages:

Amenable to mathematical manipulation

Iteration over rows and columns corresponds to computations onoutlinks and inlinks

Lots of zeros for sparse matrices

Lots of wasted space

Page 11: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 11/48

Adjacenc y List s

Take adjacency matrices… and throw away all the zeros

1: 2 4

1 2 3 4

1 0 1 0 12: 1, 3, 43: 1

2 1 0 1 1

4: 1, 34 1 0 1 0

Page 12: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 12/48

Adjac enc y L is t s : Cr i t ique

Advantages:

Much more compact representation

Easy to compute over outlinks

Disadvantages:

Much more difficult to compute over inlinks

Page 13: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 13/48

Single Sourc e Shor t est Pat h

Problem: find shortest path from a source node to one ormore target nodes

Shortest might also mean lowest weight or cost

First, a refresher: Dijkstra’s Algorithm

Page 14: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 14/48

Di jk s t ra ’s A lgor i t hm Ex am ple

∞ ∞

10

1

0 2 3 9 4 6

5 7

∞ ∞2

Example from CLR

Page 15: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 15/48

Di jk s t ra ’s A lgor i t hm Ex am ple

10∞

10

1

0 2 3 9 4 6

5 7

5 ∞2

Example from CLR

Page 16: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 16/48

Di jk s t ra ’s A lgor i t hm Ex am ple

8 14

10

1

0 2 39

4 6

5 7

5 7

2

Example from CLR

Page 17: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 17/48

Page 18: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 18/48

Di jk s t ra ’s A lgor i t hm Ex am ple

8 9

1

10

1

0 2 39

4 6

5 7

5 7

2

Example from CLR

Page 19: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 19/48

Di jk s t ra ’s A lgor i t hm Ex am ple

8 9

10

1

0 2 39

4 6

5 7

5 7

2

Example from CLR

Page 20: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 20/48

Single Sourc e Shor t est Pat h

Problem: find shortest path from a source node to one ormore target nodes

Shortest might also mean lowest weight or cost

Single processor machine: Dijkstra’s Algorithm

MapReduce: parallel Breadth-First Search (BFS)

Page 21: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 21/48

Find ing t he Shor t es t Pat h

Consider simple case of equal edge weights

Solution to the roblem can be defined inductivel

Here’s the intuition:

Define: b is reachable from a if b is on ad acenc list of a  

DISTANCETO(s ) = 0

For all nodes p reachable from s ,

For all nodes n reachable from some other set of nodes M ,DISTANCETO(n ) = 1 + min(DISTANCETO(m ), m ∈ M )

s m 2 

m 1

1

d 2 

m 3 

… d 3 

Page 22: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 22/48

Page 23: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 23/48

Page 24: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 24/48

From In t u i t ion t o Algori t hm

Data representation:

Key: node n 

Value: d (distance from start), adjacency list (list of nodes

reachable from n )

Initialization: for all nodes except for start node, = ∞

Mapper:

∀m ∈ adjacency list: emit (m , d + 1)

Sort/Shuffle

Groups distances by reachable nodes

Reducer: Selects minimum distance path for each reachable node

 

Page 25: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 25/48

Page 26: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 26/48

BFS Pseudo-Code

Page 27: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 27/48

St opping Cr i t er ion

How many iterations are needed in parallel BFS (equaledge weight case)?

Convince yourself: when a node is first “discovered”,we’ve found the shortest path

Now answer the question...

Six degrees of separation? Practicalities of implementation in MapReduce

Page 28: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 28/48

Com par ison t o Di jk s t ra

Dijkstra’s algorithm is more efficient

At any step it only pursues edges from the minimum-cost pathinside the frontier

MapReduce explores all paths in parallel

“ o s o was e

Useful work is only done at the “frontier”

Wh can’t we do better usin Ma Reduce?

Page 29: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 29/48

Weigh t ed Edges

Now add positive weights to the edges

Why can’t edge weights be negative?

Simple change: adjacency list now includes a weight w foreach edge

In mapper, emit (m , d + w p ) instead of (m , + 1) for each node m 

That’s it?

Page 30: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 30/48

St opping Cr i t er ion

How many iterations are needed in parallel BFS (positiveedge weight case)?

Convince yourself: when a node is first “discovered”,we’ve found the shortest path

Page 31: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 31/48

Addi t ional Com plex i t ies

search frontier  1

1

r  10

n 1

n 5 

n 6  n 7 n 8 

n 9 1

p q 

n 2 n 3 

n 4 

1

11

1

Page 32: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 32/48

St opping Cr i t er ion

How many iterations are needed in parallel BFS (positiveedge weight case)?

Practicalities of implementation in MapReduce

Page 33: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 33/48

Graphs and MapReduc e

Graph algorithms typically involve:

Performing computations at each node: based on node features,edge features, and local link structure

Propagating computations: “traversing” the graph

 

Represent graphs as adjacency lists

Perform local computations in mapper Pass along partial results via outlinks, keyed by destination node

Perform aggregation in reducer on inlinks to a node

Iterate until conver ence: controlled b external “driver”

Don’t forget to pass the graph structure between iterations

Page 34: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 34/48

Page 35: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 35/48

PageRank : Def ined

Given page x with inlinks t 1…t n , where

C(t) is the out-degree of t 

α  is probability of random jump

N is the total number of nodes in the graph

∑=

−+⎟ ⎠

 ⎞⎜⎝ 

⎛ =

n

i i

i

t C 

t PR

 N  xPR

1 )(

)()1(

1)( α α 

X

t 1

t 2 

…n 

Page 36: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 36/48

Com put ing PageRank

Properties of PageRank

Can be computed iteratively

Effects at each iteration are local

Sketch of algorithm:

Start with seed PR i values

Each page distributes PR i “credit” to all pages it links to

Each target page adds up “credit” from multiple in-bound links tocompute PR i+1

Iterate until values converge

Page 37: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 37/48

Page 38: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 38/48

Page 39: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 39/48

Sam ple PageRank I t e rat ion (2 )

n 2  (0.166) n 2  (0.133)Iteration 2

n 1 (0.066).

0.033

0.083.

0.1 0.10.1

n 1 (0.1)

n 4  (0.3)

n 3  (0.166)n 5  (0.3)

0.3 0.166

n 4  (0.2)

n 3  (0.183)n 5  (0.383)

Page 40: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 40/48

PageRank in MapReduc e

n 5  [n 1, n 2 , n 3 ]n 1 [n 2 , n 4 ] n 2  [n 3 , n 5 ] n 3  [n 4 ] n 4  [n 5 ]

n 2  n 4  n 3  n 5  n 1 n 2  n 3 n 4  n 5 

n 2  n 4 n 3  n 5 n 1 n 2  n 3  n 4  n 5 

Reduce

n 5  [n 1, n 2 , n 3 ]n 1 [n 2 , n 4 ] n 2  [n 3 , n 5 ] n 3  [n 4 ] n 4  [n 5 ]

Page 41: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 41/48

PageRank Pseudo-Code

Page 42: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 42/48

Com plet e PageRank

Two additional complexities

What is the proper treatment of dangling nodes?

How do we factor in the random jump factor?

Solution:

Second pass to redistribute “missing PageRank mass” andaccount for random jumps

 ⎞⎛  ⎞⎛ 

is Pa eRank value from before ' is u dated Pa eRank value

⎟ ⎠

⎜⎝ 

+−+⎟ ⎠

⎜⎝ 

= pGG

 p )1(' α α 

|G| is the number of nodes in the graph m is the missing PageRank mass

Page 43: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 43/48

PageRank Convergenc e

Alternative convergence criteria

Iterate until PageRank values don’t change

Iterate until PageRank rankings don’t change

Fixed number of iterations

 

Page 44: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 44/48

Beyond PageRank

Link structure is important for web search

PageRank is one of many link-based features: HITS, SALSA, etc.

One of many thousands of features used in ranking…

Adversarial nature of web search

Link spamming

Spider traps

Keyword stuffing …

Page 45: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 45/48

Ef f ic ient Graph Algor i t hm s

Sparse vs. dense graphs

Gra h to olo ies

Page 46: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 46/48

Figure from: Newman, M. E. J. (2005) “Power laws, Pareto

distributions and Zipf's law.” Contemporary Physics 46:323–351.

Page 47: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 47/48

Page 48: Data-Intensive Information Processing Applications ― Session #5

8/14/2019 Data-Intensive Information Processing Applications ― Session #5

http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 48/48

Questions?

Source: Wikipedia (Japanese rock garden)