data-intensive information processing applications ― session #5
TRANSCRIPT
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 1/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 2/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 3/48
Today’s Agenda
Graph problems and representations
Parallel breadth-first search
PageRank
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 4/48
What ’s a graph?
G = (V,E), where
V represents the set of vertices (nodes)
E represents the set of edges (links)
Both vertices and edges may contain additional information
Directed vs. undirected edges
Presence or absence of cycles
Graphs are everywhere:
Hyperlink structure of the Web
Physical structure of computers on the Internet Interstate highway system
Social networks
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 5/48
Source: Wikipedia (Königsberg)
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 6/48
Som e Graph Problem s
Finding shortest paths
Routing Internet traffic and UPS trucks
Finding minimum spanning trees Telco laying down fiber
Finding Max Flow
Airline scheduling
Identify “special” nodes and communities
Breaking up terrorist cells, spread of avian flu
par e ma c ng Monster.com, Match.com
...
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 7/48
Graphs and MapReduc e
Graph algorithms typically involve:
Performing computations at each node: based on node features,edge features, and local link structure
Propagating computations: “traversing” the graph
How do you represent graph data in MapReduce?
How do you traverse a graph in MapReduce?
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 8/48
Represent ing Graphs
G = (V, E)
Two common re resentations
Adjacency matrix Adjacency list
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 9/48
Adjacenc y Mat r i c es
Represent a graph as an n x n square matrix M
n = |V|
M ij
= 1 means a link from node i to j
1 2 3 4
1 0 1 0 1 1
2 1 0 1 13
4 1 0 1 0 4
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 10/48
Adjac enc y Mat r ic es : Cr i t ique
Advantages:
Amenable to mathematical manipulation
Iteration over rows and columns corresponds to computations onoutlinks and inlinks
Lots of zeros for sparse matrices
Lots of wasted space
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 11/48
Adjacenc y List s
Take adjacency matrices… and throw away all the zeros
1: 2 4
1 2 3 4
1 0 1 0 12: 1, 3, 43: 1
2 1 0 1 1
4: 1, 34 1 0 1 0
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 12/48
Adjac enc y L is t s : Cr i t ique
Advantages:
Much more compact representation
Easy to compute over outlinks
Disadvantages:
Much more difficult to compute over inlinks
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 13/48
Single Sourc e Shor t est Pat h
Problem: find shortest path from a source node to one ormore target nodes
Shortest might also mean lowest weight or cost
First, a refresher: Dijkstra’s Algorithm
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 14/48
Di jk s t ra ’s A lgor i t hm Ex am ple
∞ ∞
10
1
0 2 3 9 4 6
5 7
∞ ∞2
Example from CLR
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 15/48
Di jk s t ra ’s A lgor i t hm Ex am ple
10∞
10
1
0 2 3 9 4 6
5 7
5 ∞2
Example from CLR
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 16/48
Di jk s t ra ’s A lgor i t hm Ex am ple
8 14
10
1
0 2 39
4 6
5 7
5 7
2
Example from CLR
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 17/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 18/48
Di jk s t ra ’s A lgor i t hm Ex am ple
8 9
1
10
1
0 2 39
4 6
5 7
5 7
2
Example from CLR
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 19/48
Di jk s t ra ’s A lgor i t hm Ex am ple
8 9
10
1
0 2 39
4 6
5 7
5 7
2
Example from CLR
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 20/48
Single Sourc e Shor t est Pat h
Problem: find shortest path from a source node to one ormore target nodes
Shortest might also mean lowest weight or cost
Single processor machine: Dijkstra’s Algorithm
MapReduce: parallel Breadth-First Search (BFS)
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 21/48
Find ing t he Shor t es t Pat h
Consider simple case of equal edge weights
Solution to the roblem can be defined inductivel
Here’s the intuition:
Define: b is reachable from a if b is on ad acenc list of a
DISTANCETO(s ) = 0
For all nodes p reachable from s ,
For all nodes n reachable from some other set of nodes M ,DISTANCETO(n ) = 1 + min(DISTANCETO(m ), m ∈ M )
s m 2
m 1
n
…
…
1
d 2
m 3
… d 3
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 22/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 23/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 24/48
From In t u i t ion t o Algori t hm
Data representation:
Key: node n
Value: d (distance from start), adjacency list (list of nodes
reachable from n )
Initialization: for all nodes except for start node, = ∞
Mapper:
∀m ∈ adjacency list: emit (m , d + 1)
Sort/Shuffle
Groups distances by reachable nodes
Reducer: Selects minimum distance path for each reachable node
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 25/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 26/48
BFS Pseudo-Code
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 27/48
St opping Cr i t er ion
How many iterations are needed in parallel BFS (equaledge weight case)?
Convince yourself: when a node is first “discovered”,we’ve found the shortest path
Now answer the question...
Six degrees of separation? Practicalities of implementation in MapReduce
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 28/48
Com par ison t o Di jk s t ra
Dijkstra’s algorithm is more efficient
At any step it only pursues edges from the minimum-cost pathinside the frontier
MapReduce explores all paths in parallel
“ o s o was e
Useful work is only done at the “frontier”
Wh can’t we do better usin Ma Reduce?
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 29/48
Weigh t ed Edges
Now add positive weights to the edges
Why can’t edge weights be negative?
Simple change: adjacency list now includes a weight w foreach edge
In mapper, emit (m , d + w p ) instead of (m , + 1) for each node m
That’s it?
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 30/48
St opping Cr i t er ion
How many iterations are needed in parallel BFS (positiveedge weight case)?
Convince yourself: when a node is first “discovered”,we’ve found the shortest path
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 31/48
Addi t ional Com plex i t ies
search frontier 1
1
r 10
n 1
n 5
n 6 n 7 n 8
n 9 1
s
p q
n 2 n 3
n 4
1
11
1
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 32/48
St opping Cr i t er ion
How many iterations are needed in parallel BFS (positiveedge weight case)?
Practicalities of implementation in MapReduce
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 33/48
Graphs and MapReduc e
Graph algorithms typically involve:
Performing computations at each node: based on node features,edge features, and local link structure
Propagating computations: “traversing” the graph
Represent graphs as adjacency lists
Perform local computations in mapper Pass along partial results via outlinks, keyed by destination node
Perform aggregation in reducer on inlinks to a node
Iterate until conver ence: controlled b external “driver”
Don’t forget to pass the graph structure between iterations
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 34/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 35/48
PageRank : Def ined
Given page x with inlinks t 1…t n , where
C(t) is the out-degree of t
α is probability of random jump
N is the total number of nodes in the graph
∑=
−+⎟ ⎠
⎞⎜⎝
⎛ =
n
i i
i
t C
t PR
N xPR
1 )(
)()1(
1)( α α
X
t 1
t 2
…n
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 36/48
Com put ing PageRank
Properties of PageRank
Can be computed iteratively
Effects at each iteration are local
Sketch of algorithm:
Start with seed PR i values
Each page distributes PR i “credit” to all pages it links to
Each target page adds up “credit” from multiple in-bound links tocompute PR i+1
Iterate until values converge
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 37/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 38/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 39/48
Sam ple PageRank I t e rat ion (2 )
n 2 (0.166) n 2 (0.133)Iteration 2
n 1 (0.066).
0.033
0.083.
0.1 0.10.1
n 1 (0.1)
n 4 (0.3)
n 3 (0.166)n 5 (0.3)
0.3 0.166
n 4 (0.2)
n 3 (0.183)n 5 (0.383)
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 40/48
PageRank in MapReduc e
n 5 [n 1, n 2 , n 3 ]n 1 [n 2 , n 4 ] n 2 [n 3 , n 5 ] n 3 [n 4 ] n 4 [n 5 ]
n 2 n 4 n 3 n 5 n 1 n 2 n 3 n 4 n 5
n 2 n 4 n 3 n 5 n 1 n 2 n 3 n 4 n 5
Reduce
n 5 [n 1, n 2 , n 3 ]n 1 [n 2 , n 4 ] n 2 [n 3 , n 5 ] n 3 [n 4 ] n 4 [n 5 ]
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 41/48
PageRank Pseudo-Code
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 42/48
Com plet e PageRank
Two additional complexities
What is the proper treatment of dangling nodes?
How do we factor in the random jump factor?
Solution:
Second pass to redistribute “missing PageRank mass” andaccount for random jumps
⎞⎛ ⎞⎛
is Pa eRank value from before ' is u dated Pa eRank value
⎟ ⎠
⎜⎝
+−+⎟ ⎠
⎜⎝
= pGG
p )1(' α α
|G| is the number of nodes in the graph m is the missing PageRank mass
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 43/48
PageRank Convergenc e
Alternative convergence criteria
Iterate until PageRank values don’t change
Iterate until PageRank rankings don’t change
Fixed number of iterations
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 44/48
Beyond PageRank
Link structure is important for web search
PageRank is one of many link-based features: HITS, SALSA, etc.
One of many thousands of features used in ranking…
Adversarial nature of web search
Link spamming
Spider traps
Keyword stuffing …
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 45/48
Ef f ic ient Graph Algor i t hm s
Sparse vs. dense graphs
Gra h to olo ies
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 46/48
Figure from: Newman, M. E. J. (2005) “Power laws, Pareto
distributions and Zipf's law.” Contemporary Physics 46:323–351.
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 47/48
8/14/2019 Data-Intensive Information Processing Applications ― Session #5
http://slidepdf.com/reader/full/data-intensive-information-processing-applications-session-5 48/48
Questions?
Source: Wikipedia (Japanese rock garden)