i/o-efficient techniques for computing pagerank
Post on 18-Jul-2015
91 Views
Preview:
TRANSCRIPT
I/O-Efficient Techniques for Computing Pagerank
Yen-Yu Chen Qingqing Gan Torsten Suel
Polytechnic University, Brooklyn NY
Web Graph
• URL as a node• Hyperlink as a
directed edge
• The graph structure represents the World Wide Web
Page Rank
• Random Surfer model– A person who surf the
web by randomly clicking links on visited pages.
• PageRank of a page is proportional to the frequency with which a random surfer would visit it.
R2=0.286
R3=0.143
R4=0.143R5=0.143
R1=0.286
Practical PageRank
• Two problems:– Rank leak– Rank sink
• Pruning• Add back edges• Random Jump
R2=0.142
R3=0.101
R4=0.313R5=0.290
R1=0.154
d=0.8
Topic Sensitive PageRank
• Modified Random Jump
• Only jump to certain pages which are related to a specific topic
• ODP-biasing
Topic T
Challenge
• 3.5 Billion pages on the web
• 49 Billion hyperlinks in betweens
• Require 14G bytes to store 4-byte pagerank values – Hard to fit in memory
• Calculate the pagerank value in an I/O efficient way
I/O Efficient Algorithms
• Naïve Algorithm
• Haveliwala’s Algorithm
• Our contribution:
– Sort-Merge Algorithm
– Split-Accumulate Algorithm
Related Work
Naïve Algorithm
• Two vectors of 32-bits floating point numbers.
• Source vector is on disk
• Destination vector is in memory.
LVVLVCnaive +⋅=++= 2'
Haveliwala’s Algorithm
• Partition destination vector into d blocks Vi’ that each fit into main memory.
• Partition link file into d files Li , each only contains links pointing to nodes in Vi’ .
∑∑<≤<≤
⋅++⋅+=++⋅=di
idi
ih LVdLVVdC00
)1()1(' ε
Sort-Merge Algorithm
• Link file is identical to the on in naïve algorithm.
• Creating for each link a packet that contains the line number of the destination and an amount of rank value that has to be transmitted to that destination.
• 8-byte packet : 4-byte id + 4-byte floating number
Sort-Merge Algorithm (continue)
• Route packets by sorting them by destination and combining the ranks into the destination node.
• |P| is the total size of the generated packets that need to be written in and out once.
PLVVC mergesort ⋅+++=− 2'
Split-Accumulate Algorithm
• Splits the source vector into d blocks Vi, such that 4-byte rank values of all node in a block fit into memory.
• Link file contains information on all links with source node in block Vi.
• It likes reverse of Li in Haviliwala’s, but we remove the out-degree information to another files.
iL
Split-Accumulate Algorithm (continue)
• File Oi is a vector of 2-byte integers, storing out-degree for each element in source vector.
• File is defined as containing all packets of rank values with destination in block Vi, in arbitrary order.
iP
Split-Accumulate Algorithm (continue)
• For each iteration i:– Initial block Vi in memory
– Accumulate phase:• Scan with destinations in Vi , add rank values
in each packet to appropriate entry in Vi.
– Scan Oi and divide each rank value in Vi by its out-degree.
iP
Split-Accumulate Algorithm (continue)
– Split phase:• Read and for each record in consisting of
several sources in Vi and a destination in Vj, we write one packet with this destination node and the total amount of rank to be transmitted to it from these sources into output file ( which will become file in the next iteration).
Combining packets is simpler and more efficient. No in-memory sorting of packets is needed.
iL iL
'jPjP
Split-Accumulate Algorithm (continue)
• In a nutshell, it split packets into different buckets by destination, and then directly accumulating rank values using a table.
PL
PLV
iPiLOiCdi
split
⋅++=
⋅+++⋅=
⋅++= ∑<≤
2)1(
2)'1(5.0
)2(0
ε
ε
Experimental Setup
• Sun Blade 100 (500 MHz Ultra Sparc IIe) running Solaris 8 with 100GB, 7200 RPM hard disk.
• Various physical memory configurations: 128M, 256M, 512M, 1G, 2G
• Simulated 32M and 64M setting under 128M memory.
Results for Real Data
• 120 M web pages crawled
• 327 M URLs and 1.33 Billion links parsed out.
• After pruning:– 44.8 M nodes– 653M edges– 15.3 edges/node
Result for Real Data (continue)
• No pruning.• Add back edges
for nodes which has 0 out-degree.
• 327 M nodes• 1.96 Billion
edges
Results for Scaled Data
Results for Topic-Sensitive PR
0500
1000150020002500300035004000
10 T
opi c
s(51
2M)
20 T
opi c
s(51
2M)
10 T
opi c
s(25
6M)
20 T
opi c
s(25
6M)
Nai ve
Havel i wal a' s
Spl i t -Accumul at e
• Basic:
• Random Jump:
• Topic-Sensitive:
{
Page Rank
∑→
=pq qd
prpr
)(
)()(
∑→
−
⋅+−=pq
ii
qd
qr
n
Rpr
)(
)()1()(
)1()0()( αα
=)()( pr i∑→
−
⋅+−pq
i
qd
qr
n
R
)(
)()1(
)1()0(
αα
∑→
−
⋅pq
i
qd
qr
)(
)()1(
α
p is special
otherwise
top related