cs246: mining massive datasets jure ... - stanford university has 23,400 in-links has 1 in-link are...
TRANSCRIPT
![Page 1: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/1.jpg)
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
![Page 2: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/2.jpg)
High dim. data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank, SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
![Page 3: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/3.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
![Page 4: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/4.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Connections between political blogs Polarization of the network [Adamic-Glance, 2005]
![Page 5: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/5.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
Citation networks and Maps of science [Börner et al., 2012]
![Page 6: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/6.jpg)
domain2
domain1
domain3
router
Internet 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
![Page 7: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/7.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
Seven Bridges of Königsberg [Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
![Page 8: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/8.jpg)
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
I teach a class on
Networks. CS224W: Classes are
in the Gates
building Computer Science
Department at Stanford
Stanford University
![Page 9: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/9.jpg)
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
I teach a class on
Networks. CS224W: Classes are
in the Gates
building Computer Science
Department at Stanford
Stanford University
![Page 10: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/10.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
![Page 11: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/11.jpg)
How to organize the Web? First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates: Find relevant docs in a small and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents, random things, web spam, etc.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
![Page 12: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/12.jpg)
2 challenges of web search: (1) Web contains many sources of information
Who to “trust”?
Trick: Trustworthy pages may point to each other!
(2) What is the “best” answer to query “newspaper”?
No single right answer
Trick: Pages that actually know about newspapers might all be pointing to many newspapers
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
![Page 13: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/13.jpg)
All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
There is large diversity
in the web-graph node connectivity. Let’s rank the pages by the link structure!
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
![Page 14: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/14.jpg)
We will cover the following Link Analysis approaches for computing importances of nodes in a graph:
Page Rank
Hubs and Authorities (HITS)
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
2/5/2013 14 Jure Leskovec, Stanford C246: Mining Massive Datasets
![Page 15: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/15.jpg)
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes: www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link
Are all in-links are equal?
Links from important pages count more
Recursive question!
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
![Page 16: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/16.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
![Page 17: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/17.jpg)
Each link’s vote is proportional to the importance of its source page
If page j with importance rj has n out-links, each link gets rj / n votes
Page j’s own importance is the sum of the votes on its in-links
2/5/2013 17 Jure Leskovec, Stanford C246: Mining Massive Datasets
j
k i
rj/3
rj/3 rj/3 rj = ri/3+rk/4
ri/3 rk/4
![Page 18: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/18.jpg)
A “vote” from an important page is worth more
A page is important if it is pointed to by other important pages
Define a “rank” rj for page j
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
ji
ij
rr
id
y
m a a/2
y/2 a/2
m
y/2
The web in 1839
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2 𝒅𝒊 … out-degree of node 𝒊
![Page 19: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/19.jpg)
3 equations, 3 unknowns, no constants No unique solution
All solutions equivalent modulo the scale factor Additional constraint forces uniqueness:
𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
Solution: 𝒓𝒚 =𝟐
𝟓, 𝒓𝒂 =
𝟐
𝟓, 𝒓𝒎 =
𝟏
𝟓
Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
We need a new formulation! 2/5/2013 19 Jure Leskovec, Stanford C246: Mining Massive Datasets
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
![Page 20: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/20.jpg)
Stochastic adjacency matrix 𝑴 Let page 𝑖 has 𝑑𝑖 out-links
If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =1
𝑑𝑖
else 𝑀𝑗𝑖 = 0 𝑴 is a column stochastic matrix
Columns sum to 1
Rank vector 𝒓: vector with an entry per page 𝑟𝑖 is the importance score of page 𝑖 𝑟𝑖 = 1𝑖
The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
ji
ij
rr
id
![Page 21: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/21.jpg)
Remember the flow equation: Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓 Suppose page i links to 3 pages, including j
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
j
i
M r r
= rj
1/3
ji
ij
rr
id
ri
.
. =
![Page 22: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/22.jpg)
The flow equations can be written 𝒓 = 𝑴 ∙ 𝒓
So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector,
with corresponding eigenvalue 1 Largest eigenvalue of M is 1 since M is
column stochastic We know r is unit length and each column of M
sums to one, so 𝑴𝒓 ≤ 𝟏
We can now efficiently solve for r! The method is called Power iteration
2/5/2013 22 Jure Leskovec, Stanford C246: Mining Massive Datasets
NOTE: x is an
eigenvector with
the corresponding
eigenvalue λ if:
𝑨𝒙 = 𝝀𝒙
![Page 23: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/23.jpg)
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
2/5/2013 23 Jure Leskovec, Stanford C246: Mining Massive Datasets
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 24: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/24.jpg)
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages
Initialize: r(0) = [1/N,….,1/N]T
Iterate: r(t+1) = M ∙ r(t)
Stop when |r(t+1) – r(t)|1 <
|x|1 = 1≤i≤N|xi| is the L1 norm
2/5/2013 24 Jure Leskovec, Stanford C246: Mining Massive Datasets
ji
t
it
j
rr
i
)()1(
d
di …. out-degree of node i
![Page 25: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/25.jpg)
Power Iteration:
Set 𝑟𝑗 = 1/N
1: 𝑟′𝑗 = 𝑟𝑖
𝑑𝑖𝑖→𝑗
2: 𝑟 = 𝑟′
Goto 1
Example: ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
25
Iteration 0, 1, 2, …
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 26: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/26.jpg)
Power Iteration:
Set 𝑟𝑗 = 1/N
1: 𝑟′𝑗 = 𝑟𝑖
𝑑𝑖𝑖→𝑗
2: 𝑟 = 𝑟′
Goto 1
Example: ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
26
Iteration 0, 1, 2, …
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 27: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/27.jpg)
Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)
𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏 = 𝑴 𝑴𝒓 𝟏 = 𝑴𝟐 ⋅ 𝒓 𝟎
𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐 = 𝑴 𝑴𝟐𝒓 𝟎 = 𝑴𝟑 ⋅ 𝒓 𝟎
Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , …𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the dominant eigenvector of 𝑴
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
![Page 28: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/28.jpg)
Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 ,𝑴𝟐 ⋅ 𝒓 𝟎 , …𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the dominant eigenvector of 𝑴
Proof: Assume M has n linearly independent eigenvectors, 𝑥1, 𝑥2, … , 𝑥𝑛 with corresponding eigenvalues 𝜆1, 𝜆2, … , 𝜆𝑛, where 𝜆1 > 𝜆2 > ⋯ > 𝜆𝑛
Vectors 𝑥1, 𝑥2, … , 𝑥𝑛 form a basis and thus we can write: 𝑟(0) = 𝑐1 𝑥1 + 𝑐2 𝑥2 +⋯+ 𝑐𝑛 𝑥𝑛
𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 +⋯+ 𝒄𝒏 𝒙𝒏 = 𝑐1(𝑀𝑥1) + 𝑐2(𝑀𝑥2) + ⋯+ 𝑐𝑛(𝑀𝑥𝑛) = 𝑐1(𝜆1𝑥1) + 𝑐2(𝜆2𝑥2) + ⋯+ 𝑐𝑛(𝜆𝑛𝑥𝑛) Repeated multiplication on both sides produces 𝑀𝑘𝑟(0) = 𝑐1(𝜆1
𝑘𝑥1) + 𝑐2(𝜆2𝑘𝑥2) + ⋯+ 𝑐𝑛(𝜆𝑛
𝑘𝑥𝑛)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
![Page 29: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/29.jpg)
Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 ,𝑴𝟐 ⋅ 𝒓 𝟎 , …𝑴𝒌 ⋅ 𝒓 𝟎 , … approaches the dominant eigenvector of 𝑴
Proof (continued): Repeated multiplication on both sides produces 𝑀𝑘𝑟(0) = 𝑐1(𝜆1
𝑘𝑥1) + 𝑐2(𝜆2𝑘𝑥2) + ⋯+ 𝑐𝑛(𝜆𝑛
𝑘𝑥𝑛)
𝑀𝑘𝑟(0) = 𝜆1𝑘 𝑐1𝑥1 + 𝑐2
𝜆2
𝜆1
𝑘
𝑥2 +⋯+ 𝑐𝑛𝜆2
𝜆1
𝑘
𝑥𝑛
Since 𝜆1 > 𝜆2 then fractions 𝜆2
𝜆1,𝜆3
𝜆1… < 1
and so 𝜆𝑖𝜆1
𝑘
= 0 as 𝑘 → ∞ (for all 𝑖 = 2…𝑛).
Thus: 𝑴𝒌𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝟏𝒌𝒙𝟏
Note if 𝑐1 = 0 then the method won’t converge
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
![Page 30: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/30.jpg)
Imagine a random web surfer:
At any time 𝑡, surfer is on some page 𝑖
At time 𝑡 + 1, the surfer follows an out-link from 𝑖 uniformly at random
Ends up on some page 𝑗 linked from 𝑖
Process repeats indefinitely
Let: 𝒑(𝒕) … vector whose 𝑖th coordinate is the
prob. that the surfer is at page 𝑖 at time 𝑡
So, 𝒑(𝒕) is a probability distribution over pages
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
ji
ij
rr
(i)dout
j
i1 i2 i3
![Page 31: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/31.jpg)
Where is the surfer at time t+1?
Follows a link uniformly at random
𝑝 𝑡 + 1 = 𝑀 ⋅ 𝑝(𝑡)
Suppose the random walk reaches a state 𝑝 𝑡 + 1 = 𝑀 ⋅ 𝑝(𝑡) = 𝑝(𝑡)
then 𝒑(𝑡) is stationary distribution of a random walk
Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓
So, 𝒓 is a stationary distribution for the random walk
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
)(M)1( tptp
j
i1 i2 i3
![Page 32: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/32.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
Does this converge? Does it converge to what we want? Are results reasonable?
ji
t
it
j
rr
i
)()1(
d Mrr or
equivalently
Announcement: We graded HW0 and HW1! - Stanford students: Pick them up from the submission box in Gates - SCPD students: SCPD will send you the HW
![Page 33: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/33.jpg)
Example: ra 1 0 1 0
rb 0 1 0 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 33
=
b a
Iteration 0, 1, 2, …
ji
t
it
j
rr
i
)()1(
d
![Page 34: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/34.jpg)
Example: ra 1 0 0 0
rb 0 1 0 0
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
=
b a
Iteration 0, 1, 2, …
ji
t
it
j
rr
i
)()1(
d
![Page 35: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/35.jpg)
2 problems: (1) Some pages are
dead ends (have no out-links)
Such pages cause importance to “leak out”
(2) Spider traps
(all out-links are within the group)
Eventually spider traps absorb all importance
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
![Page 36: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/36.jpg)
Power Iteration:
Set 𝑟𝑗 = 1
𝑟𝑗 = 𝑟𝑖
𝑑𝑖𝑖→𝑗
And iterate
Example: ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Iteration 0, 1, 2, …
y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 1
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
![Page 37: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/37.jpg)
The Google solution for spider traps: At each time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
2/5/2013 37 Jure Leskovec, Stanford C246: Mining Massive Datasets
y
a m
y
a m
![Page 38: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/38.jpg)
Power Iteration:
Set 𝑟𝑗 = 1
𝑟𝑗 = 𝑟𝑖
𝑑𝑖𝑖→𝑗
And iterate
Example: ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
Iteration 0, 1, 2, …
y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
![Page 39: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/39.jpg)
Teleports: Follow random teleport links with probability 1.0 from dead-ends
Adjust matrix accordingly
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
y
a m
y a m
y ½ ½ ⅓
a ½ 0 ⅓
m 0 ½ ⅓
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
y
a m
![Page 40: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/40.jpg)
Markov chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the stationary probability of
being at each state x X Goal is to find π such that π = P π
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
)()1( tt Mrr
![Page 41: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/41.jpg)
Theory of Markov chains
Fact: For any start vector, the power method
applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
![Page 42: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/42.jpg)
Stochastic: Every column sums to 1 A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
y
a m
y a m
y ½ ½ 1/3
a ½ 0 1/3
m 0 ½ 1/3
ry = ry /2 + ra /2 + rm /3
ra = ry /2+ rm /3
rm = ra /2 + rm /3
)1
( en
aMA T• ai…=1 if node i has
out deg 0, =0 else
• e…vector of all 1s
![Page 43: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/43.jpg)
A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.
A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
y
a m
![Page 44: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/44.jpg)
From any state, there is a non-zero probability of going from any one state to any another
A possible solution: Add green links
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
y
a m
![Page 45: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/45.jpg)
Google’s solution that does it all:
Makes M stochastic, aperiodic, irreducible
At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
𝑟𝑗 = 𝛽 𝑟𝑖𝑑𝑖
𝑖→𝑗
+ (1 − 𝛽)1
𝑛
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 45
di … out-degree of node i
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
![Page 46: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/46.jpg)
PageRank equation [Brin-Page, 98]
𝑟𝑗 = 𝛽𝑟𝑖𝑑𝑖
𝑖→𝑗
+ (1 − 𝛽)1
𝑛
The Google Matrix A:
𝐴 = 𝛽 𝑀 + (1 − 𝛽)1
𝑛𝒆 ⋅ 𝒆𝑇
A is stochastic, aperiodic and irreducible, so
𝒓(𝒕+𝟏) = 𝑨 ⋅ 𝒓(𝒕) What is ? In practice =0.8,0.9 (make 5 steps and jump)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 46
e…vector of all 1s
![Page 47: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/47.jpg)
y
a =
m
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
7/33
5/33
21/33
. . .
2/5/2013 47 Jure Leskovec, Stanford C246: Mining Massive Datasets
y
a m
0.8+0.2·⅓
0.8·½+0.2·⅓
1/2 1/2 0
1/2 0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.8 + 0.2
M 1/n·1·1T
A
![Page 48: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/48.jpg)
![Page 49: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/49.jpg)
Key step is matrix-vector multiplication rnew = A ∙ rold
Easy if we have enough main memory to hold A, rold, rnew
Say N = 1 billion pages We need 4 bytes for
each entry (say)
2 billion entries for vectors, approx 8GB
Matrix A has N2 entries 1018 is a large number!
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
½ ½ 0
½ 0 0
0 ½ 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
0.8 +0.2
A = ∙M + (1-) [1/N]NxN
=
A =
![Page 50: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/50.jpg)
Suppose there are N pages Consider page j, with dj out-links
We have Mij = 1/|dj| when j→i and Mij = 0 otherwise
The random teleport is equivalent to: Adding a teleport link from j to every other page
and setting transition probability to (1-)/N
Reducing the probability of following each out-link from 1/|dj| to /|dj|
Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly
2/5/2013 50 Jure Leskovec, Stanford C246: Mining Massive Datasets
![Page 51: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/51.jpg)
𝒓 = 𝑨 ⋅ 𝒓, where 𝑨𝒊𝒋 = 𝜷 𝑴𝒊𝒋 +𝟏−𝜷
𝑵
𝑟𝑖 = 𝐴𝑖𝑗 ⋅ 𝑟𝑗𝑁𝑗=1
𝑟𝑖 = 𝛽 𝑀𝑖𝑗 +1−𝛽
𝑁⋅ 𝑟𝑗
𝑁𝑗=1
= 𝛽 𝑀𝑖𝑗 ⋅ 𝑟𝑗 +1−𝛽
𝑁𝑁𝑗=1 𝑟𝑗
𝑁𝑗=1
= 𝛽 𝑀𝑖𝑗 ⋅ 𝑟𝑗 +1−𝛽
𝑁𝑁𝑗=1 since 𝑟𝑗 = 1
So we get: 𝒓 = 𝜷 𝑴 ⋅ 𝒓 +𝟏−𝜷
𝑵 𝑵
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 52
[x]N … a vector of length N with all entries x Note: Here we assumed M
has no dead-ends.
![Page 52: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/52.jpg)
We just rearranged the PageRank equation
𝒓 = 𝜷𝑴 ⋅ 𝒓 +𝟏 − 𝜷
𝑵𝑵
where [(1-)/N]N is a vector with all N entries (1-)/N
M is a sparse matrix! (with no dead-ends)
10 links per node, approx 10N entries So in each iteration, we need to:
Compute rnew = M ∙ rold
Add a constant value (1-)/N to each entry in rnew
Note if M contains dead-ends then 𝒓𝒊𝒏𝒆𝒘
𝒊 < 𝟏 and we also have to renormalize rnew so that it sums to 1
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 53
![Page 53: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/53.jpg)
Input: Graph 𝑮 and parameter 𝜷 Directed graph 𝑮 with spider traps and dead ends Parameter 𝛽
Output: PageRank vector 𝒓
Set: 𝑟𝑗0 =1
𝑁, 𝑡 = 1
do:
∀𝑗: 𝒓′𝒋(𝒕)= 𝜷
𝒓𝒊(𝒕−𝟏)
𝒅𝒊𝒊→𝒋
𝒓′𝒋(𝒕)= 𝟎 if in-deg. of 𝒋 is 0
Now re-insert the leaked PageRank:
∀𝒋: 𝒓𝒋𝒕= 𝒓′𝒋
𝒕+𝟏−𝑺
𝑵
𝒕 = 𝒕 + 𝟏
while 𝑟𝑗(𝑡)− 𝑟𝑗(𝑡−1)> 𝜀𝑗
54
where: 𝑆 = 𝑟′𝑗(𝑡)
𝑗
![Page 54: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/54.jpg)
Encode sparse matrix using only nonzero entries
Space proportional roughly to number of links
Say 10N, or 4*10*1 billion = 40GB
Still won’t fit in memory, but will fit on disk
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 55
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
source
node degree destination nodes
![Page 55: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/55.jpg)
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
Then 1 step of power-iteration is:
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 56
0 3 1, 5, 6
1 4 17, 64, 113, 117
2 2 13, 23
src degree destination 0 1 2
3 4 5 6
0 1 2
3 4 5 6
rnew rold
Initialize all entries of rnew to (1-)/N
For each page p (of out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)
for j = 1…n: rnew(destj) += rold(p) / n
![Page 56: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/56.jpg)
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
In each iteration, we have to:
Read rold and M
Write rnew back to disk
IO cost = 2|r| + |M|
Question:
What if we could not even fit rnew in memory?
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 57
![Page 57: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/57.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 58
0 4 0, 1, 3, 5
1 2 0, 5
2 2 3, 4
src degree destination
0 1
2
3
4 5
0 1 2
3 4 5
rnew rold
![Page 58: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/58.jpg)
Similar to nested-loop join in databases
Break rnew into k blocks that fit in memory
Scan M and rold once for each block
k scans of M and rold
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
Can we do better?
Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 59
![Page 59: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/59.jpg)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 60
0 4 0, 1
1 3 0
2 2 1
src degree destination
0 1
2
3
4 5
0 1 2
3 4 5
rnew
rold
0 4 5
1 3 5
2 2 4
0 4 3
2 2 3
![Page 60: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/60.jpg)
Break M into stripes
Each stripe contains only destination nodes in the corresponding block of rnew
Some additional overhead per stripe
But it is usually worth it
Cost per iteration
|M|(1+) + (k+1)|r|
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 61
![Page 61: CS246: Mining Massive Datasets Jure ... - Stanford University has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!](https://reader033.vdocument.in/reader033/viewer/2022053016/5f17329d2f724c2e8e6dd710/html5/thumbnails/61.jpg)
Measures generic popularity of a page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses a single measure of importance
Other models e.g., hubs-and-authorities
Solution: Hubs-and-Authorities (next)
Susceptible to Link spam
Artificial link topographies created in order to boost page rank
Solution: TrustRank (next)
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 62