the pagerank
Post on 24-Jun-2015
363 Views
Preview:
TRANSCRIPT
The PageRank Citation Ranking:The PageRank Citation Ranking:Bringing Order to the WebBringing Order to the Web
Larry Page etc.
Stanford University
Presented by
Guoqiang Su & Wei Li
ContentsContents
MotivationRelated workPage Rank & Random Surfer ModelImplementationApplicationConclusion
MotivationMotivation
Web: heterogeneous and unstructuredFree of quality control on the webCommercial interest to manipulate ranking
Related WorkRelated Work
Academic citation analysisLink-based analysisClustering methods of link structureHubs & Authorities Model
BacklinkBacklink
Link Structure of the WebApproximation of importance / quality
PageRankPageRank
Pages with lots of backlinks are importantBacklinks coming from important pages
convey more importance to a page
Problem: Rank Sink
uBv vN
vRcuR
)()(
Rank SinkRank SinkPage cycles pointed by some incoming link
Problem: this loop will accumulate rank but never distribute any rank outside
Escape TermEscape Term
Solution: Rank Source
c is maximized and = 1E(u) is some vector over the web pages
– uniform, favorite page etc.
)()(
)( ucEN
vRcuR
uBv v
1R
Matrix NotationMatrix Notation
R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
ReEAcR TT )(
)( TeEA
Computing PageRankComputing PageRank
- initialize vector over web pages
loop:
- new ranks sum of normalized backlink ranks
- compute normalizing factor
- add escape term
- control parameter
while - stop when converged
SR 0
iT
i RAR 1
111 ii RRd
dERR ii 11
ii RR 1
Random Surfer ModelRandom Surfer Model Page Rank corresponds to the probability
distribution of a random walk on the web graphs
E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
ImplementationImplementationComputing resources — 24 million pages — 75 million URLs
Memory and disk storage
Weight Vector
(4 byte float)
Matrix A (linear access)
Implementation (Con't)Implementation (Con't)
Unique integer ID for each URLSort and Remove dangling linksRank initial assignmentIteration until convergenceAdd back dangling links and Re-compute
Convergence PropertiesConvergence PropertiesGraph (V, E) is an expander with factor if
for all (not too large) subsets S: |As| |s|Eigenvalue separation: Largest eigenvalue
is sufficiently larger than the second-largest eigenvalue
Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
Convergence Properties (con't)Convergence Properties (con't)PageRank computation is O(log(|V|)) due to
rapidly mixing graph G of the web.
Personalized PageRankPersonalized PageRankRank Source E can be initialized :
– uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives
result in overly high ranking– total weight on a single page, e.g. Netscape, McCarthy
great variation of ranks under different single pages as rank source
– and everything in-between, e.g. server root pages
allow manipulation by commercial interests
Applications IApplications IEstimate web traffic
– Server/page aliases
– Link/traffic disparity, e.g. porn sites, free web-mail
Backlink predictor– Citation counts have been used to predict future citations
– very difficult to map the citation structure of the web completely
– avoid the local maxima that citation counts get stuck in and get better performance
Applications II - Ranking ProxyApplications II - Ranking Proxy
Surfer's Navigation Aid
Annotating links by PageRank (bar graph)
Not query dependent
IssuesIssues Users are no random walkers – Content based methods Starting point distribution
– Actual usage data as starting vector
Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
Evaluation IEvaluation I
Evaluation IIEvaluation II
ConclusionConclusionPageRank is a global ranking based on the
web's graph structurePageRank use backlinks information to
bring order to the webPageRank can separate out representative
pages as cluster centerA great variety of applications
top related