estimating the global pagerank of web communities
DESCRIPTION
Estimating the Global PageRank of Web Communities. Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4 th 2006. Localized Search Engines. - PowerPoint PPT PresentationTRANSCRIPT
Estimating the Estimating the Global PageRank of Global PageRank of Web CommunitiesWeb Communities
Paper by Jason V. Davis & Paper by Jason V. Davis & Inderjit S. DhillonInderjit S. Dhillon
Dept. of Computer SciencesDept. of Computer SciencesUniversity of Texas at AustinUniversity of Texas at Austin
Presentation given by Scott J. McCallenPresentation given by Scott J. McCallenDept. of Computer ScienceDept. of Computer Science
Kent State UniversityKent State UniversityDecember 4December 4thth 2006 2006
Localized Search EnginesLocalized Search Engines
What are they?What are they? Focus on a particular communityFocus on a particular community Examples: www.cs.kent.edu (site specific) or all Examples: www.cs.kent.edu (site specific) or all
computer science related websites (topic specific)computer science related websites (topic specific) AdvantagesAdvantages
Searching for particular terms with several Searching for particular terms with several meaningsmeanings
Relatively inexpensive to build and useRelatively inexpensive to build and use Use less bandwidth, space and timeUse less bandwidth, space and time
Local domains are orders of magnitude smaller Local domains are orders of magnitude smaller than global domainthan global domain
Localized Search Engines Localized Search Engines (con’t)(con’t)
DisadvantagesDisadvantages Lack of Global informationLack of Global information
i.e. only local PageRanks are availablei.e. only local PageRanks are available Why is this a problem?Why is this a problem?
Only pages within that community that are highly Only pages within that community that are highly regarded will have high PageRanksregarded will have high PageRanks
There is a need for a global PageRank for There is a need for a global PageRank for pages only within a local domainpages only within a local domain
Traditionally, this can only be obtained by Traditionally, this can only be obtained by crawling entire domaincrawling entire domain
Some Global FactsSome Global Facts 2003 Study by Lyman on the Global Domain2003 Study by Lyman on the Global Domain
8.9 billion pages on the internet (static pages)8.9 billion pages on the internet (static pages) Approximately 18.7 kilobytes eachApproximately 18.7 kilobytes each 167 terabytes needed to download and crawl the 167 terabytes needed to download and crawl the
entire webentire web These resources are only available to major These resources are only available to major
corporationscorporations Local DomainsLocal Domains
May only contain a couple hundred thousand pagesMay only contain a couple hundred thousand pages May already be contained on a local web server May already be contained on a local web server
(www.cs.kent.edu)(www.cs.kent.edu) There is much less restriction to the entire datasetThere is much less restriction to the entire dataset
The advantages of localized search engines The advantages of localized search engines becomes clearbecomes clear
Global (N) vs. Local (n)Global (N) vs. Local (n)Environmental Websites
EDU Websites
Political Websites
Other websites
Some parts overlap, but others don’t.
Overlap represents links
to other domains.
Each local domain isn’t aware of the rest of the
global domain.
How is it possible to
extract global information
when only the local domain is
available?
Excluding overlap from other
domains gives a very poor estimate
of global rank.
Proposed SolutionProposed Solution Find a good approximation to the global PageRank Find a good approximation to the global PageRank
value without crawling entire global domainvalue without crawling entire global domain Find a superdomain of local domain that will well Find a superdomain of local domain that will well
approximate the PageRankapproximate the PageRank Find this superdomain by crawling as few as n or Find this superdomain by crawling as few as n or
2n additional pages given a local domain of n 2n additional pages given a local domain of n pagespages
Esessentially, add as few pages to the local Esessentially, add as few pages to the local domain as possible until we find a very good domain as possible until we find a very good approximation of the PageRanks in the local approximation of the PageRanks in the local domain domain
PageRank - DescriptionPageRank - Description
Defines importance of pages based Defines importance of pages based on the hyperlinks from one page to on the hyperlinks from one page to another (the web graph)another (the web graph)
Computes the stationary distribution Computes the stationary distribution of a Markov chain created from the of a Markov chain created from the web graphweb graph
Uses the “random surfer” model to Uses the “random surfer” model to create a “random walk” over the create a “random walk” over the chainchain
PageRank MatrixPageRank Matrix
Given m x m adjacency matrix for the web Given m x m adjacency matrix for the web graph, define the PageRank Matrix asgraph, define the PageRank Matrix as
DDUU is diagonal matrix such that UD is diagonal matrix such that UDUU-1-1 is is
column stochasticcolumn stochastic 0 ≤ 0 ≤ αα ≤ 1 ≤ 1 e is vector of all 1’se is vector of all 1’s v is the random surfer vectorv is the random surfer vector
PageRank VectorPageRank Vector
The PageRank vector r represents the The PageRank vector r represents the page rank of every node in the page rank of every node in the webgraphwebgraph
It is defined as the dominate It is defined as the dominate eigenvector of the PageRank matrixeigenvector of the PageRank matrix
Computed using the power method Computed using the power method using a random starting vectorusing a random starting vector Computation can take as much as O(mComputation can take as much as O(m22) )
time for a dense graph but in practice is time for a dense graph but in practice is normally O(km), k being the average normally O(km), k being the average number of links per pagenumber of links per page
Algorithm 1Algorithm 1
Computing the PageRank vector Computing the PageRank vector based on the adjacency matrix U of based on the adjacency matrix U of the given web graphthe given web graph
Algorithm 1 Algorithm 1 (Explanation)(Explanation)
Input: Adjacency Matrix UInput: Adjacency Matrix U Output: PageRank vector rOutput: PageRank vector r MethodMethod
Choose a random initial value for rChoose a random initial value for r(0)(0)
Continue to iterate using the random Continue to iterate using the random surfer probability and vector until surfer probability and vector until reaching the convergence thresholdreaching the convergence threshold
Return the last iteration as the Return the last iteration as the dominant eigenvector for adjacency dominant eigenvector for adjacency matrix Umatrix U
For a local domain L, we have G as the For a local domain L, we have G as the entire global domain with an N x N entire global domain with an N x N adjacency matrixadjacency matrix
Define G to be as the followingDefine G to be as the following
i.e. we partition G into separate sections i.e. we partition G into separate sections that allow L to be containedthat allow L to be contained
Assume that L has already been crawled Assume that L has already been crawled and Land Loutout is known is known
Defining the Problem ( G Defining the Problem ( G vs. L)vs. L)
Defining the Problem (p* Defining the Problem (p* in g)in g)
If we partition G as such, we can denote If we partition G as such, we can denote actual PageRank vector of L as actual PageRank vector of L as
with respect to g (the global PageRank with respect to g (the global PageRank vector)vector)
Note: ENote: ELL selects only the nodes that selects only the nodes that correspond to L from gcorrespond to L from g
Defining the Problem (n Defining the Problem (n << N)<< N)
We define p as the PageRank vector We define p as the PageRank vector computed by crawling only local domain Lcomputed by crawling only local domain L
Note that p will be much different than p*Note that p will be much different than p* Continue to crawl more nodes of the global Continue to crawl more nodes of the global
domain and the difference will become domain and the difference will become smaller, however this is not possiblesmaller, however this is not possible
Find the supergraph F of L that will Find the supergraph F of L that will minimize the difference between p and p*minimize the difference between p and p*
Defining the Problem Defining the Problem (finding F)(finding F)
We need to find F that gives us the best We need to find F that gives us the best approximation of p*approximation of p*
i.e. minimize the following problem (the i.e. minimize the following problem (the difference between the actual global difference between the actual global PageRank and the estimated PageRank)PageRank and the estimated PageRank)
F is found with a greedy strategy, using F is found with a greedy strategy, using Algorithm 2Algorithm 2
Essentially, start with L and add the nodes in Essentially, start with L and add the nodes in FFoutout that minimize our objective and continue that minimize our objective and continue doing so a total of T iterationsdoing so a total of T iterations
Algorithm 2Algorithm 2
Algorithm 2 Algorithm 2 (Explanation)(Explanation)
Input: L (local domain), LInput: L (local domain), Loutout (outlinks from L), T (outlinks from L), T (number of iterations), k (pages to crawl per (number of iterations), k (pages to crawl per iteration)iteration)
Output: p (an improved estimated PageRank vector)Output: p (an improved estimated PageRank vector) MethodMethod
First set F (supergraph) and FFirst set F (supergraph) and Foutout equal to L and L equal to L and Loutout Compute the PageRank vector of FCompute the PageRank vector of F While T has not been exceededWhile T has not been exceeded
Select k new nodes to crawl based on F, FSelect k new nodes to crawl based on F, Foutout, f, f
Expand F to include those new nodes and modify FExpand F to include those new nodes and modify Foutout Compute the new PageRank vector for FCompute the new PageRank vector for F
Select the elements from f that correspond to L and Select the elements from f that correspond to L and return preturn p
Global (N) vs. Local (n) Global (N) vs. Local (n) (Again)(Again)
Environmental Websites
EDU Websites
Political Websites
Other websites
Using it on only the local domain
gives very inaccurate
estimates of the PageRank.
We know how to create the
PageRank vector using the power method.
How far can selecting more
nodes be allowed to
proceed without crawling the entire global
domain?
How can we select nodes from other
domains (i.e. expanding the
current domain) to improve accuracy?
Selecting NodesSelecting Nodes
Select nodes to expand L to FSelect nodes to expand L to F Selected nodes must bring us closer Selected nodes must bring us closer
to the actual PageRank vectorto the actual PageRank vector Some nodes will greatly influence Some nodes will greatly influence
the current PageRankthe current PageRank Only want to select at most O(n) Only want to select at most O(n)
more pages than those already in Lmore pages than those already in L
Finding the Best NodesFinding the Best Nodes
For a page j in the global domain and the For a page j in the global domain and the frontier of F (Ffrontier of F (Foutout), the addition of page j to F ), the addition of page j to F is as followsis as follows
uj is the outlinks from F to juj is the outlinks from F to j s is the estimated inlinks from j into F (j has s is the estimated inlinks from j into F (j has
not yet been crawled)not yet been crawled) s is estimated based on the expectation of inlink s is estimated based on the expectation of inlink
counts of pages already crawled as socounts of pages already crawled as so
Finding the Best Nodes Finding the Best Nodes (con’t)(con’t)
We defined the PageRank of F to be fWe defined the PageRank of F to be f The PageRank of FThe PageRank of Fjj is f is fjj
++
xxjj is the PageRank of node j (added to the is the PageRank of node j (added to the current PageRank vector)current PageRank vector)
Directly optimizing requires us to know the Directly optimizing requires us to know the global PageRank p*global PageRank p* How can we minimize the objective without How can we minimize the objective without
knowing p*?knowing p*?
Node InfluenceNode Influence Find the nodes in FFind the nodes in Foutout that will have the greatest that will have the greatest
influence on the local domain Linfluence on the local domain L Done by attaching an influence score to each node jDone by attaching an influence score to each node j
Summation of the difference adding page j will make Summation of the difference adding page j will make to PageRank vector among all pages in Lto PageRank vector among all pages in L
The influence score has a strong corollary to the The influence score has a strong corollary to the minimization of the GlobalDiff(fminimization of the GlobalDiff(fjj) function (as ) function (as compared to a baseline, for instance, the total compared to a baseline, for instance, the total outlink count from F to node j)outlink count from F to node j)
Node Influence ResultsNode Influence Results
Node Influence vs. Outlink Count on Node Influence vs. Outlink Count on a crawl of conservative web sitesa crawl of conservative web sites
Finding the InfluenceFinding the Influence Influence must be calculated for each node j in Influence must be calculated for each node j in
frontier of F that is consideredfrontier of F that is considered We are considering O(n) pages and the We are considering O(n) pages and the
calculation is O(n), we are left with a O(ncalculation is O(n), we are left with a O(n22) ) computationcomputation
To reduce this complexity, approximating the To reduce this complexity, approximating the influence of j may be acceptable, but how?influence of j may be acceptable, but how?
Using the power method for computing the Using the power method for computing the PageRank algorithms may lead us to a good PageRank algorithms may lead us to a good approximationapproximation
However, using the algorithm (Algorithm 1), However, using the algorithm (Algorithm 1), requires having a good starting vectorrequires having a good starting vector
PageRank Vector (again)PageRank Vector (again) The PageRank algorithm will converge at The PageRank algorithm will converge at
a rate equal to the random surfer a rate equal to the random surfer probability probability αα
With a starting vector xWith a starting vector x(0)(0), the complexity , the complexity of the algorithm is of the algorithm is
That is, the more accurate the vector That is, the more accurate the vector becomes, the more complex the process isbecomes, the more complex the process is
Saving Grace: Find a very good starting Saving Grace: Find a very good starting vector for xvector for x(0)(0), in which case we only need , in which case we only need to perform one iteration of Algorithm 1to perform one iteration of Algorithm 1
Finding the Best xFinding the Best x(0)(0)
Partition the PageRank matrix for FPartition the PageRank matrix for Fjj
Finding the Best xFinding the Best x(0)(0)
Simple approachSimple approach Use as the starting vector (the current Use as the starting vector (the current
PageRank vector)PageRank vector) Perform one PageRank iterationPerform one PageRank iteration Remove the element that corresponds to added nodeRemove the element that corresponds to added node
IssuesIssues
The estimate of fThe estimate of fjj++ will have an error of at least will have an error of at least
22ααxxjj
So if the PageRank of j is very high, very bad So if the PageRank of j is very high, very bad estimateestimate
Stochastic ComplementStochastic Complement
In an expanded form, the PageRank fIn an expanded form, the PageRank fjj++
isis
Which can be solved asWhich can be solved as
Observation: Observation: This is the stochastic complement of This is the stochastic complement of
PageRank matrix of FPageRank matrix of Fjj
Stochastic Complement Stochastic Complement (Observations)(Observations)
The stochastic complement of an The stochastic complement of an irreducible matrix is uniqueirreducible matrix is unique
The stochastic complement is also The stochastic complement is also irreducible and therefore has unique irreducible and therefore has unique stationary distributionstationary distribution
With regards to the matrix SWith regards to the matrix S The subdominant eigenvalue is at most The subdominant eigenvalue is at most
which means that for large l, it is which means that for large l, it is very close to very close to αα
The New PageRank The New PageRank ApproximationApproximation
Estimate the vector fEstimate the vector fjj of length l by of length l by performing one PageRank iteration over performing one PageRank iteration over S, starting at fS, starting at f
AdvantagesAdvantages Starting and ending with a vector of length lStarting and ending with a vector of length l Creates a lower bound for error of zeroCreates a lower bound for error of zero Example: Considering adding a node k to F Example: Considering adding a node k to F
that has no influence over the PageRank of Fthat has no influence over the PageRank of F Using the stochastic complement yields the exact Using the stochastic complement yields the exact
solutionsolution
The DetailsThe Details
Begin by expanding the difference Begin by expanding the difference between two PageRank vectorsbetween two PageRank vectors
withwith
The DetailsThe Details Substitute PSubstitute PFF into the equation into the equation
Summarizing into vectorsSummarizing into vectors
Algorithm 3 Algorithm 3 (Explanation)(Explanation)
Input: F (the current local subgraph), FInput: F (the current local subgraph), Foutout (outlinks of F), f (current PageRank of F), (outlinks of F), f (current PageRank of F), k (number of pages to return)k (number of pages to return)
Output: k new pages to crawlOutput: k new pages to crawl MethodMethod
Compute the outlink sums for each page in FCompute the outlink sums for each page in F Compute a scalar for every known global page Compute a scalar for every known global page
j (how many pages link to j)j (how many pages link to j) Compute y and z as formulatedCompute y and z as formulated For each of the pages in FFor each of the pages in Foutout
Computer x as formulatedComputer x as formulated Compute the score of each page using x, y and zCompute the score of each page using x, y and z
Return the k pages with the highest scoresReturn the k pages with the highest scores
PageRank Leaks and PageRank Leaks and FlowsFlows
The change of a PageRank based on added a The change of a PageRank based on added a node j to F can be described as Leaks and node j to F can be described as Leaks and FlowsFlows
A flow is the increase in local PageRanksA flow is the increase in local PageRanks Represented by Represented by
Scalar is the total amount j has to distributeScalar is the total amount j has to distribute Vector determines how it will be distributedVector determines how it will be distributed
A leak is the decrease in local PageRanksA leak is the decrease in local PageRanks Leaks come from non-positive vectors x and yLeaks come from non-positive vectors x and y X is proportional to the weighted sum of sibling X is proportional to the weighted sum of sibling
PageRanksPageRanks Y is an artifact of the random surfer vectorY is an artifact of the random surfer vector
Leaks and FlowsLeaks and Flows
JJ
Local
PagesFlows
Leaks
Random Surfer
Siblings
ExperimentsExperiments
MethodologyMethodology Resources are limited, global graph is Resources are limited, global graph is
approximatedapproximated Baseline AlgorithmsBaseline Algorithms
RandomRandom Nodes chosen uniformly at random from Nodes chosen uniformly at random from
known global nodesknown global nodes Outlink CountOutlink Count
Node chosen have the highest number of Node chosen have the highest number of outline counts from the current local domainoutline counts from the current local domain
Results (Data Sets)Results (Data Sets) Data SetData Set
Restricted to http pages that do not contain Restricted to http pages that do not contain the characters ?, *, @, or =the characters ?, *, @, or =
EDU Data SetEDU Data Set Crawl of the top 100 computer science Crawl of the top 100 computer science
universitiesuniversities Yielded 4.7 million pages, 22.9 million linksYielded 4.7 million pages, 22.9 million links
Politics Data SetPolitics Data Set Crawl of the pages under politics in dmoz Crawl of the pages under politics in dmoz
directorydirectory Yielded 4.4 million pages, 17.2 million linksYielded 4.4 million pages, 17.2 million links
Results (EDU Data Set)Results (EDU Data Set)
Normalizations show difference, Normalizations show difference, Kendall shows similarityKendall shows similarity
Results (Politics Data Results (Politics Data Set)Set)
Result SummaryResult Summary
Stochastic Complement Stochastic Complement outperformed other methods in outperformed other methods in nearly every trialnearly every trial
The results are significantly better The results are significantly better than the random walk approach with than the random walk approach with minimal computationminimal computation
ConclusionConclusion
Accurate estimates of the PageRank can Accurate estimates of the PageRank can be obtained by using local resultsbe obtained by using local results Expand the local graph based on influenceExpand the local graph based on influence Crawl at most O(n) more pagesCrawl at most O(n) more pages Use stochastic complement to accurately Use stochastic complement to accurately
estimate the new PageRank vectorestimate the new PageRank vector Not computationally or storage intensiveNot computationally or storage intensive
Estimating the Estimating the Global PageRank Global PageRank
of Web of Web Communities Communities
The EndThe End
Thank Thank YouYou