formula for microsoft poprank
TRANSCRIPT
-
7/29/2019 Formula for Microsoft Poprank
1/19
Object-Level Ranking: Bringing
Order to Web ObjectsZaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, Wei-Ying Ma
Microsoft Research Asia, Peking University
WWW 2005
Presented by Jason Yap
-
7/29/2019 Formula for Microsoft Poprank
2/19
Problem Definition
Existing Web search enginesgenerally treat a whole Web pageas the unit for retrieval andconsuming when it comes toranking pages. (Eg: PageRank)
However, there are various kindsofobjects embedded in the staticWeb pages or Web databases. Example objects: products,
people, papers, organizations,etc
These objects have associative
links among them which can beused to generate their own Webgraph.
-
7/29/2019 Formula for Microsoft Poprank
3/19
How is this different/new?
PageRank and HITS assume that all the links are withthe same endorsement" semantics and are equally
important.
For link analysis, the most unique characteristics ofthe object graph is the heterogeneity of links.
For example, the popularity of a paper should not beaffected too much by the number of authors, but thenumber of citations does have a large impact on it.
-
7/29/2019 Formula for Microsoft Poprank
4/19
PopRank
A method which considers both the Web popularity of an objectand the object relationships to calculate the popularity score of theWeb object. Web Popularity similar to Page Ranking
PopRank extends the PageRank model by adding apopularity
propagation factor(PPF) to each link pointing to an object. Idea behind the PPF: weight backlinks differently depending on how
much they should contribute to the popularity of an object.
Problem: How do we determine the weighting? Incorrect weighting will lead to a improper ranking of objects.
Extremely challenging to assign these manually.
They propose a learning based approach to automatically learn thepopularity propagation factors for different types of links using thepartial ranking of the objects given by domain experts.
-
7/29/2019 Formula for Microsoft Poprank
5/19
Model
Example Scenario: For example, assuming a reader wants to get into a new research field and to
read the related papers.
To get started, he may first use Google or CiteSeer to find several seed objectswhich could be some related papers, authors, or conferences/journals.
After that he most likely just follows the object relationship links to locate
more papers. He may want to read the papers cited by the papers he hasalready read, or read papers of his favorite authors, conferences, or journals.
Clearly, a paper cited by a large number of popular papers could be popular,and a recent paper published in a prestigious conference with few citationscould also be popular.
Random object finder model
Similar to the random walk in PageRank A user starts his random walk on the Web, and he will start following the
object relationship links once he finds the first object on the Web.
Eventually, he gets bored and will restart his random walk on the Web again tofind another seed object.
-
7/29/2019 Formula for Microsoft Poprank
6/19
Formula to compute PopRank
-
7/29/2019 Formula for Microsoft Poprank
7/19
Estimating the PPF
The Popularity Propagation Factor (PPF) is crucial incalculating the Web Popularity.
It is not practical to manually decide these factors.However, it's easy for the system designer to collect
information from domain experts about the popularityorders of some subsets of objects, which are calledpartialranking lists.
Goal: Find a good assignment by exploring only a smallportion of the search space.
Two components to this process: Finding the PPF: SAFA (Simulated Annealing for Factor
Assignment)
Evaluating the quality of that PPF: Diameter Estimator
-
7/29/2019 Formula for Microsoft Poprank
8/19
SAFA (Simulated Annealing for Factor
Assignment) Simulated Annealing:
SAFA Algorithm: Goal: Find a good assignment of PPF.
Keep examining the neighbors of the current best (or chosen)combination of PPF factors.
If a neighbor is better, it will be chosen as the best combination.
May deliberately choose a worse combination occasionally toavoid being trapped in a local optimal area.
-
7/29/2019 Formula for Microsoft Poprank
9/19
Evaluation of PPF
It would take hours (or days) toestimate the quality of a new PPFfactorfor a large object relationshipgraph.
Use a sub-graph of the entire object
relationship graph: The sub-graph consists of a set of
concentric circles with the trainingobjects in the center as the core.
Objects that are far away from thecore don't need to be considered
because of the damping factor and thereduced effect when the distance islarger.
The larger the diameter, the moreaccurate the evaluation but the longerit would take.
-
7/29/2019 Formula for Microsoft Poprank
10/19
Diameter Expansion
Diameter Expansion Algorithm:
Goal: Find an optimal diameter.
First use the entire graph and uniform PPF factors
to compute the ranking of all the training objects. Then, compute the ranking of all the training
objects using the k-diameter sub-graph.
If the difference between the new ranking and
the ranking using the entire graph is smaller thanthe stopping threshold, the k-diameter sub-graphis considered to be large enough.
-
7/29/2019 Formula for Microsoft Poprank
11/19
Experiments
The PopRank model and the PPF estimation algorithmsproposed in the paper are fully implemented and evaluatedin the context ofLibra.
They collected 14 partial ranking lists:
8 partial ranking lists (3 for papers, 2 for authors, 2 forconferences, and 1 for journals) for 45 objects used as trainingset.
6 ranking lists (3 for papers, 1 for authors, 1 for conferences, and1 for journals) for 22 objects used as test set.
Example partial ranking list: About the ranked conferences in the database community:
1. SIG-MOD, 2. VLDB, 3. ICDE, 4. EDBT, 5. ICDT, 6. ER, 7. DEXA, 8.WIDM.
Provided by the researchers within the Microsoft Research lab.
-
7/29/2019 Formula for Microsoft Poprank
12/19
Evaluation Metrics
To measure the quality of the ranking results, the distance betweentwo ranking lists of the same set of objects is computed. They arecomparing the produced ranking of a test set vs the known ranking.
This evaluation measures both the number of mismatches betweenthe two lists and also considers the position of these mismatches.
Numerator of the formula is used to measure the real distance of
these two rankings. Denominator of the formula is used to normalize the real distance
to a number between 0 and 1.
-
7/29/2019 Formula for Microsoft Poprank
13/19
Understanding the Shape of the
Search Space
Note: The lower the ranking distance, the better. Shape in Figure 5 is quite predictable. In Figure 6, not as
predictable but there are few spikes within a small interval.
Their claim: A heuristic based search algorithm is suitablefor exploring this type of search space(SAFA).
-
7/29/2019 Formula for Microsoft Poprank
14/19
Stop Thresholds
The smaller the stopping threshold, the better thequality at the beginning and the longer the learningtime.
The quality of the ranking results can not be improvedfurther for thresholds less than 0.01.
-
7/29/2019 Formula for Microsoft Poprank
15/19
Subgraph Diameters
In Figure 9, 2 differenttraining sets were studied
(45 objects vs 8 objects). The learning time is greatly
reduced when the size ofthe sub-graph is reduced.
-
7/29/2019 Formula for Microsoft Poprank
16/19
Simulated Annealing
Found that the algorithm tried around 300-400iterations to find out the optimal PPF assignment.
It may be a local optimal solution, however when theytried another 1100 iterations and they found nofurther improvement.
-
7/29/2019 Formula for Microsoft Poprank
17/19
PopRank versus PageRank
In order to show that their PopRank model works well even with asmall number oftraining objects, they used the previous 22 testobjects as training objects here, and 6 subsets of the previous 45training objects were selected as the test datasets.
Their claim: On average, their ranking accuracy increases about50%.
-
7/29/2019 Formula for Microsoft Poprank
18/19
Number of Training Objects
As the number of training objects is increased,the ranking accuracy continues to improve.
-
7/29/2019 Formula for Microsoft Poprank
19/19
My Conclusions
Their contributions: Attempt to take advantage of heterogeneity in object
relationship links.
Automated learning-based approach to parameter tuning.
Issues: Similar to other techniques that rely on finely tuning multipleparameters, finding the perfect parameter assignment involvesa fair amount of guesswork.
The entire search space of parameters is too large and they end uphaving to make guesses on the best value for each parameter while
making strong assumptions about others. Why did they switch their training and test sets for their only
comparison with PageRank?
The dataset seems too small to make solid conclusions from (45training objects and 22 test objects).