formula for microsoft poprank

7/29/2019 Formula for Microsoft Poprank

1/19

Object-Level Ranking: Bringing

Order to Web ObjectsZaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, Wei-Ying Ma

Microsoft Research Asia, Peking University

WWW 2005

Presented by Jason Yap


2/19

Problem Definition

Existing Web search enginesgenerally treat a whole Web pageas the unit for retrieval andconsuming when it comes toranking pages. (Eg: PageRank)

However, there are various kindsofobjects embedded in the staticWeb pages or Web databases. Example objects: products,

people, papers, organizations,etc

These objects have associative

links among them which can beused to generate their own Webgraph.


3/19

How is this different/new?

PageRank and HITS assume that all the links are withthe same endorsement" semantics and are equally

important.

For link analysis, the most unique characteristics ofthe object graph is the heterogeneity of links.

For example, the popularity of a paper should not beaffected too much by the number of authors, but thenumber of citations does have a large impact on it.


4/19

PopRank

A method which considers both the Web popularity of an objectand the object relationships to calculate the popularity score of theWeb object. Web Popularity similar to Page Ranking

PopRank extends the PageRank model by adding apopularity

propagation factor(PPF) to each link pointing to an object. Idea behind the PPF: weight backlinks differently depending on how

much they should contribute to the popularity of an object.

Problem: How do we determine the weighting? Incorrect weighting will lead to a improper ranking of objects.

Extremely challenging to assign these manually.

They propose a learning based approach to automatically learn thepopularity propagation factors for different types of links using thepartial ranking of the objects given by domain experts.


5/19

Model

Example Scenario: For example, assuming a reader wants to get into a new research field and to

read the related papers.

To get started, he may first use Google or CiteSeer to find several seed objectswhich could be some related papers, authors, or conferences/journals.

After that he most likely just follows the object relationship links to locate

more papers. He may want to read the papers cited by the papers he hasalready read, or read papers of his favorite authors, conferences, or journals.

Clearly, a paper cited by a large number of popular papers could be popular,and a recent paper published in a prestigious conference with few citationscould also be popular.

Random object finder model

Similar to the random walk in PageRank A user starts his random walk on the Web, and he will start following the

object relationship links once he finds the first object on the Web.

Eventually, he gets bored and will restart his random walk on the Web again tofind another seed object.


6/19

Formula to compute PopRank


7/19

Estimating the PPF

The Popularity Propagation Factor (PPF) is crucial incalculating the Web Popularity.

It is not practical to manually decide these factors.However, it's easy for the system designer to collect

information from domain experts about the popularityorders of some subsets of objects, which are calledpartialranking lists.

Goal: Find a good assignment by exploring only a smallportion of the search space.

Two components to this process: Finding the PPF: SAFA (Simulated Annealing for Factor

Assignment)

Evaluating the quality of that PPF: Diameter Estimator


8/19

SAFA (Simulated Annealing for Factor

Assignment) Simulated Annealing:

SAFA Algorithm: Goal: Find a good assignment of PPF.

Keep examining the neighbors of the current best (or chosen)combination of PPF factors.

If a neighbor is better, it will be chosen as the best combination.

May deliberately choose a worse combination occasionally toavoid being trapped in a local optimal area.


9/19

Evaluation of PPF

It would take hours (or days) toestimate the quality of a new PPFfactorfor a large object relationshipgraph.

Use a sub-graph of the entire object

relationship graph: The sub-graph consists of a set of

concentric circles with the trainingobjects in the center as the core.

Objects that are far away from thecore don't need to be considered

because of the damping factor and thereduced effect when the distance islarger.

The larger the diameter, the moreaccurate the evaluation but the longerit would take.


10/19

Diameter Expansion

Diameter Expansion Algorithm:

Goal: Find an optimal diameter.

First use the entire graph and uniform PPF factors

to compute the ranking of all the training objects. Then, compute the ranking of all the training

objects using the k-diameter sub-graph.

If the difference between the new ranking and

the ranking using the entire graph is smaller thanthe stopping threshold, the k-diameter sub-graphis considered to be large enough.


11/19

Experiments

The PopRank model and the PPF estimation algorithmsproposed in the paper are fully implemented and evaluatedin the context ofLibra.

They collected 14 partial ranking lists:

8 partial ranking lists (3 for papers, 2 for authors, 2 forconferences, and 1 for journals) for 45 objects used as trainingset.

6 ranking lists (3 for papers, 1 for authors, 1 for conferences, and1 for journals) for 22 objects used as test set.

Example partial ranking list: About the ranked conferences in the database community:

1. SIG-MOD, 2. VLDB, 3. ICDE, 4. EDBT, 5. ICDT, 6. ER, 7. DEXA, 8.WIDM.

Provided by the researchers within the Microsoft Research lab.


12/19

Evaluation Metrics

To measure the quality of the ranking results, the distance betweentwo ranking lists of the same set of objects is computed. They arecomparing the produced ranking of a test set vs the known ranking.

This evaluation measures both the number of mismatches betweenthe two lists and also considers the position of these mismatches.

Numerator of the formula is used to measure the real distance of

these two rankings. Denominator of the formula is used to normalize the real distance

to a number between 0 and 1.


13/19

Understanding the Shape of the

Search Space

Note: The lower the ranking distance, the better. Shape in Figure 5 is quite predictable. In Figure 6, not as

predictable but there are few spikes within a small interval.

Their claim: A heuristic based search algorithm is suitablefor exploring this type of search space(SAFA).


14/19

Stop Thresholds

The smaller the stopping threshold, the better thequality at the beginning and the longer the learningtime.

The quality of the ranking results can not be improvedfurther for thresholds less than 0.01.


15/19

Subgraph Diameters

In Figure 9, 2 differenttraining sets were studied

(45 objects vs 8 objects). The learning time is greatly

reduced when the size ofthe sub-graph is reduced.


16/19

Simulated Annealing

Found that the algorithm tried around 300-400iterations to find out the optimal PPF assignment.

It may be a local optimal solution, however when theytried another 1100 iterations and they found nofurther improvement.


17/19

PopRank versus PageRank

In order to show that their PopRank model works well even with asmall number oftraining objects, they used the previous 22 testobjects as training objects here, and 6 subsets of the previous 45training objects were selected as the test datasets.

Their claim: On average, their ranking accuracy increases about50%.


18/19

Number of Training Objects

As the number of training objects is increased,the ranking accuracy continues to improve.


19/19

My Conclusions

Their contributions: Attempt to take advantage of heterogeneity in object

relationship links.

Automated learning-based approach to parameter tuning.

Issues: Similar to other techniques that rely on finely tuning multipleparameters, finding the perfect parameter assignment involvesa fair amount of guesswork.

The entire search space of parameters is too large and they end uphaving to make guesses on the best value for each parameter while

making strong assumptions about others. Why did they switch their training and test sets for their only

comparison with PageRank?

The dataset seems too small to make solid conclusions from (45training objects and 22 test objects).

formula for microsoft poprank

Documents