formula for microsoft poprank

Upload: billbhame

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Formula for Microsoft Poprank

    1/19

    Object-Level Ranking: Bringing

    Order to Web ObjectsZaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, Wei-Ying Ma

    Microsoft Research Asia, Peking University

    WWW 2005

    Presented by Jason Yap

  • 7/29/2019 Formula for Microsoft Poprank

    2/19

    Problem Definition

    Existing Web search enginesgenerally treat a whole Web pageas the unit for retrieval andconsuming when it comes toranking pages. (Eg: PageRank)

    However, there are various kindsofobjects embedded in the staticWeb pages or Web databases. Example objects: products,

    people, papers, organizations,etc

    These objects have associative

    links among them which can beused to generate their own Webgraph.

  • 7/29/2019 Formula for Microsoft Poprank

    3/19

    How is this different/new?

    PageRank and HITS assume that all the links are withthe same endorsement" semantics and are equally

    important.

    For link analysis, the most unique characteristics ofthe object graph is the heterogeneity of links.

    For example, the popularity of a paper should not beaffected too much by the number of authors, but thenumber of citations does have a large impact on it.

  • 7/29/2019 Formula for Microsoft Poprank

    4/19

    PopRank

    A method which considers both the Web popularity of an objectand the object relationships to calculate the popularity score of theWeb object. Web Popularity similar to Page Ranking

    PopRank extends the PageRank model by adding apopularity

    propagation factor(PPF) to each link pointing to an object. Idea behind the PPF: weight backlinks differently depending on how

    much they should contribute to the popularity of an object.

    Problem: How do we determine the weighting? Incorrect weighting will lead to a improper ranking of objects.

    Extremely challenging to assign these manually.

    They propose a learning based approach to automatically learn thepopularity propagation factors for different types of links using thepartial ranking of the objects given by domain experts.

  • 7/29/2019 Formula for Microsoft Poprank

    5/19

    Model

    Example Scenario: For example, assuming a reader wants to get into a new research field and to

    read the related papers.

    To get started, he may first use Google or CiteSeer to find several seed objectswhich could be some related papers, authors, or conferences/journals.

    After that he most likely just follows the object relationship links to locate

    more papers. He may want to read the papers cited by the papers he hasalready read, or read papers of his favorite authors, conferences, or journals.

    Clearly, a paper cited by a large number of popular papers could be popular,and a recent paper published in a prestigious conference with few citationscould also be popular.

    Random object finder model

    Similar to the random walk in PageRank A user starts his random walk on the Web, and he will start following the

    object relationship links once he finds the first object on the Web.

    Eventually, he gets bored and will restart his random walk on the Web again tofind another seed object.

  • 7/29/2019 Formula for Microsoft Poprank

    6/19

    Formula to compute PopRank

  • 7/29/2019 Formula for Microsoft Poprank

    7/19

    Estimating the PPF

    The Popularity Propagation Factor (PPF) is crucial incalculating the Web Popularity.

    It is not practical to manually decide these factors.However, it's easy for the system designer to collect

    information from domain experts about the popularityorders of some subsets of objects, which are calledpartialranking lists.

    Goal: Find a good assignment by exploring only a smallportion of the search space.

    Two components to this process: Finding the PPF: SAFA (Simulated Annealing for Factor

    Assignment)

    Evaluating the quality of that PPF: Diameter Estimator

  • 7/29/2019 Formula for Microsoft Poprank

    8/19

    SAFA (Simulated Annealing for Factor

    Assignment) Simulated Annealing:

    SAFA Algorithm: Goal: Find a good assignment of PPF.

    Keep examining the neighbors of the current best (or chosen)combination of PPF factors.

    If a neighbor is better, it will be chosen as the best combination.

    May deliberately choose a worse combination occasionally toavoid being trapped in a local optimal area.

  • 7/29/2019 Formula for Microsoft Poprank

    9/19

    Evaluation of PPF

    It would take hours (or days) toestimate the quality of a new PPFfactorfor a large object relationshipgraph.

    Use a sub-graph of the entire object

    relationship graph: The sub-graph consists of a set of

    concentric circles with the trainingobjects in the center as the core.

    Objects that are far away from thecore don't need to be considered

    because of the damping factor and thereduced effect when the distance islarger.

    The larger the diameter, the moreaccurate the evaluation but the longerit would take.

  • 7/29/2019 Formula for Microsoft Poprank

    10/19

    Diameter Expansion

    Diameter Expansion Algorithm:

    Goal: Find an optimal diameter.

    First use the entire graph and uniform PPF factors

    to compute the ranking of all the training objects. Then, compute the ranking of all the training

    objects using the k-diameter sub-graph.

    If the difference between the new ranking and

    the ranking using the entire graph is smaller thanthe stopping threshold, the k-diameter sub-graphis considered to be large enough.

  • 7/29/2019 Formula for Microsoft Poprank

    11/19

    Experiments

    The PopRank model and the PPF estimation algorithmsproposed in the paper are fully implemented and evaluatedin the context ofLibra.

    They collected 14 partial ranking lists:

    8 partial ranking lists (3 for papers, 2 for authors, 2 forconferences, and 1 for journals) for 45 objects used as trainingset.

    6 ranking lists (3 for papers, 1 for authors, 1 for conferences, and1 for journals) for 22 objects used as test set.

    Example partial ranking list: About the ranked conferences in the database community:

    1. SIG-MOD, 2. VLDB, 3. ICDE, 4. EDBT, 5. ICDT, 6. ER, 7. DEXA, 8.WIDM.

    Provided by the researchers within the Microsoft Research lab.

  • 7/29/2019 Formula for Microsoft Poprank

    12/19

    Evaluation Metrics

    To measure the quality of the ranking results, the distance betweentwo ranking lists of the same set of objects is computed. They arecomparing the produced ranking of a test set vs the known ranking.

    This evaluation measures both the number of mismatches betweenthe two lists and also considers the position of these mismatches.

    Numerator of the formula is used to measure the real distance of

    these two rankings. Denominator of the formula is used to normalize the real distance

    to a number between 0 and 1.

  • 7/29/2019 Formula for Microsoft Poprank

    13/19

    Understanding the Shape of the

    Search Space

    Note: The lower the ranking distance, the better. Shape in Figure 5 is quite predictable. In Figure 6, not as

    predictable but there are few spikes within a small interval.

    Their claim: A heuristic based search algorithm is suitablefor exploring this type of search space(SAFA).

  • 7/29/2019 Formula for Microsoft Poprank

    14/19

    Stop Thresholds

    The smaller the stopping threshold, the better thequality at the beginning and the longer the learningtime.

    The quality of the ranking results can not be improvedfurther for thresholds less than 0.01.

  • 7/29/2019 Formula for Microsoft Poprank

    15/19

    Subgraph Diameters

    In Figure 9, 2 differenttraining sets were studied

    (45 objects vs 8 objects). The learning time is greatly

    reduced when the size ofthe sub-graph is reduced.

  • 7/29/2019 Formula for Microsoft Poprank

    16/19

    Simulated Annealing

    Found that the algorithm tried around 300-400iterations to find out the optimal PPF assignment.

    It may be a local optimal solution, however when theytried another 1100 iterations and they found nofurther improvement.

  • 7/29/2019 Formula for Microsoft Poprank

    17/19

    PopRank versus PageRank

    In order to show that their PopRank model works well even with asmall number oftraining objects, they used the previous 22 testobjects as training objects here, and 6 subsets of the previous 45training objects were selected as the test datasets.

    Their claim: On average, their ranking accuracy increases about50%.

  • 7/29/2019 Formula for Microsoft Poprank

    18/19

    Number of Training Objects

    As the number of training objects is increased,the ranking accuracy continues to improve.

  • 7/29/2019 Formula for Microsoft Poprank

    19/19

    My Conclusions

    Their contributions: Attempt to take advantage of heterogeneity in object

    relationship links.

    Automated learning-based approach to parameter tuning.

    Issues: Similar to other techniques that rely on finely tuning multipleparameters, finding the perfect parameter assignment involvesa fair amount of guesswork.

    The entire search space of parameters is too large and they end uphaving to make guesses on the best value for each parameter while

    making strong assumptions about others. Why did they switch their training and test sets for their only

    comparison with PageRank?

    The dataset seems too small to make solid conclusions from (45training objects and 22 test objects).