web data mining

29
2 0 1 3 O a s e s

Upload: oases-ong

Post on 27-Jun-2015

66 views

Category:

Technology


0 download

DESCRIPTION

Page Rank and HITS Algorithm introductions

TRANSCRIPT

Page 1: WEB Data Mining

2 0 1 3

O a s e s

Page 2: WEB Data Mining

PageRa

nk7.3

Introduction

Strengths and Weaknesses

Timed PageRank & Recency

Search

PageRank Algorithm

Page 3: WEB Data Mining

PageRa

nk7.3 Introductio

nHITS was presented by Jon Kleinberg in January, 1998 at the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms..PageRank was presented by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April, 1998. - Based on the algorithm, they built the search engine Google

Page 4: WEB Data Mining

PageRa

nk

7.3.1

PageRank

Algorithm

PageRank (PR)is a static ranking of Web pages.

PageRank is based on the measure of prestige in social networks, the PageRank value of each page can be regarded as its prestige.

Page 5: WEB Data Mining

PageRa

nk

7.3.1

PageRank

AlgorithmConcepts :  In-links of page i: These are the hyperlinks that point to page i from other pages. Usually, hyperlinks from the same site are not considered.

Out-links of page i: These are the hyperlinks that point out to other pages from page i. Usually, links to pages of the same site are not considered.

In-links

Out-links

Page 6: WEB Data Mining

PageRa

nk

7.3.1

PageRank

Algorithm

PageRank Score :  ※ Oj is the

number of out-links of page j

uses G=(V, E) [G=graph, V=pages, E=links]

Page 7: WEB Data Mining

PageRa

nk

7.3.1

PageRank

Algorithmdoesn’t not quite suffice.  

Based on the Markov chain : 

※ Aij(1) is the probability of going from i to j in 1 transition

(隨機性下的發生) 

Page 8: WEB Data Mining

PageRa

nk

7.3.1

PageRank

Algorithm

※ adding alink from page 5 to every page  

Page 9: WEB Data Mining

PageRa

nk

7.3.1

PageRank

AlgorithmEx2:

Page 10: WEB Data Mining

PageRa

nk

7.3.1

PageRank

Algorithm

Ex3:

The random surfer has two options : 1. With probability d, he randomly chooses an out-link to follow.2. With probability 1-d, he jumps to a random page without a link.

Page 11: WEB Data Mining

PageRa

nk

7.3.1

PageRank

AlgorithmSol :

Page 12: WEB Data Mining

PageRa

nk

7.3.2

Strengths and

Weaknesses 1.The advantage of PageRank is its ability to fight spam.  Since it is not easy for Web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence PageRank. Nevertheless, there are reported ways to influence PageRank. Recognizing and fighting spam is an important issue in Web search.

Page 13: WEB Data Mining

PageRa

nk

7.3.2

Strengths and

Weaknesses 2. Another major advantage of PageRank is that it is a global measure and is query independent.

At the query time, only a lookup is needed to find the value to be integrated with other strategies to rank the pages. It is thus very efficient at the query time.

Page 14: WEB Data Mining

PageRa

nk

7.3.2

Strengths and

Weaknesses 1. The main criticism is also the query-independence nature of PageRank. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

Page 15: WEB Data Mining

PageRa

nk

7.3.3

Timed PageRank and Recency

SearchThe Web is a dynamic environment. It changes constantly. Quality pages in the past may not be quality pages now or in the future.

Many outdated pages and links are not deleted. This causes problems for Web search because such outdated pagesmay still be ranked high. - Thus, search has a temporal dimension.

Page 16: WEB Data Mining

PageRa

nk

7.3.3

Timed PageRank and Recency

SearchTime-Sensitive ranking algorithm called TS-Rank.

the surfer can take one of the two actions: 1. With probability f(ti), he randomly chooses an out-going link to follow. 2. With probability 1-f(ti), he jumps to a random page without a link.

Page 17: WEB Data Mining

PageRa

nk

7.3.3

Timed PageRank and Recency

SearchTime-Sensitive ranking algorithm called TS-Rank.

Page 18: WEB Data Mining

HITS7.4

Introduction

Finding Other

Eigenvectors

HITS Algorithm

Relationships with Co-Citation and Bibliographic Coupling

Strengths and Weaknesses of HITS

Page 19: WEB Data Mining

HITS 7.4 Introduction

HITS stands for Hypertext Induced Topic Search

Statement :  expands the list of relevant pages returned by a search    engine and then produces two rankings of the expanded set of pages, authority ranking and hub ranking.Authority : a page with many in-links. A good authority is a page pointed to by many good hubs.Hub : a page with many out-links. A good hub is a page that points to many good authorities.

Page 20: WEB Data Mining

HITS 7.4 IntroductionAuthority :

a page with many in-links. A good authority is a page pointed to by many good hubs.

http1http2http3….

http1http2http3….

Hub1

http1http2http3….

http1http2http3….

HubN

http1http2http3….

http1http2http3….

Hub2

Authority

Authority

Page 21: WEB Data Mining

HITS 7.4 IntroductionHub :

a page with many out-links. A good hub is a page that points to many good authorities.

http1http2http3….

http1http2http3….

Hub

Authority

1Authority

1Authority

2Authority

2

Authority

NAuthority

N

authorities and hubs have a mutual reinforcement relationship

Page 22: WEB Data Mining

HITS7.4.

1HITS

Algorithm uses G=(V, E) [G=graph, V=pages, E=links]

計算 page i 的 authority 分數 a(i), hub 分數 h(i). The mutual reinforcing relationship of the two scores is represented as follows:

Page 23: WEB Data Mining

HITS7.4.

1HITS

Algorithm Writing them in the matrix form, a scores = (a(1), a(2), …, a(n))T h scores = (h(1), h(2), …, h(n))T

a = L LaT

h = L L aT

Page 24: WEB Data Mining

HITS7.4.

1HITS

Algorithm Ex :

1 3

2 4

0100

0001

1010

0010

A

)2.0,2.0,2.0,2.0(

)2.0,2.0,2.0,2.0(

h

a

Sol :

Page 25: WEB Data Mining

HITS7.4.

1HITS

Algorithm

0100

0001

1010

0010

A

Sol:

2.0

6.0

2.0

4.0

2.0

2.0

2.0

2.0

0100

0001

1010

0010

0010

1100

0001

0100

a

2.0

2.0

6.0

4.0

2.0

2.0

2.0

2.0

0010

1100

0001

0100

0100

0001

1010

0010

h

a = L LaT h = L L aT

The most authority is Page

3

The most hub is

Page 2

Page 26: WEB Data Mining

HITS7.4.

2Finding Other Eigenvectors

Each of such collections could potentially be relevant to the query topic, but they could be well separated from one another in the graph G for a variety of reasons. For example,

1. The query string may represent a topic that may arise as a term in the multiple communities, e.g. “classification”.

2. The query string may refer to a highly polarized issue, involving groups that are not likely to link to one another, e.g. “abortion”.

Page 27: WEB Data Mining

HITS7.4.

3Relationships with Co-Citation and Bibliographic Coupling

An authority page is like an influential research paper (publication) which is cited by many subsequent papers. A hub page is like a survey paper which cites many other papers (including those influential papers).

Page 28: WEB Data Mining

HITS7.4.

4Strengths and Weaknesses of HITS

The main strength of HITS is its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages.

However, HITS has several disadvantages :

1. HITS does not have the anti-spam capability of PageRank.

2. HITS is topic drift. because people put hyperlinks for all kinds of reasons, including favor, spamming…

3. The query time evaluation is also a major drawback. Performing eigenvector computation are all time consuming operations.

Page 29: WEB Data Mining

END