data.mining.c.8(ii).web mining 570802461
TRANSCRIPT
1
Chapter 8.
Mining Complex Types of Data (II)
--Web Mining--
2
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
3
Mining the World-Wide Web• The WWW is huge, widely distributed, global information
service centre for – Information services: news, advertisements, consumer
information, financial management, education, government, e-commerce, etc.
– Rich and dynamic Hyper-link(超连接 ) information
– Access and usage information (WEB页面的访问和使用信息 )
• WWW provides rich sources for data/text mining
• Challenges– Too huge for effective data/text warehousing and mining
– Too complex and heterogeneous: no standards and structure
4
Web Mining: A challenging task • Researches for
– Web access patterns (访问模式 )
– Web structures and regularity
– Web contents
• Problems– The “abundance” problem
– Limited coverage of the Web: hidden Web sources, majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Etc.
5
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web Mining Taxonomy
6
Web Mining
Web StructureMining
Web ContentMining
Web Page Content MiningWebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …:Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World-Wide Web
7
Web Mining
Mining the World-Wide Web
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web StructureMining
Web ContentMining
Web PageContent Mining Search Result Mining
Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles
8
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and Han, 1998)Uses DM techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.
CustomizedUsage Tracking
Mining the World-Wide Web
9
Web Mining
Web UsageMining
General AccessPattern Tracking
Customized Usage Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyse access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.
Mining the World-Wide Web
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
10
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World-Wide Web
Web Structure Mining Using Links•HITS (Kleinberg, 1998)•PageRank (Sergey Brin and Larry Page,1998)
Amount of Web linkage information provides rich information about the relevance, the quality and structure of the Web’s contentUse interconnections between web pages to give weight to pages. .
11
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
12
Introduction• Early search engines mainly compare the similarity of the
query and the indexed pages. i.e., – They use information retrieval methods, cosine, ...
• From 1996, it became clear that the similarity alone was no longer sufficient.
• – The number of pages grew rapidly in the mid-late 1990’s.
• Google estimates: 10 million relevant pages.
• How to choose only 30-40 pages and rank them suitably to present to the user?.
13
Web Structure Analysis• Starting around 1996, researchers began to work on the
problem. They resort to hyperlinks (超连接) .
• Web pages on the other hand are connected through hyperlinks, which carry important information. – Some hyperlinks: organize information at the same site. – Other hyperlinks: point to pages from other Web sites. Such out-going
hyperlinks often indicate an implicit conveyance of authority (权威) to the pages being pointed to.
• Those pages that are pointed to by many other pages are likely to contain authoritative information.
14
Web Structure Analysis• During 1997-1998, two most influential hyperlink-based search
algorithms PageRank and HITS were reported. • Both algorithms exploit the hyperlinks of the Web to rank pages •
– PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW) in April, 1998.
– HITS: Jon Kleinberg (Cornell University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998
15
Chapter 8. Mining Complex Types of Data (II)
• Introduction Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
16
Background: Social Network Analysis
• Social network: the study of social entities (people in an organization)
- actors (主体 ), their interactions/relationships. • Interactions/relationships: represented by network or graph,
– each vertex (or node): an actor – each link: a relationship.
• From the network, we can study - properties of its structure - actor: the role, position and prestige( 声望 ) • Communities: various kinds of sub-graphs, formed by groups
of actors.
17
Social Network and the Web
• Web: viewed as a virtual social network
– Each page: actor
– each hyperlink: relationship.
• Results from social network can be adapted and extended for use in the Web context.
• Two types of social network analysis,
- centrality and prestige
closely related to hyperlink analysis and search on the Web.
18
Centrality
• An actor with extensive contacts (links) or communications with many other actors in the organization is considered more important than an actor with relatively fewer contacts.
• Central actor: one involved in many links.
19
Measure of Centrality• Network: viewed as a directed graph
• In-links of actor i: links pointing to i
• Out-links of actor i: links pointing out from i
• The simple degree centrality of actor i:
C(i) = dout(i)/(n-1)
where dout(i) the number of out-links of actor i and
n the total number of actors in the network
Dividing n-1 standardizes the centrality value into range [0,1]
20
Prestige • Prestige: more refined measure of prominence of an actor
than centrality.
• Prestigious actor:
one of extensive ties as a recipient used only in-links.
• Difference between centrality and prestige:
– centrality focuses on out-links
– prestige focuses on in-links
21
Measure of Prestige
• In-links of actor i: links pointing to i
• The simple degree Prestige of actor i:
P(i) = din(i)/(n-1)
where din(i) the number of in-links of actor i and
n the total number of actors in the network
22
Rank Prestige • Rank prestige forms the basis of most Web page link analysis
algorithms for PageRank.
• In the real world, a person i chosen by an important person is more prestigious than chosen by a less important person. – For example, if a company CEO votes for a person is much more
important than a worker votes for the person.
• If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. – Thus one’s prestige is affected by the ranks or statuses of the involved
actors.
23
Measure of Rank Prestige• Rank prestige PRank(i): a linear combination of links that point to i:
PRank(i) = A1i PRank(1) + A2iPRank(2) + …+ AniPRank(n)
where Aji =1 if j points to i and 0 otherwise.
• We have n equations for n actors --- mathematically we can write them as the column vector P :
•
• A: the adjacency matrix of network (graph), where Aij =1 if i points to j and 0 otherwise
n1n PAP T
24
Intuition Idea for Rank Prestige• A hyperlink from a page to another page is an implicit
conveyance of authority to the target page. – The more in-links that a page i receives, the more prestige the page i
has.
• Pages that point to page i also have their own prestige scores. – A page of a higher prestige pointing to i is more important than a page
of a lower prestige pointing to i.
– In other words, a page is important if it is pointed to by other important pages.
• This is exactly the idea of rank prestige in social network.
25
PageRank Algorithm• According to rank prestige, the importance of page i (i’s
PageRank score) is the sum of the PageRank scores of all pages that point to i.
• The Web as a directed graph G = (V, E). Let the total number of pages be n. The PageRank score of the page i (denoted by P(i)) is defined by:
,)(
)(),(
Eij jO
jPiP Oj is the number
of out-link of j
26
Matrix Notation• Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1),
P(2), …, P(n))T.
• Let A be the adjacency matrix of our graph with
Here we use Oi to denote the number of out-links of a node i.
• Each transition probability is 1/Oi if we assume the Web surfer will click the hyperlinks in the page i uniformly at random.
otherwise
EjiifOA
iij
0
),(1
27
Transition Probability Matrix• Let A be the state transition probability matrix
• Aij : the transition probability that the surfer in state i (page i) will move to state j (page j).
nnnn
n
n
AAA
AAA
AAA
...
...
...
...
...
...
.
21
22221
11211
A
28
Let us start…
• Given an initial probability distribution vector that a surfer is at each state (or page)
– p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and
– an nn transition probability matrix A,
we have
n
i
ip1
0 1)(
n
jijA
1
1
29
Random Surfer
• State transition:
• Where Aij(1) is the probability of going from i to j after 1 transition, we can write
• In general, the probability distribution after k steps/transition:
1-kk PAP T
n
iij ipAjp
101 )()1()(
01 PAP T
30
An Example Web Hyperlink Graph
02121000
000000
313103100
000010
00021021
00021210
A
31
Improved PageRank• At a page, the random surfer has two options
– With probability d, he randomly chooses an out-link to follow.– With probability 1-d, he jumps to a random page
• Improved model:
where E is eeT (e is a column vector of all 1’s) and thus E is a nn square matrix of all 1’s.
PAE
P ))1(( Tdn
d
32
Follow the Above Example
061610619061061061
157610619061061061
15761061061061061
061610619061157157
061610611211061157
06161061061157061
)1( Tdn
d AE
33
Final PageRank Algorithm• PageRank for each page i is
PAeP Tdd )1(
n
jji jPAddiP
1
)()1()(
34
Final PageRank Algorithm
• equivalent to the formula given in the PageRank algorithm
• The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank agorithm.
Eij jO
jPddiP
),(
)()1()(
35
Compute PageRank
• Use the iteration method PageRank-Iterate (G)
; k=1; repeat ; k=k+1; until ; return
neP /0
KT
k PdAedP )1(1
kk PP 1
1kP
36
Advantages of PageRank
• PageRank is a global measure and query independent. – PageRank values of all the pages are computed and saved
off-line rather than at the query time.
• Criticism: Query-independence. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.
• Nie, et al. Topical Link Analysis for Web Search, SIGIR 2006
37
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
38
Another Aim: Web Structure Analysis• Hyperlinks are also useful for finding Web communities.
– A Web community is a cluster of densely linked pages representing a group of people with a special interest.
• Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., – for discovering communities of named entities (e.g., people and
organizations)
– for analyzing social phenomena in emails.
39
Background: Co-citation and Bibliographic Coupling
• An typical area of research concerned with links is citation analysis (引证分析 ) of scholarly publications.
– A scholarly publication cites related prior work to acknowledge the origins of some ideas and to compare the new proposal with existing work.
• When a paper cites another paper, a relationship is established between the publications.
• We discuss two types of citation analysis, co-citation ( 共引证 )and bibliographic coupling (文献联结 ) . The HITS algorithm is related to these two types of analysis.
40
Co-citation
• If papers i and j are both cited by paper k, then they may be related in some sense to one another.
• The more papers they are co-cited by, the stronger their relationship is.
Fig. Paper i and paper j are co-cited by paper k
41
Co-citation (共引证)• Let L be the citation matrix. Each cell of the matrix is defined
as follows:
– Lij = 1 if paper i cites paper j, and 0 otherwise.
• Co-citation (denoted by Cij) is a similarity measure defined as the number of papers that co-cite i and j,
• A square matrix C can be formed with Cij, and it is called the co-citation matrix.
,1
n
kkjkiij LLC
42
Bibliographic Coupling (文献联结) • Bibliographic coupling operates on a similar principle. • Bibliographic coupling links papers that cite the same articles
– if papers i and j both cite paper k, they may be related.• The more papers they both cite, the stronger their similarity is.
Fig. Both paper i and paper j cite paper k
43
Bibliographic Coupling
• Bij represents the number of papers that are cited by both paper i and j
• A bibliographic coupling matrix B (can be formed with Bij) is symmetric and is regarded as a similarity measure of two papers in clustering
,1
n
kjkikij LLB
44
HITS
• HITS --- Hypertext Induced Topic Search.
• HITS is search query dependent for finding Web communities
• When the user issues a search query, – HITS first expands the list of relevant pages returned by a search
engine and
– then produces two rankings of the expanded set of pages, i.e.,
authority pages and hub pages.
45
Authorities and Hubs
Authority: Roughly, an authority is a page with many in-links. – The idea is that the page may have good or authoritative content on
some topic and
– thus many people trust it and link to it.
Hub: A hub is a page with many out-links. – The page serves as an organizer of the information on a particular
topic and
– points to many good authority pages on the topic.
46
Mining the Web's Link Structures• Finding authoritative Web pages(权威页面 )
– Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic
• Hyperlinks( 超连接 ) can infer the notion of authority
– A hyperlink pointing to another Web page, this can be considered as the author’s endorsement(认可 ) of the other page
• Hub pages (Hub页面 ): Web pages that provides collections of links to authorities
47
Mining the Web's Link Structures• Mutually reinforcing relationship( 相互增强关联 ):
a good hub is a page that points to many good authorities;
a good authority is page that is pointed to by many good hubs
…
Authority page (red)
…Hub page(yellow)
Hubs Authorities
48
Define Authority and Hub Weight for Each Page
paFor the page p: authority weight ; hub weight
pq
qp ha
qp
qp ah
q1
q2
q3
page p
a[p]:= sum of h[q],for q, qp
q1
q2
q3
page p
h[p]:= sum of a[q],for q, pq
ph
Better authority (hub) pages with larger a(h)-values
49
The HITS Algorithm
0011
0010
0001
0100
L
d1
d2
d4
d3
• HITS works on the pages in S(web space), and assigns every page in S an authority score and a hub score.
• Let the number of pages in S be n.
• We again use G = (V, E) to denote the hyperlink graph of S.
• We use L to denote the adjacency matrix of the
graph.
otherwise
EddifL ji
ij 0
),(1
50
The HITS Algorithm• Let the authority score of the page i be a(di), and the hub score of page i
be h(di).
• The mutual reinforcing relationship of the two scores is represented as follows:
)(
)()(ij dOUTd
ji dadh
)(
)()(ij dINd
ji dhda
51
HITS in Matrix Form• We use a to denote the column vector with all the authority
scores, a = (a(d1), a(d2), …, a(dn))T, and
• use h to denote the column vector with all the authority scores,
h = (h(d1), h(d2), …, h(dn))T,• Then,
a = LTh
h = La
52
Computation of HITS• The computation of authority scores and hub scores : using power
iteration (迭代) .
• If we use ak and hk to denote authority and hub vectors at the kth iteration, the iterations for generating the final solutions are
1 kT
k LaLa
1 kT
k hLLh
)1,...,1,1(00 ha
53
Relationships with Co-citation and Bibliographic Coupling
• Recall that co-citation of pages i and j, denoted by Cij, is
– the authority matrix (LTL) of HITS is the co-citation matrix C
• bibliographic coupling of two pages i and j, denoted by Bij is
– the hub matrix (LLT) of HITS is the bibliographic coupling matrix B
ijT
n
kkjkiij LLC )(
1
LL
ijT
n
kjkikij LLB )(
1
LL
54
HITS (Hyperlink-Induced Topic Search)• Explore interactions between hubs and authoritative
pages• Use a term-index search engine to form the root set
– Many of these pages are presumably relevant to the search topic (query)
– Some of them should contain links to most of the prominent authorities
• Expand the root set into a base set– all of the pages that the root-set pages link to, and– all of the pages that link to a page in the root set, up
to a designated size cutoff
55
Root Set (根集 ) and Base Set(基集 )• Properties of base set (ideally)
– Relatively small– Rich in relevant pages– Contain most (many) of the strongest authorities
baseroot
56
Step 1 of HITS: Create Base Set from Root Set Subgraph(, , t, d)
: a query string : a text-based search engine t, d: natural number // t=200; d=50 Let R denote the top t results of on // R root set Set S := R For each page p R // html_content get_url(url) Let W(p) denote the set of all pages p points to Let V(p) denote the set of all pages pointing to p Add all pages in W(p) to S If | V(p) | d, then add all pages in V(p) to S Else add an arbitrary set of d pages from V(p) to S End Return S // S base set : ca.1000 – 5000
57
Step 1 of HITS: Create Base Set from Root Set
For instance,
http://search.yahoo.com/bin/search?p=Data+Mining&ei=UTF-8
http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=21
http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=41
… …
• Two types of links in S:
transverse: between pages with different domain name; intrinsic: between pages with same domain name; (domain name: the first level of URL string of a page)• G: deleting all intrinsic links from S
58
The HITS Algorithm
)(
)()(ij dOUTd
ji dadh
0011
0010
0001
0100
L
aLh
d1
d2
d4
“Adjacency matrix”
d3 Initial values: a=h=1
Iterate
Normalize:
2 2( ) ( ) 1i i
i i
a d h d
)(
)()(ij dINd
ji dhda
hLa T
hLLh T
aLLa T
59
Step 2 of HITS: Calculate Authority and Hub Weight for Each Page
Iterate(G)G : a collection of n linked pages k= 1 Repeat
normalize ak, hk
k=k+1 Until ak and hk do not change significantly
Return (ak, hk).
)1,...,1,1,1(00 ha
1 kT
k LaLa
1 kT
k hLLh
60
Step 3 of HITS: Filter out the top authorities and hubs
Filter(G , c) G : a collection of n linked pages k, c: natural number (xk,yk) := Iterate(G). Report the pages with the c largest coordinates in xk as
authorities. Report the pages with the c largest coordinates in yk as hubs.
61
Strengths and Weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which
may be able to provide more relevant authority and hub pages.
• Weaknesses:
– It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy.
– Inefficiency at query time: The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations
• Reference( 文献 )
Jon M. Kleinberg: Authoritative Sources in a Hyper-linked Environment, Journal of ACM, Vol.46(5), 1999, pp604-632 http://www.cs.cornell.edu/home/kleinber/kleinber.html
62
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
63
Summary • Web mining includes mining Web link structures to identify
authoritative Web pages, Web content and Web usage mining
• We introduced
– PageRank & Social network analysis, centrality and prestige
– HITS & Co-citation and bibliographic coupling
64
Summary• Important to note: Hyperlink based ranking is not the only
algorithm used in search engines. In fact, it is combined with many content based factors to produce the final ranking presented to the user.
• Links can also be used to find communities, which are groups of content-creators or people sharing some common interests.
– Web communities
– Email communities
– Named entity communities, etc.
65