order out of chaos analyzing the link structure of the web for directory compilation and search....

48
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Order Out of Chaos

Analyzing the Link Structure of the Web for Directory Compilation and Search.

Presented 19.11.2000

by Benjy Weinberger

The Search Problem

• The WWW is a vast collection of information: over 1 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day.

• How do we find the information we need in such a large collection?

• Search is the most common activity on the web after email.

Types of Solutions

• There are basically two types of solution to the “search problem”:– Directories are static lists of web pages,

ordered in a topic taxonomy.– Search Engines attempt to match web sites to a

user’s query in real time.

• The distinction between the two types is becoming more and more fuzzy.

Types of Searches

• Two main kinds of searches:– Specific queries. E.g. “does Netscape 4.0

support the Java 1.1 code-signing API?”– Broad-topic queries. E.g. “find out information

about Che Guevara.”

• Other types of searches exist, such as:– Similar-page queries. E.g. “find pages ‘similar’

to www.cnn.com.”

Different Queries - Different Challenges

• Specific Queries: The difficulty is mostly that there are very few pages with the required information - the “needle in a haystack” problem.

• Broad-topic Queries: The difficulty is that there are too many relevant pages for humans to digest.

Broad-topic Queries

• Center around the notion of an authority - a page containing high-quality, authoritative information on the relevant topic.

• Problem: we are asking a computer to make a judgment on abstract and subjective qualities of “relevance” and “authority”.

Manual Solutions

• Manually edited catalogues of pages, ordered in a taxonomy of topics.

• Problems: – Completely unscalable. Only a tiny fraction of

the web can be catalogued.– Subjective. Decisions made by individual

editors.

Manual Solutions

• Nonetheless, solution is highly popular.

• Yahoo is still the #1 site on the web in terms of visits.

• The Open Directory Project is has over 2,000,000 sites in 250,000 categories, edited by tens of thousands of volunteers.

Automated Solutions - 1st Wave

• “Relevance” assumed to be associated with existence of keywords.

• “Authority” assumed to be associated with frequency and placement of keywords. – The more a page mentions the word the higher

that page’s score.– Font size, proximity to other keywords and

place in page may increase score.

Automated Solutions - 1st Wave

• Solution implemented by search engines such as AltaVista, HotBot, Lycos, Northern Light and many others.

• These solutions are inadequate:– The determination of “relevance” is inaccurate.

– The determination of “authority” is inaccurate.

– Don’t work for non-textual resources.

• Result: users still end up wading through thousands of results.

The Web as a Graph

• But ... the web is more than just a set of pages. The hyperlinks between pages induce a digraph structure on the web.

• Hyperlinks are created by human site authors and therefore can imply some human determination of authority.

• But how can we mine this information for search?

Naive Approach

• Determine relevance by keyword matching.

• Determine authority by number of inbound links.

• Heuristic:– Find all pages matching keywords.– Rank them by in-degree.

Naive Approach - Problems

• Good authorities may not contain keywords.– Example: honda.com, toyota.com, ford.com do

not contain the term “car manufacturers”.

• Doesn’t look at who is linking to the page.

• A highly popular page will be an authority on ANY query string it contains.– Example: cnn.com, yahoo.com, amazon.com.

Where Do We Stand

• The hyperlink structure of the web is constructed by the deliberate annotations of millions of web site authors.

• How can we mine this seemingly random link structure for underlying meaning?

Short Mathematical Interlude

• Given a square matrix A, an eigenvalue is a scalar with the property that there exists a vector such that . The vector is called an eigenvector.

• The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector.

x

0x

xxA

Short Mathematical Interlude

• A stochastic matrix is a matrix with non-negative entries such that the entries in each row sum to 1.

• A stochastic matrix always has a principal eigenvector (with eigenvalue 1):

All eigenvalues are <= 1 in absolute value.

1

1

1

1

1A

Short Mathematical Interlude

• For any matrix A, and are symmetric.

• A symmetric real n-by-n matrix has n real eigenvalues (with multiplicity).

• For any vector x not orthogonal to the principal eigenvector y,

TAA AAT

yxA

xAnn

n

Short Mathematical Interlude

• The adjacency matrix of a directed graph G is the matrix A such that:

)G(E)j,i(0

)G(E)j,i(1A j,i

Google

• High-quality, highly popular search engine.

• Developed by Sergey Brin and Larry Page at Stanford. Later became a commercial venture.

• Combines new ranking algorithm with state-of-the-art engineering solutions for scalability and performance.

Google

• Offline processing assigns each site on the web a PageRank - a global ranking based on link structure analysis.

• Relevance determined by keywords.

• Authority determined by PageRank.

PageRank

• Intuitive description: a page has high rank if the sum of the ranks of pages pointing to it is high.

• This is a circular definition, represented mathematically by:

where n is the total number of pages, deg(q) is q’s out degree and d is a damping factor.

n

d1

)qdeg(

)q(PRd)p(PR

pq

PageRank

• PR is a probability vector on the web pages (0 <= PR(p) <= 1 and the sum is 1).

• PR corresponds to a random surfer model: – A surfer clicks at random through the web,

without hitting “back”. When she gets bored, which happens with probability 1-d, she jumps to a random page.

– PR(p) is the probability of hitting page p in this “random surf”.

PageRank

• We define a matrix A for which:

Note that A is a stochastic matrix.• The PageRank vector is an eigenvector of

ji0

ji)ideg(1A j,i

11

11

J,JdABn

d1T

PageRank

• It is not hard to see, using the properties of stochastic matrices, that B has a principal eigenvector.

• We can initialize PR to all ones and apply B iteratively until it converges to this eigenvector.

Google

• Advantages:– Superior results to other engines.– Responds to any query, not just predetermined

topics.

• Disadvantages:– PageRank is global, not contextual.– Discriminates against “orphan” pages.

HITS

• HITS - Developed by Jon Kleinberg et al within the CLEVER project at IBM Almaden.

• Introduced the twin notions of Hubs and Authorities.

HITS - Hubs & Authorities

• An authority is a web site containing high-quality, highly-relevant information (with respect to the given search query).

• A hub is a web site that points to many related authorities.

• Intuitively, hubs help “pull” authorities together into a dense cluster of related sites.

HITS

• Hub (resp. authority) weights determine how good a hub (resp. authority) a site is.– The idea is to first find a focused subgraph of the

web relevant to the specified topic.– Then the link structure of this subgraph is analyzed

to find the hub and authority weights for each site, thus identifying the top authorities for this topic.

– Processing requirement per topic implies that uses are mainly for directory compilation.

HITS

• The focused subgraph is created by first taking the t highest-ranked pages from a text-based search engine as a root set R.

• R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R.

• Note that while R may fail to contain some “important” authorities, S will probably contain them.

HITS

• We have a circular definition: – A good authority is a site pointed to by many

good hubs.– A good hub is a site that points to many good

authorities.

• This is the mutually reinforcing relationship.

• How can we break this circularity?

HITS

• A static mathematical formulation:– For a web site p let h(p) be its hub weight and

a(p) be its authority weight. We take the vectors a, h to be normalized.

– We would like to find vectors a , h such that (up to normalization):

qppq

)q(a)p(h,)q(h)p(a

HITS

• If A is the adjacency matrix of the graph in question (the focused subgraph) then the constraints can be written as:

By substitution we see that a is defined by:hA

hAa,

Aa

Aah

T

T

AAB,Ba

Baa T

HITS

• In other words, a is an eigenvector of B:

• B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j.

• B is symmetric and has n orthogonal unit eigenvectors. Which one do we take?

Ba,aBa

HITS

• Let’s look at the problem dynamically:– We initialize a(p) = h(p) = 1 for all p.– We iterate the following operations:

And renormalize after each iteration.

ATAB,Baa

Aah

)q(a)p(h

hAa

)q(h)p(aqp

T

pq

HITS

• The eigenvectors of B are precisely the stationary points of this process.

• For any stationary point a, and therefore the authority weights are reinforced by the factor .

• Conclusion: The principal eigenvector represents the “densest cluster” within the focused subgraph.

aBa

HITS

• As we have said, by initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. – Initializing differently may lead to convergence

to a different eigenvector.– In practice convergence is achieved after only

10-20 iterations.

HITS

• The non-principal eigenvectors may represent other, less dense subclusters of the focused subgraph.

• Other subclusters can occur when:– A query string has several meanings– A concept is relevant to several communities.– The topic is highly polarized, involving groups that

are not likely to link to one another (such as “abortion”).

HITS

• Unlike the principal eigenvector, the other eigenvectors have both positive and negative entries.

• Each non-principal eigenvector provides us with two densely connected clusters: The “most positive” and “most negative” entries of the vector.

• The larger the eigenvalue, the more “relevant” the corresponding eigenvector.

Spectral Filtering

• A generalization of HITS.

• Developed by Chakrabarti, Dom, Gibson, Raghavan and others. Some of these worked with Kleinberg on CLEVER at IBM Almaden.

• Used commercially in Quiver to mine community knowledge.

Spectral Filtering

• We have two kinds of entities: hubs and authorities. These can both be of the same type (such as web sites, cited documents) or of different types (people and products, bookmark folders and web sites).

• Note that we distinguish, a priori, between hubs and authorities.

Spectral Filtering

• The link structure consists of directed links from hubs to authorities.

• In mathematical terms we have a bipartite digraph (H;A;E) with all edges directed from H to A.

Spectral Filtering - Affinities

• For each hub i and authority j let a(i,j) >= 0 be the affinity of i for j.

• E.g: The percentage of common terms between documents or the number of visits a user makes to a bookmark.

• A=a(i,j) is the affinity matrix.

Spectral Filtering

• In SF the mutually reinforcing relationship becomes:

Or:

qp

q,ppq

p,q )q(aa)p(h,)q(ha)p(a

Aah,hAa T

Spectral Filtering

• Note that HITS is simply SF with each web site represented as both a hub and an authority, and with the following simple affinities:

ji0

ji1a j,i

Spectral Filtering

• The spectral analysis used in HITS applies here, and the principal eigenvector is used to determine the highest ranked authorities.

Spectral Filtering

• Notes:– The links from hubs to authorities represent the

opinions of many people regarding the authorities.

– SF mines this information for results reflecting the opinions of these people as a group.

– The SF algorithm “pulls” the densely linked cluster together.

– SF is relatively insensitive to “noise”.

References

[1] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998.

[2] S.Chakrabarti, B.Dom, D.Gibson, R.Kumar, P.Raghavan, S.Rajagopalan, A.Tomkins. Spectral Filtering for Resource Discovery. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, 1998.

[3] S.Brin, L.Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc. 7th International WWW Conference, 1998.

[4] S.Brin, R.Motwani,L.Page,T.Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript.

End

• Thank you for listening!