web and search engines

Web and Search Engines

The Web: An Overview

Developed by Tim Berners-Lee and colleagues at CERN in 1990.

Currently governed by the World Wide Web Consortium

First Graphical Web Browser – Mosaic Has over 800 million publicly indexable web

pages and 180 million publicly indexable images by February of 1999

Over 16 million web servers. Create numerous millionaires and

billionaires!

Search Engine TechnologyTwo general paradigms for finding information on Web:

Browsing: From a starting point, navigate through

hyperlinks to find desired documents.

Yahoo’s category hierarchy facilitates browsing.

Searching: Submit a query to a search engine to find

desired documents.

Many well-known search engines on the Web:

AltaVista, Excite, HotBot, Infoseek, Lycos, Google,

Northern Light, etc.

Browsing Versus Searching

Category hierarchy is built mostly manually and search

engine databases can be created automatically.

Search engines can index much more documents than a

category hierarchy.

Browsing is good for finding some desired documents

and searching is better for finding a lot of desired

documents.

Browsing is more accurate (less junk will be

encountered) than searching.

Search Engine

A search engine is essentially a text retrieval

system for web pages plus a Web

interface.

So what’s new???

Some Characteristics of the Web Web pages are widely distributed on many servers. Web pages are extremely dynamic/volatile. Web pages have more structures (extensively tagged). Web pages are extensively linked. Web pages are very voluminous and diversified. Web pages often have other associated metadata. Web users are ordinary folks without special training

and they tend to submit short queries. There is a very large user community.

Overview of this Topic

Discuss how to take the special characteristics of

the Web into consideration for building good

search engines.

Specific Subtopics:

Robot;

The use of tag information;

The use of link information;

Collaborative Filtering.

RobotsA robot (also known as spider, crawler, wanderer) is a

program for fetching web pages from the Web.

Main idea:

1. Place some initial URLs into a URL queue.

2. Repeat the steps below until the queue is empty

Take the next URL from the queue and fetch

the web page using HTTP.

Extract new URLs from the downloaded web

page and add them to the queue.

RobotsWhat initial URLs to use?

Choice depends on type of search engines to be built.

For general-purpose search engines, use URLs that are likely to reach a large portion of the Web such as the Yahoo home page.

For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.

Robots

Examples:

To create a search engine for PUCPR University,

use initial URL www.pucpr.br and domain

constraint “pucpr.br”.

Only URLs having “pucpr.br” will be used.

To create a search engine for FK

(Facchochschule Konstanz), use initial URL and

domain constraints...

RobotsHow to extract URLs from a web page?

Need to identify all possible tags and attributes that hold

URLs.

Anchor tag: <a href=“URL” … > … </a>

Option tag: <option value=“URL”…> … </option>

Map: <area href=“URL” …>

Frame: <frame src=“URL” …>

Link to an image: <img src=“URL” …>

Relative path vs. absolute path: <base href= …>

RobotsHow fast should we download web pages from the same

server? Downloading web pages from a web server will

consume local resources; Be considerate to used web servers (e.g.: one page

per minute from the same server);

Other issues: Handling bad links and down links; Handling duplicate pages; Robot exclusion protocol.

Robots Exclusion Protocol

Site administrator puts a “robots.txt” file at the root of the host’s web directory. http://www.ebay.com/robots.txt http://www.cnn.com/robots.txt

File is a list of excluded directories for a given robot (user-agent). Exclude all robots from the entire site:

User-agent: * Disallow: /

Robot Exclusion Protocol Examples

Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/

• Exclude a specific robot: User-agent: GoogleBot Disallow: /

• Allow a specific robot: User-agent: GoogleBot Disallow:

User-agent: * Disallow: /

Robots

Another example:

User-agent: webcrawler Disallow: # no restriction for webcrawler

User-agent: lycra

Disallow: / # no access for robot lycra

User-agent: *

Disallow: /tmp # all other robots can index

Disallow: /logs # docs not under /tmp,/logs

Robots

Several research issues about robots: Fetching more important pages first with limited

resources; Fetching web pages in a specified subject area

such as movies and sports for creating domain-specific search engines;

Efficient re-fetch of web pages to keep web page

index up-to-date.

RobotsEfficient Crawling through URL Ordering [Cho 98] Default ordering is based on breadth-first search; Efficient crawling fetches important pages first.

Importance Definition Similarity of a page to a driving query; Backlink count of a page; PageRank of a page; Forward link of a page; Domain of a page; Combination of the above.

RobotsA method for fetching pages related to a driving query

first [Cho 98]. Suppose the query is “computer”. A page is related (hot) if “computer” appears in the title

or appears 10 times in the body of the page. Some heuristics for finding a hot page:

The anchor of its URL contains “computer”. Its URL contains “computer”. Its URL is within 3 links from a hot page.

Call the above URL as a hot URL.

RobotsCrawling Algorithm

hot_queue = url_queue = empty; /* initialization */ /* hot_queue stores hot URL and url_queue stores other URL */

enqueue(url_queue, starting_url);

while (hot_queue or url_queue is not empty)

{ url = dequeue2(hot_queue, url_queue);

/* dequeue hot_queue first if it is not empty */

page = fetch(url);

if (page is hot) then hot[url] = true;

enqueue(crawled_urls, url);

Robots

url_list = extract_urls(page);

for each u in url_list

if (u not in url_queue and u not in hot_queue and

u is not in crawled_urls) /* If u is a new URL */

if (u is a hot URL) enqueue(hot_queue, u);

else enqueue(url_queue, u);

}

Reported experimental results indicate the method is

effective.

Fish search (De Bra 94): Search by intelligently and automatically navigating through real online web pages from a starting point.

Some key features: Use heuristics to select the next page to navigate. Client-based search and Favors depth-first search.ARACHNID (Adaptive Retrieval Agents Choosing Heuristic

Neighborhoods for Information Discovery, Menczer 97)Key features: Start from multiple promising starting points. Each agent acts like a fish search engine but with more

sophisticated navigation techniques.

Fish Search and ARACHNID

Use of Tag InformationUse of Tag Information

Web pages are mostly HTML documents (for now). HTML tags allow the author of a web page to

Control the display of page contents on the Web. Express their emphases on different parts of the

page. HTML tags provide additional information about the

contents of a web page. Question: Can we make use of the tag information to

improve the effectiveness of a search engine?


Two main ideas of using tags: Associate different importance to term

occurrences in different tags. Use anchor text to index referenced documents.

. . . . . .airplane ticket and hotel . . . . . .

Page 1 Page 2: http://travelocity.com/

Use of Tag Information

Many search engines are using tags to improve retrieval effectiveness.

Associating different importance to term occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, SIBRIS.

WWWW and Google use terms in anchor tags to index a referenced page.

Shortcomings: very few tags are considered; relative importance of tags not studied; lacks rigorous performance study.

Use of Tag InformationUse of Tag Information The Webor Method (Cutler 97, Cutler 99) Partition HTML tags into six ordered classes:

title, header, list, strong, anchor, plain Extend the term frequency value of a term in a

document into a term frequency vector (TFV).

Suppose term t appears in the ith class tfi times, i = 1, 2, 3, 4, 5, 6. Then TFV = (tf1, tf2, tf3, tf4, tf5, tf6).

Example: If for page p, term “konstanz” appears 1 time in the title, 2 times in the headers and 8 times in the anchors of hyperlinks pointing to p, then for this term in p:

TFV = (1, 2, 0, 0, 8, 0).

Use of Tag InformationUse of Tag InformationThe Webor Method (Continued) Assign different importance values to term

occurrences in different classes. Let civi be the

importance value assigned to the ith class. We have

vector: CIV = (civ1, civ2, civ3, civ4, civ5, civ6) Extend the tf term weighting scheme as follows:

Suppose for term t, TFV = (tf1, tf2, tf3, tf4, tf5, tf6)

tfw = TFV CIV = tf1civ1 + … + tf6 civ6

When CIV = (1, 1, 1, 1, 0, 1), the new tfw becomes

the tfw in traditional text retrieval.


The Webor Method (Continued)

Challenge: How to find the (optimal) CIV = (civ1, civ2,

civ3, civ4, civ5, civ6) such that the retrieval

performance can be improved the most?

Our Solution: Find the optimal CIV experimentally. Need a test bed for the experiments so that we can

measure the performance of a given CIV. Need a systematic way to try out different CIVs

and to find out the optimal (or near optimal) CIV.


The Webor Method (from Weiyi Meng - Binghamton University)

Creating a test bed: Web pages: A snap shot of the Binghamton

University site in Dec. 1996 (about 4,600 pages; after removing duplicates, about 3,000 pages).

Queries: 20 queries were created (see next page). For each query, (manually) identify the documents

relevant to the query.

Use of Tag InformationUse of Tag Information The Webor Method (Continued): 20 test bed queries: web-based retrieval concert and music neural network intramural sports master thesis in geology cognitive science prerequisite of algorithm campus dining handicap student help career development promotion guideline non-matriculated

admissions grievance committee student associations laboratory in electrical engineering research

centers anthropology chairman engineering program computer workshop papers in philosophy and computer and cognitive

system


The Webor Method (Continued)The Webor Method (Continued)

Use a Genetic AlgorithmUse a Genetic Algorithm to find the optimal CIV. The initial population has 30 CIVs.

25 are randomly generated (range [1, 15]) 5 are “good” CIVs from manual screening.

Each new generation of CIVs is produced by executing: crossover, mutation, and reproduction.


The Genetic Algorithm (continued)The Genetic Algorithm (continued) Crossover

done for each consecutive pair CIVs, with probability 0.75.

a single random cut for each selected pairExample:

old pair new pair

(1, 4, 2, 1, 2, 1) (2, 3, 2, 1, 2, 1)

(2, 3, 1, 2, 5, 1) (1, 4, 1, 2, 5, 1)

cut


The Genetic Algorithm (continued)The Genetic Algorithm (continued)

Mutation performed on each CIV with probability 0.1. When mutation is performed, each CIV

component is either decreased or increased by one with equal probability, subject to range conditions of each component.

Example: If a component is already 15, then it cannot be increased.


The Genetic Algorithm (continued)The Genetic Algorithm (continued)

The fitness functionThe fitness function A CIV has an initial fitness of

0 when the 11-point average precision is less than 0.22.

(11-point average precision - 0.22), otherwise. The final fitness is its initial fitness divided by

the sum of the initial fitnesses of all the CIVs in the current generation. each fitness is between 0 and 1 the sum of all fitnesses is 1


The Genetic Algorithm (continued)The Genetic Algorithm (continued) Reproduction

Wheel of fortune scheme to select the parent population.

The scheme selects fit CIVs with high probability and unfit CIVs with low probability.

The same CIV may be selected more than once. The algorithm terminates after 25 generations and

the best CIV obtained is reported as the optimal CIV.

The 11-point average precision by the optimal CIV is reported as the performance of the CIV.


The Webor Method (continued): Experimental ResultsThe Webor Method (continued): Experimental Results

Classes: title, header, list, strong, anchor, plain

Queries Opt. CIV Normal New Improvement

1st 10 281881 0.182 0.254 39.6%

2nd 10 271881 0.172 0.255 48.3%

all 251881 0.177 0.254 43.5%

Conclusions: anchor and strong are most important header is also important title is only slightly more important than list and plain


The Webor Method (continued): SummaryThe Webor Method (continued): Summary

The Webor method has the potential to substantially improve the retrieval effectiveness.

But be cautious to draw any definitive conclusions as the results are too preliminary. Need to Expand the set of queries in the test bed Use other Web page collections

Use of Link Information

Hyperlinks among web pages provide new document retrieval opportunities.

Selected Examples: Anchor texts can be used to index a referenced

page (e.g., Webor, WWWW, Google). The ranking score (similarity) of a page with a

query can be spread to its neighboring pages. Links can be used to compute the importance of

web pages based on citation analysis. Links can be combined with a regular query to find

authoritative pages on a given topic.


Vector spread activation (Yuwono 97) The final ranking score of a page p is the sum of its

regular similarity and a portion of the similarity of each page that points to p.

Rationale: If a page is pointed to by many relevant pages, then the page is also likely to be relevant.

Let sim(q, di) be the regular similarity between q and di;

rs(q, di) be the ranking score of di with respect to q;

link(j, i) = 1 if dj points to di, = 0 otherwise.

rs(q, di) = sim(q, di) + link(j, i) sim(q, dj)

= 0.2 is a constant parameter.


PageRank citation ranking (Page 98). Web can be viewed as a huge directed graph G(V,

E), where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges).

Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).

Each backlink of a page represents a citation to the page.

PageRank is a measure of global web page importance based on the backlinks of web pages.

Computing PageRank

PageRank is based on the following basic ideas:

If a page is linked to by many pages, then the page is likely to be important.

If a page is linked to by important pages, then the page is likely to be important even though there aren’t too many pages linking to it.

The importance of a page is divided evenly and propagated to the pages pointed to by it.

105

5

Computing PageRank

PageRank Definition

Let u be a web page,

Fu be the set of pages u points to,

Bu be the set of pages that point to u,

Nu = |Fu| be the number pages in Fu.

The rank (importance) of a page u can be defined by:

R(u) = ( R(v) / Nv ) v Bu

Computing PageRank

PageRank is defined recursively and can be computed iteratively.

Initiate all page ranks to be 1/N, N is the number of vertices in the Web graph.

In ith iteration, the rank of a page is computed using the ranks of its parent pages in (i-1)th iteration. Repeat until all ranks converge.

Let Ri(u) be the rank of page u in ith iteration and R0(u) be the initial rank of u.

Ri(u) = ( Ri-1(v) / Nv ) v Bu

Computing PageRank

Matrix representation

Let M be an NN matrix and muv be the entry at the u-th row and v-th column.

muv = 1/Nv if page v has a link to page u

muv = 0 if there is no link from v to u

Let Ri be the N1 rank vector for I-th iteration

and R0 be the initial rank vector.

Then Ri = M Ri-1

Computing PageRank

If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1.

Convergence is guaranteed only if M is aperiodic (the Web graph is not a big cycle).

This is practically guaranteed for Web. M is irreducible (the Web graph is strongly

connected). This is usually not true.

Computing PageRank

Rank sink: A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.

Rank sink causes the loss of total ranks.

Example:

A B

C D

(C, D) is a rank sink

Computing PageRank

A solution to the non-irreducibility and rank sink problem.

Conceptually add a link from each page v to every page (include self).

If v has no forward links originally, make all entries in the corresponding column in M be 1/N.

If v has forward links originally, replace 1/Nv in the corresponding column by c1/Nv and then add (1-c) 1/N to all entries, 0 < c < 1.

Computing PageRank

Let M* be the new matrix. M* is irreducible. M* is stochastic, the sum of all entries of each

column is 1 and there are no negative entries.

Therefore, if M is replaced by M* as in

Ri = M* Ri-1

then the convergence is guaranteed and there will be no loss of the total rank (which is 1).

Computing PageRank

Interpretation of M* based on the random walk model.

If page v has no forward links originally, a web surfer at v can jump to any page in the Web with probability 1/N.

If page v has forward links originally, a surfer at v can either follow a link to another page with probability c 1/Nv, or jumps to any page with probability (1-c) 1/N.

Computing PageRank

Example: Suppose the Web graph is:

M =

AB

C

D

0 0 0 ½0 0 0 ½ 1 1 0 00 0 1 0

ABCD

A B C D

Computing PageRank

Example (continued): Suppose c = 0.8. All entries in Z are 0 and all entries in K are ¼.

M* = 0.8 (M+Z) + 0.2 K =

After 30 iterations: R(A) = R(B) = 0.176

R(C) = 0.332, R(D) = 0.316

0.05 0.05 0.05 0.450.05 0.05 0.05 0.45 0.85 0.85 0.05 0.050.05 0.05 0.85 0.05

Computing PageRank

Incorporate the ranks of pages into the ranking function of a search engine.

The ranking score of a web page can be a weighted sum of its regular similarity with a query and its importance.

ranking_score(q, d)

= wsim(q, d) + (1-w) R(d), if sim(q, d) > 0

= 0, otherwise

where 0 < w < 1. Both sim(q, d) and R(d) need to be

normalized to between [0, 1].


PageRank defines the global importance of web

pages but the importance is domain/topic

independent.

We often need to find important/authoritative

pages which are relevant to a given query.

What are important web browser pages?

Which pages are important game pages?

Kleinberg (Kleinberg 98) proposed to use

authority and hub scores to measure the

importance of a web page with respect to a given

query.

Authority and Hub Pages

The basic idea:

A page is a good authoritative page with respect

to a given query if it is referenced (i.e., pointed

to) by many (good hub) pages that are related to

the query.

A page is a good hub page with respect to a

given query if it points to many good

authoritative pages with respect to the query.

Good authoritative pages (authorities) and good

hub pages (hubs) reinforce each other.


Authorities and hubs related to the same query tend to form a bipartite subgraph of the web graph.

A web page can be a good authority and a good hub.

hubs authorities


Main steps of the algorithm for finding good authorities and hubs related to a query q.

1. Submit q to a regular similarity-based search engine. Let S be the set of top n pages returned by the search engine. (S is called the root set and n is often in the low hundreds).

2. Expand S into a large set T (base set):• Add pages that are pointed to by any page in

S.• Add pages that point to any page in S. If a

page has too many parent pages, only the first k parent pages will be used for some k.


3. Find the subgraph SG of the web graph that is induced by T.

S

T


Steps 2 and 3 can be made easy by storing the link structure of the Web in advance.

Link structure table:

parent_url child_url

url1 url2 url1 rul3

… …


4. Compute the authority score and hub score of each web page in T based on the subgraph SG(V, E).

Given a page p, let

a(p) be the authority score of p

h(p) be the hub score of p

(p, q) be a directed edge in E from p to q.

Two basic operations: Operation I: Update each a(p) as the sum of all

the hub scores of web pages that point to p. Operation O: Update each h(p) as the sum of all

the authority scores of web pages pointed to by p.


Operation I: for each page p:

a(p) = h(q) q: (q, p)E

Operation O: for each page p:

h(p) = a(q) q: (p, q)E

q1

q2

q3

p

q3

q2

q1

p


Matrix representation of operations I and O.

Let A be the adjacency matrix of SG: entry (p, q) is 1 if p has a link to q, else the entry is 0.

Let AT be the transpose of A.

Let hi be vector of hub scores after i iterations.

Let ai be the vector of authority scores after i iterations.

Operation I: ai = AT hi-1

Operation O: hi = A ai


After each iteration of applying Operations I and O, normalize all authority and hub scores.

Repeat until the scores for each page converge (the convergence is guaranteed).

5. Sort pages in descending authority scores.

6. Display the top authority pages.

Vq

qa

papa

2)(

)()(

Vq

qh

phph

2)(

)()(


Algorithm (summary)

submit q to a search engine to obtain the root set S;

expand S into the base set T;

obtain the induced subgraph SG(V, E) using T;

initialize a(p) = h(p) = 1 for all p in V;

for each p in V until the scores converge

{ apply Operation I;

apply Operation O;

normalize a(p) and h(p); }

return pages with top authority scores;


Example: Initialize all scores to 1.

1st Iteration:

I operation:

a(q1) = 1, a(q2) = a(q3) = 0,

a(p1) = 3, a(p2) = 2

O operation: h(q1) = 5,

h(q2) = 3, h(q3) = 5, h(p1) = 1, h(p2) = 0

Normalization: a(q1) = 0.267, a(q2) = a(q3) = 0,

a(p1) = 0.802, a(p2) = 0.535, h(q1) = 0.645,

h(q2) = 0.387, h(q3) = 0.645, h(p1) = 0.129, h(p2) = 0

q1

q2

q3

p1

p2


After 2 Iterations:

a(q1) = 0.061, a(q2) = a(q3) = 0, a(p1) = 0.791,

a(p2) = 0.609, h(q1) = 0.656, h(q2) = 0.371,

h(q3) = 0.656, h(p1) = 0.029, h(p2) = 0

After 5 Iterations:

a(q1) = a(q2) = a(q3) = 0,

a(p1) = 0.788, a(p2) = 0.615

h(q1) = 0.657, h(q2) = 0.369,

h(q3) = 0.657, h(p1) = h(p2) = 0

q1

q2

q3

p1

p2


Should all links be equally treated?

Two considerations: Some links may be more meaningful/important

than other links. Web site creators may trick the system to make

their pages more authoritative by adding dummy pages pointing to their cover pages (spamming).

Domain name: the first level of the URL of a page.

Example: domain name for “ppgia.pucpr.br/~kaestner/iir.html” is “ppgia.pucpr.br”.


Transverse link: links between pages with different domain names.

Intrinsic link: links between pages with the same domain name.

Transverse links are more important than intrinsic links.

Two ways to incorporate this:

1. Use only transverse links and discard intrinsic links.

2. Give lower weights to intrinsic links.


How to give lower weights to intrinsic links?

In adjacency matrix A, entry (p, q) should be assigned as follows:

If p has a transverse link to q, the entry is 1. If p has an intrinsic link to q, the entry is c,

where 0 < c < 1. If p has no link to q, the entry is 0.


For a given link (p, q), let V(p, q) be the vicinity (e.g., 50 characters) of the link.

If V(p, q) contains terms in the user query (topic), then the link should be more useful for identifying authoritative pages.

To incorporate this: In adjacency matrix A, make the weight associated with link (p, q) to be 1+n(p, q), where n(p, q) is the number of terms in V(p, q) that appear in the query.


Sample experiments: Rank based on large in-degree (or backlinks)

query: gameRank in-degree URL

1 13 http://www.gotm.org

2 12 http://www.gamezero.com/team-0/

3 12 http://ngp.ngpc.state.ne.us/gp.html

4 12 http://www.ben2.ucla.edu/~permadi/

gamelink/gamelink.html

5 11 http://igolfto.net/

6 11 http://www.eduplace.com/geo/indexhi.html Only pages 1, 2 and 4 are authoritative game

pages.


Sample experiments (continued) Rank based on large authority score.

query: game

Rank Authority URL

1 0.613 http://www.gotm.org

2 0.390 http://ad/doubleclick/net/jump/

gamefan-network.com/

3 0.342 http://www.d2realm.com/

4 0.324 http://www.counter-strike.net

5 0.324 http://tech-base.com/

6 0.306 http://www.e3zone.com All pages are authoritative game pages.


Sample experiments (continued) Rank based on large authority score.

query: free email

Rank Authority URL

1 0.525 http://mail.chek.com/

2 0.345 http://www.hotmail/com/

3 0.309 http://www.naplesnews.net/

4 0.261 http://www.11mail.com/

5 0.254 http://www.dwp.net/

6 0.246 http://www.wptamail.com/ All pages are authoritative free email pages.


For a given query, the induced subgraph may have multiple dense bipartite communities due to:

multiple meanings of query terms multiple web communities related to the query

ad page

obscure web page


Multiple Communities (continued) If a page is not in a community, then it is unlikely to

have a high authority score even when it has many backlinks.

Example: Suppose initially all hub and authority scores are 1. q’s p q’s p’s

G1: G2:

1st iteration for G1: a(q) = 0, a(p) = 5, h(q) = 5, h(p) = 0 1st iteration for G2: a(q) = 0, a(p) = 3, h(q) = 9, h(p) =

0


Example (continued):

1st normalization (suppose normalization factors H1

for hubs and A1 for authorities):

for pages in G1: a(q) = 0, a(p) = 5/A1, h(q) = 5/H1, h(p) = 0

for pages in G2: a(q) = 0, a(p) = 3/A1, h(q) = 9/H1, a(p) = 0

After the nth iteration (suppose Hn and An are the

normalization factors respectively): for pages in G1: a(p) = 5n / (H1…Hn-1An) ---- a

for pages in G2: a(p) = 3*9n-1 /(H1…Hn-1An) ---- b

Note that a/b approaches 0 when n is sufficiently large, that is, a is much much smaller than b.


Multiple Communities (continued) If a page is not in the largest community, then it is

unlikely to have a high authority score. The reason is similar to that regarding pages

not in a community.

larger community smaller community


Multiple Communities (continued) How to retrieve pages from smaller communities? A method for finding pages in nth largest

community: Identify the next largest community using the

existing algorithm. Destroy this community by removing links

associated with pages having large authorities. Reset all authority and hub values back to 1

and calculate all authority and hub values again.

Repeat the above n 1 times and the next largest community will be the nth largest community.


Query: House (first community)


Query: House (second community)

Collaborative Filtering

When a user submits a query to a search engine, the user may have some of the following behaviors or reactions to the returned web pages:

Click certain pages in certain order while ignore most pages.

Read some clicked pages longer than some other clicked pages.

Save/print certain clicked pages. Follow some links in clicked pages to reach more

pages.


The behavior of a user u to the result of a query q can be considered as a piece of knowledge associated with the user query pair (u, q).

The same user may use the search engine many times with many queries. Each time, the user reacts to the retrieved results.

Many users may submit different queries to the search engine. Many users may have common information

needs. The same query or similar query may be

submitted by different users.


The reactions of users to the retrieval results of many past queries can be collected and stored in a knowledge base.

User reaction knowledge can be used in at least three different ways to improve retrieval:

1. Use the knowledge immediately to benefit the current search needs of the user (user feedback).

2. Use the knowledge in the future to benefit the future search needs of the user (user profile).

3. Use the knowledge in the future to benefit the future search needs of all users (collaborative filtering).


Implicit User Feedback:

1. Derive likely relevant documents from the returned documents based on the user behavior.

Saved/printed documents can be considered to be relevant.

Documents that are viewed for a longer time can be considered to be more likely to be relevant.

2. Modify the query to a new query q* and submit q* to the search engine for another round of search.

• Relevance feedback


User Profile:

A profile of a user is a collection of information that documents the user’s information needs and/or access patterns.

Different types of user profiles exist: Static profile for describing user information

needs. Dynamic profile that changes according to

user’s recent access behaviors and patterns. Specialized profile (e.g., navigational pattern). Server side profile. Client side profile.


User Profile: (continued) User profile is widely used for text filtering:

Find documents that are similar to a user profile.

Profile-based filtering is also known as content-based recommendation.

User profile can be used in combination with query for better information retrieval and filtering.


Collaborative Filtering:

From (Miller 96):

Collaborative filtering systems make use of the reactions and opinions of people who have already seen a piece of information to make predictions about the value of that piece of information for people who have not yet seen it.

Collaborative filtering systems often recommend documents to a user (a query) that are liked (found useful) by similar users (e.g., users who have similar profiles) (for similar queries).


Main components: Recommendation gathering: e.g., record user

behaviors to retrieved documents. Recommendation aggregation: Combine multiple

recommendations into a useful measure. Recommendation usage: Apply recommendation

measures to recommend documents.Some interesting issues: What recommendations are useful? How to do recommendation aggregation? How to combine recommendation with other

usefulness measures?

Collaborative FilteringExample Systems:PHOAKS (People Helping One Another Know Stuff) For recommending URLs. Use each mention of a URL in a news article as a

recommendation. Not counting URLs in headers and quoted

sections. Not using articles posted to too many

newsgroups. Not counting URLs in announcements or ads.

Recommendation aggregation: compute the number of distinct recommenders of each URL.

Recommendation based on the number of distinct recommenders.


Example Systems:

Fab (http://fab.stanford.edu) Combines content-based recommendation and

collaborative recommendation. Retain the advantages of each approach while

avoid the weaknesses of each approach. Users are required to rank each recommended

document explicitly based on a 7-point scale. The ranking is used to update a user’s profile and

highly ranked documents are also recommended to users with similar profiles.


Example Systems:

DirectHit (http://www.directhit.com) Author-controlled search engines versus editor-

controlled directories. DirectHit aims at achieving the breadth of a

regular search engine with the accuracy of editor-controlled directories by adopting a user-controlled method.

DirectHit uses user viewing time of documents and other behavior information to identify useful hits to documents and uses collaborative filtering to help find documents for new queries.