web search for x-informatics

1

Web Search forX-Informatics

Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana TimeSpring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time

Geoffrey Fox and Bryan CarpenterGeoffrey Fox and Bryan Carpenter

PTLIU Laboratory for Community GridsPTLIU Laboratory for Community Grids

Informatics, (Computer Science , Physics)

Indiana University

Bloomington IN 47404

[email protected]

2

References IReferences I• Here are a set addressing Web Search has one approach to information

retrieval• http://umiacs.umd.edu/~bonnie/cmsc723-00/CMSC723/CMSC723.ppt

• http://img.cs.man.ac.uk/stevens/workshop/goble.ppt • http://www.isi.edu/us-uk.gridworkshop/talks/goble_-_grid_ontologies.ppt

• http://www.cs.man.ac.uk/~carole/cs3352.htm has several interesting sub-talks in it– http://www.cs.man.ac.uk/~carole/IRintroduction.ppt– http://www.cs.man.ac.uk/~carole/SearchingtheWeb.ppt– http://www.cs.man.ac.uk/~carole/IRindexing.ppt– http://www.cs.man.ac.uk/~carole/metadata.ppt– http://www.cs.man.ac.uk/~carole/TopicandRDF.ppt

• http://www.isi.edu/us-uk.gridworkshop/talks/jeffery.ppt from the excellent 2001 e-Science meeting

http://umiacs.umd.edu/~bonnie/cmsc723-00/CMSC723/CMSC723.ppt

http://img.cs.man.ac.uk/stevens/workshop/goble.ppt

http://www.isi.edu/us-uk.gridworkshop/talks/goble_-_grid_ontologies.ppt

http://www.cs.man.ac.uk/~carole/cs3352.htm

http://www.cs.man.ac.uk/~carole/IRintroduction.ppt

http://www.cs.man.ac.uk/~carole/SearchingtheWeb.ppt

http://www.cs.man.ac.uk/~carole/IRindexing.ppt

http://www.cs.man.ac.uk/~carole/metadata.ppt

http://www.cs.man.ac.uk/~carole/TopicandRDF.ppt

http://www.isi.edu/us-uk.gridworkshop/talks/jeffery.ppt

3

References II: References II: Discussion of “real systems”Discussion of “real systems”• General review stressing the “hidden web” (content stored i

n databases)http://www.press.umich.edu/jep/07-01/bergman.html

• IBM “Clever Project” Hypersearching the Webhttp://www.sciam.com/1999/0699issue/0699raghavan.html

• Google Anatomy of a Web Search Enginehttp://www.stanford.edu/class/cs240/readings/google.pdf

• Peking University Search Engine Grouphttp://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf

• A Huge set of links can be found at:http://net.cs.pku.edu.cn/~webg/refpaper/

http://www.press.umich.edu/jep/07-01/bergman.html

http://www.sciam.com/1999/0699issue/0699raghavan.html

http://www.stanford.edu/class/cs240/readings/google.pdf

http://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf

http://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf

http://net.cs.pku.edu.cn/~webg/refpaper/

4

WebGather:WebGather: towards quality and sctowards quality and scalability of a Web searcalability of a Web search serviceh serviceLI Xiaoming LI Xiaoming •• Department of Computer Department of Computer Science and Technology, Peking Univ.Science and Technology, Peking Univ.

A presentation at Supercomputing 2001 A presentation at Supercomputing 2001 through a constellation site in Chinathrough a constellation site in China

November 15, 2001November 15, 2001

This lecture built around this presentation by Xiaoming Li

We have inserted material from other cited references

5

How many search engines How many search engines out there ?out there ?

• Yahoo !• AltaVista• Lycos• Infoseek• OpenFind• Baidu• Google

• WebGather ( 天网 )• … there are more than 4000 in the world ! (Complete P

lanet White Paperhttp://www.press.umich.edu/jep/07-01/bergman.html)

6

http://e.pku.edu.cnhttp://e.pku.edu.cn

7

WebGatherWebGather

8

Our SystemOur System

9

AgendaAgenda

• Importance of Web search service

• Three primary measures/goals of a Web search service

• Our approaches to the goals

• Related works

• Future work

10

Importance of Web Search Importance of Web Search ServiceService

• Rapid growing of web information– >40 millions of Chinese web pages under .cn

• The second popular application on the web– email, search engine

• Information accessing: from address-based to content-based– who can remember all those URLs ?!

– search engine: a first step towards content-based web information accessing

• There are 4/24 sessions, 15/78 papers at WWW10 !

11

How the Web is growing in ChinaHow the Web is growing in China

Date

StatisticType

Jun. 30,1998

Dec. 31,1998

Jun.30,1999

Dec.31,1999

Jun. 30,2000

Number ofInternet Users inChina

1,175,000 2,100,000 4,000,000 8,900,000 16,900,000

Number of Websites under .CNin China

3700 5300 9906 15153 27289

* source: CNNIC

12

Primary Measures/Goals of Primary Measures/Goals of a Search enginea Search engine

• Scale

– volume of indexed web information, ...• Performance

– “real time” constraint• Quality

– do the end user like the result returned ?

they are at odds with one another !

13

Scale: go for massive !Scale: go for massive !

• the amount of information that is indexed by the system (e.g. number of web pages, number of ftp file entries, etc.)

• the number of websites it covers.• coverage: percentages of the above with res

pect to the totals out there on the Web • the number of information forms that is fetc

hed and managed by the system (e.g. html, txt, asp, xml, doc, ppt, pdf, ps, Big5 as well as GB, etc.)

14

Primary measures/goals of Primary measures/goals of a search enginea search engine

• Scale– volume of indexed information, ...

• Performance– “real time” constraint

• Quality– does the end user like the result

returned ?


15

Performance: “real time” Performance: “real time” requirementrequirement

• fetch the targeted amount of information within a time frame, say 15 days– otherwise the information may be obsolete

• deliver the results to a query within a time limit (response time), say 1 second– otherwise users may turn away from your

service, never come back !

larger scale may imply degradation of performance

16

Primary measures/goals of Primary measures/goals of a search enginea search engine

• Scale– volume of information indexed, ...

• Performance– “real time” constraint

• Quality– do the end user like the result returned ?


17

Quality: do the users like it ?Quality: do the users like it ?

• recall rate– can it return information that should be

returned ?

– high recall rate requires high coverage

• accuracy– percentage of returned results that are

relevant to the query

– high accuracy requires better coverage

• ranking (a special measure of accuracy)– are the most relevant results appearing before

those less relevant ?

18

Our approachOur approach

• Parallel and distributed processing: reach

for large scale and scalability

• User behavior analysis: give forth

mechanisms for performance• Making use of content of web pages: hint

innovative algorithms for quality

19

Towards scalabilityTowards scalability

• WebGather 1.0: a million-page level system operated since 1998, uni-crawler.

• WebGather 2.0: a 30-million-page level system operated since 2001, a fully parallel architecture.– not only boosts up the scale

– but also improves performance

– and delivers better quality

20

Internet

robot robot

scheduler

...

crawler

indexer

raw database

index database

searcher

user interface

Architecture of typical search enginesArchitecture of typical search engines

21

Architecture of WebGather 2.0Architecture of WebGather 2.0

... crawler 1 crawler 2 crawler 3 crawler n

raw database raw database raw database raw database ...

index database index database index database index database ...

indexer 1 indexer 2 indexer 3 indexer n ...

searcher 1 searcher 2 searcher 3 searcher n ...

User Interface

document database query cache

LAN

Internet

Crawling

Indexing

Searching

UI

22

Towards scalability: main Towards scalability: main technical issuestechnical issues• how to assign crawling tasks to multiple

crawlers for parallel processing– granularity of the tasks: URL or IP address ?

– maintenance of a task pool: centralized or distributed ?

– load balance

– low communication overhead

• dynamic reconfiguration– in response to failure of crawlers, …,

(remembering that crawling process usually takes weeks)

23

Scheduler N

CR

Scheduler 1

Scheduler 2 S

cheduler 3

robot

robot

robot robot

crawler

CR: crawler registry

Parallel Crawling in WebGatherParallel Crawling in WebGather

24

Task Generation and Task Generation and AssignmentAssignment

• granularity of parallelism: URL or domain name

• task pool: distributed, and tasks are dynamically created and assigned

• A hash function is used for task assignment and load balance

H(URL) = F(URL’s domain part) mod N

25

Time

NumberOf crawlers

2hours

4hours

6hours

8hours

10hours

2 0.001454 0.000309 6.18E-05 1.25E-05 8.24E-06

4 0.00059 0.000375 0.000465 0.000672 0.000568

8 7.04E-05 4.98E-05 4.18E-05 7.44E-05 5.79E-05

16 1.57E-05 1.11E-05 1.42E-05 1.51E-05 1.82E-05

Simulation result: load balanceSimulation result: load balance

26

2 4 6 8 10 12 14 160

2

4

6

8

10

12

main-controller number

spee

dup

2,4,8,16 main-controllers

Spe

edup

number of crawlers

Simulation result: scalabilitySimulation result: scalability

27

Number of crawlers

Spe

edup

Experimental result: scalabilityExperimental result: scalability

28

Our ApproachOur Approach

• Parallel and distributed processing: reach for large scale and scalability

• User behavior analysis: give forth mechanisms for performance

• Making use of content of web pages: hint innovative algorithms for quality

29

Towards high performanceTowards high performance

• “parallel processing”, of course, is a plus to performance, and

• more importantly , user behavior analysis suggests critical mechanisms for improved performance

– a search engine not only maintains web information, but also logs user queries

– a good understanding of the queries gives rise to cache design and performance tuning approaches

30

What do you keep?What do you keep?• So you gather data from the web storing

– Documents and more importantly words extracted from documents

• After removing dull words, you store document# for each word together with additional data – position and meta-information such as font, tag

enclosed in (i.e. if in meta-data section)

• Position needed to be able to respond to multiword queries with adjacency requirements

• There is a lot of important research in best way to get, store and retrieve information

31

What Pages should one get?What Pages should one get?• A Web Search is an Information not a Knowledge retrieval

engine• It looks at a set of text pages with certain additional

characteristics– URL, Titles, Fonts, Meta-data

• And matches a query to these pages returning pages in a certain order

• This order and choices made by user in dealing with this order can be thought of as “knowledge”– E.g. user tries different queries and ecides which of returned set to

explore

• People complain about “number of pages” returned but I think this is a GOOD model for knowledge and it is good to combine people with the computer

32

How do you Rank PagesHow do you Rank Pages• One can find at least 4 criteria

• Content of Document i.e. nature of occurrence of query terms in document (Author)

• Nature of links to and from this document – this is characteristic of a Web page (Other authors)– Google and IBM Clever project emphasized this

• Occurrence of documents in compiled directories (editors)

• Data on what users of search service have done (users)

33

Document Content RankingDocument Content Ranking• Here the TF*IDF method is typical

– TF Term (query word) Frequency

– IDF is Inverse Document Frequency

• This gives a crude ranking which can be refined by other schemes

• If you have multiple terms then you can add their values of TF*IDF

• Next slides come from earlier courses from Goble (Manchester) and Maryland cited at start

34

IR (Information Retrieval) as Clustering

• A query is a vague spec of a set of objects, A

• IR is reduced to the problem of determining which documents are in set A and which ones are not

• Intra clustering similarity:– What are the features that better

describe the objects in A• Inter clustering dissimilarity:

– What are the features that better distinguish the objects A from the remaining objects in C

A:Retrieved

Documents

C: Document Collection

x

xx x

xx

35

Index term weighting

Weight(t,d) = tf(t,d) x idf(t)

N Number of documents in collection

n(t) Number of documents in which term t occurs

idf(t) Inverse document frequency

occ(t,d) Occurrence of term t in document d

tmax Term in document d with highest occurrence

tf(t,d) Term frequency of t in document d

36

Index term weightingIntra-clustering similarity

– The raw frequency of a term t inside a document d.

– A measure of how well the document term describes the document contents

Inter-cluster dissimilarity– Inverse document frequency– Inverse of the frequency of a term t among

the documents in the collection. – Terms which appear in many documents are

not useful for distinguishing a relevant document from a non-relevant one.

Normalised frequency of term t

in document d

Inverse document frequency

n(t)

Nlogidf(t) =

occ(tmax, d)

occ(t,d)tf(t,d) =

Weight(t,d) = tf(t,d) x idf(t)

37

Term weighting schemesTerm weighting schemes• Best known

• Variation for query term weights

n(t)N

logocc(tmax, d)

occ(t,d)weight(t,d) = x

weight(t,d) =occ(tmax, q)0.5occ(t,q)

n(t)N

logx0.5 +

Term frequency Inverse document frequency

38

TF*IDF ExampleTF*IDF Example

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

idfi Unweighted query: contaminated retrievalResult: 2, 3, 1, 4

Weighted query: contaminated(3) retrieval(1)Result: 1, 3, 2, 4

IDF-weighted query: contaminated retrievalResult: 2, 3, 1, 4

tf ,i jwi j,

39

Let be the unnormalized weight of term in document

Let be the normalized weight of term in document

Then

w i j

w i j

ww

w

i j

i j

i j

i j

i jj

,

,

,

,

,

2

Document Length Document Length NormalizationNormalization

• Long documents have an unfair advantage– They use a lot of terms

• So they get more matches than short documents

– And they use the same words repeatedly• So they have much higher term frequencies

40

Cosine Normalization Cosine Normalization ExampleExample

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

idfi

1.70 0.97 2.67 0.87Length

tf ,i jwi j, wi j,

Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

41

Google Page RankGoogle Page Rank• This exploits nature of links to a page which is a measure of

“citations” for page

• Page A has pages T1 T2 T3 …Tn which point to it

• d is a fudge factor (say 0.85)

• PR(A) = (1-d) + d *(PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) )

• Where C(Tk) is number of links from page Tk

42

HITS: HITS: Hypertext Induced Topic SearchHypertext Induced Topic Search

• The ranking scheme depends on the query • Considers the set of pages that point to or

are pointed at by pages in the answer S

• Implemented in IBM;s Clever Prototype• Scientific American Article:• http://www.sciam.com/1999/0699issue/0699ra

ghavan.html



43

HITS (2)HITS (2)• Authorities:

– Pages that have many links point to them in S

• Hub: – pages that have many

outgoing links

Positive two-way feedback:

–better authority pages come from incoming edges from good hubs

–better hub pages come from outgoing edges to good authorities

44

Authorities and HubsAuthorities and Hubs

Authorities ( blue )Hubs (red)

45

HITS two step iterative processHITS two step iterative process• assigns initial scores to candidate hubs

and authorities on a particular topic in set of pages S

1. use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities

2. use the updated hub information to refine the guesses about the authorities--determine where the best hubs point most heavily and call these the good authorities.

• Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can then be used to determine the best authorities and hubs.

H(p) =

u S | p u

A(u)

A(p) =

v S | v p

H(u)

http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?eigenvector

46

CybercommunitiesCybercommunities

HITS is clusteringweb into

Communities

47

Google vs CleverGoogle vs Clever

• Google

1. assigns initial rankings and retains them independently of any queries -- enables faster response.

2. looks only in the forward direction, from link to link.

• Clever 1. assembles a different root

set for each search term and then prioritizes those pages in the context of that particular query.

2. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.

48

Peking UniversityPeking UniversityUser behavior analysisUser behavior analysis• taking 3 month worth of real user queries

(about 1 million queries)

• each query consists of <keywords, time, IP address, …>

• keywords distribution: we observe that high frequency keywords are dominating

• grouping the queries in 1000, exam the difference between consecutive groups: we observe a quite stable process (the difference is quite small)

• do the above for different group sizes: we observe a strong self-similarity structure

49

• Only 160,000 different keywords in 960,000 queries

• 20% of high-frequency queries occupies 80% of the total visit times

m

jj

xm

ii

C

CY

1

100/

1

Distribution of user queriesDistribution of user queries

Terms (query words as fraction)

Queriesas time

searching

0.20.8

50

Towards high performanceTowards high performance

• Query caching improves system performance dramatically– more than 70% of user queries can be

satisfied in less than 1 millisecond– almost all queries are answered in 1 second

• User behavior may also be used for other purpose– evaluation of various ranking metrics, e.g.,

the link popularity and replica popularity of a URL have positive influence to its importance

51

Our approachOur approach

• Parallel and distributed processing: reach for large scale and scalability

• User behavior analysis: give forth mechanisms for performance

• Making use of content of web pages: hint innovative algorithms for quality

52

Towards good qualityTowards good quality

• Do not miss those important pages: keep recall rate high

• Clever algorithm for removing near replicas: better accuracy

• new metrics to evaluate pages’ relevance: improved ranking– anchor text based, instead of PageRank base

d

53

Fetch the “important” Fetch the “important” pages firstpages first• crawling is normally done with a time

frame, thus not missing important pages is a practical issue for guaranteeing good search quality later on

• besides picking up “good” seed URLs, we use a formula to determine the importance of a page

54

Url1: (http://www.a.com/index.html)

词频： computer 45 network 33 server 9….

Url2: (http://www.b.com/gbindex.html)

词频： computer 45 network 30 server 16 ….

computer

network

server

Url2

Url1

3

b

a

3/(a+b)<0.01

Removing near-replicasRemoving near-replicas

vector based vs. fingerprint based

55

Related worksRelated works

• Harvest– good academic ideas, but complicated

design, not sustained

• Google– the most famous search engine in the world

at the moment, but little exposure on technology used after 1998 (Brian, 1998, WWW-7)

– character based, instead of word based, Chinese processing ?

– too much hardware than necessary (10000 PCs were reported)?

56

17 Most Popular Day Time Queries17 Most Popular Day Time Queries

平时访问天网的前17个查询词（半年总量为845113）

020000400006000080000

100000120000140000

旅行社

宾馆饭店

搜索警察情人

旅游交通

下载 sex

一见钟情

mp3

图片

oicq

北京色情人民

代理服务器

电影

访问量

57

10 Most Popular Day Time Queries, 70%10 Most Popular Day Time Queries, 70%

查询词前十名的查询量占总量的百分比

旅行社16%

宾馆饭店15%

其它30%

旅游交通6%

一见钟情2%

mp31%

sex4% 警察

7%

搜索9%

情人6%

下载4%

旅行社宾馆饭店搜索警察情人旅游交通下载sex一见钟情mp3其它

58

11 Most Popular Leisure Time Queries11 Most Popular Leisure Time Queries

假日查询词排在前列的内容（总量为245704，半年）

0

10000

20000

30000

40000

50000

色与情

法律

下载

娱乐

搜索

图片

北京

oicq

代理服务器熵

克隆与伦理

系列1

59

eduedu access vs access vs non edunon edu access access--- --- we may have a lot to say about the curve !we may have a lot to say about the curve !

- 教育网非教育网查询数日变化折线图

0

20000

40000

60000

80000

100000

120000

140000

160000

2000

-11-

2

2000

-11-

16

2000

-11-

30

2000

-12-

14

2000

-12-

28

2001

-1-1

1

2001

-1-2

5

2001

-2-8

2001

-2-2

2

2001

-3-8

2001

-3-2

2

2001

-4-5

总计教育网

非教育网

web search for x-informatics

Documents

web http

web search engine http

number of web pages

number of web sites

hidden web content

search engineinformation

databases http

ppth http