practical considerations for a web-scale search engine michael isard microsoft research silicon...

Practical considerations fora web-scale search engine

Michael Isard

Microsoft Research Silicon Valley

Search and research

• Lots of research motivated by web search– Explore specific research questions– Small to moderate scale

• A few large-scale production engines– Many additional challenges– Not all purely algorithmic/technical

• What are the extra constraints for a production system?

Production search engines

• Scale up– Tens of billions of web pages, images, etc.– Tens of thousands to millions of computers

• Geographic distribution– For performance and reliability

• Continuous crawling and serving– No downtime, need fresh results

• Long-term test/maintenance– Simplicity a core goal

Disclaimer

• Not going to describe any particular web-scale search engine– No detailed public description of any engine

• But, general principles apply

Outline

• Anatomy of a search engine

• Query serving

• Link-based ranking

• Index generation

Structure of a search engine

Queryserving

The WebThe Web

Userbehavioranalysis

Documentcrawling

IndexbuildingLink

structureanalysis

Pagefeaturetraining

Auxiliaryanswers

Rankertraining

Some index statistics

• Tens of billions of documents– Each document contains thousands of terms– Plus metadata– Plus snippet information

• Billions of unique terms– Serial numbers, etc.

• Hundreds of billions of nodes in web graph• Latency a few ms on average

– Well under a second worst-case

Query serving pipeline

The WebThe Web

Front-end web servers, caches, etc.Front-end web servers, caches, etc.Front-end web servers, caches, etc.

Index serversIndex serversIndex servers

Page relevance

• Query-dependent component– Query/document match, user metadata, etc.

• Query-independent component– Document rank, spam score, click rate, etc.

• Ranker needs:– Term frequencies and positions– Document metadata– Near-duplicate information– …

Single-box query outlineHello world + {EN-US,…}

term posting list

a 1.2,1.10,1.16,…,1040.23,…,…hello 3.76,…,45.48,…,1125.3,…,…world 7.12,…,45.29,…,1125.4,…,

(45.48,45.29), (1125.3,1125.4),…

doc metadata

1 foo.com/bar,EN-US,……45 go.com/hw.txt,EN-US,……1125 bar.com/a.html,EN-US,…

Ranker

doc snippet data

1 “once a week …”…

1125.3,45.48,…

Query Results

Query statistics

• Small number of terms (fewer than 10)

• Posting lists length 1 to 100s of millions– Most terms occur once

• Potentially millions of documents to rank– Response is needed in a few ms– Tens of thousands of near duplicates– Sorting documents by QI rank may help

• Tens or hundreds of snippets

Distributed index structure

• Tens of billions of documents

• Thousands of queries per second

• Index is constantly updated– Most pages turn over in at most a few weeks– Some very quickly (news sites)– Almost every page is never returned

How to distribute?

Distributed index: split by term

• Each computer stores a subset of terms

• Each query goes only to a few computers

• Document metadata stored separately

A-G H-M T-ZN-S

Hello world + {EN-US,…} Ranker MetadataMetadataMetadata

Split by term: pros

• Short queries only touch a few computers– With high probability all are working

• Long posting lists improve compression– Most words occur many times in corpus

Split by term: cons (1)

• Must ship posting lists across network– Multi-term queries make things worse– But maybe pre-computing can help?

• Intersections of lists for common pairs of terms• Needs to work with constantly updating index

• Extra network roundtrip for doc metadata– Too expensive to store in every posting list

• Where does the ranker run?– Hundreds of thousands of ranks to compute

Split by term: cons (2)

• Front-ends must map terms to computers– Simple hashing may be too unbalanced– Some terms may need to be split/replicated

• Long posting lists• “Hot” posting lists

• Sorting by QI rank is a global operation– Needs to work with index updates

Distributed index: split by document

• Each computer stores a subset of docs

• Each query goes to many computers

• Document metadata stored inline

Docs1-1000

Docs1001-2000

Docs3001-4000

Docs2001-3000

Hello world + {EN-US,…}

Ranker

Aggregator

Ranker Ranker Ranker

Split by document: pros

• Ranker on same computer as document– All data for a given doc in the same place– Ranker computation is distributed

• Can get low latency • Sorting by QI rank local to each computer

• Only ranks+scores need to be aggregated– Hundreds of results, not millions

Split by document: cons

• A query touches hundreds of computers– One slow computer makes query slow– Computers per query is linear in corpus size– But query speeds are not iid

• Shorter posting lists: worse compression– Each word split into many posting lists

Index replication

• Multiple copies of each partition– Needed for redundancy, performance

• Makes things more complicated– Can mitigate latency variability

• Ask two replicas, one will probably return quickly

– Interacts with data layout• Split by document may be simpler

• Consistency may not be essential

Splitting: word vs document

• Original Google paper split by word

• All major engines split by document now?– Tens of microseconds to rank a document

Link-based ranking

• Intuition: “quality” of a page is reflected somehow in the link structure of the web

• Made famous by PageRank– Can be seen as stationary distribution of a

random walk on the web graph– Google’s original advantage over AltaVista?

Some hints

• PageRank is (no longer) very important

• Anchor text contains similar information– BM25F includes a lot of link structure

• Query-dependent link features may be useful

.221

.158

.156

.121

.106

.105

.104

.102

.095

.092

.090

.038

.036

.035

.034

.034

.033

.033

.032

.032

.011

0.00

0.05

0.10

0.15

0.20

0.25

bm25

f

sals

a-au

t-id

-3

sals

a-au

t-ih

-8

sals

a-au

t-al

l-100

degr

ee-in

-id

degr

ee-in

-ih

hits

-aut

-id-2

5

hits

-aut

-ih-1

00

degr

ee-in

-all

page

rank

hits

-aut

-all-

100

hits

-hub

-all-

100

hits

-hub

-ih-1

00

hits

-hub

-id-1

00

sals

a-hu

b-al

l-100

degr

ee-o

ut-a

ll

sals

a-hu

b-ih

-8

sals

a-hu

b-id

-3

degr

ee-o

ut-ih

degr

ee-o

ut-id

rand

om

NDCG@10

Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007

Query-dependent link features

A B C D

E F G H I

J K L M N

Real-time QD link information

• Lookup of neighborhood graph

• Followed by SALSA

• In a few ms

Seems like a good topicfor approximation/learning

Index building

• Catch-all term– Create inverted files– Compute document features– Compute global link-based statistics– Which documents to crawl next?– Which crawled documents to put in the index?

• Consistency may be needed here

Index lifecycle

Queryserving

The WebThe Web

Indexselection

Usageanalysis

Pagecrawling

Experimentation

• A/B testing is best– Ranking, UI, etc.– Immediate feedback on what works– Can be very fine-grained (millions of queries)

• Some things are very hard– Index selection, etc.– Can run parallel build processes

• Long time constants: not easy to do brute force

Implementing new features

• Document-specific features much “cheaper”– Spam probability, duplicate fingerprints, language

• Global features can be done, but with a higher bar– Distribute anchor text– PageRank et al.

• Danger of “butterfly effect” on system as a whole

Distributing anchor text

IndexerIndexerIndexerCrawlerCrawlerCrawler

Anchor textDocs f0-ffAnchor textDocs f0-ffAnchor textDocs f0-ff

Distributed infrastructure

• Things are improving– Large scale partitioned file systems

• Files commonly contain many TB of data• Accessed in parallel

– Large scale data-mining platforms– General-purpose data repositories

• Data-centric– Traditional supercomputing is cycle-centric

Software engineering

• Simple always wins

• Hysteresis– Prove a change will improve things

• Big improvement needed to justify big change

– Experimental platforms are essential

Summary

• Search engines are big and complicated

• Some things are easier to change than others

• Harder changes need more convincing experiments

• Small datasets are not good predictors for large datasets

• Systems/learning may need to collaborate

practical considerations for a web-scale search engine michael isard microsoft research silicon...

Documents