practical considerations for a web-scale search engine michael isard microsoft research silicon...

34
Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Upload: elinor-rogers

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Practical considerations fora web-scale search engine

Michael Isard

Microsoft Research Silicon Valley

Page 2: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Search and research

• Lots of research motivated by web search– Explore specific research questions– Small to moderate scale

• A few large-scale production engines– Many additional challenges– Not all purely algorithmic/technical

• What are the extra constraints for a production system?

Page 3: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Production search engines

• Scale up– Tens of billions of web pages, images, etc.– Tens of thousands to millions of computers

• Geographic distribution– For performance and reliability

• Continuous crawling and serving– No downtime, need fresh results

• Long-term test/maintenance– Simplicity a core goal

Page 4: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Disclaimer

• Not going to describe any particular web-scale search engine– No detailed public description of any engine

• But, general principles apply

Page 5: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Outline

• Anatomy of a search engine

• Query serving

• Link-based ranking

• Index generation

Page 6: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Structure of a search engine

Queryserving

The WebThe Web

Userbehavioranalysis

Documentcrawling

IndexbuildingLink

structureanalysis

Pagefeaturetraining

Auxiliaryanswers

Rankertraining

Page 7: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Some index statistics

• Tens of billions of documents– Each document contains thousands of terms– Plus metadata– Plus snippet information

• Billions of unique terms– Serial numbers, etc.

• Hundreds of billions of nodes in web graph• Latency a few ms on average

– Well under a second worst-case

Page 8: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Query serving pipeline

The WebThe Web

Front-end web servers, caches, etc.Front-end web servers, caches, etc.Front-end web servers, caches, etc.

Index serversIndex serversIndex servers

Page 9: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Page relevance

• Query-dependent component– Query/document match, user metadata, etc.

• Query-independent component– Document rank, spam score, click rate, etc.

• Ranker needs:– Term frequencies and positions– Document metadata– Near-duplicate information– …

Page 10: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Single-box query outlineHello world + {EN-US,…}

term posting list

a 1.2,1.10,1.16,…,1040.23,…,…hello 3.76,…,45.48,…,1125.3,…,…world 7.12,…,45.29,…,1125.4,…,

(45.48,45.29), (1125.3,1125.4),…

doc metadata

1 foo.com/bar,EN-US,……45 go.com/hw.txt,EN-US,……1125 bar.com/a.html,EN-US,…

Ranker

doc snippet data

1 “once a week …”…

1125.3,45.48,…

Query Results

Page 11: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Query statistics

• Small number of terms (fewer than 10)

• Posting lists length 1 to 100s of millions– Most terms occur once

• Potentially millions of documents to rank– Response is needed in a few ms– Tens of thousands of near duplicates– Sorting documents by QI rank may help

• Tens or hundreds of snippets

Page 12: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Distributed index structure

• Tens of billions of documents

• Thousands of queries per second

• Index is constantly updated– Most pages turn over in at most a few weeks– Some very quickly (news sites)– Almost every page is never returned

How to distribute?

Page 13: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Distributed index: split by term

• Each computer stores a subset of terms

• Each query goes only to a few computers

• Document metadata stored separately

A-G H-M T-ZN-S

Hello world + {EN-US,…} Ranker MetadataMetadataMetadata

Page 14: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Split by term: pros

• Short queries only touch a few computers– With high probability all are working

• Long posting lists improve compression– Most words occur many times in corpus

Page 15: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Split by term: cons (1)

• Must ship posting lists across network– Multi-term queries make things worse– But maybe pre-computing can help?

• Intersections of lists for common pairs of terms• Needs to work with constantly updating index

• Extra network roundtrip for doc metadata– Too expensive to store in every posting list

• Where does the ranker run?– Hundreds of thousands of ranks to compute

Page 16: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Split by term: cons (2)

• Front-ends must map terms to computers– Simple hashing may be too unbalanced– Some terms may need to be split/replicated

• Long posting lists• “Hot” posting lists

• Sorting by QI rank is a global operation– Needs to work with index updates

Page 17: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Distributed index: split by document

• Each computer stores a subset of docs

• Each query goes to many computers

• Document metadata stored inline

Docs1-1000

Docs1001-2000

Docs3001-4000

Docs2001-3000

Hello world + {EN-US,…}

Ranker

Aggregator

Ranker Ranker Ranker

Page 18: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Split by document: pros

• Ranker on same computer as document– All data for a given doc in the same place– Ranker computation is distributed

• Can get low latency • Sorting by QI rank local to each computer

• Only ranks+scores need to be aggregated– Hundreds of results, not millions

Page 19: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Split by document: cons

• A query touches hundreds of computers– One slow computer makes query slow– Computers per query is linear in corpus size– But query speeds are not iid

• Shorter posting lists: worse compression– Each word split into many posting lists

Page 20: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Index replication

• Multiple copies of each partition– Needed for redundancy, performance

• Makes things more complicated– Can mitigate latency variability

• Ask two replicas, one will probably return quickly

– Interacts with data layout• Split by document may be simpler

• Consistency may not be essential

Page 21: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Splitting: word vs document

• Original Google paper split by word

• All major engines split by document now?– Tens of microseconds to rank a document

Page 22: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Link-based ranking

• Intuition: “quality” of a page is reflected somehow in the link structure of the web

• Made famous by PageRank– Can be seen as stationary distribution of a

random walk on the web graph– Google’s original advantage over AltaVista?

Page 23: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Some hints

• PageRank is (no longer) very important

• Anchor text contains similar information– BM25F includes a lot of link structure

• Query-dependent link features may be useful

Page 24: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

.221

.158

.156

.121

.106

.105

.104

.102

.095

.092

.090

.038

.036

.035

.034

.034

.033

.033

.032

.032

.011

0.00

0.05

0.10

0.15

0.20

0.25

bm25

f

sals

a-au

t-id

-3

sals

a-au

t-ih

-8

sals

a-au

t-al

l-100

degr

ee-in

-id

degr

ee-in

-ih

hits

-aut

-id-2

5

hits

-aut

-ih-1

00

degr

ee-in

-all

page

rank

hits

-aut

-all-

100

hits

-hub

-all-

100

hits

-hub

-ih-1

00

hits

-hub

-id-1

00

sals

a-hu

b-al

l-100

degr

ee-o

ut-a

ll

sals

a-hu

b-ih

-8

sals

a-hu

b-id

-3

degr

ee-o

ut-ih

degr

ee-o

ut-id

rand

om

NDCG@10

Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007

Page 25: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Query-dependent link features

A B C D

E F G H I

J K L M N

Page 26: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Real-time QD link information

• Lookup of neighborhood graph

• Followed by SALSA

• In a few ms

Seems like a good topicfor approximation/learning

Page 27: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Index building

• Catch-all term– Create inverted files– Compute document features– Compute global link-based statistics– Which documents to crawl next?– Which crawled documents to put in the index?

• Consistency may be needed here

Page 28: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Index lifecycle

Queryserving

The WebThe Web

Indexselection

Usageanalysis

Pagecrawling

Page 29: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Experimentation

• A/B testing is best– Ranking, UI, etc.– Immediate feedback on what works– Can be very fine-grained (millions of queries)

• Some things are very hard– Index selection, etc.– Can run parallel build processes

• Long time constants: not easy to do brute force

Page 30: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Implementing new features

• Document-specific features much “cheaper”– Spam probability, duplicate fingerprints, language

• Global features can be done, but with a higher bar– Distribute anchor text– PageRank et al.

• Danger of “butterfly effect” on system as a whole

Page 31: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Distributing anchor text

IndexerIndexerIndexerCrawlerCrawlerCrawler

Anchor textDocs f0-ffAnchor textDocs f0-ffAnchor textDocs f0-ff

Page 32: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Distributed infrastructure

• Things are improving– Large scale partitioned file systems

• Files commonly contain many TB of data• Accessed in parallel

– Large scale data-mining platforms– General-purpose data repositories

• Data-centric– Traditional supercomputing is cycle-centric

Page 33: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Software engineering

• Simple always wins

• Hysteresis– Prove a change will improve things

• Big improvement needed to justify big change

– Experimental platforms are essential

Page 34: Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Summary

• Search engines are big and complicated

• Some things are easier to change than others

• Harder changes need more convincing experiments

• Small datasets are not good predictors for large datasets

• Systems/learning may need to collaborate