practical considerations for a web-scale search engine michael isard microsoft research silicon...
TRANSCRIPT
Practical considerations fora web-scale search engine
Michael Isard
Microsoft Research Silicon Valley
Search and research
• Lots of research motivated by web search– Explore specific research questions– Small to moderate scale
• A few large-scale production engines– Many additional challenges– Not all purely algorithmic/technical
• What are the extra constraints for a production system?
Production search engines
• Scale up– Tens of billions of web pages, images, etc.– Tens of thousands to millions of computers
• Geographic distribution– For performance and reliability
• Continuous crawling and serving– No downtime, need fresh results
• Long-term test/maintenance– Simplicity a core goal
Disclaimer
• Not going to describe any particular web-scale search engine– No detailed public description of any engine
• But, general principles apply
Outline
• Anatomy of a search engine
• Query serving
• Link-based ranking
• Index generation
Structure of a search engine
Queryserving
The WebThe Web
Userbehavioranalysis
Documentcrawling
IndexbuildingLink
structureanalysis
Pagefeaturetraining
Auxiliaryanswers
Rankertraining
Some index statistics
• Tens of billions of documents– Each document contains thousands of terms– Plus metadata– Plus snippet information
• Billions of unique terms– Serial numbers, etc.
• Hundreds of billions of nodes in web graph• Latency a few ms on average
– Well under a second worst-case
Query serving pipeline
The WebThe Web
Front-end web servers, caches, etc.Front-end web servers, caches, etc.Front-end web servers, caches, etc.
Index serversIndex serversIndex servers
Page relevance
• Query-dependent component– Query/document match, user metadata, etc.
• Query-independent component– Document rank, spam score, click rate, etc.
• Ranker needs:– Term frequencies and positions– Document metadata– Near-duplicate information– …
Single-box query outlineHello world + {EN-US,…}
term posting list
a 1.2,1.10,1.16,…,1040.23,…,…hello 3.76,…,45.48,…,1125.3,…,…world 7.12,…,45.29,…,1125.4,…,
(45.48,45.29), (1125.3,1125.4),…
doc metadata
1 foo.com/bar,EN-US,……45 go.com/hw.txt,EN-US,……1125 bar.com/a.html,EN-US,…
Ranker
doc snippet data
1 “once a week …”…
1125.3,45.48,…
Query Results
Query statistics
• Small number of terms (fewer than 10)
• Posting lists length 1 to 100s of millions– Most terms occur once
• Potentially millions of documents to rank– Response is needed in a few ms– Tens of thousands of near duplicates– Sorting documents by QI rank may help
• Tens or hundreds of snippets
Distributed index structure
• Tens of billions of documents
• Thousands of queries per second
• Index is constantly updated– Most pages turn over in at most a few weeks– Some very quickly (news sites)– Almost every page is never returned
How to distribute?
Distributed index: split by term
• Each computer stores a subset of terms
• Each query goes only to a few computers
• Document metadata stored separately
A-G H-M T-ZN-S
Hello world + {EN-US,…} Ranker MetadataMetadataMetadata
Split by term: pros
• Short queries only touch a few computers– With high probability all are working
• Long posting lists improve compression– Most words occur many times in corpus
Split by term: cons (1)
• Must ship posting lists across network– Multi-term queries make things worse– But maybe pre-computing can help?
• Intersections of lists for common pairs of terms• Needs to work with constantly updating index
• Extra network roundtrip for doc metadata– Too expensive to store in every posting list
• Where does the ranker run?– Hundreds of thousands of ranks to compute
Split by term: cons (2)
• Front-ends must map terms to computers– Simple hashing may be too unbalanced– Some terms may need to be split/replicated
• Long posting lists• “Hot” posting lists
• Sorting by QI rank is a global operation– Needs to work with index updates
Distributed index: split by document
• Each computer stores a subset of docs
• Each query goes to many computers
• Document metadata stored inline
Docs1-1000
Docs1001-2000
Docs3001-4000
Docs2001-3000
Hello world + {EN-US,…}
Ranker
Aggregator
Ranker Ranker Ranker
Split by document: pros
• Ranker on same computer as document– All data for a given doc in the same place– Ranker computation is distributed
• Can get low latency • Sorting by QI rank local to each computer
• Only ranks+scores need to be aggregated– Hundreds of results, not millions
Split by document: cons
• A query touches hundreds of computers– One slow computer makes query slow– Computers per query is linear in corpus size– But query speeds are not iid
• Shorter posting lists: worse compression– Each word split into many posting lists
Index replication
• Multiple copies of each partition– Needed for redundancy, performance
• Makes things more complicated– Can mitigate latency variability
• Ask two replicas, one will probably return quickly
– Interacts with data layout• Split by document may be simpler
• Consistency may not be essential
Splitting: word vs document
• Original Google paper split by word
• All major engines split by document now?– Tens of microseconds to rank a document
Link-based ranking
• Intuition: “quality” of a page is reflected somehow in the link structure of the web
• Made famous by PageRank– Can be seen as stationary distribution of a
random walk on the web graph– Google’s original advantage over AltaVista?
Some hints
• PageRank is (no longer) very important
• Anchor text contains similar information– BM25F includes a lot of link structure
• Query-dependent link features may be useful
.221
.158
.156
.121
.106
.105
.104
.102
.095
.092
.090
.038
.036
.035
.034
.034
.033
.033
.032
.032
.011
0.00
0.05
0.10
0.15
0.20
0.25
bm25
f
sals
a-au
t-id
-3
sals
a-au
t-ih
-8
sals
a-au
t-al
l-100
degr
ee-in
-id
degr
ee-in
-ih
hits
-aut
-id-2
5
hits
-aut
-ih-1
00
degr
ee-in
-all
page
rank
hits
-aut
-all-
100
hits
-hub
-all-
100
hits
-hub
-ih-1
00
hits
-hub
-id-1
00
sals
a-hu
b-al
l-100
degr
ee-o
ut-a
ll
sals
a-hu
b-ih
-8
sals
a-hu
b-id
-3
degr
ee-o
ut-ih
degr
ee-o
ut-id
rand
om
NDCG@10
Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007
Query-dependent link features
A B C D
E F G H I
J K L M N
Real-time QD link information
• Lookup of neighborhood graph
• Followed by SALSA
• In a few ms
Seems like a good topicfor approximation/learning
Index building
• Catch-all term– Create inverted files– Compute document features– Compute global link-based statistics– Which documents to crawl next?– Which crawled documents to put in the index?
• Consistency may be needed here
Index lifecycle
Queryserving
The WebThe Web
Indexselection
Usageanalysis
Pagecrawling
Experimentation
• A/B testing is best– Ranking, UI, etc.– Immediate feedback on what works– Can be very fine-grained (millions of queries)
• Some things are very hard– Index selection, etc.– Can run parallel build processes
• Long time constants: not easy to do brute force
Implementing new features
• Document-specific features much “cheaper”– Spam probability, duplicate fingerprints, language
• Global features can be done, but with a higher bar– Distribute anchor text– PageRank et al.
• Danger of “butterfly effect” on system as a whole
Distributing anchor text
IndexerIndexerIndexerCrawlerCrawlerCrawler
Anchor textDocs f0-ffAnchor textDocs f0-ffAnchor textDocs f0-ff
Distributed infrastructure
• Things are improving– Large scale partitioned file systems
• Files commonly contain many TB of data• Accessed in parallel
– Large scale data-mining platforms– General-purpose data repositories
• Data-centric– Traditional supercomputing is cycle-centric
Software engineering
• Simple always wins
• Hysteresis– Prove a change will improve things
• Big improvement needed to justify big change
– Experimental platforms are essential
Summary
• Search engines are big and complicated
• Some things are easier to change than others
• Harder changes need more convincing experiments
• Small datasets are not good predictors for large datasets
• Systems/learning may need to collaborate