altavista search engine architecture
DESCRIPTION
Inside the design and implementation of AltaVista Search EngineTRANSCRIPT
AltaVista Indexing and Search Engine
- By Mike Burrows- Recreated by Changshu Liu
[http://csliu.com]
Goals
• General purpose• Good query performance• Scale to hundreds of gigabytes• Compact index/query representation• Queries possible during updates• Reasonable update performance
Non-Goals
• Scale beyond a terabyte• Document parsing• Query parsing• Ranking for query results
Structure of Inverted Files
• Chose flat inverted files that map words to lists of locations where those words occur
• Words are null-terminated byte strings
• Locations are 64-bit unsigned integers
• Client picks what locations mean. No predefined notion of document, page or word number
Documents
• A document is contiguous in location space
• Documents do not overlap
• Location space is allocated densely. The first document is at location 1
• Word endDoc at last location of document
• All document structure encoded with words– For example: begintitle, endtitle
Inverted File Format
• Words ordered lexicographically
• Each word followed by list of locations
• Common word prefixes are compressed
• Locations are encoded as deltas
• Deltas stored in as few bytes as possible– 2 bytes is common
• Full-text index occupies about 30% text size. Word-in-document (non-positional) index is about 10%
• Obvious format for deltas:
Continuation Bits Indicate Delta Boundaries
• Key operation: Find first location at least X
• Better format for efficient scanning: Deltas packed into aligned 64-bit word First byte contains continuation bits
Parsing a Delta• Observation:
Choose instructions to dual-issue well. Fixed word structure allows prefetch. Avoid branch mispredictions.
• 6 instr. to extract+sum+compare a deltaextql b, tp, x ; get next deltaaddq tp, l, tp ; point to next deltamskql x, l, x ; cut delta to lengthsrl l, 3, l ; get next delta lengthaddq cur, x, cur ; add delta to locationbge cur, done ; bail if done
With loop overhead, 35 instr/64-bit word.10 cycles/64-bit word.
Index Stream Reader (ISRs)An interface for reading the result of a query as an ascending sequence of locations
Lazily evaluated
ISRs are objects with methods:Loc() – Return Current LocationNext() – Advance to next locationSeek(X) – Advance to first location at least X.
Subtype ISRP adds:Prev() – Return previous location
Used for fielded queries (e.g. in title)
No methods move backwards
ISR Implementations
file — reads inverted files;seek() method is the delta parsing
loopor — merges two or more ISRsnot — returns locations not in argument ISRand — constraint solver (AND, NEAR etc)
and other, specialized ISRsand & not cannot support prev()
ISR And—constraint solver
Arguments: list of ISRs, list of ConstraintsConstraint types: (A and B are ISRs)
1. loc(A) ≤ loc(B) + K2. prev(A) ≤ loc(B) + K3. loc(A) ≤ prev(B) + K4. prev(A) ≤ prev(B) + K
If each word takes a location, constraints for two-word phrase “a b” are:
loc(A) < loc(B) loc(B) ≤ loc(A) + 1
Let E, BT, ET be ISRPs of words:enddoc, begintitle, endtitle
Constraints for conjunction: a and bprev(E) < loc(A) loc(A) ≤ loc(E)prev(E) < loc(B) loc(B) ≤ loc(E)
Constraints for field query: title: Aprev(BT) < loc(A) loc(A) ≤ loc(ET)prev(BT) < loc(ET) loc(ET) ≤ loc(BT)
Solver AlgorithmWhile (Unsatisfied Constraints)
Pick Unsatisfied Constraint()Satisfy Constraint()
To Satisfyloc(A) ≤ loc(B) + K:
seek(B, loc(A) − K)
prev(A) ≤ loc(B) + K:seek(B, prev(A) − K)
loc(A) ≤ prev(B) + K:seek(B, loc(A) − K)next(B)
prev(A) ≤ prev(B) + K:seek(B, prev(A) − K)next(B)
Some Metrics
(performance based on AltaVista Web index)20K lines of code
Indexes around 1.5GByte/Hr/600MHz CPU
Queries take about 100 cycles/query/MByte
Queries are CPU bound:95% in user space, 5% in kernel
Memory bus is currently under-utilized
Breakdown of user CPU time
30% inner loop 15% constraint solver 15% higher level seek code 7% ranking code0.2% merging results
Miss Ratios:2% I-cache8% D-cache8% level-2 cache40% level-3 cache
Postmortem
• Successes– ISRs are a good abstraction– Flat location space– Representing structure as words
• Regrets– No ability to run ISRs backwards–Wish ISR constraint solver were less
complex
AltaVista Site Architecture
- By Mike Burrows- Recreated by Changshu Liu
[http://csliu.com]
Structure of the Site
Front-Ends: Alpha Workstations
Back-Ends: 4-10 CPU Alpha Servers8GBytes RAM / 150 GBytes Disc.Organized in Groups of 4-10 MachinesEach machine has 1/Nth of the whole index
Broad Routers 0
Broad Routers 1
FDDI RouterFDDI
Router
Front End 0
Front End 1
Front End N-1
FDDI RouterFDDI
Router
Front End N
Handling Failures
• Disc: RAID controllers with spare discs
• Back-ends: front-ends use other groups
• Frond-ends: hot-spare grabs IP address
• FDDI: manual replacement of cold spare
• Site: failover via manual DNS change
RAID• Reconstructing a disc takes 30 minutes.
– Disc performance is crippled
• Except a few discs to fail a month– Need daily schedule for checking discs.
• GUI annoying when checking 60 controllers
• Once a disc failed with no error reported– Corrupted index file
• On first day, the only non-RAID device (root disc) failed during demo for press
File System
• Need a Journaling File System– Write Ahead Log– FSCK(consistency checker) takes ours
• Software/Memory errors destroy file systems– Restoring 300GB from tape doesn’t work
• Tape may be in error• Too slow
– Important to replicate data in spinning disk
Back-Ends
• Back-Ends were Digital 8400’s (Turoblaser)
• Huge cards with large connectors• Pins are on backplane, not card
• RAID setup took hours on separate machine
• Console interrupt is a boon
Front-Ends
• Biggest Problems:– Poorly-Tested software– Operator Error
• Automatic restart dealt with former
• A trivial IP failover scheme dealt with latter
HTTP Server
• Original NCSA httpd was abysmal– Forked too often– Synchronous name resolution– Logs writes to full disc– Prone to denial of service attacks
• Fixed with new first http server– Never Forks: aggravates software test
issues– Submit limits: sockets/threads/requests rate
Load Balance
• Front-End– DNS round robin
• Backend– Front-Ends will group similar queries to
the same specific backend for cache
Overload Handling
• Back-ends take short-cuts when verloaded– Ultimately, they can refuse service
• Front-ends have spare capacity to avoid site appearing completely dead
Reference
The AltaVista Indexing and Search Engine
Mike Burrows, Compaq SRCProduction Date: 01/18/2000Link: http://uwtv.org/programs/displayevent.aspx?rid=2123