altavista search engine architecture

AltaVista Indexing and Search Engine

- By Mike Burrows- Recreated by Changshu Liu

[http://csliu.com]

Goals

• General purpose• Good query performance• Scale to hundreds of gigabytes• Compact index/query representation• Queries possible during updates• Reasonable update performance

Non-Goals

• Scale beyond a terabyte• Document parsing• Query parsing• Ranking for query results

Structure of Inverted Files

• Chose flat inverted files that map words to lists of locations where those words occur

• Words are null-terminated byte strings

• Locations are 64-bit unsigned integers

• Client picks what locations mean. No predefined notion of document, page or word number

Documents

• A document is contiguous in location space

• Documents do not overlap

• Location space is allocated densely. The first document is at location 1

• Word endDoc at last location of document

• All document structure encoded with words– For example: begintitle, endtitle

Inverted File Format

• Words ordered lexicographically

• Each word followed by list of locations

• Common word prefixes are compressed

• Locations are encoded as deltas

• Deltas stored in as few bytes as possible– 2 bytes is common

• Full-text index occupies about 30% text size. Word-in-document (non-positional) index is about 10%

• Obvious format for deltas:

Continuation Bits Indicate Delta Boundaries

• Key operation: Find first location at least X

• Better format for efficient scanning: Deltas packed into aligned 64-bit word First byte contains continuation bits

Parsing a Delta• Observation:

Choose instructions to dual-issue well. Fixed word structure allows prefetch. Avoid branch mispredictions.

• 6 instr. to extract+sum+compare a deltaextql b, tp, x ; get next deltaaddq tp, l, tp ; point to next deltamskql x, l, x ; cut delta to lengthsrl l, 3, l ; get next delta lengthaddq cur, x, cur ; add delta to locationbge cur, done ; bail if done

With loop overhead, 35 instr/64-bit word.10 cycles/64-bit word.

Index Stream Reader (ISRs)An interface for reading the result of a query as an ascending sequence of locations

Lazily evaluated

ISRs are objects with methods:Loc() – Return Current LocationNext() – Advance to next locationSeek(X) – Advance to first location at least X.

Subtype ISRP adds:Prev() – Return previous location

Used for fielded queries (e.g. in title)

No methods move backwards

ISR Implementations

file — reads inverted files;seek() method is the delta parsing

loopor — merges two or more ISRsnot — returns locations not in argument ISRand — constraint solver (AND, NEAR etc)

and other, specialized ISRsand & not cannot support prev()

ISR And—constraint solver

Arguments: list of ISRs, list of ConstraintsConstraint types: (A and B are ISRs)

1. loc(A) ≤ loc(B) + K2. prev(A) ≤ loc(B) + K3. loc(A) ≤ prev(B) + K4. prev(A) ≤ prev(B) + K

If each word takes a location, constraints for two-word phrase “a b” are:

loc(A) < loc(B) loc(B) ≤ loc(A) + 1

Let E, BT, ET be ISRPs of words:enddoc, begintitle, endtitle

Constraints for conjunction: a and bprev(E) < loc(A) loc(A) ≤ loc(E)prev(E) < loc(B) loc(B) ≤ loc(E)

Constraints for field query: title: Aprev(BT) < loc(A) loc(A) ≤ loc(ET)prev(BT) < loc(ET) loc(ET) ≤ loc(BT)

Solver AlgorithmWhile (Unsatisfied Constraints)

Pick Unsatisfied Constraint()Satisfy Constraint()

To Satisfyloc(A) ≤ loc(B) + K:

seek(B, loc(A) − K)

prev(A) ≤ loc(B) + K:seek(B, prev(A) − K)

loc(A) ≤ prev(B) + K:seek(B, loc(A) − K)next(B)

prev(A) ≤ prev(B) + K:seek(B, prev(A) − K)next(B)

Some Metrics

(performance based on AltaVista Web index)20K lines of code

Indexes around 1.5GByte/Hr/600MHz CPU

Queries take about 100 cycles/query/MByte

Queries are CPU bound:95% in user space, 5% in kernel

Memory bus is currently under-utilized

Breakdown of user CPU time

30% inner loop 15% constraint solver 15% higher level seek code 7% ranking code0.2% merging results

Miss Ratios:2% I-cache8% D-cache8% level-2 cache40% level-3 cache

Postmortem

• Successes– ISRs are a good abstraction– Flat location space– Representing structure as words

• Regrets– No ability to run ISRs backwards–Wish ISR constraint solver were less

complex

AltaVista Site Architecture

- By Mike Burrows- Recreated by Changshu Liu

[http://csliu.com]

Structure of the Site

Front-Ends: Alpha Workstations

Back-Ends: 4-10 CPU Alpha Servers8GBytes RAM / 150 GBytes Disc.Organized in Groups of 4-10 MachinesEach machine has 1/Nth of the whole index

Broad Routers 0

Broad Routers 1

FDDI RouterFDDI

Router

Front End 0

Front End 1

Front End N-1

FDDI RouterFDDI

Router

Front End N

Handling Failures

• Disc: RAID controllers with spare discs

• Back-ends: front-ends use other groups

• Frond-ends: hot-spare grabs IP address

• FDDI: manual replacement of cold spare

• Site: failover via manual DNS change

RAID• Reconstructing a disc takes 30 minutes.

– Disc performance is crippled

• Except a few discs to fail a month– Need daily schedule for checking discs.

• GUI annoying when checking 60 controllers

• Once a disc failed with no error reported– Corrupted index file

• On first day, the only non-RAID device (root disc) failed during demo for press

File System

• Need a Journaling File System– Write Ahead Log– FSCK(consistency checker) takes ours

• Software/Memory errors destroy file systems– Restoring 300GB from tape doesn’t work

• Tape may be in error• Too slow

– Important to replicate data in spinning disk

Back-Ends

• Back-Ends were Digital 8400’s (Turoblaser)

• Huge cards with large connectors• Pins are on backplane, not card

• RAID setup took hours on separate machine

• Console interrupt is a boon

Front-Ends

• Biggest Problems:– Poorly-Tested software– Operator Error

• Automatic restart dealt with former

• A trivial IP failover scheme dealt with latter

HTTP Server

• Original NCSA httpd was abysmal– Forked too often– Synchronous name resolution– Logs writes to full disc– Prone to denial of service attacks

• Fixed with new first http server– Never Forks: aggravates software test

issues– Submit limits: sockets/threads/requests rate

Load Balance

• Front-End– DNS round robin

• Backend– Front-Ends will group similar queries to

the same specific backend for cache

Overload Handling

• Back-ends take short-cuts when verloaded– Ultimately, they can refuse service

• Front-ends have spare capacity to avoid site appearing completely dead

Reference

The AltaVista Indexing and Search Engine

Mike Burrows, Compaq SRCProduction Date: 01/18/2000Link: http://uwtv.org/programs/displayevent.aspx?rid=2123

altavista search engine architecture

Technology