scaling information retrieval to the webapache/yahoo computation mapreduce hadoop ec2 / elastic...

Scaling Information Retrieval to the Web

1

Yinon BentorApril 13, 2010

Overview• What is large data? How big is it? How do we

handle it?

• What we don’t want to do

• The Google Platform (and the Apache Platform, and the Amazon Platform, ...)

• MapReduce for robust, efficient batch computation

• Distributed File Systems (GFS, HDFS), and why they’re useful

• Distributed Databases: BigTable, CouchDB, HBase

2

Overview

• And how does this apply to Information Retrieval?

• Distributed implementation of Inverted Indexing

• MapReduce for PageRank

• What else can we do?

• Practical considerations

3

Large Data• Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s LHC will generate 15 PB a year

4

[Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]

What Can We Do With All This Data?

• Data Mining

• Question Answering

• Machine Translation

• Recommendation

• Ad Placement

5

• Train Classifiers (e.g., Spam Filters)

• Analyze Social Graphs

• “Discover the secrets of the universe”

“There’s no data like more data”

6

Numbers Everyone Should Know*

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from disk 20,000,000 ns

Send packet CA ! Netherlands ! CA 150,000,000 ns

* According to Jeff Dean (LADIS 2009 keynote) [Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]

What are the Lessons?• CPUs are fast, memory is slow, disk is slower:

use variable-length encodings, compression, etc.

• Read from memory whenever possible: Memory reads are ~80x faster than disk

• Prefer sequential disk reads to random access

• Prefer large files (64MB block sizes aren’t bad)

• Locality is important: Keep it within the same cache read, memory page, machine, rack, data center, continent, …

7

What we don’t want to do

Use expensive machines (they fail too)

➡ Cheap commodity hardware is better

Die on hardware failure

➡ Build reliability in software

Wait on shared resources

➡ Distribute everything

Transfer data unnecessarily

➡ Move code to data instead

8

9

Apache/Yahoo

Computation MapReduce HadoopEC2 / Elastic MapReduce

File Storage Google File System (GFS)

HDFS Amazon S3

Database BigTableHBase,

Cassandra, CouchDB

Amazon SimpleDB

Distributed/CloudComputing Platforms

MapReduce

10

“A simple programming model that applies to many large scale computing problems”

[Slide from Jeff Dean LADIS 2009]

Hide messy details in MapReduce runtime library:• automatic parallelization• load balancing• network and disk transfer optimizations• handling of machine failures• robustness

Improvements to core library benefit all users of library.

Programming Model (Lisp)• map : take a list and function f of 1

argument, apply f to each element:

11

map([1, 2, 4, 10],

function(x) {return x*x;})

> [1, 4, 16, 100]

fold([1, 4, 16, 100], 0,

function(x, y) {return x+y;})

> 121

• fold : take a list and a function g of 2 arguments, and an accumulator value, apply g iteratively to the accumulator and each value:

MapReduce Semantics

map:(k1, v1) → [(k2, v2)]

[sort and group by k2]

reduce:

(k2, [v2]) → [(k3, v3)]

12

MapReduce Operation

13

[Image from Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Word Count Example

14

, , , ,…,

MAPMAP MAP MAP MAP

Dracula 37 school 7 all 1057… all 187 school 4 cat 37…

cat 37 2 28 5 Dracula 37all 1057 187 22 school 7 4

Group by term, sort

all 1266 cat 72 Dracula 37 school 11

REDUCE REDUCE REDUCE REDUCE

Word Count: Pseudocode

15

[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Generating Map Tiles

16[Slide from Jeff Dean LADIS 2009]

Inverted Indexing

• Recall that an Inverted Index is a map from a term to its posting list

• A Posting List is a list of each occurrence of the term in the corpus. For each posting, we might store:

• Additionally, we might want to compute Document Frequency (DF)

17

DocID PositionFeatures

Anchor? Title? Font Size

Inverted Index in MapReduce (Basic Implementation)

18


19[Jimmy Lin, Data Intensive

Processing with MapReduce, (forthcoming)]

Inverted Index in MapReduce (Basic Implementation)

Inverted Index in MapReduce (Extensions)

• Mapper could:

• parse HTML or other data

• extract additional features from each page and emit more detailed postings

• Reducer could:

• implement compression, partitioning, and coding for more efficient retrieval

20

Inverted Index in MapReduce (Limitations)

• The basic implementation has a big scalability bottleneck. Using your IR knowledge, can you spot it?

• Vocabulary size is governed by Heap’s Law. Posting size is governed by Zipf’s Law. For some terms, we might be able to fit our posting list in memory!

• Workarounds exist. See [Lin 2010]

21

PageRank in MapReduceRecall that graphs can be represented as adjacency matrices or adjacency lists:

22

[Image from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

Which one is more appropriate for our task?

(Simplified) PageRank in MapReduce

23

(Assuming α=0 and no dangling edges) [Images from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

(Simplified) PageRank in MapReduce

24


Each iteration is a MapReduce:

What about Retrieval?The indexing problem:

• Scalability is paramount

• Must be relatively fast, but need not be real time

• Fundamentally a batch operation

• Incremental updates may or may not be important

• For the web, crawling is a challenge in itself

The retrieval problem:

• Must have sub-second response time

• For the web, only need relatively few results

25

[Slide from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

Great for

MapReduce!

Not So Great

MapReduce: ExecutionThe MapReduce framework:

• schedules mappers and reducers

• allocates workers close to data

• checks for slow or failed processes periodically and re-submits

• Handles sorting and combining efficiently

26

MapReduce: Conclusions• Divide and Conquer on a

massive scale

• Can efficiently handle many IR batch tasks:

• Indexing, PageRank, Language Modeling, Sequence Alignment (for Translation), Classification, and more

• A reasonable abstracting, trading off between flexibility and ease of implementation

27

[Dean and Ghemawat, OSDI 2004]

File Storage• In traditional supercomputers, storage and

computation are kept separate. This means data must be transferred through fast interconnects to compute nodes (bad!).

• Google File System (GFS) and Hadoop File System (HDFS) keep data replicated across cheap commodity hardware

• Each file is replicated at least 3 times (more for highly-used or critical files)

28

GFS: Design Considerations

• Large files: 64MB chunks (why?) stored on Chunkservers

• GFS Masters manage metadata

• Clients retrieve file data directly from chunkservers

29

[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]

GFS Usage (2007)

• 200+ GFS clusters

• Largest clusters:

• 5000+ machines

• 5+ PB of disk usage

• 10000+ clients

30

[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]

Semi-Structured Data• Traditional relational databases fail at

this scale: most operations are too expensive.

• Solution: distributed databases

• Google’s BigTable stores data as a “sparse, distributed multi-dimensional sorted map”

• In the open-source world, Cassandra (Digg/Twitter/Facebook), HBase (Yahoo, others) and CouchDB perform similar roles

31

32 [http://blog.nahurst.com/visual-guide-to-nosql-systems]

BigTable Design Considerations

• Loose schema and data types

• BigTables divided into tablets which are replicated and distributed

• Tablets are (~100-200 MB/tablet) and stored in GFS. Each machine hosts ~100 tablets

• Optimized for reads and appends

• Tablets can be reallocated on failures or increased load

33

BigTable at Google

• Used widely: Google Earth, Analytics, Crawl, Print, Orkut, Blogger, …

• Largest cluster (2009): 70+ PB of data, 10M ops/sec, 30+ GB/s I/O

34

[Jeff Dean, LADIS 2009 Keynote]

Fitting It All Together

• Operating at Web Scale requires completely distributed, fault tolerant systems

• Replication and data locality is key

• Good abstractions allow smart programmers to be efficient

• Data is only going to get bigger

35

Questions?

36

scaling information retrieval to the webapache/yahoo computation mapreduce hadoop ec2 / elastic...

Documents