scaling information retrieval to the webapache/yahoo computation mapreduce hadoop ec2 / elastic...

36
Scaling Information Retrieval to the Web 1 Yinon Bentor April 13, 2010

Upload: others

Post on 03-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Scaling Information Retrieval to the Web

1

Yinon BentorApril 13, 2010

Page 2: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Overview• What is large data? How big is it? How do we

handle it?

• What we don’t want to do

• The Google Platform (and the Apache Platform, and the Amazon Platform, ...)

• MapReduce for robust, efficient batch computation

• Distributed File Systems (GFS, HDFS), and why they’re useful

• Distributed Databases: BigTable, CouchDB, HBase

2

Page 3: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Overview

• And how does this apply to Information Retrieval?

• Distributed implementation of Inverted Indexing

• MapReduce for PageRank

• What else can we do?

• Practical considerations

3

Page 4: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Large Data• Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s LHC will generate 15 PB a year

4

[Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]

Page 5: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

What Can We Do With All This Data?

• Data Mining

• Question Answering

• Machine Translation

• Recommendation

• Ad Placement

5

• Train Classifiers (e.g., Spam Filters)

• Analyze Social Graphs

• “Discover the secrets of the universe”

“There’s no data like more data”

Page 6: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

6

Numbers Everyone Should Know*

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from disk 20,000,000 ns

Send packet CA ! Netherlands ! CA 150,000,000 ns

* According to Jeff Dean (LADIS 2009 keynote) [Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]

Page 7: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

What are the Lessons?• CPUs are fast, memory is slow, disk is slower:

use variable-length encodings, compression, etc.

• Read from memory whenever possible: Memory reads are ~80x faster than disk

• Prefer sequential disk reads to random access

• Prefer large files (64MB block sizes aren’t bad)

• Locality is important: Keep it within the same cache read, memory page, machine, rack, data center, continent, …

7

Page 8: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

What we don’t want to do

Use expensive machines (they fail too)

➡ Cheap commodity hardware is better

Die on hardware failure

➡ Build reliability in software

Wait on shared resources

➡ Distribute everything

Transfer data unnecessarily

➡ Move code to data instead

8

Page 9: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

9

Apache/Yahoo

Computation MapReduce HadoopEC2 / Elastic MapReduce

File Storage Google File System (GFS)

HDFS Amazon S3

Database BigTableHBase,

Cassandra, CouchDB

Amazon SimpleDB

Distributed/CloudComputing Platforms

Page 10: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

MapReduce

10

“A simple programming model that applies to many large scale computing problems”

[Slide from Jeff Dean LADIS 2009]

Hide messy details in MapReduce runtime library:• automatic parallelization• load balancing• network and disk transfer optimizations• handling of machine failures• robustness

Improvements to core library benefit all users of library.

Page 11: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Programming Model (Lisp)• map : take a list and function f of 1

argument, apply f to each element:

11

map([1, 2, 4, 10],

function(x) {return x*x;})

> [1, 4, 16, 100]

fold([1, 4, 16, 100], 0,

function(x, y) {return x+y;})

> 121

• fold : take a list and a function g of 2 arguments, and an accumulator value, apply g iteratively to the accumulator and each value:

Page 12: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

MapReduce Semantics

map:(k1, v1) → [(k2, v2)]

[sort and group by k2]

reduce:

(k2, [v2]) → [(k3, v3)]

12

Page 13: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

MapReduce Operation

13

[Image from Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Page 14: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Word Count Example

14

, , , ,…,

MAPMAP MAP MAP MAP

Dracula 37 school 7 all 1057… all 187 school 4 cat 37…

cat 37 2 28 5 Dracula 37all 1057 187 22 school 7 4

Group by term, sort

all 1266 cat 72 Dracula 37 school 11

REDUCE REDUCE REDUCE REDUCE

Page 15: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Word Count: Pseudocode

15

[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Page 16: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Generating Map Tiles

16[Slide from Jeff Dean LADIS 2009]

Page 17: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Inverted Indexing

• Recall that an Inverted Index is a map from a term to its posting list

• A Posting List is a list of each occurrence of the term in the corpus. For each posting, we might store:

• Additionally, we might want to compute Document Frequency (DF)

17

DocID PositionFeatures

Anchor? Title? Font Size

Page 18: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Inverted Index in MapReduce (Basic Implementation)

18

[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Page 19: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

19[Jimmy Lin, Data Intensive

Processing with MapReduce, (forthcoming)]

Inverted Index in MapReduce (Basic Implementation)

Page 20: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Inverted Index in MapReduce (Extensions)

• Mapper could:

• parse HTML or other data

• extract additional features from each page and emit more detailed postings

• Reducer could:

• implement compression, partitioning, and coding for more efficient retrieval

20

Page 21: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Inverted Index in MapReduce (Limitations)

• The basic implementation has a big scalability bottleneck. Using your IR knowledge, can you spot it?

• Vocabulary size is governed by Heap’s Law. Posting size is governed by Zipf’s Law. For some terms, we might be able to fit our posting list in memory!

• Workarounds exist. See [Lin 2010]

21

Page 22: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

PageRank in MapReduceRecall that graphs can be represented as adjacency matrices or adjacency lists:

22

[Image from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

Which one is more appropriate for our task?

Page 23: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

(Simplified) PageRank in MapReduce

23

(Assuming α=0 and no dangling edges) [Images from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

Page 24: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

(Simplified) PageRank in MapReduce

24

[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]

Each iteration is a MapReduce:

Page 25: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

What about Retrieval?The indexing problem:

• Scalability is paramount

• Must be relatively fast, but need not be real time

• Fundamentally a batch operation

• Incremental updates may or may not be important

• For the web, crawling is a challenge in itself

The retrieval problem:

• Must have sub-second response time

• For the web, only need relatively few results

25

[Slide from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]

Great for

MapReduce!

Not So Great

Page 26: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

MapReduce: ExecutionThe MapReduce framework:

• schedules mappers and reducers

• allocates workers close to data

• checks for slow or failed processes periodically and re-submits

• Handles sorting and combining efficiently

26

Page 27: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

MapReduce: Conclusions• Divide and Conquer on a

massive scale

• Can efficiently handle many IR batch tasks:

• Indexing, PageRank, Language Modeling, Sequence Alignment (for Translation), Classification, and more

• A reasonable abstracting, trading off between flexibility and ease of implementation

27

[Dean and Ghemawat, OSDI 2004]

Page 28: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

File Storage• In traditional supercomputers, storage and

computation are kept separate. This means data must be transferred through fast interconnects to compute nodes (bad!).

• Google File System (GFS) and Hadoop File System (HDFS) keep data replicated across cheap commodity hardware

• Each file is replicated at least 3 times (more for highly-used or critical files)

28

Page 29: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

GFS: Design Considerations

• Large files: 64MB chunks (why?) stored on Chunkservers

• GFS Masters manage metadata

• Clients retrieve file data directly from chunkservers

29

[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]

Page 30: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

GFS Usage (2007)

• 200+ GFS clusters

• Largest clusters:

• 5000+ machines

• 5+ PB of disk usage

• 10000+ clients

30

[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]

Page 31: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Semi-Structured Data• Traditional relational databases fail at

this scale: most operations are too expensive.

• Solution: distributed databases

• Google’s BigTable stores data as a “sparse, distributed multi-dimensional sorted map”

• In the open-source world, Cassandra (Digg/Twitter/Facebook), HBase (Yahoo, others) and CouchDB perform similar roles

31

Page 32: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

32 [http://blog.nahurst.com/visual-guide-to-nosql-systems]

Page 33: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

BigTable Design Considerations

• Loose schema and data types

• BigTables divided into tablets which are replicated and distributed

• Tablets are (~100-200 MB/tablet) and stored in GFS. Each machine hosts ~100 tablets

• Optimized for reads and appends

• Tablets can be reallocated on failures or increased load

33

Page 34: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

BigTable at Google

• Used widely: Google Earth, Analytics, Crawl, Print, Orkut, Blogger, …

• Largest cluster (2009): 70+ PB of data, 10M ops/sec, 30+ GB/s I/O

34

[Jeff Dean, LADIS 2009 Keynote]

Page 35: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Fitting It All Together

• Operating at Web Scale requires completely distributed, fault tolerant systems

• Replication and data locality is key

• Good abstractions allow smart programmers to be efficient

• Data is only going to get bigger

35

Page 36: Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Questions?

36