data-intensive text processing with mapreduce j. lin & c. dyer

39
Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer

Upload: aileen

Post on 22-Feb-2016

88 views

Category:

Documents


0 download

DESCRIPTION

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer. MapReduce. Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Data-Intensive Text Processing with MapReduceJ. Lin & C. Dyer

Page 2: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

MapReduce

• Programming model for distributed computations on massive amounts of data

• Execution framework for large-scale data processing on clusters of commodity servers

• Developed by Google – built on old, principles of parallel and distributed processing

• Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

Page 3: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Data

• Big data – issue to grapple with• Web-scale synonymous with data-intensive

processing• Public, private repositories of vast data• Behavior data important - BI

Page 4: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

4th paradigm

• Manipulate, explore, mine massive data – 4th paradigm of science (theory, experiments, simulations)

• In CS, systems must be able to scale

• Increases in capacity > improvements in bandwidth

Page 5: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

MapReduce (MR)

• MapReduce – level of abstraction and beneficial division of labor– Programming model – powerful abstraction

separates what from how of data intensive processing

Page 6: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Ideas behind MapReduce

• Scale out not up– Purchasing symmetric multi-processing machines

(SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective• Why? Machine with 2x processors > 2x cost

– Barroso & Holzle analysis using TPC benchmarks• SMP – communication order magnitude faster

– Cluster of low end approach 4x more cost effective than high end

– However, even low end only 10-50% utilization – not energy efficient

Page 7: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Ideas behind MapReduce

• Assume failures are common– Assume cluster machines mean-time failure 1000

days– 10,000 server cluster, 10 failures a day– MR copes with failure

• Move processing to the data– MR assume architecture where

processors/storage co-located– Run code on processor attached to data

Page 8: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Ideas behind MapReduce

• Process data sequentially not random– If 1TB DB with 1010, 100B records– If update 1%, take 1 month– If read entire DB and rewrites all records with

updates, takes < 1 work day on single machine– Solid state won’t help– MR – designed for batch processing, trade latency

for throughput

Page 9: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Ideas behind MapReduce• Hide system-level details from application developer– Writing distributed programs difficult• Details across threads, processes, machines• Code that runs concurrently is unpredictable– Deadlocks, race conditions, etc.

– MR isolates develop from system-level details• No locking, starvation, etc.• Well-defined interfaces• Separates what (programmer) from how (responsibility of

execution framework)• Framework designed once and verified for correctness

Page 10: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Big Ideas behind MapReduce

• Seamless scalability– Given 2x data, algorithms take at most 2x to run– Given cluster 2x large, take ½ time to run– The above is unobtainable for algorithms

• 9 women can’t have a baby in 1 month• E.g. 2x programs takes longer• Degree of parallelization increases communication

– MR small step toward attaining• Algorithm fixed, framework executes algorithm• If use 10 machines 10 hours, 100 machines 1 hour

Page 11: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Motivation for MapReduce• Still waiting for parallel processing to replace sequential• Progress of Moore’s law - most problems could be solved by

single computer, so ignore parallel, etc.• Around 2005, no longer true– Semiconductor industry ran out of opportunities to

improve• Faster clocks cheaper pipelines, superscalar

architecture– Then came multi-core• Not matched by advances in software

Page 12: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Motivation

• Parallel processing only way forward• MapReduce to the rescue– Anyone can download open source Hadoop

implementation of MapReduce– Rent a cluster from a utility cloud– Process TB within the week

• Multiple cores in a chip, multiple machines in a cluster

Page 13: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Motivation

• MapReduce: effective data analysis tool– First widely-adopted step away from von

Neumann model• Can’t treat multi-core processor, cluster as

conglomeration of many von Neumann machine image that communicates over network• Wrong abstraction• MR – organize computations not over individual

machines, but over clusters• Datacenter is the computer

Page 14: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Motivation• Models of parallel computation

– PRAM (Parallel RAM machine)• Arbitrary number of processors, share unbounded large memory,

operate synchronously on shared input– MR most successful abstraction for large-scale resources

• Manages complexity, hides details, presents well-defined behavior• Makes certain tasks easier, others harder• MapReduce first in new class of programming models

Page 15: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Basics• Divide and conquer– Partition large problem into smaller subproblems– Worker work on subproblems in parallel• Threads in a core, cores in multi-core processor,

multiple processor in a machine, machines in a cluster– Combine intermediate results from worker to final result

Page 16: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Basics

• MR – abstraction that hides system-level details from programmer

• Move code to data– Spread data across disks– DFS manages storage

• Based on Functional Programming

Page 17: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Roots• MapReduce = functional programming plus distributed

processing on steroids – Not a new idea… dates back to the 50’s (or even 30’s)

• What is functional programming?– Computation as application of functions– Computation is evaluation of mathematical functions– Avoids state and mutable data– Emphasizes application of functions instead of changes in

state

Page 18: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Roots• How is it different?– Traditional notions of “data” and “instructions” are not

applicable– Data flows are implicit in program– Different orders of execution are possible– Theoretical foundation provided by lambda calculus

• a formal system for function definition• Exemplified by LISP, Scheme

Page 19: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Overview of Lisp

• Functions written in prefix notation

(+ 1 2) 3 (* 3 4) 12(sqrt ( + (* 3 3) (* 4 4))) 5(define x 3) x(* x 5) 15

Page 20: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Roots

• Two important concepts in functional programming– Map: do something to everything in a list– Fold: combine results of a list in some way

Page 21: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Map• Higher order functions – accept other functions as arguments

– Map• Takes a function f and its argument, which is a list• applies to all elements in list

– Lists are primitive data types » [1 2 3 4 5]» [[a 1] [b 2] [c 3]]

• Returns a list as result

• Simple map example:

(map (lambda (x) (* x x)) [1 2 3 4 5]) [1 4 9 16 25]

Page 22: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Reduce– Fold• Takes function g, which has 2 arguments: an initial

value and a list. • The g applied to initial value and 1st item in list• Result stored in intermediate variable• Intermediate variable and next item in list 2nd

application of g, etc.• Fold returns final value of intermediate variable

Page 23: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Map/Fold in Action• Simple map example:

• Fold examples:

• Sum of squares:

(map (lambda (x) (* x x)) [1 2 3 4 5]) [1 4 9 16 25]

(fold + 0 [1 2 3 4 5]) 15(fold * 1 [1 2 3 4 5]) 120

(define (sum-of-squares v) // where v is a list (fold + 0 (map (lambda (x) (* x x)) v)))(sum-of-squares [1 2 3 4 5]) 55

Page 24: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer
Page 25: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Functional Programming Roots

• Use map/fold in combination• Map – transformation of dataset• Fold- aggregation operation• Can apply map in parallel• Fold – more restrictions, elements must be

brought together– Many applications do not require g be applied to

all elements of list, fold aggregations in parallel

Page 26: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

MapReduce

• Map in MapReduce is same as in functional programming

• Reduce corresponds to fold• 2 stages:– User specified computation applied over all input,

can occur in parallel, return intermediate output– Output aggregated by another user-specified

computation

Page 27: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Mappers/Reducers

• Key-value pair (k,v) – basic data structure in MR

• Keys, values – int, strings, etc., user defined– e.g. keys – URLs, values – HTML content– e.g. keys – node ids, values – adjacency lists of

nodesMap: (k1, v1) -> [(k2, v2)]Reduce: (k2, [v2]) -> [(k3, v2)]

Where […] denotes a list

Page 28: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

General Flow• Apply mapper to every input key-value pair stored in

DFS• Generate arbitrary number of intermediate (k,v)• Distributed group by operation (shuffle) on intermediate

keys• Sort intermediate results by key (not across reducers)• Aggregate intermediate results• Generate final output to DFS – one file per reducer

Map

Reduce

Page 29: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

What function is implemented?

Page 30: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer
Page 31: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Example: unigram (word count)

• (docid, doc) on DFS, doc is text• Mapper tokenizes (docid, doc), emits (k,v) for every

word – (word, 1)• Execution framework all same keys brought

together in reducer• Reducer – sums all counts (of 1) for word• Each reduce writes to one file• Words within file sorted, file same # words• Can use output as input to another MR

Page 32: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Execution Framework

• Scheduling– Job divided into tasks (certain block of (k,v) pairs)– Can have 1000s jobs need to be assigned– May exceed number that can run concurrently– Task queue– Coordination among tasks from different jobs

Page 33: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Execution Framework

• Speculative execution• Map phase only as fast as?– slowest map task• Problem: Stragglers, flaky hardware• Solution: Use speculative execution:–Exact copy of same task on different machine–Uses result of fastest task in attempt to finish–Better for map or reduce?–Can improve running time by 44% (Google)–Doesn’t help if skewed in distributed of values

Page 34: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Execution Framework

• Data/code co-location– Execute near data– It not possible must stream data• Try to keep within same rack

Page 35: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Execution Framework• Synchronization– Concurrently running processes join up– Intermediate (k,v) grouped by key, copy intermediate data over network, shuffle/sort

• Number of copy operations? Worst case:–M X R copy operations

• Each mapper may send intermediate results to every reducer– Reduce computation cannot start until all mappers finished, (k,v)

shuffled/sorted• Differs from functional programming

– Can copy intermediate (k,v) over network to reducer when mapper finishes

Page 36: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Hadoop• Careful using external resources (e.g. bottleneck querying SQL

DB)• Mappers can emit arbitrary number of intermediate (k,v), can be

of different type• Reduce can emit artibtraty number of final (k,v) and can be of

different type than intermediate (k,v)• Different from functional programming, can have side effects

(state change internal – may cause problems, external may write to files)

• MapReduce can have no reduce, but must have mapper– Can just pass identity function to reducer– May not have any input (compute pi)

Page 37: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Other Sources

• Other source can serve as source/destination for data from MapReduce– Google – BigTable– Hbase – BigTable clone– Hadoop – integrated RDB with parallel processing,

can write to DB tables

Page 38: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

Google File System (Hadoop)

• Divides file into chunks – 1MB used to be 64MB

• Master and Chunk servers• Data Replicated 3 times• Shadow Master

Page 39: Data-Intensive Text Processing with  MapReduce J. Lin & C. Dyer

CAP Theorem

• Consistency, availability, partition tolerance• Cannot satisfy all 3• Partitioning unavoidable in large data systems,

must trade off availability and consistency– If master fails, system is unavailable so consistent!– If multiple masters, more available, but inconsistent

• Workaround to single namenode– Warm standby namenode– Hadoop community working on it