data intensive scalable computinghpdc2010.eecs.northwestern.edu/hpdc2010bryant.pdf–3– our...

DataIntensive

ScalableComputing

http://www.cs.cmu.edu/~bryant

Randal E. BryantCarnegie Mellon University

Finding the Right Programming Models

– 2 –

Outline

Data-Intensive Scalable ComputingFocus on petabytes, not petaflopsRethinking machine design

Map/Reduce Programming ModelSuitability for wide class of computing tasks

Strengths & LimitationsScalabilityPerformance

Beyond Map/ReduceSmall variationsOther programming models

– 3 –

Our Data-Driven World

ScienceData bases from astronomy, genomics, natural languages, seismic modeling, …

HumanitiesScanned books, historic documents, …

CommerceCorporate sales, stock market transactions, census, airline traffic, …

EntertainmentInternet images, Hollywood movies, MP3 files, …

MedicineMRI & CT scans, patient records, …

– 4 –

Why So Much Data?We Can Get It

Automation + Internet

We Can Keep ItSeagate Barracuda1.5 TB @ $76

5¢ / GB(vs. 40¢ in 2007)

We Can Use ItScientific computationBusiness applicationsHarvesting Internet data

– 5 –

Oceans of Data, Skinny Pipes

1 TerabyteEasy to storeHard to move

Disks MB / s TimeSeagate Barracuda 115 2.3 hoursSeagate Cheetah 125 2.2 hours

Networks MB / s TimeHome Internet < 0.625 > 18.5 days

Gigabit Ethernet < 125 > 2.2 hours

PSC Teragrid Connection

< 3,750 > 4.4 minutes

– 6 –

Data-Intensive System Challenge

For Computation That Accesses 1 TB in 5 minutesData distributed over 100+ disks

Assuming uniform data partitioning

Compute using 100+ processorsConnected by gigabit Ethernet (or equivalent)

System RequirementsLots of disksLots of processorsLocated in close proximity

Within reach of fast, local-area network

– 7 –

Google Data Centers

Dalles, OregonHydroelectric power @ 2¢ / KW Hr50 MegawattsEnough to power 6,000 homes

Engineered for maximum modularity & power efficiencyContainer: 1160 servers, 250KWServer: 2 disks, 2 processors

– 8 –

High-Performance Distributed Computing: Two VersionsGrid ComputingConnect small number of

big machinesAllow resource sharing among supercomputers

IssuesPrograms must usually be specialized for machineHard to get cooperation from multiple organizations

DISC SystemBuild big system from

many small machinesUse distributed systems principles to construct large-scale machineFile system provides distribution, reliability, recoveryDynamically scheduled task as basic processing unit

– 9 –

Programming Model Comparison

Bulk SynchronousCommonly used for compute-intensive applications

Map/ReduceCommonly used for data-intensive applications

IssuesRaw performanceScalability

Cost required to increase machine size by K» ≥ K

Performance achieved by increasing machine size by K» ≤ K

Frequency and impact of failures

– 10 –

Bulk Synchronous Programming

Solving Problem Over GridE.g., finite-element computation

Partition into Regionsp regions for p processors

Map Region per ProcessorLocal computation sequentialPeriodically communicate boundary values with neighbors

– 11 –

Typical HPC OperationCharacteristics

Long-lived processesMake use of spatial localityHold all program data in memory (no disk access)High bandwidth communication

StrengthsHigh utilization of resourcesEffective for many scientific applications

WeaknessesRequires careful tuning of application to resourcesIntolerant of any variability

P1 P2 P3 P4 P5 MemoryShared Memory

P1 P2 P3 P4 P5

Message Passing

– 12 –

Map/Reduce Programming Model

Map computation across many objectsE.g., 1010 Internet web pages

Aggregate results in many different waysSystem deals with issues of resource allocation & reliability

M

x1

M

x2

M

x3

M

xn

k1

Map

Reducek1

kr

• • •

• • •

Key-ValuePairs

Dean & Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004

– 13 –

Map/Reduce Example

Create an word index of set of documentsMap: generate ⟨word, count⟩ pairs for all words in documentReduce: sum word counts across documents

Come,Dick

Come and see.

Come and see

Spot.

Come and see.

Come, come.

M Extract

Word-CountPairs

⟨dick, 1⟩

⟨see, 1⟩

⟨come, 1⟩

⟨and, 1⟩

⟨come, 1⟩

⟨come, 1⟩

⟨come, 1⟩

M M M M

⟨come, 2⟩

⟨see, 1⟩

⟨and, 1⟩⟨and, 1⟩

⟨spot, 1⟩

Sumdick∑

1

and∑

3

come∑

6

see∑

3

spot∑

1

– 14 –

Map/Reduce OperationCharacteristics

Computation broken into many, short-lived tasks

Mapping, reducing

Use disk storage to hold intermediate results

StrengthsGreat flexibility in placement, scheduling, and load balancingCan access large data sets

WeaknessesHigher overheadLower raw performance

MapReduce

MapReduce

MapReduce

MapReduce

Map/Reduce

– 15 –

System Comparison:Programming Models

Programs described at very low level

Specify detailed control of processing & communications

Rely on small number of software packages

Written by specialistsLimits classes of problems & solution methods

Application programs written in terms of high-level operations on dataRuntime system controls scheduling, load balancing, …

Conventional HPC

Hardware

Machine-DependentProgramming Model

SoftwarePackages

ApplicationPrograms

Hardware

Machine-IndependentProgramming Model

RuntimeSystem

ApplicationPrograms

DISC

– 16 –

HPC Fault Tolerance

CheckpointPeriodically store state of all processesSignificant I/O traffic

RestoreWhen failure occursReset state to that of last checkpointAll intervening computation wasted

Performance ScalingVery sensitive to number of failing components

P1 P2 P3 P4 P5

Checkpoint

Checkpoint

Restore

WastedComputation

– 17 –

Map/Reduce Fault ToleranceData Integrity

Store multiple copies of each fileIncluding intermediate results of each Map / Reduce

Continuous checkpointing

Recovering from FailureSimply recompute lost result

Localized effect

Dynamic scheduler keeps all processors busy

MapReduce

MapReduce

MapReduce

MapReduce

Map/Reduce

– 18 –

DISC Scalability Advantages

Distributed system design principles lead to scalable designDynamically scheduled tasks with state held in replicated files

Provisioning AdvantagesCan use consumer-grade components

maximizes cost-peformance

Can have heterogenous nodesMore efficient technology refresh

Operational AdvantagesMinimal staffingNo downtime

– 19 –

Getting StartedGoal

Get people involved in DISC

SoftwareHadoop Project

Open source project providing file system and Map/ReduceSupported and used by YahooRapidly expanding user/developer basePrototype on single machine, map onto cluster

– 20 –

Exploring Parallel Computation Models

DISC + Map/Reduce Provides Coarse-Grained ParallelismComputation done by independent processesFile-based communication

ObservationsRelatively “natural” programming modelResearch issue to explore full potential and limits

Low CommunicationCoarse-Grained

High CommunicationFine-Grained

SETI@home PRAMThreads

Map/Reduce

MPI

– 21 –

Example: Sparse Matrices with Map/Reduce

Task: Compute product C = A·BAssume most matrix entries are 0

MotivationCore problem in scientific computingChallenging for parallel executionDemonstrate expressiveness of Map/Reduce

10 20

30 40

50 60 70

A-1

-2 -3

-4

B-10 -80

-60 -250

-170 -460

C

X =

– 22 –

Computing Sparse Matrix Product

Represent matrix as list of nonzero entries⟨row, col, value, matrixID⟩

StrategyPhase 1: Compute all products ai,k · bk,j

Phase 2: Sum products for each entry i,jEach phase involves a Map/Reduce

10 20

30 40

50 60 70

A-1

-2 -3

-4

B1 110

A

1 320A

2 230A

2 340A

3 150A

3 260A

3 370A

1 1-1B

2 1-2B

2 2-3B

3 2-4B

– 23 –

Phase 1 Map of Matrix Multiply

Group values ai,k and bk,j according to key k

1 1-1B

2 1-2B

2 2-3B

3 2-4B

Key = row

1 110A

1 320A

2 230A

2 340A

3 150A

3 260A

3 370A

Key = 2

Key = 3

Key = 11 110

A

3 150A

2 230A

3 260A

1 320A

2 340A

3 370A

1 1-1B

2 1-2B

2 2-3B

3 2-4B

Key = col

– 24 –

Phase 1 “Reduce” of Matrix Multiply

Generate all products ai,k · bk,j

1 1-10C

3 1-50A

2 1-60C

2 2-90C

3 1-120C

3 2-180C

1 2-80C

2 2-160C

3 2-280C

Key = 2

Key = 3

Key = 11 110

A

3 150A

2 230A

3 260A

1 320A

2 340A

3 370A

1 1-1B

2 1-2B

2 2-3B

3 2-4B

X

X

X

– 25 –

Phase 2 Map of Matrix Multiply

Group products ai,k · bk,j with matching values of i and j

1 1-10C

3 1-50A

2 1-60C

2 2-90C

3 1-120C

3 2-180C

1 2-80C

2 2-160C

3 2-280C

Key = 1,2

Key = 1,1

Key = 2,1

Key = 2,2

Key = 3,1

Key = 3,2

1 1-10C

3 1-50A

2 1-60C

2 2-90C

3 1-120C

3 2-180C

1 2-80C

2 2-160C

3 2-280C

Key = row,col

– 26 –

Phase 2 Reduce of Matrix Multiply

Sum products to get final entries

1 1-10C

2 1-60C

2 2-250C

3 1-170C

1 2-80C

3 2-460C

-10 -80

-60 -250

-170 -460

C

Key = 1,2

Key = 1,1

Key = 2,1

Key = 2,2

Key = 3,1

Key = 3,2

1 1-10C

3 1-50A

2 1-60C

2 2-90C

3 1-120C

3 2-180C

1 2-80C

2 2-160C

3 2-280C

– 27 –

Lessons from Sparse Matrix Example

Associative Matching is Powerful Communication Primitive

Intermediate step in Map/Reduce

Similar Strategy Applies to Other ProblemsShortest path in graphDatabase join

Many Performance ConsiderationsKiefer, Volk, Lehner, TU DresdenShould do systematic comparison to other sparse matrix implementations

– 28 –

Tweaking MapReduce

Map/Reduce/MergeYang, et al. Yahoo & UCLAAdd merge step to do tagged merge of data sets

E.g., getting elements from matrices A & B

Demonstrate expressive power of relational algebra

Local IterationsKambatla, et al., PurdueSupport local and global iterationsStep toward bulk synchronous model

– 29 –

Global

Local

Local

Local

Global

Local

Local

Global

Local / Global Map/Reduce

Iterative Computationse.g., PageRank

Graph PartitioningPartition into clustersColocate data for each cluster

ComputationCompute solution for each cluster, holding inter-cluster values constantUpdate inter-cluster values

– 30 –

Pig Project

Chris Olston, Yahoo!Part of Apache/Hadoop

Merge Database & ProgrammingSQL-like query languageSet-oriented function application

ImplementationMap onto HadoopAutomatic selection / optimization of algorithmCaptures low-level tricks programmers have devised for Map/Reduce

– 31 –

Generalizing Map/ReduceE.g., Microsoft Dryad Project

Computational ModelAcyclic graph of operators

But expressed as textual program

Each takes collection of objects and produces objects

Purely functional model

Implementation ConceptsObjects stored in files or memoryAny object may be lost; any operator may failReplicate & recompute for fault toleranceDynamic scheduling

# Operators >> # Processorsx1 x2 x3 xn

• • •

Op2 Op2 Op2 Op2• • •

• • •

Opk Opk Opk Opk• • •

Op1 Op1 Op1 Op1

– 32 –

Implementation Challenges

Hadoop Platform is BlessingLarge community adding features, improving performanceEasy to deployGrowing body of documentation and materialsWorks well enough for range of applications

… And CurseMap & reduce functionality hardwiredNot designed as extensible platform

Dryad Not Widely Adopted305 citations on Google Scholar vs. 1759 Map/ReduceBuilt on .NET

– 33 –

Conclusions

Distributed Systems Concepts Lead to Scalable Machines

Loosely coupled execution modelLowers cost of procurement & operation

Map/Reduce Gaining Widespread UseHadoop makes it widely availableGreat for some applications, good enough for many others

Lots of Work to be DoneRicher set of programming models and implementationsExpanding range of applicability

Problems that are data and compute intensiveThe future of supercomputing?

data intensive scalable computinghpdc2010.eecs.northwestern.edu/hpdc2010bryant.pdf–3– our...

Documents