DataIntensive
ScalableComputing
http://www.cs.cmu.edu/~bryant
Randal E. BryantCarnegie Mellon University
Finding the Right Programming Models
– 2 –
Outline
Data-Intensive Scalable ComputingFocus on petabytes, not petaflopsRethinking machine design
Map/Reduce Programming ModelSuitability for wide class of computing tasks
Strengths & LimitationsScalabilityPerformance
Beyond Map/ReduceSmall variationsOther programming models
– 3 –
Our Data-Driven World
ScienceData bases from astronomy, genomics, natural languages, seismic modeling, …
HumanitiesScanned books, historic documents, …
CommerceCorporate sales, stock market transactions, census, airline traffic, …
EntertainmentInternet images, Hollywood movies, MP3 files, …
MedicineMRI & CT scans, patient records, …
– 4 –
Why So Much Data?We Can Get It
Automation + Internet
We Can Keep ItSeagate Barracuda1.5 TB @ $76
5¢ / GB(vs. 40¢ in 2007)
We Can Use ItScientific computationBusiness applicationsHarvesting Internet data
– 5 –
Oceans of Data, Skinny Pipes
1 TerabyteEasy to storeHard to move
Disks MB / s TimeSeagate Barracuda 115 2.3 hoursSeagate Cheetah 125 2.2 hours
Networks MB / s TimeHome Internet < 0.625 > 18.5 days
Gigabit Ethernet < 125 > 2.2 hours
PSC Teragrid Connection
< 3,750 > 4.4 minutes
– 6 –
Data-Intensive System Challenge
For Computation That Accesses 1 TB in 5 minutesData distributed over 100+ disks
Assuming uniform data partitioning
Compute using 100+ processorsConnected by gigabit Ethernet (or equivalent)
System RequirementsLots of disksLots of processorsLocated in close proximity
Within reach of fast, local-area network
– 7 –
Google Data Centers
Dalles, OregonHydroelectric power @ 2¢ / KW Hr50 MegawattsEnough to power 6,000 homes
Engineered for maximum modularity & power efficiencyContainer: 1160 servers, 250KWServer: 2 disks, 2 processors
– 8 –
High-Performance Distributed Computing: Two VersionsGrid ComputingConnect small number of
big machinesAllow resource sharing among supercomputers
IssuesPrograms must usually be specialized for machineHard to get cooperation from multiple organizations
DISC SystemBuild big system from
many small machinesUse distributed systems principles to construct large-scale machineFile system provides distribution, reliability, recoveryDynamically scheduled task as basic processing unit
– 9 –
Programming Model Comparison
Bulk SynchronousCommonly used for compute-intensive applications
Map/ReduceCommonly used for data-intensive applications
IssuesRaw performanceScalability
Cost required to increase machine size by K» ≥ K
Performance achieved by increasing machine size by K» ≤ K
Frequency and impact of failures
– 10 –
Bulk Synchronous Programming
Solving Problem Over GridE.g., finite-element computation
Partition into Regionsp regions for p processors
Map Region per ProcessorLocal computation sequentialPeriodically communicate boundary values with neighbors
– 11 –
Typical HPC OperationCharacteristics
Long-lived processesMake use of spatial localityHold all program data in memory (no disk access)High bandwidth communication
StrengthsHigh utilization of resourcesEffective for many scientific applications
WeaknessesRequires careful tuning of application to resourcesIntolerant of any variability
P1 P2 P3 P4 P5 MemoryShared Memory
P1 P2 P3 P4 P5
Message Passing
– 12 –
Map/Reduce Programming Model
Map computation across many objectsE.g., 1010 Internet web pages
Aggregate results in many different waysSystem deals with issues of resource allocation & reliability
M
x1
M
x2
M
x3
M
xn
k1
Map
Reducek1
kr
• • •
• • •
Key-ValuePairs
Dean & Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004
– 13 –
Map/Reduce Example
Create an word index of set of documentsMap: generate ⟨word, count⟩ pairs for all words in documentReduce: sum word counts across documents
Come,Dick
Come and see.
Come and see
Spot.
Come and see.
Come, come.
M Extract
Word-CountPairs
⟨dick, 1⟩
⟨see, 1⟩
⟨come, 1⟩
⟨and, 1⟩
⟨come, 1⟩
⟨come, 1⟩
⟨come, 1⟩
M M M M
⟨come, 2⟩
⟨see, 1⟩
⟨and, 1⟩⟨and, 1⟩
⟨spot, 1⟩
Sumdick∑
1
and∑
3
come∑
6
see∑
3
spot∑
1
– 14 –
Map/Reduce OperationCharacteristics
Computation broken into many, short-lived tasks
Mapping, reducing
Use disk storage to hold intermediate results
StrengthsGreat flexibility in placement, scheduling, and load balancingCan access large data sets
WeaknessesHigher overheadLower raw performance
MapReduce
MapReduce
MapReduce
MapReduce
Map/Reduce
– 15 –
System Comparison:Programming Models
Programs described at very low level
Specify detailed control of processing & communications
Rely on small number of software packages
Written by specialistsLimits classes of problems & solution methods
Application programs written in terms of high-level operations on dataRuntime system controls scheduling, load balancing, …
Conventional HPC
Hardware
Machine-DependentProgramming Model
SoftwarePackages
ApplicationPrograms
Hardware
Machine-IndependentProgramming Model
RuntimeSystem
ApplicationPrograms
DISC
– 16 –
HPC Fault Tolerance
CheckpointPeriodically store state of all processesSignificant I/O traffic
RestoreWhen failure occursReset state to that of last checkpointAll intervening computation wasted
Performance ScalingVery sensitive to number of failing components
P1 P2 P3 P4 P5
Checkpoint
Checkpoint
Restore
WastedComputation
– 17 –
Map/Reduce Fault ToleranceData Integrity
Store multiple copies of each fileIncluding intermediate results of each Map / Reduce
Continuous checkpointing
Recovering from FailureSimply recompute lost result
Localized effect
Dynamic scheduler keeps all processors busy
MapReduce
MapReduce
MapReduce
MapReduce
Map/Reduce
– 18 –
DISC Scalability Advantages
Distributed system design principles lead to scalable designDynamically scheduled tasks with state held in replicated files
Provisioning AdvantagesCan use consumer-grade components
maximizes cost-peformance
Can have heterogenous nodesMore efficient technology refresh
Operational AdvantagesMinimal staffingNo downtime
– 19 –
Getting StartedGoal
Get people involved in DISC
SoftwareHadoop Project
Open source project providing file system and Map/ReduceSupported and used by YahooRapidly expanding user/developer basePrototype on single machine, map onto cluster
– 20 –
Exploring Parallel Computation Models
DISC + Map/Reduce Provides Coarse-Grained ParallelismComputation done by independent processesFile-based communication
ObservationsRelatively “natural” programming modelResearch issue to explore full potential and limits
Low CommunicationCoarse-Grained
High CommunicationFine-Grained
SETI@home PRAMThreads
Map/Reduce
MPI
– 21 –
Example: Sparse Matrices with Map/Reduce
Task: Compute product C = A·BAssume most matrix entries are 0
MotivationCore problem in scientific computingChallenging for parallel executionDemonstrate expressiveness of Map/Reduce
10 20
30 40
50 60 70
A-1
-2 -3
-4
B-10 -80
-60 -250
-170 -460
C
X =
– 22 –
Computing Sparse Matrix Product
Represent matrix as list of nonzero entries⟨row, col, value, matrixID⟩
StrategyPhase 1: Compute all products ai,k · bk,j
Phase 2: Sum products for each entry i,jEach phase involves a Map/Reduce
10 20
30 40
50 60 70
A-1
-2 -3
-4
B1 110
A
1 320A
2 230A
2 340A
3 150A
3 260A
3 370A
1 1-1B
2 1-2B
2 2-3B
3 2-4B
– 23 –
Phase 1 Map of Matrix Multiply
Group values ai,k and bk,j according to key k
1 1-1B
2 1-2B
2 2-3B
3 2-4B
Key = row
1 110A
1 320A
2 230A
2 340A
3 150A
3 260A
3 370A
Key = 2
Key = 3
Key = 11 110
A
3 150A
2 230A
3 260A
1 320A
2 340A
3 370A
1 1-1B
2 1-2B
2 2-3B
3 2-4B
Key = col
– 24 –
Phase 1 “Reduce” of Matrix Multiply
Generate all products ai,k · bk,j
1 1-10C
3 1-50A
2 1-60C
2 2-90C
3 1-120C
3 2-180C
1 2-80C
2 2-160C
3 2-280C
Key = 2
Key = 3
Key = 11 110
A
3 150A
2 230A
3 260A
1 320A
2 340A
3 370A
1 1-1B
2 1-2B
2 2-3B
3 2-4B
X
X
X
– 25 –
Phase 2 Map of Matrix Multiply
Group products ai,k · bk,j with matching values of i and j
1 1-10C
3 1-50A
2 1-60C
2 2-90C
3 1-120C
3 2-180C
1 2-80C
2 2-160C
3 2-280C
Key = 1,2
Key = 1,1
Key = 2,1
Key = 2,2
Key = 3,1
Key = 3,2
1 1-10C
3 1-50A
2 1-60C
2 2-90C
3 1-120C
3 2-180C
1 2-80C
2 2-160C
3 2-280C
Key = row,col
– 26 –
Phase 2 Reduce of Matrix Multiply
Sum products to get final entries
1 1-10C
2 1-60C
2 2-250C
3 1-170C
1 2-80C
3 2-460C
-10 -80
-60 -250
-170 -460
C
Key = 1,2
Key = 1,1
Key = 2,1
Key = 2,2
Key = 3,1
Key = 3,2
1 1-10C
3 1-50A
2 1-60C
2 2-90C
3 1-120C
3 2-180C
1 2-80C
2 2-160C
3 2-280C
– 27 –
Lessons from Sparse Matrix Example
Associative Matching is Powerful Communication Primitive
Intermediate step in Map/Reduce
Similar Strategy Applies to Other ProblemsShortest path in graphDatabase join
Many Performance ConsiderationsKiefer, Volk, Lehner, TU DresdenShould do systematic comparison to other sparse matrix implementations
– 28 –
Tweaking MapReduce
Map/Reduce/MergeYang, et al. Yahoo & UCLAAdd merge step to do tagged merge of data sets
E.g., getting elements from matrices A & B
Demonstrate expressive power of relational algebra
Local IterationsKambatla, et al., PurdueSupport local and global iterationsStep toward bulk synchronous model
– 29 –
Global
Local
Local
Local
Global
Local
Local
Global
Local / Global Map/Reduce
Iterative Computationse.g., PageRank
Graph PartitioningPartition into clustersColocate data for each cluster
ComputationCompute solution for each cluster, holding inter-cluster values constantUpdate inter-cluster values
– 30 –
Pig Project
Chris Olston, Yahoo!Part of Apache/Hadoop
Merge Database & ProgrammingSQL-like query languageSet-oriented function application
ImplementationMap onto HadoopAutomatic selection / optimization of algorithmCaptures low-level tricks programmers have devised for Map/Reduce
– 31 –
Generalizing Map/ReduceE.g., Microsoft Dryad Project
Computational ModelAcyclic graph of operators
But expressed as textual program
Each takes collection of objects and produces objects
Purely functional model
Implementation ConceptsObjects stored in files or memoryAny object may be lost; any operator may failReplicate & recompute for fault toleranceDynamic scheduling
# Operators >> # Processorsx1 x2 x3 xn
• • •
Op2 Op2 Op2 Op2• • •
• • •
Opk Opk Opk Opk• • •
Op1 Op1 Op1 Op1
– 32 –
Implementation Challenges
Hadoop Platform is BlessingLarge community adding features, improving performanceEasy to deployGrowing body of documentation and materialsWorks well enough for range of applications
… And CurseMap & reduce functionality hardwiredNot designed as extensible platform
Dryad Not Widely Adopted305 citations on Google Scholar vs. 1759 Map/ReduceBuilt on .NET
– 33 –
Conclusions
Distributed Systems Concepts Lead to Scalable Machines
Loosely coupled execution modelLowers cost of procurement & operation
Map/Reduce Gaining Widespread UseHadoop makes it widely availableGreat for some applications, good enough for many others
Lots of Work to be DoneRicher set of programming models and implementationsExpanding range of applicability
Problems that are data and compute intensiveThe future of supercomputing?