Large-Scale Data and Computation
Yifu Huang
School of Computer Science, Fudan [email protected]
COMP620003 Advanced Computer Networks Report, 2013
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 1 / 27
Outline
1 Motivation
2 GFS
3 MapReduce
4 Bigtable
5 Discussion
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 2 / 27
Motivation
Motivation
Why we choose these papers?
Big dataGoogle
Why we need GFS, MapReduce, Bigtable?
A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size
What can we learn from these papers?
The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27
Motivation
Motivation
Why we choose these papers?
Big dataGoogle
Why we need GFS, MapReduce, Bigtable?
A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size
What can we learn from these papers?
The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27
Motivation
Motivation
Why we choose these papers?
Big dataGoogle
Why we need GFS, MapReduce, Bigtable?
A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size
What can we learn from these papers?
The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27
Motivation
Ecosystem
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 4 / 27
GFS
The Google File System
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google, Inc.
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 5 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview
The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27
GFS
Design Overview (cont.)
Architecture
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 7 / 27
GFS
Design Overview (cont.)
Chunk Size: 64 MBMetadata
The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas
Chunk LocationsOperation LogConsistency Model
Guarantees by GFSImplications for Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27
GFS
Design Overview (cont.)
Chunk Size: 64 MBMetadata
The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas
Chunk LocationsOperation LogConsistency Model
Guarantees by GFSImplications for Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27
GFS
Design Overview (cont.)
Chunk Size: 64 MBMetadata
The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas
Chunk LocationsOperation LogConsistency Model
Guarantees by GFSImplications for Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27
GFS
Design Overview (cont.)
Chunk Size: 64 MBMetadata
The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas
Chunk LocationsOperation LogConsistency Model
Guarantees by GFSImplications for Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27
GFS
Design Overview (cont.)
Chunk Size: 64 MBMetadata
The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas
Chunk LocationsOperation LogConsistency Model
Guarantees by GFSImplications for Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27
GFS
System Interactions
Leases and Mutation Order
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 9 / 27
GFS
System Interactions (cont.)
Data Flow
We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data
Atomic Record Appends
GFS provides an atomic append operation called record append
Snapshot
The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27
GFS
System Interactions (cont.)
Data Flow
We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data
Atomic Record Appends
GFS provides an atomic append operation called record append
Snapshot
The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27
GFS
System Interactions (cont.)
Data Flow
We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data
Atomic Record Appends
GFS provides an atomic append operation called record append
Snapshot
The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27
GFS
Fault Tolerance And Diagnosis
High Availability
Fast RecoveryChunk ReplicationMaster Replication
Data IntegrityDiagnostic Tools
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27
GFS
Fault Tolerance And Diagnosis
High Availability
Fast RecoveryChunk ReplicationMaster Replication
Data IntegrityDiagnostic Tools
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27
GFS
Fault Tolerance And Diagnosis
High Availability
Fast RecoveryChunk ReplicationMaster Replication
Data IntegrityDiagnostic Tools
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27
MapReduce
MapReduce: Simplied Data Processing on Large Clusters
MapReduce: Simplied Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Google, Inc.
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 12 / 27
MapReduce
Model
Map
Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function
Reduce
Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 13 / 27
MapReduce
Model
Map
Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function
Reduce
Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 13 / 27
MapReduce
Implementation
User ProgramMasterWorker (Map/Reduce)Fault toleranceLocalityTask granularityBackup tools
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 14 / 27
MapReduce
Performance
Grep
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 15 / 27
MapReduce
Performance (cont.)
Sort
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 16 / 27
Bigtable
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
Google, Inc.
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 17 / 27
Bigtable
Model
Rows
AtomicTablets
Column Families
Family : qualifierAccess control
Timestamps
Decreasing orderGarbage collect
Cell
(row : string, column : string, time : int64) -> string
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27
Bigtable
Model
Rows
AtomicTablets
Column Families
Family : qualifierAccess control
Timestamps
Decreasing orderGarbage collect
Cell
(row : string, column : string, time : int64) -> string
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27
Bigtable
Model
Rows
AtomicTablets
Column Families
Family : qualifierAccess control
Timestamps
Decreasing orderGarbage collect
Cell
(row : string, column : string, time : int64) -> string
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27
Bigtable
Model
Rows
AtomicTablets
Column Families
Family : qualifierAccess control
Timestamps
Decreasing orderGarbage collect
Cell
(row : string, column : string, time : int64) -> string
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27
Bigtable
Model (cont.)
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 19 / 27
Bigtable
Storage
Region–Store—memStore—StoreFileFile–BlocksLog–Write ahead log
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 20 / 27
Bigtable
Implementation
Tablet Location
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 21 / 27
Bigtable
Implementation (cont.)
Tablet Serving
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 22 / 27
Bigtable
Evaluation
Benchmarks
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 23 / 27
Bigtable
Evaluation (cont.)
Scaling
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 24 / 27
Bigtable
Evaluation (cont.)
Applications
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 25 / 27
Discussion
Discussion
Contributions of GFS, MapReduce, Bigtable
Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)
Drawbacks of GFS, MapReduce, Bigtable
Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 26 / 27
Discussion
Discussion
Contributions of GFS, MapReduce, Bigtable
Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)
Drawbacks of GFS, MapReduce, Bigtable
Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 26 / 27
Appendix
References I
[1] The Google File System. SOSP. 2003.
[2] MapReduce: Simplied Data Processing on Large Clusters. OSDI.2004.
[3] Bigtable: A Distributed Storage System for Structured Data.OSDI. 2006.
[4] The Hadoop Distributed File System. MSST. 2010.
[5] Hadoop: the definitive guide. 2012.
[6] HBase: the definitive guide. 2011.
[7] Data-intensive text processing with MapReduce. 2010.
[8] The Chubby lock service for loosely-coupled distributed systems.OSDI. 2006.
Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 27 / 27