(big) data storage systems - computer engineering · pdf file(big) data storage systems a.a....
TRANSCRIPT
(Big) Data Storage Systems A.A. 2016/17
Matteo Nardelli
Laurea Magistrale in Ingegneria Informatica - II anno
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Matteo Nardelli - SABD 2016/17
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Data Storage: the classic approach File
group of data, whose structure is defined by the file system
• File system – controls how data are structured, named, organized, stored
and retrieved from disk
– usually: single (logical) disk (e.g., HDD/SDD, RAID)
Database organized/structured collection of data (e.g., entities, tables)
• Database management system – provides a way to organize and access data stored in files
– enables: data definition, update, retrieval, administration
Matteo Nardelli - SABD 2016/17
2
What about Big Data? The storage capacities and data transfer rate have increased massively over the years. Let's consider the time needed to transfer data*
Matteo Nardelli - SABD 2016/17
3
HDD Size: ~1TB Speed: 250MB/s
SSD Size: ~1TB Speed: 850MB/s
Data Size HDD SSD 10 GB 40s 12s
100 GB 6m 49s 2m 1 TB 1h 9m 54s 20m 33s
10 TB ? ? * we consider no overhead
We need to scale out!
General principles for scalable data storage • Scalability and high performance:
– to face the continuous growth of data to store – multiple nodes as storage
• Ability to run on commodity hardware: – hardware failures are the norm rather than the exception
• Reliability and fault tolerance: – transparent data replication
• Availability: – property means that the data should be available to serve a
request at the moment it is needed – CAP theorem: trade-off with consistency
Matteo Nardelli - SABD 2016/17
4
Scalable data storage solutions Various forms of data storage in the Cloud: • Distributed file systems
– manage (large) files on multiple nodes – examples: Google File System (GFS),
Hadoop Distributed File System (HDFS)
• NoSQL databases (or more generally, data stores) – simple and flexible non-relational data models – key-value, column family, document, and graph stores – examples: BigTable, Cassandra, MongoDB, HBase, DynamoDB – Existing time series databases are built on top of NoSql
(examples: InfluxDB, KairosDB)
• NewSQL databases – horizontal scalability and fault tolerance to the relational model – examples: VoltDB, Google Spanner
Matteo Nardelli - SABD 2016/17
5
Scalable data storage solutions
Matteo Nardelli - SABD 2016/17
6
The whole picture of the different solutions
Distributed File Systems • Represent the primary support for data management
• Manage the storage of data across a network of machines
• Provide an interface whereby to store information in the form of files and later access them for read and write operations
• Several solutions (different design choices): – GFS, Apache HDFS (GFS open-source clone): designed for batch
applications with large files – FlatDS: highly distributed solution, no single point of failures – Alluxio: in-memory (high-throughput) storage system – Lustre, Ceph: designed as high-performance file system
Matteo Nardelli - SABD 2016/17
7
Case study: Google File System Assumptions and Motivations • System is built from inexpensive commodity hardware
that often fail – 60,000 nodes each 1 failure per year: 7 failures per hour!
• System stores a modest number of large file (multi GB) • Large streaming/contiguous reads, small random
reads • Many large, sequential write that appends data
– Multiple clients can concurrently append to the same file
• High sustained bandwidth is more important than low latency
Matteo Nardelli - SABD 2016/17
8
S. Ghemawat, H. Gobioff, S.-T. Leung, "The Google File System", Proc. of ACM SOSP 2003.
Case study: Google File System • Distributed file system implemented in user space • Manages (very) large files: usually multi-GB • File is split in chunks • Chunks:
– have a default size (either 64MB or 128MB) – transparent to the user – each one is stored as a plain file on chunk servers
• Files follow the write-once, read-many-times pattern – Efficient append operation: appends data at the end of a file
atomically at least once even in the presence of concurrent operations (minimal synchronization overhead)
• Fault tolerance, high availability through chunk replication – No data caching
Matteo Nardelli - SABD 2016/17
9
GFS: Architecture
• Master – single, centralized entity (this simplifies the design) – manages file metadata (stored in memory)
• Metadata: Access control information, mapping from files to chunks, locations of chunks
– does not store data (i.e., chunks) – manages chunks: creation, replication, load balancing, deletion
Matteo Nardelli - SABD 2016/17
10
GFS: Architecture
• Chunkserver – Stores chunks as file
• Client – issues metadata requests to GFS master – issues data requests to GFS chunkserver – caches metadata, does not cache data
Matteo Nardelli - SABD 2016/17
11
GFS: Metadata
• The master stores three major types of metadata: – the file and chunk namespaces – the mapping from files to chunks – the locations of each chunk’s replicas
• Metadata are stored in memory (64B per chunk) – pro: fast; easy and efficient to scan the entire state – limitation: the number of chunks is limited by the amount of
memory of the master: "The cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility gained"
• The master also keeps an operation log with a historical record of critical metadata changes – persistent on local disk – replicated – checkpoint for fast recovery
Matteo Nardelli - SABD 2016/17
12
GFS: Chunk size
• Chunk size is either 64 MB or 128 MB – much larger than typical block sizes
• Large chunk size reduces: – number of interactions between client and master – network overhead by keeping a persistent TCP connection
to the chunk server over an extended period of time – size of metadata stored on the master
• Potential disadvantage: chunks for small files may become hot spots (requires higher replication factor)
• Each chunk replica is stored as a plain Linux file and is extended as needed
Matteo Nardelli - SABD 2016/17
13
GFS: Fault-tolerance and replication • The master replicates (and maintain the replication) of
each chunk on several chunkserver – At least 3 replicas on different chunkservers
– Replication based on the primary-backup schema
– Replication degree > 3 for highly requested chunks
• Multi-level placement of replicas – different machines, local rack + reliability, + availability
– different machines, different racks + aggregated bandwidth
• Data integrity – chunk divided in 64KB blocks; 32B checksum for each block – checksum kept in memory – checksum is checked every time an application reads the data
Matteo Nardelli - SABD 2016/17
14
GFS: The master operations (1) • Stores metadata • Namespace management and locking
– namespace represented as a lookup table – read lock on internal nodes and read/write lock on the leaf:
read lock allows concurrent mutations in the same directory and prevents deletion, renaming or snapshot
• Periodic communication with each chunkserver – sends instructions and collects chunkserver state
(heartbeat messages)
• Creation, re-replication, rebalancing of chunks – balances the disk space utilization and load balancing – distributes replicas among racks to increase fault-tolerance – re-replicates a chunk as soon as the number of available
replicas falls below a user-specified goal
Matteo Nardelli - SABD 2016/17
15
GFS: The master operations (2) • Garbage collection
– file deletion logged by the master – file renamed to a hidden name with deletion timestamp: its
deletion is postponed – deleted files can be easily recovered in a limited timespan
• Stale replica detection – chunk replicas may become stale if a chunkserver fails or
misses updates to the chunk – for each chunk, the master keeps a chunk version number – chunk version number updated for each chunk mutation – the master removes stale replicas in its regular garbage
collection
Matteo Nardelli - SABD 2016/17
16
GFS: System interactions
• Files are hierarchically organized in directories – there is no data structure that represents a directory
• A file is identified by its pathname – GFS does not support aliases
• GFS supports traditional file system operations – create, delete, open, close, read, write
• It adds: – snapshot: makes a copy of a file or a directory tree almost
instantaneously (based on copy-on-write techniques) – record append: atomically appends data to a file
(supports concurrent operations)
Matteo Nardelli - SABD 2016/17
17
GFS: System interactions
Matteo Nardelli - SABD 2016/17
18
• Read operation - data flow is decoupled from the control flow - (1)(2) the client interacts with the master to locate the file chunks - (3)(4) the client interacts with the chunkserver to retrieve data
1
2
3
4
GFS: Mutations • Mutations are write or append
– Mutations are performed at all the chunk's replicas in the same order
• A lease mechanism: – Goal: minimize management
overhead at the master – The master grants a chunk lease to
one of the replica, we call primary – The primary picks a serial order for
all the mutations to the chunk – All replicas follow this order when
applying mutations
• Data flow is decoupled from the control flow
Matteo Nardelli - SABD 2016/17
19
To fully utilize machine's network bandwidth, the data
is pushed linearly along a chain of chunkservers
GFS: Atomic Record Appends
• The client specifies only the data (with no offset)
• GFS appends the data to the file at least once atomically (i.e., as one continuous sequence of bytes) – the offset is chosen by GFS – works with multiple concurrent writers
• Heavily used by Google's distributed applications – E.g., files often serve as multiple-producer/single-consumer
queue or contain merged results from many clients (see MapReduce)
Matteo Nardelli - SABD 2016/17
20
GFS: Consistency model
• GFS has a "relaxed" model: eventual consistency – simple and efficient
• A file region is: – consistent: if all replicas have the same value – defined: after a mutation if it is consistent and clients will see
what the mutation writes in its entirety
• Properties: – concurrent successful mutations leave the region in a
consistent state, but potentially undefined: it may not reflect what any one mutation has written (e.g., mingled fragments from multiple mutations)
– a failed mutation makes the region inconsistent: chunk version number and re-replication used to
• Namespace mutations are atomic – Managed exclusively by the master with locking guarantees
Matteo Nardelli - SABD 2016/17
21
GFS problems
Matteo Nardelli - SABD 2016/17
22
What's the limitation of this architecture?
The single master! • Single point of failure • Scalability bottleneck
GFS problems: Single master
• Solutions adopted to overcome issues related to the presence of a single master – overcome the single point of failure: multiple "shadow"
master that provide read-only access to the file system even when the primary master is down
– overcome the scalability bottleneck: by reducing the interaction between the master and the client
• The master stores only metadata (not data) • The client can cache metadata • Large chunk size • Chunk lease: delegates the authority of coordinating the
mutations to the primary replica
• A simple solution
Matteo Nardelli - SABD 2016/17
23
GFS problems • GFS was designed (in 2001) with MapReduce in mind
– But found lots of other applications – Designed for batch applications with large files (web crawling
and indexing) but not suitable for applications like Gmail or YouTube, meant to serve data in near real-time
• Problems – Single master node in charge of chunk servers – All metadata about files stored in the master’s memory: limits
total number of files – Problems when storage grew to tens of Pbytes – Automatic failover added (but still takes 10 seconds) – Designed for high throughput but delivers high latency: not
appropriate for latency sensitive applications – Delays due to recovering from a failed replica chunkserver
delay the client
Matteo Nardelli - SABD 2016/17
24
After GFS: Colossus
• Colossus (GFS2) – Next-generation cluster-level file system at Google – Specifically designed for real-time services – Automatically sharded metadata layer – Data typically written using Reed-Solomon (1.5x)
• group of error correcting codes
– Client-driven encoding and replication – Distributed masters – Support smaller files: chunks go from 64 MB to 1 MB
Matteo Nardelli - SABD 2016/17
25
"GFS: Evolution on Fast-forward". ACM Queue, Volume 7, Issue 7, 2009. http://queue.acm.org/detail.cfm?id=1594206
Other Distributed File Systems: FDS
• Motivations: – Data locality adds complexity – Datacenter networks are getting faster
• Flat Datacenter Storage (FDS) – Support full bandwidth bisection: no local vs. remote disk
distinction – "Flat" model: all compute nodes can access all storage with
equal throughput – FDS multiplexes workload over the disks in the cluster – FDS fully distributes metadata operations
Matteo Nardelli - SABD 2016/17
26
E. B. Nightingale, J. Elson, J. Fan, et al., "Flat datacenter storage", Proc. of OSDI'12
FDS: blobs and tracts
• Data are stored in blobs – A blob is identified by a 128-bit GUID
• A blob is divided into constant sized tracts – Tracts are sized so that random and sequential accesses
have the same throughput (on their infrastructure: 8MB)
Operations • Reads and writes are atomic
• Reads and writes are not guaranteed to appear in the order they are issued
• API is non-blocking – Helps the performances: many requests issued in parallel,
FDS serializes the updates on each blob
Matteo Nardelli - SABD 2016/17
27
FDS: architecture
• Every disk is managed by a tractserver – It stores tracts directly to disk by using raw disk interface (not
local file system)
• A metadata server coordinates the cluster and provides to clients the list of tractservers – The list is called tract locator table (TLT) – TLT does not contain the location of tracts – Client uses the blob's GUID (g) and the tract number (t) to
select an entry in the TLT: tract locator
Matteo Nardelli - SABD 2016/17
28
tract_locator = (hash(g) + t) mod TLT_length
FDS: architecture
• TLT – Changes when a disk fails or is added – Reads and writes do not change it
• The client is responsible for the initial tracts replication – reads go to one replica at random – writes go to all of them
• The metadata server detects and recovers failed tractservers – detection based on heartbeat messages
– afterwards, updates TLT by replacing the failed tractserver
Matteo Nardelli - SABD 2016/17
29
Other Distributed File Systems: Alluxio
• Motivations: – Write throughput is limited by disk and network bandwidths – Fault-tolerance by replicating data across the network
(synchronous replication further slows down write operations) – RAM throughput is increasing fast
• Alluxio (formerly Tachyon) – In-memory storage system – High-throughput reads and writes – Re-computation (lineage) based storage using memory
aggressively • One copy of data in memory (fast) • Upon failure, re-compute data using lineage (fault tolerant)
Matteo Nardelli - SABD 2016/17
30
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks", Proc. of ACM SOCC 2014
Other Distributed File Systems: Alluxio
• Alluxio adds a layer between: – application frameworks (e.g., Spark, MapReduce) – persistence layer (e.g., HDFS, Amazon S3)
Matteo Nardelli - SABD 2016/17
31
Other Distributed File Systems: Alluxio
Standard master-slave architecture (like GFS, HDFS)
Master – stores metadata – tracks lineage information – computes checkpoint order
Matteo Nardelli - SABD 2016/17
32
Worker (slave) – manages local resources – reports the status to the master – RAMdisk for storing memory-
mapped files.
Other Distributed File Systems: Alluxio
Alluxio consists of two (logical) layers: • Lineage layer: tracts the sequence of jobs that have created a
particular data output – data are immutable once written: only support for append operations – frameworks using Alluxio track data dependencies and recompute
them when a failure occurs
• Persistence layer: persists data onto storage, used to perform asynchronous checkpoints – efficient checkpointing algorithm:
• Avoids checkpointing temporary files • Checkpoints hot files first (i.e., the most read files) • Bounds re-computation time
Matteo Nardelli - SABD 2016/17
33
File set A File set B task
dependency
Distributed File Systems: Summing up Google File System
– master/slave architecture – decouples metadata from data – single master (bottleneck): limits interactions, number of metadata – designed for batch applications: 64/128MB chunk, no data caching
Flat Datacenter Storage – flat model: no local vs remote disks – distributed architecture – metadata do not contain data location, but locators – the client is in charge of (initial) replication
Alluxio (Tachyon) – in memory storage system, leverages on DFS – master/slave architecture – no replication: tracks changes (lineages), recovers data using
checkpoints and recomputations Matteo Nardelli - SABD 2016/17
34