(big) data storage systems - computer engineering · pdf file(big) data storage systems a.a....

18
(Big) Data Storage Systems A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno Università degli Studi di Roma Tor VergataDipartimento di Ingegneria Civile e Ingegneria Informatica The reference Big Data stack Matteo Nardelli - SABD 2016/17 1 Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

Upload: nguyendiep

Post on 11-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

(Big) Data Storage Systems A.A. 2016/17

Matteo Nardelli

Laurea Magistrale in Ingegneria Informatica - II anno

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Matteo Nardelli - SABD 2016/17

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Data Storage: the classic approach File

group of data, whose structure is defined by the file system

•  File system –  controls how data are structured, named, organized, stored

and retrieved from disk

–  usually: single (logical) disk (e.g., HDD/SDD, RAID)

Database organized/structured collection of data (e.g., entities, tables)

•  Database management system –  provides a way to organize and access data stored in files

–  enables: data definition, update, retrieval, administration

Matteo Nardelli - SABD 2016/17

2

What about Big Data? The storage capacities and data transfer rate have increased massively over the years. Let's consider the time needed to transfer data*

Matteo Nardelli - SABD 2016/17

3

HDD Size: ~1TB Speed: 250MB/s

SSD Size: ~1TB Speed: 850MB/s

Data Size HDD SSD 10 GB 40s 12s

100 GB 6m 49s 2m 1 TB 1h 9m 54s 20m 33s

10 TB ? ? * we consider no overhead

We need to scale out!

General principles for scalable data storage •  Scalability and high performance:

–  to face the continuous growth of data to store –  multiple nodes as storage

•  Ability to run on commodity hardware: –  hardware failures are the norm rather than the exception

•  Reliability and fault tolerance: –  transparent data replication

•  Availability: –  property means that the data should be available to serve a

request at the moment it is needed –  CAP theorem: trade-off with consistency

Matteo Nardelli - SABD 2016/17

4

Scalable data storage solutions Various forms of data storage in the Cloud: •  Distributed file systems

–  manage (large) files on multiple nodes –  examples: Google File System (GFS),

Hadoop Distributed File System (HDFS)

•  NoSQL databases (or more generally, data stores) –  simple and flexible non-relational data models –  key-value, column family, document, and graph stores –  examples: BigTable, Cassandra, MongoDB, HBase, DynamoDB –  Existing time series databases are built on top of NoSql

(examples: InfluxDB, KairosDB)

•  NewSQL databases –  horizontal scalability and fault tolerance to the relational model –  examples: VoltDB, Google Spanner

Matteo Nardelli - SABD 2016/17

5

Scalable data storage solutions

Matteo Nardelli - SABD 2016/17

6

The whole picture of the different solutions

Distributed File Systems •  Represent the primary support for data management

•  Manage the storage of data across a network of machines

•  Provide an interface whereby to store information in the form of files and later access them for read and write operations

•  Several solutions (different design choices): –  GFS, Apache HDFS (GFS open-source clone): designed for batch

applications with large files –  FlatDS: highly distributed solution, no single point of failures –  Alluxio: in-memory (high-throughput) storage system –  Lustre, Ceph: designed as high-performance file system

Matteo Nardelli - SABD 2016/17

7

Case study: Google File System Assumptions and Motivations •  System is built from inexpensive commodity hardware

that often fail –  60,000 nodes each 1 failure per year: 7 failures per hour!

•  System stores a modest number of large file (multi GB) •  Large streaming/contiguous reads, small random

reads •  Many large, sequential write that appends data

–  Multiple clients can concurrently append to the same file

•  High sustained bandwidth is more important than low latency

Matteo Nardelli - SABD 2016/17

8

S. Ghemawat, H. Gobioff, S.-T. Leung, "The Google File System", Proc. of ACM SOSP 2003.

Case study: Google File System •  Distributed file system implemented in user space •  Manages (very) large files: usually multi-GB •  File is split in chunks •  Chunks:

–  have a default size (either 64MB or 128MB) –  transparent to the user –  each one is stored as a plain file on chunk servers

•  Files follow the write-once, read-many-times pattern –  Efficient append operation: appends data at the end of a file

atomically at least once even in the presence of concurrent operations (minimal synchronization overhead)

•  Fault tolerance, high availability through chunk replication –  No data caching

Matteo Nardelli - SABD 2016/17

9

GFS: Architecture

•  Master –  single, centralized entity (this simplifies the design) –  manages file metadata (stored in memory)

•  Metadata: Access control information, mapping from files to chunks, locations of chunks

–  does not store data (i.e., chunks) –  manages chunks: creation, replication, load balancing, deletion

Matteo Nardelli - SABD 2016/17

10

GFS: Architecture

•  Chunkserver –  Stores chunks as file

•  Client –  issues metadata requests to GFS master –  issues data requests to GFS chunkserver –  caches metadata, does not cache data

Matteo Nardelli - SABD 2016/17

11

GFS: Metadata

•  The master stores three major types of metadata: –  the file and chunk namespaces –  the mapping from files to chunks –  the locations of each chunk’s replicas

•  Metadata are stored in memory (64B per chunk) –  pro: fast; easy and efficient to scan the entire state –  limitation: the number of chunks is limited by the amount of

memory of the master: "The cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility gained"

•  The master also keeps an operation log with a historical record of critical metadata changes –  persistent on local disk –  replicated –  checkpoint for fast recovery

Matteo Nardelli - SABD 2016/17

12

GFS: Chunk size

•  Chunk size is either 64 MB or 128 MB –  much larger than typical block sizes

•  Large chunk size reduces: –  number of interactions between client and master –  network overhead by keeping a persistent TCP connection

to the chunk server over an extended period of time –  size of metadata stored on the master

•  Potential disadvantage: chunks for small files may become hot spots (requires higher replication factor)

•  Each chunk replica is stored as a plain Linux file and is extended as needed

Matteo Nardelli - SABD 2016/17

13

GFS: Fault-tolerance and replication •  The master replicates (and maintain the replication) of

each chunk on several chunkserver –  At least 3 replicas on different chunkservers

–  Replication based on the primary-backup schema

–  Replication degree > 3 for highly requested chunks

•  Multi-level placement of replicas –  different machines, local rack + reliability, + availability

–  different machines, different racks + aggregated bandwidth

•  Data integrity –  chunk divided in 64KB blocks; 32B checksum for each block –  checksum kept in memory –  checksum is checked every time an application reads the data

Matteo Nardelli - SABD 2016/17

14

GFS: The master operations (1) •  Stores metadata •  Namespace management and locking

–  namespace represented as a lookup table –  read lock on internal nodes and read/write lock on the leaf:

read lock allows concurrent mutations in the same directory and prevents deletion, renaming or snapshot

•  Periodic communication with each chunkserver –  sends instructions and collects chunkserver state

(heartbeat messages)

•  Creation, re-replication, rebalancing of chunks –  balances the disk space utilization and load balancing –  distributes replicas among racks to increase fault-tolerance –  re-replicates a chunk as soon as the number of available

replicas falls below a user-specified goal

Matteo Nardelli - SABD 2016/17

15

GFS: The master operations (2) •  Garbage collection

–  file deletion logged by the master –  file renamed to a hidden name with deletion timestamp: its

deletion is postponed –  deleted files can be easily recovered in a limited timespan

•  Stale replica detection –  chunk replicas may become stale if a chunkserver fails or

misses updates to the chunk –  for each chunk, the master keeps a chunk version number –  chunk version number updated for each chunk mutation –  the master removes stale replicas in its regular garbage

collection

Matteo Nardelli - SABD 2016/17

16

GFS: System interactions

•  Files are hierarchically organized in directories –  there is no data structure that represents a directory

•  A file is identified by its pathname –  GFS does not support aliases

•  GFS supports traditional file system operations –  create, delete, open, close, read, write

•  It adds: –  snapshot: makes a copy of a file or a directory tree almost

instantaneously (based on copy-on-write techniques) –  record append: atomically appends data to a file

(supports concurrent operations)

Matteo Nardelli - SABD 2016/17

17

GFS: System interactions

Matteo Nardelli - SABD 2016/17

18

•  Read operation -  data flow is decoupled from the control flow -  (1)(2) the client interacts with the master to locate the file chunks -  (3)(4) the client interacts with the chunkserver to retrieve data

1

2

3

4

GFS: Mutations •  Mutations are write or append

–  Mutations are performed at all the chunk's replicas in the same order

•  A lease mechanism: –  Goal: minimize management

overhead at the master –  The master grants a chunk lease to

one of the replica, we call primary –  The primary picks a serial order for

all the mutations to the chunk –  All replicas follow this order when

applying mutations

•  Data flow is decoupled from the control flow

Matteo Nardelli - SABD 2016/17

19

To fully utilize machine's network bandwidth, the data

is pushed linearly along a chain of chunkservers

GFS: Atomic Record Appends

•  The client specifies only the data (with no offset)

•  GFS appends the data to the file at least once atomically (i.e., as one continuous sequence of bytes) –  the offset is chosen by GFS –  works with multiple concurrent writers

•  Heavily used by Google's distributed applications –  E.g., files often serve as multiple-producer/single-consumer

queue or contain merged results from many clients (see MapReduce)

Matteo Nardelli - SABD 2016/17

20

GFS: Consistency model

•  GFS has a "relaxed" model: eventual consistency –  simple and efficient

•  A file region is: –  consistent: if all replicas have the same value –  defined: after a mutation if it is consistent and clients will see

what the mutation writes in its entirety

•  Properties: –  concurrent successful mutations leave the region in a

consistent state, but potentially undefined: it may not reflect what any one mutation has written (e.g., mingled fragments from multiple mutations)

–  a failed mutation makes the region inconsistent: chunk version number and re-replication used to

•  Namespace mutations are atomic –  Managed exclusively by the master with locking guarantees

Matteo Nardelli - SABD 2016/17

21

GFS problems

Matteo Nardelli - SABD 2016/17

22

What's the limitation of this architecture?

The single master! •  Single point of failure •  Scalability bottleneck

GFS problems: Single master

•  Solutions adopted to overcome issues related to the presence of a single master –  overcome the single point of failure: multiple "shadow"

master that provide read-only access to the file system even when the primary master is down

–  overcome the scalability bottleneck: by reducing the interaction between the master and the client

•  The master stores only metadata (not data) •  The client can cache metadata •  Large chunk size •  Chunk lease: delegates the authority of coordinating the

mutations to the primary replica

•  A simple solution

Matteo Nardelli - SABD 2016/17

23

GFS problems •  GFS was designed (in 2001) with MapReduce in mind

–  But found lots of other applications –  Designed for batch applications with large files (web crawling

and indexing) but not suitable for applications like Gmail or YouTube, meant to serve data in near real-time

•  Problems –  Single master node in charge of chunk servers –  All metadata about files stored in the master’s memory: limits

total number of files –  Problems when storage grew to tens of Pbytes –  Automatic failover added (but still takes 10 seconds) –  Designed for high throughput but delivers high latency: not

appropriate for latency sensitive applications –  Delays due to recovering from a failed replica chunkserver

delay the client

Matteo Nardelli - SABD 2016/17

24

After GFS: Colossus

•  Colossus (GFS2) –  Next-generation cluster-level file system at Google –  Specifically designed for real-time services –  Automatically sharded metadata layer –  Data typically written using Reed-Solomon (1.5x)

•  group of error correcting codes

–  Client-driven encoding and replication –  Distributed masters –  Support smaller files: chunks go from 64 MB to 1 MB

Matteo Nardelli - SABD 2016/17

25

"GFS: Evolution on Fast-forward". ACM Queue, Volume 7, Issue 7, 2009. http://queue.acm.org/detail.cfm?id=1594206

Other Distributed File Systems: FDS

•  Motivations: –  Data locality adds complexity –  Datacenter networks are getting faster

•  Flat Datacenter Storage (FDS) –  Support full bandwidth bisection: no local vs. remote disk

distinction –  "Flat" model: all compute nodes can access all storage with

equal throughput –  FDS multiplexes workload over the disks in the cluster –  FDS fully distributes metadata operations

Matteo Nardelli - SABD 2016/17

26

E. B. Nightingale, J. Elson, J. Fan, et al., "Flat datacenter storage", Proc. of OSDI'12

FDS: blobs and tracts

•  Data are stored in blobs –  A blob is identified by a 128-bit GUID

•  A blob is divided into constant sized tracts –  Tracts are sized so that random and sequential accesses

have the same throughput (on their infrastructure: 8MB)

Operations •  Reads and writes are atomic

•  Reads and writes are not guaranteed to appear in the order they are issued

•  API is non-blocking –  Helps the performances: many requests issued in parallel,

FDS serializes the updates on each blob

Matteo Nardelli - SABD 2016/17

27

FDS: architecture

•  Every disk is managed by a tractserver –  It stores tracts directly to disk by using raw disk interface (not

local file system)

•  A metadata server coordinates the cluster and provides to clients the list of tractservers –  The list is called tract locator table (TLT) –  TLT does not contain the location of tracts –  Client uses the blob's GUID (g) and the tract number (t) to

select an entry in the TLT: tract locator

Matteo Nardelli - SABD 2016/17

28

tract_locator = (hash(g) + t) mod TLT_length

FDS: architecture

•  TLT –  Changes when a disk fails or is added –  Reads and writes do not change it

•  The client is responsible for the initial tracts replication –  reads go to one replica at random –  writes go to all of them

•  The metadata server detects and recovers failed tractservers –  detection based on heartbeat messages

–  afterwards, updates TLT by replacing the failed tractserver

Matteo Nardelli - SABD 2016/17

29

Other Distributed File Systems: Alluxio

•  Motivations: –  Write throughput is limited by disk and network bandwidths –  Fault-tolerance by replicating data across the network

(synchronous replication further slows down write operations) –  RAM throughput is increasing fast

•  Alluxio (formerly Tachyon) –  In-memory storage system –  High-throughput reads and writes –  Re-computation (lineage) based storage using memory

aggressively •  One copy of data in memory (fast) •  Upon failure, re-compute data using lineage (fault tolerant)

Matteo Nardelli - SABD 2016/17

30

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks", Proc. of ACM SOCC 2014

Other Distributed File Systems: Alluxio

•  Alluxio adds a layer between: –  application frameworks (e.g., Spark, MapReduce) –  persistence layer (e.g., HDFS, Amazon S3)

Matteo Nardelli - SABD 2016/17

31

Other Distributed File Systems: Alluxio

Standard master-slave architecture (like GFS, HDFS)

Master –  stores metadata –  tracks lineage information –  computes checkpoint order

Matteo Nardelli - SABD 2016/17

32

Worker (slave) –  manages local resources –  reports the status to the master –  RAMdisk for storing memory-

mapped files.

Other Distributed File Systems: Alluxio

Alluxio consists of two (logical) layers: •  Lineage layer: tracts the sequence of jobs that have created a

particular data output –  data are immutable once written: only support for append operations –  frameworks using Alluxio track data dependencies and recompute

them when a failure occurs

•  Persistence layer: persists data onto storage, used to perform asynchronous checkpoints –  efficient checkpointing algorithm:

•  Avoids checkpointing temporary files •  Checkpoints hot files first (i.e., the most read files) •  Bounds re-computation time

Matteo Nardelli - SABD 2016/17

33

File set A File set B task

dependency

Distributed File Systems: Summing up Google File System

–  master/slave architecture –  decouples metadata from data –  single master (bottleneck): limits interactions, number of metadata –  designed for batch applications: 64/128MB chunk, no data caching

Flat Datacenter Storage –  flat model: no local vs remote disks –  distributed architecture –  metadata do not contain data location, but locators –  the client is in charge of (initial) replication

Alluxio (Tachyon) –  in memory storage system, leverages on DFS –  master/slave architecture –  no replication: tracks changes (lineages), recovers data using

checkpoints and recomputations Matteo Nardelli - SABD 2016/17

34