bigtable: a distributed storage system for structured data f. chang, j. dean, s. ghemawat, w.c....

39
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E. Gruber Google, Inc. Tianyang HU

Upload: robert-jordan

Post on 26-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

Bigtable: A Distributed Storage System for Structured Data

F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach

M. Burrows, T. Chandra, A. Fikes, R.E. Gruber

Google, Inc.

Tianyang HU

Page 2: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

2

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 3: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

3

Motivation

• Scalability– worldwide applications & users– huge amount of communication & data

Page 4: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

4

Bigtable

• Distributed storage system– petabytes of data, thousands of machines– simple data model with dynamic control– applicability, scalability, performance, availability

• Used by more than 60 applications of Google

Page 5: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

5

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 6: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

6

Data Model – Overview

• Sparse, distributed, persistent multidimensional sorted map.

Page 7: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

7

Data Model – Example

• “Webtable” stores copy of web pages & their related information.– row key: URL (reverse hostname)– column key: attribute name– timestamp: time that the page is fetched

Page 8: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

8

Data Model – Rows

• Row key: string (usually 10-100KB, max 64KB)• Every R/W of data under a single row key is atomic

Page 9: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

9

Data Model – Rows

• Sorted by row key in lexicographic order• Tablet: a certain range of rows

– the unit of distribution & load balancing– good locality for data access

Page 10: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

10

Data Model – Columns

• Column families: group of column keys (same type)– the unit of access control

• Column key: family:qualifier

Page 11: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

11

Data Model – Timestamps

• Timestamp: index multiple versions of the same data– not necessarily the “real time”– data clean up, garbage collection

Page 12: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

12

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 13: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

13

Building Blocks

• SSTable file format– persistent, ordered, immutable key-value (string-string) pairs– used internally to store Bigtable data

Page 14: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

14

Building Blocks

• GFS– store log & data files– scalability, reliability, performance, fault tolerance

• Chubby– a highly-available and persistent distributed lock service

Page 15: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

15

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 16: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

16

Bigtable Components

• A library that is linked into every client• Many tablet servers

– handle R/W to tablets with clients

• One tablet master– assign tablets to tablet servers– detect addition & expiration of tablet servers– balance tablet-server load

Page 17: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

17

Architecture

Page 18: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

18

Tablet Location

• Three-level hierarchy– root tablet (Only one, stores addresses of METADATA tablets)– METADATA tablets (stores addresses of user tablets)– user tablets

Page 19: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

19

Tablet Location

• Client caches (multiple) tablet locations– if the cache is stale, query again

Page 20: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

20

Tablet Assignment

• The tablet master uses Chubby to keeps track of– live tablet servers

• each live tablet server acquires an exclusive lock on a corresponding file

– tablet assignment status• compare tablets registered in METADATA tablet with tablets in tablet servers

Page 21: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

21

Tablet Assignment

• Case 1: some tablets are unassigned– master assigns them to tablet servers with sufficient room

• Case 2: a tablet server stops its service– master detects it and assigns outstanding tablets to other servers.

• Case 3: too many small tablets– master initiates merge

• Case 4: a tablet grows too large– the corresponding tablet server initiates split and notifies master

Page 22: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

22

Tablet Serving

• A tablet is stored as a sequence of SSTables in GFS• Tablet mutations are logged in commit log

– the “commit log” stores redo records– recent tablet versions are stored in memory (memtable)– older tablet versions are stored in GFS

Page 23: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

23

Tablet Serving

• Recover a tablet– 1. Tablet server fetches its metadata from METADATA tablet, which

contains a list of SSTables that comprises a tablet and redo points.– 2. The server reads the indices of the SSTables into memory.– 3. The server applies all the mutations after the redo point.

Page 24: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

24

Tablet Serving

• Write operation on a tablet– 1. The tablet server checks the validity of the operation.– 2. The operation is logged in the commit log.– 3. Commit the operation.– 4. The content of tablet is inserted into memtable.

Page 25: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

25

Tablet Serving

• Read operation on a tablet– 1. The tablet server checks the validity of the operation.– 2. Execute the operation on a merged view of memtable & SSTables.

Page 26: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

26

Compactions

• Memtable grows as write operations execute• Two types of compactions

– minor compaction– merging (major) compaction

Page 27: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

27

Compactions

• Minor compaction (when memtable size reaches a threshold)– 1. Freeze the memtable– 2. Create a new memtable– 3. Convert the memtable to an SSTable and write to GFS

Page 28: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

28

Compactions

• Merging compaction (periodically)– 1. Freeze the memtable– 2. Create a new memtable– 3. Merge a few SSTables & memtable into a new SSTable

Page 29: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

29

Compactions

• Major compaction – special case of merging compaction– merges all SSTables & memtable

Page 30: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

30

Compactions

• Why freeze & create memtable?– Incoming read and write operations can continue during compactions.

• Advantages of compaction:– release the memory of the tablet server– reduce the amount of data that has to be read from the commit log

during recovery if this tablet server dies

Page 31: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

31

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 32: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

32

Refinements

• Locality groups– group multiple column families together– different locality groups are not typically accessed together– for each tablet, store each locality group in a separate SSTable– more efficient R/W

Page 33: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

33

Refinements

• Compression– similar data in same column, neighbouring rows, multiple versions– customized compression on SSTable block level (smallest component)– two-pass compression scheme

• 1. Bentley and McIlroy’s scheme, compress long strings across a large window• 2. fast compression algorithm, look for repetitions in small window• experimental compression ratio: 10% (Gzip: 25-33%)

Page 34: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

34

Refinements

• Caching– two-level cache on tablet server– Scan cache (high-level): caches key-value pairs

• Case: read the same data repeatedly

– Block cache (low-level): caches SSTable blocks• Case: sequential read

Page 35: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

35

Refinements

• Commit-log implementation– one commit log per tablet incurs a large # of disk seeks– use single commit log for all tablets on a tablet server– benefits significantly during normal operation– complicates recovery

• solution: sort the commit log entries first

Page 36: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

36

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 37: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

37

Performance Evaluation

R/W rate per tablet server

aggregate R/W rate

Page 38: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

38

Outline

• Introduction• Data Model• Building Blocks• Implementation• Refinements• Performance Evaluation• Future Work

Page 39: Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E

39

Future Work

• Resource sharing for different applications?• Hybrid with relational database?

– complex query– security