bigtable - massachusetts institute of...

31
BIGTABLE Nirmesh Malviya [email protected] April 30, 2012

Upload: others

Post on 05-Jan-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

BIGTABLENirmesh Malviya

[email protected]

April 30, 2012

Page 3: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

What is Bigtable?

● Distributed storage system.

● Manages structured data via a simple data model.

● Scalable.

● Self-managing.

4/30/2012 6.830 - Spring 2012

Page 4: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Why did Google not use a relational database?

● Google has LOTS of data.

● No commercial system big enough.● Too expensive even if there was one.

● Don’t have end-to-end control.● Low-level storage optimizations difficult.

4/30/2012 6.830 - Spring 2012

Page 5: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Data model: sparse map

●Sparse multi-dimensional map.●Indexed by <Row, Column, Timestamp> key.●Values are uninterpreted bytes.

●Distributed, persistent and sorted.

●Essentially a column-oriented physical store.

4/30/2012 6.830 - Spring 2012

Page 6: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Data model: column families

●Arbitrary number of columns on a row-by-row basis.●Column families:

●Columns of same type.●Access control.

●Family:qualifier

4/30/2012 6.830 - Spring 2012

Page 7: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Data model: example

4/30/2012 6.830 - Spring 2012

Page 8: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Data model is not relational

●Writes to a row atomic.●No multirow transactions.

●No table-wide integrity constraints.

4/30/2012 6.830 - Spring 2012

Page 9: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

API

● Writes (atomic)● Set(): write cells in a row.● DeleteCells(): delete cells in a row.● DeleteRow(): delete all cells in a row.

● Reads.

● Metadata operations.● Create/delete tables, column families, change

metadata.

4/30/2012 6.830 - Spring 2012

Page 10: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

API Example: Write/Modify

atomic row modification

4/30/2012 6.830 - Spring 2012

Page 11: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

API Example: Read

Return sets can be filtered using regular expressions:

anchor: com.cnn.*4/30/2012 6.830 - Spring 2012

Page 12: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

GFS (now Colossus)

● Large-scale distributed filesystem.

● Master: responsible for metadata.

● Chunk servers: responsible for reading and writing large chunks of data.● Chunks replicated on 3 machines.

4/30/2012 6.830 - Spring 2012

Page 13: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Google File System (GFS)

Data transfers happen directly between clients/chunkservers.

Rep

licas

MasterGFS Master

GFS Master Client

Client

C1C0 C0

C3 C3C4

C1

C5

C3

C4

4/30/2012 6.830 - Spring 2012

Page 14: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

SSTable● Immutable, sorted file of key-value

pairs.

● Chunks of data plus an index.● Index is of block ranges, not values.

Index

64K block

64K block

64K block

SSTable

4/30/2012 6.830 - Spring 2012

Page 15: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Tablet

● Contains some range of rows of the table.● Built out of multiple SSTables.

Index

64K block

64K block

64K block

SSTable

Index

64K block

64K block

64K block

SSTable

Tablet Start:aardvark End:apple

4/30/2012 6.830 - Spring 2012

Page 16: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Table● Multiple tablets make up a table.● SSTables can be shared.● Tablets do not overlap, SSTables can overlap.

SSTable SSTable SSTable SSTable

Tablet

aardvark appleTablet

apple_two_E boat

4/30/2012 6.830 - Spring 2012

Page 17: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Locality Groups

● Group column families together into an SSTable.● Can keep some groups all in memory.

● Can compress locality groups.

● Bloom Filters on locality groups – avoid searching SSTable.

4/30/2012 6.830 - Spring 2012

Page 18: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Bigtable: Building blocks

● Scheduler.

● GFS.

● Chubby Lock service.

● Mapreduce helpful but not required.

4/30/2012 6.830 - Spring 2012

Page 19: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Typical Cluster

Cluster Scheduling Master Lock Service GFS Master

Machine 1

Scheduler

Slave

GFSChunk Server

Linux

UserTask

Machine 2

Scheduler

Slave

GFSChunk Server

Linux

UserTask

Machine 3

Scheduler

Slave

GFSChunk Server

Linux

Single Task

BigTableServer

BigTableServer BigTable Master

4/30/2012 6.830 - Spring 2012

Page 20: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Chubby

● {lock/file/name} service.

● Coarse-grained locks, can store small amount of data in a lock.

● 5 replicas, need a majority vote to be

active.

4/30/2012 6.830 - Spring 2012

Page 21: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Finding a tablet

● Tablets move around from server to server.

● Given a row, how do clients find the right

machine?● Tablet property – startrowindex and

endrowindex.

● Instead: store special tables containing tablet location info in Bigtable cell itself.

4/30/2012 6.830 - Spring 2012

Page 22: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Finding a tablet

4/30/2012 6.830 - Spring 2012

Page 23: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Tablet Server

● Manages tablets, multiple tablets per server.

● Each tablet is 100-200MB.● lives on only one server.

● Tablet server splits tablets that get too big.

4/30/2012 6.830 - Spring 2012

Page 24: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Tablet Server startup

● On startup, creates and acquires an exclusive

lock on uniquely named file in Chubby directory.

● Tablet server stops serving its tables if its loses its exclusive lock.

4/30/2012 6.830 - Spring 2012

Page 25: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Bigtable Master

● Responsible for load balancing and fault tolerance.

● Use Chubby to monitor health of tablet servers,

restart failed servers.● If Chubby session expires, master kills itself.

● Preferably start tablet server on same machine

that the data is already at.

4/30/2012 6.830 - Spring 2012

Page 26: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Master Startup

● Grabs unique master lock in Chubby.● Prevents multi-instantiations.

● Scans directory in Chubby for live servers, communicates with every live tablet server.

● Scans METADATA table to learn the set of

tablets. 4/30/2012 6.830 - Spring 2012

Page 27: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Bigtable Master

● Master monitors Chubby directory to discover tablet servers.

● Master is responsible for finding when tablet

server is no longer serving its tablets.● Detects by checking periodically the status of

the lock of each tablet server.

4/30/2012 6.830 - Spring 2012

Page 28: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Writing to a table

● Mutations are logged, then applied to an in-memory version.

● Logfile stored in GFS.

SSTable SSTable

Tablet

apple_two_E boat

Insert

Insert

Delete

Insert

Delete

Insert

Memtable

4/30/2012 6.830 - Spring 2012

Page 29: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Tablet Compactions

● Minor compaction.● Reduce memory usage.● Reduce log traffic on restart.

● Merging compaction.

● Major compaction.● No deletion records, only live data.

4/30/2012 6.830 - Spring 2012

Page 30: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

BigTable System Architecture

Read/write

Cluster Scheduling Master

handles failover, monitoring

GFS

holds tablet data, logs

Lock service

holds metadata,handles master-election

Bigtable tablet server

serves data

Bigtable tablet server

serves data

Bigtable tablet server

serves data

Bigtable master

performs metadata ops,load balancing

Bigtable cell Bigtable clientBigtable client

library

Open()

Metadata ops

4/30/2012 6.830 - Spring 2012

Page 31: BIGTABLE - Massachusetts Institute of Technologydb.lcs.mit.edu/6.830/lectures/bigtable-slides.pdf · BigTable System Architecture Read/write Cluster Scheduling Master handles failover,

Bigtable in real-world

● Bigtable is closed-source and owned by Google.

● Apache Hbase is open-source

implementation.● Famous user: Facebook messaging platform.

4/30/2012 6.830 - Spring 2012