bigtable: a distributed storage system for structured...

28
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, et al. Google Inc. Symposium on Operating Systems Design and Implementation (OSDI) 2006

Upload: others

Post on 21-Jun-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

BigTable: A Distributed Storage Systemfor Structured Data

Fay Chang, Jeffrey Dean, et al.

Google Inc.

Symposium on Operating Systems Designand Implementation (OSDI) 2006p ( )

Page 2: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Buzz with Big datag

Page 3: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Challenges for Googleg g• Scale

– Google's search index is more than 100 million GB insize

• Varied applicability– Throughput oriented batch processing- Google

analytics– Latency serving- Google earth and Google maps

• Availability– Google Adsense, unavailability means loss of money!g , y y– Google Now

• High performanceHigh performance– ≈40 thousand Google searches per second on an

averageaverage– 16-20% of the queries are brand new

Page 4: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Why not RDBMS?y

• No support for semi structured datapp– Non uniform fields are not easy to handle

• Scale is too high

• Too costly• Too costly

• No low level storage optimizationg p– Required for high performance

• Simpler queries– No need for join group-by queriesNo need for join, group by queries

Page 5: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

What is BigTable ?g• Distributed storage system for managing

structured data• Highly scalableHighly scalable

– Hundreds of Terabytes of data– Spread across thousands of machines

• Self managingSelf managing– Load balancing

D i dditi / l f– Dynamic addition/removal of servers

• Supports dynamic control over underlyingpp y y glayout and format

• Does not support full relational data model• Does not support full relational data model

Page 6: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Data model

• Sparse distributed persistent multi• Sparse, distributed, persistent multi-dimensional sorted map– Indexed by: row key, column key and a timestamp

Each value in the map is an array of bytes– Each value in the map is an array of bytes

– (row:string, column:string, time:int64) string

Page 7: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Data model• Rows:

– Arbitrary strings up to 64KB in size

R d/ it i t i– Read/write is atomic

– Data is lexicographically ordered on row key

orte

dS

Page 8: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Data model• Columns families:

– Column keys are grouped to form column families– Same type within a column familySame type within a column family– Basic unit of access control

T l l– Two level name structure• Family : Qualifier

Page 9: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Data model• Timestamps:

– To index multiple versions of same data

L k ti– Look up options• Most recent n versions

• All versions

Page 10: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Data model• Tablets:

– Row-range is dynamically partitioned into tablets

T bl t it f l d b l i d di t ib ti– Tablets are unit of load balancing and distribution

– Accessing short row ranges involvescommunication with small number of machines

Page 11: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

API

• Metadata operationsp– Create/delete column families– Create/delete tables/– Change metadata

• Writes– Set(): write cells in a row– DeleteCells(): delete cells in a row

DeleteRow(): delete all cells in a row– DeleteRow(): delete all cells in a row• Reads

– Scanner: read arbitrary cells in a BigtableScanner: read arbitrary cells in a Bigtable• Each row read is atomic• Can restrict returned rows to a particular range• Can ask for just data from 1 row, all rows, etc.• Can ask for all columns, just certain column families, or

specific columnsspecific columns

Page 12: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Building blocksg• Uses distributed Google File System(GFS)

– For both log and data files• SSTable is used internally to store datay

– Persistent, immutable map from keys to values– Keys and values are arbitrary byte stringsKeys and values are arbitrary byte strings– Each SSTable is a sequence of blocks– Block index is used to locate blocksBlock index is used to locate blocks– Lookup can be done in single disk-seek

64K 64K 64K SSTable

block block block

Index

Page 13: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Building blocksg

• Chubbyy– Distributed lock service

/– Provides namespace consisting of directory/files• Each directory/file can be used as a lock

– 5 active servers with 1 among them is master

E th t t t 1 t i ti t ti– Ensures that at most 1 master is active at any time

– Uses Paxos algorithm to maintain consistencyg yamong replicas

If Chubby is unavailable for extended period then– If Chubby is unavailable for extended period thenBigTable becomes unavailable

Page 14: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Building blocksgTypical cluster:

Shared pool of machines that also run other distributed applicationsShared pool of machines that also run other distributed applications

Page 15: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementationp• Single master distributed system• 3 major components:

– Library linked to every client– Master server

• Master assigns tablets to tablet serversg• Detect the addition and expiration of tablet servers• Load balancing, garbage collection and schema changesg, g g g

– Many tablet servers• Each table contains a set of tablets, and each tablet containsEach table contains a set of tablets, and each tablet contains

all data in the associated range• Tablet servers can be dynamically added or removed• Clients communicate directly with tablet servers for

read/writes• A tablet is assigned to at most one tablet server at an instance

Page 16: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementation(contd.)p ( )• Tablet location:

Level 1: Chubby file containing location of the root tablet– Level 1: Chubby file containing location of the root tablet

– Level 2: Root tablet contains the location of all tablets in METADATA t bl tMETADATA tablet

– Level 3: Each METADATA tablet contains the location of a set of user tablets

• METADATA: Key: table id + end row, Data:METADATA: Key: table id end row, Data: location

Page 17: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementation(contd.)p ( )• Assigning tablets:

M– Master startup:• Attains a unique master lock in Chubby

d f d l• Scans servers directory to find live servers• Communicates with every server about their assigned tablets

S METADATA t l b t t bl t• Scans METADATA to learn about tablets

– Tablet server startup:• Tablet server creates and acquires an exclusive lock on a

uniquely named file on Chubby• Master monitors this directory to discover tablet servers• Master monitors this directory to discover tablet servers

– Tablet server stops serving:If t bl t l l i l k• If tablet server loses exclusive lock

• Tries to reacquire lock if the file existsC t i if th fil d t i t• Can not serve again if the file does not exist

Page 18: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementation(contd.)p ( )• Tablet serving:

– Updates are committed to a commit log that storesredo recordsredo records

– Recently committed updates are stored in memory-t blmemtable

– Older updates are stored in a sequence of SSTable

– To recover a tablet:• Read MEATDATA• Read MEATDATA

– List of SSTables and redo points

R d i di f SST bl i• Reads indices of SSTable into memory

• Reconstructs memtable

Page 19: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementation(contd.)p ( )• Tablet serving:

– Write operation• Server checks if it is well

– Read operation• Checks if the request is well-

formed

• Checks if the sender isauthorized

formed

• Check authorization inChubby fileauthorized

• Write to commit log

• After commit contents are

Chubby file

• Merge memtable and SSTableto find data• After commit contents are

written in memtable • Return data

Page 20: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Implementation(contd.)p ( )• Compaction:

– To control the size of memtable, SSTable andtabletlogtabletlog

– Minor compaction:• Move data from memtable to SSTable

• Shrinks memory usage of tablet servers

• Reduces amount of data to be read during recovery

– Merging compaction:Merging compaction:• Merge multiple SSTables and memtable to single SStable

– Major compaction:• Re-writes all SSTables into exactly one SSTable

• Helps in discarding deleted data in timely fashion

Page 21: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Refinements• Locality groups

Cli l i l l f ili h i– Clients can group multiple column families together in alocality groupOne SSTable for each locality group– One SSTable for each locality group

– Column families not accessed together can be kept intodifferent locality groupsdifferent locality groups

• E.g. metadata and language in one locality group and contentsin another

• Compression– Many opportunities for compressionMany opportunities for compression

• Similar values in the cell at different timestamps• Similar values in different columns

Si il l dj• Similar values across adjacent rows– Clients can decide format of compression

C i i li d h SST bl bl k l– Compression is applied to each SSTable block separately

Page 22: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Refinements• Caching

S h hi h l l h h h k l i– Scan cache: a high level cache that caches key-value pairreturned by SSTable

• Most useful for repetitive read• Most useful for repetitive read– Block cache: caches SSTable blocks read from file system

• Good for reading data close to already accessed data• Good for reading data close to already accessed data

• Bloom filtersRead operation has to read from all the SSTables making– Read operation has to read from all the SSTables makingup a tabletBloom filter are created for SSTables in a particular– Bloom filter are created for SSTables in a particularlocality group

– With bloom filter ask if an SSTable might contain data forWith bloom filter, ask if an SSTable might contain data forspecified row/column pair

– Reduces the number of disk accessesReduces the number of disk accesses

Page 23: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Refinements• Commit log implementation

– One commit log per tablet server

– complicates tablet movementcomplicates tablet movement

– when server fails, tablets divided among multiple servers

• can cause heavy scan load by each such server

• optimization to avoid multiple separate scans: sort log p p p gby (table, rowname, LSN), so logs for a tablet are clustered, then distributeclustered, then distribute

• GFS delay spikes can mess up log write (time critical)– solution: two separate logs, one active at a time

– can have duplicates between these twop

Page 24: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Refinements• Immutability

– SSTables are immutable– simplifies caching, sharing across GFS etcsimplifies caching, sharing across GFS etc– no need for concurrency control

SST bl f bl d d i METADATA bl– SSTables of a tablet recorded in METADATA table– Garbage collection of SSTables done by master– On tablet split, split tables can start off quickly on

shared SSTables, splitting them lazilyshared SSTables, splitting them lazily

• Only memtable has reads and updates tconcurrent

– copy on write rows, allow concurrent read/write

Page 25: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Performance evaluation

Mi b h kiMicro benchmarking

R l li tiReal applications

Page 26: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Performance evaluation

Not Linear!Not Linear!WHY?

Page 27: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Performance evaluation• Not linearly ?

– Load imbalance• Competitions with other processes• Competitions with other processes

– Network

CPU– CPU

– Rebalancing algorithm does not work perfectly• Reduce the number of tablet movement

• Load shifted around as the benchmark progressesp g

Page 28: BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

Next…• BigTable is a working system catering the

needs for a wide variety of applications for aninternet giantinternet giant– Used by over 60 applications at Google

• But…N f i d t i li d i t– No use of index, materialized views, etc.

– No support for hash/ordered tablespp /