lecture 7 – bigtable

30
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Upload: mandell

Post on 17-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Lecture 7 – Bigtable. CSE 490h – Introduction to Distributed Computing, Winter 2008. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Previous Classes. MapReduce toolkit for processing data. RPC, etc. GFS file system. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 7 – Bigtable

Lecture 7 – Bigtable

CSE 490h – Introduction to Distributed Computing, Winter 2008

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Page 2: Lecture 7 – Bigtable

Previous Classes

MapReducetoolkit for processing data

GFSfile system

RPC, etc

Page 3: Lecture 7 – Bigtable

GFS vs Bigtable GFS provides raw data storage We need:

More sophisticated storageFlexible enough to be usefulStore semi-structured dataReliable, scalable, etc

Page 4: Lecture 7 – Bigtable

Examples URLs:

Contents, crawl metadata, links, anchors, pagerank, …

Per-user data:User preference settings, recent queries/search

results, … Geographic locations:

Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Page 5: Lecture 7 – Bigtable

Commercial DB Why not use commercial database?

Not scalable enoughToo expensiveWe need to do low-level optimizations

Page 6: Lecture 7 – Bigtable

BigTable Features Distributed key-value map Fault-tolerant, persistent Scalable

Thousands of servers Terabytes of in-memory data Petabytes of disk-based data Millions of reads / writes per second, efficient scans

Self managing Servers can be added / removed dynamically Servers adjust to load imbalance

Page 7: Lecture 7 – Bigtable

7

Basic Data Model Distributed multi-dimensional sparse map

(row, column, timestamp) cell contents

“www.cnn.com”

“contents:”

Rows

Columns

Timestamps

t3t11t17“<html>…”

• Good match for most of our applications

Page 8: Lecture 7 – Bigtable

8

Rows

Name is an arbitrary stringAccess to data in a row is atomicRow creation is implicit upon storing data

Rows ordered lexicographicallyRows close together lexicographically usually

reside on one or a small number of machines

Page 9: Lecture 7 – Bigtable

9

Columns

“www.cnn.com”

“contents:”

“<html>…” “CNN home page”

“anchor:cnnsi.com”

“CNN”

“anchor:stanford.edu”

Columns have two-level name structure: family:optional_qualifier

Column family Unit of access control Has associated type information

Qualifier gives unbounded columns Additional level of indexing, if desired

Page 10: Lecture 7 – Bigtable

10

Column Families

Must be created before data can be stored Small number of column families Unbounded number of columns

Page 11: Lecture 7 – Bigtable

11

Timestamps Used to store different versions of data in

a cellNew writes default to current time, but

timestamps for writes can also be set explicitly by clients

Page 12: Lecture 7 – Bigtable

12

Timestamps Garbage Collection

Per-column-family settings to tell Bigtable to GC

“Only retain most recent K values in a cell”“Keep values until they are older than K

seconds”

Page 13: Lecture 7 – Bigtable

13

API Create / delete tables and column families

Table *T = OpenOrDie(“/bigtable/web/webtable”);

RowMutation r1(T, “com.cnn.www”);

r1.Set(“anchor:www.c-span.org”, “CNN”);

r1.Delete(“anchor:www.abc.com”);

Operation op;

Apply(&op, &r1);

Page 14: Lecture 7 – Bigtable

14

Locality Groups

Column families can be assigned to a locality groupUsed to organize underlying storage

representation for performance scans over one locality group are

O(bytes_in_locality_group) , not O(bytes_in_table)

Data in a locality group can be explicitly memory-mapped

Page 15: Lecture 7 – Bigtable

15

SSTable File-format for storing files Key-Value Map

Persistent Ordered Immutable Keys and values are strings

Page 16: Lecture 7 – Bigtable

16

SSTable Operations

Look up value for key Iterate over all key/value pairs in specified range

Sequence of blocks (64 KB) Block index used to locate blocks

How do we find block by block index? Binary search on in-memory index Or, map complete SSTable into memory

Page 17: Lecture 7 – Bigtable

17

SSTable

Relies on lock service called Chubby Ensure there is at most one active master Store bootstrap location of Bigtable data Finalize table server death Store column family information Store access control lists

Page 18: Lecture 7 – Bigtable

18

Chubby

Namespace that consists of directories and small files Each directory or file can be used as lock

Chubby client maintains session with Chubby service Expires if unable to renew its session lease within

expiration time If expired, client loses any locks and open handles

Atomic Reads / Writes

Page 19: Lecture 7 – Bigtable

19

Tablets

Rows A - E

Rows F - R

Rows S - Z

As table grows, split tables into

tablets

Page 20: Lecture 7 – Bigtable

20

Tablets

Large tables are broken into tablets at row boundariesTablet holds contiguous range of rowsAim for ~100MB to 200MB of data per tablet

Tablet server responsible for ~100 tabletsFine-grained load balancing:

Migrate tablets away from overloaded machine Master makes load-balancing decisions

Page 21: Lecture 7 – Bigtable

21

Tablet Server

Master assigns tablets to table servers Tablet server

Handles reads / writes requests to tabletsSplits large tablets

Client does not move data through master

Page 22: Lecture 7 – Bigtable

22

Master Startup

Grab unique master lock in Chubby Scan servers directory in Chubby to find

live servers Communicate with every live tablet to

discover which tablets are assigned Scan METADATA table to learn set of

tabletsTrack unassigned tablet

Page 23: Lecture 7 – Bigtable

23

Tablet Assignment

Master has list of unassigned tablets When a tablet is unassigned, and a tablet

server has room for it, Master sends tablet load request to tablet server

Page 24: Lecture 7 – Bigtable

24

Tablet Serving

Persistent state of tablet is stored in GFS Updates committed to log that stores redo

recordsMemtable: sorted buffer in memory of recent

commitsOlder updates stored in SSTable

Page 25: Lecture 7 – Bigtable

25

What if memtable gets too big?

Minor compaction: Create new memtableConvert old memtable to SSTable and write to

GFSNote: every minor compaction creates a new

SSTable

Page 26: Lecture 7 – Bigtable

26

Compactions

Merge Compaction Bound number of SSTables by executing merge

compaction Reads contents of a few SSTables and memtable,

and writes out new SSTable Major compaction:

Merging compaction that rewrites all SSTables into one SSTable

Produces SSTable that contains no deletion info Allows Bigtable to reclaim resources from deleted

data

Page 27: Lecture 7 – Bigtable

27

System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-election

holds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable CellBigtable clientBigtable client

library

Open()read/write

metadata ops

Page 28: Lecture 7 – Bigtable

28

Google: The Big Picture Custom solutions for unique problems! GFS: Stores data reliably

But just raw files BigTable: gives us key/value map

Database like, but doesn’t provide everything we need

Chubby: locking mechanism SSTable file format

MapReduce: lets us process data from BigTable (and other sources)

Page 29: Lecture 7 – Bigtable

29

Common Principles One master, multiple helpers

MapReduce: master coordinates work amongst map / reduce workers

Bigtable: master knows about location of tablet servers

GFS: master coordinates data across chunkservers

Issues with a single masterWhat about master failure?How do you avoid bottlenecks?

Page 30: Lecture 7 – Bigtable

30

Next Class Chord: A Scalable P2P Lookup Service for

Internet ApplicationsDistributed Hash Table

Guest Speaker: John McDowell, Microsoft