bigtable and google file system presented by: ayesha fawad 10/07/2014 1

1

BigTable and Google File SystemPresented by: Ayesha Fawad

10/07/2014

2

Overview

Google File System Basics Design Chunks Replicas Clusters Client

3

Overview

Google File System chunk server Master server Shadow Master Read Request Workflow Write Request Workflow Built-in Functions Limitations

4

Overview

BigTable Introduction What is BigTable? Design

example

Rows and Tablets example

Columns and Column Families example

5

Overview

BigTable Timestamp

example

Cells Data Structure SSTables and Logs Tablet Table

6

Overview

BigTable Cluster Chubby How to find a row Mutations BigTable Implementation BigTable Building Blocks Architecture

7

Overview

BigTable Master server Tablet server Client Library Incase of Failure? Recovery Process Compactions Refinement

8

Overview

BigTable Interactions between GFS and BigTable API Why use BigTable? Why not any other Database? Application Design CAP

9

Overview

BigTable Google Services using BigTable BigTable Derivatives Colossus Comparison

10

Overview

Google App Engine Introduction GAE Data store Unsupported Actions Entities Models Queries Indexes

11

Overview

Google App Engine GQL Transactions Data store Software Stack GUI

Main

Data store options Competitors

12

Overview

Google App Engine Hard Limits Free Quotas Cloud Data Storage options

13

Google File System

Presented by: Ayesha Fawad

10/07/2014

14

Basics

Originated in 2003.GFS is designed for system to system

interaction, not user to system. Network of inexpensive machines running on

Linux operating systems

15

Design

GFS relies on Distributed Computing to provide users the infrastructure they need to create, access and alter data

Distributed Computing: is all about networking several computers together and taking advantage of their individual resources in a collective way. Each computer contributes some of its resources e.g. such as memory, processing power and hard drive space, to the overall network. It turns the entire network into a massive computer, with each individual computer acting as a processor and data storage device.

16

Design Autonomic Computing: a concept in which computers

are able to diagnose problems and solve them in real time without the need for human intervention

Challenge for GFS development team was to design an autonomic monitoring system that could work across a huge network of computers

Simplification offer basic commands like open, create, read, write and

close. Some specialized commands like append, snapshot

17

Design

Checkpoints can include application level checksums Readers verify and process only file region up to last

checkpoint, which is known to be in defined state Check pointing allows writers to restart

incrementally and keeps readers from processing successfully written file data that is still incomplete from applications point of view

Relies on appends rather than overwrites

18

Chunks Files on the GFS tend to be very large (multi-gigabyte

(GB) range) GFS handles this issue by breaking files up into

chunks of 64 MB each good for scans, streams, archives, shared Q’s

Each chunk has a unique 64-bit ID number called chunk handle

Simplifies Resource Application: all file chunks are the same size check which computers are near capacity check which computers are underused balance workload by moving chunk from one resource

to another

19

Replicas

Two categories:Primary Replica: primary replica is the chunk that a chunk

server sends to a clientSecondary Replica: secondary replicas serve as backups on

other chunk servers

Master decides which chunks will act as primary or secondary Based on client changes to the data in the chunk, the master

server informs chunk servers with secondary replicas that they have to copy the new chunk off the primary chunk server to stay current

20

Design

21

Clusters

Google has organized GFS into a simple network of computers called clustersCluster contains three kinds of entities:

1. Clients 2. Master Server 3. Chunk servers

22

Client

Clients: any entity making a request Developed by Google for its own use Clients can be other computers or computer

applications

23

Chunk server

Chunk servers: workhorses stores the 64 MB file chunks sends requested chunks directly to client replicas are configurable

24

Master server

Master Server: is the coordinator for cluster maintains operation log keeps track of metadata

information describing chunks chunk garbage collection re-replication on chunk server failures chunk migration to balance load and disk space

does not store the actual chunks

25

Master server Upon start up, master server polls all the chunk

servers chunk servers respond back with information of:

data they contain location details space details

26

Shadow Master

Shadow master servers contact primary master server to stay up to date operation log polling chunk servers

Anything goes wrong with the primary master, the shadow server can take over

GFS ensure shadow master servers are stored on different machines (incase of hardware failure)

Shadow servers lag behind the primary master server by fractions of a second

They provide limited services in parallel with master. Services are limited to reads

27

Shadow Master

28

Read Request Work flow1. Client send a read request for a particular file to

master

Read Request Work flow2. Master responds back with a location of primary

replica, where client can find that particular file

29

30

Read Request Work flow3. Client contacts the chunk server directly

Read Request Work flow

4. chunk server sends the replica to the client

31

Write Request Work flow1. Client sends the request to master server

32

Write Request Work flow2. Master responds back with a location of primary and secondary replicas

33

Write Request Work flow3. Client sends the write data to all the replicas. Regardless of primary or secondary, closest one first

(pipeline)

34

Write Request Work flow4. Once data is received by replicas, client instructs the

primary replica to begin the write function primary assigns consecutive serial numbers to each of

the file changes (mutations)

35

Write Request Work flow5. After primary applies the mutations to its own data,

it sends the write requests to all the secondary replicas

36

Write Request Work flow6. Secondary replicas complete the write function and

report back to the primary replica

37

Write Request Work flow7. Primary sends confirmation to the client

if that doesn’t work, the master will identify the affected replica as garbage

38

39

Mutations

40

Mutations

Consistent: a file region is consistent, if all clients will always see same data, regardless of which replicas is being read

Defined: a region is defined, after a file data mutation if it is consistent and clients will see what the mutation writes in its entirely

41

Built-in Functions

Master and Chunk replication Streamlined recovery process Rebalancing Stale replica detection Garbage removal - configurable Checksumming

each 64 MB chunk is broken into blocks of 64 KB each block has its own 32-bit checksum master monitors and compares checksums prevents data corruption

Limitations

Suited for batch-oriented applications which prefers high sustained bandwidth over low latency e.g. web crawling

Single Point of Failure is unacceptable for latency sensitive applications e.g. Gmail or YouTube

Single master a scanning bottleneck Consistency Problems

42

43

BigTablePresented by: Ayesha Fawad

10/07/2014

44

Introduction

Created by Google in 2005. Maintained as a proprietary, in-house

technology. Some technical details were disclosed in

USENIX Symposium in 2006. It is being used by Google services since

2005.

45

What is BigTable?

It is a distributed storage system could be spread across multiple nodes appears to be one large table not a database design, it’s a storage design model

46

What is BigTable?

Map BigTable is a collection of (key, value) pairs The key identifies a row and the value is the set of

columns

47

What is BigTable?

Sparse different rows in the table may you use different

columns. with many of the columns empty for a particular row

48

What is BigTable?

Column-oriented it can operate on a set of attributes (columns) for all tuples stores each column contiguously on disk

allow more records in a disk block reduces the disk I/O

The underlying assumption is that in most cases not all columns are needed for data access

In RDBMS implementation, usually each “row” is stored contiguous on disk

49

Examplewebpages

50

Example

webpages{

"com.cnn.www" => { "contents" => "html….",

"anchor" => { "cnnsi.com" => "CNN",

"my.look.ca" => "CNN.com" }

} },

51

What is BigTable?

It is semi-structured Map (key value pair) different rows in the same table can have different

columns key is string, so it is not required to be sequential

unlike an array

52

What is BigTable?

Lexicographically sorted data is sorted by keys structure keys in a way that sorting brings the data

together, for e.g.

e d u . v i l l a n o v a . c se d u . v i l l a n o v a . l a we d u . v i l l a n o v a . w w w

53

What is BigTable?

Persistent when a certain amount of data is collected in memory,

BigTable makes it persistent by storing the data in Google File System

54

What is BigTable?

Multi-dimensional data is indexed by row key, column name and time stamp its like a table with many rows (key) and many columns (columns)

with timestamp. it acts like a mapFor e.g

URLS : row keysMetadata of Web pages : column namesContents of Web page : columnTimestamps when fetched

55

Design

row : s t r ingco lumn : s t r ingt ime : i n t64

Data is indexed by row key, column name and time stamp

Each value in map is an interpreted array of bytes Offers client some control over data layout and format

Careful choice of schema can control locality of data Client decides how to serialize the data

56

Row

Row key is up to 64KB Row range for a table are dynamically partitioned Each row range is called a tablet

unit of distribution load balancing

Clients can select row keys for better locality of data accesses reads of short row ranges are efficient. typically require communication with few number of

machines

57

Row

every read or write of data using a single row key is atomic no guarantee across rows (different columns being

read or written in the row) supports single row transactions to perform

atomic (read, modify, write) sequences on data store under a single row key

does not support general transaction across row keys

58

Row with Example

ROW

59

Column and Column Families

Column keys are grouped together to form sets, which are called Column families

family:qualifier data stored in the same column family usually has the

same data type indexes data in the same column family compress data number of distinct column families are small for e.g. language used on web page

60

Column with Example

COLUMNS

61

Column Families with Example

COLUMN FAMILY

family: qualifier

62

Timestamp

64-bit integersMultiple timestamps exist in each cell to show various

versions of data created modified

Most recent version is accessible first can choose options for garbage collection can choose specific timestamps

Timestamps are assigned by BigTable (in microseconds) or Client application

63

Timestamp with Example

TIMESTAMPS

64

Cells

CELLS

65

Mutations

First, mutations are logged in a log file Log file is stored in GFS Then the mutations are applied to an in-

memory version called memtable

66

Mutations

67

Data Structure

GFS supports two data structures. Logs Sorted String Tables

Data Structure is defined using protocol buffers (data description language)

Used to avoid inefficiency of converting data from one format to another.

For e.g. data format in Java and .NET

68

SSTables and Logs

In memory BigTable provides mutable storage using key-value.

Once the log or in-memory table reaches a certain limit, changes are made persistent by GFS. immutable

All transaction in memory are saved in GFS as segments, called logs

After changes reach a certain size (that you want in memory), they are cleaned.

After cleaning, data is compacted into series of SSTables Then sent out as chunks to GFS

69

SSTables and Logs

SSTable provides a persistent, immutable, ordered map from keys to values

Sequence of blocks form into an SSTable Each SSTable saves one block index

when SSTable is opened, index is loaded in memory specifies block location

70

SSTables and Logs

64KBBlock

64KBBlock

64KBBlock

Index (block ranges)

…

Tablet

Tablets are a range of rows of a table Contains multiple SSTables Tablets are assigned to Tablet servers

71

Index

64K block

64K block

64K block

SSTable

Index

64K block

64K block

64K block

Tablet Start:aardvark End:apple

SSTable

Table

72

SSTable SSTable SSTable SSTable

Tabletapple

Tabletapple boat

Multiple tablets form a Table SSTables can overlap but tablets do not overlap

aardvark

73

Cluster

BigTable cluster stores tables each table consists of tablets initially, table contains one tablet as the table grows, multiple tablets are created tablets are assigned to tablet servers each tablet exists at only one server server contains multiple tablets each tablet is 100-200 MB

74

How to find a Row?

75

How to find a Row?

Client reads location of the root tablet from the Chubby file

Root tablet contains location of Metadata tablets root tablet never splits

Metadata tablet contains the location of user tablets

76

BigTable Architecture

Tablet server

GFS Chunkserver

SSTable SSTable SSTable

Tablet Tablet Tablet

Tablet server

GFS Chunkserver

SSTable

(replica)SSTable

SSTable


(replica)SSTable

Logical view:

Physicallayout:

SSTable

Chubby ServerMaster

77

BigTable Implementation

BigTable has 3 components: Master Server Tablet Servers: dynamically added or removed to

handle workload Chubby Client Library: links the master server, many

tablets servers and all clients

78

BigTable Implementation

79

BigTable Building Blocks

Google File System Stores persistent state.

Scheduler Schedules jobs involved in serving BigTable

Lock Service Master election Location bootstrapping

Map Reduce Used to read/write BigTable data

80

Chubby

Distributed Lock Service name space consists of directories and files, which are

used as locks provides Mutual Exclusion

Highly available 1 master (elected) 5 active replicas

Paxos maintain consistency in replicas

Atomic reads and writes

81

Chubby

Responsible for: ensure there is only one active master store the bootstrap location of BigTable data discover tablet servers store BigTable schema information store access control lists

82

Chubby Client Library

Responsible for: providing consistent caching of Chubby files each Chubby client maintains a session with a Chubby

service Every client’s session has a lease expiration time. If the client is unable to renew its session lease within

the given time, the session expires and all locks and open handles are lost

83

Master Server Starts up:

Acquires unique master lock in chubby Discovers tablet assignments Discovers live servers in chubby Scans Metadata table to learn the set of tablets

Responsible for: Adding or deleting tablet servers based on demand Assigns tablets to tablet servers Monitor and balance tablet server load Garbage collection of files in GFS Check tablet server for the status of its lock

Incase of Failure: If session with Chubby is lost, master kills itself and an election can take

place to find a new master

84

Tablet Server Starts up:

acquires an exclusive lock on a uniquely named file in a specific Chubby directory

Responsible for: Tablet servers manages tablets Splits tablets beyond a certain size For reads and writes, client communicates directly with

tablet server Incase of Failure:

if it loses its exclusive lock, the tablet server stops serving if the file exists, will attempt to reacquire lock if the file no longer exists, tablet server kills itself, restart and

join the pool of unassigned tablet servers

85

Tablet Server FailureChubby Server

Tablet server

GFS Chunkserver



Tablet server

GFS Chunkserver

SSTable

(replica)SSTable

SSTable


(replica)SSTable

Logical view:

Physicallayout:

SSTable

X

X X X X

Master

86

Tablet Server Failure

Tablet server

GFS Chunkserver



(replica)SSTable

Logical view:

Physicallayout:

Tablet

(other tablet servers drafted to serve other “abandoned” tablets)

Backup copy of tablet made primary

Message sent to tablet server by master

Extra replica of tablet created automatically by GFS

Chubby ServerMaster

87

Tablet Server Recovery Process

Read metadata containing SSTABLES and redo points

Metadata table contains the list of SSTables that comprise a tablet and a set of a redo points

Redo points are pointers into any commit logs Apply redo points to reconstruct the memtable

based on updates in commit log

88

Tablet Server Recovery Process

Read and Write requests at the tablet server are checked to make sure they are well formed

Check permission file in Chubby to ensure Authorization

Incase of write operation, all mutations are written to commit log and finally a group commit is used

Incase of read operation, it is executed on a merged view of the sequence of SSTables and the memtable

89

Compactions

When in-memory is full Minor compaction – convert the memtable into an

SSTable Reduce memory usage Reduce log traffic on restart

Merging compaction Reduce number of SSTables Good place to apply policy “keep only N versions”

Major compaction Merging compaction that results in only one SSTable No deletion records, only live data

90

Refinement

Locality groupsClients can group multiple column families together into a locality group.

Compression Compression applied to each SSTable block separately Uses Bentley and McIlroy's scheme and fast compression

algorithm Caching for read performance

Uses Scan Cache and Block Cache Bloom filters

Reduce the number of disk accesses

91

Refinement

Commit-log implementationSuppose one log per tablet rather have one log per tablet server

Exploiting SSTable immutability No need to synchronize accesses to file system when

reading SSTables Concurrency control over rows efficient Deletes work like garbage collection on removing

obsolete SSTables Enables quick tablet split: parent SSTables used by

children

92

Interactions between GFS and BigTable

Persistent state of a collection of rows (tablet) is stored in GFS Writes

Incoming writes are recorded in memory as memtables They are sorted and buffered in memory After they reach a certain size, they are stored in

sequence of SSTables (persistent storage, in GFS)

93

Interactions between GFS and BigTable

Reads Information can be in Memtables or SSTables Need to consider, how to avoid Stale information All tables are sorted so easy to find most recent

Recovery To recover a tablet, tablet server reconstructs Memtable

by reading its metadata, redo points Need to consider, how to avoid Stale information All tables are sorted so easy to find most recent

94

API

BigTable APIs provide functions for: Creating/deleting tables, column families Changing cluster, table and column family metadata such as

access control rights Client applications can:

write or delete values lookup values from individual rows iterate over a subset of data

Support of single row transactions Allowing cells to be used as integer counters Executing client supplied scripts in the address space of

servers

95

Why use BigTable?

Scale is Large More than 100 TB of Satellite Image Data Millions of users

thousands of queries per second manage Latency

Billions of URLS billions and billions of pages each page has many versions

96

Why not any other Database?

In-house solution is always cheaper Scale is very large for most of the databases Cost is too high Same system can be used across different projects,

which again lowers the cost With Relational Databases, we expect ACID

transactions. It is impossible to guarantee Consistency while providing High Availability and Network Partition Tolerance.

97

CAP

98

CAP

99

Application Design

Reminders Timestamp is Int64, so application needs to plan

for updating the same cell at the same time by multiple clients.

At application level, need to know the data structure that is supported by GFS, to avoid conversion

100

Google Services using BigTable

Used a database by: Google Analytics Google Earth Google App Engine Datastore Google Personalized Search

101

BigTable Derivatives

Apache Hbase database, which is built to run on top of the Hadoop Distributed File System (HDFS).

Cassandra, which originated at Facebook Inc. Hypertable, an open source technology, an alternative

to HBase.

102

Colossus

GFS is more suited for batch operations Colossus is a revamped file system that is suited for

real-time operations Colossus makes use of a new search infrastructure

called ‘Caffeine’ which enables Google to update its search index in real-time

In Colossus there are many masters operating at the same time

Number of changes have already been made to the open-source Hadoop to make it look more like Colossus

103

Comparison

104

Comparison

105

Comparison

106

Comparison

107

Comparison

108

Comparison

109

Comparison

110

Comparison

111

Comparison

112

Comparison

113

Comparison

114

Comparison

Google App Engine

Presented by: Ayesha Fawad

10/07/2014

Introduction

Also know as GAE or App Engine Preview started in April 2008 Came out of preview in September 2011 Is a a PAAS (platform as a service) Allows developing and hosting web applications in

Google managed data centers Default choice for storage is a NoSQL solution

Introduction Language Independent

plans to support more languages Automatic scaling

automatically allocates more resources to handle additional demand

It is free up to certain level of resources (storage, bandwidth, or instance hours) required by the application

Does not allow joins

Introduction

Applications are Sandboxed across multiple servers security mechanism to execute/run untested code restricted resources for the safety of host system

ReliableService Level Agreement of 99.5% uptimecan sustain multiple data center failures

119

GAE Data store

It is built on top of BigTable Follows a hierarchical structure Schema-less object data store Designed to scale for high performance Queries are pre-built indexes Does not require entities of same kind

120

Does Not Support

Join operations Inequality filtering on multiple properties Filtering data based on results of sub query

121

Entities

Also known as Objects in App Engine Data store Each entity is uniquely identified by its own key An entity:

begins with root entity proceeding from parent to child

Every Entity belongs to an Entity group

122

Models

Model is the superclass for data model definitions defined in google.appengine.ext.db

Entities of a given kind are represented by instance of the corresponding model class

123

Queries

A Data store Query retrieves entities from data store which operates on entity values keys to meet specified set of conditions

Data store API provides a Query class for constructing queries PreparedQuery class for fetching and returning entities

from the data store Can apply filters and sort orders on queries

124

Indexes

An index is defined on a list of entity properties of an entity kind

An index table contains a column for every property specified in the index’s definition

Data store identifies the index that corresponds with the Query’s kind, filter properties, filter operators and sort orders

App Engine predefines an index on each property of each kind. These indexes are sufficient to simple queries.

125

GQL

GQL is a SQL like language for retrieving Entities or Keys from App Engine Data store

126

Transactions

Transaction is a set of Data store operations on one or more entity

Its atomic, means transactions are never partially applied

Isolation and Consistency Required when users are attempting to create or

update an entity with same string ID Also possible to queue transactions

127

Data store Software Stack

128

Data store Software Stack

App Engine Data store schema-less storage advanced query engine

Megastore Multi-row transactions simple indexes/queries strict schema

BigTable distributed key/value

storeNext gen distributed file

system

GUI

https://appengine.google.com Everything done through console can

also be done through Command Line (appcfg)

https://appengine.google.com/

https://appengine.google.com/

130

GUI

Main Data Administration Billing

GUI (Main)

Dashboard you can see all metrics related to your application. versions resources and usage much more

GUI (Main)

Instances total number of instances availability (e.g. dynamic) average latency average memory much more

GUI (Main)

Logs detailed information helps resolving any issue much more

GUI (Main)

Versions number of versions default setting deployment information delete a specific version much more

GUI (Main)

Backends its like a worker role piece of business logic which does not have a user

interface much more

GUI (Main)

Crom Jobs time based job can be defined in xml or yaml file much more

GUI (Main)

Task Queues can create multiple tasks first one will be default automatically can be defined in xml or yaml file much more

GUI (Main)

Quota Details detailed metrics of resources being used For e.g. storage, memcache, mail etc shows daily quota shows rate details of what client is billed for much more

139

Data store Options

High-Replication uses Paxos algorithm multi master read and write provides highest level of availability (99.999% SLA) certain queries will be Eventually Consistent some latency due to multi master writing reads are from the fastest source (local)Reads are transactional

140

Data store Options

Master/Slave offers strong Consistency over availability, for all reads

and queries data is written to a single master data center, then

replicated asynchronously to other (slave) data centers 99.9% SLA reads from master only

Competitors

App Engine offers better infrastructure to host applications in terms of administration and scalability

Other hosting services offer better flexibility for applications in terms of languages and configuration

Hard Limits

Free Quotas

144

Cloud Data Storage Options

145

References

Reference to Bigtablehttp://en.wikipedia.org/wiki/BigTablehttps://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.htmlhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf Reference to Google File Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

http://en.wikipedia.org/wiki/BigTable

https://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.html

https://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.html

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf



http://en.wikipedia.org/wiki/Google_File_System

http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

bigtable and google file system presented by: ayesha fawad 10/07/2014 1

Documents

google file system bigtable

overview bigtable introduction

file data

bigtable api

use bigtable

overview bigtable interactions

google file systempresented

file chunks