bigtable and google file system presented by: ayesha fawad 10/07/2014 1
TRANSCRIPT
1
BigTable and Google File SystemPresented by: Ayesha Fawad
10/07/2014
2
Overview
Google File System Basics Design Chunks Replicas Clusters Client
3
Overview
Google File System chunk server Master server Shadow Master Read Request Workflow Write Request Workflow Built-in Functions Limitations
4
Overview
BigTable Introduction What is BigTable? Design
example
Rows and Tablets example
Columns and Column Families example
5
Overview
BigTable Timestamp
example
Cells Data Structure SSTables and Logs Tablet Table
6
Overview
BigTable Cluster Chubby How to find a row Mutations BigTable Implementation BigTable Building Blocks Architecture
7
Overview
BigTable Master server Tablet server Client Library Incase of Failure? Recovery Process Compactions Refinement
8
Overview
BigTable Interactions between GFS and BigTable API Why use BigTable? Why not any other Database? Application Design CAP
9
Overview
BigTable Google Services using BigTable BigTable Derivatives Colossus Comparison
10
Overview
Google App Engine Introduction GAE Data store Unsupported Actions Entities Models Queries Indexes
11
Overview
Google App Engine GQL Transactions Data store Software Stack GUI
Main
Data store options Competitors
12
Overview
Google App Engine Hard Limits Free Quotas Cloud Data Storage options
13
Google File System
Presented by: Ayesha Fawad
10/07/2014
14
Basics
Originated in 2003.GFS is designed for system to system
interaction, not user to system. Network of inexpensive machines running on
Linux operating systems
15
Design
GFS relies on Distributed Computing to provide users the infrastructure they need to create, access and alter data
Distributed Computing: is all about networking several computers together and taking advantage of their individual resources in a collective way. Each computer contributes some of its resources e.g. such as memory, processing power and hard drive space, to the overall network. It turns the entire network into a massive computer, with each individual computer acting as a processor and data storage device.
16
Design Autonomic Computing: a concept in which computers
are able to diagnose problems and solve them in real time without the need for human intervention
Challenge for GFS development team was to design an autonomic monitoring system that could work across a huge network of computers
Simplification offer basic commands like open, create, read, write and
close. Some specialized commands like append, snapshot
17
Design
Checkpoints can include application level checksums Readers verify and process only file region up to last
checkpoint, which is known to be in defined state Check pointing allows writers to restart
incrementally and keeps readers from processing successfully written file data that is still incomplete from applications point of view
Relies on appends rather than overwrites
18
Chunks Files on the GFS tend to be very large (multi-gigabyte
(GB) range) GFS handles this issue by breaking files up into
chunks of 64 MB each good for scans, streams, archives, shared Q’s
Each chunk has a unique 64-bit ID number called chunk handle
Simplifies Resource Application: all file chunks are the same size check which computers are near capacity check which computers are underused balance workload by moving chunk from one resource
to another
19
Replicas
Two categories:Primary Replica: primary replica is the chunk that a chunk
server sends to a clientSecondary Replica: secondary replicas serve as backups on
other chunk servers
Master decides which chunks will act as primary or secondary Based on client changes to the data in the chunk, the master
server informs chunk servers with secondary replicas that they have to copy the new chunk off the primary chunk server to stay current
20
Design
21
Clusters
Google has organized GFS into a simple network of computers called clustersCluster contains three kinds of entities:
1. Clients 2. Master Server 3. Chunk servers
22
Client
Clients: any entity making a request Developed by Google for its own use Clients can be other computers or computer
applications
23
Chunk server
Chunk servers: workhorses stores the 64 MB file chunks sends requested chunks directly to client replicas are configurable
24
Master server
Master Server: is the coordinator for cluster maintains operation log keeps track of metadata
information describing chunks chunk garbage collection re-replication on chunk server failures chunk migration to balance load and disk space
does not store the actual chunks
25
Master server Upon start up, master server polls all the chunk
servers chunk servers respond back with information of:
data they contain location details space details
26
Shadow Master
Shadow master servers contact primary master server to stay up to date operation log polling chunk servers
Anything goes wrong with the primary master, the shadow server can take over
GFS ensure shadow master servers are stored on different machines (incase of hardware failure)
Shadow servers lag behind the primary master server by fractions of a second
They provide limited services in parallel with master. Services are limited to reads
27
Shadow Master
28
Read Request Work flow1. Client send a read request for a particular file to
master
Read Request Work flow2. Master responds back with a location of primary
replica, where client can find that particular file
29
30
Read Request Work flow3. Client contacts the chunk server directly
Read Request Work flow
4. chunk server sends the replica to the client
31
Write Request Work flow1. Client sends the request to master server
32
Write Request Work flow2. Master responds back with a location of primary and secondary replicas
33
Write Request Work flow3. Client sends the write data to all the replicas. Regardless of primary or secondary, closest one first
(pipeline)
34
Write Request Work flow4. Once data is received by replicas, client instructs the
primary replica to begin the write function primary assigns consecutive serial numbers to each of
the file changes (mutations)
35
Write Request Work flow5. After primary applies the mutations to its own data,
it sends the write requests to all the secondary replicas
36
Write Request Work flow6. Secondary replicas complete the write function and
report back to the primary replica
37
Write Request Work flow7. Primary sends confirmation to the client
if that doesn’t work, the master will identify the affected replica as garbage
38
39
Mutations
40
Mutations
Consistent: a file region is consistent, if all clients will always see same data, regardless of which replicas is being read
Defined: a region is defined, after a file data mutation if it is consistent and clients will see what the mutation writes in its entirely
41
Built-in Functions
Master and Chunk replication Streamlined recovery process Rebalancing Stale replica detection Garbage removal - configurable Checksumming
each 64 MB chunk is broken into blocks of 64 KB each block has its own 32-bit checksum master monitors and compares checksums prevents data corruption
Limitations
Suited for batch-oriented applications which prefers high sustained bandwidth over low latency e.g. web crawling
Single Point of Failure is unacceptable for latency sensitive applications e.g. Gmail or YouTube
Single master a scanning bottleneck Consistency Problems
42
43
BigTablePresented by: Ayesha Fawad
10/07/2014
44
Introduction
Created by Google in 2005. Maintained as a proprietary, in-house
technology. Some technical details were disclosed in
USENIX Symposium in 2006. It is being used by Google services since
2005.
45
What is BigTable?
It is a distributed storage system could be spread across multiple nodes appears to be one large table not a database design, it’s a storage design model
46
What is BigTable?
Map BigTable is a collection of (key, value) pairs The key identifies a row and the value is the set of
columns
47
What is BigTable?
Sparse different rows in the table may you use different
columns. with many of the columns empty for a particular row
48
What is BigTable?
Column-oriented it can operate on a set of attributes (columns) for all tuples stores each column contiguously on disk
allow more records in a disk block reduces the disk I/O
The underlying assumption is that in most cases not all columns are needed for data access
In RDBMS implementation, usually each “row” is stored contiguous on disk
49
Examplewebpages
50
Example
webpages{
"com.cnn.www" => { "contents" => "html….",
"anchor" => { "cnnsi.com" => "CNN",
"my.look.ca" => "CNN.com" }
} },
51
What is BigTable?
It is semi-structured Map (key value pair) different rows in the same table can have different
columns key is string, so it is not required to be sequential
unlike an array
52
What is BigTable?
Lexicographically sorted data is sorted by keys structure keys in a way that sorting brings the data
together, for e.g.
e d u . v i l l a n o v a . c se d u . v i l l a n o v a . l a we d u . v i l l a n o v a . w w w
53
What is BigTable?
Persistent when a certain amount of data is collected in memory,
BigTable makes it persistent by storing the data in Google File System
54
What is BigTable?
Multi-dimensional data is indexed by row key, column name and time stamp its like a table with many rows (key) and many columns (columns)
with timestamp. it acts like a mapFor e.g
URLS : row keysMetadata of Web pages : column namesContents of Web page : columnTimestamps when fetched
55
Design
row : s t r ingco lumn : s t r ingt ime : i n t64
Data is indexed by row key, column name and time stamp
Each value in map is an interpreted array of bytes Offers client some control over data layout and format
Careful choice of schema can control locality of data Client decides how to serialize the data
56
Row
Row key is up to 64KB Row range for a table are dynamically partitioned Each row range is called a tablet
unit of distribution load balancing
Clients can select row keys for better locality of data accesses reads of short row ranges are efficient. typically require communication with few number of
machines
57
Row
every read or write of data using a single row key is atomic no guarantee across rows (different columns being
read or written in the row) supports single row transactions to perform
atomic (read, modify, write) sequences on data store under a single row key
does not support general transaction across row keys
58
Row with Example
ROW
59
Column and Column Families
Column keys are grouped together to form sets, which are called Column families
family:qualifier data stored in the same column family usually has the
same data type indexes data in the same column family compress data number of distinct column families are small for e.g. language used on web page
60
Column with Example
COLUMNS
61
Column Families with Example
COLUMN FAMILY
family: qualifier
62
Timestamp
64-bit integersMultiple timestamps exist in each cell to show various
versions of data created modified
Most recent version is accessible first can choose options for garbage collection can choose specific timestamps
Timestamps are assigned by BigTable (in microseconds) or Client application
63
Timestamp with Example
TIMESTAMPS
64
Cells
CELLS
65
Mutations
First, mutations are logged in a log file Log file is stored in GFS Then the mutations are applied to an in-
memory version called memtable
66
Mutations
67
Data Structure
GFS supports two data structures. Logs Sorted String Tables
Data Structure is defined using protocol buffers (data description language)
Used to avoid inefficiency of converting data from one format to another.
For e.g. data format in Java and .NET
68
SSTables and Logs
In memory BigTable provides mutable storage using key-value.
Once the log or in-memory table reaches a certain limit, changes are made persistent by GFS. immutable
All transaction in memory are saved in GFS as segments, called logs
After changes reach a certain size (that you want in memory), they are cleaned.
After cleaning, data is compacted into series of SSTables Then sent out as chunks to GFS
69
SSTables and Logs
SSTable provides a persistent, immutable, ordered map from keys to values
Sequence of blocks form into an SSTable Each SSTable saves one block index
when SSTable is opened, index is loaded in memory specifies block location
70
SSTables and Logs
64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
…
Tablet
Tablets are a range of rows of a table Contains multiple SSTables Tablets are assigned to Tablet servers
71
Index
64K block
64K block
64K block
SSTable
Index
64K block
64K block
64K block
Tablet Start:aardvark End:apple
SSTable
Table
72
SSTable SSTable SSTable SSTable
Tabletapple
Tabletapple boat
Multiple tablets form a Table SSTables can overlap but tablets do not overlap
aardvark
73
Cluster
BigTable cluster stores tables each table consists of tablets initially, table contains one tablet as the table grows, multiple tablets are created tablets are assigned to tablet servers each tablet exists at only one server server contains multiple tablets each tablet is 100-200 MB
74
How to find a Row?
75
How to find a Row?
Client reads location of the root tablet from the Chubby file
Root tablet contains location of Metadata tablets root tablet never splits
Metadata tablet contains the location of user tablets
76
BigTable Architecture
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
Tablet server
GFS Chunkserver
SSTable
(replica)SSTable
SSTable
Tablet Tablet Tablet
(replica)SSTable
Logical view:
Physicallayout:
SSTable
Chubby ServerMaster
77
BigTable Implementation
BigTable has 3 components: Master Server Tablet Servers: dynamically added or removed to
handle workload Chubby Client Library: links the master server, many
tablets servers and all clients
78
BigTable Implementation
79
BigTable Building Blocks
Google File System Stores persistent state.
Scheduler Schedules jobs involved in serving BigTable
Lock Service Master election Location bootstrapping
Map Reduce Used to read/write BigTable data
80
Chubby
Distributed Lock Service name space consists of directories and files, which are
used as locks provides Mutual Exclusion
Highly available 1 master (elected) 5 active replicas
Paxos maintain consistency in replicas
Atomic reads and writes
81
Chubby
Responsible for: ensure there is only one active master store the bootstrap location of BigTable data discover tablet servers store BigTable schema information store access control lists
82
Chubby Client Library
Responsible for: providing consistent caching of Chubby files each Chubby client maintains a session with a Chubby
service Every client’s session has a lease expiration time. If the client is unable to renew its session lease within
the given time, the session expires and all locks and open handles are lost
83
Master Server Starts up:
Acquires unique master lock in chubby Discovers tablet assignments Discovers live servers in chubby Scans Metadata table to learn the set of tablets
Responsible for: Adding or deleting tablet servers based on demand Assigns tablets to tablet servers Monitor and balance tablet server load Garbage collection of files in GFS Check tablet server for the status of its lock
Incase of Failure: If session with Chubby is lost, master kills itself and an election can take
place to find a new master
84
Tablet Server Starts up:
acquires an exclusive lock on a uniquely named file in a specific Chubby directory
Responsible for: Tablet servers manages tablets Splits tablets beyond a certain size For reads and writes, client communicates directly with
tablet server Incase of Failure:
if it loses its exclusive lock, the tablet server stops serving if the file exists, will attempt to reacquire lock if the file no longer exists, tablet server kills itself, restart and
join the pool of unassigned tablet servers
85
Tablet Server FailureChubby Server
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
Tablet server
GFS Chunkserver
SSTable
(replica)SSTable
SSTable
Tablet Tablet Tablet
(replica)SSTable
Logical view:
Physicallayout:
SSTable
X
X X X X
Master
86
Tablet Server Failure
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
(replica)SSTable
Logical view:
Physicallayout:
Tablet
(other tablet servers drafted to serve other “abandoned” tablets)
Backup copy of tablet made primary
Message sent to tablet server by master
Extra replica of tablet created automatically by GFS
Chubby ServerMaster
87
Tablet Server Recovery Process
Read metadata containing SSTABLES and redo points
Metadata table contains the list of SSTables that comprise a tablet and a set of a redo points
Redo points are pointers into any commit logs Apply redo points to reconstruct the memtable
based on updates in commit log
88
Tablet Server Recovery Process
Read and Write requests at the tablet server are checked to make sure they are well formed
Check permission file in Chubby to ensure Authorization
Incase of write operation, all mutations are written to commit log and finally a group commit is used
Incase of read operation, it is executed on a merged view of the sequence of SSTables and the memtable
89
Compactions
When in-memory is full Minor compaction – convert the memtable into an
SSTable Reduce memory usage Reduce log traffic on restart
Merging compaction Reduce number of SSTables Good place to apply policy “keep only N versions”
Major compaction Merging compaction that results in only one SSTable No deletion records, only live data
90
Refinement
Locality groupsClients can group multiple column families together into a locality group.
Compression Compression applied to each SSTable block separately Uses Bentley and McIlroy's scheme and fast compression
algorithm Caching for read performance
Uses Scan Cache and Block Cache Bloom filters
Reduce the number of disk accesses
91
Refinement
Commit-log implementationSuppose one log per tablet rather have one log per tablet server
Exploiting SSTable immutability No need to synchronize accesses to file system when
reading SSTables Concurrency control over rows efficient Deletes work like garbage collection on removing
obsolete SSTables Enables quick tablet split: parent SSTables used by
children
92
Interactions between GFS and BigTable
Persistent state of a collection of rows (tablet) is stored in GFS Writes
Incoming writes are recorded in memory as memtables They are sorted and buffered in memory After they reach a certain size, they are stored in
sequence of SSTables (persistent storage, in GFS)
93
Interactions between GFS and BigTable
Reads Information can be in Memtables or SSTables Need to consider, how to avoid Stale information All tables are sorted so easy to find most recent
Recovery To recover a tablet, tablet server reconstructs Memtable
by reading its metadata, redo points Need to consider, how to avoid Stale information All tables are sorted so easy to find most recent
94
API
BigTable APIs provide functions for: Creating/deleting tables, column families Changing cluster, table and column family metadata such as
access control rights Client applications can:
write or delete values lookup values from individual rows iterate over a subset of data
Support of single row transactions Allowing cells to be used as integer counters Executing client supplied scripts in the address space of
servers
95
Why use BigTable?
Scale is Large More than 100 TB of Satellite Image Data Millions of users
thousands of queries per second manage Latency
Billions of URLS billions and billions of pages each page has many versions
96
Why not any other Database?
In-house solution is always cheaper Scale is very large for most of the databases Cost is too high Same system can be used across different projects,
which again lowers the cost With Relational Databases, we expect ACID
transactions. It is impossible to guarantee Consistency while providing High Availability and Network Partition Tolerance.
97
CAP
98
CAP
99
Application Design
Reminders Timestamp is Int64, so application needs to plan
for updating the same cell at the same time by multiple clients.
At application level, need to know the data structure that is supported by GFS, to avoid conversion
100
Google Services using BigTable
Used a database by: Google Analytics Google Earth Google App Engine Datastore Google Personalized Search
101
BigTable Derivatives
Apache Hbase database, which is built to run on top of the Hadoop Distributed File System (HDFS).
Cassandra, which originated at Facebook Inc. Hypertable, an open source technology, an alternative
to HBase.
102
Colossus
GFS is more suited for batch operations Colossus is a revamped file system that is suited for
real-time operations Colossus makes use of a new search infrastructure
called ‘Caffeine’ which enables Google to update its search index in real-time
In Colossus there are many masters operating at the same time
Number of changes have already been made to the open-source Hadoop to make it look more like Colossus
103
Comparison
104
Comparison
105
Comparison
106
Comparison
107
Comparison
108
Comparison
109
Comparison
110
Comparison
111
Comparison
112
Comparison
113
Comparison
114
Comparison
Google App Engine
Presented by: Ayesha Fawad
10/07/2014
Introduction
Also know as GAE or App Engine Preview started in April 2008 Came out of preview in September 2011 Is a a PAAS (platform as a service) Allows developing and hosting web applications in
Google managed data centers Default choice for storage is a NoSQL solution
Introduction Language Independent
plans to support more languages Automatic scaling
automatically allocates more resources to handle additional demand
It is free up to certain level of resources (storage, bandwidth, or instance hours) required by the application
Does not allow joins
Introduction
Applications are Sandboxed across multiple servers security mechanism to execute/run untested code restricted resources for the safety of host system
ReliableService Level Agreement of 99.5% uptimecan sustain multiple data center failures
119
GAE Data store
It is built on top of BigTable Follows a hierarchical structure Schema-less object data store Designed to scale for high performance Queries are pre-built indexes Does not require entities of same kind
120
Does Not Support
Join operations Inequality filtering on multiple properties Filtering data based on results of sub query
121
Entities
Also known as Objects in App Engine Data store Each entity is uniquely identified by its own key An entity:
begins with root entity proceeding from parent to child
Every Entity belongs to an Entity group
122
Models
Model is the superclass for data model definitions defined in google.appengine.ext.db
Entities of a given kind are represented by instance of the corresponding model class
123
Queries
A Data store Query retrieves entities from data store which operates on entity values keys to meet specified set of conditions
Data store API provides a Query class for constructing queries PreparedQuery class for fetching and returning entities
from the data store Can apply filters and sort orders on queries
124
Indexes
An index is defined on a list of entity properties of an entity kind
An index table contains a column for every property specified in the index’s definition
Data store identifies the index that corresponds with the Query’s kind, filter properties, filter operators and sort orders
App Engine predefines an index on each property of each kind. These indexes are sufficient to simple queries.
125
GQL
GQL is a SQL like language for retrieving Entities or Keys from App Engine Data store
126
Transactions
Transaction is a set of Data store operations on one or more entity
Its atomic, means transactions are never partially applied
Isolation and Consistency Required when users are attempting to create or
update an entity with same string ID Also possible to queue transactions
127
Data store Software Stack
128
Data store Software Stack
App Engine Data store schema-less storage advanced query engine
Megastore Multi-row transactions simple indexes/queries strict schema
BigTable distributed key/value
storeNext gen distributed file
system
GUI
https://appengine.google.com Everything done through console can
also be done through Command Line (appcfg)
130
GUI
Main Data Administration Billing
GUI (Main)
Dashboard you can see all metrics related to your application. versions resources and usage much more
GUI (Main)
Instances total number of instances availability (e.g. dynamic) average latency average memory much more
GUI (Main)
Logs detailed information helps resolving any issue much more
GUI (Main)
Versions number of versions default setting deployment information delete a specific version much more
GUI (Main)
Backends its like a worker role piece of business logic which does not have a user
interface much more
GUI (Main)
Crom Jobs time based job can be defined in xml or yaml file much more
GUI (Main)
Task Queues can create multiple tasks first one will be default automatically can be defined in xml or yaml file much more
GUI (Main)
Quota Details detailed metrics of resources being used For e.g. storage, memcache, mail etc shows daily quota shows rate details of what client is billed for much more
139
Data store Options
High-Replication uses Paxos algorithm multi master read and write provides highest level of availability (99.999% SLA) certain queries will be Eventually Consistent some latency due to multi master writing reads are from the fastest source (local)Reads are transactional
140
Data store Options
Master/Slave offers strong Consistency over availability, for all reads
and queries data is written to a single master data center, then
replicated asynchronously to other (slave) data centers 99.9% SLA reads from master only
Competitors
App Engine offers better infrastructure to host applications in terms of administration and scalability
Other hosting services offer better flexibility for applications in terms of languages and configuration
Hard Limits
Free Quotas
144
Cloud Data Storage Options
145
References
Reference to Bigtablehttp://en.wikipedia.org/wiki/BigTablehttps://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.htmlhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf Reference to Google File Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf