megastore by google

MEGASTORE: Providing Scalable,

Highly Available Storage for Interactive

ServicesGuided By- Prof. Kong LiPresented By- (TEAM 1)

Anumeha Shah(009423973)Ankita Kapratwar (009413469)Swapna Kulkarni(009264905)

What is Megastore● Megastore combines the scalability and availability of NoSQL

datastore with ACID semantics of RDBMS in an innovative way so that it can meet the requirement of interactive online services. Megastore provides both the high consistency as well as high availability which can not be provided by NoSQL or RDBMS alone.

● Megastore uses Paxos replication and consensus algorithm for high availability and with low latency.

● Partitions the data to a fine granularity and ACID semantics within the partition across wide area network with low latency.

Why MegastoreOnline interactive services requires high availability as well as high consistency.● Online services are growing exceedingly as potential users are

growing exceedingly.● More and more desktop services are moving to the cloud● Opposing requirements of storage demands are arising and

making the storage challenging

Reasons for opposing requirements are:● Applications should be scalable Services should be responsive.● User should have consistent view of the data● Services should be highly available services to be up for 24/7

services to be

Approach to Provide High Availability and ConsistencyTwo approaches has been taken.1. synchronous fault tolerant log replicator to provide availability.2. To provide scalability partition the data into many small

databases and provide each database with its own log replicator.

Replications for High AvailabilityNeed for replications:● Replication is needed for high availability● replication with in data center overcome the host specific

failures● But to overcome datacenter specific failure and regional disaster

the data should be replicated over geographically distributed datacenters.

Common Replication Strategies and IssuesAsynchronous master/slave● write ahead log entries are replicated by master node to at least

one slave. ● Log appends acknowledgement at master and transmissions to

slave happens parallely.● However if master fails then we can experience downtime till a

slave becomes master and also loss of data can occur.Synchronous master/slave:● Changes on masters and slave are done synchronously that is

master acknowledge the change once the changes are mirrored to slaves.

● This approach prevent data loss in failover of master to slave.● However failures need timely detection using an external system

because it may cause high latency and user visible outage

Common Replication Strategies and Issues Cont..Optimistic Replication:● There is no master.● Any member can accept the changes and the changes

propagates through the group asynchronously. This approach provide high availability and excellent latency

● However transactions are not possible as global mutation orderings are not known at commit time.

Use of Paxos for Replication● Paxos is fault tolerant consensus algorithm● There is no master but group of similar peers● A write ahead log can be replicated over all the peers.● Any of the peer can initiate read or write.● Log add the changes only if majority of the peers acknowledges

the changes.● The other peers which did not acknowledge the change

eventually acknowledge.● No distinguished failed state

Use of Paxos for Replication Cont..Issues with Paxos replication Strategy● If we have only one replicated log over wide area then it might

suffer high latencies which will limit the throughput.● What if none of the replica is updated.● What if majority of the replica does not acknowledge the writes

Solution● Partition the data ● Multiple replicated logs. ● Each partition of the data will have its own replicated log.● Synchronous log replication among the data centers.

Partitioning For Scalable Replication

Cross Entity Groups Operations.

Partitioning For Scalability and Consistency● Partition the data into entity groups● Each partition is replicated across different data centers

synchronously and independently● The data is stored in NoSQL datastore in datacenter● Within an entity group the changes are done using single phase

ACID semantics.● But across the entity group changes or operations are done

using two phase single commit using asynchronous messaging.● These entity groups are logically distant not physically distant.

So operations across the different entity groups are local● The traffic between the data centers is only for synchronous

replications

Physical LayoutHow to select entity group boundaries:● Should not be too fine grained as it may require excessive cross

group operations. Group should also not contain large no of entities as it may cause unnecessary writes.

Physical Layout● Google’s big table as a storage system which is fault tolerant

and scalable● Applications keeps the data near the user or to a region where it

is being accessed the most and maintains replications near each other to avoid failures and high latency during failures. Keeps the group of data which are accessed together either close to each other or with in the same row.

● Implement cache for low latency

Data Model Overview● Lies between abstract tuples of RDBMS and concrete row-

column storage.

● Schema=>Set of tables =>contains entities=>contains properties

● Entity group will consist of a root entity along with all entities in child table that references it

Data Model Cont..

Indexes ● This can be applied to any property● Local Index- Used to find data within an entity group● Global Index- Used to find entities without knowing in advance

the entity groups that contain them ● Storing Clause- Applications store additional properties from the

primary table for faster access at read time● Repeated Indexes- For repeated properties● Inline indexes: Extracting slices of information from child

entities and storing it in the data in parent for fast access. Implements many to many links

Mapping to Bigtable● Here the column name = Megastore table name + Property

name● Each Bigtable row stores transaction, metadata and log for the

group ● Metadata is in the same row which allows to update atomically

through a single Bigtable transaction● Index Entry- represented as a Bigtable row. Row key = Indexed

property values + primary key of indexed entity

Transactions and Concurrency control

● Entity group functions as a mini-database.● Transaction writes mutations in write-ahead log, then mutations will

apply to data● Multiple values can be stored in the same row/column pair with

different timestamps● Multiversion Concurrency control- MVCC ● Readers and writers don’t block each other

Cont..Reads- a. Current- ensure that all committed writes are applied first, then

read latest committed transaction b. Snapshot- reads the latest committed write operationc. Inconsistent- Ignore the state of log and read latest value

Writes-Begins with a current read to determine the next available log position.Commit operation gathers mutations into a log entry, assigns it a timestamp higher than any prev ones and appends to the log using Paxos

Transaction Lifecycle

READ- Obtain Timestamp & Log Position of last committed transaction

Application Logic- Read from Bigtable and gather writes into a log entry

Commit - Use Paxos for appending that entry to log

Apply - Write mutations to the entities and indexes

Clean Up - Delete data that is no longer required

Replication● Initiation of reads and writes can be done from any replica

● Replication is done per entity group by synchronously replicating the groups transaction log to a quorum of replicas

● Reads guarantees: o Read will always observe the last-acknowledged write.o After a write has been observed, all future reads observe

that write

Megastore Architecture

Data Structures and AlgorithmsReplicated Logs

ReadsAlgorithm for a Current Read

● Query Local● Find Position● Catch-Up● Validate● Query Data

Reads Timeline for reads for local replica A

WritesAlgorithm for writes

● Accept Leader● Prepare● Accept● Invalidate● Apply

WritesTimeline for writes

Coordinator AvailabilityFailure Detection

● Google's Chubby lock service is used● Writers are insulated from coordinator failure by testing

whether a coordinator has lost its locks

Validation Races● Races between validates for earlier writes and invalidates

for later writes are protected in the coordinator by always sending the log position associated with the action.

Operational IssuesDistribution of Availability

Production MetricsDistribution of Average Latencies

Conclusion

● Megastore● Paxos for Synchronization● Bigtable Datastore

QUESTIONS??????

THANK YOU !

megastore by google

Education