megastore by google
TRANSCRIPT
MEGASTORE: Providing Scalable,
Highly Available Storage for Interactive
ServicesGuided By- Prof. Kong LiPresented By- (TEAM 1)
Anumeha Shah(009423973)Ankita Kapratwar (009413469)Swapna Kulkarni(009264905)
What is Megastore● Megastore combines the scalability and availability of NoSQL
datastore with ACID semantics of RDBMS in an innovative way so that it can meet the requirement of interactive online services. Megastore provides both the high consistency as well as high availability which can not be provided by NoSQL or RDBMS alone.
● Megastore uses Paxos replication and consensus algorithm for high availability and with low latency.
● Partitions the data to a fine granularity and ACID semantics within the partition across wide area network with low latency.
Why MegastoreOnline interactive services requires high availability as well as high consistency.● Online services are growing exceedingly as potential users are
growing exceedingly.● More and more desktop services are moving to the cloud● Opposing requirements of storage demands are arising and
making the storage challenging
Reasons for opposing requirements are:● Applications should be scalable Services should be responsive.● User should have consistent view of the data● Services should be highly available services to be up for 24/7
services to be
Approach to Provide High Availability and ConsistencyTwo approaches has been taken.1. synchronous fault tolerant log replicator to provide availability.2. To provide scalability partition the data into many small
databases and provide each database with its own log replicator.
Replications for High AvailabilityNeed for replications:● Replication is needed for high availability● replication with in data center overcome the host specific
failures● But to overcome datacenter specific failure and regional disaster
the data should be replicated over geographically distributed datacenters.
Common Replication Strategies and IssuesAsynchronous master/slave● write ahead log entries are replicated by master node to at least
one slave. ● Log appends acknowledgement at master and transmissions to
slave happens parallely.● However if master fails then we can experience downtime till a
slave becomes master and also loss of data can occur.Synchronous master/slave:● Changes on masters and slave are done synchronously that is
master acknowledge the change once the changes are mirrored to slaves.
● This approach prevent data loss in failover of master to slave.● However failures need timely detection using an external system
because it may cause high latency and user visible outage
Common Replication Strategies and Issues Cont..Optimistic Replication:● There is no master.● Any member can accept the changes and the changes
propagates through the group asynchronously. This approach provide high availability and excellent latency
● However transactions are not possible as global mutation orderings are not known at commit time.
Use of Paxos for Replication● Paxos is fault tolerant consensus algorithm● There is no master but group of similar peers● A write ahead log can be replicated over all the peers.● Any of the peer can initiate read or write.● Log add the changes only if majority of the peers acknowledges
the changes.● The other peers which did not acknowledge the change
eventually acknowledge.● No distinguished failed state
Use of Paxos for Replication Cont..Issues with Paxos replication Strategy● If we have only one replicated log over wide area then it might
suffer high latencies which will limit the throughput.● What if none of the replica is updated.● What if majority of the replica does not acknowledge the writes
Solution● Partition the data ● Multiple replicated logs. ● Each partition of the data will have its own replicated log.● Synchronous log replication among the data centers.
Partitioning For Scalable Replication
Cross Entity Groups Operations.
Partitioning For Scalability and Consistency● Partition the data into entity groups● Each partition is replicated across different data centers
synchronously and independently● The data is stored in NoSQL datastore in datacenter● Within an entity group the changes are done using single phase
ACID semantics.● But across the entity group changes or operations are done
using two phase single commit using asynchronous messaging.● These entity groups are logically distant not physically distant.
So operations across the different entity groups are local● The traffic between the data centers is only for synchronous
replications
Physical LayoutHow to select entity group boundaries:● Should not be too fine grained as it may require excessive cross
group operations. Group should also not contain large no of entities as it may cause unnecessary writes.
Physical Layout● Google’s big table as a storage system which is fault tolerant
and scalable● Applications keeps the data near the user or to a region where it
is being accessed the most and maintains replications near each other to avoid failures and high latency during failures. Keeps the group of data which are accessed together either close to each other or with in the same row.
● Implement cache for low latency
Data Model Overview● Lies between abstract tuples of RDBMS and concrete row-
column storage.
● Schema=>Set of tables =>contains entities=>contains properties
● Entity group will consist of a root entity along with all entities in child table that references it
Data Model Cont..
Indexes ● This can be applied to any property● Local Index- Used to find data within an entity group● Global Index- Used to find entities without knowing in advance
the entity groups that contain them ● Storing Clause- Applications store additional properties from the
primary table for faster access at read time● Repeated Indexes- For repeated properties● Inline indexes: Extracting slices of information from child
entities and storing it in the data in parent for fast access. Implements many to many links
Mapping to Bigtable● Here the column name = Megastore table name + Property
name● Each Bigtable row stores transaction, metadata and log for the
group ● Metadata is in the same row which allows to update atomically
through a single Bigtable transaction● Index Entry- represented as a Bigtable row. Row key = Indexed
property values + primary key of indexed entity
Transactions and Concurrency control
● Entity group functions as a mini-database.● Transaction writes mutations in write-ahead log, then mutations will
apply to data● Multiple values can be stored in the same row/column pair with
different timestamps● Multiversion Concurrency control- MVCC ● Readers and writers don’t block each other
Cont..Reads- a. Current- ensure that all committed writes are applied first, then
read latest committed transaction b. Snapshot- reads the latest committed write operationc. Inconsistent- Ignore the state of log and read latest value
Writes-Begins with a current read to determine the next available log position.Commit operation gathers mutations into a log entry, assigns it a timestamp higher than any prev ones and appends to the log using Paxos
Transaction Lifecycle
READ- Obtain Timestamp & Log Position of last committed transaction
Application Logic- Read from Bigtable and gather writes into a log entry
Commit - Use Paxos for appending that entry to log
Apply - Write mutations to the entities and indexes
Clean Up - Delete data that is no longer required
Replication● Initiation of reads and writes can be done from any replica
● Replication is done per entity group by synchronously replicating the groups transaction log to a quorum of replicas
● Reads guarantees: o Read will always observe the last-acknowledged write.o After a write has been observed, all future reads observe
that write
Megastore Architecture
Data Structures and AlgorithmsReplicated Logs
ReadsAlgorithm for a Current Read
● Query Local● Find Position● Catch-Up● Validate● Query Data
Reads Timeline for reads for local replica A
WritesAlgorithm for writes
● Accept Leader● Prepare● Accept● Invalidate● Apply
WritesTimeline for writes
Coordinator AvailabilityFailure Detection
● Google's Chubby lock service is used● Writers are insulated from coordinator failure by testing
whether a coordinator has lost its locks
Validation Races● Races between validates for earlier writes and invalidates
for later writes are protected in the coordinator by always sending the log position associated with the action.
Operational IssuesDistribution of Availability
Production MetricsDistribution of Average Latencies
Conclusion
● Megastore● Paxos for Synchronization● Bigtable Datastore
QUESTIONS??????
THANK YOU !