cassandra - a decentralized storage system

Cassandra - A Decentralized Structured Storage System Presented By Tejaswi Ganne Latha Muddu Khulud Alsultan Rajaramya Janagama Marmik Patel Arunit Gupta Chaitanya Sai Prashant Malik Facebook Avinash Lakshman Facebook

Upload: arunit-gupta

Post on 12-Apr-2017



Data & Analytics

1 download


Page 1: Cassandra - A decentralized storage system

Cassandra - A Decentralized Structured Storage System

Presented ByTejaswi Ganne Latha Muddu

Khulud AlsultanRajaramya Janagama

Marmik Patel Arunit Gupta

Chaitanya Sai

Prashant MalikFacebook

Avinash LakshmanFacebook

Page 2: Cassandra - A decentralized storage system

Outline• Data Model.• Cassandra API.• System Architecture:

• Partitioning.• Replication.• Membership and Failure Detection. • Bootstrapping and Scaling the Cluster.• Local Persistence.

Page 3: Cassandra - A decentralized storage system

Introduction and Data Model

-Tejaswi Ganne

Page 4: Cassandra - A decentralized storage system

Introduction•Apache Cassandra is an open source distributed storage system.•Manages very large data spread across many commodity servers located across many data centers.•Named after the Greek mythological prophet Cassandra.•Initially developed at Facebook to power their Inbox Search feature, later Facebook open sourced it as Apache Incubator project.•Features – •High Scalability •High Availability •Fault Tolerant

Page 5: Cassandra - A decentralized storage system

Features• High Scalability: There is no downtime or interruptions to

applications as read and write throughput increases linearly as new machines get added.

• High Availability: It refers to systems which are durable and likely to operate continuously without any failure for a long time.

• Fault Tolerant: Data is automatically replicated to multiple nodes, where failed nodes can be replaced within no time. Replication is supported across multiple data centers.

Page 6: Cassandra - A decentralized storage system

Data Model

•Uses a simple data model instead of a full Relational data model.• A table in Cassandra is a distributed multi dimensional map indexed by a key.•Value is a structured object.•Operations are atomic on each row per replica.

Page 7: Cassandra - A decentralized storage system

Data Model

*Figure taken from

Page 8: Cassandra - A decentralized storage system

Data Model Contd…•Each row can have different number of columns.•Each column has < Name, Value, Timestamp >.•Columns can be ordered by names and timestamps.•Columns are grouped into Column Families (CF).•Two types of CFs – •Simple Column Family –Has columns. –A Column can be accessed using the convention - »column_family : column •Super Column Family –CF within a CF. –Has a Simple CF, or another Super CF in it. –A Column can be accessed using the convention - »column_family : super_column : column

Page 9: Cassandra - A decentralized storage system

Key-Value Model

• It is column oriented NoSQL system

• Row is collection of columns labeled with a name

• Key is the column name and a row must contain at least 1 column

Page 10: Cassandra - A decentralized storage system

Related Work

•Amazon Dynamo Dynamo is a storage system that is used by Amazon to store and retrieve user shopping carts. It requires both read and write operations for managing timestamps.•Google Chubby GFS uses a simple design with a single master server for hosting the entire metadata and where the data is split into chunks and stored in chunk servers. It is made fault tolerant using the Chubby abstraction. Chubby achieves fault-tolerant through replication.

Page 12: Cassandra - A decentralized storage system


Page 13: Cassandra - A decentralized storage system

Cassandra API

-Muddu Latha

Page 14: Cassandra - A decentralized storage system

Cassandra Query Language

• The Cassandra Query Language (CQL) is the primary language for communicating with the Cassandra database.

CQL Statements :Data Definition StatementsData Manipulation StatementsQueries

Page 15: Cassandra - A decentralized storage system

Data Definition Statements• Create Keyspace• Use • Alter Keyspace • Drop Keyspace • Create Table• Alter Table • Drop Table• Create Type

• Alter Type• Drop Type• Create Trigger • Drop Trigger• Create Function• Drop Function• Create Aggregate• Drop Aggregate

Page 16: Cassandra - A decentralized storage system

Data Definition Statements(Cont..)

1. Create Keyspace : cqlsh> CREATE KEYSPACE sample_demo with replication = {‘ class ’: ‘ SimpleStrategy ‘ , ‘ replicaton_factor ‘ : 3 };2.Use Keyspace : cqlsh> USE sample_demo ;3.Alter Keyspace :cqlsh>ALTER KEYSPACE sample_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 5};

Page 17: Cassandra - A decentralized storage system

Data Definition Statements(Cont..)

4. Drop Keyspace : DROP KEYSPACE sample_demo ;5. Create Table :CREATE TABLE presentors_list ( firstname text, lastname text, classid int, email text, PRIMARY KEY (lastname));6. Alter Table :ALTER TABLE presentors_list ADD city text ;

Page 18: Cassandra - A decentralized storage system

Data Definition Statements(Cont..)

7 . Drop Table :removes a table.DROP TABLE presentors_list;

8. Truncate Table : removes all data from a table.TRUNCATE presentors_list;

Page 19: Cassandra - A decentralized storage system

CQL Statements Execution Screenshots

Page 20: Cassandra - A decentralized storage system

Data Manipulation Statements

Insert ,Update , Delete , Batch1.Insert :INSERT INTO presentors_list( firstname , lastname , classid ,email ) VALUES( ‘ lakshmi‘ , ‘ upadrasta ‘ , 29 ,[email protected]);2.Update :UPDATE presentors_list TTL 400 set firstname = ‘ dileep ’ , classid=‘28’, WHERE lastname =‘ upadrasta ’ ;

Page 21: Cassandra - A decentralized storage system

Data Manipulation Statements(Cont..)

3.Delete : DELETE FROM presentors_list USING TIMESTAMP WHERE lastname=‘upadrasta’ ;4.Batch :BEGIN BATCH



Page 22: Cassandra - A decentralized storage system


Select SELECT * FROM presentors_list ;SELECT firstname , email WHERE lastname = ‘upadrasta ’;

Page 24: Cassandra - A decentralized storage system


Page 25: Cassandra - A decentralized storage system

System ArchitecturePartitioning

- Khulud Alsultan

Page 26: Cassandra - A decentralized storage system

System Architecture

• The core distributed systems techniques: Partitioning Replication Membership and Failure handling Bootstrapping and Scaling the Cluster. Local Persistence.

Page 27: Cassandra - A decentralized storage system

System Architecture • These modules work in synchrony to handle

read/write requests.• Read/write request for a key gets routed to any

node in the cluster. • The node determines the replica for this

particular key.

Page 28: Cassandra - A decentralized storage system

System Architecture • For writes:Routes the requests to the replicas and waits for a quorum of replicas to acknowledge the completion of the writes. • For reads:

• Routes the requests to the closest replica OR • Routes the requests to all replicas and waits for a

quorum of responses

Page 29: Cassandra - A decentralized storage system


• Scale incrementally. • Dynamically partition the data over the set of nodes

in cluster.• Partitions data using consistent hashing.• Uses an order preserving hash function.• Output range is treated as ring.• Each node is assigned a random value which

represents its position on the ring.

Page 30: Cassandra - A decentralized storage system


Page 31: Cassandra - A decentralized storage system

Consistent Hashing • Example:

Cassandra assigns a hash value to each partition key:

if you have the following data:

Page 32: Cassandra - A decentralized storage system

Consistent Hashing

Page 33: Cassandra - A decentralized storage system

Consistent Hashing • Cassandra places the data on each node according to the value of

the partition key and the range that the node is responsible for.

Page 34: Cassandra - A decentralized storage system

Consistent hashingAdvantage:• Departure or arrival of a node only affects its

immediate neighbors and others remain unaffected.

Some Challenges:• The random position assignment of each node on

the ring leads to non-uniform data and load distribution.

• The heterogeneity in the performance of nodes.

Page 35: Cassandra - A decentralized storage system


• Two ways to address this issue: Nodes get assigned to multiple positions in the


Analyze load information on the ring and have

lightly loaded nodes move on the ring to alleviate

heavily loaded nodes

Page 36: Cassandra - A decentralized storage system


Page 37: Cassandra - A decentralized storage system


-Rajaramya Janagama

Page 38: Cassandra - A decentralized storage system


• How data is duplicated across nodes.

• Why replication?

• To achieve high availability and durability.• Ensure fault tolerance and no failure by

replicating one or more copies of every row in a column family across nodes in cluster

Page 39: Cassandra - A decentralized storage system


• How to achieve replication?

• Each data item is replicated at N (replication factor) nodes.

• Coordinator node is responsible for the replication of data items.

• It also replicates keys across N-1 nodes.

Page 40: Cassandra - A decentralized storage system

Replication Policies

– Various options to replicate data

• Rack Unaware • Rack Aware• Datacenter Aware

Page 41: Cassandra - A decentralized storage system

Rack Unware

• Replicate data at N-1 successive nodes after its coordinator

Page 42: Cassandra - A decentralized storage system

Rack Aware

Replica 1

Replica 2

Rack 1

N1 N2 N3


N4 N5 N6

• No two replicas should lie in the same rack.

N- Nodes

Page 43: Cassandra - A decentralized storage system

Data Center Aware

• No two replicas should lie in the same datacenter.



Datacenter 1

N1 N2 N4N3

N6N5 N7 N8

Datacenter 2



Page 44: Cassandra - A decentralized storage system


• Cassandra provides durability guarantees in the presence of node failures and network partitions.

• The storage nodes are spread across multiple datacenters and are connected through high speed data links.

• This scheme of replicating across multiple datacenters allows us to handle entire data center failures without any outage.

Page 45: Cassandra - A decentralized storage system


Page 46: Cassandra - A decentralized storage system

Membership And Failure Detection

- Marmik Patel

Page 47: Cassandra - A decentralized storage system

What is Membership?• Can be split into two parts:

1. Service Discovery2. Failure Detection

Service Discovery• Service Discovery comes into picture when new node is set

up and added to cluster• Based on Scuttlebutt Reconciliation, a very efficient anti-

antropy gossip protocol based mechanism• Scuttlebutt has very efficient utilization of CPU and gossip


Page 48: Cassandra - A decentralized storage system

Gossip Protocol and Scuttlebutt Reconciliation

Gossip Protocol• Protocol that Cassandra uses to discover information about other nodes• Information transferred from node to the node it knows about• Not only for Membership, but also used to disseminate other system

related to control state such as health, tokens, addresses, data size etc.

Scuttlebutt Reconciliation• Not necessary that two participants in a gossip exchange most recent

mapping than those of the peer• Inspired by real life rumor spreading• Repair replicated data by comparing differences

Robbert van Renesse, Dan Mihai Dumitriu, Valient Gough, and Chris Thomas. Efficient reconciliation and flow control for anti-entropy protocols

Page 49: Cassandra - A decentralized storage system

Failure Detection

• Comes into picture when  the node is was taken down for maintenance, or fails due to an error

• Mechanism by which a node can locally determine if any other node in system is up or down

• Also used to avoid attempts to communicate with unreachable node

• Uses failure detector which is modified version of Φ Accrual Failure Detector

• Gossip protocol is used for exchanging information

Page 50: Cassandra - A decentralized storage system

Φ Accrual Failure Detector

• Based on very simple principle• Does not emit a Boolean value stating a node is up or down,

but emits a value which represents a suspicion level for nodes• Value is defines as Φ• Idea is to express the value of Φ on a scale that is dynamically

adjusted to reflect network and load condition• Difference between traditional failure detector and accrual

failure detector is which component of the system does what part of failure detection

Page 51: Cassandra - A decentralized storage system

Traditional Failure Detector vs Accrual Failure Detector

• In Traditional Failure Detector, the monitoring and interpretation are combined and output of this combination is Boolean.

• Application cannot do any interpretation as monitored information is already being interpreted

Traditional Failure Detector

Accrual Failure Detector

• Accrual Failure Detector provides lower level abstraction that avoids the interpretation of monitoring information

• Value associated with each process represents suspicion level which is left for application to interpret

Page 52: Cassandra - A decentralized storage system

Properties of Φ

• Φ represents likelihood that node A is wrong about node B’s state• Assume when Φ = 1, A will make mistake in deciding state of B

is 10%, then likelihood is about 1% when Φ = 2 , 0.1% when Φ = 3 and so on

• Node maintains a sliding window of inter-arrival times of gossip messages to calculate the value of Φ

• Φ is very good in accuracy and speed• Also adjust well to network conditions and server load conditions• Cassandra approximate Φ using exponential distribution

Page 53: Cassandra - A decentralized storage system


Page 54: Cassandra - A decentralized storage system



-Arunit Gupta

Page 55: Cassandra - A decentralized storage system

BootstrappingWhat is Bootstrapping?

Adding new nodes is called “Bootstrapping”

Ways of Adding new nodeThere are two ways of adding node :

– New node gets assigned a random token which gives its position in the ring. It gossips its location to rest of the ring where the information is exchanged about one another.

– New node reads its config file to contact it’s initial contact points.

• New nodes are added manually by administrator via CLI or Web interface provided by Cassandra.

Page 56: Cassandra - A decentralized storage system

Bootstrapping Contd..

• These initial contact points are known as Seeds, which is basically used by newly added node to know each other, where ultimate goal for all nodes in the cluster is to discover one another.

• Seeds can also come from configuration service like Zookeeper, which is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

“Because Coordinating Distributed Systems is a Zoo”

Google images

Page 57: Cassandra - A decentralized storage system

Facts!!! • Comparison with Amazon’s Dynamo which is a highly available key-value structured storage system.

“Dynamo’s load is no where close to what we see in practice over here at Facebook.” –Avinash Lakshman


Page 58: Cassandra - A decentralized storage system


In addition to seeds, you'll also need to configure the IP interface to listen on for Gossip and CQL, (listen_address and rpc_address respectively).Use listen_address that will be reachable from the listen_address used on all other nodes, and a rpc_address that will be accessible to clients.Once everything is configured and the nodes are running, use the bin/nodetool status utility to verify a properly connected cluster. For example:

Page 59: Cassandra - A decentralized storage system

Environment• Node outages occurred are often transient but may last for extended

intervals.• A network outage rarely signifies a permanent departure and should not

result in re-balancing of the partition assignment or repair of the unreachable replicas.

• Manual errors could result in unintentional startup of new Cassandra nodes. As a result an explicit mechanism is considered appropriate to initiate the addition and removal of nodes from a Cassandra Instance.

• Administrator – uses a command line tool or a browser to connect to a Cassandra node and issue a membership change to join or leave the cluster.

Page 60: Cassandra - A decentralized storage system

Scaling the cluster• Whenever a new node is added into the system, it gets assigned a token such that it can alleviate heavily loaded

node.• New node will take the range which other node were responsible for before.• Cassandra bootstrapping algorithm is initiated from any other node in the system either using a command line

utility or web dashboard.• The node giving up the data streams the data over to the new nodes using kernel copy techniques.

Cassandra Ring showing scalability.

Scaling the Cluster

Page 61: Cassandra - A decentralized storage system

FutureWhat is the Future?

• Operational experience has shown that data can be transferred at the rate of 40MB/sec from a single node. Work is going on to have multiple replicas take part in the bootstrap transfer by parallelizing the effort, similar to bit torrent which is a p2p system used to transfer large files to thousands of location in a short period of time

• Facebook uses bit torrent to distribute updates to Facebook servers.

“Bit Torrent is fantastic for this, it’s really great,” Cook said. “It’s ‘super-duper’ fast and it allows us to alleviate a lot of scaling concerns we’ve had in the past, where it took forever to get code to the webservers before you could even boot it up and run it.”

Page 62: Cassandra - A decentralized storage system

Virtual nodes in Cassandra• One of the new features slated for Cassandra 1.2’s release was virtual nodes (vnodes) where

there was paradigm change from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.


Use of Heterogeneous machines in a cluster. Node Failures and backing up.

Page 63: Cassandra - A decentralized storage system


Page 64: Cassandra - A decentralized storage system

Local Persistence

-Chaitanya Sai Manne

Page 65: Cassandra - A decentralized storage system

Local Persistence– Cassandra depends on the local file system for data

persistence.– The data is represented on disk using a format that lends

itself to efficient data retrieval.– For a data store to be considered persistent, it must write to

non-volatile storage.

Page 66: Cassandra - A decentralized storage system

Cassandra – more than one server

• All the nodes participate in a cluster

• They are independent – share nothing

• Add or remove as needed• If you need more capacity?

Add a server

Page 67: Cassandra - A decentralized storage system

Focus on singer server

Page 68: Cassandra - A decentralized storage system

Write operation• Firstly, it writes into the commit log• Then it puts into the in-memory data

structure i.e. memtable• The memtable is identified by the

primary key• Acknowledge back to the client• This is a simple process and that’s

what make scaling is easier• As memtable start to fill up there is a

flush process• Flush process writes the memtable to

a file called SS table i.e. Sorted String

• The writes here are sequential writes

Page 69: Cassandra - A decentralized storage system


Update usersSet firstname = ‘Chaitanya’Where id = ‘cm7cd’

write Rowkey,Column (id = ‘cm7cd’,

firstname = ‘Chaitanya’)

Page 70: Cassandra - A decentralized storage system

Page 71: Cassandra - A decentralized storage system

Page 72: Cassandra - A decentralized storage system

Page 73: Cassandra - A decentralized storage system

Page 74: Cassandra - A decentralized storage system

Page 75: Cassandra - A decentralized storage system

Page 76: Cassandra - A decentralized storage system

Compaction• Compaction is process which takes all the SSTables, does a

sequential reads back into the memtable of both files, do merge sort, picks the latest timestamp file and writes a brand new file.

• It deletes the old files.

Page 77: Cassandra - A decentralized storage system


Page 78: Cassandra - A decentralized storage system

Page 79: Cassandra - A decentralized storage system

Read Operation

Page 80: Cassandra - A decentralized storage system

Read Operation• It look up in the memtable before going into the files on the disk• Look up is done in order of newest to oldest• Cassandra checks an in-memory data structure called Bloom filter• Bloom filter can quickly tell you whether the key exists in a file• A key in a column family have many columns so in order to prevent

scanning all the columns it maintain column indices• In a cluster, client can ask any node to retrieve the dataConsistency Levels• Set every read and write like ONE, TWO, ALL, QUORUM->51% etc.

Page 81: Cassandra - A decentralized storage system

Read Operation

Page 82: Cassandra - A decentralized storage system

Summary• Established high scalability, performance and wide

applicability• Very high update throughput, delivering low latency• Future works:

– Adding compression– Support atomicity across keys– Secondary index support

Page 83: Cassandra - A decentralized storage system


Page 84: Cassandra - A decentralized storage system

Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.

For More Information

Page 85: Cassandra - A decentralized storage system

Thank You