benchmarking top nosql databases: apache cassandra, apache hbase and mongodb
TRANSCRIPT
ReferentEinrichtung Titel des Vortrages 1
WP-Benchmarking Top NoSQL Databases
Apache Cassandra, Apache HBase and MongoDB
Presented By Athiq Ahamed
Supriya
ReferentEinrichtung Titel des Vortrages 2
Introduction Enormous amount of data-BigData Scalabilty issue in RDBMS Rise of NoSQL databases
Amazon Dynamo Big table CAP Theorem BASE system
ReferentEinrichtung Titel des Vortrages 3
CAP Theorem Consistency Availability Partition tolerance
CAP theorem states that only two of the properties can be achieved at a time.
ReferentEinrichtung Titel des Vortrages 4
RDBMS NoSQL
Supports powerful query language
Supports very simple query language
It has a fixed schema No fixed schema
Follows ACID (Atomicity, Consistency, Isolation and Durability)
It is only eventually consistent
Supports transactions Does not support transactions
RDBMS vs NoSQL
Content:tutorialspoint.com
ReferentEinrichtung Titel des Vortrages 5
Basically available: System guarantees availability, in terms of the CAP theorem
Soft state: State of the system may change over time, because of eventual consistency model
Eventual consistency: System will become consistent over time
BASE
Content:www.edureka.in
ReferentEinrichtung Titel des Vortrages 6
Fast Performance is the key. POC processes include right benchmarks:
Configurations Parameters Workloads
Making the right choice!
Selection of NoSQL
ReferentEinrichtung Titel des Vortrages 7
Yahoo Cloud Serving Benchmark (YCSB)
Top 3 NoSQL databases-Apache Cassandra, Apache Hbase and MongoDB.
Amazon Web Services EC2 instances for hosting the tests
Test performed 3 times on 3 different days
Benchmark configuration
ReferentEinrichtung Titel des Vortrages 8
The tests ran on large size instances (15GB RAM and 4 CPU cores)
Instances used customized Ubuntu with Oracle Java 1.6 installed as a base.
A customized script written to drive the benchmark processes
Benchmark configuration
ReferentEinrichtung Titel des Vortrages 9
Each NoSQL system performs differently, not alike.
Components and Internal working.
Apache Cassandra: Columnar database model
Apache HBase: Columnar database model
MongoDB: Document storage database model
Understanding NoSQL Databases
ReferentEinrichtung Titel des Vortrages 10
Apache Cassandra Cassandra is scalable, fault-tolerant, and
consistent. All nodes are equal.
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
Key components: Node, Cluster, Commit log, Mem-table, SSTable and Bloom filter
Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
ReferentEinrichtung Titel des Vortrages 11
Ring structure, peer to peer architecture
All nodes are equal
This improves general database availablity
Scaling up and scaling down is easier
Cassandra has key-value, column oriented database
Apache Cassandra
ReferentEinrichtung Titel des Vortrages 12
Apache Cassandra
Content:http://demoiselle.sourceforge.net/component/demoiselle-cassandra/1.0.0/images/datamodel1.png
ReferentEinrichtung Titel des Vortrages 13
Cassandra has an internal keyspace called system, stores metadata about the cluster.
Metadata: The node‘s token The cluster name Keyspace n schema definitions (dynamic
loading) Whether or not the node is bootstrapped
Apache Cassandra
Content:https://www.edureka.co/blog/category/apache-cassandra/
ReferentEinrichtung Titel des Vortrages 14
Commit log: Crash recovery mechanism. Every write operation is written to commit log
Mem-Table: A memory resident data structure.
SSTable: It is a disk file to which the data is flushed from the mem-table
Apache Cassandra
ReferentEinrichtung Titel des Vortrages 15
Bloom filters are used as a performance booster
Bloom filter are very fast, quick algorithms for testing a member in the set.
Bloom filters serves as a special kind of cache – quick lookups/search as they reside in memory
Apache Cassandra
ReferentEinrichtung Titel des Vortrages 16
Gossip protocol: Communiction between nodes, co-ordination and failure check
Anti-Entropy protocol: Replica sync mechanism enusing data on different nodes are updated (Merkle trees)
Snitches ensures host proximity
Apache Cassandra
ReferentEinrichtung Titel des Vortrages 17
Apache Cassandra- Read/Write operation
ReferentEinrichtung Titel des Vortrages 18
Sparse, distributed, sorted map and multidimensional and consistent.
Hbase is a Key/value store
Consists Row key, Column family, columns and timestamp.
Apache HBase
ReferentEinrichtung Titel des Vortrages 19
Apache HBase
Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-columnkey.png
ReferentEinrichtung Titel des Vortrages 20
Region: Contiguous rows form a region Region server(RS): Serves one or more regions. Master server: Daemon responsible for
managing Hbase cluster HDFS: Distributed, open source file system
containing HBase‘s data Zookeeper: Distributed, open source co-
ordinated service for co-ordination of master and region servers.
Apache HBase Components
Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture
ReferentEinrichtung Titel des Vortrages 21
Apache Hbase Architecture
ReferentEinrichtung Titel des Vortrages 22
Client obtains meta table RS from Zookeeper
Client gets RS which holds the corresponding rowkey
Client receives the row from the respective Region server
Client caches this information along with the location of meta table server.
First Read/Write to HBase
ReferentEinrichtung Titel des Vortrages 23
WAL: Write Ahead Log is a file on the distributed file system. It is used to store new data
Block Cache: It is the read cache. It stores frequently read data in memory
Mem Store: Write cache that stores new data which is not written to disk yet.
Hfiles stores the rows as sorted key values on disk
HBase RS Components
ReferentEinrichtung Titel des Vortrages 24
Client writes the data to the WAL file stored on disk
WAL is used to recover not yet persisted data in case a server crashes.
Once data is written to WAL, it is placed in Mem Store
Hbase Write steps (1)
ReferentEinrichtung Titel des Vortrages 25
All write/read are to/from the primary node.
HDFS replicates WAL and Hfile blocks. Replication happens automatically.
When data is written in HDFS, one copy is written locally and then it is replicated to a secondary node and later to tertiary node.
HDFS Write steps (2)
ReferentEinrichtung Titel des Vortrages 26
Cassandra usecase: Availability and Partition tolerant requirements.
Consistency is tunable by setting it high in the option
Hbase usecase: Consistency and Scalability. However, at less number of nodes/threads, availability is achieved high
Cassandra and Hbase
ReferentEinrichtung Titel des Vortrages 27
Document-oriented database
High performance and automatic scaling
High consistency and partition tolerant
Replication and failover for high availability
Low latency
Flexible indexing
MongoDB
ReferentEinrichtung Titel des Vortrages 28
Document is the basic unit for MongoDB(row)
Collection is similar to a table
A single instance has multiple independent databases
Every document has a special key, “_id”
Powerful JavaScript shell for administration
Configdb contains metadata of clusters
MongoDB Concepts
ReferentEinrichtung Titel des Vortrages 29
MongoDB Simple Architecture
ReferentEinrichtung Titel des Vortrages 30
A mongo receives queries from applications
Uses metadata from config server for the data
Mangos directs write operations to a particular shard
Mongos uses the cluster metadata from the config database
Read/Write MongoDB
ReferentEinrichtung Titel des Vortrages 31
Scalability Availability Partition Tolerant Consistency
MOST IMPORTANT PERFORMANCE Yahoo Cloud Serving Benchmark (YCSB)
Recap Importance of Benchmark and Factors
ReferentEinrichtung Titel des Vortrages 32
Results: Load Process
ReferentEinrichtung Titel des Vortrages 33
Results: Read/Write Mix Workload
ReferentEinrichtung Titel des Vortrages 34
Results: Read/Scan Mix Workload
ReferentEinrichtung Titel des Vortrages 35
Results: Read Latency across all workloads
ReferentEinrichtung Titel des Vortrages 36
Results: Insert Latency across all workloads
ReferentEinrichtung Titel des Vortrages 37
Lets MIGRATE from traditional data base !!!!
Live Demo
ReferentEinrichtung Titel des Vortrages 38
Identify data model for the application Corresponding data sets have to be known Whether the application requires replication Identify the performance requirements Prototype the application Test the performance of the prototype
Discussion
ReferentEinrichtung Titel des Vortrages 39
Conclusion NoSQL replaced tradition relational databases Performance is the key feature Importance of benchmarks Top three NoSQL data base’s performance
tested Cassandra outperforms all the other NoSQL data
bases Decide based on application
ReferentEinrichtung Titel des Vortrages 40