demystifying datastores
TRANSCRIPT
Vishnu RaoMySQL Enthusiast
Doodle makerSenior Data Engineer @ DataSpark
Formerly @ flipkart.com
The comma separated list ...
● Hadoop , Hbase, Rocks Db● MySQL , MariaDB , Postgres● Cassandra , MongoDb● Druid , Redis, MemSQL● Elastic Search , Solr● Cockroach Db, Couch db ● Vertica , Infobright● Redshift , Dynamo Db● S3 , OpenStack Swift ….
The FUN-damental Qns:
The FUN-damental Qns: Which one should I use ?
DemystifyingDatastores
Lets try to look at the problem from the view of the database
First lets play some baseball ...
Base 0 : The Data itself
Base 0 : The Data itself
● Row having columns
Base 0 : The Data itself
● Row having columns● Key - Value
Base 0 : The Data itself
● Row having columns● Key - Value
○ Key - Blob (u think object)
Base 0 : The Data itself
● Row having columns● Key - Value
○ Key - Blob (u think object)○ Key - Document (u think json / xml)
Base 0 : The Data itself
● Row having columns● Key - Value
○ Key - Blob (u think object)○ Key - Document (u think json / xml)
● Graph (Nodes/edges kind of like key-value)
Base 1 : How is the Data Stored ?
Base 1 : How is the Data Stored ?
Let’s consider a Sample Data Record/Row
order-id-123 customer-1 5$ bill amount Bugis Street
1$ Tax 3 Items
Base 1 : How is the Data Stored ?
Let’s consider a Sample Data Record/Row
order-id-123 customer-1 5$ bill amount Bugis Street
1$ Tax 3 Items
Columns / AttributesPossible PrimaryKey
Column
Base 1 : How is the Data Stored ?
Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
Base 1 : How is the Data Stored ?
Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
● This is generally referred to as a ROW based DataStore.
Base 1 : How is the Data Stored ?
Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
order-id-123 customer-1 5$ bill amount Bugis Street
1$ Tax 3 Items
Base 1 : How is the Data Stored ?
Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
● The entire row is fetched in one disk access
order-id-123 customer-1 5$ bill amount Bugis Street
1$ Tax 3 Items
Base 1 : How is the Data Stored ?
Approach 2
● Store Columns SEPARATELY, so that they can be accessed independently.
Base 1 : How is the Data Stored ?
Approach 2
● Store Columns SEPARATELY, so that they can be accessed independently.
● This is generally referred to as a COLUMN based DataStore.
Base 1 : How is the Data Stored ?
Approach 2
● Avg(billing_amount) or Sum(Items)
order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street
Base 1 : How is the Data Stored ?
Approach 2
● Avg(billing_amount) or Sum(Items)
● Instead of fetching entire row, fetch necessary columns for compute○ I.e Less Data fetched from Disk = REDUCED IO
order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street
Base 1 : How is the Data Stored ?
Approach 2
● What are the other optimisations for column store.○ Imagine 4 rows with column say ‘age’
■ Row 1 - 28■ Row 2- 30■ Row 3 - 28■ Row 4- 28
Base 1 : How is the Data Stored ?
Approach 2
● While storing on disk , if you SORT and store, you can also think of compression:
28,28,28,30 (sorted -> good for search now) 28(3),30 (now compressed -> 28 stored once)
Base 1 : How is the Data Stored ?
Typically :
● MySQL / Postgres = ROW based● Vertica / Infobright / Druid = COLUMN based
Base 1 : How is the Data Stored ?
Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.
Base 1 : How is the Data Stored ?
Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.
● This is generally referred to as a COLUMN-family based DataStore.
Base 1 : How is the Data Stored ?
Approach 2.5
Logically group the columns.
order-id-123
customer-1
5$ bill amountBugis Street
1$ tax 3 items
Base 1 : How is the Data Stored ?
Approach 2.5
Logically group the columns.
Typically: Hbase/Cassandra
order-id-123
customer-1
5$ bill amountBugis Street
1$ tax 3 items
Base 2 : The Indexing
● What kind of Data Structure is used ?
Base 2 : The Indexing
● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
Base 2 : The Indexing
● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes
Base 2 : The Indexing
● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.
Base 2 : The Indexing
● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.
● Whats the index loading mechanism ? ○ Redis is Memory bound.
Base 3 : The Theorem
● Most Datastores do ○ Horizontal scaling○ Sharding
Base 3 : The Theorem
● Most Datastores do ○ Horizontal scaling○ Sharding
● So Here is the Catch - In event of Network Partition,○ How is Consistency / Availability Handled ?
Base 4 : Apart from CAP theorem
Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?
Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?
● The Query Interface (major concern ?)
The bases...
So, Try to cover the Bases & decide if you need it..
PS: There is no Silver Bullet
Thank you.
Vishnu Raojaihind213
sweetweet213mash213.wordpress.com
linkedin.com/in/213vishnu