demystifying datastores

46
Vishnu Rao MySQL Enthusiast Doodle maker Senior Data Engineer @ DataSpark Formerly @ flipkart.com

Upload: vishnu-rao

Post on 11-Apr-2017

25 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Demystifying datastores

Vishnu RaoMySQL Enthusiast

Doodle makerSenior Data Engineer @ DataSpark

Formerly @ flipkart.com

Page 2: Demystifying datastores

The comma separated list ...

● Hadoop , Hbase, Rocks Db● MySQL , MariaDB , Postgres● Cassandra , MongoDb● Druid , Redis, MemSQL● Elastic Search , Solr● Cockroach Db, Couch db ● Vertica , Infobright● Redshift , Dynamo Db● S3 , OpenStack Swift ….

Page 3: Demystifying datastores

The FUN-damental Qns:

Page 4: Demystifying datastores

The FUN-damental Qns: Which one should I use ?

Page 5: Demystifying datastores

DemystifyingDatastores

Page 6: Demystifying datastores

Lets try to look at the problem from the view of the database

Page 7: Demystifying datastores

First lets play some baseball ...

Page 8: Demystifying datastores

Base 0 : The Data itself

Page 9: Demystifying datastores

Base 0 : The Data itself

● Row having columns

Page 10: Demystifying datastores

Base 0 : The Data itself

● Row having columns● Key - Value

Page 11: Demystifying datastores

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)

Page 12: Demystifying datastores

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)○ Key - Document (u think json / xml)

Page 13: Demystifying datastores

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)○ Key - Document (u think json / xml)

● Graph (Nodes/edges kind of like key-value)

Page 14: Demystifying datastores

Base 1 : How is the Data Stored ?

Page 15: Demystifying datastores

Base 1 : How is the Data Stored ?

Let’s consider a Sample Data Record/Row

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Page 16: Demystifying datastores

Base 1 : How is the Data Stored ?

Let’s consider a Sample Data Record/Row

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Columns / AttributesPossible PrimaryKey

Column

Page 17: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 1

● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.

Page 18: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 1

● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.

● This is generally referred to as a ROW based DataStore.

Page 19: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 1

● Useful for use cases like “showing ENTIRE Order on UI”

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Page 20: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 1

● Useful for use cases like “showing ENTIRE Order on UI”

● The entire row is fetched in one disk access

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Page 21: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● Store Columns SEPARATELY, so that they can be accessed independently.

Page 22: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● Store Columns SEPARATELY, so that they can be accessed independently.

● This is generally referred to as a COLUMN based DataStore.

Page 23: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● Avg(billing_amount) or Sum(Items)

order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items

order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street

Page 24: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● Avg(billing_amount) or Sum(Items)

● Instead of fetching entire row, fetch necessary columns for compute○ I.e Less Data fetched from Disk = REDUCED IO

order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items

order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street

Page 25: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● What are the other optimisations for column store.○ Imagine 4 rows with column say ‘age’

■ Row 1 - 28■ Row 2- 30■ Row 3 - 28■ Row 4- 28

Page 26: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2

● While storing on disk , if you SORT and store, you can also think of compression:

28,28,28,30 (sorted -> good for search now) 28(3),30 (now compressed -> 28 stored once)

Page 27: Demystifying datastores

Base 1 : How is the Data Stored ?

Typically :

● MySQL / Postgres = ROW based● Vertica / Infobright / Druid = COLUMN based

Page 28: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2.5

● Store Group of Columns TOGETHER but store each group separately.

Page 29: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2.5

● Store Group of Columns TOGETHER but store each group separately.

● This is generally referred to as a COLUMN-family based DataStore.

Page 30: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2.5

Logically group the columns.

order-id-123

customer-1

5$ bill amountBugis Street

1$ tax 3 items

Page 31: Demystifying datastores

Base 1 : How is the Data Stored ?

Approach 2.5

Logically group the columns.

Typically: Hbase/Cassandra

order-id-123

customer-1

5$ bill amountBugis Street

1$ tax 3 items

Page 32: Demystifying datastores

Base 2 : The Indexing

● What kind of Data Structure is used ?

Page 33: Demystifying datastores

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

Page 34: Demystifying datastores

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes

Page 35: Demystifying datastores

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.

Page 36: Demystifying datastores

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.

● Whats the index loading mechanism ? ○ Redis is Memory bound.

Page 37: Demystifying datastores

Base 3 : The Theorem

● Most Datastores do ○ Horizontal scaling○ Sharding

Page 38: Demystifying datastores

Base 3 : The Theorem

● Most Datastores do ○ Horizontal scaling○ Sharding

● So Here is the Catch - In event of Network Partition,○ How is Consistency / Availability Handled ?

Page 39: Demystifying datastores

Base 4 : Apart from CAP theorem

Page 40: Demystifying datastores

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

Page 41: Demystifying datastores

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

Page 42: Demystifying datastores

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

● Can I do joins if data is sharded ?

○ What about Distribution awareness ?

Page 43: Demystifying datastores

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

● Can I do joins if data is sharded ?

○ What about Distribution awareness ?

● The Query Interface (major concern ?)

Page 44: Demystifying datastores

The bases...

Page 45: Demystifying datastores

So, Try to cover the Bases & decide if you need it..

PS: There is no Silver Bullet

Page 46: Demystifying datastores

Thank you.

Vishnu Raojaihind213

sweetweet213mash213.wordpress.com

linkedin.com/in/213vishnu