finding the right data solution for your application in the data storage haystack

Finding the Right Data Solution for Your Application in the Data

Storage Haystack

Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa

Research Scientist, Lanka Software Foundation

Data Models §  There has been many data models

proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o  Hierarchical (IMS): late 1960’s and

1970’s o  Directed graph (CODASYL): 1970’s o  Relational: 1970’s and early 1980’s o  Entity-Relationship: 1970’s o  Extended Relational: 1980’s o  Semantic: late 1970’s and 1980’s

§  For last 20-30 years, Relational Database systems (SQL) together with transactions has been the defacto data solution.

Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700

For many years, choice of data storage was a easy one (use RDBMS)

Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880

Scale of Systems §  However, the scale of systems

are changing due to o  Increasing user bases of

systems. o  Mobile devices, online presence o  Cloud computing and multicore

systems

§  Scaling up RDBMS o  Put it in a bigger machine o  Replicate (Cluster) the database to 2-3 more nodes. But the

approach does not scale up. o  Partition the data across many nodes (distribute, a.k.a.

shredding). However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well.

Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/

CAP Theorem, Transactions, and Storage §  RDBMS model provide two things

o  Relational model with SQL o  ACID transactions – (Atomic,

Isolation, Consistent, Durable) §  It was a classical one size fit all

solution, but it worked for a quite a some time.

§  However, CAP theorem says that you can not have it all. o  Consistency, Availability and Partition

Tolerance, pick two!

§  But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table)

§  However, to use them, one has to understand and utilize the application specific behavior.

Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462

NoSQL and other Storage Systems §  Large internet companies hit the problem first, they build

systems that are specific to their problems, and those systems did scale. o  Google Big table o  Amazon Dynamo

§  Soon many others followed, and most of them are free and open source.

§  Now there are couple of dozen §  Among advantages of

NoSQL are o  Scalability o  Flexible schema o  Designed to scale and support

fault tolerance out of the Box

Copyright ind{yeah} and licensed for reuse under CC License , http://www.flickr.com/photos/flickcoolpix/3566848458/

However, with NoSQL solutions, choosing a data storage is no longer simple.

Copyright Philipp Salzgeber on and licensed for reuse under CC License http://www.salzgeber.at/astro/pics/20081126_heart/index.html

Selecting the Right Data Solution

§  What are the right Questions to ask? §  Categorize Answers for each question §  Take different cases based on different answers and make

recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.

http://www.fotocommunity.com/pc/pc/display/22077920

What are the right Questions? o  Types of data

-  Structured, Semi-Structured, Unstructured

o  Need for Scalability -  Number of users -  Number of data items -  Size of files -  Read/Write ratio

o  Types of Queries -  Retrieve by Key -  WHERE clauses -  JOIN queries -  Offline Queries

o  Consistency -  Loose Consistency -  Single Operation Consistency -  Transactions

Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/

photos/romainguy/249370084

Unstructured Data

§  This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository)

§  One common use case is building structured data from unstructured data

§  Often associate Metadata to help searching

Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134

§  Data do not have a particular structure, often retrieved through a key (name). o  E.g. File systems.

§  Humans are good in processing unstructured data, but computers do not.

Structured Data §  Have a structure and often described through a Schema §  Often a table like 2D structure is used, but other structures

also possible. §  Main advantage of the structure is search

Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/photos/ooocha/2611398859/

§  Schema can be provided at

the deployment time or at the runtime (dynamic schema)

§  Schema can be used to o  Validate data o  Support user friendly search o  Optimize storage and queries

Semi-structured Data §  Structure is not fully defined.

But there is some inherent structure.

§  For example o  XML documents, data are

stored in a tree like structure o  Graph data o  Data structures like lists and

arrays §  Support queries based on

structure §  But processing data often

needs custom code.

Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339

Search §  Unstructured Data – no structure to support search.

o  Search based on an reverse index o  Search through Properties

§  Semi-Structured Data o  To search XML, Xpath or XQuery (Any tree like structure). o  Tuple spaces can be queried through tuple space templates o  Data registries can be searched for entries that matches with given

Metadata descriptions (search by properties) o  Graph’s can be queried based on connectivity

§  Structured Data o  Retrieve by Key o  WHERE clauses o  Queries with JOINs o  Offline Queries

Copyright bydigitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/

Consistency and Scalability §  Scalability – this is ability to

handle more users, data, or larger files by adding more nodes. We will have 3 categories. o  Small systems (can handle with 1-3

nodes) o  Scalable systems (can handle with

about 10 nodes) o  Highly scalable systems (anything

larger, can be 100s or 1000s of nodes)

Copyright NNSANews and licensed for reuse under CC License , http://www.flickr.com/photos/nnsanews/

5347287260/

§  Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o  Transactional – series of operations updated in ACID manner o  Atomic operation – single operation, updated in all replicas o  Eventual consistency - data will be eventually consistent

Data Storage Alternatives

Data Storage Implementations §  Expectations from data

storages o  Reliably store the data o  Efficient search and retrieval

of data whenever needed o  Data management – delete,

update data Copyright John Atherton by and licensed for reuse under CC

License , http://www.flickr.com/photos/gbaku/2231332836/

Challenges of Data Storage §  Reliability

o  Replicating data o  Creating backup or recovering using backups

§  Security §  Scaling and Parallel access

o  Distribution or replications o  ACID transactions

§  Availability o  Data replications

§  Vendor lock-in o  Interoperability, standard query languages

§  Simple use experience o  Hide the physical location of data, o  Provide simple API and security models o  Expressive query languages.

Data Storage Choices

Storage Type Advantages Disadvantages

Queries Transactio

ns Scale Flexible schema Key Where

Joins

Local memory

Structured

Very fast Not durable Yes No No No unless

STMs No Yes

Relational/ SQL Standardized

Rigid schema, good for read

oriented usecases. Yes Yes Yes Yes

Moderate No

Column families (NoSQL )

High write performance,

replicated

Not transactional, no-online joins Yes

Yes, secondary index No No High Yes

Documents DBs


replicated

Not transactional, no-online joins Yes

Yes, views No No Yes Yes

Object Databases

Easy to integrate with

programming languages Yes Yes Yes Yes No No

Storage Type Advantages Disadvanta

ges

Queries transaction

s Scale Flexible schema Key Search

Files

Unstructured

Save big files whose format not understood

No structured search on content Yes Indexing No Moderate Yes

Data Registries/ Metadata Catalogs

Metadata search

Yes

Property based search

(Where) No Moderate Yes

Queues

Semi-structur

ed

Representation of flow of messages over

time/ Tasks Yes N/A No Yes Yes

Triple Stores

Used to inference, very fast relationship

processing Yes Relationship

search No No Yes XML database XML native

XPath/ XQuery

Distributed Cache Fast, replicated No search Yes No No Yes Yes

Key-value pairs


replicated

Model is too simple in

some cases, not

transactional Yes No No Yes Yes

Graph DBs

Very fast joins, natural to represent relationships,

Not very scalable Yes Graph Search Yes Low N/A

Choosing the Right Data Solution

How do We do this?

§  Consider structured, semi-structured, and unstructured separately. o  Then drill down based on other 3 properties: scale, consistency,

and search. §  Structured case is more complicated, other two are bit

simpler. §  Start by giving a defacto for each case

Copyright 8664 and licensed for reuse

under CC License , http://www.flickr.com/

photos/80464769@N00/186

598462/

Handling Structured Data §  There are three main considerations: scale, consistency

and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s

nodes)

Loose Consist

ency

Operation

Consistency

ACID Transactions

Loose Consistency

Operation

Consistency

ACID Transactions

Loose Consistency

Operation

Consistency

ACID Transactions

Primary Key

DB/ KV/ CF

DB/ KV/ CF

DB KV/CF KV/CF Partitioned DB?

KV/CF KV/CF No

Where DB/ CF/Doc

DB/ CF/Doc

DB CF/Doc(?)

CF/Doc (?)

Partitioned DB?

CF/Doc

CF/Doc

No

JOIN DB DB DB ?? ?? ?? No No No

Offline DB/CF/Doc

DB/CF/Doc

DB/CF/Doc

CF/Doc

CF/Doc

No CF/Doc

CF/Doc

No

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems

Handling Small Scale Systems (1-3 nodes) §  In general using DB here for

every case might work. §  Reason for using options

other than DB o  When there is potential need

to scale later. o  High write throughput

§  KV is 1-D where as other two are 2D

Small (1-3 nodes)

Loose Consistency

Operation Consistency

ACID Transactions

Primary Key

DB/ KV/ CF

DB/ KV/ CF

DB

Where DB/ CF/Doc

DB/ CF/Doc

DB

JOIN DB DB DB

Offline DB/CF/Doc

DB/CF/Doc

DB/CF/Doc


Handling Scalable Systems §  KV, CF, and Doc can easily

handle this case. §  If DBs used with data shredded

across many nodes o  Transactions might work given that

participants on one transaction are not too many.

o  JOINs might need to transfer too much data between nodes.

o  Also should consider in Memory DBs like Vault DB.

§  Offline mode will work. §  Most systems let users choose

consistency, and loose consistency can scale more. (e.g. Cassandra)

Scalable (10 nodes)

Loose Consistency


ACID Transactions

Primary Key

KV/CF KV/CF Partitioned DB?

Where CF/Doc

CF/Doc Partitioned DB?

JOIN ?? ?? Partitioned DB??

Offline CF/Doc

CF/Doc No

*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems

Highly Scalable Systems

§  Transactions do not work in this scale. (CAP theorem).

§  Same for JOINs. The problem is sometime too much data needs to be transferred between nodes to perform the JOIN.

§  Offline case handled through Map-Reduce. Even JOIN case is OK since there is time.

Highly Scalable (1000s nodes)

Loose Consistency


ACID Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc CF/Doc No

JOIN No No No

Offline CF/Doc CF/Doc No


Highly Scalable Systems + Primary Key Retrieval

§  This is (comparatively) the easy one.

§  Can be solved through DHT (Distributed Hash table) based solutions or architectures like OceanStore.

§  Both Key-Value storage(KV) and Column Families (CF) can be used. But Key-Value model is preferred as it is more scalable.


Loose Consistency

Operation

Consistency

ACID Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc(?)

CF/Doc(?)

No

JOIN No No No


*KV-Key-Value Systems, CF-Column Families, Doc- document based

Systems

Highly Scalable systems + WHERE

§  This Generally OK, but tricky. §  CF work through a Secondary

index that do Scatter-gather (e.g. Cassandra).

§  Doc work through Map-Reduce views (e.g. CouchDB)

§  There is Bissa, which build a index for all possible queries (No range queries)

§  If you are doing this, you should do pilot runs and make sure things work.


Loose Consistency

Operation

Consistency

Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc(?)

CF/Doc(?)

No

JOIN No No No


*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems

Handling Unstructured Data

§  Storage Options o  Distributed File systems - generally scalable (e.g. NSF), but HDFS

(Hadoop) and Lustre are highly scalable versions. o  Metadata registries (e.g. Niravana, SDSC Resource Broker)

Small Scale (1-3 nodes)

Scalable (10 nodes) Highly Scalable

XML (Queried through XPath)

XML DB or convert to a structured

model

XML DB or convert to a structured model

??

Graphs Graph DBs Graph DBs if graph can be partitioned

??

Data Structures Data Structure Servers, Object

Databases

Queues Distributed Queues

Distributed Queues Distributed Queues

!

Handling Semi-Structured Data

§  Storage Options o  Answer depends on the type of structure. If there is a server

optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search)

§  Search o  Very much custom. E.g. XML or any tree = Xpath, Graph can

support very fast relationship search

Hybrid Approaches §  Some solutions have many types

of data and hence need more than one data solution (hybrid architectures).

§  For example o  Using DB for transactional data and

CF for other data. o  Keeping metadata and actual data

separate for large data archives. o  Use GraphDB to store relationship

data while other data is in Column Family storage.

§  However, if transactions are needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).

Copyright Matthew Oliphant by and licensed for reuse under CC License , http://www.flickr.com/

photos/fajalar/3174131216/

Other parameters §  Above list is not exhaustive, and there are other

parameters o  Read/ Write ratio – when high it is easy to scale o  High write throughput o  Very large data products – you will need a file system. May be

keep metadata in Data registry and store data in a file system. o  Flexible Schema o  Archival usecases o  Analytical usecases o  Others …

§  So there is no silver bullet …

Conclusion §  For last 20 years or so, DBMS were the de facto storage

solution §  However, DBMS could not scale well, and many NoSQL

solutions have been proposed instead §  As a results. it is no longer easy to find the best data

solution for your problem. §  We discussed may dimensions (types of data, scalability,

queries, and consistency) and provided guidelines on when to use which data solution.

§  Your feedback and thoughts are most welcome .. Contact me through [email protected]

finding the right data solution for your application in the data storage haystack

Technology

cc license http

data models

o consistency

o partition

right data solutionfor

defacto data solution

scale of systemsare

things o relational