finding the right data solution for your application in the data storage haystack
TRANSCRIPT
Finding the Right Data Solution for Your Application in the Data
Storage Haystack
Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa
Research Scientist, Lanka Software Foundation
Data Models § There has been many data models
proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o Hierarchical (IMS): late 1960’s and
1970’s o Directed graph (CODASYL): 1970’s o Relational: 1970’s and early 1980’s o Entity-Relationship: 1970’s o Extended Relational: 1980’s o Semantic: late 1970’s and 1980’s
§ For last 20-30 years, Relational Database systems (SQL) together with transactions has been the defacto data solution.
Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
For many years, choice of data storage was a easy one (use RDBMS)
Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
Scale of Systems § However, the scale of systems
are changing due to o Increasing user bases of
systems. o Mobile devices, online presence o Cloud computing and multicore
systems
§ Scaling up RDBMS o Put it in a bigger machine o Replicate (Cluster) the database to 2-3 more nodes. But the
approach does not scale up. o Partition the data across many nodes (distribute, a.k.a.
shredding). However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well.
Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
CAP Theorem, Transactions, and Storage § RDBMS model provide two things
o Relational model with SQL o ACID transactions – (Atomic,
Isolation, Consistent, Durable) § It was a classical one size fit all
solution, but it worked for a quite a some time.
§ However, CAP theorem says that you can not have it all. o Consistency, Availability and Partition
Tolerance, pick two!
§ But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table)
§ However, to use them, one has to understand and utilize the application specific behavior.
Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
NoSQL and other Storage Systems § Large internet companies hit the problem first, they build
systems that are specific to their problems, and those systems did scale. o Google Big table o Amazon Dynamo
§ Soon many others followed, and most of them are free and open source.
§ Now there are couple of dozen § Among advantages of
NoSQL are o Scalability o Flexible schema o Designed to scale and support
fault tolerance out of the Box
Copyright ind{yeah} and licensed for reuse under CC License , http://www.flickr.com/photos/flickcoolpix/3566848458/
However, with NoSQL solutions, choosing a data storage is no longer simple.
Copyright Philipp Salzgeber on and licensed for reuse under CC License http://www.salzgeber.at/astro/pics/20081126_heart/index.html
Selecting the Right Data Solution
§ What are the right Questions to ask? § Categorize Answers for each question § Take different cases based on different answers and make
recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
http://www.fotocommunity.com/pc/pc/display/22077920
What are the right Questions? o Types of data
- Structured, Semi-Structured, Unstructured
o Need for Scalability - Number of users - Number of data items - Size of files - Read/Write ratio
o Types of Queries - Retrieve by Key - WHERE clauses - JOIN queries - Offline Queries
o Consistency - Loose Consistency - Single Operation Consistency - Transactions
Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/
photos/romainguy/249370084
Unstructured Data
§ This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository)
§ One common use case is building structured data from unstructured data
§ Often associate Metadata to help searching
Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
§ Data do not have a particular structure, often retrieved through a key (name). o E.g. File systems.
§ Humans are good in processing unstructured data, but computers do not.
Structured Data § Have a structure and often described through a Schema § Often a table like 2D structure is used, but other structures
also possible. § Main advantage of the structure is search
Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/photos/ooocha/2611398859/
§ Schema can be provided at
the deployment time or at the runtime (dynamic schema)
§ Schema can be used to o Validate data o Support user friendly search o Optimize storage and queries
Semi-structured Data § Structure is not fully defined.
But there is some inherent structure.
§ For example o XML documents, data are
stored in a tree like structure o Graph data o Data structures like lists and
arrays § Support queries based on
structure § But processing data often
needs custom code.
Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
Search § Unstructured Data – no structure to support search.
o Search based on an reverse index o Search through Properties
§ Semi-Structured Data o To search XML, Xpath or XQuery (Any tree like structure). o Tuple spaces can be queried through tuple space templates o Data registries can be searched for entries that matches with given
Metadata descriptions (search by properties) o Graph’s can be queried based on connectivity
§ Structured Data o Retrieve by Key o WHERE clauses o Queries with JOINs o Offline Queries
Copyright bydigitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
Consistency and Scalability § Scalability – this is ability to
handle more users, data, or larger files by adding more nodes. We will have 3 categories. o Small systems (can handle with 1-3
nodes) o Scalable systems (can handle with
about 10 nodes) o Highly scalable systems (anything
larger, can be 100s or 1000s of nodes)
Copyright NNSANews and licensed for reuse under CC License , http://www.flickr.com/photos/nnsanews/
5347287260/
§ Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o Transactional – series of operations updated in ACID manner o Atomic operation – single operation, updated in all replicas o Eventual consistency - data will be eventually consistent
Data Storage Alternatives
Data Storage Implementations § Expectations from data
storages o Reliably store the data o Efficient search and retrieval
of data whenever needed o Data management – delete,
update data Copyright John Atherton by and licensed for reuse under CC
License , http://www.flickr.com/photos/gbaku/2231332836/
Challenges of Data Storage § Reliability
o Replicating data o Creating backup or recovering using backups
§ Security § Scaling and Parallel access
o Distribution or replications o ACID transactions
§ Availability o Data replications
§ Vendor lock-in o Interoperability, standard query languages
§ Simple use experience o Hide the physical location of data, o Provide simple API and security models o Expressive query languages.
Data Storage Choices
Storage Type Advantages Disadvantages
Queries Transactio
ns Scale Flexible schema Key Where
Joins
Local memory
Structured
Very fast Not durable Yes No No No unless
STMs No Yes
Relational/ SQL Standardized
Rigid schema, good for read
oriented usecases. Yes Yes Yes Yes
Moderate No
Column families (NoSQL )
High write performance,
replicated
Not transactional, no-online joins Yes
Yes, secondary index No No High Yes
Documents DBs
High write performance,
replicated
Not transactional, no-online joins Yes
Yes, views No No Yes Yes
Object Databases
Easy to integrate with
programming languages Yes Yes Yes Yes No No
Storage Type Advantages Disadvanta
ges
Queries transaction
s Scale Flexible schema Key Search
Files
Unstructured
Save big files whose format not understood
No structured search on content Yes Indexing No Moderate Yes
Data Registries/ Metadata Catalogs
Metadata search
Yes
Property based search
(Where) No Moderate Yes
Queues
Semi-structur
ed
Representation of flow of messages over
time/ Tasks Yes N/A No Yes Yes
Triple Stores
Used to inference, very fast relationship
processing Yes Relationship
search No No Yes XML database XML native
XPath/ XQuery
Distributed Cache Fast, replicated No search Yes No No Yes Yes
Key-value pairs
High write performance,
replicated
Model is too simple in
some cases, not
transactional Yes No No Yes Yes
Graph DBs
Very fast joins, natural to represent relationships,
Not very scalable Yes Graph Search Yes Low N/A
Choosing the Right Data Solution
How do We do this?
§ Consider structured, semi-structured, and unstructured separately. o Then drill down based on other 3 properties: scale, consistency,
and search. § Structured case is more complicated, other two are bit
simpler. § Start by giving a defacto for each case
Copyright 8664 and licensed for reuse
under CC License , http://www.flickr.com/
photos/80464769@N00/186
598462/
Handling Structured Data § There are three main considerations: scale, consistency
and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s
nodes)
Loose Consist
ency
Operation
Consistency
ACID Transactions
Loose Consistency
Operation
Consistency
ACID Transactions
Loose Consistency
Operation
Consistency
ACID Transactions
Primary Key
DB/ KV/ CF
DB/ KV/ CF
DB KV/CF KV/CF Partitioned DB?
KV/CF KV/CF No
Where DB/ CF/Doc
DB/ CF/Doc
DB CF/Doc(?)
CF/Doc (?)
Partitioned DB?
CF/Doc
CF/Doc
No
JOIN DB DB DB ?? ?? ?? No No No
Offline DB/CF/Doc
DB/CF/Doc
DB/CF/Doc
CF/Doc
CF/Doc
No CF/Doc
CF/Doc
No
*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
Handling Small Scale Systems (1-3 nodes) § In general using DB here for
every case might work. § Reason for using options
other than DB o When there is potential need
to scale later. o High write throughput
§ KV is 1-D where as other two are 2D
Small (1-3 nodes)
Loose Consistency
Operation Consistency
ACID Transactions
Primary Key
DB/ KV/ CF
DB/ KV/ CF
DB
Where DB/ CF/Doc
DB/ CF/Doc
DB
JOIN DB DB DB
Offline DB/CF/Doc
DB/CF/Doc
DB/CF/Doc
*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
Handling Scalable Systems § KV, CF, and Doc can easily
handle this case. § If DBs used with data shredded
across many nodes o Transactions might work given that
participants on one transaction are not too many.
o JOINs might need to transfer too much data between nodes.
o Also should consider in Memory DBs like Vault DB.
§ Offline mode will work. § Most systems let users choose
consistency, and loose consistency can scale more. (e.g. Cassandra)
Scalable (10 nodes)
Loose Consistency
Operation Consistency
ACID Transactions
Primary Key
KV/CF KV/CF Partitioned DB?
Where CF/Doc
CF/Doc Partitioned DB?
JOIN ?? ?? Partitioned DB??
Offline CF/Doc
CF/Doc No
*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems
Highly Scalable Systems
§ Transactions do not work in this scale. (CAP theorem).
§ Same for JOINs. The problem is sometime too much data needs to be transferred between nodes to perform the JOIN.
§ Offline case handled through Map-Reduce. Even JOIN case is OK since there is time.
Highly Scalable (1000s nodes)
Loose Consistency
Operation Consistency
ACID Transactions
Primary Key
KV/CF KV/CF No
Where CF/Doc CF/Doc No
JOIN No No No
Offline CF/Doc CF/Doc No
*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
Highly Scalable Systems + Primary Key Retrieval
§ This is (comparatively) the easy one.
§ Can be solved through DHT (Distributed Hash table) based solutions or architectures like OceanStore.
§ Both Key-Value storage(KV) and Column Families (CF) can be used. But Key-Value model is preferred as it is more scalable.
Highly Scalable (1000s nodes)
Loose Consistency
Operation
Consistency
ACID Transactions
Primary Key
KV/CF KV/CF No
Where CF/Doc(?)
CF/Doc(?)
No
JOIN No No No
Offline CF/Doc CF/Doc No
*KV-Key-Value Systems, CF-Column Families, Doc- document based
Systems
Highly Scalable systems + WHERE
§ This Generally OK, but tricky. § CF work through a Secondary
index that do Scatter-gather (e.g. Cassandra).
§ Doc work through Map-Reduce views (e.g. CouchDB)
§ There is Bissa, which build a index for all possible queries (No range queries)
§ If you are doing this, you should do pilot runs and make sure things work.
Highly Scalable (1000s nodes)
Loose Consistency
Operation
Consistency
Transactions
Primary Key
KV/CF KV/CF No
Where CF/Doc(?)
CF/Doc(?)
No
JOIN No No No
Offline CF/Doc CF/Doc No
*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems
Handling Unstructured Data
§ Storage Options o Distributed File systems - generally scalable (e.g. NSF), but HDFS
(Hadoop) and Lustre are highly scalable versions. o Metadata registries (e.g. Niravana, SDSC Resource Broker)
Small Scale (1-3 nodes)
Scalable (10 nodes) Highly Scalable
XML (Queried through XPath)
XML DB or convert to a structured
model
XML DB or convert to a structured model
??
Graphs Graph DBs Graph DBs if graph can be partitioned
??
Data Structures Data Structure Servers, Object
Databases
Queues Distributed Queues
Distributed Queues Distributed Queues
!
Handling Semi-Structured Data
§ Storage Options o Answer depends on the type of structure. If there is a server
optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search)
§ Search o Very much custom. E.g. XML or any tree = Xpath, Graph can
support very fast relationship search
Hybrid Approaches § Some solutions have many types
of data and hence need more than one data solution (hybrid architectures).
§ For example o Using DB for transactional data and
CF for other data. o Keeping metadata and actual data
separate for large data archives. o Use GraphDB to store relationship
data while other data is in Column Family storage.
§ However, if transactions are needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).
Copyright Matthew Oliphant by and licensed for reuse under CC License , http://www.flickr.com/
photos/fajalar/3174131216/
Other parameters § Above list is not exhaustive, and there are other
parameters o Read/ Write ratio – when high it is easy to scale o High write throughput o Very large data products – you will need a file system. May be
keep metadata in Data registry and store data in a file system. o Flexible Schema o Archival usecases o Analytical usecases o Others …
§ So there is no silver bullet …
Conclusion § For last 20 years or so, DBMS were the de facto storage
solution § However, DBMS could not scale well, and many NoSQL
solutions have been proposed instead § As a results. it is no longer easy to find the best data
solution for your problem. § We discussed may dimensions (types of data, scalability,
queries, and consistency) and provided guidelines on when to use which data solution.
§ Your feedback and thoughts are most welcome .. Contact me through [email protected]