big data - architectural concerns for the new age

Post on 26-Jan-2015






Click to see full reader


A brief introduction to Big Data and why should care about polyglot storage


Big Dataarchitectural concerns for the

new age

Sunday, 2 December 12

Debasish GhoshCTO

(a Nomura Research Institute group company)

Sunday, 2 December 12

@debasishg on Twitter

code @

blog @ Ruminations of a Programmer

Sunday, 2 December 12

some numbers ..

Sunday, 2 December 12

Facebook reaches 1 billion active users

Sunday, 2 December 12

Sunday, 2 December 12

Sunday, 2 December 12

some more numbers ..

Sunday, 2 December 12

• Walmart handles 1M transactions per hour

• Google processes 24PB of data per day

• AT&T transfers 30PB of data per day

• 90 trillion emails are sent every year

• World of Warcraft uses 1.3PB of storage

Sunday, 2 December 12

Big Data - the positive feedback cycle

new technologiesmake using big data

efficientmore adoption

of big data

generationof morebig data




Sunday, 2 December 12

new technologies

.. new architectural concerns

Sunday, 2 December 12

new ways to store data

Sunday, 2 December 12

new techniques to retrieve data

Sunday, 2 December 12

new ways to scale reads & writes

Sunday, 2 December 12

transparent to the application

Sunday, 2 December 12

new ways to consume data

Sunday, 2 December 12

new techniques to analyze data

Sunday, 2 December 12

new ways to visualize data

Sunday, 2 December 12

at Web scale

Sunday, 2 December 12

The Database Landscape so far ..

• relational database - the bedrock of enterprise data

• irrespective of application development paradigm

• object-relational-mapping considered to be the panacea for impedance mismatch

Sunday, 2 December 12

“Object Relational Mapping is the Vietnam of Computer Science”

- Ted Neward (2006)

blogger, big geek and architectural consultant

Sunday, 2 December 12

RDBMS & Big Data

• once the data volume crosses the limit of a single server, you shard / partition

• sharding implies a lookup node for the hash code => SPOF

• cross shard joins, transactions don’t scale

Sunday, 2 December 12

RDBMS & Big Data

• Cost of distributed transactions

• synchronization overhead

• 2 phase commit is a blocking protocol (can block indefinitely)

• as slow as the slowest DB node + network latency

Sunday, 2 December 12

RDBMS & Big Data

• Master/Slave replication

• synchronous replication => slow

• asynchronous replication => can lose data

• writing to master is a bottleneck and SPOF

Sunday, 2 December 12

Need Distributed Databases

• data is automatically partitioned

• transparent to the application

• add capacity without downtime

• failure tolerant

Sunday, 2 December 12

2 famous papers ..

• Bigtable: A distributed storage system for structured data, 2006

• Dynamo: Amazon’s highly scalable key/value store, 2007

Sunday, 2 December 12

Addressing 2 Approaches

• Bigtable: “how can we build a distributed database on top of GFS ?”

• Dynamo: “how can we build a distributed hash table appropriate for data center ?”

Sunday, 2 December 12

Big Data recommendations

• reduce accidental complexity in processing data

• be less rigid (no rigid schema)

• store data in a format closer to the domain model

• hence no universal data model ..

Sunday, 2 December 12

Polyglot Storage

• unfortunately came to be known as NoSQL databases

• document oriented (MongoDB, CouchDB)

• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)

• data structure based (redis)

• graph based (Neo4J)

Sunday, 2 December 12

richer modeling capabilities

closer to domain model

reduced impedancemismatch

Sunday, 2 December 12

Asynchronous Replication to RDBMS using Message Oriented Middleware

Sunday, 2 December 12

Hybrid Oracle MongoDB storage over Messaging backbone

Sunday, 2 December 12

Relational Database is just another option, not the only option when data set is BIG and

semantically rich

Sunday, 2 December 12

10 things never to do with a Relational Database

• Search

• Recommendation

• High Frequency Trading

• Product Cataloging

• User group / ACLs

• Log Analysis

• Media Repository

• Email

• Classification ad

• Time Series / Forecasting


Sunday, 2 December 12

Scalability, Availability ..• ACID => BASE

• CAP Theorem & Eventual Consistency

• Consistent Hashing

• Vector Clocks

• Hinted Hand-off & Read repair

• Anti-entropy

• Gossip Protocol

Sunday, 2 December 12

CAP Theorem

• Consistency, Availability & Partition Tolerance

• You can have only 2 of these in a distributed system

• Eric Brewer postulated this quite some time back

Sunday, 2 December 12


• Basic Availability Soft-state Eventual consistency

• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.

• It’s ok to use stale data and it’s ok to give approximate answers

Sunday, 2 December 12

Consistent Hashing

Sunday, 2 December 12

Big Data in the wild

• Hadoop

• started as a batch processing engine (HDFS & Map/Reduce)

• with bigger and bigger data, you need to make them available to users at near real time

• stream processing, CEP ..

Sunday, 2 December 12

a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems

Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing

Cloudera Impala


in Hadoop

Sunday, 2 December 12

Real time queries in Hadoop

• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop

• expensive and may need lots of data movement between the database & the Hadoop clusters

Sunday, 2 December 12

.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current

base ..

Sunday, 2 December 12

Shark from UC Berkeley

• a large scale data warehouse system for Spark, compatible with Hive

• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3

Sunday, 2 December 12

BI and Analytics

• making Big Data available to developers

• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)

• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..

Sunday, 2 December 12

Machine Learning

• personalization

• social network analysis

• pattern discovery - click patterns, recommendations, ratings

• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..

Sunday, 2 December 12


• Big Data will grow bigger - we need to embrace the changes in architecture

• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain

• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware

Sunday, 2 December 12


• Go for decentralized architectures, avoid SPOFs

• With the big volumes of data, streaming is your friend

Sunday, 2 December 12

Thank You!

Sunday, 2 December 12

top related