big data - architectural concerns for the new age

Big Dataarchitectural concerns for the

new age

Sunday, 2 December 12

Debasish GhoshCTO

(a Nomura Research Institute group company)

@debasishg on Twitter

code @ http://github.com/debasishg

blog @ Ruminations of a Programmer http://debasishg.blogspot.com

some numbers ..

Facebook reaches 1 billion active users

some more numbers ..

• Walmart handles 1M transactions per hour

• Google processes 24PB of data per day

• AT&T transfers 30PB of data per day

• 90 trillion emails are sent every year

• World of Warcraft uses 1.3PB of storage

Big Data - the positive feedback cycle

new technologiesmake using big data

efficientmore adoption

of big data

generationof morebig data

new technologies

.. new architectural concerns

new ways to store data

new techniques to retrieve data

new ways to scale reads & writes

transparent to the application

new ways to consume data

new techniques to analyze data

new ways to visualize data

at Web scale

The Database Landscape so far ..

• relational database - the bedrock of enterprise data

• irrespective of application development paradigm

• object-relational-mapping considered to be the panacea for impedance mismatch

“Object Relational Mapping is the Vietnam of Computer Science”

- Ted Neward (2006)

blogger, big geek and architectural consultant

RDBMS & Big Data

• once the data volume crosses the limit of a single server, you shard / partition

• sharding implies a lookup node for the hash code => SPOF

• cross shard joins, transactions don’t scale

RDBMS & Big Data

• Cost of distributed transactions

• synchronization overhead

• 2 phase commit is a blocking protocol (can block indefinitely)

• as slow as the slowest DB node + network latency

RDBMS & Big Data

• Master/Slave replication

• synchronous replication => slow

• asynchronous replication => can lose data

• writing to master is a bottleneck and SPOF

Need Distributed Databases

• data is automatically partitioned

• transparent to the application

• add capacity without downtime

• failure tolerant

2 famous papers ..

• Bigtable: A distributed storage system for structured data, 2006

• Dynamo: Amazon’s highly scalable key/value store, 2007

Addressing 2 Approaches

• Bigtable: “how can we build a distributed database on top of GFS ?”

• Dynamo: “how can we build a distributed hash table appropriate for data center ?”

Big Data recommendations

• reduce accidental complexity in processing data

• be less rigid (no rigid schema)

• store data in a format closer to the domain model

• hence no universal data model ..

Polyglot Storage

• unfortunately came to be known as NoSQL databases

• document oriented (MongoDB, CouchDB)

• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)

• data structure based (redis)

• graph based (Neo4J)

richer modeling capabilities

closer to domain model

reduced impedancemismatch

Asynchronous Replication to RDBMS using Message Oriented Middleware

Hybrid Oracle MongoDB storage over Messaging backbone

Relational Database is just another option, not the only option when data set is BIG and

semantically rich

10 things never to do with a Relational Database

• Search

• Recommendation

• High Frequency Trading

• Product Cataloging

• User group / ACLs

• Log Analysis

• Media Repository

• Email

• Classification ad

• Time Series / Forecasting

Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-database-206944?page=0,0

Scalability, Availability ..• ACID => BASE

• CAP Theorem & Eventual Consistency

• Consistent Hashing

• Vector Clocks

• Hinted Hand-off & Read repair

• Anti-entropy

• Gossip Protocol

CAP Theorem

• Consistency, Availability & Partition Tolerance

• You can have only 2 of these in a distributed system

• Eric Brewer postulated this quite some time back

ACID => BASE

• Basic Availability Soft-state Eventual consistency

• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.

• It’s ok to use stale data and it’s ok to give approximate answers

Consistent Hashing

Big Data in the wild

• Hadoop

• started as a batch processing engine (HDFS & Map/Reduce)

• with bigger and bigger data, you need to make them available to users at near real time

• stream processing, CEP ..

a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems

Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing

Cloudera Impala

complementingMap/Reduce

in Hadoop

Real time queries in Hadoop

• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop

• expensive and may need lots of data movement between the database & the Hadoop clusters

.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current

base ..

Shark from UC Berkeley

• a large scale data warehouse system for Spark, compatible with Hive

• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3

BI and Analytics

• making Big Data available to developers

• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)

• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..

Machine Learning

• personalization

• social network analysis

• pattern discovery - click patterns, recommendations, ratings

• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..

Summary

• Big Data will grow bigger - we need to embrace the changes in architecture

• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain

• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware

Summary

• Go for decentralized architectures, avoid SPOFs

• With the big volumes of data, streaming is your friend

Thank You!

http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/

http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htm

http://www.emich.edu/chhs/about-researchMETHODS.html

http://docs.basho.com/riak/latest/references/appendices/concepts/

big data - architectural concerns for the new age

Technology

investigating workplace safety concerns in the age of a

the progressive era. the gilded age gilded age lasted from...

the rest of rest - roy fieldingthe problem (circa 1994)...

architectural support for security in the many-core age:...

architectural and urban project - scielo colombia · sola...

age concerns third age architecture · value £12.9m...

architectural principles in the age of cybernetics

revenue and disaster management department … · age sex...

section 5: architectural standards for new construction...

robot modeling and control - ulisboa · modeling dynamics...

1. aegean metallurgy in the bronze age: recent … › imgs...

“architectural education in this age of globalization”

indian and european architectural...

dialogue architectural theory in the sustainable age 3

e-...

parenting in the digital age - esafety commissioner ·...

bronze age architectural traditions: dates and landscapes...

“celebrating age and maturity” · • expertly trained...

physis -...

rudolf wittkower and architectural principles in the age...