big data - architectural concerns for the new age
DESCRIPTION
A brief introduction to Big Data and why should care about polyglot storageTRANSCRIPT
![Page 1: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/1.jpg)
Big Dataarchitectural concerns for the
new age
Sunday, 2 December 12
![Page 2: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/2.jpg)
Debasish GhoshCTO
(a Nomura Research Institute group company)
Sunday, 2 December 12
![Page 3: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/3.jpg)
@debasishg on Twitter
code @ http://github.com/debasishg
blog @ Ruminations of a Programmer http://debasishg.blogspot.com
Sunday, 2 December 12
![Page 4: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/4.jpg)
some numbers ..
Sunday, 2 December 12
![Page 5: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/5.jpg)
Facebook reaches 1 billion active users
Sunday, 2 December 12
![Page 6: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/6.jpg)
Sunday, 2 December 12
![Page 7: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/7.jpg)
Sunday, 2 December 12
![Page 8: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/8.jpg)
some more numbers ..
Sunday, 2 December 12
![Page 9: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/9.jpg)
• Walmart handles 1M transactions per hour
• Google processes 24PB of data per day
• AT&T transfers 30PB of data per day
• 90 trillion emails are sent every year
• World of Warcraft uses 1.3PB of storage
Sunday, 2 December 12
![Page 10: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/10.jpg)
Big Data - the positive feedback cycle
new technologiesmake using big data
efficientmore adoption
of big data
generationof morebig data
1
2
3
Sunday, 2 December 12
![Page 11: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/11.jpg)
new technologies
.. new architectural concerns
Sunday, 2 December 12
![Page 12: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/12.jpg)
new ways to store data
Sunday, 2 December 12
![Page 13: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/13.jpg)
new techniques to retrieve data
Sunday, 2 December 12
![Page 14: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/14.jpg)
new ways to scale reads & writes
Sunday, 2 December 12
![Page 15: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/15.jpg)
transparent to the application
Sunday, 2 December 12
![Page 16: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/16.jpg)
new ways to consume data
Sunday, 2 December 12
![Page 17: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/17.jpg)
new techniques to analyze data
Sunday, 2 December 12
![Page 18: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/18.jpg)
new ways to visualize data
Sunday, 2 December 12
![Page 19: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/19.jpg)
at Web scale
Sunday, 2 December 12
![Page 20: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/20.jpg)
The Database Landscape so far ..
• relational database - the bedrock of enterprise data
• irrespective of application development paradigm
• object-relational-mapping considered to be the panacea for impedance mismatch
Sunday, 2 December 12
![Page 21: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/21.jpg)
“Object Relational Mapping is the Vietnam of Computer Science”
- Ted Neward (2006)
blogger, big geek and architectural consultant
Sunday, 2 December 12
![Page 22: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/22.jpg)
RDBMS & Big Data
• once the data volume crosses the limit of a single server, you shard / partition
• sharding implies a lookup node for the hash code => SPOF
• cross shard joins, transactions don’t scale
Sunday, 2 December 12
![Page 23: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/23.jpg)
RDBMS & Big Data
• Cost of distributed transactions
• synchronization overhead
• 2 phase commit is a blocking protocol (can block indefinitely)
• as slow as the slowest DB node + network latency
Sunday, 2 December 12
![Page 24: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/24.jpg)
RDBMS & Big Data
• Master/Slave replication
• synchronous replication => slow
• asynchronous replication => can lose data
• writing to master is a bottleneck and SPOF
Sunday, 2 December 12
![Page 25: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/25.jpg)
Need Distributed Databases
• data is automatically partitioned
• transparent to the application
• add capacity without downtime
• failure tolerant
Sunday, 2 December 12
![Page 26: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/26.jpg)
2 famous papers ..
• Bigtable: A distributed storage system for structured data, 2006
• Dynamo: Amazon’s highly scalable key/value store, 2007
Sunday, 2 December 12
![Page 27: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/27.jpg)
Addressing 2 Approaches
• Bigtable: “how can we build a distributed database on top of GFS ?”
• Dynamo: “how can we build a distributed hash table appropriate for data center ?”
Sunday, 2 December 12
![Page 28: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/28.jpg)
Big Data recommendations
• reduce accidental complexity in processing data
• be less rigid (no rigid schema)
• store data in a format closer to the domain model
• hence no universal data model ..
Sunday, 2 December 12
![Page 29: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/29.jpg)
Polyglot Storage
• unfortunately came to be known as NoSQL databases
• document oriented (MongoDB, CouchDB)
• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)
• data structure based (redis)
• graph based (Neo4J)
Sunday, 2 December 12
![Page 30: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/30.jpg)
richer modeling capabilities
closer to domain model
reduced impedancemismatch
Sunday, 2 December 12
![Page 31: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/31.jpg)
Asynchronous Replication to RDBMS using Message Oriented Middleware
Sunday, 2 December 12
![Page 32: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/32.jpg)
Hybrid Oracle MongoDB storage over Messaging backbone
Sunday, 2 December 12
![Page 33: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/33.jpg)
Relational Database is just another option, not the only option when data set is BIG and
semantically rich
Sunday, 2 December 12
![Page 34: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/34.jpg)
10 things never to do with a Relational Database
• Search
• Recommendation
• High Frequency Trading
• Product Cataloging
• User group / ACLs
• Log Analysis
• Media Repository
• Classification ad
• Time Series / Forecasting
Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-database-206944?page=0,0
Sunday, 2 December 12
![Page 35: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/35.jpg)
Scalability, Availability ..• ACID => BASE
• CAP Theorem & Eventual Consistency
• Consistent Hashing
• Vector Clocks
• Hinted Hand-off & Read repair
• Anti-entropy
• Gossip Protocol
Sunday, 2 December 12
![Page 36: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/36.jpg)
CAP Theorem
• Consistency, Availability & Partition Tolerance
• You can have only 2 of these in a distributed system
• Eric Brewer postulated this quite some time back
Sunday, 2 December 12
![Page 37: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/37.jpg)
ACID => BASE
• Basic Availability Soft-state Eventual consistency
• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.
• It’s ok to use stale data and it’s ok to give approximate answers
Sunday, 2 December 12
![Page 38: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/38.jpg)
Consistent Hashing
Sunday, 2 December 12
![Page 39: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/39.jpg)
Big Data in the wild
• Hadoop
• started as a batch processing engine (HDFS & Map/Reduce)
• with bigger and bigger data, you need to make them available to users at near real time
• stream processing, CEP ..
Sunday, 2 December 12
![Page 40: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/40.jpg)
a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems
Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing
Cloudera Impala
complementingMap/Reduce
in Hadoop
Sunday, 2 December 12
![Page 41: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/41.jpg)
Real time queries in Hadoop
• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop
• expensive and may need lots of data movement between the database & the Hadoop clusters
Sunday, 2 December 12
![Page 42: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/42.jpg)
.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current
base ..
Sunday, 2 December 12
![Page 43: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/43.jpg)
Shark from UC Berkeley
• a large scale data warehouse system for Spark, compatible with Hive
• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3
Sunday, 2 December 12
![Page 44: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/44.jpg)
BI and Analytics
• making Big Data available to developers
• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)
• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..
Sunday, 2 December 12
![Page 45: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/45.jpg)
Machine Learning
• personalization
• social network analysis
• pattern discovery - click patterns, recommendations, ratings
• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..
Sunday, 2 December 12
![Page 46: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/46.jpg)
Summary
• Big Data will grow bigger - we need to embrace the changes in architecture
• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain
• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware
Sunday, 2 December 12
![Page 47: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/47.jpg)
Summary
• Go for decentralized architectures, avoid SPOFs
• With the big volumes of data, streaming is your friend
Sunday, 2 December 12
![Page 48: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/48.jpg)
Thank You!
Sunday, 2 December 12
![Page 49: Big Data - architectural concerns for the new age](https://reader034.vdocument.in/reader034/viewer/2022051612/54c625ce4a795900608b459b/html5/thumbnails/49.jpg)
http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/
http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htm
http://www.emich.edu/chhs/about-researchMETHODS.html
http://docs.basho.com/riak/latest/references/appendices/concepts/
Sunday, 2 December 12