big data - architectural concerns for the new age
Post on 26-Jan-2015
104 Views
Preview:
DESCRIPTION
TRANSCRIPT
Big Dataarchitectural concerns for the
new age
Sunday, 2 December 12
Debasish GhoshCTO
(a Nomura Research Institute group company)
Sunday, 2 December 12
@debasishg on Twitter
code @ http://github.com/debasishg
blog @ Ruminations of a Programmer http://debasishg.blogspot.com
Sunday, 2 December 12
some numbers ..
Sunday, 2 December 12
Facebook reaches 1 billion active users
Sunday, 2 December 12
Sunday, 2 December 12
Sunday, 2 December 12
some more numbers ..
Sunday, 2 December 12
• Walmart handles 1M transactions per hour
• Google processes 24PB of data per day
• AT&T transfers 30PB of data per day
• 90 trillion emails are sent every year
• World of Warcraft uses 1.3PB of storage
Sunday, 2 December 12
Big Data - the positive feedback cycle
new technologiesmake using big data
efficientmore adoption
of big data
generationof morebig data
1
2
3
Sunday, 2 December 12
new technologies
.. new architectural concerns
Sunday, 2 December 12
new ways to store data
Sunday, 2 December 12
new techniques to retrieve data
Sunday, 2 December 12
new ways to scale reads & writes
Sunday, 2 December 12
transparent to the application
Sunday, 2 December 12
new ways to consume data
Sunday, 2 December 12
new techniques to analyze data
Sunday, 2 December 12
new ways to visualize data
Sunday, 2 December 12
at Web scale
Sunday, 2 December 12
The Database Landscape so far ..
• relational database - the bedrock of enterprise data
• irrespective of application development paradigm
• object-relational-mapping considered to be the panacea for impedance mismatch
Sunday, 2 December 12
“Object Relational Mapping is the Vietnam of Computer Science”
- Ted Neward (2006)
blogger, big geek and architectural consultant
Sunday, 2 December 12
RDBMS & Big Data
• once the data volume crosses the limit of a single server, you shard / partition
• sharding implies a lookup node for the hash code => SPOF
• cross shard joins, transactions don’t scale
Sunday, 2 December 12
RDBMS & Big Data
• Cost of distributed transactions
• synchronization overhead
• 2 phase commit is a blocking protocol (can block indefinitely)
• as slow as the slowest DB node + network latency
Sunday, 2 December 12
RDBMS & Big Data
• Master/Slave replication
• synchronous replication => slow
• asynchronous replication => can lose data
• writing to master is a bottleneck and SPOF
Sunday, 2 December 12
Need Distributed Databases
• data is automatically partitioned
• transparent to the application
• add capacity without downtime
• failure tolerant
Sunday, 2 December 12
2 famous papers ..
• Bigtable: A distributed storage system for structured data, 2006
• Dynamo: Amazon’s highly scalable key/value store, 2007
Sunday, 2 December 12
Addressing 2 Approaches
• Bigtable: “how can we build a distributed database on top of GFS ?”
• Dynamo: “how can we build a distributed hash table appropriate for data center ?”
Sunday, 2 December 12
Big Data recommendations
• reduce accidental complexity in processing data
• be less rigid (no rigid schema)
• store data in a format closer to the domain model
• hence no universal data model ..
Sunday, 2 December 12
Polyglot Storage
• unfortunately came to be known as NoSQL databases
• document oriented (MongoDB, CouchDB)
• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)
• data structure based (redis)
• graph based (Neo4J)
Sunday, 2 December 12
richer modeling capabilities
closer to domain model
reduced impedancemismatch
Sunday, 2 December 12
Asynchronous Replication to RDBMS using Message Oriented Middleware
Sunday, 2 December 12
Hybrid Oracle MongoDB storage over Messaging backbone
Sunday, 2 December 12
Relational Database is just another option, not the only option when data set is BIG and
semantically rich
Sunday, 2 December 12
10 things never to do with a Relational Database
• Search
• Recommendation
• High Frequency Trading
• Product Cataloging
• User group / ACLs
• Log Analysis
• Media Repository
• Classification ad
• Time Series / Forecasting
Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-database-206944?page=0,0
Sunday, 2 December 12
Scalability, Availability ..• ACID => BASE
• CAP Theorem & Eventual Consistency
• Consistent Hashing
• Vector Clocks
• Hinted Hand-off & Read repair
• Anti-entropy
• Gossip Protocol
Sunday, 2 December 12
CAP Theorem
• Consistency, Availability & Partition Tolerance
• You can have only 2 of these in a distributed system
• Eric Brewer postulated this quite some time back
Sunday, 2 December 12
ACID => BASE
• Basic Availability Soft-state Eventual consistency
• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.
• It’s ok to use stale data and it’s ok to give approximate answers
Sunday, 2 December 12
Consistent Hashing
Sunday, 2 December 12
Big Data in the wild
• Hadoop
• started as a batch processing engine (HDFS & Map/Reduce)
• with bigger and bigger data, you need to make them available to users at near real time
• stream processing, CEP ..
Sunday, 2 December 12
a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems
Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing
Cloudera Impala
complementingMap/Reduce
in Hadoop
Sunday, 2 December 12
Real time queries in Hadoop
• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop
• expensive and may need lots of data movement between the database & the Hadoop clusters
Sunday, 2 December 12
.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current
base ..
Sunday, 2 December 12
Shark from UC Berkeley
• a large scale data warehouse system for Spark, compatible with Hive
• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3
Sunday, 2 December 12
BI and Analytics
• making Big Data available to developers
• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)
• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..
Sunday, 2 December 12
Machine Learning
• personalization
• social network analysis
• pattern discovery - click patterns, recommendations, ratings
• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..
Sunday, 2 December 12
Summary
• Big Data will grow bigger - we need to embrace the changes in architecture
• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain
• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware
Sunday, 2 December 12
Summary
• Go for decentralized architectures, avoid SPOFs
• With the big volumes of data, streaming is your friend
Sunday, 2 December 12
Thank You!
Sunday, 2 December 12
http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/
http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htm
http://www.emich.edu/chhs/about-researchMETHODS.html
http://docs.basho.com/riak/latest/references/appendices/concepts/
Sunday, 2 December 12
top related