why every nosql deployment should be paired with hadoop
DESCRIPTION
Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. These slides will address: - Why NoSQL and Big Data are similar, but different - The categories of NoSQL systems, and the types of applications for which they are best suited - How Cloudera’s Distribution Including Apache Hadoop and Couchbase can be used together to build better applications - Explore real-world use cases where NoSQL and Hadoop technologies work in concert To view Couchbase webinars on-demand visit http://www.couchbase.com/webinarsTRANSCRIPT
1
Why every NoSQL deployment should be paired with Hadoop
James Phillips Co-‐founder and SVP Products
Amr Awadallah Co-‐founder and CTO
Couchbase Cloudera
2
Agenda
• Big Audience vs. Big Data • NoSQL for Big Audience • Hadoop for Big Data • Big Audiences create and consume Big Data
– NoSQL and Hadoop are highly synergisJc • Couchbase + Cloudera
3
Aren’t NoSQL, Hadoop, “Big Data” all the same?
No.
4
Two challenges at the data layer
IDC esJmates that more than 1.8 trillion gigabytes of informaJon was
created in 2011 and that it will double every two years.
Most new interacJve soWware systems are accessed via browser with 2 billion potenJal users and a
24x7 upJme requirement.
“Big Audience.” “Big Data.”
5
NoSQL for
“Big Audience”
6
Changes in interacJve soWware – NoSQL driver
7
Modern interactive software architecture
Application Scales Out Just add more commodity web servers
Database Scales Up Get a bigger, more complex server
Note – RelaJonal database technology is great for what it is great for, but it is not great for this.
8
Extending the scope of RDBMS technology
• Data parJJoning (“sharding”) – DisrupJve to reshard – impacts applicaJon – No cross-‐shard joins – Schema management at every shard
• Denormalizng – Increases speed – At the limit, provides complete flexibility – Eliminates relaJonal query benefits
• Distributed caching – Accelerate reads – Scale out – Another Jer, no write acceleraJon, coherency management
9
Lacking market soluJons, users forced to invent
Dynamo October 2007
Cassandra August 2008
Voldemort February 2009
Bigtable November 2006
• No schema required before inserJng data • No schema change required to change data format • Auto-‐sharding without applicaJon parJcipaJon • Distributed queries • Integrated main memory caching • Data synchronizaJon (mobile, mulJ-‐datacenter)
10
NoSQL database matches application logic tier architecture Data layer now scales with linear cost and constant performance.
Application Scales Out Just add more commodity web servers
Database Scales Out Just add more commodity data servers
Scaling out flattens the cost and performance curves.
NoSQL Database Servers
11
11%
12%
16%
29%
35%
49%
Other
All of these
Costs
High latency/low performance
Inability to scale out data
Lack of flexibility/rigid schemas
Source: Couchbase NoSQL Survey, December 2011, n=1351
What is the biggest data management problem driving your use of NoSQL in the coming year?
Survey: Schema inflexibility #1 adopJon driver
12
Hadoop for
“Big Data”
©2012 Cloudera, Inc. All Rights Reserved. 13
Storage Only Grid (original raw data)
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps
Mostly Append
ETL Compute Grid
2. Moving Data To Compute Doesn’t Scale
1. Can’t Explore Original High Fidelity Raw Data
3. Archiving = Premature Data Death
The Problems with Current Data Systems
©2012 Cloudera, Inc. All Rights Reserved. 14
The Solution: A Combined Storage/Compute Layer
Hadoop: Storage + Compute Grid
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps 1. Data Exploration & Advanced Analytics
3. Keep Data Alive For Ever
2. Scalable Throughput For ETL & Aggregation
Mostly Append
The Key Benefit: Agility/Flexibility
©2012 Cloudera, Inc. All Rights Reserved. 15
Schema-on-Read (Hadoop):
Schema-on-Write (RDBMS): • Schema must be created before
any data can be loaded.
• An explicit load operation has to take place which transforms data to DB internal structure.
• New columns must be added explicitly before new data for such columns can be loaded into the database.
• Read is Fast
• Standards/Governance
• Data is simply copied to the file store, no transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)
• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.
• Load is Fast
• Flexibility/Agility
Pros
Scalability: Scalable Software Development
©2012 Cloudera, Inc. All Rights Reserved. 16
Grows without requiring developers to re-architect their algorithms/application.
AUTO SCALE
Economics: Return on Byte • Return on Byte (ROB) = value to be extracted from
that byte divided by the cost of storing that byte
• If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.
©2012 Cloudera, Inc. All Rights Reserved. 17
Low ROB
High ROB
Hadoop in the Enterprise Data Stack
Logs Files Web Data Relational Databases
IDEs BI / Analytics
Enterprise Reporting
Enterprise Data Warehouse
Cloudera Manager
SYSTEM OPERATORS
ENGINEERS ANALYSTS BUSINESS USERS
Web/Mobile Applications
CUSTOMERS
Sqoop
Sqoop
Sqoop
Flume Flume Flume
Modeling Tools
DATA SCIENTISTS
DATA ARCHITECTS
Meta Data/ ETL Tools
ODBC, JDBC, NFS
©2012 Cloudera, Inc. All Rights Reserved. 18
19
Big Audiences create and consume
Big Data.
20
Two peas. One pod.
hnp://Jnyurl.com/6tx42tw
21
Hadoop as a Web applicaJon feeder or consumer
“big data”
insights
applicaJon Web
big audience
applicaJon Web
“big audience”
big data
insights
Panern 1 Hadoop feeding a web applicaJon
Panern 2 Hadoop consuming web applicaJon data
22
Panern 1 Case Study: AOL Ad TargeJng
• One of the largest online ad targeJng operaJons • Ad slot filling opJmizaJon
– Serve the most relevant ad to a given user – Meet contracted impression counts
• Relevancy criteria – Demographic – Psychographic – Current behavioral
• 40 milliseconds to fill all slots
23
AOL AdverJsing: Hadoop as an ad targeJng feeder
events profiles, campaigns
profiles, real Jme campaign staJsJcs
40 milliseconds to respond with the decision.
2
3
1
affiliates
24
Panern 2 Case Study: Social gaming user analysis
• Tens to hundreds of millions of users • Game opJmizaJon requirements
– Keep game fresh and retain audience – Maximize revenue through offer and experience tuning
• Very different data management tasks – Serving game data
• System of record game data • Very low latency data access • Non-‐disrupJve elasJcity • Complex queries
– Analyzing user behavior • Not game data, rather user behavior data • High-‐throughput data analysis
25
Social Game: Game opJmizaJon via Hadoop
1
2
3
User interacJng with game
ValidaJon and response
Game and user data system of record
Insights
4
5
User behavioral data
26
Couchbase and Cloudera
27
Couchcbase Sqoop connector for Cloudera
hnp://www.couchbase.com/develop/connectors/hadoop
Cloudera-‐cerJfied connector Bi-‐direcJonal data movement -‐ Hadoop -‐> Couchbase -‐ Couchbase -‐> Hadoop
28
Questions?