why every nosql deployment should be paired with hadoop

1

Why every NoSQL deployment should be paired with Hadoop

James Phillips Co-‐founder and SVP Products

Amr Awadallah Co-‐founder and CTO

Couchbase Cloudera

2

Agenda

•  Big Audience vs. Big Data •  NoSQL for Big Audience •  Hadoop for Big Data •  Big Audiences create and consume Big Data

–  NoSQL and Hadoop are highly synergisJc •  Couchbase + Cloudera

3

Aren’t NoSQL, Hadoop, “Big Data” all the same?

No.

4

Two challenges at the data layer

IDC esJmates that more than 1.8 trillion gigabytes of informaJon was

created in 2011 and that it will double every two years.

Most new interacJve soWware systems are accessed via browser with 2 billion potenJal users and a

24x7 upJme requirement.

“Big Audience.” “Big Data.”

5

NoSQL for

“Big Audience”

6

Changes in interacJve soWware – NoSQL driver

7

Modern interactive software architecture

Application Scales Out Just add more commodity web servers

Database Scales Up Get a bigger, more complex server

Note – RelaJonal database technology is great for what it is great for, but it is not great for this.

8

Extending the scope of RDBMS technology

•  Data parJJoning (“sharding”) –  DisrupJve to reshard – impacts applicaJon –  No cross-‐shard joins –  Schema management at every shard

•  Denormalizng –  Increases speed –  At the limit, provides complete flexibility –  Eliminates relaJonal query benefits

•  Distributed caching –  Accelerate reads –  Scale out –  Another Jer, no write acceleraJon, coherency management

9

Lacking market soluJons, users forced to invent

Dynamo October 2007

Cassandra August 2008

Voldemort February 2009

Bigtable November 2006

•  No schema required before inserJng data •  No schema change required to change data format •  Auto-‐sharding without applicaJon parJcipaJon •  Distributed queries •  Integrated main memory caching •  Data synchronizaJon (mobile, mulJ-‐datacenter)

10

NoSQL database matches application logic tier architecture Data layer now scales with linear cost and constant performance.

Application Scales Out Just add more commodity web servers

Database Scales Out Just add more commodity data servers

Scaling out flattens the cost and performance curves.

NoSQL Database Servers

11

11%

12%

16%

29%

35%

49%

Other

All of these

Costs

High latency/low performance

Inability to scale out data

Lack of flexibility/rigid schemas

Source: Couchbase NoSQL Survey, December 2011, n=1351

What is the biggest data management problem driving your use of NoSQL in the coming year?

Survey: Schema inflexibility #1 adopJon driver

12

Hadoop for

“Big Data”

©2012 Cloudera, Inc. All Rights Reserved. 13

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

2. Moving Data To Compute Doesn’t Scale

1. Can’t Explore Original High Fidelity Raw Data

3. Archiving = Premature Data Death

The Problems with Current Data Systems


The Solution: A Combined Storage/Compute Layer

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps 1. Data Exploration & Advanced Analytics

3. Keep Data Alive For Ever

2. Scalable Throughput For ETL & Aggregation

Mostly Append

The Key Benefit: Agility/Flexibility


Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS): •  Schema must be created before

any data can be loaded.

•  An explicit load operation has to take place which transforms data to DB internal structure.

•  New columns must be added explicitly before new data for such columns can be loaded into the database.

•  Read is Fast

•  Standards/Governance

•  Data is simply copied to the file store, no transformation is needed.

•  A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)

•  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

•  Load is Fast

•  Flexibility/Agility

Pros

Scalability: Scalable Software Development


Grows without requiring developers to re-architect their algorithms/application.

AUTO SCALE

Economics: Return on Byte •  Return on Byte (ROB) = value to be extracted from

that byte divided by the cost of storing that byte

•  If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.


Low ROB

High ROB

Hadoop in the Enterprise Data Stack

Logs Files Web Data Relational Databases

IDEs BI / Analytics

Enterprise Reporting

Enterprise Data Warehouse

Cloudera Manager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile Applications

CUSTOMERS

Sqoop

Sqoop

Sqoop

Flume Flume Flume

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/ ETL Tools

ODBC, JDBC, NFS


19

Big Audiences create and consume

Big Data.

20

Two peas. One pod.

hnp://Jnyurl.com/6tx42tw

21

Hadoop as a Web applicaJon feeder or consumer

“big data”

insights

applicaJon Web

big audience

applicaJon Web

“big audience”

big data

insights

Panern 1 Hadoop feeding a web applicaJon

Panern 2 Hadoop consuming web applicaJon data

22

Panern 1 Case Study: AOL Ad TargeJng

•  One of the largest online ad targeJng operaJons •  Ad slot filling opJmizaJon

–  Serve the most relevant ad to a given user – Meet contracted impression counts

•  Relevancy criteria –  Demographic –  Psychographic –  Current behavioral

•  40 milliseconds to fill all slots

23

AOL AdverJsing: Hadoop as an ad targeJng feeder

events profiles, campaigns

profiles, real Jme campaign staJsJcs

40 milliseconds to respond with the decision.

2

3

1

affiliates

24

Panern 2 Case Study: Social gaming user analysis

•  Tens to hundreds of millions of users •  Game opJmizaJon requirements

–  Keep game fresh and retain audience – Maximize revenue through offer and experience tuning

•  Very different data management tasks –  Serving game data

•  System of record game data •  Very low latency data access •  Non-‐disrupJve elasJcity •  Complex queries

–  Analyzing user behavior •  Not game data, rather user behavior data •  High-‐throughput data analysis

25

Social Game: Game opJmizaJon via Hadoop

1

2

3

User interacJng with game

ValidaJon and response

Game and user data system of record

Insights

4

5

User behavioral data

26

Couchbase and Cloudera

27

Couchcbase Sqoop connector for Cloudera

hnp://www.couchbase.com/develop/connectors/hadoop

Cloudera-‐cerJfied connector Bi-‐direcJonal data movement -‐ Hadoop -‐> Couchbase -‐ Couchbase -‐> Hadoop

28

Questions?

why every nosql deployment should be paired with hadoop

Technology

data hadoop

big data nosql

new data

inserjng data

hadoopforbig data

big data big audiences

data layer big audience

commodity data servers