why every nosql deployment should be paired with hadoop

28
1 Why every NoSQL deployment should be paired with Hadoop James Phillips Cofounder and SVP Products Amr Awadallah Cofounder and CTO Couchbase Cloudera

Upload: couchbase

Post on 20-Jun-2015

2.582 views

Category:

Technology


0 download

DESCRIPTION

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. These slides will address: - Why NoSQL and Big Data are similar, but different - The categories of NoSQL systems, and the types of applications for which they are best suited - How Cloudera’s Distribution Including Apache Hadoop and Couchbase can be used together to build better applications - Explore real-world use cases where NoSQL and Hadoop technologies work in concert To view Couchbase webinars on-demand visit http://www.couchbase.com/webinars

TRANSCRIPT

Page 1: Why Every NoSQL Deployment Should be Paired With Hadoop

1  

Why  every  NoSQL  deployment  should  be  paired  with  Hadoop  

James  Phillips  Co-­‐founder  and  SVP  Products  

Amr  Awadallah  Co-­‐founder  and  CTO  

Couchbase   Cloudera  

Page 2: Why Every NoSQL Deployment Should be Paired With Hadoop

2  

Agenda    

•  Big  Audience  vs.  Big  Data  •  NoSQL  for  Big  Audience  •  Hadoop  for  Big  Data  •  Big  Audiences  create  and  consume  Big  Data  

–  NoSQL  and  Hadoop  are  highly  synergisJc  •  Couchbase  +  Cloudera  

Page 3: Why Every NoSQL Deployment Should be Paired With Hadoop

3  

Aren’t  NoSQL,  Hadoop,  “Big  Data”  all  the  same?  

No.  

Page 4: Why Every NoSQL Deployment Should be Paired With Hadoop

4  

Two  challenges  at  the  data  layer    

IDC  esJmates  that  more  than  1.8  trillion  gigabytes  of  informaJon  was  

created  in  2011  and  that  it  will  double  every  two  years.  

Most  new  interacJve  soWware  systems  are  accessed  via  browser  with  2  billion  potenJal  users  and  a  

24x7  upJme  requirement.  

“Big  Audience.”   “Big  Data.”  

Page 5: Why Every NoSQL Deployment Should be Paired With Hadoop

5  

NoSQL for

“Big Audience”

Page 6: Why Every NoSQL Deployment Should be Paired With Hadoop

6  

Changes  in  interacJve  soWware  –  NoSQL  driver  

Page 7: Why Every NoSQL Deployment Should be Paired With Hadoop

7  

Modern interactive software architecture

Application Scales Out Just add more commodity web servers

Database Scales Up Get a bigger, more complex server

Note  –  RelaJonal  database  technology  is  great  for  what  it  is  great  for,  but  it  is  not  great  for  this.  

Page 8: Why Every NoSQL Deployment Should be Paired With Hadoop

8  

Extending  the  scope  of  RDBMS  technology  

•  Data  parJJoning  (“sharding”)  –  DisrupJve  to  reshard  –  impacts  applicaJon  –  No  cross-­‐shard  joins  –  Schema  management  at  every  shard  

•  Denormalizng  –  Increases  speed  –  At  the  limit,  provides  complete  flexibility  –  Eliminates  relaJonal  query  benefits  

•  Distributed  caching  –  Accelerate  reads  –  Scale  out  –  Another  Jer,  no  write  acceleraJon,  coherency  management  

Page 9: Why Every NoSQL Deployment Should be Paired With Hadoop

9  

Lacking  market  soluJons,  users  forced  to  invent  

Dynamo  October  2007  

Cassandra  August  2008  

Voldemort  February  2009  

Bigtable  November  2006  

•  No  schema  required  before  inserJng  data  •  No  schema  change  required  to  change  data  format  •  Auto-­‐sharding  without  applicaJon  parJcipaJon  •  Distributed  queries  •  Integrated  main  memory  caching  •  Data  synchronizaJon  (mobile,  mulJ-­‐datacenter)  

Page 10: Why Every NoSQL Deployment Should be Paired With Hadoop

10  

NoSQL database matches application logic tier architecture Data layer now scales with linear cost and constant performance.

Application Scales Out Just add more commodity web servers

Database Scales Out Just add more commodity data servers

Scaling out flattens the cost and performance curves.

NoSQL  Database  Servers  

Page 11: Why Every NoSQL Deployment Should be Paired With Hadoop

11  

11%  

12%  

16%  

29%  

35%  

49%  

Other  

All  of  these  

Costs  

High  latency/low  performance  

Inability  to  scale  out  data  

Lack  of  flexibility/rigid  schemas  

Source: Couchbase NoSQL Survey, December 2011, n=1351

What  is  the  biggest  data  management  problem    driving  your  use  of  NoSQL  in  the  coming  year?  

Survey:  Schema  inflexibility  #1  adopJon  driver  

Page 12: Why Every NoSQL Deployment Should be Paired With Hadoop

12  

Hadoop for

“Big Data”

Page 13: Why Every NoSQL Deployment Should be Paired With Hadoop

©2012 Cloudera, Inc. All Rights Reserved. 13

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

2. Moving Data To Compute Doesn’t Scale

1. Can’t Explore Original High Fidelity Raw Data

3. Archiving = Premature Data Death

The Problems with Current Data Systems

Page 14: Why Every NoSQL Deployment Should be Paired With Hadoop

©2012 Cloudera, Inc. All Rights Reserved. 14

The Solution: A Combined Storage/Compute Layer

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps 1. Data Exploration & Advanced Analytics

3. Keep Data Alive For Ever

2. Scalable Throughput For ETL & Aggregation

Mostly Append

Page 15: Why Every NoSQL Deployment Should be Paired With Hadoop

The Key Benefit: Agility/Flexibility

©2012 Cloudera, Inc. All Rights Reserved. 15

Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS): •  Schema must be created before

any data can be loaded.

•  An explicit load operation has to take place which transforms data to DB internal structure.

•  New columns must be added explicitly before new data for such columns can be loaded into the database.

•  Read is Fast

•  Standards/Governance

•  Data is simply copied to the file store, no transformation is needed.

•  A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)

•  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

•  Load is Fast

•  Flexibility/Agility

Pros  

Page 16: Why Every NoSQL Deployment Should be Paired With Hadoop

Scalability: Scalable Software Development

©2012 Cloudera, Inc. All Rights Reserved. 16

Grows without requiring developers to re-architect their algorithms/application.

AUTO  SCALE  

Page 17: Why Every NoSQL Deployment Should be Paired With Hadoop

Economics: Return on Byte •  Return on Byte (ROB) = value to be extracted from

that byte divided by the cost of storing that byte

•  If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.

©2012 Cloudera, Inc. All Rights Reserved. 17

Low ROB

High ROB

Page 18: Why Every NoSQL Deployment Should be Paired With Hadoop

Hadoop in the Enterprise Data Stack

Logs Files Web Data Relational Databases

IDEs BI / Analytics

Enterprise Reporting

Enterprise Data Warehouse

Cloudera Manager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile Applications

CUSTOMERS

Sqoop

Sqoop

Sqoop

Flume Flume Flume

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/ ETL Tools

ODBC, JDBC, NFS

©2012 Cloudera, Inc. All Rights Reserved. 18

Page 19: Why Every NoSQL Deployment Should be Paired With Hadoop

19  

Big Audiences create and consume

Big Data.

Page 20: Why Every NoSQL Deployment Should be Paired With Hadoop

20  

Two  peas.  One  pod.  

hnp://Jnyurl.com/6tx42tw  

Page 21: Why Every NoSQL Deployment Should be Paired With Hadoop

21  

Hadoop  as  a  Web  applicaJon  feeder  or  consumer  

“big  data”  

insights  

applicaJon  Web  

big  audience  

applicaJon  Web  

“big  audience”  

big  data  

insights  

Panern  1  Hadoop  feeding  a  web  applicaJon  

Panern  2  Hadoop  consuming  web  applicaJon  data  

Page 22: Why Every NoSQL Deployment Should be Paired With Hadoop

22  

Panern  1  Case  Study:  AOL  Ad  TargeJng  

•  One  of  the  largest  online  ad  targeJng  operaJons  •  Ad  slot  filling  opJmizaJon  

–  Serve  the  most  relevant  ad  to  a  given  user  – Meet  contracted  impression  counts  

•  Relevancy  criteria  –  Demographic  –  Psychographic  –  Current  behavioral  

•  40  milliseconds  to  fill  all  slots  

Page 23: Why Every NoSQL Deployment Should be Paired With Hadoop

23  

AOL  AdverJsing:  Hadoop  as  an  ad  targeJng  feeder  

events  profiles,  campaigns  

profiles,  real  Jme  campaign    staJsJcs  

40  milliseconds  to  respond  with  the  decision.  

2  

3  

1  

affiliates  

Page 24: Why Every NoSQL Deployment Should be Paired With Hadoop

24  

Panern  2  Case  Study:  Social  gaming  user  analysis  

•  Tens  to  hundreds  of  millions  of  users  •  Game  opJmizaJon  requirements  

–  Keep  game  fresh  and  retain  audience  – Maximize  revenue  through  offer  and  experience  tuning  

•  Very  different  data  management  tasks  –  Serving  game  data  

•  System  of  record  game  data  •  Very  low  latency  data  access  •  Non-­‐disrupJve  elasJcity  •  Complex  queries  

–  Analyzing  user  behavior  •  Not  game  data,  rather  user  behavior  data  •  High-­‐throughput  data  analysis  

Page 25: Why Every NoSQL Deployment Should be Paired With Hadoop

25  

Social  Game:  Game  opJmizaJon  via  Hadoop  

1  

2  

3  

User  interacJng  with  game  

ValidaJon  and  response  

Game  and  user  data  system  of  record  

Insights  

4  

5  

User  behavioral  data  

Page 26: Why Every NoSQL Deployment Should be Paired With Hadoop

26  

Couchbase and Cloudera

Page 27: Why Every NoSQL Deployment Should be Paired With Hadoop

27  

Couchcbase  Sqoop  connector  for  Cloudera  

hnp://www.couchbase.com/develop/connectors/hadoop    

Cloudera-­‐cerJfied  connector  Bi-­‐direcJonal  data  movement          -­‐  Hadoop  -­‐>  Couchbase          -­‐  Couchbase  -­‐>  Hadoop  

Page 28: Why Every NoSQL Deployment Should be Paired With Hadoop

28  

Questions?