data lifecycles at massive scale...qubole confidential • co-founder and ceo of qubole! cloud based...

15
Data Lifecycles at Massive Scale Pasadena, January 2016

Upload: others

Post on 15-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Data Lifecycles at Massive ScalePasadena, January 2016

Page 2: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Background

Qubole Confidential

•  Co-founder and CEO of Qubole§  Cloud Based Big Data as a Service§  Processes 250PB+ data every month

•  Lead Data Infrastructure at Facebook§  Made Big Data Self-service in Facebook§  Nearly an Exabyte of data

•  Co-creator of Apache Hive•  Democratized Big Data through SQL

Operations Analyst

Marketing Ops

Analyst

DataArchitect

BusinessUsers

Product Support

Customer Support

Developer

Sales Ops

Product Managers

Data Infrastructure

Page 3: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Evolution of Data

Qubole Confidential

Page 4: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Requirements from Modern Data Infrastructure

Qubole Confidential

   

Scalability on Commodity Hardware

Multi-Persona support for Multiple Use-cases

DATA INFRASTRUCTURE

   

   

   

   SQL

for Analysts

Machine Learning

for Data Scientists

Data Prep for

ETL Users

Streaming for

Developers

Page 5: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Data Infrastructure for a Modern Enterprise

Qubole Confidential

Data  Crea(on  

Data  Inges(on  

Streaming  Analy(cs  

Batch  Data  Processing  

Adhoc  Analysis  

Machine  Learning  

Visualiza(on/Dashboards  

Applica(ons  

CREATION   COLLECTION   PROCESSING   USAGE  

Page 6: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Rise of Open Source

Qubole Confidential

Data  Crea(on  

Data  Inges(on  

   

Streaming  Analy(cs  

   

Batch  Data  Processing  

   

Adhoc  Analysis  

   

Machine  Learning  

   

Visualiza(on/Dashboards  

Applica(ons  

CREATION   COLLECTION   PROCESSING   USAGE  

Page 7: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Qubole – a reflection of Usage of a Modern Data Infrastructure

Qubole Confidential

Engine  Usage  on  Qubole  

Hive  

Spark  

MapReduce  

Presto  

Hbase  

A  Modern  Data  PlaEorm  needs  Mul(ple  Engines    -­‐  Hive  for  Complex  SQL  -­‐  Spark  for  Data  Science  and  

Streaming  -­‐  Presto  for  Interac(ve  Simple  SQL  -­‐  Map  Reduce  for  Batch  ETL  

250PB+  Data  Processed  Every  Month  

Page 8: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Impediments for an Aspiring Data Driven Enterprise

Why companies struggle with

Self-Service Big Data :

•  6-18 month implementation time

• Only 27% of Big Data initiatives are classified as “Successful” in 2014

Rigid  and  inflexible  

infrastructure  

Non  adap(ve  soNware  services  

Highly  specialized  systems  

Difficult  to  build  and  operate  

• Only 13% of organizations achieve full-scale production

•  57% of organizations cite skills gap as a major inhibitor

Page 9: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

9

The Cloud Provides:

On-demand Infrastructure

Highly Scalable Object Stores

Self Service Infrastructure

Power of the Cloud

Page 10: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

10

Big Data Meets Cloud – Power of Object Stores

The Right Storage for Storing Big Data

•  Elastic Scalability: petabyte scale without capacity constraints

•  High Concurrency: throughput scales linearly

•  High Availability: 99.99% availability

•  High Durability: easily more durable than HDFS 3x replicas

•  Enterprise Grade at a fraction of the cost

Page 11: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

11

Big Data Meets Cloud – Power of Elastic Compute

The Right Compute Paradigm to Fit Usage

•  On-Demand: provision entire clusters in less than two minutes with no lead time for sourcing

•  Elastic: cluster size should match workload; run with thousands of nodes when you need it, de-provision all nodes when idle

•  Flexible: change compute infrastructure to fit workloads

•  Cost Efficiency: Multiple SLA options to fit the right budget to the workloads

Page 12: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

12

Big Data Meets Cloud – Power of Self Service SaaS

Page 13: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

13

Big Data Meets Cloud – Cloud Security and Compliance

Security is a no. 1 citizen: Cloud Built from Outside–In

•  Multiple Encryption options

•  Industry-standard authentication for every REST API request

•  Virtual Private Cloud

•  Auto-logging for auditability

•  Industry Compliance

Page 14: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Big Data Platform and the Cloud

Enterprise Grade Security, Governance and Reliability

Multi-User and SaaS architecture for Best

Operational Efficiency

Auto-scaling and portability across Clouds

Successful Big Data Adoption at Scale with a Unified Big Data Platform Built for the Cloud

Page 15: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure

Thank You