qubole presentation for the cleveland big data and hadoop meetup

Cleveland Big Data and Hadoop User Group

Great Lakes Science Center

September 14, 2015

Jason Huang

Senior Solutions Architect, Qubole

Company Founding

Qubole founders built the Facebook data platform.

The Facebook model changed the role for datain an enterprise.

• Needed to turn the data assets into a “utility” to make a viable business.

– Collaborative: over 30% of employees use the data directly.

– Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use.

– Scalable: Exabyte's of data moving fast

It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people.

Work at Facebook inspired the founding of Qubole

Operations

Analyst

Marketing Ops

Analyst

Data

Architect

Business

Users

Product

SupportCustomer

Support

Developer

Sales Ops

Product

Managers

Data

Infrastructure

State of the Big Data Industry (n=417)

0%

10%

20%

30%

40%

50%

60%

70%

80%

Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive

Impediments for an Aspiring Data Driven Enterprise

Where Big

Data falls

short:

• 6-18 month implementation time

• Only 27% of Big Data initiatives are

classified as “Successful” in 2014

Rigid and inflexible

infrastructure

Non adaptive software services

Highly specialized

systems

Difficult to build and operate

• Only 13% of organizations achieve full-scale production

• 57% of organizations cite skills gap as a major inhibitor

Impediments for an Aspiring Data Driven Enterprise

What you need to work in the cloud:

Central

Governance &

Security

Internet

Scale

Instant

Deployment

Isolated

Multitenancy

Elastic

Object Store

Underpinnings

Qubole Case Study

Qubole Case Study

• 1 out of 3 employees

leverages Big Data

• Stores 60PB+ of data

• Logs 20TB+ of new data

per day

• Processes 3PB+ per day

over 2,000+ jobs

Qubole Case Study

Qubole Case Study

Why Hive?

“Qubole has enabled more

users within Pinterest to

get to the data and has

made the data platform lot

more scalable and stable”

Mohammad Shahangian

Lead, Data Science

and Infrastructure

Hive

Metastore

Pig

Cascading

Hive

HDFS/S3

Hive’s metastore serves as the canonical source of truth for all Hadoop jobs

Metadata Data

Qubole Case Study

Qubole Case Study

Operations

Analyst

Marketing

Ops

Analyst

Data

Architect

Busines

s

Users

Product

SupportCustomer

Support

Developer

Sales Ops

Product

Managers

Ease of use for analysts

• Dozens of Data

Scientist and

Analyst users

• Produces double-

digit TBs of data

per day

• Does not have

dedicated staff

to setup and

manage clusters

and Hadoop

Distributions

010110101010

Qubole Case Study

Qubole Case Study

Producers Continuous Processing Storage Analytics

CDN

Real Time

Bidding

Retargeting

Platform

ETL

Kinesis S3 Redshift

Machine LearningStreaming

Customer Data

Why Spark?

010110101010

010110101010

010110101010

“Qubole put our cluster

management, auto-scaling

and ad-hoc queries on

autopilot. Its higher

performance for Big Data

queries translates directly

into faster and more

actionable marketing

intelligence for our

customers.”

Yekesa Kosuru

VP, Technology

Qubole Case Study

Qubole Case Study

• Designed for

scientists &

clinicians

• Leveraging

massive

datasets from

institutes,

public sources

and more…

• Cloud-based

product

delivered via

web

Qubole Case Study

Qubole Case Study

"Our customers have varying

needs: clinical researchers

might use GenePool to

examine genomic data from a

single patient, while a major

research institution might use

the platform to perform

analyses over 10,000 patients

at once”

Anish Kejariwal - Senior Director of

Engineering• Unified Metadata

• Auto-Scaling

• Spot Optimized

• Policy Keeper

• Cloud Tuned

• Cluster Lifecycle Management

Developer

CenterAnalyst Workbench UI Policy, Governance &

Security Center

QDS Unified Control Panel

QDS Data Engines

Why Presto?

qubole presentation for the cleveland big data and hadoop meetup

Data & Analytics