jason huang, solutions engineer, qubole at mlconf atl - 9/18/15

September 18, 2015

Jason HuangSenior Solutions Architect, Qubole Inc.

Company Founding

Qubole founders built the Facebook data platform.

The Facebook model changed the role for datain an enterprise.

• Needed to turn the data assets into a “utility” to make a viable business.

– Collaborative: over 30% of employees use the data directly.

– Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use.

– Scalable: Exabyte's of data moving fast

It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people.

Work at Facebook inspired the founding of Qubole

Operations Analyst

Marketing Ops

Analyst

DataArchitec

t

Business

Users

Product Support

Customer Support

Developer

Sales Ops

Product Managers

Data Infrastructur

e

Impediments for an Aspiring Data Driven Enterprise

Where Big Data falls

short:

• 6-18 month implementation time• Only 27% of Big Data initiatives are

classified as “Successful” in 2014

Rigid and inflexible

infrastructure

Non adaptive software services

Highly specialized

systems

Difficult to build and operate

• Only 13% of organizations achieve full-scale production

• 57% of organizations cite skills gap as a major inhibitor

State of the Big Data Industry (n=417)

Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive0%

10%

20%

30%

40%

50%

60%

70%

80%

• Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Analytic Libraries:• Spark Streaming (Streaming Data)• Spark SQL (Data Processing)• MLlib (Machine Learning)• GraphX (Graph Processing)

Apache Spark

• Streaming Data– Process streaming data with Spark built-in functions– Applications such as fraud detection and log processing– ETL via data ingestion

• Machine Learning– Helps users run repeated queries and machine learning

algorithms on data sets– MLlib can work in areas such as clustering, classification, and

dimensionality reduction– Used for very common big data functions - predictive

intelligence, customer segmentation, and sentiment analysis

Common Spark Use Cases

• Interactive Analysis– MapReduce was built to handle batch processing– SQL-on-Hadoop engines such as Hive or Pig can be too slow for

interactive analysis– Spark is fast enough to perform exploratory queries without sampling– Provides multiple language-specific APIs including R, Python, Scala and

Java.

• Fog Computing– The Internet of Things - objects and devices with tiny embedded

sensors that communicate with each other and users, creating a fully interconnected world

– Decentralize data processing and storage and use Spark streaming analytics and interactive real time queries

Common Spark Use Cases

Use Spark for distributed computation:

- Combine SparkSQL, GraphX along with MLlib in the same Spark program

- Ability to use language of choice - python/scala/R/java

- Extensive algorithms (http://spark.apache.org/docs/latest/mllib-guide.html)

Why Spark MLlib?

• Classification and Regression: logistic regression, linear regression, linear support vector machine (SVM), naive Bayes, decision trees

• Collaborative Filtering: alternating least squares (ALS)

• Clustering: k-means, Gaussian mixture

• Dimensionality Reduction: singular value decomposition (SVD), principal component analysis (PCA)

Algorithms

• Spark : Fast, Scalable and Flexible

• R : Statistics, Packages and Plots

SparkR combines both - very powerful

Use SparkR API to take advantage of Spark, bring the data back into R - and do some machine learning, data visualization, etc.

How about R? Use SparkR!

What about the cloud?

Central Governance &

SecurityInternet

Scale

Instant Deployment

Isolated Multi-tenancy

Elastic

Object Store Underpinning

s

• Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre-installed on all cluster nodes

– e.g. submit SparkR programs via a client-side API to an on-demand compute cluster

• ETL (data cleansing, transformations, table joins, etc.) required prior to any ML modeling and analysis

– e.g. Use other Big Data tools in order to prepare data – hive/hadoop/cascading/pig…

Spark in the Cloud

• Use AWS S3 object store to decouple compute and storage; scale processing power and storage capacity independently

• S3 is highly available, reliable, scalable and cost effective

• Elastic compute provides unlimited scale on-demand: calculations may require 10, 100 or 1,000+ compute nodes.

• Ability to have multiple clusters – distinguish between teams, workloads, production, non-production R&D/test

Spark in the Cloud

Cloud object store for data sets:

e.g. AWS S3:

• Flexible compute resource options– High memory instances

• AWS EC2 r3.* for high memory workloads to cache and manipulate large Spark RDDs

– High CPU• AWS EC2 c3.* for CPU intensive workloads

• Automatic cluster termination when idle• Periodically check for bad instances and remove them

Spark in the Cloud

CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.

• Install Spark on EC2 (HDFS if required)

• Choose Spark backend cluster mode and configure it– Standalone– Yarn– Mesos

• Spin up a cluster of instances

DIY - Getting Started on the Cloud

CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.

EC2 scripts can help:

http://spark.apache.org/docs/latest/ec2-scripts.html

- Helps spin up named clusters- Creates a security group, comes pre-baked with

Spark installed

DIY - Getting Started on the Cloud

http://spark.apache.org/docs/latest/ec2-scripts.html

Another (very short) Demo

Qubole Case Study

Qubole Case Study

Operations Analyst

MarketingOps

Analyst

DataArchitec

t

Business

UsersProduct Support

Customer Support

Developer

Sales Ops

Product Manager

s

Ease of use for analysts

• Dozens of DataScientist andAnalyst users

• Produces double-digit TBs of data per day

• Does not havededicated staffto setup and manage clustersand Hadoop Distributions

010110101010

Qubole Case Study

Qubole Case Study

Producers Continuous Processing Storage Analytics

CDN

Real TimeBidding

RetargetingPlatform

ETL

Kinesis S3 Redshift

Machine LearningStreaming

Customer Data

Why Spark?010110101010

010110101010

010110101010

“Qubole put our cluster management, auto-scaling and ad-hoc

queries on autopilot. Its higher performance for

Big Data queries translates directly into

faster and more actionable marketing intelligence for our

customers.”

Yekesa KosuruVP, Technology