qubole @ aws meetup bangalore - july 2015

Big Data as a Service

Joydeep Sen SarmaHariharan IyerShubham Tagra

Introduction

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

2

Agenda

Introduction

• Founded ~ 10/2011

• Team:– Founding Crew initial authors of Apache Hive, ran Data @Facebook– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto

• Financing:– Completed Series-B 10/2014– LightSpeed, Charles-River, Norwest, Anand/Venky

• Product: Qubole Data Service– Big Data as a Service– AWS/GCE/Azure

3

About Qubole

Introduction

4

Customers

IntroductionQubole Data Service

6

IntroductionQubole Data Service

Introduction

• Self-Serve Big Data Analytics:– Lack of Hadoop trained IT/engineers– Team of Analysts

• Lowest TCO– Cloud Optimized - takes full advantage of AWS

• Unified Platform for all Tools:– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…– Pick and Choose. Combine and Use

• Awesome Support and Solutions

7

Why Qubole

Self Service Analytics: Direct Access to Big Data

8

Self Service Analytics: Manage Clusters Easily

9

Self Service Analytics: Schedule Jobs

10

Self Service Analytics: Self Serve Dashboards using Notebooks

11

Qubole and EC2





• Qubole vs. EMR

12

Agenda

Qubole and EC2

• Custom AMIs for much faster boot-up

• Auto-termination

• Auto-scaling

• Spot Instances

• EBS

13

EC2 Magic Sauce

• Auto-start and termination– Cluster starts automatically when you need to run a command

• Intelligent - no cluster required for metadata commands– Terminated after couple of hours of Idle time

• Auto-scaling– Min Size <= Cluster <= Max Size

14

Cluster LifeCycle Management

Qubole and EC2

15

Map Slots Reduce Slots

Slave

Slave

Slave

Slave

Slave

SELECT * FROM FOO JOIN BAR ON BAZ = ...

Auto-Scaling

• Upscaling– Engine-specific algorithms– Cannot just look at expected time (parallelism matters)

• Downscaling– Decommissioning takes time– Need to consider hour boundary– Stuck on mapper output

• Output offloading

• AWS Integration– Hour boundary– Eventual consistency

16

Why is it hard?

Qubole and EC2

Min Cluster Size: 400Max Cluster Size: 800Time for which cluster size < max size: 49%

17

But it pays off!

Qubole and EC2

18

But it pays off!

Expected Compute Hours Compute hours saved Savings (%)

2902246.2 2107311.01 72.6

4655815.5 2105486.11 45.2

1698052.65 1658738.375 97.6

1776944.4 1476547.835 83.1

2063127.85 838628.7 40.6

919721.25 613630.955 66.7

Qubole and EC2

• Various configurations– All nodes on-demand– Minimum nodes on-demand, rest combination of on-demand & spot– All nodes spot

• Minimum set has higher bid price => less likely to lose– Up to 90% savings compared to on-demand price

19

Spot instances

Qubole and EC2

• Not always available– Fall back to on-demand

• Increases overall cost of cluster– Periodically replace extra on-demand instances when spot available

• Can go away at any time– Hadoop has built-in resiliency– Place replicas on stable instances

• Auto-scaling– Must maintain requested ratio

20

Why is it hard?

Qubole and EC2

• Useful for newer instance types - c3/m3– Low ephemeral storage

• Better performance/$– Compared to older instance types with more storage

• Writes are changed– Minimize writes to EBS volumes– Use them only when ephemeral is near full

21

Elastic Block Store

Qubole and EC2

• Spot Fleets• EBS-only instances

22

What’s coming next

Qubole and EC2

Resources

• Auto-scaling Hadoop Clusters in Qubole

• Spot Instances in Qubole Clusters

• Rebalancing Hadoop Clusters for Better Spot Utilization

• Improved Performance with Low-Ephemeral-Storage Instances

23

Qubole and EC2

http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/

http://www.qubole.com/blog/product/riding-the-spotted-elephant/

http://www.qubole.com/blog/product/rebalancing-hadoop-higher-spot-utilization/

http://www.qubole.com/blog/product/high-performance-hadoop-new-generation-aws-instances/

Agenda





• Qubole vs. EMR

24

Agenda

Qubole and S3

• No real Rename (aka Move) operation– renames are copies and expensive!

• S3 connection establishment is expensive– ie. - small calls like getObjectDetails(key) and reads are expensive!

• S3 has bulk prefix listing– listObjectsChunked(startKey, maxListing)

• Puts are atomic– Objects created when object uploaded– Unlike HDFS where files are created on first write

• MultiPart!

25

S3 != HDFS

Qubole and S3

• Naiive

• Smart

• Up to 1000x improvement

26

Prefix Listing

for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << listObject(path)

pathList = listPrefix(‘/x’)while (entry = pathList.next()):

if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << entry

Qubole and S3

• Split Computation– Divide input files into tasks for Map-Reduce/Spark/Presto

• Recovering Partitions

• List Paths matching regex pattern (‘/x/y/z/*/*’)

• and many more ..

27

Prefix Listing - Use Cases

Qubole and S3

• Normally:– Write data to temporary location - atomically rename to final location

• With S3:– Write data to final location– Atomic puts deal with speculation/retries– Optional: Remove on Failure

• By default in Hive, DirectFileOutputCommitter in MR/Spark

• Tricky: retries/speculation must use same path

28

Direct Writes

Qubole and S3

• Naiive:

29

Pre-Fetching

Client S3

• Smart:

Client S3

S3 Local Disk Cache (Presto)

30

S3 Local Disk Cache (Presto)

31

Qubole and S3

• Populating Cache while performing Query may cause Slowness

• Large Files are split– Cache Files or Splits?

• Should Caching Combine Small Files?

• Should Caching transform data into Columnar?

Watch out for Table Copies!

32

Why S3 Caching is hard

Qubole and S3

• Handle S3 Timeouts and Exceptions (truncated streams)

• Optimize away seek() operations

• Data Sharing across Organizations using Cross-Account Roles

33

Miscellaneous

Resources

● S3 Optimizations in Hive○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/

● Caching in Presto○ http://www.qubole.com/blog/product/caching-presto/

● Qubole vs. EMR Performance Comparison○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapre

duce-emr/

● Data Sharing via AWS Roles:○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2

Qubole and S3

http://www.qubole.com/optimizing-hadoop-for-s3-part-1/

http://www.qubole.com/blog/product/caching-presto/

http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapreduce-emr/

http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapreduce-emr/

https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2

Agenda





• Qubole vs. EMR

35

Agenda

• Introduction to Presto

• Brief Introduction to Kinesis

• Qubole’s Value add to Kinesis

• Brief introduction to Redshift

• Qubole’s Value add to Redshift

• When to use which system

36

Presto, Kinesis and RedShift

Presto, Kinesis, RedShift

• Interactive SQL Query Engine for Big Data

• Open source by Facebook in late 2013

• Follows ANSI-SQL

• Own execution engine

37

Presto, What is it?


● Extensibility

○ Pluggable Datasources

● Performance

○ In-memory Execution

○ Aggressive Pipelining

○ Highly efficient Java code

○ Dynamic Query Compilation

○ Vectorization38

Why Presto?


● Smooth learning curve as it adheres to ANSI-SQL

● Active open source community

● Proven worth at scale in companies like Facebook, NetFlix, Airbnb

39

Why not something else?


● Self managed Presto clusters

● Auto-configured

● Autoscaling

● Data Caching

40

Benefits of Presto @Qubole


● Kinesis Connector

● S3 Optimizations

● Insert Support

● UDF support

41

Qubole’s Contribution


● Average times across the performance tests by MediaMath on a 22TB text format table, partitioned on date, queried on partition with ~1.2b rows

http://www.qubole.com/blog/big-data/performance-testing-presto/

42

Comparison with Hive


● High Capacity Pipe for Real-Time Processing

● Key Concepts

○ Record

○ Streams

○ Shards

○ Checkpoints

43

Kinesis


● Streaming usecase

○ Spark

● SQL usecase

○ Via Hive

○ Via Presto

44

Qubole and Kinesis


● Kinesis Connector

45

Presto-Kinesis Integration


● Example○ Step1: Define Schema

46



○ Step2: Run Query

47


● Datawarehouse service

● OLAP

● Storage + Compute

48

Redshift


● ETL Usecase

○ DBImport

○ DBExport

● Adhoc Query Usecase

○ Direct Query

○ Hive

○ Presto

49

Qubole and Redshift


Two ways to access Redshift via Presto● Via Hive JDBC Storage Handler

50

Presto-Redshift integration


Two ways to access Redshift via Presto● Via jdbc connector

51

Presto-Redshift integration


● Single Platform

● Consistent User Interface

● Cross-source Joins without data consolidation

52

New Opportunities


● Hive: ETL, huge Joins, Group By on high cardinality columns

● Redshift: Interactive Queries when data loading is acceptable

● Presto:

○ Interactive Queries

○ Direct Queries without loading

○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc

53

Hive? Presto? RedShift?


● Iterative Machine Learning

● In-Memory Computing

● Spark Streaming

54

Got Spark?


Resources

● Presto vs Hive○ http://www.qubole.com/blog/big-data/performance-testing-presto/

● Presto Kinesis Integration○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-I

nteractively-Querying-Streaming-Data

● Presto Kinesis Connector○ https://github.com/qubole/presto-kinesis

● Hive Redshift/Jdbc Connector○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/

● Hive Redshift/Jdbc Connector○ https://github.com/qubole/Hive-JDBC-Storage-Handler

http://www.qubole.com/blog/big-data/performance-testing-presto/

http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-Interactively-Querying-Streaming-Data

http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-Interactively-Querying-Streaming-Data

https://github.com/qubole/presto-kinesis

http://www.qubole.com/blog/product/hive-jdbc-storage-handler/

https://github.com/qubole/Hive-JDBC-Storage-Handler

Agenda





• Qubole vs. EMR

56

Agenda

Thanks!

[email protected], [email protected], [email protected]

[email protected]

mailto:[email protected]




qubole @ aws meetup bangalore - july 2015

Engineering

qubole data service

introduction introduction

spark qubole

ec2 introduction

self service analytics

service awsgceazure

ec2 s3 redshift kinesis

ec2 magic sauce