qubole @ aws meetup bangalore - july 2015

57
Big Data as a Service Joydeep Sen Sarma Hariharan Iyer Shubham Tagra

Upload: joydeep-sen-sarma

Post on 16-Aug-2015

319 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Qubole @ AWS Meetup Bangalore - July 2015

Big Data as a Service

Joydeep Sen SarmaHariharan IyerShubham Tagra

Page 2: Qubole @ AWS Meetup Bangalore - July 2015

Introduction

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

2

Agenda

Page 3: Qubole @ AWS Meetup Bangalore - July 2015

Introduction

• Founded ~ 10/2011

• Team:– Founding Crew initial authors of Apache Hive, ran Data @Facebook– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto

• Financing:– Completed Series-B 10/2014– LightSpeed, Charles-River, Norwest, Anand/Venky

• Product: Qubole Data Service– Big Data as a Service– AWS/GCE/Azure

3

About Qubole

Page 4: Qubole @ AWS Meetup Bangalore - July 2015

Introduction

4

Customers

Page 5: Qubole @ AWS Meetup Bangalore - July 2015

IntroductionQubole Data Service

Page 6: Qubole @ AWS Meetup Bangalore - July 2015

6

IntroductionQubole Data Service

Page 7: Qubole @ AWS Meetup Bangalore - July 2015

Introduction

• Self-Serve Big Data Analytics:– Lack of Hadoop trained IT/engineers– Team of Analysts

• Lowest TCO– Cloud Optimized - takes full advantage of AWS

• Unified Platform for all Tools:– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…– Pick and Choose. Combine and Use

• Awesome Support and Solutions

7

Why Qubole

Page 8: Qubole @ AWS Meetup Bangalore - July 2015

Self Service Analytics: Direct Access to Big Data

8

Page 9: Qubole @ AWS Meetup Bangalore - July 2015

Self Service Analytics: Manage Clusters Easily

9

Page 10: Qubole @ AWS Meetup Bangalore - July 2015

Self Service Analytics: Schedule Jobs

10

Page 11: Qubole @ AWS Meetup Bangalore - July 2015

Self Service Analytics: Self Serve Dashboards using Notebooks

11

Page 12: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and EC2

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

12

Agenda

Page 13: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and EC2

• Custom AMIs for much faster boot-up

• Auto-termination

• Auto-scaling

• Spot Instances

• EBS

13

EC2 Magic Sauce

Page 14: Qubole @ AWS Meetup Bangalore - July 2015

• Auto-start and termination– Cluster starts automatically when you need to run a command

• Intelligent - no cluster required for metadata commands– Terminated after couple of hours of Idle time

• Auto-scaling– Min Size <= Cluster <= Max Size

14

Cluster LifeCycle Management

Qubole and EC2

Page 15: Qubole @ AWS Meetup Bangalore - July 2015

15

Map Slots Reduce Slots

Slave

Slave

Slave

Slave

Slave

SELECT * FROM FOO JOIN BAR ON BAZ = ...

Auto-Scaling

Page 16: Qubole @ AWS Meetup Bangalore - July 2015

• Upscaling– Engine-specific algorithms– Cannot just look at expected time (parallelism matters)

• Downscaling– Decommissioning takes time– Need to consider hour boundary– Stuck on mapper output

• Output offloading

• AWS Integration– Hour boundary– Eventual consistency

16

Why is it hard?

Qubole and EC2

Page 17: Qubole @ AWS Meetup Bangalore - July 2015

Min Cluster Size: 400Max Cluster Size: 800Time for which cluster size < max size: 49%

17

But it pays off!

Qubole and EC2

Page 18: Qubole @ AWS Meetup Bangalore - July 2015

18

But it pays off!

Expected Compute Hours Compute hours saved Savings (%)

2902246.2 2107311.01 72.6

4655815.5 2105486.11 45.2

1698052.65 1658738.375 97.6

1776944.4 1476547.835 83.1

2063127.85 838628.7 40.6

919721.25 613630.955 66.7

Qubole and EC2

Page 19: Qubole @ AWS Meetup Bangalore - July 2015

• Various configurations– All nodes on-demand– Minimum nodes on-demand, rest combination of on-demand & spot– All nodes spot

• Minimum set has higher bid price => less likely to lose– Up to 90% savings compared to on-demand price

19

Spot instances

Qubole and EC2

Page 20: Qubole @ AWS Meetup Bangalore - July 2015

• Not always available– Fall back to on-demand

• Increases overall cost of cluster– Periodically replace extra on-demand instances when spot available

• Can go away at any time– Hadoop has built-in resiliency– Place replicas on stable instances

• Auto-scaling– Must maintain requested ratio

20

Why is it hard?

Qubole and EC2

Page 21: Qubole @ AWS Meetup Bangalore - July 2015

• Useful for newer instance types - c3/m3– Low ephemeral storage

• Better performance/$– Compared to older instance types with more storage

• Writes are changed– Minimize writes to EBS volumes– Use them only when ephemeral is near full

21

Elastic Block Store

Qubole and EC2

Page 22: Qubole @ AWS Meetup Bangalore - July 2015

• Spot Fleets• EBS-only instances

22

What’s coming next

Qubole and EC2

Page 23: Qubole @ AWS Meetup Bangalore - July 2015

Resources

• Auto-scaling Hadoop Clusters in Qubole

• Spot Instances in Qubole Clusters

• Rebalancing Hadoop Clusters for Better Spot Utilization

• Improved Performance with Low-Ephemeral-Storage Instances

23

Qubole and EC2

Page 24: Qubole @ AWS Meetup Bangalore - July 2015

Agenda

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

24

Agenda

Page 25: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• No real Rename (aka Move) operation– renames are copies and expensive!

• S3 connection establishment is expensive– ie. - small calls like getObjectDetails(key) and reads are expensive!

• S3 has bulk prefix listing– listObjectsChunked(startKey, maxListing)

• Puts are atomic– Objects created when object uploaded– Unlike HDFS where files are created on first write

• MultiPart!

25

S3 != HDFS

Page 26: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Naiive

• Smart

• Up to 1000x improvement

26

Prefix Listing

for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << listObject(path)

pathList = listPrefix(‘/x’)while (entry = pathList.next()):

if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:result << entry

Page 27: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Split Computation– Divide input files into tasks for Map-Reduce/Spark/Presto

• Recovering Partitions

• List Paths matching regex pattern (‘/x/y/z/*/*’)

• and many more ..

27

Prefix Listing - Use Cases

Page 28: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Normally:– Write data to temporary location - atomically rename to final location

• With S3:– Write data to final location– Atomic puts deal with speculation/retries– Optional: Remove on Failure

• By default in Hive, DirectFileOutputCommitter in MR/Spark

• Tricky: retries/speculation must use same path

28

Direct Writes

Page 29: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Naiive:

29

Pre-Fetching

Client S3

• Smart:

Client S3

Page 30: Qubole @ AWS Meetup Bangalore - July 2015

S3 Local Disk Cache (Presto)

30

Page 31: Qubole @ AWS Meetup Bangalore - July 2015

S3 Local Disk Cache (Presto)

31

Page 32: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Populating Cache while performing Query may cause Slowness

• Large Files are split– Cache Files or Splits?

• Should Caching Combine Small Files?

• Should Caching transform data into Columnar?

Watch out for Table Copies!

32

Why S3 Caching is hard

Page 33: Qubole @ AWS Meetup Bangalore - July 2015

Qubole and S3

• Handle S3 Timeouts and Exceptions (truncated streams)

• Optimize away seek() operations

• Data Sharing across Organizations using Cross-Account Roles

33

Miscellaneous

Page 34: Qubole @ AWS Meetup Bangalore - July 2015

Resources

● S3 Optimizations in Hive○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/

● Caching in Presto○ http://www.qubole.com/blog/product/caching-presto/

● Qubole vs. EMR Performance Comparison○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-mapre

duce-emr/

● Data Sharing via AWS Roles:○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2

Qubole and S3

Page 35: Qubole @ AWS Meetup Bangalore - July 2015

Agenda

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

35

Agenda

Page 36: Qubole @ AWS Meetup Bangalore - July 2015

• Introduction to Presto

• Brief Introduction to Kinesis

• Qubole’s Value add to Kinesis

• Brief introduction to Redshift

• Qubole’s Value add to Redshift

• When to use which system

36

Presto, Kinesis and RedShift

Presto, Kinesis, RedShift

Page 37: Qubole @ AWS Meetup Bangalore - July 2015

• Interactive SQL Query Engine for Big Data

• Open source by Facebook in late 2013

• Follows ANSI-SQL

• Own execution engine

37

Presto, What is it?

Presto, Kinesis, RedShift

Page 38: Qubole @ AWS Meetup Bangalore - July 2015

● Extensibility

○ Pluggable Datasources

● Performance

○ In-memory Execution

○ Aggressive Pipelining

○ Highly efficient Java code

○ Dynamic Query Compilation

○ Vectorization38

Why Presto?

Presto, Kinesis, RedShift

Page 39: Qubole @ AWS Meetup Bangalore - July 2015

● Smooth learning curve as it adheres to ANSI-SQL

● Active open source community

● Proven worth at scale in companies like Facebook, NetFlix, Airbnb

39

Why not something else?

Presto, Kinesis, RedShift

Page 40: Qubole @ AWS Meetup Bangalore - July 2015

● Self managed Presto clusters

● Auto-configured

● Autoscaling

● Data Caching

40

Benefits of Presto @Qubole

Presto, Kinesis, RedShift

Page 41: Qubole @ AWS Meetup Bangalore - July 2015

● Kinesis Connector

● S3 Optimizations

● Insert Support

● UDF support

41

Qubole’s Contribution

Presto, Kinesis, RedShift

Page 42: Qubole @ AWS Meetup Bangalore - July 2015

● Average times across the performance tests by MediaMath on a 22TB text format table, partitioned on date, queried on partition with ~1.2b rows

http://www.qubole.com/blog/big-data/performance-testing-presto/

42

Comparison with Hive

Presto, Kinesis, RedShift

Page 43: Qubole @ AWS Meetup Bangalore - July 2015

● High Capacity Pipe for Real-Time Processing

● Key Concepts

○ Record

○ Streams

○ Shards

○ Checkpoints

43

Kinesis

Presto, Kinesis, RedShift

Page 44: Qubole @ AWS Meetup Bangalore - July 2015

● Streaming usecase

○ Spark

● SQL usecase

○ Via Hive

○ Via Presto

44

Qubole and Kinesis

Presto, Kinesis, RedShift

Page 45: Qubole @ AWS Meetup Bangalore - July 2015

● Kinesis Connector

45

Presto-Kinesis Integration

Presto, Kinesis, RedShift

Page 46: Qubole @ AWS Meetup Bangalore - July 2015

● Example○ Step1: Define Schema

46

Presto-Kinesis Integration

Presto, Kinesis, RedShift

Page 47: Qubole @ AWS Meetup Bangalore - July 2015

○ Step2: Run Query

47

Presto-Kinesis Integration

Page 48: Qubole @ AWS Meetup Bangalore - July 2015

● Datawarehouse service

● OLAP

● Storage + Compute

48

Redshift

Presto, Kinesis, RedShift

Page 49: Qubole @ AWS Meetup Bangalore - July 2015

● ETL Usecase

○ DBImport

○ DBExport

● Adhoc Query Usecase

○ Direct Query

○ Hive

○ Presto

49

Qubole and Redshift

Presto, Kinesis, RedShift

Page 50: Qubole @ AWS Meetup Bangalore - July 2015

Two ways to access Redshift via Presto● Via Hive JDBC Storage Handler

50

Presto-Redshift integration

Presto, Kinesis, RedShift

Page 51: Qubole @ AWS Meetup Bangalore - July 2015

Two ways to access Redshift via Presto● Via jdbc connector

51

Presto-Redshift integration

Presto, Kinesis, RedShift

Page 52: Qubole @ AWS Meetup Bangalore - July 2015

● Single Platform

● Consistent User Interface

● Cross-source Joins without data consolidation

52

New Opportunities

Presto, Kinesis, RedShift

Page 53: Qubole @ AWS Meetup Bangalore - July 2015

● Hive: ETL, huge Joins, Group By on high cardinality columns

● Redshift: Interactive Queries when data loading is acceptable

● Presto:

○ Interactive Queries

○ Direct Queries without loading

○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc

53

Hive? Presto? RedShift?

Presto, Kinesis, RedShift

Page 54: Qubole @ AWS Meetup Bangalore - July 2015

● Iterative Machine Learning

● In-Memory Computing

● Spark Streaming

54

Got Spark?

Presto, Kinesis, RedShift

Page 55: Qubole @ AWS Meetup Bangalore - July 2015

Resources

● Presto vs Hive○ http://www.qubole.com/blog/big-data/performance-testing-presto/

● Presto Kinesis Integration○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-I

nteractively-Querying-Streaming-Data

● Presto Kinesis Connector○ https://github.com/qubole/presto-kinesis

● Hive Redshift/Jdbc Connector○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/

● Hive Redshift/Jdbc Connector○ https://github.com/qubole/Hive-JDBC-Storage-Handler

Page 56: Qubole @ AWS Meetup Bangalore - July 2015

Agenda

• Introduction to Qubole

• How Qubole integrates with AWS:

– EC2– S3– RedShift– Kinesis

• Hive vs. Presto vs. RedShift vs. Spark

• Qubole vs. EMR

56

Agenda