qubole overview at the fifth elephant conference

21
The Elephant in the Cloud Qubole Data Platform

Upload: joydeep-sen-sarma

Post on 28-Nov-2014

1.489 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Qubole Overview at the Fifth Elephant Conference

The Elephant in the Cloud

Qubole Data Platform

Page 2: Qubole Overview at the Fifth Elephant Conference

Cloud is Awesome

• On-Demand• Elastic• Cheap– Spot Instances!

• Infinite Storage

Page 3: Qubole Overview at the Fifth Elephant Conference

But it’s Complicated ..

Page 4: Qubole Overview at the Fifth Elephant Conference

But it’s Complicated ..

• Setup my own Hive metastore .. damn.• Setup my own cluster , hmmm ..

– How many nodes?– What type of nodes?– Spot vs. On-Demand? How to bid?– What happens if Spot instances disappear?

• Why did my query fail last night?• How to schedule something to run periodically?

Page 5: Qubole Overview at the Fifth Elephant Conference

Easier:

Page 6: Qubole Overview at the Fifth Elephant Conference

Easier: month old Job

Page 7: Qubole Overview at the Fifth Elephant Conference

Auto-Scalingselect t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) tgroup by t.county;

insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;

Newco_Hadoop

Page 8: Qubole Overview at the Fifth Elephant Conference

Consolidation=Efficiency

Page 9: Qubole Overview at the Fifth Elephant Conference

Consolidation=Efficiency

Page 10: Qubole Overview at the Fifth Elephant Conference

Engineering Trivia

• When to add/delete nodes?– Project future demand using JobTracker Stats

• How to safely delete nodes?– Don’t if they hold intermediate data– Decomission from HDFS– Delete cache blocks

• How to place Data?– One copy on Core Nodes– Cached File to Node Affinity

Page 11: Qubole Overview at the Fifth Elephant Conference

Issues with Cloud Storage

• Slower compared to Local Drives (4x)• Very slow on small files (5x)• Tremendous Variance (avg:95, stddev: 25)

Page 12: Qubole Overview at the Fifth Elephant Conference

Switch to HDFS?

• S3DistCp for Efficient Copy between S3 and HDFSWe have also made available S3DistCp, an extension of the open source Apache DistCp tool for distributed data copy, that has been optimized to work with Amazon S3. Using S3DistCp, you can efficiently copy large amounts of data between Amazon S3 and HDFS on your Amazon EMR job flow or copy files between Amazon S3 buckets. During data copy you can also optimize your files for Hadoop processing. This includes modifying compression schemes, concatenating small files, and creating partitions.

Page 13: Qubole Overview at the Fifth Elephant Conference

Switch to HDFS?

Use HDFS as Cache

Page 14: Qubole Overview at the Fifth Elephant Conference

Columnar-Cloud-Cache

S3

page_views.json

MR

HDFS

MapTask Uploader

Cluster-1

Page 15: Qubole Overview at the Fifth Elephant Conference

Columnar-Cloud-Cache

S3

page_views.json

MR

HDFS

MapTask

Cluster-1

Page 16: Qubole Overview at the Fifth Elephant Conference

Columnar-Cloud-Cache

S3

MR

HDFS

Cluster-2

Page 17: Qubole Overview at the Fifth Elephant Conference

Columnar-Cloud-Cache

S3

page_views.json

MR

HDFS

MapTask Uploader

Cluster-2

Page 18: Qubole Overview at the Fifth Elephant Conference

vs. S3

• Upto 5x faster• Predictable

csv json

Page 19: Qubole Overview at the Fifth Elephant Conference

HDFS as Cache

• Drop cached files liberally:– When nodes are decomissioned– When nodes fail

• Make block placement smart:– Always maintain copy in core node

Page 20: Qubole Overview at the Fifth Elephant Conference

Tip of Iceberg

• Extract data samples to MySql– Quick expression evaluation

• Checkbox for Fast and Dirty Queries – Sample data automatically– Stop computation after 90%– Approximate count distinct

• Periodic Jobs!• Query Authoring widgets

Page 21: Qubole Overview at the Fifth Elephant Conference

Q&A