high volume updates in apache hive

17
© Hortonworks Inc. 2012 High Volume Updates in Hive June 2012 Page 1 Owen O’Malley [email protected] @owen_omalley

Upload: owen-omalley

Post on 15-Jan-2015

3.863 views

Category:

Technology


0 download

DESCRIPTION

Apache Hive provides a convenient SQL-based query language to data stored in HDFS. HDFS provides highly scaleable bandwidth to the data, but does not support arbitrary writes. One of Hortonworks` customers needs to store a high volume of customer data (> 1 TB/day) and that data contains a high percentage (15%) of record updates distributed across years. In many high-update use-cases, HBase would suffice, but the current lack of push down filters from Hive into HBase and HBase`s single level keys make it too expensive. Our solution is to use a custom record reader that stores the edit records as separate HDFS files and synthesizes the current set of records dynamically as the table is read. This provides an economical solution to their need that works within the framework provided by Hive. We believe this use case applies to many Hive users and plan to develop and open source a reusable solution.

TRANSCRIPT

Page 1: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

High Volume Updates in Hive

June 2012

Page 1

Owen O’[email protected]@owen_omalley

Page 2: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Who Am I?

Page 2

• Was working in Yahoo Search in Jan 2006, when we decided to work on Hadoop.

• My first patch went in before it was Hadoop.• Was the first committer added to the project.• Was the first Apache VP of Hadoop.• Won the sort benchmark in 2008 & 2009.• Was the tech lead for:

• MapReduce• Security

• Now I’m working on Hive

Page 3: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

A Data Flood

Page 3

• The customer has:• Business critical• 1 TB/day of customer data• Need to keep all historical data• Data size increasing exponentially

• Doubles year over year• Lots of derived reports• 15% of traffic is updates to old data

Page 4: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

The Dataflow

Page 4

Page 5: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

The Approach

Page 5

• Picked Hive• Wanted a higher-level query language• Hive’s HQL lowers learning curve• Batch processing• But, they have those edits & deletes…

• Hive doesn’t update things well• Reflects the capabilities of HDFS• Can add records• Can overwrite partitions

Page 6: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Why not Hbase?

Page 6

• Good• Handles compaction for us• Files are indexed and have Bloom filters

• Bad• Push down filters only recently done• Tuned for real-time access• One mapper per a region• HBase has only a single key

Page 7: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Limitations of a Single Key

Page 7

• HBase uses a single key• Stored in sorted order• Divided into regions

• For date-based tables• Recent dates are very hot• Hot spot moves

• Can Hash to spread load• <10 bit hash><YYYY/MM/DD><id>

Page 8: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Hive Table Layout

Page 8

Page 9: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Design

Page 9

• Hive allows tables to specify an InputFormat and OutputFormat

• We can store each bucket as a base file and a sequential set of edit files.

• To enable efficient reading of a bucket, the base and edits files must be sorted.

• Edits must be compacted eventually.

Page 10: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Repeatable Reads

Page 10

• MapReduce handles failures by re-executing failed tasks.• Lost machines also cause re-execution• Some reduces may get output from a

different attempt of a lost map.• All queries must only include completed

updates.• The update files generated by a single

job must contain a consistent timestamp.

Page 11: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Stitching Buckets Together

Page 11

Page 12: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Limitations

Page 12

• Requires active component to manage compaction.• Compactions should be scheduled for

off-peak hours.• Minor and major compactions

• Requires table be sorted.• Must use consistent bucketing.• Only suitable for batch, not real-time

Page 13: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Additional Challenges from Hive

Page 13

• Hive defines its own OutputFormat• Hive serializes before the OutputFormat

• Makes inspecting fields difficult• Makes custom serialization difficult

• RCFiles compress well, but no index or bloom filters• Supports batch access, but not

lookups

Page 14: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Hive’s Output Committer

Page 14

• MapReduce uses OutputCommitters to promote the successful task attempt’s output.• Failure and speculative execution

• Hive uses its own OutputCommitter• Runs in submitting process• Uses regex to match task attempts• Picks largest output• Prevents side-files including indexes

Page 15: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Dynamic Partitions

Page 15

• Hive supports updating a lot of partitions dynamically.

• Keeps a RecordWriter per a partition.• RecordWriters keep a reference to their

compression codec and its buffers.• Updating thousands of partitions uses a

lot of RAM.• Snappy codec seems worse than LZO.

Page 16: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Conclusion

Page 16

• Provides scalable read and write access.• Less resource intensive than HBase.• Still in Development

• The approach is scalable• Naturally bumps still hidden in the dark

• Testing on samples as data loads• Plan to implement an open source

version as a library in Hive.

Page 17: High Volume Updates in Apache Hive

© Hortonworks Inc. 2012

Thank You!Questions & Answers

Page 17