orc 2015

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

ORC: 2015

Gopal Vijayaraghavan


ORC – Optimized Row-Columnar File

Columnar Storage+

Row-groups & Fixed splits

Protobuf Metadata Storage+

+

Type-safe Vectorization+

Hive ACID transactions+

Single SerDe for Format+


Need for Speed: The Stinger Initiative

Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.

Launched: February 2013; Delivered: April 2014.

Delivered in 100% Apache Open Source.

SQL Engine

VectorizedSQL Engine

ColumnarStorage

ORC

= 100X+ +

Distributed Execution

Apache Tez


ORC at Facebook

Saved more than 1,400 servers worth of storage.

CompressioniCompression ratio increased from 5x to 8xglobally.

Compressioni

[1]


ORC at Spotify

16x less HDFS read when using ORC versus Avro.(5)

IOi32x less CPU when using ORC versus Avro.(5)

CPUi

[2]


ORC: Today

What is Optimized about ORC?


ORC – Optimized Row-Columnar File

Columnar Storage+

Row-groups & Stripe splits

Protobuf Metadata Storage+

+

Type-safe Vectorization+

Hive ACID transactions+

Single SerDe for Format+


Columnar Storage

Storage Performance

● Compress each column differently

● Detect & compress common sub-sequences

● Auto-increment ids

● String Enums

● Large Integers (uid scale)

● Unique strings (UUIDS)

Read Performance

● Column projection

● Columnar deserializers

● Data locality

Write Throughput

● Stats auto-gather


Row-groups & Stripe splits

Split Parallelism

● Effective parallelism

● No seeks to find boundaries

● No splits with zero data

● Decompress fixed chunks

Stripes

● Single unsplittable chunk

● Will reside in 1 HDFS block entirely

● Is self-contained for all read ops


A Single SerDe for all ORC Files

A Single Writer

● No mismatch of serialization

● Forward compatibility

Readers

● Multiple reader implementations

● Allows for vector readers

● And row-mode readers

● Similar loop – good JIT hit-rate


Protobuf Metadata Storage

Standardized Metadata

● Readers are easier to write

● Metadata readers are auto-generated

Metadata Forward Compatibility

● Protobuf Optional fields

Statistics Storage in Metadata

● Standard serialization for stats

● Allows for PPD into the IO layer


Type-safe Vectorization

Schema on Write

● Write ORC Structs with types

● SerDe & Inputformat

Read Performance

● Data is read with few copies

● Primitive types are fast

● Primitives are also unboxed

● Predicates are typed too


ORC: ETL Improvements

Always more new data


ORC (Zlib): Compress Differently

674

389433

ORC (old zlib) ORC SNAPPY ORC (new zlib)

ETL for TPC-H LineItem (scale 1 Tb)

Time Taken

Different Zlib algorithms for encoding

● Z_FILTERED

● Z_DEFAULT

● Z_BEST_SPEED

● Z_DEFAULT_COMPRESSION

In detail

● Compress IS_NULL bitsets lightly

● Compress Integers differently from Doubles

● Compress string dictionaries differently

● Allow for user choice


ORC (Zlib): Compress Differently

Different Zlib algorithms for encoding

● Z_FILTERED

● Z_DEFAULT

● Z_BEST_SPEED

● Z_DEFAULT_COMPRESSION

In detail

● Compress IS_NULL bitsets lightly

● Compress Integers differently from Doubles

● Compress string dictionaries differently

● Allow for user choice

178.5

225.1

172.2

ORC (old zlib) ORC SNAPPY ORC (new zlib)

Data Sizes for TPC-H Lineitem (Scale 1 Tb)

Size on Disk


Using JDK8 SIMD: Integer Writers

Integer encodings

● Base + Delta

● Run-length

● Direct

Trade-off for Size/Speed

● Use fixed bit-width loops

● Snap to nearest bit-width

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 4 8 16 24 32 40 48 56 64

Me

an T

ime

(m

s)

Bit Width

ORC Write Integer Performance(smaller better)

hive 0.13 bitpacking

hive 1.0 bitpacking (new)


Double Writers

273.331

247.634231.741

0

50

100

150

200

250

300

old buffered + BE buffered + LE

Me

an T

ime

(m

s)

Double Write Modes

ORC Write Double Performance(smaller is better)

Double Writers

● JVM is big-endian

● X86 is little-endian

● Special handling of NaN


ORC: Scale compression buffers

269.4263.3

258.5 258.4 258.4 258.4

184.8 183.5 182.2 180.1 178.3 177.4

140

160

180

200

220

240

260

280

300

320

8 16 32 64 128 256

SizeinM

B

CompressionBufferSizeinKB

FileSize

ZLIB

SNAPPY

Large Columns vs More Columns

● Adjust when >1000 columns

Trade offs

● Compression

● Low memory use

More additions

● Dynamically partitioned insert


ORC: Streaming Ingest + ACID

Broken pattern: Partitions for Atomicity-

- Isolation & Consistency on retries+

Transactions are pluggable (txn.manager)+

Cache/Replication friendly (base + deltas)+


ORC: LLAP and Sub-second

ORC – Pushing for Sub-second


ORC: Row Indexes

Min-Max pruning

● Evaluate on statistics

Bloom filters

● Better String filters

● Filter a random distribution

LLAP Future

● Row-level vector SARGs

5999989709

540,000

10,000

No Indexes Min-Max Indexes Bloomfilter Indexes

from tpch_1000.lineitem where l_orderkey = 1212000001;

(log scale)

Rows Read


ORC: Row Indexes

Min-Max pruning

● Evaluate on Statistics

Bloom filters

● Better String filters

● Filter a random distribution

LLAP Future

● Row-level vector SARGs

74

4.5 1.34

No Indexes Min-Max Indexes Bloomfilter Indexes

* from tpch_1000.lineitem where l_orderkey=1212000001;(smaller better)

Time Taken (seconds)


ORC: JDK8 SIMD Readers

Integer encodings

● Base + Delta

● Run-length

● Direct

Trade-off for Size/Speed

● Use fixed bit-width loops

● Snap to nearest bit-width

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 24 32 40 48 56 64

Me

an T

ime

(m

s)

Bit Width

ORC Read Integer Performance

hive 0.13 unpacking

hive-1.0 unpacking (new)


ORC: Vectorization + SIMD

Advantage of a Single SerDe

● Primitive Types

Allocation free tight inner loops

● JDK8 has auto-vectorization

Vectorized Early Filter

● Vectors can be filtered early in ORC

● StringDictionary can be used to binary-search

Vectorized SIMD Join

● Performance for single key joins

0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm20x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm20x00007f13d2e6afba: movslq %eax,%r100x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3

;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)

0x00007f13d2e6afc4: vmovdqu %ymm2,0x10(%rdx,%rax,8)0x00007f13d2e6afca: vaddpd %ymm1,%ymm3,%ymm20x00007f13d2e6afce: vmovdqu %ymm2,0x30(%rdx,%r10,8)

;*dastore vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)


ORC: Split Strategies + Tez Grouping

Amdahl’s Law

● As fast as the slowest task

● Slice work thinly, but not too thin

Split-generation vs Execution time

● ETL

● BI

● Hybrid

Split-grouping & estimation

● ColumnarSplit size

● Group by estimate, not file size

● Bucket pruning

Slow split


ORC: LLAP

- JIT Performance for short queries+

Row-group level caching+

Asynchronous IO Elevator+

+ Multi-threaded Column Vector processing+


ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)


Questions?

?Interested? Stop by the Hortonworks booth to learn more


Endnotes

(1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

(2) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014

https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014

orc 2015

Software