ingesting data at blazing speed using apache orc

© 2017 IBM Corporation

Ingesting Data at Blazing Speed with Apache ORC

Gustavo Arocena

IBM Toronto Lab

© 2017 IBM Corporation2

IBM Canada Lab

Toronto


Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They areprovided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2017. All rights reserved.

U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the

United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a

trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was

published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the

Web at

▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council

▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.

▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

▪Other company, product, or service names may be trademarks or service marks of others.


Agenda

What is Big SQL?

From Parquet to ORC

Reading ORC Files Fast

Scaling Up

Tuning Big SQL for ORC


Big SQL Background


What is Big SQL?

Data Security

Metastore

Cluster Mgmt.

Administration

Runs on

Data Platform


What’s the Big Deal?

Grace Under Pressure


The Big SQL Advantages

Scale

• Only engine to run TPC-DS at 100TB scale

Complex SQL

• Capable of running all 99 TPC-DS queries since 2014

• Complex queries optimized with IBM Cost-based optimizer

Concurrency

• Handles highly concurrent workloads gracefully

• 12 stream TPC-DS

Efficient Resource Utilization

• Memory

• CPU

• IO

What’s the Big Deal?


Metrics for Big SQL 4.2.5 vs. Spark SQL 2.1

▪ Hadoop DS @ 100TB, 4 streams

13.7

43.2

BIG SQL SPARK SQL

Ho

urs

Elapsed Time

76.4

88.2

BIG SQL SPARK SQL

%

CPU Utilization

107

388

BIG SQL SPARK SQL

MB

/Se

c

Disk Reads

25

237

BIG SQL SPARK SQL

MB

/Se

c

Disk Writes

- 15%

1/3

1/3 1/9

https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/


From Parquet to ORC


Big SQL Architecture (as of 2016)

Head Node

Worker NodeWorker Node

Worker Node

Parquet

IO

Hive

Compat.

IO

Hive

Metastore

HDFS

HDFS

NN


2017: Big SQL on HDP

Most popular data format on HDP is ORC

ORC performance becomes top priority


0

5000

10000

15000

20000

25000

30000

35000

1 Stream 4 Streams

Ela

psed T

ime (

sec)

Parquet vs ORC1TB TPC-DS

Parquet ORC v0

70% Slower

with ORC

ORC vs Parquet in Big SQL 4.2 > 50000

315% Slower

with ORC


Limitations of Hive Compatibility IO Engine

Slow Ingestion

Single row at a time

JIT unfriendly

Data values as Java objects

Low Scalability

Large memory footprint per

scan

Excessive CPU use

Overloaded disks


The Roadmap Towards Fast ORC ingestion

1st Phase

• Big SQL 4.2.5, Dec ‘16

• Fast ORC Ingestion

2nd Phase

• Big SQL 5.0.1, Aug ‘17

• ORC at Scale


1st Phase – Fast Ingestion using Apache ORC

0

5000

10000

15000

20000

25000

1 Stream 4 Streams

Ela

psed T

ime (

sec)


Parquet ORC v1

Apache ORC libs key benefits

▪ Many-row-at-a-time API

▪ Enable JIT-friendly code

▪ Represent data using

primitive Java types

▪ Make projection and

selection pushdown

very easy

2% Faster

with ORC

65% Slower

with ORC


2nd Phase - Managing Resources

0

2000

4000

6000

8000

10000

12000

14000

1 Stream 4 Streams

Ela

psed T

ime (

sec)


Parquet ORC v2

Resource Manager has global

oversight over

▪ Total number of threads

▪ Overall JVM heap

consumption

▪ Degree of parallelism

per scan

15% Faster

with ORC

3.4% Faster

with ORC


ORC as a First Class Citizen in 5.0.1

Head Node

Worker NodeWorker Node

Worker Node

Parquet

IO

Hive

Compat.

IO

Hive

Metastore

HDFS

HDFS

NN

ORC

IO


ORC Background


What is Apache ORC?

ORC = efficient storage + fast

ingestion

Compression

• Type-specific encodings (RLE for numbers, dictionary for strings, etc)

• Generic compression (Zlib, Snappy)

Data skipping

• Column skipping based on data layout

• Row skipping based on MIN/MAX stats and bloom filters

JIT friendly

• Vectorized APIs (retrieve data as arrays of primitive values)

▪ Engines leverage all

these features

▪ Apache ORC libs allow

applications to leverage

them too


ORC Physical Data Layout

CREATE HADOOP TABLE SALES(id INTEGER,

quantity INTEGER,

amount DOUBLE)

Stripe stats

Stripe stats

Stripe stats

File stats

Stripe (HDFS block)

Row group (10K rows)

10K id values

10K quantity values

10K amount values

Row group stats


Leveraging the Apache ORC Libraries


Dependencies and Classes

▪ Java Dependencies (orc.apache.org group id in Maven)

orc-core-1.4.0-nohive.jar

aircompressor-0.3.jar

▪ Java Classes for “vectorized” processing

import org.apache.orc.OrcFile;

import org.apache.orc.Reader;

import org.apache.orc.RecordReader;

import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;

import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;

import org.apache.orc.storage.ql.io.sarg.SearchArgument;


Using Vectorized ORC APIs

Reader r = OrcFile.createReader(path, OrcFile.readerOptions(conf));

RecordReader rr = r.rows();

VectorizedRowBatch batch = r.getSchema().createRowBatch(1000);

1000 Values

ID

QUANTITY

AMOUNT

long quantity[1000];

long id[1000];

double amount[1000];

▪ A vectorized row batch is a Java object that contains 1000 decoded rows


JIT friendly code

// Compute sum(amount)

double sum = 0;

while (rr.nextBatch(batch)) {

long[] qty = ((LongColumnVector) batch.cols[1]).vector;

double[] amt = ((DoubleColumnVector) batch.cols[2]).vector;

for (int i=0; i < batch.size; i++)if (qty[i] < 500)

sum += amt[i];

}

▪ Get the total for sales involving less than 500 items

SELECT sum(amount)

FROM sales

WHERE quantity < 500

• No objects

• No method calls

• Tight loop compiles

to machine code


Column Skipping/Pruning (a.k.a. Projection Pushdown)

▪ If we don’t say otherwise, ORC will read all the columns

▪ But our query is using only two columns

ID

QUANTITY

AMOUNT

SELECT sum(amount)

FROM sales




// Projection

boolean projection[] = new boolean[] {false, true, true});

// ORC RecordReader with projection pushdown

RecordReader rr = r.rows(

new Reader.Options()

.include(projection));


double sum = 0;

while (rr.nextBatch(batch)) { … }

SELECT sum(amount)

FROM sales




create external hadoop table web_sales (

ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int not null, ws_bill_customer_sk int, ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int, ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int, ws_promo_sk int, ws_order_number bigint not null, ws_quantity bigint, ws_wholesale_cost double, ws_list_price double, ws_sales_price double, ws_ext_discount_amt double, ws_ext_sales_price double, ws_ext_wholesale_cost double, ws_ext_list_price double, ws_ext_tax double, ws_coupon_amt double, ws_ext_ship_cost double, ws_net_paid double, ws_net_paid_inc_tax double, ws_net_paid_inc_ship double, ws_net_paid_inc_ship_tax double, ws_net_profit double

)

STORED AS ORC;

0

5

10

15

20

25

30

With Projection Pushdown Without Projection Pushdown

Ela

psed t

ime (

seconds)

SELECT max(ws_sold_date_sk)FROM tpcds_1tb.web_sales

3.5

24.9


Row Skipping/Pruning (a.k.a. Predicate Pushdown)

▪ Row skipping leverages the MIN/MAX stats

quantityMIN:274,MAX:590

quantityMIN:603,MAX:3000

quantity

MIN:510, Max:540

quantityMIN:330, Max:420

Stripes Row groups

SELECT sum(amount)

FROM sales


• Data must be sorted for

pruning to be effective!


Row Skipping/Pruning (a.k.a. Predicate Pushdown)

// Predicates

SearchArgument selection = SearchArgumentFactory.newBuilder()

.lessThan("quantity", PredicateLeaf.Type.LONG, 500L).build();

// ORC RecordReader with projection and selection pushdown

RecordReader rr = r.rows(

new Reader.Options()

.include(projection)

.searchArgument(selection, new String[] {}));


double sum =0;

while (rr.nextBatch(batch)) { ... }

SELECT sum(amount)

FROM sales



Scaling Up


Scaling Up

▪ Reading too many files hurts performance and can cause OOM

▪ How Many?

Number of disks

Java Heap size

▪ Must have multiple disks AND the files must be evenly distributed

across the disks

▪ But number of threads must be limited by the Java heap size!

Need 30 to 80 MB of Java Heap per open ORC file/split

Biggest consumers

• RecordReader

• VectorizedRowBatch


Scaling Up In an Engine

▪ An engine must handle concurrency (multiple queries/users)

▪ Fixed # of open files per scan leads to OOMs

▪ Must gracefully degrade performance instead of OOM

▪ Need to limit total open files

▪ Multiple issues to deal with

Starvation (all queries must make reasonable progress)

Stragglers (work for a scan must be evenly balanced across nodes)

Adapting parallelism to concurrency

• Single query must take full advantage of the available resources

• As concurrency increases, parallelism decreases for ongoing scans

• When concurrency decreases, parallelism per scan increases


Big SQL Tuning for ORC

• The “fundamentals” (regardless of the data format)

Ensure Big SQL has enough resources

• % memory

• % CPU

• Temp storage spread across multiple disks (e.g. same disks as HDFS)

Use Partitioning

Run ANALYZE on all your tables (or ensure AUTO ANALYZE is enabled)

▪ ORC-specific tuning (Big SQL 5.0.1 & up)

Property bigsql.java.orc.preftp.size controls the max number of open ORC files

▪ ORC file creation

Data sorted by filtering columns

Stripe and row group size

Bloom filters

▪ For more Big SQL tuning tips, see

https://developer.ibm.com/hadoop/2016/11/16/top-6-big-sql-v4-2-performance-tips/


Summary

▪ ORC = Storage Efficiency + Fast Ingestion

▪ Fast ingestion using Vectorized APIs in Apache ORC

▪ Big SQL performs best with data in ORC format!


Questions?


Thank you!

ingesting data at blazing speed using apache orc

Technology