orc file - optimizing your big data
Post on 23-Jan-2018
1.079 Views
Preview:
TRANSCRIPT
ORC File –Optimizing Your Big DataOwen O’Malley, Co-founder HortonworksApache Hadoop, Hive, ORC, and Incubator@owen_omalley
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
In the Beginning…
Hadoop applications used text or SequenceFile– Text is slow and not splittable when compressed
– SequenceFile only supports key and value and user-defined serialization
Hive added RCFile– User controls the columns to read and decompress
– No type information and user-defined serialization
– Finding splits was expensive
Avro files created– Type information included!
– Had to read and decompress entire row
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
ORC File Basics
Columnar format– Enables user to read & decompress just the bytes they need
Fast– See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
Indexed
Self-describing– Includes all of the information about types and encoding
Rich type system– All of Hive’s types including timestamp, struct, map, list, and union
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Compatibility
Backwards compatibility– Automatically detect the version of the file and read it.
Forward compatibility– Most changes are made so old readers will read the new files
– Maintain the ability to write old files via orc.write.format
– Always write old version until your last cluster upgrades
Current file versions– 0.11 – Original version
– 0.12 – Updated run length encoding (RLE)
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Structure
File contains a list of stripes, which are sets of rows– Default size is 64MB
– Large stripe size enables efficient reads
Footer– Contains the list of stripe locations
– Type description
– File and stripe statistics
Postscript– Compression parameters
– File format version
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Structure
Indexes– Offsets to jump to start of row group
– Row group size defaults to 10,000 rows
– Minimum, Maximum, and Count of each column
Data– Data for the stripe organized by column
Footer– List of stream locations
– Column encoding information
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Layout
Page 8
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Index Data
Row Data
Stripe Footer
~6
4 M
B S
trip
e
Index Data
Row Data
Stripe Footer
~6
4 M
B S
trip
e
Index Data
Row Data
Stripe Footer
~64 M
B S
trip
e
File Footer
Postscript
File Metadata
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Schema Evolution
ORC now supports schema evolution– Hive 2.1 – append columns or type conversion
– Upcoming Hive 2.3 – map columns or inner structures by name
– User passes desired schema to ORC reader
Type conversions– Most types will convert although some are ugly.
– If the value doesn’t fit in the new type, it will become null.
Cautions– Name mapping requires ORC files written by Hive ≥ 2.0
– Some of the type conversions are slow
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Using ORC
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Hive or Presto
Modify your table definition:– create table my_table (
name string,
address string,
) stored as orc;
Import data:– insert overwrite table my_table select * from my_staging;
Use either configuration or table properties– tblproperties ("orc.compress"="NONE")
– set hive.exec.orc.default.compress=NONE;
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Java
Use the ORC project rather than Hive’s ORC.– Hive’s master branch uses it.
– Maven group id: org.apache.orc version: 1.4.0
– nohive classifier avoids interfering with Hive’s packages
Two levels of access– orc-core – Faster access, but uses Hive’s vectorized API
– orc-mapreduce – Row by row access, simpler OrcStruct API
MapReduce API implements WritableComparable– Can be shuffled
– Need to specify type information in configuration for shuffle or output
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From C++
Pure C++ client library– No JNI or JDK so client can estimate and control memory
Combine with pure C++ HDFS client from HDFS-8707– Work ongoing in feature branch, but should be committed soon.
Reader is stable and in production use.
Alibaba has created a writer and is contributing it to Apache ORC.– Should be in the next release ORC 1.5.0.
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Command Line
Using hive –orcfiledump from Hive– -j -p – pretty prints the metadata as JSON
– -d – prints data as JSON
Using java -jar orc-tools-1.4.0-uber.jar from ORC– meta – print the metadata as JSON
– data – print data as JSON
– convert – convert JSON to ORC
– json-schema – scan a set of JSON documents to find the matching schema
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Optimization
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Size
Makes a huge difference in performance– orc.stripe.size or hive.exec.orc.default.stripe.size
– Controls the amount of buffer in writer. Default is 64MB
– Trade off
• Large stripes = Large more efficient reads
• Small stripes = Less memory and more granular processing splits
Multiple files written at the same time will shrink stripes– Use Hive’s hive.optimize.sort.dynamic.partition
– Sorting dynamic partitions means a one writer at a time
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDFS Block Padding
The stripes don’t align exactly with HDFS blocks
HDFS scatters blocks around cluster
Often want to pad to block boundaries– Costs space, but improves performance
– hive.exec.orc.default.block.padding – true
– hive.exec.orc.block.padding.tolerance – 0.05
Index Data
Row Data
Stripe Footer
~6
4 M
B S
trip
e
Index Data
Row Data
Stripe Footer
~64
MB
Str
ipe
Index Data
Row Data
Stripe Footer
~64
MB
Str
ipe
HDFS Block
HDFS Block
Padding
File Footer
Postscript
File Metadata
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Predicate Push Down
Reader is given a SearchArg– Limited set predicates over column and literal value
– Reader will skip over any parts of file that can’t contain valid rows
ORC indexes at three levels:– File
– Stripe
– Row Group (10k rows)
Reader still needs to apply predicate to filter out single rows
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning
Every primitive column has minimum and maximum at each level– Sorting your data within a file helps a lot
– Consider sorting instead of making lots of partitions
Writer can optionally include bloomfilters– Provides a probabilistic bitmap of hashcodes
– Only works with equality predicates at the row group level
– Requires significant space in the file
– Manually enabled by using orc.bloom.filter.columns
– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)
– Set the default charset in JVM via -Dfile.encoding=UTF-8
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning Example
TPC-DS– from tpch1000.lineitem where l_orderkey = 1212000001;
Rows Read– Nothing – 5,999,989,709
– Min/Max – 540,000
– BloomFilter – 10,000
Time Taken– Nothing – 74 sec
– Min/Max – 4.5 sec
– BloomFilter – 1.3 sec
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Split Calculation
Hive’s OrcInputFormat has three strategies for split calculation– BI
• Small fast queries
• Splits based on HDFS blocks
– ETL
• Large queries
• Read file footer and apply SearchArg to stripes
• Can include footer in splits (hive.orc.splits.include.file.footer)
– Hybrid
• If small files or lots of files, use BI
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP – Live Long & Process
Provides a persistent service to speed up Hive– Caches ORC and text data
– Saves costs of Yarn container & JVM spin up
– JIT finishes after first few seconds
Cache uses ORC’s RLE– Decompresses zlib or Snappy
– RLE is fast and saves memory
– Automatically caches hot columns and partitions
Allows Spark to use Hive’s column and row security
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Current Work In Progress
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Speed Improvements for ACID
Hive supports ACID transactions on ORC tables– Uses delta files in HDFS to store changes to each partition
– Delta files store insert/update/delete operations
– Used to support SQL insert commands
Unfortunately, update operations don’t allow predicate push down on the deltas
In the upcoming Hive 2.3, we added a new ACID layout– It change updates to an insert and delete
– Allows predicate pushdown even on the delta files
Also added SQL merge command in Hive 2.2
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Column Encryption (ORC-14)
Allows users to encrypt some of the columns of the file– Provides column level security even with access to raw files
– Uses Key Management Server from Ranger or Hadoop
– Includes both the data and the index
– Daily key rolling can anonymize data after 90 days
User specifies how data is masked if user doesn’t have access– Nullify
– Redact
– SHA256
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You@owen_omalleyowen@hortonworks.com
top related