interactive query in hadoop
DESCRIPTION
Hive 13 & Tez providing Human Interactive Query across petabytes of data.TRANSCRIPT
Page 1 © Hortonworks Inc. 2014
Interactive Query In Hadoop
Rommel Garcia
Solutions Engineer
May 3, 2014
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Hadoop 2
Multi Use Data PlatformBatch, Interactive, Online, Streaming, …
HADOOP 2
Redundant, Reliable Storage(HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Standard QueryProcessing
Hive, Pig
BatchMapReduce
Online Data Processing
HBase, Accumulo
InteractiveTez
Real Time Stream Processing
Stormothers
…
Page 3 © Hortonworks Inc. 2014
The Interactive Query Tech Stack
Hive
Tez
YARN
HDFS
SQL
DAG
Resource
Storage
Page 4 © Hortonworks Inc. 2014
Hive
Page 5 © Hortonworks Inc. 2014
Hive
Open source project that
• facilitates querying (SQL compliant)• project structure
residing in a distributed storage like HDFS.
Page 6 © Hortonworks Inc. 2014
Hive SQL Compliance
Page 7 © Hortonworks Inc. 2014
Hive Performance
Page 7
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput
Query PlannerUsing extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)
Latency
ORC File Columnar, type aware format with indices LatencyCost Based Optimizer
(Optiq)Join re-ordering and other optimizations based on column statistics including histograms etc. Latency
Page 8 © Hortonworks Inc. 2014
Vectorization Using Modern CPU
CPU
10K rows
Page 9 © Hortonworks Inc. 2014
Hive Optimizations
• Pre-warmed Containers (Hive Query Server)
• Low-latency Dispatch (Hive Query Server)
• DAG utilization (Tez)
• Buffer Caching (cache accessed data)
• Predicate Pushdown
Page 10 © Hortonworks Inc. 2014
Hive - ORCFile
Page 11 © Hortonworks Inc. 2014
Tez
Page 12 © Hortonworks Inc. 2014
Tez – Introduction
• Distributed execution framework targeted towards data-processing applications.
• Express computation as a dataflow graph.
• Flexible Input-Processor-Output runtime model
• Extensively use caching
• Data type agnostic
• Built on top of YARN
• Apache licensed.
Page 13 © Hortonworks Inc. 2014
Feature Description Benefit
Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency
Tez Container Pre-Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency
Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!
Latency
Runtime re-configuration of DAG
Runtime query tuning by picking aggregation parallelism using online query statistics Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput
Hive On Tez - Execution
Page 14 © Hortonworks Inc. 2014
SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a
JOIN b on (a.id = b.id)
JOIN c on (a.itemId = c.itemId)
GROUP by a.state
Comparing Tez vs. MR – running queries in Hive
• To express the above query in MapReduce, Hive needs to compose and execute four separate MR jobs.
• Each MR job comes at a cost of job start-up and disk I/O as the results are written and re-read between MR jobs. This takes too long!
Page 15 © Hortonworks Inc. 2014
SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a
JOIN b on (a.id = b.id)
JOIN c on (a.itemId = c.itemId)
GROUP by a.state
Comparing Tez vs. MR – running queries in Hive
• Using the Tez framework, this query can be expressed as a single executing graph.
• No wasted I/O. Each node in the graph streams results to the next node.
• No wasted job start up. Tez provides “hot containers” for jobs to be immediately submitted.
Page 16 © Hortonworks Inc. 2014
Tez – Deep Dive – API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Vertex map2 = new Vertex(MapProcessor.class);
Vertex reduce1 = new Vertex(ReduceProcessor.class);
Vertex reduce2 = new Vertex(ReduceProcessor.class);
Vertex join1 = new Vertex(JoinProcessor.class);
…….
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
…….
dag.addVertex(map1).addVertex(map2)
.addVertex(reduce1).addVertex(reduce2)
.addVertex(join1)
.addEdge(edge1).addEdge(edge2)
.addEdge(edge3).addEdge(edge4);
reduce1
map2
reduce2
join1
map1
Scatter_Gather
Bipartite Sequential
Scatter_Gather
Bipartite Sequential
Simple DAG definition API
Page 17 © Hortonworks Inc. 2014
Demo
Hive 13 + Tez
Page 18 © Hortonworks Inc. 2014
Multi-Tenancy with HiveServer2
Resource contentions may exists when multiple users run very large queries simultaneously which affects overall query latency. Apply these controls to resolve it.• Container re-use timeout• Tez split wave tuning• Round Robin Queuing setup
Page 19 © Hortonworks Inc. 2014
Tez - Waves
queue
C.1
C.2
C.3
C.4
C.5
containers
TEZ
tez.am.grouping.split-waves=3.0
15 Tasks
T.1
T.2
T.3
T.4
T.5
Page 20 © Hortonworks Inc. 2014
Thank You!Rommel GarciaHortonworks@rommelgarcia