munich hug 21.11.2013
DESCRIPTION
TRANSCRIPT
© Hortonworks Inc. 2013 - Confidential
Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture
by delivering One Enterprise Hadoop
November 2013
Page 1
Agenda
Page 2
• Hortonworks Overview of Tez–Quick and painless
• A driver for Tez: The Stinger Initiative• Tez Deep Dive• Demo
© Hortonworks Inc. 2013 - Confidential
A Brief History of Apache Hadoop
Page 3
2013
Focus on INNOVATION2005: Hadoop created
at Yahoo!
Focus on OPERATIONS2008: Yahoo team extends focus to
operations to support multiple projects & growing clusters
Yahoo! begins to Operate at scale
EnterpriseHadoop
Apache Project Established
HortonworksData Platform
2004 2008 2010 20122006
STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with
24 key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013 - Confidential
Our Mission:
Our Commitment
Innovate in the OpenWe employ the core architects and operators of Hadoop and drive innovation through open source Apache Foundation projects to avoid vendor lock-in
Certify for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform for enterprise usage and deliver the highest quality of support
Interoperate with the EcosystemWe work with partners to deeply integrate Hadoop with key technologies so you can leverage existing skills and investments
Page 4
Headquarters: Palo Alto, CAEmployees: 240+ and growingCustomers: 120+ and growingInvestors: Benchmark, Index, Yahoo, Dragoneer, Tenaya
Trusted Partners with:
Enable your Modern Data Architecture by delivering One Enterprise Hadoop
© Hortonworks Inc. 2013 - Confidential
Goal: Interoperable and Familiar
Page 5
APPL
ICAT
ION
SDA
TA S
YSTE
MSO
URC
ES
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
© Hortonworks Inc. 2013 - Confidential
UDADiagram
Betting on Hortonworks…
Teradata Portfolio for Hadoop
• Seamless data access between Teradata and Hadoop (SQL-H)
• Simple management & monitoring with Viewpoint integration
• Flexible deployment options
Page 6
HDInsight & HDP for Windows
• Only Hadoop Distribution for Windows Azure & Windows Server
• Native integration with SQL Server, Excel, and System Center
• Extends Hadoop to .NET community
Complete Portfolio for Hadoop
Appliances
Instant Access + Infinite Scale
• SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP
• Enables analytics apps (BOBJ) to interact with Hadoop
© Hortonworks Inc. 2013 - Confidential
Hortonworks Approach to Enterprise Hadoop
Identify and introduce enterprise requirements into the public domain
Work with the community to advance and incubate open source projects
Apply Enterprise Rigor to provide the most stable and reliable distribution
Community Driven Enterprise Apache Hadoop
© Hortonworks Inc. 2013 - Confidential
Driving Hadoop Innovation
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers to Apache Hadoop
63total
Hortonworks engineers focus on making Apache Hadoop an enterprise viable
platform that powers modern data architectures and deeply integrates
with existing data center technologies
© Hortonworks Inc. 2013 - Confidential
HDP: Enterprise Hadoop Platform
Page 9
Hortonworks Data Platform (HDP)
• The ONLY 100% open source and complete platform
• Integrates full range of enterprise-ready services
• Certified and tested at scale
• Engineered for deep ecosystem interoperability
OS/VM Cloud Appliance
PLATFORM SERVICES
HADOOP CORE
Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZREDUCE
HIVE &HCATALOG
PIGHBASE
© Hortonworks Inc. 2013 - Confidential
Hortonworks: The Value of “Open” for You
Page 10
Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so that you are represented in the open source community
Avoid Vendor LockHortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in
The partners you rely on, rely on Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments
Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use
Support from the expertsWe provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience
© Hortonworks Inc. 2013 - Confidential
SQL-in-Hadoop with Apache Hive
• Apache Hive is the standard for SQL interaction with Hadoop–Enterprise makes final purchasing
decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%)
–Most application claim Hive compatibility TODAY*
• Stinger Initiative: Simple Focus–Performance–SQL-Compatibility
Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho
Page 11
Had
oop
HDFS
Hive
TezMapReduce
SQL
YARN
Business Analytics
CustomApps
Improves existing tools & preserves investments
© Hortonworks Inc. 2013 - Confidential
Stinger Initiative Goals
• Enables Hive to support interactive workloads• Improves existing tools & preserves investments
Query Planner
Hive
Execution Engine
Tez= 100X+
FileFormat
ORC file
= SQL Compatible
+
Data Types
Windowing&
Subqueries+
© Hortonworks Inc. 2013 - Confidential
Stinger: Hive For All Analytics
Enterprise Reports
Dashboard / Scorecard
Parameterized Reports
Visualization Data Mining
Interactive Batch
100X Faster+
SQL Compatible
© Hortonworks Inc. 2013 - Confidential
Stinger Roadmap
Page 14
DATA TYPES• Subqueries for IN,
NOT IN, HAVING• Datatypes: CHAR,
VARCHAR, DATETIME
• Improvements to DECIMAL datatype
• Integration with Tez and Tez Service
• Vectorization Preview
• Intelligent Optimizer• Column Statistics• Authentication and
Authorization Enhancements
• Full vector query
• Join optimizations• ORCFile• SQL:2003
windowing functions
© Hortonworks Inc. 2013 - Confidential
Stinger: Some early Results
• Query Engine Work ONLY• Uses TPC “style” benchmark• Just a few weeks of work
• OTHER work coming
Page 15
© Hortonworks Inc. 2013 - Confidential
Apache Tez : Accelerating Hadoop Query Processing
Page 16
© Hortonworks Inc. 2013 - Confidential
Tez – Introduction
Page 17
• Distributed execution framework targeted towards data-processing applications.
• Based on expressing a computation as a dataflow graph.
• Built on top of YARN – the resource management framework for Hadoop.
• Open source Apache incubator project and Apache licensed.
© Hortonworks Inc. 2013 - Confidential
Old School Hadoop: MapReduce
© Hortonworks Inc. 2013 - Confidential
Fundamentals of YARN
• The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:–a global ResourceManager–a per-application ApplicationMaster.–a per-node slave NodeManager and–a per-application Container running on a NodeManager
Page 19
© Hortonworks Inc. 2013 - Confidential
New School Hadoop with YARN
© Hortonworks Inc. 2013 - Confidential
Tez – Design Themes
Page 21
• Empowering End Users• Execution Performance
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying deployment
Page 22
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Expressive dataflow definition API’s–Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.–Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance.
Page 23
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2
© Hortonworks Inc. 2013 - Confidential
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 24
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model–Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.–End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful tasks.
Page 25
IntermediateReduce
ShuffleInput
ReduceProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
PairwiseJoin
Input1
JoinProcessor
FileSortedOutput
Input2
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Data type agnostic–Tez is only concerned with the movement of data. Files and
streams of bytes.–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them.
Page 26
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Simplifying deployment–Tez is a completely client side application.–No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.–Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.–Leverages YARN local resources.
Page 27
ClientMachine
NodeManager
TezTask
NodeManager
TezTaskTezClient
HDFSTez Lib 1 Tez Lib 2
ClientMachine
TezClient
© Hortonworks Inc. 2013 - Confidential
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying usage
With great power API’s come great responsibilities
Tez is a framework on which end user applications can be built
Page 28
© Hortonworks Inc. 2013 - Confidential
Tez – Execution Performance
• Performance gains over Map Reduce• Optimal resource management• Plan reconfiguration at runtime• Dynamic physical data flow decisions
Page 29
© Hortonworks Inc. 2013 - Confidential
Tez – Execution Performance
• Performance gains over Map Reduce–Eliminate replicated write barrier between successive
computations.–Eliminate job launch overhead of workflow jobs.–Eliminate extra stage of map reads in every workflow job.–Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Page 30
Pig/Hive - MRPig/Hive - Tez
© Hortonworks Inc. 2013 - Confidential
Tez – Execution Performance
• Optimal resource management–Reuse YARN containers to launch new tasks.–Reuse YARN containers to enable shared objects across tasks.
Page 31
YARN Container
TezTask Host
TezTask1
TezTask2
Sha
red
Obj
ects
YARN Container
Tez Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013 - Confidential
Tez – Execution Performance
• Plan reconfiguration at runtime–Dynamic runtime concurrency control based on data size, user
operator resources, available cluster resources and locality.–Advanced changes in dataflow graph structure.–Progressive graph construction in concert with user optimizer.
Page 32
HDFS Blocks
YARNResources
Stage 150 maps
100 partitions
Stage 2100
reducers
Stage 150 maps
100 partitions
Stage 2100 10
reducers
Only 10GB’s
of data
© Hortonworks Inc. 2013 - Confidential
Tez – Execution Performance
• Dynamic physical data flow decisions–Decide the type of physical byte movement and storage on the fly.–Store intermediate data on distributed store, local store or in-
memory.–Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 33
Producer(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
© Hortonworks Inc. 2013 - Confidential
Tez – Deep Dive – API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Vertex map2 = new Vertex(MapProcessor.class);
Vertex reduce1 = new Vertex(ReduceProcessor.class);
Vertex reduce2 = new Vertex(ReduceProcessor.class);
Vertex join1 = new Vertex(JoinProcessor.class);
…….
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
…….
dag.addVertex(map1).addVertex(map2)
.addVertex(reduce1).addVertex(reduce2)
.addVertex(join1)
.addEdge(edge1).addEdge(edge2)
.addEdge(edge3).addEdge(edge4);
Page 34
reduce1
map2
reduce2
join1
map1
Scatter_Gather
Bipartite Sequential
Scatter_Gather
Bipartite Sequential
Simple DAG definition API
© Hortonworks Inc. 2013 - Confidential
Tez – Deep Dive – API
Page 35
• Data movement – Defines routing of data between tasks– One-To-One : Data from the ith producer task routes to the ith consumer
task.– Broadcast : Data from a producer task routes to all consumer tasks.– Scatter-Gather : Producer tasks scatter data into shards and consumer
tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task.
• Scheduling – Defines when a consumer task is scheduled– Sequential : Consumer task may be scheduled after a producer task
completes.– Concurrent : Consumer task must be co-scheduled with a producer task.
• Data source – Defines the lifetime/reliability of a task output– Persisted : Output will be available after the task exits. Output may be lost
later on.– Persisted-Reliable : Output is reliably stored and will always be available– Ephemeral : Output is available only while the producer task is running
Edge properties define the connection between producer and consumer vertices in the DAG
© Hortonworks Inc. 2013 - Confidential
Tez – Deep Dive – Scheduling
Page 36
reduce1
map1
Start
vertex
Vertex Scheduler
Start
tasks
DAGScheduler
Get Priority
Get Priority
Start
vertex
TaskScheduler
Get container
Get container
• Vertex SchedulerDetermines when tasks in a vertex can start
• DAG SchedulerDetermines priority of task
• Task SchedulerAllocates containers from YARN and assigns them to tasks
© Hortonworks Inc. 2013 - Confidential
Tez – Deep Dive – Task Execution
Page 37
Task Attempt(real on machine)
Task Attempt(logical in AM)
Env, cmd line, resources
Task JVM
InputProcessor
Output
Get Task
Start container
Input
Processor
OutputData
InformationData Events
• Start task shell with user specified env, resources etc.
• Fetch and instantiate Input, Processor, Output objects
• Receive (incremental) input information and process the input
• Provide output information
© Hortonworks Inc. 2013 - Confidential
Tez - Sessions
• The amount of work programmed into a script/query may not be doable within a single Tez DAG.
Page 38
© Hortonworks Inc. 2013 - Confidential
Tez - Sessions
Page 39
• Even better performance gains may be achieved through caching with the session: Within AM or container
© Hortonworks Inc. 2013 - Confidential
Tez – Automatic Reduce Parallelism
Page 40
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Set Parallelism
Cancel Task
Re-Route
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013 - Confidential
Tez – Reduce Slow Start/Pre-launch
Page 41
Map Vertex
Reduce VertexApp Master
Vertex ManagerTask Completed
Vertex StateMachine
Start Tasks
Start
Event Model
Map completion events sent to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.
© Hortonworks Inc. 2013 - Confidential
Tez – Current status
• Apache Incubator Project–Rapid development. Over 330 jiras opened. Over 220 resolved.–Growing community.
• Focus on stability–Testing and quality are highest priority.–Working on Tez+YARN to fix basic performance overheads.–Code ready and deployed on multi-node environments.
• DAG of MR processing is working– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.– Working Hive prototype that can target Tez for execution of
queries (HIVE-4660).–Work started on prototype of Pig that can target Tez.
Page 42
© Hortonworks Inc. 2013 - Confidential
Tez – Current status
Page 43
Fact TableDimension
Table 1
Result Table 1
Dimension Table 2
Result Table 2
Dimension Table 3
Result Table 3
Join
Join
Join
Typical pattern in a TPC-DS query
Fact Table
Dimension Table 1
Dimension Table 1
Dimension Table 1
Optimization for
small data sets
Both can now run as a single Tez job
© Hortonworks Inc. 2013 - Confidential
Tez – MRR Performance
Page 44
RC File _x000d_Scale 200
ORC File _x000d_Scale 200
RC File _x000d_Scale 1000
ORC File _x000d_Scale 1000
0
10
20
30
40
50
60
70
80
55 54
75
65
35 34
55
46
Traditional _x000d_Map-ReduceTez Map_x000d_Reduce Reduce
Elap
sed
Tim
e (s
econ
ds)
TPC-DS Query 12 with Hive on Tez
© Hortonworks Inc. 2013 - Confidential
Tez – Roadmap
• Full DAG support–Multi-way input and output.–Other graph connection patterns.
• Performance optimizations–Container reuse–Cross task shared resources–Using HDFS data caching
• Runtime plan optimizations–Automatic input (map) parallelism–Automatic aggregation (reduce) parallelism
• Usability.–Stability and testability–Recovery and history
Page 45
© Hortonworks Inc. 2013 - Confidential
Tez – Community
• Early adopters and contributors welcome–Adopters to drive more scenarios. Contributors to make them
happen.–Hive and Pig communities are on-board and making great
progress - HIVE-4660 and PIG-3446
• Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez–http://www.meetup.com/Apache-Tez-User-Group
• Useful links–Work tracking: https://issues.apache.org/jira/browse/TEZ–Code: https://github.com/apache/incubator-tez– Developer list: [email protected]
User list: [email protected] Issues list: [email protected]
Page 46
© Hortonworks Inc. 2013 - Confidential
Tez – Takeaways
• Distributed execution framework that works on computations represented as dataflow graphs
• Naturally maps to execution plans produced by query optimizers
• Execution architecture designed to enable dynamic performance optimizations at runtime
• Open source Apache project – your use-cases and code are welcome
• It works and is already being used by Hive
Page 47
© Hortonworks Inc. 2013 - Confidential
Tez
https://github.com/t3rmin4t0r/tez-autobuild
Tez: https://github.com/apache/tez.git
Demo: https://github.com/t3rmin4t0r/tez-autobuild
Thanks for your time and attention!
Questions?
Page 48