huhadoop - v1.1
DESCRIPTION
TRANSCRIPT
04/10/2023
Prepared for:
Big Data Expedition Roadshow
Presented by:“Big Data Joe” Rossi
Huhadoop?
What Makes Up Hadoop 1.x?
Hadoop 1.0 – HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
SecondaryNameNode /
JobTracker
Client1-1
1-21-3
Hadoop 1.0 – HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
SecondaryNameNode /
JobTracker
Client1-1 1-2
1-3
ReduceMap
2-1 3-2 3-3 4-1
2-3 4-2 2-2 3-1 4-3
ReduceMap
MapReduce v1 Limitations
ScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
AvailabilityJobTracker failure kills all queued and running jobs
Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization
No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else
HADOOP 1.0
Single Use SystemBatch Apps
Apache Hadoop 1.0: Single Use System
HDFS(redundant, reliable storage)
MapReduce(cluster resource management and data
processing)
Pig Hive
What’s New In Hadoop 2.x?
YARN Replaces MapReduce
Yet Another Resource Negotiator
YARN
YARN will be the de-facto distributed operating system for Big Data
Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
BATCH(MapReduce)
INTERACTIVE(Tez, Spark)
ONLINE(HBase)
STREAMING(DataTorrent)
GRAPH(Giraph)
2010
2011
2012
2013
2014
Today
YARN: Moving Quickly
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily
Version 2.3
YARN: Dr. Evil Approved
Graph Processing
Running all on the same Hadoop cluster to give applications access to all the same source data!
YARN: Applications
MapReduce v2
Real-Time Streaming Analytics
Master-WorkerOnline
YARN: What Has Changed?YARN MRv1RM
ResourceManager
AMApplicationMaster
JTJobTracker
Scheduler Scheduler
NMNodeManager
TTTaskTracker
ContainerMap
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicationMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce
ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of dataEnables running apps in memory within the cluster
7 Benefits of YARN
7
The Future of HadoopProjects and Roadmap
SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.
Stinger: Interactive Query for Hive
SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.
ScaleThe only SQL interface to Hadoop designed for
queries that scale from Terabytes to Petabytes.
Stinger: Speed – Apache Tez
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Tez(execution layer)
MR Pig Hive
Stinger: Speed – Apache Tez
Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.
HOYA: HBase on YARN
Easier DeploymentAPIs to create, start, stop and delete HBase clusters.
AvailabilityRecover from Region Server loss with a new container.
Machine LearningFramework well suited for building machine learning jobs.
Microsoft REEF
Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.
Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.
RetainableEvaluatorExecutionFramework
Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD Fusion IO
Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades
Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration
Hadoop Roadmap
EARLYQ2 2014
MIDQ2 2014
Questions?No such thing as a stupid question.
Huhadoop?
Supporting SlidesSlides with information that may be asked
YARN: How It Works
ResourceManager
NodeManager
ApplicationMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client
YARN: Example App Deployment
ResourceManager
NodeManager
HOYA / HBase Master
NodeManager
NodeManager NodeManager
Scheduler
Region Server
Region Server Region Server
HOYA Client
Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 3
Dynamic Runtime Changes 3
Linear Scalability 3
State Checkpointing 3
Apache Spark + Shark
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Apache Spark
Shark
Hive(sql)
Hadoop 2.x – YARN + HDFS
NameNode
DataNode / NodeManager DataNode / NodeManager
DataNode / NodeManager DataNode / NodeManager
StandbyNameNode /
ResourceManager
ContainerContainer
ContainerContainer
ContainerContainer
ContainerContainer
Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.
YARN: Key Take-Aways
Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.
One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.
Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.
YARN will be the de-facto distributed operating system for Big Data
Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm
Proprietary / Open Source O O
Support for Hadoop 1.x 1 1
Support for Hadoop 2.x 1 1
Native YARN 1 3
Dashboard 1 3
Extensible via Modules 1 1
Technical Support 1 1
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 1 3
Dynamic Runtime Changes 1 3
High Availability 1 2
Prog. Languages Supported Java, Python, etc. Java, Python, etc.
Log Analysis 1 3
Site Operations 1 3
MapReduce Diagnostics 1 3
Open Source Operators Library 1 2
Open Source Application Templates 1 3
Complex Computations (DAG) 1 3
Linear Scalability 1 3
Security 1 3
CLI and Macros 1 3
Configuration Based Specification 1 3
State Checkpointing 1 3
Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases
The 1st Generation Of Hadoop
Hadoop
HBase
Stinger: HiveQL – SQL SupportHive SQL Datatypes Hive SQL Semantics
Apache Spark
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Apache Spark
Shark
Hive(sql)
Spark Streaming
MLib(machine learning)
Project Mgt Committee Members
Hortonworks
Others
Cloudera
Yahoo!
0 2 4 6 8 10 12 14 16
7
6
3
15
11
Project Committers
Hortonworks
Others
Cloudera
Yahoo!
0 5 10 15 20 25 30
24
24
11
11
5
YARN: Why The De-Facto Distributed OS
Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily
Enables InnovationSmart people and companies to do amazing things to data
Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the
2013 alone
Apache Storm Topology
Bolt(Filter)Spout
Stream(Data Source)
Spout
Stream(Data Source)
Bolt(RDBMS Writes)
Bolt(Calculation)
Bolt(HDFS Writes)
RDBMS
HDFS
Hadoop 1.0 – MR + HDFS
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
SecondaryNameNode /
JobTracker
ReduceMap
ReduceMap ReduceMap
ReduceMap
Hadoop 1.0 – MapReduce
JobTracker
TaskTracker
ReduceMap
TaskTracker
ReduceMap
TaskTracker
ReduceMap
TaskTracker
ReduceMap
YARN: Uncharted Territory
You
Are Here
Technology
Value