hadoop crashcourse v3

Hadoop Crash Course

Summer 2015Version 1.0

Hadoop Interest Group

Jules Damji jdamji@hortonworks.com

@2twitme

Rafael Coss rafael@hortonworks.com

@racoss

Hadoop Crash Course

Why Hadoop?

Hadoop Ecosystem & Distribution

Store Data (HDFS)

Process Data in Hadoop 1 (MapReduce)

Process Data in Hadoop 2 (Yarn + MapReduce/Tez)

Data Access

What disrupted the data center?

New Data Paradigm Opens Up New Opportunity

2.8 zettabytesin 2012

44 zettabytesin 2020

1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research

Clickstream

ERP, CRM, SCM

Web & social

Geolocation

Internet of Things

Server logs

Files, emails

Transform every industry via full fidelity of data and analytics

Opportunity

T R A D I T I O N A L

LAGGARDS

LEADERS

Ability to Consume Data

Enterprise Blind Spot

Hadoop YARN-based Architecture Unlocks Opportunity

Consolidates all data sets

Delivers real-time insights

Integrates with data center

Scalable and affordable

T U R N A L L O F Y O U R D ATA I N T O VA L U E

| Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation

Two Paths in a Customer’s Journey to a Data LakeS

Goal:• Centralized Architecture• Data-driven Business

DATA LAKE

Journey to the Data Lake with Hadoop

Systems of Insight

The journey begins with either:

1. Cost Optimization (Data Architecture Optimization)

2. Advanced Analytic Applications

Leaders are Data Driven

Advanced Analytic Apps

Cost Optimization

Hadoop Ecosystem

runs on

RDBMS Import/Export

Distributed Storage & Processing Framework

Secure NoSQL DB

SQL on HBase

NoSQL DB

Workflow Management

Streaming Data Ingestion

Cluster System Operations

Secure Gateway

Distributed Registry

Search & Indexing

Even Faster Data Processing

Data Management

Machine Learning

Hadoop Architecture

Data Access Engines

Distributed Reliable Storage

Distributed Compute FrameworkResource Mgt, Data Locality

Data Operating System

Batch Interactive Streaming

Governance Security

Hadoop Key Services

Hortonworks Data Platform

Multi-tenant data platform built on a centralized architecture of shared enterprise services

YARN: data operating system

Governance Security

Operations

Resource management

Existing applications

Newanalytics

Partner applications

Data access: batch, interactive, real-time

Storage

Key Services

Resource and workload management

Scalable tiered storage

Consistent operations

Comprehensive security

Trusted data governance

HORTONWORKS DATA PLATFORM

HDP 2.3 is Apache Hadoop; not “based on” Hadoop

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.3July 2015

Ongoing Innovation in Apache

HDFSYARNMapReduceHadoop Core

HORTONWORKS DATA PLATFORM

HDP 2.3 is Apache Hadoop; not “based on” Hadoop

4.10.2

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

0.12.0 0.12.0

0.12.1 0.13.0 0.4.0

1.4.4 1.4.4 3.3.23.4.5

0.4.00.5.0

0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2

4.0.04.7.2

1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0

1.4.0 1.5.1 4.0.0

1.5.1 1.4.4 3.4.5

0.96.1

0.98.0 0.9.1

HDP 2.3July 2015

1.3.12.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 4.2.0

Ongoing Innovation in Apache

Hortonworks Development Investment for the Enterprise

Horizontal Integration for Enterprise ServicesEnsure consistent enterprise services are applied across the Hadoop stack

Vertical Integration with YARN and HDFS

Ensure engines can run reliably and respectfully in a YARN based cluster

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Load data and manage

according to policy

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITYGOVERNANCE

Deploy and effectively

manage the platform

° ° ° ° ° ° ° ° ° ° ° ° ° ° °

Script

JavaScala

Cascading

Stream

Search

HBaseAccumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Others

ISV Engines

1 ° ° ° ° ° ° ° ° ° ° ° ° ° °

YARN: Data Operating System(Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider SliderTez Tez

OPERATIONS

+/directory/structure/in/memory.txt

Resource management + schedulingDisk, CPU, Memory

CoreNameNode

ResourceManagerYARN

Hadoop daemon

User application

DataNodeHDFS

NodeManagerYARN

Worker Node

Joys of Real Hardware (Jeff Dean)

Typical first year for a new cluster:~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)~1 network rewiring (rolling ~5% of machines down over 2-day span)~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)~5 racks go wonky (40-80 machines see 50% packetloss)~8 network maintenances (4 might cause ~30-minute random connectivity losses)~12 router reloads (takes out DNS and external vips for a couple minutes)~3 router failures (have to immediately pull traffic for an hour)~dozens of minor 30-second blips for dns~1000 individual machine failures~thousands of hard drive failuresslow disks, bad memory, misconfigured machines, flaky machines, etc

Hadoop Distributed File System (HDFS)

Fault Tolerant Distributed Storage

• Divide files into big blocks and distribute 3 copies randomly across the cluster

• Processing Data Locality

• Not Just storage but computation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111

Logical File

Blocks

Cluster

The DataNodes

“I’m still here! This is my latest heartbeat.”

“I’m here too! And here is my latest heartbeat.”

“Hey DataNode1, Replicate block 123 to

DataNode 3.”

NameNode

DataNode 1 DataNode 3 DataNode 4

123 123

DataNode 1

Batch Processing in Hadoop

MapReduceBatch Access to DataOriginal data access mechanism for Hadoop

• FrameworkMade for developing distributed applications to process vast amounts of data in-parallel on large clusters

• ProvenReliable interface to Hadoop which works from GB to PB. But, batch oriented – Speed is not it’s strong point.

• EcosystemPorted to Hadoop 2 to run on YARN. Supports original investments in Hadoop by customers and partner ecosystem.

DataNode1

Mapper

Data is shuffledacross the network

& sorted

Map Phase

Shuffle/Sort Reduce Phase

MapReduce Job Lifecycle

Saying that MapReduce is dead is preposterous- Would limits us to only new workloads

- ALL Hadoop clusters use map reduce

- Why rewrite everything immediately?

DataNode2

Mapper

DataNode3

Mapper

DataNode1

Reducer

DataNode2

Reducer

DataNode3

Reducer

YARN: Data Operating System

Interactive Real-TimeBatch

What is MapReduce?

Break a large problem into sub-solutionsMap

• Iterate over a large # of records

• Extract something of interest from each record

Shuffle

• Sort Intermediate results

Reduce

• Aggregate, summarize, filter or transform intermediate results

• Generate final output

Map Process

DataData

DataData Map

Process

Reduce Process

Read & ETL

Shuffle & SortAggregation

DataData

1st Gen Hadoop: Cost Effective Batch at Scale

HADOOP 1.0Built for Web-Scale Batch Apps

Single App

INTERACTIVE

Single App

Silos created for distinct use casesSingle App

Single App

ONLINE

Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data

• Built by Yahoo! to be the heartbeat of its ad & search business

• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

• Incredibly disruptive to current platform economics

Traditional Hadoop Advantages Manages new data paradigm Handles data at scale Cost effective Open source

Traditional Hadoop Had Limitations

Batch-only architecture

Single purpose clusters, specific data sets

Difficult to integrate with existing investments

Not enterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce

What does iOS 6 and Windows 3.1 have in common?

Hadoop Beyond Batch with YARN

MapReduce

Pig(data flow)

Hive(SQL)

OthersAPI,

Engine, andSystem

Hadoop 1MapReduce as the Base

HDFS(redundant, reliable storage)

YARN(Data Operating System: resource management, etc.)

Tez(modern execution engine)

Data FlowPig

SQLHive

Java AppsCascading

BatchMapReduce

Hadoop 2Apache Yarn as a Base

System

Engine

API’s

Single Use SysztemBatch Apps

Multi Use Data PlatformBatch, Interactive, Online, Streaming, …

A shift from the old to the new…

Apache Tez is a critical innovation of the Stinger Initiative.

• Along with YARN, Tez not only improves Hive, but improves all things batch and interactive for Hadoop; Pig, Cascading…

• More Efficient Processing than MapReduce

• Reduce operations and complexity of back end processing• Allows for Map Reduce Reduce which saves hard disk operations• Implements a “service” which is always on, decreasing start times

of jobs• Allows Caching of Data in Memory

Cascading/Scalding

Why is Tez Important?

°1 ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

° ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (Hadoop Distributed File

System)

Scripting

Tez Tez

Applications

Hive – MapReduce Hive – Tez

SELECT a.state, COUNT(*), AVG(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

SELECT a.state

JOIN (a, c)SELECT c.price

SELECT b.id

JOIN(a, b)GROUP BY a.state

COUNT(*)AVG(c.price)

SELECT a.state,c.itemId

JOIN (a, c)

JOIN(a, b)GROUP BY a.state

COUNT(*)AVG(c.price)

SELECT b.id

Tez avoids unneeded writes to HDFS

HDP delivers a Centralized Architecture

Other Pure Play Vendors A siloed “with” YARN architecture

Disjoint, Siloed Clusters• Inefficient use of resources, single tenant, duplicate storage & processing

• Multiple implementations of governance, security and operations

• New applications require new clusters

Hortonworks Data PlatformA centralized architecture built on YARN

Application

Security

Storage

Governance

Operations

Storage

Governance Security

Operations

Resource Management

Existing Applications

NewAnalytics

Partner Applications

(ie. SAS)

Application

Security

Storage

Governance

Operations

Application

Security

Storage

Governance

Operations

Interactive

Dedicated Resource mgt

Real-time

Dedicated Resource mgt

Single cluster, multiple applications• Efficient storage, processing

• Centralized Security, Operations, Governance

• Run a variety of applications simultaneously

Data Access: Batch, Interactive & Real-time

{Processing + Storage}

={MapReduce/YARN + HDFS}

={Core Hadoop}

Modern Data Architecture emerges to unify data & processing

Modern Data Architecture

• Enable applications to have access to all your enterprise data through an efficient centralized platform

• Supported with a centralized approach governance, security and operations

• Versatile to handle any applications and datasets no matter the size or type

Clickstream Web & Social

Geolocation Sensor & Machine

Server Logs

Unstructured

Existing Systems

ERP CRM SCM

Data Marts

Business Analytics

Visualization& Dashboards

ApplicationsBusiness Analytics

Visualization& Dashboards

Interactive Real-TimeBatch Partner ISVBatch BatchMPP

What is Data Access?

Data Access defines ALL the channels through which data can be accessed,

analyzed, cleansed and consumed within Hadoop. Each channel can be

categorized into THREE core patterns; Batch, Interactive and Real-time.

Multiple engines provide optimized access to your

mission critical data.

Access patterns enabled by YARN

BatchNeeds to happen but, no timeframe limitations

Interactive Needs to happen at Human time

Real-Time Needs to happen at Machine Execution time.

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

Apache Projects Enable Access Patterns

• Various Open Source projects have incubated in order to meet these access pattern needs

• Today, they can all run on a single cluster on a Single set of data because of YARN!

• ALL powered by a BROAD Open Community

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

BatchMapReduce

PigHive

InteractiveSolr

SparkHive

Real-TimeHBase

AccumuloStorm

Scripting Data Flow & ETLApache Pig

• Data flow engine and scripting language (Pig Latin)

• Allows you to transform data and datasets

Advantages over MapReduce• Reduces time to write jobs

• Community support

• Piggybank has a significant number of UDF’s to help adoption

• There are a large number of existing shops using PIG

Pig Latin

• Pig executes in a unique fashion:oDuring execution, each statement is processed by the Pig

interpreter o If a statement is valid, it gets added to a logical plan built by the

interpretero The steps in the logical plan do not actually execute until a

DUMP or STORE command is used

Why use Pig?

• Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5 sites

Apache Hive: THE defacto standard for SQL in Hadoop

• What?

• Treat your data in Hadoop as tables

• Provides a standard SQL 92 interface to data in Hadoop

• Why?

• Shipped in every distribution… you already have it (although some do not ship complete versions) Quickly find value in raw data files

• Proven at petabyte scale for both batch and interactive queries

• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…

Hive Architecture

User issues SQL query

Hive parses and plans query

Query converted to MapReduce/Tez and executed on Hadoop

Web UI

JDBC / ODBC

HiveSQL

1HiveServer2 Hive

MR/Tez Compiler

Optimizer

Executor

MetaStore(MySQL, Postgresql,

Oracle)

MapReduce or Tez Job

Data DataData

Hadoop 3Data-local processing

Using Tez for Hive Queries

Set the following property in either hive-site.xml or in your script:

set hive.execution.engine=tez;

SQL ComplianceEvolution of SQL Compliance in HiveSQL Datatypes SQL Semantics

INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT

FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING

BOOLEAN JOIN on explicit join key

ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins

STRING Sub-queries in the FROM clause

BINARY ROLLUP and CUBE

TIMESTAMP UNION

DECIMAL Standard aggregations (sum, avg, etc.)

DATE Custom Java UDFs

VARCHAR Windowing functions (OVER, RANK, etc.)

CHAR Advanced UDFs (ngram, XPath, URL)

Interval Types Sub-queries for IN/NOT IN, HAVING

JOINs in WHERE Clause

INSERT/UPDATE/DELETE

Legend

Hive 10 or earlier

Roadmap

Hive 11

Hive 12

Hive 13

Overview of Stinger

Base OptimizationsGenerate simplified DAGs

In-memory Hash Joins

Vector Query EngineOptimized for modern processor

architectures

TezExpress tasks more simply

Eliminate disk writesPre-warmed Containers

ORCFileColumn Store

High CompressionPredicate / Filter Pushdowns

YARNNext-gen Hadoop data processing

framework

100X+ Faster Time to Insight

Deeper Analytical Capabilities

Performance Optimizations

Query PlannerIntelligent Cost-Based Optimizer

System

Engine

YARN : Data Operating System

°1 ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

° ° ° ° ° ° °

° ° ° ° ° ° N

BatchMapReduce

Real-TimeSlider

Direct

Java.NET

Scripting

Cascading

JavaScala

HBaseAccumulo

Stream

OtherISV

Applications

Others

Spark Other ISV

HDP 2.2 HDP 2.2

HDP 2.2TezTezTez Tez

YARN: Resource Manager for Hadoop 2.0

FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming

EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service

SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads

Data Processing Engines Run Natively IN Hadoop

Hive & Pig

Hive & Pig work well together and many customers use both

Hive is a good choice:

• if you are familiar with SQL

• when you want to query data

• when you need an answer to specific questions

Pig is a good choice:

• For ETL (Extract, Transform, Load)

• for preparing data for analysis

• when you have a long series of steps to perform

Pig and Hive Sample Scenario

Hadoop DistributedFile System

StructuredData

RawData

1. Put the data into HDFS in its raw format

Answers to questions = $$

2. Use Pig to explore andtransform

3. Data analysts use Hive to query the data

4. Data scientists use MapReduce, R, Mahout and Spark to mine the data

Hidden gems = $$

Big Data ETL Life Cycle

Mobile Apps

Transactions,OLTP, OLAP

Social Media, Web Logs

Machine Device, Scientific

Documents and Emails

9. Govern & enrich with metadata

3. Stream real-time data

8. Explore & validate data

4. Mask sensitive data

2. Replicate changed data & schemas

Visualization& Analytics

11. Subscribe to datasets

Data Mart

1. Load or archive batch data

Data Access & Query

5. Access customer “golden record

10. Correlate real-time events with historical patterns & trends

6. Transform & refine data

7. Move results to EDW

HDP: Any Data, Any Application, Anywhere

Any Application

• Deep integration with ecosystem partners to extend existing investments and skills

• Broadest set of applications through the stable of YARN-Ready applications

Any DataDeploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets.

AnywhereImplement HDP naturally across the complete range of deployment options

Clickstream Web & Social

Geolocation Internet of Things

Server Logs

Files, emailsERP CRM SCM

hybrid

commodity appliance cloud

Over 70 Hortonworks Certified YARN Apps

What next? -> developer.hortonworks.com

Thank you!

rafael@hortonworks.com

@racoss

IoT Data Discovery Lab

• A trucking company has over 100 trucks.• The geolocation data collected from the trucks contains events generated

while the truck drivers are driving.• The company’s goal with Hadoop is to Mitigate Risk:

o Understand correlations between miles driven and eventso Compute the risk factor for each driver based on mileage & events

o Lab Envo Sandbox 2.3 TP

o Lab Doco URL: http://ow.ly/Qv1JMo Load Datao Query Datao Process Data

Move Data Into Hadoop

Geolocation.csv

trucks.csv

Geolocation_stage Geolocation

Trucks_stage Trucks

csv ORC

Geolocation

Trucks

PIGRisk Calculation

Truck_mileage

Avg_mileage

DriverMileage

RiskFactor

Events

Trucking Risk Analysis – Hadoop ELT

Calculate Risk

Cautionary Statement Regarding Forward-Looking Statements

This presentation contains forward-looking statements involving risks and uncertainties. Such forward-looking statements in this presentation generally relate to future events, our ability to increase the number of support subscription customers, the growth in usage of the Hadoop framework, our ability to innovate and develop the various open source projects that will enhance the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general business outlook. In some cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,” “target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or similar terms or expressions that concern our expectations, strategy, plans or intentions. You should not rely upon forward-looking statements as predictions of future events. We have based the forward-looking statements contained in this presentation primarily on our current expectations and projections about future events and trends that we believe may affect our business, financial condition and prospects. We cannot assure you that the results, events and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements.

The forward-looking statements made in this prospectus relate only to events as of the date on which the statements are made and we undertake no obligation to update any of the information in this presentation.

TrademarksHortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other names used herein may be trademarks of their respective owners.

A Definition of Open Enterprise Hadoop

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Load data and manage

according to policy

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITYGOVERNANCE

Deploy and effectively

manage the platform

° ° ° ° ° ° ° ° ° ° ° ° ° ° °

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

YARN: Data Operating System(Cluster Resource Management)

OPERATIONS

Batch Interactive Real-Time

Big Data ETL Life Cycle

Mobile Apps

Transactions,OLTP, OLAP

Social Media, Web Logs

Machine Device, Scientific

Documents and Emails

9. Govern & enrich with metadata

3. Stream real-time data

8. Explore & validate data

4. Mask sensitive data

2. Replicate changed data & schemas

Visualization& Analytics

11. Subscribe to datasets

Data Mart

1. Load or archive batch data

Data Access & Query

5. Access customer “golden record

10. Correlate real-time events with historical patterns & trends

6. Transform & refine data

7. Move results to EDW

DataDataData

DataData

Data DataDataSchemaData

DataData

ETL ETL

DataDataData

DataData

Data DataDataSchemaData

DataData

ETL ETL

Fragile workflows make supporting the analytical models you want expensive and time-consuming.

Options for Data Input

MapReduce

WebHDFShadoop fs -put

Vendor Connectors

Hadoop

nfs gateway

Hue Explorer

Risk Factors Viewed in a Graph

Risk Factors Viewed on a Map

hadoop crashcourse v3

data locality data

fidelity of data

data sets

data center scalable

new data paradigm

mapreduce process data

apache hadoop

data enterprise blind

Software

bigdata hadoop course content · industries using hadoop....

· (page views ? hourly? monthly hadoop node hadoop node...

2. hadoop -...

hadoop 1.0 vs hadoop 2.0

hadoop crash course hadoop summit sj

201305 hadoop jpl-v3

snapshotting in hadoop distributed file system for hadoop...

hue: the hadoop ui - hadoop singapore

the great twitter crashcourse for the sharepoint community

configuración para hadoop configuración de wps para...

why use hadoop?, challenges / learning hadoop & average...

big data hadoop architect v3 -...

hadoop , hadoop , hadoop !!!

unfccc crashcourse introduction to the climate regime...

hadoop installation guide | hadoop configuration

powershell crashcourse

introduction to hadoop and hadoop component

analyzing hadoop with hadoop

corehab crashcourse

hadoop training #4: programming with hadoop