introduction to hadoop

Hadoop – Taming Big Data Jax ArcSig, June 2012

Ovidiu Dimulescu

About @odimulescu

•  Working on the Web since 1997 •  Likes stuff well done •  Into engineering cultures and all around automaEon •  Speaker at local user groups •  Organizer for the local Mobile User Group jaxmug.com

Agenda

•  IntroducEon • Use cases • Architecture • MapReduce Examples

• Q&A

What is ?

•  Apache Hadoop is an open source Java soSware framework for running data-‐intensive applicaEons on large clusters of commodity hardware

•  Created by Doug CuVng (Lucene & Nutch creator)

•  Named aSer Doug’s son’s toy elephant

What and how is solving?

•  Processing diverse large datasets in pracAcal Ame at low cost

•  Consolidates data in a distributed file system

•  Moves computaAon to data rather then data to computaEon

•  Simpler programming model

Why does it maEer?

•  Volume, Velocity, Variety and Value

•  Datasets do not fit on local HDDs let alone RAM

•  Data grows at tremendous pace

•  Data is heterogeneous •  Scaling up is expensive (licensing, cpus, disks, interconnects, etc.)

•  Scaling up has a ceiling (physical, technical, etc.)

Why does it maEer?

80%

20%

Data types

Complex Structured

Complex Data

Images, Video Logs Documents Call records Sensor data Mail archives

Structured Data

User Profiles CRM HR Records

* Chart Source: IDC White Paper

Why does it maEer?

•  Need to process a 10TB dataset

•  Assume sustained transfer of 75MB/s

•  On 1 node -‐ Scanning data ~ 2 days

•  On 10 node cluster -‐ Scanning data ~ 5 hrs

•  Low $/TB for commodity drives

•  Low-‐end servers are mulEcore capable

Use Cases

•  ETL -‐ Extract Transform Load

•  RecommendaEon Engines

•  Customer Churn Analysis •  Ad TargeEng •  Data “sandbox”

Use Cases -‐ Typical ETL

Live DB

ReporAng DB

ETL 1

BI ApplicaAons

Data Warehouse

ETL 2

Logs

Use Cases -‐ Hadoop ETL

Live DB

ReporAng DB

BI ApplicaAons

Data Warehouse

Hadoop Data Loading

Logs

Data Loading

Use Cases – Analysis methods

•  Pakern recogniEon

•  Index building

•  Text mining

•  CollaboraEve filtering

•  PredicEon models

•  SenEment analysis

•  Graphs creaEon and traversal

Who uses it?

Who supports it?

Why use Hadoop?

•  PracEcal to do things that were previously not

ü  Shorter execuEon Eme ü  Costs less

ü  Simpler programming model •  Open system with greater flexibility

•  Large and growing ecosystem

Hadoop – Silver bullet?

•  Not a database replacement

•  Not a data warehousing (complements it)

•  Not for interacEve reporEng •  Not a general purpose storage mechanism

•  Not for problems that are not parallelizable in a share-‐nothing fashion

Architecture – Design Axioms

•  System Shall Manage and Heal Itself

•  Performance Shall Scale Linearly

•  Compute Should Move to Data •  Simple Core, Modular and Extensible

Architecture – Core Components

HDFS

Distributed filesystem designed for low cost storage and high bandwidth access across the cluster.

Map-‐Reduce

Programming model for processing and generaEng large data sets.

Architecture – Official Extensions

HDFS HBase

Storage

MapReduce Framework

Data Processing

ZooKeeper Chukwa

Management

Pig (Data Flow) Avro

Data Access

Hive (SQL)

Architecture – CDH DistribuAon

1.  CDH – Cloudera’s DistribuEon of Hadoop 2.  Image credit -‐ Cloudera presentaEon @ Microstrategy World 2011

HDFS -‐ Design

•  Based on Google’s GFS

•  Files are stored as blocks (64MB default size)

•  Configurable data replicaEon (3x, Rack Aware)

•  Fault Tolerant, Expects HW failures

•  HUGE files, Expects Streaming not Low Latency

•  Mostly WORM

HDFS -‐ Architecture

Namenode (NN)

Datanode 1 Datanode 2 Datanode N

Namenode -‐ Master •  Filesystem metadata •  Controls read/write to files •  Manages blocks replicaEon •  Applies transacEon log on startup

Datanode -‐ Slaves •  Reads / Write blocks to/from clients •  Replicates blocks at master’s request

H D F S

Client ask NN for file NN returns DNs that host it Client ask DN for data

HDFS – Fault tolerance

•  DataNode

§  Uses CRC to avoid corrupEon §  Data is replicated on other nodes (3x)

•  NameNode

§  Checkpoint NameNode §  Backup NameNode §  Failover is manual

MapReduce -‐ Design

•  Based on Google’s MR paper •  Borrows from funcEonal programming •  Simpler programming model

§  map (in_key, in_value) -‐> (out_key, intermediate_value) list

§  reduce (out_key, intermediate_value list) -‐> out_value list

•  No user synchronizaEon and coordinaEon

Input -‐> Map -‐> Reduce -‐> Output

MapReduce -‐ Architecture

JobsTracker (JT)

TaskTracker 1

JobTracker -‐ Master •  Accepts MR jobs submiked by clients •  Assigns Map and Reduce tasks to

TaskTrackers, data locality aware •  Monitors tasks and TaskTracker status,

re-‐executes tasks upon failure •  SpeculaEve execuEon

TaskTracker -‐ Slaves •  Run Map and Reduce tasks received

from Jobtracker •  Manage storage and transmission of

intermediate output

J O B S API

Client launches a job -‐ ConfiguraEon -‐ Mapper -‐ Reducer -‐ Input -‐ Output TaskTracker 2 TaskTracker N

Hadoop -‐ Core Architecture

JobsTracker

TaskTracker 1 DataNode 1

J O B S API

NameNode

TaskTracker 2 DataNode 2

TaskTracker N DataNode N

H D F S

Mini OS •  File system •  Scheduler

hkp://www.slideshare.net/esaliya/mapreduce-‐in-‐simple-‐terms

MapReduce – Head First Style

MapReduce – Mapper Types

One-‐to-‐One map(k, v) = emit (k, transform(v))

Exploder map(k, v) = foreach p in v: emit (k, p)

Filter map(k, v) = if cond(v) then emit (k, v)

MapReduce – Reducer Types

Sum Reducer

reduce(k, vals) = sum = 0 foreach v in vals: sum += v emit (k, sum)

MapReduce – High level pipeline

K1

K1

K1

K1

K2

K2

K2

K2

MapReduce – Detailed pipeline

Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html

MapReduce – Combiner Phase

•  OpEonal •  Runs on mapper nodes aSer map phase •  “ Mini-‐reduce,” only on local map output •  Used to save bandwidth before sending data to full reducer •  The Reducer can be Combiner if

1.  Output key, values are the same as input key, values 2.  CommutaEve and AssociaEve (SUM, MAX ok but AVG not)

Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html

InstallaAon

1.  Download & configure single-‐node cluster

hadoop.apache.org/common/releases.html

2.  Download a demo VM

Cloudera Hortonwork

3.  Use a hosted environment (Amazon’s EMR, Azure)

InstallaAon – Pla[orm Notes

ProducAon Linux – Official

Development

Linux OSX Windows via Cygwin *Nix

MapReduce – Client Languages

Java, Any JVM Languages -‐ NaEve C++ -‐ Pipes framework – Socket IO Any – Streaming – Stdin / Stdout

Pig LaEn, Hive HQL, C via JNI

hadoop pipes -‐input path_in -‐output path_out -‐program exec_program

hadoop jar hadoop-‐streaming.jar -‐mapper map_prog -‐reducer reduce_prog -‐input path_in -‐output path_out

hadoop jar jar_path main_class input_path output_path

MapReduce – Client Anatomy

•  Main Program (aka Driver)

Configures the Job IniEates the Job

•  Input LocaEon •  Mapper •  Combiner (opEonal) •  Reducer •  Output LocaEon

MapReduce – Word Count Example

MapReduce – C# Mapper

MapReduce – C# Reducer

MapReduce – Java Mapper

MapReduce – Java Reducer

MapReduce – JavaScript Mapper

MapReduce – JavaScript Reducer

Summary

is an economical scalable distributed data processing system which enables data:

ü  ConsolidaAon (Structured or Not) ü Query Flexibility (Any Language) ü  Agility (Evolving Schemas)

QuesAons ?

References

Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio 10 Hadoop-‐able Problems, by Cloudera Hadoop, An Industry PerspecEve, by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop

introduction to hadoop

Technology

ram data

data warehousing

corrupeon data

data s datanode

data types complex data

data simple core

node scanning data

data locality aware