a (very) short intro to hadoop

40
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. A very short Intro to Hadoop 1 photo by: i_pinz, flickr Scale Unlimited, Inc Monday, December 19, 2011

Upload: ken-krugler

Post on 27-Jan-2015

122 views

Category:

Technology


3 download

DESCRIPTION

A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)

TRANSCRIPT

Page 1: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A very short Intro toHadoop

1

phot

o by

: i_p

inz,

flic

kr

Scale Unlimited, Inc

Monday, December 19, 2011

Page 2: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Welcome to Intro to Hadoop

Usually a 4 hour online class

Focus on Why, What, How of Hadoop

2

Monday, December 19, 2011

Page 3: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Meet Your Instructor

Developer of Bixo web mining toolkit

Committer on Apache Tika

Hadoop developer and trainer

Solr developer and trainer

3

Monday, December 19, 2011

Page 4: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A very short Intro toHadoop

4

Overview

phot

o by

: exf

ordy

, flic

kr

Monday, December 19, 2011

Page 5: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

How to Crunch a Petabyte?

Lots of disks, spinning all the time

Redundancy, since disks die

Lots of CPU cores, working all the time

Retry, since network errors happen

5

Monday, December 19, 2011

Page 6: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Hadoop to the Rescue

Scalable - many servers with lots of cores and spindles

Reliable - detect failures, redundant storage

Fault-tolerant - auto-retry, self-healing

Simple - use many servers as one really big computer

6

Monday, December 19, 2011

Page 7: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Logical Architecture

Logically, Hadoop is simply a computing cluster that provides:

a Storage layer, and

an Execution layer

7

Cluster

Storage

Execution

Monday, December 19, 2011

Page 8: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Storage Layer

Hadoop Distributed File System (aka HDFS, or Hadoop DFS)

Runs on top of regular OS file system, typically Linux ext3

Fixed-size blocks (64MB by default) that are replicated

Write once, read many; optimized for streaming in and out

8

Monday, December 19, 2011

Page 9: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Execution Layer

Hadoop Map-Reduce

Responsible for running a job in parallel on many servers

Handles re-trying a task that fails, validating complete results

Jobs consist of special “map” and “reduce” operations

9

Monday, December 19, 2011

Page 10: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Scalable

Virtual execution and storage layers span many nodes (servers)

Scales linearly (sort of) with cores and disks.

10

Cluster

RackRackRack

Node Node Node ...Node

Storage

Executioncpu

disk

Monday, December 19, 2011

Page 11: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Reliable

Each block is replicated, typically three times.

Each task must succeed, or the job fails

11

Monday, December 19, 2011

Page 12: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Fault-tolerant

Failed tasks are automatically retried.

Failed data transfers are automatically retried.

Servers can join and leave the cluster at any time.

12

Monday, December 19, 2011

Page 13: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Simple

Reduces complexity

Conceptual “operating system” that spans many CPUs & disks

13

Node Node Node ...Node

Storage

Executioncpu

disk

Monday, December 19, 2011

Page 14: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Typical Hadoop Cluster

Has one “master” server - high quality, beefy box.

NameNode process - manages file system

JobTracker process - manages tasks

Has multiple “slave” servers - commodity hardware.

DataNode process - manages file system blocks on local drives

TaskTracker process - runs tasks on server

Uses high speed network between all servers

14

Monday, December 19, 2011

Page 15: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Architectural Components

15

Solid boxes are unique applicationsDashed boxes are child JVM instances

(on same node as parent)Dotted boxes are blocks of managed files

(on same node as parent)

NameNode DataNodeDataNode

DataNodeDataNode

JobTrackerClientjobs

Master Slaves

mapper child jvm

reducer child jvm

mapper child jvm

reducer child jvmreducer

child jvm

mapper child jvm

data block

TaskTrackertasks

Monday, December 19, 2011

Page 16: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A very short Intro toHadoop

16

Distributed File System

phot

o by

: Gra

ham

Rac

her,

flick

r

Monday, December 19, 2011

Page 17: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Virtual File System

Treats many disks on many servers as one huge, logical volume

Data is stored in 1...n blocks

The “DataNode” process manages blocks of data on a slave.

The “NameNode” process keeps track of file metadata on the master.

17

Monday, December 19, 2011

Page 18: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Replication

Each block is stored on several different disks (default is 3)

Hadoop tries to copy blocks to different servers and racks.

Protects data against disk, server, rack failures.

Reduces the need to move data to code.

18

Monday, December 19, 2011

Page 19: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Error Recovery

Slaves constantly “check in” with the master.

Data is automatically replicated if a disk or server “goes away”.

19

Monday, December 19, 2011

Page 20: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Limitations

Optimized for streaming data in/out

So no random access to data in a file

Data rates ≈ 30% - 50% of max raw disk rate

No append support currently

Write once, read many

20

Monday, December 19, 2011

Page 21: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

NameNode

Runs on master node

Is a single point of failure

There are no built-in software hot failover mechanisms

Maintains filesystem metadata, the “Namespace”

files and hierarchical directories

21

Monday, December 19, 2011

Page 22: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Files stored on HDFS are chunked and stored as blocks on DataNode

Manages storage attached to the nodes that they run on

Data never flows through NameNode, only DataNodes

22

DataNodes

DataNodeDataNode

DataNodeDataNode data block

Monday, December 19, 2011

Page 23: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A very short Intro toHadoop

23

Map-Reduce Paradigm

phot

o by

: Gra

ham

Rac

her,

flick

r

Monday, December 19, 2011

Page 24: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Value Pair -> two units of data, exchanged between Map & Reduce

Map -> The ‘map’ function in the MapReduce algorithm

user defined

converts each input Key Value Pair to 0...n output Key Value Pairs

Reduce -> The ‘reduce’ function in the MapReduce algorithm

user defined

converts each input Key + all Values to 0...n output Key Value Pairs

Group -> A built-in operation that happens between Map and Reduce

ensures each Key passed to Reduce includes all Values

24

Definitions

Monday, December 19, 2011

Page 25: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

All Together

25

Reducer

Reducer

Reducer

Mapper

Mapper

Mapper

Mapper

Mapper

Shuffle

Shuffle

Shuffle

Monday, December 19, 2011

Page 26: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

How MapReduce Works

Map translates input to keys and values to new keys and values

System Groups each unique key with all its values

Reduce translates the values of each unique key to new keys and values

26

[K1,V1] Map [K2,V2]

[K2,{V2,V2,....}] [K3,V3]Reduce

[K2,V2] [K2,{V2,V2,....}]Group

Monday, December 19, 2011

Page 27: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Canonical Example - Word Count

Word Count

Read a document, parse out the words, count the frequency of each word

Specifically, in MapReduce

With a document consisting of lines of text

Translate each line of text into key = word and value = 1

e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1>

Sum the values (ones) for each unique word

27

Monday, December 19, 2011

Page 28: A (very) short intro to Hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

28

[1, When in the Course of human events, it becomes]

[When,1] [in,1] [the,1] [Course,1] [k,1]

[2, necessary for one people to dissolve the political bands]

[n, {k,k,k,k,...}]

[When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]

[k,{v,v,...}][connected,{1,1,1,1,1}]

[When,4] [peope,6] [dissolve,3]

[k,sum(v,v,v,..)][connected,5]

Map

Reduce

GroupUser

Defined

[K1,V1]

[K2,{V2,V2,....}]

[K2,V2]

[K3,V3]

[3, which have connected them with another, and to assume]

Monday, December 19, 2011

Page 29: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Divide & Conquer (splitting data)

Because

The Map function only cares about the current key and value, and

The Reduce function only cares about the current key and its values

Then

A Mapper can invoke Map on an arbitrary number of input keys and values

or just some fraction of the input data set

A Reducer can invoke Reduce on an arbitrary number of the unique keys

but all the values for that key

29

Monday, December 19, 2011

Page 30: A (very) short intro to Hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

30

Mapper

Reducer

file

split1 split2 split3 split4 ...

Mapper

Mapper

Mapper

Mapper

part-00000 part-00001 part-000N

directory

Reducer

Reducer

Shuffle

Shuffle

Shuffle

[K2,V2] [K2,{V2,V2,...}][K1,V1] [K3,V3]

Mappers must complete before Reducers can

begin

Monday, December 19, 2011

Page 31: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Divide & Conquer (parallelizable)

Because

Each Mapper is independent and processes part of the whole, and

Each Reducer is independent and processes part of the whole

Then

Any number of Mappers can run on each node, and

Any number of Reducers can run on each node, and

The cluster can contain any number of nodes

31

Monday, December 19, 2011

Page 32: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

JobTracker

Is a single point of failureDetermines # Mapper Tasks from file splits via InputFormatUses predefined value for # Reducer TasksClient applications use JobClient to submit jobs and query statusCommand line use hadoop job <commands>

Web status console use http://jobtracker-server:50030/

32

JobTrackerClientjobs

Monday, December 19, 2011

Page 33: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

TaskTracker

Spawns each Task as a new child JVM

Max # mapper and reducer tasks set independently

Can pass child JVM opts via mapred.child.java.opts

Can re-use JVM to avoid overhead of task initialization

33

TaskTracker

mapper

child jvm

reducer

child jvm

mapper

child jvm

reducer

child jvmreducer

child jvm

mapper

child jvm

Monday, December 19, 2011

Page 34: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A very short Intro toHadoop

34

Hadoop (sort of)Deep Thoughts

phot

o by

: Gra

ham

Rac

her,

flick

r

Monday, December 19, 2011

Page 35: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Avoiding Hadoop

Hadoop is a big hammer - but not every problem is a nail

Small data

Real-time data

Beware the Hadoopaphile

35

Monday, December 19, 2011

Page 36: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Avoiding Map-Reduce

Writing Hadoop code is painful and error prone

Hive & Pig are good solutions for People Who Like SQL

Cascading is a good solution for complex, stable workflows

36

Monday, December 19, 2011

Page 37: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Leveraging the Eco-System

Many open source projects built on top of Hadoop

HBase - scalable NoSQL data store

Sqoop - getting SQL data in/out of Hadoop

Other projects work well with Hadoop

Kafka/Scribe - getting log data in/out of Hadoop

Avro - data serialization/storage format

37

Monday, December 19, 2011

Page 38: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Get Involved

Join the mailing list - http://hadoop.apache.org/mailing_lists.html

Go to user group meetings - e.g. http://hadoop.meetup.com/

38

Monday, December 19, 2011

Page 39: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Learn More

Buy the book - Hadoop: The Definitive Guide, 2nd edition

Try the tutorials

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

Get training (danger, personal plug)

http://www.scaleunlimited.com/training

http://www.cloudera.com/hadoop-training

39

Monday, December 19, 2011

Page 40: A (very) short intro to Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Resources

Scale Unlimited Alumni list - [email protected]

Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html

Users groups - e.g. http://hadoop.meetup.com/

Hadoop API - http://hadoop.apache.org/common/docs/current/api/

Hadoop: The Definitive Guide, 2nd edition by Tom White

Cascading: http://www.cascading.org

Datameer: http://www.datameer.com

40

Monday, December 19, 2011