(mis)use of hadoop - cilvr.cs.nyu.edu · pdf fileecho grep 326: > grepper echo wc -l >...

Post on 06-Feb-2018

226 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

(Mis)Use of Hadoop

John Langford

NYU Large Scale Learning Class, February 19, 2013

(Post presentation version)

The data problem

Traditional high performance computing is FLoating OperationsPer Second. http://top500.org/lists/2012/11/

Titan http://www.olcf.ornl.gov/titan/: 17.6PFlops, 8.2MWFlops first, network second, data is irrelevant.Primary use = simulations.

Data is different.⇒ the need to store information. With large amounts of data,errors should be assumed. Use replication to zero out the chanceof error.⇒ the need for quick access. No single machine can handle alldata requests. Use locality to minimize bandwidth.

The data problem

Traditional high performance computing is FLoating OperationsPer Second. http://top500.org/lists/2012/11/

Titan http://www.olcf.ornl.gov/titan/: 17.6PFlops, 8.2MWFlops first, network second, data is irrelevant.Primary use = simulations.Data is different.⇒ the need to store information. With large amounts of data,errors should be assumed. Use replication to zero out the chanceof error.⇒ the need for quick access. No single machine can handle alldata requests. Use locality to minimize bandwidth.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

Hadoop

1 System for datacentric computing.

2 Distributed File System + Map Reduce + Advancedcomponents

3 Java’s first serious use as an OS

4 Open Source clone of GFS + Map Reduce system.

5 Yahoo!’s bulk data processing system.

6 The craze at Strata (= data business conference). Nearlyevery company is interoperating with, extending, and/or usingHadoop.

The NYU Hadoop cluster

∼92 machines8 cores @ 2Ghz to 2.5Ghz / machine16GB RAM / machine1Gb/s network card∼100TB storage (in Hadoop).A low end Hadoop cluster, of most use as a shared datarichenvironment.

Access directions for NYU students:ssh <netid>@hpc.nyu.edu

ssh dumbo.es.its.nyu.edu

Take 10 minutes to setup ssh tunneling:https://wikis.nyu.edu/display/NYUHPC/SCP+through+SSH+

Tunneling

For nonNYU students, you can experiment with Hadoop easilyusing AWS.

The NYU Hadoop cluster

∼92 machines8 cores @ 2Ghz to 2.5Ghz / machine16GB RAM / machine1Gb/s network card∼100TB storage (in Hadoop).A low end Hadoop cluster, of most use as a shared datarichenvironment.Access directions for NYU students:ssh <netid>@hpc.nyu.edu

ssh dumbo.es.its.nyu.edu

Take 10 minutes to setup ssh tunneling:https://wikis.nyu.edu/display/NYUHPC/SCP+through+SSH+

Tunneling

For nonNYU students, you can experiment with Hadoop easilyusing AWS.

The NYU Hadoop cluster

∼92 machines8 cores @ 2Ghz to 2.5Ghz / machine16GB RAM / machine1Gb/s network card∼100TB storage (in Hadoop).A low end Hadoop cluster, of most use as a shared datarichenvironment.Access directions for NYU students:ssh <netid>@hpc.nyu.edu

ssh dumbo.es.its.nyu.edu

Take 10 minutes to setup ssh tunneling:https://wikis.nyu.edu/display/NYUHPC/SCP+through+SSH+

Tunneling

For nonNYU students, you can experiment with Hadoop easilyusing AWS.

Hadoop Distributed File System (HDFS)

Half of Hadoop is HDFS. It’s the better half.

All data is stored 3 times ⇒1 robust to failures. How many random failures required to lose

info?

2 1/3 storage.

3 You don’t “backup” a Hadoop cluster.

4 Multiple sources for any one piece of data.

Files are stored in 64MB chunks (“shards”).

1 Sequential reads are fast. 100MB/s disk requires 0.64s toread a chunk but only .01 s to start reading it.

2 Not made for random reads.

3 Not for small files.

No support for file modification.⇒ HDFS is in it’s own namespace.⇒ Need new comands.

Hadoop Distributed File System (HDFS)

Half of Hadoop is HDFS. It’s the better half.All data is stored 3 times ⇒

1 robust to failures. How many random failures required to loseinfo?

2 1/3 storage.

3 You don’t “backup” a Hadoop cluster.

4 Multiple sources for any one piece of data.

Files are stored in 64MB chunks (“shards”).

1 Sequential reads are fast. 100MB/s disk requires 0.64s toread a chunk but only .01 s to start reading it.

2 Not made for random reads.

3 Not for small files.

No support for file modification.⇒ HDFS is in it’s own namespace.⇒ Need new comands.

Hadoop Distributed File System (HDFS)

Half of Hadoop is HDFS. It’s the better half.All data is stored 3 times ⇒

1 robust to failures. How many random failures required to loseinfo?

2 1/3 storage.

3 You don’t “backup” a Hadoop cluster.

4 Multiple sources for any one piece of data.

Files are stored in 64MB chunks (“shards”).

1 Sequential reads are fast. 100MB/s disk requires 0.64s toread a chunk but only .01 s to start reading it.

2 Not made for random reads.

3 Not for small files.

No support for file modification.⇒ HDFS is in it’s own namespace.⇒ Need new comands.

Hadoop Distributed File System (HDFS)

Half of Hadoop is HDFS. It’s the better half.All data is stored 3 times ⇒

1 robust to failures. How many random failures required to loseinfo?

2 1/3 storage.

3 You don’t “backup” a Hadoop cluster.

4 Multiple sources for any one piece of data.

Files are stored in 64MB chunks (“shards”).

1 Sequential reads are fast. 100MB/s disk requires 0.64s toread a chunk but only .01 s to start reading it.

2 Not made for random reads.

3 Not for small files.

No support for file modification.⇒ HDFS is in it’s own namespace.⇒ Need new comands.

HDFS ops

Execute:echo alias hfs= ’hadoop fs ’ >> .bashrc

source .bashrc

Common commands:1 hfs See available commands.2 hfs -help more command details.3 hfs -ls [<path>] List files4 hfs -cp <src> <dst> Copy stuff5 hfs -mkdir <path> Create path6 hfs -rm <path> remove a file7 hfs -chmod <path> Modify permissions.8 hfs -chown <path> Modify owner.

Remote access commands:1 hfs -cat <src> Cat contents to stdout.2 hfs -copyFromLocal <localsrc> <dst> Copy stuff3 hfs -copyToLocal <src> <localdst> Copy stuff

File system is browsable. For NYU: http://babar:50070/

HDFS ops

Execute:echo alias hfs= ’hadoop fs ’ >> .bashrc

source .bashrc

Common commands:1 hfs See available commands.2 hfs -help more command details.3 hfs -ls [<path>] List files4 hfs -cp <src> <dst> Copy stuff5 hfs -mkdir <path> Create path6 hfs -rm <path> remove a file7 hfs -chmod <path> Modify permissions.8 hfs -chown <path> Modify owner.

Remote access commands:1 hfs -cat <src> Cat contents to stdout.2 hfs -copyFromLocal <localsrc> <dst> Copy stuff3 hfs -copyToLocal <src> <localdst> Copy stuff

File system is browsable. For NYU: http://babar:50070/

Hadoop Map-Reduce

An example of “Bulk Synchronous Parallel” data processing.Map (Programming language ideal): A function f : A→ B.

Map (Hadoop ideal): A function f : A∗ → B∗

Map (Real implementation): Any program consuming A∗ andoutputting B∗.In between Map and Reduce is sort(B∗) which partitions elementsacross multiple reducers.

Reduce (Programming language ideal): A function g : BxB → B.

Reduce (Hadoop ideal): A function g : B∗ → C .Reduce (Real Implementation): Any program consuming B∗ andoutputing C .A,B,C are often (but not always) line oriented.

Hadoop Map-Reduce

An example of “Bulk Synchronous Parallel” data processing.Map (Programming language ideal): A function f : A→ B.Map (Hadoop ideal): A function f : A∗ → B∗

Map (Real implementation): Any program consuming A∗ andoutputting B∗.

In between Map and Reduce is sort(B∗) which partitions elementsacross multiple reducers.Reduce (Programming language ideal): A function g : BxB → B.Reduce (Hadoop ideal): A function g : B∗ → C .

Reduce (Real Implementation): Any program consuming B∗ andoutputing C .

A,B,C are often (but not always) line oriented.

Hadoop Map-Reduce

An example of “Bulk Synchronous Parallel” data processing.Map (Programming language ideal): A function f : A→ B.Map (Hadoop ideal): A function f : A∗ → B∗

Map (Real implementation): Any program consuming A∗ andoutputting B∗.In between Map and Reduce is sort(B∗) which partitions elementsacross multiple reducers.Reduce (Programming language ideal): A function g : BxB → B.Reduce (Hadoop ideal): A function g : B∗ → C .Reduce (Real Implementation): Any program consuming B∗ andoutputing C .A,B,C are often (but not always) line oriented.

Reduce Reduce

HDFS HDFS HDFS

HDFS

Map Map Map Map

HDFS

HDFS

Hadoop Streaming

Hadoop streaming = use any program written in any language formapreduce operations.Execute:echo "export HAS=/usr/lib/hadoop/contrib/streaming

export HSJ=hadoop-streaming-1.0.3.16.jar

alias hjs= ’hadoop jar $(HAS)/$(HSJ) ’" >> .bashrc

source .bashrc

Get the first example from every mapped chunk in rcv1.echo ‘‘cat > temp; head -n 1 temp’’ > header

hjs -input /user/jl5386/rcv1.train.txt -output

headresults -mapper header -reducer cat -file header

hfs -cat headresults/part-00000 | wc -l = number ofmappers

Hadoop Streaming

Hadoop streaming = use any program written in any language formapreduce operations.Execute:echo "export HAS=/usr/lib/hadoop/contrib/streaming

export HSJ=hadoop-streaming-1.0.3.16.jar

alias hjs= ’hadoop jar $(HAS)/$(HSJ) ’" >> .bashrc

source .bashrc

Get the first example from every mapped chunk in rcv1.echo ‘‘cat > temp; head -n 1 temp’’ > header

hjs -input /user/jl5386/rcv1.train.txt -output

headresults -mapper header -reducer cat -file header

hfs -cat headresults/part-00000 | wc -l = number ofmappers

Guess what it does

echo ‘‘cut -d ’ ’ -f 1 | grep 1’’ > cutter

echo wc -l > counter

hjs -input /user/jl5386/rcv1.train.txt -output

countres -mapper cutter -reducer counter -file cutter

-file counter

hfs -cat countres/part-00000

Guess what it does

echo grep 326: > grepper

echo wc -l > counter

hjs -input /user/jl5386/rcv1.train.txt -output fcount

-mapper grepper -reducer counter -file grepper -file

counter

hfs -cat fcount/part-00000

Guess what it does

echo ‘‘cut -d ’ ’ -f 1 | sort -u’’ > cutsort

echo sort -u > sorter

hjs -input /user/jl5386/rcv1.train.txt -output labels

-mapper cutsort -reducer sorter -file cutsort -file

sorter

hfs -cat labels/part-00000

Hadoop job control

Watch what is happening with job tracker URL (given on joblaunch).

hadoop job -list

hadoop job -kill <id>

Abusing Hadoop Streaming

Hadoop streaming makes Hadoop into a general purpose jobsubmission system.hjs -Dmapred.task.timeout=600000000

-Dmapred.job.map.memory.mb=3000 -input <yourdata>

-output <finaloutput> -mapper cat -reducer

<yourprogram> -file <yourprogram>

Why Hadoop for job control?

1 map can be handy for selecting a subset or different data forfeatures or examples.

2 much better bandwidth limits.

Abusing Hadoop Streaming

Hadoop streaming makes Hadoop into a general purpose jobsubmission system.hjs -Dmapred.task.timeout=600000000

-Dmapred.job.map.memory.mb=3000 -input <yourdata>

-output <finaloutput> -mapper cat -reducer

<yourprogram> -file <yourprogram>

Why Hadoop for job control?

1 map can be handy for selecting a subset or different data forfeatures or examples.

2 much better bandwidth limits.

Parallel Learning for Parameter Search

Hadoop streaming makes Hadoop into a general purpose jobsubmission system.for a in 0.1 0.3 1 3 10; do

echo ./vw -l $a > job $a

hjs -Dmapred.task.timeout=600000000

-Dmapred.job.map.memory.mb=3000 -input <yourdata>

-output output $a -mapper cat -reducer job $a -file

job $a -file vwdone

Parallel Learning for Speed

The next lecture.

More Hadoop things

PIG(Y!) is an SQL→MapReduce compiler

Zookeeper(Y!) is a system for sharing small amounts of info.

Hive(Facebook): Much faster data query and exploration. ...

References

[Hadoop Tutorial]http://developer.yahoo.com/hadoop/tutorial/

[GFS] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, “TheGoogle file system”, SOSP ’03.[MapReduce] Jeffrey Dean, Sanjay Ghemawat MapReduce:simplified data processing on large clusters, Communications of theACM, Volume 51 Issue 1, January 2008, Pages 107-113.

In general, MapReduce is an example of bulk synchronous parallelcomputation—there are many other papers.

top related