an introduction to the world of hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

An Introduction to the World of HadoopApplications to Scientific Data Mining

Gordon Rios

[email protected] Constraint Computation Centre (4C)

University College Cork

October 29, 2010

Gordon Rios Introduction to Hadoop

[email protected]



Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading




ObjectivesParallel Computing with MapReduceMapReduce Thinking

Outline








Objectives

At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .

Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop





Outline








MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .





Basics Elements of MapReduceMapReduce is distributed sort with specific places to insertapplication logic. . .

an input reader: read work data W from file system1 andproduce a set of splits S: W → S

a Map function: (S)→ (K , V )

combiner function: a mapper optimization. . .

partition function: partition2 keys k ∈ K to reducers K → R

compare function cmp(ki , kj): sort keys presented to eachreducer

a Reduce function: reduce output from all mappers for aparticular to another set of values for that key wk(k , V )→ (k , wk ))

an output writer: write output to file system.1

A distributed file system (DFS) for stability and scale2

The default hash keys modulo number of reducers





Outline








Examples of Map and Reduce

Let’s start with a few examples of Map. . .

Word Count: read in a stream of text (e.g. a document or a set ofdocuments) and emit each word as a key with a value of 1

Inverted Index: read in a stream of documents and emit eachword as a key and the document ID as the value

Max Temperature: read in formatted data and emit year as akey with temperature as the value

Mean Rain Precipitation: read in daily data and emit(year-month, lat, long) as a key with temperature asthe value

Reduce in these cases simply applies a count, list, max,average, to a set of values for each key,respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]





Visualizing Word Count

source: Chris Wensel fromhttp://www.cascading.org


http://www.cascading.org



Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Outline








Engineering Intermezzo

This is how easy it is to get Hadoop installed . . . given that youhave Java 6 installed already. . .

Get Hadoop: http://hadoop.apache.org/

% t a r xz f hadoop−x . y . z . t a r . gz% expor t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z% expor t PATH=$PATH:$HADOOP_INSTALL / b in


http://hadoop.apache.org/




MapReduce with Hadoop and the streaming library

Now, let’s take a closer look at how Hadoop implementsMapReduce from [White, 2011]. . .





Hadoop Streaming Library

We’ll focus on the streaming library as it’s the most natural forscientific or technical computing. . . let’s look at the DefinitiveGuide’s weather example. . .





Outline








Hadoop Book Examples

More examples from Hadoop: The Definitive Guide, 2nd Edition(Hadoop 20.1) http://www.hadoopbook.com/. . . here’show to install and try them for yourself. . .

Install Git: http://git-scm.com/Visit github for book code:http://github.com/tomwhite/hadoop-book/

Checkout code examples from The Definitive Guide% cd BUILD_DIR% git clone http://github.com/tomwhite/hadoop-book.git hadoop-book


http://www.hadoopbook.com/

http://git-scm.com/

http://github.com/tomwhite/hadoop-book/

http://github.com/tomwhite/hadoop-book/

http://github.com/tomwhite/hadoop-book.git




Example: ECA Mean Precipitation

Let’s compute mean precipitation at over 2,000 weather stations andmake some graphics. There are 2,186 files with median of 21,875lines each, a minimum of 1,025 and a maximum of 78,090.

ECA Daily Data

The ECA dataset contains series of daily observations at meteorological stations throughout Europe and theMediterranean. Part of the dataset is freely available for non-commercial research. To download this public dataselect one of the options below. Note that a gridded version with daily temperature and precipitation fields is alsoavailable. source: http://eca.knmi.nl/dailydata/index.php

File Format

FILE FORMAT (MISSING VALUE CODE = −9999):

01−06 STAID : S ta t i on i d e n t i f i e r08−13 SOUID : Source i d e n t i f i e r15−22 DATE : Date YYYYMMDD24−28 RR : P r e c i p i t a t i o n amount i n 0.1 mm30−34 Q_RR : q u a l i t y code f o r RR (0= ’ va l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )


http://eca.knmi.nl/dailydata/index.php





Scientific Data Mining: use the Hadoop stream library andmanually pipeline MapReduce jobs together as needed. . .

Write hadoop scripts in python in two stepsTest cat data | map.py | sort | reduce.py >output (not shown)Process data into individual files for each time period(Year/Month) of interest using hadoop stream library (localmode)Call R in batch mode to produce image files





ECA Mean Precipitation: Step One

map_one.py

def l a t_ lon_ to_coord ( s ) :s ign = 1d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )s ign = −1 i f d < 0 else 1x = abs ( d ) + m / 60.0 + s / 3600.0return f l o a t ( s ign ∗ x )

for l i n e in sys . s t d i n :# f l d s = ( s ta id , souid , date , r r , q_r r )f l d s = l i n e . s t r i p ( ) . s p l i t ( " , " )i f len ( f l d s ) != 5 :

continues t a i d = f l d s [ 0 ] . s t r i p ( ) # s t a t i o n i ddate = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDDi f date < BEGIN_DATE or date > END_DATE:

continuer r = f l d s [ 3 ] . s t r i p ( ) # p r e c i p i t a t i o n i n 0.1 mmq_r r = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code "0 " = v a l i dl a t , lon = l a t l o n s . get ( s ta id , (None , None ) )i f q_r r == ’ 0 ’ and ( l a t is not None) and ( lon is not None ) :

pr in t "%s ,%.4 f ,%.4 f \ t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )





ECA Mean Precipitation: Step One (cont)

reduce_one.py

( las t_key , x , n ) = (None , 0 .0 , 0)for l i n e in sys . s t d i n :

( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value

i f n > 0:pr in t "%s \ t %.2 f " % ( last_key , x / n )x = 0.0n = 0

# we j u s t want data f o r the year 2009( las t_key , x , n ) = ( key , x + f l o a t ( va l ) , n + 1)

i f l as t_key :i f n > 0:

pr in t "%s \ t %.2 f " % ( last_key , x / n )





ECA Mean Precipitation: Step TwoMap ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

map_two.py

for l i n e in sys . s t d i n :yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " \ t " )yyyymm , l a t , lon = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )pr in t "%s \ t%s %s %s " % (yyyymm , l a t , lon , mean_precip )

Empty reduce just write to a local file (hack since we’re running locally)

reduce_two.py

l as t_key = Nonevalues = [ ]for l i n e in sys . s t d i n :

( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value

w r i t e _ f i l e ( las t_key , values )values = [ ]

l as t_key = keyvalues . append ( va l ) # create a s t r i n g wi th th ree values

i f l as t_key :w r i t e _ f i l e ( las t_key , values )






Step One: input -> (yyyymm,lat,lon), mean precip

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output

Step Two: (date,lat,lon), mean precip -> files(yymm)

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer/Desktop/tmp/tarragona/python/reduce_two.py% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two





Batch Processing in R

And, after a little batch processing with R. . .batch-graphics.R

l i b r a r y ( f i e l d s )f i l e s <− c ( " 200901. dat " , " 200902. dat " , " 200903. dat " ,

" 200904. dat " , " 200905. dat " , " 200906. dat " ," 200907. dat " , " 200908. dat " , " 200909. dat " ," 200910. dat " , " 200911. dat " , " 200912. dat " )

i <− 1for ( f i n f i l e s ) {

mat <− read . t ab l e ( f )names ( mat ) <− c ( " l a t " , " long " , " p rec ip " )png ( f i lename=paste ( " prec ip−" , i , " . png " , sep=" " ) , he igh t =480 , width =480)q u i l t . p l o t ( mat \ $long , mat \ $ la t , mat \ $precip , nco l =100 ,nrow=100 ,

y l im=c ( 22 .0 ,79 .0 ) , x l im=c (−52.0 ,72.0) ,co l=two . co lo rs (256 , s t a r t = " wheat " , end=" darkblue " , middle=" blue " ) ,z l im=c (0 ,410) , add . legend=T , cex . lab =0.6)

po in t s (1 .2453 , 41.1187 , pch=1)t e x t (1 .2453 , 41.1187 , " tar ragona " , cex =0.8 , pos=1)po in t s (2.35083 , 48.89 , pch=1)t e x t (2 .3508 , 48.89 , " pa r i s " , cex =0.8 , pos=4)po in t s (12.4823 , 41.8955 , pch=1)t e x t (12.4823 , 41.8955 , " rome " , cex =0.8 , pos=4)dev . o f f ( )i <− i + 1

}





ECA Precipitation 2009 Month: 1





Summary of What We Did

Work through a complete example but that’s not all since with verylittle additional work we can. . .

Test the scripts in pseudo-distributed mode locally on our ownmachine

Run the job on a compute cluster remotely

Run the job in the cloud with EC2 there system as just anotherremote cluster

Run the job with Amazon’s Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/ whichallows you to pay for exactly as much computing as you use.

See [White, 2011] for complete details on how to run in these differentmodes. . .


http://aws.amazon.com/elasticmapreduce/




Outline








Systems Development APIs

And, you can build production systems with Hadoop in eitherJava or C++. . .

Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)


http://www.cascading.org




Cascading

Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and thenautomatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph




Ad Hoc AnalysisFurther Reading

Outline








Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]





Interesting Application Frameworks with Hadoop

Here are a few examples of frameworks in development or alreadyavailable that use Hadoop as a platform. . .

Apache Mahout: Ambitious project to implement popularmachine learning algorithms and recommenders with Hadoop3

Graph: Jake Hoffman from Yahoo Research has released someof his work on large scale network analysis with Hadoop withprototype code4. Also see [Vassilvitskii, 2010] for related graphanalysis research.

Application to GIS: Nathan Kerr’s M.S. Thesis with lots of detailson how to do GIS with Hadoop5

3http://mahout.apache.org/

4http://github.com/jhofman/icwsm2010_tutorial

5http://www.nathankerr.com/projects/parallel-gis-processing/alternative_

approaches_to_parallel_gis_processing.html


http://mahout.apache.org/

http://github.com/jhofman/icwsm2010_tutorial

http://www.nathankerr.com/projects/parallel-gis-processing/alternative_approaches_to_parallel_gis_processing.html

http://www.nathankerr.com/projects/parallel-gis-processing/alternative_approaches_to_parallel_gis_processing.html




Outline








Further ReadingWhite, T.Hadoop: The Definitive Guide, 2nd EditionO’Reilly Media, Inc., Sebastopol, CA, 2011

Sanderson, D.Programming Google App EngineO’Reilly Media, Inc., Sebastopol, CA, 2009

Murty, J.Programming Amazon Web ServicesO’Reilly Media, Inc., Sebastopol, CA, 2008

Dean, J. and Ghemawat, S.MapReduce: simplified data processing on large clustersCommunications of the ACM, 51(1):107–113, 2008

Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. andBurrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.Bigtable: a distributed storage system for structured dataOSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,USENIX Assoc., Berkeley, CA, 2006

MapReduce on Wikipediahttp://en.wikipedia.org/wiki/MapReduce

Vassilvitskii, S.XXL Graph Algorithms, Hadoop Summit 2010http://developer.yahoo.com/events/hadoopsummit2010/


http://en.wikipedia.org/wiki/MapReduce

http://developer.yahoo.com/events/hadoopsummit2010/

an introduction to the world of hadoop

Technology

world of hadoop applications

terms of mapreduce2

hadoop streaming api

scientic data miningobjectives

scientic data mininghadoop

scientic data mining

scientic data mininggordon

input data