an introduction to the world of hadoop
DESCRIPTION
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/TRANSCRIPT
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
An Introduction to the World of HadoopApplications to Scientific Data Mining
Gordon Rios
[email protected] Constraint Computation Centre (4C)
University College Cork
October 29, 2010
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .
Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .
Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .
Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .
Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over sharedmemory and filesystem
If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .
1 Map works independently to convert input data to key valuepairs. . .
2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Basics Elements of MapReduceMapReduce is distributed sort with specific places to insertapplication logic. . .
an input reader: read work data W from file system1 andproduce a set of splits S: W → S
a Map function: (S)→ (K , V )
combiner function: a mapper optimization. . .
partition function: partition2 keys k ∈ K to reducers K → R
compare function cmp(ki , kj): sort keys presented to eachreducer
a Reduce function: reduce output from all mappers for aparticular to another set of values for that key wk(k , V )→ (k , wk ))
an output writer: write output to file system.1
A distributed file system (DFS) for stability and scale2
The default hash keys modulo number of reducers
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Examples of Map and Reduce
Let’s start with a few examples of Map. . .
Word Count: read in a stream of text (e.g. a document or a set ofdocuments) and emit each word as a key with a value of 1
Inverted Index: read in a stream of documents and emit eachword as a key and the document ID as the value
Max Temperature: read in formatted data and emit year as akey with temperature as the value
Mean Rain Precipitation: read in daily data and emit(year-month, lat, long) as a key with temperature asthe value
Reduce in these cases simply applies a count, list, max,average, to a set of values for each key,respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
ObjectivesParallel Computing with MapReduceMapReduce Thinking
Visualizing Word Count
source: Chris Wensel fromhttp://www.cascading.org
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Engineering Intermezzo
This is how easy it is to get Hadoop installed . . . given that youhave Java 6 installed already. . .
Get Hadoop: http://hadoop.apache.org/
% t a r xz f hadoop−x . y . z . t a r . gz% expor t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z% expor t PATH=$PATH:$HADOOP_INSTALL / b in
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
MapReduce with Hadoop and the streaming library
Now, let’s take a closer look at how Hadoop implementsMapReduce from [White, 2011]. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Hadoop Streaming Library
We’ll focus on the streaming library as it’s the most natural forscientific or technical computing. . . let’s look at the DefinitiveGuide’s weather example. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Hadoop Book Examples
More examples from Hadoop: The Definitive Guide, 2nd Edition(Hadoop 20.1) http://www.hadoopbook.com/. . . here’show to install and try them for yourself. . .
Install Git: http://git-scm.com/Visit github for book code:http://github.com/tomwhite/hadoop-book/
Checkout code examples from The Definitive Guide% cd BUILD_DIR% git clone http://github.com/tomwhite/hadoop-book.git hadoop-book
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Example: ECA Mean Precipitation
Let’s compute mean precipitation at over 2,000 weather stations andmake some graphics. There are 2,186 files with median of 21,875lines each, a minimum of 1,025 and a maximum of 78,090.
ECA Daily Data
The ECA dataset contains series of daily observations at meteorological stations throughout Europe and theMediterranean. Part of the dataset is freely available for non-commercial research. To download this public dataselect one of the options below. Note that a gridded version with daily temperature and precipitation fields is alsoavailable. source: http://eca.knmi.nl/dailydata/index.php
File Format
FILE FORMAT (MISSING VALUE CODE = −9999):
01−06 STAID : S ta t i on i d e n t i f i e r08−13 SOUID : Source i d e n t i f i e r15−22 DATE : Date YYYYMMDD24−28 RR : P r e c i p i t a t i o n amount i n 0.1 mm30−34 Q_RR : q u a l i t y code f o r RR (0= ’ va l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Example: ECA Mean Precipitation
Scientific Data Mining: use the Hadoop stream library andmanually pipeline MapReduce jobs together as needed. . .
Write hadoop scripts in python in two stepsTest cat data | map.py | sort | reduce.py >output (not shown)Process data into individual files for each time period(Year/Month) of interest using hadoop stream library (localmode)Call R in batch mode to produce image files
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Mean Precipitation: Step One
map_one.py
def l a t_ lon_ to_coord ( s ) :s ign = 1d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )s ign = −1 i f d < 0 else 1x = abs ( d ) + m / 60.0 + s / 3600.0return f l o a t ( s ign ∗ x )
for l i n e in sys . s t d i n :# f l d s = ( s ta id , souid , date , r r , q_r r )f l d s = l i n e . s t r i p ( ) . s p l i t ( " , " )i f len ( f l d s ) != 5 :
continues t a i d = f l d s [ 0 ] . s t r i p ( ) # s t a t i o n i ddate = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDDi f date < BEGIN_DATE or date > END_DATE:
continuer r = f l d s [ 3 ] . s t r i p ( ) # p r e c i p i t a t i o n i n 0.1 mmq_r r = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code "0 " = v a l i dl a t , lon = l a t l o n s . get ( s ta id , (None , None ) )i f q_r r == ’ 0 ’ and ( l a t is not None) and ( lon is not None ) :
pr in t "%s ,%.4 f ,%.4 f \ t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Mean Precipitation: Step One (cont)
reduce_one.py
( las t_key , x , n ) = (None , 0 .0 , 0)for l i n e in sys . s t d i n :
( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value
i f n > 0:pr in t "%s \ t %.2 f " % ( last_key , x / n )x = 0.0n = 0
# we j u s t want data f o r the year 2009( las t_key , x , n ) = ( key , x + f l o a t ( va l ) , n + 1)
i f l as t_key :i f n > 0:
pr in t "%s \ t %.2 f " % ( last_key , x / n )
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Mean Precipitation: Step TwoMap ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))
map_two.py
for l i n e in sys . s t d i n :yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " \ t " )yyyymm , l a t , lon = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )pr in t "%s \ t%s %s %s " % (yyyymm , l a t , lon , mean_precip )
Empty reduce just write to a local file (hack since we’re running locally)
reduce_two.py
l as t_key = Nonevalues = [ ]for l i n e in sys . s t d i n :
( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value
w r i t e _ f i l e ( las t_key , values )values = [ ]
l as t_key = keyvalues . append ( va l ) # create a s t r i n g wi th th ree values
i f l as t_key :w r i t e _ f i l e ( las t_key , values )
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Mean Precipitation: Step TwoMap ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))
map_two.py
for l i n e in sys . s t d i n :yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " \ t " )yyyymm , l a t , lon = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )pr in t "%s \ t%s %s %s " % (yyyymm , l a t , lon , mean_precip )
Empty reduce just write to a local file (hack since we’re running locally)
reduce_two.py
l as t_key = Nonevalues = [ ]for l i n e in sys . s t d i n :
( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value
w r i t e _ f i l e ( las t_key , values )values = [ ]
l as t_key = keyvalues . append ( va l ) # create a s t r i n g wi th th ree values
i f l as t_key :w r i t e _ f i l e ( las t_key , values )
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Example: ECA Mean Precipitation
Step One: input -> (yyyymm,lat,lon), mean precip
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output
Step Two: (date,lat,lon), mean precip -> files(yymm)
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer/Desktop/tmp/tarragona/python/reduce_two.py% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Example: ECA Mean Precipitation
Step One: input -> (yyyymm,lat,lon), mean precip
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output
Step Two: (date,lat,lon), mean precip -> files(yymm)
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer/Desktop/tmp/tarragona/python/reduce_two.py% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Batch Processing in R
And, after a little batch processing with R. . .batch-graphics.R
l i b r a r y ( f i e l d s )f i l e s <− c ( " 200901. dat " , " 200902. dat " , " 200903. dat " ,
" 200904. dat " , " 200905. dat " , " 200906. dat " ," 200907. dat " , " 200908. dat " , " 200909. dat " ," 200910. dat " , " 200911. dat " , " 200912. dat " )
i <− 1for ( f i n f i l e s ) {
mat <− read . t ab l e ( f )names ( mat ) <− c ( " l a t " , " long " , " p rec ip " )png ( f i lename=paste ( " prec ip−" , i , " . png " , sep=" " ) , he igh t =480 , width =480)q u i l t . p l o t ( mat \ $long , mat \ $ la t , mat \ $precip , nco l =100 ,nrow=100 ,
y l im=c ( 22 .0 ,79 .0 ) , x l im=c (−52.0 ,72.0) ,co l=two . co lo rs (256 , s t a r t = " wheat " , end=" darkblue " , middle=" blue " ) ,z l im=c (0 ,410) , add . legend=T , cex . lab =0.6)
po in t s (1 .2453 , 41.1187 , pch=1)t e x t (1 .2453 , 41.1187 , " tar ragona " , cex =0.8 , pos=1)po in t s (2.35083 , 48.89 , pch=1)t e x t (2 .3508 , 48.89 , " pa r i s " , cex =0.8 , pos=4)po in t s (12.4823 , 41.8955 , pch=1)t e x t (12.4823 , 41.8955 , " rome " , cex =0.8 , pos=4)dev . o f f ( )i <− i + 1
}
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 1
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 2
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 3
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 4
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 5
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 6
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 7
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 8
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 9
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 10
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 11
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
ECA Precipitation 2009 Month: 12
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Summary of What We Did
Work through a complete example but that’s not all since with verylittle additional work we can. . .
Test the scripts in pseudo-distributed mode locally on our ownmachine
Run the job on a compute cluster remotely
Run the job in the cloud with EC2 there system as just anotherremote cluster
Run the job with Amazon’s Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/ whichallows you to pay for exactly as much computing as you use.
See [White, 2011] for complete details on how to run in these differentmodes. . .
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in eitherJava or C++. . .
Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in eitherJava or C++. . .
Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in eitherJava or C++. . .
Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Hadoop BasicsHadoop ExamplesDeveloping Production Systems
Cascading
Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and thenautomatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)
Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)
Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)
Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)
Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)
Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Interesting Application Frameworks with Hadoop
Here are a few examples of frameworks in development or alreadyavailable that use Hadoop as a platform. . .
Apache Mahout: Ambitious project to implement popularmachine learning algorithms and recommenders with Hadoop3
Graph: Jake Hoffman from Yahoo Research has released someof his work on large scale network analysis with Hadoop withprototype code4. Also see [Vassilvitskii, 2010] for related graphanalysis research.
Application to GIS: Nathan Kerr’s M.S. Thesis with lots of detailson how to do GIS with Hadoop5
3http://mahout.apache.org/
4http://github.com/jhofman/icwsm2010_tutorial
5http://www.nathankerr.com/projects/parallel-gis-processing/alternative_
approaches_to_parallel_gis_processing.html
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Outline
1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce
2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop
3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading
Gordon Rios Introduction to Hadoop
MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce
Wider World of Hadoop
Ad Hoc AnalysisFurther Reading
Further ReadingWhite, T.Hadoop: The Definitive Guide, 2nd EditionO’Reilly Media, Inc., Sebastopol, CA, 2011
Sanderson, D.Programming Google App EngineO’Reilly Media, Inc., Sebastopol, CA, 2009
Murty, J.Programming Amazon Web ServicesO’Reilly Media, Inc., Sebastopol, CA, 2008
Dean, J. and Ghemawat, S.MapReduce: simplified data processing on large clustersCommunications of the ACM, 51(1):107–113, 2008
Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. andBurrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.Bigtable: a distributed storage system for structured dataOSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,USENIX Assoc., Berkeley, CA, 2006
MapReduce on Wikipediahttp://en.wikipedia.org/wiki/MapReduce
Vassilvitskii, S.XXL Graph Algorithms, Hadoop Summit 2010http://developer.yahoo.com/events/hadoopsummit2010/
Gordon Rios Introduction to Hadoop