perl on amazon elastic mapreduce
TRANSCRIPT
Perl on Amazon Elastic MapReduce
Friday, 30 December 11
A Gentle Introduction to MapReduce
• Distributed computing model
• Mappers process the input and forward intermediate results to reducers.
• Reducers aggregate these intermediate results, and emit the final results.
Friday, 30 December 11
Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.
$ map | sort | reduce
Friday, 30 December 11
MapReduce
• Input data sent to mappers as (k, v) pairs.
• After processing, mappers emit (kout, vout).
• These pairs are sorted and sent to reducers.
• All (kout, vout) pairs for a given kout are sent to a single reducer.
Friday, 30 December 11
The sorting guarantees that all values for a given key are sent to a single reducer.
MapReduce
• Reducers get (k, [v1, v2, …, vn]).
• After processing, the reducer emits a (kf, vf) per result.
Friday, 30 December 11
MapReduce
We wanted to have a world map showing where people were starting our games (like
Mozilla Glow)
Friday, 30 December 11
Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.
Glowfish
Friday, 30 December 11
MapReduce
• Input: ( epoch, IP address )
• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )
• Reducers get ( blockId, [ip1, ip2, …, ipn] )
• Do a geo lookup and emit
( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )
Friday, 30 December 11
On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago1 day to modify the Glow protocol, 1 day to buildEverything stored on S3
$ map | sort | reduce
Friday, 30 December 11
Friday, 30 December 11
Apache Hadoop
• Distributed programming framework
• Implements MapReduce
• Does all the usual distributed programming heavy-lifting for you
• Highly-fault tolerant, automatic task re-assignment in case of failure
• You focus on mappers and reducers
Friday, 30 December 11
Serialisation, heartbeat, node management, directory, etc.Speculative task execution, first one to finish winsPotentially very simple and contained code
Apache Hadoop
• Native Java API
• Streaming API which can use mappers and reducers written in any programming language.
• Distributed file system (HDFS)
• Distributed Cache
Friday, 30 December 11
You supply the mapper, reducer, and driver code
Amazon Elastic MapReduce
• On-demand Hadoop clusters running on EC2 instances.
• Improved S3 support for storage of input and output data.
• Build workflows by sending jobs to a cluster.
Friday, 30 December 11
S3 gives you virtually unlimited storage with very high redundancyS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)All this is controlled using a REST APIJobs are called ‘steps’ in EMR lingo
EMR Downsides
• No control over the machine images.
• Perl 5.8.8
• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.
• HDFS not available at cluster-creation time.
• Debian
Friday, 30 December 11
No way to customise the image and, e.g., install your own PerlSo it’s a good idea to store the final results of a workflow in S3No way to store dependencies in HDFS when cluster is created
Streaming vs. Native
$ cat | map | sort | reduce
Friday, 30 December 11
Streaming vs. Native
Instead of
( k, [ v1, v2, …, vn ] )
reducers get
(( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))
Friday, 30 December 11
Composite Keys
• Reducers receive both keys and values sorted
• Merge 3 tables:userid, 0, … # customer info
userid, 1, … # payments history
userid, recordid1, … # clickstream
userid, recordid2, … # clickstream
Friday, 30 December 11
If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.when the userid changes, it’s a new user.
Streaming vs. Native
• Limited API
• About a 7-10% increase in run time
• About a 1000% decrease in development time (as reported by a non-representative sample of developers)
Friday, 30 December 11
E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.Because reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.
Where’s My Towel?
• Tasks run chrooted in a non-deterministic location.
• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.
• For native Java jobs, your dependencies get packaged in the JAR alongside your code.
Friday, 30 December 11
So how do you get all the CPAN goodness you know and love in there?HDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!
Streaming’s Little Helpers
Define your inputs and outputs:--input s3://events/2011-30-10
--output s3://glowfish/output/2011-30-10
Friday, 30 December 11
Can have multiple inputs
Streaming’s Little Helpers
You can use any class in Hadoop’s classpath as a codec, several come bundled:-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Friday, 30 December 11
That -D is a Hadoop define, not a JVM system property definition
Streaming’s Little Helpers
• Use S3 to store…
• input data
• output data
• supporting data (e.g., Geo-IP)
• your code
Friday, 30 December 11
On a streaming job you specify the programs to use as mapper and reducer
Mapper and Reducer
To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3:--mapper s3://glowfish/bin/mapper.pl
--reducer s3://glowfish/bin/reducer.pl
Friday, 30 December 11
Support Files
When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Friday, 30 December 11
Support Files
When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Friday, 30 December 11
In the unknown directory where the task is running, making it accessible to it
Dependencies
But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz
Friday, 30 December 11
Dependencies
But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz
Friday, 30 December 11
Dependencies
But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib
Friday, 30 December 11
Dependencies
Hadoop will uncompress it and create a link to whatever directory it created, in the task’s
working directory.
Friday, 30 December 11
Dependencies
Which is where it stores your mapper and reducer.
Friday, 30 December 11
Dependencies
use lib qw/ locallib /;
Friday, 30 December 11
Mapper#!/usr/bin/env perl
use strict;use warnings;
use lib qw/ locallib /;
use JSON::PP;
my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;
while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}
print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";
Friday, 30 December 11
Mapper#!/usr/bin/env perl
use strict;use warnings;
use lib qw/ locallib /;
use JSON::PP;
my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;
while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}
print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";
Friday, 30 December 11
At the end of the job, Hadoop aggregates counters from all tasks.
Reducer#!/usr/bin/env perl
use strict;use warnings;use lib qw/ locallib /;
use Geo::IP;use Regexp::Common qw/ net /;use Readonly;
Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";
my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;
my $time_slot;my $previous_time_slot = -1;
Friday, 30 December 11
Reducer#!/usr/bin/env perl
use strict;use warnings;use lib qw/ locallib /;
use Geo::IP;use Regexp::Common qw/ net /;use Readonly;
Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";
my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;
my $time_slot;my $previous_time_slot = -1;
Friday, 30 December 11
Reducer#!/usr/bin/env perl
use strict;use warnings;use lib qw/ locallib /;
use Geo::IP;use Regexp::Common qw/ net /;use Readonly;
Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";
my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;
my $time_slot;my $previous_time_slot = -1;
Friday, 30 December 11
Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
Friday, 30 December 11
Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
Friday, 30 December 11
Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
Friday, 30 December 11
Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";
Friday, 30 December 11
Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";
Friday, 30 December 11
Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";
Friday, 30 December 11
Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";
Friday, 30 December 11
Recap
• EMR clusters are volatile!
Friday, 30 December 11
Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single reducer, sorted.
Friday, 30 December 11
Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single reducer, sorted.
• Use S3 for everything, and plan your dataflow ahead.
Friday, 30 December 11
( On data )
• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others:
s3://bucket/path/data/run_date=2011-11-12
• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.
Friday, 30 December 11
Hive partitioning
Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.
• Use S3 for everything, and plan your dataflow ahead.
• Make carton a part of your life, and especially of your build tool’s.
Friday, 30 December 11
( carton )
• Shipwright for humans
• Reads dependencies from Makefile.PL
• Installs them locally to your app
• Deploy your stuff, including carton.lock
• Run carton install --deployment
• Tar result and upload to S3
Friday, 30 December 11
URLs
• The MapReduce Paperhttp://labs.google.com/papers/mapreduce.html
• Apache Hadoophttp://hadoop.apache.org/
• Amazon Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/
Friday, 30 December 11
URLs
• Hadoop Streaming Tutorial (Apache)http://hadoop.apache.org/common/docs/r0.20.2/streaming.html
• Hadoop Streaming How-To (Amazon)http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowStreaming.html
Friday, 30 December 11
URLs
• Amazon EMR Perl Client Libraryhttp://aws.amazon.com/code/Elastic-MapReduce/2309
• Amazon EMR Command-Line Toolhttp://aws.amazon.com/code/Elastic-MapReduce/2264
Friday, 30 December 11
That’s All, Folks!
Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce
Friday, 30 December 11