perl on amazon elastic mapreduce

Perl on Amazon Elastic MapReduce

Friday, 30 December 11

A Gentle Introduction to MapReduce

• Distributed computing model

• Mappers process the input and forward intermediate results to reducers.

• Reducers aggregate these intermediate results, and emit the final results.


Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.

$ map | sort | reduce


MapReduce

• Input data sent to mappers as (k, v) pairs.

• After processing, mappers emit (kout, vout).

• These pairs are sorted and sent to reducers.

• All (kout, vout) pairs for a given kout are sent to a single reducer.


The sorting guarantees that all values for a given key are sent to a single reducer.

MapReduce

• Reducers get (k, [v1, v2, …, vn]).

• After processing, the reducer emits a (kf, vf) per result.


MapReduce

We wanted to have a world map showing where people were starting our games (like

Mozilla Glow)


Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.

Glowfish


MapReduce

• Input: ( epoch, IP address )

• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )

• Reducers get ( blockId, [ip1, ip2, …, ipn] )

• Do a geo lookup and emit

( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )


On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago1 day to modify the Glow protocol, 1 day to buildEverything stored on S3

$ map | sort | reduce


Apache Hadoop

• Distributed programming framework

• Implements MapReduce

• Does all the usual distributed programming heavy-lifting for you

• Highly-fault tolerant, automatic task re-assignment in case of failure

• You focus on mappers and reducers


Serialisation, heartbeat, node management, directory, etc.Speculative task execution, first one to finish winsPotentially very simple and contained code

Apache Hadoop

• Native Java API

• Streaming API which can use mappers and reducers written in any programming language.

• Distributed file system (HDFS)

• Distributed Cache


You supply the mapper, reducer, and driver code

Amazon Elastic MapReduce

• On-demand Hadoop clusters running on EC2 instances.

• Improved S3 support for storage of input and output data.

• Build workflows by sending jobs to a cluster.


S3 gives you virtually unlimited storage with very high redundancyS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)All this is controlled using a REST APIJobs are called ‘steps’ in EMR lingo

EMR Downsides

• No control over the machine images.

• Perl 5.8.8

• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.

• HDFS not available at cluster-creation time.

• Debian


No way to customise the image and, e.g., install your own PerlSo it’s a good idea to store the final results of a workflow in S3No way to store dependencies in HDFS when cluster is created

Streaming vs. Native

$ cat | map | sort | reduce



Instead of

( k, [ v1, v2, …, vn ] )

reducers get

(( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))


Composite Keys

• Reducers receive both keys and values sorted

• Merge 3 tables:userid, 0, … # customer info

userid, 1, … # payments history

userid, recordid1, … # clickstream

userid, recordid2, … # clickstream


If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.when the userid changes, it’s a new user.


• Limited API

• About a 7-10% increase in run time

• About a 1000% decrease in development time (as reported by a non-representative sample of developers)


E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.Because reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.

Where’s My Towel?

• Tasks run chrooted in a non-deterministic location.

• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.

• For native Java jobs, your dependencies get packaged in the JAR alongside your code.


So how do you get all the CPAN goodness you know and love in there?HDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!

Streaming’s Little Helpers

Define your inputs and outputs:--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10


Can have multiple inputs


You can use any class in Hadoop’s classpath as a codec, several come bundled:-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner


That -D is a Hadoop define, not a JVM system property definition


• Use S3 to store…

• input data

• output data

• supporting data (e.g., Geo-IP)

• your code


On a streaming job you specify the programs to use as mapper and reducer

Mapper and Reducer

To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3:--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl


Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat


Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat


In the unknown directory where the task is running, making it accessible to it

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz


Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib


Dependencies

Hadoop will uncompress it and create a link to whatever directory it created, in the task’s

working directory.


Dependencies

Which is where it stores your mapper and reducer.


Dependencies

use lib qw/ locallib /;


Mapper#!/usr/bin/env perl

use strict;use warnings;


use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";


Mapper#!/usr/bin/env perl

use strict;use warnings;


use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";


At the end of the job, Hadoop aggregates counters from all tasks.

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;


Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }


Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";


Recap

• EMR clusters are volatile!


Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.


Recap


• Values for a given key will all go to a single reducer, sorted.

• Use S3 for everything, and plan your dataflow ahead.


( On data )

• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others:

s3://bucket/path/data/run_date=2011-11-12

• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.


Hive partitioning

Recap


• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.

• Use S3 for everything, and plan your dataflow ahead.

• Make carton a part of your life, and especially of your build tool’s.


( carton )

• Shipwright for humans

• Reads dependencies from Makefile.PL

• Installs them locally to your app

• Deploy your stuff, including carton.lock

• Run carton install --deployment

• Tar result and upload to S3


URLs

• The MapReduce Paperhttp://labs.google.com/papers/mapreduce.html

• Apache Hadoophttp://hadoop.apache.org/

• Amazon Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/


http://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

http://aws.amazon.com/elasticmapreduce/

http://aws.amazon.com/elasticmapreduce/

URLs

• Hadoop Streaming Tutorial (Apache)http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

• Hadoop Streaming How-To (Amazon)http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowStreaming.html


http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowStreaming.html




URLs

• Amazon EMR Perl Client Libraryhttp://aws.amazon.com/code/Elastic-MapReduce/2309

• Amazon EMR Command-Line Toolhttp://aws.amazon.com/code/Elastic-MapReduce/2264


http://aws.amazon.com/code/Elastic-MapReduce/2309




That’s All, Folks!

Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce

[email protected]


mailto:[email protected]

mailto:[email protected]

perl on amazon elastic mapreduce

Documents