apache hadoop java api

54
2/24/13 Short Apache Hadoop API Overview Adam Kawa Data Engineer @ Spotify

Upload: adam-kawa

Post on 15-Jan-2015

5.049 views

Category:

Technology


3 download

DESCRIPTION

Short introduction to MapReduce Java API for Apache Hadoop

TRANSCRIPT

Page 1: Apache Hadoop Java API

2/24/13

Short Apache Hadoop API

Overview

Adam KawaData Engineer @ Spotify

Page 2: Apache Hadoop Java API

2/24/13

Image Source http://developer.yahoo.com/hadoop/tutorial/module4.html

Page 3: Apache Hadoop Java API

2/24/13

InputFormat ReposibilitiesDivide input data into logical input splits

Data in HDFS is divided into block, but processed as input splits

InputSplit may contains any number of blocks (usually 1)

Each Mapper processes one input split

Creates RecordReaders to extract <key, value> pairs

Page 4: Apache Hadoop Java API

2/24/13

InputFormat Classpublic abstract class InputFormat<K, V> {

public abstract

List<InputSplit> getSplits(JobContext context) throws ...;

public abstract

RecordReader<K,V> createRecordReader(InputSplit split,

TaskAttemptContext context) throws ...;

}

Page 5: Apache Hadoop Java API

2/24/13

Most Common InputFormatsTextInputFormat

Each \n-terminated line is a value

The byte offset of that line is a key

Why not a line number?

KeyValueTextInputFormat

Key and value are separated by a separator (tab by default)

Page 6: Apache Hadoop Java API

2/24/13

Binary InputFormatsSequenceFileInputFormat

SequenceFiles are flat files consisting of binary <key, value> pairs

AvroInputFormat

Avro supports rich data structures (not necessarily <key, value> pairs) serialized to files or messages

Compact, fast, language-independent, self-describing, dynamic

Page 7: Apache Hadoop Java API

2/24/13

Some Other InputFormatsNLineInputFormat

Should not be too big since splits are calculated in a single thread (NLineInputFormat#getSplitsForFile)

CombineFileInputFormat

An abstract class, but not so difficult to extend

SeparatorInputFormat

How to here: http://blog.rguha.net/?p=293

Page 8: Apache Hadoop Java API

2/24/13

Some Other InputFormatsMultipleInputs

Supports multiple input paths with a different InputFormat and Mapper for each path

MultipleInputs.addInputPath(job,

firstPath, FirstInputFormat.class, FirstMapper.class);

MultipleInputs.addInputPath(job,

secondPath, SecondInputFormat.class, SecondMapper.class);

Page 9: Apache Hadoop Java API

2/24/13

InputFormat Class (Partial) Hierarchy

Page 10: Apache Hadoop Java API

2/24/13

InputFormat Interesting FactsIdeally InputSplit size is equal to HDFS block size

Or InputSplit contains multiple collocated HDFS block

InputFormat may prevent splitting a file

A whole file is processed by a single mapper (e.g. gzip)

boolean FileInputFormat#isSplittable();

Page 11: Apache Hadoop Java API

2/24/13

InputFormat Interesting FactsMapper knows the file/offset/size of the split that it process

MapContext#getInputSplit()

Useful for later debugging on a local machine

Page 12: Apache Hadoop Java API

2/24/13

InputFormat Interesting FactsPathFilter (included in InputFormat) specifies which files

to include or not into input data

PathFilter hiddenFileFilter = new PathFilter(){

public boolean accept(Path p){

String name = p.getName();

return !name.startsWith("_") && !name.startsWith(".");

}

}; 

Page 13: Apache Hadoop Java API

2/24/13

RecordReaderExtract <key, value> pairs from corresponding InputSplit

Examples:

LineRecordReader

KeyValueRecordReader

SequenceFileRecordReader

Page 14: Apache Hadoop Java API

2/24/13

RecordReader Logic Must handle a common situation when InputSplit and

HDFS block boundaries do not match

Image source: Hadoop: The Definitive Guide by Tom White

Page 15: Apache Hadoop Java API

2/24/13

RecordReader Logic Exemplary solution – based on LineRecordReader

Skips* everything from its block until the fist '\n'

Reads from the second block until it sees '\n'

*except the very first block (an offset equals to 0)

Image source: Hadoop: The Definitive Guide by Tom White

Page 16: Apache Hadoop Java API

2/24/13

Keys And ValuesKeys must implement WritableComparable interface

Since they are sorted before passing to the Reducers

Values must implement “at least” Writable interface

Page 17: Apache Hadoop Java API

2/24/13

WritableComparables Hierarchy

Image source: Hadoop: The Definitive Guide by Tom White

Page 18: Apache Hadoop Java API

2/24/13

Writable And WritableComparablepublic interface Writable {

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

}

public interface WritableComparable<T> extends Writable, Comparable<T> {

}

public interface Comparable<T> {

public int compareTo(T o);

}

Page 19: Apache Hadoop Java API

2/24/13

Example: SongWritableclass SongWritable implements Writable {

String title;

int year;

byte[] content;

public void write(DataOutput out) throws ... {

out.writeUTF(title);

out.writeInt(year);

out.writeInt(content.length);

out.write(content);

}

}

Page 20: Apache Hadoop Java API

2/24/13

MapperTakes input in form of a <key, value> pair

Emits a set of intermediate <key, value> pairs

Stores them locally and later passes to the Reducers

But earlier: partition + sort + spill + merge

Page 21: Apache Hadoop Java API

2/24/13

Mapper Methodsvoid setup(Context context) throws ... {}

protected void cleanup(Context context) throws ... {}

void map(KEYIN key, VALUEIN value, Context context) ... {

context.write((KEYOUT) key, (VALUEOUT) value);

}

public void run(Context context) throws ... {

setup(context);

while (context.nextKeyValue()) {

map(context.getCurrentKey(), context.getCurrentValue(), context);

}

cleanup(context);

}

Page 22: Apache Hadoop Java API

2/24/13

MapContext ObjectAllow the user map code to communicate with MapReduce system

public InputSplit getInputSplit();

public TaskAttemptID getTaskAttemptID();

public void setStatus(String msg);

public boolean nextKeyValue() throws ...;

public KEYIN getCurrentKey() throws ...;

public VALUEIN getCurrentValue() throws ...;

public void write(KEYOUT key, VALUEOUT value) throws ...;

public Counter getCounter(String groupName, String counterName);

Page 23: Apache Hadoop Java API

2/24/13

Examples Of MappersImplement highly specialized Mappers and reuse/chain them

when possible

IdentityMapper

InverseMapper

RegexMapper

TokenCounterMapper

Page 24: Apache Hadoop Java API

2/24/13

TokenCounterMapperpublic class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

@Override

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Page 25: Apache Hadoop Java API

2/24/13

General AdvicesReuse Writable instead of creating a new one each time

Apache commons StringUtils class seems to be the most efficient for String tokenization

Page 26: Apache Hadoop Java API

2/24/13

Chain Of MappersUse multiple Mapper classes within a single Map task

The output of the first Mapper becomes the input of the second, and so on until the last Mapper

The output of the last Mapper will be written to the task's output

Encourages implementation of reusable and highly specialized Mappers

Page 27: Apache Hadoop Java API

2/24/13

Exemplary Chain Of Mappers JobConf mapAConf = new JobConf(false);

 ...

 ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,

   Text.class, Text.class, true, mapAConf);

 

 JobConf mapBConf = new JobConf(false);

 ...

 ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,

   LongWritable.class, Text.class, false, mapBConf);

 FileInputFormat.setInputPaths(conf, inDir);

 FileOutputFormat.setOutputPath(conf, outDir);

 JobClient jc = new JobClient(conf);

 RunningJob job = jc.submitJob(conf);

Page 28: Apache Hadoop Java API

2/24/13

PartitionerSpecifies which Reducer a given <key, value> pair is sent to

Desire even distribution of the intermediate data

Skewed data may overload a single reducer and make a whole job running longer

public abstract class Partitioner<KEY, VALUE> {

public abstract

int getPartition(KEY key, VALUE value, int numPartitions);

}

Page 29: Apache Hadoop Java API

2/24/13

HashPartitionerThe default choice for general-purpose use cases

public int getPartition(K key, V value, int numReduceTasks) {

return

(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}

Page 30: Apache Hadoop Java API

2/24/13

TotalOrderPartitionerA partitioner that aims the total order of the output

Page 31: Apache Hadoop Java API

2/24/13

TotalOrderPartitionerBefore job runs, it samples input data to provide fairly even

distribution over keys

Page 32: Apache Hadoop Java API

2/24/13

TotalOrderPartitionerThree samplers

InputSampler.RandomSampler<K,V>

Sample from random points in the input

InputSampler.IntervalSampler<K,V>

Sample from s splits at regular intervals

InputSampler.SplitSampler<K,V>

Samples the first n records from s splits

Page 33: Apache Hadoop Java API

2/24/13

ReducerGets list(<key, list(value)>)

Keys are sorted, but values for a given key are not sorted

Emits a set of output <key, value> pairs

Page 34: Apache Hadoop Java API

2/24/13

Reducer Run Methodpublic void run(Context context) throws … {

setup(context);

while (context.nextKey()) {

reduce(context.getCurrentKey(),

context.getValues(), context);

}

cleanup(context);

}

Page 35: Apache Hadoop Java API

2/24/13

Chain Of Mappers After A ReducerThe ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task

Combined with ChainMapper, one could get [MAP+ / REDUCE MAP*]

ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,

Text.class, Text.class, true, reduceConf);

ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,

LongWritable.class, Text.class, false, null);

ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,

LongWritable.class, LongWritable.class, true, null);

Page 36: Apache Hadoop Java API

2/24/13

OutputFormat Class Hierarchy

Image source: Hadoop: The Definitive Guide by Tom White

Page 37: Apache Hadoop Java API

2/24/13

MultipleOutputsMultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);

MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class);

public void reduce(WritableComparable key, Iterator<Writable> values, Context context) throws ... {

...

mos.write("text", , key, new Text("Hello"));

mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");

mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");

mos.write(key, new Text("value"), generateFileName(key, new Text("value")));

}

Page 38: Apache Hadoop Java API

2/24/13

Other Useful FeaturesCombiner

Skipping bad records

Compression

Profiling

Isolation Runner

Page 39: Apache Hadoop Java API

2/24/13

Job Class Methodspublic void setInputFormatClass(..);

public void setOutputFormatClass(..);

public void setMapperClass(..);

public void setCombinerClass(..);

public void setReducerClass(...);

public void setPartitionerClass(..);

public void setMapOutputKeyClass(..);

public void setMapOutputValueClass(..);

public void setOutputKeyClass(..);

public void setOutputValueClass(..);

public void setSortComparatorClass(..);

public void setGroupingComparatorClass(..);

public void setNumReduceTasks(int tasks);

public void setJobName(String name);

public float mapProgress();

public float reduceProgress();

public boolean isComplete();

public boolean isSuccessful();

public void killJob();

public void submit();

public boolean waitForCompletion(..);

Page 40: Apache Hadoop Java API

2/24/13

ToolRunnerSupports parsing allows the user to specify configuration

options on the command linehadoop jar examples.jar SongCount

-D mapreduce.job.reduces=10

-D artist.gender=FEMALE

-files dictionary.dat

-jar math.jar,spotify.jar

songs counts

Page 41: Apache Hadoop Java API

2/24/13

Side Data Distributionpublic class MyMapper<K, V> extends Mapper<K,V,V,K> {

String gender = null;

File dictionary = null;

protected void setup(Context context) throws … {

Configuration conf = context.getConfiguration();

gender = conf.get(“artist.gender”, “MALE”);

dictionary = new File(“dictionary.dat”);

}

Page 42: Apache Hadoop Java API

2/24/13

public class WordCount extends Configured implements Tool {

public int run(String[] otherArgs) throws Exception {

if (args.length != 2) {

System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName());

return -1;

}

Job job = new Job(getConf());

FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

...

return job.waitForCompletion(true); ? 0 : 1;

}

}

public static void main(String[] allArgs) throws Exception {

int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs);

System.exit(exitCode);

}

Page 43: Apache Hadoop Java API

2/24/13

MRUnitBuilt on top of JUnit

Provides a mock InputSplit, Contex and other classes

Can test

The Mapper class,

The Reducer class,

The full MapReduce job

The pipeline of MapReduce jobs

Page 44: Apache Hadoop Java API

2/24/13

MRUnit Examplepublic class IdentityMapTest extends TestCase {

private MapDriver<Text, Text, Text, Text> driver;

@Before

public void setUp() {

driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>());

}

@Test

public void testMyMapper() {

driver

.withInput(new Text("foo"), new Text("bar"))

.withOutput(new Text("oof"), new Text("rab"))

.runTest();

}

}

Page 45: Apache Hadoop Java API

2/24/13

Example: Secondary Sortreduce(key, Iterator<value>) method gets iterator

over values

These values are not sorted for a given key

Sometimes we want to get them sorted

Useful to find minimum or maximum value quickly

Page 46: Apache Hadoop Java API

2/24/13

Secondary Sort Is TrickyA couple of custom classes are needed

WritableComparable

Partitioner

SortComparator (optional, but recommended)

GroupingComparator

Page 47: Apache Hadoop Java API

2/24/13

Composite KeyLeverages “traditional” sorting mechanism of intermediate keys

Intermediate key becomes composite of the “natural” key and the value

(Disturbia, 1) → (Disturbia#1, 1)

(SOS, 4) → (SOS#4, 4)

(Disturbia, 7) → (Disturbia#7, 7)

(Fast car, 2) → (Fast car#2, 2)

(Fast car, 6) → (Fast car#6, 6)

(Disturbia, 4) → (Disturbia#4, 4)

(Fast car, 2) → (Fast car#2, 2)

Page 48: Apache Hadoop Java API

2/24/13

Custom PartitionerHashPartitioner uses a hash on keys

The same titles may go to different reducers (because they are combined with ts in a key)

Use a custom partitioner that partitions only on first part of the key

int getPartition(TitleWithTs key, LongWritable value, int num) {

return hashParitioner.getPartition(key.title);

}

Page 49: Apache Hadoop Java API

2/24/13

Ordering Of KeysKeys needs to be ordered before passing to the reducer

Orders by natural key and, for the same natural key, on the value portion of the key

Implement sorting in WritableComparable or use Comparator class

job.setSortComparatorClass(SongWithTsComparator.class);

Page 50: Apache Hadoop Java API

2/24/13

Data Passed To The ReducerBy default, each unique key forces reduce() method(Disturbia#1, 1) → reduce method is invoked

(Disturbia#4, 4) → reduce method is invoked

(Disturbia#7, 7) → reduce method is invoked

(Fast car#2, 2) → reduce method is invoked

(Fast car#2, 2)

(Fast car#6, 6) → reduce method is invoked

(SOS#4, 4) → reduce method is invoked

Page 51: Apache Hadoop Java API

2/24/13

Data Passed To The ReducerGroupingComparatorClass class determines which keys and

values are passed in a single call to the reduce method

Just look at the natural key when grouping(Disturbia#1, 1) → reduce method is invoked

(Disturbia#4, 4)

(Disturbia#7, 7)

(Fast car#2, 2) → reduce method is invoked

(Fast car#2, 2)

(Fast car#6, 6)

(SOS#4, 4) → reduce method is invoked

Page 52: Apache Hadoop Java API

2/24/13

QuestionHow to calculate a median from a set of numbers using Java

MapReduce?

Page 53: Apache Hadoop Java API

2/24/13

Question – A Possible AnswerImplement TotalSort, but

Each Reducer produce an additional file containing a pair

<minimum_value, number_of_values>

After the job ends, a single-thread application

Reads these files to build the index

Calculate which value in which file is the median

Finds this value in this file

Page 54: Apache Hadoop Java API

2/24/13

Thanks!Would you like to use Hadoop API at Spotify?

Apply via [email protected]