data munging with hadoop

Data Munging with HadoopAugust 2013

Giannis Neokleous

www.giann.is

@yiannis_n

Hadoop at Knewton● AWS - EMR

○ Use S3 as FS then Redshift.

● Some Cascading

● Processing

○ Student Events (At different processing stages)

○ Recommendations

○ Course graphs

○ Compute metrics

● Use internal tools for launching EMR clusters

Analytics Team at Knewton?● Currently working on building the platform, tools, interfaces, data for

other teams

● Source of new model/metrics development

○ Compute and data processing platform, data warehouse

○ All your data are belong to us

● Making access to bulk data for Data Scientists easier

● Nature of data we’re publishing calls for solid validation methods

○ Analytics enables that

● Multiple sources, lots of data models

○ Recommendations/Student events/Course graphs/Rec Context/etc..

Dealing with multiple data source types

Why?

● Teams using multiple data stores

○ Want to bootstrap the data stores

○ Want to bulk analyze the data stores

● Decide to build a data warehouse to store all your data

● Not everything sits in human readable log files

● Simulate traffic through your system and want to push to services

○ Load tests

Stuck!

How do you bring in all the

data from all the data

sources?

Solutions● Ask the service owners to publish their payload every day to an easy to

process format - Log files?

○ Consistency is hard (source =?= payload)

● In the case of services, use the API

● Rely on packaged tools to transform data

○ Slow transformation processes

○ One extra transformation step

Can you do better?

● Understand your input source or output destinations

○ Write custom:

■ InputFormats / OutputFormats

■ RecordReader/RecordWriter

● Cleaner code in your mappers and reducers

○ Good separation of data model initialization and actual logic

○ Meaningful objects in mappers, reducers

Really? Where?

ExamplesUsed it for:

● Bootstrapping data stores like Cassandra, SenseiDB, Redshift

● Writing lucene indices

● Bulk reading data stores for analytics

● Load tests

○ Simulating real traffic

○ Publishing to queues

SenseiDB

kk,show me more

A little about MapReduce● InputFormat

○ Figures out where the data is, what to read and how to read them

○ Divides the data to record readers

● RecordReader

○ Instantiated by InputFormats

○ Do the actual reading

● Mapper

○ Key/Value pairs get passed in by the record readers

● Reducer

○ Key/Value pairs get passed in from the mappers

○ All the same keys end up in the same reducer

A little about MapReduce

● OutputFormat

○ Figures out where and how to write the data

○ Divides the data to record writers

○ What to do after the data has been written

● RecordWriter

○ Instantiated by OutputFormats

○ Do the actual writing

Data Flow

DB

DBTableDBTable

DBTableDBTable

Hadoop Cluster

InputFormatPull

RecordReader

RecordReader

RecordReader

Output

Reduce

Reduce

Reduce

Map

Map

Map

Push

OutputFormat

RecordWriter

RecordWriter

RecordWriter

What do I need?

You need:

● Write a custom record reader extending org.apache.hadoop.

mapreduce.RecordReader

● Write a custom input format extending from org.apache.hadoop.

mapreduce.InputFormat

● Define them in your job configuration:

Job job = new Job(new Configuration());

job.setInputFormatClass(MyAwesomeInputFormat.class);

RecordReader IFacepublic abstract class ExampleRecordReader extends RecordReader<LongWritable, RecommendationWritable> {

public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;

public boolean nextKeyValue() throws IOException, InterruptedException;

public LongWritable getCurrentKey() throws IOException, InterruptedException;

public RecommendationWritable getCurrentValue() throws IOException, InterruptedException;

public float getProgress() throws IOException, InterruptedException;

public void close() throws IOException;

}

InputFormat IFace

public class ExampleInputFormat extends InputFormat<LongWritable, RecommendationWritable> {

public List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext

context)

throws IOException, InterruptedException;

}

● A lot of times you subclass from an InputFormat further down the

class hierarchy and override a few methods

● Good way to filter early.

○ Skip files you really don’t need to read

■ Works great with time series data

JsonRecordReaderpublic abstract class JsonRecordReader<V extends Writable> extends RecordReader<LongWritable, V> { public boolean nextKeyValue() throws IOException,

InterruptedException { key.set(pos); Text jsonText = new Text(); int newSize = 0; if (pos < end) { newSize = in.readLine(jsonText); if (newSize > 0 && !jsonText.toString().isEmpty()) { for (ObjectDecorator<String> decorator : decorators) { jsonText = new Text( decorator.decorateObject(jsonText.toString())); } V tempValue = (V) gson.fromJson(jsonText.toString(),

getDataClass(jsonText.toString())); value = tempValue; } pos += newSize; } if (newSize == 0 || jsonText.toString().isEmpty()) { key = null; value = null; return false; } else { return true; } }

public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration();

FileSplit fileSplit = (FileSplit) split; start = fileSplit.getStart(); end = start + fileSplit.getLength(); compressionCodecs = new CompressionCodecFactory(conf);

in = initLineReader(fileSplit, conf);

pos = start; gson = gsonBuilder.create();}

public LongWritable getCurrentKey() throws IOException, InterruptedException { return key;}

public V getCurrentValue() throws IOException, InterruptedException { return value;}

public float getProgress() throws IOException, InterruptedException { if (start == end) { return 0.0f; } else { return Math.max(0.0f, Math.min(1.0f, (pos - start) / (float) (end - start))); }}

public void registerDeserializer(Type type, JsonDeserializer<?> jsonDes) { gsonBuilder.registerTypeAdapter(type, jsonDes);}

//Other omitted methods

RecommendationInputFormat

public class RecommendationsInputFormat extends FileInputFormat<LongWritable, RecommendationWritable> {

public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

JsonRecordReader rr = new JsonRecordReader<RecommendationWritable>() { @Override protected Class<?> getDataClass(String jsonStr) { return RecommendationWritable.class; } }; FileSplit fileSplit = (FileSplit) split; rr.addDecorator(new JsonFilenameDecorator(fileSplit.getPath().getName())); return rr; }}

● Inherits from FileInputFormat so everything else is taken care of

● Define createRecordReader and instantiate a JsonRecordReader

Output

OutputFormat IFacepublic abstract class ExampleOutputFormat extends OutputFormat<LongWritable, RecommendationWritable> {

public abstract RecordWriter<LongWritable, RecommendationWritable>

getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException;

public abstract void checkOutputSpecs(JobContext context) throws IOException, InterruptedException;

public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException, InterruptedException;

}

● getRecordWriter should initialize the RecordWriter and pass it an

output stream (doesn’t have to be HDFS specific).

○ example from TextOutputFormat:

public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {... FileSystem fs = file.getFileSystem(conf); if (!isCompressed) { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(fileOut, keyValueSeparator); } else { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(new DataOutputStream (codec.createOutputStream(fileOut)), keyValueSeparator); }...}

RecordWriter IFace

public abstract class ExampleRecordWriter extends RecordWriter<LongWritable, RecommendationWritable> {

public void write(LongWritable key, RecommendationWritable value) throws IOException, InterruptedException;

public void close(TaskAttemptContext context) throws IOException, InterruptedException;

}

● Can use an existing output stream (non-HDFS) from a library.

● If it’s a file then you can write to the temp work dir and on close move

the file to HDFS.

Does it matter what I write?

Yes!● You have to worry about splits

● Avro, Sequence files, Thrift (with a little work)

○ Automatically serialize and deserialize objects on read/write

*Table borrowed from: Hadoop: The Definitive Guide.

What else?

Bulk read Cassandra starting today

● Apply same ideas for bulk reading (or writing, though not today)

Cassandra

● No need to do a single call to Cassandra

A little about SSTables● Sorted

○ Both row keys and columns

● Key Value pairs

○ Rows:

■ Row value: Key

■ Columns: Value

○ Columns:

■ Column name: Key

■ Column value: Value

● Immutable

● Consist of 4 parts

○ ColumnFamily-hd-3549-Data.db

Column family name Version Table number

Type (One table consists of 4 or 5 types)One of: Data, Index, Filter, Statistics, [CompressionInfo]

SSTableInputFormat● An input format specifically for SSTables.

○ Extends from FileInputFormat

● Includes a DataPathFilter for filtering through files for *-Data.db

files

● Expands all subdirectories of input - Filters for ColumnFamily

● Configures Comparator, Subcomparator and Partitioner classes used in

ColumnFamily.

● Two types: FileInputFormat

SSTableInputFormat

SSTableColumnInputFormat SSTableRowInputFormat

SSTableRecordReader● A record reader specifically for SSTables.

● On init:

○ Copies the table locally. (Decompresses it, if using Priam)

○ Opens the table for reading. (Only needs Data, Index and

CompressionInfo tables)

○ Creates a TableScanner from the reader

● Two types:

RecordReader

SSTableRecordReader

SSTableColumnRecordReader SSTableRowRecordReader

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf);

ClassLoader loader = SSTableMRExample.class.getClassLoader();

conf.addResource(loader.getResource("knewton-site.xml"));

SSTableInputFormat.setPartitionerClass(RandomPartitioner.class.getName(), job);

SSTableInputFormat.setComparatorClass(LongType.class.getName(), job);

SSTableInputFormat.setColumnFamilyName("StudentEvents", job);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(StudentEventWritable.class);

job.setMapperClass(StudentEventMapper.class);

job.setReducerClass(StudentEventReducer.class);

job.setInputFormatClass(SSTableColumnInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

SSTableInputFormat.addInputPaths(job, args[0]);

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

public class StudentEventMapper extends SSTableColumnMapper<Long, StudentEvent, LongWritable, StudentEventWritable> {

@Override

public void performMapTask(Long key, StudentEvent value, Context context) {

//do stuff here

}

// Some other omitted trivial methods

}

Load additional properties from a conf file

Define mappers/reducers. The only thing you have to write.

Read each column as a separate record.

Sees row key/column pairs.Remember to skip deleted columns (tombstones)

KnewtonMRTools● Open sourcing today!

○ http://github.com/knewton/KnewtonMRTools

● Plan is to have reusable record readers and writers

● Currently includes the JsonRecordReader

● Includes examples jobs with sample input - Self contained

http://github.com/knewton/KnewtonMRTools

http://github.com/knewton/KnewtonMRTools

KassandraMRHelper● Find out more:

○ Source: http://github.com/knewton/KassandraMRHelper

○ Slides: http://giann.is/media/cassandra_meetup_july_2013.pdf

○ Presentation Video: http://www.livestream.com/knerd/video?

clipId=pla_5aa2c893-2803-47f4-b674-

15b1af52a226&utm_source=lslibrary&utm_medium=ui-thumb

● Has all you need to get started on bulk reading SSTables with Hadoop.

● Includes an example job that reads "student events"

● Handles compressed tables

● Use Priam? Even better can snappy decompress priam backups.

● Don't have a cluster up or a table handy?

○ Use com.knewton.mapreduce.cassandra.WriteSampleSSTable in the

test source directory to generate one.

http://github.com/knewton/KassandraMRHelper

http://giann.is/media/cassandra_meetup_july_2013.pdf

http://www.livestream.com/knerd/video?clipId=pla_5aa2c893-2803-47f4-b674-15b1af52a226&utm_source=lslibrary&utm_medium=ui-thumb




Thank you

Questions?Giannis Neokleous

www.giann.is

@yiannis_n

data munging with hadoop

Technology