data munging with hadoop
Post on 13-Sep-2014
706 views
DESCRIPTION
TRANSCRIPT
Data Munging with HadoopAugust 2013
Giannis Neokleous
www.giann.is
@yiannis_n
Hadoop at Knewton● AWS - EMR
○ Use S3 as FS then Redshift.
● Some Cascading
● Processing
○ Student Events (At different processing stages)
○ Recommendations
○ Course graphs
○ Compute metrics
● Use internal tools for launching EMR clusters
Analytics Team at Knewton?● Currently working on building the platform, tools, interfaces, data for
other teams
● Source of new model/metrics development
○ Compute and data processing platform, data warehouse
○ All your data are belong to us
● Making access to bulk data for Data Scientists easier
● Nature of data we’re publishing calls for solid validation methods
○ Analytics enables that
● Multiple sources, lots of data models
○ Recommendations/Student events/Course graphs/Rec Context/etc..
Dealing with multiple data source types
Why?
● Teams using multiple data stores
○ Want to bootstrap the data stores
○ Want to bulk analyze the data stores
● Decide to build a data warehouse to store all your data
● Not everything sits in human readable log files
● Simulate traffic through your system and want to push to services
○ Load tests
Stuck!
How do you bring in all the
data from all the data
sources?
Solutions● Ask the service owners to publish their payload every day to an easy to
process format - Log files?
○ Consistency is hard (source =?= payload)
● In the case of services, use the API
● Rely on packaged tools to transform data
○ Slow transformation processes
○ One extra transformation step
Can you do better?
● Understand your input source or output destinations
○ Write custom:
■ InputFormats / OutputFormats
■ RecordReader/RecordWriter
● Cleaner code in your mappers and reducers
○ Good separation of data model initialization and actual logic
○ Meaningful objects in mappers, reducers
Really? Where?
ExamplesUsed it for:
● Bootstrapping data stores like Cassandra, SenseiDB, Redshift
● Writing lucene indices
● Bulk reading data stores for analytics
● Load tests
○ Simulating real traffic
○ Publishing to queues
SenseiDB
kk,show me more
A little about MapReduce● InputFormat
○ Figures out where the data is, what to read and how to read them
○ Divides the data to record readers
● RecordReader
○ Instantiated by InputFormats
○ Do the actual reading
● Mapper
○ Key/Value pairs get passed in by the record readers
● Reducer
○ Key/Value pairs get passed in from the mappers
○ All the same keys end up in the same reducer
A little about MapReduce
● OutputFormat
○ Figures out where and how to write the data
○ Divides the data to record writers
○ What to do after the data has been written
● RecordWriter
○ Instantiated by OutputFormats
○ Do the actual writing
Data Flow
DB
DBTableDBTable
DBTableDBTable
Hadoop Cluster
InputFormatPull
RecordReader
RecordReader
RecordReader
Output
Reduce
Reduce
Reduce
Map
Map
Map
Push
OutputFormat
RecordWriter
RecordWriter
RecordWriter
Input
What do I need?
You need:
● Write a custom record reader extending org.apache.hadoop.
mapreduce.RecordReader
● Write a custom input format extending from org.apache.hadoop.
mapreduce.InputFormat
● Define them in your job configuration:
Job job = new Job(new Configuration());
job.setInputFormatClass(MyAwesomeInputFormat.class);
RecordReader IFacepublic abstract class ExampleRecordReader extends RecordReader<LongWritable, RecommendationWritable> {
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
public boolean nextKeyValue() throws IOException, InterruptedException;
public LongWritable getCurrentKey() throws IOException, InterruptedException;
public RecommendationWritable getCurrentValue() throws IOException, InterruptedException;
public float getProgress() throws IOException, InterruptedException;
public void close() throws IOException;
}
InputFormat IFace
public class ExampleInputFormat extends InputFormat<LongWritable, RecommendationWritable> {
public List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext
context)
throws IOException, InterruptedException;
}
● A lot of times you subclass from an InputFormat further down the
class hierarchy and override a few methods
● Good way to filter early.
○ Skip files you really don’t need to read
■ Works great with time series data
JsonRecordReaderpublic abstract class JsonRecordReader<V extends Writable> extends RecordReader<LongWritable, V> { public boolean nextKeyValue() throws IOException,
InterruptedException { key.set(pos); Text jsonText = new Text(); int newSize = 0; if (pos < end) { newSize = in.readLine(jsonText); if (newSize > 0 && !jsonText.toString().isEmpty()) { for (ObjectDecorator<String> decorator : decorators) { jsonText = new Text( decorator.decorateObject(jsonText.toString())); } V tempValue = (V) gson.fromJson(jsonText.toString(),
getDataClass(jsonText.toString())); value = tempValue; } pos += newSize; } if (newSize == 0 || jsonText.toString().isEmpty()) { key = null; value = null; return false; } else { return true; } }
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration();
FileSplit fileSplit = (FileSplit) split; start = fileSplit.getStart(); end = start + fileSplit.getLength(); compressionCodecs = new CompressionCodecFactory(conf);
in = initLineReader(fileSplit, conf);
pos = start; gson = gsonBuilder.create();}
public LongWritable getCurrentKey() throws IOException, InterruptedException { return key;}
public V getCurrentValue() throws IOException, InterruptedException { return value;}
public float getProgress() throws IOException, InterruptedException { if (start == end) { return 0.0f; } else { return Math.max(0.0f, Math.min(1.0f, (pos - start) / (float) (end - start))); }}
public void registerDeserializer(Type type, JsonDeserializer<?> jsonDes) { gsonBuilder.registerTypeAdapter(type, jsonDes);}
//Other omitted methods
RecommendationInputFormat
public class RecommendationsInputFormat extends FileInputFormat<LongWritable, RecommendationWritable> {
public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
JsonRecordReader rr = new JsonRecordReader<RecommendationWritable>() { @Override protected Class<?> getDataClass(String jsonStr) { return RecommendationWritable.class; } }; FileSplit fileSplit = (FileSplit) split; rr.addDecorator(new JsonFilenameDecorator(fileSplit.getPath().getName())); return rr; }}
● Inherits from FileInputFormat so everything else is taken care of
● Define createRecordReader and instantiate a JsonRecordReader
Output
OutputFormat IFacepublic abstract class ExampleOutputFormat extends OutputFormat<LongWritable, RecommendationWritable> {
public abstract RecordWriter<LongWritable, RecommendationWritable>
getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext context) throws IOException, InterruptedException;
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException, InterruptedException;
}
● getRecordWriter should initialize the RecordWriter and pass it an
output stream (doesn’t have to be HDFS specific).
○ example from TextOutputFormat:
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {... FileSystem fs = file.getFileSystem(conf); if (!isCompressed) { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(fileOut, keyValueSeparator); } else { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(new DataOutputStream (codec.createOutputStream(fileOut)), keyValueSeparator); }...}
RecordWriter IFace
public abstract class ExampleRecordWriter extends RecordWriter<LongWritable, RecommendationWritable> {
public void write(LongWritable key, RecommendationWritable value) throws IOException, InterruptedException;
public void close(TaskAttemptContext context) throws IOException, InterruptedException;
}
● Can use an existing output stream (non-HDFS) from a library.
● If it’s a file then you can write to the temp work dir and on close move
the file to HDFS.
Does it matter what I write?
Yes!● You have to worry about splits
● Avro, Sequence files, Thrift (with a little work)
○ Automatically serialize and deserialize objects on read/write
*Table borrowed from: Hadoop: The Definitive Guide.
What else?
Bulk read Cassandra starting today
● Apply same ideas for bulk reading (or writing, though not today)
Cassandra
● No need to do a single call to Cassandra
A little about SSTables● Sorted
○ Both row keys and columns
● Key Value pairs
○ Rows:
■ Row value: Key
■ Columns: Value
○ Columns:
■ Column name: Key
■ Column value: Value
● Immutable
● Consist of 4 parts
○ ColumnFamily-hd-3549-Data.db
Column family name Version Table number
Type (One table consists of 4 or 5 types)One of: Data, Index, Filter, Statistics, [CompressionInfo]
SSTableInputFormat● An input format specifically for SSTables.
○ Extends from FileInputFormat
● Includes a DataPathFilter for filtering through files for *-Data.db
files
● Expands all subdirectories of input - Filters for ColumnFamily
● Configures Comparator, Subcomparator and Partitioner classes used in
ColumnFamily.
● Two types: FileInputFormat
SSTableInputFormat
SSTableColumnInputFormat SSTableRowInputFormat
SSTableRecordReader● A record reader specifically for SSTables.
● On init:
○ Copies the table locally. (Decompresses it, if using Priam)
○ Opens the table for reading. (Only needs Data, Index and
CompressionInfo tables)
○ Creates a TableScanner from the reader
● Two types:
RecordReader
SSTableRecordReader
SSTableColumnRecordReader SSTableRowRecordReader
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
ClassLoader loader = SSTableMRExample.class.getClassLoader();
conf.addResource(loader.getResource("knewton-site.xml"));
SSTableInputFormat.setPartitionerClass(RandomPartitioner.class.getName(), job);
SSTableInputFormat.setComparatorClass(LongType.class.getName(), job);
SSTableInputFormat.setColumnFamilyName("StudentEvents", job);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(StudentEventWritable.class);
job.setMapperClass(StudentEventMapper.class);
job.setReducerClass(StudentEventReducer.class);
job.setInputFormatClass(SSTableColumnInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
SSTableInputFormat.addInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
public class StudentEventMapper extends SSTableColumnMapper<Long, StudentEvent, LongWritable, StudentEventWritable> {
@Override
public void performMapTask(Long key, StudentEvent value, Context context) {
//do stuff here
}
// Some other omitted trivial methods
}
Load additional properties from a conf file
Define mappers/reducers. The only thing you have to write.
Read each column as a separate record.
Sees row key/column pairs.Remember to skip deleted columns (tombstones)
KnewtonMRTools● Open sourcing today!
○ http://github.com/knewton/KnewtonMRTools
● Plan is to have reusable record readers and writers
● Currently includes the JsonRecordReader
● Includes examples jobs with sample input - Self contained
KassandraMRHelper● Find out more:
○ Source: http://github.com/knewton/KassandraMRHelper
○ Slides: http://giann.is/media/cassandra_meetup_july_2013.pdf
○ Presentation Video: http://www.livestream.com/knerd/video?
clipId=pla_5aa2c893-2803-47f4-b674-
15b1af52a226&utm_source=lslibrary&utm_medium=ui-thumb
● Has all you need to get started on bulk reading SSTables with Hadoop.
● Includes an example job that reads "student events"
● Handles compressed tables
● Use Priam? Even better can snappy decompress priam backups.
● Don't have a cluster up or a table handy?
○ Use com.knewton.mapreduce.cassandra.WriteSampleSSTable in the
test source directory to generate one.
Thank you
Questions?Giannis Neokleous
www.giann.is
@yiannis_n