rapid development of big data applications using spring for apache hadoop

Spring for Apache Hadoop

By Zenyk Matchyshyn

Agenda• Goals of the project• Hadoop Introduction• High level support• Workflows• Scripting & Migration• Alternatives• Testing & Related

Big Data – Why?Because of Terabytes and Petabytes:

• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection

Goals• Provide programmatic model to work with

Hadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of

Spring Batch & Spring Integration• Leverage Spring features

Supported distros

• Apache Hadoop 1.2.1/2.0.6/2.2.0• Cloudera CDH4• Hortonworks HDP 1.3• Pivotal HD 1.0/1.1

HADOOP INTRODUCTION

Hadoop

Hadoop Map/Reduce

HDFS

HBase

Pig Hive

Hadoop basics

Split Map Shuffle Reduce

Dog ate the boneCat ate the fish

Dog, 1Ate, 1The, 1 Bone, 1Cat, 1Ate, 1The, 1Fish,1

Dog, 1Ate, {1, 1}The, {1, 1} Bone, 1Cat, 1Fish,1

Dog, 1Ate, 2The, 2 Bone, 1Cat, 1Fish,1

Configuration< … XML …>

<context:property-placeholder location="hadoop.properties"/>

<hdp:configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}

</hdp:configuration>

<… XML … >

Job definition<hdp:job id=“hadoopJob"

input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>

Configuration conf = new Configuration();

Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

Job Execution

<hdp:job-runner id="runner" run-at-startup="true" pre-action=“someScript“

post-action=“someOtherScript“ job-ref=“hadoopJob" />

• Basic:

• Scheduled– TaskScheduler– Quartz

• Custom

HIGH LEVEL TOOLS

Solutions

• HBase• Hive• Pig• Cascading

Simplifies• Thread safety• DAO friendliness, wrappers and basic

mappers• Simple connection interfaces• Runners, Template and callback

methods• Common scenarios simplifications• Scripting support

Example - Template

template.execute("MyTable", new TableCallback<Object>() {

@Override public Object doInTable(HTable table) throws Throwable { Put p = new Put(Bytes.toBytes("SomeRow")); p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue")); table.put(p);

return null; }

});

<hdp:hbase-configuration/>

<bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration"/>

Example – Script Runner<hdp:hive-server host=“hivehost" port="10001" />

<hdp:hive-template />

<hdp:hive-client-factory host="some-host" port="some-port" > <hdp:script location="classpath:org/company/hive/script.q">

<arguments>ignore-case=true</arguments> </hdp:script> </hdp:hive-client-factory>

<hdp:hive-runner id="hiveRunner" run-at-startup="true"> <hdp:script> DROP TABLE IF EXITS testHiveBatchTable; CREATE TABLE testHiveBatchTable (key int, value string); </hdp:script> <hdp:script location="hive-scripts/script.q"/> </hdp:hive-runner>

WORKFLOWS

Typical Big Data Processing Flow

Capture Pre-Process Insert Process Extract Present

Spring Batch & Spring Integration

• Big Data Flows are based on Spring Integration & Spring Batch

• Spring for Hadoop provides:– Spring Batch tasklets– Spring Integration support

Tasklets

• Job runners• Script runners• Hive • Pig• Cascading

Example

<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />

<batch:job id="job1"> <batch:step id="import" next=“ht"> <batch:tasklet ref="script-tasklet"/> </batch:step> <batch:step id=“ht"> <batch:tasklet ref=" hadoop-tasklet" /></batch:step> </batch:job>

SCRIPTING & MIGRATION

Details

• Supports JVM languages from JSR-223 (Groovy, JRuby, Jython, Rhino)

• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from

Hadoop

Example<hdp:script-tasklet id="script-tasklet"> <hdp:script language="groovy">

inputPath = "/user/gutenberg/input/word/" outputPath = "/user/gutenberg/output/word/"

if (fsh.test(inputPath)) { fsh.rmr(inputPath) }

if (fsh.test(outputPath)) { fsh.rmr(outputPath) }

inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"

fsh.put(inputFile, inputPath)

</hdp:script> </hdp:script-tasklet>

MigrationHadoop Streaming:

Hadoop Tool Executor:

<hdp:streaming id="streaming" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"/>

<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/>

<hdp:arg value="data/out.txt"/> property=value

</hdp:tool-runner>

Alternatives

• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export

TESTING & RELATED TOOLS

Testing

• JUnit/Mocks + MRUnit• Mini-HDFS and Mini-MapReduce

cluster• LocalJobRunner

Spring YARN

HDFSstorage

Map/Reducecluster / data process

YARNcluster

HDFSstorage

Map/Reducedata process

Otherlike Spark - data

Hadoop 1.x Hadoop 2.x

Spring eXtreme Data (XD)

• Ultimate data processing solution• Implements most common approach,

business logic up to you• On top of Spring Batch and Spring

Integration• Has DSL• Scalable

More speedups• Use provider quick start VM for initial

development• Use cloud based images for production

(start/stop)• Don’t use Map/Reduce without real need.

Start with higher abstraction.• Don’t migrate without real need!• Invest in DevOps (Chef / Puppet /

Vagrant…)

Q/A

?

rapid development of big data applications using spring for apache hadoop

Technology