rapid development of big data applications using spring for apache hadoop

32
Spring for Apache Hadoop By Zenyk Matchyshyn

Upload: zenyk

Post on 26-Jan-2015

112 views

Category:

Technology


3 download

DESCRIPTION

Spring for Apache Hadoop provides a collection of extensions and wrappers which make real-life project development using Hadoop technologies much easier, just like it has done for J2EE. Moreover, it doesn’t force you to dramatically change the way you develop applications – you still work with Java stack. In my presentation I’ll describe how we used Spring for Apache Hadoop to speed up and simplify development of business features using Big Data technologies. Main focus: High level APIs for different aspects Modularity & Testability Integration with Spring Batch and other Spring projects Scripting Extensibility & Caveats

TRANSCRIPT

Page 1: Rapid Development of Big Data applications using Spring for Apache Hadoop

Spring for Apache Hadoop

By Zenyk Matchyshyn

Page 2: Rapid Development of Big Data applications using Spring for Apache Hadoop

Agenda• Goals of the project• Hadoop Introduction• High level support• Workflows• Scripting & Migration• Alternatives• Testing & Related

Page 3: Rapid Development of Big Data applications using Spring for Apache Hadoop

Big Data – Why?Because of Terabytes and Petabytes:

• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection

Page 4: Rapid Development of Big Data applications using Spring for Apache Hadoop

Goals• Provide programmatic model to work with

Hadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of

Spring Batch & Spring Integration• Leverage Spring features

Page 5: Rapid Development of Big Data applications using Spring for Apache Hadoop

Supported distros

• Apache Hadoop 1.2.1/2.0.6/2.2.0• Cloudera CDH4• Hortonworks HDP 1.3• Pivotal HD 1.0/1.1

Page 6: Rapid Development of Big Data applications using Spring for Apache Hadoop

HADOOP INTRODUCTION

Page 7: Rapid Development of Big Data applications using Spring for Apache Hadoop

Hadoop

Hadoop Map/Reduce

HDFS

HBase

Pig Hive

Page 8: Rapid Development of Big Data applications using Spring for Apache Hadoop

Hadoop basics

Split Map Shuffle Reduce

Dog ate the boneCat ate the fish

Dog, 1Ate, 1The, 1 Bone, 1Cat, 1Ate, 1The, 1Fish,1

Dog, 1Ate, {1, 1}The, {1, 1} Bone, 1Cat, 1Fish,1

Dog, 1Ate, 2The, 2 Bone, 1Cat, 1Fish,1

Page 9: Rapid Development of Big Data applications using Spring for Apache Hadoop

Configuration< … XML …>

<context:property-placeholder location="hadoop.properties"/>

<hdp:configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}

</hdp:configuration>

<… XML … >

Page 10: Rapid Development of Big Data applications using Spring for Apache Hadoop

Job definition<hdp:job id=“hadoopJob"

input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>

Configuration conf = new Configuration();

Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

Page 11: Rapid Development of Big Data applications using Spring for Apache Hadoop

Job Execution

<hdp:job-runner id="runner" run-at-startup="true" pre-action=“someScript“

post-action=“someOtherScript“ job-ref=“hadoopJob" />

• Basic:

• Scheduled– TaskScheduler– Quartz

• Custom

Page 12: Rapid Development of Big Data applications using Spring for Apache Hadoop

HIGH LEVEL TOOLS

Page 13: Rapid Development of Big Data applications using Spring for Apache Hadoop

Solutions

• HBase• Hive• Pig• Cascading

Page 14: Rapid Development of Big Data applications using Spring for Apache Hadoop

Simplifies• Thread safety• DAO friendliness, wrappers and basic

mappers• Simple connection interfaces• Runners, Template and callback

methods• Common scenarios simplifications• Scripting support

Page 15: Rapid Development of Big Data applications using Spring for Apache Hadoop

Example - Template

template.execute("MyTable", new TableCallback<Object>() {

@Override public Object doInTable(HTable table) throws Throwable { Put p = new Put(Bytes.toBytes("SomeRow")); p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue")); table.put(p);

return null; }

});

<hdp:hbase-configuration/>

<bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration"/>

Page 16: Rapid Development of Big Data applications using Spring for Apache Hadoop

Example – Script Runner<hdp:hive-server host=“hivehost" port="10001" />

<hdp:hive-template />

<hdp:hive-client-factory host="some-host" port="some-port" > <hdp:script location="classpath:org/company/hive/script.q">

<arguments>ignore-case=true</arguments> </hdp:script> </hdp:hive-client-factory>

<hdp:hive-runner id="hiveRunner" run-at-startup="true"> <hdp:script> DROP TABLE IF EXITS testHiveBatchTable; CREATE TABLE testHiveBatchTable (key int, value string); </hdp:script> <hdp:script location="hive-scripts/script.q"/> </hdp:hive-runner>

Page 17: Rapid Development of Big Data applications using Spring for Apache Hadoop

WORKFLOWS

Page 18: Rapid Development of Big Data applications using Spring for Apache Hadoop

Typical Big Data Processing Flow

Capture Pre-Process Insert Process Extract Present

Page 19: Rapid Development of Big Data applications using Spring for Apache Hadoop

Spring Batch & Spring Integration

• Big Data Flows are based on Spring Integration & Spring Batch

• Spring for Hadoop provides:– Spring Batch tasklets– Spring Integration support

Page 20: Rapid Development of Big Data applications using Spring for Apache Hadoop

Tasklets

• Job runners• Script runners• Hive • Pig• Cascading

Page 21: Rapid Development of Big Data applications using Spring for Apache Hadoop

Example

<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />

<batch:job id="job1"> <batch:step id="import" next=“ht"> <batch:tasklet ref="script-tasklet"/> </batch:step> <batch:step id=“ht"> <batch:tasklet ref=" hadoop-tasklet" /></batch:step> </batch:job>

Page 22: Rapid Development of Big Data applications using Spring for Apache Hadoop

SCRIPTING & MIGRATION

Page 23: Rapid Development of Big Data applications using Spring for Apache Hadoop

Details

• Supports JVM languages from JSR-223 (Groovy, JRuby, Jython, Rhino)

• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from

Hadoop

Page 24: Rapid Development of Big Data applications using Spring for Apache Hadoop

Example<hdp:script-tasklet id="script-tasklet"> <hdp:script language="groovy">

inputPath = "/user/gutenberg/input/word/" outputPath = "/user/gutenberg/output/word/"

if (fsh.test(inputPath)) { fsh.rmr(inputPath) }

if (fsh.test(outputPath)) { fsh.rmr(outputPath) }

inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"

fsh.put(inputFile, inputPath)

</hdp:script> </hdp:script-tasklet>

Page 25: Rapid Development of Big Data applications using Spring for Apache Hadoop

MigrationHadoop Streaming:

Hadoop Tool Executor:

<hdp:streaming id="streaming" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"/>

<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/>

<hdp:arg value="data/out.txt"/> property=value

</hdp:tool-runner>

Page 26: Rapid Development of Big Data applications using Spring for Apache Hadoop

Alternatives

• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export

Page 27: Rapid Development of Big Data applications using Spring for Apache Hadoop

TESTING & RELATED TOOLS

Page 28: Rapid Development of Big Data applications using Spring for Apache Hadoop

Testing

• JUnit/Mocks + MRUnit• Mini-HDFS and Mini-MapReduce

cluster• LocalJobRunner

Page 29: Rapid Development of Big Data applications using Spring for Apache Hadoop

Spring YARN

HDFSstorage

Map/Reducecluster / data process

YARNcluster

HDFSstorage

Map/Reducedata process

Otherlike Spark - data

Hadoop 1.x Hadoop 2.x

Page 30: Rapid Development of Big Data applications using Spring for Apache Hadoop

Spring eXtreme Data (XD)

• Ultimate data processing solution• Implements most common approach,

business logic up to you• On top of Spring Batch and Spring

Integration• Has DSL• Scalable

Page 31: Rapid Development of Big Data applications using Spring for Apache Hadoop

More speedups• Use provider quick start VM for initial

development• Use cloud based images for production

(start/stop)• Don’t use Map/Reduce without real need.

Start with higher abstraction.• Don’t migrate without real need!• Invest in DevOps (Chef / Puppet /

Vagrant…)

Page 32: Rapid Development of Big Data applications using Spring for Apache Hadoop

Q/A

?