driving innovation through - events.static.linuxfound.org · driving innovation through data...

31
DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th 2015

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015

Page 2: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

SPEAKER

2

André KelpeSenior Software Engineer at Concurrent company behind Cascading, Lingual and Drivenhttp://concurrentinc.com / @concurrent

[email protected] / @fs111

Page 3: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

http://cascading.org

Apache licensed Java framework for writing data oriented applications

production ready, stable and battle proven

INTRODUCTION

3

Page 4: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

4

PHILOSOPHY

Page 5: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

developer productivity

users focus on business problems, not distributed systems knowledge

predictable runtime behaviour

fail fast

PHILOSOPHY

5

Page 6: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

stable user APIs

safe defaults with knobs for experts

batch workloads

PHILOSOPHY

6

Page 7: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

testability & robustness

production quality applications rather than a collection of scripts

abstractions over interchangeable platforms

PHILOSOPHY

7

Page 8: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

8

TERMINOLOGY

Page 9: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

A SERIES OF PIPES

9

https://www.flickr.com/photos/theilr/4283377543/sizes/l

Page 10: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

CASCADING TERMINOLOGY

10

• Taps are sources and sinks for data• Schemes represent the format of the data • Pipes are connecting Taps

Page 11: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

● Tuples flow through Pipes● Fields describe the Tuples● Operations are executed on Tuples in

TupleStreams● Pipes can be merged, spliced, joined etc.● Pipe-assemblies are reusable components

CASCADING TERMINOLOGY

11

Page 12: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational platform

Flows can be orchestrated via Cascade

Applications are Directed Acyclic Graphs (DAG)

CASCADING TERMINOLOGY

12

Page 13: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

DAG

13

Page 14: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

14

PLATFORMS

Page 15: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

CASCADING PLATFORMS

15

local

change 1 line of code, recompile, done.

Page 16: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

COMPILER ANALOGY

16

User Code TranslationOptimisationAssembly

CPU Architecture

QueryPlanner/RuleEngine

MR

Tez

Flink

FlowDef

FlowDef

FlowDef

FlowDef

FlowDefothers…

Page 17: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

DAG

17

Page 18: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

A DAG RUNNING ON A PLATFORM

18

Page 19: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

REAL WORLD DAG

19

https://github.com/cchepelov/wcplus

https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

Page 20: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

20

CODE EXAMPLE

Page 21: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

● Fluid - A Fluent API for Cascading− Targeted at application writers− https://github.com/Cascading/fluid

● „Raw“ Cascading API− Targeted for library writers, code

generators, integration layers− https://github.com/Cascading/cascading

APIS

21

Page 22: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

COUNTING WORDS

22

String docPath = args[ 0 ];

String wcPath = args[ 1 ];

Properties properties = new Properties();

AppProps.setApplicationJarClass( properties, Main.class );

FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

// create source and sink taps

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );

Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

...

Page 23: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

COUNTING WORDS (CONT.)

23

// specify a regex operation to split the "document" text lines into a token stream

Fields token = new Fields( "token" );

Fields text = new Fields( "text" );

RegexSplitGenerator splitter =

new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token"

Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts

Pipe wcPipe = new Pipe( "wc", docPipe );

wcPipe = new GroupBy( wcPipe, token );

wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

...

Page 24: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

COUNTING WORDS (CONT.)

24

// connect the taps, pipes, etc., into a flow

FlowDef flowDef = FlowDef.flowDef()

.setName( “word count" )

.addSource( docPipe, docTap )

.addTailSink( wcPipe, wcTap );

Flow wcFlow = flowConnector.connect( flowDef )

wcFlow.complete(); // ← runs the code

}

Page 25: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

A FULL TOOLBOX

25

● Operations − Function

− Filter

− Regex/Scripts

− Boolean operators

− Count/Limit/Last/First

− Scripts

− Unique

− Asserts

− Min/Max

● Splices− GroupBy− CoGroup− HashJoin− Merge

● JoinsLeft, right, outer, inner, mixed, custom

Page 26: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

A FULL TOOLBOX

26

• data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra, kinesis, accumulo …

• data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV…

• integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom

Page 27: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

OUTLOOK TO CASCADING 3.1+

27

• improved serialization through strong typing

• Cascading on Apache Flink

• Cascading on Hazelcast

Page 28: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

DON’T LIKE JAVA?

28

Clojure/logic programming

https://github.com/nathanmarz/cascalog

Clojure

https://github.com/Netflix/PigPen

Scala

https://github.com/twitter/scalding

Page 29: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

29

QUESTIONS?

Page 30: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

LINK COLLECTION

30

• http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ • http://concurrentinc.com • https://groups.google.com/forum/#!forum/

cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/

userguide/html/

Page 31: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th

DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015