cascading meetup #4 @ bluekai

37
Copyright @2013, Concurrent, Inc. BlueKai Cupertino, CA 2013-03-05 Cascading Meetup #4 1 Tuesday, 05 March 13

Upload: paco-nathan

Post on 23-Jun-2015

1.428 views

Category:

Technology


4 download

DESCRIPTION

Slides from Cascading meetup #4, held at BlueKai in Cupertino, CA on 2013-03-05

TRANSCRIPT

Page 1: Cascading meetup #4 @ BlueKai

Copyright @2013, Concurrent, Inc.

BlueKaiCupertino, CA2013-03-05

Cascading Meetup #4

1Tuesday, 05 March 13

Page 2: Cascading meetup #4 @ BlueKai

Cascading Meetup

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development

2Tuesday, 05 March 13

Page 3: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

Let’s consider an example app… at the front end

LOB use cases drive demand for apps

3Tuesday, 05 March 13LOB use cases drive the demand for Big Data apps

Page 4: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

An example… in the back office

Organizations have substantial investmentsin people, infrastructure, process

4Tuesday, 05 March 13Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes

Page 5: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

An example… for the heavy lifting!

“Main Street” firms are migratingworkflows to Hadoop, for cost savings and scale-out

5Tuesday, 05 March 13“Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.

Page 6: Cascading meetup #4 @ BlueKai

Two Avenues…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

6Tuesday, 05 March 13Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity

Page 7: Cascading meetup #4 @ BlueKai

Two Avenues…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

Hadoop almost never gets used in isolation; data workflows define the “glue” required for system integration of Enterprise apps

7Tuesday, 05 March 13Hadoop is almost never used in isolation.Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.

Page 8: Cascading meetup #4 @ BlueKai

Cascading Meetup

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development

8Tuesday, 05 March 13

Page 9: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base

• ANSI SQL parser/optimizer atop Cascading flow planner

• JDBC driver to integrate into existing tools and app servers

• relational catalog over a collection of unstructured data

• SQL shell prompt to run queries

9Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

Page 10: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base

• ANSI SQL parser/optimizer atop Cascading flow planner

• JDBC driver to integrate into existing tools and app servers

• relational catalog over a collection of unstructured data

• SQL shell prompt to run queries

Premise: most SQL in the world gets written by machines…

This isn’t a database; this is about making machine-to-machine communications simpler and more robust at scale.

10Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

Page 11: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – ANSI SQL

• enable analysts without retraining on Hadoop, etc.

• transparency for Support, Ops, Finance, et al.

a language for queries – not a database,but ANSI SQL as a DSL for workflows

11Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

Page 12: Cascading meetup #4 @ BlueKai

ANSI SQL – reviews

Open Source 'Lingual' Helps SQL Devs Unlock HadoopThor Olavsrud, 2013-02-22cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop

Hadoop Apps Without MapReduce MindsetsAdrian Bridgwater, 2013-02-28drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708

Concurrent gives old SQL users new Hadoop tricksJack Clark, 2013-02-20theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/

Concurrent Open Source Project Ties SQL to HadoopMichael Vizard, 2013-02-21itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html

Concurrent Releases Lingual, a SQL DSL for HadoopBoris Lublinsky, 2013-02-28infoq.com/news/2013/02/Lingual

12Tuesday, 05 March 13

Page 13: Cascading meetup #4 @ BlueKai

ANSI SQL – CSV data in local file system

cascading.org/lingual

13Tuesday, 05 March 13The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.

Page 14: Cascading meetup #4 @ BlueKai

ANSI SQL – shell prompt, catalog

cascading.org/lingual

14Tuesday, 05 March 13Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.

Page 15: Cascading meetup #4 @ BlueKai

ANSI SQL – queries

cascading.org/lingual

15Tuesday, 05 March 13Here’s an example SQL query on that “employee” test database from MySQL.

Page 16: Cascading meetup #4 @ BlueKai

ANSI SQL – layers

abstraction RDBMS JVM Cluster

parser ANSI SQLcompliant parser

ANSI SQLcompliant parser

optimizer logical plan, optimized based on stats

logical plan, optimized based on stats

planner physical plan API “plumbing”

machinedata

query history,table stats

app history, tuple stats

topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc.

visualization ERD flow diagram

schema table schema tuple schema

catalog relational catalog tap usage DB

provenance (manual audit) data setproducers/consumers

16Tuesday, 05 March 13When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters

Page 17: Cascading meetup #4 @ BlueKai

ANSI SQL – JDBC driver

public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();  ResultSet resultSet = statement.executeQuery( "select *\n" + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n" + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n" + "on e.\"EMPID\" = s.\"CUST_ID\"" );  while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();  for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); }

System.out.println( builder ); }  resultSet.close(); statement.close(); connection.close(); }

17Tuesday, 05 March 13Note that in this example the schema for the DDL has been derived directly from the CSV files.

In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.

Page 18: Cascading meetup #4 @ BlueKai

ANSI SQL – JDBC driver

$ gradle clean jar$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar CUST_ID=100; PROD_ID=10; EMPID=100; NAME=BillCUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian

Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors)

This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency –for many under-represented use cases in Enterprise IT.

It’s essentially ANSI SQL as a DSL.

18Tuesday, 05 March 13success

Page 19: Cascading meetup #4 @ BlueKai

Cascading Meetup

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development

19Tuesday, 05 March 13

Page 20: Cascading meetup #4 @ BlueKai

Test-Driven Development (TDD)

source: Wikipedia

20Tuesday, 05 March 13A general view of TDD process

Page 21: Cascading meetup #4 @ BlueKai

Test-Driven Development (TDD)

In terms of Big Data apps, TDD is not generally part of the conversation

21Tuesday, 05 March 13TDD is not usually high on the list when people start discussing Big Data apps.

Page 22: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Traps – Cascading “exceptional data”

• assert patterns (regex) on the tuple streams

• adjust assert levels, like log4j levels

• define traps on branches

• tuples which fail asserts get trapped

22Tuesday, 05 March 13An innovation in Cascading was to introduce the notion of a “data exception”, based on setting stream assertion levels as part of the business logic of an app.

Page 23: Cascading meetup #4 @ BlueKai

Traps – example code

// set up... 

Pipe etlPipe = new Pipe( "etlPipe" );

// some processing... 

AssertMatches assertMatches = new AssertMatches( ".*true" );etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches ); // some processing... 

FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap ); if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT );else flowDef.setAssertionLevel( AssertionLevel.NONE );

23Tuesday, 05 March 13Example use in Cascading code

Page 24: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Traps – redirect exceptions in production

shunt the trapped exceptional data to other parts of the organization:

• Ops: notifications

• QA: investigate data anomalies

• Support: review customer records

• Finance: audit

24Tuesday, 05 March 13

Page 25: Cascading meetup #4 @ BlueKai

TDD – practice at scale

1. assert expected patterns in raw input

2. run just that, to find edge cases

3. handle the edge cases for input data

4. assert expected patterns after first chunk of processing

5. run just that, to verify failure

6. code until test passes

7. repeat #4 for each chunk

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

25Tuesday, 05 March 13

Page 26: Cascading meetup #4 @ BlueKai

TDD – Cascalog features

consider that TDD is about asserting and negating logical predicates…

• Cascalog is based on logical predicates

• function definitions as composable subqueries

• functions are not particularly far from being unit tests

• Midje: facts, mocks

sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html

sritchie.github.com/2012/01/22/cascalog-testing-20.html

26Tuesday, 05 March 13Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.

Page 27: Cascading meetup #4 @ BlueKai

Cascading Meetup

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development…plus, a proposal

27Tuesday, 05 March 13

Page 28: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Suppose your organization is responsiblefor an large-scale app…

Multiple teams develop reusable libraries…

28Tuesday, 05 March 13Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…

Page 29: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Data Analysts: ANSI SQL queries for data prep

(displaces Hive, etc.)

29Tuesday, 05 March 13Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.These can migrate into a Cascading app to run on Hadoop.

Page 30: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Server-side Engineering: HBase tap for customer profiles

(integrating other components)

30Tuesday, 05 March 13Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.These can migrate into a Cascading app to run on Hadoop.

Page 31: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Ops + Support: Traps get routed to customer review

(ties into notifications, etc.)

31Tuesday, 05 March 13Support needs to review exceptional data, via reports/notifications.These can migrate into a Cascading app to run on Hadoop.

Page 32: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Data Scientists: R => PMML for predictive models

(displaces SAS, etc.)

32Tuesday, 05 March 13Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.These can migrate into a Cascading app to run on Hadoop.

Page 33: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

App Engineering: Java/Scala/Clojure for business logic in data pipelines

(displaces Pig, etc.)

33Tuesday, 05 March 13Generally the revenue apps require some custom business logic -- representing business process for LOB.These can migrate into a Cascading app to run on Hadoop.

Page 34: Cascading meetup #4 @ BlueKai

ANSI SQL – multiple flows

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

Front-end Engineering: Memcached tap for pushing updates to API

(integrating other components)

34Tuesday, 05 March 13Engineering provides integration with caching layer, for API updates.These can migrate into a Cascading app to run on Hadoop.

Page 35: Cascading meetup #4 @ BlueKai

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – API principles

• specify what is required, not how it must be achieved

• plan far ahead, before consuming cluster resources – fail fast prior to submit

• fail the same way twice – deterministicflow planners help reduce engineeringcosts for debugging at scale

• same JAR, any scale – app does notrequire a recompile to change data taps or cluster topologies

• no surprises

35Tuesday, 05 March 13Some of the design principles for the pattern language

Page 36: Cascading meetup #4 @ BlueKai

by Paco Nathan

Enterprise Data Workflowswith Cascading

O’Reilly, 2013amazon.com/dp/1449358721

book…

36Tuesday, 05 March 13Our upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon -- scheduled to be out in print this June

Page 37: Cascading meetup #4 @ BlueKai

blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities:

cascading.org

zest.to/group11

github.com/Cascading

conjars.org

goo.gl/KQtUL

concurrentinc.com

join us for very interesting work!

drill-down…

Copyright @2013, Concurrent, Inc.

37Tuesday, 05 March 13Links to our open source projects, developer community, etc…