Download - Cascading meetup #4 @ BlueKai
Copyright @2013, Concurrent, Inc.
BlueKaiCupertino, CA2013-03-05
Cascading Meetup #4
1Tuesday, 05 March 13
Cascading Meetup
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development
2Tuesday, 05 March 13
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise Data Workflows
Let’s consider an example app… at the front end
LOB use cases drive demand for apps
3Tuesday, 05 March 13LOB use cases drive the demand for Big Data apps
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise Data Workflows
An example… in the back office
Organizations have substantial investmentsin people, infrastructure, process
4Tuesday, 05 March 13Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise Data Workflows
An example… for the heavy lifting!
“Main Street” firms are migratingworkflows to Hadoop, for cost savings and scale-out
5Tuesday, 05 March 13“Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Two Avenues…
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
6Tuesday, 05 March 13Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
Hadoop almost never gets used in isolation; data workflows define the “glue” required for system integration of Enterprise apps
7Tuesday, 05 March 13Hadoop is almost never used in isolation.Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
Cascading Meetup
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development
8Tuesday, 05 March 13
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
• ANSI SQL parser/optimizer atop Cascading flow planner
• JDBC driver to integrate into existing tools and app servers
• relational catalog over a collection of unstructured data
• SQL shell prompt to run queries
9Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
• ANSI SQL parser/optimizer atop Cascading flow planner
• JDBC driver to integrate into existing tools and app servers
• relational catalog over a collection of unstructured data
• SQL shell prompt to run queries
Premise: most SQL in the world gets written by machines…
This isn’t a database; this is about making machine-to-machine communications simpler and more robust at scale.
10Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – ANSI SQL
• enable analysts without retraining on Hadoop, etc.
• transparency for Support, Ops, Finance, et al.
a language for queries – not a database,but ANSI SQL as a DSL for workflows
11Tuesday, 05 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
ANSI SQL – reviews
Open Source 'Lingual' Helps SQL Devs Unlock HadoopThor Olavsrud, 2013-02-22cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop
Hadoop Apps Without MapReduce MindsetsAdrian Bridgwater, 2013-02-28drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708
Concurrent gives old SQL users new Hadoop tricksJack Clark, 2013-02-20theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/
Concurrent Open Source Project Ties SQL to HadoopMichael Vizard, 2013-02-21itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html
Concurrent Releases Lingual, a SQL DSL for HadoopBoris Lublinsky, 2013-02-28infoq.com/news/2013/02/Lingual
12Tuesday, 05 March 13
ANSI SQL – CSV data in local file system
cascading.org/lingual
13Tuesday, 05 March 13The test database for MySQL is available for download from https://launchpad.net/test-db/
Here we have a bunch o’ CSV flat files in a directory in the local file system.
Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog
cascading.org/lingual
14Tuesday, 05 March 13Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries
cascading.org/lingual
15Tuesday, 05 March 13Here’s an example SQL query on that “employee” test database from MySQL.
ANSI SQL – layers
abstraction RDBMS JVM Cluster
parser ANSI SQLcompliant parser
ANSI SQLcompliant parser
optimizer logical plan, optimized based on stats
logical plan, optimized based on stats
planner physical plan API “plumbing”
machinedata
query history,table stats
app history, tuple stats
topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc.
visualization ERD flow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
provenance (manual audit) data setproducers/consumers
16Tuesday, 05 March 13When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
ANSI SQL – JDBC driver
public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement(); ResultSet resultSet = statement.executeQuery( "select *\n" + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n" + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n" + "on e.\"EMPID\" = s.\"CUST_ID\"" ); while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder(); for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); }
System.out.println( builder ); } resultSet.close(); statement.close(); connection.close(); }
17Tuesday, 05 March 13Note that in this example the schema for the DDL has been derived directly from the CSV files.
In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
ANSI SQL – JDBC driver
$ gradle clean jar$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar CUST_ID=100; PROD_ID=10; EMPID=100; NAME=BillCUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian
Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors)
This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency –for many under-represented use cases in Enterprise IT.
It’s essentially ANSI SQL as a DSL.
18Tuesday, 05 March 13success
Cascading Meetup
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development
19Tuesday, 05 March 13
Test-Driven Development (TDD)
source: Wikipedia
20Tuesday, 05 March 13A general view of TDD process
Test-Driven Development (TDD)
In terms of Big Data apps, TDD is not generally part of the conversation
21Tuesday, 05 March 13TDD is not usually high on the list when people start discussing Big Data apps.
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Traps – Cascading “exceptional data”
• assert patterns (regex) on the tuple streams
• adjust assert levels, like log4j levels
• define traps on branches
• tuples which fail asserts get trapped
22Tuesday, 05 March 13An innovation in Cascading was to introduce the notion of a “data exception”, based on setting stream assertion levels as part of the business logic of an app.
Traps – example code
// set up...
Pipe etlPipe = new Pipe( "etlPipe" );
// some processing...
AssertMatches assertMatches = new AssertMatches( ".*true" );etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches ); // some processing...
FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap ); if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT );else flowDef.setAssertionLevel( AssertionLevel.NONE );
23Tuesday, 05 March 13Example use in Cascading code
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Traps – redirect exceptions in production
shunt the trapped exceptional data to other parts of the organization:
• Ops: notifications
• QA: investigate data anomalies
• Support: review customer records
• Finance: audit
24Tuesday, 05 March 13
TDD – practice at scale
1. assert expected patterns in raw input
2. run just that, to find edge cases
3. handle the edge cases for input data
4. assert expected patterns after first chunk of processing
5. run just that, to verify failure
6. code until test passes
7. repeat #4 for each chunk
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
25Tuesday, 05 March 13
TDD – Cascalog features
consider that TDD is about asserting and negating logical predicates…
• Cascalog is based on logical predicates
• function definitions as composable subqueries
• functions are not particularly far from being unit tests
• Midje: facts, mocks
sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html
sritchie.github.com/2012/01/22/cascalog-testing-20.html
26Tuesday, 05 March 13Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading Meetup
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Enterprise Data Workflows2. ANSI SQL Support3. Test-Driven Development…plus, a proposal
27Tuesday, 05 March 13
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Suppose your organization is responsiblefor an large-scale app…
Multiple teams develop reusable libraries…
28Tuesday, 05 March 13Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Data Analysts: ANSI SQL queries for data prep
(displaces Hive, etc.)
29Tuesday, 05 March 13Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Server-side Engineering: HBase tap for customer profiles
(integrating other components)
30Tuesday, 05 March 13Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Ops + Support: Traps get routed to customer review
(ties into notifications, etc.)
31Tuesday, 05 March 13Support needs to review exceptional data, via reports/notifications.These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Data Scientists: R => PMML for predictive models
(displaces SAS, etc.)
32Tuesday, 05 March 13Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
App Engineering: Java/Scala/Clojure for business logic in data pipelines
(displaces Pig, etc.)
33Tuesday, 05 March 13Generally the revenue apps require some custom business logic -- representing business process for LOB.These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows
M
gps
Countgps_count
R
Geohash
gpslogs
Maxrecent_visit
M
road
RoadMetadata
Join EstimateAlbedo Geohash
Regexparse-road
RoadSegments
R
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
M
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
Join
R reco
Front-end Engineering: Memcached tap for pushing updates to API
(integrating other components)
34Tuesday, 05 March 13Engineering provides integration with caching layer, for API updates.These can migrate into a Cascading app to run on Hadoop.
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – API principles
• specify what is required, not how it must be achieved
• plan far ahead, before consuming cluster resources – fail fast prior to submit
• fail the same way twice – deterministicflow planners help reduce engineeringcosts for debugging at scale
• same JAR, any scale – app does notrequire a recompile to change data taps or cluster topologies
• no surprises
35Tuesday, 05 March 13Some of the design principles for the pattern language
by Paco Nathan
Enterprise Data Workflowswith Cascading
O’Reilly, 2013amazon.com/dp/1449358721
book…
36Tuesday, 05 March 13Our upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon -- scheduled to be out in print this June
blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
join us for very interesting work!
drill-down…
Copyright @2013, Concurrent, Inc.
37Tuesday, 05 March 13Links to our open source projects, developer community, etc…