what is a distributed data science pipeline. how with apache spark and friends

What is a Distributed Data Science PipelineHow with Apache Spark and Friends.

by @DataFellas@Noootsab, 23th Nov. ‘15 @YaJUG

● (Legacy) Data Science Pipeline/Product● What changed since then● Distributed Data Science (today)● Challenges● Going beyond (productivity)

Outline

Data Fellas6 months old Belgian Startup

Andy Petrella@noootsab

MathsGeospatialDistributed Computing

@SparkNotebookSpark/Scala trainerMachine Learning

Xavier Tordoir@xtordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)Spark trainerMachine Learning

(Legacy) Data Science PipelineOr, so called, Data Product

Static Results

Lot of information lost in translation

Sounds like Waterfall

ETL look and feel

Sampling Modelling Tuning Report Interprete

(Legacy) Data Science PipelineOr, so called, Data Product

Mono machine!

CPU bounds

Memory bounds

Or resampling because small-ish data

Sampling Modelling Tuning Report Interprete

FactsData gets bigger or, precisely, the amount of available sources explodes

Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ

Our world TodayNo, it wasn’t better before

Consequences

HARD (or will be too big...)

Ephemeral

Restricted View

Sampling

Report

Interpretation ⇒ Too SLOW to get real ROI out of the overall system

How to work around that?

Consequences

Alerting system over descriptive charts

More accurate results

more or harder models (e.g. Deep Learning)

More data

Constant data flow

Online interactions under control (e.g. direct feedback)

Needs are

Distributed Systems

So, we need...

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Tune accuracy

Tune performances

Access Layer

User Access

Tune accuracy

Tune performances

Access Layer

User Access

Tune accuracy

Tune performances

Access Layer

User Access

Tune accuracy

Tune performances

Access Layer

User Access

Tune accuracy

Tune performances

Access Layer

User Access

YO!Aren’t we talking about “Big” Data ? Fast Data ?

So could really (all) results being neither big nor fast?

Actually, Results are becoming themselves “Big” Data ! Fast Data !

Tune accuracy

Tune performances

Access Layer

User Access

how do we access data since 90’s? remember SOA? → SERVICES!

Nowadays, we’re talking about micro services.

Here we are, one service for one result.

Tune accuracy

Tune performances

Access Layer

User Access

C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?

We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!

What about Productivity?Streamlining development lifecycle most welcome

Tune accuracy

Tune performances

Access Layer

User Access

ops data

sci ops

ops data

web ops data

web ops data sci

Tune accuracy

Tune performances

Access Layer

User Access

sci ops

ops data

web ops data

Tune accuracy

Tune performances

Access Layer

User Access

ops data

sci ops

ops data

web ops data

web ops data sci

➔ Longer production line➔ More constraints (resources sharing, time, …)➔ More people➔ More skills

Overlooking these points and you’ll be soon or sooner

So, how to have:

● results coming fast enough whilst keeping accuracy level high?● Responsivity to external/unpredictable events?

WHEN...

kicked

WarningTeam Fight: seen by members

WarningTeam Fight: seen by managers

WarningTeam Fight: seen by employers

WarningTeam Fight: seen by customers

At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).

Hence, Data Fellas

● extends the Spark Notebook (interactivity)● builds the Shar3 product arounds it (Integrated Reactivity)

Concepts of Data Fellas’ Shar3Shareable and Streamlined Data Science

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Using Shar3yeah \o/

Let’s take this example where some buddies from

Datastax Joel Jacobson @joeljacobson

Simon Ambridge @stratman1958

Mesosphere Michael Hausenblas @mhausenblas

Typesafe Iulian Dragos @jaguarul

Data Fellas Xavier Tordoir @xtordoir

(and me)

Using Shar3yeah \o/

Let’s take this example where some buddies from

Datastax Joel Jacobson @joeljacobson

Simon Ambridge @stratman1958

Mesosphere Michael Hausenblas @mhausenblas

Typesafe Iulian Dragos @jaguarul

Data Fellas Xavier Tordoir @xtordoir

(and me)

Using Shar3yeah \o/

What do we need to do now?

● Deploy● connect the dots● track● scale

BoTh the Jobs and the services

Using Shar3yeah \o/

From notebook

to SBT project

to Docker

to Marathon

SNBSBT/JAR

Docker

marathon

Using Shar3yeah \o/

From Notebook

● to output● to Avro

Using Shar3yeah \o/

From Notebook

● to Avro● to service● to SBT● to Docker● to Marathon

SNBSBT/JAR Docker

marathon

Using Shar3yeah \o/

From Notebook

● to Avro● to Tableau● or QlikView● or D3.JS● or …

Using Shar3yeah \o/

So we have these information available:

● notebook’s markdown text● notebook’s code/model● data sources● Output/sinks● Output/services● Avro schema

Shouldn’t them all be

reused???

Using Shar3yeah \o/

Variant Analysis

There is a service!

Using Shar3yeah \o/

Let’s use it…

Using Shar3yeah \o/

What was the process?

Using Shar3yeah \o/

Fine and the output is in C*

Using Shar3yeah \o/

Let’s check what’s in-there

Using Shar3yeah \o/

not what I need, let’s ADAPT

Using Shar3yeah \o/

Poke us on

@DataFellas

@Shar3_Fellas

@SparkNotebook

@Xtordoir & @Noootsab

Now @TypeSafe: http://t.co/o1Bt6dQtgH

If you wanna learn more about the different tools… Join us @ O’Reilly

Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)

That’s all folksThanks for listening/staying

what is a distributed data science pipeline. how with apache spark and friends

Science

managed solutions apache spark® · apache spark® apache...

using apache spark, apache kafka and apache...

a tutorial on apache spark - michael...

introduction to cassandra • why spark - apache cassandra |...

accelerator for apache spark functional specification ·...

netflix’s recommendation ml pipeline using apache...

apache spark 2.0

apache spark operations

apache ignite and apache spark - gridgain systems · ignite...

r&d to product pipeline using apache spark in adtech: spark...

apache spark 101

spark sql | apache spark

r + apache spark

apache spark introduction

using apache spark pat mcdonough - databricks. apache spark...

integrating apache hive with kafka, spark, and...

apache spark streaming

apache spark rdds

end-to-end data pipeline with apache spark

apache spark session