what is a distributed data science pipeline. how with apache spark and friends

What is a Distributed Data Science PipelineHow with Apache Spark and Friends.

by @DataFellas@Noootsab, 23th Nov. ‘15 @YaJUG

● (Legacy) Data Science Pipeline/Product● What changed since then● Distributed Data Science (today)● Challenges● Going beyond (productivity)

Outline

Data Fellas6 months old Belgian Startup

Andy Petrella@noootsab

MathsGeospatialDistributed Computing

@SparkNotebookSpark/Scala trainerMachine Learning

Xavier Tordoir@xtordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)Spark trainerMachine Learning

(Legacy) Data Science PipelineOr, so called, Data Product

Static Results

Lot of information lost in translation

Sounds like Waterfall

ETL look and feel

Sampling Modelling Tuning Report Interprete

(Legacy) Data Science PipelineOr, so called, Data Product

Mono machine!

CPU bounds

Memory bounds

Or resampling because small-ish data

Sampling Modelling Tuning Report Interprete

FactsData gets bigger or, precisely, the amount of available sources explodes

Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ

Our world TodayNo, it wasn’t better before

Consequences

HARD (or will be too big...)

Ephemeral

Restricted View

Sampling

Report


Interpretation ⇒ Too SLOW to get real ROI out of the overall system

How to work around that?


Consequences


Alerting system over descriptive charts

More accurate results

more or harder models (e.g. Deep Learning)

More data

Constant data flow

Online interactions under control (e.g. direct feedback)

Needs are


Distributed Systems

So, we need...

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access






Tune accuracy

Tune performances


Access Layer

User Access

YO!Aren’t we talking about “Big” Data ? Fast Data ?

So could really (all) results being neither big nor fast?

Actually, Results are becoming themselves “Big” Data ! Fast Data !






Tune accuracy

Tune performances


Access Layer

User Access

how do we access data since 90’s? remember SOA? → SERVICES!

Nowadays, we’re talking about micro services.

Here we are, one service for one result.






Tune accuracy

Tune performances


Access Layer

User Access

C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?

We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!

What about Productivity?Streamlining development lifecycle most welcome





Tune accuracy

Tune performances


Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci






Tune accuracy

Tune performances


Access Layer

User Access

ops

data

ops

sci

sci ops

sci

ops data

web ops data

web ops data

data

sci






Tune accuracy

Tune performances


Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci


➔ Longer production line➔ More constraints (resources sharing, time, …)➔ More people➔ More skills

Overlooking these points and you’ll be soon or sooner

So, how to have:

● results coming fast enough whilst keeping accuracy level high?● Responsivity to external/unpredictable events?

WHEN...

kicked

WarningTeam Fight: seen by members

WarningTeam Fight: seen by managers

WarningTeam Fight: seen by employers

WarningTeam Fight: seen by customers


At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).

Hence, Data Fellas

● extends the Spark Notebook (interactivity)● builds the Shar3 product arounds it (Integrated Reactivity)

Concepts of Data Fellas’ Shar3Shareable and Streamlined Data Science

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Using Shar3yeah \o/

Let’s take this example where some buddies from

Datastax Joel Jacobson @joeljacobson

Simon Ambridge @stratman1958

Mesosphere Michael Hausenblas @mhausenblas

Typesafe Iulian Dragos @jaguarul

Data Fellas Xavier Tordoir @xtordoir

(and me)

Using Shar3yeah \o/

Using Shar3yeah \o/

What do we need to do now?

● Deploy● connect the dots● track● scale

BoTh the Jobs and the services

Using Shar3yeah \o/

From notebook

to SBT project

to Docker

to Marathon

SNBSBT/JAR

Docker

marathon

Using Shar3yeah \o/

From Notebook

● to output● to Avro

SNB

Using Shar3yeah \o/

From Notebook

● to Avro● to service● to SBT● to Docker● to Marathon

SNBSBT/JAR Docker

marathon

Using Shar3yeah \o/

From Notebook

● to Avro● to Tableau● or QlikView● or D3.JS● or …

SNB

Using Shar3yeah \o/

So we have these information available:

● notebook’s markdown text● notebook’s code/model● data sources● Output/sinks● Output/services● Avro schema

Shouldn’t them all be

reused???

Using Shar3yeah \o/

Variant Analysis

There is a service!

Using Shar3yeah \o/

Let’s use it…

Using Shar3yeah \o/

What was the process?

Using Shar3yeah \o/

Fine and the output is in C*

Using Shar3yeah \o/

Let’s check what’s in-there

Using Shar3yeah \o/

not what I need, let’s ADAPT

Using Shar3yeah \o/

Poke us on

@DataFellas

@Shar3_Fellas

@SparkNotebook

@Xtordoir & @Noootsab

Now @TypeSafe: http://t.co/o1Bt6dQtgH

If you wanna learn more about the different tools… Join us @ O’Reilly

Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)

That’s all folksThanks for listening/staying

http://t.co/o1Bt6dQtgH

https://docs.google.com/forms/d/151AKOlZTrmu3JZSPfbHpUSzk7mAaeE-SJk9hLsDUpYo/viewform

http://noetl.org