what is a distributed data science pipeline. how with apache spark and friends

48
What is a Distributed Data Science Pipeline How with Apache Spark and Friends. by @DataFellas @Noootsab, 23th Nov. ‘15 @YaJUG

Upload: andy-petrella

Post on 13-Apr-2017

1.478 views

Category:

Science


0 download

TRANSCRIPT

Page 1: What is a distributed data science pipeline. how with apache spark and friends

What is a Distributed Data Science PipelineHow with Apache Spark and Friends.

by @DataFellas@Noootsab, 23th Nov. ‘15 @YaJUG

Page 2: What is a distributed data science pipeline. how with apache spark and friends

● (Legacy) Data Science Pipeline/Product● What changed since then● Distributed Data Science (today)● Challenges● Going beyond (productivity)

Outline

Page 3: What is a distributed data science pipeline. how with apache spark and friends

Data Fellas6 months old Belgian Startup

Andy Petrella@noootsab

MathsGeospatialDistributed Computing

@SparkNotebookSpark/Scala trainerMachine Learning

Xavier Tordoir@xtordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)Spark trainerMachine Learning

Page 4: What is a distributed data science pipeline. how with apache spark and friends

(Legacy) Data Science PipelineOr, so called, Data Product

Static Results

Lot of information lost in translation

Sounds like Waterfall

ETL look and feel

Sampling Modelling Tuning Report Interprete

Page 5: What is a distributed data science pipeline. how with apache spark and friends

(Legacy) Data Science PipelineOr, so called, Data Product

Mono machine!

CPU bounds

Memory bounds

Or resampling because small-ish data

Sampling Modelling Tuning Report Interprete

Page 6: What is a distributed data science pipeline. how with apache spark and friends

FactsData gets bigger or, precisely, the amount of available sources explodes

Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ

Our world TodayNo, it wasn’t better before

Page 7: What is a distributed data science pipeline. how with apache spark and friends

Consequences

HARD (or will be too big...)

Ephemeral

Restricted View

Sampling

Report

Our world TodayNo, it wasn’t better before

Page 8: What is a distributed data science pipeline. how with apache spark and friends

Interpretation ⇒ Too SLOW to get real ROI out of the overall system

How to work around that?

Our world TodayNo, it wasn’t better before

Consequences

Page 9: What is a distributed data science pipeline. how with apache spark and friends

Our world TodayNo, it wasn’t better before

Alerting system over descriptive charts

More accurate results

more or harder models (e.g. Deep Learning)

More data

Constant data flow

Online interactions under control (e.g. direct feedback)

Needs are

Page 10: What is a distributed data science pipeline. how with apache spark and friends

Our world TodayNo, it wasn’t better before

Distributed Systems

So, we need...

Page 11: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Page 12: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Page 13: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Page 14: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Page 15: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Page 16: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

YO!Aren’t we talking about “Big” Data ? Fast Data ?

So could really (all) results being neither big nor fast?

Actually, Results are becoming themselves “Big” Data ! Fast Data !

Page 17: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

how do we access data since 90’s? remember SOA? → SERVICES!

Nowadays, we’re talking about micro services.

Here we are, one service for one result.

Page 18: What is a distributed data science pipeline. how with apache spark and friends

Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?

We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!

Page 19: What is a distributed data science pipeline. how with apache spark and friends

What about Productivity?Streamlining development lifecycle most welcome

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 20: What is a distributed data science pipeline. how with apache spark and friends

What about Productivity?Streamlining development lifecycle most welcome

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops

sci

sci ops

sci

ops data

web ops data

web ops data

data

sci

Page 21: What is a distributed data science pipeline. how with apache spark and friends

What about Productivity?Streamlining development lifecycle most welcome

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 22: What is a distributed data science pipeline. how with apache spark and friends

What about Productivity?Streamlining development lifecycle most welcome

➔ Longer production line➔ More constraints (resources sharing, time, …)➔ More people➔ More skills

Overlooking these points and you’ll be soon or sooner

So, how to have:

● results coming fast enough whilst keeping accuracy level high?● Responsivity to external/unpredictable events?

WHEN...

kicked

Page 23: What is a distributed data science pipeline. how with apache spark and friends

WarningTeam Fight: seen by members

Page 24: What is a distributed data science pipeline. how with apache spark and friends

WarningTeam Fight: seen by managers

Page 25: What is a distributed data science pipeline. how with apache spark and friends

WarningTeam Fight: seen by employers

Page 26: What is a distributed data science pipeline. how with apache spark and friends

WarningTeam Fight: seen by customers

Page 27: What is a distributed data science pipeline. how with apache spark and friends

What about Productivity?Streamlining development lifecycle most welcome

At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).

Hence, Data Fellas

● extends the Spark Notebook (interactivity)● builds the Shar3 product arounds it (Integrated Reactivity)

Page 28: What is a distributed data science pipeline. how with apache spark and friends

Concepts of Data Fellas’ Shar3Shareable and Streamlined Data Science

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Page 29: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Let’s take this example where some buddies from

Datastax Joel Jacobson @joeljacobson

Simon Ambridge @stratman1958

Mesosphere Michael Hausenblas @mhausenblas

Typesafe Iulian Dragos @jaguarul

Data Fellas Xavier Tordoir @xtordoir

(and me)

Page 30: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Let’s take this example where some buddies from

Datastax Joel Jacobson @joeljacobson

Simon Ambridge @stratman1958

Mesosphere Michael Hausenblas @mhausenblas

Typesafe Iulian Dragos @jaguarul

Data Fellas Xavier Tordoir @xtordoir

(and me)

Page 31: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Page 32: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Page 33: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Page 34: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Page 35: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

What do we need to do now?

● Deploy● connect the dots● track● scale

BoTh the Jobs and the services

Page 36: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

From notebook

to SBT project

to Docker

to Marathon

SNBSBT/JAR

Docker

marathon

Page 37: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

From Notebook

● to output● to Avro

SNB

Page 38: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

From Notebook

● to Avro● to service● to SBT● to Docker● to Marathon

SNBSBT/JAR Docker

marathon

Page 39: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

From Notebook

● to Avro● to Tableau● or QlikView● or D3.JS● or …

SNB

Page 40: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

So we have these information available:

● notebook’s markdown text● notebook’s code/model● data sources● Output/sinks● Output/services● Avro schema

Shouldn’t them all be

reused???

Page 41: What is a distributed data science pipeline. how with apache spark and friends

Using Shar3yeah \o/

Variant Analysis

Page 42: What is a distributed data science pipeline. how with apache spark and friends

There is a service!

Using Shar3yeah \o/

Page 43: What is a distributed data science pipeline. how with apache spark and friends

Let’s use it…

Using Shar3yeah \o/

Page 44: What is a distributed data science pipeline. how with apache spark and friends

What was the process?

Using Shar3yeah \o/

Page 45: What is a distributed data science pipeline. how with apache spark and friends

Fine and the output is in C*

Using Shar3yeah \o/

Page 46: What is a distributed data science pipeline. how with apache spark and friends

Let’s check what’s in-there

Using Shar3yeah \o/

Page 47: What is a distributed data science pipeline. how with apache spark and friends

not what I need, let’s ADAPT

Using Shar3yeah \o/

Page 48: What is a distributed data science pipeline. how with apache spark and friends

Poke us on

@DataFellas

@Shar3_Fellas

@SparkNotebook

@Xtordoir & @Noootsab

Now @TypeSafe: http://t.co/o1Bt6dQtgH

If you wanna learn more about the different tools… Join us @ O’Reilly

Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)

That’s all folksThanks for listening/staying