
Imagine How

5 Years from Now will

predictive applications be put

in production

Our Goal Today

How are we doing today ? What is difficult ?

What should be simpler?

What is a predictive application ?

Churn Prevention

Fraud Detection

Demand Forecast



Match Making

Ad Bidding

Drug Studies



This discussion not relevant to all



Drug Studies Multi-Years


Multi-Years Weekly



Bidding Two Weeks Sub-Second

Data SpanRetrain every … Score






Production = Dev

Online Learning

Not just a “model”

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


Data Collection

Let’s call this a Predictive Service Specification

How much effort ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


20% 30% 25% 5% 5% 15%

Data Collection

Who Does What ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


Data Domain Engineers

Data AnalystsData ScientistsBusiness Intelligence Engineers

Huge Variety of Tech

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


Data Collection

ETL ? Ad-Hoc?

ETL ? Ad-Hoc?

ETL ? SQL ? R ? Python ?

Matlab ?

R ? Python ? R ? Python ? SAS? Java / Python

Business Rules Management System

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


From Build to Run

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


?Input Data Decision

Build Time

Run Time

How People Do that Today ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


PMMLETL WebServiceScript/SQL

Data Collection

A Predictive Service =

Up to 4 different “Applications" that can run out-of-sync

Some Integrated Per-Platform Approach

in Database

in SAS

in Hadoop/Spark

SQL Commercial Warehouse + Scoring UDF

End-to-end integration script

Ad-hoc development

Top Companies invested a lot

Each probably >5M$ in their ML production platform

Reason 1 : Prohibitive Costs kill projects

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


RSQL PythonR

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring



300K$ 50K$ 200K$ 100K$



Reason 2: Distribution DriftNew behaviour

New productNew competitor

Model stops working as planned

You need to be able to do same week update

Reason 3: Mitigate with Data Hazards

You need to be able to do same week update

Most interesting “Big Data” Sources are fragile

Reason 4: Decide is beyond Predict

Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization

Reason 5: “Suits ready” for scalability

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring


Your CTO could certainly maintain it up and running all by himself

Your CTO could certainly maintain it up and running all by himself

Imagine the Dream Platform That Would Solve All This


Let’s call it Blue Box

New Data Decision

Feature : Cleansing, Enrich and Merge

Blue Box must be the perfect Data Blending runtime

Feature: Aggregating Data

Raw Events Stream Aggregate State

Consolidating History Must be part of Blue Box

1TB-100TB+ 100MB-1OGB

Feature : External Data Compliant

main data

enriched main data

additional data

e.g. Census, Map, Etc..

Third Data Data Must Be “In” the Blue Box

Feature : Update Data Service

Smart Lazy Human

A/B Test Support in Blue Box

Decision Ver. A

Decision Ver. B

P D F M SNew


Feature : Programatic Decision

Need for Business Compliant “Real-Time” Rules in Blue Box

model 1

model 2 model 3


combine with

if proba > 0,63 decision A else decision B

if proba > 0,79 decision A else decision B

Feature : Audit and Logs

Smart Lazy Human


Blue Box needs to keep track of its decisions and Why

Decision Cause Log

External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source

What does Blue Box look like?


Interesting / Potential Open Source Project

Real-Time Entity Update, Management, Scoring

Open Source PMML Scoring in Java

Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning

How will we create the “blue box” ?


Specification ? PMML Extension ?

Open Source Framework ?

Hadoop / Spark Specific ?

Thank you !

is blue

Convince decisions makers to make data their competitive advantage

[email protected]

Wanna work on this topic ?

Wanna share your dream features?

Top Related