[243] turning data into value

Ph.D in Computer Science at ENS Paris/INRIA

Postdoctoral Fellow at Carnegie Mellon University

>500 citations, Best Paper Award at 2009 CVPR Conference

NEC Labs (Bell Labs) in Cupertino (Silicon Valley)

Senior Researcher at Intel (3 pending patents)

- Developed ML algorithms for face recognition

Invited speaker to CMU, Samsung, Tokyo Univ, SNU, etc.

Co-Founder of Solidware

Olivier Duchenne

Co-founder | Chief Machine Learning Scientist

8 years experience in Machine learning, Computer Vision and Big Data

Guidelines for using Machine Learning on real data

Avoid Common Mistakes

Understand Better the Data

1.Big Enough Data?

2.Changing Data

Machine Learning and Data Science

From Computer Vision Experience

To Solving Companies issues:

Ex: car accident prediction (insurance),

default prediction (bank),

stock value prediction

Machine Learning and Data Science

Prediction Function

Predicted Target Value

ML Algorithms analyze

historical data

to detect patterns

PAST DATA

(Training Data Set)

Internal Data

Ex: Age, Gender

External Data

Ex: Web Crawl

Target Value

Machine-Learning based Predictive Modeling Newly Incoming Data

Unknown

Target Value

Internal Data External Data

1. Prediction Function. Ex: a linear function, a neural net,…

2. The prediction function is parametrized. Ex: 𝐟𝜶 𝐗 = 𝜶𝒊𝑿𝒊𝒊

3. The goal is to find the best prediction function, i.e. the best

parameters.

4. We build an objective function, that represents how good a

prediction function is.

5. The objective function always has a data term. Ex: 𝐨𝐛𝐣 𝜶 = 𝒇𝜶 𝑿𝒔 − 𝒀𝒔 𝟐

𝒔

6. The algorithm tries to find the best parameters, that optimizes this

objective function. Ex: closed form solution, stochastic gradient

descent, …

Basic Explanation of Machine Learning

History of Machine Learning for Computer Vision

Model-Driven Mixed Data-Driven

1970s Hand-designed Model

1980s Alignment

Method

2000s Deformable

Model

2010s Conv. Network

1990s Grid Model

Why didn’t people use ML since the beginning?

General Assumptions for the reason

1.“Better Computer” available now

2.“Better Algorithms”

3.“Amount of Data” “We create so much data that 90% of the data in the world today has

been created in the last two years alone”

- Petter Bae Brandtzæ g, SINTEF ICT

How much data did CV Researcher use?

Image source: http://www.vision.caltech.edu/ Image source: http://doi.ieeecomputersociety.org/

2004

Caltech 101

10K Images

2005-2010

Pascal VOC

2K 30K objects

2010-2015

Image Net

10M 15M images

http://www.image-net.org/

The answer is… “Amount of Data”

Image source - Smartdatacollective.com

• Most Advanced Machine

Learning cannot be applied if

there are not enough data

• Critical mass of data is

necessary to use, for example,

deep learning

• When the amount of data

increases, the machine

learning models and, therefore,

the prediction model becomes

more complex and better

With enough data, ANY algorithms okay?

Support vector machines Bayesian networks

Regression forest Sparse dictionary learning

Artificial neural networks K-Nearest neighbors

Deep learning Boosting

Deep Learning Neural Networks Log. Regression

No, it depends on the company and the problem you are trying to solve

A B C

What Changed in Machine Learning Domain From the Past to the Present:

Synonym: Over generalizing

That is like visiting a new place during one day, seeing a mountain fire.

And believing that there are fires everyday there.

Why do we need lots of data?

Overfitting

In real life, we do not have many chances of having

clean & BIG data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Seoul Busan Daejeon Gwangju

Prob. To default

Prob. To default… (many more cities)

An example: Overfitting due to lack of data

As there are many

categories,

some categories with small

data show outlier results

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Seoul Busan Daejon Kyangju

Prob. To default

Prob. To default

… (many more cities)

So, always use error bars

You want to detect an event which occur on average with probability: p=5%

Let’s say you have many cities with ~50 samples

On average, 1/13 will have this event 0 times.

Without proper handling, the extreme case, will be all wrong.

This kind of error can happen often

How to fight against overfitting

Data More Samples

Less Variables

Artificial Data Extension

Algorithm Simpler Objective Function

Regularization

Bagging

Modeling Feature Engineering

Data Normalization

Data In Computer Vision, it is possible to extend the data.

Ex: Hiring annotator, Amazon Mechanical Turk, Google Re-Captcha

Companies often they have a limited number of samples, and cannot extend it.

Ex: A Korean Bank that gives ~100K loans per year

1. Count only positives ( Detecting rare events require more data)

Ex: Image Detection. It’s easy to find an infinite number of negatives.

Often company want to detect rare events (few positives)

Ex: predicting car accident / ad clicks / defaults / online purchase

How to count your data?

2. Difficulty of the task


• Learning addition ( 𝒚 = 𝟏 ∗ 𝑿𝟏 + 𝟏 ∗ 𝑿𝟐 )

(Requires ~100 samples)

• Learning object recognition

( Requires ~10M samples)

3. Probabilistic event detection is harder.

What is in this image?

Will this user click on a car advertisement?

Client #1: Male, 27y.o, lives in Seoul, Salary

man in the construction sector, already

previously clicked on a car advertisement

Client #2: Male, 27y.o, lives in Seoul, Salary

man in the construction sector, already

previously clicked on a car advertisement

Yes

No


Algorithm

1. Many algorithms exist: GLM, Boosting, Lasso, Regression Forest, SVM,

Gaussian Process, Bayesian Networks, Deep Learning, …

2. The complexity of their prediction functions differ.

3. The more complex the prediction function is, the more it fits the data.

Purchase

Prob.

Age

Purchase

Prob.

Age

Purchase

Prob.

Age

Underfitting Overfitting

Algorithm

1. Less parameters Less overfitting

2. More parameters Less underfitting

3. Ex: Best of both worlds: Deep Conv Nets

Algorithm

Avoiding “Too Many Categories” problem

Busan

Seoul

Dae-

jeon

Dae

-gou

Po-

hang

In-

cheon

Soo-

won

Ul-

San


Busan

Seoul

Dae-

jeon

Dae

-gou

Po-

hang

In-

cheon

Soo-

won

Ul-

San

Grouping

Merging


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 3 4 5 6

Prob. To default

Prob. To default log10(population)

Regularization

𝑚𝑖𝑛𝜃 𝑠𝑙𝑜𝑠𝑠 𝑔𝑡𝑠, 𝑓𝜃 𝑋𝑠 + 𝜆Ω(𝜃)

𝑚𝑖𝑛𝜃 𝑠𝑙𝑜𝑠𝑠 𝑔𝑡𝑠, 𝑓𝜃 𝑋𝑠 , s.t. Ω 𝜃 < 𝜆

Ω 𝜃 = 𝜃 2

𝜃 1

Data Normalization

Removing variance that has no impact on the target value Help the ML system to focus on meaningful variance

Deep Face (Facebook 2014), DB size: 120M images

Bagging

1. Randomly modify slightly the training set.

2. Do the training

3. Repeat

4. Average all prediction functions

• Market changes

• Law/Regulation Changes

• Collected Data changes

• Client filtering / Marketing changes

Data change through time

Representation of data change

• Variable names change

• Category names change

Changing Data

• Cyclic Data Changes

Seasonality

• Trending has to be handled separately

Interpolation – Extrapolation

Why is time so different from other variables ?

Prob.

To buy

A

smartphone

Age

Prob.

To buy

A

smartphone

Time

?

?

Interpolation Extrapolation

Time is correlated with hidden variables

Cost for car

insurance

(one type of

insurance)

Time New Law

Change causes can be unknown, but consistant

Cost for car

insurance

(one type of

insurance)

Time

Seasonality

Cost for car

insurance

(one type of

insurance)

Time

Changing Data Representation

• Collected Data changes

• Category splitting, merging

• Variable names change

• Category names change

Job Applications: [email protected]

Visit our booth

Thank you

Visit our website: solidware.io

[243] turning data into value

Technology