[243] turning data into value
TRANSCRIPT
Ph.D in Computer Science at ENS Paris/INRIA
Postdoctoral Fellow at Carnegie Mellon University
>500 citations, Best Paper Award at 2009 CVPR Conference
NEC Labs (Bell Labs) in Cupertino (Silicon Valley)
Senior Researcher at Intel (3 pending patents)
- Developed ML algorithms for face recognition
Invited speaker to CMU, Samsung, Tokyo Univ, SNU, etc.
Co-Founder of Solidware
Olivier Duchenne
Co-founder | Chief Machine Learning Scientist
8 years experience in Machine learning, Computer Vision and Big Data
Guidelines for using Machine Learning on real data
Avoid Common Mistakes
Understand Better the Data
1.Big Enough Data?
2.Changing Data
Machine Learning and Data Science
From Computer Vision Experience
To Solving Companies issues:
Ex: car accident prediction (insurance),
default prediction (bank),
stock value prediction
Machine Learning and Data Science
Prediction Function
Predicted Target Value
ML Algorithms analyze
historical data
to detect patterns
PAST DATA
(Training Data Set)
Internal Data
Ex: Age, Gender
External Data
Ex: Web Crawl
Target Value
Machine-Learning based Predictive Modeling Newly Incoming Data
Unknown
Target Value
Internal Data External Data
1. Prediction Function. Ex: a linear function, a neural net,…
2. The prediction function is parametrized. Ex: 𝐟𝜶 𝐗 = 𝜶𝒊𝑿𝒊𝒊
3. The goal is to find the best prediction function, i.e. the best
parameters.
4. We build an objective function, that represents how good a
prediction function is.
5. The objective function always has a data term. Ex: 𝐨𝐛𝐣 𝜶 = 𝒇𝜶 𝑿𝒔 − 𝒀𝒔 𝟐
𝒔
6. The algorithm tries to find the best parameters, that optimizes this
objective function. Ex: closed form solution, stochastic gradient
descent, …
Basic Explanation of Machine Learning
History of Machine Learning for Computer Vision
Model-Driven Mixed Data-Driven
1970s Hand-designed Model
1980s Alignment
Method
2000s Deformable
Model
2010s Conv. Network
1990s Grid Model
Why didn’t people use ML since the beginning?
General Assumptions for the reason
1.“Better Computer” available now
2.“Better Algorithms”
3.“Amount of Data” “We create so much data that 90% of the data in the world today has
been created in the last two years alone”
- Petter Bae Brandtzæ g, SINTEF ICT
How much data did CV Researcher use?
Image source: http://www.vision.caltech.edu/ Image source: http://doi.ieeecomputersociety.org/
2004
Caltech 101
10K Images
2005-2010
Pascal VOC
2K 30K objects
2010-2015
Image Net
10M 15M images
http://www.image-net.org/
The answer is… “Amount of Data”
Image source - Smartdatacollective.com
• Most Advanced Machine
Learning cannot be applied if
there are not enough data
• Critical mass of data is
necessary to use, for example,
deep learning
• When the amount of data
increases, the machine
learning models and, therefore,
the prediction model becomes
more complex and better
With enough data, ANY algorithms okay?
Support vector machines Bayesian networks
Regression forest Sparse dictionary learning
Artificial neural networks K-Nearest neighbors
Deep learning Boosting
Deep Learning Neural Networks Log. Regression
No, it depends on the company and the problem you are trying to solve
A B C
Synonym: Over generalizing
That is like visiting a new place during one day, seeing a mountain fire.
And believing that there are fires everyday there.
Why do we need lots of data?
Overfitting
In real life, we do not have many chances of having
clean & BIG data
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Seoul Busan Daejeon Gwangju
Prob. To default
Prob. To default… (many more cities)
An example: Overfitting due to lack of data
As there are many
categories,
some categories with small
data show outlier results
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Seoul Busan Daejon Kyangju
Prob. To default
Prob. To default
… (many more cities)
So, always use error bars
You want to detect an event which occur on average with probability: p=5%
Let’s say you have many cities with ~50 samples
On average, 1/13 will have this event 0 times.
Without proper handling, the extreme case, will be all wrong.
This kind of error can happen often
How to fight against overfitting
Data More Samples
Less Variables
Artificial Data Extension
Algorithm Simpler Objective Function
Regularization
Bagging
Modeling Feature Engineering
Data Normalization
Data In Computer Vision, it is possible to extend the data.
Ex: Hiring annotator, Amazon Mechanical Turk, Google Re-Captcha
Companies often they have a limited number of samples, and cannot extend it.
Ex: A Korean Bank that gives ~100K loans per year
1. Count only positives ( Detecting rare events require more data)
Ex: Image Detection. It’s easy to find an infinite number of negatives.
Often company want to detect rare events (few positives)
Ex: predicting car accident / ad clicks / defaults / online purchase
How to count your data?
2. Difficulty of the task
How to count your data?
• Learning addition ( 𝒚 = 𝟏 ∗ 𝑿𝟏 + 𝟏 ∗ 𝑿𝟐 )
(Requires ~100 samples)
• Learning object recognition
( Requires ~10M samples)
3. Probabilistic event detection is harder.
What is in this image?
Will this user click on a car advertisement?
Client #1: Male, 27y.o, lives in Seoul, Salary
man in the construction sector, already
previously clicked on a car advertisement
Client #2: Male, 27y.o, lives in Seoul, Salary
man in the construction sector, already
previously clicked on a car advertisement
Yes
No
How to count your data?
Algorithm
1. Many algorithms exist: GLM, Boosting, Lasso, Regression Forest, SVM,
Gaussian Process, Bayesian Networks, Deep Learning, …
2. The complexity of their prediction functions differ.
3. The more complex the prediction function is, the more it fits the data.
Purchase
Prob.
Age
Purchase
Prob.
Age
Purchase
Prob.
Age
Underfitting Overfitting
Algorithm
1. Less parameters Less overfitting
2. More parameters Less underfitting
3. Ex: Best of both worlds: Deep Conv Nets
Algorithm
Avoiding “Too Many Categories” problem
Busan
Seoul
Dae-
jeon
Dae
-gou
Po-
hang
In-
cheon
Soo-
won
Ul-
San
Avoiding “Too Many Categories” problem
Busan
Seoul
Dae-
jeon
Dae
-gou
Po-
hang
In-
cheon
Soo-
won
Ul-
San
Grouping
Merging
Avoiding “Too Many Categories” problem
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 3 4 5 6
Prob. To default
Prob. To default log10(population)
Data Normalization
Removing variance that has no impact on the target value Help the ML system to focus on meaningful variance
Deep Face (Facebook 2014), DB size: 120M images
Bagging
1. Randomly modify slightly the training set.
2. Do the training
3. Repeat
4. Average all prediction functions
• Market changes
• Law/Regulation Changes
• Collected Data changes
• Client filtering / Marketing changes
Data change through time
Representation of data change
• Variable names change
• Category names change
Changing Data
• Cyclic Data Changes
Seasonality
• Trending has to be handled separately
Interpolation – Extrapolation
Why is time so different from other variables ?
Prob.
To buy
A
smartphone
Age
Prob.
To buy
A
smartphone
Time
?
?
Interpolation Extrapolation
Time is correlated with hidden variables
Cost for car
insurance
(one type of
insurance)
Time New Law
Changing Data Representation
• Collected Data changes
• Category splitting, merging
• Variable names change
• Category names change