l9. real world machine learning - cooking predictions
TRANSCRIPT
Cooking PredictionsA real case in the hotel sector
Andrés González Big Data Prediction Manager
[email protected] Twitter: @data_lytics
CleverTask Solutions SL - Big Data Business Unit 3
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit 4
Hotel Sector
• % room occupation. • Cancellation risk. • Income.
CleverTask Solutions SL - Big Data Business Unit 5
Business Need
Predict client’s
NATIONALITY
BEFORE
client
check-in
CleverTask Solutions SL - Big Data Business Unit 6
Staff Arrangement
Languages
CleverTask Solutions SL - Big Data Business Unit 7
Prepare Activities
CleverTask Solutions SL - Big Data Business Unit 8
Kitchen Arrangement
CleverTask Solutions SL - Big Data Business Unit 9
Customize Stay
CleverTask Solutions SL - Big Data Business Unit 10
… Details Make the Difference
In short, because…
CleverTask Solutions SL - Big Data Business Unit 11
Machine Learning basics
CleverTask Solutions SL - Big Data Business Unit 12
Machine Learning basics
Can you find patterns in this data?
CleverTask Solutions SL - Big Data Business Unit
13
Machine Learning basics
Historical Data Training Prediction
New Data Re-Training
CleverTask Solutions SL - Big Data Business Unit 14
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit
Tasting the Dish
Cooking
Transforming
15
“Cooking” Predictions2
Go to the market to buy ingredients
Cleaning
CleverTask Solutions SL - Big Data Business Unit
Evaluating Prediction Quality
Training the Model
Transforming and Feature Engineering
15
“Cooking” Predictions2
Gathering RAW data
Cleaning Data
CleverTask Solutions SL - Big Data Business Unit 16
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit 17
Where does Data come from?
Own Website
Partners Websites
RAW Data
CleverTask Solutions SL - Big Data Business Unit 18
RAW Data
One year historical reservation data
(.xlsx file)
Characteristics •260.000 reservations •80 fields
•57 categorical •9 numeric •10 date •3 text •1 incorrect field
•Size: 150 MB
CleverTask Solutions SL - Big Data Business Unit 19
RAW Data
CleverTask Solutions SL - Big Data Business Unit 20
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit
“Dirty” RAW Data
Gathering Data
21
The Process
New Fields
1 3 4
Transformation and Feature Engineering
“Clean” Data
Calculated Fields
2Cleaning Model
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 22
Data Cleaning
CleverTask Solutions SL - Big Data Business Unit 23
Data Cleaning
Row Deletion
• Reservations without check-in
• Cancelled reservations • Rows with errors
Column Deletion
• IDs vs names • Columns with little data
Other Actions
• Give dates a format • Delete accents • Transform .xlsx -> .csv
CleverTask Solutions SL - Big Data Business Unit 24
Clean Dataset
Clean
•150.000 reservations •46 fields •26 categorical •9 numeric •10 data •1 text
•Size: 75MB
Dirty
•260.000 reservations •80 fields
•57 categorical •9 numeric •10 data •3 text •1 incorrect field
•Size: 150 MB
CleverTask Solutions SL - Big Data Business Unit
“Dirty” RAW Data
Gathering Data
25
The Process
New Fields
1 3 4
Transformations and Feature Engineering
“Clean” Data
Calculated Fields
2Cleaning Model
CleverTask Solutions SL - Big Data Business Unit 26
TransformationsCountry Grouping
•A lot of countries to predict (210)
•Some countries have very few instances
•Grouping objective: mín. 1% of total instances
• Does not affect business objective
•Total number of groups: 20
New Fields
• RESERV_ANTICIPATION (calculated): (reservation date - checkin date)
• COUNTRY_HOTEL (name of the country)
• HOTEL_STARS (1-5)
CleverTask Solutions SL - Big Data Business Unit 27
Clean Dataset
Clean •150.000 reservations •46 fields •Size: 75MB
Dirty •260.000 reservations •80 fields •Size: 150 MB
Transformed •150.000 registers •49 fields •Size: 80MB
CleverTask Solutions SL - Big Data Business Unit 28
What is Feature Engineering
Extract signal from noise
CleverTask Solutions SL - Big Data Business Unit 29
Feature Engineering Techniques
• Detecta fields (features) that are predictorss
(signal) and bypass those that are not (noise)
• Dependand fields (pax, days, pax*days) • Needless fields (reservation number) • Fields with very little data • Random fields (minute and second of reservation)
• Domain knowledge • Experience • Recursive cycle
CleverTask Solutions SL - Big Data Business Unit 30
Field Selection
Algorithm Adjustment
Prediction
Quality Evaluation
Recursive Feature Engineering
CleverTask Solutions SL - Big Data Business Unit 31
Clean Dataset
Clean •150.000 reservations •46 fields •Size: 75MB
Dirty •260.000 reservations •80 fields •Size: 150 MB
Transformed •150.000 registers •49 fields •Size: 80MB
Final Dataset •150.000 registers •10 fields •Size: 55MB
CleverTask Solutions SL - Big Data Business Unit 32
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit 33
The Process
“Dirty” RAW Data
New Fields
1 3 4Gathering Data
Transformation and Feature Engineering
“Clean” Data
Calculated
2Cleaning Modeling
CleverTask Solutions SL - Big Data Business Unit 34
ModelingTraining Learning
CleverTask Solutions SL - Big Data Business Unit 35
Modeling
CleverTask Solutions SL - Big Data Business Unit 37
Agenda Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
CleverTask Solutions SL - Big Data Business Unit 38
Quality Evaluation
80%
20% Evaluation
Training
TestDataset 100%
Modelo
CleverTask Solutions SL - Big Data Business Unit 39
Quality Evaluation
Accuracy Confusion Matrix
CleverTask Solutions SL - Big Data Business Unit 40
Quality Evaluation
54% 75%
CleverTask Solutions SL - Big Data Business Unit 41
Quality EvaluationPredicted vs Real Distribution
CleverTask Solutions SL - Big Data Business Unit 42
Cooking Predictions
80%
20%Tasting the Dish
Cooking
Transforming
Go to the market to buy ingredients
Cleaning
CleverTask Solutions SL - Big Data Business Unit 42
Cooking Predictions
80%
20%Evaluating Prediction Quality
Training the Model
Transforming and Feature Engineering
Gathering RAW data
Cleaning Data
CleverTask Solutions SL - Big Data Business Unit 43
Other TechniquesEnsembles Clusters
Weight Analysis Anomaly Detection
CleverTask Solutions SL - Big Data Business Unit 44
ENDemail: [email protected]
Twitter: @data_lytics
www.clevertask.com