scaling feature generation from prototyping to production ... · a typical result of prototyping is...
TRANSCRIPT
SCALING FEATURE GENERATIONFROM PROTOTYPING TO PRODUCTIONAT REWE
REWE Systems GmbH | March 2019 | Benjamin Greve
AGENDA
March 2019 Scaling Feature Generation 2
1 / Introduction
2 / Example Project: Predicting Brand Market Fit
3 / Feature Generation in Prototyping – The Problems
4 / Moving From Prototyping to Production – Lessons Learned
INTRODUCTION / ME
March 2019 Scaling Feature Generation 3
Benjamin GreveData Scientist at REWE Systems
• Mathematician
• Working in data science projects for 5+ years (2 years at REWE)
• Likes hiking, climbing, dancing, cooking (especially the eating part)
INTRODUCTION / REWE GROUP
March 2019 Scaling Feature Generation 4
Source: https://www.rewe-group-geschaeftsbericht.de, 14.03.2019
INTRODUCTION / REWE GROUP
March 2019 Scaling Feature Generation 5
15,300MARKETS 2017
57.8Billion Euro
TOTAL REVENUE 2017
345,000EMPLOYEES 2017
REWE Group facts (2017)
INTRODUCTION / REWE SYSTEMS
March 2019 Scaling Feature Generation 6
>1 Bil.data setsevery day
30locations
200,000users
30,000cash registers
1,200IT specialists
7,500markets
INTRODUCTION / REWE SYSTEMS
Scaling Feature Generation 7March 2019
EXAMPLE PROJECT: PREDICTING BRAND MARKET FIT
March 2019 Scaling Feature Generation 8
We want to predict how well a brand will be received in a particular market to assist category managers in selecting the right brands for each market.
BRAND / CATEGORY:A set of products that form a logical group. Can contain between one and a few thousand products.
y = 1.13
y = 0.92
y = 1.27
FEATURES x1, x2, …• Popularity of wider category in market• Number of competing articles• Location of market, incl. demographical information• …
Technical definition
0
1
more popular
less popular
TARGET VARIABLE y:Brand popularityin current marketcompared to averageacross all REWEmarkets
FEATURE GENERATION IS CENTRAL PART IN THE DATA SCIENCE WORKFLOW
March 2019 Scaling Feature Generation 9
CRISP-DM: Cross-industry standard process for data mining
DataWarehouse
? FeatureGeneration
Predictive
model
Predictions
„Everything you do with the data before you apply a predictive
model“
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining#/media/File:CRISP-DM_Process_Diagram.png
A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT
March 2019 Scaling Feature Generation 10
OUTPUT #1: Very long SQL script
• 2000+ lines of SQL• 40+ intermediate tables• interdependencies• inconsistent naming, chaotic code style• not optimized for performance• not robust→ not production-ready or scalable
OUTPUT#2: Very long R or Python script
• Different topic for another talk
.sql
Prototyping performed by Data Scientists
A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT
March 2019 Scaling Feature Generation 11
OUTPUT #1: Very long SQL scriptOUTPUT #1: Very long SQL script
• 2000+ lines of SQL• 40+ intermediate tables• interdependencies• inconsistent naming, chaotic code style• not optimized for performance• not robust→ not production-ready or scalable
AVOID LONG, MONOLITHIC SQL SCRIPTS
March 2019 Scaling Feature Generation 12
.sql
Weaknesses:
• Duplicate/redundant SQL for every feature
• Parameters hidden within the scripts
• Adding new features is hard due to interdependencies
one very longSQL script
training datatable
.sql .sql
two very long, redundant SQL scripts
training datatable
scoring datatable
Strengths:
✓ Highly modular and scalable
✓ Parameters are centrally defined
✓ Feature scripts are mostly independent, making iteasy to add new features
USE A MODULAR FEATURE GENERATION INSTEAD
March 2019 Scaling Feature Generation 13
Input script
Creates input table
i.e. table of all market-brand-combinations for
which to calculate features
Featurescript A
Featurescript B
Featurescript C
Market Brand Feature A1 Feature A2 Feature B1 …
10000001 20001
10000002 20001
10000003 20001
merge features into feature store
Derived helper tables
• Distinct list of brands• Mapping brands to
articles• …
.sql .sql .sql .sql .sql
Feature Store
DECOMPOSING COMPLEX SQL WITHIN KNIME LEADS TO COMPLEX WORKFLOWS
March 2019 Scaling Feature Generation 14
✓ Visual, KNIME native
• Gets too complex as SQL logic grows
• Hard to translate into standard data warehouse ETLs
✓ Code✓ Powerful version control
(releases)✓ Analysts fluent in SQL
• Harder to explain/show
Example workflow from KNIME Examples server
A DEPLOYMENT WORKFLOW PULLS SQL SCRIPTS FROM VERSION CONTROL TO KNIME SERVER
March 2019 Scaling Feature Generation 15
A fully automated deployment workflow copies snapshot or release versions from the version control system to the KNIME Server
Download SQL files from version control system
• To perform Feature Generation, execute a set of SQL scripts in a given order
• Just write down the name of the files to be executed in the correct order and run the workflow
IT’S VERY SIMPLE TO EXECUTE THE DEPLOYED SQL FILES
March 2019 Scaling Feature Generation 16
Useful metanodes help to implement a staging concept, e.g.:
Select staging environment → obtain connection to the corresponding database
HELPFUL METANODES EXAMPLE: GET DATABASE CONNECTION
March 2019 Scaling Feature Generation 17
NEXT STEPS
March 2019 Scaling Feature Generation 18
Currently: Next level:
Feature dependency graph• Automatic dependency resolution• Already built into KNIME!
• specify features in arbitrary order• specify scripts manually• order manually to take
dependencies into account
THANKS
March 2019 Scaling Feature Generation 19
Thanks for your attention