scaling feature generation from prototyping to production ... · a typical result of prototyping is...

19
SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION AT REWE REWE Systems GmbH | March 2019 | Benjamin Greve

Upload: others

Post on 23-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

SCALING FEATURE GENERATIONFROM PROTOTYPING TO PRODUCTIONAT REWE

REWE Systems GmbH | March 2019 | Benjamin Greve

Page 2: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

AGENDA

March 2019 Scaling Feature Generation 2

1 / Introduction

2 / Example Project: Predicting Brand Market Fit

3 / Feature Generation in Prototyping – The Problems

4 / Moving From Prototyping to Production – Lessons Learned

Page 3: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

INTRODUCTION / ME

March 2019 Scaling Feature Generation 3

Benjamin GreveData Scientist at REWE Systems

• Mathematician

• Working in data science projects for 5+ years (2 years at REWE)

• Likes hiking, climbing, dancing, cooking (especially the eating part)

Page 4: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

INTRODUCTION / REWE GROUP

March 2019 Scaling Feature Generation 4

Source: https://www.rewe-group-geschaeftsbericht.de, 14.03.2019

Page 5: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

INTRODUCTION / REWE GROUP

March 2019 Scaling Feature Generation 5

15,300MARKETS 2017

57.8Billion Euro

TOTAL REVENUE 2017

345,000EMPLOYEES 2017

REWE Group facts (2017)

Page 6: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

INTRODUCTION / REWE SYSTEMS

March 2019 Scaling Feature Generation 6

Page 7: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

>1 Bil.data setsevery day

30locations

200,000users

30,000cash registers

1,200IT specialists

7,500markets

INTRODUCTION / REWE SYSTEMS

Scaling Feature Generation 7March 2019

Page 8: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

EXAMPLE PROJECT: PREDICTING BRAND MARKET FIT

March 2019 Scaling Feature Generation 8

We want to predict how well a brand will be received in a particular market to assist category managers in selecting the right brands for each market.

BRAND / CATEGORY:A set of products that form a logical group. Can contain between one and a few thousand products.

y = 1.13

y = 0.92

y = 1.27

FEATURES x1, x2, …• Popularity of wider category in market• Number of competing articles• Location of market, incl. demographical information• …

Technical definition

0

1

more popular

less popular

TARGET VARIABLE y:Brand popularityin current marketcompared to averageacross all REWEmarkets

Page 9: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

FEATURE GENERATION IS CENTRAL PART IN THE DATA SCIENCE WORKFLOW

March 2019 Scaling Feature Generation 9

CRISP-DM: Cross-industry standard process for data mining

DataWarehouse

? FeatureGeneration

Predictive

model

Predictions

„Everything you do with the data before you apply a predictive

model“

https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining#/media/File:CRISP-DM_Process_Diagram.png

Page 10: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT

March 2019 Scaling Feature Generation 10

OUTPUT #1: Very long SQL script

• 2000+ lines of SQL• 40+ intermediate tables• interdependencies• inconsistent naming, chaotic code style• not optimized for performance• not robust→ not production-ready or scalable

OUTPUT#2: Very long R or Python script

• Different topic for another talk

.sql

Prototyping performed by Data Scientists

Page 11: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT

March 2019 Scaling Feature Generation 11

OUTPUT #1: Very long SQL scriptOUTPUT #1: Very long SQL script

• 2000+ lines of SQL• 40+ intermediate tables• interdependencies• inconsistent naming, chaotic code style• not optimized for performance• not robust→ not production-ready or scalable

Page 12: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

AVOID LONG, MONOLITHIC SQL SCRIPTS

March 2019 Scaling Feature Generation 12

.sql

Weaknesses:

• Duplicate/redundant SQL for every feature

• Parameters hidden within the scripts

• Adding new features is hard due to interdependencies

one very longSQL script

training datatable

.sql .sql

two very long, redundant SQL scripts

training datatable

scoring datatable

Page 13: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

Strengths:

✓ Highly modular and scalable

✓ Parameters are centrally defined

✓ Feature scripts are mostly independent, making iteasy to add new features

USE A MODULAR FEATURE GENERATION INSTEAD

March 2019 Scaling Feature Generation 13

Input script

Creates input table

i.e. table of all market-brand-combinations for

which to calculate features

Featurescript A

Featurescript B

Featurescript C

Market Brand Feature A1 Feature A2 Feature B1 …

10000001 20001

10000002 20001

10000003 20001

merge features into feature store

Derived helper tables

• Distinct list of brands• Mapping brands to

articles• …

.sql .sql .sql .sql .sql

Feature Store

Page 14: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

DECOMPOSING COMPLEX SQL WITHIN KNIME LEADS TO COMPLEX WORKFLOWS

March 2019 Scaling Feature Generation 14

✓ Visual, KNIME native

• Gets too complex as SQL logic grows

• Hard to translate into standard data warehouse ETLs

✓ Code✓ Powerful version control

(releases)✓ Analysts fluent in SQL

• Harder to explain/show

Example workflow from KNIME Examples server

Page 15: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

A DEPLOYMENT WORKFLOW PULLS SQL SCRIPTS FROM VERSION CONTROL TO KNIME SERVER

March 2019 Scaling Feature Generation 15

A fully automated deployment workflow copies snapshot or release versions from the version control system to the KNIME Server

Download SQL files from version control system

Page 16: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

• To perform Feature Generation, execute a set of SQL scripts in a given order

• Just write down the name of the files to be executed in the correct order and run the workflow

IT’S VERY SIMPLE TO EXECUTE THE DEPLOYED SQL FILES

March 2019 Scaling Feature Generation 16

Page 17: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

Useful metanodes help to implement a staging concept, e.g.:

Select staging environment → obtain connection to the corresponding database

HELPFUL METANODES EXAMPLE: GET DATABASE CONNECTION

March 2019 Scaling Feature Generation 17

Page 18: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

NEXT STEPS

March 2019 Scaling Feature Generation 18

Currently: Next level:

Feature dependency graph• Automatic dependency resolution• Already built into KNIME!

• specify features in arbitrary order• specify scripts manually• order manually to take

dependencies into account

Page 19: SCALING FEATURE GENERATION FROM PROTOTYPING TO PRODUCTION ... · A TYPICAL RESULT OF PROTOTYPING IS A VERY LONG SQL SCRIPT March 2019 Scaling Feature Generation 10 OUTPUT #1: Very

THANKS

March 2019 Scaling Feature Generation 19

Thanks for your attention