find signal in noise

17
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016 Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Michal Brys Data Scientist @ Allegro Measure Camp | London, 10th September 2016 Find signal in noise. 6 steps to find value from messy data.

Upload: michal-brys

Post on 15-Apr-2017

447 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

Michal BrysData Scientist @ AllegroMeasure Camp | London, 10th September 2016

Find signal in noise.6 steps to find value from messy data.

Page 2: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

Michal BrysData Scientist @ Allegro

Specialized also in:+ Google Analytics + Google Tag Manager

michalbrys.comabout.me/michal.brys

Page 3: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

Framework for data analysisCRISP-DM

- Cross Industry Standard Process for Data Mining

- Set up in 1996 (SPSS, Teradata, Daimler AG, NCR ,OHRA)

- Still works!

Read more: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

Page 4: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

1: Business Understanding

- Define analysis goal- What you want to achieve by analysis?

- Check business context- Don’t be afraid to ask questions

Page 5: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

1: Business Understanding

I want to select customers group with the

highest probability of response (...) to target marketing campaign for this group.

Page 6: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

2: Data Understanding- Collect data

Check:

- What all variables in dataset means- How about missing values?- Exploratory data analysis (EDA)

Page 7: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

2: Data Understanding

Google Analytics with client id as custom dimension

- Source: Cookies + JavaScript tracker- Processed by Google Analytics- No access to raw data

Page 8: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

2: Data Understanding

10 000 records with 11 variables

Page 9: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

3: Data Preparation

- Data cleaning- Prepare new variables, transform data- Remove missing and outstanding values- Check distributions

Page 10: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

3: Data Preparation

Example: Fix variables type.

Page 11: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

4: Modeling- Classification problem- Prepare models by different methods- Training and test subset

- CARTC5.0Logit Regression

Page 12: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

5: Evaluation

Model True Negative

True Positive

False Negative

False Positive

Total Error Rate

CART 5081 3150 1080 689 17.69%

C5.0 4089 2701 1606 1604 32.10%

Logit Regression 5871 2107 1307 715 20.22%

Page 13: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

6: Deployment

- Prepare report- Implement in system- Bulid product- ...

Page 14: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

Summary

CRISP-DM

+ Keeps business goal in mind+ Result will answer for initial question+ Reproducible and documented process

Image: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/File:CRISP-DM_Process_Diagram.png

Page 15: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

More inspiration

“Data Mining Methods and Models”Daniel T. Larose

“The Signal and the Noise” Nate Silver

Page 16: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

One more thing...

michalbrys.gitbooks.io/r-google-analytics/

Page 17: Find signal in noise

Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016

Q&A

Michal Brysabout.me/michal.brysgithub.com/michalbrys