guerrilla analytics - introduction and case study

34
Guerrilla Analytics Introduction and Case Study Enda Ridge, PhD Copyright Enda Ridge 2015 1 #GuerrillaAnalytics http://guerrilla-analytics.net

Upload: enda-ridge

Post on 06-Aug-2015

31 views

Category:

Data & Analytics


0 download

TRANSCRIPT

#GuerrillaAnalytics http://guerrilla-analytics.net 1

Guerrilla AnalyticsIntroduction and Case StudyEnda Ridge, PhD

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 2

What we are told about Data Science

“the sexy job in the next 10 years will be statisticians”

“Data Scientist: The Sexiest Job of the 21st Century”

“Information is the oil of the 21st century, and analytics is the combustion engine.”

http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 3

Hi, we need an update on the insurance policy classification work. It’s going to the Head of Underwriting this afternoon.

Um. Which work? I think Jo did that butJo’s on holidays.

I’ll check my mailbox and send you my spreadsheet from last week. Err.....the population changed with

the extra system extract on Tuesday.

And we added a bunch of business rules to accommodate that....

so we can’t go back to the earlier numbers.

The Reality

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 4Copyright Enda Ridge 2015

Those were the droids I was looking for ...

#GuerrillaAnalytics http://guerrilla-analytics.net 5

My Journey to Guerrilla Analytics

Mechanical Engineer

PhD Computer

Science

Boutique Consultancy

Forensic Data Analytics

Senior Manager

Copyright Enda Ridge 2015

Constraints Constraints+

DynamicReproducible

Constraints+

DynamicReproducible

+Tested

Constraints+

DynamicReproducible

+Tested

+Audit

#GuerrillaAnalytics http://guerrilla-analytics.net 6

Common format

Data Analytics Insight

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 7

Misconception

Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 8

Reality is Guerrilla Analytics

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 9

Maintain Data Provenance

Copyright Enda Ridge 2015

Maintaining Data Provenance mitigates disruptions

10

7 Principles of Guerrilla Analytics•S

pace is cheap, confusion is expensive

1

•Prefer simple, visual project structures

2

•Prefer automation with program code

3

•Link data on the file system, analytics environment, and work products

4

•Version control data and code

5

•Consolidate team knowledge in builds

6

•Prefer code that runs end to end

7Copyright Enda Ridge 2015 #GuerrillaAnalytics http://guerrilla-analytics.net

~100 practice

tips

#GuerrillaAnalytics http://guerrilla-analytics.net 11

Guerrilla Analytics

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 12

Guerrilla Analytics Case Study

Copyright Enda Ridge 2015

Client Retail Bank

Situation Error in credit card customer mailing processFailure to comply with regulations, potential fines

Mission • Understand system landscape & get the right data• Rebuild full customer history• Identify system errors and start of non-compliance• Quantify effected customers and cost to bank

Timeline 6-8 weeks

#GuerrillaAnalytics http://guerrilla-analytics.net 13

System Landscape

Customer Contact

Card System 1 Card System 2

Collections Manual Intervention

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 14

Data Receipt

Copyright Enda Ridge 2015

Guerrilla Analytics Environment

• Lost Data• Multiple Copies of data• Limited supporting information• Local copies of data• Renamed data• ä~ delimited data

#GuerrillaAnalytics http://guerrilla-analytics.net 15

Data Receipt

Copyright Enda Ridge 2015

Guerrilla Analytics Approach

• Have 1 Data location• Data Unique Identifiers• Data log• Supporting material near data• Never modify the data

#GuerrillaAnalytics http://guerrilla-analytics.net 16

Data Load

File System

Crazy-name spreadsheet 1Crazy-name spreadsheet 2Crazy-name spreadsheet 3

FNU810A

long_named_file_v0.2.1.pdf

Analytics Environment

Credit_Card_Samples

DBO.Accounts

Customer_Letters

Guerrilla Environment

• Renamed files• Scattered inconsistent

locations• Multiple versions of files• Replacements of files

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 17

Data Load

Data

Crazy-name spreadsheet 1

Crazy-name spreadsheet 2

FNU810A

long_named_file_v0.2.1.pdf

Analytics Environment

D010.Crazy-name spreadsheet 1

D026.Crazy-name spreadsheet 2

D040.FNU810A

D051.long_named_file_v0.2.1.pdf

Guerrilla Analytics Approach

• One-to-one mapping from files to datasets– Keep crazy names

• Minimize prep work• Put the Data Identifier in

the path

Copyright Enda Ridge 2015

D010

D026

D040

D051

#GuerrillaAnalytics http://guerrilla-analytics.net 18

Guerrilla Analytics

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 19

Analytics: Guerrilla Analytics Environment

Copyright Enda Ridge 2015

My Documents/Transactions

Accounts_Formatted.SQL

TransProf_FINAL.R

Trans_DO_NOT_USE.R

TransProf_v2.R

Sample_accounts.SQL

• Many code files/languages• Variety of output types• Data manipulation

– on file system– in analytics environment

• Combinations of tools• Many users• Many iterations

#GuerrillaAnalytics http://guerrilla-analytics.net 20

Analytics: Guerrilla Analytics Approach

Copyright Enda Ridge 2015

• One folder for all team work products

• Give every work product an identifier

• Keep a work product log• Clear running order of files• No dead/orphaned files

Work_Products

Work_products.xls

WP_024• 010_Accounts_Cleaned.SQL• 030_Transaction_Profiles.R• 050_Sample_accounts.SQL

WP_96

WP_97

#GuerrillaAnalytics http://guerrilla-analytics.net 21

Analytics: Guerrilla Analytics Approach

Copyright Enda Ridge 2015

• Keep older versions in subfolder• Keep related information in a

subfolder WP_024010_Accounts_Cleaned.SQL030_Transaction_Profiles.R050_Sample_accounts.SQL

supporting

archive

#GuerrillaAnalytics http://guerrilla-analytics.net 22

Analytics: Guerrilla Analytics Approach

File System

WP_024010_Accounts_Cleaned.SQL030_Transaction_Profiles.R050_Sample_accounts.SQL

Analytics Environment

WP_024.ACCOUNTS_CLEANED

WP_024.TRANSACTION_PROFILES

WP_024.SAMPLE_ACCOUNTS

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 23

Data Manipulation: Guerrilla EnvironmentAccount_ID Statement_ID Min_Payment Transaction_ID Amount Type

A 15 30.00 1 50.00 ExpenseA 15 30.00 2 25.00 ExpenseA 15 30.00 3 -75.00 PaymentA 15 30.00 4 20.00 Expense

Copyright Enda Ridge 2015

ID Stmnt_ID Min_Payment Balance Min_Paym_Made

A 15 30.00 20.00 No

... ... ...

#GuerrillaAnalytics http://guerrilla-analytics.net 24

Data Manipulation: Guerrilla Analytics Approach

Copyright Enda Ridge 2015

Account_ID Statement_ID Min_Payment

Transaction_ID

Amount Type RunningPayments

Min Paym Made

A 15 30.00 1 50.00 Expense 0.00 NoA 15 30.00 2 25.00 Expense 0.00 NoA 15 30.00 3 -75.00 Payment 75.00 YesA 15 30.00 4 20.00 Expense 75.00 Yes

Account_ID Statement_ID Min_Payment Transaction_ID Amount Type

A 15 30.00 1 50.00 ExpenseA 15 30.00 2 25.00 ExpenseA 15 30.00 3 -75.00 PaymentA 15 30.00 4 20.00 Expense

#GuerrillaAnalytics http://guerrilla-analytics.net 25

Guerrilla Analytics

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 26

Reporting – what is a report?

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 27

Reporting – Guerrilla Analytics Environment

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 28

Reporting – Guerrilla Analytics Environment

Copyright Enda Ridge 2015

Select min/max of transaction_time

WP_030

•010_Late payments.SQL•030_Late payments.py

WP_042

#GuerrillaAnalytics http://guerrilla-analytics.net 29

Guerrilla Analytics

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 30

Why consolidate?

Raw

Duplicates

Customers Clean_Cust

Deduped New_dupes

Work Product

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 31

Why consolidate?

Raw

Duplicates

Customers Clean_Cust

Deduped New_dupes

Duplicates_02

Customers_02

Duplicates

Deduped Clean_cust New_dupes

Work Product

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net 32

Guerrilla Analytics Approach: Builds

Deduped

Clean_cust

New_dupesDuplicates_02

Duplicates

Customers_02

Dupes_latest

Cust_Latest

Raw Latest Clean Rules Interface

Version Controlled Code and Data

Copyright Enda Ridge 2015

WP_030

33

Summary

Copyright Enda Ridge 2015 #GuerrillaAnalytics http://guerrilla-analytics.net

A Realistic Workflow• Guerrilla Analytics Principles• Guerrilla Analytics Practice Tips

Case Study• Data receipt and load• Analytics• Reporting and work products• Consolidation with Builds

Why Data Science is Difficult• Disruptions, Constraints• These break Data Provenance

Those were the droids I was looking for ...

#GuerrillaAnalytics http://guerrilla-analytics.net 34

Keep in Touch!

@Enda_Ridge

http://guerrilla-analytics.net

Copyright Enda Ridge 2015

Or contact me for 50% discount