demystifying data science

113
Demystifying Data Science What does it mean in practice? Jonathan Sedar Principal Data Scientist Applied AI Ltd www.applied.ai @applied_ai @jonsedar

Upload: jonathan-sedar

Post on 16-Apr-2017

134 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Demystifying Data Science

Demystifying Data ScienceWhat does it mean in practice?

Jonathan SedarPrincipal Data ScientistApplied AI Ltd

www.applied.ai@applied_ai@jonsedar

Page 2: Demystifying Data Science

Applied AI is a Data Science Consultancy

We create a competitive advantage for financial services companies through applied artificial intelligence

www.applied.ai @applied_ai @jonsedar

Page 3: Demystifying Data Science

Know Your Customers Develop Your Market Manage Risk & Regulation

Innovate & Experiment Streamline Operations Embed Data Science

Page 4: Demystifying Data Science

Demystifying Data Science

Motivations A Maturity Model

An Ecosystem Model Practical Examples & Advice

Page 5: Demystifying Data Science

Data Science

Page 6: Demystifying Data Science

$> DATA.SCIENCE()

Page 7: Demystifying Data Science
Page 8: Demystifying Data Science

Intelligently Learning From Data

Page 9: Demystifying Data Science

Extracting information from all that Big Data you're collecting

.. and the small stuff too

Page 10: Demystifying Data Science

Discovering correlations, inferring patterns of behaviour ... and training

models to predict outcomes

Page 11: Demystifying Data Science

Running the business more effectively ... and systematising

insights and products

Page 12: Demystifying Data Science

How wonderful for you

Page 13: Demystifying Data Science

Learning from data is nothing new

Page 14: Demystifying Data Science

Most of our business is doing it already

Page 15: Demystifying Data Science

Trading & Quant Finance Increase Revenue

Page 16: Demystifying Data Science

Process Optimisation Reduce Costs

Page 17: Demystifying Data Science

Portfolio Risk Modelling Manage Risk

Page 18: Demystifying Data Science

Reserves & Stress Testing Meet Compliance

Page 19: Demystifying Data Science

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

Page 20: Demystifying Data Science
Page 21: Demystifying Data Science

Data Science Maturity Model

Page 22: Demystifying Data Science
Page 23: Demystifying Data Science

Sophisticated Analyses•Hypothesis testing & data discovery

•Advanced statistics & predictive modelling

•Deliver immediate value, guide strategy

•Advanced data science is supported thought the organisation and embedded in:

•Products & Services•Senior Decision Making•Business Administration

Full Capability Data Science

• Identify new opportunities and useful data sources

•Basic modelling•Senior leaders help to define & develop the business case

Getting Started•Create ‘data products’, reports, new systems to embed change

•Replace legacy systems•Build internal knowledge and skills

Business Operations

Page 24: Demystifying Data Science

• Auto Insurer: “Help me price correctly”

• Extracted, cleaned, parsed data from messy internal & external sources

• Lightweight multidimensional analysis of customer base inc interactive dashboards

• Reports and strategic recommendations to board level, proving the need for further analysis

Getting Started

Page 25: Demystifying Data Science

Sophisticated Analyses

• Life & Pensions: “Help me model my customer churn (a credit risk situation)”

• Sourced, cleaned, prepared internal & external data

• Created advanced time-to-event models using Bayesian statistics

• Churn modelling output identified key risk groups & potentially large new revenues and cost savings

Page 26: Demystifying Data Science

Business Operations

• Asset Management Co: “Help me price real estate at the optimal market price”

• Sourced, cleaned, prepared data, undertook initial investigations and statistical modelling

• Created a price prediction “engine” within a microservice API, now used within daily operations

• Accurate estimates and reduced manual effort

Page 27: Demystifying Data Science

Full Capability Data Science

• The holy grail!

• A centre of excellence guiding:

• Products

• Decision Making

• Business Administration

Page 28: Demystifying Data Science
Page 29: Demystifying Data Science

Data Science Ecosystem

Page 30: Demystifying Data Science
Page 31: Demystifying Data Science

Data Curation

• Making the right data available for modelling and maintaining it well.

• Garbage-in-garbage-out

• Getting to ‘good data’ is subtle

• 80% of the process

Page 32: Demystifying Data Science

Machine Learning• Learning from data

• The empirical practice at the heart of statistics.

• A machine (aka computer or model) is trained on a dataset to predict values

• Predict or infer real-word behaviours.

Page 33: Demystifying Data Science

Business Integration• Conventional business analysis lives and

dies within spreadsheets & presentations

• Expensive dashboards require unstable data pipelines.

• Huge data warehouses and "lakes" are so complicated they're barely utilised.

• Business integration is hard, but critical

Page 34: Demystifying Data Science
Page 35: Demystifying Data Science

Three Stories of Data Science in Practice

Page 36: Demystifying Data Science

Data Curation

Page 37: Demystifying Data Science

Curating external datasets to better understand customers

Clustering Introspection Visualisation

Page 38: Demystifying Data Science

We work mainly with insurance companies They don’t have a reputation for being exciting

But from a data science point of view…

Page 39: Demystifying Data Science

It’s quite interesting!

Page 40: Demystifying Data Science

“Our term insurance policies are lapsing before they become profitable”

Page 41: Demystifying Data Science

We modelled lapse using survival analysis (more of which later)

Along the way noticed something…

Page 42: Demystifying Data Science

The churn rate was sky-high in new estates

Page 43: Demystifying Data Science

Geographic Effects

Page 44: Demystifying Data Science

And Socioeconomic Effects

Page 45: Demystifying Data Science

We could use these effects to:

Identify lapse-prone customers More accurately price credit risk

Identify new markets

Page 46: Demystifying Data Science

… we’re not the first people to think of this

Page 47: Demystifying Data Science

We can do it better and cheaper ourselves

Page 48: Demystifying Data Science

First: geocode the customer baseGet lat/long based on address

Used Nominatim (FOSS, based on PostGIS) rather than Google, because …

Page 49: Demystifying Data Science

Irish addresses are pathological!

Page 50: Demystifying Data Science

Second: go shopping for socioeconomic data

Irish census produced every 5 years 15 themes, 500+ features

Captures almost everything about daily life Aggregated to ‘small areas’ approx 200 households

Page 51: Demystifying Data Science

Census themesTheme Subject Theme Subject

1 Sex, Age & Migration 9 Social Class2 Ethnicity & Language 10 Education3 Irish Langage 11 Commuting4 Families 12 Health5 Private Housholds 13 Occupation6 Housing 14 Industries7 Hospitals & Prisons 15 PC & Internet8 Principal Status

Page 52: Demystifying Data Science

We could do what Experian does, and also:

We would own the code We could integrate with any internal project

We could tune it to fit our needs

Page 53: Demystifying Data Science

Lets take a look at the data

Not a trivial task… What we have is a really big matrix

18,488 rows x 767 columns

Page 54: Demystifying Data Science

Data Compression Visualisation

Clustering (unsupervised learning)

Page 55: Demystifying Data Science

Data Compression

Singular Value Decomposition Rotate and scale data into new frame of reference

Compress into fewer features while maintaining information

Compressed 500+ columns into 100

Page 56: Demystifying Data Science

Data Visualisation

t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualise 100D in 2D space

View natural clustering in the data

Page 57: Demystifying Data Science

Clustering

Hierarchical Agglomerative Clustering (Ward Clustering)

Progressively group nearby datapoints into larger clusters Cut nested hierarchy of clusters to fit

Page 58: Demystifying Data Science

Interpreting the Clusters

Page 59: Demystifying Data Science

…carefully

Page 60: Demystifying Data Science

Now we can place each small area on a map

Using shapefiles and PostGIS

Page 61: Demystifying Data Science

Dublin, Ireland 2011

Page 62: Demystifying Data Science

Interactive dashboard showing each Small Area (200 people),

plotted by location and cluster id

Page 63: Demystifying Data Science

Data Curation• A centralised, up-to-date, traceable,

documented repository for structured text, tabular & image datasets

• Augment with public data to keep up with competitors and gain an edge

• Update, maintain and optimise your primary data sources to allow for high risk/reward POC projects

Page 64: Demystifying Data Science

Machine Learning

Page 65: Demystifying Data Science

Learning from data to predict outcomes and infer behaviours

Supervised (classification, regression) Unsupervised (clustering, pattern matching)

Reinforcement (behavioural rewards)

Page 66: Demystifying Data Science

Hot new area, thus word soup

artificial intelligence machine intelligence statistical modelling

robotic process automation cognitive computing

deep learning …

Page 67: Demystifying Data Science

Statistics <3 Machine Learning

Page 68: Demystifying Data Science

Example 1: time to event modelling

“What’s our projected customer churn (and thus projected credit risk)

Supervised Regression

Page 69: Demystifying Data Science

Basic idea: estimate this curve

Page 70: Demystifying Data Science

Counts: Kaplan Meier

Page 71: Demystifying Data Science

Parametric (or semi-parametric) models Exponential, Weibull, Cox PH Regression etc

Page 72: Demystifying Data Science

Time-varying coefficients Piecewise, Aalen-Additive Regression etc

Page 73: Demystifying Data Science

Sidenote: Bayesian Inference is perfect for time-based regression

Page 74: Demystifying Data Science

Treat observed values as a realisation of a probability distribution

Page 75: Demystifying Data Science

Big wins: capture prior knowledge, preserve uncertainty, model introspection and inference

Page 76: Demystifying Data Science

Create predictions with qualified uncertainty: “credible regions”

Page 77: Demystifying Data Science

Straightforward to extend models e.g. time-varying effects

Page 78: Demystifying Data Science

Straightforward to make models robust e.g. outlier detection, mixture models

Page 79: Demystifying Data Science

Example 2: topic modelling

“Can we learn the topics of conversation in broker communications?

Unsupervised Clustering

Page 80: Demystifying Data Science

NLP upon business data sources

Page 81: Demystifying Data Science

After careful cleaning, anonymisation, preprocessing

Page 82: Demystifying Data Science

Find the ‘topics’ of conversation Words that seem to co-occur

Page 83: Demystifying Data Science

Use topics as a shortcut to categorise and correlate documents to activity

Page 84: Demystifying Data Science

Create the communications graph Learn social & organisational structure

Page 85: Demystifying Data Science

Design for interactive investigation

Page 86: Demystifying Data Science

Example 3: anomaly detection

“Can we spot fraudulent activity in claims?”

Un / Supervised Learning

Page 87: Demystifying Data Science

Supervised Learning: function estimation

Classification: Log. Reg, Neural / Deep Nets, Trees, Random Forests Regression: Linear, Non-Linear, Time-Series

Page 88: Demystifying Data Science

Unsupervised Learning: pattern finding

Clustering, distance measures, topologies

Page 89: Demystifying Data Science

Feature engineering is critical

Understand the data shape, size, behaviours and the processes that generated it

Page 90: Demystifying Data Science

Machine Learning• Sophisticated statistical techniques,

good software dev practices and research-grade, open-source software

• Document and share knowledge to become technical centre of excellence

• Validate, test, review & maintain your data pipelines, software and models to mitigate risk and allow for audit

Page 91: Demystifying Data Science

Business Integration

Page 92: Demystifying Data Science

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

Page 93: Demystifying Data Science

How to integrate data science into business activities?

Page 94: Demystifying Data Science

Tooling

Page 95: Demystifying Data Science

Open Source

Page 96: Demystifying Data Science

Reproducibility and Documentation

Page 97: Demystifying Data Science

Wider Communication

Page 98: Demystifying Data Science

APIs and Integration

Page 99: Demystifying Data Science

The Team

Page 100: Demystifying Data Science

Data scientist skill set

Drew Conway’s (in)famous Venn Diagram

Page 101: Demystifying Data Science

Not so different from a software development team

Page 102: Demystifying Data Science

Communicate

Page 103: Demystifying Data Science

Iterate

Page 104: Demystifying Data Science

and another thing…

Page 105: Demystifying Data Science

The practice of data science can offer powerful insight and prediction…

Page 106: Demystifying Data Science

… it’s only a model

Page 107: Demystifying Data Science

Business Integration• Clear path from model inference and

predictions to the extrapolation of business actions and impacts

• Communicate results with non-technical stakeholders via engaging dashboards and visualisations

• Integrate an automated, live, on-demand prediction service with business systems

Page 108: Demystifying Data Science

Using a “Data Science” approach: - Motivations - A Maturity Model - An Ecosystem Model

Practical Examples & Advice

Page 109: Demystifying Data Science

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

Page 110: Demystifying Data Science
Page 111: Demystifying Data Science
Page 112: Demystifying Data Science

Further reading•Blogs with good technical articles, insights etc

•http://blog.applied.ai •http://www.magesblog.com •https://planet.scipy.org •http://andrewgelman.com •http://blog.kaggle.com

• Books / technical articles •https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice •http://www.oreilly.com/data/free/ten-signs-of-data-science-maturity.csp •Machine Learning for Hackers http://shop.oreilly.com/product/0636920018483.do

Page 113: Demystifying Data Science

Thank you

www.applied.ai @applied_ai @jonsedar