demystifying data science

Post on 16-Apr-2017

135 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Demystifying Data ScienceWhat does it mean in practice?

Jonathan SedarPrincipal Data ScientistApplied AI Ltd

www.applied.ai@applied_ai@jonsedar

Applied AI is a Data Science Consultancy

We create a competitive advantage for financial services companies through applied artificial intelligence

www.applied.ai @applied_ai @jonsedar

Know Your Customers Develop Your Market Manage Risk & Regulation

Innovate & Experiment Streamline Operations Embed Data Science

Demystifying Data Science

Motivations A Maturity Model

An Ecosystem Model Practical Examples & Advice

Data Science

$> DATA.SCIENCE()

Intelligently Learning From Data

Extracting information from all that Big Data you're collecting

.. and the small stuff too

Discovering correlations, inferring patterns of behaviour ... and training

models to predict outcomes

Running the business more effectively ... and systematising

insights and products

How wonderful for you

Learning from data is nothing new

Most of our business is doing it already

Trading & Quant Finance Increase Revenue

Process Optimisation Reduce Costs

Portfolio Risk Modelling Manage Risk

Reserves & Stress Testing Meet Compliance

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

Data Science Maturity Model

Sophisticated Analyses•Hypothesis testing & data discovery

•Advanced statistics & predictive modelling

•Deliver immediate value, guide strategy

•Advanced data science is supported thought the organisation and embedded in:

•Products & Services•Senior Decision Making•Business Administration

Full Capability Data Science

• Identify new opportunities and useful data sources

•Basic modelling•Senior leaders help to define & develop the business case

Getting Started•Create ‘data products’, reports, new systems to embed change

•Replace legacy systems•Build internal knowledge and skills

Business Operations

• Auto Insurer: “Help me price correctly”

• Extracted, cleaned, parsed data from messy internal & external sources

• Lightweight multidimensional analysis of customer base inc interactive dashboards

• Reports and strategic recommendations to board level, proving the need for further analysis

Getting Started

Sophisticated Analyses

• Life & Pensions: “Help me model my customer churn (a credit risk situation)”

• Sourced, cleaned, prepared internal & external data

• Created advanced time-to-event models using Bayesian statistics

• Churn modelling output identified key risk groups & potentially large new revenues and cost savings

Business Operations

• Asset Management Co: “Help me price real estate at the optimal market price”

• Sourced, cleaned, prepared data, undertook initial investigations and statistical modelling

• Created a price prediction “engine” within a microservice API, now used within daily operations

• Accurate estimates and reduced manual effort

Full Capability Data Science

• The holy grail!

• A centre of excellence guiding:

• Products

• Decision Making

• Business Administration

Data Science Ecosystem

Data Curation

• Making the right data available for modelling and maintaining it well.

• Garbage-in-garbage-out

• Getting to ‘good data’ is subtle

• 80% of the process

Machine Learning• Learning from data

• The empirical practice at the heart of statistics.

• A machine (aka computer or model) is trained on a dataset to predict values

• Predict or infer real-word behaviours.

Business Integration• Conventional business analysis lives and

dies within spreadsheets & presentations

• Expensive dashboards require unstable data pipelines.

• Huge data warehouses and "lakes" are so complicated they're barely utilised.

• Business integration is hard, but critical

Three Stories of Data Science in Practice

Data Curation

Curating external datasets to better understand customers

Clustering Introspection Visualisation

We work mainly with insurance companies They don’t have a reputation for being exciting

But from a data science point of view…

It’s quite interesting!

“Our term insurance policies are lapsing before they become profitable”

We modelled lapse using survival analysis (more of which later)

Along the way noticed something…

The churn rate was sky-high in new estates

Geographic Effects

And Socioeconomic Effects

We could use these effects to:

Identify lapse-prone customers More accurately price credit risk

Identify new markets

… we’re not the first people to think of this

We can do it better and cheaper ourselves

First: geocode the customer baseGet lat/long based on address

Used Nominatim (FOSS, based on PostGIS) rather than Google, because …

Irish addresses are pathological!

Second: go shopping for socioeconomic data

Irish census produced every 5 years 15 themes, 500+ features

Captures almost everything about daily life Aggregated to ‘small areas’ approx 200 households

Census themesTheme Subject Theme Subject

1 Sex, Age & Migration 9 Social Class2 Ethnicity & Language 10 Education3 Irish Langage 11 Commuting4 Families 12 Health5 Private Housholds 13 Occupation6 Housing 14 Industries7 Hospitals & Prisons 15 PC & Internet8 Principal Status

We could do what Experian does, and also:

We would own the code We could integrate with any internal project

We could tune it to fit our needs

Lets take a look at the data

Not a trivial task… What we have is a really big matrix

18,488 rows x 767 columns

Data Compression Visualisation

Clustering (unsupervised learning)

Data Compression

Singular Value Decomposition Rotate and scale data into new frame of reference

Compress into fewer features while maintaining information

Compressed 500+ columns into 100

Data Visualisation

t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualise 100D in 2D space

View natural clustering in the data

Clustering

Hierarchical Agglomerative Clustering (Ward Clustering)

Progressively group nearby datapoints into larger clusters Cut nested hierarchy of clusters to fit

Interpreting the Clusters

…carefully

Now we can place each small area on a map

Using shapefiles and PostGIS

Dublin, Ireland 2011

Interactive dashboard showing each Small Area (200 people),

plotted by location and cluster id

Data Curation• A centralised, up-to-date, traceable,

documented repository for structured text, tabular & image datasets

• Augment with public data to keep up with competitors and gain an edge

• Update, maintain and optimise your primary data sources to allow for high risk/reward POC projects

Machine Learning

Learning from data to predict outcomes and infer behaviours

Supervised (classification, regression) Unsupervised (clustering, pattern matching)

Reinforcement (behavioural rewards)

Hot new area, thus word soup

artificial intelligence machine intelligence statistical modelling

robotic process automation cognitive computing

deep learning …

Statistics <3 Machine Learning

Example 1: time to event modelling

“What’s our projected customer churn (and thus projected credit risk)

Supervised Regression

Basic idea: estimate this curve

Counts: Kaplan Meier

Parametric (or semi-parametric) models Exponential, Weibull, Cox PH Regression etc

Time-varying coefficients Piecewise, Aalen-Additive Regression etc

Sidenote: Bayesian Inference is perfect for time-based regression

Treat observed values as a realisation of a probability distribution

Big wins: capture prior knowledge, preserve uncertainty, model introspection and inference

Create predictions with qualified uncertainty: “credible regions”

Straightforward to extend models e.g. time-varying effects

Straightforward to make models robust e.g. outlier detection, mixture models

Example 2: topic modelling

“Can we learn the topics of conversation in broker communications?

Unsupervised Clustering

NLP upon business data sources

After careful cleaning, anonymisation, preprocessing

Find the ‘topics’ of conversation Words that seem to co-occur

Use topics as a shortcut to categorise and correlate documents to activity

Create the communications graph Learn social & organisational structure

Design for interactive investigation

Example 3: anomaly detection

“Can we spot fraudulent activity in claims?”

Un / Supervised Learning

Supervised Learning: function estimation

Classification: Log. Reg, Neural / Deep Nets, Trees, Random Forests Regression: Linear, Non-Linear, Time-Series

Unsupervised Learning: pattern finding

Clustering, distance measures, topologies

Feature engineering is critical

Understand the data shape, size, behaviours and the processes that generated it

Machine Learning• Sophisticated statistical techniques,

good software dev practices and research-grade, open-source software

• Document and share knowledge to become technical centre of excellence

• Validate, test, review & maintain your data pipelines, software and models to mitigate risk and allow for audit

Business Integration

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

How to integrate data science into business activities?

Tooling

Open Source

Reproducibility and Documentation

Wider Communication

APIs and Integration

The Team

Data scientist skill set

Drew Conway’s (in)famous Venn Diagram

Not so different from a software development team

Communicate

Iterate

and another thing…

The practice of data science can offer powerful insight and prediction…

… it’s only a model

Business Integration• Clear path from model inference and

predictions to the extrapolation of business actions and impacts

• Communicate results with non-technical stakeholders via engaging dashboards and visualisations

• Integrate an automated, live, on-demand prediction service with business systems

Using a “Data Science” approach: - Motivations - A Maturity Model - An Ecosystem Model

Practical Examples & Advice

Learning from data benefits the whole business

Increase Revenue

tune risk profileunderstand the competition

optimise business processesimprove customer retention

inform & adapt to regulatory changedemonstrate leadership

innovate product-market fitincrease customer base

Reduce Cost

Manage Risk Meet Compliance

Further reading•Blogs with good technical articles, insights etc

•http://blog.applied.ai •http://www.magesblog.com •https://planet.scipy.org •http://andrewgelman.com •http://blog.kaggle.com

• Books / technical articles •https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice •http://www.oreilly.com/data/free/ten-signs-of-data-science-maturity.csp •Machine Learning for Hackers http://shop.oreilly.com/product/0636920018483.do

Thank you

www.applied.ai @applied_ai @jonsedar

top related