demystifying data science
TRANSCRIPT
Demystifying Data ScienceWhat does it mean in practice?
Jonathan SedarPrincipal Data ScientistApplied AI Ltd
www.applied.ai@applied_ai@jonsedar
Applied AI is a Data Science Consultancy
We create a competitive advantage for financial services companies through applied artificial intelligence
www.applied.ai @applied_ai @jonsedar
Know Your Customers Develop Your Market Manage Risk & Regulation
Innovate & Experiment Streamline Operations Embed Data Science
Demystifying Data Science
Motivations A Maturity Model
An Ecosystem Model Practical Examples & Advice
Data Science
$> DATA.SCIENCE()
Intelligently Learning From Data
Extracting information from all that Big Data you're collecting
.. and the small stuff too
Discovering correlations, inferring patterns of behaviour ... and training
models to predict outcomes
Running the business more effectively ... and systematising
insights and products
How wonderful for you
Learning from data is nothing new
Most of our business is doing it already
Trading & Quant Finance Increase Revenue
Process Optimisation Reduce Costs
Portfolio Risk Modelling Manage Risk
Reserves & Stress Testing Meet Compliance
Learning from data benefits the whole business
Increase Revenue
tune risk profileunderstand the competition
optimise business processesimprove customer retention
inform & adapt to regulatory changedemonstrate leadership
innovate product-market fitincrease customer base
Reduce Cost
Manage Risk Meet Compliance
Data Science Maturity Model
Sophisticated Analyses•Hypothesis testing & data discovery
•Advanced statistics & predictive modelling
•Deliver immediate value, guide strategy
•Advanced data science is supported thought the organisation and embedded in:
•Products & Services•Senior Decision Making•Business Administration
Full Capability Data Science
• Identify new opportunities and useful data sources
•Basic modelling•Senior leaders help to define & develop the business case
Getting Started•Create ‘data products’, reports, new systems to embed change
•Replace legacy systems•Build internal knowledge and skills
Business Operations
• Auto Insurer: “Help me price correctly”
• Extracted, cleaned, parsed data from messy internal & external sources
• Lightweight multidimensional analysis of customer base inc interactive dashboards
• Reports and strategic recommendations to board level, proving the need for further analysis
Getting Started
Sophisticated Analyses
• Life & Pensions: “Help me model my customer churn (a credit risk situation)”
• Sourced, cleaned, prepared internal & external data
• Created advanced time-to-event models using Bayesian statistics
• Churn modelling output identified key risk groups & potentially large new revenues and cost savings
Business Operations
• Asset Management Co: “Help me price real estate at the optimal market price”
• Sourced, cleaned, prepared data, undertook initial investigations and statistical modelling
• Created a price prediction “engine” within a microservice API, now used within daily operations
• Accurate estimates and reduced manual effort
Full Capability Data Science
• The holy grail!
• A centre of excellence guiding:
• Products
• Decision Making
• Business Administration
Data Science Ecosystem
Data Curation
• Making the right data available for modelling and maintaining it well.
• Garbage-in-garbage-out
• Getting to ‘good data’ is subtle
• 80% of the process
Machine Learning• Learning from data
• The empirical practice at the heart of statistics.
• A machine (aka computer or model) is trained on a dataset to predict values
• Predict or infer real-word behaviours.
Business Integration• Conventional business analysis lives and
dies within spreadsheets & presentations
• Expensive dashboards require unstable data pipelines.
• Huge data warehouses and "lakes" are so complicated they're barely utilised.
• Business integration is hard, but critical
Three Stories of Data Science in Practice
Data Curation
Curating external datasets to better understand customers
Clustering Introspection Visualisation
We work mainly with insurance companies They don’t have a reputation for being exciting
But from a data science point of view…
It’s quite interesting!
“Our term insurance policies are lapsing before they become profitable”
We modelled lapse using survival analysis (more of which later)
Along the way noticed something…
The churn rate was sky-high in new estates
Geographic Effects
And Socioeconomic Effects
We could use these effects to:
Identify lapse-prone customers More accurately price credit risk
Identify new markets
… we’re not the first people to think of this
We can do it better and cheaper ourselves
First: geocode the customer baseGet lat/long based on address
Used Nominatim (FOSS, based on PostGIS) rather than Google, because …
Irish addresses are pathological!
Second: go shopping for socioeconomic data
Irish census produced every 5 years 15 themes, 500+ features
Captures almost everything about daily life Aggregated to ‘small areas’ approx 200 households
Census themesTheme Subject Theme Subject
1 Sex, Age & Migration 9 Social Class2 Ethnicity & Language 10 Education3 Irish Langage 11 Commuting4 Families 12 Health5 Private Housholds 13 Occupation6 Housing 14 Industries7 Hospitals & Prisons 15 PC & Internet8 Principal Status
We could do what Experian does, and also:
We would own the code We could integrate with any internal project
We could tune it to fit our needs
Lets take a look at the data
Not a trivial task… What we have is a really big matrix
18,488 rows x 767 columns
Data Compression Visualisation
Clustering (unsupervised learning)
Data Compression
Singular Value Decomposition Rotate and scale data into new frame of reference
Compress into fewer features while maintaining information
Compressed 500+ columns into 100
Data Visualisation
t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualise 100D in 2D space
View natural clustering in the data
Clustering
Hierarchical Agglomerative Clustering (Ward Clustering)
Progressively group nearby datapoints into larger clusters Cut nested hierarchy of clusters to fit
Interpreting the Clusters
…carefully
Now we can place each small area on a map
Using shapefiles and PostGIS
Dublin, Ireland 2011
Interactive dashboard showing each Small Area (200 people),
plotted by location and cluster id
Data Curation• A centralised, up-to-date, traceable,
documented repository for structured text, tabular & image datasets
• Augment with public data to keep up with competitors and gain an edge
• Update, maintain and optimise your primary data sources to allow for high risk/reward POC projects
Machine Learning
Learning from data to predict outcomes and infer behaviours
Supervised (classification, regression) Unsupervised (clustering, pattern matching)
Reinforcement (behavioural rewards)
Hot new area, thus word soup
artificial intelligence machine intelligence statistical modelling
robotic process automation cognitive computing
deep learning …
Statistics <3 Machine Learning
Example 1: time to event modelling
“What’s our projected customer churn (and thus projected credit risk)
Supervised Regression
Basic idea: estimate this curve
Counts: Kaplan Meier
Parametric (or semi-parametric) models Exponential, Weibull, Cox PH Regression etc
Time-varying coefficients Piecewise, Aalen-Additive Regression etc
Sidenote: Bayesian Inference is perfect for time-based regression
Treat observed values as a realisation of a probability distribution
Big wins: capture prior knowledge, preserve uncertainty, model introspection and inference
Create predictions with qualified uncertainty: “credible regions”
Straightforward to extend models e.g. time-varying effects
Straightforward to make models robust e.g. outlier detection, mixture models
Example 2: topic modelling
“Can we learn the topics of conversation in broker communications?
Unsupervised Clustering
NLP upon business data sources
After careful cleaning, anonymisation, preprocessing
Find the ‘topics’ of conversation Words that seem to co-occur
Use topics as a shortcut to categorise and correlate documents to activity
Create the communications graph Learn social & organisational structure
Design for interactive investigation
Example 3: anomaly detection
“Can we spot fraudulent activity in claims?”
Un / Supervised Learning
Supervised Learning: function estimation
Classification: Log. Reg, Neural / Deep Nets, Trees, Random Forests Regression: Linear, Non-Linear, Time-Series
Unsupervised Learning: pattern finding
Clustering, distance measures, topologies
Feature engineering is critical
Understand the data shape, size, behaviours and the processes that generated it
Machine Learning• Sophisticated statistical techniques,
good software dev practices and research-grade, open-source software
• Document and share knowledge to become technical centre of excellence
• Validate, test, review & maintain your data pipelines, software and models to mitigate risk and allow for audit
Business Integration
Learning from data benefits the whole business
Increase Revenue
tune risk profileunderstand the competition
optimise business processesimprove customer retention
inform & adapt to regulatory changedemonstrate leadership
innovate product-market fitincrease customer base
Reduce Cost
Manage Risk Meet Compliance
How to integrate data science into business activities?
Tooling
Open Source
Reproducibility and Documentation
Wider Communication
APIs and Integration
The Team
Data scientist skill set
Drew Conway’s (in)famous Venn Diagram
Not so different from a software development team
Communicate
Iterate
and another thing…
The practice of data science can offer powerful insight and prediction…
… it’s only a model
Business Integration• Clear path from model inference and
predictions to the extrapolation of business actions and impacts
• Communicate results with non-technical stakeholders via engaging dashboards and visualisations
• Integrate an automated, live, on-demand prediction service with business systems
Using a “Data Science” approach: - Motivations - A Maturity Model - An Ecosystem Model
Practical Examples & Advice
Learning from data benefits the whole business
Increase Revenue
tune risk profileunderstand the competition
optimise business processesimprove customer retention
inform & adapt to regulatory changedemonstrate leadership
innovate product-market fitincrease customer base
Reduce Cost
Manage Risk Meet Compliance
Further reading•Blogs with good technical articles, insights etc
•http://blog.applied.ai •http://www.magesblog.com •https://planet.scipy.org •http://andrewgelman.com •http://blog.kaggle.com
• Books / technical articles •https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice •http://www.oreilly.com/data/free/ten-signs-of-data-science-maturity.csp •Machine Learning for Hackers http://shop.oreilly.com/product/0636920018483.do
Thank you
www.applied.ai @applied_ai @jonsedar