pay no attention to the man behind the curtain - the unseen work behind data science

64
Pay no attention to the man behind the curtain… The unseen work behind data science and analytics Accelerate Data Science conference October 18, 2017 Mark Madsen www.ThirdNature.net @markmadsen

Upload: mark-madsen

Post on 24-Jan-2018

264 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Pay no attention to the man behind the curtain - the unseen work behind data science

Pay no attention to the man behind the curtain… The unseen work behind data science and analytics

Accelerate Data Science conference October 18, 2017 Mark Madsen www.ThirdNature.net @markmadsen

Page 2: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

INTRO The problem we’re (really) trying to solve, current state

Page 3: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

The focus is largely on machine learning today

You are here

Page 4: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The craft model of information delivery does not scale

Page 5: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

So we shifted to data publishing

Industrialized data delivery for self-service access.

Page 6: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

Increased data capture and BI maturity leads to more data-intensive practices, rising complexity

Pareto analysis of the share of buyers who make up 80% of sales volume for products, in this case Coke.

Data source: CMO council

Page 7: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

What makes these customers different? How does this affect a new product launch, or line extensions?

These are not the type of questions you can answer with only queries and reporting.

Data source: CMO council

Page 8: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Compounding the problem: observations, not transactions

Event data doesn’t fit well with current methods of collection and

storage, or with the technology to process and analyze it.

Copyright Third Nature, Inc.

Page 9: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The old problem was access, the new one is analysis

Page 10: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The applied view of data science

Five basic things you can do:

▪Prediction – what is most likely to happen?

▪Estimation – what’s the future value of a variable?

▪Description – what relationships exist in the data?

▪ Simulation – what could happen?

▪Prescription – what should you do?

Slide 10 Copyright Third Nature, Inc.

Page 11: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Applying analytics isn’t just putting them on a screen There are different models of use at machine and human speed

Decision-Action

Human decision support

Humans moderating

machine decisions

Machine decisions

Monitor-Alert

Human monitoring

Machine monitoring

Page 12: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

THE NATURE OF THE PROBLEM FOR ORGANIZATIONS

Implementing data science is a problem of multiple perspectives

Page 13: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

We don’t have an analytics problem, just like we didn’t have a BI problem

The origin of analytics as “business intelligence” was stated well in 1958:

…the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn

“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf

Our goal is analytics as a capability, not a technology

Page 14: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Three constituencies

Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer

Page 15: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Starting points

Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem.

Many more start with builders: technology solutions looking for problems, e.g. 55% of the IT driven Hadoop and Spark projects over the last five years.

The right place to start? Stakeholders. The goal to achieve, the problem to solve.

Page 16: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

NATURE OF THE PROBLEM FROM THE STAKEHOLDER’S PERSPECTIVE

Each constituency has their own set of problems to deal with

Page 17: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The myth that still drives analytics – analytic gold

All we need is a fat

pipe and pans

working in parallel…

Page 18: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Analytic insights that result in no action are expensive trivia.

It’s not the insight, but what you do with it, that matters As a manager: what would you do in this situation?

Page 19: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Perennially difficult: What question do you address?

What’s possible?

How do you know what’s feasible and what isn’t? (both technically and financially)

You don’t, unless you know the data science and the business (and even then maybe not, ML makes no guarantees)

It takes domain expertise and analytic expertise and intuition - that’s why you need analysts.

Page 20: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Important questions for managers

1. What is the goal?

2. Is the goal worth achieving?

3. Do you have a clearly stated, measureable goal?

4. Do you have the data required?

If they don’t realize this is important, they complain about analysts asking them a bunch of (obvious*) questions.

There are processes you can put in place to find problems to address, prioritize them and determine how to deploy the solutions for them.

*Not really

Page 21: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Applying analytics is not an analytics problem

Applying analytics is not in the analyst’s control.

It’s not in the engineer’s control.

It’s in the control of the people involved in the process.

Failures are often in execution, not in analytics development.

For example, we saw unexpectedly poor performance in a number of geographies. Was it the new analytics we tried? Was it a data problem? No, it was a simple compliance problem.

Page 22: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

NATURE OF THE PROBLEM FROM THE ANALYST’S PERSPECTIVE

Page 23: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The analytics process at a high level

Diagram: Kate Matsudaira

Page 24: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The nature of analytics problems is researching the unknown rather than accessing the known.

Repeat for each new problem

Diagram: Kate Matsudaira

Page 25: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Important: no two analytics projects are entirely alike

Different goals = different data, preparation, algorithm

Different algorithms have different resource consumption profiles and scaling ability.

Each requires it’s own custom engineered data features

Page 26: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Starting at the start: Do you have a clearly stated, measureable goal?

Page 27: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The main hurdle: just getting the data

Do you know where to find it? Because it’s

unlikely to be in the data warehouse.

Do you have access to it?

Is access fast enough? Because DWs are for

QRD, not for moving huge piles of data. And

ERP systems and SaaS apps are right out.

Page 28: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Do you have the right data?

Many machine learning techniques require labeled (known good) training data:

Supervised learning: a person has to define the correct output for some portion of the data. Data is divided into training sets used for model building and test sets for validating the results.

• What is spam and what isn’t?

• What does a fraudulent transaction look like

Third Nature 28

Page 29: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Do you have enough of the right data?

ML needs a lot, you may be disappointed in your own efforts

Page 30: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Define the business problem

Translate the problem into an analytic context Select

appropriate data

Learn the data

Create a model set

Fix problems with data

Transform data

Build models

Assess models

Deploy models

Assess results

Source: Michael Berry, Data Miners Inc.

Slide 30 Copyright Third Nature, Inc.

What does an expert analyst really do?

Page 31: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

What does an expert analyst do?

You can’t model data for this in advance.

Page 32: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

Where do analysts spend their time? mostly data work

Define the business problem

Translate the problem into an analytic context

Select appropriate data

Learn the data

Create a model set

Fix problems with data

Transform data

Build models

Assess models

Deploy models

Assess results

% of time spent

70% 30%

Source: Michael Berry, Data Miners Inc.

Slide 32

Page 33: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Feature engineering is the core of the process

Lots of data (as attributes) makes things harder

Lots of data (instances) makes things slow

Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are.

Cleaning up data, choosing attributes, deriving features is not a technical problem as much as a creative one.

The best way to enable data scientists is to remove data management obstacles.

Page 34: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

Where do most of the analytics tools focus?

Define the business problem

Translate the problem into an analytic context Select

appropriate data

Learn the data

Create a model set

Fix problems with data

Transform data

Build models

Assess models

Deploy models

Assess results

Source: Michael Berry, Data Miners Inc.

Slide 34

Page 35: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

Where do most of the analytics aaS focus?

Define the business problem

Translate the problem into an analytic context Select

appropriate data

Learn the data

Create a model set

Fix problems with data

Transform data

Build models

Assess models

Deploy models

Assess results

Source: Michael Berry, Data Miners Inc.

Slide 35

Page 36: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The analyst’s workspace in BI is relatively spare

Page 37: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The analyst’s workspace needs to be more like a kitchen than like BI vending machines

Page 38: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

NATURE OF THE PROBLEM FROM THE BUILDER’S PERSPECTIVE

Page 39: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

IT and Ops people want to know “what to build?”

Giant data platform? Self service tools?

Page 40: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Analytics requires different processes and workloads

None of this analytics work is the same as what IT considered “analysis” to be, which is usually equated with BI or ad-hoc query.

Ad-hoc analysis =

Exploratory data analysis =

Batch analytics =

Real-time analytics

A real analytics production workflow

Hatch, CIKM ‘11 Slide 40

Page 41: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Embedding analytics: less voodoo, more engineering

Page 42: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Things engineering and operations worry about

Engineering time and effort ▪ Introduction of new technology, complexity

▪ Integration - Deployment of models requirements linking different types of environments, creating supportable workflows for the analysts

▪ Ability to develop and deploy at the required speed

Supportability ▪ Automation

▪ The environment requires additional monitoring, other technology and processes, particularly for customer-facing work

▪ Support costs (time and money)

SLAs: ▪ Availability – if analytics are tied to production operations, particularly

customer facing, this becomes important and difficult because it’s not standard application work

▪ Performance and scalability – have to manage unpredictable workloads, resource conflicts between model development with model execution

Page 43: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The world changes, do the models?

In BI you maintain ETL and schemas, in ML you maintain models.

“Model decay” happens as the assumptions around which a model is built change, e.g. spam techniques change.

When you adjust the model you need to know it is better again

▪ Better save the data used to build the model

▪ Better save the model

▪ Baseline and measurements

Page 44: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

You need a system of record for analytics

Page 45: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

THREE PERSPECTIVES, ONE SOLUTION?

There are requirements from all constituents. You need to put them together to have a complete picture of what’s needed.

Page 46: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

The missing stakeholder

There is another stakeholder: analytics management - the CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist.

The perspective and problems of the person responsible for oversight of the team and efforts is across the organization and across multiple projects

Page 47: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Repeatability

Page 48: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Operational predictability

Page 49: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Reproducibility

Page 50: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Analytics solutions are interdisciplinary

Team composition is best when the skills and backgrounds are mixed.

Domain knowledge is still valuable – ignore the AI and ML hype saying that it’s all math and engineering.

Data management and engineering is a necessary part for much of this work.

Page 51: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

Data scientists and engineers work from opposing directions

exploration

modeling

integration

applications

infrastructure

help people ask the right questions, frame them, define measurable goals

define models that run to determine answers or carry out actions

deliver the results / product in production, at scale

build data science models into applications and delivery systems

provide the systems and practices to build and run the desired models

Diagram concept: Paco Nathan

Page 52: Pay no attention to the man behind the curtain - the unseen work behind data science

Using a matrix to plan the project team

Image: Paco Nathan

Page 53: Pay no attention to the man behind the curtain - the unseen work behind data science

This is a team sport, not a solo act

Image: Paco Nathan

Page 54: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

We already know the craft model doesn’t scale. How do we industrialize like we did for BI?

Page 55: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc. Copyright Third Nature, Inc.

There is an extensive list of requirements to support

Primary requirements needed by constituents S D E

Data catalog and ability to search it for datasets X X

Self-service access to curated data X

Self-service access to uncurated (unknown, new) data X X

Temporary storage for working with data X

Data integration, cleaning, transformation, preparation tools and environment X X

Persistent storage for source data used by production models X X

Persistent storage for training, testing, production data used by models X X

Storage and management of models X X

Deployment, monitoring, decommissioning models X

Lineage, traceability of changes made for data used by models X X

Lineage, traceability for model changes X X X

Managing baseline data / metrics for comparing model performance X X X

Managing ongoing data / metrics for tracking ongoing model performance X X X

S = stakeholder, user, D = data scientist, analyst, E = engineer, developer

Page 56: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Non-answer #1: “Innovation as Procurement”

Software vendors want to sell you one thing: high margin software.

Most assume the data is there and ready to use by their application – just load it.

Most of the work lies in data integration, cleaning and data management.

Embedding analytics in a process adds infrastructure that most organizations don’t have and can’t support. It takes new infrastructure.

Page 57: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Non-answer #2: Best Practices

“78% of high performing companies have a centralized data science team in place in their organization” – follow their lead!

This is called survival bias. Flipping a coin is often as effective as “Do what they did.”

The problem: you have directions to cross a minefield but no map of where to start.

Page 58: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

The enterprise focus needs to be on repeatability - where it can be supported

Page 59: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Key focus for the organization: Infrastructure vs Application

Infrastructure enables value, applications deliver value.

Enable applications by pushing the reusable elements down into the platform.

The infrastructure is a hidden combination of technology, process and methods.

Page 60: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Data management is a key element of infrastructure

Multiple contexts of use, differing quality levels

You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.

Page 61: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Manage your data (or it will manage you)

Data management is where both analysts and developers are weakest.

Modern engineering practices are where data management is weakest.

You need to bridge the groups and practices in the organization if you want to make this work repeatable.

Page 62: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

Conclusion: new stuff eventually becomes old stuff

Page 63: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

About the Presenter

Mark Madsen is president of Third Nature, an advisory firm focused on analytics, data and technology strategy.

Mark is an award-winning author, architect and CTO who has received awards for his work from the American Productivity & Quality Center, Smithsonian Institute and industry associations.

He is an international speaker, a contributor to Forbes, and member of the O’Reilly Artificial Intelligence and Strata program committees. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net

Page 64: Pay no attention to the man behind the curtain - the unseen work behind data science

Copyright Third Nature, Inc.

About Third Nature

Third Nature is an advisory firm focused on practices and technology in

analytics, information strategy, business intelligence and data management.

Our goal is to help organizations solve problems using data. We offer

education, advisory and research services to support business and IT

organizations. We also provide product-related consulting to software

vendors in the data industry.

We specialize in strategy and architecture, so we look at emerging

technologies and markets, evaluating how technologies are applied to solve

problems rather than simply comparing product features. We fill the gap

between what industry analyst firms cover and what organizations need.