staying ahead of the data avalanche - · pdf filestaying ahead of the data avalanche ......

48
Staying Ahead of the Data Avalanche Challenges and Opportunities in Analytics Prof. Dr. Seppe vanden Broucke SAS Analytics Experience Rome – 8 November 2016

Upload: lequynh

Post on 06-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Staying Ahead of the Data AvalancheChallenges and Opportunities in Analytics

Prof. Dr. Seppe vanden BrouckeSAS Analytics Experience Rome – 8 November 2016

Page 2: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Presenter: Seppe vanden Broucke

• Assistant professor in Data and Process Science at department of Decision Sciences and Information Management at KU Leuven (Belgium)

• PhD in Applied Economics at KU Leuven, Belgium in 2014• Title: Advances in Process Mining: Artificial Negative Events and Other Techniques

• Research: business data mining and analytics, machine learning, process management, process mining

• Contact: www.dataminingapps.com [email protected]

Page 3: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

BIGDATA

Page 4: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“We live in a data flooded world”

Page 5: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“Making sense of mountains of data” aka

“Scale your data mountain”

Page 6: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“The data avalanche”“Data is

the new

oil”

“The data tsunami”

Page 7: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

BIGDATA

“It all sounds kind

of dangerous”

Page 8: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

BIGDATA

DATASCIENCE+ =

But so many success stories…

&ANALYTICS

Page 9: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“We live in magical times”

Page 10: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Uber

Page 11: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery
Page 12: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery
Page 13: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Contextual RNN-GANs for Abstract

Reasoning Diagram Generation

Arnab Ghosh*, Viveka Kulharia*, Amitabha

Mukerjee, Vinay Namboodiri, Mohit Bansal

Measuring an Artificial Intelligence System's

Performance on a Verbal IQ Test For Young Children

Stellan Ohlsson, Robert H. Sloan, György Turán, Aaron

Urasky

Page 14: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

BIGDATA “Let the good

times roll”

DATAANALYTICS

+

Page 15: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

So why do so many projects fail?

Page 16: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“During 2015, only 15% of Fortune 500 organizations were able to

exploit big data for competitive advantage” – Gartner

“Data maturity of companies is very disparate, and

the most advanced of them start doubting.”

– Christophe Bourguignat

“75 % have invested in Big Data, but only 10% have

projects in production.”

Companies face disillusions. They start asking

questions: I know how much it costs, but how much

do I earn? What is my return on investment?

Page 17: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Machine learning and data science have ( just) reached “peak hype”

Page 18: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

The challenges ahead

TALENT PROCESSTOOLS,

FILES,

FEEDS

COMMU-

NICA-

TION

MEA-

SURING

PRIVACY,

COM-

PLIANCE

ETHICS

QUALITY

Page 19: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

TALENT“A data scientist is like a gold-coloured unicorn:

mythical powers, but impossible to find”

Page 20: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

TALENT“A data scientist is like a gold-coloured unicorn:

mythical powers, but impossible to find”

Programmer

Page 21: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

TALENT Or a spider with 25 legs?

Page 22: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Data science as a straight through process?PROCESS

Adhering to a data science workflow is A-OK:

• CRISP-DM

• The KDD process

• SEMMA

• BinaryEdge

Page 23: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Data science as a straight through process?PROCESS

Data

Selection Cleaning Transformation DiscoveryInterpretation/

Evaluation

Selected Data

Cleaned/Processed

Data

Transformed Data

Mined Model/Patterns

Knowledge/Insights

Page 24: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Not really...PROCESS

Data

Selection Cleaning Transformation DiscoveryInterpretation/

Evaluation

Selected Data

Cleaned/Processed

Data

Transformed Data

Mined Model/Patterns

Knowledge/Insights

Page 25: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

More like a loopPROCESS

Page 26: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Experiments can take a while…PROCESS

Page 27: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

These things are hardPROCESS

• How to create a sense of urgency?

• What does it mean to be finished?

• You can’t predict the future.

Page 28: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Throw it over the wall projectsCOMMU-

NICA-

TION

Page 29: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Throw it over the wall projectsCOMMU-

NICA-

TION

I want to put this GBM into production,

though some steps are done using R and SAS

Anyone know what this XGBoost thing is?Why aren’t we

deployed yet? We have all this data, why can’t

we find interesting customers?

Page 30: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Talking helpsCOMMU-

NICA-

TION

• Learn each other’s language

• Think with your business hat

• Teach semantics (why a shorter lead list is not easier

to produce)

• Convert hard problems into simpler ones

• Use examples, methaphors, analogies

• Show them and show them often

• IT and data science can live together

Page 31: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“Not everything that counts can be counted…

and not everything that can be counted counts”MEA-

SURING

• Show before and after

• “When are you happy?”

• Accept failures

• Manual measuring can be a good thing• Hard to automate subjective feelings…

Page 32: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“No one ever got fired for installing Hadoop on a

cluster… right?”

TOOLS,

FILES,

FEEDS

Page 33: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

A fool can ask more questions in an hour than a

wise man can answer in a hundred years

TOOLS,

FILES,

FEEDS

• Focus on the files

• What are we going to use it for?

A data scientist can find, love, and ditch more

tools/libraries/… in an hour than a procurement

officer can vet in a hundred years

Page 34: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Focus on feeds, files, dataTOOLS,

FILES,

FEEDS

• Let them (us) own the data

• Ship fast, ship often

• Focus on format and storage standards, not on

technology:

“Can I get information on X for months A and B with only those

columns that changed?”

... “Can I get it myself?”

• Where’s your golden data set?

• Trust your experts

Page 35: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Technology moves too fast anyway…TOOLS,

FILES,

FEEDS

• HDFS?

• What about HFD5, or Kudo?

• Do we even have unstructured data?

• Do we know what to do with it?

• V’s of Big Data – yeah right!

• BigSQL, or Hive, or Slurp?

• Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?

• What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X

• We did these things before… they weren’t hard then

• True, but…

Page 36: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

It’s a difficult balanceTOOLS,

FILES,

FEEDS

Page 37: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

The wall of deployementTOOLS,

FILES,

FEEDS

• Versioning

• Collaboration

• Scalable execution

• Multiple language support

• Multiple kernel support

• Monitoring

• Scheduling

• Acyclic dependency graphs

• Quite different from playing in a notebook• Vendors are starting to help out

• SAS, SPSS, Domino Data Labs, sense.io, ScienceOps

<-> Jupyter, Rodeo, Your 3GB PIP packages

• Not familiar both to most data scientists (too messy) and IT shops (too

unfamiliar)

Page 38: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

• Can new hires get set up in the environment to run analyses on their first day?

• Can data scientists utilize the latest tools/packages without help from IT?

• Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?

• Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?

• Does collaboration happen through a system other than email or copying files?

• Can predictive models be deployed to production without custom engineering or infrastructure work?

• Is there a single place to search for past research and reusable data sets, code, etc?

• Do your data scientists use the best tools money can buy?

Source: https://blog.dominodatalab.com/joel-test-data-science/

The “Joel Test” for Data ScienceTOOLS,

FILES,

FEEDS

Page 39: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Garbage in…QUALITY

“This model is gonna be great!”

Page 40: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Sometimes they are…QUALITY

• Really: everyone has bad data• But: more “bad” means more time

• Do make sure to get a continuous source

to the “bad” data

• Survey: 50+ banks participating world-wide• Most banks indicated that between 10–20 percent of their data suffer from data

quality problems

• Manual data entry is one of the key problems

• Diversity of data sources and consistent corporate wide data representation the

main challenges for data quality

• Regulatory compliance is the key motive to improve data quality

Page 41: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Oh boy…

• Datensparsamkeit

• Cookie law

• Basel II / III

• Who knows where the cloud is anyways?

• EU directives outdated

• “It’s all on Facebook anyway”

PRIVACY,

COM-

PLIANCE

Page 42: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Academics are just getting started…PRIVACY,

COM-

PLIANCE

Page 43: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

In more ways than one...PRIVACY,

COM-

PLIANCE

Page 44: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

“If only we didn’t have to worry about this”PRIVACY,

COM-

PLIANCE

Page 45: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Use it as a competitive

advantage?

PRIVACY,

COM-

PLIANCE

45

https://backchannel.com/an-exclusive-look-at-how-ai-and-machine-learning-work-at-apple-8dbfb131932b#.crky6nt6k

Page 46: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Data science for good?ETHICS

• Can an algorithm be racist? Sexist?

• “Will Predictive Models Outliers Be The New Socially

Excluded?” Companies like DataKind, or Bayes Impact

• Concept of open models

Page 47: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

The challenges today

TALENT PROCESSTOOLS,

FILES,

FEEDS

COMMU-

NICA-

TION

MEA-

SURING

PRIVACY,

COM-

PLIANCE

ETHICS

QUALITY

Page 48: Staying Ahead of the Data Avalanche -  · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery

Thank you