big data analytics - best of the worst : anti-patterns & antidotes

23
Big Data Analytics - The Best of the Worst Krishna Sankar @ksankar https://www.linkedin.com/in/ksankar

Upload: krishna-sankar

Post on 12-Aug-2015

1.210 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Big Data Analytics -The Best of the Worst

Krishna Sankar

@ksankarhttps://www.linkedin.com/in/ksankar

About MeAbout Me

o Data Scientist • Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx]

• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]

o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …

o Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA

• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,• Guest Lecturer at Naval PG School,…

o Studying MS-CFRM (Computational Finance/Risk management) UWAo Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishing

o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com

Background – Top 5Background – Top 5

http://tcapp2.publishpath.com/rabbitholehttp://conservationmagazine.org/wordpress/wp-­‐content/uploads/2013/05/context-­‐matters.jpg

1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns

1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns

Donald Rumsfeld is an armchair Data Scientist !

http://smartorg.com/2013/07/valuepoint19/

The World

Knowns

Unknowns

YouUnKnown Known

o Others  know,  you  don’tModel Evolution/DevOpsto capture this

o Capture in  Models

o Facts,  outcomes  or  scenarios  we  have  not  encountered,  nor  considered

o “Black  swans”,  outliers,  long  tails  of  probability  distributions

o Lack  of  experience,  imagination

o Potential  facts,  outcomes  we  are  aware,  but  not    with  certainty

o Stochastic  processes,  Probabilities

o Known Knownso There are things we know that

we knowo Known Unknowns

o That is to say, there are things that we now know we don't know

o But there are also Unknown Unknownso There are things we do not know

we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics

2) The pipeline is the context 2) The pipeline is the context

o Scalable  Model  Deployment

o Big  Data  automation  &  purpose  built  appliances  (soft/hard)

o Manage  SLAs  &  response   times

o Scalable  Model  Deployment

o Big  Data  automation  &  purpose  built  appliances  (soft/hard)

o Manage  SLAs  &  response   times

o Volumeo Velocityo Streaming  Data

o Volumeo Velocityo Streaming  Data

o Canonical   formo Data  catalogo Data  Fabric  across  the  

organizationo Access  to  multiple  

sources  of  data  o Think  Hybrid  – Big  Data  

Apps,  Appliances  &  Infrastructure

o Canonical   formo Data  catalogo Data  Fabric  across  the  

organizationo Access  to  multiple  

sources  of  data  o Think  Hybrid  – Big  Data  

Apps,  Appliances  &  Infrastructure

CollectCollect StoreStore TransformTransform

o Metadatao Monitor  counters  &  

Metricso Structured  vs.  Multi-­‐

structured

o Metadatao Monitor  counters  &  

Metricso Structured  vs.  Multi-­‐

structured

o Flexible  &  Selectable§ Data  Subsets  § Attribute  sets

o Flexible  &  Selectable§ Data  Subsets  § Attribute  sets

o Refine  model  with§ Extended  Data  

subsets§ Engineered  

Attribute  setso Validation  run  across  a  

larger  data  set

o Refine  model  with§ Extended  Data  

subsets§ Engineered  

Attribute  setso Validation  run  across  a  

larger  data  set

ReasonReason ModelModel DeployDeploy

Data ManagementData Management Data ScienceData Science

o Dynamic  Data  Setso 2  way  key-­‐value  tagging  of  

datasetso Extended  attribute  setso Advanced  Analytics

o Dynamic  Data  Setso 2  way  key-­‐value  tagging  of  

datasetso Extended  attribute  setso Advanced  Analytics

ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict

o Performanceo Scalabilityo Refresh  Latencyo In-­‐memory  Analytics

o Performanceo Scalabilityo Refresh  Latencyo In-­‐memory  Analytics

o Advanced  Visualizationo Interactive  Dashboardso Map  Overlayo Infographics

o Advanced  Visualizationo Interactive  Dashboardso Map  Overlayo Infographics

¤ Bytes to Business a.k.a. Build the full stack

¤ Find Relevant Data For Business

¤ Connect the Dots

VolumeVolume

VelocityVelocity

VarietyVariety

3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s

ContextContext

ConnectednessConnectedness

IntelligenceIntelligence

InterfaceInterface

InferenceInference

o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality

CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE

4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift

Dynamic dash boardsMulti-dimensional

pivots w/ customization

Selectable algorithms on data

subsets“Cluster Customer for 5 thanksgiving

seasons”

Learning ModelsAutomatic Feature Selection

& hyper parameter optimizations as it gets

more data

Dynamic Models –Model Selection based

on context

Com

plex

ity

Value

Automated Analytics- Let Data tell story

Feature Learning, AI, Deep Learning

Concept DriftValidate Model assumptions + hyper parameters + features in the current context – after they are in production

Ref:  Prof.  Josh  Bloom,  Keynote:  A  Systems  View  of  Machine  Learning,  #pydata Seattle’15

5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps

oAnalytics in the lab = Investigative• Interactive, Iterative,

Explorative• Output is usually decision

data science

oAnalytics in the factory = Operational• Automated, systemic,

transparent & explainable• Output is embedded

intelligence• Embedded in customer facing

decision systems

Josh  Wills-­‐From   the  labs  to  the  factory,  https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/

http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/

There is a chasm between Model/Reason and Deploy

6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell

oData is the lens through which you see the business and fell the pulse

o Collect the right data through “Thoughtful Data Design”

oGive Data Back in a Powerful Way

o But don’t confuse or overwhelm the users• The users have to feel safe• The users have to feel they are in control

oNever try to launch a complicated data product on a fixed schedule

oOffer progressively sophisticated products, leveraging the data & insights, across the different user population segments • Customer segmentation & stratification is not just for retail !

Josh  Wills-­‐From   the  labs  to  the  factory,  https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/

http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/

“The

re are

no ro

utin

e st

atis

tical qu

esti

ons,

onl

y qu

esti

onable

sta

tist

ical rou

tine

s” --Da

vid Co

x

Ref:  Gabriele  CornoNatural  History  Museum  in  #London   ..by  George  ThalassinosBig Data Analytics - The Best of the Worst

Data SwampData SwampBlue Pillo Typical case of “ungoverned data

stores addressing a limited data science audience“

o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.

o Now every one starts putting their data into this “lake”.

o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence

Red Pill-Data CurationoData Curation• A consistent published schema

oData Quality & Data Lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation & discipline

oDesign for the right “Data Gravity” & “Data Mass” as Van Lindberg mentioned, yesterday, in his keynote

• Not Data Molasses !

Data SwampData SwampBlue Pillo Typical case of “ungoverned data

stores addressing a limited data science audience“

o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.

o Now every one starts putting their data into this “lake”.

o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence

Red Pill-Data CurationoData Curation

• A consistent published schema

oData quality & data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline

https://www.linkedin.com/pulse/data-­‐lakes-­‐udls-­‐vs-­‐analytics-­‐platforms-­‐gargi-­‐adhav

Big Data To Nowhere Big Data To Nowhere

Blue Pillo IT sees an opportunity and starts

building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps.

o A conversation goes like this …• Business : I heard that we have a big

data infrastructure, cool. When can I show a demo to our customers ?

• IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

• Business : … (unprintable)

Red Pill-Full Stack MVP (see next slide)o Build the full stack ie bits to business …

o Build incremental Decision Data Science & Product Data Science layers, as appropriate …

o The following conversation is a lot better …

• Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

• IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

• Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

• IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

• IT: With the data we have, we only know that they comprise ~‾30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

ML EnginenumPy, SciPy, Pandas, Spark,

Azure ML, MPP/Impala

o Collecto Storeo Transform

oReportoVisualize

oRecommend o Predict

oReasonoModel

oModel o Explore

R/Python

o Compositional Analysis

Data HubCurated Data

Storage : HDFS, ParquetCompute : Hadoop MR, Spark

Landing Zone

DashboardsAPIs

Reporting Hub

Analytics HubETL

In-Memory HubReal-TimeKafka …

Reporting  Hub

Analytics  Hub

Hadoop  MR

Long-­‐Running  Complex  Jobs   -­‐ Yearly  pivots,  Multi-­‐dimensional   Exact  Uniques

✔ ️ ✔ ️

Real-­‐time  ad-­‐hoc  pivots,   Approx Uniques (HLL) ✔ ️

Fast  Response  with  Aggregated  data  Subsets ✔ ️

ML EnginenumPy, SciPy, Pandas, Spark,

Azure ML, MPP/Impala

o Collecto Storeo Transform

oReportoVisualize

oRecommend o Predict

oReasonoModel

oModel o Explore

R/Python

o Compositional Analysis

Data HubCurated Data

Storage : HDFS, ParquetCompute : Hadoop MR, Spark

Landing Zone

DashboardsAPIs

Reporting Hub

Analytics HubETL

In-Memory HubReal-TimeKafka …

Reporting  Hub

Analytics  Hub

Hadoop  MR

Long-­‐Running  Complex  Jobs   -­‐ Yearly  pivots,  Multi-­‐dimensional   Exact  Uniques

✔ ️ ✔ ️

Real-­‐time  ad-­‐hoc  pivots,   Approx Uniques (HLL) ✔ ️

Fast  Response  with  Aggregated  data  Subsets ✔ ️

https://www.linkedin.com/pulse/why-­‐how-­‐make-­‐mvp-­‐analytics-­‐ruoyu-­‐bao

Build The E2E Analytics MVP Stack

A Data Too FarA Data Too FarBlue Pillo You might get a few .gz files, a few .csv files

and of course, parquet files, in multiple systems

o Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional.

o The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …

Red Pill-Data Curationo “..The most creative things that happen

with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

o Data Pipelines (eg.Kafka) with in-line processing to ensure correctness, semantic and temporal congruence & integrity

Ref:  Jay  Kreps,  Announcing  Confluent

Where is the Tofu ?Where is the Tofu ?Blue Pillo It is very simple to produce

“reasonable” recommendations

o But extremely difficult to improve them to become “great”

o And, there is a huge difference in business value between reasonable Data Set & great …

Red Pill-Data Curationo The Antidote : The insights and the

algorithms should be relevant and scalable …

o There is a huge gap between Model-Reason and Deploy …

o Statistical Significance need not mean business significance

o Don't confuse the statistical significance of an experiment with the magnitude of the result, even though the word "significance" is often used for both – Peter Norvig

Ref:   Xavier  Amatriain when  he  talked  about  the  Netflix  Prize

"Knowledge is a process of piling up facts; wisdom lies in their simplification."

- Martin Fischer

Analytics - miscuesAnalytics - miscuesoDon’t Torture the Data

Down  the  rabbit  hole  art  by  frostyshadowshttp://frostyshadows.deviantart.com/art/Down-­‐the-­‐Rabbit-­‐Hole-­‐358090601

Design PrinciplesDesign Principles1. Start with needs*2. Do less3. Design with data4. Do the hard work to make it simple5. Iterate. Then iterate again.6. Build for inclusion7. Understand context8. Build digital services, not websites9. Be consistent, not uniform10. Make things open: it makes things better

https://www.gov.uk/design-­‐principles

Data Alone is not enoughData Alone is not enoughoData alone is not enough• Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond it

oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset

oData Scientists are not Data Alchemists• Don’t expect Analytic Gold from a pack of data lead

A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755https://www.flickr.com/photos/bionerd/3123155390

More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm

oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data

o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking

o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not mean it can be

learned

o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning - by Pedro Domingos

§ http://dl.acm.org/citation.cfm?id=2347755

In short …In short …o Build Full stack, iteratively building capabilitieso Identify the ‘Right’ Business Problemso Create Valuable Data Perspectiveso Frame problems & bring analytics together with non-quantitative information to

build compelling storieso Embed Inference & Intelligence in products

https://www.linkedin.com/pulse/article/20141108013125-­‐1290064-­‐winning-­‐at-­‐analytics-­‐takes-­‐more-­‐than-­‐technologyhttp://www.kdnuggets.com/2014/09/hiring-­‐data-­‐scientist-­‐what-­‐to-­‐look-­‐for.html

Ogilvy & Mather Advertising : Morning view from the Ogilvy & Mather NY office, nicknamed the Chocolate Factory �#�TravelTuesdayThan

k Yo

uThank You