the analytics continuum

44
The Analytics Continuum Rob Marano 7 May 2014 1 5/7/1 4 © 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Upload: rob-marano

Post on 06-May-2015

359 views

Category:

Data & Analytics


1 download

DESCRIPTION

Quick tour of data analytics and machine learning for the discerning business analyst and investment banker.

TRANSCRIPT

Page 1: The Analytics Continuum

1

The Analytics Continuum

Rob Marano

7 May 2014

5/7/14© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Page 2: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 2

“What’s measured improves.”

Peter F. Drucker

5/7/14

Page 3: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 3

“Knowledge has to be improved, challenged, and increased constantly, or it vanishes.”

Peter F. Drucker

5/7/14

Page 4: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 4

“When you develop your opinions on the basis of weak evidence, you will have difficulty interpreting subsequent information that

contradicts these opinions, even if this new information is obviously more accurate.”

Nassim Nicholas Taleb

The Black Swan: The Impact of the Highly Improbable

5/7/14

Page 5: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 5

Agenda

• Execution vs. search• Balancing the “knowns” & “unknowns”• Data here, there, everywhere …• Machine learning as foundation to analytics• Visualization as action to analytics• Imminent opportunities

5/7/14

Page 6: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 6

History of Analytics

Source: Economic Time of India

What drives the progression?5/7/14

Page 7: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 7

Why Consider Such an Investment?

• Like any innovation, right?• Enable the business to gain

– Competitive advantage– Cost cutting via productivity or automation– Compliance

• But what about all that tech we already have?

Is change good to the bottom line?5/7/14

Page 8: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 8

Why Consider Such an Investment?

• Machine learning is used in– Web search– Spam filters– Recommender systems– Ad placement– Credit scoring– Fraud detection– Stock trading– Drug design– and much more

5/7/14

Page 9: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 9

Impact of “Startup Culture”

• The most successful of businesses have perfected execution

• They run operations with the highest level of efficiency and effectiveness for the business

• Like any auto-assist or fully automated system, the operations are modeled perfectly

Change is not considered a constant or asset5/7/14

Page 10: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 10

Impact of “Startup Culture”

• The most successful of starts have perfected change as its advantage to search for its niche

• Startups build solutions that anticipate change, especially on how to use data to pivot

• Data & analytics form core to manage change

Startups value change inherently 5/7/14

Page 11: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 11

Impact of “Startup Culture”

• The startup community continues to be the vendor of choice behind all modern analytics

• Google, Yahoo, Facebook, Twitter, etc … the list goes on

• Google started this “analytics age” – open source now dominates it

Any business has access to modern analytics5/7/14

Page 12: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 12

Knowns & Unknowns

• Knowledge & business strategy– “Known knowns”– “Known unknowns”– “Unknown unknowns”

• Operations & strategy depend upon evidence• Timely get the right info to the right person

5/7/14

Page 13: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 13

(Big) Data Here, There, Everywhere

• Data operates every process but not collected• The more online, the more potential• Advantages

– Competitive– Productivity/efficiency– Compliance

Wisdom

Knowledge

Info

Data

5/7/14

Page 14: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 14

How Big is “Big Data?”

5/7/14

What’s big for your department? Company?Source: InfoChimps, “[Infographic] Taming Big Data from Wikibon”

Page 15: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 15

Foundation of Analytics

• Historically rigid data dictionaries provided advantages via SQL and RDBMS

• As compute/storage reduced in cost & deployment complexities, more data processed

• Cost of infrastructure kept rising; state-of-the-art not keeping pace

Big Data enables commodity analytics5/7/14

Page 16: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 16

Analytics Core

• Big Data– Commodity computation & storage– Modern computation framework– Open, loose-coupling of components

• Machine learning– Commodity knowledge discovery

• Delivered as a cost-effective service

5/7/14

Page 17: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 17

IT Transition to Big Data Analytics

• Startup advantages lead to cost-effective analysis of large quantities of data

• Traditional data warehouse solutions do not effectively scale in cost nor productivity

• Growth of open source delivers both

New “open” vendors leading the way5/7/14

Page 18: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 18

Big Data as Enabler

Source: VMware Blog, “4 Key Architecture Considerations for Big Data Analytics”5/7/14

Page 19: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 195/7/14

Page 20: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 20

Apache Hadoop as Epicenter

5/7/14

Dat

a In

tegr

ation

(Flu

me,

Chu

kwa,

Sqo

op)

Scripting(Pig)

Distributed Storage(HDFS)

Syst

ems

Man

agem

ent &

Mon

itorin

g(A

mba

ri, Z

ooke

eper

)

Wor

kflow

& S

ched

ulin

g(O

ozie

)

Dat

abas

e(H

base

, Cas

sand

ra)

Distributed Compute(MapReduce)

Meta Data Services(HCatalog)

Query(Hive)

Mac

hine

Lea

rnin

g(M

ahou

t)

Source: Hortonworks, “About Hortonworks Data Platform”

Page 21: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 21

The Hadoop Ecosystem• Ambari Deployment, configuration and monitoring• Flume Collection and import of log and event data• HBase Column-oriented database scaling to billions of rows• HCatalog Schema and data type sharing over Pig, Hive and MapReduce• HDFS Distributed redundant file system for Hadoop• Hive Data warehouse with SQL-like access• Mahout Library of machine learning and data mining algorithms• MapReduce Parallel computation on server clusters• Pig High-level programming language for Hadoop computations• Oozie Orchestration and workflow management• Sqoop Imports data from relational databases• Whirr Cloud-agnostic deployment of clusters• Zookeeper Configuration management and coordination

5/7/14Source: Edd Dumbill, “What is Apache Hadoop?”

Page 22: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 22

So, what is Machine Learning?

• Non-trivial process of finding and communicating “valid, novel, potentially useful and understandable patterns in data.”1

• Delivers the engineering behind the science of automated classification, categorization, and recommendation without being explicitly programmed

• Allows data to be transformed with relative ease into actionable knowledgeML powers today’s internet economies

1: Ciro Donalek, “Supervised & Unsupervised Learning”5/7/14

Page 23: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 23

Machine Learning as Enabler

• Open source, cloud computing, & startup culture powered rise of analytics

• Delivers powerful processing & results• Figures out how to perform a particularly

manual task by generalizing from examples

Tactics & strategy require evidence that learns5/7/14

Page 24: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 24

Learning – Human or Machine

• Learning an iterative process to converge• The ML “space” is huge and growing, but get a

handle on the intended mission objectives– Representation

• Which group of classifiers will “it” learn; which features

– Evaluation• Distinguish good from bad classifiers

– Optimization• Which is the highest scoring classifier

1: Pedro Domingos, “A Few Useful Things to Know about Machine Learning”5/7/14

Page 25: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 25

Analytics Starts With Data

5/7/14

Ingestion

Conversio

n

Upload

Image Source: Research Live, “Order from Chaos”

websites + web svcs

Page 26: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 26

and It Ends with Knowledge

5/7/14

Aggregati

on

Analysis

Visualization

Image Source: Visualize This by Nahan Yau

Wisdom

Knowledge

Info

Data

Page 27: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 27

Taxonomy of ML

• ML converts data trends into logic to automate data processing

• Based upon pattern recognition• Basic goal is generalization• Built upon two key techniques

– Supervised learning– Unsupervised learning

5/7/14

Page 28: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 28

Supervised Learning

• ML technique which takes a training data set with specific features that result in a model

• The model is used to assess whether an input is of a pre-defined class

• Key to supervised learning remains feature set extraction

• Popular examples include– Regression– Classification– Outliers detection

5/7/14

Page 29: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 29

Unsupervised Learning

• ML technique to group data according to similar features, or characteristics

• Such technique does not require a model to be generated, rather similarity is calculated

• Popular examples include– Clustering– Density estimation– Visualization by projection

5/7/14

Page 30: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 30

Most Important Step in ML

• “Know thine data like thyself”– Know features about your data in order to narrow

the algorithm selection process– Are the features nominal or continuous?– Are there missing values in the features?– If missing values, where are they missing?– Are there outliers in the data?– Are you looking for something that occurs very

infrequently?

5/7/14

Page 31: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 31

Choosing the ML Algorithm

• Know your data inside out & back again• Consider the goal• Use unsupervised unless need to predict certain

target values, then use supervised• Choose a set of algos matched to goal/data• Try each algorithm, assess and compare• Adjust and combine optimization techniques• Choose, operate, and continually measure• Repeat

5/7/14

Page 32: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 32

Generalized ML Application Steps

• Collect data• Prepare the input data• Analyze input data & features• Train the algorithm (if supervised)• Test the algorithm with fresh data• Operate ML• Detect subtle changes to data (cycles,seasons)• Measure for performance• Repeat as frequently needed

5/7/14

Portions sourced: Machine Learning in Action by Peter Harrington, Manning Publications

Page 33: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 33

Highlights of Supervised Algos

• Generalized Linear Models– Bayesian Regression– Ordinary least squares (regression)

• Support Vector Machines• K Nearest Neighbors• Naïve Bayes• Decision Trees• Neural Networks• Ensemble Methods

5/7/14

Portions sourced: “Supervised Learning” from scikit-learn.org

Page 34: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 34

Highlights of Unsupervised Algos

• Clustering• K-means• DBSCAN• Hidden Markov Models

• Density Estimation• Neural Networks (restricted Boltzmann)

5/7/14

Portions sourced: “Supervised Learning” from scikit-learn.org

Page 35: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 35

Learning -> Evaluation

5/7/14

• The Classifier Evaluation Framework

1 2 : Knowledge of 1 is necessary for 2

1 2 : Feedback from 1 should be used to adjust 2

Choice of Learning Algorithm(s)

Datasets Selection

Error-Estimation/ Sampling Method

Performance Measure of Interest Statistical Test

Perform Evaluation

Source: “Performance Evaluation of Machine Learning Algorithms” by Mohak Shah & Nathalie Japkowicz

Page 36: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 36

Overview of Performance Measures

5/7/14

All Measures

Additional Info (Classifier Uncertainty Cost ratio

Skew)

Confusion Matrix Alternate Information

Deterministic Classifiers Scoring Classifiers Continuous and Prob. Classifiers (Reliability

metrics)

Multi-class Focus

Single-class Focus

No Chance Correction

Chance Correction

Accuracy Error Rate

Cohen’s Kappa Fielss Kappa

TP/FP Rate Precision/Recall Sens./Spec. F-measure Geom. Mean Dice

Graphical Measures

Summary Statistic

Roc Curves PR Curves DET Curves Lift Charts Cost

Curves

AUCH Measure

Area under ROC- cost curve

Distance/Error measures

KL divergence K&B IR BIRRMSE

InformationTheoretic Measures

Interestingness Comprehensibility Multi-

criteria

Source: “Performance Evaluation of Machine Learning Algorithms” by Mohak Shah & Nathalie Japkowicz

Page 37: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 37

Confusion Matrix-BasedPerformance Measures

5/7/14

• Multi-Class Focus:– Accuracy =

(TP+TN)/(P+N)

• Single-Class Focus: – Precision = TP/(TP+FP)– Recall = TP/P– Fallout = FP/N– Sensitivity = TP/(TP+FN)– Specificity =

TN/(FP+TN)

True class

Hypothesized class

Pos Neg

Yes TP FP

No FN TN

P=TP+FN N=FP+TN

Confusion Matrix

Source: “Performance Evaluation of Machine Learning Algorithms” by Mohak Shah & Nathalie Japkowicz

Page 38: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 38

Tying It All Together

5/7/14

Page 39: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 39

Visualization as Action

5/7/14

Page 40: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 40

Imminent Opportunities

• Any business with high volume of data– Look at processes, human-machine interfaces– Sentiment; Customer Experience; Campaigns– Infosec; Network Services; Customer Churn

• Sectors coming analytics-ready– Healthcare; Government; Retail– Manufacturing; Utilities

• Imagine a world of Internet-of-Things?

Can you imagine keeping all data? Analyze it?5/7/14

Page 41: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 41

Analytics

• Big Data– Commodity compute & storage

• Analytics– Commodity intelligence

• Big Data Analytics– Store everything– Analyze everything– Do it everyday

Cost effectively manage “unknown unknowns”5/7/14

Page 42: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 42

“Know the enemy and know yourself; in a hundred battles you will never be in peril.”

Sun Tzu

5/7/14

Page 43: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 43

“It’s no longer hard to find the answer to a given question; the hard part is finding the right question, and as questions evolve, we gain

better insight into our own ecosystem and our business.”

Kevin Weil

Director of Product for RevenueTwitter

5/7/14

Page 44: The Analytics Continuum

© 2014 The Hackerati, Inc. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 44

The Analytics Continuum

Rob [email protected]

7 May 2014

5/7/14