mastering the 80% of analytics: what data scientists really do
TRANSCRIPT
Mastering the 80% of Analytics
What Data Scientists Really Do
Mik, PhD @AvrioAnalytics
Boss Parents
Me Reality
Boss Parents
Me Reality
Caveat
•Data Science is very broad
•This is a particular perspective
•Mathematician
•Predictive algorithm developer
•Very brief
A Day in the LifeWrangling
Modeling
FeaturesR
esul
ts
What is “Wrangling”?•Data:
•Getting
•Formatting
•Cleaning
What is “Wrangling”?•Data:
•Getting
•Formatting
•Cleaning
Data Janitorial Work
Getting the Data
•Myriad of sources
Getting the Data
•Myriad of sources
•Varying collection, storage and maintenance
Getting the Data
•Myriad of sources
•Varying collection, storage and maintenance
•Most people just don’t care
Getting the Data
•Myriad of sources
•Varying collection, storage and maintenance
•Most people just don’t care
•At least not soon enough
Got it. Now what?
•Structured: in a consistent and defined format
Got it. Now what?
•Structured: in a consistent and defined format
•Unstructured: no consistent format
Got it. Now what?
•Structured: in a consistent and defined format
•Unstructured: no consistent format
•Text data
Got it. Now what?
•Structured: in a consistent and defined format
•Unstructured: no consistent format
•Text data
Movie Rating
Star Wars 5 StarsI loved the new Star Wars,
definitely 5/5 stars!
Formatting
•Alignment
Formatting
•Alignment
•Unions, intersections, grouping
Formatting
•Alignment
•Unions, intersections, grouping
•Transformations
FormattingTime Username Views
12:30 jsmith 32
12:45 mik 27
1:00 dmartin 8
1:15 jsmith 46
Time Username Views
12:20 gwarren 12
12:30 lpeabody 53
12:40 dmartin 20
12:50 hjohnson 5
Formatting
Username Views
jsmith 32, 46
Data is Dirty Business
•Duplicates
Data is Dirty Business
•Duplicates
•Missing values
Data is Dirty Business
•Duplicates
•Missing values
• Ill-formed values
Data is Dirty Business
•Duplicates
•Missing values
• Ill-formed values
•Wrong values
Data is Dirty Business
•Duplicates
•Missing values
• Ill-formed values
•Wrong values
Similar in effect
Data is Dirty Business
•Duplicates
•Missing values
• Ill-formed values
•Wrong values
Types of Missing-ness
•MCAR: Missing Completely at Random
Types of Missing-ness
•MCAR: Missing Completely at Random
•MAR: Missing at Random
Types of Missing-ness
•MCAR: Missing Completely at Random
•MAR: Missing at Random
•MNAR: Missing Not at Random
Types of Missing-ness
•MCAR: Missing Completely at Random
•MAR: Missing at Random
•MNAR: Missing Not at Random
Bad
Worse
Dealing with Missing DataX Y Z
129 1 40110 3210 32
989 65
Dealing with Missing Data
•DeletionX Y Z
129 1 40110 3210 32
989 65
Dealing with Missing Data
•Deletion
•Pairwise
•Listwise
X Y Z129 1 40110 3210 32
989 65
Dealing with Missing Data
•Deletion
•Pairwise
•Listwise
X Y Z129 1 40110 3210 32
989 65
X Z129 40210 32
PairwiseX Y Z
129 40
Listwise
Dealing with Missing Data
• Imputation
Dealing with Missing Data
• Imputation
•Mean substitution
•Regression
Dealing with Missing Data
•Multiple Imputation
Dealing with Missing Data
•Multiple Imputation
•Stochastic simulation
Dealing with Missing Data
•Multiple Imputation
•Stochastic simulation
•Must know distribution
Gotchas
•Sampling Error
Gotchas
•Sampling Error
•Statistical Power
Gotchas
•Sampling Error
•Statistical Power
•Population Parameters
Gotchas
•Sampling Error
•Statistical Power
•Population Parameters
•Propagation
So what do I do?
•Approaches vary quite a lot
So what do I do?
•Approaches vary quite a lot
•MCAR, MAR hard to prove
So what do I do?
•Approaches vary quite a lot
•MCAR, MAR hard to prove
•Principle of Least Harm
60% - 80% of Work
Cleaning Done! Now the fun!
•Almost…
Cleaning Done! Now the fun!
•Almost…
•Clean data is still “raw”
Cleaning Done! Now the fun!
•Almost…
•Clean data is still “raw”
•Features: pre-processed for modeling
Feature Engineering
•A lot of data is useless
Feature Engineering
•A lot of data is useless
•Filter, slice, transform
Feature Engineering
•A lot of data is useless
•Filter, slice, transform
•Singular idea: What’s the main driver?
Feature Engineering
•Considerations
•Relevance
•Redundancy
Feature Engineering
•Considerations
•Relevance
•Redundancy
•Curse of Dimensionality
Feature Engineering Methods
• PCA
• Edge Detection
• Blob Detection
• Auto encoding
• Kernel PCA
• Partial Least Squares
• Generalized Least Squares
• Direct Modeling
• Isomapping
• Mutual Information Theory
• Information Entropy Theory
• ICA
• MDR
• Latent Factors
• MPCA
• LSA
• Statistical Moments
• Random Projections
•De-Noising
•Weighting
•Patch Extraction
•Functional Mapping
•Discretization
•Filtering
•FFT
•Smoothing
•Density Mapping
Feature Engineering
• It’s hard
Feature Engineering
• It’s hard
•Analysis + Domain knowledge
Feature Engineering
• It’s hard
•Analysis + Domain knowledge
•…Deserves a presentation on its own
Feature Engineering
• It’s hard
•Analysis + Domain knowledge
•…Deserves a presentation on its own
•Features are input to machine learning
Now the fun stuff (finally)
•ML: computer acts without explicit program
Now the fun stuff (finally)
•ML: computer acts without explicit program
•Utilizes empirical data to “teach” a process
Now the fun stuff (finally)
•ML: computer acts without explicit program
•Utilizes empirical data to “teach” a process
•Pattern Rec. -> ML -> Deep Learning
Now the fun stuff (finally)
•ML: computer acts without explicit program
•Utilizes empirical data to “teach” a process
•Pattern Rec. -> ML -> Deep Learning
•Buzzwords abound
Now the fun stuff (finally)
•ML: computer acts without explicit program
•Utilizes empirical data to “teach” a process
•Pattern Rec. -> ML -> Deep Learning
•Buzzwords abound
•Fairly simple, lots of libraries
ML Approaches
•Classes of problems
ML Approaches
•Classes of problems
•Continuous (regression)
ML Approaches
•Classes of problems
•Continuous (regression)
•Discrete (classification)
ML Approaches
•Classes of problems
•Continuous (regression)
•Discrete (classification)
•Classes of solutions
ML Approaches
•Classes of problems
•Continuous (regression)
•Discrete (classification)
•Classes of solutions
•Supervised
ML Approaches
•Classes of problems
•Continuous (regression)
•Discrete (classification)
•Classes of solutions
•Supervised
•Unsupervised
ML Algorithms
•Neural Networks
ML Algorithms
•Neural Networks
•Genetic Algorithms
ML Algorithms
•Neural Networks
•Genetic Algorithms
•Bayesian Classification
ML Algorithms
•Neural Networks
•Genetic Algorithms
•Bayesian Classification
•Support Vector Machines
ML Algorithms
•Neural Networks
•Genetic Algorithms
•Bayesian Classification
•Support Vector Machines
•Many used as type of feature extraction
Neural Networks
•Motivated by brain function
•Neurons fire, activate paths
•Non-linear
•Simplest: PerceptronX1
X2
Logic Layer
w1
w2
Neural Networks
• Inputs feed neuron with weight
Neural Networks
• Inputs feed neuron with weight
•Logic Layer: activation function
Neural Networks
• Inputs feed neuron with weight
•Logic Layer: activation function
•Fires (or not) based on inputs
Neural Networks
• Inputs feed neuron with weight
•Logic Layer: activation function
•Fires (or not) based on inputs
•Weights from minimizing cost function
Neural Networks
• Inputs feed neuron with weight
•Logic Layer: activation function
•Fires (or not) based on inputs
•Weights from minimizing cost function
•Backpropagation
Sigmoid Logic Layer
0
0.25
0.5
0.75
1
-10 -8 -6 -4 -2 0 2 4 6 8 10
w = 1 w = 2
1
1 + e�w
Tx
Neural Networks
•Most networks are bigger
X1
X2
A1
AM
Y1
YK
Machine Learning
•Got data, features and algorithm
Machine Learning
•Got data, features and algorithm
•Just plug in and profit!
Machine Learning
•Got data, features and algorithm
•Just plug in and profit!
•Not quite
Machine Learning
•Got data, features and algorithm
•Just plug in and profit!
•Not quite
•Tuning and training
Tuning
•What about N, M and K?
X1
X2
A1
AM
Y1
YK
Tuning
•What about N, M and K?
•Hyper-parameters
X1
X2
A1
AM
Y1
YK
Tuning
•What about N, M and K?
•Hyper-parameters
•Size of layers, thresholds, etc.
X1
X2
A1
AM
Y1
YK
Tuning
•What about N, M and K?
•Hyper-parameters
•Size of layers, thresholds, etc.
•Static specifics of the algorithm
X1
X2
A1
AM
Y1
YK
Training
• It’s all about the teaching
Training
• It’s all about the teaching
•Representative data set
Training
• It’s all about the teaching
•Representative data set
•Large, clean
Training
•Don’t teach to the test
Training
•Don’t teach to the test
•Causes overfitting
Training
•Don’t teach to the test
•Causes overfitting
•Training (80%) and Testing (20%) data
Training
•Don’t teach to the test
•Causes overfitting
•Training (80%) and Testing (20%) data
•Cross-validation
With all the open source libraries, isn’t machine learning easy now?
I got results!
•Why doesn’t anyone care?
I got results!
•Why doesn’t anyone care?
•Kaggle vs. Real Life Syndrome
I got results!
•Why doesn’t anyone care?
•Kaggle vs. Real Life Syndrome
• It’s all in the presentation
It’s all in the presentation
•Complex topic
It’s all in the presentation
•Complex topic
•Non-technical audience
It’s all in the presentation
•Complex topic
•Non-technical audience
•Several stakeholders
It’s all in the presentation
•Complex topic
•Non-technical audience
•Several stakeholders
•Many likely skeptics
It’s all in the presentation
•Avoid buzzwords
It’s all in the presentation
•Avoid buzzwords
•Focus on a business problem
It’s all in the presentation
•Avoid buzzwords
•Focus on a business problem
•Show value
It’s all in the presentation
•Avoid buzzwords
•Focus on a business problem
•Show value
•Keep in mind cost
Is it actually science?
•Sometimes
Is it actually science?
•Sometimes
•…but often not
Is it actually science?
•Sometimes
•…but often not
•Data Sciences vs. Data Engineering
Is it actually science?
•Sometimes
•…but often not
•Data Sciences vs. Data Engineering
• It should be — focus on why
Is it actually science?
Applied Math
Computer Science
Domain Expertise
Is it actually science?
Applied Math
Computer Science
Domain Expertise
Applied Math
Computer Science
Physics
Physicist
Why Data Science?
•Big problems, fun challenges
Why Data Science?
•Big problems, fun challenges
•Both science and business
Why Data Science?
•Big problems, fun challenges
•Both science and business
•Consistently awesome
2012: Sexiest Job of the Century
2016: Best Job of the Year
2016: Hottest Job of the Year
2016: Best Career Opportunity
Why Data Science?S
alar
y
So want to get started?
•Theano
So want to get started?
•Theano
•TensorFlow
So want to get started?
•Theano
•TensorFlow
•Torch
So want to get started?
•Theano
•TensorFlow
•Torch
•Pandas
Tomorrow is here
www.avrioanalytics.com