introduction - world bankpubdocs.worldbank.org/en/244111541088455535/data... · introduction data...
TRANSCRIPT
Introduction
Data Science and Data Engineering
Instructor – Raja Iqbal
• Founder, CEO & Chief Data Scientist.
• Worked in Bing data mining, Bing Ads (2006-2013)• ETL, bot detection, online experimentation and A/B testing,
relevance of online ads, click prediction, etc.
• Ph.D. in CS with a focus on computer vision, machine learning and data mining.
Copyright (c) 2018. Data Science Dojo 2
Instructor – Rebecca Merrett
• Technical Writer and Content Developer.
• Worked in game engine technology, writing technical content on new features.
• Graduate diploma in mathematics and statistics, with a bachelor degree in information and media.
Copyright (c) 2018. Data Science Dojo 3
Instructor – Victoria Louise Clayton
• Instructor and Mentor.
• Worked for a small research consultancy in London and worked on projects for governments, international organizations and companies such as the UN and Siemens.
• BA in Human Sciences from Oxford University and an MSc in Decision Science.
Copyright (c) 2018. Data Science Dojo 4
Instructor – Margaux Penwarden
• Instructor and Mentor.
• Data scientist at McKinsey & Company, Sydney
• Bachelor’s in Computer Science and Mathematics from Télécom Paristech (“Grande Ecole”), and a Master’s in Statistics from Imperial College, London.
Copyright (c) 2018. Data Science Dojo 5
About Data Science Dojo
•Started in August 2014
•100+ bootcamps, workshops and corporate trainings
•~3500 attendees
•600+ companies
•10 countries
Copyright (c) 2018. Data Science Dojo 6
Learning Objectives
• Learn the theory and practice of data science for improved health systems and healthcare.
• Explore and visualize a health-related dataset.
• Build and evaluate predictive models for classification and regression (for instance, predicting whether a tumour is malignant or not), as an example of the use of machine learning in health.
Copyright (c) 2018. Data Science Dojo 7
Learning Objectives
• Understand the fundamentals of unsupervised learning and clustering, and its potential applications in health systems
• Learn fundamentals of text analytics and perform text analytics on a health-related dataset
• Get an introduction to big data and data engineering
Copyright (c) 2018. Data Science Dojo 8
CURRICULUM
Copyright © 2018. Data Science Dojo 9
Maximizing ROI From This Week
•Map the techniques to real problems at all times:•Problem and business impact•Data you have (and do not have).•Measurement metrics•Business metrics
10Copyright (c) 2018. Data Science Dojo
Logistics
• 8:30 am – 5:30 pm daily*
• Course material and resources:• Handbooks• Learning portal
• Request:• Make sure your computers are ready• Keep the session interactive• Social media, email, etc.
Copyright (c) 2018. Data Science Dojo 11
*We will end at 4:00 pm on Friday
Agenda for Today
Session I: Understanding the AI and data science landscape
Session II: Data exploration and visualization
Session III: Introduction to predictive modeling
Session IV: Decision tree learning and building your first predictive model
Session V: Evaluating classification models
Copyright (c) 2018. Data Science Dojo 12
Understanding the AI and Data Science Landscape
Objectives
• Review the current data science landscape
• Discuss what other organizations are (or may be) doing
• Common data mining tasks
• Identify some data science problems in health
Copyright (c) 2018. Data Science Dojo 14
Drug Discoveries
• Insilico Medicine• Finding new drugs and treatments
including immunotherapies.
• MIT Clinical Machine Learning Group• Focussed on disease processes and
design for effective treatment of diseases such as Type 2 diabetes.
• Knight Cancer Institute• With a current focus on developing an
approach to personalize drug combinations for Acute Myeloid Leukemia (AML).
Copyright (c) 2018. Data Science Dojo 15
Medical Imaging & Diagnostics
▪VunoMed• Identifies different types
of lung tissue damage by color to help physicians make more accurate diagnosis.
• IBM Watson Genomics
• Provides precision medicine to cancer patients.
Copyright (c) 2018. Data Science Dojo 16
Virtual Assistants
• Scanadu’s doc.ai • NLP program that allows patients
to get their lab results explained to them by an app, saving both patient and doctor time and money.
• Somatix• Recognizes of hand-to-mouth
gestures in order to help people better understand their behavior and make life-affirming changes.
Copyright (c) 2018. Data Science Dojo 17
Research
• Google Deep Mind• Develops technology to address macular
degeneration in aging eyes.
• Desktop Genetics• AI-designed tech for more effective and
affordable guides. Recognized as leader in genome editing technology.
• iCarbonX• Monitors and models human biological
data to enable people to find the proper lifestyle and treatments that can improve their health, life quality and joy.
Copyright (c) 2018. Data Science Dojo 18
Brainstorming
What are some other applications?
Copyright (c) 2018. Data Science Dojo 19
Connecting the Dots
•The underlying magic behind what we saw is ‘big data’ and ‘predictive analytics’
Copyright (c) 2018. Data Science Dojo 20
Big Data Pipeline
Stage: Data influx
• Output: Data stream
Stage: Collection
• Output: Targetdata
Stage: Preprocessing
• Output: Preprocessed data
Stage: Transformation
• Output: Transformed data
Stage: Data Mining
• Output: Patterns
Stage: Interpretation and Evaluation
• Output: Knowledge discovery and actionable insights
Copyright (c) 2018. Data Science Dojo 21
Dat
a M
anag
em
ent
Dat
a Sc
ien
ce
Collect
Store
Transform
Reason
Model
Visualize
Recommend
Predict
Explore
ETL/Log SQL NoSQL MapReduce Real Time Analytics
Big Data – Technology, Platforms & Products
Copyright (c) 2018. Data Science Dojo 22
Data Mining Tasks
• Descriptive Methods: • Find human-interpretable patterns that describe the data
• Techniques: Clustering, Association Analysis, X-point summaries
• Predictive Methods: • Use available data to build models that can predict the outcome of
future data
• Techniques: Classification, Regression, Anomaly, and Deviation Detection
• Prescriptive Methods: • Predict future outcomes and suggest actions that may prevent or
mitigate the impact of the predicted outcomes
• Techniques: Various optimization techniques
Copyright (c) 2018. Data Science Dojo 23
Traffic Management
Descriptive [Informing Role]:
• Traffic jam has happened already
• [Implicit: Do something about it]
Copyright (c) 2018. Data Science Dojo 24
Traffic Management
Predictive [Informing and Warning Role]: • Traffic jam is about to happen in the next 30 minutes
• [Implicit: Do something before it happens]
Copyright (c) 2018. Data Science Dojo 25
Traffic Management
Prescriptive [Informing, Warning, and Advisory Role]:
Take action so traffic jam does not happen OR
Traffic jam is about to happen in the next 30 minutes and you could possibly take the following courses of action:
• Route traffic to service road near I-5
• Block more traffic from entering the WA-520 bridge
Copyright (c) 2018. Data Science Dojo 26
27Copyright (c) 2018. Data Science Dojo
Data Mining and Predictive Analytics
In the next few slides, we will take a look at some of the most common data mining tasks.
Copyright (c) 2018. Data Science Dojo 28
Classification: A Simple Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
Test
Set
Training
SetModel
Learn
Classifier
Copyright (c) 2018. Data Science Dojo 29
Classification: More Examples
• What is the likelihood that a patient will develop diabetes?
• What is the likelihood that a COPD patient will be readmitted within 90 days of discharge?
• What is the likelihood that a person will not show up to their appointment?
Copyright (c) 2018. Data Science Dojo 30
Intra-cluster distancesare minimized
Inter-cluster distancesare maximized
Clustering in 3-D space using Euclidean distance
Clustering: An Illustration
Copyright (c) 2018. Data Science Dojo 31
Clustering
• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:• Data points within a cluster have more similarities with
one another
• Data points in different clusters have less similarities with one another
32
Copyright (c) 2018. Data Science Dojo
Clustering: Similarity Measures
• Similarity Measures:• Euclidean Distance if attributes are
continuous• Other problem-specific measures• Example: If a particular word occurs in two
documents or not
Copyright (c) 2018. Data Science Dojo 33
Clustering: Examples
To find groups of documents that are similar to each other based on the most important terms that appear in them (e.g. medical records)
Copyright (c) 2018. Data Science Dojo 34
Association Analysis
Your behavior is being predicted, not by studying you, but by studying others.
Copyright (c) 2018. Data Science Dojo 35
Association Rule Discovery
• Given a set of records each of which contain some number of items from a given collection:• Produce dependency rules which will predict the
occurrence of an item based on the occurrences of other items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered:{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Copyright (c) 2018. Data Science Dojo 36
Association Analysis: Pharmacy Shelf Management
Copyright (c) 2018. Data Science Dojo 37
Predicts a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
Regression
Copyright (c) 2018. Data Science Dojo 38
Regression Example
Copyright (c) 2018. Data Science Dojo 39
Predicting vaccine demand to better plan supply
Anomaly Detection
• Detect significant deviations from normal behavior
• Applications:
• Unusual patient behavior
• Insurance fraud detection
• Treatment outlier detection
Copyright (c) 2018. Data Science Dojo 40
Challenges in Data Mining
Scalability DimensionalityComplex and
heterogeneous data
Data qualityData ownership and
distributionPrivacy
Reaction timeMany other domain
specific issues
Copyright (c) 2018. Data Science Dojo 41
AI in Healthcare Landscape
42Copyright (c) 2018. Data Science Dojo
Overview of Datasets
Wisconsin Breast Cancer Data
Copyright (c) 2018. Data Science Dojo 44
Wisconsin Breast Cancer Data
45Copyright (c) 2018. Data Science Dojo
• Features obtained from a digital image of a fine needle aspirate (FNA) of a breast mass.
• Describes characteristics of the cell nuclei present in the image.
• Attribute information:• ID number
• Diagnosis (M = malignant, B = benign)
• 10 real-valued features
• Total of 569 records
Wisconsin Breast Cancer Data46
Copyright (c) 2018. Data Science Dojo 46
Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Features: Wisconsin Breast Cancer Data
47Copyright (c) 2018. Data Science Dojo
Id: ID number
diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean: mean of distances from center to points on the perimeter
texture_mean: standard deviation of gray-scale values
perimeter_mean: mean size of the core tumor
compactness_se: standard error for perimeter^2 / area - 1.0
smoothness_mean: mean of local variation in radius lengths
compactness_mean: mean of perimeter^2 / area - 1.0
concavity_mean: mean of severity of concave portions of the contour
concave points_mean: mean for number of concave portions of the contour
fractal_dimension_mean: mean for "coastline approximation" - 1
radius_se: standard error for the mean of distances from center to points on the perimeter
texture_se: standard error for standard deviation of gray-scale values
Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Data Exploration and Visualization
Agenda
•Why data exploration and visualization?
•Exploration and visualization of data:•Core R functionality • lattice package •ggplot2 package
Copyright © 2018. Data Science Dojo 49
WHY DATA EXPLORATION AND VISUALIZATION?
Copyright © 2018. Data Science Dojo 50
Data Beats Algorithm But…
•More data usually yields good generalization performance, even with a simple algorithm
•But there are caveats:•Amount of data may have diminishing returns•Data quality and variety matters•A decent performing learning algorithm is still needed•Most importantly, extracting useful features out of data is important
Copyright © 2018. Data Science Dojo 51
Copyright © 2018. Data Science Dojo
23:05:33 –5 UTC, April 3, 2014
Is Date-Time Stamp a Good Feature?
Hour of date Day of week AM/PM
52
Dispelling a Common Myth
•There is NO single ML algorithm that will take raw data and give you the best model
•You do NOT need to know a lot of machine learning algorithms to build robust predictive models
Copyright © 2018. Data Science Dojo53
Janitorial Work is Important
•Not spending time on understanding your data is a source of many problems!
•Remember the 80/20 rule:• 80% : Data cleaning, data exploration, feature
engineering, pre-processing, etc
• 20% : Model building
Copyright © 2018. Data Science Dojo 54
EXPLORATION AND VISUALIZATION USING R
Copyright © 2018. Data Science Dojo 55
Objectives
•Develop an understanding of the high-level thinking process of data exploration
•Make sense of data using visualization techniques
•Learn to perform feature engineering
•Become a good storyteller
Copyright © 2018. Data Science Dojo 56
Anscombe’s Quartet
Plot
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Copyright © 2018. Data Science Dojo 57
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Consider the 4 following different datasets
Anscombe’s Quartet
Mean of X 9
Variance of X 11
Mean of Y 7.5
Variance of Y 4.125
Correlation between X & Y
0.816
Copyright © 2018. Data Science Dojo 58
Awareness Test
Copyright © 2018. Data Science Dojo 59
Common Graphical Parameters
• Title of graph using the main function, main = “title”
• Label x axis by using the xlab function, xlab = “label x axis”
• Label x axis by using the ylab function, ylab = “label y axis”
• Colors controlled by col
• Get legends of layered plots with auto.key=TRUE
Copyright © 2018. Data Science Dojo 60
Exploring Data Commands
Copyright © 2018. Data Science Dojo
Commands Description
read.csv() , read.table() Load data/file into a dataframe
data() Loads or resets a dataset
names() List names of variables in a dataframe
head() First 6 rows of data
tail() Last 6 rows of data
str() Display internal structure if R object
View() View dataset in spreadsheet format in RStudio
dim() Dimensions( rows and columns) of dataframe
summary() Display 5-number summary and mean
colnames() Provide column names
61
CORE R GRAPHICS
Copyright © 2018. Data Science Dojo 62
Copyright © 2018. Data Science Dojo 63
Breast Cancer Dataset
breast_cancer <- read.csv("data.csv")
data(breast_cancer)
head(breast_cancer)
• Summarizes quantitative/numeric data
Boxplots
# Core Graphics
boxplot(
radius_mean~diagnosis,
data=breast_cancer,
main="Radius Mean for
various diagnoses
", xlab="Diagnosis",
ylab="Radius Mean"
)
B: Benign M: Malignant
Copyright © 2018. Data Science Dojo 64
Pie Chart
▪ Summarizes qualitative/categorical variables
# Core Graphics
pie(table(breast_cancer$diagnosis))
Copyright © 2018. Data Science Dojo 65
B: Benign M: Malignant
Scatter Plot
▪ Visual depiction of correlation between numeric variables
# Core Graphics
plot(breast_cancer$concave.
points_worst
,breast_cancer$perimeter_worst
,xlab="Concave Points Worst",
ylab="Perimeter Worst")
Copyright © 2018. Data Science Dojo 66
Scatter Plot
# Core Graphics
plot(perimeter_worst~area_worst,
data=breast_cancer)
▪ Plot of perimeter_worst against area_worst
Copyright © 2018. Data Science Dojo 67
Scatter Plot
plot(concave.points_worst~perimeter_worst,
data=breast_cancer,
main="Concave Points Worst vs Perimeter Worst",
xlab="Concave Points Worst",
ylab="Perimeter Worst")
abline(lm(concave.points_worst~perimeter_worst,
data=breast_cancer),col="red",lwd=2)
cor(breast_cancer$concave.points_worst,breast_cancer$perimeter_worst)
>0.816322101687544
• Plots counts of Concave Points Worst versus Perimeter Worst, then adds a regression line
• Find correlation between variables (values close to 1 or -1 depict strong linear relationship)
Copyright © 2018. Data Science Dojo 68
GGPLOT2 GRAPHICS
Copyright © 2018. Data Science Dojo 69
ggplot Fundamentals
•ggplot() provides a blank canvas for plotting
•geom_*() creates actual graphical layers• geom_point()
• geom_boxplot()
•aes() defines an "aesthetic" either globally or by layer
Copyright © 2018. Data Science Dojo 70
Copyright © 2018. Data Science Dojo
ggplot(breast_cancer, aes()) + geom_point()
Layering
Layer 1 Layer 2
71
Histogram
A histogram of counts of Concave Points Worst
ggplot(breast_cancer,aes(x=con
cave.points_worst)) +
geom_histogram()
Copyright © 2018. Data Science Dojo 72
Density
Smooths over the counts of concave points worst
▪ Note the location of aes()
ggplot(breast_cancer) +
geom_density(aes(x=concave.
points_worst),fill="gray50") +
labs(x="Concave Points Worst")
Copyright © 2018. Data Science Dojo 73
Scatter Plot
ggplot(breast_cancer,
aes(x=concave.points_worst,
y=perimeter_worst)) +
geom_point() +
labs(x="Concave Points Worst",
y="Perimeter Worst")
Copyright © 2018. Data Science Dojo 74
Saving a ggplot Object
# ggplot object
# Store the plot for future
modifications
g <- ggplot(breast_cancer,
aes(x=concave.points_worst,
y=perimeter_worst))
# Second aesthetic adds settings
specific to geom_point layer
g + geom_point(aes(color=diagnosis))
+ labs(x="Concave Points Worst",
y="Perimeter Worst")
Copyright © 2018. Data Science Dojo 75
Segmenting a Plot
# Segment by factor
g +
geom_point(aes(color=diagnosis))
+ facet_wrap(~diagnosis) +
labs(x="Concave Points Worst“
,y="Perimeter Worst")
Copyright © 2018. Data Science Dojo 76
Summary
✓Basics of R
✓Graphing in R – core and ggplot2
✓Look at multiple types of graphs
✓Visualize and segment data to gain more insights
✓Identify key features
✓Summarize findings
Copyright © 2018. Data Science Dojo 77
QUESTIONS
Copyright © 2018. Data Science Dojo 78
Building Classification Models Using Decision Trees
80Copyright (c) 2018. Data Science Dojo 80
Agenda
• Introduction to predictive analytics
• Introduction to classification
•Decision Tree Classifier
•Hands-on Lab: Building a decision tree classifier using R
81Copyright (c) 2018. Data Science Dojo
INTRODUCTION TO PREDICTIVE ANALYTICS
Copyright © 2018. Data Science Dojo
81
82Copyright (c) 2018. Data Science Dojo 82
Emergency & Surgery Rooms
• Gauss Surgical• Develops real-time blood monitoring
solutions to provide an accurate and objective estimate of blood loss.
• MedaSense• Assesses patients’ physiological
response to pain.
83Copyright (c) 2018. Data Science Dojo 83
Patient Data & Risk Assessment
▪ Watson for oncology• Analyzes patients medical records and identify
treatment options for doctors and patients.
▪ SkinVision• Assesses skin cancer risk using image
recognition and user provided information.
▪Berg• Includes dosage trials for intravenous
tumor treatment, detection and management of prostate cancer.
84Copyright (c) 2018. Data Science Dojo 84
Mental Health
▪MedyMatch• Helps treat stroke and head trauma
more effectively by detecting intracranial brain bleeds.
▪P1vital• Predicting Response to Depression
Treatment (PReDicT test) uses Machine Learning to provide anti-depressant treatment.
85Copyright (c) 2018. Data Science Dojo
INTRODUCTION TO CLASSIFICATION
Copyright © 2018. Data Science Dojo
85
86Copyright (c) 2018. Data Science Dojo 86
Supervised Learning
Training Set
TrainModel
Learning
LearningAlgorithm
Model
ApplyModel
Prediction
Test Set
87Copyright (c) 2018. Data Science Dojo 87
Decision Tree Learning
Splitting Attributes
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
≥ 0.1358<0.1358
< 26.29 ≥26.29
88Copyright (c) 2018. Data Science Dojo 88
A Different Decision Tree
Texture
Perimeter
Concavity
MalignantBenign
Benign
Malignant
<114.6 ≥114.6
< 26.29≥26.29
<0.1358 ≥ 0.1358
There could be more than one tree
that fits the same data!
89Copyright (c) 2018. Data Science Dojo 89
Decision Tree Application
Training Set
TrainModel
Induction
LearningAlgorithm
Model
ApplyModel
Deduction
Test Set
90Copyright (c) 2018. Data Science Dojo 90
Apply Model to Test Data
Test DataStart from the root of tree.
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
91Copyright (c) 2018. Data Science Dojo 91
Apply Model to Test Data
Test Data
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
92Copyright (c) 2018. Data Science Dojo 92
Apply Model to Test Data
Test Data
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
93Copyright (c) 2018. Data Science Dojo 93
Apply Model to Test Data
Test Data
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
94Copyright (c) 2018. Data Science Dojo 94
Apply Model to Test Data
Test Data
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
≥ 0.1358<0.1358
< 26.29 ≥26.29
95Copyright (c) 2018. Data Science Dojo 95
Apply Model to Test Data
Test Data
Diagnosis = “Benign”
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
96Copyright (c) 2018. Data Science Dojo 96
How Do We Get A Tree?
• Exponentially many decision trees are possible
• Finding the optimal tree is infeasible
• Greedy methods that find near-optimal solutions do exist
97Copyright (c) 2018. Data Science Dojo 97
Tree Induction
• Greedy strategy• Split based attribute test that optimizes a
criterion
• Issues
• How to split the records
• What attribute test condition?
• How to determine the best split?• When do we stop?
98Copyright (c) 2018. Data Science Dojo 98
Tree Induction
• Greedy strategy• Split based attribute test that optimizes a
criterion
• Issues
• How to split the records
• What attribute test criterion?
• How to determine the best split?• When do we stop?
99Copyright (c) 2018. Data Science Dojo 99
Splitting a Node
Texture> 26.29?
NoYes
Binary Split
Texture
[16.5, 22.2)<16.5
[22.2, 32.5) [35.8, 39.7)≥ 30.2
Multi-way Split
100Copyright (c) 2018. Data Science Dojo 100
Tree Induction
• Greedy strategy• Split based attribute test that optimizes a criterion
• Issues• How to split the records
• What attribute test criterion?
• How to determine the best split?
• When do we stop?
101Copyright (c) 2018. Data Science Dojo 101
What is The Best Split?
Before Splitting: 10 records of class 1, 10 records of class 2
Which test condition is the best?
Texture< 26.29?
NoYes
C1: 6C2: 4
C1: 4C2: 6
Concavity?
C1: 1C2: 3
C1: 8C2: 0
C1: 1C2: 7
ID?
C1: 0C2: 1
C1: 1C2: 0
C1: 0C2: 1
C1: 1C2: 0
1 3
2
s1s2 s3
s20
…
C1: Benign
C2: Malignant
102Copyright (c) 2018. Data Science Dojo 102
C1: 9C2: 1
C1: 5C2: 5
What is The Best Split?
• Greedy approach • Homogeneous class distribution preferred
• Need a measure of node impurity
Non-homogeneous
High degree of impurity
Homogeneous
Low degree of impurity
C1: Benign
C2: Malignant
103Copyright (c) 2018. Data Science Dojo 103
Measures of Node Impurity
•Gini Index
•Entropy
•Misclassification error
104Copyright (c) 2018. Data Science Dojo 104
Impurity Measure: GINI
• p( j | t) is the relative frequency of class j at node t
• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
• nc=number of classes
• Minimum (0.0) when all records belong to one class, implying most interesting information
j
tjptGINI 2)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
C1: Benign
C2: Malignant
105Copyright (c) 2018. Data Science Dojo 105
Impurity Measure: GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
C1: Benign
C2: Malignant
106Copyright (c) 2018. Data Science Dojo 106
Impurity Measure: GINI
• When a node p is split into k partitions (children), the quality of split is computed as:
where
ni = number of records at child i
n = number of records at node p
k
i
i iGINIn
npsplitGINI
1
)(),(
107Copyright (c) 2018. Data Science Dojo 107
Impurity Measure: GINI
• Split data into two partitions
• Partition measurements are weighted
• Larger and purer partitions are sought after
B?
Malignant Benign
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2
C1 5 1
C2 2 4
Gini=0.371
Gini(N1)
= 1 – (5/7)2 – (2/7)2
= 0.408
Gini(N2)
= 1 – (1/5)2– (4/5)2
= 0.320
Gini(B?, Parent)
= 7/12 * 0.408 +
5/12 * 0.320
= 0.371
N1 N2
C1 5 1
C2 2 4
C1: Benign
C2: Malignant
108Copyright (c) 2018. Data Science Dojo 108
• 𝑝 𝑗 𝑡 is the relative frequency of class j at node t
• Maximum: records equally distributed
• Minimum: all records belong to one class
j
tjptjptEntropy ))|((log)|()( 2
Impurity Measure: Entropy
109Copyright (c) 2018. Data Science Dojo 109
Impurity Measure: Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
j
tjptjptEntropy )|(log)|()(2
C1: Benign
C2: Malignant
110Copyright (c) 2018. Data Science Dojo 110
Impurity Measure: Information
• Node p is split into k partitions
• ni is number of records in partition i
• Measures reduction in entropy
• Choose split that maximizes GAIN
• Tends to prefer splits with large number of partitions
k
i
i
splitiEntropy
n
npEntropyGAIN
1
)()(
111Copyright (c) 2018. Data Science Dojo 111
Impurity Measure: Classification Error
• Maximum: records are equally distributed
• Minimum: all records belong to one class
• Similar to information gain• Less sensitive for > 2 or 3 splits
• Less prone to overfitting
)|(max1)( tiPtErrori
112Copyright (c) 2018. Data Science Dojo 112
Impurity Measure: Classification Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)|(max1)( tiPtErrori
C1: Benign
C2: Malignant
113Copyright (c) 2018. Data Science Dojo 113
Tree Induction
• Greedy strategy• Split based attribute test that optimizes a criterion
• Issues• How to split the records
• What attribute test criterion?
• How to determine the best split?
• When do we stop?
114Copyright (c) 2018. Data Science Dojo 114
Sample Stopping Criteria
• All the records belong to the same class
• All the records have similar attribute values
• Fixed termination or pruning• Number of Levels
• Number in Leaf Node
• Minimum samples per leaf node
115Copyright (c) 2018. Data Science Dojo 115
Decision Trees - PROS
• Intuitive• Easy interpretation for small
trees
• Non parametric• Incorporate both numeric
and categorical attributes
• Fast• Once rules are developed,
prediction is rapid
• Robust to outliers
Perimeter
Concavity
Texture
BenignMalignant
Malignant
Benign
<114.6 ≥114.6
<0.1358<0.1358
< 26.29 ≥26.29
116Copyright (c) 2018. Data Science Dojo 116
Decision Trees - CONS
• Overfitting• Must be trained with great care
• Rectangular Classification• Recursive partitioning of data may not capture complex relationships
117Copyright (c) 2018. Data Science Dojo
QUESTIONS
Copyright (c) 2018. Data Science Dojo
118Copyright (c) 2018. Data Science Dojo 118Copyright (c) 2018. Data Science Dojo
Evaluating Classification Models
119Copyright (c) 2018. Data Science Dojo 119
Agenda
• Evaluation of classification models:• Confusion Matrix
• Accuracy, Precision, Recall, F1 measure
• Building robust machine learning models:
• Bias/variance tradeoff
• Methods of evaluation:
• Cross validation
• ROC curve
120Copyright (c) 2018. Data Science Dojo 120
The Limitations of Accuracy
• Consider a 2-class problem:• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10
• If the model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading!
121Copyright (c) 2018. Data Science Dojo
METRICS FOR EVALUATION
122Copyright (c) 2018. Data Science Dojo 122
Confusion Matrix
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
123Copyright (c) 2018. Data Science Dojo 123
Confusion Matrix
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yesa
(TP)b
(FN)
Class=Noc
(FP)d
(TN)
dcba
da
FNFPTNTP
TNTP
Accuracy
124Copyright (c) 2018. Data Science Dojo 124
Precision
𝑝 =𝑇𝑃
𝑇𝑃 + 𝐹𝑃=
𝑎
𝑎 + 𝑐
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yesa
(TP)b
(FN)
Class=Noc
(FP)d
(TN)
125Copyright (c) 2018. Data Science Dojo 125
Recall/Sensitivity
𝑟 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁=
𝑎
𝑎 + 𝑏
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yesa
(TP)b
(FN)
Class=Noc
(FP)d
(TN)
126Copyright (c) 2018. Data Science Dojo 126
F1-Score
𝐹1 =2𝑟𝑝
𝑟 + 𝑝=
2𝑎
2𝑎 + 𝑏 + 𝑐
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yesa
(TP)b
(FN)
Class=Noc
(FP)d
(TN)
Harmonic mean of precision and recall
127Copyright (c) 2018. Data Science Dojo
WILL MY MODEL BETRAY ME?
128Copyright (c) 2018. Data Science Dojo 128
Is My Model Really Good?
• My model shows an accuracy of 90% in the training environment
• Would the model be 90% accurate in production environment?
129Copyright (c) 2018. Data Science Dojo 129
Generalization
• A machine learning model should be able to handle any data set coming from the same distribution as the training set.
• Generalization refers to a model's ability to handle any random variations of training data
130Copyright (c) 2018. Data Science Dojo 130
Overfitting (lack of generalization)
• The gravest and most common sin of machine learning
• Overfitting: learning so much from your data that you memorize it.• You do well on training data• But don’t do well (or even fail miserably) on test data
131Copyright (c) 2018. Data Science Dojo 131
Train/Test Partition is Not Enough
Labelled Data
Training DataBlind Holdout Data
70% 30%
132Copyright (c) 2018. Data Science Dojo 132
Blind Holdout Dataset
• The person building the model has no access to the blind holdout data set• Why do we need to lock it away?
• Even in presence of a 70/30 split, you may end up with a model that is not generalized
133Copyright (c) 2018. Data Science Dojo 133
Perils of Overfitting
134Copyright (c) 2018. Data Science Dojo 134
Bias/Variance Tradeoff
You can beat your data to confession.
135Copyright (c) 2018. Data Science Dojo 135Copyright (c) 2018. Data Science Dojo
The generation of random numbers is too important to be left to chance.
136Copyright (c) 2018. Data Science Dojo 136
Bias/Variance Trade-off
Bullseye is the theoretical best performance (accuracy, precision, recall or something else)
Each dartboard represents a model
137Copyright (c) 2018. Data Science Dojo 137
Bias/Variance Trade-off
• Test your model on several variations of the dataset
• Each dot represents a random variation of the test data set
138Copyright (c) 2018. Data Science Dojo 138
Bias/Variance Trade-off
139Copyright (c) 2018. Data Science Dojo
METHODS OF EVALUATION
140Copyright (c) 2018. Data Science Dojo 140
Cross Validation
•Split data into k disjoint partitions
•Train on k-1 partitions and test on 1
•Repeat k times
141Copyright (c) 2018. Data Science Dojo 141
Cross Validation (k=10)
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Training Set Test Set
142Copyright (c) 2018. Data Science Dojo 142
Adjusting Learning Parameters
max depth = 10 max depth = 7 max depth = 2
A1 100% 80% 55%
A2 60% 78% 55%
A3 90% 79% 55%
A4 70% 77% 55%
A5 80% 81% 55%
average 80% 79% 55%
143Copyright (c) 2018. Data Science Dojo 143
Holdout Set
•70% for training, 30% for testing
•60/40 or 50/50 also possible
•Repeated holdout: Apply 70/30, 60/40 or 50/50 many times.
144Copyright (c) 2018. Data Science Dojo 144
Stratified Sampling
•Use when class distribution is skewed
•Ensures that all partitions have fixed ratio of classes•Same ratio as training set• If training set is 5% class 1 and 95% class 2, so is each partition
145Copyright (c) 2018. Data Science Dojo 145
Using ROC for Model Comparison
• No model consistently
outperforms the other
• Purple is better at low
thresholds
• Red is better at high
thresholds
• Area Under ROC Curve (AUC)
• Compares models directly
AUC=0.865AUC=0.859
146Copyright (c) 2018. Data Science Dojo
QUESTIONS
Copyright (c) 2018. Data Science Dojo