introduction - world bankpubdocs.worldbank.org/en/244111541088455535/data... · introduction data...

Introduction

Data Science and Data Engineering

Instructor – Raja Iqbal

• Founder, CEO & Chief Data Scientist.

• Worked in Bing data mining, Bing Ads (2006-2013)• ETL, bot detection, online experimentation and A/B testing,

relevance of online ads, click prediction, etc.

• Ph.D. in CS with a focus on computer vision, machine learning and data mining.

Copyright (c) 2018. Data Science Dojo 2

Instructor – Rebecca Merrett

• Technical Writer and Content Developer.

• Worked in game engine technology, writing technical content on new features.

• Graduate diploma in mathematics and statistics, with a bachelor degree in information and media.


Instructor – Victoria Louise Clayton

• Instructor and Mentor.

• Worked for a small research consultancy in London and worked on projects for governments, international organizations and companies such as the UN and Siemens.

• BA in Human Sciences from Oxford University and an MSc in Decision Science.


Instructor – Margaux Penwarden

• Instructor and Mentor.

• Data scientist at McKinsey & Company, Sydney

• Bachelor’s in Computer Science and Mathematics from Télécom Paristech (“Grande Ecole”), and a Master’s in Statistics from Imperial College, London.


About Data Science Dojo

•Started in August 2014

•100+ bootcamps, workshops and corporate trainings

•~3500 attendees

•600+ companies

•10 countries


http://datasciencedojo.com/bootcamp/alumni/

http://datasciencedojo.com/bootcamp/companies/

Learning Objectives

• Learn the theory and practice of data science for improved health systems and healthcare.

• Explore and visualize a health-related dataset.

• Build and evaluate predictive models for classification and regression (for instance, predicting whether a tumour is malignant or not), as an example of the use of machine learning in health.


Learning Objectives

• Understand the fundamentals of unsupervised learning and clustering, and its potential applications in health systems

• Learn fundamentals of text analytics and perform text analytics on a health-related dataset

• Get an introduction to big data and data engineering


CURRICULUM

Copyright © 2018. Data Science Dojo 9

Maximizing ROI From This Week

•Map the techniques to real problems at all times:•Problem and business impact•Data you have (and do not have).•Measurement metrics•Business metrics

10Copyright (c) 2018. Data Science Dojo

Logistics

• 8:30 am – 5:30 pm daily*

• Course material and resources:• Handbooks• Learning portal

• Request:• Make sure your computers are ready• Keep the session interactive• Social media, email, etc.


*We will end at 4:00 pm on Friday

Agenda for Today

Session I: Understanding the AI and data science landscape

Session II: Data exploration and visualization

Session III: Introduction to predictive modeling

Session IV: Decision tree learning and building your first predictive model

Session V: Evaluating classification models


Understanding the AI and Data Science Landscape

Objectives

• Review the current data science landscape

• Discuss what other organizations are (or may be) doing

• Common data mining tasks

• Identify some data science problems in health


Drug Discoveries

• Insilico Medicine• Finding new drugs and treatments

including immunotherapies.

• MIT Clinical Machine Learning Group• Focussed on disease processes and

design for effective treatment of diseases such as Type 2 diabetes.

• Knight Cancer Institute• With a current focus on developing an

approach to personalize drug combinations for Acute Myeloid Leukemia (AML).


Medical Imaging & Diagnostics

▪VunoMed• Identifies different types

of lung tissue damage by color to help physicians make more accurate diagnosis.

• IBM Watson Genomics

• Provides precision medicine to cancer patients.


Virtual Assistants

• Scanadu’s doc.ai • NLP program that allows patients

to get their lab results explained to them by an app, saving both patient and doctor time and money.

• Somatix• Recognizes of hand-to-mouth

gestures in order to help people better understand their behavior and make life-affirming changes.


Research

• Google Deep Mind• Develops technology to address macular

degeneration in aging eyes.

• Desktop Genetics• AI-designed tech for more effective and

affordable guides. Recognized as leader in genome editing technology.

• iCarbonX• Monitors and models human biological

data to enable people to find the proper lifestyle and treatments that can improve their health, life quality and joy.


Brainstorming

What are some other applications?


Connecting the Dots

•The underlying magic behind what we saw is ‘big data’ and ‘predictive analytics’


Big Data Pipeline

Stage: Data influx

• Output: Data stream

Stage: Collection

• Output: Targetdata

Stage: Preprocessing

• Output: Preprocessed data

Stage: Transformation

• Output: Transformed data

Stage: Data Mining

• Output: Patterns

Stage: Interpretation and Evaluation

• Output: Knowledge discovery and actionable insights


Dat

a M

anag

em

ent

Dat

a Sc

ien

ce

Collect

Store

Transform

Reason

Model

Visualize

Recommend

Predict

Explore

ETL/Log SQL NoSQL MapReduce Real Time Analytics

Big Data – Technology, Platforms & Products


Data Mining Tasks

• Descriptive Methods: • Find human-interpretable patterns that describe the data

• Techniques: Clustering, Association Analysis, X-point summaries

• Predictive Methods: • Use available data to build models that can predict the outcome of

future data

• Techniques: Classification, Regression, Anomaly, and Deviation Detection

• Prescriptive Methods: • Predict future outcomes and suggest actions that may prevent or

mitigate the impact of the predicted outcomes

• Techniques: Various optimization techniques


Traffic Management

Descriptive [Informing Role]:

• Traffic jam has happened already

• [Implicit: Do something about it]


Traffic Management

Predictive [Informing and Warning Role]: • Traffic jam is about to happen in the next 30 minutes

• [Implicit: Do something before it happens]


Traffic Management

Prescriptive [Informing, Warning, and Advisory Role]:

Take action so traffic jam does not happen OR

Traffic jam is about to happen in the next 30 minutes and you could possibly take the following courses of action:

• Route traffic to service road near I-5

• Block more traffic from entering the WA-520 bridge


Data Mining and Predictive Analytics

In the next few slides, we will take a look at some of the most common data mining tasks.


Classification: A Simple Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test

Set

Training

SetModel

Learn

Classifier


Classification: More Examples

• What is the likelihood that a patient will develop diabetes?

• What is the likelihood that a COPD patient will be readmitted within 90 days of discharge?

• What is the likelihood that a person will not show up to their appointment?


Intra-cluster distancesare minimized

Inter-cluster distancesare maximized

Clustering in 3-D space using Euclidean distance

Clustering: An Illustration


Clustering

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:• Data points within a cluster have more similarities with

one another

• Data points in different clusters have less similarities with one another

32

Copyright (c) 2018. Data Science Dojo

Clustering: Similarity Measures

• Similarity Measures:• Euclidean Distance if attributes are

continuous• Other problem-specific measures• Example: If a particular word occurs in two

documents or not


Clustering: Examples

To find groups of documents that are similar to each other based on the most important terms that appear in them (e.g. medical records)


Association Analysis

Your behavior is being predicted, not by studying you, but by studying others.


Association Rule Discovery

• Given a set of records each of which contain some number of items from a given collection:• Produce dependency rules which will predict the

occurrence of an item based on the occurrences of other items

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}


Association Analysis: Pharmacy Shelf Management


Predicts a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency

Regression


Regression Example


Predicting vaccine demand to better plan supply

Anomaly Detection

• Detect significant deviations from normal behavior

• Applications:

• Unusual patient behavior

• Insurance fraud detection

• Treatment outlier detection


Challenges in Data Mining

Scalability DimensionalityComplex and

heterogeneous data

Data qualityData ownership and

distributionPrivacy

Reaction timeMany other domain

specific issues


AI in Healthcare Landscape


Overview of Datasets

Wisconsin Breast Cancer Data


Wisconsin Breast Cancer Data


• Features obtained from a digital image of a fine needle aspirate (FNA) of a breast mass.

• Describes characteristics of the cell nuclei present in the image.

• Attribute information:• ID number

• Diagnosis (M = malignant, B = benign)

• 10 real-valued features

• Total of 569 records

Wisconsin Breast Cancer Data46


Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Features: Wisconsin Breast Cancer Data


Id: ID number

diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)

radius_mean: mean of distances from center to points on the perimeter

texture_mean: standard deviation of gray-scale values

perimeter_mean: mean size of the core tumor

compactness_se: standard error for perimeter^2 / area - 1.0

smoothness_mean: mean of local variation in radius lengths

compactness_mean: mean of perimeter^2 / area - 1.0

concavity_mean: mean of severity of concave portions of the contour

concave points_mean: mean for number of concave portions of the contour

fractal_dimension_mean: mean for "coastline approximation" - 1

radius_se: standard error for the mean of distances from center to points on the perimeter

texture_se: standard error for standard deviation of gray-scale values

Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Data Exploration and Visualization

Agenda

•Why data exploration and visualization?

•Exploration and visualization of data:•Core R functionality • lattice package •ggplot2 package


WHY DATA EXPLORATION AND VISUALIZATION?


Data Beats Algorithm But…

•More data usually yields good generalization performance, even with a simple algorithm

•But there are caveats:•Amount of data may have diminishing returns•Data quality and variety matters•A decent performing learning algorithm is still needed•Most importantly, extracting useful features out of data is important


Copyright © 2018. Data Science Dojo

23:05:33 –5 UTC, April 3, 2014

Is Date-Time Stamp a Good Feature?

Hour of date Day of week AM/PM

52

Dispelling a Common Myth

•There is NO single ML algorithm that will take raw data and give you the best model

•You do NOT need to know a lot of machine learning algorithms to build robust predictive models

Copyright © 2018. Data Science Dojo53

Janitorial Work is Important

•Not spending time on understanding your data is a source of many problems!

•Remember the 80/20 rule:• 80% : Data cleaning, data exploration, feature

engineering, pre-processing, etc

• 20% : Model building


EXPLORATION AND VISUALIZATION USING R


Objectives

•Develop an understanding of the high-level thinking process of data exploration

•Make sense of data using visualization techniques

•Learn to perform feature engineering

•Become a good storyteller


Anscombe’s Quartet

Plot

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89


I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Consider the 4 following different datasets

Anscombe’s Quartet

Mean of X 9

Variance of X 11

Mean of Y 7.5

Variance of Y 4.125

Correlation between X & Y

0.816


Awareness Test


Common Graphical Parameters

• Title of graph using the main function, main = “title”

• Label x axis by using the xlab function, xlab = “label x axis”

• Label x axis by using the ylab function, ylab = “label y axis”

• Colors controlled by col

• Get legends of layered plots with auto.key=TRUE


Exploring Data Commands


Commands Description

read.csv() , read.table() Load data/file into a dataframe

data() Loads or resets a dataset

names() List names of variables in a dataframe

head() First 6 rows of data

tail() Last 6 rows of data

str() Display internal structure if R object

View() View dataset in spreadsheet format in RStudio

dim() Dimensions( rows and columns) of dataframe

summary() Display 5-number summary and mean

colnames() Provide column names

61

CORE R GRAPHICS



Breast Cancer Dataset

breast_cancer <- read.csv("data.csv")

data(breast_cancer)

head(breast_cancer)

• Summarizes quantitative/numeric data

Boxplots

# Core Graphics

boxplot(

radius_mean~diagnosis,

data=breast_cancer,

main="Radius Mean for

various diagnoses

", xlab="Diagnosis",

ylab="Radius Mean"

)

B: Benign M: Malignant


Pie Chart

▪ Summarizes qualitative/categorical variables

# Core Graphics

pie(table(breast_cancer$diagnosis))


B: Benign M: Malignant

Scatter Plot

▪ Visual depiction of correlation between numeric variables

# Core Graphics

plot(breast_cancer$concave.

points_worst

,breast_cancer$perimeter_worst

,xlab="Concave Points Worst",

ylab="Perimeter Worst")


Scatter Plot

# Core Graphics

plot(perimeter_worst~area_worst,

data=breast_cancer)

▪ Plot of perimeter_worst against area_worst


Scatter Plot

plot(concave.points_worst~perimeter_worst,

data=breast_cancer,

main="Concave Points Worst vs Perimeter Worst",

xlab="Concave Points Worst",

ylab="Perimeter Worst")

abline(lm(concave.points_worst~perimeter_worst,

data=breast_cancer),col="red",lwd=2)

cor(breast_cancer$concave.points_worst,breast_cancer$perimeter_worst)

>0.816322101687544

• Plots counts of Concave Points Worst versus Perimeter Worst, then adds a regression line

• Find correlation between variables (values close to 1 or -1 depict strong linear relationship)


GGPLOT2 GRAPHICS


ggplot Fundamentals

•ggplot() provides a blank canvas for plotting

•geom_*() creates actual graphical layers• geom_point()

• geom_boxplot()

•aes() defines an "aesthetic" either globally or by layer



ggplot(breast_cancer, aes()) + geom_point()

Layering

Layer 1 Layer 2

71

Histogram

A histogram of counts of Concave Points Worst

ggplot(breast_cancer,aes(x=con

cave.points_worst)) +

geom_histogram()


Density

Smooths over the counts of concave points worst

▪ Note the location of aes()

ggplot(breast_cancer) +

geom_density(aes(x=concave.

points_worst),fill="gray50") +

labs(x="Concave Points Worst")


Scatter Plot

ggplot(breast_cancer,

aes(x=concave.points_worst,

y=perimeter_worst)) +

geom_point() +

labs(x="Concave Points Worst",

y="Perimeter Worst")


Saving a ggplot Object

# ggplot object

# Store the plot for future

modifications

g <- ggplot(breast_cancer,

aes(x=concave.points_worst,

y=perimeter_worst))

# Second aesthetic adds settings

specific to geom_point layer

g + geom_point(aes(color=diagnosis))

+ labs(x="Concave Points Worst",

y="Perimeter Worst")


Segmenting a Plot

# Segment by factor

g +

geom_point(aes(color=diagnosis))

+ facet_wrap(~diagnosis) +

labs(x="Concave Points Worst“

,y="Perimeter Worst")


Summary

✓Basics of R

✓Graphing in R – core and ggplot2

✓Look at multiple types of graphs

✓Visualize and segment data to gain more insights

✓Identify key features

✓Summarize findings


QUESTIONS


Building Classification Models Using Decision Trees

80Copyright (c) 2018. Data Science Dojo 80

Agenda

• Introduction to predictive analytics

• Introduction to classification

•Decision Tree Classifier

•Hands-on Lab: Building a decision tree classifier using R


INTRODUCTION TO PREDICTIVE ANALYTICS


81


Emergency & Surgery Rooms

• Gauss Surgical• Develops real-time blood monitoring

solutions to provide an accurate and objective estimate of blood loss.

• MedaSense• Assesses patients’ physiological

response to pain.


Patient Data & Risk Assessment

▪ Watson for oncology• Analyzes patients medical records and identify

treatment options for doctors and patients.

▪ SkinVision• Assesses skin cancer risk using image

recognition and user provided information.

▪Berg• Includes dosage trials for intravenous

tumor treatment, detection and management of prostate cancer.


Mental Health

▪MedyMatch• Helps treat stroke and head trauma

more effectively by detecting intracranial brain bleeds.

▪P1vital• Predicting Response to Depression

Treatment (PReDicT test) uses Machine Learning to provide anti-depressant treatment.


INTRODUCTION TO CLASSIFICATION


85


Supervised Learning

Training Set

TrainModel

Learning

LearningAlgorithm

Model

ApplyModel

Prediction

Test Set


Decision Tree Learning

Splitting Attributes

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

≥ 0.1358<0.1358

< 26.29 ≥26.29


A Different Decision Tree

Texture

Perimeter

Concavity

MalignantBenign

Benign

Malignant

<114.6 ≥114.6

< 26.29≥26.29

<0.1358 ≥ 0.1358

There could be more than one tree

that fits the same data!


Decision Tree Application

Training Set

TrainModel

Induction

LearningAlgorithm

Model

ApplyModel

Deduction

Test Set


Apply Model to Test Data

Test DataStart from the root of tree.

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29



Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29



Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

≥ 0.1358<0.1358

< 26.29 ≥26.29



Test Data

Diagnosis = “Benign”

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29


How Do We Get A Tree?

• Exponentially many decision trees are possible

• Finding the optimal tree is infeasible

• Greedy methods that find near-optimal solutions do exist


Tree Induction

• Greedy strategy• Split based attribute test that optimizes a

criterion

• Issues

• How to split the records

• What attribute test condition?

• How to determine the best split?• When do we stop?


Tree Induction

• Greedy strategy• Split based attribute test that optimizes a

criterion

• Issues

• How to split the records

• What attribute test criterion?

• How to determine the best split?• When do we stop?


Splitting a Node

Texture> 26.29?

NoYes

Binary Split

Texture

[16.5, 22.2)<16.5

[22.2, 32.5) [35.8, 39.7)≥ 30.2

Multi-way Split


Tree Induction

• Greedy strategy• Split based attribute test that optimizes a criterion

• Issues• How to split the records


• How to determine the best split?

• When do we stop?


What is The Best Split?

Before Splitting: 10 records of class 1, 10 records of class 2

Which test condition is the best?

Texture< 26.29?

NoYes

C1: 6C2: 4

C1: 4C2: 6

Concavity?

C1: 1C2: 3

C1: 8C2: 0

C1: 1C2: 7

ID?

C1: 0C2: 1

C1: 1C2: 0

C1: 0C2: 1

C1: 1C2: 0

1 3

2

s1s2 s3

s20

…

C1: Benign

C2: Malignant


C1: 9C2: 1

C1: 5C2: 5

What is The Best Split?

• Greedy approach • Homogeneous class distribution preferred

• Need a measure of node impurity

Non-homogeneous

High degree of impurity

Homogeneous

Low degree of impurity

C1: Benign

C2: Malignant


Measures of Node Impurity

•Gini Index

•Entropy

•Misclassification error


Impurity Measure: GINI

• p( j | t) is the relative frequency of class j at node t

• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

• nc=number of classes

• Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0

C2 6

Gini=0.000

C1 2

C2 4

Gini=0.444

C1 3

C2 3

Gini=0.500

C1 1

C2 5

Gini=0.278

C1: Benign

C2: Malignant



C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1: Benign

C2: Malignant



• When a node p is split into k partitions (children), the quality of split is computed as:

where

ni = number of records at child i

n = number of records at node p

k

i

i iGINIn

npsplitGINI

1

)(),(



• Split data into two partitions

• Partition measurements are weighted

• Larger and purer partitions are sought after

B?

Malignant Benign

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2

C1 5 1

C2 2 4

Gini=0.371

Gini(N1)

= 1 – (5/7)2 – (2/7)2

= 0.408

Gini(N2)

= 1 – (1/5)2– (4/5)2

= 0.320

Gini(B?, Parent)

= 7/12 * 0.408 +

5/12 * 0.320

= 0.371

N1 N2

C1 5 1

C2 2 4

C1: Benign

C2: Malignant


• 𝑝 𝑗 𝑡 is the relative frequency of class j at node t

• Maximum: records equally distributed

• Minimum: all records belong to one class

j

tjptjptEntropy ))|((log)|()( 2

Impurity Measure: Entropy


Impurity Measure: Entropy

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

C1: Benign

C2: Malignant


Impurity Measure: Information

• Node p is split into k partitions

• ni is number of records in partition i

• Measures reduction in entropy

• Choose split that maximizes GAIN

• Tends to prefer splits with large number of partitions

k

i

i

splitiEntropy

n

npEntropyGAIN

1

)()(


Impurity Measure: Classification Error

• Maximum: records are equally distributed

• Minimum: all records belong to one class

• Similar to information gain• Less sensitive for > 2 or 3 splits

• Less prone to overfitting

)|(max1)( tiPtErrori


Impurity Measure: Classification Error

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)|(max1)( tiPtErrori

C1: Benign

C2: Malignant


Tree Induction

• Greedy strategy• Split based attribute test that optimizes a criterion

• Issues• How to split the records


• How to determine the best split?

• When do we stop?


Sample Stopping Criteria

• All the records belong to the same class

• All the records have similar attribute values

• Fixed termination or pruning• Number of Levels

• Number in Leaf Node

• Minimum samples per leaf node


Decision Trees - PROS

• Intuitive• Easy interpretation for small

trees

• Non parametric• Incorporate both numeric

and categorical attributes

• Fast• Once rules are developed,

prediction is rapid

• Robust to outliers

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29


Decision Trees - CONS

• Overfitting• Must be trained with great care

• Rectangular Classification• Recursive partitioning of data may not capture complex relationships


QUESTIONS


118Copyright (c) 2018. Data Science Dojo 118Copyright (c) 2018. Data Science Dojo

Evaluating Classification Models


Agenda

• Evaluation of classification models:• Confusion Matrix

• Accuracy, Precision, Recall, F1 measure

• Building robust machine learning models:

• Bias/variance tradeoff

• Methods of evaluation:

• Cross validation

• ROC curve


The Limitations of Accuracy

• Consider a 2-class problem:• Number of Class 0 examples = 9990

• Number of Class 1 examples = 10

• If the model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

• Accuracy is misleading!


METRICS FOR EVALUATION


Confusion Matrix

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)


Confusion Matrix

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

dcba

da

FNFPTNTP

TNTP

Accuracy


Precision

𝑝 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃=

𝑎

𝑎 + 𝑐

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)


Recall/Sensitivity

𝑟 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁=

𝑎

𝑎 + 𝑏

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)


F1-Score

𝐹1 =2𝑟𝑝

𝑟 + 𝑝=

2𝑎

2𝑎 + 𝑏 + 𝑐

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

Harmonic mean of precision and recall


WILL MY MODEL BETRAY ME?


Is My Model Really Good?

• My model shows an accuracy of 90% in the training environment

• Would the model be 90% accurate in production environment?


Generalization

• A machine learning model should be able to handle any data set coming from the same distribution as the training set.

• Generalization refers to a model's ability to handle any random variations of training data


Overfitting (lack of generalization)

• The gravest and most common sin of machine learning

• Overfitting: learning so much from your data that you memorize it.• You do well on training data• But don’t do well (or even fail miserably) on test data


Train/Test Partition is Not Enough

Labelled Data

Training DataBlind Holdout Data

70% 30%


Blind Holdout Dataset

• The person building the model has no access to the blind holdout data set• Why do we need to lock it away?

• Even in presence of a 70/30 split, you may end up with a model that is not generalized


Perils of Overfitting


Bias/Variance Tradeoff

You can beat your data to confession.

135Copyright (c) 2018. Data Science Dojo 135Copyright (c) 2018. Data Science Dojo

The generation of random numbers is too important to be left to chance.


Bias/Variance Trade-off

Bullseye is the theoretical best performance (accuracy, precision, recall or something else)

Each dartboard represents a model



• Test your model on several variations of the dataset

• Each dot represents a random variation of the test data set


METHODS OF EVALUATION


Cross Validation

•Split data into k disjoint partitions

•Train on k-1 partitions and test on 1

•Repeat k times


Cross Validation (k=10)

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Training Set Test Set


Adjusting Learning Parameters

max depth = 10 max depth = 7 max depth = 2

A1 100% 80% 55%

A2 60% 78% 55%

A3 90% 79% 55%

A4 70% 77% 55%

A5 80% 81% 55%

average 80% 79% 55%


Holdout Set

•70% for training, 30% for testing

•60/40 or 50/50 also possible

•Repeated holdout: Apply 70/30, 60/40 or 50/50 many times.


Stratified Sampling

•Use when class distribution is skewed

•Ensures that all partitions have fixed ratio of classes•Same ratio as training set• If training set is 5% class 1 and 95% class 2, so is each partition


Using ROC for Model Comparison

• No model consistently

outperforms the other

• Purple is better at low

thresholds

• Red is better at high

thresholds

• Area Under ROC Curve (AUC)

• Compares models directly

AUC=0.865AUC=0.859


QUESTIONS


introduction - world bankpubdocs.worldbank.org/en/244111541088455535/data... · introduction data...

Documents