introduction - world bankpubdocs.worldbank.org/en/244111541088455535/data... · introduction data...

146
Introduction Data Science and Data Engineering

Upload: others

Post on 22-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Introduction

Data Science and Data Engineering

Page 2: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Instructor – Raja Iqbal

• Founder, CEO & Chief Data Scientist.

• Worked in Bing data mining, Bing Ads (2006-2013)• ETL, bot detection, online experimentation and A/B testing,

relevance of online ads, click prediction, etc.

• Ph.D. in CS with a focus on computer vision, machine learning and data mining.

Copyright (c) 2018. Data Science Dojo 2

Page 3: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Instructor – Rebecca Merrett

• Technical Writer and Content Developer.

• Worked in game engine technology, writing technical content on new features.

• Graduate diploma in mathematics and statistics, with a bachelor degree in information and media.

Copyright (c) 2018. Data Science Dojo 3

Page 4: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Instructor – Victoria Louise Clayton

• Instructor and Mentor.

• Worked for a small research consultancy in London and worked on projects for governments, international organizations and companies such as the UN and Siemens.

• BA in Human Sciences from Oxford University and an MSc in Decision Science.

Copyright (c) 2018. Data Science Dojo 4

Page 5: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Instructor – Margaux Penwarden

• Instructor and Mentor.

• Data scientist at McKinsey & Company, Sydney

• Bachelor’s in Computer Science and Mathematics from Télécom Paristech (“Grande Ecole”), and a Master’s in Statistics from Imperial College, London.

Copyright (c) 2018. Data Science Dojo 5

Page 6: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

About Data Science Dojo

•Started in August 2014

•100+ bootcamps, workshops and corporate trainings

•~3500 attendees

•600+ companies

•10 countries

Copyright (c) 2018. Data Science Dojo 6

Page 7: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Learning Objectives

• Learn the theory and practice of data science for improved health systems and healthcare.

• Explore and visualize a health-related dataset.

• Build and evaluate predictive models for classification and regression (for instance, predicting whether a tumour is malignant or not), as an example of the use of machine learning in health.

Copyright (c) 2018. Data Science Dojo 7

Page 8: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Learning Objectives

• Understand the fundamentals of unsupervised learning and clustering, and its potential applications in health systems

• Learn fundamentals of text analytics and perform text analytics on a health-related dataset

• Get an introduction to big data and data engineering

Copyright (c) 2018. Data Science Dojo 8

Page 9: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

CURRICULUM

Copyright © 2018. Data Science Dojo 9

Page 10: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Maximizing ROI From This Week

•Map the techniques to real problems at all times:•Problem and business impact•Data you have (and do not have).•Measurement metrics•Business metrics

10Copyright (c) 2018. Data Science Dojo

Page 11: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Logistics

• 8:30 am – 5:30 pm daily*

• Course material and resources:• Handbooks• Learning portal

• Request:• Make sure your computers are ready• Keep the session interactive• Social media, email, etc.

Copyright (c) 2018. Data Science Dojo 11

*We will end at 4:00 pm on Friday

Page 12: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Agenda for Today

Session I: Understanding the AI and data science landscape

Session II: Data exploration and visualization

Session III: Introduction to predictive modeling

Session IV: Decision tree learning and building your first predictive model

Session V: Evaluating classification models

Copyright (c) 2018. Data Science Dojo 12

Page 13: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Understanding the AI and Data Science Landscape

Page 14: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Objectives

• Review the current data science landscape

• Discuss what other organizations are (or may be) doing

• Common data mining tasks

• Identify some data science problems in health

Copyright (c) 2018. Data Science Dojo 14

Page 15: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Drug Discoveries

• Insilico Medicine• Finding new drugs and treatments

including immunotherapies.

• MIT Clinical Machine Learning Group• Focussed on disease processes and

design for effective treatment of diseases such as Type 2 diabetes.

• Knight Cancer Institute• With a current focus on developing an

approach to personalize drug combinations for Acute Myeloid Leukemia (AML).

Copyright (c) 2018. Data Science Dojo 15

Page 16: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Medical Imaging & Diagnostics

▪VunoMed• Identifies different types

of lung tissue damage by color to help physicians make more accurate diagnosis.

• IBM Watson Genomics

• Provides precision medicine to cancer patients.

Copyright (c) 2018. Data Science Dojo 16

Page 17: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Virtual Assistants

• Scanadu’s doc.ai • NLP program that allows patients

to get their lab results explained to them by an app, saving both patient and doctor time and money.

• Somatix• Recognizes of hand-to-mouth

gestures in order to help people better understand their behavior and make life-affirming changes.

Copyright (c) 2018. Data Science Dojo 17

Page 18: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Research

• Google Deep Mind• Develops technology to address macular

degeneration in aging eyes.

• Desktop Genetics• AI-designed tech for more effective and

affordable guides. Recognized as leader in genome editing technology.

• iCarbonX• Monitors and models human biological

data to enable people to find the proper lifestyle and treatments that can improve their health, life quality and joy.

Copyright (c) 2018. Data Science Dojo 18

Page 19: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Brainstorming

What are some other applications?

Copyright (c) 2018. Data Science Dojo 19

Page 20: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Connecting the Dots

•The underlying magic behind what we saw is ‘big data’ and ‘predictive analytics’

Copyright (c) 2018. Data Science Dojo 20

Page 21: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Big Data Pipeline

Stage: Data influx

• Output: Data stream

Stage: Collection

• Output: Targetdata

Stage: Preprocessing

• Output: Preprocessed data

Stage: Transformation

• Output: Transformed data

Stage: Data Mining

• Output: Patterns

Stage: Interpretation and Evaluation

• Output: Knowledge discovery and actionable insights

Copyright (c) 2018. Data Science Dojo 21

Page 22: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Dat

a M

anag

em

ent

Dat

a Sc

ien

ce

Collect

Store

Transform

Reason

Model

Visualize

Recommend

Predict

Explore

ETL/Log SQL NoSQL MapReduce Real Time Analytics

Big Data – Technology, Platforms & Products

Copyright (c) 2018. Data Science Dojo 22

Page 23: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Data Mining Tasks

• Descriptive Methods: • Find human-interpretable patterns that describe the data

• Techniques: Clustering, Association Analysis, X-point summaries

• Predictive Methods: • Use available data to build models that can predict the outcome of

future data

• Techniques: Classification, Regression, Anomaly, and Deviation Detection

• Prescriptive Methods: • Predict future outcomes and suggest actions that may prevent or

mitigate the impact of the predicted outcomes

• Techniques: Various optimization techniques

Copyright (c) 2018. Data Science Dojo 23

Page 24: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Traffic Management

Descriptive [Informing Role]:

• Traffic jam has happened already

• [Implicit: Do something about it]

Copyright (c) 2018. Data Science Dojo 24

Page 25: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Traffic Management

Predictive [Informing and Warning Role]: • Traffic jam is about to happen in the next 30 minutes

• [Implicit: Do something before it happens]

Copyright (c) 2018. Data Science Dojo 25

Page 26: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Traffic Management

Prescriptive [Informing, Warning, and Advisory Role]:

Take action so traffic jam does not happen OR

Traffic jam is about to happen in the next 30 minutes and you could possibly take the following courses of action:

• Route traffic to service road near I-5

• Block more traffic from entering the WA-520 bridge

Copyright (c) 2018. Data Science Dojo 26

Page 27: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

27Copyright (c) 2018. Data Science Dojo

Page 28: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Data Mining and Predictive Analytics

In the next few slides, we will take a look at some of the most common data mining tasks.

Copyright (c) 2018. Data Science Dojo 28

Page 29: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Classification: A Simple Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test

Set

Training

SetModel

Learn

Classifier

Copyright (c) 2018. Data Science Dojo 29

Page 30: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Classification: More Examples

• What is the likelihood that a patient will develop diabetes?

• What is the likelihood that a COPD patient will be readmitted within 90 days of discharge?

• What is the likelihood that a person will not show up to their appointment?

Copyright (c) 2018. Data Science Dojo 30

Page 31: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Intra-cluster distancesare minimized

Inter-cluster distancesare maximized

Clustering in 3-D space using Euclidean distance

Clustering: An Illustration

Copyright (c) 2018. Data Science Dojo 31

Page 32: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Clustering

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:• Data points within a cluster have more similarities with

one another

• Data points in different clusters have less similarities with one another

32

Copyright (c) 2018. Data Science Dojo

Page 33: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Clustering: Similarity Measures

• Similarity Measures:• Euclidean Distance if attributes are

continuous• Other problem-specific measures• Example: If a particular word occurs in two

documents or not

Copyright (c) 2018. Data Science Dojo 33

Page 34: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Clustering: Examples

To find groups of documents that are similar to each other based on the most important terms that appear in them (e.g. medical records)

Copyright (c) 2018. Data Science Dojo 34

Page 35: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Association Analysis

Your behavior is being predicted, not by studying you, but by studying others.

Copyright (c) 2018. Data Science Dojo 35

Page 36: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Association Rule Discovery

• Given a set of records each of which contain some number of items from a given collection:• Produce dependency rules which will predict the

occurrence of an item based on the occurrences of other items

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}

Copyright (c) 2018. Data Science Dojo 36

Page 37: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Association Analysis: Pharmacy Shelf Management

Copyright (c) 2018. Data Science Dojo 37

Page 38: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Predicts a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency

Regression

Copyright (c) 2018. Data Science Dojo 38

Page 39: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Regression Example

Copyright (c) 2018. Data Science Dojo 39

Predicting vaccine demand to better plan supply

Page 40: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Anomaly Detection

• Detect significant deviations from normal behavior

• Applications:

• Unusual patient behavior

• Insurance fraud detection

• Treatment outlier detection

Copyright (c) 2018. Data Science Dojo 40

Page 41: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Challenges in Data Mining

Scalability DimensionalityComplex and

heterogeneous data

Data qualityData ownership and

distributionPrivacy

Reaction timeMany other domain

specific issues

Copyright (c) 2018. Data Science Dojo 41

Page 42: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

AI in Healthcare Landscape

42Copyright (c) 2018. Data Science Dojo

Page 43: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Overview of Datasets

Page 44: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Wisconsin Breast Cancer Data

Copyright (c) 2018. Data Science Dojo 44

Page 45: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Wisconsin Breast Cancer Data

45Copyright (c) 2018. Data Science Dojo

• Features obtained from a digital image of a fine needle aspirate (FNA) of a breast mass.

• Describes characteristics of the cell nuclei present in the image.

• Attribute information:• ID number

• Diagnosis (M = malignant, B = benign)

• 10 real-valued features

• Total of 569 records

Page 46: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Wisconsin Breast Cancer Data46

Copyright (c) 2018. Data Science Dojo 46

Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Page 47: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Features: Wisconsin Breast Cancer Data

47Copyright (c) 2018. Data Science Dojo

Id: ID number

diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)

radius_mean: mean of distances from center to points on the perimeter

texture_mean: standard deviation of gray-scale values

perimeter_mean: mean size of the core tumor

compactness_se: standard error for perimeter^2 / area - 1.0

smoothness_mean: mean of local variation in radius lengths

compactness_mean: mean of perimeter^2 / area - 1.0

concavity_mean: mean of severity of concave portions of the contour

concave points_mean: mean for number of concave portions of the contour

fractal_dimension_mean: mean for "coastline approximation" - 1

radius_se: standard error for the mean of distances from center to points on the perimeter

texture_se: standard error for standard deviation of gray-scale values

Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Page 48: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Data Exploration and Visualization

Page 49: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Agenda

•Why data exploration and visualization?

•Exploration and visualization of data:•Core R functionality • lattice package •ggplot2 package

Copyright © 2018. Data Science Dojo 49

Page 50: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

WHY DATA EXPLORATION AND VISUALIZATION?

Copyright © 2018. Data Science Dojo 50

Page 51: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Data Beats Algorithm But…

•More data usually yields good generalization performance, even with a simple algorithm

•But there are caveats:•Amount of data may have diminishing returns•Data quality and variety matters•A decent performing learning algorithm is still needed•Most importantly, extracting useful features out of data is important

Copyright © 2018. Data Science Dojo 51

Page 52: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Copyright © 2018. Data Science Dojo

23:05:33 –5 UTC, April 3, 2014

Is Date-Time Stamp a Good Feature?

Hour of date Day of week AM/PM

52

Page 53: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Dispelling a Common Myth

•There is NO single ML algorithm that will take raw data and give you the best model

•You do NOT need to know a lot of machine learning algorithms to build robust predictive models

Copyright © 2018. Data Science Dojo53

Page 54: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Janitorial Work is Important

•Not spending time on understanding your data is a source of many problems!

•Remember the 80/20 rule:• 80% : Data cleaning, data exploration, feature

engineering, pre-processing, etc

• 20% : Model building

Copyright © 2018. Data Science Dojo 54

Page 55: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

EXPLORATION AND VISUALIZATION USING R

Copyright © 2018. Data Science Dojo 55

Page 56: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Objectives

•Develop an understanding of the high-level thinking process of data exploration

•Make sense of data using visualization techniques

•Learn to perform feature engineering

•Become a good storyteller

Copyright © 2018. Data Science Dojo 56

Page 57: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Anscombe’s Quartet

Plot

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Copyright © 2018. Data Science Dojo 57

Page 58: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Consider the 4 following different datasets

Anscombe’s Quartet

Mean of X 9

Variance of X 11

Mean of Y 7.5

Variance of Y 4.125

Correlation between X & Y

0.816

Copyright © 2018. Data Science Dojo 58

Page 59: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Awareness Test

Copyright © 2018. Data Science Dojo 59

Page 60: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Common Graphical Parameters

• Title of graph using the main function, main = “title”

• Label x axis by using the xlab function, xlab = “label x axis”

• Label x axis by using the ylab function, ylab = “label y axis”

• Colors controlled by col

• Get legends of layered plots with auto.key=TRUE

Copyright © 2018. Data Science Dojo 60

Page 61: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Exploring Data Commands

Copyright © 2018. Data Science Dojo

Commands Description

read.csv() , read.table() Load data/file into a dataframe

data() Loads or resets a dataset

names() List names of variables in a dataframe

head() First 6 rows of data

tail() Last 6 rows of data

str() Display internal structure if R object

View() View dataset in spreadsheet format in RStudio

dim() Dimensions( rows and columns) of dataframe

summary() Display 5-number summary and mean

colnames() Provide column names

61

Page 62: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

CORE R GRAPHICS

Copyright © 2018. Data Science Dojo 62

Page 63: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Copyright © 2018. Data Science Dojo 63

Breast Cancer Dataset

breast_cancer <- read.csv("data.csv")

data(breast_cancer)

head(breast_cancer)

Page 64: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

• Summarizes quantitative/numeric data

Boxplots

# Core Graphics

boxplot(

radius_mean~diagnosis,

data=breast_cancer,

main="Radius Mean for

various diagnoses

", xlab="Diagnosis",

ylab="Radius Mean"

)

B: Benign M: Malignant

Copyright © 2018. Data Science Dojo 64

Page 65: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Pie Chart

▪ Summarizes qualitative/categorical variables

# Core Graphics

pie(table(breast_cancer$diagnosis))

Copyright © 2018. Data Science Dojo 65

B: Benign M: Malignant

Page 66: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Scatter Plot

▪ Visual depiction of correlation between numeric variables

# Core Graphics

plot(breast_cancer$concave.

points_worst

,breast_cancer$perimeter_worst

,xlab="Concave Points Worst",

ylab="Perimeter Worst")

Copyright © 2018. Data Science Dojo 66

Page 67: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Scatter Plot

# Core Graphics

plot(perimeter_worst~area_worst,

data=breast_cancer)

▪ Plot of perimeter_worst against area_worst

Copyright © 2018. Data Science Dojo 67

Page 68: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Scatter Plot

plot(concave.points_worst~perimeter_worst,

data=breast_cancer,

main="Concave Points Worst vs Perimeter Worst",

xlab="Concave Points Worst",

ylab="Perimeter Worst")

abline(lm(concave.points_worst~perimeter_worst,

data=breast_cancer),col="red",lwd=2)

cor(breast_cancer$concave.points_worst,breast_cancer$perimeter_worst)

>0.816322101687544

• Plots counts of Concave Points Worst versus Perimeter Worst, then adds a regression line

• Find correlation between variables (values close to 1 or -1 depict strong linear relationship)

Copyright © 2018. Data Science Dojo 68

Page 69: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

GGPLOT2 GRAPHICS

Copyright © 2018. Data Science Dojo 69

Page 70: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

ggplot Fundamentals

•ggplot() provides a blank canvas for plotting

•geom_*() creates actual graphical layers• geom_point()

• geom_boxplot()

•aes() defines an "aesthetic" either globally or by layer

Copyright © 2018. Data Science Dojo 70

Page 71: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Copyright © 2018. Data Science Dojo

ggplot(breast_cancer, aes()) + geom_point()

Layering

Layer 1 Layer 2

71

Page 72: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Histogram

A histogram of counts of Concave Points Worst

ggplot(breast_cancer,aes(x=con

cave.points_worst)) +

geom_histogram()

Copyright © 2018. Data Science Dojo 72

Page 73: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Density

Smooths over the counts of concave points worst

▪ Note the location of aes()

ggplot(breast_cancer) +

geom_density(aes(x=concave.

points_worst),fill="gray50") +

labs(x="Concave Points Worst")

Copyright © 2018. Data Science Dojo 73

Page 74: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Scatter Plot

ggplot(breast_cancer,

aes(x=concave.points_worst,

y=perimeter_worst)) +

geom_point() +

labs(x="Concave Points Worst",

y="Perimeter Worst")

Copyright © 2018. Data Science Dojo 74

Page 75: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Saving a ggplot Object

# ggplot object

# Store the plot for future

modifications

g <- ggplot(breast_cancer,

aes(x=concave.points_worst,

y=perimeter_worst))

# Second aesthetic adds settings

specific to geom_point layer

g + geom_point(aes(color=diagnosis))

+ labs(x="Concave Points Worst",

y="Perimeter Worst")

Copyright © 2018. Data Science Dojo 75

Page 76: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Segmenting a Plot

# Segment by factor

g +

geom_point(aes(color=diagnosis))

+ facet_wrap(~diagnosis) +

labs(x="Concave Points Worst“

,y="Perimeter Worst")

Copyright © 2018. Data Science Dojo 76

Page 77: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Summary

✓Basics of R

✓Graphing in R – core and ggplot2

✓Look at multiple types of graphs

✓Visualize and segment data to gain more insights

✓Identify key features

✓Summarize findings

Copyright © 2018. Data Science Dojo 77

Page 78: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

QUESTIONS

Copyright © 2018. Data Science Dojo 78

Page 79: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

Building Classification Models Using Decision Trees

Page 80: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

80Copyright (c) 2018. Data Science Dojo 80

Agenda

• Introduction to predictive analytics

• Introduction to classification

•Decision Tree Classifier

•Hands-on Lab: Building a decision tree classifier using R

Page 81: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

81Copyright (c) 2018. Data Science Dojo

INTRODUCTION TO PREDICTIVE ANALYTICS

Copyright © 2018. Data Science Dojo

81

Page 82: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

82Copyright (c) 2018. Data Science Dojo 82

Emergency & Surgery Rooms

• Gauss Surgical• Develops real-time blood monitoring

solutions to provide an accurate and objective estimate of blood loss.

• MedaSense• Assesses patients’ physiological

response to pain.

Page 83: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

83Copyright (c) 2018. Data Science Dojo 83

Patient Data & Risk Assessment

▪ Watson for oncology• Analyzes patients medical records and identify

treatment options for doctors and patients.

▪ SkinVision• Assesses skin cancer risk using image

recognition and user provided information.

▪Berg• Includes dosage trials for intravenous

tumor treatment, detection and management of prostate cancer.

Page 84: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

84Copyright (c) 2018. Data Science Dojo 84

Mental Health

▪MedyMatch• Helps treat stroke and head trauma

more effectively by detecting intracranial brain bleeds.

▪P1vital• Predicting Response to Depression

Treatment (PReDicT test) uses Machine Learning to provide anti-depressant treatment.

Page 85: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

85Copyright (c) 2018. Data Science Dojo

INTRODUCTION TO CLASSIFICATION

Copyright © 2018. Data Science Dojo

85

Page 86: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

86Copyright (c) 2018. Data Science Dojo 86

Supervised Learning

Training Set

TrainModel

Learning

LearningAlgorithm

Model

ApplyModel

Prediction

Test Set

Page 87: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

87Copyright (c) 2018. Data Science Dojo 87

Decision Tree Learning

Splitting Attributes

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

≥ 0.1358<0.1358

< 26.29 ≥26.29

Page 88: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

88Copyright (c) 2018. Data Science Dojo 88

A Different Decision Tree

Texture

Perimeter

Concavity

MalignantBenign

Benign

Malignant

<114.6 ≥114.6

< 26.29≥26.29

<0.1358 ≥ 0.1358

There could be more than one tree

that fits the same data!

Page 89: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

89Copyright (c) 2018. Data Science Dojo 89

Decision Tree Application

Training Set

TrainModel

Induction

LearningAlgorithm

Model

ApplyModel

Deduction

Test Set

Page 90: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

90Copyright (c) 2018. Data Science Dojo 90

Apply Model to Test Data

Test DataStart from the root of tree.

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 91: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

91Copyright (c) 2018. Data Science Dojo 91

Apply Model to Test Data

Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 92: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

92Copyright (c) 2018. Data Science Dojo 92

Apply Model to Test Data

Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 93: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

93Copyright (c) 2018. Data Science Dojo 93

Apply Model to Test Data

Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 94: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

94Copyright (c) 2018. Data Science Dojo 94

Apply Model to Test Data

Test Data

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

≥ 0.1358<0.1358

< 26.29 ≥26.29

Page 95: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

95Copyright (c) 2018. Data Science Dojo 95

Apply Model to Test Data

Test Data

Diagnosis = “Benign”

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 96: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

96Copyright (c) 2018. Data Science Dojo 96

How Do We Get A Tree?

• Exponentially many decision trees are possible

• Finding the optimal tree is infeasible

• Greedy methods that find near-optimal solutions do exist

Page 97: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

97Copyright (c) 2018. Data Science Dojo 97

Tree Induction

• Greedy strategy• Split based attribute test that optimizes a

criterion

• Issues

• How to split the records

• What attribute test condition?

• How to determine the best split?• When do we stop?

Page 98: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

98Copyright (c) 2018. Data Science Dojo 98

Tree Induction

• Greedy strategy• Split based attribute test that optimizes a

criterion

• Issues

• How to split the records

• What attribute test criterion?

• How to determine the best split?• When do we stop?

Page 99: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

99Copyright (c) 2018. Data Science Dojo 99

Splitting a Node

Texture> 26.29?

NoYes

Binary Split

Texture

[16.5, 22.2)<16.5

[22.2, 32.5) [35.8, 39.7)≥ 30.2

Multi-way Split

Page 100: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

100Copyright (c) 2018. Data Science Dojo 100

Tree Induction

• Greedy strategy• Split based attribute test that optimizes a criterion

• Issues• How to split the records

• What attribute test criterion?

• How to determine the best split?

• When do we stop?

Page 101: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

101Copyright (c) 2018. Data Science Dojo 101

What is The Best Split?

Before Splitting: 10 records of class 1, 10 records of class 2

Which test condition is the best?

Texture< 26.29?

NoYes

C1: 6C2: 4

C1: 4C2: 6

Concavity?

C1: 1C2: 3

C1: 8C2: 0

C1: 1C2: 7

ID?

C1: 0C2: 1

C1: 1C2: 0

C1: 0C2: 1

C1: 1C2: 0

1 3

2

s1s2 s3

s20

C1: Benign

C2: Malignant

Page 102: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

102Copyright (c) 2018. Data Science Dojo 102

C1: 9C2: 1

C1: 5C2: 5

What is The Best Split?

• Greedy approach • Homogeneous class distribution preferred

• Need a measure of node impurity

Non-homogeneous

High degree of impurity

Homogeneous

Low degree of impurity

C1: Benign

C2: Malignant

Page 103: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

103Copyright (c) 2018. Data Science Dojo 103

Measures of Node Impurity

•Gini Index

•Entropy

•Misclassification error

Page 104: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

104Copyright (c) 2018. Data Science Dojo 104

Impurity Measure: GINI

• p( j | t) is the relative frequency of class j at node t

• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

• nc=number of classes

• Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0

C2 6

Gini=0.000

C1 2

C2 4

Gini=0.444

C1 3

C2 3

Gini=0.500

C1 1

C2 5

Gini=0.278

C1: Benign

C2: Malignant

Page 105: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

105Copyright (c) 2018. Data Science Dojo 105

Impurity Measure: GINI

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1: Benign

C2: Malignant

Page 106: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

106Copyright (c) 2018. Data Science Dojo 106

Impurity Measure: GINI

• When a node p is split into k partitions (children), the quality of split is computed as:

where

ni = number of records at child i

n = number of records at node p

k

i

i iGINIn

npsplitGINI

1

)(),(

Page 107: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

107Copyright (c) 2018. Data Science Dojo 107

Impurity Measure: GINI

• Split data into two partitions

• Partition measurements are weighted

• Larger and purer partitions are sought after

B?

Malignant Benign

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2

C1 5 1

C2 2 4

Gini=0.371

Gini(N1)

= 1 – (5/7)2 – (2/7)2

= 0.408

Gini(N2)

= 1 – (1/5)2– (4/5)2

= 0.320

Gini(B?, Parent)

= 7/12 * 0.408 +

5/12 * 0.320

= 0.371

N1 N2

C1 5 1

C2 2 4

C1: Benign

C2: Malignant

Page 108: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

108Copyright (c) 2018. Data Science Dojo 108

• 𝑝 𝑗 𝑡 is the relative frequency of class j at node t

• Maximum: records equally distributed

• Minimum: all records belong to one class

j

tjptjptEntropy ))|((log)|()( 2

Impurity Measure: Entropy

Page 109: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

109Copyright (c) 2018. Data Science Dojo 109

Impurity Measure: Entropy

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

C1: Benign

C2: Malignant

Page 110: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

110Copyright (c) 2018. Data Science Dojo 110

Impurity Measure: Information

• Node p is split into k partitions

• ni is number of records in partition i

• Measures reduction in entropy

• Choose split that maximizes GAIN

• Tends to prefer splits with large number of partitions

k

i

i

splitiEntropy

n

npEntropyGAIN

1

)()(

Page 111: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

111Copyright (c) 2018. Data Science Dojo 111

Impurity Measure: Classification Error

• Maximum: records are equally distributed

• Minimum: all records belong to one class

• Similar to information gain• Less sensitive for > 2 or 3 splits

• Less prone to overfitting

)|(max1)( tiPtErrori

Page 112: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

112Copyright (c) 2018. Data Science Dojo 112

Impurity Measure: Classification Error

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)|(max1)( tiPtErrori

C1: Benign

C2: Malignant

Page 113: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

113Copyright (c) 2018. Data Science Dojo 113

Tree Induction

• Greedy strategy• Split based attribute test that optimizes a criterion

• Issues• How to split the records

• What attribute test criterion?

• How to determine the best split?

• When do we stop?

Page 114: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

114Copyright (c) 2018. Data Science Dojo 114

Sample Stopping Criteria

• All the records belong to the same class

• All the records have similar attribute values

• Fixed termination or pruning• Number of Levels

• Number in Leaf Node

• Minimum samples per leaf node

Page 115: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

115Copyright (c) 2018. Data Science Dojo 115

Decision Trees - PROS

• Intuitive• Easy interpretation for small

trees

• Non parametric• Incorporate both numeric

and categorical attributes

• Fast• Once rules are developed,

prediction is rapid

• Robust to outliers

Perimeter

Concavity

Texture

BenignMalignant

Malignant

Benign

<114.6 ≥114.6

<0.1358<0.1358

< 26.29 ≥26.29

Page 116: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

116Copyright (c) 2018. Data Science Dojo 116

Decision Trees - CONS

• Overfitting• Must be trained with great care

• Rectangular Classification• Recursive partitioning of data may not capture complex relationships

Page 117: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

117Copyright (c) 2018. Data Science Dojo

QUESTIONS

Copyright (c) 2018. Data Science Dojo

Page 118: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

118Copyright (c) 2018. Data Science Dojo 118Copyright (c) 2018. Data Science Dojo

Evaluating Classification Models

Page 119: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

119Copyright (c) 2018. Data Science Dojo 119

Agenda

• Evaluation of classification models:• Confusion Matrix

• Accuracy, Precision, Recall, F1 measure

• Building robust machine learning models:

• Bias/variance tradeoff

• Methods of evaluation:

• Cross validation

• ROC curve

Page 120: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

120Copyright (c) 2018. Data Science Dojo 120

The Limitations of Accuracy

• Consider a 2-class problem:• Number of Class 0 examples = 9990

• Number of Class 1 examples = 10

• If the model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

• Accuracy is misleading!

Page 121: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

121Copyright (c) 2018. Data Science Dojo

METRICS FOR EVALUATION

Page 122: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

122Copyright (c) 2018. Data Science Dojo 122

Confusion Matrix

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Page 123: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

123Copyright (c) 2018. Data Science Dojo 123

Confusion Matrix

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

dcba

da

FNFPTNTP

TNTP

Accuracy

Page 124: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

124Copyright (c) 2018. Data Science Dojo 124

Precision

𝑝 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃=

𝑎

𝑎 + 𝑐

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

Page 125: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

125Copyright (c) 2018. Data Science Dojo 125

Recall/Sensitivity

𝑟 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁=

𝑎

𝑎 + 𝑏

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

Page 126: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

126Copyright (c) 2018. Data Science Dojo 126

F1-Score

𝐹1 =2𝑟𝑝

𝑟 + 𝑝=

2𝑎

2𝑎 + 𝑏 + 𝑐

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yesa

(TP)b

(FN)

Class=Noc

(FP)d

(TN)

Harmonic mean of precision and recall

Page 127: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

127Copyright (c) 2018. Data Science Dojo

WILL MY MODEL BETRAY ME?

Page 128: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

128Copyright (c) 2018. Data Science Dojo 128

Is My Model Really Good?

• My model shows an accuracy of 90% in the training environment

• Would the model be 90% accurate in production environment?

Page 129: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

129Copyright (c) 2018. Data Science Dojo 129

Generalization

• A machine learning model should be able to handle any data set coming from the same distribution as the training set.

• Generalization refers to a model's ability to handle any random variations of training data

Page 130: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

130Copyright (c) 2018. Data Science Dojo 130

Overfitting (lack of generalization)

• The gravest and most common sin of machine learning

• Overfitting: learning so much from your data that you memorize it.• You do well on training data• But don’t do well (or even fail miserably) on test data

Page 131: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

131Copyright (c) 2018. Data Science Dojo 131

Train/Test Partition is Not Enough

Labelled Data

Training DataBlind Holdout Data

70% 30%

Page 132: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

132Copyright (c) 2018. Data Science Dojo 132

Blind Holdout Dataset

• The person building the model has no access to the blind holdout data set• Why do we need to lock it away?

• Even in presence of a 70/30 split, you may end up with a model that is not generalized

Page 133: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

133Copyright (c) 2018. Data Science Dojo 133

Perils of Overfitting

Page 134: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

134Copyright (c) 2018. Data Science Dojo 134

Bias/Variance Tradeoff

You can beat your data to confession.

Page 135: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

135Copyright (c) 2018. Data Science Dojo 135Copyright (c) 2018. Data Science Dojo

The generation of random numbers is too important to be left to chance.

Page 136: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

136Copyright (c) 2018. Data Science Dojo 136

Bias/Variance Trade-off

Bullseye is the theoretical best performance (accuracy, precision, recall or something else)

Each dartboard represents a model

Page 137: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

137Copyright (c) 2018. Data Science Dojo 137

Bias/Variance Trade-off

• Test your model on several variations of the dataset

• Each dot represents a random variation of the test data set

Page 138: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

138Copyright (c) 2018. Data Science Dojo 138

Bias/Variance Trade-off

Page 139: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

139Copyright (c) 2018. Data Science Dojo

METHODS OF EVALUATION

Page 140: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

140Copyright (c) 2018. Data Science Dojo 140

Cross Validation

•Split data into k disjoint partitions

•Train on k-1 partitions and test on 1

•Repeat k times

Page 141: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

141Copyright (c) 2018. Data Science Dojo 141

Cross Validation (k=10)

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Training Set Test Set

Page 142: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

142Copyright (c) 2018. Data Science Dojo 142

Adjusting Learning Parameters

max depth = 10 max depth = 7 max depth = 2

A1 100% 80% 55%

A2 60% 78% 55%

A3 90% 79% 55%

A4 70% 77% 55%

A5 80% 81% 55%

average 80% 79% 55%

Page 143: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

143Copyright (c) 2018. Data Science Dojo 143

Holdout Set

•70% for training, 30% for testing

•60/40 or 50/50 also possible

•Repeated holdout: Apply 70/30, 60/40 or 50/50 many times.

Page 144: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

144Copyright (c) 2018. Data Science Dojo 144

Stratified Sampling

•Use when class distribution is skewed

•Ensures that all partitions have fixed ratio of classes•Same ratio as training set• If training set is 5% class 1 and 95% class 2, so is each partition

Page 145: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

145Copyright (c) 2018. Data Science Dojo 145

Using ROC for Model Comparison

• No model consistently

outperforms the other

• Purple is better at low

thresholds

• Red is better at high

thresholds

• Area Under ROC Curve (AUC)

• Compares models directly

AUC=0.865AUC=0.859

Page 146: Introduction - World Bankpubdocs.worldbank.org/en/244111541088455535/Data... · Introduction Data Science and Data Engineering. Instructor –Raja Iqbal •Founder, CEO & Chief Data

146Copyright (c) 2018. Data Science Dojo

QUESTIONS

Copyright (c) 2018. Data Science Dojo