scikit learn: how to deal with missing values

17

Click here to load reader

Upload: damian-r-mingle-mba

Post on 22-Jan-2018

128 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Scikit Learn: How to Deal with Missing Values

Missing Data:

A Machine Learning

Approach

Page 2: Scikit Learn: How to Deal with Missing Values

DAMIAN MINGLE

CHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

Page 3: Scikit Learn: How to Deal with Missing Values

GET THE FULL STORY

bit.ly/UseSciKitNow

Page 4: Scikit Learn: How to Deal with Missing Values

What’s Imputation Anyway?

Some models don’t do well with missing values, so filling with values could

prove useful.

Missing values can be replaced by the mean, median, or frequent value.

Page 5: Scikit Learn: How to Deal with Missing Values

Why Imputation Matters

Imputing the missing values can give better results than discarding the

samples containing any missing value.

Imputing does not always improve the predictions – cross-validation is good to

review.

In some cases, dropping rows or using marker values is more effective.

Page 6: Scikit Learn: How to Deal with Missing Values

Preprocessing

Clustering

Regression

Classification

Dimensionality Reduction

Model Selection

Page 7: Scikit Learn: How to Deal with Missing Values

Let’s Look

at an

ML Recipe

Imputation

Page 8: Scikit Learn: How to Deal with Missing Values

The Imports

import numpy as np

import urllib

from sklearn.preprocessing import Imputer

Page 9: Scikit Learn: How to Deal with Missing Values

Load Dataset with Missing Values

url = “https://goo.gl/3jvZXE”

raw_data = urllib.urlopen(url)

dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)

Page 10: Scikit Learn: How to Deal with Missing Values

Separate Features from Target

X = dataset[:,0:7]

y = dataset[:,8

Page 11: Scikit Learn: How to Deal with Missing Values

Mark Values with 0

X[X==0]=np.nan

Page 12: Scikit Learn: How to Deal with Missing Values

Impute Missing Values with Mean

imp = Imputer(missing_values =‘NaN’, strategy=‘mean’)

imputed_X = imp.fit_transform(X)

Page 13: Scikit Learn: How to Deal with Missing Values

Imputation Recipe

# Impute missing values with the mean

import numpy as np

import urllib

from sklearn.preprocessing import Imputer

# Load dataset from UCI Machine Learning Repo

url = “https://goo.gl/3jvZXE”

raw_data = urllib.urlopen(url)

dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)

# Segregate the data by features and target

X = dataset[:,0:7] y = dataset[:,8]

# All values with 0 become “not actual number” (NaN)

X[X==0]=np.nan

# Make use of the mean value for attribute

imp = Imputer(missing_values='NaN', strategy='mean')

imputed_X = imp.fit_transform(X)

Page 14: Scikit Learn: How to Deal with Missing Values

Missing Data:

A Machine Learning

Approach

Page 15: Scikit Learn: How to Deal with Missing Values

DAMIAN MINGLE

CHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

Page 16: Scikit Learn: How to Deal with Missing Values

GET THE FULL STORY

bit.ly/UseSciKitNow

Page 17: Scikit Learn: How to Deal with Missing Values

Resources

Society of Data Scientists

SciKit Learn

Also:

Fit the imputer on X, fit(X[,y])

Fit to data, then transform it, fit_transform (X[,y])

Impute all missing values in X, transform(X)