scikit learn: how to deal with missing values

Missing Data:

A Machine Learning

Approach

DAMIAN MINGLE

CHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

GET THE FULL STORY

bit.ly/UseSciKitNow

What’s Imputation Anyway?

Some models don’t do well with missing values, so filling with values could

prove useful.

Missing values can be replaced by the mean, median, or frequent value.

Why Imputation Matters

Imputing the missing values can give better results than discarding the

samples containing any missing value.

Imputing does not always improve the predictions – cross-validation is good to

review.

In some cases, dropping rows or using marker values is more effective.

Preprocessing

Clustering

Regression

Classification

Dimensionality Reduction

Model Selection

Let’s Look

ML Recipe

Imputation

The Imports

import numpy as np

import urllib

from sklearn.preprocessing import Imputer

Load Dataset with Missing Values

url = “https://goo.gl/3jvZXE”

raw_data = urllib.urlopen(url)

dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)

Separate Features from Target

X = dataset[:,0:7]

y = dataset[:,8

Mark Values with 0

X[X==0]=np.nan

Impute Missing Values with Mean

imp = Imputer(missing_values =‘NaN’, strategy=‘mean’)

imputed_X = imp.fit_transform(X)

Imputation Recipe

# Impute missing values with the mean

import numpy as np

import urllib

from sklearn.preprocessing import Imputer

# Load dataset from UCI Machine Learning Repo

url = “https://goo.gl/3jvZXE”

raw_data = urllib.urlopen(url)

dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)

# Segregate the data by features and target

X = dataset[:,0:7] y = dataset[:,8]

# All values with 0 become “not actual number” (NaN)

X[X==0]=np.nan

# Make use of the mean value for attribute

imp = Imputer(missing_values='NaN', strategy='mean')

imputed_X = imp.fit_transform(X)

Missing Data:

A Machine Learning

Approach

DAMIAN MINGLE

CHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

GET THE FULL STORY

bit.ly/UseSciKitNow

Resources

Society of Data Scientists

SciKit Learn

Fit the imputer on X, fit(X[,y])

Fit to data, then transform it, fit_transform (X[,y])

Impute all missing values in X, transform(X)

scikit learn: how to deal with missing values

Data & Analytics

gradient boosted regression trees in scikit-learn

scikit gstat documentation - github pages · scikit gstat...

scikit learn: how to standardize your data

what's new in scikit-learn 0.17

scikit learn docs

machine learning for neuroimaging with scikit-learn

can methods that deal with missing data reduce bias or...

introduction to machine learning with scikit-learn

scikit-learn: machine learning in...

scikit learn user guide 0.12

release 0.5 - scikit-cuda

anomaly/novelty detection with scikit-learn

scikit-plot documentation

scikit learn: data normalization techniques that work

missing data & how to deal: an overview of missing...

a introduction to scikit-learn -...

scikit-learn laboratory documentation

machine learning with scikit-learn

clustering: a scikit learn tutorial

introduction to scikit-learn