scikit learn: how to deal with missing values
Post on 22-Jan-2018
128 Views
Preview:
TRANSCRIPT
Missing Data:
A Machine Learning
Approach
DAMIAN MINGLE
CHIEF DATA SCIENTIST, WPC Healthcare
@DamianMingle
What’s Imputation Anyway?
Some models don’t do well with missing values, so filling with values could
prove useful.
Missing values can be replaced by the mean, median, or frequent value.
Why Imputation Matters
Imputing the missing values can give better results than discarding the
samples containing any missing value.
Imputing does not always improve the predictions – cross-validation is good to
review.
In some cases, dropping rows or using marker values is more effective.
Preprocessing
Clustering
Regression
Classification
Dimensionality Reduction
Model Selection
Let’s Look
at an
ML Recipe
Imputation
The Imports
import numpy as np
import urllib
from sklearn.preprocessing import Imputer
Load Dataset with Missing Values
url = “https://goo.gl/3jvZXE”
raw_data = urllib.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)
Separate Features from Target
X = dataset[:,0:7]
y = dataset[:,8
Mark Values with 0
X[X==0]=np.nan
Impute Missing Values with Mean
imp = Imputer(missing_values =‘NaN’, strategy=‘mean’)
imputed_X = imp.fit_transform(X)
Imputation Recipe
# Impute missing values with the mean
import numpy as np
import urllib
from sklearn.preprocessing import Imputer
# Load dataset from UCI Machine Learning Repo
url = “https://goo.gl/3jvZXE”
raw_data = urllib.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",") print(dataset.shape)
# Segregate the data by features and target
X = dataset[:,0:7] y = dataset[:,8]
# All values with 0 become “not actual number” (NaN)
X[X==0]=np.nan
# Make use of the mean value for attribute
imp = Imputer(missing_values='NaN', strategy='mean')
imputed_X = imp.fit_transform(X)
Missing Data:
A Machine Learning
Approach
DAMIAN MINGLE
CHIEF DATA SCIENTIST, WPC Healthcare
@DamianMingle
Resources
Society of Data Scientists
SciKit Learn
Also:
Fit the imputer on X, fit(X[,y])
Fit to data, then transform it, fit_transform (X[,y])
Impute all missing values in X, transform(X)
top related