applied machine learning lecture 3: pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Applied Machine LearningLecture 3: Pre-processing steps and

hyperparameters tuning

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

January 28, 2020

mailto:[email protected]

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

the �rst step: mapping features to numerical vectors

I scikit-learn's learning methods works with features asnumbers, not strings

I they can't directly use the feature dicts we have stored in X

I converting from string to numbers is the purpose of these lines:vec = DictVectorizer()

Xe = vec.fit_transform(X)

types of vectorizers

I a DictVectorizer converts from attribute�value dicts:

I a CountVectorizer converts from texts (after splitting intowords) or lists:

I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF (downweighting common words)

what goes on in a DictVectorizer?

I each feature corresponds to one or more columns in the outputmatrix

I easy case: boolean and numerical features:

I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans

code example (DictVectorizer)

from sklearn.feature_extraction import DictVectorizer

X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},

{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},

{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

vec = DictVectorizer()

Xe = vec.fit_transform(X)

print(Xe.toarray())

print(vec.get_feature_names())

the result:

[[ 1. 0. 1. 0. 0. 7.]

[ 1. 0. 0. 1. 1. 2.]

[ 0. 1. 1. 0. 0. 9.]]

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

Overview



Feature Selection


Imbalanced Classes

Review and Closing


I Remove rows/columns that contain missing values (theeasiest, but ...)

I Replace the missing values with some values; the process iscalled imputation (more strategic)

Imputation of missing values

I Derived data from the same feature (univariate featureimputation)I (e.g., mean, median, most-frequent)

I Derived data from several features

I See scikit-learn Imputation of missing values

https://scikit-learn.org/stable/modules/impute.html

Overview



Feature Selection


Imbalanced Classes

Review and Closing

Challenges with large number of features

I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.

I What are the potential challenges with more features?

I Finding key information/feature(s) becomes more di�cult

I Higher complexity ⇒ bad for generalization

I Training takes longer time

I To solve these issues: try feature selection method

Selecting Features: Brute Force

for every possible set of features S :train and evaluate the model based on S

return the set Smax that gave the best result

Feature Selection: Overall Ideas

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

https://en.wikipedia.org/wiki/Feature_selection

Association between a feature and the output (idea)

Feature Ranking Example (document categories)

[Source: Manning & Schütze, Introduction to Information Retrieval]

https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html

Filter-based Feature Selection in scikit-learn

http://scikit-learn.org/stable/modules/feature_selection.html

I selectors:I SelectKBestI SelectPercentile

I feature scoring functions:I f_classifI mutual_info_classifI chi2

http://scikit-learn.org/stable/modules/feature_selection.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

Example of a wrapper method: greedy forward selection

S = empty setrepeat:�nd the feature F that gives the largest

improvement when added to Sif there was an improvement add F to S

until there was no improvement

I the MLXtend library has an implementationI see SequentialFeatureSelector

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Example of a wrapper method: backward selection

I See notebook

I Recursive Feature Elimination and Cross-Validation

(RFECV)

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

Overview



Feature Selection


Imbalanced Classes

Review and Closing

Hyperparameters vs parameters

I Hyperparameters

I often used to help estimatethe model parameters

I speci�ed by the users

I tuned for the problem athand

I In Random forests: numberof trees, number of featuresconsidered

I In Neural Networks: learningrate, number of hidden layers

I Parameters

I are needed by the model formaking predictions

I the values are estimatedfrom data

I usually are not set by theusers

I In Random forests: randomseed used

I In Neural Networks: weights

How to tune the hyperparameters?

I Trial and error

In scikit-learn:http://scikit-learn.org/stable/modules/grid_search.html

I GridSearchCV

I RandomizedSearchCV

http://scikit-learn.org/stable/modules/grid_search.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Overview



Feature Selection


Imbalanced Classes

Review and Closing

�Finding needle in haystack� problems

I Finding rare diseases

I Finding anomalies in travelpattern

I Di�erentiating crashes vsnon-crashes situations

I . . .[source]

https://upload.wikimedia.org/wikipedia/commons/c/ce/Haystack.png

Evaluation of imbalance cases

def has_disease(patient):

return False

I a DummyClassifier will have a high accuracy

I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .

Dealing with class imbalance

I Change parameter(s) of the learning algorithm

I Change the data

I Use appropriate performance metric(s)

Change the parameter(s) of the learning algorithm

I Example in scikit-learn: LinearSVC

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Changing the data

[source]

I Or generate synthetic data using the existing data

https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

Library for imbalanced-learn

I https://imbalanced-learn.readthedocs.io/

I See for example: BalancedRandomForestClassi�er

https://imbalanced-learn.readthedocs.io/

Overview



Feature Selection


Imbalanced Classes

Review and Closing

Review of pre-processing and hyperparameters tuning

I Explain di�erent ways of selecting features

I Explain pros/cons of GridSearchCV and RandomizedSearchCV

I Explain di�erent ways of dealing with imbalance classes

Next lecture

I Linear classi�ers and regression models

I (Methods for) data collection and bias

I Annotating data

applied machine learning lecture 3: pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Documents