applied machine learning lecture 3: pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...
TRANSCRIPT
Applied Machine LearningLecture 3: Pre-processing steps and
hyperparameters tuning
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
January 28, 2020
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
the �rst step: mapping features to numerical vectors
I scikit-learn's learning methods works with features asnumbers, not strings
I they can't directly use the feature dicts we have stored in X
I converting from string to numbers is the purpose of these lines:vec = DictVectorizer()
Xe = vec.fit_transform(X)
types of vectorizers
I a DictVectorizer converts from attribute�value dicts:
I a CountVectorizer converts from texts (after splitting intowords) or lists:
I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF (downweighting common words)
what goes on in a DictVectorizer?
I each feature corresponds to one or more columns in the outputmatrix
I easy case: boolean and numerical features:
I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans
what goes on in a DictVectorizer?
I each feature corresponds to one or more columns in the outputmatrix
I easy case: boolean and numerical features:
I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans
code example (DictVectorizer)
from sklearn.feature_extraction import DictVectorizer
X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},
{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},
{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]
vec = DictVectorizer()
Xe = vec.fit_transform(X)
print(Xe.toarray())
print(vec.get_feature_names())
the result:
[[ 1. 0. 1. 0. 0. 7.]
[ 1. 0. 0. 1. 1. 2.]
[ 0. 1. 1. 0. 0. 9.]]
['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']
code example (DictVectorizer)
from sklearn.feature_extraction import DictVectorizer
X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},
{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},
{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]
vec = DictVectorizer()
Xe = vec.fit_transform(X)
print(Xe.toarray())
print(vec.get_feature_names())
the result:
[[ 1. 0. 1. 0. 0. 7.]
[ 1. 0. 0. 1. 1. 2.]
[ 0. 1. 1. 0. 0. 9.]]
['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
Dealing with missing values
I Remove rows/columns that contain missing values (theeasiest, but ...)
I Replace the missing values with some values; the process iscalled imputation (more strategic)
Imputation of missing values
I Derived data from the same feature (univariate featureimputation)I (e.g., mean, median, most-frequent)
I Derived data from several features
I See scikit-learn Imputation of missing values
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
Challenges with large number of features
I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.
I What are the potential challenges with more features?
I Finding key information/feature(s) becomes more di�cult
I Higher complexity ⇒ bad for generalization
I Training takes longer time
I To solve these issues: try feature selection method
Challenges with large number of features
I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.
I What are the potential challenges with more features?
I Finding key information/feature(s) becomes more di�cult
I Higher complexity ⇒ bad for generalization
I Training takes longer time
I To solve these issues: try feature selection method
Challenges with large number of features
I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.
I What are the potential challenges with more features?
I Finding key information/feature(s) becomes more di�cult
I Higher complexity ⇒ bad for generalization
I Training takes longer time
I To solve these issues: try feature selection method
Selecting Features: Brute Force
for every possible set of features S :train and evaluate the model based on S
return the set Smax that gave the best result
Feature Selection: Overall Ideas
I �lter methods:
I wrapper methods:
I embedded methods:
[Source: Wikipedia]
Feature Selection: Overall Ideas
I �lter methods:
I wrapper methods:
I embedded methods:
[Source: Wikipedia]
Feature Selection: Overall Ideas
I �lter methods:
I wrapper methods:
I embedded methods:
[Source: Wikipedia]
Association between a feature and the output (idea)
Feature Ranking Example (document categories)
[Source: Manning & Schütze, Introduction to Information Retrieval]
Filter-based Feature Selection in scikit-learn
http://scikit-learn.org/stable/modules/feature_selection.html
I selectors:I SelectKBestI SelectPercentile
I feature scoring functions:I f_classifI mutual_info_classifI chi2
Example of a wrapper method: greedy forward selection
S = empty setrepeat:�nd the feature F that gives the largest
improvement when added to Sif there was an improvement add F to S
until there was no improvement
I the MLXtend library has an implementationI see SequentialFeatureSelector
Example of a wrapper method: backward selection
I See notebook
I Recursive Feature Elimination and Cross-Validation
(RFECV)
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
Hyperparameters vs parameters
I Hyperparameters
I often used to help estimatethe model parameters
I speci�ed by the users
I tuned for the problem athand
I In Random forests: numberof trees, number of featuresconsidered
I In Neural Networks: learningrate, number of hidden layers
I Parameters
I are needed by the model formaking predictions
I the values are estimatedfrom data
I usually are not set by theusers
I In Random forests: randomseed used
I In Neural Networks: weights
Hyperparameters vs parameters
I Hyperparameters
I often used to help estimatethe model parameters
I speci�ed by the users
I tuned for the problem athand
I In Random forests: numberof trees, number of featuresconsidered
I In Neural Networks: learningrate, number of hidden layers
I Parameters
I are needed by the model formaking predictions
I the values are estimatedfrom data
I usually are not set by theusers
I In Random forests: randomseed used
I In Neural Networks: weights
How to tune the hyperparameters?
I Trial and error
In scikit-learn:http://scikit-learn.org/stable/modules/grid_search.html
I GridSearchCV
I RandomizedSearchCV
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
�Finding needle in haystack� problems
I Finding rare diseases
I Finding anomalies in travelpattern
I Di�erentiating crashes vsnon-crashes situations
I . . .[source]
Evaluation of imbalance cases
def has_disease(patient):
return False
I a DummyClassifier will have a high accuracy
I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .
Evaluation of imbalance cases
def has_disease(patient):
return False
I a DummyClassifier will have a high accuracy
I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .
Dealing with class imbalance
I Change parameter(s) of the learning algorithm
I Change the data
I Use appropriate performance metric(s)
Dealing with class imbalance
I Change parameter(s) of the learning algorithm
I Change the data
I Use appropriate performance metric(s)
Dealing with class imbalance
I Change parameter(s) of the learning algorithm
I Change the data
I Use appropriate performance metric(s)
Change the parameter(s) of the learning algorithm
I Example in scikit-learn: LinearSVC
Changing the data
[source]
I Or generate synthetic data using the existing data
Changing the data
[source]
I Or generate synthetic data using the existing data
Library for imbalanced-learn
I https://imbalanced-learn.readthedocs.io/
I See for example: BalancedRandomForestClassi�er
Overview
Converting features to numerical vectors
Dealing with missing values
Feature Selection
Hyperparameters Tuning
Imbalanced Classes
Review and Closing
Review of pre-processing and hyperparameters tuning
I Explain di�erent ways of selecting features
I Explain pros/cons of GridSearchCV and RandomizedSearchCV
I Explain di�erent ways of dealing with imbalance classes
Next lecture
I Linear classi�ers and regression models
I (Methods for) data collection and bias
I Annotating data