Download - MLHEP Lectures - day 3, basic track
![Page 1: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/1.jpg)
Machine Learning in High Energy Physics
Lectures 5 & 6
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 101
![Page 2: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/2.jpg)
Linear models: linear regression
Minimizing MSE:
d(x) =< w, x > +w0
= ( , ) → min1N*i Lmse xi yi
( , ) = (d( ) +Lmse xi yi xi yi )2
1 / 101
![Page 3: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/3.jpg)
Linear models: logistic regression
Minimizing logistic loss:
Penalty for single observation :
d(x) =< w, x > +w0
= ( , ) → min*i Llogistic xi yi
± 1yi( , ) = ln(1 + )Llogistic xi yi e+ d( )yi xi
2 / 101
![Page 4: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/4.jpg)
Linear models: support vector machine (SVM)
Margin no penalty
( , ) = max(0, 1 + d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
3 / 101
![Page 5: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/5.jpg)
Kernel trickwe can project data into higher-dimensional space, e.g. by adding new features.
Hopefully, in the new space distributions are separable
4 / 101
![Page 6: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/6.jpg)
Kernel trick is a projection operator:
We need only kernel:
Popular choices: polynomial kernel andRBF kernel.
Pw = P( )*i αi xi
d(x) = < w, P(x) >new
d(x) = K( , x)*i αi xi
K(x, ) =< P(x), P( )x ̃ x ̃ >new
5 / 101
![Page 7: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/7.jpg)
Regularizations
regularization : regularization:
:
= L( , ) + → min1N ∑
i
xi yi reg
L2 = α |reg *j wj |2
L1 = β | |reg *j wj
+L1 L2 = α | + β | |reg *j wj |2 *j wj
6 / 101
![Page 8: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/8.jpg)
Stochastic optimizationmethodsStochastic gradient descent
take — random event from trainingdata
(can be applied to additive lossfunctions)
i
w ← w + η�L( , )xi yi
�w
7 / 101
![Page 9: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/9.jpg)
Decision treesNP complex to buildheuristic: use greedy optimizationoptimization criterions (impurities): misclassification, Gini, entropy
8 / 101
![Page 10: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/10.jpg)
Decision trees for regressionOptimizing MSE, prediction inside a leaf is constant.
9 / 101
![Page 11: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/11.jpg)
Overfitting in decision treepre-stoppingpost-pruningunstable to the changes in training dataset
10 / 101
![Page 12: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/12.jpg)
Random ForestMany trees built independently
bagging of samplessubsampling of features
Simple voting is used to get prediction of anensemble
11 / 101
![Page 13: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/13.jpg)
Random Forest
overfitted (in the sense that predictions for train and test are different)doesn't overfit: increasing complexity (adding more trees) doesn't spoil aclassifier 12 / 101
![Page 14: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/14.jpg)
Random Forestsimple and parallelizabledoesn't require much tuninghardly interpretable
but feature importances can be computeddoesn't fix samples poorly classifies at previous stages
13 / 101
![Page 15: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/15.jpg)
EnsemblesAveraging decision functions
Weighted decision
D(x) = (x)1J *
Jj=1 dj
D(x) = (x)*j αjdj
14 / 101
![Page 16: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/16.jpg)
Sample weights in MLCan be used with many estimators. We now have triples
weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter
, , i + index of an eventxi yi wi
= nwi n i
15 / 101
![Page 17: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/17.jpg)
Sample weights in MLCan be used with many estimators. We now have triples
weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter
Example for logistic regression:
, , i + index of an eventxi yi wi
= nwi n i
= L( , ) → min∑i
wi xi yi
16 / 101
![Page 18: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/18.jpg)
Weights (parameters) of a classifier sample weights
In code:
tree = DecisionTreeClassifier(max_depth=4)tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance of training events.
Only sample weights when talking about AdaBoost.
y
17 / 101
![Page 19: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/19.jpg)
AdaBoost [Freund, Shapire, 1995]Bagging: information from previous trees not taken into account.
Adaptive Boosting is a weighted composition of weak learners:
We assume , labels ,
th weak learner misclassified th event iff
D(x) = (x)∑j
αjdj
(x) = ±1dj = ±1yi
j i ( ) = +1yidj xi
18 / 101
![Page 20: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/20.jpg)
AdaBoost
Weak learners are built in sequence, each classifier is trained using differentweights
initially = 1 for each training sampleAfter building th base classifier:1. compute the total weight of correctly and wrongly classified events
2. increase weight of misclassified samples
D(x) = (x)∑j
αjdj
wij
= ln( )αj12
wcorrect
wwrong
← ×wi wi e+ ( )αj yidj xi
19 / 101
![Page 21: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/21.jpg)
AdaBoost example
Decision trees of depth will beused.
1
20 / 101
![Page 22: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/22.jpg)
21 / 101
![Page 23: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/23.jpg)
22 / 101
![Page 24: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/24.jpg)
(1, 2, 3, 100 trees) 23 / 101
![Page 25: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/25.jpg)
AdaBoost secret
sample weight is equal to penalty for event
is obtained as a result of analytical optimization Exercise: prove formula for
D(x) = (x)∑j
αjdj
= L( , ) = exp(+ D( )) → min∑i
xi yi ∑i
yi xi
= L( , ) = exp(+ D( ))wi xi yi yi xi
αj
αj
24 / 101
![Page 26: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/26.jpg)
Loss function of AdaBoost
25 / 101
![Page 27: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/27.jpg)
AdaBoost summaryis able to combine many weak learnerstakes mistakes into accountsimple, overhead for boosting is negligibletoo sensitive to outliers
In scikit-learn, one can run AdaBoost over other algorithms.
26 / 101
![Page 28: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/28.jpg)
Gradient Boosting
27 / 101
![Page 29: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/29.jpg)
Decision trees for regression
28 / 101
![Page 30: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/30.jpg)
Gradient boosting to minimize MSESay, we're trying to build an ensemble to minimize MSE:
When ensemble's prediction is obtained by taking weighted sum
Assuming that we already built estimators, how do we train a next one?
(D( ) + → min∑i
xi yi )2
D(x) = (x)*j dj
(x) = (x) = (x) + (x)Dj *j=1j′ dj′ Dj+1 dj
j + 1
29 / 101
![Page 31: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/31.jpg)
Natural solution is to greedily minimize MSE:
Introduce residual: , now we need to simply minimizeMSE
So the th estimator (tree) is trained using the following data:
( ( ) + = ( ( ) + ( ) + → min∑i
Dj xi yi )2 ∑i
Dj+1 xi dj xi yi )2
( ) = + ( )Rj xi yi Dj+1 xi
( ( ) + ( ) → min∑i
dj xi Rj xi )2
j
, ( )xi Rj xi
30 / 101
![Page 32: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/32.jpg)
Example: regression with GB
using regression trees of depth=231 / 101
![Page 33: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/33.jpg)
number of trees = 1, 2, 3, 100 32 / 101
![Page 34: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/34.jpg)
Gradient Boosting visualization
33 / 101
![Page 35: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/35.jpg)
Gradient Boosting [Friedman, 1999]composition of weak regressors,
Borrow an approach to encode probabilities from logistic regression
Optimization of log-likelihood ( ):
D(x) = (x)∑j
αjdj
(x)p+1
(x)p+1
==σ(D(x))σ(+D(x))
= ±1yi
= L( , ) = ln(1 + ) → min∑i
xi yi ∑i
e+ D( )yi xi
34 / 101
![Page 36: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/36.jpg)
Gradient Boosting
Optimization problem: find all and weak leaners Mission impossible
D(x) = (x)∑j
αjdj
= ln(1 + ) → min∑i
e+ D( )yi xi
αj dj
35 / 101
![Page 37: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/37.jpg)
Gradient Boosting
Optimization problem: find all and weak leaners Mission impossibleMain point: greedy optimization of loss function by training one more weaklearner Each new estimator follows the gradient of loss function
D(x) = (x)∑j
αjdj
= ln(1 + ) → min∑i
e+ D( )yi xi
αj dj
dj
36 / 101
![Page 38: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/38.jpg)
Gradient BoostingGradient boosting ~ steepest gradient descent.
At jth iteration:
compute pseudo-residual
train regressor to minimize MSE: find optimal
Important exercise: compute pseudo-residuals for MSE and logistic losses.
(x) = (x)Dj *j=1j′ αj′ dj′
(x) = (x) + (x)Dj Dj+1 αjdj
R( ) = + xi�
�D( )xi
<<D(x)= (x)Dj+1
dj ( ( ) + R( ) → min*i dj xi xi )2
αj
37 / 101
![Page 39: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/39.jpg)
Additional GB tricksto make training more stable, add learning rate :
randomization to fight noise and build different trees:
subsampling of featuressubsampling of training samples
η
(x) = η (x)Dj ∑j
αjdj
38 / 101
![Page 40: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/40.jpg)
AdaBoost is a particular case of gradient boosting with different target lossfunction*:
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )
= → minada ∑i
e+ D( )yi xi
( ) = ±1dj xi
39 / 101
![Page 41: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/41.jpg)
Loss functionsGradient boosting can optimize different smooth loss function.
regression,
Mean Squared Error Mean Absolute Error
binary classification,
ExpLoss (ada AdaLoss) LogLoss
y ! ℝ
(d( ) +*i xi yi )2
d( ) +*i << xi yi <<
= ±1yi
*i e+ d( )yi xi
log(1 + )*i e+ d( )yi xi
40 / 101
![Page 42: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/42.jpg)
41 / 101
![Page 43: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/43.jpg)
42 / 101
![Page 44: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/44.jpg)
Usage of second-order informationFor additive loss function apart from gradient , we can make use of secondderivatives .
E.g. select leaf value using second-order step:
gihi
= L( ( ), ) = L( ( ) + (x), ) a∑i
Dj xi yi ∑i
Dj+1 xi dj yi
a L( ( ), ) + (x) + (x)∑i
Dj+1 xi yi gi djhi
2d2
j
= + ( + ) → minj+1 ∑leaf
gleafwleafhleaf
2w2
leaf
43 / 101
![Page 45: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/45.jpg)
Using second-order information
Independent optimization. Explicit solution for optimal values in the leaves:
a + ( + ) → minj+1 ∑leaf
gleafwleafhleaf
2w2
leaf
where = , = .gleaf ∑i!leaf
gi hleaf ∑i!leaf
hi
= +wleafgleaf
hleaf
44 / 101
![Page 46: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/46.jpg)
Using second-order information: recipeOn each iteration of gradient boosting
1. train a tree to follow gradient (minimize MSE with gradient)2. change the values assigned in leaves to:
3. update predictions (no weight for estimator: )
This improvement is quite cheap and allows smaller GBs to be more effective.We can use information about hessians on the tree building step (step 1),
← +wleafgleaf
hleaf
= 1aj
(x) = (x) + η (x)Dj Dj+1 dj
45 / 101
![Page 47: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/47.jpg)
Multiclass classification: ensembling
One-vs-one, One-vs-rest,
scikit-learn implements those as meta-algorithms.
× ( + 1)nclasses nclasses
2nclasses
46 / 101
![Page 48: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/48.jpg)
Multiclass classification: modifying an algorithmMost classifiers have natural generalizations to multiclass classification.
Example for logistic regression: introduce for each class avector .
Converting to probabilities using softmax function:
And minimize LogLoss:
c ! 1, 2,… , Cwc
(x) =< , x >dc wc
(x) =pce (x)dc
*c ̃ e (x)dc̃
= + log ( )*i pyi xi47 / 101
![Page 49: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/49.jpg)
Softmax functionTypical way to convert numbers to probabilities.
Mapping is surjective, but not injective ( dimensions to dimension).Invariant to global shift:
For the case of two classes:
Coincides with logistic function for
n n
n n + 1
(x) → (x) + constdc dc
(x) = =p1e (x)d1
+e (x)d1 e (x)d2
11 + e (x)+ (x)d2 d1
d(x) = (x) + (x)d1 d2
48 / 101
![Page 50: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/50.jpg)
Loss function: ranking exampleIn ranking we need to order items by :
We can penalize for misordering:
yi
< ⇒ d( ) < d( )yi yj xi xj
= L( , , , )∑i,i ̃
xi xi ̃ yi yi ̃
L(x, , y, ) = {x ̃ y ̃ σ(d( ) + d(x)),x ̃ 0,
y < y ̃ otherwise
49 / 101
![Page 51: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/51.jpg)
Adapting boostingBy modifying boosting or changing loss function we can solve different problems
classificationregressionranking
HEP-specific examples in Tatiana's lecture tomorrow.
50 / 101
![Page 52: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/52.jpg)
Gradient Boosting classification playground
51 / 101
![Page 53: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/53.jpg)
-minutes breakn3
52 / 101
![Page 54: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/54.jpg)
Recapitulation: AdaBoostMinimizes
by increasing weights of misclassified samples:
= → minada ∑i
e+ D( )yi xi
← ×wi wi e+ ( )αj yidj xi
53 / 101
![Page 55: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/55.jpg)
Gradient Boosting overviewA powerful ensembling technique (typically used over trees, GBDT)
a general way to optimize differentiable lossescan be adapted to other problems
'following' the gradient of loss at each stepmaking steps in the space of functionsgradient of poorly-classified events is higherincreasing number of trees can drive to overfitting (= getting worse quality on new data)requires tuning, better when trees are not complexwidely used in practice
54 / 101
![Page 56: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/56.jpg)
Feature engineeringFeature engineering = creating features to get the best result with ML
important stepmostly relying on domain knowledgerequires some understandingmost of practitioners' time is spent at this step
55 / 101
![Page 57: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/57.jpg)
Feature engineeringAnalyzing available features
scale and shape of featuresAnalyze which information lacks
challenge example: maybe subleading jets matter?Validate your guesses
56 / 101
![Page 58: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/58.jpg)
Feature engineeringAnalyzing available features
scale and shape of featuresAnalyze which information lacks
challenge example: maybe subleading jets matter?Validate your guessesMachine learning is a proper tool for checking your understanding of data
57 / 101
![Page 59: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/59.jpg)
Linear models exampleSingle event with sufficiently large value of feature can break almost all linearmodels.
Heavy-tailed distributions are harmful, pretransforming required
logarithmpower transformand throwing out outliers
Same tricks actually help to more advanced methods.
Which transformation is the best for Random Forest?
58 / 101
![Page 60: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/60.jpg)
One-hot encodingCategorical features (= 'not orderable'), being one-hot encoded, are easier forML to operate with
59 / 101
![Page 61: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/61.jpg)
Decision tree example is hard for tree to use, since provides no good splitting
Don't forget that a tree can't reconstruct linear combinations — take care ofthis.
ηlepton
60 / 101
![Page 62: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/62.jpg)
Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.
What about invariant mass of 3 particles?
a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2
61 / 101
![Page 63: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/63.jpg)
Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.
What about invariant mass of 3 particles? (see Vicens' talk today).
Good features are ones that are explainable by physics. Start from simplest andmost natural.
Mind the cost of computing the features.
a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2
62 / 101
![Page 64: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/64.jpg)
Output engineeringTypically not discussed, but the target of learning plays an important role.
Example: predicting number of upvotes forcomment. Assuming the error is MSE
Consider two cases:
100 comments when predicted 01200 comments when predicted 1000
= (d( ) + → min∑i
xi yi )2
63 / 101
![Page 65: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/65.jpg)
Output engineeringRidiculously large impact of highly-commented articles. We need to predict order, not exact number of comments.
Possible solutions:
alternate loss function. E.g. use MAPEapply logarithm to the target, predict
Evaluation score should be changed accordingly.
log(# comments)
64 / 101
![Page 66: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/66.jpg)
Sample weightsTypically used to estimate the contribution of event (how often we expect thisto happen).
Sample weights in some situations also matter.
highly misbalanced dataset (e.g. 99 % of events in class 0) tend to haveproblems during optimization.changing sample weights to balance the dataset frequently helps.
65 / 101
![Page 67: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/67.jpg)
Feature selectionWhy?
speed up training / predictionreduce time of data preparation in the pipelinehelp algorithm to 'focus' on finding reliable dependencies
useful when amount of training data is limited
Problem:
find a subset of features, which provides best quality
66 / 101
![Page 68: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/68.jpg)
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest data
2d
67 / 101
![Page 69: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/69.jpg)
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.
2d
68 / 101
![Page 70: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/70.jpg)
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.
Filtering methods
Eliminate variables which seem not to carry statistical information about thetarget. E.g. by measuring Pearson correlation or mutual information.
Example: all angles will be thrown out.
2d
ϕ 69 / 101
![Page 71: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/71.jpg)
Feature selection: embedded methodsFeature selection is a part of training. Example: — regularized linear models.
Forward selection
Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality.
Backward elimination
Almost the same, but this time we iteratively eliminate features.
Bidirectional combination of the above is possible, some algorithms can usepreviously trained model as the new starting point for optimization.
L1
70 / 101
![Page 72: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/72.jpg)
Unsupervised dimensionality reduction
71 / 101
![Page 73: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/73.jpg)
Principal component analysis [Pearson, 1901]PCA is finding axes along which variance is maximal
72 / 101
![Page 74: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/74.jpg)
PCA descriptionPCA is based on the principal axis theorem
is covariance matrix of the dataset, is orthogonal matrix, is diagonalmatrix.
Q = ΛUU T
Q U Λ
Λ = diag( , ,… , ), ~ ~ ⋯ ~λ1 λ2 λn λ1 λ2 λn
73 / 101
![Page 75: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/75.jpg)
PCA optimization visualized
74 / 101
![Page 76: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/76.jpg)
PCA: eigenfaces
Emotion = α[scared] + β[laughs] + γ[angry]+. . .75 / 101
![Page 77: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/77.jpg)
Locally linear embeddinghandles the case of non-linear dimensionality reduction
Express each sample as a convex combination of neighbours + →*i
<<xi *i ̃ wii ̃ xi ̃ << minw
76 / 101
![Page 78: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/78.jpg)
Locally linear embedding
subject to constraints: , , and if are notneighbors.
Finding an optimal mapping for all points simultaneously ( are images —positions in the new space):
+ →∑i
<
<
<<xi ∑
i ̃ wii ̃ xi ̃
<
<
<< min
w
= 1*i ̃ wii ̃ >= 0wii ̃ = 0wii ̃ i, i ̃
yi
+ →∑i
<
<
<<yi ∑
i ̃ wii ̃ yi ̃
<
<
<< min
y
77 / 101
![Page 79: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/79.jpg)
PCA and LLE
78 / 101
![Page 80: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/80.jpg)
IsomapIsomap is targeted to preserve geodesic distance on the manifold between twopoints
79 / 101
![Page 81: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/81.jpg)
Supervised dimensionality reduction
80 / 101
![Page 82: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/82.jpg)
Fisher's LDA (Linear Discriminant Analysis) [1936]Original idea: find a projection to discriminate classes best
81 / 101
![Page 83: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/83.jpg)
Fisher's LDAMean and variance within a single class ( ):
Total within-class variance:
Total between-class variance:
Goal: find a projection to maximize a ratio
c c ! {1, 2,… , C}=< xμk >events of class c=< ||x + |σk μk |2 >events of class c
=σwithin *c pcσc
= || + μ|σbetween *c pc μc |2
σbetween
σwithin
82 / 101
![Page 84: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/84.jpg)
Fisher's LDA
83 / 101
![Page 85: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/85.jpg)
LDA: solving optimization problemWe are interested in finding -dimensional projection :
Naturally connected to the generalized eigenvalue problemProjection vector corresponds to the highest generalized eigenvalueFinds a subspace of components when applied to a classificationproblem with classes
Fisher's LDA is a basic popular binary classification technique.
1 w
→wwTΣwithin
wwTΣbetweenmax
w
C + 1C
84 / 101
![Page 86: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/86.jpg)
Common spacial patternsWhen we expect that each class is close to some linear subspace, we can
(naively) find for each class this subspace by PCA(better idea) take into account variation of other data and optimize
Natural generalization is to take several components: is aprojection matrix; and are number of dimensions in original and newspaces
Frequently used in neural sciences, in particular in BCI based on EEG / MEG.
W ! ℝn×n1
n n1
tr W → maxW TΣclass
subject to W = IW TΣtotal
85 / 101
![Page 87: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/87.jpg)
Common spacial patternsPatters found describe the projection into 6-dimensional space
86 / 101
![Page 88: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/88.jpg)
Dimensionality reduction summaryis capable of extracting sensible features from highly-dimensional datafrequently used to 'visualize' the datanonlinear methods rely on the distance in the spaceworks well with highly-dimensional spaces with features of same nature
87 / 101
![Page 89: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/89.jpg)
Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long
88 / 101
![Page 90: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/90.jpg)
Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long
We need automated hyperparameter optimization!
89 / 101
![Page 91: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/91.jpg)
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize it
90 / 101
![Page 92: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/92.jpg)
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy results
91 / 101
![Page 93: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/93.jpg)
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem
92 / 101
![Page 94: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/94.jpg)
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem
Before running grid optimization make sure your metric is stable (i.e. bytrain/testing on different subsets).
Overfitting (=getting too optimistic estimate of quality on a holdout) by usingmany attempts is a real issue.
93 / 101
![Page 95: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/95.jpg)
Optimal grid searchstochastic optimization (Metropolis-Hastings, annealing)
requires too many evaluations, using only last checked combinationregression techniques, reusing all known information (ML to optimize ML!)
94 / 101
![Page 96: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/96.jpg)
Optimal grid search using regressionGeneral algorithm (point of grid = set of parameters):
1. evaluations at random points2. build regression model based on known results3. select the point with best expected quality according to trained model4. evaluate quality at this points5. Go to 2 if not enough evaluations
Why not using linear regression?
Exploration vs. exploitation trade-off: should we try explore poorly-coveredregions or try to enhance currently seen to be optimal?
95 / 101
![Page 97: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/97.jpg)
Gaussian processes for regressionSome definitions: , where and are functions of mean andcovariance: ,
represents our prior expectation of quality (may be takenconstant)
represents influence of known results on theexpectation of values in new pointsRBF kernel is used here too: Another popular choice:
We can model the posterior distribution of results in each point.
Y U GP(m, K) m Km(x) K(x, )x ̃
m(x) = �Y(x)
K(x, ) = �Y(x)Y( )x ̃ x ̃
K(x, ) = exp(+c||x + | )x ̃ x ̃ |2K(x, ) = exp(+c||x + ||)x ̃ x ̃
96 / 101
![Page 99: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/99.jpg)
Gaussian processesGaussian processes model posterior distribution at each point of the grid.we know at which point we have already well-estimated qualityand we are able to find regions which need exploration. See also this demo.
98 / 101
![Page 100: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/100.jpg)
Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)
99 / 101
![Page 101: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/101.jpg)
Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)but this is not what you should spend your time on
the gain from properly cooking features / reconsidering problem is muchhigher
100 / 101
![Page 102: MLHEP Lectures - day 3, basic track](https://reader031.vdocument.in/reader031/viewer/2022020301/58aba9f51a28abdf3c8b5f13/html5/thumbnails/102.jpg)
101 / 101