mlhep lectures - day 3, basic track

102
Machine Learning in High Energy Physics Lectures 5 & 6 Alex Rogozhnikov Lund, MLHEP 2016 1 / 101

Upload: arogozhnikov

Post on 21-Feb-2017

938 views

Category:

Science


1 download

TRANSCRIPT

Page 1: MLHEP Lectures - day 3, basic track

Machine Learning in High Energy Physics

Lectures 5 & 6

Alex Rogozhnikov

Lund, MLHEP 2016

1 / 101

Page 2: MLHEP Lectures - day 3, basic track

Linear models: linear regression

Minimizing MSE:

d(x) =< w, x > +w0

= ( , ) → min1N*i Lmse xi yi

( , ) = (d( ) +Lmse xi yi xi yi )2

1 / 101

Page 3: MLHEP Lectures - day 3, basic track

Linear models: logistic regression

Minimizing logistic loss:

Penalty for single observation :

d(x) =< w, x > +w0

= ( , ) → min*i Llogistic xi yi

± 1yi( , ) = ln(1 + )Llogistic xi yi e+ d( )yi xi

2 / 101

Page 4: MLHEP Lectures - day 3, basic track

Linear models: support vector machine (SVM)

Margin no penalty

( , ) = max(0, 1 + d( ))Lhinge xi yi yi xi

d( ) > 1 →yi xi

3 / 101

Page 5: MLHEP Lectures - day 3, basic track

Kernel trickwe can project data into higher-dimensional space, e.g. by adding new features.

Hopefully, in the new space distributions are separable

4 / 101

Page 6: MLHEP Lectures - day 3, basic track

Kernel trick is a projection operator:

We need only kernel:

Popular choices: polynomial kernel andRBF kernel.

Pw = P( )*i αi xi

d(x) = < w, P(x) >new

d(x) = K( , x)*i αi xi

K(x, ) =< P(x), P( )x ̃  x ̃  >new

5 / 101

Page 7: MLHEP Lectures - day 3, basic track

Regularizations

regularization : regularization:

:

= L( , ) + → min1N ∑

i

xi yi reg

L2 = α |reg *j wj |2

L1 = β | |reg *j wj

+L1 L2 = α | + β | |reg *j wj |2 *j wj

6 / 101

Page 8: MLHEP Lectures - day 3, basic track

Stochastic optimizationmethodsStochastic gradient descent

take — random event from trainingdata

(can be applied to additive lossfunctions)

i

w ← w + η�L( , )xi yi

�w

7 / 101

Page 9: MLHEP Lectures - day 3, basic track

Decision treesNP complex to buildheuristic: use greedy optimizationoptimization criterions (impurities): misclassification, Gini, entropy

8 / 101

Page 10: MLHEP Lectures - day 3, basic track

Decision trees for regressionOptimizing MSE, prediction inside a leaf is constant.

9 / 101

Page 11: MLHEP Lectures - day 3, basic track

Overfitting in decision treepre-stoppingpost-pruningunstable to the changes in training dataset

10 / 101

Page 12: MLHEP Lectures - day 3, basic track

Random ForestMany trees built independently

bagging of samplessubsampling of features

Simple voting is used to get prediction of anensemble

11 / 101

Page 13: MLHEP Lectures - day 3, basic track

Random Forest

overfitted (in the sense that predictions for train and test are different)doesn't overfit: increasing complexity (adding more trees) doesn't spoil aclassifier 12 / 101

Page 14: MLHEP Lectures - day 3, basic track

Random Forestsimple and parallelizabledoesn't require much tuninghardly interpretable

but feature importances can be computeddoesn't fix samples poorly classifies at previous stages

13 / 101

Page 15: MLHEP Lectures - day 3, basic track

EnsemblesAveraging decision functions

Weighted decision

D(x) = (x)1J *

Jj=1 dj

D(x) = (x)*j αjdj

14 / 101

Page 16: MLHEP Lectures - day 3, basic track

Sample weights in MLCan be used with many estimators. We now have triples

weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter

, , i +  index of an eventxi yi wi

= nwi n i

15 / 101

Page 17: MLHEP Lectures - day 3, basic track

Sample weights in MLCan be used with many estimators. We now have triples

weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter

Example for logistic regression:

, , i +  index of an eventxi yi wi

= nwi n i

= L( , ) → min∑i

wi xi yi

16 / 101

Page 18: MLHEP Lectures - day 3, basic track

Weights (parameters) of a classifier sample weights

In code:

tree = DecisionTreeClassifier(max_depth=4)tree.fit(X, y, sample_weight=weights)

Sample weights are convenient way to regulate importance of training events.

Only sample weights when talking about AdaBoost.

y

17 / 101

Page 19: MLHEP Lectures - day 3, basic track

AdaBoost [Freund, Shapire, 1995]Bagging: information from previous trees not taken into account.

Adaptive Boosting is a weighted composition of weak learners:

We assume , labels ,

th weak learner misclassified th event iff

D(x) = (x)∑j

αjdj

(x) = ±1dj = ±1yi

j i ( ) = +1yidj xi

18 / 101

Page 20: MLHEP Lectures - day 3, basic track

AdaBoost

Weak learners are built in sequence, each classifier is trained using differentweights

initially = 1 for each training sampleAfter building th base classifier:1. compute the total weight of correctly and wrongly classified events

2. increase weight of misclassified samples

D(x) = (x)∑j

αjdj

wij

= ln( )αj12

wcorrect

wwrong

← ×wi wi e+ ( )αj yidj xi

19 / 101

Page 21: MLHEP Lectures - day 3, basic track

AdaBoost example

Decision trees of depth will beused.

1

20 / 101

Page 22: MLHEP Lectures - day 3, basic track

21 / 101

Page 23: MLHEP Lectures - day 3, basic track

22 / 101

Page 24: MLHEP Lectures - day 3, basic track

(1, 2, 3, 100 trees) 23 / 101

Page 25: MLHEP Lectures - day 3, basic track

AdaBoost secret

sample weight is equal to penalty for event

is obtained as a result of analytical optimization Exercise: prove formula for

D(x) = (x)∑j

αjdj

= L( , ) = exp(+ D( )) → min∑i

xi yi ∑i

yi xi

= L( , ) = exp(+ D( ))wi xi yi yi xi

αj

αj

24 / 101

Page 26: MLHEP Lectures - day 3, basic track

Loss function of AdaBoost

25 / 101

Page 27: MLHEP Lectures - day 3, basic track

AdaBoost summaryis able to combine many weak learnerstakes mistakes into accountsimple, overhead for boosting is negligibletoo sensitive to outliers

In scikit-learn, one can run AdaBoost over other algorithms.

26 / 101

Page 28: MLHEP Lectures - day 3, basic track

Gradient Boosting

27 / 101

Page 29: MLHEP Lectures - day 3, basic track

Decision trees for regression

28 / 101

Page 30: MLHEP Lectures - day 3, basic track

Gradient boosting to minimize MSESay, we're trying to build an ensemble to minimize MSE:

When ensemble's prediction is obtained by taking weighted sum

Assuming that we already built estimators, how do we train a next one?

(D( ) + → min∑i

xi yi )2

D(x) = (x)*j dj

(x) = (x) = (x) + (x)Dj *j=1j′ dj′ Dj+1 dj

j + 1

29 / 101

Page 31: MLHEP Lectures - day 3, basic track

Natural solution is to greedily minimize MSE:

Introduce residual: , now we need to simply minimizeMSE

So the th estimator (tree) is trained using the following data:

( ( ) + = ( ( ) + ( ) + → min∑i

Dj xi yi )2 ∑i

Dj+1 xi dj xi yi )2

( ) = + ( )Rj xi yi Dj+1 xi

( ( ) + ( ) → min∑i

dj xi Rj xi )2

j

, ( )xi Rj xi

30 / 101

Page 32: MLHEP Lectures - day 3, basic track

Example: regression with GB

using regression trees of depth=231 / 101

Page 33: MLHEP Lectures - day 3, basic track

number of trees = 1, 2, 3, 100 32 / 101

Page 35: MLHEP Lectures - day 3, basic track

Gradient Boosting [Friedman, 1999]composition of weak regressors,

Borrow an approach to encode probabilities from logistic regression

Optimization of log-likelihood ( ):

D(x) = (x)∑j

αjdj

(x)p+1

(x)p+1

==σ(D(x))σ(+D(x))

= ±1yi

= L( , ) = ln(1 + ) → min∑i

xi yi ∑i

e+ D( )yi xi

34 / 101

Page 36: MLHEP Lectures - day 3, basic track

Gradient Boosting

Optimization problem: find all and weak leaners Mission impossible

D(x) = (x)∑j

αjdj

= ln(1 + ) → min∑i

e+ D( )yi xi

αj dj

35 / 101

Page 37: MLHEP Lectures - day 3, basic track

Gradient Boosting

Optimization problem: find all and weak leaners Mission impossibleMain point: greedy optimization of loss function by training one more weaklearner Each new estimator follows the gradient of loss function

D(x) = (x)∑j

αjdj

= ln(1 + ) → min∑i

e+ D( )yi xi

αj dj

dj

36 / 101

Page 38: MLHEP Lectures - day 3, basic track

Gradient BoostingGradient boosting ~ steepest gradient descent.

At jth iteration:

compute pseudo-residual

train regressor to minimize MSE: find optimal

Important exercise: compute pseudo-residuals for MSE and logistic losses.

(x) = (x)Dj *j=1j′ αj′ dj′

(x) = (x) + (x)Dj Dj+1 αjdj

R( ) = + xi�

�D( )xi

<<D(x)= (x)Dj+1

dj ( ( ) + R( ) → min*i dj xi xi )2

αj

37 / 101

Page 39: MLHEP Lectures - day 3, basic track

Additional GB tricksto make training more stable, add learning rate :

randomization to fight noise and build different trees:

subsampling of featuressubsampling of training samples

η

(x) = η (x)Dj ∑j

αjdj

38 / 101

Page 40: MLHEP Lectures - day 3, basic track

AdaBoost is a particular case of gradient boosting with different target lossfunction*:

This loss function is called ExpLoss or AdaLoss.

*(also AdaBoost expects that )

= → minada ∑i

e+ D( )yi xi

( ) = ±1dj xi

39 / 101

Page 41: MLHEP Lectures - day 3, basic track

Loss functionsGradient boosting can optimize different smooth loss function.

regression,

Mean Squared Error Mean Absolute Error

binary classification,

ExpLoss (ada AdaLoss) LogLoss

y ! ℝ

(d( ) +*i xi yi )2

d( ) +*i << xi yi <<

= ±1yi

*i e+ d( )yi xi

log(1 + )*i e+ d( )yi xi

40 / 101

Page 42: MLHEP Lectures - day 3, basic track

41 / 101

Page 43: MLHEP Lectures - day 3, basic track

42 / 101

Page 44: MLHEP Lectures - day 3, basic track

Usage of second-order informationFor additive loss function apart from gradient , we can make use of secondderivatives .

E.g. select leaf value using second-order step:

gihi

= L( ( ), ) = L( ( ) + (x), ) a∑i

Dj xi yi ∑i

Dj+1 xi dj yi

a L( ( ), ) + (x) + (x)∑i

Dj+1 xi yi gi djhi

2d2

j

= + ( + ) → minj+1 ∑leaf

gleafwleafhleaf

2w2

leaf

43 / 101

Page 45: MLHEP Lectures - day 3, basic track

Using second-order information

Independent optimization. Explicit solution for optimal values in the leaves:

a + ( + ) → minj+1 ∑leaf

gleafwleafhleaf

2w2

leaf

where = , = .gleaf ∑i!leaf

gi hleaf ∑i!leaf

hi

= +wleafgleaf

hleaf

44 / 101

Page 46: MLHEP Lectures - day 3, basic track

Using second-order information: recipeOn each iteration of gradient boosting

1. train a tree to follow gradient (minimize MSE with gradient)2. change the values assigned in leaves to:

3. update predictions (no weight for estimator: )

This improvement is quite cheap and allows smaller GBs to be more effective.We can use information about hessians on the tree building step (step 1),

← +wleafgleaf

hleaf

= 1aj

(x) = (x) + η (x)Dj Dj+1 dj

45 / 101

Page 47: MLHEP Lectures - day 3, basic track

Multiclass classification: ensembling

One-vs-one, One-vs-rest,

scikit-learn implements those as meta-algorithms.

× ( + 1)nclasses nclasses

2nclasses

46 / 101

Page 48: MLHEP Lectures - day 3, basic track

Multiclass classification: modifying an algorithmMost classifiers have natural generalizations to multiclass classification.

Example for logistic regression: introduce for each class avector .

Converting to probabilities using softmax function:

And minimize LogLoss:

c ! 1, 2,… , Cwc

(x) =< , x >dc wc

(x) =pce (x)dc

*c ̃ e (x)dc̃ 

= + log ( )*i pyi xi47 / 101

Page 49: MLHEP Lectures - day 3, basic track

Softmax functionTypical way to convert numbers to probabilities.

Mapping is surjective, but not injective ( dimensions to dimension).Invariant to global shift:

For the case of two classes:

Coincides with logistic function for

n n

n n + 1

(x) → (x) + constdc dc

(x) = =p1e (x)d1

+e (x)d1 e (x)d2

11 + e (x)+ (x)d2 d1

d(x) = (x) + (x)d1 d2

48 / 101

Page 50: MLHEP Lectures - day 3, basic track

Loss function: ranking exampleIn ranking we need to order items by :

We can penalize for misordering:

yi

< ⇒ d( ) < d( )yi yj xi xj

= L( , , , )∑i,i ̃ 

xi xi ̃  yi yi ̃ 

L(x, , y, ) = {x ̃  y ̃  σ(d( ) + d(x)),x ̃ 0,

y < y ̃ otherwise

49 / 101

Page 51: MLHEP Lectures - day 3, basic track

Adapting boostingBy modifying boosting or changing loss function we can solve different problems

classificationregressionranking

HEP-specific examples in Tatiana's lecture tomorrow.

50 / 101

Page 52: MLHEP Lectures - day 3, basic track

Gradient Boosting classification playground

51 / 101

Page 53: MLHEP Lectures - day 3, basic track

-minutes breakn3

52 / 101

Page 54: MLHEP Lectures - day 3, basic track

Recapitulation: AdaBoostMinimizes

by increasing weights of misclassified samples:

= → minada ∑i

e+ D( )yi xi

← ×wi wi e+ ( )αj yidj xi

53 / 101

Page 55: MLHEP Lectures - day 3, basic track

Gradient Boosting overviewA powerful ensembling technique (typically used over trees, GBDT)

a general way to optimize differentiable lossescan be adapted to other problems

'following' the gradient of loss at each stepmaking steps in the space of functionsgradient of poorly-classified events is higherincreasing number of trees can drive to overfitting (= getting worse quality on new data)requires tuning, better when trees are not complexwidely used in practice

54 / 101

Page 56: MLHEP Lectures - day 3, basic track

Feature engineeringFeature engineering = creating features to get the best result with ML

important stepmostly relying on domain knowledgerequires some understandingmost of practitioners' time is spent at this step

55 / 101

Page 57: MLHEP Lectures - day 3, basic track

Feature engineeringAnalyzing available features

scale and shape of featuresAnalyze which information lacks

challenge example: maybe subleading jets matter?Validate your guesses

56 / 101

Page 58: MLHEP Lectures - day 3, basic track

Feature engineeringAnalyzing available features

scale and shape of featuresAnalyze which information lacks

challenge example: maybe subleading jets matter?Validate your guessesMachine learning is a proper tool for checking your understanding of data

57 / 101

Page 59: MLHEP Lectures - day 3, basic track

Linear models exampleSingle event with sufficiently large value of feature can break almost all linearmodels.

Heavy-tailed distributions are harmful, pretransforming required

logarithmpower transformand throwing out outliers

Same tricks actually help to more advanced methods.

Which transformation is the best for Random Forest?

58 / 101

Page 60: MLHEP Lectures - day 3, basic track

One-hot encodingCategorical features (= 'not orderable'), being one-hot encoded, are easier forML to operate with

59 / 101

Page 61: MLHEP Lectures - day 3, basic track

Decision tree example is hard for tree to use, since provides no good splitting

Don't forget that a tree can't reconstruct linear combinations — take care ofthis.

ηlepton

60 / 101

Page 62: MLHEP Lectures - day 3, basic track

Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:

Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.

What about invariant mass of 3 particles?

a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2

61 / 101

Page 63: MLHEP Lectures - day 3, basic track

Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:

Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.

What about invariant mass of 3 particles? (see Vicens' talk today).

Good features are ones that are explainable by physics. Start from simplest andmost natural.

Mind the cost of computing the features.

a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2

62 / 101

Page 64: MLHEP Lectures - day 3, basic track

Output engineeringTypically not discussed, but the target of learning plays an important role.

Example: predicting number of upvotes forcomment. Assuming the error is MSE

Consider two cases:

100 comments when predicted 01200 comments when predicted 1000

= (d( ) + → min∑i

xi yi )2

63 / 101

Page 65: MLHEP Lectures - day 3, basic track

Output engineeringRidiculously large impact of highly-commented articles. We need to predict order, not exact number of comments.

Possible solutions:

alternate loss function. E.g. use MAPEapply logarithm to the target, predict

Evaluation score should be changed accordingly.

log(# comments)

64 / 101

Page 66: MLHEP Lectures - day 3, basic track

Sample weightsTypically used to estimate the contribution of event (how often we expect thisto happen).

Sample weights in some situations also matter.

highly misbalanced dataset (e.g. 99 % of events in class 0) tend to haveproblems during optimization.changing sample weights to balance the dataset frequently helps.

65 / 101

Page 67: MLHEP Lectures - day 3, basic track

Feature selectionWhy?

speed up training / predictionreduce time of data preparation in the pipelinehelp algorithm to 'focus' on finding reliable dependencies

useful when amount of training data is limited

Problem:

find a subset of features, which provides best quality

66 / 101

Page 68: MLHEP Lectures - day 3, basic track

Feature selection

Exhaustive search: cross-validations.

incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest data

2d

67 / 101

Page 69: MLHEP Lectures - day 3, basic track

Feature selection

Exhaustive search: cross-validations.

incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.

2d

68 / 101

Page 70: MLHEP Lectures - day 3, basic track

Feature selection

Exhaustive search: cross-validations.

incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.

Filtering methods

Eliminate variables which seem not to carry statistical information about thetarget. E.g. by measuring Pearson correlation or mutual information.

Example: all angles will be thrown out.

2d

ϕ 69 / 101

Page 71: MLHEP Lectures - day 3, basic track

Feature selection: embedded methodsFeature selection is a part of training. Example: — regularized linear models.

Forward selection

Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality.

Backward elimination

Almost the same, but this time we iteratively eliminate features.

Bidirectional combination of the above is possible, some algorithms can usepreviously trained model as the new starting point for optimization.

L1

70 / 101

Page 72: MLHEP Lectures - day 3, basic track

Unsupervised dimensionality reduction

71 / 101

Page 73: MLHEP Lectures - day 3, basic track

Principal component analysis [Pearson, 1901]PCA is finding axes along which variance is maximal

72 / 101

Page 74: MLHEP Lectures - day 3, basic track

PCA descriptionPCA is based on the principal axis theorem

is covariance matrix of the dataset, is orthogonal matrix, is diagonalmatrix.

Q = ΛUU T

Q U Λ

Λ = diag( , ,… , ), ~ ~ ⋯ ~λ1 λ2 λn λ1 λ2 λn

73 / 101

Page 75: MLHEP Lectures - day 3, basic track

PCA optimization visualized

74 / 101

Page 76: MLHEP Lectures - day 3, basic track

PCA: eigenfaces

Emotion = α[scared] + β[laughs] + γ[angry]+. . .75 / 101

Page 77: MLHEP Lectures - day 3, basic track

Locally linear embeddinghandles the case of non-linear dimensionality reduction

Express each sample as a convex combination of neighbours + →*i

<<xi *i ̃ wii ̃ xi ̃ << minw

76 / 101

Page 78: MLHEP Lectures - day 3, basic track

Locally linear embedding

subject to constraints: , , and if are notneighbors.

Finding an optimal mapping for all points simultaneously ( are images —positions in the new space):

+ →∑i

<

<

<<xi ∑

i ̃ wii ̃ xi ̃ 

<

<

<< min

w

= 1*i ̃ wii ̃  >= 0wii ̃  = 0wii ̃  i, i ̃ 

yi

+ →∑i

<

<

<<yi ∑

i ̃ wii ̃ yi ̃ 

<

<

<< min

y

77 / 101

Page 79: MLHEP Lectures - day 3, basic track

PCA and LLE

78 / 101

Page 80: MLHEP Lectures - day 3, basic track

IsomapIsomap is targeted to preserve geodesic distance on the manifold between twopoints

79 / 101

Page 81: MLHEP Lectures - day 3, basic track

Supervised dimensionality reduction

80 / 101

Page 82: MLHEP Lectures - day 3, basic track

Fisher's LDA (Linear Discriminant Analysis) [1936]Original idea: find a projection to discriminate classes best

81 / 101

Page 83: MLHEP Lectures - day 3, basic track

Fisher's LDAMean and variance within a single class ( ):

Total within-class variance:

Total between-class variance:

Goal: find a projection to maximize a ratio

c c ! {1, 2,… , C}=< xμk >events of class c=< ||x + |σk μk |2 >events of class c

=σwithin *c pcσc

= || + μ|σbetween *c pc μc |2

σbetween

σwithin

82 / 101

Page 84: MLHEP Lectures - day 3, basic track

Fisher's LDA

83 / 101

Page 85: MLHEP Lectures - day 3, basic track

LDA: solving optimization problemWe are interested in finding -dimensional projection :

Naturally connected to the generalized eigenvalue problemProjection vector corresponds to the highest generalized eigenvalueFinds a subspace of components when applied to a classificationproblem with classes

Fisher's LDA is a basic popular binary classification technique.

1 w

→wwTΣwithin

wwTΣbetweenmax

w

C + 1C

84 / 101

Page 86: MLHEP Lectures - day 3, basic track

Common spacial patternsWhen we expect that each class is close to some linear subspace, we can

(naively) find for each class this subspace by PCA(better idea) take into account variation of other data and optimize

Natural generalization is to take several components: is aprojection matrix; and are number of dimensions in original and newspaces

Frequently used in neural sciences, in particular in BCI based on EEG / MEG.

W ! ℝn×n1

n n1

tr W → maxW TΣclass

subject to  W = IW TΣtotal

85 / 101

Page 87: MLHEP Lectures - day 3, basic track

Common spacial patternsPatters found describe the projection into 6-dimensional space

86 / 101

Page 88: MLHEP Lectures - day 3, basic track

Dimensionality reduction summaryis capable of extracting sensible features from highly-dimensional datafrequently used to 'visualize' the datanonlinear methods rely on the distance in the spaceworks well with highly-dimensional spaces with features of same nature

87 / 101

Page 89: MLHEP Lectures - day 3, basic track

Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long

88 / 101

Page 90: MLHEP Lectures - day 3, basic track

Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long

We need automated hyperparameter optimization!

89 / 101

Page 91: MLHEP Lectures - day 3, basic track

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize it

90 / 101

Page 92: MLHEP Lectures - day 3, basic track

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy results

91 / 101

Page 93: MLHEP Lectures - day 3, basic track

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem

92 / 101

Page 94: MLHEP Lectures - day 3, basic track

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem

Before running grid optimization make sure your metric is stable (i.e. bytrain/testing on different subsets).

Overfitting (=getting too optimistic estimate of quality on a holdout) by usingmany attempts is a real issue.

93 / 101

Page 95: MLHEP Lectures - day 3, basic track

Optimal grid searchstochastic optimization (Metropolis-Hastings, annealing)

requires too many evaluations, using only last checked combinationregression techniques, reusing all known information (ML to optimize ML!)

94 / 101

Page 96: MLHEP Lectures - day 3, basic track

Optimal grid search using regressionGeneral algorithm (point of grid = set of parameters):

1. evaluations at random points2. build regression model based on known results3. select the point with best expected quality according to trained model4. evaluate quality at this points5. Go to 2 if not enough evaluations

Why not using linear regression?

Exploration vs. exploitation trade-off: should we try explore poorly-coveredregions or try to enhance currently seen to be optimal?

95 / 101

Page 97: MLHEP Lectures - day 3, basic track

Gaussian processes for regressionSome definitions: , where and are functions of mean andcovariance: ,

represents our prior expectation of quality (may be takenconstant)

represents influence of known results on theexpectation of values in new pointsRBF kernel is used here too: Another popular choice:

We can model the posterior distribution of results in each point.

Y U GP(m, K) m Km(x) K(x, )x ̃ 

m(x) = �Y(x)

K(x, ) = �Y(x)Y( )x ̃  x ̃ 

K(x, ) = exp(+c||x + | )x ̃  x ̃ |2K(x, ) = exp(+c||x + ||)x ̃  x ̃ 

96 / 101

Page 98: MLHEP Lectures - day 3, basic track

Gaussian Process Demo on Mathematica

97 / 101

Page 99: MLHEP Lectures - day 3, basic track

Gaussian processesGaussian processes model posterior distribution at each point of the grid.we know at which point we have already well-estimated qualityand we are able to find regions which need exploration. See also this demo.

98 / 101

Page 100: MLHEP Lectures - day 3, basic track

Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)

99 / 101

Page 101: MLHEP Lectures - day 3, basic track

Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)but this is not what you should spend your time on

the gain from properly cooking features / reconsidering problem is muchhigher

100 / 101

Page 102: MLHEP Lectures - day 3, basic track

101 / 101