mlhep lectures - day 3, basic track

Machine Learning in High Energy Physics

Lectures 5 & 6

Alex Rogozhnikov

Lund, MLHEP 2016

1 / 101

Linear models: linear regression

Minimizing MSE:

d(x) =< w, x > +w0

= ( , ) → min1N*i Lmse xi yi

( , ) = (d( ) +Lmse xi yi xi yi )2

1 / 101

Linear models: logistic regression

Minimizing logistic loss:

Penalty for single observation :

d(x) =< w, x > +w0

= ( , ) → min*i Llogistic xi yi

± 1yi( , ) = ln(1 + )Llogistic xi yi e+ d( )yi xi

2 / 101

Linear models: support vector machine (SVM)

Margin no penalty

( , ) = max(0, 1 + d( ))Lhinge xi yi yi xi

d( ) > 1 →yi xi

3 / 101

Kernel trickwe can project data into higher-dimensional space, e.g. by adding new features.

Hopefully, in the new space distributions are separable

4 / 101

Kernel trick is a projection operator:

We need only kernel:

Popular choices: polynomial kernel andRBF kernel.

Pw = P( )*i αi xi

d(x) = < w, P(x) >new

d(x) = K( , x)*i αi xi

K(x, ) =< P(x), P( )x ̃ x ̃ >new

5 / 101

Regularizations

regularization : regularization:

:

= L( , ) + → min1N ∑

i

xi yi reg

L2 = α |reg *j wj |2

L1 = β | |reg *j wj

+L1 L2 = α | + β | |reg *j wj |2 *j wj

6 / 101

Stochastic optimizationmethodsStochastic gradient descent

take — random event from trainingdata

(can be applied to additive lossfunctions)

i

w ← w + η�L( , )xi yi

�w

7 / 101

Decision treesNP complex to buildheuristic: use greedy optimizationoptimization criterions (impurities): misclassification, Gini, entropy

8 / 101

Decision trees for regressionOptimizing MSE, prediction inside a leaf is constant.

9 / 101

Overfitting in decision treepre-stoppingpost-pruningunstable to the changes in training dataset

10 / 101

Random ForestMany trees built independently

bagging of samplessubsampling of features

Simple voting is used to get prediction of anensemble

11 / 101

Random Forest

overfitted (in the sense that predictions for train and test are different)doesn't overfit: increasing complexity (adding more trees) doesn't spoil aclassifier 12 / 101

Random Forestsimple and parallelizabledoesn't require much tuninghardly interpretable

but feature importances can be computeddoesn't fix samples poorly classifies at previous stages

13 / 101

EnsemblesAveraging decision functions

Weighted decision

D(x) = (x)1J *

Jj=1 dj

D(x) = (x)*j αjdj

14 / 101

Sample weights in MLCan be used with many estimators. We now have triples

weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter

, , i + index of an eventxi yi wi

= nwi n i

15 / 101

Sample weights in MLCan be used with many estimators. We now have triples

weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter

Example for logistic regression:

, , i + index of an eventxi yi wi

= nwi n i

= L( , ) → min∑i

wi xi yi

16 / 101

Weights (parameters) of a classifier sample weights

In code:

tree = DecisionTreeClassifier(max_depth=4)tree.fit(X, y, sample_weight=weights)

Sample weights are convenient way to regulate importance of training events.

Only sample weights when talking about AdaBoost.

y

17 / 101

AdaBoost [Freund, Shapire, 1995]Bagging: information from previous trees not taken into account.

Adaptive Boosting is a weighted composition of weak learners:

We assume , labels ,

th weak learner misclassified th event iff

D(x) = (x)∑j

αjdj

(x) = ±1dj = ±1yi

j i ( ) = +1yidj xi

18 / 101

AdaBoost

Weak learners are built in sequence, each classifier is trained using differentweights

initially = 1 for each training sampleAfter building th base classifier:1. compute the total weight of correctly and wrongly classified events

2. increase weight of misclassified samples

D(x) = (x)∑j

αjdj

wij

= ln( )αj12

wcorrect

wwrong

← ×wi wi e+ ( )αj yidj xi

19 / 101

AdaBoost example

Decision trees of depth will beused.

1

20 / 101

21 / 101

22 / 101

(1, 2, 3, 100 trees) 23 / 101

AdaBoost secret

sample weight is equal to penalty for event

is obtained as a result of analytical optimization Exercise: prove formula for

D(x) = (x)∑j

αjdj

= L( , ) = exp(+ D( )) → min∑i

xi yi ∑i

yi xi

= L( , ) = exp(+ D( ))wi xi yi yi xi

αj

αj

24 / 101

Loss function of AdaBoost

25 / 101

AdaBoost summaryis able to combine many weak learnerstakes mistakes into accountsimple, overhead for boosting is negligibletoo sensitive to outliers

In scikit-learn, one can run AdaBoost over other algorithms.

26 / 101

Gradient Boosting

27 / 101

Decision trees for regression

28 / 101

Gradient boosting to minimize MSESay, we're trying to build an ensemble to minimize MSE:

When ensemble's prediction is obtained by taking weighted sum

Assuming that we already built estimators, how do we train a next one?

(D( ) + → min∑i

xi yi )2

D(x) = (x)*j dj

(x) = (x) = (x) + (x)Dj *j=1j′ dj′ Dj+1 dj

j + 1

29 / 101

Natural solution is to greedily minimize MSE:

Introduce residual: , now we need to simply minimizeMSE

So the th estimator (tree) is trained using the following data:

( ( ) + = ( ( ) + ( ) + → min∑i

Dj xi yi )2 ∑i

Dj+1 xi dj xi yi )2

( ) = + ( )Rj xi yi Dj+1 xi

( ( ) + ( ) → min∑i

dj xi Rj xi )2

j

, ( )xi Rj xi

30 / 101

Example: regression with GB

using regression trees of depth=231 / 101

number of trees = 1, 2, 3, 100 32 / 101

Gradient Boosting visualization

33 / 101

https://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

Gradient Boosting [Friedman, 1999]composition of weak regressors,

Borrow an approach to encode probabilities from logistic regression

Optimization of log-likelihood ( ):

D(x) = (x)∑j

αjdj

(x)p+1

(x)p+1

==σ(D(x))σ(+D(x))

= ±1yi

= L( , ) = ln(1 + ) → min∑i

xi yi ∑i

e+ D( )yi xi

34 / 101

Gradient Boosting

Optimization problem: find all and weak leaners Mission impossible

D(x) = (x)∑j

αjdj

= ln(1 + ) → min∑i

e+ D( )yi xi

αj dj

35 / 101

Gradient Boosting

Optimization problem: find all and weak leaners Mission impossibleMain point: greedy optimization of loss function by training one more weaklearner Each new estimator follows the gradient of loss function

D(x) = (x)∑j

αjdj

= ln(1 + ) → min∑i

e+ D( )yi xi

αj dj

dj

36 / 101

Gradient BoostingGradient boosting ~ steepest gradient descent.

At jth iteration:

compute pseudo-residual

train regressor to minimize MSE: find optimal

Important exercise: compute pseudo-residuals for MSE and logistic losses.

(x) = (x)Dj *j=1j′ αj′ dj′

(x) = (x) + (x)Dj Dj+1 αjdj

R( ) = + xi�

�D( )xi

<<D(x)= (x)Dj+1

dj ( ( ) + R( ) → min*i dj xi xi )2

αj

37 / 101

Additional GB tricksto make training more stable, add learning rate :

randomization to fight noise and build different trees:

subsampling of featuressubsampling of training samples

η

(x) = η (x)Dj ∑j

αjdj

38 / 101

AdaBoost is a particular case of gradient boosting with different target lossfunction*:

This loss function is called ExpLoss or AdaLoss.

*(also AdaBoost expects that )

= → minada ∑i

e+ D( )yi xi

( ) = ±1dj xi

39 / 101

Loss functionsGradient boosting can optimize different smooth loss function.

regression,

Mean Squared Error Mean Absolute Error

binary classification,

ExpLoss (ada AdaLoss) LogLoss

y ! ℝ

(d( ) +*i xi yi )2

d( ) +*i << xi yi <<

= ±1yi

*i e+ d( )yi xi

log(1 + )*i e+ d( )yi xi

40 / 101

41 / 101

42 / 101

Usage of second-order informationFor additive loss function apart from gradient , we can make use of secondderivatives .

E.g. select leaf value using second-order step:

gihi

= L( ( ), ) = L( ( ) + (x), ) a∑i

Dj xi yi ∑i

Dj+1 xi dj yi

a L( ( ), ) + (x) + (x)∑i

Dj+1 xi yi gi djhi

2d2

j

= + ( + ) → minj+1 ∑leaf

gleafwleafhleaf

2w2

leaf

43 / 101

Using second-order information

Independent optimization. Explicit solution for optimal values in the leaves:

a + ( + ) → minj+1 ∑leaf

gleafwleafhleaf

2w2

leaf

where = , = .gleaf ∑i!leaf

gi hleaf ∑i!leaf

hi

= +wleafgleaf

hleaf

44 / 101

Using second-order information: recipeOn each iteration of gradient boosting

1. train a tree to follow gradient (minimize MSE with gradient)2. change the values assigned in leaves to:

3. update predictions (no weight for estimator: )

This improvement is quite cheap and allows smaller GBs to be more effective.We can use information about hessians on the tree building step (step 1),

← +wleafgleaf

hleaf

= 1aj

(x) = (x) + η (x)Dj Dj+1 dj

45 / 101

Multiclass classification: ensembling

One-vs-one, One-vs-rest,

scikit-learn implements those as meta-algorithms.

× ( + 1)nclasses nclasses

2nclasses

46 / 101

Multiclass classification: modifying an algorithmMost classifiers have natural generalizations to multiclass classification.

Example for logistic regression: introduce for each class avector .

Converting to probabilities using softmax function:

And minimize LogLoss:

c ! 1, 2,… , Cwc

(x) =< , x >dc wc

(x) =pce (x)dc

*c ̃ e (x)dc̃

= + log ( )*i pyi xi47 / 101

Softmax functionTypical way to convert numbers to probabilities.

Mapping is surjective, but not injective ( dimensions to dimension).Invariant to global shift:

For the case of two classes:

Coincides with logistic function for

n n

n n + 1

(x) → (x) + constdc dc

(x) = =p1e (x)d1

+e (x)d1 e (x)d2

11 + e (x)+ (x)d2 d1

d(x) = (x) + (x)d1 d2

48 / 101

Loss function: ranking exampleIn ranking we need to order items by :

We can penalize for misordering:

yi

< ⇒ d( ) < d( )yi yj xi xj

= L( , , , )∑i,i ̃

xi xi ̃ yi yi ̃

L(x, , y, ) = {x ̃ y ̃ σ(d( ) + d(x)),x ̃ 0,

y < y ̃ otherwise

49 / 101

Adapting boostingBy modifying boosting or changing loss function we can solve different problems

classificationregressionranking

HEP-specific examples in Tatiana's lecture tomorrow.

50 / 101

Gradient Boosting classification playground

51 / 101

https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

-minutes breakn3

52 / 101

Recapitulation: AdaBoostMinimizes

by increasing weights of misclassified samples:

= → minada ∑i

e+ D( )yi xi

← ×wi wi e+ ( )αj yidj xi

53 / 101

Gradient Boosting overviewA powerful ensembling technique (typically used over trees, GBDT)

a general way to optimize differentiable lossescan be adapted to other problems

'following' the gradient of loss at each stepmaking steps in the space of functionsgradient of poorly-classified events is higherincreasing number of trees can drive to overfitting (= getting worse quality on new data)requires tuning, better when trees are not complexwidely used in practice

54 / 101

Feature engineeringFeature engineering = creating features to get the best result with ML

important stepmostly relying on domain knowledgerequires some understandingmost of practitioners' time is spent at this step

55 / 101

Feature engineeringAnalyzing available features

scale and shape of featuresAnalyze which information lacks

challenge example: maybe subleading jets matter?Validate your guesses

56 / 101

Feature engineeringAnalyzing available features

scale and shape of featuresAnalyze which information lacks

challenge example: maybe subleading jets matter?Validate your guessesMachine learning is a proper tool for checking your understanding of data

57 / 101

Linear models exampleSingle event with sufficiently large value of feature can break almost all linearmodels.

Heavy-tailed distributions are harmful, pretransforming required

logarithmpower transformand throwing out outliers

Same tricks actually help to more advanced methods.

Which transformation is the best for Random Forest?

58 / 101

https://en.wikipedia.org/wiki/Power_transform

One-hot encodingCategorical features (= 'not orderable'), being one-hot encoded, are easier forML to operate with

59 / 101

Decision tree example is hard for tree to use, since provides no good splitting

Don't forget that a tree can't reconstruct linear combinations — take care ofthis.

ηlepton

60 / 101

Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:

Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.

What about invariant mass of 3 particles?

a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2

61 / 101

Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:

Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.

What about invariant mass of 3 particles? (see Vicens' talk today).

Good features are ones that are explainable by physics. Start from simplest andmost natural.

Mind the cost of computing the features.

a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2

62 / 101

Output engineeringTypically not discussed, but the target of learning plays an important role.

Example: predicting number of upvotes forcomment. Assuming the error is MSE

Consider two cases:

100 comments when predicted 01200 comments when predicted 1000

= (d( ) + → min∑i

xi yi )2

63 / 101

Output engineeringRidiculously large impact of highly-commented articles. We need to predict order, not exact number of comments.

Possible solutions:

alternate loss function. E.g. use MAPEapply logarithm to the target, predict

Evaluation score should be changed accordingly.

log(# comments)

64 / 101

https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

Sample weightsTypically used to estimate the contribution of event (how often we expect thisto happen).

Sample weights in some situations also matter.

highly misbalanced dataset (e.g. 99 % of events in class 0) tend to haveproblems during optimization.changing sample weights to balance the dataset frequently helps.

65 / 101

Feature selectionWhy?

speed up training / predictionreduce time of data preparation in the pipelinehelp algorithm to 'focus' on finding reliable dependencies

useful when amount of training data is limited

Problem:

find a subset of features, which provides best quality

66 / 101

Feature selection

Exhaustive search: cross-validations.

incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest data

2d

67 / 101

Feature selection


incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.

2d

68 / 101

Feature selection


incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.

Filtering methods

Eliminate variables which seem not to carry statistical information about thetarget. E.g. by measuring Pearson correlation or mutual information.

Example: all angles will be thrown out.

2d

ϕ 69 / 101

Feature selection: embedded methodsFeature selection is a part of training. Example: — regularized linear models.

Forward selection

Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality.

Backward elimination

Almost the same, but this time we iteratively eliminate features.

Bidirectional combination of the above is possible, some algorithms can usepreviously trained model as the new starting point for optimization.

L1

70 / 101

Unsupervised dimensionality reduction

71 / 101

Principal component analysis [Pearson, 1901]PCA is finding axes along which variance is maximal

72 / 101

PCA descriptionPCA is based on the principal axis theorem

is covariance matrix of the dataset, is orthogonal matrix, is diagonalmatrix.

Q = ΛUU T

Q U Λ

Λ = diag( , ,… , ), ~ ~ ⋯ ~λ1 λ2 λn λ1 λ2 λn

73 / 101

PCA optimization visualized

74 / 101

PCA: eigenfaces

Emotion = α[scared] + β[laughs] + γ[angry]+. . .75 / 101

Locally linear embeddinghandles the case of non-linear dimensionality reduction

Express each sample as a convex combination of neighbours + →*i

<<xi *i ̃ wii ̃ xi ̃ << minw

76 / 101

Locally linear embedding

subject to constraints: , , and if are notneighbors.

Finding an optimal mapping for all points simultaneously ( are images —positions in the new space):

+ →∑i

<

<

<<xi ∑

i ̃ wii ̃ xi ̃

<

<

<< min

w

= 1*i ̃ wii ̃ >= 0wii ̃ = 0wii ̃ i, i ̃

yi

+ →∑i

<

<

<<yi ∑

i ̃ wii ̃ yi ̃

<

<

<< min

y

77 / 101

PCA and LLE

78 / 101

IsomapIsomap is targeted to preserve geodesic distance on the manifold between twopoints

79 / 101

Supervised dimensionality reduction

80 / 101

Fisher's LDA (Linear Discriminant Analysis) [1936]Original idea: find a projection to discriminate classes best

81 / 101

Fisher's LDAMean and variance within a single class ( ):

Total within-class variance:

Total between-class variance:

Goal: find a projection to maximize a ratio

c c ! {1, 2,… , C}=< xμk >events of class c=< ||x + |σk μk |2 >events of class c

=σwithin *c pcσc

= || + μ|σbetween *c pc μc |2

σbetween

σwithin

82 / 101

Fisher's LDA

83 / 101

LDA: solving optimization problemWe are interested in finding -dimensional projection :

Naturally connected to the generalized eigenvalue problemProjection vector corresponds to the highest generalized eigenvalueFinds a subspace of components when applied to a classificationproblem with classes

Fisher's LDA is a basic popular binary classification technique.

1 w

→wwTΣwithin

wwTΣbetweenmax

w

C + 1C

84 / 101

Common spacial patternsWhen we expect that each class is close to some linear subspace, we can

(naively) find for each class this subspace by PCA(better idea) take into account variation of other data and optimize

Natural generalization is to take several components: is aprojection matrix; and are number of dimensions in original and newspaces

Frequently used in neural sciences, in particular in BCI based on EEG / MEG.

W ! ℝn×n1

n n1

tr W → maxW TΣclass

subject to W = IW TΣtotal

85 / 101

Common spacial patternsPatters found describe the projection into 6-dimensional space

86 / 101

Dimensionality reduction summaryis capable of extracting sensible features from highly-dimensional datafrequently used to 'visualize' the datanonlinear methods rely on the distance in the spaceworks well with highly-dimensional spaces with features of same nature

87 / 101

Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long

88 / 101

Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long

We need automated hyperparameter optimization!

89 / 101

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize it

90 / 101

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy results

91 / 101

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem

92 / 101

Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem

Before running grid optimization make sure your metric is stable (i.e. bytrain/testing on different subsets).

Overfitting (=getting too optimistic estimate of quality on a holdout) by usingmany attempts is a real issue.

93 / 101

Optimal grid searchstochastic optimization (Metropolis-Hastings, annealing)

requires too many evaluations, using only last checked combinationregression techniques, reusing all known information (ML to optimize ML!)

94 / 101

Optimal grid search using regressionGeneral algorithm (point of grid = set of parameters):

1. evaluations at random points2. build regression model based on known results3. select the point with best expected quality according to trained model4. evaluate quality at this points5. Go to 2 if not enough evaluations

Why not using linear regression?

Exploration vs. exploitation trade-off: should we try explore poorly-coveredregions or try to enhance currently seen to be optimal?

95 / 101

Gaussian processes for regressionSome definitions: , where and are functions of mean andcovariance: ,

represents our prior expectation of quality (may be takenconstant)

represents influence of known results on theexpectation of values in new pointsRBF kernel is used here too: Another popular choice:

We can model the posterior distribution of results in each point.

Y U GP(m, K) m Km(x) K(x, )x ̃

m(x) = �Y(x)

K(x, ) = �Y(x)Y( )x ̃ x ̃

K(x, ) = exp(+c||x + | )x ̃ x ̃ |2K(x, ) = exp(+c||x + ||)x ̃ x ̃

96 / 101

Gaussian Process Demo on Mathematica

97 / 101

https://www.youtube.com/watch?v=kvPmArtVoFE

Gaussian processesGaussian processes model posterior distribution at each point of the grid.we know at which point we have already well-estimated qualityand we are able to find regions which need exploration. See also this demo.

98 / 101

http://www.tmpl.fi/gp/

Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)

99 / 101

Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)but this is not what you should spend your time on

the gain from properly cooking features / reconsidering problem is muchhigher

100 / 101

101 / 101

mlhep lectures - day 3, basic track

Science