data analysis exercises - nikhefverkerke/statcourse/excercises...exercise 3 – multi-variate...

Wouter Verkerke, UCSB

Data Analysis Exercises

Wouter Verkerke (NIKHEF)

Getting started

• The input files (all very small) for the various exercises are located on the web at– http://www.nikhef.nl/~verkerke/statcourse/

– NB: If you run linux, easiest way to download a file to the local directory is is ‘curl <url>’

Wouter Verkerke, NIKHEF


Module 1

Exercise 3 – Multi-Variate Analysis

• ROOT is distributed with the ‘Toolkit for Multivariate Analysis’, a toolkit that allows to train and apply many of the multi-variate analysis techniques shown in Module 2.– Here we will run it on a number of sample events

– Copy input file ex3_makesample.C. This is a macro that can generate several ‘toy’ input samples.

– Copy input file ex3_driver.C. This is the driver macro to run the TMVA toolkit. You also need to copy utility files TMVAGui.C and tmvaglob.C

• To make the first sample, execute from the OS command line ‘root -l –b –q ex3_makesample.C(0)’– This make a file sample0.root which contains a sample of ‘toy’

background events and ‘toy’ signal events

– Signal: Gaussian distribution in x (mean=-3, sigma=3)

– Background: Gaussian distribution in x(mean=+3, sigma=3)



• To analyze the first sample, start a ROOT session– Execute ‘.L ex3_driver.C+’, this compiles the driver application (it

may take a few minutes). NB: If you don’t have a compiler installed on your laptop, this will fail. In that case resort to the interpreted C++ mode by executing ‘.L ex3_driver.C’

– Now analyze the first sample by issuing the following commandon the ROOT promptex3_driver(0,”x”,”Fisher,BDT,MLP”) ;

– This will train a Fisher discriminant, a Boosted Decision Tree and a Multi-Layer Perceptron on this sample

– Observe how e.g. a BDT trains a lot quicker than a MLP, but takes longer to evaluate on the data.

– While the training is running (~5 minutes), take a moment to review the techniques being trained in the slides of Module 2

– When the training is finished, a window will pop up with various choices.



• Explore the following menu items in this order– 1a) Input distributions (just one for this example)

– 4a) Classifier output distributions (observe characteristic spikiness of BDT output)

– 4b) Same, but overlay of both test and training samples. Difference in these are indicative of overtraining (not in this sample)

– 5b) ROC curve (signal vs background efficiency)

– 5a) Efficiency curves (show signal and background efficiency vsdiscriminant, as well as S/sqrt(S+B) which helps to find optimal cut for a given amount of signal and background

– Change the amounts of signal and background in the dialog box and see how the S/sqrt(S+B) changes shape

– 9) – 11) Control plots for individual algorithms (these show e.g. network architecture, BDT structure etc...)



• Now create sample 1 – by running ‘root -l –b –q ex3_makesample.C(1)’

– Sample 1 has three observables :x,y,z

– Signal = Gaussian(x,0,3)*Gaussian(y,0,3)*Gaussian(z,0,3)

– Background = Flat in (x,y,z)

• Now analyze sample 1– Add the likelihood discriminant

– ex3_driver(0,”x”,”Fisher,BDT,MLP,Likelihood”)

– And look at the plots as for sample 0, but add the plots that the correlation (2a) and the plots that show the performance of decorrelation (2b,2c,2d,1b,1c,1d)

• You will see that the performance of Fisher is very good for sample 0, but much worse for sample 1– Try to understand why that is (see slides on Fisher discriminant

and MLP)



• Sample 2– Signal and background differ only in their correlation information

– Signal = Gaussian({x,z,y},0,3) 80% correlation between (x,y) and 50% correlation between (y,z)

– Background = Gaussian({x,z,y},0,3) 80% anti-correlation between (x,y) and 50% anti-correlation between (y,z)

• Sample 3– Multi-dimensional variant of sample 0

– Signal = Gaussian(x,-3,3)*Gaussian(y,-3,3)*Gaussian(z,-3,3)

– Background = Gaussian(x,+3,5)*Gaussian(y,+3,5)*Gaussian(z,-+3,5)

• Sample 4– Intertwined donuts. Two observables (x,y)

– Signal = donut in (x,y) centered at (-2,-2), radius 5, width 1

– Signal = donut in (x,y) centered at (+2,+2), radius 5, width 1


Exercise 1 – The central limit theorem

• In module 1 we saw that the Central Limit Theorem predicts that the sum of N measurements has a Gaussian distribution in the limit of N ∞, independent of the distribution of each individual measurement– In this exercise we will investigate how quickly this convergence

happens as function of N.

– We start with a ‘fake’ measurement resulting in a value x with a uniform distribution between [0,1] (i.e. this is very non-Gaussian)

– Then we will look at the distribution of x1+x2, x1+x2+x3, etc and compare these with the properties of a Gaussian distribution

• Start with file ex1.C– This macro books a ROOT histogram, runs 10000 experiments

and fills the ‘measured’ value of x in the histogram and plots the histogram and the end of the run.

– Look at the macro and run it (‘root -l ex1.C’ from the OS command line, or ‘.x ex1.C’ from the ROOT command line)



• Modify the loop so that instead of filling the result of a single measurement in the histogram you store the result of Nsum measurements– Allocate a variable xsum that it is initialized to zero

– The variable Nsum is already defined in the macro as first argument to macro ex1(). Its default value when unspecified is 1.

– Make a loop from from j=1,Nsum and in the loop add ‘measurement’ as returned by the ‘gRandom...’ line to the value of xsum. The histogram defined by the macro has its range already defined as [0,Nsum] so that the summed measurement values always fit in the range of the histogram

– Run the macro again now passing value 2 as argument for Nsum‘.x ex1.C(2)’ (or root –l ‘ex1.C(2)’ from the OS command line. Note that in this case the quotations are essential). Look at the distribution

– Repeat for Nsum=3,5,10,20 and 100.



• You will see that around Nsum=10 the distribution is already looks quite Gaussian. – This is however mostly for the ‘core’ of the distribution. The convergence of

the tails of the distribution is much slower as we will see next in this exercise

• To compare the distribution to a Gaussian we compare the number of events in the 1,2,3,4,5 sigma range to that expected for a true Gaussian distribution– I.e. we expect for a true Gaussian that 68% of the events is in the ±1

sigma range. Then we count which fraction of the xsum distribution is in that range

– And we repeat for 2,3,4,5 sigma

• To do so we need to calculate the expected sigma of the Gaussian by calculating the root of the variance of the distribution– Calculate first (on a piece of paper) the variance of a uniform distribution in

the range [0,1].

– To do so, use the formula

where you can use to calculate it, where F(x) is the distribution you are averaging over (F(x) is uniform distribution in range [0,1] in this case)

22)( xxxV

dxxF )(


• Then once you have variance for a single measurement of x, determine what the variance is for the sum of N identical measurements– If you need help, look at the slides on Central Limit Theorem of

module one

• At the beginning allocate a variable Nsigma1 and initialize it to zero. This will hold the number of events in the ‘one-sigma’ range.– In the ‘experiment loop’, once you have calculated Xsum, determine if

the answer is inside or outside the ‘one-sigma range’, i.e. it is outside the range [-1*sigma,+1,sigma].

– At the end of the loop print the fraction of events Nsigma1/Ntot, which is the fraction of events outside the one-sigma range of the distribution.

– Compare it the fraction expected for a Gaussian distribution.Tip: You can get the exact fraction of events outside a n-sigma Gaussian distribution from the following ROOT expression:double gaussfrac = 1-TMath::Erfc(n/sqrt(2))

where ‘n’ is the number of sigmas (i.e. 1 will give you 100%-68%≈32%)



• Note that there is a statistical error the measurement of Nsigma1/Ntot which is (to good approximation) sqrt(Nsigma1)/Ntot– Compare Nsigma1, error(Nsigma1), and the ‘true value’ of Nsigma1

for a Gaussian distribution on one line– Do this for Nsum=2,5,10,20,100– You will see that 10000 experiments provides plenty precision to see

that the one-sigma range of the distribution of Xsum converges rapidly to that expected for a Gaussian distribution.

• Now repeat the exercise for 2,3 sigma range.– To do so, add variables Nsigma2, Nsigma3, fill them in the event loop

with the corresponding ranges and compare them (with their errors) to the matching fractions for a true Gaussian distribution.

– Do this for Nsum=2,5,10,20,100– Do you have enough statistics to measure the convergence for 2 and 3

sigma? If not, increase the number of experiments by e.g. a factory of 10

• Finally add the 4,5 sigma range– How many experiments do you need to verify 5-sigma convergence?

(Feel free to stop this exercise if runs start to take too long)



• What does it mean?– If you have done your exercise correctly you’ll see the following

results for the Nsum=20 run with Nexp=1.000.000 for 1,2,3,4,5 sigma

• n = 3198780 frac = 0.319879 +/- 0.00017 Gauss = 0.317311 rel. = 0.008

• n = 450384 frac = 0.0450384 +/- 6.7e-05 Gauss = 0.0455003 rel. = -0.010

• n = 22954 frac = 0.0022954 +/- 1.5e-05 Gauss = 0.0026998 rel. = -0.149

• n = 329 frac = 3.29e-05 +/- 1.8e-06 Gauss = 6.33425e-05 rel. = 0.480

• n = 0 frac = 0 +/- 0 Gauss = 5.73303e-07

– While the 2,3 sigma fractions are fairly close to Gaussian (rel=frac-Gauss/Gauss) the 4-sigma number is 50% off

– E.g. your interpretation of how often a result 4 times the sqrt(variance) away from the central value happens is 50% off w.r.t the Gaussian distribution

– Verifying 5 sigma results is a very time consuming business (even when a simulation of your measurement is as trivial as throwing a single random number)



Module 2

Exercise 6 – Analytical vs numeric MLE

• For certain simple cases it is possible to calculate the ML estimate of a parameter analytically rather than relying on a numeric estimate. A well known case is that of the fit of a lifetime of an exponential distribution

• Copy ex6.C and run it. This example performs an unbinned MLE fit to an exponential distribution of 100 events to estimate the lifetime.

• Now we aim construct the analytical estimator of the lifetime (do this part on paper, not by computer)– Write down the probability density function for the lifetime

distribution.

– It is essential that you formulate a normalized expression, i.e. IntF(t) dt == 1 when integrated from 0 to ∞. The easiest way to accomplish that is to divide whatever you expression you have by the integral of that expression over dt and then calculate that integral by hand


Exercise 6 – Analytical vs numeric MLE

– Next write down the analytical form of the negative log-likelihood –log(L) for that probability density function given a dataset of 100 events labeled x0-x99 (label these xi in your expression). Be sure to also include the pdf normalization term in the expression

– The analytical ML estimate of tau then follows the requirement that d(-logL)/dtau = 0. Calculate the derivative of the –log(L) w.r.t. tau and solve the equation for the value of tau with results in the condition d(-logL)/dtau = 0

• Finally, implement the analytical calculation of MLE estimator for tau in the code of ex6.– Uncomment block one, which implements a look over the dataset,

retrieving the values of the decay time one by one and build your calculation of the analytical estimate of tau with that of the numeric calculation from fitTo()

– Explain why you might have minor discrepancies between the analytical and numeric calculations.

– Increase the event count from 100 to 10000 and run again


Exercise 10 – Toy event generation

• This exercise demonstrates the principle of toy event generation through sampling.– Copy input file ex4.C and look at it. The input file defines a nearly

empty main function and a function ‘double func(double x)’ is defined to return a Gaussian distribution in x.

– The first step of this exercise is to sample func() to make a toy dataset. To do toy MC sampling we first need to know the maximum of the function. For now, we assume that we know that func(x) is a Gaussian and can determine the maximum by evaluating the function at x=0. Store the function value at x=0 into a double named fmax.

– Now write a loop that runs 1000 times. In the loop, you generate two random numbers: a double x in the range [-10,10] and a double y in the range [0,fmax]. The value of x is a potential element of the toy dataset you are generating. If you accept it, depends on the value of y. Think about what the acceptance criterium should be (consult the slides of module 3 if necessary) and if it passes, store the value of x in the histogram.



– Allocate an integer counter to keep track of the number of accepted events. At the end of the macro draw the histogram and print the efficiency of the generation cycle, as calculated from the number of accepted events divided by the number of trial events

– Now change the code such that instead of doing 10000 trials, the loop will only stop after 10000 accepted events. Modify the code such that you can still calculate the efficiency after this change.

– Change the width of the Gaussian from 3.0 to 1.0. Run again and look at the generation efficiency. Now change it to 0.1 and observe the generation efficiency.

• Now we modify the toy generation macro so that it is usable on any function.– This means we can no longer rely on the assumption that the

maximum of the function is at x=0.

– The most common way to estimate the maximum of an unknown function is through random sampling. To that effect, add some code before the generation loop that samples the function at 100 random positions in x and saves the highest value found as fmax.



– Change the width of the Gaussian back to 3.0 and run modified macro. Compare the fmax that was found through random sampling with the fmax obtained using the knowledge that the function was Gaussian (i.efmax=func(0)).

– Now change the func(x) from the Gaussian function to the following expression:

(1+0.9*sin(sqrt(x*x)))/(fabs(x)+0.1)

and verify that it works fine.

• Finally we explore the limitations of sampling algorithms.– One runs generally into trouble if the empirical maximum finding

algorithm does not find the true maximum. This is most likely to happen if you don’t take enough samples or if the function is strongly peaked.

– Choose the following func(x)

TMath::Gaus(x,0,0.1,kTRUE)+0.1 ;

i.e. a narrow Gaussian plus a flat background and rerun the exercise

– Now lower the number of trial samples for maximum finding to 10 and see what happens


Module 4

Exercise 13 – A Bayesian interval

• In this exercise we will construct a Bayesian interval for an example measurement.– Copy input file ex13.C, have a look at the first block of code and run it.– You will see a plot of a Voigtian distribution, which is the convolution

of a Breit-Wigner shape (=1/(x2+Г2)) which is typical for mass resonances and a Gaussian, which is typical the detector resolution. The convolution of these two pdfs describes what a mass resonance will look like after detection, and is analytically calculated in class RooVoigtian. The plot shows three cases: the pure Breit-Wigner distribution in red, the Voigtian distribution with Г=3 and σ=1 (blue) and with Г=3 and σ=3 (dashed).

– Now uncomment BLOCK 1 and rerun. We generate a toy dataset at Г=3 and σ=1 and fit that with our model. Note the strong correlation between Г and σ in the fit (which is typical if σ and Г are of similar magnitude)

• A simple Bayesian interval– In this exercise Г(width) is the parameter of interest and σ is a

nuisance parameter. – We first run a scenario in which we assume that σ is known exactly

(i.e. there are no nuisance parameters). Now uncomment BLOCK 2 and rerun. Here the –log(L) function is constructed and plotted versus Г



– Now uncomment Block 3 and rerun. This code constructs the Likelihood function from the –log(L) function (through exponentiation) and constructs a Bayesian posterior function from the likelihood by multiplying it with a flat prior function in Г. The posterior is then plotted. A 68% central Bayesian interval can now be defined by an area that integrates 68% of the likelihood and leaves 16% on each side.

– Now uncomment Block 4. The additional code will create the cumulative distribution function (cdf) of the Bayesian posterior, i.e. ∫Г

∞P(Г’)dГ’, which simplifies the calculation of the interval: The interval is now delimited by the values of Г where the CDF crosses values 0.16 and 0.84. Determine what the Bayesian interval is (use can select cross-hairs on the canvas to simplify this task: click the right mouse button in the area above the plot frame and select the item SetCrosshairs). Write down the interval for future comparison.

• A Bayesian interval with a nuisance parameter– We now consider the case where we don’t know σ a priori and

need to constrain it from the data. Thus σ is now a nuisance parameter.



– Uncomment Block 5 and rerun. The code that is added will show the –log(L) and the L distribution versus (Г,σ). (You can again see from the L distribution that there is a significant correlation between Г and σ.

– Uncomment Block 6 and rerun. We now follow the Bayesian procedure for the elimination of nuisance parameters: we integrate the posterior distribution (takes as the likelihood times a flat prior in both σ and Г) over the nuisance parameter σ and plot the resulting posterior in Г as well as the corresponding cumulative distribution (the required integrations may take a few minutes). Compare the distribution of this posterior to that of the previous scenario (with fixed σ). Calculate the 68% central interval from the c.d.f. and write down the values for future reference

• A Bayesian interval with a nuisance parameter with non-uniform prior– For illustration we can also choose a non-uniform prior for the σ

parameter, e.g. a Gaussian with mean 1 and width 0.1. This represent a knowledge on the value of σ that is about three times as precise as what can be inferred from the data. Uncomment Block 7, rerun (will again take a few minutes) and calculate the 68% central interval for this case.


Exercise 14 – Profile Likelihood interval

• This exercise aims to calculate a interval on Г for the same problem as Ex13, but now using a Profile Likelihood method.– Copy file ex14.C, look at the code and run it. The code in this

exercise sets up the same model and data as Ex13, constructs the likelihood and plots the likelihood as function of Г, assuming a known fixed value of σ=1.

– Calculate the 68% likelihood interval by finding the values of Гthat correspond to LR=+0.5. Write down the value and compare it to the corresponding 68% Bayesian interval

– Uncomment Block1 and rerun. Now we move to the scenario where must constrain σ from the data and will consider it a nuisance parameter.

– The procedure to eliminate nuisance parameters in the profile likelihood method is to make a scan of the likelihood in Г where we plot for each point in Г the best (=lowest) value of the LR for any value σ (instead of the value of the LR at σ=1). [ This will invariable widen the distribution. Try to understand why that is the case]

Exercise 14 – Profile Likelihood interval

– The newly added code creates the profile likelihood function (which is represented by a function object in RooFit as well (which will internally call MINUIT to perform the minimization of the likelihood w.r.t. the nuisance parameters everytime it is called)

– Calculate the 68% profile likelihood interval by finding the values of Г that correspond to PLR=+0.5. Write down the value and compare it to the corresponding 68% Bayesian interval

– Finally uncomment Block 3 and run again. The newly added code will provide a visualization of how the model shape changes in the profile likelihood as function of the parameter of interest Г. A canvas with 9 pads is created, which correspond to the situation at Г=0.5,1,...,4.5. Each pad shows the data (which is always the same), and in red the model with σ=1, and in blue the model with the value of σ that gives the best fit (the likelihood of this best fit is used in the profile likelihhod)


Exercise 15 – Testing Wilks theorem

• This exercise aims to test the validity of Wilks theorem for the example analysis used in Ex13 and Ex14.– Wilks theorem states that the distribution of the 2 times the

Likelihood ratio will be that of a chi-squared distribution in the asymptotic case (i.e. N∞).

– In this exercise we will generate a series of toy Monte Carlo samples, calculate the likelihood ratio for each of them, plot the distribution and compare that distribution to that of the asymptotic chi-squared distribution.

– Copy file ex15.C, look at the code and run it (takes 1-2 minutes). Does the data look like the asymptotic distribution? You will have to put the y-axis in a log scale – to do so right-click just above the plot frame and select option SetLogY.

– Up to what value of the LLR do you have enough statistics to claim agreement. What is the corresponding level of significance? (remember LLR=0.5∙Z2)

– The provided code generate the LLR distribution for L(width=3)/L(width=best). Change the code so that it generates the distribution for L(width=1)/L(width=best) and rerun. Does the distribution look different?


Exercise 15 – Testing Wilks theorem

– So far we have checked the behavior of the likelihood ratio with σfixed to 1. Next we check if the profile likelihood also follows the prediction of Wilks theorem. To do so we need to make σ a floating parameter of the model. Change ‘sigma[1]’ into ‘sigma[1,0.1,3]’ in the factory string and rerun.

– How many toy data samples do you need to generate and fit to validate Wilks theorem up to Z=5? (i.e. calculate first the corresponding probability, then take 100/prob as a rough estimate of the number of toys you need).

– How long would it take you to do that check? (based on the time it took you to run 1000 toys). What if you had 1000 CPUs available?


data analysis exercises - nikhefverkerke/statcourse/excercises...exercise 3 – multi-variate...

Documents