wouter verkerke, nikhef roofit a tool kit for data modeling in root wouter verkerke (nikhef) david...
Post on 21-Dec-2015
224 views
TRANSCRIPT
Wouter Verkerke, NIKHEF
RooFitA tool kit for data modeling in ROOT
Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine)
Wouter Verkerke, NIKHEF
Focus: coding a probability density function
• Focus on one practical aspect of many data analysis in HEP: How do you formulate your p.d.f. in ROOT – For ‘simple’ problems (gauss, polynomial), ROOT built-in models
well sufficient
– But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you are quickly running into trouble
Wouter Verkerke, NIKHEF
A recent example
• BaBar experiment at SLAC: Extract sin(2) from time dependent CP violation of B decay: e+e- Y(4s) BB– Reconstruct both Bs, measure decay time difference
– Physics of interest is in decay time dependent oscillation
• Many issues arise– Standard ROOT function framework clearly insufficient to handle such
complicated functions must develop new framework
– Normalization of p.d.f. not always trivial to calculate may need numeric integration techniques
– Unbinned fit, >2 dimensions, many events computation performance important must try optimize code for acceptable performance
– Simultaneous fit to control samples to account for detector performance
);|BkgResol();(BkgDecay);BkgSel()1(
);|SigResol())2sin(,;(SigDecay);SigSel(
bkgbkgbkgsig
sigsigsigsig
rdttqtpmf
rdttqtpmf
Wouter Verkerke, NIKHEF
A recent example
• Initial approach BaBar: write it from scratch (in FORTRAN!)– Does its designated job quite well, but took a long time to develop
– Possible because sin(2) effort supported by O(50) people.
– Optimization of ML calculations hand-coded error prone and not easy to maintain
– Difficult to transfer knowledge/code from one analysis to another.
• A better solution: A modeling language in C++ that integrates seamlessly into ROOT– Recycle code and knowledge
• Development of RooFit package– Started 5 years ago.
– Very successful virtually everybody in BaBar uses it now in standard ROOT distribution
Wouter Verkerke, NIKHEF
RooFit – a data modeling language for ROOT
C++ command line interface & macros
Data management & histogramming
Graphics interface
I/O support
MINUIT
ToyMC dataGeneration
Data/ModelFitting
Data Modeling
Model Visualization
Extension to ROOT – (Almost) no overlap with existing functionality
Wouter Verkerke, NIKHEF
Data modeling - Desired functionality
Building/Adjusting Models
Easy to write basic PDFs ( normalization)
Easy to compose complex models (modular design)
Reuse of existing functions
Flexibility – No arbitrary implementation-related restrictions
Using Models
Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits
Toy MC generation: Generate MC datasets from any model
Visualization: Slice/project model & data in any possible way
Speed – Should be as fast or faster than hand-coded model
A n
a l y
s i s
w
o r
k c
y c
l e
Wouter Verkerke, NIKHEF
• Idea: represent math symbols as C++ objects
– Result: 1 line of code per symbol in a function (the C++ constructor) rather than 1 line of code per function
Data modeling – OO representation
variable RooRealVar
function RooAbsReal
PDF RooAbsPdf
space point RooArgSet
list of space points RooAbsData
integral RooRealIntegral
RooFit classMathematical concept
),;( qpxF
px,
x
dxxfx
xmax
min
)(
)(xf
kx
Wouter Verkerke, NIKHEF
Data modeling – Constructing composite objects
• Straightforward correlation between mathematical representation of formula and RooFit code
RooRealVar x
RooRealVar s
RooFormulaVar sqrts
RooGaussian g
RooRealVar x(“x”,”x”,-10,10) ; RooRealVar m(“m”,”mean”,0) ; RooRealVar s(“s”,”sigma”,2,0,10) ; RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ; RooGaussian g(“g”,”gauss”,x,m,sqrts) ;
Math
RooFitdiagram
RooFitcode
RooRealVar m
),,( smxgauss
Wouter Verkerke, NIKHEF
Model building – (Re)using standard components
• RooFit provides a collection of compiled standard PDF classes
RooArgusBG
RooPolynomial
RooBMixDecay
RooHistPdf
RooGaussian
BasicGaussian, Exponential, Polynomial,…Chebychev polynomial
Physics inspiredARGUS,Crystal Ball, Breit-Wigner, Voigtian,B/D-Decay,….
Non-parametricHistogram, KEYS
Easy to extend the library: each p.d.f. is a separate C++ class
Wouter Verkerke, NIKHEF
Importance of a good and extendable model library
• If ‘new’ p.d.f.s are made available in a easy-to-use form, more people will use them
• Example 1: non-parametric kernel estimation technique– See article by Kyle Cranmer
– People like it, but are put off by the prospect of thoroughly reading the entire article, coding the concept from scratch, and then using it.
– In RooFit switching from a histogram-based shape to a KEYS shape is a one-line code change (RooHistPdf RooKeysPdf)
• Example 2: Chebychev polynomials – Its easy to convince people to try them if all they need to do is a
one-line code change (RooPolynomial RooChebychev)
Wouter Verkerke, NIKHEF
Model building – (Re)using standard components
• Library p.d.f.s can be adjusted on the fly.– Just plug in any function expression you like as input variable
– Works universally, even for classes you write yourself
• Maximum flexibility of library shapes keeps library small
g(x,y;a0,a1,s)
g(x;m,s) m(y;a0,a1)
RooPolyVar m(“m”,y,RooArgList(a0,a1)) ;RooGaussian g(“g”,”gauss”,x,m,s) ;
Wouter Verkerke, NIKHEF
Handling of p.d.f normalization
• Normalization of (component) p.d.f.s to unity is often a good part of the work of writing a p.d.f.
• RooFit handles most normalization issues transparently to the user– P.d.f can advertise (multiple) analytical expressions for integrals
– When no analytical expression is provided, RooFit will automatically perform numeric integration to obtain normalization
– More complicated that it seems: even if gauss(x,m,s) can be integrated analytically over x, gauss(f(x),m,s) cannot. Such use cases are automatically recognized.
– Multi-dimensional integrals can be combination of numeric and p.d.f-provided analytical partial integrals
• Variety of numeric integration techniques is interfaced– Adaptive trapezoid, Gauss-Kronrod, VEGAS MC…
– User can override parameters globally or per p.d.f. as necessary
Wouter Verkerke, NIKHEF
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooAddPdf+
RooGaussian
Wouter Verkerke, NIKHEF
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooProdPdf*)()|(),( ygyxfyxk
RooProdPdf k(“k”,”k”,g, Conditional(f,x))
)()(),( ygxfyxh
RooProdPdf h(“h”,”h”, RooArgSet(f,g))
Wouter Verkerke, NIKHEF
Using models - Overview
• All RooFit models provide universal and complete fitting and Toy Monte Carlo generating functionality– Model complexity only limited by available memory and CPU power
– Fitting/plotting a 5-D model as easy as using a 1-D model
– Most operations are one-liners
RooAbsPdf
RooDataSet
RooAbsData
gauss.fitTo(data)
data = gauss.generate(x,1000)
Fitting Generating
Wouter Verkerke, NIKHEF
Fitting functionality
• Binned or unbinned ML fit?– For RooFit this is the essentially the same!
– Nice example of C++ abstraction through inheritance
• Fitting interface takes abstract RooAbsData object– Supply unbinned data object (created from TTree) unbinned fit
– Supply histogram object (created from THx) binned fit
x y z1 3 52 4 61 3 52 4 6
BinnedUnbinned
RooDataSet RooDataHist
RooAbsData
pdf.fitTo(data)
Wouter Verkerke, NIKHEF
Automated vs. hand-coded optimization of p.d.f.
• Automatic pre-fit PDF optimization– Prior to each fit, the PDF is analyzed for possible optimizations
– Optimization algorithms:
• Detection and precalculation of constant terms in any PDF expression
• Function caching and lazy evaluation
• Factorization of multi-dimensional problems where ever possible
– Optimizations are always tailored to the specific use in each fit.
– Possible because OO structure of p.d.f. allows automated analysis of structure
• No need for users to hard-code optimizations – Keeps your code understandable, maintainable and flexible
without sacrificing performance
– Optimization concepts implemented by RooFit are applied consistently and completely to all PDFs
– Speedup of factor 3-10 reported in realistic complex fits
• Fit parallelization on multi-CPU hosts– Option for automatic parallelization of fit function on multi-CPU hosts
(no explicit or implicit support from user PDFs needed)
Wouter Verkerke, NIKHEF
MC Event generation
• Sample “toy Monte Carlo” samples from any PDF– Accept/reject sampling method used by default
• MC generation method automatically optimized– PDF can advertise internal generator if more efficient methods
exists (e.g. Gaussian)
– Each generator request will use the most efficient accept-reject / internal generator combination available
– Operator PDFs (sum,product,…) distribute generation over components whenever possible
– Generation order for products with conditional PDFs is sorted out automatically
– Possible because OO structure of p.d.f. allows automated analysis of structure
• Can also specify template data as input to generator
pdf.generate(vars,nevt)
pdf.generate(vars,data)
Wouter Verkerke, NIKHEF
Using models – Plotting
• Model visualization geared towards ‘publication plots’ not interactive browsing emphasis on 1-dimensional plots
• Simplest case: plotting a 1-D model over data– Modular structure of composite p.d.f.s allows easy access to components for plotting
– Can show Poisson confidence intervals instead of sqrt(N) errors
RooPlot* frame = mes.frame() ;data->plotOn(frame) ;pdf->plotOn(frame) ;
pdf->plotOn(frame,Components(“bkg”))
frame->Draw() ;
Can store plot with data andall curves as single object
Adaptive spacing of curve
points to achieve 1‰
precision
Wouter Verkerke, NIKHEF
Visualizing multi-dimensional models
• IMHO: Visualization of multi-dim. models much more complicated than visualization of multi-dim. data
• Simplest scenario: projection plots– Simple example M(m,t) = f[S(m)*T(t)]+(1-f)[B(m)*U(t)]
– Provide intuitive behavior: projection of p.d.f. automatically follows projection of data: 1-D plot frame remembers ‘hidden’ variables in dataset. Subsequent p.d.f.s in same frame are automatically projected
– Works identically for non-factorizable p.d.f.s! Necessary integrals are calculated automatically either analytical or numerically
RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;
mB distribution t distribution
dttmMmc ),()(
dmtmMtc ),()(
Wouter Verkerke, NIKHEF
Using models – plotting multi-dimensional models
• Projection plot usually doesn’t capture full information content of your p.d.f. equally easy to plot a ‘slice’
RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;
RooPlot* frame = dt.frame() ;dt.setRange(“sel”,5.27,5.30) ;data.plotOn(frame,CutRange(“sel”)) ;model.plotOn(frame,ProjectionRange(“sel”));model.plotOn(frame,ProjectionRange(“sel”), Components(“bkg”)) ;
mB distribution t distribution
More complicated ‘slice’ views such as a likelihood ratio plot only take O(3) lines of extra code
30.5
27.5
),()( dmtmMtc
Wouter Verkerke, NIKHEF
Using models – plotting
• Can also use template data to show ‘average’ curve– Convenient for visualization of conditional p.d.f.s
– Example plot t distribution of following model over data, averaging over dt values of data
)|,( dttmM
DNi
iD
dmdttmMN
tCurve,1
)|,(1
)(
model.plotOn(frame,ProjWData(dtData)) ;model.plotOn(frame,Components("bkg"), ProjWData(dtData),LineStyle(kDashed)) ;
Wouter Verkerke, NIKHEF
Advanced features – Task automation
• Support for routine task automation, e.g. useful in finding confidence intervals
Input model Generate toy MC Fit model
Repeat N times
Accumulatefit statistics
Distribution of- parameter values- parameter errors- parameter pulls
// Instantiate MC study managerRooMCStudy mgr(inputModel) ;
// Generate and fit 100 samples of 1000 eventsmgr.generateAndFit(100,1000) ;
// Plot distribution of sigma parametermgr.plotParam(sigma)->Draw()
Wouter Verkerke, NIKHEF
RooFit at SourceForge - roofit.sourceforge.net
RooFit moved to SourceForge to facilitate access and
communication with non-BaBar users
Code access–CVS repository via pserver
–File distribution sets for production versions
Wouter Verkerke, NIKHEF
RooFit at SourceForge - Documentation
Documentation
Comprehensive set of tutorials
(PPT slide show + example macros)
Five separate tutorials
More than 250 slides and 20
macros in total
Class reference in THtml style
Also available by end of 2005: Pedagogical User Guide + Reference Guide (~100 pages document)
Wouter Verkerke, NIKHEF
Selection of BaBar publications in 2004 using RooFit
• Improved Measurement of the CKM angle alpha using B0+-
• Measurement of branching fractions and charge asymmetries in B+ decays to +, K+, + and '+, and search for B0 decays to K0 and
• Branching Fraction and CP Asymmetries in B0KSKSKS
• Measurement of CP Asymmetries in B0K0 and B0K+K-K0S
• Measurement of Branching Fractions and Time-Dependent CP-Violating Asymmetries in B' K Decays
• Improved Measurement of Time-Dependent CP Violation in B0 to (ccbar)K0 Decays (‘sin2’)
• Measurements of the Branching Fraction and CP-Violating Asymmetries in B0f0(980)KS Decays
• Measurement of Time-dependent CP-Violating Asymmetries in B0K*, K*KS0 Decays
• Study of the decay B0+- and constraints on the CKM angle alpha.
• Measurement of CP-violating Asymmetries in B0K0S0 Decays
• Measurement of Time-Dependent CP Asymmetries in B0K0
• Search for B+/- [K-/+ +/-]D K+/- and upper limit on the bu amplitude in B+/- DK+/-
• Limits on the Decay-Rate Difference of Neutral B Mesons and on CP, T, and CPT Violation in B0B0bar Oscillations
Wouter Verkerke, NIKHEF
Summary
• RooFit adds a powerful data modeling language to ROOT– Mathematical objects (variables, functions…) represented as C++ objects– Complex models are easily composed from library of standard components
• New fundamental components can be written easily– Powerful tools for fitting, Toy MC generation and visualization are easy to use
• Compact syntax• Universal functionality (almost no arbitrary / implementation related restrictions)
– Automated function optimization analysis and implementation delivers industrial strength performance
• RooFit makes the road from a ‘simple’ to a really elaborate/complex p.d.f. less steep:
• RooFit is distributed with ROOT starting with ROOT version v5.02-00