luca lista, siena 2004 advanced software methods for physics analysis luca lista infn napoli

Luca Lista, Siena 2004

Advanced software methods Advanced software methods for physics analysisfor physics analysis

Luca Lista

INFN Napoli


IntroductionIntroduction

• Unprecedented data analysis complexity are experienced in BaBar today, and LHC in the future Huge data samples Large number of analyses High level of analysis complexity

• Analysis software and data model must have a high level of reliability Provide the users community with a set powerful, generic,

simple to use, tools to analyze data “Core” software technicalities must be hidden to the end user

behind well defined interfaces Code reuse must be pursued to permit users to share what has

been developed A good Object Oriented architecture is a solution

• N.B.: “Good” is more important than “OO” here…


OutlineOutline

• Analysis Data Model BaBar experience with the new Analysis Model

• Analysis toolkits An example of fitting and “toy” Monte Carlo

• Conclusions


BaBar experienceBaBar experience

• BaBar has collected up to now 200×106 BB pairs

• Physics events are organized according to a well established model with different components at increasing levels of detail: Tag ( boolean bits, floats and int. ) for fast event filtering Aod ( “Micro-DST” ) containing physics-level objects Esd ( “Mini-DST” ) intermediate level reconstruction objects

• Reconstructed tracks and neutral objects are accessible to the users by means of the information provided by “BtaCandidates”, which represent particle candidates

• Event store technology originally based on Objectivity Showed to be unfit for analysis purposes

• Almost every physics analysis is based on Tag and Aod levels. ROOT-IO supported early for those components


The original Analysis ModelThe original Analysis Model

• The “physics” code in the BaBar Framework Performs particle identification and build intermediate

“composite” particles (0, KS, D, resonances) • Composite particles are built by means of a standard (i.e.

validated) and configurable set of tools • This task is intrinsically very expensive in term of CPU time

because of the high combinatorics involved.• The output of the process (containers of composite particles

candidates) was available only at the transient level Very often physics analyses take advantage of an intermediate

step of data reduction (skimming) Performs further data reduction on individual analysis basis Eventually manage the production of large PAW/ROOT

n-tuples which contain specific analysis level physics variables, composite candidates kinematics and vertexing


Old model achievementsOld model achievements

• This model has been so far very successful in producing high quality physics results Submitted more than 90 physics papers, which

include discoveries!

But experts believe that it wouldn’t scale well to 5x current data set

Moreover, analysis at the n-tuple level has the disadvantage that existing (and validated) code is lost

This leads to replication of code Sharing analysis tools and strategies is challenging


The new Analysis ModelThe new Analysis Model

• Makes the Event Store content configurable Support for the storage of composite candidates More flexibility: support addition of custom user data

• Expand the role of Skimming (collections of pre-selected events) Rewrite just customized components

• Borrow components from parent collections

Provide both pointer-copy full deep-copy for light skims • Better performance in reading, off-site easy portability

The possible scenario: • Analysis working groups may instrument their skims with high level added

data and specific composite candidates• Final users may produce their own reduced skims (“re-skimming”) with

their options of user added data and composite candidates depending on their actual needs

Frequent centralized skimming to reduce the delay for availability of improvements, new skims, etc.

Interactively access supported to skim data (ROOT) for analysis!

• Production of large n-tuples becomes less appealing!


Composite particles candidatesComposite particles candidates

• PAW n-tuples was used as an additional storage format to cope with lack of needed functionalities Analysts use to store composite candidates and high level physics

quantities in the n-tuples Often replicating most of the Event Store content. Every analysis working group has its own large heterogeneous

PAW/ROOT files (disk space inefficient) and formats

• With the new model, candidate’s particles genealogy, the fit algorithm and the constrains are written to the DST This design choice rewards disk space saving with some cost in terms

of reading performances• When reading back, all the persistent candidates are translated in

transient objects The performance issues of re-fitting all the transient candidates may

be mitigated supporting dynamic data access• Also known as “Reconstruction on demand”


User added data - requirements User added data - requirements

• Support for storing user-defined variables Widely used physics analysis variables in BaBar:

• mES, E, mD*,D, cosB,D*l, …• Guarantee flexibility and uniformity in the interface

Support for native types • float, int,…

As well as for more complex objects• ThreeVector, LorentzVector, …

• User data should refer to analysis objects or be global to the entire event

• Non-intrusive design To extend functionalities (old good Open/Closed OO principle)

• Simplicity from the point view of the user Technological details are hidden as much as possible


User Added Data Interface User Added Data Interface

• The requirements are fulfilled by systematic use of Generic Programming (C++ templates) User variables of generic type

• Global event data• Data associable to generic objects

– Specializations for BtaCandidate case (the BaBar particle candidate object)

All native and most common structured variable types and for BtaCandidate objects can be stored in the DST

• User has to deal with few concepts and classes Variables by means of UsrVariable<T> template class Blocks aggregating variables values and associations:

• UsrEventBlock class to aggregate quantities relevant to the whole event • UsrCandBlock class to group variables associated to candidates

static “put”/”get” functions to interact with the event store


User defined candidate variables User defined candidate variables

UsrVariable<float> mES( “mES” );UsrVariable<float> deltaE( “deltaE” );UsrCandBlock B0Data;

for each BtaCandidate cand{ bool ok = true; mES = ...; deltaE = ...; // compute the values ok &= B0Data.put( cand, mES ); ok &= B0Data.put( cand, deltaE );}

for each BtaCandidate cand{ bool found = B0Data.get( cand, mES ); // get candidate mES found = B0Data.get( cand, deltaE ); // get candidate deltaE}

begin of the job:Declare variables

Analysis groupswriting data

Analystreading data


Status of the new Analysis ModelStatus of the new Analysis Model

• The model is completed and in production• Physics results for the next conferences are being

produced using the new Analysis Model


A toolkit for Fit and “toy” Monte CarloA toolkit for Fit and “toy” Monte Carlo

• The toolkit provides: a language to describe and model complex

parametric fit problems in C++ utilities to study the fit frequentistic properties

• Not intended to provide new mathematical algorithms The underlying minimization engine is Minuit

• Interest on this tool from: BaBar, GEANT4, LCG-PI (to be released)


Main functionalitiesMain functionalities

• Description of Probability Distribution Functions (PDF) most common PDFs provided (Gaussian, Poisson, etc.) random number generators for each provided PDF utilities to combine and manipulate PDFs

• Manipulation of symbolic expression simplifies the definition of PDF models and fit functions

• Fitter tools different Unbinned Maximum Likelihood (UML) fitters and

Chi-square fitter supported• Toy Monte Carlo

utility to generate random data samples to validate the fit results (pull distribution, fit bias estimate, etc.)

• User-defined components can be easily plugged-in


Design choicesDesign choices

• The code is optimized for speed Toy Monte Carlo of complex fits are very CPU intensive

• It can be achieved without loosing good OO design avoid virtual functions where not necessary using template generic programming the Boost C++ library provides powerful tools

• Metaprogramming permits type manipulations at compile time

• User don’t “see” these technical detail in the interface

• External package dependencies are well isolated Random number generator engines (ROOT, CLHEP, …) Minuit wrapper (ROOT, …)

• Other minimizers may be adopted (NAG, …)


A PDF implements the “()” operator: P = f( x, y, … )

Users can define new PDFs respecting the above interface

PDF interfacePDF interface

struct Flat : { PdfFlat( double a, double b ) : min( a ), max( b ) { } double operator()( double x ) const { return ( x < min || x > max ? 0 : 1 / ( max - min ) ); }

double min, max;};

struct Poissonian { PdfPoissonian( double m ) : mean( m ) { } double operator()( int n ) const { return ( exp( - mean ) * pow( mean, n ) / factorial( n ) ); }

double mean;};

Variable set;a sequence of any variable type is supported

Returns dP(x)/dx Returns P(n)


Implements the “generate” method: r.generate( x, y, … )

Random number generatorsRandom number generators

template< typename Generator = DefaultGen>struct RandomGenerator< Flat, Generator >{ RandomGenerator( const Flat& pdf ) : _min( pdf.min ), _max( pdf.max ) { } void generate( double & x ) const{ x = Generator::shootFlat( _min, _max ); } private: const double& _min, &_max;};

RANDOM_GENERATOR_SAMPLE(MyPdf, Bins, Min, Max)

RANDOM_GENERATOR_HITORMISS(MyPdf, Min, Max, fMax)

• Users can define new generators with the preferred method

• Numerical implementations are provided • trapezoidal PDF sampling• “hit or miss” technique

Random engine:CLHEP, ROOT, …

Partial specialization


Combining PDFsCombining PDFs

Argus shoulder ( 5.20, 5.28, -0.1 );Gaussian peak( 5.28, 0.05 );typedef Mixture<Gaussian, Argus> Mix;Mix pdf( peak, shoulder, 0.1 );

RandomGenerator<Mix> rnd;double x;rnd.generate( x );

Gaussian sigX( 5.28, 0.05 );Gaussian sigY ( 0, 0.015 );typedef Independent<Gaussian, Gaussian> SigXY;

RandomGenerator<SigXY> rndXY;double x, y;rndXY.generate( x, y );

10% peaking component

Argus + Gaussian peaking

• Transformation of variables is also supportedRandom variables are be generated in the original coordinate

system, then transformed

2D Gaussian peaking

Random generatorsdefined automatically

Define the relevant “language” of the problem in C++!


Fit PDF parameters and run Toy MCFit PDF parameters and run Toy MC

const int sig = 100; double mean = 0, sigma = 1;

Gaussian pdf( mean, sigma ); Likelihood<Gaussian> like( pdf ); UMLParameterFitter<Likelihood<Gaussian> > fitter( like );

fitter.addParameter( "mean", & pdf.mean ); fitter.addParameter( "sigma", & pdf.sigma );

Poissonian num( sig ); // alternative: Constant Gaussian pdfExp( mean, sigma ); Experiment<Poissonian, Gaussian> experiment( num, pdfExp ); for ( int i = 0; i < 50000; i++ ) { Sample<Likelihood::types> sample; experiment.generate( sample ); double par[ 2 ] = { mean, sigma }, err[ 2 ] = { 1, 1 }, logLike; logLike = fitter.fit( par, err, sample );

double pullm = ( par[ 0 ] - mean ) / err[ 0 ]; double pulls = ( par[ 1 ] - sigma ) / err[ 1 ]; }

Poisson PDF for MC generation

Parameters “linked”to the fitter

Definition of fit model and fitter

Type list deduced fromLikelihood type


Parameter fit Results (Pulls)Parameter fit Results (Pulls)

There is a bias (as expected):

2 = 1/ni(xi-)2 1/n-1i(xi-)2


Combined Yield and parameter fitCombined Yield and parameter fit

const int sig = 10, bkg = 5; typedef Poissonian Fluctuation;

Fluctuation fluctuationSig( sig ), fluctuationBkg( bkg );

typedef Independent< Gaussian, Gaussian > PdfSig; typedef Independent< Flat, Flat > PdfBkg; Gaussian g1( 0, 1 ), g2( 0, 0.5 ); Flat f1( -5, 5 ), f2( -5, 5 ); Sig pdfSig( g1, g2 ); Bkg pdfBkg( f1, f2 ); typedef Experiment<Fluctuation, Sig> ToySig; typedef Experiment<Fluctuation, Bkg> ToyBkg; ToySig toySig( fluctuationSig, pdfSig ); ToyBkg toyBkg( fluctuationBkg, pdfBkg );

Experiment2<ToySig, ToyBkg> toy( toySig, toyBkg ); typedef ExtendedLikelihood2<PdfSig, PdfBkg> Likelihood;

Gaussian G1( 0, 1 ); Sig pdfSig1( G1, g2 ); Likelihood like( pdfSig1, pdfBkg ); UMLYieldAndParameterFitter<Likelihood>

fitter( like ); fitter.addParameter( "mean", & G1.mean );

double pull1, pull2, pull3; for ( int i = 0; i < 50000; i++ ) { Sample< Likelihood::types > sample; toy.generate( sample ); double s[] = { sig, bkg, 0 }; double err[] = { 1, 1, 1 }; double logLike = fitter.fit( s, err, sample ); pull1 = ( s[ 0 ] - sig ) / err[ 0 ]; pull2 = ( s[ 1 ] - bkg ) / err[ 1 ]; pull3 = ( s[ 2 ] - 0 ) / err[ 2 ]; }

2D Gaussian signal over a 2D flat background:

Simultaneous fit of yields and Gaussian mean


ConclusionsConclusions

• Object Oriented analysis and design has been adopted in the last decade as software technology for physics applications … but many OO solutions exist for complex problems like

physics analysis Special care needed to evaluate code flexibility, extensibility,

interdependency, …

• The potential advantages of OO are fully exploited with a careful evaluation of the required use cases Not always obvious, experience is important! … but more than 10 years of development have brought

significant experience and solutions

OO is a mature technology in the HEP field.


Backup slidesBackup slides


How to improve the Analysis model ?How to improve the Analysis model ?

• Protect existing physics level code against changes New functionalities are added by means of a non-intrusive

design • Same interface and same modus operandi as before

• Backward compatibility with existing physics code

• Replace Objectivity and large ntuples production Support a complete ROOT-IO persistency

• Except detector conditions that still depend on Objectivity– Migration to ROOT in the future is foreseen

Support interactive access ( ROOT/CINT ) from the same ROOT files

• Usable as a data inspection tool


The new Event Store structureThe new Event Store structure

•Track hits•Single crystal calo hits•Cherenkov rings and hits• …

Esd•Reco objects•Tracks and fits•Calorimeter clusters•PID•MC matching

Aod

•Tag bits•Tag floats and ints

Tag

•Geant tracks•Geant vertices

Tru

•Tracks and neutral candidates•Composite candidates

Cnd

•Generic event data•Generic candidate data

Usr

MicroMicro

MiniMini

Structure has been rationalized leaving the same transient interface

ROOT persistency only


User defined event variables User defined event variables

• Interface for global event variables the same mutatis mutandis

UsrEventBlock AWG_Tag;UsrVariable<int> multiplicity;AWG_Tag.addVariable( multiplicity );

UsrEventBlock AWG_Tag;UsrVariable<int> multiplicity;AWG_Tag.addVariable( multiplicity );

multiplicity = some_algo(…);AWG_Tag.put( multiplicity );UsrIfd<UsrEventBlock>::put( event, & AWG_Tag,“AWGTAG”);

multiplicity = some_algo(…);AWG_Tag.put( multiplicity );UsrIfd<UsrEventBlock>::put( event, & AWG_Tag,“AWGTAG”);

begin of the job:Declare variables

event()code

Provide to the analysis groups the functionality to make their own TAGs easily


UML Yield fitUML Yield fit

const int sig = 10, bkg = 5;

typedef Independent< Gaussian, Gaussian > PdfSig; typedef Independent< Flat, Flat > PdfBkg; PdfSig pdfSig( Gaussian( 0, 1 ), Gaussian( 0, 0.5 ) ); PdfBkg pdfBkg( Flat( -5, 5 ), Flat( -5, 5 ) );

typedef ExtendedLikelihood2< PdfSig, PdfBkg > Likelihood; Likelihood like( pdfSig, pdfBkg ); UMLYieldFitter< Likelihood > fitter( like );

typedef Poissonian Fluctuation; // alternative: Constant Fluctuation fluctuationSig( sig ), fluctuationBkg( bkg ); typedef Experiment< Fluctuation, PdfSig > ToySig; typedef Experiment< Fluctuation, PdfBkg > ToyBkg; ToySig toySig( fluctuationSig, pdfSig ); ToyBkg toyBkg( fluctuationBkg, pdfBkg ); Experiment2< ToySig, ToyBkg > toy( toySig, toyBkg );

for ( int i = 0; i < 50000; i++ ) { Sample< Likelihood::types > sample; toy.generate( sample ); double s[] = { sig, bkg }, err[] = { 1, 1 }; double logLike = fitter.fit( s, err, sample ); double pull1 = ( s[0] - sig ) / err[0] ), pull2 = ( ( s[1] - bkg ) / err[1] ); }

Ext. Likelihood with two samples

Yield fitter extractsthe yield of the two components

In 2 dimensions:

Flat background in a signal box

Gaussian signal


Yield fit Results (Pulls)Yield fit Results (Pulls)

Discrete structure because of low statistics Poisson fluctuation

<s> = 10 <b> = 5


Symbolic function packageSymbolic function package

• Symbolic expressions makes the definition of PDFs easier

{

X x; // declare the variable x

// normalize using the symbolic integration at c-tor PdfNonParametric<X> f1( sqr( sin(x) + cos(x) ) ,

0, 4 * M_PI ); // recompute the normalization every time, since // the parameter tau may change from call to call Parameter tau( 0.123 ); PdfParametric<X> f2( x * exp( - tau * x ) ,

0, 10 );

}

User can specifydifferent way of performingnormalization and integration

Normalization:Analytic integral performedby the compiler


Example of Example of 22 fit fit

{ X x; Parameter a( 0 ), b( 1 ), c( 0 ); Function<X> parabola( c + x*( b + x*a ) ); UniformPartition partition( 100, -1.0, 1.0 ); Chi2<Function<X> > chi2( parabola, partition ); Chi2Fitter<Chi2<Function<X> > > fitter( chi ); fitter.addParameter( "a", a.ptr() ); fitter.addParameter( "b", b.ptr() ); fitter.addParameter( "c", c.ptr() );

SampleErr<double> sample( partition.bins() ); // fill the sample... double par[] = { a, b, c }, err[] = { 1, 1, 1 }; fitter.fit( par, err, sample );}


Possible future improvementPossible future improvement

• Upper limit extraction based on Toy Monte Carlo• Support for 2 fit with correlated errors and covariance

matrix• Provide more “standard” PDFs

Crystal ball, Tchebichev polynomials,…

• Managing singular PDF Delta-Dirac components

• Managing (un)folding• …


ConclusionsConclusions

• The new analysis model addresses the limitations of the current analysis model

• ROOT I/O based Event Store persistency• Backward compatibility with existing physics code • Flexibility and extensibility of the Event Store contents

and central role of skims production Iterative skimming of the events with progressive data

reduction and enrichment in specific information provides• easier off-site portability of data

• reduced resource needs ( CPU and disk space )

• Moreover, it favors common tools developments and common analysis strategies among groups


ConclusionConclusion

• We designed a new tool to model fit problems• Using template generic programming we obtained:

Generality:• User can plug-in new components (PDF, transformations, random

generators, etc.)• Easy to incorporate in the tool external contributions

Light-weight• Most of the code is contained in header (#include) files• Mild external dependencies

Easy to use• Very “synthetic” and “expressive” code

CPU Speed• Virtual function calls are extremely limited• Most of the methods are inlined

• Interest has been expressed from: Geant4 Statistical testing toolkit LCG/PI (LHC Computing Grid - Physics Interfaces)

• Will focus on a release version shortly

luca lista, siena 2004 advanced software methods for physics analysis luca lista infn napoli

Documents

analysis purposesalmost

new analysis modelmakes

physics papers

individual analysis

vertexingluca lista

challengingluca lista

componentsluca lista

oo hereluca lista