luca lista, siena 2004 advanced software methods for physics analysis luca lista infn napoli
TRANSCRIPT
Luca Lista, Siena 2004
Advanced software methods Advanced software methods for physics analysisfor physics analysis
Luca Lista
INFN Napoli
Luca Lista, Siena 2004
IntroductionIntroduction
• Unprecedented data analysis complexity are experienced in BaBar today, and LHC in the future Huge data samples Large number of analyses High level of analysis complexity
• Analysis software and data model must have a high level of reliability Provide the users community with a set powerful, generic,
simple to use, tools to analyze data “Core” software technicalities must be hidden to the end user
behind well defined interfaces Code reuse must be pursued to permit users to share what has
been developed A good Object Oriented architecture is a solution
• N.B.: “Good” is more important than “OO” here…
Luca Lista, Siena 2004
OutlineOutline
• Analysis Data Model BaBar experience with the new Analysis Model
• Analysis toolkits An example of fitting and “toy” Monte Carlo
• Conclusions
Luca Lista, Siena 2004
BaBar experienceBaBar experience
• BaBar has collected up to now 200×106 BB pairs
• Physics events are organized according to a well established model with different components at increasing levels of detail: Tag ( boolean bits, floats and int. ) for fast event filtering Aod ( “Micro-DST” ) containing physics-level objects Esd ( “Mini-DST” ) intermediate level reconstruction objects
• Reconstructed tracks and neutral objects are accessible to the users by means of the information provided by “BtaCandidates”, which represent particle candidates
• Event store technology originally based on Objectivity Showed to be unfit for analysis purposes
• Almost every physics analysis is based on Tag and Aod levels. ROOT-IO supported early for those components
Luca Lista, Siena 2004
The original Analysis ModelThe original Analysis Model
• The “physics” code in the BaBar Framework Performs particle identification and build intermediate
“composite” particles (0, KS, D, resonances) • Composite particles are built by means of a standard (i.e.
validated) and configurable set of tools • This task is intrinsically very expensive in term of CPU time
because of the high combinatorics involved.• The output of the process (containers of composite particles
candidates) was available only at the transient level Very often physics analyses take advantage of an intermediate
step of data reduction (skimming) Performs further data reduction on individual analysis basis Eventually manage the production of large PAW/ROOT
n-tuples which contain specific analysis level physics variables, composite candidates kinematics and vertexing
Luca Lista, Siena 2004
Old model achievementsOld model achievements
• This model has been so far very successful in producing high quality physics results Submitted more than 90 physics papers, which
include discoveries!
But experts believe that it wouldn’t scale well to 5x current data set
Moreover, analysis at the n-tuple level has the disadvantage that existing (and validated) code is lost
This leads to replication of code Sharing analysis tools and strategies is challenging
Luca Lista, Siena 2004
The new Analysis ModelThe new Analysis Model
• Makes the Event Store content configurable Support for the storage of composite candidates More flexibility: support addition of custom user data
• Expand the role of Skimming (collections of pre-selected events) Rewrite just customized components
• Borrow components from parent collections
Provide both pointer-copy full deep-copy for light skims • Better performance in reading, off-site easy portability
The possible scenario: • Analysis working groups may instrument their skims with high level added
data and specific composite candidates• Final users may produce their own reduced skims (“re-skimming”) with
their options of user added data and composite candidates depending on their actual needs
Frequent centralized skimming to reduce the delay for availability of improvements, new skims, etc.
Interactively access supported to skim data (ROOT) for analysis!
• Production of large n-tuples becomes less appealing!
Luca Lista, Siena 2004
Composite particles candidatesComposite particles candidates
• PAW n-tuples was used as an additional storage format to cope with lack of needed functionalities Analysts use to store composite candidates and high level physics
quantities in the n-tuples Often replicating most of the Event Store content. Every analysis working group has its own large heterogeneous
PAW/ROOT files (disk space inefficient) and formats
• With the new model, candidate’s particles genealogy, the fit algorithm and the constrains are written to the DST This design choice rewards disk space saving with some cost in terms
of reading performances• When reading back, all the persistent candidates are translated in
transient objects The performance issues of re-fitting all the transient candidates may
be mitigated supporting dynamic data access• Also known as “Reconstruction on demand”
Luca Lista, Siena 2004
User added data - requirements User added data - requirements
• Support for storing user-defined variables Widely used physics analysis variables in BaBar:
• mES, E, mD*,D, cosB,D*l, …• Guarantee flexibility and uniformity in the interface
Support for native types • float, int,…
As well as for more complex objects• ThreeVector, LorentzVector, …
• User data should refer to analysis objects or be global to the entire event
• Non-intrusive design To extend functionalities (old good Open/Closed OO principle)
• Simplicity from the point view of the user Technological details are hidden as much as possible
Luca Lista, Siena 2004
User Added Data Interface User Added Data Interface
• The requirements are fulfilled by systematic use of Generic Programming (C++ templates) User variables of generic type
• Global event data• Data associable to generic objects
– Specializations for BtaCandidate case (the BaBar particle candidate object)
All native and most common structured variable types and for BtaCandidate objects can be stored in the DST
• User has to deal with few concepts and classes Variables by means of UsrVariable<T> template class Blocks aggregating variables values and associations:
• UsrEventBlock class to aggregate quantities relevant to the whole event • UsrCandBlock class to group variables associated to candidates
static “put”/”get” functions to interact with the event store
Luca Lista, Siena 2004
User defined candidate variables User defined candidate variables
UsrVariable<float> mES( “mES” );UsrVariable<float> deltaE( “deltaE” );UsrCandBlock B0Data;
for each BtaCandidate cand{ bool ok = true; mES = ...; deltaE = ...; // compute the values ok &= B0Data.put( cand, mES ); ok &= B0Data.put( cand, deltaE );}
for each BtaCandidate cand{ bool found = B0Data.get( cand, mES ); // get candidate mES found = B0Data.get( cand, deltaE ); // get candidate deltaE}
begin of the job:Declare variables
Analysis groupswriting data
Analystreading data
Luca Lista, Siena 2004
Status of the new Analysis ModelStatus of the new Analysis Model
• The model is completed and in production• Physics results for the next conferences are being
produced using the new Analysis Model
Luca Lista, Siena 2004
A toolkit for Fit and “toy” Monte CarloA toolkit for Fit and “toy” Monte Carlo
• The toolkit provides: a language to describe and model complex
parametric fit problems in C++ utilities to study the fit frequentistic properties
• Not intended to provide new mathematical algorithms The underlying minimization engine is Minuit
• Interest on this tool from: BaBar, GEANT4, LCG-PI (to be released)
Luca Lista, Siena 2004
Main functionalitiesMain functionalities
• Description of Probability Distribution Functions (PDF) most common PDFs provided (Gaussian, Poisson, etc.) random number generators for each provided PDF utilities to combine and manipulate PDFs
• Manipulation of symbolic expression simplifies the definition of PDF models and fit functions
• Fitter tools different Unbinned Maximum Likelihood (UML) fitters and
Chi-square fitter supported• Toy Monte Carlo
utility to generate random data samples to validate the fit results (pull distribution, fit bias estimate, etc.)
• User-defined components can be easily plugged-in
Luca Lista, Siena 2004
Design choicesDesign choices
• The code is optimized for speed Toy Monte Carlo of complex fits are very CPU intensive
• It can be achieved without loosing good OO design avoid virtual functions where not necessary using template generic programming the Boost C++ library provides powerful tools
• Metaprogramming permits type manipulations at compile time
• User don’t “see” these technical detail in the interface
• External package dependencies are well isolated Random number generator engines (ROOT, CLHEP, …) Minuit wrapper (ROOT, …)
• Other minimizers may be adopted (NAG, …)
Luca Lista, Siena 2004
A PDF implements the “()” operator: P = f( x, y, … )
Users can define new PDFs respecting the above interface
PDF interfacePDF interface
struct Flat : { PdfFlat( double a, double b ) : min( a ), max( b ) { } double operator()( double x ) const { return ( x < min || x > max ? 0 : 1 / ( max - min ) ); }
double min, max;};
struct Poissonian { PdfPoissonian( double m ) : mean( m ) { } double operator()( int n ) const { return ( exp( - mean ) * pow( mean, n ) / factorial( n ) ); }
double mean;};
Variable set;a sequence of any variable type is supported
Returns dP(x)/dx Returns P(n)
Luca Lista, Siena 2004
Implements the “generate” method: r.generate( x, y, … )
Random number generatorsRandom number generators
template< typename Generator = DefaultGen>struct RandomGenerator< Flat, Generator >{ RandomGenerator( const Flat& pdf ) : _min( pdf.min ), _max( pdf.max ) { } void generate( double & x ) const{ x = Generator::shootFlat( _min, _max ); } private: const double& _min, &_max;};
RANDOM_GENERATOR_SAMPLE(MyPdf, Bins, Min, Max)
RANDOM_GENERATOR_HITORMISS(MyPdf, Min, Max, fMax)
• Users can define new generators with the preferred method
• Numerical implementations are provided • trapezoidal PDF sampling• “hit or miss” technique
Random engine:CLHEP, ROOT, …
Partial specialization
Luca Lista, Siena 2004
Combining PDFsCombining PDFs
Argus shoulder ( 5.20, 5.28, -0.1 );Gaussian peak( 5.28, 0.05 );typedef Mixture<Gaussian, Argus> Mix;Mix pdf( peak, shoulder, 0.1 );
RandomGenerator<Mix> rnd;double x;rnd.generate( x );
Gaussian sigX( 5.28, 0.05 );Gaussian sigY ( 0, 0.015 );typedef Independent<Gaussian, Gaussian> SigXY;
RandomGenerator<SigXY> rndXY;double x, y;rndXY.generate( x, y );
10% peaking component
Argus + Gaussian peaking
• Transformation of variables is also supportedRandom variables are be generated in the original coordinate
system, then transformed
2D Gaussian peaking
Random generatorsdefined automatically
Define the relevant “language” of the problem in C++!
Luca Lista, Siena 2004
Fit PDF parameters and run Toy MCFit PDF parameters and run Toy MC
const int sig = 100; double mean = 0, sigma = 1;
Gaussian pdf( mean, sigma ); Likelihood<Gaussian> like( pdf ); UMLParameterFitter<Likelihood<Gaussian> > fitter( like );
fitter.addParameter( "mean", & pdf.mean ); fitter.addParameter( "sigma", & pdf.sigma );
Poissonian num( sig ); // alternative: Constant Gaussian pdfExp( mean, sigma ); Experiment<Poissonian, Gaussian> experiment( num, pdfExp ); for ( int i = 0; i < 50000; i++ ) { Sample<Likelihood::types> sample; experiment.generate( sample ); double par[ 2 ] = { mean, sigma }, err[ 2 ] = { 1, 1 }, logLike; logLike = fitter.fit( par, err, sample );
double pullm = ( par[ 0 ] - mean ) / err[ 0 ]; double pulls = ( par[ 1 ] - sigma ) / err[ 1 ]; }
Poisson PDF for MC generation
Parameters “linked”to the fitter
Definition of fit model and fitter
Type list deduced fromLikelihood type
Luca Lista, Siena 2004
Parameter fit Results (Pulls)Parameter fit Results (Pulls)
There is a bias (as expected):
2 = 1/ni(xi-)2 1/n-1i(xi-)2
Luca Lista, Siena 2004
Combined Yield and parameter fitCombined Yield and parameter fit
const int sig = 10, bkg = 5; typedef Poissonian Fluctuation;
Fluctuation fluctuationSig( sig ), fluctuationBkg( bkg );
typedef Independent< Gaussian, Gaussian > PdfSig; typedef Independent< Flat, Flat > PdfBkg; Gaussian g1( 0, 1 ), g2( 0, 0.5 ); Flat f1( -5, 5 ), f2( -5, 5 ); Sig pdfSig( g1, g2 ); Bkg pdfBkg( f1, f2 ); typedef Experiment<Fluctuation, Sig> ToySig; typedef Experiment<Fluctuation, Bkg> ToyBkg; ToySig toySig( fluctuationSig, pdfSig ); ToyBkg toyBkg( fluctuationBkg, pdfBkg );
Experiment2<ToySig, ToyBkg> toy( toySig, toyBkg ); typedef ExtendedLikelihood2<PdfSig, PdfBkg> Likelihood;
Gaussian G1( 0, 1 ); Sig pdfSig1( G1, g2 ); Likelihood like( pdfSig1, pdfBkg ); UMLYieldAndParameterFitter<Likelihood>
fitter( like ); fitter.addParameter( "mean", & G1.mean );
double pull1, pull2, pull3; for ( int i = 0; i < 50000; i++ ) { Sample< Likelihood::types > sample; toy.generate( sample ); double s[] = { sig, bkg, 0 }; double err[] = { 1, 1, 1 }; double logLike = fitter.fit( s, err, sample ); pull1 = ( s[ 0 ] - sig ) / err[ 0 ]; pull2 = ( s[ 1 ] - bkg ) / err[ 1 ]; pull3 = ( s[ 2 ] - 0 ) / err[ 2 ]; }
2D Gaussian signal over a 2D flat background:
Simultaneous fit of yields and Gaussian mean
Luca Lista, Siena 2004
ConclusionsConclusions
• Object Oriented analysis and design has been adopted in the last decade as software technology for physics applications … but many OO solutions exist for complex problems like
physics analysis Special care needed to evaluate code flexibility, extensibility,
interdependency, …
• The potential advantages of OO are fully exploited with a careful evaluation of the required use cases Not always obvious, experience is important! … but more than 10 years of development have brought
significant experience and solutions
OO is a mature technology in the HEP field.
Luca Lista, Siena 2004
Backup slidesBackup slides
Luca Lista, Siena 2004
How to improve the Analysis model ?How to improve the Analysis model ?
• Protect existing physics level code against changes New functionalities are added by means of a non-intrusive
design • Same interface and same modus operandi as before
• Backward compatibility with existing physics code
• Replace Objectivity and large ntuples production Support a complete ROOT-IO persistency
• Except detector conditions that still depend on Objectivity– Migration to ROOT in the future is foreseen
Support interactive access ( ROOT/CINT ) from the same ROOT files
• Usable as a data inspection tool
Luca Lista, Siena 2004
The new Event Store structureThe new Event Store structure
•Track hits•Single crystal calo hits•Cherenkov rings and hits• …
Esd•Reco objects•Tracks and fits•Calorimeter clusters•PID•MC matching
Aod
•Tag bits•Tag floats and ints
Tag
•Geant tracks•Geant vertices
Tru
•Tracks and neutral candidates•Composite candidates
Cnd
•Generic event data•Generic candidate data
Usr
MicroMicro
MiniMini
Structure has been rationalized leaving the same transient interface
ROOT persistency only
Luca Lista, Siena 2004
User defined event variables User defined event variables
• Interface for global event variables the same mutatis mutandis
UsrEventBlock AWG_Tag;UsrVariable<int> multiplicity;AWG_Tag.addVariable( multiplicity );
UsrEventBlock AWG_Tag;UsrVariable<int> multiplicity;AWG_Tag.addVariable( multiplicity );
multiplicity = some_algo(…);AWG_Tag.put( multiplicity );UsrIfd<UsrEventBlock>::put( event, & AWG_Tag,“AWGTAG”);
multiplicity = some_algo(…);AWG_Tag.put( multiplicity );UsrIfd<UsrEventBlock>::put( event, & AWG_Tag,“AWGTAG”);
begin of the job:Declare variables
event()code
Provide to the analysis groups the functionality to make their own TAGs easily
Luca Lista, Siena 2004
UML Yield fitUML Yield fit
const int sig = 10, bkg = 5;
typedef Independent< Gaussian, Gaussian > PdfSig; typedef Independent< Flat, Flat > PdfBkg; PdfSig pdfSig( Gaussian( 0, 1 ), Gaussian( 0, 0.5 ) ); PdfBkg pdfBkg( Flat( -5, 5 ), Flat( -5, 5 ) );
typedef ExtendedLikelihood2< PdfSig, PdfBkg > Likelihood; Likelihood like( pdfSig, pdfBkg ); UMLYieldFitter< Likelihood > fitter( like );
typedef Poissonian Fluctuation; // alternative: Constant Fluctuation fluctuationSig( sig ), fluctuationBkg( bkg ); typedef Experiment< Fluctuation, PdfSig > ToySig; typedef Experiment< Fluctuation, PdfBkg > ToyBkg; ToySig toySig( fluctuationSig, pdfSig ); ToyBkg toyBkg( fluctuationBkg, pdfBkg ); Experiment2< ToySig, ToyBkg > toy( toySig, toyBkg );
for ( int i = 0; i < 50000; i++ ) { Sample< Likelihood::types > sample; toy.generate( sample ); double s[] = { sig, bkg }, err[] = { 1, 1 }; double logLike = fitter.fit( s, err, sample ); double pull1 = ( s[0] - sig ) / err[0] ), pull2 = ( ( s[1] - bkg ) / err[1] ); }
Ext. Likelihood with two samples
Yield fitter extractsthe yield of the two components
In 2 dimensions:
Flat background in a signal box
Gaussian signal
Luca Lista, Siena 2004
Yield fit Results (Pulls)Yield fit Results (Pulls)
Discrete structure because of low statistics Poisson fluctuation
<s> = 10 <b> = 5
Luca Lista, Siena 2004
Symbolic function packageSymbolic function package
• Symbolic expressions makes the definition of PDFs easier
{
X x; // declare the variable x
// normalize using the symbolic integration at c-tor PdfNonParametric<X> f1( sqr( sin(x) + cos(x) ) ,
0, 4 * M_PI ); // recompute the normalization every time, since // the parameter tau may change from call to call Parameter tau( 0.123 ); PdfParametric<X> f2( x * exp( - tau * x ) ,
0, 10 );
}
User can specifydifferent way of performingnormalization and integration
Normalization:Analytic integral performedby the compiler
Luca Lista, Siena 2004
Example of Example of 22 fit fit
{ X x; Parameter a( 0 ), b( 1 ), c( 0 ); Function<X> parabola( c + x*( b + x*a ) ); UniformPartition partition( 100, -1.0, 1.0 ); Chi2<Function<X> > chi2( parabola, partition ); Chi2Fitter<Chi2<Function<X> > > fitter( chi ); fitter.addParameter( "a", a.ptr() ); fitter.addParameter( "b", b.ptr() ); fitter.addParameter( "c", c.ptr() );
SampleErr<double> sample( partition.bins() ); // fill the sample... double par[] = { a, b, c }, err[] = { 1, 1, 1 }; fitter.fit( par, err, sample );}
Luca Lista, Siena 2004
Possible future improvementPossible future improvement
• Upper limit extraction based on Toy Monte Carlo• Support for 2 fit with correlated errors and covariance
matrix• Provide more “standard” PDFs
Crystal ball, Tchebichev polynomials,…
• Managing singular PDF Delta-Dirac components
• Managing (un)folding• …
Luca Lista, Siena 2004
ConclusionsConclusions
• The new analysis model addresses the limitations of the current analysis model
• ROOT I/O based Event Store persistency• Backward compatibility with existing physics code • Flexibility and extensibility of the Event Store contents
and central role of skims production Iterative skimming of the events with progressive data
reduction and enrichment in specific information provides• easier off-site portability of data
• reduced resource needs ( CPU and disk space )
• Moreover, it favors common tools developments and common analysis strategies among groups
Luca Lista, Siena 2004
ConclusionConclusion
• We designed a new tool to model fit problems• Using template generic programming we obtained:
Generality:• User can plug-in new components (PDF, transformations, random
generators, etc.)• Easy to incorporate in the tool external contributions
Light-weight• Most of the code is contained in header (#include) files• Mild external dependencies
Easy to use• Very “synthetic” and “expressive” code
CPU Speed• Virtual function calls are extremely limited• Most of the methods are inlined
• Interest has been expressed from: Geant4 Statistical testing toolkit LCG/PI (LHC Computing Grid - Physics Interfaces)
• Will focus on a release version shortly