roofit – open issues w. verkerke. datasets current class structure data representation...

13
RooFit – Open issues W. Verkerke

Upload: myles-gray

Post on 29-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

RooFit – Open issues

W. Verkerke

Page 2: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Datasets

• Current class structure

• Data representation– RooAbsData (abstract base class)– RooDataSet (unbinned [weighted] data)– RooDataHist (binned data)

• Data storage– RooAbsDataStore (abstract base class)– RooTreeDataStore (TTree based storage)

• used by both RooDataSet and RooDataHist

– RooCompositeDataStore• Used by RooDataSet when combining external datasets with Link() rather than

Import()

– Since there are 2 concrete implementations, most RooFit code already adapted to concept that storage type is not necessarily tree-based (e.g. virtual copy construction through clone functions etc)

Page 3: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Open issues in datasets - storage

• Project: New STL vector-based storage implementation– May be (much) faster that TTree-based datastore

• Work needed– Develop new class RooVectorDataStore

– Inherits from RooAbsDataStore, implements full functionality of RooTreeDataStore (including support for append/merge/rename operations, storing of ‘cache’ columns). Must be persistable, support string and category data types as well

– Workload: 3 days of work

– Once done, need to add cmdline option to RooDataSet/Hist to use this alternate storage technique [easy]

– Workload: 0.5 days of work

– Add new stressRooFit test module that exercises this type of storage

– Workload: 0.5 days of work

– Need to validate that RooCompositeDataStore works fine with RooVectorDataStores (should be OK)

– Workload 0.5 days of work

Page 4: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Open issues in datasets - representation

• Request for new mixed binned-unbinned data representation type

• Work needed– Fixed feature is a ‘master category’ variable that indexes the various data

subsets.

– Write class RooMixedData to represent this.

– Need work out precise functionality and interface of such a class• Several concepts of binned data not available for unbinned data and vice versa (see next slide)

– Could make class that only implement common aspects (as defined in RooAbsData), but in practice only useable as read-only class. OK?

– Is (typed) access to component representation needed, i.e. do you need to be able to see subset [i] as a RooDataHist or RooDataSet, (not handled via composite storage scheme, but could be added a separate layer: i.e. RooMixedData owns multiple RooDataHist and RooDataSet objects that each own their own storage, then link their storage objects to a RooCompositeDataStore for unified view.

– Workload: ~1 week (depending on what design/interface issues will appear…)

Page 5: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Functionality of RooDataSet/RooDataHist

Operation RooDataHist RooDataSet

add(RooArgSet) Increase weight of corresponding bin

Add data point

append(RooAbsData) Add all points Add all points

merge(RooDataSet) UNDEFINED Add columns from imported dataset

addColumn(RooAbsArg) UNDEFINED Add columns with values of given function

set(RooArgSet&,dbl) Set weight of given point to given value

UNDEFINED

binVolume(RooArgSet&) Return volume of bin in given (subset) of dimensions

UNDEFINED

weightError() Return error on given weight()

UNDEFINED

Page 6: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Open issues in datasets - representation

• Representation of number-counting data

• Now– Regular PDF: Gauss(x) RooDataSet(x) with N entries

– Extended PDF: Gauss(x)*Poisson(N) RooDataSet(x) with N entries

– Number-counting PDF: should be (in analogy)Poisson(N) RooCountingData(<no_obs>) with N entries but we don’t have that.

– Can do: Poisson(N) RooDataSet(N) with 1 entry but that doesn’t (automatically) behave in the right way.

– Also requires some thinking on the PDF-side…

– Two ways to go

Page 7: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Open issues in datasets - representation

• Path #1 (Kyle proposal)– Need to label (any) pdf explicit as ‘number counting’ pdf– Effect is that generate() fills a dataset with 1 entry representing the event count, rather

than N entries of a dummy observable where the dataset size represents the event count– Possible issue: Special meaning of counting data only clear in contact of (labeled) pdf

that generated it, unless data is also labeled itself in some way. [ E.g when calculating total event count of a composite dataset need to know if RooDataSet with 1 entry counts as 1 or as N, simular issue when asking for event count of component dataset ]

• Path #2 (My original proposal)– Make a wrapper class that represents any pdf as a number counting pdf, e.g. class

RooCountingPdf, e.g.ws.factory(“CountingPdf::Nexp(Poisson(Nobs,mu))”) ;

– Net effect of class is • to redirect output of RooAbsPdf::getVal() to RooAbsPdf::expectedEvents()• Return class of type RooCountingData() when generate is called

– Requires writing of a class RooCountingData which can be extremely lightweight & fast (just contains 1 double)

– Adapt class RooMixedData to be able to also contain RooCountingData– Data and pdf are both self-labeling in terms of interpretation. Should be straightforward

to use this in existing RooFit code [ but need to check if there is code that assumes at least one ‘observable’ ]

• Workload: either way 2-3 days

Page 8: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Conceptual issues with simultaneous pdf / data

• Need more flexibility in mixing/matching different pdfs

• Eg sim[ F(x), G(y) | i ]– Will work technically, but fundamental issue is that meaningful observables

depend on index I

– Unwanted side-effects of present construction:

generate() will make random y variable for generation of F(x), and random x variable for generation of G(y).

Datasets will always allocate entries for x and y for both dataset subsets (results in a waste of space, especially if x,y are binned)

• Need several items to resolve this– Composite datasets, where each subset only stores selected observables

[ need: a mechanism to specify this ]

– A mechanism in RooSimultaneous::generate() to only generate the “relevant” observables for each state [ need: same mechanism to specify this ]

– Will need to change RooSimultaneous in any case to store output in a composite datastore [ not done now] to gain needed flexibility

Page 9: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Conceptual issues with simultaneous pdf / data

• Composite datasets most likely used only in conjunction with RooSimultaneous, so that p.d.f. is likely the most sensible point to make this interface, e.g.

ws.factory(“SIMUL::model[idx,a=pdfA(x),b=pdfB(y)]”)

then modify internally RooSimultaneous::generate() to follow instructions accordingly.

• Also need new syntax to construct RooDataSets in this way

RooDataSet ds(“ds”,”ds”,RooArgSet(x,y,i),Index(i), Import(dataA,”a”,x), Import(dataB,”b”,y)

Page 10: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Conceptual issues with simultaneous pdf / data

• Once concept of RooMixedData is implemented can also think of interface binned-vs-unbinned datasets– Construction ‘by hand’ follows trivially from ctor

RooMixedData ds(“ds”,”ds”,RooArgSet(x,y,i),Index(i), Import(dataA,”a”,x), Import(dataB,”b”,y)

– When generating binned-vs-unbinned is a ‘preference’ (you can always do either way)

– Either specify at generation time (requires non-trivial interface), or encode ‘preference’ inside a RooSimultaneous

• Still requires some creativity to be able insert this preference spec in the factory

• Otherwise through class interface

sim.setGenerateBinned(“a”,kTRUE) ;

Page 11: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Recap of data and simultaneous issues

• Project 1– Make RooVectorDataStore ~ 1 week. Easily factorized/delegated

• Project 2– Adjust RooDataSet/RooDataHist to accept index-dependent

observables [ ~2-3 days ]

– Adjust RooSimultanous to specify ‘relevant’ observables for each index [ 1 day ]

• Project 3 – Make RooCountingData ~ 2-3 days

– Make RooMixedData ~2-4 days [ depending on difficulties ]

– Adjust RooSimultaneous to use these

Page 12: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Other issues

• Workspaces– Ability to rename named sets store in datasets [ 1 hour ]– Make EDIT() capable of removing terms in PROD terms [ 1 day ]– Bug in RooHistPdf persistence [ 1-2 days ]

• Time consuming as it requires intervention in RooAbsArg streamer

– Kyle reported 32/64 issues in persistence [need example] [ ?? ]

• Pdf interface issues– Port generateSimGlobal() to generate() interface [ 1 day ]– Make extendedTerm() return Double_t instead of Int_t to support

Asimov datasets [ 0.5 day ]– Common abstract interface for morphing operator PDF [ ??? ]

• Likelihood interface issues– What normalization set applied to constraint terms?– Need data/pdf combination scheme that allows to detach dataset

that has already died from a NLL Simplifies use of setData() in RooStats [1-2 days ]

Page 13: RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]

Addressing RooStats performance issues from RooFit side

• Avoid need to (re)create likelihoods– Modified data/pdf attachement scheme in RooNLLVar that allow to

detach datasets after they have been deleted Allows straightforward use of setData() in RooStats [ 2 days ]

• Speeding of dataset looping, creation deletion– Vector-based datasets [ ~1 week ]

• Copy overhead of complex objects– Complex defines as have >>100 nodes

– Several optimization already applied on RooFit side (Hash tables etc for reconnection lookup). Biggest speed gain most likely in form of addition of new classes that allow to reduce number of objects Collapse construct of a pdf for N channels into a single one. Needs some details on use cases, but likely good progress possible in O[2-3 days]

• Profiling of RooStats TLimit macro essential