impact of parallelism on hep software april 29 th 2013 ecole polytechnique/llr rene brun

Impact of parallelismon HEP software

Impact of parallelismon HEP software

April 29th 2013

Ecole Polytechnique/LLR

Rene Brun

Software Upgrades

All LHC experiments and groups like CERN/SFT are looking at all possible performance improvements or rethinking their software stack for the post LS2 years.

This effort is driven by the new hardware and also the analysis of the hot spots.

Work is going on in ROOT to support thread safety, parallel buffer merges and parallel Tree I/O.

In the GEANT world, several projects(eg G4MT) investigate multi-core, gpus or like solutions. In this talk I will review the progress with one of these projects.

2

Hardware

R.Brun : Paralllelism and HEP software3

From a recent talk

by Intel

From a recent talk

by Intel

If you trust Intel

R.Brun : Paralllelism and HEP software 4

If you trust Intel 2


Vendors race


parallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelism

parallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelismparallelism

Parallelism: many failures


inmos

cray

cm2

We failed in vectorizing codes like GEANT3 in 1985-1987 on CRAY, Cyber205, ETA10, IBM3090 because our approach was wrong

Some successful attempts in online systems in 1983

We failed too on MPP systems like the Thinking Machines, Elxsi in 1991-1993 because our approach was wrong

Are we going to take a

wrong approach

again?

Are we going to take a

wrong approach

again?

Parallelism: key points


Minimize the sequential/synchronization parts (Amdhal law): Very difficult

Run the same code (processes) on all cores to optimize the memory use (code and read-only data sharing)

Job-level is better than event-level parallelism for offline systems.

Use the good-old principle of data locality to minimize the cache misses.

Exploit the vector capabilities but be careful with the new/delete/gather/scatter problem

Reorganize your code to reduce tails

Data Structures & parallelism


eventevent

vertices

tracks

C++ pointersspecific to a process

C++ pointersspecific to a process

Copying the structure implies a

relocation of all pointers

Copying the structure implies a

relocation of all pointers

I/O is a nightmare

I/O is a nightmare

Update of the structure from a different thread implies a

lock/mutex

Update of the structure from a different thread implies a

lock/mutex

Data Structures & Locality


sparse data structures defeat the system memory caches

sparse data structures defeat the system memory caches

Group object elements/collections such that the storage matches

the traversal processes

Group object elements/collections such that the storage matches

the traversal processes

For example: group the cross-sections for all processes per

material instead of all materials per process

For example: group the cross-sections for all processes per

material instead of all materials per process

Tools & Libs


hbookhbook

zebrazebra

pawpawzbookzbook

hydrahydra

geant1geant1

geant2geant2

geant3geant3 geant4geant4

rootroot

minuitminuit

bosbos

geant5geant5

Detector Simulation tools

13

All based on the same principle:Sequential particle transport

The GEANT versions

R.Brun : Paralllelism and HEP software 141975 1980 1990 1995 2010

G1G1

G2G2

G3G3

G4G4

functionality

Conventional Transport


oooo

o

oo

oo

o

oo

ooo

o

oo

o oo

o

o

o

T1

T3

T2

o

o

o

oooo

oo

o

o

ooo

o

oo

oo

oT4

Each particle tracked step by step through hundreds of volumes

Each particle tracked step by step through hundreds of volumes

when all hits for all tracks are in

memory summable digits

are computed

when all hits for all tracks are in

memory summable digits

are computed

Analogy with car traffic


2

5

3

1

4

Starting Assumptions

The LHC experiments use extensively G4 as main simulation engine. They have invested in validation procedures. Any new project must be coherent with their framework.

One of the reasons why the experiments develop their own fast MC solution is the fact that a full simulation is too slow for several physics analysis. These fast MCs are not in the G4 framework (different control, different geometries, etc), but becoming coherent with the experiments frameworks.

Giving the amount of good work with the G4 physics, it is unthinkable to not capitalize on this work.


Goals

Design a new detector simulation tool derived from the Geant4 physics , but with a radically new transport engine supporting: Full and Fast simulation (not exclusive) Designed to exploit parallel hardware this talk

18

Definitions

19

Detector Physical volumes Logical volumes

ALICE 4,354,735 4,764

ATLAS 29,046,966 7,143

CMS 1,166,318 1,537

LHCb 18,491,756 709

A logical volume has a given shape and material

Steps/lvolume in Atlas


Huge dynamic range 7100 lvolume types

29 million instances

Simple observation: HEP transport is mostly local !

• Locality not exploited by the classical transportation approach

• Existing code very inefficient (0.6-0.8 IPC)

• Cache misses due to fragmented code

50 per cent of the time spent in 50/7100 lvolumes

21

Neighbors/lvolume in Atlas


Volumes with too many neighbors

Neighbors/lvolume in CMS


Same problem with CMS

LHCB geometry statistics


Better situation with neighbors because of a non cylindrical

geometry

90 per cent of steps in 50/700 volumes

New Transport Scheme


oooo

o

oo

oo

o

oo

ooo

o

oo

o oo

o

o

o

T1

T3

T2

o

o

o

oooo

oo

o

o

ooo

o

oo

oo

oT4

All particles in the same volume type are

transported in parallel.

Particles entering new volumes or generated

are accumulated in the volume basket.

All particles in the same volume type are

transported in parallel.

Particles entering new volumes or generated

are accumulated in the volume basket.

Events for which all hits are

available are digitized in

parallel

Events for which all hits are

available are digitized in

parallel

Tails again


A killer if one has to wait the end of col(i) before

processing col(i+1)

Average number of objects in

memory

A better solution


Pipeline of objects

CheckpointSynchronization.

Only 1 « gap » every N events

This type of solution required

anyhow for pile-up studies

A better better solution


checkpoints At each checkpoint we have to keep the

non finished objects/events.

We can now digitize with parallelism on events, clear and reuse the slots.

29

Benchmarks/lessons from a prototype

HT mode

Excellent CPU usage

Benchmarking 10+1 threads on a 12 core Xeon

Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues

Event re-injection will improve the speed-up

29

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t sVectorizing the geometry (ex1)


Double_t TGeoPara::Safety(Double_t *point, Bool_t in) const{ // computes the closest distance from given point to this shape. Double_t saf[3]; // distance from point to higher Z face saf[0] = fZ-TMath::Abs(point[2]); // Z

Double_t yt = point[1]-fTyz*point[2]; saf[1] = fY-TMath::Abs(yt); // Y // cos of angle YZ Double_t cty = 1.0/TMath::Sqrt(1.0+fTyz*fTyz);

Double_t xt = point[0]-fTxz*point[2]-fTxy*yt; saf[2] = fX-TMath::Abs(xt); // X // cos of angle XZ Double_t ctx = 1.0/TMath::Sqrt(1.0+fTxy*fTxy+fTxz*fTxz); saf[2] *= ctx; saf[1] *= cty; if (in) return saf[TMath::LocMin(3,saf)]; for (Int_t i=0; i<3; i++) saf[i]=-saf[i]; return saf[TMath::LocMax(3,saf)];}

Huge performance gain expected in this type of code where shape constants can

be computed outside the loop

Vectorizing the geometry (ex2)

R.Brun : Paralllelism and HEP software

33

G4double G4Cons::DistanceToIn( const G4ThreeVector& p, const G4ThreeVector& v ) const{ G4double snxt = kInfinity ; // snxt = default return value const G4double dRmax = 100*std::min(fRmax1,fRmax2); static const G4double halfCarTolerance=kCarTolerance*0.5; static const G4double halfRadTolerance=kRadTolerance*0.5;

G4double tanRMax,secRMax,rMaxAv,rMaxOAv ; // Data for cones G4double tanRMin,secRMin,rMinAv,rMinOAv ; G4double rout,rin ;

G4double tolORMin,tolORMin2,tolIRMin,tolIRMin2 ; // `generous' radii squared G4double tolORMax2,tolIRMax,tolIRMax2 ; G4double tolODz,tolIDz ;

G4double Dist,s,xi,yi,zi,ri=0.,risec,rhoi2,cosPsi ; // Intersection point vars

G4double t1,t2,t3,b,c,d ; // Quadratic solver variables G4double nt1,nt2,nt3 ; G4double Comp ;

G4ThreeVector Normal;

// Cone Precalcs

tanRMin = (fRmin2 - fRmin1)*0.5/fDz ; secRMin = std::sqrt(1.0 + tanRMin*tanRMin) ; rMinAv = (fRmin1 + fRmin2)*0.5 ;

if (rMinAv > halfRadTolerance) { rMinOAv = rMinAv - halfRadTolerance ; } else { rMinOAv = 0.0 ; } tanRMax = (fRmax2 - fRmax1)*0.5/fDz ; secRMax = std::sqrt(1.0 + tanRMax*tanRMax) ; rMaxAv = (fRmax1 + fRmax2)*0.5 ; rMaxOAv = rMaxAv + halfRadTolerance ; // Intersection with z-surfaces

tolIDz = fDz - halfCarTolerance ; tolODz = fDz + halfCarTolerance ;

…… //here starts the real algorithm

Huge performance gain expected in this type of code

where shape constants can be computed outside

the loop

All these statements are independent of the particle !!!

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

Vectorizing the Physics

• This is going to be more difficult when extracting the physics classes from G4. However important gains are expected in the functions computing the distance to the next interaction point for each process.

• There is a diversity of interfaces and we have now sub-branches per particle type.


Where are we now?

Present status Several investigations of possible alternatives for “extremely

parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to

prove vectorization gain New design in preparation

35

Major points under discussion

How to minimize locks and maximize local handling of particles

How to handle hit and digit structures How to preserve the history of the particles

This point seems more difficult at the moment and it requires more design

What is the possible speedup obtained by micro-parallelisation

What are the bottlenecks and opportunities with parallel I/O

36

SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s

37

Current design

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev Digitizer thread

Events

BF: basket status (one char per B)

Transport thread

Ev build thread

Reused after each transport

task

Flushedat the end of

event

Features

Pros Excellent potential locality Easy to introduce hits and digits

Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and

introduce particle pruning

38

Processing flow I

The transport thread takes particles from the input buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for

further processing If the LV is a sensitive detector, hits are generated and stored

per LV basket A LV basked history record is kept (under investigation)

Input and output particle buffers are fixed size structures, which can however evolve (be optimised) during simulation

39

40

Design under study

Input particle list


p array

Hits

p array

History

List of logical Volumes


Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

✔ full!

✗ empty!

BF: basket status (one char per B)

Note

Containers are “slow growing” contiguous containers Every time a container has to grow, it is realloc-ated

contiguously to the new size A blocking operation

We expect containers size to converge If not, there is a design problem

41

Processing flow II

When an input particle buffer is exhausted It is marked as such by the transport thread in the LV#BF

(Logical Volume # Basket Flag) Then the transport thread scans the LV#BF (Logical

Volume # Basket Flag) data structure to find the next basket to be transported

Used buffers are scanned by the dispatcher thread that updates a global track counter per event

And then they are declared available to be filled (a) to be reused

42

Important!!

a (available) basket being filled by the dispatcher f (full) basket ready to be dispatched r (ready) basket ready to be transported t (transporting) basket being transported

43

44

Current design

Input particle list


p array

p array


Logical Volumes

tt1) The transport thread has finished working on the input array

tt2) The transport thread marks the lv#bn from transporting (t) to “to be dispatched” (f)

LV

BN t

dt1) The dispatch takes the first f basket and dispatches the output particles into the input particle lists of the baskets available to be filled (a)

tt3) The transport thread gets a basket to be transported (r) from the fast selection list and marks it “transporting” (t)

fa

Dispatcher

a

dt2) When dispatching is finished the basket is moved from f to a

a

dt3) When a input list is full, the basket is moved from a to r, ready to be transported and it is pushed into the fast selection list

tt0) Initial status: the transport thread is transporting a basket, marked as “transporting” (t)

Input particle list


p array

p array


LV

BN rarta

Transport thread Dispatch thread

BF: basket status

BF: basket status

asynch!!

Fast LV#BN queueLV#BN

Locks… The only lock is the push and pop from the fast selection queue The dispatcher watches continuously the done byte-vector and

dispatch every new basket that is ready Or it can sleep some and then process a number of done baskets in

succession The transport thread marks the done basket (no lock!)

No one touches a (t) basket apart the transport thread that deals with it The transport thread gets a new basket (lock!)

This is to avoid that two threads get the same basket or that the dispatcher thread is updating the fast selection queue

The dispatcher thread does not need to lock the whole bit-array while dispatching The basket in f or in a will not be touched by the transport threads

The only doubtful situation is when there are no basket to be transported… In this case the “global” threshold for transporting should be lowered by a

hungry transport thread (lock, but just to update an integer!) And the dispatcher will mark baskets as ready to be transported (r)

45

Memory

We hope to have a self-adjusting system that will stabilise with time

In case of an “accident” (an event much larger that any other), we need a way to “quench inflation”

We have identified two methods Event flushing: do NOT transport particles from a given set of events

and move them directly to the output buffer Energy flushing: transport low energy particles and move the high

energy ones to the output buffer “Untransported” particles are just reinjected into the

system, but they do not shower

46

Processing flow III Note an important point

The LV basket structure has input and output particle buffers and hits and history buffers

Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a

single event Hits and history buffers are

Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads

successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data

structure till the event is flushed

47

Processing flow IV When an event is finished, the digitizer thread

kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure

When this is over, the event is built into the event structure (to be designed!) by the event builder thread

After that, the history for this event is assembled by the same thread If…

Then the event is output By an output thread or in parallel?

48

Questions?

How many dispatcher, digitizer and event-builder threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help

Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of

the volumes Threads could be distributed proportionally to the time spent

in the different LVs

49

50

Short term tasks

Continue the design work – essential before any more substantial implementation This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the

implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines

Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods

Particularly in the EM domain

impact of parallelism on hep software april 29 th 2013 ecole polytechnique/llr rene brun

Documents

hep software

parallelism slide

process slide

intel slide

software stack

lockmutex slide

intel r

hardware r