hera-b daq system and its self-healing abilities · 5 may 2003 v.rybnikov, desy 1 hera-b daq system...
TRANSCRIPT
5 May 2003 V.Rybnikov, DESY 1
Hera-B DAQ System and its self-healing abilities
1. HERA-B experiment2. DAQ architecture
• Read-outØ Self-healing tools
• SwitchØ SLT nodes isolation
3. Run control system4. Self-healing tools (software)
• Releasing resources• Process recovery
V.Rybnikov, DESY, Hamburg
5 May 2003 V.Rybnikov, DESY 2
HERA-B experiment (sub-detectors)
5 May 2003 V.Rybnikov, DESY 3
DAQ architecture
LOGGINGNODES (3)
Event Rate 10 MHz
50 KHz < 30 13 Gb/s
500 Hz 200 165 MB/s
50 Hz 150 22 MB/s
criti
cal p
oint
s
~1100 SHARC nodes
240 SLT nodes
100 x 2 4LT CPUs
~ 2000 processes on ~ 1500 nodes
DATAvolume
5 May 2003 V.Rybnikov, DESY 4
DAQ architecture (SHARC board)
• 6U VME card (MSC, Stutensee, Germany)• 6 ADSP-21060 (Analog Devices), 40 MHz• ADSP chip holds 512 KB on-chip memory• global memory bus (240 MB/s in 48bit words)• external memory 256K x 32• 10 DMA controllers / chip
• 6 for 4 bit parallel links (40 MB/s)• 4 for global memory communication
• VME interface to write/read ADSP and global memory
44SWITCH
1FCS interface
2Event Controller
140SLBs
5 May 2003 V.Rybnikov, DESY 5
DAQ architecture (read-out)
SHA
RC
INT
ER
FAC
E
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC
CO
NT
RO
LL
OG
IC
PIGGYBACK
SHA
RC
boa
rd
PIGGYBACK
0,5 – 60 m
27-40 MHz
SHA
RC
INT
ER
FAC
E
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC
CO
NT
RO
LL
OG
IC
SHA
RC
INT
ER
FAC
E
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC
CO
NT
RO
LL
OG
IC
SHA
RC
INT
ER
FAC
E
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CEDIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC
CO
NT
RO
LL
OG
IC
SHA
RC
INT
ER
FAC
E
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC DIGITAL PIPELINE
OUTPUTFIFO SH
AR
CIN
TE
RFA
CE
ADCTDC
CO
NT
RO
LL
OG
IC
Total : ~2070 FEDs(32-1024 channels)
Push-down systemNo missing clock allowedNo hardware recovery
5 May 2003 V.Rybnikov, DESY 6
Self-healing tools (read-out recovery)
monitor SVD
monitor ITR
commonmonitor
DATA STREAM
• FED error threshold • min period between
consecutive recoveries • max number of
consecutive recoveries
FEDexpert ACTION
Monitors• check event header
information for every FED w.r.t.errors
5 May 2003 V.Rybnikov, DESY 7
Self-healing tools (read-out recovery)
AC
TIO
N
• stop triggers• reset FEDs• Re-chain (initialize)
event buffers and EventController
• start triggers
Action takes < 5 sec Run re-initialization ~ 2 minRun re-start ~ 8 - 10 min
5 May 2003 V.Rybnikov, DESY 8
DAQ architecture (switch)
from FEDs Routing tables server
• reads the switchconnection data base
• creates routing tablesin memory
• pushes down the tablesinto every SHARC node after the boot-up
SHARC to PCI interfaceboards are used to connectSecond Level PCs to the SWICTH
10
12
5 May 2003 V.Rybnikov, DESY 9
Self-healing tools (SLT nodes isolation)
from FEDS
10
12 Distributor tasks:• to send calibration constants to all Second Level Trigger (SLT) nodes• to check status of the SLT nodes (processes) via ping-pongmessages
Problem:Accumulating messagesaddressed to a dead node (process) blocks the switch
DISTRIBUTOR
5 May 2003 V.Rybnikov, DESY 10
Self-healing tools (SLT nodes isolation)
SLTprocess
routingtable server
distributor
SLT processexpert
processserver
interconnections
routinginformation
SLT process died
changerouting
SLT process died
terminateSLT process
ping-pong
5 May 2003 V.Rybnikov, DESY 11
Run control system
Ø the process information for all runs is stored in the DAQ data baseü list of processesü how to start them (args, env, etc)ü where to start themü etc.
Ø all the processes are started remotelyby means of process servers and managers
Ø clean-up of shared resources (shared mem, semaphores, etc) carried out during the start-up and stop procedures
BASICS
5 May 2003 V.Rybnikov, DESY 12
Run control system (process service)
Features
ü Process creation and terminationon any ‘ONLINE’ machine
ü Process status monitoring andnotification about its change
ü Monitoring the node resourcesutilization (CPU, memory, etc)
Implementation start interfaceproserv commands
processserver
inetd
proserv interface
startstopkill
5 May 2003 V.Rybnikov, DESY 13
Run control system (process management)
Data Base
Data Taking Slow ControlStandaloneTestReprocessingMC
Boot upproceduresupporters
“SYSTEM” ProcessManagers
Run Watch is the very first process for every run
globalprocesses
ComponentFARM
processes
5 May 2003 V.Rybnikov, DESY 14
Run control system (DAQ data base)
process configuration
process template
5 May 2003 V.Rybnikov, DESY 15
Run Watch
Run control system (run boot-up)
ü checking process servers on all machinesØ restarting them if required
ü freeing resources by launching ‘fini’ scripts
Process managerCOMP 1
comp 1 processes comp N processes
Process managerCOMP N
Run ControlGUI
“SYSTEM”Process Manager
global processes
5 May 2003 V.Rybnikov, DESY 16
Self-healing tools (process recovery)
ProcessManager
Can berestarted ?
Yes
No
restart
Critical ?No
Yes
forget
Prosess server reports on process termination
Checks processes
5 May 2003 V.Rybnikov, DESY 17
Conclusions
HERA-B is a big complex experiment developedand built up by hundreds of scientists, engineersand technicians. The major developments are complete.Problems effecting data taking efficiency are being fixed by introducing self-healing tools.
5 May 2003 V.Rybnikov, DESY 18
Appendix (ONLINE expert tools)
5 May 2003 V.Rybnikov, DESY 19
Switch performance
5 May 2003 V.Rybnikov, DESY 20
Switch performance
5 May 2003 V.Rybnikov, DESY 21
Switch routing