handling/surviving with radiation effects in large scale complex systems (hep experiment) jorgen...
TRANSCRIPT
Handling/surviving with Handling/surviving with radiation effects in large radiation effects in large scale complex systemsscale complex systems
(HEP experiment)(HEP experiment)
Jorgen ChristiansenJorgen ChristiansenPH-ESEPH-ESE
OutlineOutline Large scale systemLarge scale system RisksRisks Radiation “culture” and reviewsRadiation “culture” and reviews ZonesZones Failure types and recoveryFailure types and recovery System Architecture, Simplify, redundancySystem Architecture, Simplify, redundancy Classification of systems and functionsClassification of systems and functions COTS, CPU’s, FPGA’s, ASIC’sCOTS, CPU’s, FPGA’s, ASIC’s Accumulating effects: TID, NIELAccumulating effects: TID, NIEL SEESEE Some examplesSome examples The wallThe wall What to avoidWhat to avoid SummarySummary Source of informationSource of information
22
Large scale HEP system: Large scale HEP system: ATLASATLAS
~100M detector channels~100M detector channels ~400k front-end ASIC’s ~400k front-end ASIC’s
~40 different ASIC types~40 different ASIC types(large majority in same technology)(large majority in same technology)
~50k front-end hybrids~50k front-end hybrids ~20k optical links~20k optical links ~300 KW of power for on-detector ~300 KW of power for on-detector
electronicselectronics ~220 readout/trigger crates~220 readout/trigger crates ~2000 computers~2000 computers Radiation (10 years): Pixel layerRadiation (10 years): Pixel layer
TID: ~1MGyTID: ~1MGyNIEL: ~1x10NIEL: ~1x1015 15 1Mev n cm1Mev n cm-2-2
SEE: ~1x10SEE: ~1x1015 15 >20Mev h cm>20Mev h cm-2-2
~Thousand man-years invested in ~Thousand man-years invested in design, test, radiation qualification, design, test, radiation qualification, production, installation, and final production, installation, and final commissioning of electronics !commissioning of electronics !
Other experimentsOther experiments CMS: Similar to ATLASCMS: Similar to ATLAS LHCb: ¼ size, forward, “open”, similar levelsLHCb: ¼ size, forward, “open”, similar levels ALICE: ¼ size, Much lower levels, SEE initially ALICE: ¼ size, Much lower levels, SEE initially
underestimatedunderestimated
Typically the same within factor 2 - 4 inside experiment.Order of magnitude differences in cavern !
33
What risks to take ?What risks to take ? Risk analysisRisk analysis
What local failures provokes what system What local failures provokes what system failures ?failures ?
This is often more complicated than initially thoughtThis is often more complicated than initially thought Can a given failure destroy other equipmentCan a given failure destroy other equipment Hard and soft failuresHard and soft failures How long does it take to recover from this ?How long does it take to recover from this ?
Fold in radiation induced failure types and Fold in radiation induced failure types and ratesrates Zero risk do not existZero risk do not exist Very low risk will be very expensiveVery low risk will be very expensive Anything else than low risk will be Anything else than low risk will be
unacceptableunacceptable44
Radiation “culture”Radiation “culture” In large projects all project managers and electronics designers must be In large projects all project managers and electronics designers must be
educated in this so they are capable of taking appropriate measures and educated in this so they are capable of taking appropriate measures and feels responsible for this.feels responsible for this.
When ever making major/small design decisions all designers must continuously When ever making major/small design decisions all designers must continuously ask the question: what about radiation effects ?.ask the question: what about radiation effects ?.
Speak common language on this “new” field.Speak common language on this “new” field. Expertise, advice and design “tricks” for radiation hard design must be Expertise, advice and design “tricks” for radiation hard design must be
exchanged between people in project (and from other projects and exchanged between people in project (and from other projects and communities: Space, Mill, etc.).communities: Space, Mill, etc.).
Common/standardized solutions can/must be identified.Common/standardized solutions can/must be identified. Use common rad hard technology and librariesUse common rad hard technology and libraries Use common ASIC components (e.g. particular link interface)Use common ASIC components (e.g. particular link interface) Use common COTS components (e.g. particular FPGA)Use common COTS components (e.g. particular FPGA) Use common design tricks (ECC, triple redundancy)Use common design tricks (ECC, triple redundancy) Common and standardized test and qualification methods.Common and standardized test and qualification methods. Etc.Etc.
A radiation working group can be a means to obtain this.A radiation working group can be a means to obtain this.(can be termed electronics/radiation/? coordination group/board/committee , ,)(can be termed electronics/radiation/? coordination group/board/committee , ,)
Make everybody aware that designs will be reviewed and must be testedMake everybody aware that designs will be reviewed and must be tested Make everybody aware of importance of radiation hardness policy and Make everybody aware of importance of radiation hardness policy and
develop(enforce) consensus that this is for the best for us all.develop(enforce) consensus that this is for the best for us all. Clear guidelines: radiation levels, safety factors,testing methods, design Clear guidelines: radiation levels, safety factors,testing methods, design
methods, etc.methods, etc.55
ReviewsReviews Design reviews can be an excellent way to Design reviews can be an excellent way to
verify/enforce/encourage that radiation effects have been properly verify/enforce/encourage that radiation effects have been properly taken into account and tested for in each component/sub-system.taken into account and tested for in each component/sub-system.
By exchanging reviewers across sub-system boundaries a very By exchanging reviewers across sub-system boundaries a very useful exchange of ideas/approaches is accomplished.useful exchange of ideas/approaches is accomplished.
A lot of this useful exchange actually then happens outside the review contextA lot of this useful exchange actually then happens outside the review context We are all here to make a full system that works well.We are all here to make a full system that works well.
Reviews should mainly be seen as an additional help to the Reviews should mainly be seen as an additional help to the designers to make radiation tolerant and reliable designs.designers to make radiation tolerant and reliable designs. If used as a “police force” then it can be very difficult to get real If used as a “police force” then it can be very difficult to get real
potential problems to the surface.potential problems to the surface. Radiation effects and consequences must be a permanent part of Radiation effects and consequences must be a permanent part of
the review “menu”.the review “menu”. ReviewsReviews
Global system architectureGlobal system architecture Sub-system architecturesSub-system architectures Designs (ASIC’s, boards)Designs (ASIC’s, boards) Production readiness reviews (in principal too late to resolve problem)Production readiness reviews (in principal too late to resolve problem)
66
Radiation “zones”Radiation “zones” High level zones: 1MGy – 1KGy, 10High level zones: 1MGy – 1KGy, 101515 – 10 – 101111 >2oMev h cm >2oMev h cm-2-2 , 10 years , 10 years
(trackers)(trackers) Everybody knows (must be made to know) that radiation has to be taken Everybody knows (must be made to know) that radiation has to be taken
into account and special systems based on special components needs to into account and special systems based on special components needs to be designed/tested/qualified/etc.be designed/tested/qualified/etc.
TID, NIEL, SEE - Estimate of soft failure rates -> TMRTID, NIEL, SEE - Estimate of soft failure rates -> TMR ASIC’sASIC’s
Intermediate zones: 100Gy – 1kGy, 10Intermediate zones: 100Gy – 1kGy, 101111 – 10 – 101010 >2oMev h cm >2oMev h cm-2-2 (calorimeters and muon)(calorimeters and muon)
ASIC’sASIC’s Potential use of COTSPotential use of COTS
Radiation tests for TID, (NIEL), SEE (Use available tests if appropriate)Radiation tests for TID, (NIEL), SEE (Use available tests if appropriate) Special design principles for SEE (e.g. Triple Modular Redundant, TMR in FPGA’s)Special design principles for SEE (e.g. Triple Modular Redundant, TMR in FPGA’s)
Low level zones: < 100Gy, <10Low level zones: < 100Gy, <101010 >2oMev h cm >2oMev h cm-2-2 (cavern), (cavern),
Extensive use of COTS Extensive use of COTS TID and NIEL not a major problemTID and NIEL not a major problem SEE effects can be severely underestimatedSEE effects can be severely underestimated
Safe zones: < ? (LHC experiments: counting house with full access)Safe zones: < ? (LHC experiments: counting house with full access) One has to be very careful of how such zones are defined. Do not confuse One has to be very careful of how such zones are defined. Do not confuse
with low rate zones !!!with low rate zones !!!77
Failure types to deal withFailure types to deal with Hard failuresHard failures
Requires repair/sparesRequires repair/spares TID, NIEL, SEB, SEGR, (SEL)TID, NIEL, SEB, SEGR, (SEL)
Soft/transient failures:Soft/transient failures: Requires local or system recoveryRequires local or system recovery Can possibly provoke secondary hard failures Can possibly provoke secondary hard failures
in system!.in system!. SEUSEU (SEL if recovery by power cycling)(SEL if recovery by power cycling)
Can/will be destructive because of high currents and Can/will be destructive because of high currents and overheating overheating
Over current protection to prevent “self destruction”.Over current protection to prevent “self destruction”.
88
Recover from problemRecover from problem Hard failures (TID, NIEL & SEE: SEB, SEGR,SEL)Hard failures (TID, NIEL & SEE: SEB, SEGR,SEL)
Identify failing component/moduleIdentify failing component/module Can be difficult in large complex systemsCan be difficult in large complex systems
Exchange failing component (access, spares, etc.)Exchange failing component (access, spares, etc.) Effects of this on the running of the global system ?Effects of this on the running of the global system ?
Soft failuresSoft failures Identify something is not working as intended -> Identify something is not working as intended ->
MonitoringMonitoring Identify causeIdentify cause Re-initialize module or sub-system or whole systemRe-initialize module or sub-system or whole system
Simple local reset (remotely !)Simple local reset (remotely !) Reset/resync of full system (if errors have propagated)Reset/resync of full system (if errors have propagated) Reload parameters and resetReload parameters and reset Reload configuration (e.g. FPGA’s)Reload configuration (e.g. FPGA’s) Power cycling to get system out of a deadlock (a la your PC)Power cycling to get system out of a deadlock (a la your PC)
How much “beam” time has been lost by this ?How much “beam” time has been lost by this ?99
System ArchitectureSystem Architecture Detailed knowledge (with clear Detailed knowledge (with clear
overview) of system architecture is overview) of system architecture is required to estimate effect of hard required to estimate effect of hard and soft failures on global systemand soft failures on global system
Which parts are in what radiation Which parts are in what radiation environment.environment.
Which parts can provoke hard Which parts can provoke hard failuresfailures
Which parts can make full system Which parts can make full system malfunction for significant time malfunction for significant time ( seconds, minutes, hours, days, ( seconds, minutes, hours, days, months)months)
Keep in mind total quantity in full systemKeep in mind total quantity in full system
Which parts can make an important Which parts can make an important functional failure for significant time functional failure for significant time but global system still runsbut global system still runs
Which parts can make an important Which parts can make an important function fail for short timefunction fail for short time
Short malfunction in non critical Short malfunction in non critical functionfunction
Central functions in safe area’sCentral functions in safe area’s1010
Simplify, but not too muchSimplify, but not too much At system level:At system level:
Clear and well defined system architectureClear and well defined system architecture Identify particular critical parts of systemIdentify particular critical parts of system
It is much easier to predict behavior of simple It is much easier to predict behavior of simple systems for what can happen when something goes systems for what can happen when something goes wrong.wrong.
1.1. Simple pipelined and triggered readoutSimple pipelined and triggered readout Require lots of memory and readout bandwidth (IC area and power)Require lots of memory and readout bandwidth (IC area and power)
2.2. Complicated DSP functions with local data Complicated DSP functions with local data reduction/compression.reduction/compression.
May require much less memory and readout May require much less memory and readout bandwidth but is much more complicated.bandwidth but is much more complicated.
Keep as much as possible out of radiation zones Keep as much as possible out of radiation zones (within physical and cost limitations)(within physical and cost limitations)
Do not simplify so much that key vital functions are Do not simplify so much that key vital functions are forgotten or left out. (e.g. protection systems, power forgotten or left out. (e.g. protection systems, power supply for crate, vital monitoring functions, built in supply for crate, vital monitoring functions, built in test functions, etc.)test functions, etc.)
At sub-system/board/chip level:At sub-system/board/chip level: Nice fancy functions can become a nightmare if they Nice fancy functions can become a nightmare if they
malfunction (SEU’s)malfunction (SEU’s)1111
RedundancyRedundancy Redundancy is the only way to make things really Redundancy is the only way to make things really
fail safefail safe At system level (for all kinds of failures)At system level (for all kinds of failures) ………….... For memories: ECC For memories: ECC (beware of possible multi bit error)(beware of possible multi bit error)
At gate level (for SEU)At gate level (for SEU) Triple redundancy implementation: TMRTriple redundancy implementation: TMR
Significant hardware and power overheadSignificant hardware and power overhead Certain functions can in some cases not be triplicated Certain functions can in some cases not be triplicated
(Central functions in FPGA’s)(Central functions in FPGA’s) How to verify/test that TMR actually works as intendedHow to verify/test that TMR actually works as intended
Comes at a high cost (complexity, money, , ,) and Comes at a high cost (complexity, money, , ,) and can potentially make things “worse” if not can potentially make things “worse” if not carefully done.carefully done.
Use where really needed.Use where really needed.
1212
Classification of equipmentClassification of equipment Personnel safety: Personnel safety:
Must never fail to prevent dangerous situationsMust never fail to prevent dangerous situations Redundant system(s)Redundant system(s)
Acceptable that it reacts too often ?Acceptable that it reacts too often ? Equipment safety:Equipment safety:
Much of our equipment is unique so large scale damage must Much of our equipment is unique so large scale damage must be prevented at “all cost”. Can it actually be reproduced ?be prevented at “all cost”. Can it actually be reproduced ?
Critical for running of whole systemCritical for running of whole system Whole system will be un-functional if failureWhole system will be un-functional if failure Central system (Power, DAQ, etc.) or critical sub-system (e.g. Central system (Power, DAQ, etc.) or critical sub-system (e.g.
Ecal, Tracker, , )Ecal, Tracker, , ) Critical for running of Local systemCritical for running of Local system
Local system will be affected for some time but whole system Local system will be affected for some time but whole system can continue to runcan continue to run
Limited part of sub-detector (e.g. tracker with “redundant” Limited part of sub-detector (e.g. tracker with “redundant” layers)layers)
UncriticalUncritical Loose data from a few channels for short time.Loose data from a few channels for short time.
1313
Classification of functionsClassification of functions Protection systemProtection system
Have independent system for thisHave independent system for this
PoweringPowering Can affect significant parts of systemCan affect significant parts of system Can possibly provoke other serious failuresCan possibly provoke other serious failures Must be detected fastMust be detected fast
ConfigurationConfiguration Part of system runs without correct function. Part of system runs without correct function.
Can possibly occur for significant time if not detectedCan possibly occur for significant time if not detected Continuously verify local configuration data.Continuously verify local configuration data.
Send configuration checksum with data, etc.Send configuration checksum with data, etc. Can possibly provoke other serious failuresCan possibly provoke other serious failures
Control (e.g. resets)Control (e.g. resets) Part of system can get desynchronized and supply corrupted data until detected Part of system can get desynchronized and supply corrupted data until detected
and system re-synchronized.and system re-synchronized. CountersCounters BuffersBuffers State machinesState machines
Send synchronization verification data (Bunch-ID, Event-ID, etc.)Send synchronization verification data (Bunch-ID, Event-ID, etc.) Monitoring (Monitoring (Separate from protection system)Separate from protection system)
Problematic situation is not detected, and protection system may at some point react.Problematic situation is not detected, and protection system may at some point react. Wrong monitoring information could provoke system to perform strong intervention (double check Wrong monitoring information could provoke system to perform strong intervention (double check
monitoring data)monitoring data)
Data pathData path A single data glitch in a channel out of 100M will normally not be importantA single data glitch in a channel out of 100M will normally not be important
1414
Using COTS Using COTS (Commercial Of The Shelf)(Commercial Of The Shelf)
For use in “low” level radiation environments, But what is “low” ?.For use in “low” level radiation environments, But what is “low” ?. Internal architecture/implementation normally unknownInternal architecture/implementation normally unknown
Often you think you know but the real implementation can actually be quite Often you think you know but the real implementation can actually be quite different from what is “hinted” in the manual/datasheetdifferent from what is “hinted” in the manual/datasheet
E.g. internal test features normally undocumented, but may still get activated by a SEU.E.g. internal test features normally undocumented, but may still get activated by a SEU. Unmapped Illegal states, , , , ,Unmapped Illegal states, , , , , Small difference in implemenation can make major difference in radiation tolerance.Small difference in implemenation can make major difference in radiation tolerance.
Assuring that all units have same radiation behaviorAssuring that all units have same radiation behavior Problem of assuring that different chips come from same fab. and process/design has not been Problem of assuring that different chips come from same fab. and process/design has not been
modified (yield enhancement, bug fix, second source, etc)modified (yield enhancement, bug fix, second source, etc) Chips: Buying all from same batch in practice very difficultChips: Buying all from same batch in practice very difficult Boards/boxes: “Impossible”Boards/boxes: “Impossible”
TID/NIEL: Particular sensitive components may be present but this TID/NIEL: Particular sensitive components may be present but this can relatively ease be verified.can relatively ease be verified.
Major problem for use in “low” radiation environments will be SEE Major problem for use in “low” radiation environments will be SEE effects:effects:
Rate ?Rate ? Effect ?Effect ? Recovery ?Recovery ?
PS: Radiation qualified space/mill components extremely PS: Radiation qualified space/mill components extremely expensive and does often not solve the SEU problem.expensive and does often not solve the SEU problem.
1515
Use of CPU’sUse of CPU’s Containing CPU’s: PC, PLC, , ,Containing CPU’s: PC, PLC, , ,
Basically any of piece of modern commercial electronics equipment Basically any of piece of modern commercial electronics equipment contains micro controllers !.contains micro controllers !.
Contains: Program memory, Data memory, Cache, Buffers, Register Contains: Program memory, Data memory, Cache, Buffers, Register file, registers, pipeline registers, state registers, ,file, registers, pipeline registers, state registers, ,
Different cross-sections for different types of memoryDifferent cross-sections for different types of memory How large a fraction of this is actually “functionally sensitive” to SEU ?How large a fraction of this is actually “functionally sensitive” to SEU ?
Complicated piece of electronics where radiation effects (SEU) are Complicated piece of electronics where radiation effects (SEU) are very hard to predict !.very hard to predict !.
Operating systemOperating system User codeUser code
Schemes to possibly improve on this:Schemes to possibly improve on this: Special fault tolerant processors (expensive and old fashioned)Special fault tolerant processors (expensive and old fashioned) Error Correcting Code (ECC) memory (and Cache). What about multi bit Error Correcting Code (ECC) memory (and Cache). What about multi bit
flips ?.flips ?. Watch dog timersWatch dog timers Self checking user softwareSelf checking user software
Does not resolve problem of operation systemDoes not resolve problem of operation system Does not resolve problem of SEU in key control functions of CPU (e.g. program counter, Does not resolve problem of SEU in key control functions of CPU (e.g. program counter,
memory)memory)
Do NOT use in systems where “frequent” failures can not be Do NOT use in systems where “frequent” failures can not be tolerated.tolerated. 1616
ASIC’s versus COTSASIC’s versus COTS ASIC’sASIC’s
High time, manpower and money costsHigh time, manpower and money costs Cost effective if many parts needed: ~100kCost effective if many parts needed: ~100k Needed anyway for very specific functions Needed anyway for very specific functions
(e.g. FE chips for experiments)(e.g. FE chips for experiments) Can be made very rad hard (if done correctly)Can be made very rad hard (if done correctly)
Use of appropriate technology, libraries, tools and design approach (TMR)Use of appropriate technology, libraries, tools and design approach (TMR)
Option: Flexible ASIC for use in similar applications (link chip, interface, Option: Flexible ASIC for use in similar applications (link chip, interface, etc.)etc.)
Requires a lot of communication across Requires a lot of communication across projects/groups/departments/experiments/ / /projects/groups/departments/experiments/ / /
COTSCOTS Lots of “nice and very attractive” devices out there.Lots of “nice and very attractive” devices out there. ““Cheap” but VERY problematic Cheap” but VERY problematic
(and may therefore finally become very expensive).(and may therefore finally become very expensive). Use with extreme care.Use with extreme care.
FPGA’s with tricks (mitigation)FPGA’s with tricks (mitigation) Chose appropriate FPGA technology (common platform across projects)Chose appropriate FPGA technology (common platform across projects)
Special space/mill versions, Antifuse, FLASH, RAM with scrubbing , , + TMRSpecial space/mill versions, Antifuse, FLASH, RAM with scrubbing , , + TMR
Be careful with hidden functions: SEFI (test, programming, ,)Be careful with hidden functions: SEFI (test, programming, ,) Be careful with intelligent CAE tools that may try to remove TMR redundancy Be careful with intelligent CAE tools that may try to remove TMR redundancy
(synthesis)(synthesis) 1717
TID and NIELTID and NIEL In principle this could be seen as early In principle this could be seen as early
wear-out of electronics.wear-out of electronics. But much earlier than expectedBut much earlier than expected Starts in small hot regions or in Starts in small hot regions or in
particular batches of components.particular batches of components. Use available spares to repairUse available spares to repair Functions may possibly “recover” after Functions may possibly “recover” after
some time without radiation because of some time without radiation because of annealingannealing
But will then quickly expand like a virusBut will then quickly expand like a virus Spares run out and component Spares run out and component
obsolescence problems then gets a obsolescence problems then gets a serious problem.serious problem.
Before systems clearly “hard fail” they Before systems clearly “hard fail” they may be seen as unreliable/unstable may be seen as unreliable/unstable (and no evident cause of this can be (and no evident cause of this can be found)found)
Decreased noise margins on signalsDecreased noise margins on signals Timing racesTiming races Increased leakage currents -> power Increased leakage currents -> power
supplies marginal supplies marginal Etc.Etc.
Failure rate
Time
Infant mortalities
1818
SEESEE Systems have sporadic failures that happens randomly and Systems have sporadic failures that happens randomly and
at no well identified locations.at no well identified locations. One may possibly have a few particular sensitive parts.One may possibly have a few particular sensitive parts.
““Normal” (non rad induced) sporadic failures are identified Normal” (non rad induced) sporadic failures are identified and corrected during system and sub-system tests and final and corrected during system and sub-system tests and final commissioning of full system.commissioning of full system. Already indentifying cause of sporadic failures at this level and Already indentifying cause of sporadic failures at this level and
resolve them is very difficult.resolve them is very difficult. SEE failures will only be seen when radiation present (after SEE failures will only be seen when radiation present (after
system commissioning) and then it is often too late to react system commissioning) and then it is often too late to react as resolving this may require complete system re-design.as resolving this may require complete system re-design. May possibly be bearable during first initial low luminosity but May possibly be bearable during first initial low luminosity but
then …..then ….. We have not yet had any of our large scale systems running We have not yet had any of our large scale systems running
under such radiation conditions.under such radiation conditions. Surprises may(will) be ahead of us even if we have done our best Surprises may(will) be ahead of us even if we have done our best
to be immune to these.to be immune to these. They hit you at exactly the time you do NOT want system They hit you at exactly the time you do NOT want system
failures to occur.failures to occur.1919
CMS tracker front-endCMS tracker front-end Uncritical (SEU)Uncritical (SEU)
Data pathData path (Monitoring)(Monitoring)
Single corrupted read can be redoneSingle corrupted read can be redone
Critical (SEU)Critical (SEU) Configuration/controlConfiguration/control Data path controlData path control
Buffer controllerBuffer controller Readout controllerReadout controller
SynchronizationSynchronization PLLPLL ResetsResets Event tag identifiersEvent tag identifiers
Vital (but not seen)Vital (but not seen) PowerPower Protection (thermal)Protection (thermal)
digital optical link
Optical transmitter
ADC DSP
RAMTTCrx
TTCrx µPFront End Driver
T1
Front End Controller
I2C
Front End ModuleDetector
Control module
PLL
CLK
PLL
CCU
analogue optical link
DCU
Tx/Rx
Tx/Rx
APV
APVMUX
Data path
Config./control
Safe zone2020
Control/monitoringControl/monitoring CMS tracker: CCU, DCUCMS tracker: CCU, DCU
Inside detectorInside detector Radiation hard: ~1MGyRadiation hard: ~1MGy SEU immuneSEU immune Redundant token-ring routingRedundant token-ring routing ASIC’s basedASIC’s based
LHCb: SPECSLHCb: SPECS At periphery of detectorAt periphery of detector Radiation tolerant: ~1kGyRadiation tolerant: ~1kGy SEU “immune”SEU “immune” FPGA implementation with well FPGA implementation with well
chosen FPGA family (radiation chosen FPGA family (radiation tested) and TMRtested) and TMR
ATLAS general monitoring: ELMB:ATLAS general monitoring: ELMB: In cavern and muon systemsIn cavern and muon systems Radiation tolerant: ~100GyRadiation tolerant: ~100Gy SEU sensitive, but self checking SEU sensitive, but self checking
functionsfunctions Processor with watchdog and special Processor with watchdog and special
software.software.
CCUM
LVDS/CMOS
CC
U
Primary
Secondary
CC
U
CC
U
CC
U
LVDS/CMOS LVDS/CMOS LVDS/CMOS
100m
On detector
Front end electronic
SPECS slave
To the veryFront end chip
Service box
SPECS slave
Service box
SPECS slave
Top of detector
Barracks Cavern
Master Board
100m
On detector
Front end electronic
SPECS slave
To the veryFront end chip
Service box
SPECS slave
Service box
SPECS slave
Service box
SPECS slave
Service box
SPECS slave
Top of detector
Barracks Cavern
Master Board
SPECS SLAVE MEZZANINE BOARD
SPECSSLAVE
PRO-ASICPlusAPA150
JTAG Bus
I2C Bus
Serial prom65Kbs
Address switches
ChannelB[7:0]
Parallel Bus
ds90cv010
RJ45
Channel Bdecoded
resonator oscillator
Cde & ctrl I/O
JA C
onn
ecto
r
JB C
onn
ecto
r
DCU
I2C/JTAG_DE /12
I2C /3
JTAG /4
I2C_2.5V 2
I2C_3.3V 2
I2C/JTAG_RE /12
I/O /32
Prog. Connector.
65Lvds32 Ana_input /6
APA_program. /7
Loc.Addr/6
SPECS SLAVE MEZZANINE BOARD
SPECSSLAVE
PRO-ASICPlusAPA150
JTAG Bus
I2C Bus
Serial prom65Kbs
Serial prom65Kbs
Address switchesAddress switches
ChannelB[7:0]
Parallel Bus
ds90cv010ds90cv010
RJ45
Channel Bdecoded
resonator oscillator
Cde & ctrl I/O
JA C
onn
ecto
r
JB C
onn
ecto
r
DCUDCU
I2C/JTAG_DE /12
I2C /3
JTAG /4
I2C_2.5V 2
I2C_3.3V 2I2C_3.3V 2
I2C/JTAG_RE /12
I/O /32
Prog. Connector.
65Lvds3265Lvds32 Ana_input /6
APA_program. /7
Loc.Addr/6
APVs
CLK - T1
CCU
DCU
De
t
PLL-Delay
A/DMemory
TTCrx
I2C
FED
FEC
IV
TTCrx
PCIIntfc
CLK
& T
1
A/D
Temp
Front-endModule
Control Module
FEC ctrl
DataPath
ControlPath
toDAQ
2121
Power suppliesPower supplies Worries:Worries:
Destroy our electronics if over voltage/currentDestroy our electronics if over voltage/current Destruction of power supply itselfDestruction of power supply itself Loose control/monitoring requiring power cyclingLoose control/monitoring requiring power cycling
Wiener: Keep intelligence out of radiation zoneWiener: Keep intelligence out of radiation zone Radiation zone Radiation zone
DC/DC converter: 400VDC – LV (De-rated HV power transistors DC/DC converter: 400VDC – LV (De-rated HV power transistors tested)tested)
Outside radiation:Outside radiation: Control and monitoring (CPU)Control and monitoring (CPU) PFC (Power Factor Correction): CPU and high power devicesPFC (Power Factor Correction): CPU and high power devices
CostCost Cabling -> relatively few “high power” channelsCabling -> relatively few “high power” channels Limited control and monitoringLimited control and monitoring
CAEN: Keep HV out of radiation zone CAEN: Keep HV out of radiation zone Radiation zone:Radiation zone:
48v DC – LV, possibility of many low/med power channels.48v DC – LV, possibility of many low/med power channels. Local processor with dedicated software (SEU !)Local processor with dedicated software (SEU !)
Outside radiationOutside radiation AC/DC (48V)AC/DC (48V) PFC compensating filterPFC compensating filter (rad tolerant version required for high power systems as power loss of 48v (rad tolerant version required for high power systems as power loss of 48v
on ~100m to large)on ~100m to large)
Extensive use of rad tolerant low voltage drop linear regulator Extensive use of rad tolerant low voltage drop linear regulator developed by STM in collaboration with CERNdeveloped by STM in collaboration with CERN
Custom: Custom: Atlas calorimeterAtlas calorimeter
2222
The Wall (not by Pink Floyd)The Wall (not by Pink Floyd) LHCb experience:LHCb experience:
Physical (thin) walls does not Physical (thin) walls does not make problem disappear. make problem disappear.
What you do not see, you do not What you do not see, you do not worry about.worry about.
When you see it, it is too late.When you see it, it is too late. Lots of concrete needed to give Lots of concrete needed to give
effective radiation shielding.effective radiation shielding. Political/organisatorial walls Political/organisatorial walls
does not make things betterdoes not make things better
All participants in global project All participants in global project (experiments + machine + ?) (experiments + machine + ?) must be aware of the potential must be aware of the potential problems.problems. Extensive exchange of Extensive exchange of
information/experienceinformation/experience Key part of project Key part of project
requirementsrequirements ReviewsReviews
Shielding wall
COTS Electronics + CPU farm
Custom rad hardOn-detectorelectronics
LHC machine equipmentnext to experiment and in “hidden” service cavern !
2323
What to avoidWhat to avoid Underestimate the problemUnderestimate the problem Safety systems in radiationSafety systems in radiation Forget that local errors can propagate to the system level Forget that local errors can propagate to the system level
and make the whole system fall over (very hard to verify in and make the whole system fall over (very hard to verify in advance for complex systems)advance for complex systems)
Assume that somebody else will magically solve this.Assume that somebody else will magically solve this.
Complicated not well known electronics (black box) in Complicated not well known electronics (black box) in radiation environmentradiation environment Computers, PLC, Complicated Communication interfaces , , Computers, PLC, Complicated Communication interfaces , ,
High power devices in radiation zonesHigh power devices in radiation zones SEE effects can become “catastrophic”SEE effects can become “catastrophic”
Particular known weak componentsParticular known weak components Some types of opto couplers, etc.Some types of opto couplers, etc.
Uncritical use of complicated devices (e.g. FPGA’s)Uncritical use of complicated devices (e.g. FPGA’s)
2424
SummarySummary Expensive Expensive
(If forgotten then much more expensive)(If forgotten then much more expensive) Takes significant efforts across all Takes significant efforts across all
systems, levels and hierarchies.systems, levels and hierarchies. Radiation culture, policy and reviewsRadiation culture, policy and reviews
No easy solutions in this domainNo easy solutions in this domain ““The Devil is in the details”The Devil is in the details”
Still waiting for unexpected problems Still waiting for unexpected problems when final beam, despite huge efforts when final beam, despite huge efforts done in the experiments in this domain.done in the experiments in this domain.
2525
Source of informationSource of information Experiment web pages:Experiment web pages:
LHCb:LHCb: Electronics coordination: Electronics coordination:
http://lhcb-elec.web.cern.ch/lhcb-elec/ http://lhcb-elec.web.cern.ch/lhcb-elec/ Radiation policy: Radiation policy:
http://lhcb-elec.web.cern.ch/lhcb-elec/html/radiation_hardness.htm http://lhcb-elec.web.cern.ch/lhcb-elec/html/radiation_hardness.htm Reviews (many): Reviews (many):
http://lhcb-elec.web.cern.ch/lhcb-elec/html/design_review.htm http://lhcb-elec.web.cern.ch/lhcb-elec/html/design_review.htm ATLAS Radiation policy: ATLAS Radiation policy:
http://atlas.web.cern.ch/Atlas/GROUPS/FRONTEND/radhard.htm http://atlas.web.cern.ch/Atlas/GROUPS/FRONTEND/radhard.htm Elec 2005 lectures:Elec 2005 lectures:
https://webh12.cern.ch/hr-training/tech/elec2005___electronics_in_high.hthttps://webh12.cern.ch/hr-training/tech/elec2005___electronics_in_high.htm m
FPGA’s: FPGA’s: http://klabs.org/fpgas.htm http://klabs.org/fpgas.htm
Summer student lecture:Summer student lecture:http://indico.cern.ch/conferenceDisplay.py?confId=34799 http://indico.cern.ch/conferenceDisplay.py?confId=34799
Google it on internet.Google it on internet.
2626