handling/surviving with radiation effects in large scale complex systems (hep experiment) jorgen...

26
Handling/surviving Handling/surviving with radiation effects with radiation effects in large scale complex in large scale complex systems systems (HEP experiment) (HEP experiment) Jorgen Christiansen Jorgen Christiansen PH-ESE PH-ESE

Upload: james-jones

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Handling/surviving with Handling/surviving with radiation effects in large radiation effects in large scale complex systemsscale complex systems

(HEP experiment)(HEP experiment)

Jorgen ChristiansenJorgen ChristiansenPH-ESEPH-ESE

OutlineOutline Large scale systemLarge scale system RisksRisks Radiation “culture” and reviewsRadiation “culture” and reviews ZonesZones Failure types and recoveryFailure types and recovery System Architecture, Simplify, redundancySystem Architecture, Simplify, redundancy Classification of systems and functionsClassification of systems and functions COTS, CPU’s, FPGA’s, ASIC’sCOTS, CPU’s, FPGA’s, ASIC’s Accumulating effects: TID, NIELAccumulating effects: TID, NIEL SEESEE Some examplesSome examples The wallThe wall What to avoidWhat to avoid SummarySummary Source of informationSource of information

22

Large scale HEP system: Large scale HEP system: ATLASATLAS

~100M detector channels~100M detector channels ~400k front-end ASIC’s ~400k front-end ASIC’s

~40 different ASIC types~40 different ASIC types(large majority in same technology)(large majority in same technology)

~50k front-end hybrids~50k front-end hybrids ~20k optical links~20k optical links ~300 KW of power for on-detector ~300 KW of power for on-detector

electronicselectronics ~220 readout/trigger crates~220 readout/trigger crates ~2000 computers~2000 computers Radiation (10 years): Pixel layerRadiation (10 years): Pixel layer

TID: ~1MGyTID: ~1MGyNIEL: ~1x10NIEL: ~1x1015 15 1Mev n cm1Mev n cm-2-2

SEE: ~1x10SEE: ~1x1015 15 >20Mev h cm>20Mev h cm-2-2

~Thousand man-years invested in ~Thousand man-years invested in design, test, radiation qualification, design, test, radiation qualification, production, installation, and final production, installation, and final commissioning of electronics !commissioning of electronics !

Other experimentsOther experiments CMS: Similar to ATLASCMS: Similar to ATLAS LHCb: ¼ size, forward, “open”, similar levelsLHCb: ¼ size, forward, “open”, similar levels ALICE: ¼ size, Much lower levels, SEE initially ALICE: ¼ size, Much lower levels, SEE initially

underestimatedunderestimated

Typically the same within factor 2 - 4 inside experiment.Order of magnitude differences in cavern !

33

What risks to take ?What risks to take ? Risk analysisRisk analysis

What local failures provokes what system What local failures provokes what system failures ?failures ?

This is often more complicated than initially thoughtThis is often more complicated than initially thought Can a given failure destroy other equipmentCan a given failure destroy other equipment Hard and soft failuresHard and soft failures How long does it take to recover from this ?How long does it take to recover from this ?

Fold in radiation induced failure types and Fold in radiation induced failure types and ratesrates Zero risk do not existZero risk do not exist Very low risk will be very expensiveVery low risk will be very expensive Anything else than low risk will be Anything else than low risk will be

unacceptableunacceptable44

Radiation “culture”Radiation “culture” In large projects all project managers and electronics designers must be In large projects all project managers and electronics designers must be

educated in this so they are capable of taking appropriate measures and educated in this so they are capable of taking appropriate measures and feels responsible for this.feels responsible for this.

When ever making major/small design decisions all designers must continuously When ever making major/small design decisions all designers must continuously ask the question: what about radiation effects ?.ask the question: what about radiation effects ?.

Speak common language on this “new” field.Speak common language on this “new” field. Expertise, advice and design “tricks” for radiation hard design must be Expertise, advice and design “tricks” for radiation hard design must be

exchanged between people in project (and from other projects and exchanged between people in project (and from other projects and communities: Space, Mill, etc.).communities: Space, Mill, etc.).

Common/standardized solutions can/must be identified.Common/standardized solutions can/must be identified. Use common rad hard technology and librariesUse common rad hard technology and libraries Use common ASIC components (e.g. particular link interface)Use common ASIC components (e.g. particular link interface) Use common COTS components (e.g. particular FPGA)Use common COTS components (e.g. particular FPGA) Use common design tricks (ECC, triple redundancy)Use common design tricks (ECC, triple redundancy) Common and standardized test and qualification methods.Common and standardized test and qualification methods. Etc.Etc.

A radiation working group can be a means to obtain this.A radiation working group can be a means to obtain this.(can be termed electronics/radiation/? coordination group/board/committee , ,)(can be termed electronics/radiation/? coordination group/board/committee , ,)

Make everybody aware that designs will be reviewed and must be testedMake everybody aware that designs will be reviewed and must be tested Make everybody aware of importance of radiation hardness policy and Make everybody aware of importance of radiation hardness policy and

develop(enforce) consensus that this is for the best for us all.develop(enforce) consensus that this is for the best for us all. Clear guidelines: radiation levels, safety factors,testing methods, design Clear guidelines: radiation levels, safety factors,testing methods, design

methods, etc.methods, etc.55

ReviewsReviews Design reviews can be an excellent way to Design reviews can be an excellent way to

verify/enforce/encourage that radiation effects have been properly verify/enforce/encourage that radiation effects have been properly taken into account and tested for in each component/sub-system.taken into account and tested for in each component/sub-system.

By exchanging reviewers across sub-system boundaries a very By exchanging reviewers across sub-system boundaries a very useful exchange of ideas/approaches is accomplished.useful exchange of ideas/approaches is accomplished.

A lot of this useful exchange actually then happens outside the review contextA lot of this useful exchange actually then happens outside the review context We are all here to make a full system that works well.We are all here to make a full system that works well.

Reviews should mainly be seen as an additional help to the Reviews should mainly be seen as an additional help to the designers to make radiation tolerant and reliable designs.designers to make radiation tolerant and reliable designs. If used as a “police force” then it can be very difficult to get real If used as a “police force” then it can be very difficult to get real

potential problems to the surface.potential problems to the surface. Radiation effects and consequences must be a permanent part of Radiation effects and consequences must be a permanent part of

the review “menu”.the review “menu”. ReviewsReviews

Global system architectureGlobal system architecture Sub-system architecturesSub-system architectures Designs (ASIC’s, boards)Designs (ASIC’s, boards) Production readiness reviews (in principal too late to resolve problem)Production readiness reviews (in principal too late to resolve problem)

66

Radiation “zones”Radiation “zones” High level zones: 1MGy – 1KGy, 10High level zones: 1MGy – 1KGy, 101515 – 10 – 101111 >2oMev h cm >2oMev h cm-2-2 , 10 years , 10 years

(trackers)(trackers) Everybody knows (must be made to know) that radiation has to be taken Everybody knows (must be made to know) that radiation has to be taken

into account and special systems based on special components needs to into account and special systems based on special components needs to be designed/tested/qualified/etc.be designed/tested/qualified/etc.

TID, NIEL, SEE - Estimate of soft failure rates -> TMRTID, NIEL, SEE - Estimate of soft failure rates -> TMR ASIC’sASIC’s

Intermediate zones: 100Gy – 1kGy, 10Intermediate zones: 100Gy – 1kGy, 101111 – 10 – 101010 >2oMev h cm >2oMev h cm-2-2 (calorimeters and muon)(calorimeters and muon)

ASIC’sASIC’s Potential use of COTSPotential use of COTS

Radiation tests for TID, (NIEL), SEE (Use available tests if appropriate)Radiation tests for TID, (NIEL), SEE (Use available tests if appropriate) Special design principles for SEE (e.g. Triple Modular Redundant, TMR in FPGA’s)Special design principles for SEE (e.g. Triple Modular Redundant, TMR in FPGA’s)

Low level zones: < 100Gy, <10Low level zones: < 100Gy, <101010 >2oMev h cm >2oMev h cm-2-2 (cavern), (cavern),

Extensive use of COTS Extensive use of COTS TID and NIEL not a major problemTID and NIEL not a major problem SEE effects can be severely underestimatedSEE effects can be severely underestimated

Safe zones: < ? (LHC experiments: counting house with full access)Safe zones: < ? (LHC experiments: counting house with full access) One has to be very careful of how such zones are defined. Do not confuse One has to be very careful of how such zones are defined. Do not confuse

with low rate zones !!!with low rate zones !!!77

Failure types to deal withFailure types to deal with Hard failuresHard failures

Requires repair/sparesRequires repair/spares TID, NIEL, SEB, SEGR, (SEL)TID, NIEL, SEB, SEGR, (SEL)

Soft/transient failures:Soft/transient failures: Requires local or system recoveryRequires local or system recovery Can possibly provoke secondary hard failures Can possibly provoke secondary hard failures

in system!.in system!. SEUSEU (SEL if recovery by power cycling)(SEL if recovery by power cycling)

Can/will be destructive because of high currents and Can/will be destructive because of high currents and overheating overheating

Over current protection to prevent “self destruction”.Over current protection to prevent “self destruction”.

88

Recover from problemRecover from problem Hard failures (TID, NIEL & SEE: SEB, SEGR,SEL)Hard failures (TID, NIEL & SEE: SEB, SEGR,SEL)

Identify failing component/moduleIdentify failing component/module Can be difficult in large complex systemsCan be difficult in large complex systems

Exchange failing component (access, spares, etc.)Exchange failing component (access, spares, etc.) Effects of this on the running of the global system ?Effects of this on the running of the global system ?

Soft failuresSoft failures Identify something is not working as intended -> Identify something is not working as intended ->

MonitoringMonitoring Identify causeIdentify cause Re-initialize module or sub-system or whole systemRe-initialize module or sub-system or whole system

Simple local reset (remotely !)Simple local reset (remotely !) Reset/resync of full system (if errors have propagated)Reset/resync of full system (if errors have propagated) Reload parameters and resetReload parameters and reset Reload configuration (e.g. FPGA’s)Reload configuration (e.g. FPGA’s) Power cycling to get system out of a deadlock (a la your PC)Power cycling to get system out of a deadlock (a la your PC)

How much “beam” time has been lost by this ?How much “beam” time has been lost by this ?99

System ArchitectureSystem Architecture Detailed knowledge (with clear Detailed knowledge (with clear

overview) of system architecture is overview) of system architecture is required to estimate effect of hard required to estimate effect of hard and soft failures on global systemand soft failures on global system

Which parts are in what radiation Which parts are in what radiation environment.environment.

Which parts can provoke hard Which parts can provoke hard failuresfailures

Which parts can make full system Which parts can make full system malfunction for significant time malfunction for significant time ( seconds, minutes, hours, days, ( seconds, minutes, hours, days, months)months)

Keep in mind total quantity in full systemKeep in mind total quantity in full system

Which parts can make an important Which parts can make an important functional failure for significant time functional failure for significant time but global system still runsbut global system still runs

Which parts can make an important Which parts can make an important function fail for short timefunction fail for short time

Short malfunction in non critical Short malfunction in non critical functionfunction

Central functions in safe area’sCentral functions in safe area’s1010

Simplify, but not too muchSimplify, but not too much At system level:At system level:

Clear and well defined system architectureClear and well defined system architecture Identify particular critical parts of systemIdentify particular critical parts of system

It is much easier to predict behavior of simple It is much easier to predict behavior of simple systems for what can happen when something goes systems for what can happen when something goes wrong.wrong.

1.1. Simple pipelined and triggered readoutSimple pipelined and triggered readout Require lots of memory and readout bandwidth (IC area and power)Require lots of memory and readout bandwidth (IC area and power)

2.2. Complicated DSP functions with local data Complicated DSP functions with local data reduction/compression.reduction/compression.

May require much less memory and readout May require much less memory and readout bandwidth but is much more complicated.bandwidth but is much more complicated.

Keep as much as possible out of radiation zones Keep as much as possible out of radiation zones (within physical and cost limitations)(within physical and cost limitations)

Do not simplify so much that key vital functions are Do not simplify so much that key vital functions are forgotten or left out. (e.g. protection systems, power forgotten or left out. (e.g. protection systems, power supply for crate, vital monitoring functions, built in supply for crate, vital monitoring functions, built in test functions, etc.)test functions, etc.)

At sub-system/board/chip level:At sub-system/board/chip level: Nice fancy functions can become a nightmare if they Nice fancy functions can become a nightmare if they

malfunction (SEU’s)malfunction (SEU’s)1111

RedundancyRedundancy Redundancy is the only way to make things really Redundancy is the only way to make things really

fail safefail safe At system level (for all kinds of failures)At system level (for all kinds of failures) ………….... For memories: ECC For memories: ECC (beware of possible multi bit error)(beware of possible multi bit error)

At gate level (for SEU)At gate level (for SEU) Triple redundancy implementation: TMRTriple redundancy implementation: TMR

Significant hardware and power overheadSignificant hardware and power overhead Certain functions can in some cases not be triplicated Certain functions can in some cases not be triplicated

(Central functions in FPGA’s)(Central functions in FPGA’s) How to verify/test that TMR actually works as intendedHow to verify/test that TMR actually works as intended

Comes at a high cost (complexity, money, , ,) and Comes at a high cost (complexity, money, , ,) and can potentially make things “worse” if not can potentially make things “worse” if not carefully done.carefully done.

Use where really needed.Use where really needed.

1212

Classification of equipmentClassification of equipment Personnel safety: Personnel safety:

Must never fail to prevent dangerous situationsMust never fail to prevent dangerous situations Redundant system(s)Redundant system(s)

Acceptable that it reacts too often ?Acceptable that it reacts too often ? Equipment safety:Equipment safety:

Much of our equipment is unique so large scale damage must Much of our equipment is unique so large scale damage must be prevented at “all cost”. Can it actually be reproduced ?be prevented at “all cost”. Can it actually be reproduced ?

Critical for running of whole systemCritical for running of whole system Whole system will be un-functional if failureWhole system will be un-functional if failure Central system (Power, DAQ, etc.) or critical sub-system (e.g. Central system (Power, DAQ, etc.) or critical sub-system (e.g.

Ecal, Tracker, , )Ecal, Tracker, , ) Critical for running of Local systemCritical for running of Local system

Local system will be affected for some time but whole system Local system will be affected for some time but whole system can continue to runcan continue to run

Limited part of sub-detector (e.g. tracker with “redundant” Limited part of sub-detector (e.g. tracker with “redundant” layers)layers)

UncriticalUncritical Loose data from a few channels for short time.Loose data from a few channels for short time.

1313

Classification of functionsClassification of functions Protection systemProtection system

Have independent system for thisHave independent system for this

PoweringPowering Can affect significant parts of systemCan affect significant parts of system Can possibly provoke other serious failuresCan possibly provoke other serious failures Must be detected fastMust be detected fast

ConfigurationConfiguration Part of system runs without correct function. Part of system runs without correct function.

Can possibly occur for significant time if not detectedCan possibly occur for significant time if not detected Continuously verify local configuration data.Continuously verify local configuration data.

Send configuration checksum with data, etc.Send configuration checksum with data, etc. Can possibly provoke other serious failuresCan possibly provoke other serious failures

Control (e.g. resets)Control (e.g. resets) Part of system can get desynchronized and supply corrupted data until detected Part of system can get desynchronized and supply corrupted data until detected

and system re-synchronized.and system re-synchronized. CountersCounters BuffersBuffers State machinesState machines

Send synchronization verification data (Bunch-ID, Event-ID, etc.)Send synchronization verification data (Bunch-ID, Event-ID, etc.) Monitoring (Monitoring (Separate from protection system)Separate from protection system)

Problematic situation is not detected, and protection system may at some point react.Problematic situation is not detected, and protection system may at some point react. Wrong monitoring information could provoke system to perform strong intervention (double check Wrong monitoring information could provoke system to perform strong intervention (double check

monitoring data)monitoring data)

Data pathData path A single data glitch in a channel out of 100M will normally not be importantA single data glitch in a channel out of 100M will normally not be important

1414

Using COTS Using COTS (Commercial Of The Shelf)(Commercial Of The Shelf)

For use in “low” level radiation environments, But what is “low” ?.For use in “low” level radiation environments, But what is “low” ?. Internal architecture/implementation normally unknownInternal architecture/implementation normally unknown

Often you think you know but the real implementation can actually be quite Often you think you know but the real implementation can actually be quite different from what is “hinted” in the manual/datasheetdifferent from what is “hinted” in the manual/datasheet

E.g. internal test features normally undocumented, but may still get activated by a SEU.E.g. internal test features normally undocumented, but may still get activated by a SEU. Unmapped Illegal states, , , , ,Unmapped Illegal states, , , , , Small difference in implemenation can make major difference in radiation tolerance.Small difference in implemenation can make major difference in radiation tolerance.

Assuring that all units have same radiation behaviorAssuring that all units have same radiation behavior Problem of assuring that different chips come from same fab. and process/design has not been Problem of assuring that different chips come from same fab. and process/design has not been

modified (yield enhancement, bug fix, second source, etc)modified (yield enhancement, bug fix, second source, etc) Chips: Buying all from same batch in practice very difficultChips: Buying all from same batch in practice very difficult Boards/boxes: “Impossible”Boards/boxes: “Impossible”

TID/NIEL: Particular sensitive components may be present but this TID/NIEL: Particular sensitive components may be present but this can relatively ease be verified.can relatively ease be verified.

Major problem for use in “low” radiation environments will be SEE Major problem for use in “low” radiation environments will be SEE effects:effects:

Rate ?Rate ? Effect ?Effect ? Recovery ?Recovery ?

PS: Radiation qualified space/mill components extremely PS: Radiation qualified space/mill components extremely expensive and does often not solve the SEU problem.expensive and does often not solve the SEU problem.

1515

Use of CPU’sUse of CPU’s Containing CPU’s: PC, PLC, , ,Containing CPU’s: PC, PLC, , ,

Basically any of piece of modern commercial electronics equipment Basically any of piece of modern commercial electronics equipment contains micro controllers !.contains micro controllers !.

Contains: Program memory, Data memory, Cache, Buffers, Register Contains: Program memory, Data memory, Cache, Buffers, Register file, registers, pipeline registers, state registers, ,file, registers, pipeline registers, state registers, ,

Different cross-sections for different types of memoryDifferent cross-sections for different types of memory How large a fraction of this is actually “functionally sensitive” to SEU ?How large a fraction of this is actually “functionally sensitive” to SEU ?

Complicated piece of electronics where radiation effects (SEU) are Complicated piece of electronics where radiation effects (SEU) are very hard to predict !.very hard to predict !.

Operating systemOperating system User codeUser code

Schemes to possibly improve on this:Schemes to possibly improve on this: Special fault tolerant processors (expensive and old fashioned)Special fault tolerant processors (expensive and old fashioned) Error Correcting Code (ECC) memory (and Cache). What about multi bit Error Correcting Code (ECC) memory (and Cache). What about multi bit

flips ?.flips ?. Watch dog timersWatch dog timers Self checking user softwareSelf checking user software

Does not resolve problem of operation systemDoes not resolve problem of operation system Does not resolve problem of SEU in key control functions of CPU (e.g. program counter, Does not resolve problem of SEU in key control functions of CPU (e.g. program counter,

memory)memory)

Do NOT use in systems where “frequent” failures can not be Do NOT use in systems where “frequent” failures can not be tolerated.tolerated. 1616

ASIC’s versus COTSASIC’s versus COTS ASIC’sASIC’s

High time, manpower and money costsHigh time, manpower and money costs Cost effective if many parts needed: ~100kCost effective if many parts needed: ~100k Needed anyway for very specific functions Needed anyway for very specific functions

(e.g. FE chips for experiments)(e.g. FE chips for experiments) Can be made very rad hard (if done correctly)Can be made very rad hard (if done correctly)

Use of appropriate technology, libraries, tools and design approach (TMR)Use of appropriate technology, libraries, tools and design approach (TMR)

Option: Flexible ASIC for use in similar applications (link chip, interface, Option: Flexible ASIC for use in similar applications (link chip, interface, etc.)etc.)

Requires a lot of communication across Requires a lot of communication across projects/groups/departments/experiments/ / /projects/groups/departments/experiments/ / /

COTSCOTS Lots of “nice and very attractive” devices out there.Lots of “nice and very attractive” devices out there. ““Cheap” but VERY problematic Cheap” but VERY problematic

(and may therefore finally become very expensive).(and may therefore finally become very expensive). Use with extreme care.Use with extreme care.

FPGA’s with tricks (mitigation)FPGA’s with tricks (mitigation) Chose appropriate FPGA technology (common platform across projects)Chose appropriate FPGA technology (common platform across projects)

Special space/mill versions, Antifuse, FLASH, RAM with scrubbing , , + TMRSpecial space/mill versions, Antifuse, FLASH, RAM with scrubbing , , + TMR

Be careful with hidden functions: SEFI (test, programming, ,)Be careful with hidden functions: SEFI (test, programming, ,) Be careful with intelligent CAE tools that may try to remove TMR redundancy Be careful with intelligent CAE tools that may try to remove TMR redundancy

(synthesis)(synthesis) 1717

TID and NIELTID and NIEL In principle this could be seen as early In principle this could be seen as early

wear-out of electronics.wear-out of electronics. But much earlier than expectedBut much earlier than expected Starts in small hot regions or in Starts in small hot regions or in

particular batches of components.particular batches of components. Use available spares to repairUse available spares to repair Functions may possibly “recover” after Functions may possibly “recover” after

some time without radiation because of some time without radiation because of annealingannealing

But will then quickly expand like a virusBut will then quickly expand like a virus Spares run out and component Spares run out and component

obsolescence problems then gets a obsolescence problems then gets a serious problem.serious problem.

Before systems clearly “hard fail” they Before systems clearly “hard fail” they may be seen as unreliable/unstable may be seen as unreliable/unstable (and no evident cause of this can be (and no evident cause of this can be found)found)

Decreased noise margins on signalsDecreased noise margins on signals Timing racesTiming races Increased leakage currents -> power Increased leakage currents -> power

supplies marginal supplies marginal Etc.Etc.

Failure rate

Time

Infant mortalities

1818

SEESEE Systems have sporadic failures that happens randomly and Systems have sporadic failures that happens randomly and

at no well identified locations.at no well identified locations. One may possibly have a few particular sensitive parts.One may possibly have a few particular sensitive parts.

““Normal” (non rad induced) sporadic failures are identified Normal” (non rad induced) sporadic failures are identified and corrected during system and sub-system tests and final and corrected during system and sub-system tests and final commissioning of full system.commissioning of full system. Already indentifying cause of sporadic failures at this level and Already indentifying cause of sporadic failures at this level and

resolve them is very difficult.resolve them is very difficult. SEE failures will only be seen when radiation present (after SEE failures will only be seen when radiation present (after

system commissioning) and then it is often too late to react system commissioning) and then it is often too late to react as resolving this may require complete system re-design.as resolving this may require complete system re-design. May possibly be bearable during first initial low luminosity but May possibly be bearable during first initial low luminosity but

then …..then ….. We have not yet had any of our large scale systems running We have not yet had any of our large scale systems running

under such radiation conditions.under such radiation conditions. Surprises may(will) be ahead of us even if we have done our best Surprises may(will) be ahead of us even if we have done our best

to be immune to these.to be immune to these. They hit you at exactly the time you do NOT want system They hit you at exactly the time you do NOT want system

failures to occur.failures to occur.1919

CMS tracker front-endCMS tracker front-end Uncritical (SEU)Uncritical (SEU)

Data pathData path (Monitoring)(Monitoring)

Single corrupted read can be redoneSingle corrupted read can be redone

Critical (SEU)Critical (SEU) Configuration/controlConfiguration/control Data path controlData path control

Buffer controllerBuffer controller Readout controllerReadout controller

SynchronizationSynchronization PLLPLL ResetsResets Event tag identifiersEvent tag identifiers

Vital (but not seen)Vital (but not seen) PowerPower Protection (thermal)Protection (thermal)

digital optical link

Optical transmitter

ADC DSP

RAMTTCrx

TTCrx µPFront End Driver

T1

Front End Controller

I2C

Front End ModuleDetector

Control module

PLL

CLK

PLL

CCU

analogue optical link

DCU

Tx/Rx

Tx/Rx

APV

APVMUX

Data path

Config./control

Safe zone2020

Control/monitoringControl/monitoring CMS tracker: CCU, DCUCMS tracker: CCU, DCU

Inside detectorInside detector Radiation hard: ~1MGyRadiation hard: ~1MGy SEU immuneSEU immune Redundant token-ring routingRedundant token-ring routing ASIC’s basedASIC’s based

LHCb: SPECSLHCb: SPECS At periphery of detectorAt periphery of detector Radiation tolerant: ~1kGyRadiation tolerant: ~1kGy SEU “immune”SEU “immune” FPGA implementation with well FPGA implementation with well

chosen FPGA family (radiation chosen FPGA family (radiation tested) and TMRtested) and TMR

ATLAS general monitoring: ELMB:ATLAS general monitoring: ELMB: In cavern and muon systemsIn cavern and muon systems Radiation tolerant: ~100GyRadiation tolerant: ~100Gy SEU sensitive, but self checking SEU sensitive, but self checking

functionsfunctions Processor with watchdog and special Processor with watchdog and special

software.software.

CCUM

LVDS/CMOS

CC

U

Primary

Secondary

CC

U

CC

U

CC

U

LVDS/CMOS LVDS/CMOS LVDS/CMOS

100m

On detector

Front end electronic

SPECS slave

To the veryFront end chip

Service box

SPECS slave

Service box

SPECS slave

Top of detector

Barracks Cavern

Master Board

100m

On detector

Front end electronic

SPECS slave

To the veryFront end chip

Service box

SPECS slave

Service box

SPECS slave

Service box

SPECS slave

Service box

SPECS slave

Top of detector

Barracks Cavern

Master Board

SPECS SLAVE MEZZANINE BOARD

SPECSSLAVE

PRO-ASICPlusAPA150

JTAG Bus

I2C Bus

Serial prom65Kbs

Address switches

ChannelB[7:0]

Parallel Bus

ds90cv010

RJ45

Channel Bdecoded

resonator oscillator

Cde & ctrl I/O

JA C

onn

ecto

r

JB C

onn

ecto

r

DCU

I2C/JTAG_DE /12

I2C /3

JTAG /4

I2C_2.5V 2

I2C_3.3V 2

I2C/JTAG_RE /12

I/O /32

Prog. Connector.

65Lvds32 Ana_input /6

APA_program. /7

Loc.Addr/6

SPECS SLAVE MEZZANINE BOARD

SPECSSLAVE

PRO-ASICPlusAPA150

JTAG Bus

I2C Bus

Serial prom65Kbs

Serial prom65Kbs

Address switchesAddress switches

ChannelB[7:0]

Parallel Bus

ds90cv010ds90cv010

RJ45

Channel Bdecoded

resonator oscillator

Cde & ctrl I/O

JA C

onn

ecto

r

JB C

onn

ecto

r

DCUDCU

I2C/JTAG_DE /12

I2C /3

JTAG /4

I2C_2.5V 2

I2C_3.3V 2I2C_3.3V 2

I2C/JTAG_RE /12

I/O /32

Prog. Connector.

65Lvds3265Lvds32 Ana_input /6

APA_program. /7

Loc.Addr/6

APVs

CLK - T1

CCU

DCU

De

t

PLL-Delay

A/DMemory

TTCrx

I2C

FED

FEC

IV

TTCrx

PCIIntfc

CLK

& T

1

A/D

Temp

Front-endModule

Control Module

FEC ctrl

DataPath

ControlPath

toDAQ

2121

Power suppliesPower supplies Worries:Worries:

Destroy our electronics if over voltage/currentDestroy our electronics if over voltage/current Destruction of power supply itselfDestruction of power supply itself Loose control/monitoring requiring power cyclingLoose control/monitoring requiring power cycling

Wiener: Keep intelligence out of radiation zoneWiener: Keep intelligence out of radiation zone Radiation zone Radiation zone

DC/DC converter: 400VDC – LV (De-rated HV power transistors DC/DC converter: 400VDC – LV (De-rated HV power transistors tested)tested)

Outside radiation:Outside radiation: Control and monitoring (CPU)Control and monitoring (CPU) PFC (Power Factor Correction): CPU and high power devicesPFC (Power Factor Correction): CPU and high power devices

CostCost Cabling -> relatively few “high power” channelsCabling -> relatively few “high power” channels Limited control and monitoringLimited control and monitoring

CAEN: Keep HV out of radiation zone CAEN: Keep HV out of radiation zone Radiation zone:Radiation zone:

48v DC – LV, possibility of many low/med power channels.48v DC – LV, possibility of many low/med power channels. Local processor with dedicated software (SEU !)Local processor with dedicated software (SEU !)

Outside radiationOutside radiation AC/DC (48V)AC/DC (48V) PFC compensating filterPFC compensating filter (rad tolerant version required for high power systems as power loss of 48v (rad tolerant version required for high power systems as power loss of 48v

on ~100m to large)on ~100m to large)

Extensive use of rad tolerant low voltage drop linear regulator Extensive use of rad tolerant low voltage drop linear regulator developed by STM in collaboration with CERNdeveloped by STM in collaboration with CERN

Custom: Custom: Atlas calorimeterAtlas calorimeter

2222

The Wall (not by Pink Floyd)The Wall (not by Pink Floyd) LHCb experience:LHCb experience:

Physical (thin) walls does not Physical (thin) walls does not make problem disappear. make problem disappear.

What you do not see, you do not What you do not see, you do not worry about.worry about.

When you see it, it is too late.When you see it, it is too late. Lots of concrete needed to give Lots of concrete needed to give

effective radiation shielding.effective radiation shielding. Political/organisatorial walls Political/organisatorial walls

does not make things betterdoes not make things better

All participants in global project All participants in global project (experiments + machine + ?) (experiments + machine + ?) must be aware of the potential must be aware of the potential problems.problems. Extensive exchange of Extensive exchange of

information/experienceinformation/experience Key part of project Key part of project

requirementsrequirements ReviewsReviews

Shielding wall

COTS Electronics + CPU farm

Custom rad hardOn-detectorelectronics

LHC machine equipmentnext to experiment and in “hidden” service cavern !

2323

What to avoidWhat to avoid Underestimate the problemUnderestimate the problem Safety systems in radiationSafety systems in radiation Forget that local errors can propagate to the system level Forget that local errors can propagate to the system level

and make the whole system fall over (very hard to verify in and make the whole system fall over (very hard to verify in advance for complex systems)advance for complex systems)

Assume that somebody else will magically solve this.Assume that somebody else will magically solve this.

Complicated not well known electronics (black box) in Complicated not well known electronics (black box) in radiation environmentradiation environment Computers, PLC, Complicated Communication interfaces , , Computers, PLC, Complicated Communication interfaces , ,

High power devices in radiation zonesHigh power devices in radiation zones SEE effects can become “catastrophic”SEE effects can become “catastrophic”

Particular known weak componentsParticular known weak components Some types of opto couplers, etc.Some types of opto couplers, etc.

Uncritical use of complicated devices (e.g. FPGA’s)Uncritical use of complicated devices (e.g. FPGA’s)

2424

SummarySummary Expensive Expensive

(If forgotten then much more expensive)(If forgotten then much more expensive) Takes significant efforts across all Takes significant efforts across all

systems, levels and hierarchies.systems, levels and hierarchies. Radiation culture, policy and reviewsRadiation culture, policy and reviews

No easy solutions in this domainNo easy solutions in this domain ““The Devil is in the details”The Devil is in the details”

Still waiting for unexpected problems Still waiting for unexpected problems when final beam, despite huge efforts when final beam, despite huge efforts done in the experiments in this domain.done in the experiments in this domain.

2525

Source of informationSource of information Experiment web pages:Experiment web pages:

LHCb:LHCb: Electronics coordination: Electronics coordination:

http://lhcb-elec.web.cern.ch/lhcb-elec/ http://lhcb-elec.web.cern.ch/lhcb-elec/ Radiation policy: Radiation policy:

http://lhcb-elec.web.cern.ch/lhcb-elec/html/radiation_hardness.htm http://lhcb-elec.web.cern.ch/lhcb-elec/html/radiation_hardness.htm Reviews (many): Reviews (many):

http://lhcb-elec.web.cern.ch/lhcb-elec/html/design_review.htm http://lhcb-elec.web.cern.ch/lhcb-elec/html/design_review.htm ATLAS Radiation policy: ATLAS Radiation policy:

http://atlas.web.cern.ch/Atlas/GROUPS/FRONTEND/radhard.htm http://atlas.web.cern.ch/Atlas/GROUPS/FRONTEND/radhard.htm Elec 2005 lectures:Elec 2005 lectures:

https://webh12.cern.ch/hr-training/tech/elec2005___electronics_in_high.hthttps://webh12.cern.ch/hr-training/tech/elec2005___electronics_in_high.htm m

FPGA’s: FPGA’s: http://klabs.org/fpgas.htm http://klabs.org/fpgas.htm

Summer student lecture:Summer student lecture:http://indico.cern.ch/conferenceDisplay.py?confId=34799 http://indico.cern.ch/conferenceDisplay.py?confId=34799

Google it on internet.Google it on internet.

2626