avoiding the pitfalls of redundant power systems design · nonredundant power system 1812 ps #1...
TRANSCRIPT
1
IBM Systems and Technology Group
September 20, 2005© 2005 IBM Corporation
Steve [email protected]
Avoiding the Pitfalls of Redundant Power Systems Design
2
2
IBM Systems and Technology Group
© 2005 IBM Corporation
Power Supply 200K Hr MTBF
Nonredundant power system
1812
PS #1
Redundant power system w/ quarterly scheduled service
19.5
Hot swap redundant power system; 48hr replacement
If only it were true…
Load
N+1 Power System Availability
PS #2
Outages/10K servers with 5 year life_____
0.4
3
3
IBM Systems and Technology Group
© 2005 IBM Corporation
Single Points of Failure
System Outage Probability =Probability of 2 failed power supplies in the system
+ Probability of all single fails that can cause a system outage
The design of a high availability power systems needs to focus on the reduction of critical failures more than the increase of power supply reliability.
4
4
IBM Systems and Technology Group
© 2005 IBM Corporation
Power Supply
Power Supply
Bus Converter
Bus Converter
Dual Power Cords
Redundant DC buses
Redundant POL convertersRedundant bus converters
shorted output
shorted output
shorted input
shorted input
shorted output
shorted output
shorted input
shorted input
Group converterparameters
Service System
Faulty sensors
Faulty service action
Where Single Points of Failure Can Occur
5
5
IBM Systems and Technology Group
© 2005 IBM Corporation
Shorted Output Sections
Load
RegulationControls
RegulationControls
Regulator #1
Regulator #2
Shorted output FET or Cap will bring down output bus
Addition of ORing FET (rectifier)
Rectifier – Simple but lossy
FET – Need control circuit; more efficientORing FETControls
RegulationControls
6
6
IBM Systems and Technology Group
© 2005 IBM Corporation
Latent Failures
Undetected failures can compromise redundancy
When this fails we need to know it
…and report it to the service system
The addition of the “+1” supply is simple, but the implications are not
ORing FETControls
RegulationControls
7
7
IBM Systems and Technology Group
© 2005 IBM Corporation
Separation of Regulation and Protection Circuits
UC3844
COMP1
VFB2
ISNS3
RT/CT4
OUT6
VCC7
VREF8
To Current Sense
To Gate Drive
VCC
+
-
OUT
+
-
OUTOvervoltage indicator
To Remote
Sense Point
+Vlocal
-Vlocal
Regulation and protection share remote sense amp
Regulation reference also shared
Second remote sense amp
Second reference
Second pair of sense leads
Open sense lead latent fail omitted
UC3844
COMP1
VFB2
ISNS3
RT/CT4
OUT6
VCC7
VREF8
To Current Sense
To Gate Drive
0
VCC
+
-
OUT
+
-
OUT
To Remote
Sense Point
Overvoltage indicator
+Vcc
+
-
OUT
To Remote
Sense Point #2
+Vcc
+Vee
+Vee
+Vcc
8
8
IBM Systems and Technology Group
© 2005 IBM Corporation
Regulator to Load Impedance (Penalty of Hot Plug)
∆Vo∆I2
2L
Vbulk Vo CL..Large signal approximation
Distribution LPhase L
I
L = Total L from bulk to load= Lp/N + Ld
Numerical example:
I=140; Lp=200nH; Ld=150nH;N=15;Vbulk=12; Vo=2
For V = 20mV; CL= 4000uF
If this were a VRM on the board (Ld~3nH) CL=400uF
The decision to add hot plug should not be taken lightly. It may have significant impact on cost and packaging.
Lp
Load
Ld
ooo
Vbulk Vo
CL
Total of N Phases
9
9
IBM Systems and Technology Group
© 2005 IBM Corporation
Overvoltage Discrimination
Error Amp
+
-
OUT
PWM
+
-
OUT
Vref
Ramp
Voltage Sense
VCC
To Gate Drive
+
-
OUT
Vref2
-Disable OV (Option 1)
-Disable OV (Option 2)
Bus OV does not indicate which regulator caused OV
Simple Solution
Better YetPS #2
I=0
I>0
Duty Cycle must go to 0 to use this
PS #1
OV culprit
10
10
IBM Systems and Technology Group
© 2005 IBM Corporation
Featurable Loads
PS #1 5v@50A
PS #2 5V@50A
One populated PCI slot
Nine empty PCI slots
5V@2A PCI card
o o
o
I=1
I=1
To assure N+1 redundancy, accurate current sharing & sensing required
1% current sharing with 1% current measurement accuracy is marginal here
Periodic Vout adjust up/down by service system will help
Most I/O (PCI cards, Disks) and memory tend to be highly featurable
11
11
IBM Systems and Technology Group
© 2005 IBM Corporation
Faulty Sensor (Group Overcurrent)
PS #1 1.1v@50A
PS #2 1.1V@50A
PS #3 1.1V@50A
Service System
Load =100A
Reported
Currents
I=33
I=33
I=60
Total output current must be observed to prevent/minimize smoke during load faults
Defective current sensor likely
A real group overcurrent will still have balanced regulator output currents
This group does not need to be shut down, only PS#3
12
12
IBM Systems and Technology Group
© 2005 IBM Corporation
Input FaultsPS #1
PS #2
PS #3Branch Circuit Protector
Shared (critical path)
Selective coordination can be nontrivial
Similar technology circuit protection used in the PS and branch protector will generally not coordinate.
VRM #1
VRM #2
Applies to distributed DC busses also
Bus Converter #1
Bus Converter #2
Shorted input on VRM input can bring down both busses
Solid state protection circuits practical
13
13
IBM Systems and Technology Group
© 2005 IBM Corporation
FMEA analysis at system levelTesting with real hardwareService modeling by service personnelDesign reviews of microcode
Design Practices That Can Help
“Failure of Imagination”