data center downtime feb2011
DESCRIPTION
Data Center Downtime Feb2011TRANSCRIPT
The Truth and Consequencesof Data Center Downtime
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Emerson Network Power: The global leader in enabling Business-Critical Continuity
Automatic Transfer Switch
ParallelingSwitchgear
Uninterruptible Power Supplies & Batteries
Fire Pump Controller
Surge Protection
Extreme-DensityPrecision Cooling
Perimeter Precision Cooling
Power Distribution UnitsData Center Infrastructure Management
Integrated Racks
Cooling
RackRack Power
Distribution Unit
KVM Switch
UPS
Monitoring
Cold Aisle Containment
Row Based Precision Cooling
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Emerson Network Power –An organization with established customers
© 2011 Emerson Network Power
© 2010 Emerson Network Power
• Emerson Network Power overview
• National Survey on Data Center Downtime: Frequency, Duration and Cost, Dr. Larry Ponemon, Founder and President, Ponemon Institute
• Preventing the Most Common Causes of Downtime: Root Cause Analysis, Best Practice Prevention and Technology, Peter Panfil, Vice President and General Manager, Liebert North America AC Power, Emerson Network Power
• Question and Answer session
Presentation topics
© 2011 Emerson Network Power
National Survey on Data Center Downtime: Frequency, Duration and Cost
Dr. Larry Ponemon Founder and PresidentPonemon Institute
© 2011 Emerson Network Power
© 2010 Emerson Network Power
• The Institute is dedicated to advancing responsible information management practices that positively affect privacy, data protection and information security in business and government
• The Institute conducts independent research, educates leaders from the private and public sectors and verifies the privacy and data protection practices of organizations
• The Institute is a member of the Council of American Survey Research Organizations (CASRO), and Dr. Ponemon serves as CASRO’s chairman of Government and Public Affairs Committee of the Board
• The Institute has assembled more than 50 leading multinational corporations called the RIM Council, which focuses the development and execution of ethical principles for the collection and use of personal data about people and households
About the Ponemon Institute
© 2011 Emerson Network Power
© 2010 Emerson Network Power
• Purpose: Determine the frequency and cost of unplanned data center outages
• Study 1: 453 individuals in U.S. organizations who have responsibility for data center operations– Perceptions about data center criticality, availability and outages– Perception differences between executives and associates
• Study 2: Develop an activity-based costing model derived from actual meetings or site visits for 41 data centers that experienced a complete or partial unplanned data center outages to capture both direct and indirect costs related to:– Damage to mission critical data– Impact of downtime on organizational productivity– Damages to equipment and other assets– Cost to detect and remediate systems and core business processes– Legal and regulatory impact, including litigation defense cost– Lost confidence and trust among key stakeholders
About the studies
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Perceptions about data center availability
© 2011 Emerson Network Power
Agree: Combines strongly agree and agree responsesDisagree: Combines strongly disagree, disagree and
unsure responses
© 2010 Emerson Network Power
Perception differences between senior management and operators
© 2011 Emerson Network Power
Supervisor and belowDirector and above
© 2010 Emerson Network Power
Experience with unplanned data center outages
© 2011 Emerson Network Power
Experienced one or more unplanned outages data center
over the past 24 months
Frequency of unplanned data center outages
over the past 24 months
Total data center outage: Entire facility is downPartial outage: Limited to individual rows and rack
Device-level outage: Individual servers and IT units
© 2010 Emerson Network Power
Extrapolated duration of data center outages in minutes
© 2011 Emerson Network Power
Total data center outage: Entire facility is downPartial outage: Limited to individual rows and rack
Device-level outage: Individual servers and IT units
© 2010 Emerson Network Power
Extrapolated frequency of complete data center outages by square footage
© 2011 Emerson Network Power
Freq
uenc
yD
urat
ion
© 2010 Emerson Network Power
Extrapolated frequency of complete data center outages by industry
© 2011 Emerson Network Power
Extrapolated frequency of unplanned outages over two years
© 2010 Emerson Network Power
Study 2: Activity-based cost framework for the cost of data center outages
© 2011 Emerson Network Power
Interviewed and audited 41 data center managers who experienced an unplanned outage
© 2010 Emerson Network Power
Cost loadings from ABC Framework
Cost activity centers Direct cost
Indirect cost
Opportunity cost Total
Detection 52% 48% 0% 100%Equipment cost 60% 40% 0% 100%IT productivity loss 23% 77% 0% 100%End-user productivity loss 22% 78% 0% 100%Third parties 35% 41% 24% 100%Recovery 22% 78% 0% 100%Ex-post response 53% 47% 0% 100%Lost revenue 33% 26% 41% 100%Business disruption 24% 30% 45% 100%
Average contribution 36% 52% 12%
© 2011 Emerson Network Power
Interviewed and audited 41 data center managers who experienced an unplanned outage
© 2010 Emerson Network Power
Average cost by category
© 2011 Emerson Network Power
Results shown are derived from the analysis of 41 data centers located in the United States
© 2010 Emerson Network Power
Total cost by industry sector
© 2011 Emerson Network Power
The average duration of the outage for the 41 data centers was 102 minutes
© 2010 Emerson Network Power
Total cost for partial and total shutdown
© 2011 Emerson Network Power
Results shown are derived from the analysis of 41 data centers located in the United States
Preventing the Most Common Causes of Downtime: Root Cause Analysis, Best Practice Prevention and Technology
Peter Panfil Vice President and General managerLiebert North America AC PowerEmerson Network Power
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Were the unplanned outages during the 24 months preventable?
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Total cost by industry sector
© 2011 Emerson Network Power
Data centers experienced multiple outages duringthe 24 month period surveyed
© 2010 Emerson Network Power
• 65% of outages caused by battery failure
• Service life of a battery varies, dependant on:– Frequency of usage– Ambient temperatures– Quality of connections and terminals
• The weakest link in critical power
#1: Battery failure
© 2011 Emerson Network Power
How?
A single bad cell among thousands can take down a facility
Batteries have a limited life expectancy
False confidence; no indication of problems until needed
© 2010 Emerson Network Power
#1: Battery failure
© 2011 Emerson Network Power
Best Practice: Preventive Maintenance• Service contracts for inspections and testing
– Monthly, quarterly and annual actions need to be taken
© 2010 Emerson Network Power
#1: Battery failure
© 2011 Emerson Network Power
Best Practice: Real-Time Monitoring• Measure the internal DC resistance of all battery cells• Combination of hardware and software
– Alarm management via email and SMS– Measures the reliability of the entire battery
• Strap• Inter tier connections• Plates• Battery connection posts/ terminals
• Proactively indentify and replace bad batteries
White Paper: Implementing Proactive Battery Management Strategies
to Protect Your Critical Power System
© 2010 Emerson Network Power
IT usage is variable, not static
IT gets added without knowledge of infrastructure impact
Redundant UPS loaded over 50%Should UPS or battery failure occur, the remaining UPS cannot support 101% of the load
• 53% of outages caused by lack of UPS capacity
• IT growth outpaces AC Power infrastructure growth• Disconnect between Facilities and IT
– The owner of the UPS might not be IT
• Battery runtime is also dependant on how much load is being supported
#2: UPS capacity exceeded
© 2011 Emerson Network Power
How?
© 2010 Emerson Network Power
#2: UPS capacity exceeded
© 2011 Emerson Network Power
Best Practice: Additional UPS Cores for capacity and redundancy
• Keep redundant UPS capacities at 30% - 40%– IT load must not exceed the total capacity of a single UPS– Efficiency of the Liebert NXL optimized at partial loads
• Size the new UPS system on best-case growth• Real-time capacity monitoring to manage load balancing• UPS configured in a parallel redundant configuration
Some data centers willing to trade redundancy for capacity – analyze the costs,
risks and benefits
© 2010 Emerson Network Power
#2: UPS capacity exceeded
© 2011 Emerson Network Power
• Options for parallel redundant UPS
White Paper: High-Availability Power Systems, Part II: Redundancy Options
UPSCore STS
UPSCore
SS SS SS
System Control Cabinet Paralleling Cabinet
UPSCore
UPS Core
UPSCore
UPSCore
IT Load IT Load
N+1Centralized static transfer switch
System-level control, fault tolerant Size of STS determines total capacity
1+NDistributed static switches
Individual cores manage load transfers Cannot parallel different sized UPS
© 2010 Emerson Network Power
Pushing the EPO thinking it’s a light switch
Improper equipment operation could drop the entire facility
Careless installation of servers damages infrastructure
• 51% of outages caused by user error
• Many people involved in data center operation– Too many cooks…– Alarms and control panels everywhere
• 100% preventable• Most cost-effective root cause to solve
#3: Accidental EPO / Human error
© 2011 Emerson Network Power
How?
© 2010 Emerson Network Power
#3: Accidental EPO / Human error
© 2011 Emerson Network Power
Best Practice: Documentation, Standard Procedures, Training and Remote Monitoring
Shield EPODocumented Maintenance Procedures
LabelingOne-Lines
Follow Processes; No Short
CutsPersonnel Training
Keep it Clean
No Food or Drink
Escort Visitors
Infrastructure Monitoring
© 2010 Emerson Network Power
#3: Accidental EPO / Human error
© 2011 Emerson Network Power
• Best practices for EPO– A / B EPO in A / B data centers– Separate EPO from the fire alarm– Remove local EPO from UPS and PDUs– Provide physical protection– Provide maintenance and test features– Document and label– Training
• 2011 code changes– NFPA 70 – 645-10, Disconnecting Means
© 2010 Emerson Network Power
UPS has components with a finite life, some need replaced
UPS repaired with non-OEM parts
Blame the UPS when it’s really the batteries
• 49% of outages caused by UPS failure
• Reliability of a UPS only lasts as long as the shortest component life– Liebert design philosophy addresses this issue by reducing the number
of parts, thus decreasing the chance of a failure
• UPS designed to prevent outages, not cause them
#4: UPS equipment failure
© 2011 Emerson Network Power
How?
© 2010 Emerson Network Power
#4: UPS equipment failure
© 2011 Emerson Network Power
Best Practice: Preventive Maintenance by an experienced technician
• At least two PM visits per year• OEM technician using OEM parts and calibration• MTBF for units that received two PM’s is 23 times higher than a
machine with no PM service events per year
White Paper: The Effect of Regular, Skilled Preventive Maintenance on Critical Power System Reliability
© 2010 Emerson Network Power
Cooling leaks and chilled water distributed in-row
Repairs to in-row cooling causes chilled water leaks
Server densities are rising, so is the heat
• 35% of outages caused by water incursion• 33% of outages are heat-related
• As densities increase, cooling is brought closer to the IT load– For some in-row cooling products, water is on top of, next to and below
critical electrical equipment– Solving the heat problem, but causing a water problem
#5: Heat- and water-related
© 2011 Emerson Network Power
How?
© 2010 Emerson Network Power
#5: Heat- and water-related
© 2011 Emerson Network Power
Best Practice: Utilized refrigerants, easier maintenance and leak detection monitoring
• R410A and Glycol for row-based units– Eliminate the need for water in the row
• Monitor for leaks under the floor
• Importance of easy maintenance for row CW units– Do you need to remove the in-row unit for repair?
Refrigerant-based high density cooling Front and rear parts
accessPoint or zone detection
© 2010 Emerson Network Power
#5: Heat- and water-related
© 2011 Emerson Network Power
Best Practice: Optimized airflow• Containment
– Increases cooling capacity and energy efficiency
• Temperature sensors– Supply and return– Rack-level
• Utilize temperature data to control and optimize cooling output– Variable Speed Drives– Digital Scroll Compressors
White Paper: Combining Cold Aisle Containment with Intelligent Control to Optimize Data Center Cooling
Efficiency
© 2010 Emerson Network Power
#5: Heat- and water-related
© 2011 Emerson Network Power
• Optimized airflow not only prevents heat-related outages, it improves cooling efficiency
Requires less fan power per kW of coolingLeverages variable fan speed control
Operates with digital scroll technology for variable capacity controlUp to 33% efficiency gain
Digital CompressorVariable Speed Fan
© 2010 Emerson Network Power
What could be done to prevent unplanned outages in the future?
© 2011 Emerson Network Power
How to make the case for more resourcesand budget?
What can be done short-term?
© 2010 Emerson Network Power
1. Educate your senior leaders on frequency and impact of downtime on your business– 56% of senior leaders think downtime doesn’t happen often
2. Utilize Cost of Downtime data to justify infrastructure improvements– Develop a business case or your own ABC model
3. Grab the “low-hanging fruit”– No cost to ensure IT staff doesn’t bring a Big Gulp onto the server floor
4. Conduct assessments and audits– Assess batteries, capacity, airflow– vendors can help
5. Talk to your infrastructure vendors– Service contracts, new technology, more best practices
Next steps
© 2011 Emerson Network Power
© 2010 Emerson Network Power
Dr. Larry Ponemon, Founder and President, Ponemon Institute• National Survey on Data Center
Outages• Coming Soon: Cost of Data
Center Outages
Q & A, further reading
Peter Panfil, Vice President and General Manager, Liebert North America AC Power, Emerson Network Power• Addressing the Leading Root
Causes of Downtime
© 2011 Emerson Network Power