is your data center on the verge of a crisis?
DESCRIPTION
What are the symptoms of a poorly managed data center facility? How close are you to an operating failure or catastrophic downtime event? Learn how to spot the warning signs and start improving your facility management program immediately to minimize the risk of downtime, reduce costs, and upgrade your operations.TRANSCRIPT
© 2014 Uptime Institute
Is your data center on the verge of a crisis?
Julian Kudritzki Chief Operating Officer
Uptime Institute
What Defines a Crisis?
2
Tour of Operational Computer Room
3
Looking for Clues
4
Tour of ‘Live’ Critical Spaces
5
Daily Practices Compromise Uptime, Safety, and Security
6
• Overtime hours exceeding 10% • Voice mail boxes full • Emails not responded to • Email inbox size limit exceeded • Meetings missed or routinely cancelled • No time for training • Shortage of qualified staff • Personnel performing work outside their competency • Everything is an emergency • Personnel turnover
What Else Is Going On?
7
• Break fix budget exceeded • Maintenance budget exceeded • Energy cost estimate exceeded or unknown • Last minute deployment requirements • No organization chart • No responsibilities matrix • No records of maintenance activities • No written policies & procedures • No preventive maintenance schedule • Back of the server looks like a spaghetti pot exploded
The Issues Add Up
8
• Cabling is not labeled or worse incorrectly labeled • Equipment is not uniquely labeled • Loads are consistently out of balance • Capacities are not managed or tracked • Deferred maintenance exceeds 10% • Housekeeping: if it looks like a mess, it is a mess Maybe you don’t have a crisis, but how do you know how well your data center operation compares to rest of industry?
The Issues Add Up
9
Are you confident in your Facilities team’s capability to manage a technologically advanced and highly efficient design to your 24 x 7 uptime requirements?
• Can you easily replace any member of that team? • Are you protected against poor operations practices
migrating from older sites to higher criticality data centers? • Do you have sites that operate in isolation, ignoring global
corporate standards? • Do you even have corporate global standards? • If you outsource any aspect of your data center operations,
how do you avoid losing responsibility and accountability? • Do you manage an outsourcing contract. . . . or direct an
expert team?
Ask the Tough Questions
10
• Initial review • Gap analysis against industry best practices
§ Staffing and Organization § Maintenance § Training § Planning, Coordination & Management § Operating Conditions
• Roadmap to operational excellence • Plan changes • Implement changes • Monitor & refine • Annual review
Path to Data Center Operations Success
11
Key Elements of Facilities Management Staffing and Organization
• Staffing • Qualifications • Organization
Maintenance • Preventative Maintenance (PM)
Program • Housekeeping Policies • Maintenance Management
System (MMS) • Vendor Support • Deferred Maint. Program • Predictive Maintenance • Life-Cycle Planning • Failure Analysis Program
12
Key Elements of Facilities Management Training
• Data Center Staff • Vendors
Planning, Coordination, and Management
• Site Policies • Financial Management • Reference Library • Computer Room Mgmt.
Operating Conditions • Load Management • Operating Set Points • Alternating Use of
Infrastructure Equipment
13
The Uptime Institute over the years has observed management issues posing the largest risk to uptime physical infrastructure
• Inadequate staffing • Ineffective or non-existing maintenance and training programs • Lacking processes and procedures • Resulting in the majority of outages being caused by
‘human error’ No standard existed to help Owners/Operators determine
• Common language/vocabulary of data center operations • Focus of data center management • Resource allocation • Resource requirements
Genesis of Industry Best Practices
14
Data Center Owners / Operators / End Users • Increased availability and cost savings • Multi-site consistency • Benchmark for continuous monitoring and refinement
Colocation / Managed Services Sites
• All of the above plus… • Customer assurance of consistency • Competitive differentiator (attain & retain certification)
Industry Benchmark
• No need to reply on opinions and anecdotes
Value of Industry Best Practices
15
Uptime Institute has been conducting Operational Sustainability Reviews for approximately 3 years— based upon decades of site operations knowledge and experience:
• Operational Sustainability Certifications: Tier + Gold, Silver, or Bronze • Management & Operations (M&O) Stamps of Approval
See http://uptimeinstitute.com/publications for Tier Standard: Operational Sustainability
Best Practices Reviews
16
Staffing • Inadequate staffing • Excessive overtime (over 10%) • No escalation process
Qualification
• No list of required qualifications • No experience with data center specific equipment
Organization
• Roles and Responsibilities not documented • Data center organization not integrated
Staffing and Organization Significant Findings
17
Preventive Maintenance (PM) • No list of required PM activities • PM activities not fully scripted • No quality control process
Housekeeping
• Combustibles in the data center • No documented housekeeping policy
Maintenance Management System (MMS)
• No list of equipment • Missing critical data: warranty info, maintenance history, performance
data, etc.
Maintenance Significant Findings
18
Vendor Support • Contracts missing response times, call-in process, detail SOW, or
technician qualifications Deferred Maintenance
• Unable to produce Deferred maintenance report from MMS Predictive Maintenance
• No predictive maintenance program • Not comparing current results with previous results
Maintenance Significant Findings
19
Life-Cycle Planning • No life-cycle plan • Not using MMS data to develop plan
Failure Analysis • No record of outages or near misses
Maintenance Significant Findings
20
Data Center Staff • Undocumented On-the-Job (OJT) programs • No formal qualification program • No list of training required by position • No formal training program with lesson plans, etc.
Vendors • No briefing for escorted vendors
Training Significant Findings
21
Load Management • Alarm settings not documented • Alarms not set on PDUs to ensure maximum loads are not exceeded
Operating Set Points • Cooling set points are not document or part of
Change Management Process • Changing of set points is not controlled
Operating Conditions Significant Findings
22
Site Policies • Missing Site Policies • Especially Site Configuration Policy
Reference Library
• No process for keeping documents up-to-date
Capacity Management • No process for forecasting future space, power, and cooling
requirements • No active tracking of cooling capacity • Ineffective management of Cold Aisles /Hot Aisles • Electrical power monitoring (balancing phases)
Planning, Coordination, and Management Significant Findings
23
Facilities • Operate and maintain the critical facility infrastructure • Support the installation of IT equipment (space, power, & cooling)
IT Management • Operate and maintain IT hardware, software, applications, and
network connectivity • Manage the installation/de-installation of IT equipment
Security • Access Control • Physical Security
Typical Data Center Disciplines
24
Functionally Separate Organization • Corporate Real Estate (Facilities) • IT • Security
Communication between organizations was typically poor
• Data center activities conducted without coordination • Poor future space, power, and cooling planning
No individual responsible for all aspects of operating a data center
Past Organizational Structures
25
Factors driving changes to organizational structure • Rapid changes in technology and speed at which capacity must be
brought online • Increased costs associate with IT and Facilities • Business objectives of continuous computing availability
Legacy organizations could not accommodate quickly evolving business requirements
• Slow to respond • Not integrated
Evolving Organizational Structure
26
The value of industry best practices is in the process of continuous improvement
• Discovery leads to learning • Learning leads to change • Change leads to improvement • Regular reviews leads to discovery • Crises can be avoided
Summary
27
For more information contact: Julian Kudritzki
[email protected] 206.706.4143
Questions?
© 2014 Uptime Institute 28