what to do when it all goes so wrong
DESCRIPTION
As IT Professionals we inevitably will see situations where everything goes wrong. At times we are somewhat lucky and this just means diminished functionality or a slow system. Other times our organization is temporarily out of business. Regardless of the scope of the issue, how we react can have a direct impact on how quickly things are returned to normal. This session will cover how to communicate issues, including what to say, who to say it to and when to say it. Part of managing communication is to get everyone into a room, forcing them to talk, so time will be spent on designing an effective war room. The session will also cover how by setting out to prove that an issue is ours we are able to more quickly get at a root cause.TRANSCRIPT
More than 11 years in IT
SQL Server DBA for over 3 years
Previous Life as Developer
Blogger◦ http://adventuresinsql.com
◦ Syndicated on SQLServerCentral.com
◦ Syndicated on SQLServerPedia.com
@dave_levy on Twitter
Peak Time of Peak Sales Day
Typical Hourly Sales $100K/HR
Order Entry Screen is Locked Up
Users report Slowness Initially
Now the “Sales Center” Application is Just “Clocking”
Let Everyone Know There is a Problem◦ Prevent Duplicated Efforts
◦ Allows Others to Speak Up
Recent Changes
Related Issues
http://www.freedigitalphotos.net/images/view_photog.php?photogid=1983
Send Up a Flare◦ Send to an IT Only Distribution Group
◦ Keep the Subject Line General
◦ Provide Broad Overview Including:
Systems Impacted
Major Symptoms Including Error Messages
Number of People Impacted
Any Location Specific Information
To: IT Emergencies
Subject: Sales Center Issues
Sales Center Users are reporting that the Order Entry screen has quit responding. We are currently investigating the issue with the Sales Center Development Team. We will provide updates as we know more.
What Systems are Involved?◦ SQL Server
◦ AS400
◦ Mainframe
◦ Web Farm
◦ Major Network Components like Load Balancers
Analyze Collected Information◦ Are There Any Obvious Signs of Trouble?
◦ Can the Problem be Linked to a Change?
◦ Can Any Patterns be Identified?
Prove It Is Your Issue◦ Shows Humility
◦ Shows Respect for Everyone Else’s Time
◦ Avoid Appearing Arrogant
Prove It Is Your Issue◦ Construct Tests to Prove Theories in Order of
Likelihood Until Problem Proven or Theories Exhausted
Faster than arguing about what it is not
How can you know it is not your issue?
List Potential Actions◦ Rank by effort, confidence, level of risk
◦ Develop action plans for best options and re-rank
◦ Each potential action should have a rollback plan
Define Measures◦ What will indicate things have gotten better?
Adding this index will reduce Disk IO by 10 million reads per second
The execution time of query x will drop from 6 minutes to 50 milliseconds
Define Measures◦ What will indicate things have gotten worse?
Disk IO may go up
The execution time of query x may go up
Adding this index may slow inserts from the order upload process
Communicate Your Intentions
Make the Change◦ Follow a written plan
◦ Make a single change
◦ A single person should make the change
◦ Document any additional steps taken
Start Over by Collecting More Data
Signs You Need to Convene A War Room◦ Having Trouble Finding Anything Wrong
◦ 30 Minutes Without Progress
◦ An Issue Appears to Span Multiple Systems
◦ Having Difficulty Getting People Engaged
Get Everyone in a Room
No Changes Made Outside the Room
No Heroes◦ Watch out for people doing a lot of typing
◦ Avoid changes that take more than a few minutes
Have a Call in Number for Remote Coworkers
Monitor Your Guest List◦ 1-2 Representatives From Each Team
◦ Try to Keep Management Out
◦ Watch for Disruptive People
To: IT Emergencies
Subject: Sales Center Issues
We are convening a war room for the Sales Center issue. Everyone working on the issue please meet in the North Conference Room. Remote/WFH coworkers should dial into the conference bridge 888-888-1234, participant code:1234.
White Board the Issue◦ Every System Gets Own Column
◦ Write All Facts on White Board
◦ Closed Items Get Crossed Out Not Erased
◦ Include a Resolution for Each Closed Item
Share the Floor◦ Likely Issue Owner Has the Lead
◦ Make Sure Everyone is Heard
◦ Contributing Often Involves Staying Out of the Way
◦ Don’t Be Afraid to Fade Back and Run The Whiteboard
Keep an Eye On Time◦ Provide Regular Updates to Management
◦ Bring in Food Around Meal Times
Raises Spirits
Brings in More People to Help
To: IT Emergencies
Subject: Sales Center Issues Update
The Sales Center war room is still going. We are currently looking into a driver issue with IBM. All necessary resources have been engaged.
Keep People in Reserve◦ Each Team Should Divide up the Day
◦ Rotate People In and Out
◦ Send Someone Home Early to Come in Early
To: IT Emergencies
Subject: Sales Center Issues Resolved
The Sales Center issue has been resolved. The issue was caused by a patch that was applied over the weekend. Now that it has been backed out everything has returned to normal.