Download - Lea Dit 2010 Td Presentation Au Email[1]
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
MANAGING INFO IN THE
INFORMATION AGE
– A CLIENT CASE
MATT FOURIE
THINKING DIMENSIONS
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Some of our recent
clients...
Barclays IT
Macquarie ITG
Unisys
Woolworths IT
Capita UK
SITA Global
BT Financial
McDonalds IT
• Thinking Dimensions
International - operating
KEPNERandFOURIE RCA
company initiatives for the
last 23 years
• Specialise in RCA for IT,
Telecoms & Manufacturing
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
AGENDA“Most incident
investigators ask
the wrong
questions, so don’t
change your people,
change the
questions they are
asking”
• Introduction
• Intro Client Case
– Stakeholder commitment
– Managing Information
– Quality of Information
– Investigation support
• Process demonstration
• Client outcomes
• Questions & answers
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Investigation Info
“It takes a company without a formal and
effective Root Cause Analysis culture, up
to 3 days to restore service incidents, but
up to 25 days to find the root cause”KEPNERandFOURIE 2010
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Client Case situationInternational
Australian
Investment
Bank‟s IT
Division
2007-2010
• Lack of Stakeholder commitment
• Poor management of information
• Working with poor quality
information
• Poor incident investigation
support
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Client situation - results• Reduced downtime of critical
systems by at least 60%
• Virtually eliminated recurring
incidents
• Level of escalations dropped > 50%
• Visible improvement of productivity
“The key to success
is to be insistent
about specificity –
the more specific
you are the better
your chances to
solve the incident.”
KEPNERandFOURIE
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
How did they do it?
Decided to
follow four strategies
to improve the
management
& quality of
Incident Investigation
information
1. Improve Stakeholder involvement &
commitment
2. Improve management of information
3. Improve quality of information thus
decreasing incident investigation
cycles
4. Improve support for incident
investigations
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 1: Improve stakeholder commitment
Specific challenges
• Lack of cross-silo
collaboration
• Poor stakeholder buy-in
• Reluctant contributions
from subject matter
experts (SME‟s)
Client ActionsIntroduced a formal division wide
Root Cause Analysis (RCA)system
I. Provided common processes in
troubleshooting and solution
finding
II. Introduced stakeholder/info
source analysis
III. Provided an easy way for SME‟s
to contribute meaningfully
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Best in class Stakeholder Commitment
• Resolution time to repair
a critical outage (3 hrs vs
45 hours)
• 71% increased
improvement in mean-
time-to-repair of critical
bus apps vs 11% decline
• 98% availability of critical
business applications vs
82% availability
-20
0
20
40
60
80
100
Best in Class Average Laggard
Mean-time-to-repair
Improvement m-t-t-r
Availability8 hrs
45 hrs
3 hrs
Aberdeen Group
Boston Feb 2010
J DeBarros & G
Patil
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Best in class with RCAStakeholder Commitment
• 69% of Best in Class Co’s
implemented RCA over the last
2 years with 50% improvement
in productivity and 19%
improvement in profitability.
28% indicated they will do RCA
in next year
• 19% of Average rated Co’s
implemented RCA with a 12%
improvement of productivity.
Only 19% is planning to do
RCA in next 12 months
• The Laggards did not do any
RCA with a 9% drop in
productivity. Nearly 30% to
implement RCA-10
0
10
20
30
40
50
60
70
Best in Class Average Laggard
Existing RCA
RCA next 12 mos
Improved prod
Aberdeen GroupBoston Feb 2010J DeBarros & G Patil
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Client case situation
-10
0
10
20
30
40
50
60
70
80
Best in Class Average Laggard Client
Existing RCA
RCA next 12 mos
Improved prod
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Common process• Everybody uses the same
process for finding causes and
solutions
• The process determines which
questions to ask at each step
for each type of incident
investigation approach
• Designed for minimalistic
information combined with a
good focus to provide quick
answers
Step 1: Identify Problem
Situation
Step 2: Gather Incident
Information
Step 3: Analyse Incident
Information
Step 4: Determine Conclusion
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Stakeholder analysis
• What do you know?
• What don‟t you know?
• Who has the
information?
• How will you obtain the
missing information?
Decision makers
ImplementersInfluencers
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 1: Improve stakeholder commitment
SPECIFIC RESULTS ACHIEVED
Incident is first attempted in natural teams but if not
resolved, Management gives permission to ask for
appropriate SME‟s
Management sanctioning incident investigation
meetings, because they know it will provide results
Achieving more in less time and not adverse to
attending Incident Investigation meetings
Management promoting the use of the formal RCA
processes
“If a team could
not solve a
problem, the
person with the
information was
not invited!”Chuck Kepner
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 2: Improve management of information
Specific challenges
• Inappropriate use of
information sources
• Either too much or too little
information
• High level of escalations
• Duplication of efforts
Client actions
1. Introduced “rules of engagement”
2. Introduced a framework of “levels of
troubleshooting” to align with PM‟s
severity levels
3. Taught staff to trust the processes to
deliver the correct answers –
templates with questions
4. Introduced the “minimalistic”
principle
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Rules of engagement
TOP – Commitment to training of key staff
and facilitators. Publicise the rules for engagement
MIDDLE – Commitment to declare a situation as an unresolved incident. Gives instruction for direct reports to do a RCA exercise to resolve incident
WORKFORCE – Allow IT professionals 2-8 hours to resolve a problem. If not, they would be allowed to escalate incident and apply the RCA process
Top
Middle
workforce
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Levels of troubleshooting1. SEV 3: - Thinking on Your Feet – “Checklist” problem solving using appropriate
checklists. Leadership would allow the IT professional to resolve an incident within
8 hours. If this does not happen the incident is escalated.
2. SEV 2: - Intuitive Analysis – Leadership instructs and allows the natural team to
perform an intuitive RCA on the incident. If not resolved the team escalates the
incident.
3. SEV 1: - Investigative Analysis – In-house trained RCA facilitators have the
permission of Leadership to assemble a cross-silo team to formally investigate the
incident with the appropriate RCA tools to systematically arrive at the TRUE &
ROOT causes for a problem situation
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
“Minimalistic principle”..
• Only need to analyse the information that
would be relevant to the incident
• Worked questions within a customised
“factor analysis” framework
• Get a quick factual “snapshot” of the
characteristics of the incident and then
use SME experience and gut feel to
explain the snapshot
• Test SME inputs against logic of snapshot
“Too much information
can cause confusion.
The key is to get all the
relevant information onto
one page and that is
normally substantially
less than gathering
„all‟ the Information.”Innovation – the FreeZone
thinking experience.
by Kepner & Fourie
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Example of templates with questions
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 2: Improve management of information
SPECIFIC RESULTS ACHIEVED
Staff knew exactly when to apply a formal RCA
process, when to involve a facilitator and when to
call on a cross-functional SME
Gave IT professionals the confidence that they
were working through a problem situation
systematically and comprehensively
Developed a “no-nonsense” incident investigation
culture – you ask a question; you either have the
answer or you need to go and get it.
“Every incident
has multiple
entry points. To
be successful in
solving the
incident you need
to find the correct
entry point.”Matt Fourie
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 3: Improve quality of information
Specific challenges• Wasted time and effort having
to do too many replications
• Mostly dealing with raw data
instead of information
• Long investigation cycle times
• High levels of recurring
incidents
Client Actions1. Introduced a set of interrogative
questions to convert raw data into
meaningful information
2. Created “deductive” reasoning culture
to arrive at answers quickly and
effectively
3. Testing possible causes on paper to
eliminate 90% of replication time, effort
and money
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Incident statement - sampleOBJECT FAULT
Q1: What is the most specific object/thing you are having
a problem with?
A1: Software freezing
Q2: Can you be more specific?
A2: Feed to Wallboard software A2: Slow
Q3: Can you be more specific? What do you mean by
“Slow”?
A3: Wallboard price feeding update not responding
Q4: Do you know the cause of this situation? A4: No
Wallboard price feed update Not responding
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Snapshot info for causesIS BUT NOT WHY NOT
OBJECT
FAULT
USERS
WHERE
TIMING
PATTERN
CYCLE
OBJECT – What object and which other object(s) not?
FAULT – What fault and which other typical faults not?
USERS – Who has the problem and who does not?
WHERE – Where are these users and where could they have been but are not?
TIMING – When did it happen first time and when not?
PATTERN – What is the pattern of faults and what could it have been but is not?
CYCLE – In which cycle does the problem occur and in which cycle does it not occur?
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
CauseWise sampleDIMENSION IS BUT NOT WHY NOT Possible Causes & Testing
Object Fireburst V2.0
connection
E-Express, Mango
connections
F/B upgrade from V1 to
V2, Poor testing issue
1. Proxy server tampered with during the Java
upgrade on the LAN
Fault dropping Freezing, slow Time out settings,
configuration of drivers
X
Loc of
Object
ANZ, USA, UK Asia LAN, Proxy server issues,
F/Wall rules
2. Java upgrade caused driver incompatibility with
Fireburst website V2.0
Timing Monday, Sept 2nd
with SOB
Any time earlier
than Sept 2nd
Java upgrade, Netscape
upgrade
√ √ X
Pattern Continuous Sporadic, Periodic Don‟t know 3. Netscape upgrade caused driver incompatibility
with Fireburst website V2.0
Life
Cycle
When doing a
transaction
“x” time into
transaction
Operator error, Code
error on a specific page
√ √ A1 √ √ √ √
A1- Only if the staff in Asia did not upgrade to Netscape
Phase
of Work
Just after logging
in
Logging in or out OS configuration issue,
DNS issue
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Snapshot info for SolutionsKey Requirements 1 2 3 4 5 6 7 8
1 Best data transfer rate
2 No loss of data
3 Improve system up-time
4 Improve trickle & purge
5 Reduce DR time
6 Capex < $2m
7 Implement < 3 mos
Four Question Drill• What are the results you want to
achieve with this solution?
• What are the existing problems you
would like to remove with this
solution?
• What are the potential risks you
would like to avoid with this
solution?
• What money and time do you have
or do you need to preserve? What
are the restrictions out of your
control?
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
SolutionWise DemoStep 1: Purpose Statement – Increase Market share
Requirements to Fulfill Problems/Symptoms to Remove
•Maximum increase in market share
•Attract as many as possible new clients
•Maximum employee buy-in
•Long term growth
•Maximum impact on competitors
•Do not lose any clients
•Missing deadlines for implementation
•Excessive costs
•Dissatisfied staff
•Admin mistakes
•Long turnaround times
•Security issues
Risks to Avoid Resources and Restrictions
•Employees feel it is an increase in their workload
•Not making the anticipated market share increase
•Security breaches
•5 months to implement
•80K for the implementation
•Least costs possible
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Requirements to Fulfill Problems/Symptoms to
Remove
Most effective move of data
Fast & efficient
Low overhead
Improve error checking
Should delete data
Automated
Producing a report
Slowness
Inefficient
Not copying all data
No error checks
Does not always work
Low reliability
Requires resources to monitor
Risks to Avoid Resources and Restrictions
Negative impact on System
Performance
Negative impact on customers
Los of data
Difficult to maintain
Least cost
Easy to implement
Hardware spec‟s at site
Ops hours at site
Step 1: Purpose Statement – Find a way to improve “trickle & purge” for RMC Application
Key Requirements
1. Fast as possible transfer rate
2. No loss of data
3. Should not impact System
Performance
4. Do not increase any resource
overhead
5. Easy to repair and to maintain
6. Improve reliability
7. Ease of implementation
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Statement: Find a way to improve “trickle & purge” for RMC application
Key Solution
Requirements
Various actions to meet key requirements
1. Rewrite and Improve the existing code
2. Improve hardware specifications
3. Optimize disk layout to accommodate all tasks
4. Replace “trickle and purge” with a “constant feed” system
5. Design a good validation code
6. Provide automatic back-ups
7. Develop proper and comprehensive documentation for
process
8. Improve staff awareness through training
9. Automated alerts if task not correct
10. QC test for every release
1. Fast as possible transfer rate
2. No loss of Data
3. Should not impact System Performance
4. Do not increase resource overheads
5. Easy to repair & maintain
6. Improve reliability
7. Ease of implementation
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Statement: Find a way to improve “trickle & purge” for RMC application
Key Solution
Requirements
Various actions to meet key requirements
1 2 3 4 5 6 7 8 9 10
1. Fast as possible transfer rate 3 2 2 3 0 0 0 0 0 0
2. No loss of Data 3 2 0 3 3 3 0 0 1 0
3. Should not impact System Performance 3 3 3 3 0 0 0 0 1 0
4. Do not increase resource overheads 3 2 3 3 0 0 0 0 0 0
5. Easy to repair & maintain 2 0 0 2 2 1 3 3 2 0
6. Improve reliability 3 2 2 3 3 0 1 2 2 3
7. Ease of implementation 2 0 0 2 0 0 3 3 0 3
Possible Actions: 1. Rewrite & improve code; 2. Improve H/W specs; 3. Optimize disk layout for task; 4. Replace trickle &
purge with “constant feed”; 5. Design validation code; 6. Provide automatic back-ups; 7. Develop proper & comprehensive doc‟s
for process; 8. Improve staff awareness through training; 9. Automated alerts if not correct; 10. QC test for every release.
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Statement: Find a way to improve “trickle & purge” for RMC application
Key Solution
Requirements
Various actions to meet key requirements
1 2 3 4 5 6 7 8 9 10
1. Fast as possible transfer rate 3 2 2 3 0 0 0 0 0 0
2. No loss of Data 3 2 0 3 3 3 0 0 1 0
3. Should not impact System Performance 3 3 3 3 0 0 0 0 1 0
4. Do not increase resource overheads 3 2 3 3 0 0 0 0 0 0
5. Easy to repair & maintain 2 0 0 2 2 1 3 3 2 0
6. Improve reliability 3 2 2 3 3 0 1 2 2 3
7. Ease of implementation 2 0 0 2 0 0 3 3 0 3
New solution:1. Determine existing GAPS in code and find the best person to re-write to spec‟s required. 2. Get one person
to upgrade all procedures and documentation for new code and design. 3. Once we have a stable system then provide
appropriate awareness training to maximize effectiveness of new design.
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Reducing cycle timesIS BUT NOT WHY NOT
ABC server XYZ Server software
Slow
performance
Comms issues Volume issue
Claims division Other divisions W/end upgrade
Globally Isolated areas LAN firewall
June 2nd Before New proxy
Loading data Retrieving data Volume of data
Sales reporting Other reports Excessive data
Server
slow
XX
X
X
X
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 3: Improve quality of information
SPECIFIC RESULTS ACHIEVED
Incident root cause found first time every time.
Meetings became more productive
RCA method always created a better and common
understanding of the problem situation to all
stakeholders
Recurring incidents were virtually eliminated
Cycle times for incident investigations reduced
drastically
I keep six honest serving-men:
(They taught me all I knew)
Their names are What and
Where and When And How and
Why and Who. I send them
over land and sea, I send them
East and West; but after they
have worked for me, I give them
all a rest.
Rudyard Kipling
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 4: Improve support for incident investigations
Specific challenges
• Did not know “Who,
What, How and When”
• No “Go To” person to
help with effective
investigations
Client actions
1. Trained in-house professional
RCA investigators
2. Established a “rules of
engagement” for facilitators
3. Publicise successes
4. Recognition by Management
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Training in-house facilitators• Advice to Incident Owner on who to invite to
RCA meeting to improve chances of a quick
success (Stakeholders & Info Sources)
• How to prepare a team for an effective RCA
meeting
• Exceptional investigation facilitation skills (the
art of asking the right questions and how to
verify it for authenticity)
• RCA process skills to enable the facilitator to
lead any team at any level in investigations.
“One of the main reasons for
incident investigation failure
is “analysis paralysis” –
having to work with too
much information”
Infra-Structure Manager
Airline Software Platforms
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Strategy 4: Improve support for incident investigations
SPECIFIC RESULTS ACHIEVED
Facilitators established a forum for themselves,
meeting once a month to discuss lessons learned
and sharing successes
Facilitators are now also used to help solve vendor
issues affecting application performance
Facilitators started to feed results into an agreed
knowledge data base, also encouraging informal
use of RCA incidents to be recorded
Increased division awareness of how well they are
doing with application performance issues
“It is always a good
strategy to stand a
few steps back and
looking at the
incident from a
different angle”
Unknown
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Application Performance results
0
10
20
30
40
50
60
70
80
90
100
2006 2008 2010
Mean-time-to-repair
Improvement m-t-t-r
AvailabilityDAYS
WEEKS
HOURS1. M-T-T-R went from
weeks to a couple
of hours
2. Improvement in M-
T-T-R practices by
nearly 50%
3. Availability of
critical systems
went from 77% to
94%
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Improvement in escalations
0
10
20
30
40
50
60
70
80
90
Sev 3 to Sev 2 Sev2 to Sev 1 Recurring problems
Vendor Interventions
2006
2010
1. Escalation of severity 3
to severity 2 reduced by
nearly 24%
2. Escalation of severity 2
to severity 1 reduced by
76%
3. Recurring incidents
reduced by 35%
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Lessons learned..• Most of the recurring incidents and problems are caused by “out of date
procedures” and lack of proper documentation
• RCA is a “mental orientation” which people have to get trained in – “does
not come with experience”
• IT professionals need a “thinking approach” that could be applied in most
situations
• Rules of Engagement to become a standing order
• Encourage use in all incident investigation meetings – ask for the
paperwork/evidence
• Sponsors continuous RCA training
• Regular email communications to publish successes
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
Thank you for your time!If you have any further questions regarding Minor or Major
Investigations and how to acquire the in-house skills to
improve your metrics on this drastically, please do not
hesitate to speak to us after this or Andrew on;
Copyright 2010 © KEPNERandFOURIE™ All rights reserved www.thinkingdimensionsglobal.com
ITIL Centric Processes