complex problem determination cases in real world, hiroki
TRANSCRIPT
© 2006 IBM Corporation
Complex problem cases in real world
Hiroki NakamuraDRO, IBM Japan
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation2
Contents
1. Summary2. Customer Requirements and Impacts (TAT, workload, cost, etc.)3. Case 1 : Resolved complex case: DB2 calculation error4. Case 2 : Resolved complex case: MQ connection error5. Case 3 : Resolved complex case : MQ connection timeout6. Case 4 : Un-resolved case: WPS/LDAP CPU 100% utilization7. Case 5 : Un-resolved case: Web application timeout8. Cause code analysis9. Cost and product lifetime
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation3
Summary
Need to reduce resolution TAT and workload of account team to improve customer satisfaction
Built-in trace/diagnostic code without large performance down Detail description in problem fix database to search more easily Performance monitoring and its problem detection function Ease of use core dump inspection for crash/hung cases PD enablement (on/off) without restart a product
More effective PD schemes are required
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation4
Customer Impacts and Requirements
Waste large amount of time for PD and recreation test Even if a fix is provided, account team need very long time regression test
Root cause analysis even if one time problem for some years Source code investigation even if few materials (such as, no trace) Logical scenario of a problem (root cause) to trust a fix Detail information for customer management
– In mission critical cases, Japanese companies tend to be very sensitive in quality
– Especially financial company, because of strict guide form Financial Service Agency– Frequent progress reports of a problem resolution required, every three hours, daily or etc.
– A fix code might not be applied if low occurrence and easy recovery could be guaranteed Recurrence test to confirm a problem is fixed by a provided solution Special build is preferable to Fixpac because it’s single fix and no long term
regression test needed Direct communication channel to laboratory change team, who makes a
solution
Impacts
Requirements
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation5
What is a Problem? Unexpected results were produced by
SQL calculation– No hardware error detected
– The problems are 100% reproducible, but symptoms are different each time
– The SQL is very large and the data is considerably huge.
– It takes about 30 hours for the calculation.
– Application debug code takes about 4 days.
– No reproducible in IBM
– No reproducible with small data
DB2 Problem ?
Frequent occurrence of parity error on a FC adapter can lead to two-bit error.
– Parity error (single bit) can be recovered by retry access.
– Two bit error is not detectable and recoverable and can cause inconsistent behavior.
– Data corruption or wrong calculation Some customers were suffered from this problem
– A communication company
– An insurance company Temporary error was not recognized as a severe
H/W error, because it’s recoverable.– Cause long term problem determination of
DB2 instead of H/W
Double bit Parity Error of H/W can lead to a problem
Resolved Case 1 : DB2 calculation error
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation6
Incident ▼ ▼
▼Replace FC Adaptor
▼Add CPUs and memory
▼No FC Error Found
▼Service-in
▼Application Errors Found
HW Temporary Error
▼Problem Support Request
▼Critical Situation Process
▼No problem found in application
FA (76 days)▼ ▼
▼Report to the CustomerSituation Close
▼Report to the Customer
▼Confirmed no H/W error
Situation Chronology
TAT/WL 93days / 381 person days
Product Fiber Channel adapter in pSeries
Problems Found inconsistent results every time the same SQL executed with the same data.There are two system with the same configuration. A system produces proper results anytime, but the other system produces wrong results.
Frequency SQL application error happened. Reproducible
Reasons for long term
Takes about 30 hours for reproduction and 4 days with application debug code, because of considerably large quantity of SQL and data.
Resolved Case 1 : DB2 calculation error
timeframe
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation7
Who has ownership ?Resolved Case 2 : MQ connection error
MQ ServerRoutersSwitches
HostBroad band
EthernetRoutersSwitches
MQ get connection error !!
IBMNonIBM IBM
NonIBM
NonIBM
•Initial problem is MQ get connection error.•There is no similar problem found.•There is a long network path of MQ connection.
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation8
Packet Capture to analyze network
L2SW#1
MediaConverter 1A
MediaConverter 1B
L3SW#1
L2SW#2
MediaConverter 2A
MediaConverter 2B
L3SW#2
External F/W#1 External F/W#2
MQ Server #1 MQ Server #2
Fiber
UTP
UTP
L2SW#3 L2SW#4
Back F/W#1 Back F/W#2
Broad Band EtherRouter #1
Broad Band EtherRouter #2
L2SW#5 L2SW#6
L3 F/W, router
Router
Host
Broad Band Ethernet
Intranet
Firewall x 2Host
: Data capture point
Sent reset
packet
Resolved Case 2 : MQ connection error
Red line : connection path
Bug in SYN
Defender function
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation9
Symptom
APL Server
MQ Manager
WebServer
GatewayServer
ClientChannelTreads
Disconnected only channels from Application Server
Send Channel Process
Receive Channel Process
Send Channel Process
Receive Channel Process
Send Channel Process
Receive Channel Process
WebServer
WebServer
GatewayServer
a MQGet requestper 3sec
Client channel thread is createdAccording to a request from Gateway Server
Resolved Case 3 : Connection Timeout
1. Connection timeout happened between Web server and Application server.2. Only MQ channels from application server were disconnected.3. Investigated MQ log and trace No error found in MQ4. Investigated AIX trace MQ threads are waiting for a lock5. A lock owner thread is not dispatched for a long time.6. And then connection timeout occurred.
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation10
How to determine configuration ?Resolved Case 3 : Connection Timeout
CPU 1
CPU 2
K_T
K_T
K_T
VP
VP
VP
P1_T1
P1_T2
P1_T3
P1_T4
・・・P1_T99 Lock Owner
・・・
K_T : Kernel Thread, VP : Virtual ProcessorP#_T# : Process ID and Thread ID(Client Channel Thread)
Processor wide CPU assignment
Lock Wait : Check process
Sleep
Lock Wait : Check process
Lock Wait : Check process
CPU 1
CPU 2
VP
VP
VP
P1_T1
P1_T2
P1_T3
P1_T4
・・・P1_T99 Lock Owner
・・・
System wide CPU assignment
Lock Wait : Check process
Sleep
Lock Wait : Check process
Lock Wait : Check process
Bottleneck
Change Configuration
Wait for CPU dispatch for a
long time
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation11
Problem sequenceLogin
Request
Web Browser Integrated Authentication Portal Server
LDAP
AIX AIX
AIX
TAM/WebSEAL
WAS
WPSInterceptor
UDBUser
Registry
Create Cookie
Request
Transfer
AuthenticationRe-Authentication Retrieve Group Info
PortalScreen
x 3 x 2
HACMP(Active Standby)
Directory Server
9:089:00
Portal #1
LDAPMaster
CPU Util
100
Portal #2
100
100
LDAPBackup
100
10:07-9 10:37-54 11:20-13:05
100%
100%
100%
100%
100%
UID/PW
Un-resolved Case 4 : CPU utilization 100%
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation12
Portal Server
AIX
WAS
WPSInterceptor
Re-Authenticatoin
Retrieve Group Info
CreatePortal
LDAP
AIX
Directory Server
UDB
9:089:00 10:07-9 10:37-54 11:20-13:052006/1/17
Request
Response
Smooth communication(Request/Response)
Request
Response
Delayed Processin Portal Server
Normal Process
Request
Response
Discard response from LDAP because of no preparation.Inconsistent requests were left in Portal Server.
Take longer toreceive response
moredegradation
Reqeust
Portal Server re-sent inconsistent requests many times.So LDAP server became overdrive.
Prompt Reply
Re-Request with inconsistency
・・
Overdrive byRe-Request withinconsistency
100%
Request
Response
Recovered by reboot
Receive response before preparation
LogicFlaw?
LogicFlaw?
Supposed Scenario
Take longer toreceive response
Prompt Reply
Request with inconsistency
100%
Prompt Reply
Un-resolved Case 4 : CPU utilization 100%
•Based on Solution assurance review•No occurrence in reproduction test in a customer test machine.
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation13
DB Server
AIX
UDBEngine
TC
P/
IP
Agent
Agent
Agent
Agent
Web
S
erve
r
Other Unix
ServletEngine
Servlet
Ap
plic
ation
TC
P/
IP TC
P/
IPPC terminal
IE
CL
I D
rive
r
Hub Server
Appl Server
Ap
plic
ation
TC
P/
IP
CL
I
Driv
er
TC
P/
IP
Agent
UDBEngine
db2tcpcmConnect
SQL
Terminate
Serv
let
Ap
plic
ation
コール
Timeout threshold120 sec
Return
Connection and termination by each SQL executionbecause of host base legacy application
2006.02.23
Un-resolved Case 5 : Application timeoutConfiguration•Problem is an application timeout.•The application is legacy.•There are some components by other companies.•It’s very difficult to gather data to analyze.
Timeout
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation14
Con
nect
ions
Time
DB Server ConnectionsSymptom 1) increase of connection (5 to 32), wait for 16sec and connection became 46 2) connection became 0, wait for 28sec and connection became 45 3) connection timeout happened in some cases, even if there is no rapid connection increase
Time Connection Time Connection10:21:40 10 12:58:23 510:21:41 12 12:58:24 310:21:42 12 12:58:25 310:21:44 11 12:58:26 310:21:45 5 12:58:27 310:21:46 8 12:58:29 310:21:47 5 12:58:30 310:21:48 2 12:58:31 210:21:49 5 12:58:32 110:21:50 5 12:58:33 010:21:51 32 12:59:01 45 28sec wait10:22:07 46
16sec wait12:59:02 32
10:22:09 38 12:59:03 2610:22:10 38 12:59:04 2010:22:11 38 12:59:06 1610:22:12 34 12:59:07 1110:22:13 33 12:59:08 310:22:14 30 12:59:09 410:22:15 32 12:59:10 210:22:16 33 12:59:11 3
Rapid increase of connectionsUn-resolved Case 5 : Application timeout
•Investigation is from DB2, because other components are owned by other company.
•No body can explain long connection wait time and rapid increase of connections.
•DB2 trace affected very heavy CPU utilization, which cause many timeout.
DB2 trace and AIX trace are taken, but not resolved
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation15
Severe Problems Causal Analysis (long aged )
Others, 12, 27%
Long PD time by Lab8, 17%
Difficult reproduction7, 15%
Side effect by a fix3, 7%
High error rate
3, 7%No guidance ofImportant info3, 7%
Low quality2, 4%
Improper communicationwith lab , 2, 4%
Improper communicationwith customer 2, 4%
Need to accelerateresolution, 2, 4%
Improper problemmanagement
2, 4%
N=15Code=46
DRO IBM Japan
Complex problem cases in real world Unclassified © 2006 IBM Corporation16
Cost of Poor Quality (may be common understanding)
% DefectsIntroduced inthis phase
Coding UnitTest
FunctTest
FieldTest
PostRelease
% Defects found inin this phase
Per
cent
age
of B
ugs
85%
$ Cost torepair defectin this phase$25
$250
$14,000
$1000
$130
Source: Applied Software Measurement,
Capers Jones, 1996