avoiding the pitfalls of speech application rollouts through testing and production management rob...
TRANSCRIPT
Avoiding the Pitfalls of Speech Application Rollouts
Through Testing and Production Management
Rob Edmondson, Senior Field EngineerEmpirix, Inc.
OverviewYou are about to deploy a new call center, with new PBX, Speech enabled
IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones …
• What is the customer perceived latency for your IVR to respond to callers’ speech inputs?
• What is the average host connection latency for the IVR?
• What percentage of caller’s utterances are recognized the first time?
• What percentage of calls fail to be completed in the IVR because of application errors?
• How many calls fail to be routed to the correct agent or skill group?
• What is the average time it takes for screen pop to occur?
• What percentage of screen pops have missing or incorrect information?
• What percentage of screen pops never happen?
• What is the voice quality for the agent and caller?
• What is the impact on other users of your CRM system?
At 5 Calls/Minute?At 30 Calls/Minute?
At Maximum call load?
Why Should We Care? Call Queue Analysis
0
500
1000
1500
2000
7:0
07:3
08:0
08:3
09:0
09:3
010:0
010:3
011:0
011:3
012:0
012:3
01:0
01:3
02:0
02:3
03:0
03:3
04:0
04:3
05:0
05:3
06:0
06:3
07:0
07:3
08:0
08:3
09:0
09:3
010:0
0
Time Interval
Call
s
0
250
500
750
1000
Seco
nd
s
Calls Capacity ASA
Business Goals Driving Self-Service…
0%
20%
40%
60%
80%
100%
Stock
/Mut
ual
Retail
Ban
king
Credit
Car
d
Mor
tgag
e
Health
Insu
ranc
e
Teleco
m
Utilitie
s
Handled by Agents
Handled by Self-Service
Source:Enterprise Integration Group2004
...While Quality Strategies Focused on Agents.
Speech Application Quality – Design and Delivery
Quality Evaluation Matrix
Delivery
De
sig
n
•Difficult to Use
•Unpredictable Behavior
•Easy to Use
•Unpredictable Behavior
•Easy to Use
•Behaves as Designed
•Difficult to Use
•Behaves as Designed
Quality of
Customer Experience
Common Questions When Deploying Speech
• Can I just ‘speechify’ my DTMF apps?
• Should I allow DTMF input?• What voice should we use?• How personal should the
application be?• Should I allow barge-in?• Which utterances should I
allow for a recognition state?
• How do I handle error conditions?
• When do I transfer to an agent?
• How do I test speech?• Do I have enough
speech/TTS resources?• Do I need to test with
different accents?• How do I do usability
testing?• Will VoIP impact my speech
recognition accuracy?• How do I verify TTS quality?• How do I make sure it’s
working after we go into production?
Design Delivery
Application Testing:
Dialog Traversal: creates and executes a series of test cases to cover all possible paths through the dialog to verify:
• that the right prompts are played• each state in the call flow is reached correctly• ensure the universal, error, and help behaviors are operational
System Load: simulates a high in-bound call volume to ensure that:
• expected caller capacity can be handled • proper load balancing occurs across the system.
Speech Testing*Recognition Testing: Evaluates recognizer performance. Callers generate utterances by talking to the application, using test scripts.• male and female speakers• different dialects• different noise conditions• Accuracy is measured by comparing the recognition results to a transcription of the utterances.• barge-in, speaker verification, subscriber profiles and dynamic grammars, should also be tested for accuracy with a variety of speakers and calling conditions
*Introduction to the Nuance System, v8.5, pg 72
Usability Testing:
Conducted early in the design process and is also helpful at this stage to validate the performance of an application against the metrics laid out in the requirements phase
*Nuance Project Method
Tuning and Monitoring:
Ongoing analysis of real caller interactions. This occurs during•Pilot deployment (beta)•Post-Deployment•Ongoing Monitoring
Testing During the Lifecycle
Requirements
Design
Implementation
Testing
Deployment
Usability
Recognition
Application
Performance
Tuning
Usability Testing – A Key to Success
“Usability testing is sometimes confused with quality assurance (QA), but the two are very different. QA usually measures a product’s performance
against its specifications. For example, QA on an automobile would ensure that the components function as specified, that the gaps between
the doors and the body are within tolerances, and so forth. QA testing would not determine whether a vehicle is easy for people to operate, but
usability testing would. In a speech application, QA ensure that the appropriate prompts do in fact play at the right times in the right order. This kind of testing is important, because designers generally shouldn’t assume that an application will work to ‘spec’. QA testing can tell us a
great deal about a system’s functionality. But it can’t tell us if the target population for the application can use it –or will like to use it.”
- Blade Kottely, The Art and Business of Speech Recognition, pg. 122
“Usability testing is just as important for simple DTMF applications as it is for complex NL (natural language) applications. In general, the more control the user has over the application, the more testing will be required and the more valuable this testing will be…. The subject is a complex one, and both designers and developers are encouraged to develop formal, documented
test plans early in the product life cycle.”
- Bruce Balentine and David P. Morgan, How to Build a Speech Recognition Application, 2nd Edition, pg. 294
Recognition Testing - Useful Metrics
• First Time Recognition rate:o For a known good input prompt, what percentage of the time is
the expected prompt heard back
• Timeout and Rejection rates:o For timeout and invalid input tests, how often is the correct
behavior observed?
• Barge-in detection rate:o When barging in at an acceptable time, what percentage of
time is the speech detected
• Menu response latencyo How long after the end of input utterance does it take for the
next prompt to begin
Dialog State Testing ‘Dashboard’
FileName Comments RawData Pct
Large.vce Male 100/100 100%
Medium.vce female 98/100 98%
Personal.vce
Cell phone 96/100 96%
Min Avg Max
0.3 sec 0.45 sec 1.2 sec
Pause RawData Pct
0.5 0/50 0%
2.0 50/50 100%
4.0 50/50 100%
Dialog State: GetPizzaSize
First Time Recognition Rate
Response Time Data
Barge In Success Data
Error Handling
Error RawData Pct
Timeout1 50/50 100%
Timeout2 50/50 100%
Reject1 46/50 92%
Reject2 40/46 87%
Tester Comments
-Dialog state performs very well
-Still need to test universal behaviors (Help, Main Menu)
-Used clip ‘nothing.wav’ for Reject tests - around 10% of calls came up with Medium instead of correct rejection
Application Testing
Agents
Callers
IVR/Speech Platform
TelephonyInfrastructure
PBXPSTN
ACD
T1, E1, PRI, SIP H.323,…
CTI
TelephonyServer (s)
ASR/TTSServer (s)
Application Server(s)
ApplicationInfrastructure
DesktopApp(s)
BackendApplications(CRM, etc)
Web Servers
VoiceXMLMRCP
Performance Testing - System Overview
Example configuration and vendors
VoiceXMLPlatform
Call Control/Media Server
(Gateway)
ASR, TTS
Web Server
MRCP
VoiceXML 2.0
CCXMLSIP
SIP,H.323
Network/PBX
T1, E1 PRI,…
CTI Server
JTAPI,…
Nuance, Scansoft, IBM, Microsoft, Loquendo, …
Nortel, Avaya, Genesys, IVB, Edify, IBM, Aspect, Syntellect,
Nuance, VoiceGenie,…
Excel, AudioCodes, Voxeo, IVB, Cisco, Genesys, Avaya, …
ACD
Cisco, Avaya, Nortel, VegaStream, …
Genesys, Avaya,Nortel, Cisco,Apropos, …
BEA, IBM,Sun, Oracle,Microsoft,
OpenSource
Avaya, Nortel, Intertel, NEC, Cisco, Siemens, …..
Avaya, Nortel, Cisco,Apropos, II, Siemens, …..
Performance TestingLoad Test ObjectivesLoad Test Objectives
Application can handle expected load Find System bottlenecks Find pre-failure indicators Understand recovery procedures
Application can handle expected load Find System bottlenecks Find pre-failure indicators Understand recovery procedures
Call Rate(CPH)
Correct Classification
Rate
1000 98 %
2000 98 %
3000 98 %
5100 97 %
6300 65 %
6600 45 %
compare recognition rates at increasing load levels
ConsiderationsConsiderations
‘component’ load tests to isolate specific pieces
Test lab or Production? Emulate real-world call patterns Iterative testing allows ‘find and fix’ Go beyond what you expect in
production
‘component’ load tests to isolate specific pieces
Test lab or Production? Emulate real-world call patterns Iterative testing allows ‘find and fix’ Go beyond what you expect in
production
Performance Testing• Key Metrics
o Customer perceived latency at each step– The time from end of caller input to the beginning of the next response,
which is ‘dead air’ to the callero Time to Complete the Call (call length)o Transactional Completion Rateo First time recognition rateo All of these metrics relative to call load
• Why are these important?o Direct measures of caller’s quality of experienceo Cost implications to the enterprise
– Cost of variability– Self service versus assisted help
o Quantify an otherwise subjective idea
Performance Test Case Study
Performance Test Case Study
Performance Test Case Study
Production Management
• Tuning/Monitoring – Vendor Tools• Application Monitoring
o 3rd party tools for device/application monitoring
o Proactive call transactions
• Key Metrics for Customer Experienceo Latencieso Transactional errorso Speech recognition success rates
Customer Perceived Latencies
0
5000
10000
15000
20000
25000
30000
TimeT
oCon
nect
Subsc
riber
or P
rovid
er
Claim
Inquir
y
Inqu
iry T
ype
Enter
Sub
scrib
er N
umbe
r
J R o
r T
Enter
DOB 1
Eligib
ility I
nfo
1
Enter
DOB 2
Eligib
ility I
nfo
2
Min/Max
5th-95th %
Transaction Failures By Time of Day
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour (PDT)
Co
un
t
Host Failures
TTS Failures
*excludes retry calls
Review: Common Questions
• Can I just ‘speechify’ my DTMF apps?
• Should I allow DTMF input?• What voice should we use?• How personal should the
application be?• Should I allow barge-in?• Which utterances should I
allow for a recognition state?
• How do I handle error conditions?
• When do I transfer to an agent?
• How do I test speech?• Do I have enough
speech/TTS resources?• Do I need to test with
different accents?• How do I do usability
testing?• Will VoIP impact my speech
recognition accuracy?• How do I verify TTS quality?• How do I make sure it’s
working after we go into production?
Design Delivery
Review:You are about to deploy a new call center, with new PBX, Speech enabled
IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones …
• What is the customer perceived latency for your IVR to respond to callers’ speech inputs?
• What is the average host connection latency for the IVR?
• What percentage of caller’s utterances are recognized the first time?
• What percentage of calls fail to be completed in the IVR because of application errors?
• How many calls fail to be routed to the correct agent or skill group?
• What is the average time it takes for screen pop to occur?
• What percentage of screen pops have missing or incorrect information?
• What percentage of screen pops never happen?
• What is the voice quality for the agent and caller?
• What is the impact on other users of your CRM system?
At 5 Calls/Minute?At 30 Calls/Minute?
At Maximum call load?
Rob EdmondsonEmpirix, [email protected]