avoiding the pitfalls of speech application rollouts through testing and production management rob...

Avoiding the Pitfalls of Speech Application Rollouts

Through Testing and Production Management

Rob Edmondson, Senior Field EngineerEmpirix, Inc.

OverviewYou are about to deploy a new call center, with new PBX, Speech enabled

IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones …

• What is the customer perceived latency for your IVR to respond to callers’ speech inputs?

• What is the average host connection latency for the IVR?

• What percentage of caller’s utterances are recognized the first time?

• What percentage of calls fail to be completed in the IVR because of application errors?

• How many calls fail to be routed to the correct agent or skill group?

• What is the average time it takes for screen pop to occur?

• What percentage of screen pops have missing or incorrect information?

• What percentage of screen pops never happen?

• What is the voice quality for the agent and caller?

• What is the impact on other users of your CRM system?

At 5 Calls/Minute?At 30 Calls/Minute?

At Maximum call load?

Why Should We Care? Call Queue Analysis

0

500

1000

1500

2000

7:0

07:3

08:0

08:3

09:0

09:3

010:0

010:3

011:0

011:3

012:0

012:3

01:0

01:3

02:0

02:3

03:0

03:3

04:0

04:3

05:0

05:3

06:0

06:3

07:0

07:3

08:0

08:3

09:0

09:3

010:0

0

Time Interval

Call

s

0

250

500

750

1000

Seco

nd

s

Calls Capacity ASA

Business Goals Driving Self-Service…

0%

20%

40%

60%

80%

100%

Stock

/Mut

ual

Retail

Ban

king

Credit

Car

d

Mor

tgag

e

Health

Insu

ranc

e

Teleco

m

Utilitie

s

Handled by Agents

Handled by Self-Service

Source:Enterprise Integration Group2004

...While Quality Strategies Focused on Agents.

Speech Application Quality – Design and Delivery

Quality Evaluation Matrix

Delivery

De

sig

n

•Difficult to Use

•Unpredictable Behavior

•Easy to Use

•Unpredictable Behavior

•Easy to Use

•Behaves as Designed

•Difficult to Use

•Behaves as Designed

Quality of

Customer Experience

Common Questions When Deploying Speech

• Can I just ‘speechify’ my DTMF apps?

• Should I allow DTMF input?• What voice should we use?• How personal should the

application be?• Should I allow barge-in?• Which utterances should I

allow for a recognition state?

• How do I handle error conditions?

• When do I transfer to an agent?

• How do I test speech?• Do I have enough

speech/TTS resources?• Do I need to test with

different accents?• How do I do usability

testing?• Will VoIP impact my speech

recognition accuracy?• How do I verify TTS quality?• How do I make sure it’s

working after we go into production?

Design Delivery

Application Testing:

Dialog Traversal: creates and executes a series of test cases to cover all possible paths through the dialog to verify:

• that the right prompts are played• each state in the call flow is reached correctly• ensure the universal, error, and help behaviors are operational

System Load: simulates a high in-bound call volume to ensure that:

• expected caller capacity can be handled • proper load balancing occurs across the system.

Speech Testing*Recognition Testing: Evaluates recognizer performance. Callers generate utterances by talking to the application, using test scripts.• male and female speakers• different dialects• different noise conditions• Accuracy is measured by comparing the recognition results to a transcription of the utterances.• barge-in, speaker verification, subscriber profiles and dynamic grammars, should also be tested for accuracy with a variety of speakers and calling conditions

*Introduction to the Nuance System, v8.5, pg 72

Usability Testing:

Conducted early in the design process and is also helpful at this stage to validate the performance of an application against the metrics laid out in the requirements phase

*Nuance Project Method

Tuning and Monitoring:

Ongoing analysis of real caller interactions. This occurs during•Pilot deployment (beta)•Post-Deployment•Ongoing Monitoring

Testing During the Lifecycle

Requirements

Design

Implementation

Testing

Deployment

Usability

Recognition

Application

Performance

Tuning

Usability Testing – A Key to Success

“Usability testing is sometimes confused with quality assurance (QA), but the two are very different. QA usually measures a product’s performance

against its specifications. For example, QA on an automobile would ensure that the components function as specified, that the gaps between

the doors and the body are within tolerances, and so forth. QA testing would not determine whether a vehicle is easy for people to operate, but

usability testing would. In a speech application, QA ensure that the appropriate prompts do in fact play at the right times in the right order. This kind of testing is important, because designers generally shouldn’t assume that an application will work to ‘spec’. QA testing can tell us a

great deal about a system’s functionality. But it can’t tell us if the target population for the application can use it –or will like to use it.”

- Blade Kottely, The Art and Business of Speech Recognition, pg. 122

“Usability testing is just as important for simple DTMF applications as it is for complex NL (natural language) applications. In general, the more control the user has over the application, the more testing will be required and the more valuable this testing will be…. The subject is a complex one, and both designers and developers are encouraged to develop formal, documented

test plans early in the product life cycle.”

- Bruce Balentine and David P. Morgan, How to Build a Speech Recognition Application, 2nd Edition, pg. 294

Recognition Testing - Useful Metrics

• First Time Recognition rate:o For a known good input prompt, what percentage of the time is

the expected prompt heard back

• Timeout and Rejection rates:o For timeout and invalid input tests, how often is the correct

behavior observed?

• Barge-in detection rate:o When barging in at an acceptable time, what percentage of

time is the speech detected

• Menu response latencyo How long after the end of input utterance does it take for the

next prompt to begin

Dialog State Testing ‘Dashboard’

FileName Comments RawData Pct

Large.vce Male 100/100 100%

Medium.vce female 98/100 98%

Personal.vce

Cell phone 96/100 96%

Min Avg Max

0.3 sec 0.45 sec 1.2 sec

Pause RawData Pct

0.5 0/50 0%

2.0 50/50 100%

4.0 50/50 100%

Dialog State: GetPizzaSize

First Time Recognition Rate

Response Time Data

Barge In Success Data

Error Handling

Error RawData Pct

Timeout1 50/50 100%

Timeout2 50/50 100%

Reject1 46/50 92%

Reject2 40/46 87%

Tester Comments

-Dialog state performs very well

-Still need to test universal behaviors (Help, Main Menu)

-Used clip ‘nothing.wav’ for Reject tests - around 10% of calls came up with Medium instead of correct rejection

Application Testing

Agents

Callers

IVR/Speech Platform

TelephonyInfrastructure

PBXPSTN

ACD

T1, E1, PRI, SIP H.323,…

CTI

TelephonyServer (s)

ASR/TTSServer (s)

Application Server(s)

ApplicationInfrastructure

DesktopApp(s)

BackendApplications(CRM, etc)

Web Servers

VoiceXMLMRCP

Performance Testing - System Overview

Example configuration and vendors

VoiceXMLPlatform

Call Control/Media Server

(Gateway)

ASR, TTS

Web Server

MRCP

VoiceXML 2.0

CCXMLSIP

SIP,H.323

Network/PBX

T1, E1 PRI,…

CTI Server

JTAPI,…

Nuance, Scansoft, IBM, Microsoft, Loquendo, …

Nortel, Avaya, Genesys, IVB, Edify, IBM, Aspect, Syntellect,

Nuance, VoiceGenie,…

Excel, AudioCodes, Voxeo, IVB, Cisco, Genesys, Avaya, …

ACD

Cisco, Avaya, Nortel, VegaStream, …

Genesys, Avaya,Nortel, Cisco,Apropos, …

BEA, IBM,Sun, Oracle,Microsoft,

OpenSource

Avaya, Nortel, Intertel, NEC, Cisco, Siemens, …..

Avaya, Nortel, Cisco,Apropos, II, Siemens, …..

Performance TestingLoad Test ObjectivesLoad Test Objectives

Application can handle expected load Find System bottlenecks Find pre-failure indicators Understand recovery procedures

Application can handle expected load Find System bottlenecks Find pre-failure indicators Understand recovery procedures

Call Rate(CPH)

Correct Classification

Rate

1000 98 %

2000 98 %

3000 98 %

5100 97 %

6300 65 %

6600 45 %

compare recognition rates at increasing load levels

ConsiderationsConsiderations

‘component’ load tests to isolate specific pieces

Test lab or Production? Emulate real-world call patterns Iterative testing allows ‘find and fix’ Go beyond what you expect in

production

‘component’ load tests to isolate specific pieces

Test lab or Production? Emulate real-world call patterns Iterative testing allows ‘find and fix’ Go beyond what you expect in

production

Performance Testing• Key Metrics

o Customer perceived latency at each step– The time from end of caller input to the beginning of the next response,

which is ‘dead air’ to the callero Time to Complete the Call (call length)o Transactional Completion Rateo First time recognition rateo All of these metrics relative to call load

• Why are these important?o Direct measures of caller’s quality of experienceo Cost implications to the enterprise

– Cost of variability– Self service versus assisted help

o Quantify an otherwise subjective idea

Performance Test Case Study

Production Management

• Tuning/Monitoring – Vendor Tools• Application Monitoring

o 3rd party tools for device/application monitoring

o Proactive call transactions

• Key Metrics for Customer Experienceo Latencieso Transactional errorso Speech recognition success rates

Customer Perceived Latencies

0

5000

10000

15000

20000

25000

30000

TimeT

oCon

nect

Subsc

riber

or P

rovid

er

Claim

Inquir

y

Inqu

iry T

ype

Enter

Sub

scrib

er N

umbe

r

J R o

r T

Enter

DOB 1

Eligib

ility I

nfo

1

Enter

DOB 2

Eligib

ility I

nfo

2

Min/Max

5th-95th %

Transaction Failures By Time of Day

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hour (PDT)

Co

un

t

Host Failures

TTS Failures

*excludes retry calls

Review: Common Questions

• Can I just ‘speechify’ my DTMF apps?

• Should I allow DTMF input?• What voice should we use?• How personal should the

application be?• Should I allow barge-in?• Which utterances should I

allow for a recognition state?

• How do I handle error conditions?

• When do I transfer to an agent?

• How do I test speech?• Do I have enough

speech/TTS resources?• Do I need to test with

different accents?• How do I do usability

testing?• Will VoIP impact my speech

recognition accuracy?• How do I verify TTS quality?• How do I make sure it’s

working after we go into production?

Design Delivery

Review:You are about to deploy a new call center, with new PBX, Speech enabled

IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones …

• What is the customer perceived latency for your IVR to respond to callers’ speech inputs?

• What is the average host connection latency for the IVR?

• What percentage of caller’s utterances are recognized the first time?

• What percentage of calls fail to be completed in the IVR because of application errors?

• How many calls fail to be routed to the correct agent or skill group?

• What is the average time it takes for screen pop to occur?

• What percentage of screen pops have missing or incorrect information?

• What percentage of screen pops never happen?

• What is the voice quality for the agent and caller?

• What is the impact on other users of your CRM system?

At 5 Calls/Minute?At 30 Calls/Minute?

At Maximum call load?

Rob EdmondsonEmpirix, [email protected]

mailto:[email protected]

avoiding the pitfalls of speech application rollouts through testing and production management rob...

Documents

percentage of calls

percentage of screen

service leveltotal calls

agent stations

correct agent

callers speech inputs

average time

service levelasa7