thanassis trikas - dspace@mit home

FTL REPORT R87-2

AUTOMATED SPEECH RECOGNITIONIN AIR TRAFFIC CONTROL

Thanassis Trikas

January 1987

FLIGHT TRANSPORTATION LABORATORY REPORT R87-2


Thanassis Trikas

January 1987


by

THANASSIS TRIKAS

Abstract

Over the past few years, the technology and performance of Automated Speech Recog-nition (ASR) systems has been improving steadily. This has resulted in their successful usein a number of industrial applications. Motivated by this success, a look was taken at theapplication of ASR to Air Traffic Control, a task whose primary means of communication isverbal.

In particular, ASR, and audio playback was incorporated into an Air Tiraffic ControlSimulation task in order to replace blip-drivers, people responsible for manually keying inverbal commands and simulating pilot responses. This was done through the use of a VOTANVPC2000 ASR continuous speech recognition system which also possessed a digital recordingcapability.

Parsing systems were designed that utilized the syntax of ATC commands, as definedin the controller's handbook, in order to detect and correct recognition errors. As well,techniques whereby the user could correct any recognition errors himself were included.

Finally, some desirable features of ASR systems to be used in this environment wereformulated based on the experience gained in the ATC simulation task and parser design.These predominantly include continuous speech recognition, a simple training procedure, andan open architecture to allow for the customization of the speech recognition to the particulartask at hand required by the parser.

Acknowledgements

I would like to express my sincere gratitude and appreciation to the following people: Prof.R. W. Simpson, my thesis advisor, for his suggestions and guidance; Dr. John Pararas, for allof his help and encouragement with every stage of this work; My fellow graduate students,Dave Weissbein, Mark Kolb, Ron LaJoie and Jim Butler for their help and friendship, bothin and out of the classroom; and finally, my parents and brothers for their encouragement.

As well, I would also like to thank the NASA/FAA TRI-University Program for fundingthis research.

Contents

Abstract

Acknowledgements iii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 ATC Command Recognition: Operational Environment . . . . . . . . 7

1.2.2 ATC Command Recognition: Simulation Environment . . . . . . . . . 101.3 O utline ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Automatic Speech Recognition 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 How ASR Systems Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Recognition Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.2 Factors Affecting Recognition . . . . . . . . . . . . . . . . . . . . . . . 21

3 ASR Systems Selected for Experimentation 243.1 LISNER 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 VOTAN VPC2000 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.2 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 ATC Simulation Environment: Command Recognition System Design 444.1 ATC Simulation and Display . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Speech Input Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.1 ASR System . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 484.2.2 User Feedback and Prompting . . . . . . . . . . . . . . . . . . . . . . 504.2.3 Speech Input Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

- i i .i -

4.3 Pseudo-Pilot Responses . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 80

4.4 Discussion . . . . . . . . . . . . . . . . . . . . - - . - - - - . .. . . . . . 85

5 Air Traffic Control Connand Recognition: Operational Applications 935.1 General Difficulties . . . . . . . . . . . . . . . . . - . . . . . . .. . . . . . . 93

5.1.1 Recognition of Aircraft Names . . . . . . . . . . . . . . . . . . . . . . 94

5.1.2 Issuance of Non-Standard Commands . . . . . . . . . . . . . . . . . . 955.2 Application Specific Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Digitized Command Transmission - Voice Channel Offloading . . . . . 965.2.2 Command Prestoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Conclusions and Recommendations 1036.1 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

-i v -

List of Figures

2.1 Block Diagram of Generic ASR system. . . . . . . . . . . . . . . . . . . . . . 16

3.1 LIS'NER 1000 Functional Block Diagram. . . . . . . . . . . . . . . . . . . . . 253.2 Commands used in preliminary ASR system evaluation. . . . . . . . . . . . . 313.3 VAX Display for Command entry feedback. . . . . . . . . . . . . . . . . . . . 323.4 Example of word boundary mis-alignment due to misrecognition errors. . . . 42

4.1 Configuration of ATC Simulation Hardware. . . . . . . . . . . . . . . . . . . . 464.2 Icon used for display of fixes in the simulation display . . . . . .-. . . . . . . 474.3 Icon used for display of airports in the simulation display . . /. . . . . . . . 474.4 Icon used for display of aircraft in the simulation display . . . . . . . . . . . . 484.5 Sample of the ATC Simulation Display on the TI Explorer. . . . . . . . . . . 494.6 Display format including feedback for spoken commands. . . . . . . . . . . . 524.7 Example of the Finite State Machine logic for the specification of a heading . 554.8 Superblock structure of the FSM implemented. . . . . . . . . . . . . . . . . . 574.9 Internal structure of the Aircraft Name Superblock . . . . . . . . . . . . . . . 594.10 Internal structure of the Heading Command Superblock . . . . . . . . . . . . 604.11 Internal structure of the Altitude Command Superblock . . . . . . . . . . . . 614.12 Internal structure of the Airspeed Command Superblock . . . . . . . . . . . . 624.13 Airspeed Command Superblock maintaining original ATC syntax . . . . . . . 634.14 Table of ATC Commands used in Pattern Matcher database . . . . . . . . . 744.15 Flowchart of sequencing of VPC2000 functions . . . . . . . . . . . . . . . . . 83

5.1 Duplex voice channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

List of Tables

1.1 Table of ICAO Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Table of typical words used for ASR evaluation . . . . . . . . . . . . . . . . . 28

4.1 Table of discrete messages recorded for Pseudo-pilot response formulation . . 81

-vi -

Chapter 1

Introduction

1.1 Motivation

Since airline deregulation, the amount of commercial air traffic has been steadily increas-

ing. This increase has had two major repercussions.

First, as the amount of air traffic increases, the Air Traffic Control (ATC) system is

rapidly approaching its saturation capacity. Thus, an ever increasing number of aircraft

are being delayed, either on the ground at their originating airport, or in the air at their

destination, until they can be accommodated by the ATC system. These delays, apart

from being annoying from a traveler's point of view, are also the cause of increased fuel

consumption and operating costs of aircraft waiting for take-off clearance or waiting for

landing clearance. Since current air traffic growth trends are expected to continue, a great

deal of study is being made into techniques for increasing the capacity of the ATC system as

well as utilizing the existing capacity more efficiently. These techniques, although they often

only involve procedural changes, almost always introduce a heavy reliance on computers and

automation. Thus the Air Traffic Controller will more and more be forced to interface with

computers in the execution of his everyday tasks in an increasingly automated system[1).

Second, the amount of air traffic for which controllers are responsible is also increasing.

This, in conjunction with the loss of skilled personnel arising from the PATCO strike of

19811, means that air traffic controllers are working harder now than ever before. It is an

'Many people feel that only now is the ATC system beginning to return to the level of expertise and staffingthat was prevalent before the PATCO strike.

issue of great concern since this increase in workload could possibly translate directly into a

decrease in safety. In order to help alleviate this increase in workload, some airports increase

the number of active controllers on duty during busy periods and give them each smaller

sectors to control. Still, there are practical limits to this subdivision of sectors or the number

of controllers on duty and for this reason, a greater and greater emphasis is being placed on

automation in the ATC environment in order to reduce workload.

While the number and scope of automation strategies is fairly broad, all of these have

one common factor; the dissemination of information from a human operator, typically the

controller, to a computer. It is here where the increase in automation places the greatest

strain. Speech recognition is a means of alleviating this by providing a simpler controller-

computer interface as well as performance improvements not possible with more conventional

interfaces.

Current input modalities such as the keyboard, special function keys, or a mouse (with

pull down menus), although sufficient for a great number of tasks, can become somewhat

awkward or clumsy in an ATC environment. This because the primary means of information

transfer in ATC is verbal. Thus, it is conceivable that in some situations, information would

have to be repeated twice, once through speech for humans, and another time through key-

boards for computers. For example, in today's semiautomated system, changes in flight plans

or cruising altitudes have to be transmitted by the controller through the voice channel to

the pilots as well as entered through the keyboard into the computer in order to maintain

the veracity and integrity of the flight plan database. This type of redundancy will become

even more acute as more automation is introduced into the ATC system, with the obvious

adverse effects on controller workload. The problem is more pronounced if the information

must be entered in real-time in order to, for example, reflect the current state of an aircraft

or number of aircraft in the ATC system.

Even if we ignore these real-time strategies and the requirement of redundant information

transfer, speech still has a large number of benefits over more conventional input modalities

[2,3]. It is easier, simpler, less demanding and more natural than other more conventional

input modalities. Furthermore, it requires almost no training on the part of the user in its

use2 . It also allows the controller to use his eyes or hands simultaneously for other tasks,

thus potentially allowing for multi-modal communication strategies (i.e., simultaneous use of

more than one input modality such as keyboard and speech, or trackball and speech). The

consequences of these factors are that the task may be performed faster or more accurately

or that an extra operator may no longer be required.

These however are not the only possible benefits. Some studies indicate that under certain

circumstances, memory retention in tasks performed using speech is often better than that

using other input modalities. As well, speech is the highest capacity output channel of

humans[4,5,6), yielding roughly a threefold (or more) improvement in data entry rates over

a keyboard in problem solving tasks that require thinking and typing[2]. Thus, there are

significant benefits to utilizing this channel in terms of operator workload reductions.

The goal of the work reported here is not to design an ASR system but instead to use an

off the shelf system, applying it in the context of an ATC environment, in order to explore

the potential benefits and problems in applying ASR to this environment. A secondary goal

is to determine desirable features and requirements for an ASR system designed specifically

for ATC. It is often lamented that one of the problems facing designers of ASR systems is

that they do not have any specific criteria for their design (other than the obvious ones of

low recognition error rates and delays)[2]. Granted, the required or desirable features may be

dependent upon the exact ATC application, but it appears that there are some generalities

that can still be made which could lead to an ASR system well suited for ATC applications

as a whole.

1.2 Application Areas

Technically termed Automated Speech Recognition3 or ASR, the recognition of human

speech by computers is a technology that is widely acclaimed as being "here". Although a

2In practice however, some restrictions to the natural flexibility of speech must still be applied as shall be

shown later.

3The term speech recognition should not be confused with the term voice recognition. The latter deals with the

recognition of a particular speaker based on his or her vocal patterns while the former deals with the actual

recognition of what the speaker is saying.

lot of work is being and still remains to be done, ASR has already moved out of the realm

of pure research and is being used successfully in industry, where significant operational

benefits have been accrued [7,8,9]. Thus, although each particular application should be

analyzed in its own right in order to determine its specific benefits and pitfalls, it appears

that the significant amount of real-world practical experience and success with using this

type of technology indicate that it is feasible. It is these successes and the rapidly advancing

state-of-the-art technology that have motivated interest in ASR systems and how they can

be used in an ATC environment 4 .

In general, tasks for which ASR should be considered for use are those which either cannot

be accomplished using conventional methods such as the keyboard or trackball, or which in

some way are being inadequately performed currently.

Initial applications of this type of technology in ATC would be in existing data-entry

tasks. These tasks entail replacing or complementing more conventional input modalities,

principally keyboards, with ASR in areas where the sole function is the straightforward entry

of data into the computer[3,14).

There have, for example, already been studies into the use of speech input to replace

keyboards in the flight-strip entry and updating functions [15]. It was found that under

certain traffic conditions at even moderate traffic densities, it was possible for the controller

responsible for maintaining this information to become overloaded. Thus, it would seem

possible that data entry rates could be improved by using speech recognition. Although

these studies demonstrated no significant difference in data entry rates over keyboards, ASR

technology has advanced significantly since 1977 when the study took place and as such it

is likely that improvements are now possible. Of primary significance is the fact that the

recognition system used was a discrete speech system which is inherently slower and than

a continuous speech recognition system. Therefore, it seems plausible that a continuous

speech recognition system would provide improvements in data entry rates. Regardless of

this, it was found that the error rate for entering flight data using ASR was lower than

'Although some mention will be given to ASR in the aircraft cockpit, this work deals primarily with applica-

tions of ASR to the controllers task. The reader wishing further information on ASR in the cockpit is urged

to consult, amongst others, the related articles [10,11,12,13).

that using a keyboard, indicating that there are indeed possible payoffs. In addition, even

if no significant performance improvements can be realized, there is still the issue of which

modality is preferred by controllers.

What will be covered in this work is a broad range of applications that involve the

recognition, by computer, of verbal controller commands currently directed towards aircraft

pilots. This for two reasons. First, this information, as shall be shown later, can be very

useful when made available to automation systems and second, ATC commands, by design,

contain features that are similar to those that yield optimum ASR performance. These

features are as follows.

First, since there is only one user of the ASR system at each ATC sector, a speaker de-

pendent recognition system (to be explained in Chapter 2) can be used. This is advantageous

because speaker dependent systems are inherently more accurate than speaker independent

systems which must recognize speech from a number of different speakers. Different con-

trollers for any given sector can still be readily accommodated with this system simply by

storing their speech data on a floppy or cassette and calling it up when they report onto the

sector.

Second, the procedures used for communication between pilots and controllers are de-

signed to reduce recognition errors made by communication over a possibly noisy radio chan-

nel. Thus, similar sounding words that are easily confused by humans (and thus even more

likely to be confused by an ASR system) have been eliminated. This is exemplified by the

use of the word "niner" instead of "nine" in order to reduce the likelihood of confusion with

the word "five". As well, short words such as the letters of the alphabet, which are also very

difficult to recognize correctly, have been replaced by the "zulu" or phonetic alphabet[16]

(see Table 1.1).

Finally and most importantly, the overall structure of ATC commands, in terms of their

distinctness from one another and the rigid syntax that is used[16], coupled with the fact

that the task is to recognize entire commands as opposed to individual words implies that

there is a lot of additional information that can be brought to bear to aid in the recognition

process.

Table 1.1: Table of ICAO Phonetics

Thus, for example, the ATC command syntax can be used to constrain the input to only

those words that are syntactically valid and thereby reduce errors. T~is however is not of

help in detecting errors between two syntactically valid words, such as two different numbers.

Although it might seem over-restrictive to rigidly enforce this command syntax, this is not so.

In fact, during training, controllers are forced to adhere fairly well to it, and most continue

to do so throughout their careers.

There is however additional information contained in the rest of a command that provides

further capability for error detection and correction. For example, consider the vectoring

command "TWA turn left heading zero one zero". If the recognized command is "TWA

turn left heading zero five zero" and the aircraft's heading is 040, then the "turn left"

would signal, to the pilot for example, that a mistake has been made somewhere and that

clarification should be requested. This same information that is used by the pilot is also

potentially usable by an ASR system.

Another example of this occurs if the word "descend" is not recognized in a "descend

and maintain five thousand" command. Here, it is still clear, based on the rest of the input,

what the desired action is and this can potentially be used to infer what the un-recognized

word was.

A Alpha N NovemberB Bravo 0 OscarC Charlie P PapaD Delta Q QuebecE Echo R RomeoF Foxtrot S SierraG Golf T TangoH Hotel U UniformI India V VictorJ Juliett W WhiskeyK Kilo X X-rayL Lima Y YankeeM Mike Z Zulu

In fact, the human listener uses this same type of information to aid in the recognition

process. This was demonstrated by Pollack and Pickett [17] who showed that roughly 20 to

30 percent of the words from tape-recorded conversations cannot be understood by a given

listener when they are played back individually in random order, even though every word

was perfectly understood in the original conversation.

The actual ways in which the information made available through the recognition of Air

Traffic Controller commands can be used are numerous and will be discussed in the following

sections. They are loosely grouped into two classes called "Operational Applications" and

"Simulation Applications". These involve applications in the current or projected operational

environment and in the simulation environment respectively.

1.2.1 ATC Command Recognition: Operational Environment

By far the simplest application of Air Traffic Control Command Recognition, or ATCCR,

would be to use it in order to provide a memory aid to controllers. Here, ATCCR could be

used to recognize controller commands issued to pilots directly (without the need to type

them in) and display them on a scrolling history of issued commands. This could possibly

be used by the controller in order to determine which commands have already been issued

since, during high workload situations, it is possible to forget these. Because the ASR system

would not be a direct part of ATC operations in this application, recognition errors would

not have any significant effects on the controller's performance of his duties. Thus, this type

of system could be used in order to generate data on the recognition accuracy and error rates

in a real-world ATC environment, as a precursor to the implementation of ATCCR for other

tasks. As well, it would create a database of controller commands issued in a format readily

readable by a computer and would thus allow for computer analysis of different aspects of

controller operating procedure.

Once some practical experience has been gained with ATCCR, a far more ambitious

application can be undertaken. This would be to use speech recognition to allow the computer

system to listen in to the commands issued by the controller and responses issued by pilots.

This information, in conjunction with the other information available to the computer (such

as radar tracks, minimum safe altitudes, restricted zones, etc.) could be used to provide a

backup controller to catch any potentially dangerous situations that might be missed by the

controller. The system would in effect provide for conformance monitoring and conflict alert.

This application however is extremely difficult since it requires not only very good speech

recognizer performance, but also the integration of a large number of different technologies

including such fields as AI and Natural Language Understanding. This task is also greatly

complicated by the difficulties involved in the recognition of pilot transmissions arising not

only from the noise and low bandwidth of the radio channel, but also from the great variability

possible in pilot speech.

As mentioned previously, one of the primary motivating factors for the use of ASR relates

to the increase in automation in the ATC environment. This increase in automation is not

only occurring on the ground, but in the air as well. As more and more new aircraft progress

to "digital" cockpits, it is becoming increasingly obvious that there would be significant

benefits to linking these two systems together digitally, in a format that would allow direct

communication between ground controller, ground computer, airborne computer and pilot, as

opposed to verbally from controller to pilot as is the case now. This could be accomplished

using by using ATCCR to first recognize the controller's commands and then transmit a

representation of these to the specified aircraft5 . Once received by the aircraft, they could

then be reproduced, for presentation to the pilot, either aurally, using, for example a speech

synthesizer, or visually using a standard display.

With this configuration, one can easily envision a future system where the flight director

in an aircraft would receive commands directly from the controller and his computer and

then, pending acknowledgment and verification from the pilot, execute them[10].

The benefits that can be accrued with this digital link between air and ground are nu-

merous. Most importantly, message intelligibility could be enhanced significantly. Currently,

commands transmitted over a noisy and often over-used radio channel are somewhat difficult

to make out and often result in errors made by pilots. Messages transmitted in a digital

'Although the exact method by which this transmission would take place remains to be seen, it could beaccomplished using the digital communication capability made possible by Mode S.

format however are less likely to be corrupted by noise. Even if they are, checks can easily

be made as to their integrity. Furthermore, since these commands are now in a format where

they can be readily manipulated by computer, they can be made available for recall by the

pilot in order to avoid the need for the controller to repeat or re-issue commands should the

pilot forget them.

The increased use of this digital link would also greatly off-load the voice channel. This,

would improve the intelligibility of any verbal communications made using it, as well as

increase the effective bandwidth of the controller-pilot communication channel as a whole.

Although more conventional technology in the form of vocoders6 could also be used for

command digitization, again, without requiring the controller to key in his command, ASR

possesses significant advantages over these. First, current vocoders operate at a minimum

of about 1200 baud. ASR systems however can reduce this to something on the order of

200 baud if straight recognized text is transmitted, and even lower if this text is further

compressed. Thus, much less of a strain on the bandwidth of the digital link is incurred

using an ASR system.

Second, since vocoders simply sample the speech waveform, the only way of recreating

and presenting the digitized information transmitted is to reproduce the original audio signal.

This however results in a format that is unusable by a computer and as such, the issued

command could not be displayed visually to the pilot, or used in any future automation

functions unless it was also keyed into the computer.

In a command digitization and transmission application scenario such as the one de-

scribed earlier, another problem, arising from the mix of "digital" and conventional aircraft,

is created. In particular, it would be necessary for controllers to keep track of which aircraft

must be "spoken" to and which can have keyed commands/information sent to them. If

ASR is used, possibly in conjunction with keyboards and/or other input modalities, then

it would be possible, if he so desires, for the controller to issue commands verbally to all

of the aircraft. It would then be up to the computer to determine the capabilities of the

aircraft being referred to. If it possesses "digital" capability, then then the verbal message

6 Systems that sample a speech waveform and compress it for more efficient transmission.

would be sent digitally. If not, the message could be sent verbally over the radio link, either

through reproduction by a speech synthesizer of some sort (advantages in terms of a distinct

voice over a possibly cluttered radio channel, disadvantages in terms of intelligibility) or by

replaying a recording of the message made as it was previously said by the controller.

A further application of ATCCR, although one for which ATCCR is not essential, is to

allow for the prestoring and automatic (or semi-automatic) issuance of clearances to aircraft.

In practice, the controller can often anticipate what clearances should be issued to aircraft

often minutes in advance. Thus, with an ATCCR system, he could pre-store these clearances

for issue later by the computer and divert some of his attention to other tasks. These

clearances could be transmitted, as described earlier for the digital cockpit scenario, either

digitally or verbally depending on aircraft capabilities.

The actual issuance of these clearances could be accomplished either by simply recalling

the clearance from the computer when it is desired, or by having the computer automat-

ically monitor the specified aircraft to determine when it should be issued. Granted, the

controller might want to validate or acknowledge every command sent to the aircraft by the

computer, but this could be done by simply prompting the user whenever it is time to trans-

mit a command and asking if the command should still be issued. In this way, the ultimate

responsibility still lies with the ATC controller.

1.2.2 ATC Command Recognition: Simulation Environment

Before any of the (aforementioned) real-world applications of Air Traffic Controller com-

mand recognition can be studied and evaluated, a facility to demonstrate them in an envi-

ronment typical of Air Traffic Control should be available. For this reason, one of the initial

goals of this work was the incorporation of ASR into an existing real-time ATC simulation.

In investigating the configuration of this simulation, it was readily apparent that it, in

itself, was an ideal application of ATC command recognition. In particular, current real time

ATC simulations use what are called pseudo-pilots or blip drivers. These are people whose

task it is to translate verbal commands from the subject controller involved in the simulation

into typed commands that are keyed into a computer. The use of these people adds to the

cost and complexity of the experiments using the simulation. ASR would allow these people

to be replaced by a direct data path from the subject controller to the simulation computer.

Granted, this is not likely to have the flexibility that is available with a human blip driver,

but the advantages in terms of cost and manpower requirements could possibly outweigh

this.

The other function often performed by blip-drivers is the simulation of pilot responses to

the controller in order to add to the fidelity of the simulation for certain ATC research. This

function can also be replaced by the computer by using computer generated verbal responses.

This technology is much more mature and does not pose the same types of technological

problems posed by ASR. Computer generated responses can be produced in basically two

ways. The first is through a rule-based text-to-speech synthesizer. This takes written text

and through a series of often empirically derived algorithms or rules, specifies the output

of an electronic sound synthesis circuit in order to generate an imitation of human speech.

The major drawbacks of this system are perception or intelligibility problems that arise due

to the flat monotone and "robot" like quality of the speech output. Most systems available

however, provide at least some ability for the user to specify stress and intonation patterns,

in order to make the output more intelligible (see related articles [18,19,20,21,22,23]). Such a

system's advantages lie in its flexibility, in that there is no requirement to know beforehand

exactly what the words or phrases that the system will be required to say are.

The second method for simulating pilot responses utilizes pre-recorded messages and plays

them back in a specified order as required by the user. An example of this kind of system is

the response when requesting a telephone number from Information. This technique results

in messages that are more intelligible than text-to-speech because they are in fact, simply

tape or digital recordings of human speech. It is however, far less flexible in that all possible

messages must be recorded beforehand. As well, it requires a lot of memory to store these

pre-recorded messages in the computer although there are a number of techniques to reduce

this [24,25). Furthermore there is the difficulty of introducing intonation and emphasis into

the speech output since this requires that all possible occurrences of these can be predicted

beforehand and suitable messages recorded.

Recognition errors can be handled much more easily in the simulation environment. If one

occurs, the computer can simply respond with a verbal "Say again." message thus adding

even more to the realism of the simulation. Note that recognition error rates must be kept

fairly low since a system that responds with "Say Again." too often is not very practical.

1.3 Outline

As has been shown, there is quite a variety of possible applications arising from the use

of ASR to recognize Air Traffic Controller verbal commands. Some of these, although still

potentially useful, might turn out to be impractical especially in light of current technology.

Thus, this work will explore these applications in more detail.

Before commencing on a description of the work performed and the results obtained, it

is important to first define some terms and concepts dealing with ASR systems. These are

covered in Chapter 2. The reader wishing more detailed information is urged to consult the

references.

In Chapter 3, the ASR systems that were purchased and used are described with particular

reference to the features that were found to be desirable.

The work performed was concerned predominantly with the development of both speech

input and output capabilities for an ATC terminal area simulation. In Chapter 4, the imple-

mentation of this task is detailed. In particular, the design of a system for ATC command

recognition (ATCCR) will be presented. Integral to this are such things as the methods

used to incorporate ATC syntax requirements and constraints as well as the handling of

recognition errors, both their detection and correction.

In Chapter 5, some of the operational applications of ATCCR alluded to earlier are

revisited and analyzed in greater detail as to their feasibility and possible shortcomings,

based on the experience gained from the Simulation application.

Finally, Chapter 6 outlines the conclusions of this work, along with suitable recommenda-

tions for further work both with the existing hardware as well as with new technology ASR

systems that are currently appearing on the market.

Chapter 2

Automatic Speech Recognition

2.1 Introduction

The basic purpose of a speech recognition system is to recognize verbal input from some

pre-defined vocabulary and to translate it into a string of alphanumeric characters. Before

beginning a description of how these systems work however, it is first important to present

some general categorizations of the systems that are available. In general, ASR systems can

be categorized based on three different features and capabilities. These are:

1. Speaker Dependence/Independence

2. Discrete/Connected/Continuous Speech Recognition

3. Vocabulary Size

The first of these deals with whether or not the system is designed to be used by only

one speaker at a time. If it is speaker dependent, then it must be trained to a particular

user's vocal patterns, typically by having him repeat to the system all of the words that

are desired for it to recognize. With speaker independent systems, there is no need for this

extensive training procedure because some basic information about how the words in the

vocabulary are spoken is usually incorporated directly into the system. In general however,

the speaker dependence distinction is one that is closely related to accuracy. A speaker

dependent system can be made somewhat speaker independent simply by having multiple

users train the system to their voices. Thus, it would possess data from a spectrum of speakers

and should in theory be able to recognize speech from any speakers with roughly similar vocal

patterns. If it possessed sufficient accuracy, then it could be termed speaker dependent. Its

accuracy however, would tend not to be as good as a system that was explicitly designed for

speaker independence.

The second categorization deals with the type of speech input that is allowable. For

discrete speech recognition systems, it is assumed that the words or utterances contained in

the vocabulary will be spoken with a brief period of silence in between. This period of silence

is typically on the order of 150 to 200 msec long and is used to delineate utterances, thereby

allowing a simpler and more accurate recognition algorithm to be implemented.

These utterances are in general not restricted to being single words. In fact, they can be

entire phrases. Individual words contained in these phrases however cannot be recognized

unless they have been trained as such.

Connected word recognition systems, however, impose fewer restrictions on the user in

that these periods of silence need not occur after every word. Thus, the user can run words

together during his speech. Every so often however, a pause must still be included (the actual

recognition of the speech does not commence until this is detected). This is not much of a

problem since normal speech tends to be liberally sprinkled with these pauses.

Continuous speech recognition systems provide the most flexibility in how the user speaks.

With them, there is no requirement for the user to pause anywhere during speech input.

Unlike connected systems, recognition is performed as the words are spoken. Thus, it is

possible for words at the beginning of a stream of continuous speech to be recognized before

the user is finished talking or has paused.

The third categorization deals with how many words can be recognized by the ASR

system. This varies a great deal from system to system. In general, the limiting factor in

vocabulary size is the inherent accuracy of the recognition system. The higher the accuracy,

the larger the vocabulary possible. This is why speaker independent systems which, as a

rule, possess lower recognition accuracies than comparable speaker dependent systems, have

smaller vocabularies.

With some systems, every word to be recognized must be explicitly trained by the user.

This can be very cumbersome and time consuming for large vocabularies and thus limits

the practical size of the vocabulary to roughly 100 words or so. Other systems however,

use training procedures that do not require each word to be explicitly trained. With these

systems, the vocabulary size is often in the hundreds or thousands of words.

In defining vocabulary size, there is another factor to consider. This involves the capability

of some systems to activate only certain sections of the entire vocabulary. Hence, a better

indicator of performance is the size of the active or instantaneous vocabulary. Clearly the

larger the active vocabulary, the greater the likelihood of recognition errors and the larger

the recognition delays since more comparisons must be performed.

2.2 How ASR Systems Work

Although the actual details of how ASR systems work varies a great deal from system

to system, their basic internal structure is very similar. It consists of a Feature Extractor, a

Recognizer and a Vocabulary Database as indicated in Figure 2.1.

The feature extractor is basically responsible for analyzing the incoming speech input

signal and extracting data from it in a format that can be used by the recognition algorithm.

The recognition algorithm then takes this data and compares it to data in the vocabulary

database in order to determine which, if any, word was said.

The vocabulary database contains all of the words that can be recognized by the ASR

system. It is created by having the user train or enroll onto the system or through purely

theoretical means. In the simplest training procedure, the user simply repeats, a given

number of times, all of the words contained in the vocabulary so as to provide the ASR

system with information as to how these words "look" when spoken. This information is

then used to create a set of reference patterns or templates each one describing a particular

word. Other training procedures however, are much simpler and only require the user to

read a few paragraphs of text aloud from which the ASR system extracts information about

how the user articulates his words and in this way, generates the required templates.

The extraction from the input signal of data used to generate these templates is the

MicrophoneSignal

RecognizedOutput

Figure 2.1: Block Diagram of Generic ASR system.

VocabularyDatabase

Recognizer

primary responsibility of the Feature Extractor. The simplest form of feature extraction

is to sample the incoming speech signal. Since the bandwidth of human speech is roughly

4kHz, this implies a sampling rate of at least 8kHz minimum. At this rate, 8 kbytes of data

(assuming 8 bit quantization of the data) are produced for every second of speech. This

creates serious problems both in terms of memory requirements as well as recognition delays

(it takes a long time to process this much data). For this reason, alternative techniques are

used in order to reduce the data rates required.

One of the simplest and most successful of these takes advantage of the fact that the

frequency spectrum of the speech signal, although it varies in time, does not vary quickly.

Thus, if the signal is passed through a bank of bandpass filters in order to determine its

spectrum, these can be sampled at rates much slower than 8 kHz (typically at rates near 100

Hz).

Another successful technique to reduce data rates is to use Linear Predictive Coding or

LPC [26]. Here, an estimate is made of the present value of the input signal based on a linear

combination of the last n values, in conjunction with an all pole model of the vocal tract.

The output of this system is then related to the coefficients that minimize the estimation or

prediction error and again produces data at a rate of roughly 100 Hz.

In some systems, this data is further processed to produce a more compact representation.

For example, in vector quantization, each data sample (it is usually a vector of data) is

compared to a set of standard reference frames and replaced by a symbol associated with the

frame that best matches it. Thus, the output of the feature extractor can be transformed

into a sequence of symbols.

In a slightly more complex system, the incoming speech signal undergoes more extensive

processing in order to recognize the actual phonemes that it contains. Phonemes are the

different sounds that are made during speech (eg; "oo", "ah", and so on) and form a set of

basic building blocks for speech. Both the number and the actual phonemes themselves differ

somewhat from language to language but they are relatively constant for a given language.

There are roughly forty different phonemes contained in the English language [243. Thus,

the speech signal can be characterized by roughly forty different symbols. Thus, the data

rates generated by the feature extractor are very low, on the order of 50-100 Hz. Feature

extractors of this sort are potentially more accurate in that they try to extract the same

features from the speech signal that the user consciously tries to reproduce when he says a

word.

In any case, no matter what the data outputted by the feature extractor actually repre-

sents, there is a lot of similarity in how it is subsequently processed.

In a discrete speech recognition system for example, the data output by the feature

extractor is saved in a buffer until the end of a word, as indicated by a short period of silence,

is detected. This yields a matrix of data, assuming that the feature extractor outputs data

in the form of frames or vectors of data, whose one axis corresponds to time. This matrix

or pattern is then compared to patterns contained in the Vocabulary database that were

generated in a similar manner while the user was training the system. Here, a problem is

readily evident. Because different words and even different vocalizations of the same word

are different lengths, a common reference for comparison must be found. A simple way to

accomplish this is to time-normalize the patterns, so that they are of uniform length, by

merging adjacent frames together or by interpolating between them as required. If this is

done uniformly along the length, or time axis, of the matrix or pattern, then it is termed

linear time-normalization.

Now that the two patterns are the same length, they can be compared to determine

whether or not they match. The actual method of doing this again varies from system

to system. The simplest involves measuring the norm of the difference between the two

matrices in order to compute the distance between them. The most common norm used

is the Euclidean or 2-norm which is simply the square root of the sum of the squares of

each entry in the difference matrix. There are in addition, other more complicated and

computationaly intensive methods for distance measure but these are described elsewhere

[26,24,27]. If the distance between these two patterns is lower than a prespecified threshold,

then a match is declared. This threshold test prevents random noises such as doors slamming

and phones ringing from creating false recognitions.

The performance of systems utilizing linear time normalization however decreases dra-

matically as the size of the vocabulary increases (they are typically confined to vocabulary

sizes on the order of 20 to 40 words). This is primarily because the time normalization

procedure obscures a lot of the features of the spoken word. Furthermore, if one examines

an utterance closely, it can be seen that when its length changes, it does not do so uniformly

(linearly) along the length of the word. Consider the word "five". If the duration of this

word is increased as it is. spoken, it can be seen that most of the stretching occurs in the "i"

sound and not the f" and "v". In order to more readily account for this phenomenon, a

non-linear time normalization technique is used. With this technique, time normalization is

accomplished by aligning features found in the reference and input patterns in such a way

as to obtain the best match. Since the number of possible non-linear time alignments can

be quite numerous, dynamic programming techniques are used to eliminate some of these

and thereby reduce the computational complexity of this algorithm. For this reason, this

technique is often termed Dynamic Time Warping (DTW).

DTW not only yields greatly improved results for discrete speech, but it can readily be

extended to both connected and continuous speech recognition. The process of extending

this procedure to connected speech is fairly straightforward and consists of finding a similar

non-linear alignment, but this time relating the entire spoken input phrase to a super-pattern

consisting of the "best" sequence of reference patterns. This "best" sequence of words is then

the recognized phrase.

This procedure is modified in continuous speech recognition systems so that non-linear

time alignments between reference and input templates are calculated as the speech comes

in. In this way, when the score of the comparisons using this alignment dips below a certain

threshold, a match is declared and the normalization procedure is begun anew from the point

where this current matched pattern ended'. Thus, the principal difference between this and

connected speech recognition lies in where recognition events occur.

With these systems however, some problems arise due to the differences between words

when spoken in discrete as opposed to continuous or connected speech. First of all, words

tend to be shorter when spoken as part of a continuous stream than when spoken individually.

'A good explanation of this procedure is given in 128].

The resulting differences in length are sometimes too excessive for the recognition algorithm

to handle and thus errors result. Furthermore, the actual articulation of adjacent words is

sometimes changed significantly due to the slurring together of words. This phenomenon,

termed co-articulation often results in sounds that were not part of either individual word. A

good example of this occurs when the words "did you" are spoken quickly. The result sounds

more like "dija" than anything else and unless allowances are made for this, it is certain to

cause recognition errors.

In order to attempt to take this into account, some systems use what is termed embedded or

in phrase training. With this, vocabulary words are trained as part of a stream of continuous

speech in order to include co-articulation effects on word boundaries. This however is not

very general since these effects are to a great deal dependent on exactly what the surrounding

words are and it is unrealistic to train for all possible word combinations.

It is in order to account for some of these variations in how words/are said that other

procedures are being used as well'. These include such techniques as statistically based DTW

[29) as well as a process known as Hidden Markov Modeling (HMM) [30,31].

In a standard Markov Model, the various vocalizations of a word are used in order to con-

struct a finite state machine type structure where each state is associated with a particular

data frame or feature and each branch with the probability of receiving that feature. With

HMM however, this one-to-one correspondence between states and features is eliminated. In-

stead, each state is probabilistically associated with a number of features. Since assumptions

are no longer made about exactly which features are required in the input, and the actual

feature (or state) sequence is hidden, the number of states that are required to represent an

utterance can be reduced without a large degradation in performance. This also allows for

some errors to be made during the feature extraction process. The training procedure for

a system using HMM however is quite time consuming since quite a few repetitions of each

word must be used in order to evaluate the probabilities associated with each of the branches

of the finite state machine.

2Only the gross variabilities have been presented so far. There are other, somewhat smaller, but potentiallyequally significant sources of variability and these will be discussed in Section 2.3.2.

Thus, it can be seen that there are quite a few different techniques for recognizing speech.

This section has presented a brief introduction into some of these in order to provide the

reader with some necessary background. If more detailed information is desired, then addi-

tional references should be consulted.

2.3. Recognition Errors

2.3.1 Categorization

Any speech recognition system, even human, is certain to make at least some recognition

errors. The difference between systems lies however in the rate at which these errors occur

and in the inherent capabilities for recovering from them and correcting for them. In general,

recognition errors fall into three major categories.

1. Mis-recognition errors are those in which one word is mistaken for another. This also

implies that the comparison successfully passed any recognition threshold tests.

2. Non-recognition errors are those in which spoken words, that are members of the

current active vocabulary, are not recognized at all. This is usually because the utter-

ance, when spoken, is sufficiently different from that as trained so that the recognition

threshold is not passed. Some reasons why these differences arise will be discussed

later.

3. Spurious-recognition errors are those in which the ASR system indicates that a word

was spoken when in fact none was. These typically arise when extraneous noises are

mistaken for speech.

In general, the occurrence of these types of errors is very dependent on both the particular

recognition system that is being used as well as its operating parameters such as recognition

thresholds.

2.3.2 Factors Affecting Recognition

More important than the types of recognition errors are the questions of how and why

they occur and what affects their frequency. It is by understanding these issues that error

rates can be reduced. In general, since ASR systems are simply pattern matchers, it is

obvious that anything creating differences between these patterns as they are trained and as

they are produced during speech will increase the error rates.

These factors can range from stress, high workload, and nervousness on the part of the

user when speaking to the system to such things as day to day variations in his speech

possibly arising from such things as fatigue and colds. Some systems try to counter some

of these day to day variations by putting the speakers through a short enrollment session

each time they begin to use the system. In this, the user simply reads a short phrase prior

to using the system in order to allow the recognition system to adapt to how he is speaking

that particular day or session.

Other factors affecting recognition accuracy include environmental or background noise.

This can affect recognition accuracy in three basic ways. First, it can lead to spurious

recognitions through the ASR systems' mistaking of these sounds for valid speech input.

Second, users might be forced to change their articulation in order to compensate for and

be heard over background noises, and third, the noise might actually corrupt the speech

signal itself and mask a lot of information. While some systems simply use a noise canceling

microphone to counteract these, this is sometimes not sufficient and techniques to more

directly account for background noise must be incorporated into the recognition algorithm

itself.

The type of microphone used can have even further effects on recognition accuracy [37,38}.

In particular, microphones with a poor frequency response or highly non-linear or time vary-

ing response will greatly affect the quality and constancy of the signal made available to the

ASR system. The end result might be that not enough "clean" signal is available to the ASR

system for it to accurately discern between the utterances spoken.

To a great extent, it is the training procedure itself that produces a lot of the differences

between templates and the words as they are usually spoken. This is because users often train

the vocabulary words in a way that is significantly different from the way that they actually

say them. This is termed the training effect and is caused by nervousness or hesitance on

the part of the user. Furthermore, they are often simply reading words off a list and this

results in different pronunciations of words. Granted, one would desire a system that is not

sensitive to variations this small in the way that words are spoken however, most systems,

especially speaker dependent continuous speech ones are, unfortunately.

As mentioned earlier, speaker dependency is closely tied to recognition accuracy. In

general, speaker independent systems are much less sensitive too the types of variations

mentioned earlier, but are consequently also much less accurate than speaker dependent

systems since they must accommodate a much broader range in how words are said. These

variations come not only from pitch and inflection changes from user to user, but also from

dialect and accent. In order to keep error rates reasonable, these systems tend to confine

themselves to small vocabularies. Conversely, speaker dependent systems are trained by the

eventual user and hence know with much greater accuracy, how each of the words said by

the user would appear. This is analogous to a human's ability to recognize more easily the

speech of someone with whom they are familiar.

Chapter 3

ASR Systems Selected for

Experimentation

In selecting an ASR system for this research, a two step approach was taken. First,

an inexpensive, low performance, system was purchased. This was done in order to give

a better insight into ASR technology so that the requirements and desirable features of a

higher performance system to be used in subsequent research and development could be more

accurately defined.

The goal of this work, as stated previously, was to obtain some practical information

about the incorporation of ASR in ATC. It was not to test and document the performance

of a number of ASR systems currently on the market. As such, the evaluation details and

results are presented in an exploratory, qualitative, rather than a quantitative manner. It

was felt that this would give the reader a better idea of the problems typically encountered

using this type of technology, without creating a false sense of confidence in performance

figures which are, after all, highly subject to a number of factors and difficult to duplicate

from test to test.

3.1 LISNER 1000

3.1.1 Description

The ASR system purchased for initial evaluation was the LIS'NER 1000 system produced

by Micro Mint of Cedarhurst NY. This system consists of a plug-in card, with appropriate

Microphone Amplifier Bandpass A/DInput Filter

SP1000

LIS'NER 1000

APPLE II+

Figure 3.1: LIS'NER 1000 Functional Block Diagram.

software, for the APPLE II family of home computers and cost $250 at the time of purchase

in May 1985.

The LIS'NER 1000 ASR system is a speaker dependent, discrete word recognition sys-

tem[27,32,33]. Total vocabulary is 64 words or utterances.

A block diagram of the system hardware is shown in Figure 3.1. Its basic operation is as

follows. First, the signal from a headset mounted electret microphone is filtered to prevent

aliasing and remove low frequency biases. It is then digitized by an A-to-D and sampled by

the SP1000 chip at a rate of 6.5 khz. The SP1000 uses this incoming signal to generate LPC

data. This LPC data is organized in frames, each frame consisting of 8 LPC and one energy

parameter, and is made available to the APPLE at a rate of 50 hz. It is this data that is

used by the system in the recognition process.

During normal operation, a value for the background noise is constantly being monitored

by looking at the energy level of the incoming signal. A significant increase in the energy of

the incoming signal (6 db) signals the start of an utterance. All subsequent LPC data from

the SP1000 is then saved in a buffer until the end of the word is detected. This is specified

by a period of silence (roughly 200 msec) determined, again, by looking at the energy of the

incoming signal. The resulting data is compressed or time normalized into a block of data

12 frames long to allow for a more uniform means of comparison as well as to minimize the

amount of data that must be stored. This compression is accomplished by merging together

any adjacent frames that are very similar. Thus, any "interesting" features of the waveform

are preserved.

What is then done differs depending upon whether the user is in recognition or training

mode. In training mode, this data is averaged with data from previous vocalizations of the

same utterance to form a template. Currently, the software requires that each utterance

be repeated twice during training. When all of the utterances in the vocabulary have been

repeated twice, the training phase is finished.

In recognition mode however, the resulting data is compared to the vocabulary templates

to find the best match. This comparison is performed using dynamic time warping and a

Euclidean norm as a distance measure. The template that possesses the shortest distance is

the one that is selected as the best match. This distance however, must be greater than a

minimum threshold (termed the acceptance threshold). This threshold provides a trade-off

between unrecognized words and false recognitions. If it is too low, then valid utterances

will not be successfully recognized. If it too high, then utterances not in the vocabulary or

spurious noises will be misrecognized as valid words.

Since the time to recognize a word depends directly on the number of words or templates

in the vocabulary, a small trick is used to reduce the search time. This involves examining

the distance measure as it is computed for each template. If the distance is less than a certain

prespecified threshold (termed the lower threshold), then that template is treated as the best

match and no further computations are performed. As well, if the distance is greater than a

third threshold value (termed the upper threshold), all further comparisons to the utterance

are stopped. This hopefully allows the system to quickly disregard spurious noises such as

doors slamming, phones ringing, and so on since these will likely result in distances that

are greater than the upper threshold in almost all cases. Those cases of spurious noise that

do pass this test however, will still not likely cause recognition errors due to the acceptance

threshold test.

A useful feature incorporated into this system is the ability to divide the entire trained

vocabulary into what are termed groups, by assigning every trained utterance to a particular

group. Using these, an active vocabulary, that is to say, the vocabulary of words or utterances

which is searched through during the recognition algorithm, can be reduced to a subset of the

total trained vocabulary. Once a word is recognized, a search byte for the group containing

the recognized word is used to specify which groups comprise the new active vocabulary.

Since reduction of the size of the active vocabulary reduces the number of comparisons that

must be made by the recognition algorithm it has the potential for decreasing recognition

delays as well as reducing the probability of mis-recognition errors. This grouping structure

however, is not entirely arbitrary as each trained word can occur only in one group and a

search byte can only be specified on a group by group instead of on a word by word, basis.

3.1.2 Evaluation and Testing

There were basically two modes of testing that were performed on the Lis'ner 1000. The

first involved the straightforward training of a particular vocabulary and subsequent testing

of the recognition accuracy and other parameters on a word by word basis while the second

involved the recognition of entire sequences of words, in sentences typical of ATC commands.

Word Entry

The primary goal of testing on a word by word basis was to study basic parameters

of interest such as recognition speeds, delays, and accuracy. This was accomplished by

having the user talk to the LIS'NER 1000 and having the APPLE display on its screen a

representation of the recognized utterance.

The testing was performed with a value of zero for the lower threshold. This meant that

the distance between any template and an utterance had to be less than zero, clearly an

impossibility, for the template to be declared a match and all further comparisons stopped.

Thus it forced the recognition algorithm to search through the entire active vocabulary. As

United-Airlines one descend Alpha Mike Yankee

TWA two climb Bravo November Zulu

Air-Canada three and-maintain Charlie Oscar

Piedmont four turn Delta Papa

Lufthansa five left Echo Quebec

Republic six right Foxtrot Romeo no

US-Air seven heading Golf Sierra over

eight fly Hotel Tango check

Manjo niner airspeed India Uniform cancel

Celts zero cleared-for-final Juliett Victor delete

Boston hundred feet Kilo Whiskey enter

Revere thousand degrees Lima X-Ray execute

Table 3.1: Table of typical words used for ASR evaluation

well, the "grouping" feature of the LISNER 1000 which allows for the reduction of the size

of the active vocabulary was disabled. This was done in order to get a better idea of what

the actual recognition rates were with a well known and constant size vocabulary.

The testing itself consisted simply of speaking words that were trained beforehand. These

words were selected as being typical of the ATC vocabulary. A list of some of the words used

can be seen in Table 3.1. This list is by no means an exhaustive list of all the words that were

trained and tested but it is indicative of the types of utterances that recognition information

was desired for. Note that the convention used throughout this report is to treat hyphenated

words as one utterance. That is to say, they are trained as one word and the ASR system

will not recognize them individually unless they are also trained individually. For example,

the utterance "United-Airlines" is trained as one word. Thus, the words "United" and

"Airlines" cannot be recognized as separate words unless they are also trained as such.

The recognition accuracy of this system was found to vary not only with size of the

vocabulary, but also with its content. It was very common for recognition errors to be

made between two words that sounded similar, such as "fly" and "five", but what was not

expected was the large number of errors (mis-recognitions) that occurred between words that

did not sound similar or were not even the same length. This was primarily due to the data

compression algorithms used which tended to obliterate word features.

The recognition accuracies themselves were on the order of 70 % to 80 %. These values

varied however depending on the actual words contained in the vocabulary and which speakers

were using the system, often dropping to as low as 60 % for some users. Best results were

typically obtained with "loud", confident speakers who articulated clearly as opposed to

"quiet", timid speakers. It is also interesting to note that better results were obtained for

multi-syllabic or long words. This is probably a direct result of the time normalization and

compression algorithms and their effect on the data quality. In particular, longer words

possess many more features than do shorter ones and it is less likely for these to be "lost"

during data processing and compression.

These figures do not take into account the significant number of recognitions triggered

by background noise (conversations, telephones ringing, etc.) however. In fact, background

noise alone was in some cases responsible for the degradation of recognition accuracy to the

neighborhood of 40%. The principal cause of this was that other than placing the microphone

close to the user's mouth where the signal magnitude was likely to be much larger than

the noise magnitude, there was no attempt made by the ASR system to compensate for

external noise. In general, best results were obtained by training the system in very quiet

surroundings and then moving the system to the somewhat noisier ones for testing and

evaluation. Background noise however, need not be a problem if noise canceling mikes are

used, as shall be shown in the testing described in later sections.

The system was also found to be very sensitive to differences in the pronunciation of

trained words. This arose in two different contexts. First, the user often spoke differently

when training the system than when testing its recognition performance. Thus, the templates

generated during training were significantly different from those generated during actual

speech input tests. This effect however was reduced, although not eliminated, as the users

became more confident and familiar with the system.

Second, changes in emotional state, health or intent colored words and this resulted in

different vocalizations thus making the utterances unmatchable. In this respect, the greatest

problem occurred when a recognition error was made and the operator had to repeat the word.

The natural tendency was to repeat it in a much slower manner, articulating each syllable

clearly, as would be done when repeating something to a person. Although this may make it

easier for another person to recognize what was said, it changes the acoustical pattern of the

utterance greatly and actually degrades the performance of the speech recognition system.

Thus, the user had to consciously force himself to maintain a consistent enunciation of the

vocabulary both during training and while using the system. This was especially difficult

with this system due to the frustration factor. This is a positive feedback effect, common

with the lower performance recognition systems, which arises from the user becoming more

and more frustrated with the fact that the system will not recognize a particular word and

as such, changing his pronunciation of that word more and more.

As would be expected, recognition speeds were found to be a function the vocabulary size.

For vocabularies of 32 words, the recognition delay was determined to be about 2.5 seconds.

For 64 words, it was found to be about 5 seconds. The primary reason for this delay was

the fact that the LIS'NER 1000 did not possess its own dedicated processor. Instead, it

relied on the Motorola 6502 processor in the host APPLE, a fairly old and slow processor,

to perform the calculations required by the recognition algorithm. These large delays made

it very difficult to use the system.

ATC Command Entry

In order to more properly asses the requirements of an ASR system, some testing in an

environment typical of what the applications environment would eventually be had to be

performed. For this reason, the LIS'NER 1000 was connected so as to serve as a speech

input front end for a VAX 11/750, which, at the time of the work, was the computer on

which most of the ATC simulation research in the Flight Transportation Laboratory was

being performed. The inter-connection was implemented using an RS-232 serial link and

the LIS'NER 1000 simply sent an ASCII representation of a word or utterance, as it was

recognized, to the VAX through this link. It was then the responsibility of software in the

VAX to perform any error checking and parsing of the input as required.

Since the primary application to be studied, ATC simulation, involves the entry of entire

commands or phrases as opposed to single words or utterances, a few commands typical of

1. (aircraft) CLIMB/DESCEND AN D-MAINTAIN (altitude)where (aircraft) is the aircraft call sign (eg; air-canada, united-

airlines) followed by the digits of the flight number.eg; "United-Airlines six five zero"

"TWA three five"

and (altitude)

11. (aircraft) TURN-LEFT/TUwhere (heading)

is either the word "FLIGHT-LEVEL" followed bythe three separate digits of the flight level or theseparate digits of the thousands plus the hundreds

terminated by the word feet.eg; "One seven thousand niner hundred feet"

"Flight level one eight zero"

RN-RIGHT HEADING (degrees)is the three separate digits of the heading omittingthe word degrees with 360 indicating a northheading.eg; "zero zero five" for 50

"three two zero" for 320%

Figure 3.2: Commands used in preliminary ASR system evaluation.

ATC were formulated so that some experience could be gained in the operational problems

of command entry. These commands were very simple and were drawn almost directly from

the ATC Handbook. They consisted of the vectoring and altitude change commands and can

be found in Figure 3.2. In general, the procedure was to first identify the particular aircraft

being referred to and then to issue the desired command. The termination of the command

was indicated by the receipt of the syntactically complete command. Thus, for example, once

the third digit in the heading specification was received, the system would be reset, awaiting

the input of another command.

Feedback to the user was provided through the use of a CRT display. The display consisted

of three lines (see Figure 3.3). The first was used to display system messages such as "Invalid

Command!" if an invalid command was entered, or "Please Repeat!" if the utterance could

not be matched to any of the words in the vocabulary. The second line was used to display

EXECUTING COMMAND

United-Airlines turn left heading zero niner zero F1

(send (fetch-nth '(:aircraft 0) '(:name 'ua)) :fly-heading 90)

Figure 3.3: VAX Display for Command entry feedback.

the current state of the command as recognized so far, allowing the user to detect mistakes

and keep track of where in the command he was. The third line was used to display the final

command in a format (Lisp code) executable by the ATC simulation.

It was obvious that a mechanism for correcting errors was also required. Thus, the

keywords "NO"and "CANCEL" were included in the vocabulary. Upon the receipt of the

"NO" keyword, the command line parser would back up past the last utterance, in effect,

acting like a delete key for an entire word. When the "CAN C EL" keyword was received, the

entire command was canceled and the display cleared and reset to await a new command.

On the whole, the system performed fairly well considering the accuracy and speed of the

LISNER 1000 speech recognition system. A lot of the problems evidenced during word entry

became even more critical during ATC command entry. In particular, the large recognition

delays (and errors) made it extremely difficult to enter entire command strings. This was

compounded by the fact that the next word could not be spoken until the previous word had

been recognized. This forced the user to be unrealistically and impractically slow in inputing

an entire command, often requiring six or more seconds, if even a single recognition error

was made, to input a single word (2 seconds delay for the initial mis-recognition of the word

plus 2 seconds delay to recognize the keyword "NO", plus 2 seconds delay to recognize the

word, hopefully correctly, the second time).

Another problem arose from the way command termination was implemented. This

because there was no allowance made for the user to correct an error in the last utterance,

ie; the one that signaled the end of the command sequence. For example, there would be no

way to correct an error in recognizing the "five" in a "... heading zero four five" command

if another digit were substituted in its place. Thus, command termination was modified

so that the use of the keywords "Enter" or "Execute" indicated to the processor that the

command was both finished and correct. In addition to this, some sort of timeout on the

microphone input could be implemented to indicate that the user was finished speaking and

hence finished entering the command. These and other command termination strategies will

be discussed in more detail in later chapters.

3.1.3 Recommendations

Performance of the LIS'NER 1000 ASR system was found to be lacking in two major

respects. First, its recognition accuracy was very poor, especially in comparison to other,

more expensive, recognizers on the market. Second, the recognition delays were large and

this made command entry very impractical. These delays arose not only from the serial

architecture of the system', but also due to the slow operating speed of the 6502 processor.

It did however demonstrate some potentially very useful features. In particular, the

"grouping" feature was found to potentially be of great use both in reducing recognition

delays and improving accuracy, especially in conjunction with the fairly rigid command syntax

found in ATC commands by reducing the size of the active vocabulary.

'The processor must complete the execution of the recognition algorithm before performing other tasks,including the reading of data from the SP1000. Thus the user must not only pause for a sufficient timebetween words or utterances to delineate them, he must also wait until the previous word was recognized.

The limited size of the vocabulary, 64 words, although it likely would not encompass

the entire vocabulary required for all projected applications, was not found to be overly

restrictive. This is especially true when different groups of 64 words can be switched into

and out of memory, thereby increasing the effective size of the total vocabulary.

In conclusion, some of the features that were found desirable, at this stage of the work, in a

more capable ASR system are listed below. The first three requirements are all very closely

related. Tradeoffs must routinely be made between these during the ASR system design

process based on what the designer feels is most important. Thus, it is difficult to determine

exactly which system will meet user needs without some research and experimentation.

* Continuous Speech Recognition

One of the underlying principles in this work is to impose as few additional constraints

on the user as possible. The requirement to pause between words when entering a com-

mand, thus creating a very halting form of speech, was felt to be'too restrictive, and

created a a strong preference for continuous or connected speech recognition systems.

These would allow the user to concentrate on his work instead of on his speech. With

either of these systems however, the user still retains the option to revert to discrete

speech if, for example, better recognition accuracy were desired. Furthermore, contin-

uous was preferred over connected speech because it was felt that connected speech

would result in excessive delays since the speech data is not analyzed until a pause is

detected in the stream of incoming speech. Thus in a continuous stream of words, the

first might not be recognized until after the last had been spoken. Note however that

this preference on speech input mode is also affected significantly by the recognition

delays and error rates of the system. In particular, continuous speech recognition sys-

tems with high error rates and large recognition delays would likely not be desirable

over connected or even discrete systems with better performance.

* Short Recognition Delays

This requirement is closely related to the preference for continuous speech ASR and is

very difficult to quantify. Obviously a recognition delay as short as possible is desired

but at what point do the delays become too large? Furthermore, there are some trade-

offs to be made. For example, how much additional delay can be tolerated in order

to gain the benefit of continuous speech? In particular, is the user more willing to

tolerate larger recognition delays as long as he can speak continuously, or reduced

recognition delays with the imposition of discrete speech? This would also depend on

the recognition accuracy of the respective systems and on the exact nature of the task

being performed.

High Recognition Accuracy

Clearly we want the ASR system to be as accurate as possible. The actual recognition

accuracy is difficult quantify exactly but should realistically be a minimum of roughly

95 %. This would imply a recognition error for every three commands issued if we

assume an average command length of eight to nine words. The actual error rate

however is also affected by things such as size and content of thy vocabulary. Thus,

some experimentation is required to determine the accuracies possible with an ATC

vocabulary for each specific ASR system. Also related is the desire for continuous

speech recognition since discrete ASR systems are more accurate than continuous ASR

systems. Thus, another question is "How much, if any, degradation in accuracy can

be tolerated for the acquisition of continuous speech capabilities?". This can only be

answered by experimentation.

. Vocabulary Size of roughly 60 words minimum.

Although a total vocabulary size of 64 words would not likely be sufficient for all of

the applications envisioned, it would at least allow their demonstration. When coupled

with the ability to switch in different groups of 64 words however, it was thought that

this would be sufficient to meet the needs of all the tasks envisioned.

. Speaker Dependence

Clearly since we are concerned with speech input from only one controller, there is no

need for speaker independent systems. Different controllers could still be accommo-

dated by storing their templates on floppy disc and recalling these when required.

" Good Noise Immunity

Since the controller operates in an environment where there is a lot of background

noise, good noise immunity is essential. This however, was easily taken care of (as shall

be seen in a later section) through the use of noise canceling microphones.

" Vocabulary Grouping or Set Switching Capability

Initial work indicated that this would have potentially large benefits in terms of both

decreasing recognition delays and increasing recognition accuracy when used to incor-

porate syntactical and grammatical constraints into the recognition process. Thus, it

was very desirable in an ASR system used for ATC command recognition.

3.2 VOTAN VPC2000 System

3.2.1 Description

After some analysis of the existing technology, the ASR system selkcted for further re-

search and developmental work was the VOTAN VPC 2000 system. This is a speaker depen-

dent, continuous speech ASR system produced by VOTAN of Fremont, CA[34]. It consists

of a plug in card for an IBM PC or compatible computer and associated driving software.

List price, at the time of purchase, was $1500 ($1200 for the hardware and $300 for the

software) 2. As a further benefit, it also had the capability for the digital recording and play-

back of spoken messages. This feature was used in conjunction with the ATC simulation as

will be described in the next chapter.

The internal operation of this system is not documented but it operates in much the

same way as other continuous speech ASR systems. That is to say, it utilizes dynamic time

warping techniques on the data made available by some sort of feature extraction process to

compare templates from a trained vocabulary to those obtained from speech input. If the

comparison meets a prespecified threshold test, then a recognition event is declared.

It contains its own processor, a Motorola 6809 as well as custom digital signal processing

chips and as such, does not require the host computer's processor to execute the recognition

2The cost when the decision was made to purchase this system was $2100. Over the month that it took toorder, the price dropped to the above amount. This is indicative of the cost trends of ASR technology.

algorithm. This configuration allows for fairly short recognition delays as well as for software

to be run concurrently on the host computer.

Vocabulary is limited to a maximum of 64 words at a time. This, as is the case for

nearly all ASR systems, is due to memory limitations as the VPC board uses its own on-

board memory (only 22K) for template storage. The actual vocabulary size is influenced

greatly by the number of training passes made for each word as well as the length of the

words trained. The figure of 64 words is the nominal vocabulary size for average length

words and two training passes per word. More training passes per word would obviously

reduce this figure. There is also the ability to swap in different vocabularies from main PC

memory, assuming the user can tolerate a brief delay. Thus the effective vocabulary size can

be increased dramatically.

The system operates with two basic software packages. These are VOICEKEY and its

associated utilities, and Voice Programming Language (VPL) and its associated utilities. Al-

though the actual recognition algorithm is identical irrespective of which particular software

package is being used, each incorporates a different user interface and thus provides features

and capabilities not found in the other.

In particular, VOICEKEY operates by making the speech input seem like keyboard input.

Thus, the operation of the VPC is for the most part, transparent to the user and this results

in a reduced capability for control of the VPC functions.

VPL however is an actual programming language. It allows for much more flexibility and

control of the VPC functions. It also has the additional feature that it provides information

on not only the best guess as to which word was spoken, but also the second best guess.

This information can be used to great advantage, as shall be shown in Chapter 4, in order to

correct recognition errors. It however does not provide the same "set switching" or vocabulary

"grouping" capabilities as are found in VOICEKEY.

Training of the vocabulary proceeds much the same as in other speaker dependent sys-

tems, requiring the user to repeat the words contained in the vocabulary a specified number

of times. With this system however, the number of repetitions is left up to the user and can

even be changed from word to word. Thus, more templates could be generated for difficult to

recognize words in order to hopefully improve recognition performance. Furthermore, data

from different training passes are not averaged together to create vocabulary templates as

was the case with the LISNER system. Instead, a template is created and saved each time

a word is trained. As well, the templates are not normalized to a constant length. This

hopefully eliminates some of the feature masking that occurred with the LISNER system.

Since this system is a continuous speech system, it must be able to accommodate some

of the co-articulation effects common in continuous speech. This is accomplished by allowing

the user to train words or utterances "in phrase". In this type of training, a stream of

continuous speech containing the word for which "in phrase" training is desired is spoken to

the system. This allows the system to generate a template based on what the word would

actually "sound" like in a stream of continuous speech. Granted, this is not entirely general

since the co-articulation effects are highly dependent on the neighboring words and it is

unrealistic to train for all word combinations, but it does at least address the problem.

3.2.2 Evaluation and Testing

Although the majority of testing and evaluation of this system was performed in conjunc-

tion with the development of the ATC simulation and is described in Chapter 4, there was

significant testing of this system in a standalone environment. This testing basically involved

the implementation of a command entry task such as that outlined in Section 3.1.2. This

was performed using both continuous and discrete speech as input in order to get an idea of

the baseline recognition performance and how it would be affected by the use of continuous

speech.

Discrete Speech

To a certain extent, a comparison between continuous and discrete speech input is difficult

to make. Since the VPC2000 is a continuous ASR system, it does not require a pause between

words at all, and as such, it is up to the user to introduce what he feels is a pause of "sufficient

duration". Thus, in instances where the pause is not of sufficient duration and discrete word

recognition systems would fail, the VPC2000 would succeed. As well, discrete speech ASR

systems are tailored to the much simpler task of recognizing discrete speech. Thus, their

performance in this task can, in general, be expected to be much better.

However, the performance of the VPC, even for discrete speech, was much better than the

LISNER. This because the VPC represented a significant technological leap in its architecture

and design over that used in the LISNER. Improvements in performance were evidenced both

in delay reduction and increased recognition accuracy.

The decrease in delays took on two forms. First, due to the parallel construction of the

system, the user was allowed to speak the next word even before the current one had been

recognized. This reduced inter-word delays to only that required to delineate the words for

discrete speech. This however, is actually quite common with the higher performance discrete

speech recognition systems as well.

Second, the actual time taken to recognize a particular word dropped dramatically. For

a vocabulary of 64 words, this turned out to be on the order of 0.8 seconds or less. Even

though still significant, this did not affect the user as much as might be expected since it

only delayed feedback and did not in any way affect how quickly he could say words. In

general, he would typically enter an entire string of words sequentially, without waiting for

each word to be recognized. This would create a delay between the time he finished speaking

to the time the last word was recognized, but this was was still within acceptable levels. For

example, for a string of ten words, this delay was only on the order of 2 to 3 seconds.

For discrete speech input, the recognition accuracy of this system, roughly 97%, was much

higher than that of the LIS'NER 1000. As well, some of the sensitivity to the "vocal coloring"

of words was reduced, although not eliminated. In this respect, it still did not match the

performance of other ASR systems examined which were far less sensitive to these variations

in how a word was said (although these were discrete speech systems). A tendency for

recognition accuracy to degrade slightly if it had been a long time (approximately 3 weeks)

since the vocabulary was trained was also noted with this system. This was not evidenced

with the LIS'NER 1000 system primarily due to its poorer performance and the masking of

this phenomenon by other factors.

Continuous Speech

The greatest benefit of this system over other systems examined was the ability to use

continuous speech. In fact, performance was such that it allowed for true continuous speech.

Thus, not only was there no requirement for pauses between words, there was no need to limit

the duration of a stream of continuous speech as is sometimes the case for connected speech

systems. To illustrate this, numerous, but unsuccessful attempts were made to "out-talk"

the system. "Out-talking" a system occurs when a stream of speech, of sufficient duration

that the ASR system cannot recognize it fast enough and words that would otherwise be

recognized are lost, is spoken. For discrete speech, this can occur, depending on the exact

definition of "out-talking", even with two words if there does not exist a pause between them

of sufficient duration to delineate them3 . With this system, this is accomplished not through

large amounts of memory but with software that processes the speech data as it comes in.

Thus, in a long stream of words, as the first words are recognized, the c6rresponding speech

data can be ignored, thus freeing up memory, even before the user is finished talking. This is

not the case with connected speech where the speech data is saved in a buffer until a pause

is detected.

The command entry task of Section 3.1.2 was again repeated, this time allowing the user

to enter commands in either continuous or discrete speech as desired. This served to indicate

some problems that were not initially evident while using discrete speech.

The major problem that arose was a decrease in recognition accuracy. This resulted

primarily from co-articulation effects that created a significant difference between the words

as trained and the words as spoken. The "in phrase" training procedure, although it did

help, did not completely solve the problem. The problem was especially acute for short

words such as the digits. These were often not recognized when they were part of a stream of

continuous speech. This because the co-articulation or slurring effects were so great that there

was relatively little "data" associated with these words to allow for confidence (recognition

3 This is not really a fair criticism of discrete speech recognition systems however. A more realistic test wouldbe to impose the requirement for discrete speech on the user. In this case, the higher performance discreteASR systems (parallel construction) are impossible to out-talk as well.

threshold test) in the comparison. Granted, the recognition threshold could be lowered but

this would create other problems (spurious recognitions). At first glance, it was thought that

this could be countered with more "in phrase" training using samples of highly co-articulated

speech but this actually created more errors than it eliminated. In particular, since templates

generated in this way became very short, there was a significantly increased likelihood that

they could be matched to spurious noises, such as taking a breath between words, or speech

data "left over" between words (this "left over" data arose principally from the alignment of

words during the recognition algorithm and the fact that different vocalizations of the same

word were different lengths). Thus the rate of spurious recognitions increased dramatically.

This problem was especially acute with the word "eight" since the "t" sound is often omitted

during speech and the resulting sound was very easily confused with "left over" data or the

sound made while taking a breath between words. 4 For this reason, significant care had to

be taken during the training procedure.

When recognition errors were made during a stream of continuous speech, this system

demonstrated very robust performance in its ability to get back "on track" and recognize

succeeding words. That is to say, even though a recognition error would often cause an error

in recognizing the following word, by the second, or at most third word, the system would

be recognizing correctly again. The reason for this error in adjacent words involves the fact

that for continuous speech ASR systems, the recognition algorithm begins re-analyzing the

data at the point where the previously recognized word finished. Thus, if an error is made in

this word, data associated with the next word could be masked. Consider for example, the

sequence of words "fly present heading". Since "five" is longer than "fly", if a recognition

error is made, the recognition algorithm will commence recognition at a point after the word

"present" actually begins (see Figure 3.4). Thus, it is not likely that the word "present"

would be recognized and it is even possible for the remaining "stub" or "left-over" data to be

misrecognized as some other word. This ability to get back on track is somewhat similar to

the word spotting feature of some systems. With this, words contained in the vocabulary can

"Take for example, the sequence of digits "8 2 2". If these are said rapidly, the "t" in the eight is omittedand the pronunciation changes from eight two two to eigh two two.

InputWaveform I A

CorrectflRecognition present

-- heading -

Incorrect fvRecognition f

-- heading --

Figure 3.4: Example of word boundary mis-alignment due to misrecognition errors.

be recognized in a stream of speech containing both trained and untrained words. The same

can also be accomplished with the VPC system through judicious selection of the recognition

threshold.

The performance results and figures given so far assumed ideal conditions (quiet environ-

ment). These however could not be expected in typical operating conditions and were even

difficult to obtain in a Laboratory environment. Anytime the conditions were less than ideal,

there were significant reductions in recognition accuracy. In fact, even the simple operation of

a fan and the resulting breeze blowing across the microphone were enough to trigger spurious

recognitions at the rate of roughly one every two seconds. Clearly this could not be allowed.

In order to solve this problem, noise canceling microphones were used. These provided

almost ideal noise immunity even allowing people nearby to carry on normal conversations

while the system was being used without significantly affecting recognition performance. Two

noise canceling microphones were tested. These were a Communications Applied Technol-

ogy CAT 1 electret mike and a Telex Airman 750. Both of these were headset mounted

microphones but whereas the CAT mike was specifically designed for ASR use, the Telex

was designed for use in the aircraft cockpit. Thus, understandably, performance of the CAT

mike was superior to that of the Telex. Both of these however performed much better than

the gooseneck mike that was standard issue with the VOTAN system. The headset mounted

mikes also had the advantage that the distance between the mouth and the microphone was

kept constant. This greatly reduced some of the variability in the input signal and thus

further improved recognition accuracy over that using the gooseneck microphone.

A CAT throat microphone was also tested but, although it still performed better than

the gooseneck mike, it did not perform nearly as well as either of the headsets. Its advantage

was that it offered noise immunity superior to that of the noise canceling microphones. Noise

levels encountered or expected however were not sufficient to justify its use, especially in

light of its poorer performance.

Chapter 4

ATC Simulation Environment:Command Recognition SystemDesign

In this chapter, the major portion of the work performed will be presented. This, as

mentioned in Chapter 1, involved the development of speech input and output capabilities

for a terminal area ATC simulation. It was in conjunction with this simulation that a system

for recognizing ATC commands was designed and implemented.

The ATC Simulation itself deals with operations in the terminal area airspace of an

airport. Here, it is the controller's responsibility to issue appropriate commands to any

aircraft so as to avoid conflicts and minimize any delays to all aircraft arriving and departing

from the airport. This is accomplished through verbal communications between the controller

and the pilots over a radio link. In order to aid him in determining the position of aircraft,

the controller also possesses a display presenting information, made available by surveillance

radar, about names, positions and velocities of the aircraft being tracked in his airspace.

Functionally, the simulation can be split into three basic components. These are:

1. ATC Simulation and Display: The basic simulation task. This involves simulation

of the terminal area environment including aircraft, surveillance radar, winds, navaids,

and so on, as well as the duplication of the Air Traffic Controller's display used to

present aircraft radar tracks and other information to the controller

2. Speech Input Interface: A suitable user interface that will allow the controller to

input commands verbally directly into the computer.

3. Pseudo Pilot Response: A system to simulate pilot responses and queries to con-

troller commands in order to create a more realistic environment.

The overall configuration of hardware that was used to complete this task can be seen in

Figure 4.1. It consists principally of a host computer, a Texas Instruments Explorer, with an

inter-connection, via an RS-232 serial link, to the speech recognition (and audio playback)

system, the VPC2000. Since the ASR system required hosting on an IBM PC, this was also

included in the hardware.

In incorporating ASR into this simulation, two conflicting philosophies were apparent. On

the one hand, there was the desire to design a system with which ASR could be incorporated

into existing ATC simulation environments, without the modification of the interface to the

controller, in order to present an environment as similar as possible to that found in the

real world. This implies that there be no operational differences arising from whether the

controller was talking to a pilot, a blip-driver/pseudo-pilot, or an ASR system. Thus, the only

feedback available to him to indicate command transmission/recognition errors had occurred

would be (simulated) verbal pilot responses and actions in response to his commands. On

the other hand however, the incorporation of ASR should include a good user interface for the

controller. This almost certainly implies that he be presented with some sort of visual display

in order to provide additional feedback required for the detection and possible correction of

recognition errors. These conflicting criteria were resolved by designing both capabilities into

the simulation. In this manner, if one or the other was no longer desired, it could easily be

removed.

4.1 ATC Simulation and Display

Here, the basic simulation functions that must be performed regardless of the whether

speech recognition and playback is used or not are implemented. These entail the modeling

and simulation of the airspace and the aircraft flying in it as well as any additional factors

Figure 4.1: Configuration of ATC Simulation Hardware.

desired for fidelity.

In the current configuration, this task was the responsibility of the simulation computer, a

Texas Instruments Explorer. This is a Lisp Machine using the Common Lisp implementation

of the Lisp programming language. The actual simulation itself was written in this language

using object oriented programming techniques in order to allow for ease of development and

modification.

The details of the actual airspace being simulated are specified by a user defined database.

This database contains items such as airports, VOR's, and fixes. Aircraft flying in this

airspace are separate entities and can be generated/introduced in a number of ways. These

include allowing the user to specify exact scenarios in which aircraft entering the airspace

are defined deterministically or, creating aircraft randomly at prespecified entry-fixes and

prespecified rates and distributions.

The simulated controller's display was presented on the Lisp Machine's screen, a black

and white raster scan display with a resolution of 1024 by 750 pixels. It consisted of a

x T T OR

Figure 4.2: Icon used for display of fixes in the simulation display

HFD

Figure 4.3: Icon used for display of airports in the simulation display

Tectangulr window, created using the Lisp Machine's window interface, and gave the user

the abilijy to perform standard window operations such as resizing and reconfiguration as

wel as allowing him to use the mouse. These capabilities were very useful)ater when format

changes were desired.

T afeiialdisylaj of items on the simulated controller's display was accomplished using

icons. There were basically three different icons representing navaids, airports, and aircraft.

The icons used to represent navaids consisted of an "X" symbol with the name of the fix

beside it. Airports were represented by circles with a dot at their center and had the name

of the airport beside them. Finally, aircraft were represented by a circle with a cross at

its center. A lot of additional information was also displayed with the aircraft icon. This

consisted of the aircraft name and flight-number, its estimated altitude in hundreds of feet,

if available, and its estimated groundspeed in knots. Examples of these icons can be seen in

FiguesA.2 through 4.4.

Therwere o other windows included in the display that could be used for

other purposes. One of these was used to display the elapsed time during the simulation.

The uses of the rest will be described shortly.

For the current simulation task, the airspace within roughly 50 miles of Boston's Logan

airport was used. A picture of how the display would actually appear can be seen in Figure

070 ---- Altitude in hundreds of feet250 -- Estimated speed in knots

Figure 4.4: Icon used for display of aircraft in the simulation display

4.5. Note the location of Boston's airport (labeled BOS) in the center of the display as well as

the three aircraft (CP123, UA66, and AA151) flying in the simulation. The positions of these

aircraft were updated, as would be the case in the real world where positional information is

made available by surveillance radar, at roughly five second intervals.

4.2 Speech Input Interface

With the simulation configuration as described in the last section, commands issued to

a particular aircraft by a subject controller are relayed to a blip-driver and keyed into the

Lisp Machine manually. It is the replacement of this by the use of speech directly that is the

primary function of the Speech Input Interface (SII).

The SII can be split up into two basic divisions. The first, the ASR system, is responsible

for the actual recognition of the controller's spoken input and the second, the Speech Input

Parser is responsible for performing any error detection and correction as well as translating

the input into executable code. These two systems are described in further detail in the

following sections.

4.2.1 ASR System

The ASR system used for the simulation, as mentioned previously, was the VOTAN

VPC2000. This system was selected primarily because it provided the capability for contin-

uous speech thereby freeing the user from any artificial constraints in how he spoke. This

added to the fidelity of the simulation and allowed for the emphasis on what the controller

was doing as opposed to how he was speaking.

The capabilities and use of this system have, for the most part, already been described in

Figure 4.5: Sample of the ATC Simulation Display on the TI Explorer.

49

Chapter 3. Its operation and performance in conjunction with the simulation task however

will be described in later sections.

Of significant import to the operation of the VPC system for speech recognition was the

fact that it was also used to generate simulated pseudo-pilot responses. Thus, since it could

not do both at the same time, there were limitations as to when the controller could talk to the

system. Furthermore, a reliable method for switching between these functions was required.

Since primary simulation control was the responsibility of the Explorer, it would ordinarily

have been this that controlled the switching of functions in the VPC. This was not possible

however since no control could be exerted on the VPC when it was in recognition mode by

input made through the serial port. Therefore, the switching of modes was accomplished

through the selection of a keyword, "Over" which, when recognized internally by the VPC,

would switch the function from speech recognition to speech playback. This required that

every command issued be terminated by the word "Over". Once in speech playback mode,

the VPC could be commanded to switch back to speech recognition mode directly by the

Explorer through commands issued through the serial port.

4.2.2 User Feedback and Prompting

Before beginning an explanation of the actual procedure used in the parsing of speech

input, it is important to discuss the methods that were used to provide feedback to the

controller of the recognized commands as mentioned previously.

In current simulations, the only feedback available to the controller indicating an error in

the recognition and execution of his commands arise from the pseudo-pilot acknowledgments

and actions in response to his commands. Although it would be desirable to utilize only

these for the feedback of errors in the recognition of spoken commands in the simulation

task, these acknowledgments, as shall be discussed in Section 4.3, are somewhat limited in

their error handling capability. Thus, the capability for additional feedback was desired.

The additional feedback was accomplished by displaying the recognized words on the

screen, using one of the auxiliary windows of the Simulation Display mentioned in the previous

section. The recognized words were displayed in two ways. In the first, they were simply

echoed onto the screen as they were received by the Explorer and in the second, they were

displayed as part of the current command string, assuming that they passed the parsing tests

on their validity. When the current command was completed, the display advanced to the

next line and all subsequent input was treated as part of the new command. In this way, a

scrolling list of the commands issued was generated.

If a parsing error was detected however, then a suitable message was displayed informing

the user of the problem. This message was presented in its own window in order to avoid

cluttering the speech recognition feedback display.

An example of what this would look like in operation can be seen in Figure 4.6. The

window on the bottom right hand side is used to echo the recognized words as they are

received from the VPC. Note however that there are two words and two numbers in each

line of this display. This is because the top two matches from the recognition system, along

with their respective scores, are being displayed. Thus, for the example shown, the last word

recognized was either "turn", with a score of 44 or "three", also with a score of 44. Recall

that the lower the score, the better the match.

Above this window can be seen the display used to provide the primary feedback of the

recognized commands. In this example, there are nine completed commands and the user is

currently midway through entering the tenth.

Finally, just below the window displaying the simulation time, there is the error feedback

window. Here, an error message corresponding to an invalid input (neither "turn" nor

"three" are valid inputs at this stage of the command) is being displayed. Thus, the user is

alerted to this and can correct it.

Once a command was terminated, a command terminator character was printed at the

end of the current line. If the command was a valid one, then this command terminator

was a period. If there was any type of error however, then the terminator was a question

mark. The use of this enabled the user to determine when the previous command had been

terminated and the next command could be input as well as whether or not the command

was one that could be executed properly.

CL En

0 *0 C C tn CD

C-) 0 P O

4.2.3 Speech Input Parser

The basic function of the speech input parser is to translate the spoken commands of

the controller, once they are recognized by the ASR system, into a format suitable for en-

try into the 'computer. In general, this task, although similar to Natural Language (NL)

understanding systems in Artificial Intelligence, possesses some significant differences.

In NL systems, the basic problem lies in understanding what the meaning of a statement

made in everyday conversational speech is. There tend to be very few restrictions as to the

scope of what can be said and even fewer on the syntax that must be used. Although there

can be some ambiguities, with a NL system, it is assumed that the user input is, for the most

part, correct. It is here that the principal differences between the two tasks differ.

In the current task, the introduction of a speech recognition system into the data input

process can result in errors in the words arriving at the computer even if the user's input

was correct. Thus the problem now becomes one of determining what was meant by the user

when there exists the possibility of errors having been made in the transcription of his speech.

This is a much more difficult problem. Fortunately, in ATC command recognition a rigid

and well defined syntax can be imposed on the commands input. This syntax is specified in

the Air Traffic Controllers handbook and although it might not be strictly adhered to in the

real world, it is not unreasonable to expect users in the simulation and training environment

to follow it. It is through the restriction in the scope of the conversation to these commands

only and the utilization of the ATC syntax that the problem of understanding what was

said can be greatly simplified. In this way, the real problem becomes one of eliminating, or

otherwise accounting for, errors made during the speech recognition process.

There are basically two ways that this can be done. The first is to improve the accu-

racy of the recognition algorithms in the ASR systems themselves. For most of the higher

performance systems however this would be very difficult as they are operating near peak

performance now. Furthermore, this is not the goal of this work.

The second is to use the additional information made available by syntax and grammar

in conjunction with existing ASR systems in order to improve their recognition accuracy. It

is here that the greatest potential for improvement lies. In fact, for any realistic or non-

trivial application of ASR, especially continuous or connected speech ASR, this information

is essential for resolving ambiguities in the recognition of speech. For this reason, many of

the more successful ASR systems do indeed use this information directly in the recognition

process [21. The exact methods for utilizing this information, however, are many and diverse

and require, to a great extent, tailoring to the capabilities and performance of the actual

ASR system being used.

Finite State Machine

Since the construction of a parser that would be able to correct recognition errors without

external aid was a very difficult undertaking, at least initially, the approach first attempted

was to construct a parser that would utilize this syntactical information in order to merely

detect recognition errors. It would then be up to the user to correct these himself.

In order to do this, a Finite State Machine (FSM) was used to specify valid word se-

quences. In the FSM approach, a parser transitions through a finite number of states, as

a function of the recognized input words. The valid word sequences, and hence the valid

commands, are thus defined by the possible paths through the FSM as it transitions from

state to state. Errors are indicated by the receipt of a word that is not defined with reference

to the current state of the parser.

For example, consider the FSM shown in Figure 4.7. This FSM is used to specify the

syntax required for the input of a heading azimuth. Here the states are represented by circles

and transitions or branches from one state to another by arrows. The quoted words contained

in the arrows represent the verbal inputs required to transition between the states. Here,

the symbol "<i-j>" is used to represent a branch using only one of the digits i through jinclusive. Thus, for example, "<0-5>" implies that a branch exists for any one of the digits

"zero", "one", "two", "three", "four" or "five".

If we assume that the parser is initially at state S1, then the only valid inputs are "zero",

"one", "two" and "three". Any other input received would indicate an error. If the words

"zero", "one" or "two" are received, then the parser would transition to state S2. From

XSi

"<0-2>"

S2

Figure 4.7: Example of the Finite State Machine logic for the specification of a heading

here, the valid inputs become all of the digits "zero" through "niner". If, on the other hand,

the word "three" is received, then the parser would transition to state S6 where the valid

inputs now become the digits "zero" through "six", and so on.

With this system, detection of errors, whether they arise from recognition errors made

by the ASR system or input errors made by the controller1 , is fairly straightforward. This

because the valid words are defined as a function of the current state of the parser by the

branches to other states. Thus, if a word is received for which a transition branch does not

exist, an error is signaled. For example, if the word "five" were received while the parser

were at state S1, then an error would be signaled since there is no branch defined for this

word.

With the FSM structure, it is very easy to add other words or otherwise change the

syntax of commands. This can be done simply by adding, removing or otherwise modifying

any desired branches or states of the FSM. For this reason, as well as the added simplicity, a

reduced subset of the ATC commands was first implemented. These consisted of the vectoring

commands, altitude change commands, and the airspeed control commands, as taken directly

from the ATC handbook.

The resulting FSM was too large to be drawn on a single page and as such was split

into conceptual blocks called Superblocks (see Figure 4.8). Each superblock contains its own

internal FSM for implementing its required subdivision of the syntax (see Figures 4.9 through

4.11) and connects to other Superblocks through branches defined by the to-other-superblocks

and from-other-superblocks labels. With this Superblock representation, it was much easier

to picture the overall structure and flow of the FSM. For the current example, it consisted

of getting an aircraft name, followed by a command (one of altitude, airspeed, or heading),

followed by a terminator (the keyword "Over") which reset the FSM to await the input of

the next command.

In general, the operation of such a system would be as follows. As the user speaks a

command, the words composing it are recognized by the ASR system and transmitted to the

SIP. Upon receipt by the SIP, they are immediately displayed on the initial feedback window

'These two are indistinguishable by the parser since only the controller knows exactly what was said

InitialState

Aircraft NameSuperblock

HeadingSuperblock

AltitudeSuperblock

AirspeedSuperblock

"Over"

Figure 4.8: Superblock structure of the FSM implemented.

(bottom right hand corner of Figure 4.6) to provide "raw" user feedback to the recognition

process. If they parse correctly, then the parser transitions to the next state and these words

are also displayed in the command feedback window (middle right hand window in Figure

4.6) as part of the current command being input. If an error is detected however, then a

suitable message informing the user is displayed in the error feedback window.

The capability to correct this error was implemented, as was the case the preliminary

evaluation performed in Chapter 3, through the use of the "Delete" and "Cancel" keywords.

Upon the receipt of the "Delete" keyword, the parser would "back-up" one state to the

previous state, thus, in effect, deleting the last recognized utterance from the input stream.

The command feedback display would then be updated to reflect this change. With the

"Cancel" command however, the parser would be reset to its initial state thus deleting the

entire command received so far. In this case, the display of the current command would also

be cleared in order to reflect its cancellation.

In order to allow the user to make any changes or corrections to the recognized command,

the input command was not executed by the simulation until the user had validated it by

saying "Over". In this way, the keyword "Over" served two functions; that of switching

VPC modes and that of command validation or termination.

The compilation of the recognized input into syntactically correct commands however was

only one part of the problem. The other dealt with how these were used in order to implement

the desired action. With this configuration of the SIP, this was accomplished by constructing

an executable Lisp s-expression or statement during the state-to-state transitions of the FSM

parser. Once the "Over" keyword was recognized, this statement, if valid, was evaluated and

an appropriate pseudo-pilot response generated.

Although with the FSM structure the possible word sequences, and hence the command

syntax was rigidly defined, there was still the capability of introducing some flexibility into

how the commands were entered. This was accomplished by explicitly incorporating into the

FSM any "short-cuts" or alternative possibilities in how a command might be issued. Take

for example the FSM defining the format required for specifying an aircraft name (Figure

4.9). Here, although the aircraft call sign was deemed essential, the flight number was not

from-other-superblocks

"TWA" "Air-Canada" "United-Airlines" "CP-Air"

n-1

<0-9>"

of<0-9>"'

n-4

Aircraft "turn" (fig 4.10)Name -climb/descend" (fig 4.11)Superblock "increase/decrease" (fig 4.12)

to-other-superblocks

Figure 4.9: Internal structure of the Aircraft Name Superblock


turn"

HeadingCommandSuperblock


Figure 4.10: Internal structure of the Heading Command Superblock

"left" "right"

"0 or "5"


"climb" "descend"

"flight-level""and-maintain"

(2 "flight-level"

"thousand"

"hundred"

"hundred"

"thousand"

0a.Altitude

CommandSuperblock

to-other-8uperblocks

Figure 4.11: Internal structure of the Altitude Command Superblock

from-other- uperblocks

"increase" "decrease"

"speed-by"

s-2

"speed-to"

s_5

"<1-9>"

s..e

AirspeedCommandSuperblock


Figure 4.12: Internal structure of the Airspeed Command Superblock


"increase" "decrease"

os..1

speed"

os..2

"2" . "1">"3","4",.9"

<-9" os -

<0-9>" "<0-9>"

Lto-other-superblocks

Figure 4.13: Airspeed Command Superblock maintaining original ATC syntax

7

I

OriginalSyntax

AirspeedCommandSuperblock

and could occasionally be omitted. This was a reasonable thing to expect since if only one

flight of a particular air carrier was under the supervision of the controller, it might be likely

that he would address it by the carrier name only. Thus, the section of the FSM dealing

with the receipt of the aircraft name could be exited at any of the states n_1, n-2, n-3 or

n.4. If this resulted in his not uniquely identifying the aircraft to which he was referring,

then contextual error checks on the controller's input, mentioned later in this Chapter, would

come into play to handle this in an appropriate manner.

This same type of flexibility was also incorporated into the vectoring command where

the use of the word "heading" was treated as optional as well as in the altitude command

where "and-maintain" was also not essential. If examination of operating procedures were

to reveal other modifications of the command syntax that were used in practice, then these

could also be incorporated in a similar manner.

An interesting problem was incurred in the definition of the airspeed control commands.

This arose from the fact that the command utilized to issue a change in airspeed, by a

specified increment or decrement, was almost identical to that used to signal a new absolute

airspeed to be flown. In particular, a relative speed change was indicated by the command

"increase/decrease speed (number of knots)" whereas an absolute speed command was of

the form "increase/decrease speed to (speed)" [16]. For example, "increase speed to two

two zero knots" implied that the pilot should now fly at 220 knots whereas "increase speed

two zero knots" implied that he should increase his current speed by 20 knots. Addition of

another word, "to", and hence another branch to the FSM would create problems however

since it would be impossible for the ASR system to distinguish the word "to" from the

number "two"2. In fact, there is no way for even a pilot to know which one was said by the

controller until the entire command is finished.

At first glance, it would seem possible to solve this problem by adding two different

utterances "speed" and "speed-to" to the vocabulary. Thus, when the user was entering a

relative speed change command, the word "speed" would be recognized and the appropriate

2Note that since it is very likely for a controller to use the word "to" in the issuance of this type of command,it cannot realistically be eliminated from the vocabulary.

branch taken in the FSM. Similarly, the word "speed-to" could be used to define a different

path. This method however would not work because it would be very difficult for the ASR

system to distinguish between the utterances "speed two" and "speed-to". Thus, in some

cases, an invalid change in speed would be recognized and in others, an invalid new speed

would occur. As well, it could not realistically be expected that the controller would strictly

adhere to the format of using "speed-to" only for absolute speed changes and "speed" only

for relative speed changes.

There are however, two other ways that this problem can be handled. In the first, the

standard syntax can be modified slightly so that absolute speed changes are specified by

the "increase/decrease speed-to ... " command and relative speed changes by the "in-

crease/decrease speed-by ... " command3 . This would then be incorporated into the FSM

by using the the two utterances "speed-to" and "speed-by" to yield a FSM such as the one

in Figure 4.12.

If, however, the requirement that the command syntax not be modified is imposed, this

problem could still be handled by using the same techniques that a human listener would use.

In particular, the FSM would be modified so that all combinations of input were allowable

(see Figure 4.13). Here, no attempt would be made by the FSM to distinguish between "to"

and "two" and the decision on which was actually entered would be handled internally by

determining what the argument of the command (i.e., the "(speed)") was. If it was less

than say 90 knots, then it would be assumed that the desired command was a relative speed

change command. If not, then it would be assumed to be an absolute airspeed command.

Note that the actual speeds entered can range all the way up to the two thousands. This

because the initial "to" would be recognized as a "two". If this occurs then the leading

"two" would be eliminated from the speed and it would again be assumed that an absolute

airspeed command was issued.

3Since the word "speed" is no-longer contained in the vocabulary, there is no longer the possibility of errorsbeing made between "speed-to" and "speed two"

Error Detection Using Contextual Information

Oftentimes a recognition or input error occurs that does not create any parsing problems

and can in no way be detected by the parser 4 . These types of errors create special difficulties

and fall into two general classes. The first of these involves errors that arise from the non-

recognition of words which are optional and need not be included in the command. Examples

of these are the words "and-maintain" or "heading". In most cases, these words are not

critical to the understanding or execution of the command issued and thus their omission

does not create any problems. However in other cases, such as in specification of an aircraft

flight number, non-recognized words can cause problems.

The second class of these errors involves mis-recognition errors are made amongst words

that lie on the same state-to-state branch. In this case, the parser transitions to the correct

state but does so based on the wrong input (with respect to what was actually said). Since

the parser still transitions to the proper state, there is no way for these errors to be detected,

at least with the standard FSM mechanism. A good example of this is in the transitions

between states h_4 and h-5 in Figure 4.10. Here, if a mis-recognition is made amongst any of

the digits, for example, mistaking "four" for "five", there is no way for the parser to detect

it. It would however result in an incorrect heading being issued if the user did not detect it

and explicitly correct it.

Since these errors could not be detected by syntax alone, an alternative technique had

to be employed. This technique utilizes information about the airspace and the aircraft

flying in it, termed contextual information, in order to detect some of the discrepancies and

ambiguities in the issued commands.

A good example of its use can be found in the determination of which aircraft was being

referred to by the controller. Here, after the command has been terminated and before the

commanded action has actually been executed, comparisons are made in order to determine

which of the aircraft in the simulation possess names similar to the one recognized. If the

result is unique, then the command can be executed. If however there is more than one

'They can however be detected by the user if he monitors the visual feedback

possibility, either because not enough of the aircraft's name was issued to make it unique,

or because of a recognition error made in the aircraft name, then a suitable error message is

displayed. This action is readily modifiable to, for example, determine the likeliest candidate

aircraft, based on a user defined measure of merit, and assume that the controller was referring

to this aircraft (if he really wasn't, then the pseudo-pilot response would indicate this error

to him).

This type of contextual error checking could, in a similar manner, be extended to include

other possible error occurrences. For example, before executing a vectoring change command,

the direction to turn onto ("left" or "right") could be used in conjunction with the new

heading and the aircraft's current heading to check for possible errors. The same could be

done with "climb" and "descend" in an altitude change command. These however were not

implemented in this initial version of the SIP.

Evaluation

Although the FSM approach provided a straightforward and simple method of using ATC

command syntax for error detection, its performance when a recognition error was made was

lacking. Since it possessed no innate ability to correct for and/or otherwise compensate for a

recognition error, when an error did occur, the user was forced to stop and correct it before

he could proceed with any verbal input.

The reason for this was the very rigid structure of the FSM. With this, it was very critical

that the parser transition correctly from state to state since the valid vocabulary words are

defined based solely on what the current state is. Thus, if an error caused a transition to the

wrong state, then the valid vocabulary became something other than what it should actually

have been. This implied that subsequent controller input would not be parsed correctly even

if it was validly recognized.

This problem was ironically made even more acute by the capability of the VPC for

continuous speech (ironic because the continuous speech capability was one of the primary

reasons for selecting the VPC system). Using this, users spoke entire commands in a stream

of continuous speech, without stopping after every word to make sure that it was recognized

properly. In this manner, if an error resulting in an incorrect state transition were made, the

rest of the words, having already been spoken, would be, when recognized, either discarded

by the parser as invalid input or parsed incorrectly.

This is perhaps best illustrated by an example. Consider the input of the command

"descend three thousand five hundred over". If "descend" were misrecognized for "turn

left", then the parser would be expecting a heading. The first word arriving, the "three" is

a valid heading digit and would thus parse properly. The next, "thousand", however, would

cause an input error to be indicated. The "five" would again be valid as a heading digit

but the next word, "hundred", would be invalid. Thus, the parser would assume that the

command issued so far was "turn left three five" and would be awaiting the final digit of

the heading specification.

Although this type of problem5 would simply require the user to either correct or repeat

his command, if the stream of speech were terminated by the keyword "Over" the problem

would be complicated further. Under these circumstances, since the VPC was still recognizing

the user's speech (it was the SIP that was treating it as invalid), the "Over" keyword could

be recognized and cause the VPC to switch functions from speech recognition to speech

playback. Thus, it would be impossible for the user to correct the transcribed command

since the VPC was no longer recognizing his verbal input and the error correction keywords

could not be entered. (This same problem would also occur if a mis-recognition of the word

"Over" for some other word, typically the word "four", occurred in the middle of a command

input.)

There were basically three methods of handling this problem. In the first, the receipt of

the word "Over" at any state in the FSM would cause the parser to be reset to the initial state

and an appropriate pseudo-pilot message indicating the error, if possible, to be generated

(see Section 4.3 for a description of how this would take place). The VPC would also be

switched back into speech recognition mode. This procedure however, made no allowances for

the user to correct the errors in the current command and instead forced him to repeat it in

6Although illustrated with an example using a mis-recognition error, this type of problem could occur withany error that resulted in an incorrect state transition including spurious and non-recognition errors.

its entirety6. The underlying philosophy here was that only pseudo-pilot responses would be

used for error feedback (no feedback display) and that these would indicate to the controller

that an error was made in interpreting his command.

Although this approach solved the VPC mode switching problem, it forced the user to

repeat the entire command just issued. In order to remedy this, a second approach in which

the branches based on the word "Over", at states that were not syntactically valid states for

command termination, were modified so that they branched back to the same state. In this

way, the command recognized so far was not lost. Furthermore, since the VPC was again

reset to speech recognition mode, the user could correct any errors and continue on with the

command input from this point.

A third, though not as elegant, technique for handling this type of error was to simply

refrain from "Over" until the display could be examined in order to determine that no errors

were made. If any were detected, then these could be corrected before the command was

terminated. This approach slowed down the command input process significantly, even in

those cases where there were no errors made, but was successful.

Finite State Machine with Set Switching for Vocabulary Size Reduction

From the testing done with the previous configuration, it was readily apparent that a

means of reducing recognition errors had to be found. One -way that it was thought this

could be accomplished was through the use of the set switching capability of the VPC as

outlined in Chapter 3. This could be used in conjunction with the FSM implementation of

the parser in order to specify the active vocabulary of the VPC as a function of the state of

the parser. In this way, the ASR system would only recognize words from a list of valid input

words. Thus, when these recognized words were received by the FSM, they were guaranteed

to be syntactically valid. By reducing vocabulary size in this way, recognition delays could

also be significantly reduced.

In order to do this, the required vocabulary set switching logic had to be contained on

*Its action was similar to that taken when a "Cancel" command was issued except the feedback display wasnot cleared. Instead, a "?" was displayed as the command terminator in order to indicate that an errorhad occurred.

the IBM PC (the delay required to transmit this information from the Explorer would make

implementation intractable). Since this significantly complicated the code operating on the

PC (a FSM complete with all of the error detection and correction logic, would have had

to be constructed), some of the simpler "built-in" set switching features of the VPC system

were used. These however resulted in an active vocabulary was not as rigorously defined

as it could have been. That is to say, the active vocabulary at any given state contained

words that would not be valid given that state. This however did not alter the validity of

the results obtained using this configuration since recognition errors resulting in syntactically

invalid words were rare.

As would be expected, the recognition accuracy was improved by the reduction of the

active vocabulary size at each state. This configuration however- still exhibited the same

sensitivity to errors that occurred with the previous configuration. With the previous config-

uration, an error causing an improper transition would cause subsequent controller input to

be discarded as invalid. This would be determined at the Explorer end, after the words had

been recognized. With this configuration however, the same error would cause the controller's

verbal input to be discarded at the VPC end, before recognition had even occurred (actually,

during the recognition stage itself). This because the VPC would now be trying to recognize

the controller's input based on the wrong active vocabulary. Thus, it would seem from the

perspective of the Explorer and the SIP, that the controller had stopped talking. Although

this made almost no difference as to what was seen by the user with either configuration, (he

still had to stop and correct the error that caused this before he could go on), there was a

lot of potentially useful information contained in the rest of the command that could be of

use in correcting this error that was lost.

FSM with Inferior Choice Words

A technique that yielded similar results to set switching again incorporated the baseline

FSM structure but this time made available to it information about how well all of the words

in the vocabulary matched the current verbal input. Thus, if the first choice word (as specified

by the ASR system) would not parse correctly, then the second could be examined and so

on until one did parse properly. In this way, all of the syntactically invalid words would be

disregarded, thus, in effect, generating the same results as with set switching. Although this

parser did stifle the ASR system's ability to get back on track after a recognition error was

made, the increase in size of the active vocabulary (it was now the entire vocabulary) created

other problems. In particular, recognition delays were increased since more comparisons had

to be made in the recognition process. The recognition error rate, however, did not increase

because the parser would detect any potential errors arising from words that did not parse

correctly and examine inferior choices to find one that did parse correctly. Errors involving

words that did parse correctly were not detectable even with the previous parsers.

Operationally, there were other factors to consider. First, a threshold test had to be

implemented on inferior choice words in order to prevent words that had very poor scores,

and were thus, not likely said by the user, from being parsed. This prevented random

"garbage" from being parsed into a valid command. Further modifications to this threshold

test could also be implemented using the relative score differences between the best guess

and the one that would parse correctly (if these were different). If this was small, then it was

very likely that the two words could easily be confused. If it was large, then the second choice

word was not likely what had been said and an error had probably been made at a previous

stage in the parsing process. This however did not add significantly to the performance of

the parser and hence was not included.

Second, care had to be taken to accept the word "Over" only when it was the first choice

since its receipt would indicate to the Explorer that the VPC had switched modes. Thus, a

command to switch it back to recognition mode (or a pseudo-pilot message) would be sent

to the VPC. The VPC however only switched to playback mode when the first choice word

was "Over". Thus, an extra reset command would be contained in the VPC's input queue

and would result in speech playback synchronization problems.

Finally, the VPC system only made information about the top two candidate words

available. Although in most cases this was sufficient, there were instances where neither of

the top two choices were correct and the third was not available. Thus, an input error was

indicated when there was the potential to recover from it. This was a simple problem with

the VOTAN software since, in order to determine the best match, all of the words in the

vocabulary had to be scored. The interface provided however, only allowed access to the top

two.

Even if the second choice word was the correct selection, it was not possible to realign

the recognition algorithm to commence at the end of this word. Thus, there were time mis-

alignment problems, as mentioned in Section 2.2 and in Figure 3.4, which often resulted in

recognition errors in the succeeding words as well.

For these reasons, the performance of this system, although comparable, was not as good

as that for the basic FSM with set switching. This however, could likely be remedied to a

great extent if more than the top two choice words were made available by the VPC.

Other Variations

There are further variations possible on this FSM approach. However most of these

become intractable, either because the interrelationships between recognized words in con-

tinuous speech cannot be readily included or because only the top two choices as to the

recognized word are available with the current VPC software.

One of these variations that possesses a great deal of elegance and simplicity and deserves

mention involves the assignment of a confidence (or score) to each branch of the FSM based

on the likelihood that the word contained in that branch was the one that was actually

spoken. The command actually entered would then be defined by the path that possessed

the best score. This system however breaks down due to the extreme difficulty in assigning

scores to all of the branches in this manner. In particular, since in continuous speech the

currently recognized word affects the recognition of the words following it (word boundaries),

the recognition algorithm would have to be run quite a number of times on the same block

of speech data in order to obtain scores for all of the word sequence combinations required

to score all the branches of the FSM. Even if this this could be done, which it cannot with

the VPC, the delays would go up dramatically. This could be modified slightly however, so

that only the top two or three choices would be used to determine the possible branchings.

Or, more correctly, only the top choices within a specified range of recognition scores, thus,

only performing these evaluations when confusion between words is likely. However, this still

becomes complex very rapidly and again, could not be implemented with the VPC.

Pattern Matcher

While the FSM approach used so far was successful in incorporating ATC command

syntax for the detection of errors, it was unable to correct or in any way recover from

them without external aid. Thus, it was left up to the user to correct these before he could

commence inputing data. This requirement for user intervention every time an error occurred

resulted in a very difficult command entry procedure.

In order to remedy this, a different approach was taken in the design of a parser. This

approach, termed the Pattern Matcher, or PM, attempted to make-use, of the fact that even

though a command might contain one or two errors, its general intent could, at least in some

cases, still be readily inferred. In this way, the user would not be required to correct all of

the recognition errors that occurred.

The general procedure here involved comparing the entire input command to a database

of allowable commands on a word by word basis in order to determine which was the best

match. In this way (in a manner somewhat analogous to the speech recognition process itself),

the input command could be "recognized" and the required command action determined.

Since the explicit enumeration of all possible word sequences constituting the "recog-

nizable" commands was unrealistic, a more compact notation was used to construct the

aforementioned database of allowable commands. In particular, groups of words that are

logical entities are represented by keywords beginning with the symbol "+", in order to

avoid having to list all of the possibilities. For example ~+aircraft" is used to represent

combinations of words that can denote an aircraft. In a similar manner, "+altitude" is used

to represent word groups that specify an altitude, and so on.

An example of how this database of controller commands appears can be seen in Figure

4.14. This database contains the same commands that were incorporated into the FSM in

the previous sections.

Determining which command was spoken when no recognition errors were made was fairly

+aircraft

+aircraft+aircraft

+aircraft

+aircraft

+aircraft

+aircraft

+aircraft

+aircraft

+aircraft

turn right heading +heading

turn left heading +heading

turn right +heading

turn left +heading

climb and-maintain +altitude

descend and-maintain +altitude

climb +altitude

descend +altitude

increase speed-to +speed

decrease speed-to +speed

Figure 4.14: Table of ATC Commands used in Pattern Matcher database

straightforward and consisted simply of stepping through each word in the input command

and comparing it to the corresponding position in the command template.

When recognition errors were made however, the process became much more difficult.

This because mis-recognition errors could obscure words and spurious of non-recognition

errors resulting in word omissions or insertions into the input command could create problems

aligning the input and template. In order to allow for these, a simple procedure whereby

adjacent words were examined to determine if and what types of errors had occurred was

used. This procedure is perhaps best explained by example.

Consider for example the input command I1 12 13 14 and the template command T1

T2 T3 T4. The procedure for comparing these would be as follows.

1. If Il matches T1, then a match is declared and the comparison proceeds with the next

two elements 12 and T2

2. If I1 doesn't match T1, then I1 is compared to T2

(a) if Il matches T2, then it assumed that word T1 has been omitted from the input

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

*wWwwwwww-

stream and the matching process proceeds comparing 12 and T3

(b) if I1 does not match T2 then 12 is compared to T2

i. if 12 matches T2, then I1 is treated as a mis-match of T1

ii. if 12 does not match T2, but 12 matches T1, then I1 is treated as a spurious

input

iii. if 12 does not match T2 and 12 does not match T1, then I1 is again treated as

a mis-match of T1 and the comparison process is repeated commencing with

12 and T2

Thus, it can be seen that the result of this comparison is a sequence of either match, mis-

match, spurious, or omitted. These are then used to determine a score for the comparison.

The actual scoring used is, to a certain extent, somewhat arbitrary. In the baseline scoring

used, a match scored 0, a mis-match 0.75, an omission 1.0, and a spurious a 1.0. The scores

that were thus generated for each comparison of template and input were tallied and the

"recognized" command was selected as that with the lowest score.

On the whole, this parser was successful in demonstrating some of the things that it set

out to accomplish. In particular, with this approach, it was possible for errors to be made

and still allow the command to be recognized.

There were quite a few instances however where the errors were such that more than

one command template possessed the minimum score. Thus, it was unclear what the issued

command actually was.

Also, the fact that recognition errors tended to occur in successive words (due to word

boundary misalignment) created significant difficulties for the current algorithm since it only

examined adjacent words in performing the matching process.

Even if errors were such that the command could be recognized, there was still the

possibility that they would make the commanded action unclear. This would occur, for

example, if an error were made in the recognition of a digit in a heading, or altitude, or even

aircraft name.

This developmental system suffered from the fact that no mechanism for interactive error

correction by the user was included. As such, if an unrecoverable error was made, the user

was required to input the entire command again. This was done primarily to simplify the

code.

Still, the pattern matching approach provided benefits over the FSM approach. In partic-

ular, the improved error robustness provided by the innate capability to allow for recognition

errors was very desirable. Furthermore, with this structure, the parser could take advantage

of the VPC system's capability to get back on track after a recognition error was made, and

thus allow the user to keep speaking even if an error was made. This created a much more

realistic simulation environment since the user did not have to pause after every few words

to monitor the feedback display and correct errors as they occurred before he could continue

with his verbal input. It also more accurately duplicated the procedure used by humans to

recognize speech.

Its operation was however very slow. This because none of the actual pattern match-

ing was attempted until the entire command was received (ie; terminated by the keyword

"Over"). This was done for two reasons. First of all, it greatly simplified the development

of the Pattern Matcher and second, information as to what the next words in the command

input were was required by the matching algorithm.

Possible modifications and improvements to this system are numerous. Clearly the scoring

could be modified to, for example, reduce the significance of mis-matches involving unim-

portant inputs such as "and", or, to increase the importance of others, more critical to the

message content, such as "descend" or "turn". As well, since with a continuous speech ASR

system a mis-recognition error in one word increases the likelihood of an recognition error

(mis-recognition or non-recognition) in the next 7 , adjacent errors could be scored slightly less

or the matching algorithm could examine words further down the input in order to make a

decision about errors at the current position in the input sequence.

Further modifications could include the use of inferior choices for each recognized input

word into the comparison process. The score could then be suitably modified to reflect this,7 Due to the word boundary misalignment problem mentioned earlier.

possibly by using the recognition score of the particular word selected.

Similarly, a confusability matrix, a matrix whose elements specify the likelihood of the

ASR system confusing one word for another, could be determined (empirically) and used to

score hypothesized misrecognitions involving any word pairs.

It is beyond the scope of this work however to examine all of these possibilities but they

are mentioned here for the sake of completeness.

4.2.4 Discussion

From the material covered in the previous sections, it can be seen that the primary

difficulty in the design of the Speech Input Interface (or ATCCR system) is the handling of

recognition errors. Although means were presented whereby these-could be corrected by the

user, it was desirable that the SIP system itself be able to correct some of these.

In designing this capability for internal error correction, it became readily apparent that

what was being attempted by the PM, and to some extent, the FSM through some of the

variations on it, was directly analogous to what was being done in connected speech recog-

nition systems. That is to say, attempts were being made to find the best sequence of

words, subject to syntactical constraints that matched a given "segment" of speech input.

This matching process was such that comparisons went in both directions. Thus, not only

could words recognized early on affect the recognition of words spoken later, but the con-

verse could also occur. The difference however is that the SIP operates on entire commands

whereas connected ASR systems work on short phrases delineated by pauses.

This differed significantly from what goes on in continuous speech recognition. In con-

tinuous speech recognition, a top-down, fore-aft process in which only the past history of

recognized words can affect the recognition of any given word is performed. This is directly

analogous to that taken by the Speech Input Parsers based on a FSM approach.

This is not what is in general, desired in recognizing entire commands since in a great

number of cases, it is not clear, even if no recognition errors are made, exactly what was said

until most or all of the command has been recognized. Although systems using this fore-aft

approach can still be quite successful in recognizing commands, there is a lot of information

contained in the rest of the input that can be used to great advantage in order to reduce the

error rate. Furthermore, there are instances where these are guaranteed to fail in recognizing

what was said.'

Thus, the question was raised that perhaps a connected speech recognition system was

more amenable to the task at hand (since this is what was in effect, being done by the PM).

On closer reflection however, it was decided that what was really desired was a connected

speech recognition capability in a continuous speech recognition system9. In this way, the

speech input data could be "rewound" to various locations to re-examine it, perhaps under

different operating parameters. Thus, the immediate feedback benefits of continuous speech

recognition are maintained until an error is detected. Then, the system could re-examine the

input using connected speech recognition techniques. -.

In this light, the capability of the VPC system to get itself back on track, after a recog-

nition error was made, assuming, of course, that set switching was not being used in the

recognition. algorithm, was thought to be desirable. It could be used in order to continue

generating valid recognized input, even after an error was made and this input could be used

in order to allow the parsers to hypothesize where and if any errors had been made. The

ASR system could then be rewound to this point in order to attempt to verify the error

hypothesis.

Although the FSM could also generate these error hypotheses, the Pattern Matcher was

potentially much better. This because the pattern matching process determined the exact

location of the differences between the input command and the command-template. Thus,

the locations of candidate recognition errors could be hypothesized to allow the ASR system

to be rewound to these points.

For example, if a vectoring command was spoken and only two digits of the three digit

heading were recognized, then the ASR system could be rewound to the section of speech

"As exemplified by the "speed to" command where it is unclear until the actual speed change is recognizedwhether the controller said "to" or "two".

*An interesting variation might be to have two ASR systems, one continuous and the other connected,recognizing verbal input and making it available to the SIP in parallel. In this way, the disadvantages ofeach ASR system on its own could potentially be masked by the operation of the other.

waveform data where this heading was issued and it could be reprocessed with a relaxed

recognition threshold in order to extract the missing digit.

Furthermore, this could be used to provide a "what if" type of feature in which any word

in the recognized stream of input could be replaced by another (say the second or third choice

word replacing the first choice word) and the resulting changes in the recognized output and

scores could be observed. This could be used to great advantage to resolve errors resulting

from word boundary synchronization problems in following words when a recognition error

is made.

This capability to rewind speech data to an arbitrary point and restart the recogni-

tion algorithm however was not possible with the current configuration of the VPC ASR

system10 . Furthermore, the additional computations would probably increase delays dra-

matically. Thus, with the current system, although hypotheses can be made as to what

phrase was actually said, there is no means of going back and verifying this hypothesis

through re-analysis of the speech data.

This however was not the only instance where ASR system limitation affected the design

of the SIP and its error correction schemes. For example, recall that the FSM with Set

Switching SIP was limited since only the top two choices were available. Still yet another

example relates to the fact that the only output from the ASR system is recognized words.

Thus, if a non-recognition error occurred, there would be no way for the SIP to know that a

word was actually spoken but not recognized.

In general, these limitations were not related to the actual recognition procedure being

used but are instead directly attributable to the black box design of the ASR system. This

because ASR system designers do not want to unnecessarily complicate the user interfaces

for their systems. Thus, a lot of the internal operation is hidden from the user. The end

result however is that complex techniques that attempt to improve or correct recognition

errors are powerless due to a lack of information from and control over the ASR system.

"'This has to a certain extent, been remedied through the introduction of a Library of C routines that providea lot more flexibility in what can be done with this system.

4.3 Pseudo-Pilot Responses

One of the useful features of the VOTAN VPC system was the digital recording capability.

This allowed the user to record a number of spoken messages of varying length, subject to

memory limitations, and then play them back in any particular sequence desired. This feature

was used to incorporate pseudo-pilot responses into the Air Traffic Control simulation.

As mentioned in Section 4.2.2 one of the principal uses of the pseudo-pilot in the simu-

lation task is to provide feedback to the user about possible recognition errors. This form of

feedback can be used to indicate four general statuses of command recognition.

In the first, the controller's command was received correctly and there were no apparent

problems in understanding it. In this case, the pseudo-pilot is used to generate a standard

acknowledgment to the controller indicating the command was received.

In the second, an error was made in recognizing the specific command however the air-

craft being referred to was known. Here the p-pilot responds with a "Say Again" message

indicating that the message was not received correctly. In the third, it was unclear which

aircraft was being referred to and as such, there is no p-pilot response and finally in the

fourth, an error was made in determining which aircraft was being referred to, therefore a

response is generated from a random aircraft. The exact handling of these last two cases can

be varied quite a bit but basically consists of either no pseudo-pilot response or response from

the wrong pseudo-pilot. The other possibility, responses from multiple aircraft was just not

possible with the current system and might not be tractable even if the capability existed.

In order to obtain the desired flexibility in response message format as well as reduce the

limit memory requirements, the response messages were constructed by connecting together

shorter, more general messages. These shorter messages consisted of a specified number

of aircraft carriers (the simulation was required to operate with only these particular air

carriers), followed by the digits and a number of keywords. A table outlining exactly what

the pseudo-pilot's "vocabulary" was has been included in Table 4.1.

Furthermore, in order to simulate different aircraft pilots (the controller does use this

information), a number of speakers were recorded creating a database of pseudo-pilot voices.

Table 4.1: Table of discrete messages recorded for Pseudo-pilot response formulation

Thus, as each aircraft entered the controllers airspace, it was assigned a particular pseudo-

pilot voice. All responses from this aircraft were then made using this particular voice.

The format of pseudo-pilot responses was kept fairly basic. They consisted of the identi-

fication of which aircraft was responding followed by a "Roger" message implying message

received and acknowledged, or a "Say Again" message implying that the message was not

understood. With the inherent flexibility however, these could later be expanded to in-

clude a much broader range of message formats. For example, a typical response message

acknowledging a command would be the message "United Airlines six five zero Roger.".

The scheduling of the p-pilot functions was performed primarily at the VPC end. The

actual generation of which pseudo-pilot messages were to be played however was performed on

the simulation computer (Explorer). A flowchart indicating the action performed by the VPC

board can be seen in Figure 4.15. Basically, it remained in speech recognition mode until the

keyword "Over" was recognized. It then switched into speech playback mode."1 In this mode,

the first input from the Explorer (via the RS-232 Serial Port) specified which pseudo-pilot

voice dataset was to be used and the subsequent input specified which particular message

was to be played. Upon receipt of the ''end-of--message'' flag, the VPC was switched back

into recognition mode to repeat the process over again.

"In order to indicate these mode changes to the user so that he would know when the system was listeningto speech, a beep was sounded before entering and after exiting recognition mode.

Message Recorded Message RecordedUnited Airlines zero

TWA oneCP Air two

Air Canada threeSay Again four

Roger fiveheading sixaltitude sevenhundred eight

thousand niner

The critical and determining factor in this sequencing strategy is the use of the word

"Over" in the speech input stream as the command terminator that would cause the VPC to

switch functions. Although it was desirable that pauses in the controller's speech be also used

to indicate command termination in conjunction with this, the only way that this detection

and timing of pauses in the user's speech could be implemented caused the VPC to switch

into and out of recognition mode on a regular basis if there was no verbal input. Thus, there

was the possibility that when the controller did speak, the VPC would not be in recognition

mode and would miss part of his spoken input. For this reason, this was not implemented.

Ideally of course, there should be two different systems, one for speech recognition and one

for speech output. In this way, there would not be the same problems with sequencing.

Pseudo-pilot responses were generated by the simulation computer upon receipt of the

keyword "Over" from the speech recognizer. If the verbal command parsed correctly, then an

appropriate acknowledgment message would be generated. If the command was ambiguous

but the aircraft being referred to was not, then a "Say Again" message would be generated.

If, however, the aircraft specification was ambiguous, then any action could be taken. In the

current configuration, the user was prompted on the display that the aircraft specification was

ambiguous and a null message was played. Note that in any case, the ''end-of-message''

flag must be sent to the VPC in order to switch it back into speech recognition mode.

Furthermore, since the VPC was constantly being switched into and out of recognition mode,

it was necessary for the user to know when the recognition system was "listening" to his verbal

input. In order to do this, a beep was sounded whenever the VPC switched modes. This

could easily be modified to include a visual signal on the controllers display since his attention

is fixed there anyway. Experimentation revealed however that it was much simpler to use

the aural cue.

Evaluation

The performance of the pseudo-pilot, although it did add a degree of realism and satis-

faction to the ATC simulation, was lacking. The primary problem was directly attributable

to the use of short messages that were concatenated together to form a suitable the response

Figure 4.15: Flowchart of sequencing of VPC2000 functions

83

message. This created two problems.

The first problem was a result of the apparently random changes in pitch and intonation

of the recorded words. These arise from the fact that when the messages were recorded, it

was very difficult to avoid the introduction of inflections and emphasis. Thus, when they

were connected together, these inflections did not mesh together very well and produced very

strange sounding, although still intelligible, pseudo-pilot responses. This could be remedied

through a more careful and iterative recording procedure where such messages are erased and

re-recorded, or, different messages could be recorded with specific intonations for playback

at different positions in the response sequence.

The second problem concerned the discernible pause between the messages as they were

played back. This was directly attributable to the concatenation procedure used to construct

pseudo-pilot response messages. This made for responses that were very slow and often

created a significant delay to the controller since he was not able to issue the next command

until the message was finished playing.

Although the internal operation of the VPC in speech playback mode could not be mod-

ified, there were still a number of ways to reduce the effects of this problem. The first was

to reduce the call-out for identification of the aircraft by omitting the flight number and

using the carrier name only. This would reduce the number of concatenations necessary to

construct the response message and decrease the delays. This however, created the potential

for confusion between different flights of the same carrier but the distinct voices used for the

different pseudo-pilots alleviated this to some extent.

The second method used was to record some of the more common response messages as

entire messages and add these to the existing pseudo-pilot vocabulary. By doing this, not only

were delays eliminated, but the responses were much more realistic sounding. In particular,

the "Roger" and "Say Again" messages were recorded in this way. Since it was not possible

to know all of the flight numbers of the aircraft in the simulation beforehand, these messages

were simply recorded with only the carrier name for identification. By using these messages

only when fast responses were desired and the standard message format in other cases,

pseudo-pilot flexibility was still maintained. This scheme however, although addressing the

intonation and concatenation delay problems of the pseudo-pilot responses, still possesses the

problem of potentially ambiguous aircraft specification. This could be remedied by further

customizing these messages to include the aircraft flight number as well but this would entail

knowing beforehand the exact names of the aircraft that would be operating in the simulation

and would greatly limit flexibility. Furthermore, responses for every aircraft would have to

be recorded and this would greatly increase memory requirements.

In light of these findings, the final message format selected was to use the format initially

described for its flexibility in addition to some messages recorded in their entirety for their

realistic qualities.

The different pseudo-pilot voices however still made it possible to determine which aircraft

was responding if more than one flight from that particular carrier was in the air.

In all of these cases however, only a fixed number of recordings are being used so there is

not much variability in the sound of the responses as there would be in the real world. Thus,

although the basic task is accomplished, there are some drawbacks in terms of realism.

4.4 Discussion

In general, the simulation worked fairly well and was a good vehicle for the demonstration

of ATC command recognition. The use of the VPC was successful in eliminating the require-

ment for blip drivers both for speech input and speech output functions while maintaining a

high degree of realism. Extensive testing of the simulation however, served to indicate some

limitations and problems with both ATC command entry in relation to the simulation task

as well as in general. These, in addition to suggestions as to how they can be alleviated are

discussed in the following section.

Recognition Errors

As would be expected and was amply demonstrated, by far the greatest problem in

incorporating ATC command recognition into the simulation application, arose from errors

in the recognition of the controller's verbal input.

These errors resulted primarily from the sensitivity of the VPC to co-articulation effects

and variations in the way that the user spoke. These variations were, in general, thought to be

insignificant to the user (he was in no way trying to "fool" the system or push it to its limits).

However, they did significantly affect the system. If care was taken to maintain consistent

pronunciation between training and use of the system as well as limiting co-articulation

effects through careful articulation of verbal input, then the VPC performance was found to

be more than adequate to accomplish the tasks attempted. If however this was not the case,

then the error rate increased enough to make its use difficult.

The recognition errors that occurred were, for the most part, very similar to those encoun-

tered and described during the initial system evaluation of Chapter 3. There were however

some additional errors incurred when the user paused and said "ummm" or "aahhh" while

entering a command. This often led to word insertions or spurious recognitions since these

sounds were often mis-recognized as valid input. One method of correcting for these was to

train these sounds as they were made by the user and add them to the vocabulary. Thus,

they could hopefully be recognized and eliminated appropriately by the parser. This how-

ever, created more errors than it eliminated since there was a great deal of variability in how

these sounds were made by the user and thus they were rarely recognized. As well, the ad-

dition of these templates into the recognition vocabulary created a lot more mis-recognition

errors. As such, the operator was instead required to try and avoid these expressions and

state commands clearly.

Furthermore, there were instances where the subject controller desired to talk to other

people and could not since the ASR system was listening in. In order to allow for this a

microphone cut-off switch was included in the set-up. This also alleviated the problem with

"umm" and "aahhh" by allowing the user to switch the mike off until he had decided which

command he wanted to enter.

Background noise, which it was found could greatly affect the error rate, was successfully

compensated for through the use of a noise canceling, headset mounted, microphone. The use

of this further increased recognition accuracy by reducing the variability in the positioning

of the microphone with respect to the user's mouth and thus the variations in the signal seen

by the ASR system.

Since the vocabulary was already fixed for the ATC environment, words that were eas-

ily confused with others or were otherwise likely to cause recognition errors could not be

eliminated. Instead, problem words were merged together with other words to form longer

utterances whose recognizability, at least with the VPC system, was enhanced. This proce-

dure tended to be empirical in nature, requiring different word combinations to be tested in

order to determine which would solve the problem. It was this that motivated the concatena-

tion of such words as "and" and "maintain" into the single utterance "and-maintain" (the

"and" was a constant source of recognition error difficulties).

Some desirable modifications to the system to reduce error rates would be the addition

of the capability to train words "on the fly" while the simulation was running and then

add them to the vocabulary. In this way problem words could be retrained so as to reduce

recognition errors. Furthermore, commonly used sequences of words, such as aircraft names,

could be trained and added as single utterances. For example, the utterance "Air-Canada-

one-two-three" could be used in addition to the four words "Air-Canada one two three".

This would improve accuracy since longer words tend to be easier for the VPC to recognize.

Even if this longer utterance were not recognized, then the fallback would be to recognize

each individual word and proceed from there as was done originally. Care would have to

be taken to avoid adding similar sounding utterances to the vocabulary however, since these

could be easily confused by the VPC.

Another technique with the potential to improve recognition accuracy is the use of adap-

tive template modification techniques. Using this, templates could be modified, or even

removed, if their recognition performance was not good. This would be indicated through a

large number of corrections made involving these particular words by the user. The problem

here however lies in determining that these were indeed corrections and not simple input

changes. Furthermore, it is unclear exactly when a template should be modified and when

it should remain unchanged. For example, if a recognition error occurred due to an unrea-

sonable variation in how a particular word was said (eg; yawn, background noise, excessive

co-articulation or mumbling) then adding this template to the vocabulary would probably

only degrade the recognition accuracy (as was evidenced with the addition of the highly

co-articulated template for "eight" discussed on page 41). Thus, improper procedures could

very readily lead to even poorer performance than was originally evident.

By far the biggest difficulties with recognition errors occurred when they involved the

keywords "Over", "Delete", or "Cancel". Since these words performed special functions,

recognition errors involving would significantly alter the state of the parser. Fortunately,

these words were sufficiently distinct from other words contained in the vocabulary so that

recognition errors involving them were rare. With a larger vocabulary however, this might

not be the case. Therefore, this could be a problem.

One way in which this problem could be addressed would be to duplicate some existing

ASR systems which require the activation of an additional switcIr on-the mike to indicate

that the word being spoken is a keyword. In this way, the user can make certain that these

keywords are not confused with standard input.

Error Correction

Of the two error correction strategies implemented, it was, in general, found that if the

error rate was low (this depended to a great extent on the user and on the particular day as

was indicated in Chapter 3), the "Cancel" was preferable to the "Delete". This because the

user would often use the full capabilities of the ASR system and enter the entire command as

part of a stream of continuous speech since this was much easier than pausing after each word

or group of words in order to check for errors". In this manner, it was just as straightforward,

and much less demanding, to cancel the current command and repeat it in its entirety rather

than delete back to the error and commence from this point on. If the error rate was high

however, then since the user could almost be certain that an error would be made as the

command was repeated, it was preferable to simply correct the current command rather than

begin again.

Since it was often difficult and time consuming to "Delete" back to the error and correct2 This pausing would also tend to increase, the recognition accuracy of the ASR system by reducing co-articulation effects.

it, the need for additional error correction schemes was indicated.

One such improvement to the error correction procedure would be to provide the capa-

bility to repeat only certain portions of a command, preceded by a word such as "Check"

which would indicate that an error had occurred in the prior words. For example, consider

the input sequence "TWA turn left heading 090 Check 050 Over". Here, it is clear what is

implied. This type of capability however would be very difficult to incorporate for a number

of reasons. First, if there are errors present, both in the original command and possibly the

modification coming after the "Check" then it would be very easy for the meaning to become

muddled. Furthermore, because the correction could be a correction of any part of the com-

mand, the benefits of syntax for error reduction would be eliminated. Thus, this technique

was not implemented in the current configuration although it does-deserve mention.

A much better technique for the correction of input errors would be the incorporation

of multi-modal techniques (mouse and keyboard as well as speech) into the command entry

and correction process. In this way for example, the mouse could be used to select any

recognized word presented in the feedback display. The user would then have the option

of changing/correcting this word, deleting it, or inserting other words at this position. This

could be done by typing, speech, or even through the use of pull down menus containing mouse

sensitive options as to what the recognized word was. Furthermore, with this capability,

commands could be entered by keyboard alone if desired. Thus, hard to recognize or problem

words could simply be typed in, thereby eliminating any frustration on the part of the user

attempting to enter these verbally.

Scope

One of the major drawbacks of the ATC simulation itself was its limited scope. This

arose primarily because of the limited number of ATC commands that could be understood

by the system (only three different commands were implemented for this particular stage of

the work).

This, however, was not much of a problem for two reasons. First, the structure of the

Speech Input Parsers was such that additional commands could readily be added. Second, in

a simulation environment, it is very simple to restrict the scope of the task to one requiring

only those commands that have been defined. This, as will be discussed in the next chapter,

is not the case in an operational environment.

Note that if the number of commands were to be increased, then the vocabulary would also

likely increase. This increase would however have to be limited so that the total vocabulary

was at most 64 words. The reason for this is that even though the VPC allows different

groups of 64 words (actually, 22K worth of template data) to be switched in from main

memory thereby increasing the effective total vocabulary size, the system is not monitoring

the microphone while this is taking place and as such, all speech made during this period is

lost. Thus, all of the words that can possibly occur in any given command must be part of

the same group of 64 words comprising the current vocabulary. -

The scope of the simulation was also limited in that the aircraft names, or at least their

call sign roots, had to be known prior to the operation of the simulation in order to train

these on the ASR system and to define them in the SIP. Again, this did not pose much of

a problem in the simulation environment since the names of any aircraft appearing in the

simulation can be readily controlled. Furthermore, the capability of assigning different flight

numbers to aircraft with the same call sign root or carrier name made it seem to the user

that there was more variety in aircraft names than there actually was.

Speech Input and Output Sequencing

The actual sequencing of the command entry and pseudo-pilot response functions also

left something to be desired. In particular, since the hardware and software limitations of the

VPC forced the use of a keyword ("Over" was selected) to switch between speech recognition

and playback functions, this keyword had to be used to terminate each and every command

issued in order to allow any pseudo-pilot messages generated by the simulation to be played.

This greatly limited the rate at which commands could be entered since the user was forced

to pause after every command was terminated in order to wait for any pseudo-pilot messages

(recall that whether or not a pseudo-pilot message was generated, the VPC still switched

modes and would thus, not be monitoring the speech of the user). Furthermore, it resulted in

the inability to issue a number of commands, possibly to different aircraft, in rapid succession

without any intervening pauses to wait for acknowledgments from the pilots as is often done

in the real world, especially during high workload situations.

These last two criticisms are easily addressed, at least with the FSM parsing approach, by

adding a branch from the "bottom" of the FSM to the "top" so that the receipt of another

aircraft name after a syntactically complete message had been input would indicate that

another command was being issued. The chain of commands would still have to be terminated

by "Over" however in order to indicate that they are error free and can be executed. The

PM would require more significant modifications to implement this task however since the

current version requires the each command to be a separate entity for matching purposes

and there is no provision for splitting this command chain into its separate commands.

However, the problem of delays incurred through the recognition, or mis-recognition of

the word "Over" and the subsequent switching into speech output mode still exits.

There are two possible means for correcting this. First, the command termination strategy

(required in order to know when the user is finished inputing a command and correcting

any errors made in it as well as to switch operating modes) could be modified to make it

more general by including information about command syntax and periods of silence on the

part of the controller. In this way, for example, a period of silence of sufficient duration (in

conjunction with a syntactically complete message) could be used to indicate the termination

of a command. A good way to do this would be to incorporate a push-to-talk switch on the

controller's mike. This would be monitored by the Speech Input Interface and when released,

it could be used to indicate the termination of the command. Furthermore, it could be used

to disengage the ASR system so that the controller could talk to other people without having

it attempt to recognize what he said.

Second, the speech input and output functions could be performed using different systems

so that they would not be mutually exclusive. In this way, the user would be able to talk

to the system at any time. Furthermore, if he wanted to issue commands to two distinct

aircraft sequentially without pausing in between and waiting for an acknowledgment, he

could. Granted, there would be some scheduling involved so that pseudo-pilot responses

would not be played while he was speaking but this could be easily accomplished by the

use of the aforementioned microphone switch. This could be monitored to determine if the

controller was finished talking and the playback of pseudo-pilot messages suppressed until he

was.

Chapter 5

Air Traffic Control CommandRecognition: OperationalApplications

Now that some initial experience has been gained in the design of a system for ATCCR, it

is time to re-examine some of the Operational Applications mentioned in Chapter 1 in order

to determine what the practical difficulties in their implementation would be and propose

some solutions. Here, the difficulties to be discussed are those relating to system design and

not to recognition errors which were discussed in the last chapter.

These difficulties can be grouped into two major classes; those that are specific to a

particular application, and those that are generic to any application. Both of these will be

discussed in the following sections.

5.1 General Difficulties

In general, no matter what the particular use to which ATCCR (or ASR for that matter)

is put, there are a number of problems in incorporating it into an everyday operational

environment. These problems basically result from the finite vocabulary of ASR systems,

and the finite number of recognizable commands that can be designed into the ATCCR

system and have, to some extent, already been evidenced in the simulation task of the last

chapter. There, however, the scope and nature of the task could be artificially constrained

and modified in order to minimize these difficulties. This however cannot be done in an

operational environment where the controller does not possess the same control over the

environment.

5.1.1 Recognition of Aircraft Names

For example, in a simulation environment, explicit control can be exercised over the name

of any aircraft appearing in the controller's airspace. In this way, only those aircraft whose

names have been included into the ASR system's vocabulary would appear. In an operational

environment however, this control over which aircraft appear is not possible. Hence, it is

conceivable for an aircraft whose name cannot be recognized by the ASR system to enter a

controller's sector.

One way that this could be remedied is to determine all of the different names of aircraft

that could be expected to enter the sector during some future period of time (actually, only

the carrier names might be required) and explicitly train the ASR system to recognize these

before operations begin. Thus, when a given aircraft entered the sector, the template for its

name could be called up from a database into an "active" list of aircraft names so that it

could be recognized by the ASR system.

This approach however suffers from a number of disadvantages. First, the number of

possible aircraft names is quite large and as such, not only would recognition delays be

increased significantly, but a sizable amount of.memory would be required on board the ASR

system in order to hold all of these. This, however, is remedied to a certain extent by the

maintaining of a list of "active" aircraft names (i.e., those which are currently in the ATC

sector).

Second, since there would exist a number of names that would be used only rarely,

recognition performance could be expected to be degraded significantly for these due to

changes in the user's voice and variations in pronunciation, between the time that they were

trained and the time that they were used, if such a long term database were constructed.

Third, the actual training of all of these different names would be quite time consuming

and tedious (at least with a system such as the VPC where each word must be explicitly

repeated a number of times in order to train it). Therefore, frequent re-training of the

vocabulary, for reasons such as the one mentioned above, could not realistically be expected.

Most importantly however, it is unrealistic to expect to be able to foresee the names of

all aircraft that will be encountered. Thus, there will always exist the possibility of aircraft

whose names have not been trained, such as military aircraft, entering the sector. With the

solution mentioned previously, there is no way that these aircraft can be accommodated.

A better solution would be to incorporate the capability to train new words on-the-fly,

during actual operations, into the ATCCR system. In this way, as new aircraft entered the

controller's sector, their names could be trained and added to the vocabulary. The aircraft

to associate with these new names could be indicated to the computer by simply pointing

to the desired aircraft with a mouse while training its name. When these aircraft leave the

sector, their names could then be deleted automatically, or retained in,a database for future

recall if necessary.

This on-the-fly training capability could also be used, as was mentioned in the last Chap-

ter, to retrain difficult to recognize words or to train entire aircraft names (i.e., callsign plus

flight number) as single utterances in the hopes of improving recognition accuracy.

5.1.2 Issuance of Non-Standard Commands

During actual operations, there is also a problem arising from the controller's use of

commands or phrases that have not specifically been included into the standard ATCCR

system (or the ASR system's vocabulary). Input of these would tend to produce "digital

garbage" as the ASR system recognized random words or structures, or, even worse, would

potentially result in the generation of an unintended command1.

It is unrealistic to expect to be able to include all of the commands that can be issued

by the controller to pilots into the ATCCR system and even if this could be done, it would

not allow for the flexibility required for communication during emergencies, or other non-

standard situations, or for idle chatter between the controller and pilot. For this reason, some

'One of the fundamental assumptions used in ATCCR parser design was that the controller's input was validand that it was just a case of trying to recognize it. Thus, error correction techniques could make somechanges or assumptions as to what the recogniz-ed words were that could lead to valid but incorrect andunintended commands.

mechanism whereby the ATCCR system can be easily disengaged from the controller's verbal

input is required. The best method for doing this would be to utilize a push-to-talk switch on

the microphone in much the same way as was indicated in the last chapter. Using this, the

controller could then engage only the radio for non-standard verbal communications, or the

radio and the ATCCR system for standard command issuance. Operationally however, it

remains to be seen exactly how effective this procedure would be since it would now require

the controller to constantly determine which commands (and in what format) are standard

and which aren't, in order to activate the mike switch accordingly.

Although standard commands transmitted at the mike switch setting for non-standard

commands would not create any difficulties for the ATCCR system itself, commands issued

in this manner would not be available to any computer monitoring -the controller's input.

This loss of data could have serious ramifications to applications that utilize this information

such as an automated ATC decision support system.

5.2 Application Specific Difficulties

5.2.1 Digitized Command Transmission - Voice Channel Offloading

One of the primary motivating factors for this research, at least initially, was the expected

emergence a digital communication link between the controller and aircraft based on Mode

S technology. It was thought that this could be used not only to reduce the error rate of

command reception by the pilot, but also to offload the ATC sector's voice channel, and

speed up command transmission, acknowledgment and response.

A typical scenario for realizing this would be as follows. A controller, would issue a

verbal command to a particular aircraft in his normal manner. The ASR system listening

in would recognize it and translate it into a message format suitable for transmission. Any

errors made in recognizing the command could be handled by having the controller monitor

the feedback display and correct them before verifying or terminating a command. If the

aircraft possessed digital capability, this command would then be transmitted digitally. Once

received by the aircraft, it could be displayed either visually on a cockpit CRT or aurally using

speech output technology, and be made available for recall if desired. The pilot would then

acknowledge receipt of this command either digitally, by pushing a button on his display

(or perhaps through a similar on-board ASR system), or verbally over the radio link. If

however the aircraft did not possess digital capability, then the controller's command could

be transmitted over the radio channel.

The primary difficulty in implementing such a system lies in the management of the voice

channel. In the current environment, the radio link is a simplex channel. Anyone desiring

to transmit a message, be it pilot or controller, must first detect whether or not someone

else is currently using the radio before transmitting. Although this is fairly straightforward

to do with the current system, the addition of an ATCCR system to the channel 2 alters the

protocol of this channel slightly and can lead to message collisions. -

These message collisions can take on two forms. In the first, the pilot transmits to the

controller through what he perceives to be an open channel when in reality the controller is

currently in the process of voicing a command. This occurs because the controller's voiced

commands are intercepted by the ATCCR system before they are broadcast over the radio so

that they can be transmitted digitally (if the aircraft is so equipped). Thus, if the controller

were issuing a command to a digital aircraft, there would be no indication to any pilot

monitoring the radio channel that the controller was busy talking and he would feel free to

broadcast.

This type of message collision can also arise if the aircraft being addressed by the controller

is not digitally equipped. This because the controller's command is not broadcast over the

radio link until after it has been determined that the aircraft being referred to does not

possess digital uplink capability. Thus, there is a brief period of time during which any

pilot wishing to use the radio channel would not detect the controller talking. Although

this could be as short as the time required to speak and recognize the aircraft's name, it

could be increased significantly if any recognition errors had to be corrected or if a slow ASR

system were being used. Furthermore, the ATCCR schemes proposed in this work take no

2Recall that there are cases when controller commands would still have to be transmitted verbally These could

arise from non-standard commands as mentioned in the last section, or from operations involving aircraft

that are not digitally equipped.

action until the entire command has been recognized in order to facilitate error detection

and correction. Thus, this delay in transmission could be even larger.

This leads directly to the second type of message conflict where the computer transmits

a voiced controller command over the radio channel while another pilot is talking. In the

previous scenario, if a pilot did seize the voice channel during the time it took to recognize

the aircraft's name and determine that it was not digitally equipped, then when the ATCCR

system did broadcast the command over the radio, there would already be someone talking

on it even if this were not so when the controller began voicing his command.

There are two basic methods in which these message conflict problems can be handled.

The first requires that the voice channel be re-designed so that it becomes a duplex channel.

A diagram indicating how this would appear can be seen in Figure .5.1. In this, there are

two loops, an "air voice" loop containing the pilots and a "ground voice" loop containing

the controller and his ATCCR system. The interface between these two loops is handled by

computer. It is the responsibility of this computer to detect and buffer any incoming pilot

messages that occur while the controller is talking and replay these when he has finished. It

also has to buffer any outgoing controller messages so that they are transmitted only when

the radio channel is free (no pilots are talking). In this way, no messages are lost due to

channel conflicts.

This type of approach however, has some inherent difficulties. First, an automated tech-

nique must be developed in order to detect when the radio channel or "air voice" loop is busy.

This in order to detect any incoming pilot transmissions so that they can be recorded and

buffered as well as to determine when an outgoing controller command can be transmitted.

This task is complicated by the presence of background noise on the radio link but there

are cues, such as the reception of a carrier when someone is transmitting, or the periods of

relative silence (relative to the nominal background noise that is) that occur when someone

has his mike switched on and is not talking, that can be of aid in this process.

Second, the same must be done for the "ground voice" loop. This however is fairly

straightforward since a push to talk switch on the controller's microphone can be readily

monitored to determine if he is talking.

"Air Voice"Loop

"Ground Voice"Loop

Figure 5.1: Duplex voice channel.

The biggest difficulty however lies in the actual sequencing of outgoing and incoming

buffered messages. Should outgoing controller commands take precedence over incoming pilot

messages? What are the effects on ATC operations arising from lags and delays associated

with the buffering of communications? What are the results when pilots monitoring the

radio hear another pilot's messages before the controller does (due to message buffering at the

ground)? How can the interleaving of conversations arising from this buffering of messages be

avoided? How can emergency communications be distinguished from other communications

in order to allow them to take preference?

Thus, although such a system could probably be designed, some extensive testing and

simulation is required to determine whether it would alleviate controller workload or simply

add to it by unnecessarily complicating his task.

If the benefits of command recognition and digital command transmission without voice

channel offloading are alone sufficient to justify its use, then a second solution to the voice

channel conflict problem is possible. This requires the transmission of the controller's com-

mands in parallel on both the voice channel and the digital channel. In this way, the basic

operations on the verbal channel remain unchanged, except for the potential use of digital

instead of verbal acknowledgments by aircraft. Thus, the difficulties with radio conflicts

mentioned earlier would not occur. (Note however, that with such a system, there would be

a lag between the reception of the verbal and digital commands since the verbal command

would still have to be recognized before it could be digitized and transmitted.)

Furthermore, it would address the deficiency in a pilot's awareness of other air traffic

brought about with the use of digital command transmission since all of the commands

issued to aircraft would be available to anyone listening in on the radio link. These, as will

be attested to by any pilot, are used extensively, especially in crowded airspace, in order to

determine where other aircraft are and what they are doing. Thus, their elimination could

have serious ramifications in terms of safety.

5.2.2 Command Prestoring

The other major application that is envisioned for ATCCR is its use in prestoring con-

100

troller commands and clearances for later issue. These prestored commands can be entered

for storage in the computer verbally by the controller, in the anticipation of some future

event (such as an aircraft reaching a waypoint), or they can be contained in a database of

commonly used clearances, such as those used for standard approach or departure patterns.

In either case, when it is determined by the controller (or even by a computer monitoring

the ATC sector) that these clearances should be transmitted, they have already been entered

and are thus available for immediate transmission. Thus, the controller anticipating a period

of high workload can prestore a number of these commands in order to simplify his task.

In order to simplify the use of such a system, the controller would be given a display

containing all of these prestored commands and information about their status (i.e., pending,

transmitted, acknowledged, etc.). Using this, he could examine previously issued commands,

modify existing ones, or add new ones. This display would also allow the computer to request

validation of each specific prestored command before it was actually transmitted or signal to

the controller that an already issued command had not yet been acknowledged.

The actual transmission of these commands or clearances would take place using the same

procedure as that described in the last section. Thus, if the aircraft being referred to was

digitally equipped, the clearance would be digitally transmitted. If it were not, then it would

be transmitted over the radio link, perhaps using a recording of the controller's own voice.

Pilot .acknowledgments to these commands could also be transmitted either verbally or

digitally. If they were verbal, it would be the responsibility of the controller to recognize the

acknowledgment and update the display of the corresponding command to indicate this. If

they were digital, then this could be done by the computer directly.

In general, this system suffers from the same basic problems arising from message conflicts

that were mentioned in the last section. This because since both digital and non-digital

aircraft are being accommodated, operations on the voice channel are the same as the last

section. With this application however, the frequency of these message conflicts is increased

because both the controller and computer are now generating outgoing commands.

Although the' duplex voice channel modification described in Figure 5.1 addresses this

problem, the difficulties inherent in scheduling the playback of incoming and outgoing mes-

101

sages that have been buffered are almost certain to result in interleaved communications,

especially during situations of high loading on the voice channel.

This is because outgoing prestored commands transmitted by the computer and incoming

responses and acknowledgments to these are intermingled with controller originated com-

munications on the radio link (if prestored commands are directed towards non-digitally

equipped aircraft). The result is that the controller is likely to hear a seemingly random

sequence of messages and acknowledgments on the radio channel thereby greatly increasing

his workload by forcing him to mentally sift through these to determine what each refers to.

A solution to this is to require that prestored commands and clearances are transmitted

and acknowledged digitally only. In this way, the computer itself can handle the management

and scheduling of prestored command transmission and acknowledgment detection, thereby

freeing the controller to simply perform supervisory functions and concentrate on his own

task at hand. This type of operation emphasizes the need for the prestored command display

mentioned earlier in order to allow the controller to interface with the computer in the

execution of this supervisory function.

102

Chapter 6

Conclusions and Recommendations

6.1 Summary

The basic goal of this work has been to apply existing ASR technology in an ATC en-

vironment in order to explore not only some of the potential benefits and problems arising

from the practical application of ASR, but also the features and capabilities desirable in an

ASR system to be used in ATC.

This was accomplished by integrating a VOTAN VPC2000 continuous speech recognition

system into an existing ATC simulation so as to provide a means whereby verbal commands

issued by controllers and directed towards aircraft could be entered into the computer directly

thereby eliminating the need for blip drivers or pseudo-pilots.

In general, the potential benefits accrued through the use of ASR in an ATC environment

involve the simplification of the controller-computer interface in an environment where the

primary means of communication is verbal and the use of and reliance on computers is

increasing significantly, both in the air and on the ground.

The major difficulties however lie predominantly in the handling of errors. In order to

address the problem of recognition errors, the syntax for ATC commands was incorporated

into a Speech Input Parser. This was done in two basic ways. The first utilized a Finite

State Machine approach for syntax specification and required active intervention on the part

of the user in order to correct any errors once they were detected. The second however used

a pattern matching approach to compare the input command to a list of allowable commands

103

in order to determine the best match and could hypothesize possible corrections if any errors

were detected as long as these did not critically affect the intelligibility of the commanded

action.

The user based techniques developed for correction of recognition errors consisted of

utilizing the verbal channel in order to enter specific keywords that would either delete the

last recognized word, or delete the entire recognized command so far. These were found to

be lacking both in terms of speed, flexibility and ease of use, and from the fact that errors

could even be made in recognizing these keywords.

The automated techniques developed to correct for recognition errors internally were

limited by the capabilities of, and information made available by the VPC system. In many

cases, even though they were successful in hypothesizing the location- of these errors, there

was no capability to re-analyze the data and validate these hypotheses. As such, these

automated techniques were more proof of concept vehicles than implementable strategies (at

least with the current configuration of the VPC).

The major drawbacks of the VPC system were its sensitivity to variations in articulation

(co-articulation, intonation) and its inability to rewind data in order to re-examine sections of

speech data. The former is for the most part inherent in the particular recognition algorithm

and technique being used and could not readily be changed. The latter however is a result of

the actual packaging of the software. This problem has been addressed with a new software

package (a library of user callable C language subroutines to control the recognition functions

[35]) recently made available. There are however still some limitations in the capability of

the VPC that have not been addressed. In particular, the inability to obtain a ranking,

including scores, of how well each of the words in the active vocabulary matched the current

input as well as a pointer to the location in the speech data where each of these words ends

and the next word would therefore begin.

6.2 Recommendations

As a result of the work performed, the requirements and capabilities of an improved

104

operational ASR system for use in ATC can be more accurately specified. Although these

have, to some extent, already been discussed in the body of the text, they will be summarized

again. The current system was never intended for operational use. It was just for proof of

concept demonstration and system development in order to more accurately define not only

the significant areas of research, but also features and capabilities that would be desirable in

a more advanced, higher performance system that would be used in practical operations.

* Speaker Dependence

For the ATCCR application this was never really an issue since there is only one user

at a time, the controller, and thus a speaker dependent system is adequate.

e Continuous Speech Recognition

As mentioned earlier, the restrictions posed on the user with discrete speech recognition

systems and the delays associated with connected speech recognition systems created

a strong preference for continuous speech systems. Although in retrospect, some of the

error correction strategies hypothesized bear more similarity to connected than to con-

tinuous speech recognition techniques, it was felt that a continuous speech recognition

system with the capability to buffer speech input and go back and re-analyze it (i.e.,

connected speech recognition capability) would result in much better performance. In

this way, the delays associated with connected speech recognition techniques would

only be incurred when ambiguities or errors required their use.

* High Baseline Recognition Accuracy

Here, what is implied is the inherent accuracy of the recognition algorithms themselves,

without the explicit use of syntax or set-switching as an aid to the recognition process.

Although these can be used later to improve the overall performance,the ASR system,

must at least possess an adequate performance for word recognition to allow for a

reduction in the processing required by any error correction schemes, be they user

aided or internal. For continuous speech recognition systems, this almost certainly

implies the use of phoneme based approaches. This because co-articulation effects (one

105

of the major causes of recognition errors in continuous speech recognition) can be more

readily and accurately modeled.

The actual recognition accuracy required is difficult to quantify exactly since there

are a large number of variables. These include vocabulary size, vocabulary content,

speaker characteristics, and training procedure. In general however, the recognition rate

should be at least 95% for a vocabulary consisting of all of the words that are required

to implement the required task. This would result in a success rate for command

recognition of about 60% (assuming a command consists of roughly 10 words). Syntax

and other techniques could then be used to improve this.

Simplified Training Procedure

In general, the type of training procedure such as that used in the VPC where each

word is trained by repeating it to the system both in discrete and embedded modes has

serious drawbacks. First, this procedure is highly subject to training effects. Second,

it does not accurately allow for co-articulation effects. Third, the actual training can

become very time consuming for large vocabularies. What would be desired to remedy

some of these difficulties is a procedure where the user would simply talk to the system,

perhaps reading a section of text, in order to train the system to his voice. The handling

of co-articulation effects however is more related to the actual recognition algorithm

being used and as such, cannot be addressed solely through modifications to the training

procedure.

In addition, the capability to add new words to the vocabulary on-the-fly during ac-

tual operations would be desirable. With most systems, this is more related to the

"packaging" of the software than the actual training procedure. However, there are

those systems whose training procedure is so complex and time consuming that this

capability cannot realistically be added.

. Reduced Sensitivity to Variations in Speech

These, as mentioned earlier, can arise from anything from co-articulation effects to a

cold or stress on the part of the user and tend to decrease the recognition accuracy of a

106

system. These can be accounted for either through the use of more robust recognition

algorithms (by accurately modeling co-articulation effects for example) or through the

use of an enrollment procedure prior to the use of the system. With this, the user

would simply read aloud a brief paragraph in order to allow the recognition algorithm

to adapt to how he sounds that particular day. Additionally, if the actual procedure

for training the vocabulary were short enough, he could even retrain all or parts of it

prior to use.

Vocabulary Size

The actual vocabulary size required depends greatly on the task being implemented. In

general, since the entire vocabulary of words used in ATC (excluding names of specific

places) is only about two to three hundred words, a vocabulary roughly this size should

be sufficient. Granted, this might be increased depending on the application in order

to allow for a large number of aircraft names or waypoints and fixes.

As mentioned earlier, a more accurate indication of performance is the size of the active

vocabulary. If the system is one in which the only user control of its internal operation

is through the specification of the active vocabulary, then the active vocabulary should

be as large as possible (with a realistic minimum of about 60 words) in order to reduce

the requirement for vocabulary set switching while a command is being input and thus

the type of errors evidenced in Section 4.2.3 with a parser that utilized set switching.

If however more control is available over the internal operation, in particular, if the

capability to rewind the speech data is available, then a smaller active vocabulary

would be acceptable since the added control would allow any errors to be handled.

Note that as a general rule of thumb, the size of the vocabulary of an ASR system is

limited by its recognition accuracy. Therefore, the more accurate the system, the larger

the vocabulary.

107

. Short Recognition Delays

This is very difficult to quantify exactly since a number of different factors enter into it.

Clearly, the recognition delays should be as short as possible, in order to decrease the

lag between command and action as well as to more readily allow error correction by the

user without forcing him to wait excessively. Furthermore, the shorter the recognition

delay, the more time available for any post-processing required by the Speech Input

Parser. However, ASR systems with larger delays can offset this by reducing the need

for error correction by the user, or post-processing by the SIP, with higher recognition

accuracy. Thus, this must be analyzed in conjunction with the recognition accuracy of

a system in order to determine if it is excessive. ~

In general, recognition delays should be at most, one second for an individual word or

four seconds for a long stream of speech in order to force the user to not wait too long

before action is taken in relation to his input. Systems with poorer accuracy should

naturally be at the lower end of this scale.

Open Architecture

This is perhaps the most important requirement in an ASR system due to the ex-

ploratory nature of the work that was performed here. With an open architecture, the

user could exert more control over the recognition functions and would not be restricted

in what can be done by the "packaging" of the ASR system. In this way, some of the

parsing strategies and error correction mentioned earlier could be implemented. The

most desirable feature in an open architecture system would be the ability for the user

to call the recognition routines directly on any specified section of the incoming speech

data with any parameters and vocabulary desired. This, in general, is not possible with

a black box approach to the design of the interface to an ASR system where the only

input is speech (and possibly syntax for set switching purposes) and the output is a

recognized word.

For these reasons, what is really desired is a development system in order to allow

108

for the type of flexibility in configuration and execution that is required for research

purposes.

. Hosting

Although this is not critical, an ASR system that could be hosted on the ATC Simu-

lation Computer itself would possess distinct advantages. This because a lot of infor-

mation about the environment (airspace, what the controller is currently doing,...) is

available here and transferring it to another computer often results in difficulties.

Candidate ASR Systems

In general, although the VPC was more than adequate for demonstrating the proof of

concept of ATCCR and for performing initial ATCCR developmental work, a more capa-

ble ASR system was desired for testing and development of what might eventually be an

operational ATCCR system.

Based on the experience gleaned through the use of the VPC, the emerging ASR tech-

nology of phoneme based speech recognition was felt to be what was desired. With these

systems, the phonemes contained in the speech input are first recognized and then used to

consult a dictionary of phonemic spellings in order to determine the word spoken. These

types of systems offer a great number of performance improvements over more conventional

technology, such as the VPC, and are currently being used in order to tackle the much

more complex problem (in terms of vocabulary sizes and syntactical flexibility) of recogniz-

ing natural language. Thus, they should be quite successful in the reduced scope of the ATC

environment. Some of the advantages of this approach are listed below;

* There is already a great body of knowledge dealing with phonemes, their characteri-

zation, how they are used to construct speech, and most importantly, rules for their

co-articulation. Thus, degradations in recognition accuracy arising from co-articulation

effects can be reduced to a greater extent than possible simply with embedded train-

ing of the vocabulary words as in the VPC. It is this that is the major advantage of

phoneme based systems.

109

* The training procedure is much simpler for the user since it typically consists of having

him read, out loud, a paragraph of phonetically rich text in order to determine how he

enunciates phonemes. Thus, training effects are reduced since the training task is not

as artificial as that in other systems. Furthermore, all of the words contained in the

vocabulary need not be explicitly trained. Instead, their phonetic spelling must simply

be contained in the phonemic dictionary. Thus, the addition of words to the vocabulary,

even on-the-fly, is fairly straightforward and consists of simply adding another entry to

the dictionary without the need to specifically train them.

* Since data rates for phoneme recognition systems are only about 100 Hz, less memory

is required to buffer incoming speech data. Thus, it is more reasonable to save large

blocks of data for later re-processing if any errors are detected by the parser.

6.3 Future Work

Now that the basic tools and procedures for using ASR have been demonstrated and are

in place, there is the potential for a great deal of modification and additional study to be

performed not only in order to improve the current facility, but also to investigate different

application configurations. Areas thought to be of great potential have been summarized

below.

. Incorporate keyboard and mouse in addition to ASR as input modalities. These can

then be used for;

1. command entry

- Mix of input modalities can be used for entering commands. In this way, the

controller is free to use what he is most comfortable with.

- Aircraft or fixes and waypoints being referred to can be selected directly on

the display with the mouse.

- Difficult to recognize words can be entered with keyboard

110

2. error correction

- Add the capability to mouse recognized words on the feedback display. The

user can then change these, insert words in front of them, or delete them. This

can be done using either keyboard input, speech input, or pull down menus

with options.

* Investigate the addition of the capability to correct errors by only repeating part of the

issued command (i.e., "heading zero niner zero CHECK zero one zero")

* Simulate the mixed digital/non-digital cockpit environment

* Develop command prestoring functions and evaluate them irr a simulated environment

* Eliminate the pseudo-pilot responses (unless these can be separated from the VPC) in

order to allow more flexibility in how command termination is done and to decrease

the operational problems resulting from misrecognitions of the terminating keyword

* Investigate the possibility of using the Explorer for speech generation functions directly

in order to free up the VPC so that it performs only speech recognition.

* Change scope and format of pseudo-pilot messages to include more detail about com-

mands issued or errors detected. For example, evaluate the usefulness of error specific

pseudo-pilot responses such as "Please repeat heading for AA156".

* Investigate the use of other technologies for the generation of pseudo-pilot responses.

* Add a push to talk switch whose state can be monitored by the Explorer in order to

allow for improved sequencing of speech playback functions and command termination

detection.

* Develop the capability to allow for sequential commands without intervening pauses

to be issued. This would entail the modification of the parsers and the command

terminators.

111

" Modify the Pattern Matcher so that matching is done as each word is recognized as

opposed to waiting until the entire command has been input.

. Investigate the use of alternate scoring strategies for the Pattern Matcher.

. With the added flexibility and user control now available with the VOTAN Library of

C routines,

- Investigate the possibility of recoding the C programs to allow for the scores of

all the words in the vocabulary to be obtained as opposed to just the top two.

Present in the form of a word vector in order to allow for some of the refinements

to the SIPs mentioned in Chapter 4 to be implemented.

- Investigate the use of the rewind capability made available to implement some of

the error correction strategies alluded to in the refinements to the SIPs.

- Re-design the user interface to present a standardized training procedure where

the new user is explicitly paced through the training process

- Add the capability to add or re-train vocabulary words on the fly, while the simu-

lation is running and demonstrate how this could be used to for example, handle

aircraft entering into the airspace that have not had their names trained.

" Investigate the use of ASR for functions other than ATC command entry (eg; commands

directed towards the computer).

" Determine, by actual monitoring of controller-pilot communications, how strictly the

ATC command syntax is adhered to in practice.

" Investigate the possibility of using two ASR systems in parallel in the ATCCR process

in order to capitalize on differences in performance available with different systems.

" Collect and evaluate detailed statistics on the frequency and type of recognition errors

made during actual operation of the ATC simulation by comparing the recognized input

to that obtained through transcription by a human.

112

Bibliography

[1] Toong, H. D. and Gupta, A. "Automating Air Traffic Control", Technology Review, Vol

85, No 3, pp. 40-54, April 1982

[2] Lea, Wayne A., "The Value of Speech Recognition Systems", Printed in Lea, Wayne A.,Trends in Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1980

[3] Poock, Gary K., "Voice Recognition Boosts Command Terminal Throughput",

Speech Technology, April, 1982

[4] Shannon, C.E. and Weaver, W., The Mathematical Theory of Communication,

University of Illinois Press, Urbana, IL, 1949.

[5] Turn, R. "The Use of Speech for Man-Computer Communication", RAND Report-1386-

ARPA, RAND Corp., Sanra Monica, CA.

[6] Lea, W. A.,"Establishing the Value of Voice Communications with

IEEE Trans. Audio and Electroacoustics, Vol AU-16.

[7] Rasmusson, Paul R., "Summary of Several Industrial Voice Data

tions",

Appearing in The Official Proceedings of SPEECH TECH '85, Vol

mensions Inc., New York, NY, 1985.

Computers",

Collection Applica-

1, No 2, Media Di-

[8] Ashton, Robert, "Voice Input of Warehouse Inventory",

Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Di-


113

[9] Nelson, Donald L., "Use of Voice Recognition to Support the Collection of Product

Quality Data", Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1,

No 2, Media Dimensions Inc., New York, NY, 1985.

[10] Newbery, R. R., "Integration of Advanced Displays, FMS, Speech Recognition and Data

Link", The Journal of Navigation, Vol 38, No 1, January, 1985.

[11] Reed, L. "Military Applications of Voice Technology", Speech Technology,

Feb/Mar, 1985.

[12] Lerner, Eric J., "Talking to Your Aircraft", Aerospace America, January, 1986.

[13] Merrifield, John T., "Boeing Explores Voice Recognition for-Future Transport Flight

Deck", Aviation Week and Space Technology, April 21, 1986.

[14] Legget, Johm, and Williams, Glen, "An Empirical Investigation of Voice as an Input

Modality for Computer Programming", International Journal of Man-Machine Studies,

pp. 493-520, January 1984.

[15] Connolly, Donald W., "Voice Data Entry in Air Traffic Control", FAA-NA-79-20,

August, 1979.

[16] Air Traffic Control 7110.65C, Air Traffic Service, Federal Aviation Administration, U.S.

Department of Transport, Jan 21, 1982

[17] Pollack, I., and Pickett, J.M., "The Intelligibility of Excerpts from Conventional Speech",

Language and Speech, pp. 165-171, Volume 6, 1963

[18] Pisoni, D.B., Nusbam, H.C., and Greene, B.G., "Perception of Synthetic Speech Gener-

ated by Rule", Proceedings of the IEEE, Vol. 73, No. 11, November 1985.

[19] Pisoni, D.B. et al, "Some Human Factors Issues in the Perception of Synthetic Speech"



114

{201 McPeters, D.L., and Tharp, A.L., "The Influence of Rule-Generated Stress on Computer-

Synthesized Speech", International Journal of Man-Machine Studies, Vol 20, pp. 215-

226, 1984.

[21] Schwab, E.C., Nusbaum, H.C., and Pisoni, D.B., "Some Effects of Training on the

Perception of Synthetic Speech", Human Factors, pp. 395-408, August 1985.

[22) Simpson, C.A., and Marchionda-Frost, K., "Synthesized Speech Rate and Pitch Effects

on Intelligibility of Warning Messages for Pilots", Human Factors, pp. 509-517, October

1984.

[23] DECtalk A Guide To Voice, Digital Equipment Corporation, July 1985.

[24] Rabiner, L. R., and Schafer R.W., Digital Processing of Speech Signals, Prentice-Hall

Inc., Englewood Cliffs, NJ,1978.

[25] Harrison, John A., "Should Speech Input/Output Technology be Applied to ATC Sim-

ulators and Operational Systems", ICA O Bulletin, May 1984.

[26] Schafer, R.W., and Markel, J.D., editors Speech Analysis, IEEE Press, John Wiley and

Sons, New York, NY, 1979.

[27] SP-1000 Manual, Internal Publication, General Instruments Corp., Hicksville, NY, 1986.

[28] White, G.M., "Speech Recognition: An Idea Whose Time Is Coming", Byte Magazine,

January, 1984.

[29] Russell, M.J., et al, "Some Techniques for Incorporating Local Timescale Variability In-

formation into a Dynamic Time Warping Algorithm for Automatic Speech Recognition",

Proc. IEEE Conference on Acoustics, Speech and Signal Processing, pp 1037-1040, 1983

[30] Jelinek, F. et al, "Continuous Speech Recognition: Statistical Methods", Handbook of Statistics.

Vol. 2, Krishnaiah and Kanal, eds. North-Holland, 1982.

[31] Mari, J.F., and Roucos, S., "Speaker Independent Connected Digit Recognition using

Hidden Markov Models",

115



[32] Ciarcia, S., "The Lis'ner 1000", Byte, pp. 111-124, November, 1984.

[33] Lis'ner 1000 Voice Recognition User and Assembly Manual, Rev 2.0, The Micromint Inc.,

Cedarhurst, NY, Oct 1984.

[34] VOTAN VPC2000 Users Guide, Votan, Fremont, CA, November 1985.

[35] Voice Library Reference Manual, Ver C-07, Votan, Fremont, CA, March 1986.

[36] Smyth, Christopher C., "Automated Voice and Touch Data Entry for the U.S. Army's

Forward Area Alerting Radar (FAAR)", Speech Technology, Feb/Mar, 1985.

[37] Waller, Harry F., "Choosing the Right Microphone for Speech Applications", Appearing

in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Dimensions Inc.,

New York, NY 1985.

[38] Heline, Ture, "Apply Electret Microphones to Voice-Input Designs", Electronic Design

News, Sepetember 2, 1981.

116

thanassis trikas - dspace@mit home

Documents