a symptom-driven expert system for isolating and correcting network faults

8
A Symptom-Driven Expert System for Isolating and Correcting Network Faults Todd E. Marques March 1988-Vol. 26, NO. 3 IEEE Communications Magazine Introduction oday a single data network cm consist of do~ens T of geographically distributed switching nodes and concentrators, and can support numerous niche- specific communications protocols, terminal devices, and network access facilities. The task of isolating faults within large and complex networks can be extremely demanding. Competency in this area requires a general command of data communications princ iples and sectionalization strategies, as well as specific knowledge about the operations and diagnostic capa- bilities of a wide variety of network components. The demands of this task are likely to grow as new and var- ied product offerings are deployed, and as the sheer size and topological complexity of networks increase. This article discusses recent work geared toward the simplification of fault isolation within data net\+-orks. Traditional approaches to task simplification have focused on properties of the user interface. The use of color, graphics, pan and zoom, and many othcr coding and display techniques can provide users with the crit- ical information required to make sound and timely decisions. Yet the judgment processes, which c Iearly are among the most difficult aspects of man) tasks, remain the purview of the user. Thus, it is not clear that advanced interface techniques actually reduce the skill level required to perform complex problem solv- ing tasks such as fault isolation. The approach reported in this paper focuses less on the presentation of key information and more on the interpretation of key information. SturKeeper@ Network Troubleshooter is a real-time interactive expert system that was developed to simplify the task of fault isolation by mechanizing many of thc key judgment processes. Based on a symptom description and situation-specific background data supplied by the user, Troubleshooter identifies all suspicious components, formulates a plan designed to minimize the amount of time taken to isolate the fault, executes the plan, and terminates after the fault has been isolated and cor- rected. The system is designed to be used by individu- als with widely varying skill and experience levels. As shown in Fis. 1, Troubleshooter is positioned between a clerk, technician, or administrator staffing a help desk and a conventional operations system (OS). Troubleshooter interacts with the OS much like a human user. It has access to the entire OS command set as well as all reports and displays available to any other authoriLed user. Troubleshooter determines which OS commands should be executed in a given situation, executes the commands, and interprets OS responses. System input consists mainly of (i) symptom descrip- tions that either are reported to the help desk by network end-users, or are derived from network-generated mes- sages, and (ii) specific inquiries about the operational state of a designated component. System output consists of (i) a diagnosis that identifies the physical address of the faulty network component, describes the fault, and provides evidence upon which the diagnosis is based, (ii) rrsults of any system-initiated proceduies intended to clear the fault, and in the event that human interven- tion is drcmed neccssary to clcar the fault, (iii) adLice on the appropriate corrective procedures. 6 0163-6804/88/000S-0006 $01 .OO 0 1988 IEEE

Upload: te

Post on 08-Aug-2016

217 views

Category:

Documents


5 download

TRANSCRIPT

A Symptom-Driven Expert System for Isolating and Correcting Network Faults Todd E. Marques

March 1988-Vol. 26, NO. 3 IEEE Communications Magazine

Introduction oday a single data network c m consist of d o ~ e n s T of geographically distributed switching nodes

and concentrators, and can support numerous niche- specific communications protocols, terminal devices, and network access facilities. The task of isolating faults within large and complex networks can be extremely demanding. Competency in this area requires a general command of data communications princ iples and sectionalization strategies, as well as specific knowledge about the operations and diagnostic capa- bilities of a wide variety of network components. T h e demands of this task are likely to grow as new and var- ied product offerings are deployed, and as the sheer size and topological complexity of networks increase.

Th i s article discusses recent work geared toward the simplification of fault isolation within data net\+-orks. Traditional approaches to task simplification have focused on properties of the user interface. T h e use of color, graphics, pan and zoom, and many othcr coding and display techniques can provide users with the crit- ical information required to make sound and timely decisions. Yet the judgment processes, which c Iearly are among the most difficult aspects of man) tasks, remain the purview of the user. Thus, i t is not clear that advanced interface techniques actually reduce the skill level required to perform complex problem solv- ing tasks such as fault isolation. T h e approach reported in this paper focuses less on the presentation of key information and more on the interpretation of key information.

SturKeeper@ Network Troubleshooter is a real-time interactive expert system that was developed to simplify the task of fault isolation by mechanizing many of thc key judgment processes. Based on a symptom description and situation-specific background data supplied by the user, Troubleshooter identifies all suspicious components, formulates a plan designed to minimize the amount of time taken to isolate the fault, executes the plan, and terminates after the fault has been isolated and cor- rected. T h e system is designed to be used by individu- als with widely varying skill and experience levels.

As shown in Fis. 1 , Troubleshooter is positioned between a clerk, technician, or administrator staffing a help desk and a conventional operations system (OS). Troubleshooter interacts with the OS much like a human user. It has access to the entire OS command set as well as a l l reports and displays available to any other authoriLed user. Troubleshooter determines which OS commands should be executed in a given situation, executes the commands, and interprets OS responses.

System input consists mainly of ( i ) symptom descrip- tions that either are reported to the help desk by network end-users, or are derived from network-generated mes- sages, and ( i i ) specific inquiries about the operational state of a designated component. System output consists of ( i ) a diagnosis that identifies the physical address of the faulty network component, describes the fault, and provides evidence upon which the diagnosis is based, ( i i ) rrsults of any system-initiated proceduies intended to clear the fault , and in the event that human interven- tion is drcmed neccssary to clcar the fault, ( i i i ) adLice on the appropriate corrective procedures.

6 0163-6804/88/000S-0006 $01 .OO 0 1988 IEEE

Fzg. 1 . TroubleJhooter: A n Intrllzgent Front-End to a n Erzstzng 0prratzon.c S y J t f m .

Initial Task Domain Troubleshooter is currently operational in the Da-

taki t@ Virtual Circuit Switch (VCS) networking envi- ronment. Datakit VCS provides data communication between host computers and terminals of various types. Datakit VCS networks may range from a simple star configuration where hosts and terminals are intercon- nected via a single node, to a wide area network con- sisting of many nodes linked by digital data service or high speed fiber trunks. Datakit VCS networks have been deployed within a variety of government, educa- tional, and business settings.

T h e Datakit VCS architecture is, as its name sug- gests, highly modular or kit-like. A Datakit VCS node consists of one or two cabinets. Each cabinet may con- tain one to four slotted shelves housing plug-in circuit packs that communicate across i t printed circuit back- plane. Each circuit pack, or module, performs a spe- cialized function such as timing, packet switching, inter-nodal trunking, or interfacing with end-user ter- minals and host computers. T h e node cabinetry also contains power and cooling subsystems.

Datakit VCS end-users establish communication with a host computer within the network much like they would place a telephone call. Rather than lifting a handset, the terminal is powered on, and rather than issuing dial-tone, the network signifies readiness by issuing a destination prompt. T h e end-user responds to the destination prompt by keying in the name of a host computer. A virtual circuit is then set u p between the terminal and the designated host computer. From this point on, the network facilities are transparent to the end-user. When the end-user terminates a session, call processing software takes down the virtual circuit, performs various administrative activities, and re-issues a destination prompt.

As with any complex system, things occasionally go wrong. For example, an end-user may be unable to establish a connection with a particular host computer on the network, a connection may drop unexpectedly, response time may be perceived as excessive, or extraneous characters may appear on the terminal screen. T h e fault associated with these symptoms may lie within the network facilities, or within the host computer, building wiring, telephone equipment, or any number of external sources.

Most network installations maintain a hotline or help desk where network end-users can report prob- lems. Typically after a complaint reaches the help desk, a trouble report or trouble ticket is opened that includes the symptom description, the name and loca-

7

tion of the end-user reporting the trouble, and othcr details. Depending on the skill level of the help desk staff, the trouble ticket is worked directly or forwarded to a spcci a I 1 y t ra i ned tec h n icia n .

T h e StarKeepeB Network Management System (NMS) is an OS that supports network operations, administration, and maintenancr activities. It provides centralized remote access to all Datakit VCS control consoles. Thus, from a single workstation, users are able to issue operations commands to all points within the nrtwork, and to access a wide array of tabular and graphical network status and performance displays. T h e OS also maintains a relational data base that con- tains detailed network topology information, and records of component alarms and other performance indicators.

Even with the consolidation of controls and displays probided by the OS, the task of isolating faults remains imposing. Consider the vast array of controls and data available to the technician. T h e combined OS, Datakit VCS operations command language consists of over 100 verb-object pairs, most with several arguments and options. Thrrc are several hundred OS- and compo- nent-generated reports, displays, and alarm messages, and well over SO00 pagrs of written documentation!

The Knowledge Base T h e knowledge base is the repository of facts and

procedures that enables the system to exhibit expert behavior. A process often referred to as knowlcdge engincering is used to formulate the knowledge basr. This is a multistage process that usually involves defi- nition of subject matter or domain boundaries, identi- fication of key subject matter experts and other infor- mation sources of expertise, and extraction and formal representation of information from the available sources.

T h e network components identified in Table I con- stitute the initial domain handled by Troubleshooter. This collection of network components was selectrd because it is most representative of the Datakit VCS networks currently in operation.

T h e components are classified as ( i ) c o m m o n e q u i p - m e n t ; that is, equipment common to all Datakit VCS nodes, and (i i) component interface modu les that can be added t o the common equipment to serve particular communications needs. As indicated in Table I, scveral of the components are available in multiple vcrsions.

T h e subject matter expertise was obtained from se\.- era1 sources. Chief among these were in-house data communications experts, and the Datakit VCS hard- ware and software developers. Members of the AT&T Trouble Resolution Center (TRC) served as the in- house data communications experts. The T R C is responsible for providing 24-hour nationwide support for Datakit VCS ci~stomers. Drawing on substantial field experience, the staff was helpful in focusing the knowledge engineering process by identifying compo- nents that had proven to be most problematic. Further, their general knowledge of data communications prin- ciples and the product line proved useful in formulat- ing high-level sectionalization strategies.

March 1988-Vol. 26, No. 3 IEEE Communications Magazine

TABLE I . SCOPE OF T H E INITIAL KNOWLEDGE BASE

Component ?)rpe

cooling subsystem

controller

packet switch

power subsystem

timing

interface modules:

multiplexed host interface

terminal interface - async

terminal interface - sync

trunk

voict/data multiplexer

vmions

model 500. model 2000

5MB-drfve -- software generlcs 2.3.4 through 3.1.2. 1oMB-drhre -- software generlc 3.2.1

no variants

model 500. model 2000 hardwired. model 2000 pluggable

clock. repeater

cpm-422. cpm-dr, cpm-hs

6 port. 12 port

no variants

dds. T1.8MB fiber

no variants

T h e global perspective imparted by the T R C experts was complemented by the highly specialized knowl- edge possessed by the component hardware and soft- ware developers. These individuals were knowledge- able about the structure and functioning of the indi- vidual network components for which they had responsibility. Having designed the components, they were also excellent sources of information on the pre- cise meaning of potentially obscure or cryptic compo- nent-generated alarm, status, and performance data.

T h e domain knowledge was obtained from a series of interviews which was conducted with each subject matter expert. At the beginning of the initial interview session, each expert was presented with examples of all network-generated reports and displays available for a given component. T h e subject matter expert was then asked to indicate how the available data could be used to determine whether the component was operating normally.

T h e experts were encouraged to identify simple procedures for passing a component; that is, confirm- ing that the component is operating normally. This involved careful analysis of the various reports and displays, and identification of the most critical fields or variables contained therein. For example, it was con- cluded that one type of Datakit VCS power supply could be passed if a status report contained nominal readings for the 12V and 5V fuses at each shelf in the cabinet, 5V tolerance, and a few other status indicators for each of three redundant power units.

Simplicity was emphasized to minimize the amount of time taken to isolate a fault. There are circumstances

March 1966-Vol. 26, No. 3 IEEE Communications Magazine

where it is necessary to assess more than 50 network components during the course of a single investiga- tion. There must be a method for quickly passing components that are not faulty, so that time can be invested scrutinizing faulty components.

Subsequent knowledge engineering sessions with the domain expert focused on procedures for dealing with faulty components. A faulty component is defined here as any component displaying one or more unaccept- able readings on critical status indicators. The power supply would be considered faulty if, for example, readings indicated that the 5V tolerance was out of range.

During the sessions, each critical indicator for a given component was manipulated to show a fault state. For example, the 5V fuse reading on the power supply was changed from the nominal value “good” to the fault state “bad.” After each manipulation the expert was asked to describe a set of procedures for determining the cause of the fault. T h e procedural de- scription included (i) the operations commands required to execute any prescribed diagnostic tests, (ii) detailed guidelines for interpreting test results, and where necessary (iii) step-by-step instructions for performing manual operations such as measuring resistance at some point in the cabinetry wiring, or checking hard- ware switch settings.

Once the diagnostic procedures were established for a particular fault condition, the experts were asked to identify appropriate corrective procedures. Depending on the diagnosis, the procedures could involve execu- tion of additional operations commands, or manual adjustment or replacement of network components. T h e sequence of identifying diagnostic and corrective procedures continued until all critical readings for a given component had been studied.

After the knowledge acquisition process was com- pleted for a given component, the resultant knowledge was coded in the form of complex condition-action pairs referred to as production rules, or simply rules.

T h e English language representation of a rule for evaluating the service state of an asynchronous termi- nal interface module would state:

IF the service stake reading associated with end-user port 6 on the terminal interface board in slot 22 and node LCl is: out of service,

AND the terminal interface board hardware is in an enabled state,

AND permission to restore port 6 o n the board has been granted by the network administrator,

T H E N issue the operations command to node LCl to restore port 6 in slot 22 to an operational state,

AND confirm that port 6 has been restored to service and is fully operational by re-evaluating the critical variables.

The rules developed for each network component were reviewed for accuracy by the corresponding subject matter experts.

T h e methodology described above was repeated for each network component and component variant iden- tified in Table I . In most cases, different subject matter experts were consulted on each component.

8

TROUBLESHOOTER HOST

1-1 1 PRODUCTION SYSTEM I USER

INTERFACE

INFERENCE ENGINE 0 RULE BASE

-

I I

IS HOST

CONFIGURATION 6 DATABASE

REMOTE N O M CONSOLE HANDLER

Fzg. 2. Troubleshooter Arrhztecture.

S ys tern Architecture T h e Troubleshooter architccture, which is shown in

Fig. 2 , is distributed across two machines. T h e local softwarca constitutes approximately 90 per-

cent of the system source code and runs under the UNIXB operating system on a dedicated AT&T 3B2 computer. T h e remote software runs on the OS (Star- Keeper NMS) host, which may be an AT&T 3B2, 3B5, or SB15 minicomputer. The remote software also runs under the I J N I X operating system. Descriptions of the major components comprising the local software follow:

Produc t ion system. This is the heart of the system and consists of an inference engine, or rule interpreter, and ;I rule base. T h e production system is implemented in OPS4 [ l ] .

As is characteristic of production systems in general, the rule base is organized such that it can be readily extended. Th i s flexibility is essential as new specialist modules must be added frequently to track the rapid evolution of the task domain. T h e rule base currently consists of approximately 700 rules.

T h e rule base is partitioned into groups specializing in the diagnosis of different network components. There is a separate specialist module for each interface module and piece of common equipment presented in Table I . There are also specialist modules for perform- ing highly spccialized and complex tasks such as cir- cuit tracing.

R u n - T i m e library. These functions perform miscel- laneous low-level tasks such as OS command line gen- eration and transmission, agenda formulation, user input parsing and validation, and various housrkeep-

9

ing activities. T h e library functions are written in Franz LISP [ 2 ] .

I lser interface. These C language routines handle user interface initialization functions, perform display scrolling and paging functions, manage screen input) output, process internally generated status messages and signals, arid drive the status annunciator system.

Loggers and uti l i t ies. These C and Shell language routines log all interaction between the user and sys- tem, and all interaction between the system and the OS. T h e logs provide a comprehensive audit trail of fault isolation activities, data for gauging the effective- ness of the system, and a source of knowledge for novice technicians who arc, interested in sharpening their fault isolation skills. lJtilities exist to summariLr the logs and to obtain system usage profiles over user- specified time intervals.

Network c o m m u n i c a t i o n s handler-local. This col- lection of C language routines establishes and main- tains connectivity between the local and remote por- tions of the system. Connectivity is ordinarily achieved by way of the Datakit VCS network, although direct host-to-host communication is also supported. T h e network communications handler has an error dctec- tion and retransmission protocol that giiarantees error- free data transport. Also, if the communications link is dropped, or various synchronization problems occur, the handler will re-establish or resynchronize connec- t i v i t y automatically without loss of continuity to any ongoing investigation.

Network c o m m u n i c a t i o n s handler-remote. These processes envelop all OS output bound for the Trou- bleshooter host with a message header and message trailer. T h e header and trailer are processed and then stripped by the local communications handler.

Conf igura t ion database access rout ines . These C language routines enable the system to access specified records from the OS-resident relational database con- taining detailed network topology data and alarm his- tory data.

Data f i l ters. As would be expected, the OS produces reports and displays in human readable form. T h e data filtering scripts are required to transform each OS output into a series .of class-attribute-value expressions that can be interpreted by the Troubleshooter produc- tion system.

System Operation T h e system has several modes of operation that arc

intended to accommodate differing user skill levels and task demands. Figure 3 depicts the operations asso- ciated with the investigation of troubles reported by end-users. Th i s mode will be discussed in detail because it involves most of the system’s major capabili- ties. Alternative modes of operation will be summa- rized after this discussion. As indicated by the opera- tional flow given in Fig. 3 , its approach in dealing with end-user problems is basically in keeping with an expert diagnostician.

It begins by attending to the symptom description and supporting data, narrows the search to a relatively few suspicious components, draws upon i ts prior expe-

March 1988-Vol. 26, No. 3 IEEE Communications Magazine

AND OTHER

IDENTWY PUUSIW S U W T

OF NETWORK CO(IWNENTS !4

HISTORY si7 Fig. 3. Troubleshooter Fault Isolation Methodology.

riences in similar circumstances to determine the rela- tive likelihood that each of the suspicious components is indeed faulty, and then executes an investigation that focuses initially on the most likely fault locations. It periodically seeks feedback on its performance to guide its ongoing investigations and to influence the planning phase of subsequent investigations.

Starting a n Investigation After receiving a trouble ticket or other fault notifi-

cation, the user logs into the Troubleshooter host and invokes the system via a simple command line. T h e top level screen presented in Fig. 1 appears at the user terminal.

As shown, the screen is divided into four regions. T h e top region displays control key definitions for exiting the system, obtaining context-sensitive on-line help, and scrolling or paging the interactive text win- dow. Status and error messages are temporarily dis-

March 1988-Vol. 26, No. 3 IEEE Communications Magazine 10

played in the second screen region directly below the system control key definitions. Accompanying each message is a numerical code that can be used to locate additional information within the system reference manual. T h e third region of the screen is the text win- dow where all interactive dialogue takes place. Most user input is in the form of menu selections or simple responses to questions posed by the system. T h e fourth region is occupied by a six stage status annunciator system that articulates to show the current operational state of the system.

To address an end-user problem, the user would select option 0 diagnose end-user problems. T h e other options will be briefly discussed later.

Encoding the Problem Context .Troubleshooter begins an investigation by gathering

the same sorts of information that would be included in a trouble ticket. If the system is asked to diagnose an end-user problem, i t begins by prompting for the phys- ical or logical address constituting the end-user’s net- work entry point. T h e user is presented with a menu of predefined symptom descriptions from which to select. T h e user is expected to select the alternative that most adequately describes the symptom that is currently under investigation. If none of the alternatives is satis- factory, the user is free to enter a new description. A new symptom description is automatically incorpo- rated into the menu once the system associates it with a fault condition.

Identifying the Plausible Subset A large network may consist of thousands of com-

ponents, and a blind search would be costly and extremely time consuming. As an initial step in the fault isolation process, the system acts to restrict the search to the plausible subset.

T h e plausible subset is made u p of all network com- ponents constituting the communications path tra- versed by the end-user at thc time the symptom was observed. Consider the hypothetical network given in Fig. 5

exit: DEL help: ? scroll up:% dowmAd page lwd:? back% end: *e

0 - (diagnose end-user problems)

1 - (diagnose clock) 4 - (diagnose cpm) 7 - (diagnose switch)

2. (diagnose controller) 5 - (diagnose power) 8 - (diagnose trunk)

11 - (diagnose vdm)

3 - (diagnose cooling) 6 - (diagnose repeater) 9 - (diagnose tsrn)

10 - (diagnose ly)

12 - (resume prior analysis) 13 - (display network configuration data) 14 - (execute programmed analysis)

Enter 0-14 .................................................. setup lproc rcmd rproc error

w Fig. -I. Troublevhooter User Interface-Top Level

Screen.

NODEl I

HOST5

NODE4

HOST4

NODE3 HOST2

HOST3

Assume that ;in end-user a t terminal T2 has estab- lished a session with HOST4. T h e plausible subset consists of the path endpoints, T2 and HOST4, and all inter\,ening components. Following the configuration given in Fig. 5, the intervening components in this example includc Iota1 and remote voice data multi- plexers, a terminal interface module in NODEl , all common equipment associated with NODEl, local and remote trunk interfacr modules in NODEl and NODE4 respectively, all common equipment in NODE4, and a multiplexed host interface module connecting HOST4 to the network.

In order to trace the communications path, the sys- tem must be supplied with a starting point, which is the physical address where the end-user terminal enters the network. The system then issues various operations commands to the designated address to determine lvhether the cnd-user is currently engaged in a conver- sation. If thew is an active connection, the system accesses switch memory of the node to identify the next leg of the c.otnmunic.ations path. Whenever a trunk is encountered, the system queries a centralized configu- ration database residing on the OS host to determine the location of the rrmote end. T h e system then con- tinues the trace at the designated remote node. Thus ,

by interleaving node operatioris commands and d a ~ a - basc qucries, thc system can navigate freely throughout the network, and trace an active connection end-to-end.

Some symptoms percei\wl by end-usrrs are caused h y broken network conncctions. If a connection is broken, therc is no path to trace and this presents a problem for identifying the plausiblc subset. Here, a slightly less exact algorithm is used that relies on knowledgc of the

'it'i i t V C S call routing strategy, the OS-resident coil- figuration data base, and node-resident routing data. G iwn a starting point and the name of any host cotnputer on the network, the system can project the path that would havr been traversed had the connec- tion bern active.

D, . k'

Formulating the Agenda ?.roubleshooter S ~ I S an agenda that specifics the

order in which each component within the plausible subset will be investigated. Thr agenda is arranged such that components deemed most likely to be at fault are investigated first. This is important in minimizing the amount of time taken to isolate the fault, and any degradation of network performance due to testing.

Much like any expert diagnostician, the system draws upon its prior experiences to determine the rela- ti\,(> likelihood that each component is at fault. Prior experience is encoded in the form of a covariance matrix that relates symptom descriptions and fault locations. Thr. fault locations refer to generic compo- nent types (for example, multiplexed host interfacr) rather than individual components (for example, the ~nultiplexed host interface in slot 24 at NODE4).

Figure 6 illiistrates how prior experience is applied in formulating the agenda. Continuing the earlier example, the network components identified in the path between T2 and HOST4 arc defined as the plau- sible subset. T h e individual components are classified by type. There may be several individual components of a given typc within the plausiblr subset. In the CUI--

rent example, there a ~ - e two clocking suhsystems, switches, power subsystems, and other pieces of com- mon equipment because there were two nodes, NODEl and NODE4, in the path. Similarly, there are two fiber trunk interface tnodules involved.

As indicated in Fig. 6, a weight is applied t o each component typc represented in the plausible subset.

Plausible Subset

Weights

Agenda

11 March 1988-Vol. 26, No. 3

IEEE Communications Magazine

The weight represents the conditional probability that a component c of type n is at fault given that symp- tom S has just been observed, P(C, , [S) . Observational data stored in the trouble history are used to compute the conditional probabilities. T h e component types are then arranged in descending order of probability to form the agenda.

Testing Once a component is targeted for investigation, the

rules derived from the earlier knowledge engineering procedures are applied. T h e system first takes a high level view of the component to determine whether its behavior is nominal. It issues the OS commands required to obtain necessary reports and displays, parses the information, and then examines the critical fields.

If the component is found to be nominal, the com- ponent is passed and the next component specified on the agenda is targeted. Recall that, according to the prescribed inference procedures, any evidence of fault is followed u p with additional tests devised to refine the diagnosis to the point where a specific intervention ran be recommended, or in some cases, executed automatically.

Many of the tests conducted by the system are service affecting. For example, certain loopback tests cannot be conducted while an end-user has an active connec- tion. T h e system seeks permission from the user before performing any service affecting operation in the net- work. If permission is denied, the system will discon- tinue testing the current component and move on to the next component on the agenda.

Intervention T h e system will perform corrective procedures that

do not require human involvement. For example, the terminal port associated with a given end-user may have been inadvertently taken out of service due to some sort of administrative or resource provisioning mixup. Having recognized this as the source of the problem reported by the end-user, the system can issue the appropriate OS commands to restore the port to an operational state. If this service affecting action is approved by the user, Troubleshooter will restorc thr port to service.

Many corrective procedures require human involve- ment. For example, the system may have evidence sug- gesting the circuit board that handles a particular end- user port is not seated securely in the slotted shelf at the Datakit VCS node. In such cases, the system issues detailed instructions to the user, which, depending on user skill and experience levels, may need to be for- warded to a qualified technician.

Feedback Processing Feedback is required for the system to determine

when it is done with a given problem, and for the sys- tem to learn from its experiences.

Each time the system performs or recommends an intervention, it seeks feedback from the network to

March 1988-Vol. 26. No. 3 iEEE Communications Magazine

verify that the intervention restored the given compo- nent to a nominal state. If the intervention is not sur- res$ful, it will propose an alternative strategy based on a revised set of assumptions about the nature of the fault.

If the intervention is successful, the user will be asked to indicate whether the initial symptom has dis- appeared. Often this will require contact with the per- son who originally reported the symptom.

If the initial symptom persists, the system will infer that there are multiple faults in the path, and return to the agenda to continue testing at the next most likely fault location. T h e investigation terminates only when the fault underlying the reported symptom is isolated and correrted.

There are instances when immediate feedback can- not be obtained, either because the person who reported the problem is unavailable or the problem is intermittent and requires a lengthy observation period. In this case, or any other case of missing data, the user may suspend the investigation until the required information is available. In the meantime, the user is free to start or continue any number of other investigations.

After confirmation is received, the system will infer that it has isolated the fault underlying the symptom. It will then terminate the fault isolation process and update its trouble history to indicate the symptom reported, and component type found to be at fault.

Updating the trouble history has the effect of revis- ing the weights used to formulate the agenda during the planning phase of subsequent investigations. Thus as the system gains experience within a network, it can become more efficient in isolating faults. It is interest- ing to note that, in time, no two systems will behave quite alike since every system modifies its behavior in arcordanre with the observed trouble history of the network in which i t operates.

Alternate Modes of Operation T h e operational mode discussed thus far is particu-

larly well suited for persons with relatively little data communications expertise. An experienced technician will often be able io formulate a hypothesis about the location of a fault once a symptom is presented. For example, if after listening to a symptom description, the user is certain that the fault lies in the terminal interface component, then the user should select option 10 dzagnose ty from the main menu shown in Fig. 4.

T h e system allows expert users to capitalize on their expertise by planning their own investigations without assistance. Alternatively, the user may request that the system identify the plausible subset in a particular situation, and handle the remainder of the planning operations without further assistance. This can reduce the time taken to complete an investigation and reduce the number of operations commands transmitted.

While the system foigoes its usual planning activi- ties, i t still prompts for a symptom description and the physical address of the component to be diagnosed. Also, testing, intervention, feedback, and history up- dating operations are the same, regardless of whether a

12

( ~ ~ ~ i i p o t i c t i t i s targeted for in\mtigatioti b y the systt.tt1 or by the technic ian.

Another mode of operation is referred t o a s p r o - gt-ammcd analysis. This mode enables users t o tliag- nose predefined collections of netlvoi k cntitics on deniand. Each collec tion is defined using ;I prcscribcd syntax at id is stored on disk ;IS ;I standard text file. Progr-animcd analysis is designed to support pre\mti\,e o r routine main tenance procedures. For example, ;I

I I S ~ I may ivish t o diagnose the entire dial-irl inotlern ~xx)I l o r the n e t i v o i 1, O L Y I y Frithy eiwiing. The user i v o i i l d define the physical addrrssc~s of a11 modcni pool ports i n :I test file, and then instrwt the systcni t o load thiit file and diagnose the addresses contained therein. Should additional dial-in ports be adtlrd, the I I ~ C ' I

i v o i i l d etlit the 5c1 ipt file a(-corclingly. .The progi~ininied analysis mode is not symptoni-

tlriwn. R a t h c ~ than scar-c.fiiirg f o r the cause o f g.iwn symptom, i t exemtcs ;I comprehensive study ot each coniponrnt to identify all existing problems and m a n y impending prohlenis. This mode produces a listing o f a l l pi&Ienis found, the supporting c\,idcnce, and pro- posed ('01 r c ~ t i \ , ? responses.

Discussion Troubleshootcr is ;I collection of m a n y programs

antl is irnplernrntctl in sc\wal different progr;iniming 1angii;iges. As is clear from i ts diverse makeup, 110

attcmipt h a s t x m m a d e to demonstrate the inherent srilxriority of a n y one expert system architecture, lan- gtiagr, control struc turc, o r ktioivledgc. repwwntation tech t i i (1 tie. S i 111 pl ic i t y and rnii i t i t a i t u hi 1 i t y tvcre t he dri\.ing forces behind most ;ii-chitcctural and tool selcc- tion dccisions. Sincc the inc.eptiori o f the 7'roiihle- shooter project, the emphasis has been on system twh;i\ior r a t her than irnplcmcn tat ion tools o r tech- n i q I I C s .

Ex tc t i si IT 1 ; i b o r a t ()ry tcs t i ng ha s CO ti f i r mecl its ;I bi 1 i t y t o detect faul ts . But detecting a fault condition is oiic'

matter, and piovitling useful., credible solutions to those f au l t s is another matter. T h e far mole difficult question relates t o how users will respond to the advice it imp;irts. \Vi11 its advice be taken seriously? M'ill it be thought t o cmhod~~ rhat highly subjective quality of exper-tis?? Field tcsting is ongoing to assess its behavior in various real-lvorld en\,irontnents and to answer questions per-taining to user xceptance . Sites xvith small (< lo nodes), medium (1 1-30 nodes), arid large (>30 nodes) Datakit I'CS nctlvorks \~olutiteercd to install the system, antl to use it during the course of t he i r f'a 11 I t i so 1 ;it i o ti ac t i 1.i ties .

I n addition to providing data on system perfor- mance, the testing Ivill help in understanding how

wr>, ing org~iriic;itiorial s t i ~ ~ ~ t ~ i r e s arid procetlurcs inflii- ence system cffecti\.eness, and the extent to ivhic h ski l l Ir\.el of thr ~1scr affects indi\idiial prrccsptions of s! s- tern effecti\wiess.

Preliminary findings suggest the system \vi11 be p c s i - ceived 21s more rxliiable b y i-clati\.ely inexpet-ienced and part-time users. Expert technicians indicated ;I st rong prrfercmce For 71'toitbl~~shooter features t h a t c.liininattd tedious 01- error-prone aspects of fault isolation \uch a s ( ircuit tracing. Not surprisingly, features dealing inole with the prot)lcm solving aspects of fault isoliitioti ~ v e r e \ ieuwl ;IS I c s ~ important by rxpcrts. Ho~ve\,c.i, expc~rt te( hnic iaris repoi tetl that in most cases the sys- tem's o\.crall ;ippro;ich t o problem solving W;IS consis- t t m t n i t h their o t v n . Only minor procedural tiiffcrcmccs lvcre reported, ;rnd these differences generally related t o the order in \vhich diagnostic tests \vcre cwndircted.

Cluiimt wmrk c m Troubleshoot~r is continriing in t ~ v o areas. First. the knowledge base is bcing expanded to handle scvei-al ncw interface modules atid additional (ahinetry in the Datakit I'CS networking cnvirotinic~nt. 'l'tie siLe o f the knoivledge base is expected to tlouhle i n the nvxt year alone. Second, exploratory stidies are undenvay to tlctcrmine the feasibility of niig1.atitig Trou bles hooter to tl i f fercn t netw o r k do mains . \l'h i I C , a s lvith most expert systems, the bulk of the current kno\vledge 1)asc. is domain-dependent, its (ontrol sti-at- egy , fault isoliition methotlolog)., and adapt iw G I ~ ~ I -

bilitics appeai- suitable to other d a t i i and wicr. 11 e t ~ v ( Irk s.

The course of fut urc lvork will depend Im q ' l y on findings from ongoing field tcsting. iLIajor tlesigii changes arc likely t o be influenced not so niiich by technological advances as by what is learncd about how people perceive and react to expcrt systems.

~ ~ ~~

Todd E. Marques is Mrmhcr 01 '1 t~~l inic; i I Staff in 111t.

Nc'tnmrk Managerntwt and Host Interfate5 Dqxit t m m t ( i t

AT&T Brll Laboratoi-irs. Hc holds a B.S. drgree in ps) (t iolog) TI-om t h r Irnivcrsity of California ; i t Davis, antl the hl . . i . :ci i t l

Pk1.D. degrees in psychology from Ric-e I'nivei-sit). <: i i i~ent l \ , hr is invol\,t,d in applirti rrseart h and dcvclopnent o f c'xpc'~ I systcms for network opr.1-ations, ~ idmin i s t r~ i t i on , and inainic,- 11 m c e .

13 March 1988-Vol. 26, No. 3

IEEE Communications Magazine