recognizing a program's design: a graph-parsing approach

8
Recognizing a Program’s Design: A Graph-Parsing Approach Me8 Rid, and Linda M. w//s, Massachusetts lnsfitute of Technology Roglarnmers tendto use thesame structures over and ovw. By recoghizing these clichh, this PowYPe can recanstmcta proghm’s desigb and generate cbcumentatjon automaticallly. 82 n experienced programmer can often reconstruct much of the hi- A erarchy of a program’s design by recognizing commonly used data structures and algorithms and knowing how they typically implement higher level abstractions. We call these commonly used programming structures clichk. Ex- amples of algorithmic cliches are list enu- merations, binary searches, and suc- cessive-approximationloops. Examples of data-structure,clichCsare sorted lists, bal- anced binary trees, and hash tables. Psychological experiments’ have shown that programmers use cliches heavily in many programming tasks. Instead of rea- soning from first principles, program- mers - like other problem solvers - tend to rely on their experience as much as pos- sible. In general, a cliche contains both fixed and varying parts. For example, every bi- nary search must include computations to apply the search predicate and divide the remaining search space in half, but the specificsearch predicate will vary. A cliche may also include constraints that restrict the varying parts. For example, the opera- tion that computes the next approxima- tion in a successive-approximation loop must reduce the error term. We have built a prototype, the Recog- nizer, that automatically finds all occur- rences of a given set of cliches in a pro- gram and builds a hierarchical des- cription of the program in terms of the clichCs it finds. So far we have demon- strated the Recognizer only on small Common Lisp programs, but the un- derlying technology is language-indepen- dent. There are practical and theoretical rea- sons to automate cliche recognition. From a practical standpoint, automated cliche recognition will ease many soft- ware-engineering tasks, including mainte- nance, documentation, enhancement, optimization, and debugging. From a the- oretical standpoint, automated cliche rec- ognition is an ideal problem for studying how to represent and use programming knowledge and experience. IEEE Software

Upload: lm

Post on 08-Aug-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Recognizing a program's design: a graph-parsing approach

Recognizing a Program’s Design: A

Graph-Parsing Approach M e 8 Rid, and Linda M. w//s, Massachusetts lnsfitute of Technology

Roglarnmers tendto use thesame

structures over and ovw. By recoghizing

these clichh, this PowYPe can recanstmcta

proghm’s desigb and generate

cbcumentatjon automaticallly.

82

n experienced programmer can often reconstruct much of the hi- A erarchy of a program’s design by

recognizing commonly used data structures and algorithms and knowing how they typically implement higher level abstractions. We call these commonly used programming structures clichk. Ex- amples of algorithmic cliches are list enu- merations, binary searches, and suc- cessive-approximation loops. Examples of data-structure,clichCs are sorted lists, bal- anced binary trees, and hash tables.

Psychological experiments’ have shown that programmers use cliches heavily in many programming tasks. Instead of rea- soning from first principles, program- mers - like other problem solvers - tend to rely on their experience as much as pos- sible.

In general, a cliche contains both fixed and varying parts. For example, every bi- nary search must include computations to apply the search predicate and divide the remaining search space in half, but the specific search predicate will vary. A cliche

may also include constraints that restrict the varying parts. For example, the opera- tion that computes the next approxima- tion in a successive-approximation loop must reduce the error term.

We have built a prototype, the Recog- nizer, that automatically finds all occur- rences of a given set of cliches in a pro- gram and builds a hierarchical des- cription of the program in terms of the clichCs it finds. So far we have demon- strated the Recognizer only on small Common Lisp programs, but the un- derlying technology is language-indepen- dent.

There are practical and theoretical rea- sons to automate cliche recognition. From a practical standpoint, automated cliche recognition will ease many soft- ware-engineering tasks, including mainte- nance, documentation, enhancement, optimization, and debugging. From a the- oretical standpoint, automated cliche rec- ognition is an ideal problem for studying how to represent and use programming knowledge and experience.

IEEE Software

Page 2: Recognizing a program's design: a graph-parsing approach

Difficutties Our goal is not to mimic the human pe

cessof cliche recognition. Instead, we want to use human experiential knowledge in the form of cliches to achieve a similar re- sult. Five characteristics of cliches make this difficult:

Syntactic variation: A programmer can achieve the same net flow of data and con- trol many ways.

Noncontiguousness: A cliche’s parts can be scattered through the program text; they do not necessarily appear in ad- jacent lines or expressions.

Implementation variation: An abstrac- tion can be implemented many ways. For example, a hash table’s buckets may or may not be sorted.

Overlapping implementations: P r e gram optimization often merges the im- plementations of two or more distinct a b stractions. Therefore, portions of a program may be part of more than one cliche.

Unrecognizable code: Not every p r e gram is constructed completely of cliches. The recognition system must be able to ignore idiosyncratic code.

Example How the Recognizer works can be dem-

onstrated by its performance on a simple program that retrieves entriesfrom a hash table, shown in Figure la. This program uses three cliches:

hash table, a data-structure cliche typically used to implement associative re- trieval,

cdr enumeration, the pattern of Car, Cdr, and Null typically used in Lisp to visit each element ofa list, and

linear search, an algorithmic cliche that applies a predicate to a sequence of elements until an element is found that satisfies the predicate, or until there are no more elements.

One way the Recognizer demonstrates

its recognition of the design of the pro- gram in Figure la is by producing the doc- umentation shown in Figure lb. The re- sulting documentation, although stilted, does describe the important design deci- sions in the program and can help a pro- grammer locate relevant objects in the code (via the identifiers).

One potential benefit of automated cliche recognition is to use such automati- cally produced documentation to main- tain poorly documented o r undocu- mented programs. Automatically produced documentation can be updated whenever the source code changes, solv- ing the pernicious problem of misleading, out-ofdate documentation.

Representationdrift The key to the Recognizer’s approach is

a representation shift: Instead of looking for cliches directly in the source code, the Recognizer first translates the program into a language-independent, graphical representation called the Plan Calculus. The Plan Calculus is a program represen- tation shared by all components of the Programmer’s Apprenti~e,~” an intelli- gent programming system being devel- oped at the Massachusetts Institute of Technology.

The diagram of the Recognizer archi- tecture in Figure 2 shows the path of the input program in Figure la to the docu- mentation in Figure lb. The program is

(DEFUN TABLE-LOOKUP (TABLE KEY) (LET ( (BUCKET (AREF TABLE (HASH KEY TABLE) ) ) )

(LOOP (IF (NULL BUCKET) (RETURN NIL) ) (LET ( (ENTRY (CAR BUCKET) ) )

(SETQ BUCKET (CDR BUCKET) ) ) ) ) (IF (EQUAL (KEY ENTRY) KEY) (RETURN ENTRY) ) )

(a)

TABLE-LOOKUP is an associative retrieval operation. If there is an element o f the set TABLE with key KEY,

The key function is KEY.

The hashing function is HASH.

then that element is returned; otherwise NIL.

The set TABLE is implemented as a hash table.

A bucket BUCKET of the hash table TABLE is implemented as a list. The elements of the list BUCKET are enumerated. Linear search is used to find the first element of the list BUCKET whose key is equal to KEY.

Figure l. (a) The Recognizer, given the Common Lisp program (and an appropriate cliche library) (b) automatically produces documentation. Uppercased words in the doc- umentation indicate relevant identifiers in the program.

January 1990 83

Page 3: Recognizing a program's design: a graph-parsing approach

Cliche library 1- (Plans and overlays)

(DEFUN TABLE-LOOKUP (TABLE KEY) (LET ( (BUCKET (AREF TABLE (HASH K E Y TABLE))) ) (LOOP

(IF (NULL BUCKET) (RETURN NIL) ) (LET ((ENTRY (CAR BUCKET) ) )

(SETQ BUCKET KDR BUCKET) ) ) ) ) (IF (EQUAL (KEY ENTRY) KEY) (RETURN ENTRY)))

Translate

t Plan

Paraphrase - To other tools -A TABLE-LOOKUP is an associative retrieval operation.

If there is an element of the set TABLE with key KEY,

The key function is KEY. then that element is returned; otherwise NIL is returned.

The set TABLE is implemented as a hash table.

Flgure 2. Recognizer architecture. Only the module that translates source code into the Plan Calculus is language-dependent. As part of the Programmer’s Apprentice, trans- lators have been wriien for subsets of Lisp, Fortran, Cobol, and Ada.

first translated into the Plan Calculus and then encoded as a flow graph, as shown in the top right. The flow graph is then parsed with the grammar extracted from the cliche library to produce a design tree. Figure 3 shows the design tree produced

by the Recognizer for the program in Fig- ure 1 a.

The system uses this design tree to gen- erate documentation by combining tex- tual templates associated with the recog- nized clichks, filling in slots with

set-retrieve

hash-table-retrieve

compute-hash select,-term set-re,trieve

nAs n AREF cdr-retrieve

I - I I -

cdr-enumeration linear-search A-

selec!-car seleT-cdr null-!est apply -pred!cate

I CAR CDR NULL EQUAL KEY I Figure 3. Design tree produced by the Recognizer for the program in Figure 1 a. Each nonterminal in the tree is the name of a cliche that has been recognized in the program. The dashed lines at the tree’s fringe are links to identifiers in the source code to facilitate documentation generation.

identifiers taken from the program. Design trees are the key output of the

Recognizer because they can be input to many other tools. For example, the Programmer’s Apprentice will include ed- itors, debuggers, optimizers, and other tools that take advantage ofdesign trees to provide more intelligent assistance than purely code-based tools. The Recognizer can be used to produce design trees for existing programs so these tools can be a p plied.

Plan Calculus. Translating programs into the Plan Calculus helps the Recog- nizer overcome the difficulties of syntactic variation and noncontiguousness by a b stracting away from the details of algo- rithms that depend only on their expres sion in code. The Plan Calculus combines the representation properties of flow- charts, dataflow schemas, and abstract data types. Essentially, aphn is a hierarchi- cal graph structure composed of boxes, which denote operations and tests, and ar- rows, which denote control flow and dataflow.

Figure 4a shows two plans in the graphi- cal notation we use for the Plan Calculus. On the left is the plan for an implementa- tion, called hash-table-retrieve, which is a clichCd combination of operations to retrieve an entry from a hash table. It in- cludes three operations: hash, which com- putes the index of the table that cor- responds to the input key, select, which selects the indexed bucket of the table, and retrieve, which applies associative re- trieval to the set of entries in the bucket. The solid arrows indicate dataflow con- straints between these operations. Be- cause this plan has no conditional struc- ture, there is no control flow, which would be indicated by crosshatch arrows.

On the right side of Figure 4a is the spec- ification for an operation called set-re- trieve. When it succeeds, its output is an element of the set whose key is equal to the input key; if there is no such element, the operation fails. We specify precondi- tions and postconditions such as these in a separate logical language.

ClirhC library. In the Programmer’s A p prentice, both clichks and individual pro- grams are represented as plans. The rela-

84 IEEE Software

Page 4: Recognizing a program's design: a graph-parsing approach

tionship between a specification cliche and an implementation cliche is repre- sented as an ouerhy. An overlay is com- posed of two plans and a set of correspon- dences between their parts, as Figure 4a shows. (Formally, an overlay defines a mapping from instances of one plan to in- stances of another.)

For example, the overlay in Figure 4a represents the relationship between the set-retrieve specification and the hash- table-retrieve implementation.

The hooked lines crossing the dividing line denote cmespondences: The set is im- plemented as a hash table. The input to the hash computation is the input key. The key function of the bucket retrieval is the key function of the overall retrieval. The output of the bucket retrieval is the output of the overall retrieval.

A cliche library may contain different overlays involving the same plans, each one representing a different way to a b stract the same implementation or a dif- ferent way to implement the same specifi- cation.

In the Programmer’s Apprentice, over- lays are used both for program analysis (cliche recognition) and synthesis (de- sign). In program analysis, occurrences of the implementation on the left side of an overlay are replaced by occurrences of the specification on the right; in program syn- thesis, the opposite happens.

We have compiled an initial library of several hundred plans and overlays in the area of basic programming techniques, such as manipulating arrays, vectors, lists, and sets. About a dozen of these are used to produce the documentation in Figure lb from the program in Figure la.

Graph parsing. Essen tially, a plan is a di- rected graph, and cliche recognition identifies subgraphs and replaces them with more abstract operations. It is there- fore natural to view cliche recognition as a graph-parsing problem. Indeed, the heart of the Recognizer is a flow-graph parser developed by Daniel Brotsky. The Rec- ognizer’s design trees are derivation trees produced by this parser, as described in the box on p. 88.

To encode a plan as a flow graph, the Recognizer turns the plan’s boxes into flowgraph nodes and the plan’s dataflow

hash: compute-hash

set-retrieve

retrieve: set-retrieve

hash-table-retrieve

(a)

1

retrieve

I I

Figure 4. (a) Representation in the Plan Calculus of how associative retrieval can be implemented using a hash table. This overlay is part of the cliche library used to recog- nize the program in Figure 1 a. (b) Part of the encoding of the overlay into graph grammar rules (the constraints are not shown).

arrows into flowgraph edges. All other in- formation in a plan, such as control flow, preconditions, and postconditions, is en- coded in flowgraph attributes. (The re- sulting graphs are acyclic because the Plan Calculus models iteration by tail-recur- sion.)

The overlays in the cliche library are similarly encoded as an attributegraph grammar, a straightforward generaliza- tion of attribute-string grammars.

Each plan definition in the library gives rise to a grammar rule. The top rule in Figure 4b encodes the hash-table-retrieve plan in Figure 4a. The left side of this rule is a node with the plan name as its type; the right side of the rule encodes the body of the plan as a flow graph. When the Rec- ognizer uses this rule to parse the flow graph of an input program, it recognizes the hash-table-retrieve cliche.

Each overlay in the cliche library gives rise to a simple grammar rule with one node on each side. The bottom rule in Figure 4b encodes the overlay in Figure

4a. The left side of the grammar rule en- codes the right side of the overlay; the right side of the grammar rule encodes the left side of the overlay. The mapping between input and output ports on either side of the grammar rule is computed from the overlay’s correspondences. When the Recognizer uses this rule to parse the flow graph of an input program, it reconstructs a programmer’s decision to implement set-retrieve by hash-table-re- trieve.

We encode the plan for each side of an overlay as a separate grammar rule (in- stead of a single rule for each overlay) for two reasons. First, there may be plans in the library that are not used in any overlay. Second, some overlays have plans on both sides, while a grammar rule must have a single node on the left side. (The Recog- nizer handles such overlays by interleav- ing expansion steps with reduction steps during the parsing process, as detailed by Wills?)

January 1990 85

Page 5: Recognizing a program's design: a graph-parsing approach

(DEFUN R ( L X &AUX B ) (SETQ B (AREF L (H X L ) ) ) (PROG ( E )

LP (WHEN (NULL B ) (RETURN N I L ) ) (SETQ E (CAR B ) ) (COND ((EQUAL (K E ) X) (RETURN E ) )

( T (SETQ B (CDR B ) ) ) (GO L P ) ) ) ) )

R is an associative retrieval operation. If there is an element of the set L with key X, then that element is returned; otherwise N I L . The key function is K.

The hashing function is H.

The elements of the list B are enumerated. Linear search is used to find the first element of the

The set L is implemented as a hash table.

A bucket B of the hash table L is implemented as a list.

list B whose key is equal to X.

Fgure 5. (a) Asyntactic variation of the program in Figure 1 a and (b) the corresponding documentation produced by the Recognizer. The documentation is identical to Figure 1 b, except for the names of the identifiers.

(DEFUN TABLE-LOOKUP (TABLE KEY) ( L E T ( (BUCKET (AREF TABLE (HASH KEY TABLE) ) ) )

(LOOP ( I F (NULL BUCKET) (RETURN N I L ) ) (LET* ( (ENTRY (CAR BUCKET) ) )

( (Y (KEY ENTRY) ) ) ) (COND ( (STRING> Y KEY) (RETURN N I L ) )

( (EQUAL Y KEY) (RETURN ENTRY) ) ) (SETQ BUCKET (CDR BUCKET) ) ) ) )

TABLE-LOOKUP is an associative retrieval operation.

then that element is returned; otherwise N I L . I f there is an element of the set TABLE with key KEY,

The key function is KEY.

The hashing function is HASH. The set TABLE is implemented as a hash table.

A bucket BUCKET of the hash table TABLE is implemented as a sorted list. The elements of the sorted list BUCKET are enumerated. The iteration is terminated when an element of the sorted list BUCKET is found whose key Y is greater than KEY.

Linear search is used to find the first element of the sorted list BUCKET whose key Y is equal to KEY.

The sorting relation on keys is STRING>.

Figure6. (a) An implementation variation of the program in Figure 1 a, in which the buck- ets of the hash table are sorted lists. (b) The first seven lines of the documentation are the same as in Figure 1 b.

Difficulties revisited The Recognizer overcomes the five dif-

ficulties in automating clichk recognition.

Syntactic variation. Figure 5 shows a table-lookup program that is very differ- ent from the one in Figure la. Among the differences are the variable names (Table and L), control primitives (Loop and Prog), and the syntactic nesting. However, both programs translate into the same plan, so the Recognizer produces an iden- tical design tree (the one in Figure 4) and identical documentation (except identifi- ers).

This example demonstrates that the Recognizer identifies cliches using only structural information such as control flow, dataflow, and primitive operation types. It does not use any information in the names of variables or procedures. This is both a strength and a limitation.

Noncontiguousness. The Plan Calculus addresses the noncontiguousness prob lem by explicitly representing dataflow and control flow in programs. For exam- ple, the Car, Cdr, and Null steps of the cdr- enumeration cliche are adjacent in the dataflow graph for the program in Figure la, even though they are separated by un- related expressions in the code.

Implementation variation. Figure 6 il- lustrates the Recognizer’s ability to deal with implementation variation. The p r e gram in Figure 6a is similar to the p r e gram in Figure la, except that the buckets of the hash table have been implemented as sorted lists. Given the same grammar used in Figure 1, hut with the addition of rules that describe the implementation of set-retrieve on sorted lists, the Recognizer produces a design tree (not shown here) that has the same top three levels asFigure 2, but differs in the layers below.

Overlapping implementations. Figures 7 and 8 illustrate the Recognizer’s ability to deal with overlapping implementa- tions. The grammar for this example in- cludes rules to find the minimum ele- ment of a list (list-min) by enumerating the elements and accumulating the mini- mum element thus far; it also includes

86 IEEE Software

Page 6: Recognizing a program's design: a graph-parsing approach

similar rules to find the maximum ele- ment of a list (list-max).

The program in Figure 7 - a simple, inefficient program to compute a list’s maximum and minimum elements - could be synthesized straightforwardly from the cliche library by implementing the list-max and list-min specifications separately. (It is inefficient because it enu- merates the list twice.)

The optimized version of this program in Figure 8 enumerates the list only once, creating an overlap between the list-max and list-min implementations. The design tree generated for this program shows the overlap in the cdr-enumeration non- terminal, which is shared by the two s u b trees of list-max and list-min. In effect, the Recognizer has undone the optimization.

Unrecognizable code. The processing time for Brotsky’s parsing algorithm is polynomial in the size of the input and the grammar. This means that if we could guarantee that the input graph can always be derived from the grammar, cliche rec- ognition could be performed very effi- ciently. Of course, most programs are only partially constructed of cliches, so we ex- tended Brotsky’s algorithm in two ways to facilitate partial recognition. Because our extensions amount to computing s u b graph isomorphism, the resulting algo- rithm has exponential worst-case perfor- mance.

Our first extension lets the parser ig- nore indeterminate amounts of unpars able leading and trailing input, so it can recognize cliches in the midst of un- recognizable code. The parser ignores un- parsable leading input by starting its read head not only at the leftmost edge of the input flow graph but at every possible in- termediate position in the flow graph. It ignores unparsable trailing input simply by allowing parses to complete before the input flow graph is totally scanned.

Our second extension lets the parser recognize low-level cliches, even if it can’t reconstruct the higher level design that puts them together. It does this by consid- ering every nonterminal node in the grammar as a possible starting type for a derivation. The design tree in Figure 8 il- lustrates partial recognition: Because the cliche library contains no single specifca-

January 1990

(DEFUN MAX-MIN (L) (VALUES (LIST-MAX L) (LIST-MIN L) ) )

(DEFUN LIST-MAX (L) (LET ((MAX MOST-NEGATIVE-FIXNUM))

(IF (NULL L) (RETURN MAX) ) (LET ((N (CAR L) ) )

(SETQ L (CDR L)) ) ) )

(LOOP

(IF (> N MAX) (SETQ MAX N) ) )

(DEFUN LIST-MIN ( L ) (LET ((MIN MOST-POSITIVE-FIXNUM))

(IF (NULL L) (RETURN MIN) ) (LET ((N (CAR L)))

(SETQ L (CDR L)) ) ) )

(LOOP

(IF ( < N MIN) (SETQ MIN N) ) )

Fire 7. An unoptimized Common Lisp program that computes the maximum and min- imum elements of a nonempty list of integers.

tion for computing both list-max and list- min, the design tree does not have a single root. In general, a design tree with multi- ple roots indicates either that the pro- gram’s top level is idiosyncratic or that the relevant cliche is not in the library. In this example, it seems reasonable that the li- brary does not include such trivial specifi- cation combinations.

or the Recognizer to be a practical maintenance aid, it must be im- F proved and extended several ways.

Fxtensions. The Recognizer was devel- oped in parallel with - actually, slightly behind - the Plan Calculus. The Recog- nizer does not now handle data plans or data overlays (facilities in the Plan Calcu-

(DEFUN MAX-MIN (L) (LET ( (MAX MOST-NEGATIVE-FIXNUM)

(MIN MOST-POSITIVE-FIXNUM)) (LOOP

(IF (NULL L ) (RETURN (VALUES MAX MIN) ) ) (LET ((N (CAR L) ) )

(IF (> N MAX) (SETQ MAX N) ) (IF (< N MIN) (SETQ MIN N) ) )

(SETQ L (CDR L)) ) ) )

l is t -mu list-min

cdr-min cdr-max I I --

mu-accumulation cdr-enumeration mu-accumulation

apply-fynction2 ! I ! /

> <

selec,t-car I‘ select-cdr P apply-function2 apply-fynction2 apply-fyction2

null-,test

NULL

(b)

Fire 8. (a) An optimized version of program in Figure 7 and (b) design tree produced by the Recognizer. Note the shared Occurrence of the cdr-enumeration cliche.

87

Page 7: Recognizing a program's design: a graph-parsing approach

A flow graph is a labeled, directed, acyclic graph in which edges connect a node's input and output ports. Node labels identify node types; each node type has a fixed number of input and output ports. Fan-in and fan-out are allowed.

Afbw graph is derived from a context-free flow-graph grammar in much the same way a string is derived from a context-free string gram- mar. Figure A shows acontext-free flow-graph grammar. which is a set of rewrite rules, each specifying how a node in a graph may be re- placed by a subgraph. The left side of each rule is asingle nonterminal node; the right side is a graph that may contain both terminal and nonterminal nodes.

Unlike string grammars, each rule in a flow-graph grammar speci- fies a mapping (shown by numbers) between the unconnected input and output ports on the left side and unconnected input and output ports on the right side. This mapping determines how the subgraph is connected to the surrounding nodes when it replaces a nonterminal node in a derivation. (For more information on graph grammars in general, see the accounts ediied by Hartmut Ehrig.') As with string grammars, it is convenient to abstract a graph deriva-

tion sequence as a tree that shows how each nonterminal is ex- panded, as Figure B shows.

Flow-graph grammars are a natural formalism for encoding various kinds of engineering diagrams, such as electrical circuits or dataflow. Furthermore, polynomial-time algorithms have been implemented for parsing flow raphs based on generalizations of existing string parsing algorithms23 3

References 1. "Graph-Grammars and Their Application to Computer Science." Lecture

Notes In Computer Science Series, Vol. 291, H. Ehrig et al., eds., Springer-Verlag, New York, Dec. 1986.

2. D.C. Brotsky, "An A!gorithm for Parsing Flow Graphs,"Tech. Report 704, Artificial Intelligence Lab., Massachusetts Inst. of Technology, Cam- bridge, Mass., 1984.

3. R. Lutz, Chart Parsing of Flow Graphs, Pfm. 11 h lntYJdnt Conf Artificial /n?e//igmce, Morgan Kaufman, LosAttos.Calif., 1989, pp. 116121.

Fire A. Aflow-graph grammar.

U' A h D

U

U

Fimre E. (1) Aderivation sequence; (2) a derivation tree.

lus for language-independent modeling of data structures and data abstraction). Also, the Recognizer's handling of de- structive operations (such as modlfylng an array) is inadequate.

These deficiencies should be relatively easy to correct. We also plan to connect the Recognizer to a new logical reasoning system, developed as part of the P r e grammer's Apprentice, so it can reason more powerfully about grammar attri- butes.

Limitations. It is already clear from our experiments with small programs that the exhaustive, purely structural approach used by the Recognizer will not directly scale up to programs of commercial size and complexity: It is too expensive to search for all possible derivations of the input graph from the clichk library.

Other systems use heuristics to prune multiple analyses. For example, Lewis Johnson's Proust' pursues the clich6 with

the greatest number of currently recog- nized parts. In general, the use of heuris tics to prune the search for cliches can lead to missing useful ways of un- derstanding the input program.

Hybrid approach. Another possible way to control the search for cliches is to use the existing documentation, such as com- ments and mnemonic variable and proce- dure identifiers. The Recognizer does not now use this information because it is often incomplete and inaccurate. How- ever, this kind of documentation could provide an important independent source of expectations about a program's purpose and design. The Recognizer could then confirm, amend, and com- plete these expectations by checking them against the code.

Other recognition systems are given a third input, in addition to a program and a cliche library, in the form of a specifica- tion (F.J. Lukey's Pudsef), a set of goals

(Proust), or a model program that per- forms the same task (William Murray's Talus'). We envisage a hybrid approach to cliche recognition that combines two complementary processes: documenta- tion- and/or specificationdriven ( t o p down) and codedriven (bottom-up). The heuristic topdown process will use docu- mentation or user input to guide the codedriven process by generating expec- tations. The algorithmic bottom-up p r e cess will fill in the gaps in the documenta- tion and verify or reject the expectations.

To design a hybrid recognition system, we first need to conduct additional the- oretical and empirical analyses of the complexity of the Recognizer. We are trying to apply the Recognizer to p r e grams at least 10 times larger than the ex- amples in this article.

Leamingclich6~. One of our long-range goals is to explore how the Recognizer can help automate the task of knowledge ac-

88 IEEE Software

Page 8: Recognizing a program's design: a graph-parsing approach

quisition in the Programmer's Appren- tice. In particular, we are interested in how the system might automatically learn new clichks.

One idea is to look at programs that use some unfamiliar parts to implement fa- miliar specifications. First, the Recognizer would identify what specifications are im- plemented by familiar parts of the p r e gram. Then, given a top-level specifica- tion, it should be possible to identlfy some lower level specifications that are not ac- counted for by the recognized program parts. A learning procedure could then reasonably hypothesize that the un- recognizable part of the program is a new cliche for implementing the remaining specifications. Robert Hall has demon- strated a similar learning scheme in the domains of digital circuits and mechani- c a ~ gears.* .:.

Acknowledgments We thank Elliot Chikofsky, Richard Waters,

Dilip Soni, Howard Reubenstein, Yang Meng Tan, and Scott Wills for their suggestions to im- prove this article.

Support for this work has been provided in part by the National Science Foundation under grant 11214616644, the Defense Advanced Re- search Projects Agency under Naval Research contract N00014-88-K-0487, IBM, Nynex, and Siemens. Theviews and conclusions in this arti- cle are those of the authors and should not be interpreted as representing the policies, ex- pressed or implied, of these organizations.

References 1. E. Soloway and K Ehrlich, "Empirical

Studies ofProgramming Knowledge," lE.E.5 Trans. SoftwareEng., Sept. 1984, pp. 595-609.

2. C. Rich a n d R.C. Waters, "The Prcl grammer's Apprentice: A Research Over-

view," Cumph, Nov. 1988, pp. 10-25.

3. C. Rich and R.C. Waters, ThePropmnaerk A p t i c e , Addison-Wesley, Read ing , Mass., to appear, 1990.

4. L.M. Wills, "Automated Program Recog- nition: A Feasibility Demonstration," Art@ dallntelltgence, 1990, toappear.

5. W.L. Johnson, Intatia-Based Diagnosis of Novice Programming Errors, Morgan Kaufmann, Los Altos, Calif., 1986.

6. F.J. Lukey, "Understanding and Debug- ging Programs," Int 'lJ Man-Machine Stud-

7. W.R. Murray, "Heuristic a n d Formal Methods in Automatic Program Debug- ging," Proc. Ninth Int't Joint Con$ A r t i f i d Inlelligace, Morgan Kaufman, Los Altos,

8. RJ. Hall, "Learning by Failing to Explain: Using Partial Explanations to Learn in In- complete or Intractable Domains," Mu- chinelearning, Jan. 1988, pp. 45-77.

k, 1980, pp. 189-202.

Calif., 1 9 8 5 , ~ ~ . 15-19.

Cbarles Rich is a principal research scientist at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. He codirects the Programmer's Apprentice project, which includes the work de- scribed in this article. His research interests are knowledge representa- tion and the application of artificial intelligence to engineering problem solving, especially in software engineering.

Rich received a BS in engineering science from the University of Tcl ronto and an MS and PhD in artificial intelligence from MIT. He is a member ofACM, AAAI, and the IEEE Computer Society.

IindaWills is a PhD candidate a t the Artificial Intelligence Laboratory of MIT and a member of the Programmer's Apprentice project Wills's pn- mary research interest is program understanding. Other interests in- clude intelligent tutoring systems, machine vision, and concurrent com- puting.

Wills received a BS and MS in computer science from MIT. She is a member of ACM and AAAI.

Address questions about this article to the authors a t Artificial Intelli- gence Laboratory, Massachusetts Institute of Technology, 545 Technol- ogySq.,Cambridge,MA02139.

January 1990

W a Your Last Sofbare Project

Late? If your last software project was late. you need Costar, a sofhvare cost estimation tool that will help you plan and manage your n a t project. Costar is based on the COCOMO model described by Barry Boehm in Software Engineering Economics.

COCOMO is used by hundreds of software managers to estimate the cost. stafting levels, and schedule required to complete a project-it's reliable, repeatable. and accurate

Costar estimates are based on 15 factors that strongly influence the effort required to complete a project, including:

The Capability and Experience of your Programmers 61 Analysts The Complexity of your project The Required Reliability of your project

Costar is a complete implementation of the COCOMO "detailed" model. so it calculates estimates for all phases of your project, from Requirements through Coding. Integration and Maintenance. Costar puts you in control of the estimation and planning process. and provides full traceability for each estimate. Usrr definable cost drivers and a wide variety of reports makes Costar flexible and powerful.

Supports Function Points 8t Ada COCOMO.

Softstar Sptems.

28 hnemah Road. SOFTST*R Amherst, NH 03031

(603) 672-0987

Reader Service Number 7