the automatic generation of literature abstracts: an...

20
11 The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases C. D. Paice II.I Background Considering the important part played by abstracts in the traditiona information services, the possibility of producing abstracts by computer ha not received very much attention. There are perhaps two main reasons for this First, it appears that the production of well-constructed abstracts is an artificial intelligence problem, and therefore unlikely to be either feasible o worthwhile until well into the future: the alternative of picking sentences from here and there in a document is a rather unattractive proposition. Second, the cost of key-punching complete texts for input to an abstracting program can hardly be justified -- especially since the program will then in effect discard most of the text which has been so laboriously prepared. It now appears tha the first of these objections is exaggerated --reasonable-looking abstracts can often be produced by quite 'unintelligent' programs --while with advances in technology the second problem should soon disappear. We should be ready to take advantage of this when it happens. Early work in this field concentrated on the extracting problem: that is to say, with finding sentences which could be extracted from a text to convey a good idea of its subject matter. Luhn (1958) write a program which looked fo sentences containing clusters of 'key words': that is, the most frequent non commonplace words in the text. The clusters were weighted according to thei size and density, and those sentences containing the most highly weighted clusters were selected. At about the same time Baxendale (1958) drew attention to the fact that the position of a sentence within a text has a bearing on it importance: for instance, she showed that in 85 per cent of a sample of 200 paragraphs the 'topic' sentence was the first, while in another 7 per cent it was the last. Extending this idea, we can understand that the first few and the las few paragraphs of a document are likely to give a strong indication of it overall subject: the pages in between usually contain a lot of detail, which is no of much value taken out of its context. During the 1960s the most important work was carried out by Edmundson (1969), who studied four extracting methods, both individually and in al possible combinations. All four methods involved the assignment of weights to sentences, and the subsequent selection of sentences with the highest weights The location method weighted sentences if they occurred in preferred positions 172

Upload: others

Post on 28-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

11

The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases C. D. Paice

II.I Background

Considering the important part played by abstracts in the traditional information services, the possibility of producing abstracts by computer has not received very much attention. There are perhaps two main reasons for this. First, it appears tha t the production of well-constructed abstracts is an artificial intelligence problem, and therefore unlikely to be either feasible or worthwhile until well into the future: the alternative of picking sentences from here and there in a document is a rather unattractive proposition. Second, the cost of key-punching complete texts for input to an abstracting program can hardly be justified - - especially since the program will then in effect discard most of the text which has been so laboriously prepared. It now appears that the first of these objections is exaggerated --reasonable-looking abstracts can often be produced by quite 'unintelligent' programs - -whi le with advances in technology the second problem should soon disappear. We should be ready to take advantage of this when it happens.

Early work in this field concentrated on the extracting problem: that is to say, with finding sentences which could be extracted from a text to convey a good idea of its subject matter. Luhn (1958) write a program which looked for sentences containing clusters of 'key words': that is, the most frequent non- commonplace words in the text. The clusters were weighted according to their size and density, and those sentences containing the most highly weighted clusters were selected. At about the same time Baxendale (1958) drew attention to the fact that the position of a sentence within a text has a bearing on its importance: for instance, she showed that in 85 per cent of a sample of 200 paragraphs the 'topic' sentence was the first, while in another 7 per cent it was the last. Extending this idea, we can understand that the first few and the last few paragraphs of a document are likely to give a strong indication of its overall subject: the pages in between usually contain a lot of detail, which is not of much value taken out of its context.

During the 1960s the most important work was carried out by Edmundson (1969), who studied four extracting methods, both individually and in all possible combinations. All four methods involved the assignment of weights to sentences, and the subsequent selection of sentences with the highest weights. The location method weighted sentences if they occurred in preferred positions

172

Page 2: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Background 173

in the text, as described in the last paragraph, or if they occurred below a significant heading such as ' Introduction' or 'Conclusions'. The cue method made use of a dictionary of words which had been found to have either a positive association with_ extract-worthy sentences (783 instances) or a negative association (73 instances): these words were weighted positively or negatively, as appropriate. The key method gave weight to the most frequent non-commonplace words in a document which were not included in the dictionary of cue words. Finally, the title method weighted words in a text if they also appeared in its title (high weight) or in a heading within the text (lower weight). All four methods were considered to be helpful in identifying extract-worthy sentences, and the best results of all were obtained by combining the location, cue and title methods: the key method was definitely the poorest, but this is not surprising in view of the way in which the key words were defined.

Earl (1970) made use of two syntax analysis programs (the first to assign parts of speech to the individual words in a document and the second to assign phrase classes where possible), in an attempt to find whether there was any correlation between the syntax of a sentence and its suitability for extracting. ~n both cases the great majority of sentences were represented by unique syntactic patterns, and, hence, the experiment was inconclusive. She also tested

'

a method reminiscent of ~uhn s, in which sentences were selected ff they contained enough key words, and obtained results which were 'mildly encouraging'.

Skorokhod'ko (1972) started from the standpoint tb_at the optimal method of extraction is likely to vary, depending on the structure of the text: the arrangement of sections and sub-sections, and the logical _flow of ideas, vary a good deal from one text to another. Hence, he described an adaptive method, in which relationships between sentences depending on the semantic related- ness of the words within them are used in setting up a graphical represer, tati~r. of the text. Sentences which are semantically related to large numbers of other sentences, and whose deletion would cause serious disruption of the sense of that part of the text, receive high weights and are therefore prime candidates for extraction.

Skorokhod'ko 's method was designed for producing ez:tracts ,'_r R~ssiaa. Taylor (1977) has used a rather similar approach in which a text is converted te a semantic network using case grammar relationships. 'Maximally connected sub-graphs' are identified, and from them a single sub-graph is constructed. This is finally reconverted to textual form to serve as the abstract.

The rather successful ADAM system developed by Rush, Salvador and Zamora (1971) and Pollock and Zamora (1975) is said to rely on the detection and rejection of irrelevant sentences rather than on the selection of extract- worthy ones: however, it is not clear whether this distinction has any real substance. Their program makes use of a word control list (WCL) containing information about nearly 800 words and phrases, which are labelled according to their semantic or syntactic status, or both. Some of the entries are flagged as 'very positive', which means that sentences containing them are likely to be extract-worthy, while some are 'very negative'. The WCL thus contains information similar to Edmundson's cue dictionary, except that the infor- mation is used not for assigning global weights, but for deciding whether to select a sentence during actual scanning of that part of the text. Positive items

Page 3: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

174 The automatic generation of literature abstracts

have their weights in the WCL decreased if that term is found to occur more than four times per thousand words in the text being pro¢essed (Pollock and Zamora, 1975); this takes account of the idea that the infor)nation content of a word is less if the word occurs rather commonly (Edmondson and Wyllys, 1961), and it also tends to reduce the size of the abstract. The weights of negative terms, similarly, are made less negative if the terms occur more than seven times per thousand words.

The WCL also contains various other kinds of semantic label, used in the fairly sophisticated selection algorithm; in particular, the algorithm takes account of expressions which require antecedents, such as 'this' and 'these'. Sentences containing such expressions are only selected if it is also possible to select the preceding sentence or sentences. In this way they largely avoid the selection of sentences containing dangling references, such as would be seen in the present sentence if it were taken out of its context. Another innovative feature of their method is the provision for deleting certain words, expressions or clauses which might detract from the neatness or conciseness of a selected sentence.

A drawback with selecting (or rejecting) sentences during the actual scanning of the text is that there is a certain lack of control over the length of the extract. The ADAM system produces extracts nearly 20 per cent of the length of the original document, and this is several times greater than the length of a typical indicative abstract (Pollock and Zamora, 1975).

By taking account of inter-sentence references, the ADAM system largely avoids the disjointed nature of the more primitive forms of abstract. However, where widely separated sentences are juxtaposed, the style of an extract can be impaired by repetitious sentence structures. Mathis, Rush and Young (1973) have identified a number of rules enabling such pairs of sentences to be merged. For example, their rules would combine the sentences

This investigation is concerned with the weighting of terms in search requests. This investigation shows that the use of weights is generally advantageous.

into This investigation is concerned with the weighting of terms in search requests, and shows that the use of weights is generally advantageous.

It is evident that only a small number of adjacent sentences will, in fact, require or be amenable to this kind of transformation.

Karasev (1978) has implemented a system for producing abstracts in Russian, which appears to be similar in some respects to the ADAM system. A dictionary of phrases, words and word-stems is used to classify sentences as essential, eligible or unsuitable for extracting. During scanning of the text, the essential and eligible sentences are gathered into a file. Afterwards a synthesis procedure constructs from this file a string of sentences linked together by syntactic relationships into a coherent whole.

11.2 Introduction to the present work

The abstracting method described in this chapter makes use of what are surely the most obvious of all the indicators of sentence significance but which

Page 4: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Introduction to the present work 175

nonetheless seem hardly to have been considered up to now. These are commonly occurring structures which explicitly state that the sentences containing them have something important to say about the subject matter or the 'message' of the document. Examples are 'The principal aim of this paper is to investigate... ' and 'tn the present paper, a method is described for... ' . These structures, which we will refer to as indicators (Paice, 1977), appear to correspond in some degree with Janos's 'thematic metatext', except that he is mainly concerned with the linguistic principles underlying the thematic structure of a text: our approach here is essentially concrete and pragmatic. Pollock and Zamora (1975) list a number of our indicator phrases, but see them merely as 'non-substantive introductory phrases' which are to be removed from their extracts during the final editing.

Technical papers and reports may contain a number of indicator phrases varying from none up to a dozen or more. gt is, therefore, obvious that no practical abstracting system will be able to rely on indicators alone. We should regard this method as just one out of an armoury of devices, although our investigations so far suggest that this may be a particularly powerful and convenient device. For present purposes, therefore, we have drawn up a set of rules which depend exclusively on identifying indicator phrases: incorporating other methods of determining sentence significance must come later.

The aim of our project is primarily to produce indicative abstracts - - t h a t is, abstracts which help clarify the topic of a document. If the abstracts also contain informative content - - information about the substance and findings of the work described - - t.hen this is regarded as a bonus. The generation of critical or comparative abstracts (Paice, 1977) we regard as completely out of reach: indeed, it is to abstracts of this kind that the 'artificial intelligence problem', mentioned at the start of this chapter, really applies. We also expect our abstracts to be quite short - - s a y about 100 words, since this seems to be what is often required in practice - - a l though the desired length is, in fact, an adjustable parameter in our method.

The disjointed appearance of primitive extracts is greatly reduced if sequences of adjacent sentences are selected rather than isolated sentences. The selection of coherent passages is assisted by taking account of exophoric references - - that is, words or word-groups which signal a link from one sentence to another (Halliday and Hasan, 1976). (As already mentioned, such references play an important part in the ADAM system: Rush, Salvador and Zamora, 1971: Pollock and Zamora, 1975.) Thus, the indicated sentences found in our method are not used on their own but serve as kernels around which an abstract can be constructed.

If a text contains several indicators, as many do, a decision has to be made about how to 'compose'-the abstract. Either we must select one longish passage to represent the document or we must put together two or more shorter passages. In either case, we need a method for assigning appropriate weights to the various indicators found, since we would expect that some are more helpful and reliable than others. The method used for the present study always aims to produce passages of about the full desired length: the problem of composing abstracts from several separate short passages has not yet been seriously considered. It can well be seen, however, that a two-part abstract, with one part summarising the field of study and the other the outcome of the investigation, could in many cases be very useful.

Page 5: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

176 The automatic generation of literature abstracts

Our method of abstracting consists of four main stages. First, it is necessary to identify any indicators which occur within a text and to assign appropriate weights to them. Second, it is necessary to aggregate adjacent or earlier sentences to each indicated sentence in order to produce a coherent, self- contained passage. Detection of exophoric references plays an important part here. In addition, the algorithm must test for occurrences of certain non- exophoric features which detract from the extract-worthiness of a sentence: in some cases these 'disindicators' cause the unconditional deletion of a sentence or passage, but in others they merely cause a decrease of its weight. Third, the most highly weighted passage or passages must be selected to form the abstract. And, finally, certain cosmetic adjustments must be made to produce the finished product; these include the deletion or replacement of certain words or structures and transformation of the text to a standard tense and style. Only the first two of these stages have so far been studied in detail; at stage three our present procedure is simply to select the most highly weighted passage to serve as the abstract.

It must be made plain that we do not yet have a working system for doing all this. We have written programs which can successfully identify the indicators and assign weights to them*, but these are not yet in as flexible and effective a form as we would wish. For the aggregation stage, we have drawn up a pencil- and-paper algorithm, together with a set of associated rules for detecting and resolving a fairly wide range of exophoric features. A good deal of work will be needed in order to implement the aggregation process, but there appear to be no serious problems in the way of achieving this.

Our indicator definitions and aggregation rules were mainly drawn up during an examination of all of the articles in Volume 12 of Information Processing and Management, referred to below as 'the sample'. After drafting, the rules were used to produce a number of abstracts, some from the above volume and some from various other journals published in the same year (1976). During this process a number of extensions and modifications were made to the rules, but this was only done in cases where the modifications were simple and obvious ones. Some of the abstracts are shown in Figure 11.6.

11.3 Identification and weighting of indicators

At first glance it might be supposed that the identification of the indicators is a trivial string-matching process, but a little thought reveals that things are not this simple. In the first place, although quite a small number of basic indicator structures can be defined, the words used within the structures show a lot of variation. Thus, it can be seen that the strings 'This article is concerned with ... ', 'Our paper deals with ...', 'The present report concerns . . . ' and 'The following discussion is about . . . ' are all manifestations of the same basic structure. If every possible indicator is to be represented in full, many thousands of strings will be needed, occupying a lot of space and making the matching process hopelessly slow. Hence, we use a small number of basic structures, known as templates, in which every 'word' is in effect a paradigm - - that is, a representative of a group of words and phrases any of which may

* I should like to acknowledge the valuable programming work which has been done by S. Y. Gan and F. M. S. Ibrahim.

Page 6: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Identification and weighting of indicators 177

appear at this point in the structure. Hence, the templates themselves are used in conjunction with a set of word-groups, the paradigms being the 'key' members of the groups*.

A second problem is that in practice indicators often contain sequences of intervening words which are not part of the indicators themselves. For instance, in 'Our investigations into the indexing of technical documents by computer have shown that . . . ' only the first two and last three words belong to the indicator. Moreover, whereas at some points in an indicator long sequences of intervening words may be allowed, at other points only one or two, or even none at all, may be tolerable. Hence, in our template each paradigm is accompanied by a skip limit, specifying the maximum number of words which may be tested and skipped in searching for a member of that word-group.

A further feature of many indicators is the presence of optional words or word-groups which, nonetheless, form part of the indicators themselves. For instance, if we compare 'The purpose is to ... ' and 'The purpose here is to ... ' , we see that if would be unfortunate to skip over 'here' in the way described above, because whereas the first of these strings is a very weak and uncertain indicator, the second is much more 'positive': the word 'here' is an important part of the indicator, and increases its weight. Such features can be allowed for either by flagging the paradigm concerned as optional or, in more complex cases, by introducing non-linear features (points of branching and con- vergence), so that a template may contain a number of distinct paths, each representing a particular 'family' of indicators.

tt often happens that a particular word attached to a template can occur in several distinct guises. The variations mostly occur in the last few letters of a word, being due mainly to differences of number of tense, and sometimes to spelling variations: an example is the group 'analyse', 'analyze', 'analyses', 'analyzes', etc. To prevent unnecessary multiplication of items, the word- groups may contain word-stems terminated by hyphens (for example, 'result-', 'discuss-', 'analy-') which will match with any text word which starts with the letters up to the hyphen.

For our study seven groups of indicators have been defined, each represented by a separate template. There is nothing magic about the number seven. For one thing, the templates are all branched structures, and some of them could easily be split up into two or more smaller templates. For another, several indicators are known which our scheme does not allow for: some of these might require new separate templates. Some actual representatives of the seven groups are shown in Figure 11.1.

In scanning a text to find possible indicators, the algorithm looks for occurrences of any of a small set of entry words, namely 'the', 'a', 'an', 'in', 'for', 'here', 'this', 'these', 'our', 'my', 'we', ' t ' and 'it': if one of these is found, it may be the star[ ~' of an indicator. While the majority of our indicators start at the beginning of a sentence, 13 out of the 80 instances found in the sample start later on in the sentence*: our program therefore tests every text word to see

* It should be understood that the paradigms are really just a notational convenience: in a computer implementation the templates are likely to contain direct references or pointers to word-groups held in tables or lists elsewhere in store. * The total of 13 could be too low, since indicators of this kind can easily be missed when searing is by eye.

Page 7: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

!78 The automatic generation of literature abstracts

Group A (28 "iustanees)=

!ii The procedures in t roduced in t h i s s tudy [must be] ooo The r e s u l t s of t h i s s tudy confirm t h a t . o . The s u b j e c t field cons idered in this p r o j e c t was ..°

Group B (14 i n s t a n c e s ) : In t h i s paper [ t h ree r e l a t e d i n f l u e n c e ] measures a re developed .o . (7) In the following [more d e t a i l e d ] - ~ - l y s i s , [documents on ,.o]

In this paper we [would like to] examine ...

Group C (13 instates): Our s t u d i e s have i n d i c a t e d t h a t . . . I ! i The above d i s c u s s i o n [has] provided [an overview o f l . . . This paper r e p o r t s the r e s u l t s of an exper iment . . .

Group D (1 i n s t a n c e ) : This is a report on [our assessment of] ... (6)

Group E (I0 i n s t a n c e s ) :

weWe [have] de sc r ibed our [ s e r i e s o f ] s t u d i e s [ to i l l u s t r a t e ] . . . i [have[have]~riedf°Undto]thatshow'"that . . . l i l

Group F (6 i n s t a n c e s ) | Our objective [in writin~ these programs] was to assist ... (6) Our purpose here is to discuss ... (8)

Group G (8 i n s t a n c e s ) : It has been shown t h a t . . . (4) It is our view that o.o (4)

Total number of ir.llcators found : 80

Total number of papers examined : 92

Average incidence of indicators : 2.5 per paper

Number of papers with no indicator s 4

Figure 11.1 Examples of indicators taken from Information Processing and Management, 12 (1976). Words in brackets are not part of the indicator itself; words in parentheses are total weights, including minor weights

whether it is an entry word. In practice this may be unnecessary, since in 8 of the 13 special cases the entry word is preceded by a comma, while in another 2 the preceding word is 'for', which could well be included in the actual indicator if the relevant template were modified slightly.

When an entry word has been found, further words from the text are scanned to see whether they can be matched against a path in one of the templates. A complicating factor here is the possibility of words which ought to be skipped causing a false start along a particular path. For instance, the words 'The survey . . . ' would be matched as the start of a path such as 'The survey presented here shows ...'. However, our sample contains the example 'The survey results of this study were ... ', in which 'survey' merely serves to qualify the noun 'results'. The program must, therefore, be able to backtrack and examine alternatives if the path which it is following proves to be a dead end.

. . . . . . . . , ~ ~ ~ m , ~ - ~ F I I T [ m l II m

Page 8: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Identification and weighting of indicators 179

I21 method ÷3

[0] this 1

For J t ' [2 ] investigation ÷3

[3] method +2, r

- [0] the L [3] result÷2--

[ 3 0 ] /s ~ [ I ] discussed ÷2

[2] how +1 F [3] show +2 _._1

[01 we -{ _[ [21 how ÷1 [31 discuss ÷ 2 [ - [51 method ÷2

L [ 2 1 the? 1

t_ [31 result ÷2

Figure 11.2 Template for indicators of group B (slightly simplified). Skip limits are shown thus: [31. Weight increments are shown thus: + 2. Underlined words (or word-stems) are paradigms. A query (?) denotes an optional paradigm

We have mentioned the need to assign weights to indicators. Indicators from certain groups (such as group A) seem to be inherently more reliable than others (for example, group E), and therefore deserve higher weight. Moreo vet, it is not sensible to insist that every path in a template be matched to its end: thus, 'in this paper . . . ' is properly regarded as an indicator, but should recieve a much smaller weight than 'in this paper we outline a method for ... '. The templates, therefore, contain a number of weights placed at strategic points. When an entry word is found in scanning a text, the sentence weight is set to zero. Whenever the matching successfully reaches a weighted point in a template, that weight is added to the sentence weight: hence, by the time a long indicator has been fully matched, several individual weights will have been added together to give a total weight for the indicator.

Figure 11.2 shows the template for indicators of group B (chosen because it is not too complicated), the weights being shown as small integers preceded by a plus sign. The assignment of weights is at present largely a matter of guesswork, but the aim is that a good indicator should receive a weight of about + 10. Any piece of text which matches a path in a template far enough to finish up with a weight greater than zero is deemed to contain a useful indicator.

In some cases, the various items within aword -g roup are not all equally 'good'. For instance, compare 'The survey is ...'; 'This survey is ... '; 'The present survey is ...'. The first example is very vague: it could refer to any survey which has previously been mentioned in passing. The second is rather more definite, but 'this' is often used exophorically. The third example, however, refers with very little doubt to the article containing it. Words in the word-groups can, therefore, contain minor weights which are added to the indicator's weight when that specific word is matched. Most minor weights are zero, non-zero values being regarded as deviations from the norm. In the above examples the minor weight of'the" is - 2 , the value for 'this' is 0 and for 'the present' the value is +2. In addition, the value of a minor weight occasionally depends on the position of a sentence in the text. For example, 'the following' has a positive minor weight if it occurs near the beginning of a

Page 9: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

180 The automatic generation of literature abstracts

p a ~ r = paper , a r t i c l e , r e p o r t , rev iew, m~n]ys- , d i s c u s s i o n , s u ~ a r - , o u t l i n e , p r e s e n t a t i o n , e s s ay , s t u d - , su rvey .

i n v e s t l ~ a t i o n l i n v e s t i g a t i o n , exper iment , r e s e a r c h , enquir-, work, s t u d - , su rvey .

methodl method, means t model, t echnique t approach, p r o c e s s , program-, procedure , sys t em, measure, fo rmula , f u n c t i o n , i d e a ( - 2 ) .

r o s u l t l r e s u l t , outcome, f i n d i n g 0 c o n c l u s i o n .

d i s c u s s ! d i s c u s s - , i n t r o d u c e , p r e s e n t , deve lop , examine, d e s c r i b e , review, r e p o r t , o u t l i n e , c o n s i d e r , i n v e s t i g a t e , exp lo re , a s s e S S - , - n n 3 y - , s y n t h e s i - , s t u d - , su rvey , a sk , s i m p l i f - , d e s / - w i t h , t r a c e ( - 2 ) , c o v e r ( - 2 ) .

shoW= show t shew, reveal, demonstrate, confirm, prove, indicate, explain, illustrate, illuminate, clarify, make clear, made clear~ determine, discover, find, £ound, suggest(-2), pro- po~(-2), ~ e ( - 2 ) , s~(-2).

th is: th is, the(-2), the .present(+2), ou.r(÷2), the fol lowing (+2 n e a r s t a r t o f paper~ -.8 near end ) , t he above (+2 nea r end of paper , -enes.v s t a r t ) , the p reced in~ (as l a s t ) , t he foro~oi~ (as Z~t).

th._e.el the, a, an, some, our(+2)°

we= we, I.

i_~s: is, are, was, were, has been, have been, will be.

ho___ww= how, why, what, whether, that.

Figure 11.3 Examples of word-groups (slightly simplified). Minor weights are shown (in parentheses) only when non-zero

text, but negative if it occurs near the end, while the opposite applies to 'the above', ' the foregoing' and 'the preceding'.

In Figure 11.3 a number of the main word-groups are shown; minor weights are included (in parentheses) only when they are non-zero.

It is probably the case that some of our templates are more complicated than they really need to be, owing to separation into two or more groups of words which might all be lumped into a single large group. This happens when there are certain contexts in which words of one group are appropriate, but not words of a related group. For instance, the indicator 'This paper describes an investigation into . . . ' becomes senseless if 'paper ' and 'investigation' are interchanged: hence, they are kept in separate groups. However, in other situations words of these two groups are freely interchangeable. Since we are not in the business of distinguishing sensible sentences from silly ones, the making of fine distinctions probably serves no real purpose.

The paths within the templates comprise strictly ordered sets of paradigms, so that indicators which differ merely in the arrangement of the paradigms have to be treated as distinct indicators. Thus, 'The aim of this paper is to show .. . ' is a group A indicator, whereas 'It is the aim of this paper to show . . . ' belongs to group G. It might seem an attractive idea to take a more 'grammatical ' approach to our quest for indicators. This would involve scanning a text for clusters of 'indicator words' (basically, those contained in

Page 10: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

The aggregation rules 181

our present word-groups) and attempting to parse the clusters and transform them into a basic underlying form: thus, the two examples above should both be converted into the same underlying form. In principle, this approach should be more powerful than the use of rigid template structures, although it is an open question whether it would be as fast or could be made to work as well in practice.

~I!o4 "11"F, ae aggrega¢f iee re~e~

The addition of supportive sentences to an indicated sentence to produce a one-part abstract is controlled by three threshold values. The first two are adjustable parameters of the system, and specify the desired minimum and maximum lengths of the abstract. The final length of the abstract is not absolutely bound to fall between these limits, since (a) it may for some reason prove impossible to find enough sentences to reach the lower limit, and (b) in situations where the upper limit is used, aggregation only ceases after the limit has been exceeded. The third value is the target length, and is computed by the system from the other two; it is supposed to ensure that, under 'ideal circumstances', the length of the abstract will be close to the mean of the lower and upper limits. In discussing the aggregation rules, we use the term extract to refer to a collection of sentences which have been gathered together in the process of constructing an abstract. Initial!y, each extract consists of a singte indicated sentence. We use the term exophora to mean any feature within an extract which refers to a sentence outside: references linking sentences within the extract are said to have been resolved, and the term 'exophora ' is no longer applied to them. Any extract which has no unresolved exophora is said to be tidy and may serve as an abstract, provided that its length is acceptable.

in building up an extract, the most important task is always to deal with any exophoric references. To this end, sentences are added until either every exophoric feature has been resolved or the upper limit has been exceeded. In the latter case, the algorithm tries to neutralise the remaining exophora by either deleting or modifying them. For instance, in 'The problem of assigning suitable weights to key-terms is discussed later' the word 'later' could simply be deleted. However, if the full extract cannot be neutralised, it is dismantled sentence by sentence, the algorithm trying again at each stage, if the length of the extract falls back to below the target length, and the exophora are of a '!ong-range' kind (as discussed in the next section), an additional strategem is adopted. This consists of taking sentences, up to the maximum length limit, from the very beginning of the document, to see whether these resolve the difficulty. However, if all of these attempts fail, the process is abandoned and another indicator is sought.

If a tidy extract has been obtained, but with a length less than the target length, then the algorithm tries to add further sentences to the extract until it is long enough. However, the addition of each sentence or series of sentences can only succeed if any exophora they contain can be resolved or neutralised.

There are four distinct procedures for adding these further sentences. First, if the end of the extract is not at the end of a paragraph, then succeeding sentences are added (if possible) until either the target length has been exceeded or the end of the paragraph has been reached. Second, if the extract

Page 11: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

182 The automatic generation of literature abstracts

does not start at the beginning of a paragraph, preceding sentences are added, working backwards, until the target length is exceeded or the start of the paragraph is reached. Third, the whole paragraph immediately following the extract is added, provided that it is in the same section and that its addition does not cause the maximum length limit to be exceeded, And, finally, the whole preceding paragraph is added to the extract, subject to the same conditions. Each of these procedures is invoked only if the extract is less than the target length.

If, despite these endeavours, the length of the extract is still less than the lower limit; sentences from the beginning of the document are added until the target length is exceeded. Both of the procedures that take sentences from the start of the document are so framed that it is almost impossible to select a single sentence in this way: to do so, of course, would be likely to produce a disjointed abstract. It may also be noted that there is no need to look for backward references in such sentences: provided that they contain no forward reference, the first n sentences of a document must always be tidy.

These aggregation procedures are intended to reflect the normal cohesive structure of a text. An indicated sentence will be strongly related to other sentences in the same paragraph, and especially to its immediate neighbours. It will have a weaker relationship to adjacent paragraphs; moreover, since the sentences in a neighbouring paragraph will cohere strongly together, it is, in general, inappropriate to select only a part of that paragraph. Of course, there will be cases when part of an adjacent paragraph could be successfully selected, but this can only be decided by a structural and thematic analysis, which is at present beyond us. The relation of an indicated paragraph with sentences in a different section of a document is usually weak and diffuse, so such sentences are excluded from the abstract-building process - - except for the first few sentences of the document, which are deemed to have a relationship to every later sentence.

The rules outlined above undoubtedly need further refinement. For instance, an otherwise tidy sequence of sentences may include one which contains some undesirable feature (see end of Section 11.5, below). It would be useful to delete such a sentence, while leaving the rest of the sequence intact, but as yet we have not given this matter enough attention to know whether such a procedure would be safe.

Our aggregation rules are designed to produce single, tidy extracts whose lengths lie within certain limits. If more than one extract is produced from a document, that with the highest weight is adopted as the abstract. The production of multi-part abstracts would ideally involve a more flexible approach. During the aggregation process there may be several points where a tidy extract has been found. At each such point the extract could be saved, so that eventually there would be several alternative extracts based on the same indicator, differing in length and perhaps also in weight. In a document containing several indicators the several groups of alternative extracts would serve as data for a composition program, which would choose among the various possibilities to construct a multi-part abstract of desired length.

11.5 Exophora Rules for dealing with exophora have to be concerned first with recognition and second with treatment. Potentially exophoric features are usually quite

Page 12: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Exophora 183

easy to find, but a decision must then be made as to whether they are actually being used exophorically. For instance, in 'Our investigations have shown this to be true' the word 'this' is plainly exophoric, whereas in ' In this paper we outline a method for assigning weights to key terms, and present results which show that this is beneficial' neither occurrence of ' this ' is exophoric. The first of the latter pair is covered by a rule that no word which forms part of an indicator can be exophoric. The second is an internal reference, and for the present a crude rule is used which says that 'this' can only be exophoric if it occurs in the first eight words of a sentence. This kind of rule, with adjustments, is used for many potential exophora: in due course we shall look for a more satisfactory criterion, but, as yet, the table of rules is in an undeveloped and provisional state.

The treatment of an exophoric feature depends on its direction, its range and its precision. A reference which points forward to a later sentence or passage is said to be cataphoric, while one which refers to something earlier in the text is termed anaphoric (Halliday and Hasan, 1976). Thus, the resolution of a cataphoric feature entails the finding and inclusion of later sentences, while the resolution of anaphoric features involves working backwards. Cataphoric features are certainly the less common of the two, but they often involve quite extensive structures, especially those which announce an enumeration. For example, in order to resolve the sentence 'There are three distinct methods to be considered' it is necessary to find three phrases, sentences or paragraphs, and to include them all. Each separate component, or 'prong', will normally be preceded by a suitable number, letter or word, some examples being as follows:

1 . . . . 2 . . . . 3 . . . .

( i ) . . . ( i i ) . . . ( i i i ) . . .

(a)... (b)... (c)... F i r s t . . . . S e c o n d , . . . T h i r d . . . .

F i r s t l y . . . . S e c o n d l y . . . . F i n a l l y . . . .

However, if the cataphoric sentence had said 'There are several factors to be considered', only the last of these examples would have been neatly delimited: the others (apart from possible layout clues such as end of indentation) would have necessitated searching forward some way to see whether a fourth prong could be discovered.

Moreover, enumerations frequently contain anomalous punctuation. Spurious periods, such as those in the first example above, occur in various guises and situations in almost any text, but more peculiar are examples involving colons, such as

There are several factors to be considered: (i) the distribution of the points about the mean.

(ii) the influence o f . . .

in which the cataphoric sentence appears to extend as far as 'mean', but no further. The rules for dealing with enumerations are plainly complicated, and have yet to be properly formalised. Of course, some cataphoric features can be resolved within the sentence, and thus cause few problems.

The range of an exophoric feature concerns the distance of the item to which it refers. Many anaophoric features, such as 'this', 'such', 'but' , 'however' , 'moreover ' , 'nonetheless', etc., occurring early in a sentence, almost always

Page 13: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

184 The automatic generation of literature abstracts

refer to the preceding sentence: we refer to them as short-range anaphora. On the other hand, terms such as 'above', 'below', 'earlier' and 'later' are long- range features. Features which could fall into either class we refer to as 'free range'.

Terms such as 'above' and 'later' refer to parts of the document which not only are more or less distant, but also whose locations are only vaguely indicated: in other words, they are of low precision. Short-range exophora, which refer to an adjacent sentence, are of medium precision, whereas a feature which can be linked to a specific word or phrase in a nearby sentence is of high precision. A proper grammatical analysis could probably convert many of our medium-precision features into high-precision ones, but for our purposes this is probably unnecessary, since resolution of an exophoric feature is a matter of deciding what sentence it refers to, not what word.

As has previously been mentioned, if a feature cannot be resolved by finding and including the sentence referred to, an at tempt is made to neutralise it. Many short-range anaphora occurring at or near the start of a sentence can be deleted without harm; examples are 'but ' , 'however' , 'furthermore', 'moreover ' and 'yet'. Long-range anaphora can often be treated in the same way, or can be replaced (for example, conversion of ' . . . is discussed later' by ' . . . is also discussed'). However, since short-range features add to the cohesion of a text, their deletion is in most cases treated as a resort action. And in many cases, of course, neutralisation is not permitted (for example, 'this', 'these' and 'such').

All this means that each exophoric feature has to be represented by a set of properties defining how to decide whether it really is exophoric; how to resolve it; and, if that fails, whether it can be neutralised, and how. Some of these rules are tabulated in Figure 11.4.

So far we have ignored anaphoric features containing the word 'the', but since these are especially problematic, they deserve special mention. In the example 'We have shown that the model can predict . . . ' 'the model ' is evidently an anaphoric reference to a particular model which has been introduced previously. The range of the reference is free, but this is offset by a high degree of precision. Resolution consists of finding a sentence in which the word 'model ' appears in a non-anaphoric setting: for instance, 'This paper presents and investigates a model for ...'. The matching is done permissively, so that differences which only affect the last few letters of a word do not rule out a match (Paice, 1977: Section 4.3.2). If the word following'the' is one of a range of adjectives such as 'above', 'proposed', 'given' or 'current', then the word for which a link is to be found is the next word. These linkages are looked for as part of the aggregation process described earlier, but, in addition, a special check takes place right at the start. This relies on the idea tl~at any abstract will start with the title of the document concerned, and so any~he-reference which can be linked to a word or word-stem in the title is deemed to have been resolved. !

This all presupposes that we can readily decide which occurrences of 'the' are anaphoric and which are not. Of 200 instances of 'the' in our sample, only 15 were felt to be plainly anaphoric, a l though a few more were borderline. In the example ' . . . we consider the distribution of key terms . . . ' ' the' is not anaphoric, because it relates to 'key terms'. In fact, the structure 'the ... of, which appears rarely to be anaphoric, accounts for 57 per cent of our instances of ' the' . In most cases ' the' and 'of' are separated by a single word, but our rules

Page 14: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Exophora 185

(i) Anaphoric if found in position 1: ' a n d ' 9 °bu t °9 'hence ' 9 Iso°

A c t i o n s i n c l u d e p r e v i o u s t e x t u n i t , o t h e r w i s e d e l e t e e x p r e s s i o n °

( i i ) Anaphoric in p o s i t i o n 1 i f foby a comma: °besidesg~ 'more i n t e r e s t i n g ' , °more to the po in t '

Action: include previous text unit, otherwise delete expression.

(iii) Anaphoric in positions i-8~ unless °and ° or Vbut' occur earlier: lagain', 'also', 'anyhow g, lanyway0, las a consequence' (unless f . by °of ' )~ ' a s a r e s u l t ' (unless foby =of')~ 'consequent ly '~ °fur thermore '~ 'however~ ' i n a d d i t i o n ' (un less foby ' t o ' ) ~ *in another sense ' 9 °in any case 'p ' i n a s im i l a r fashion '~ ' in doing so'9 °in other ~ords °, 'Lu particular ~, 'in so doing', 'in the same way~0 0instead' (unlessfoby 'of')9 'moreoverQ9 °more--ly', 'on the other hand' 9 'similarly'~ 'therefore°~ 'thus', 'unfort- unately'~ 'what is more'~ 'yet' (unless foby 'more')

Action: include previous te~t unit, otherwise delete expression.

(iv) Anaphorie in positions 1-8~ unless 'than ~ occurs shortly after: °better°, 'further~ 'greater', 'less' 9 'more ~, 'other', 'rather'

Action: include previous text un1% otherwise discount sentence.

(v) Anaphoric in positions 1-8: 'alternatively'~ 'both : (unless later f.by 'and'), 'an example is ' ~ 'examples are', ~examples include', i for example g, 'for instance', °it'~ 'similar'~ 'them', 'they'

Action: include previous text unit~ otherwise discount sentence°

(vi) Anaphoric in positions 1-8, and also if the following word matches a word in the preceding (and not in the same) text unit:

'each 0 (unless foby 0o£')~ 'such' (unless foby ~as')~ 'these', 'this' Action: include previous text unit, otherwise discount sentence.

(vii) Padding expressions: 'in effect"~ 'in fact', 'now' (unless foby 'that' or 'and')

Action: delete expression°

(viii) Cataphoric or disindicative in positions 1-8: °not'

Action: find and include up to a matching 'although'~ Qbut', 'except'~ 'however'~ 'instead', ~nonetheless' or 'unless'; otherwise discount sentence.

(ix) Cataphoric in any position: ~o~I the one hand'

Actiong link forward to 'on the other [handl 'o

(x) Words used in enumerations: 'firstly' etCo~ 'second', 'third' etc.; 'finally'~ 'lastly'

Action* llnk into multi-pronged structure~ otherwise discount sentence°

Figure 11.4 Provisional rules for common exophoric words and expressions. "Position P' means expression starts at Pth word of sentence. 'Text unit' means sentence or paragraph, as appropriate. 'f.by' stands for 'followed by'

allow up to six, provided that a significant word does not intervene. A number of other rules can be drawn up, and some of these are shown in Figure 11.5. If °the' is followed a few words later by an identifiable verb, this is taken as a sign that it is anaphoric. In cases where this applies (7.5 per cent), or where none of the other rules can be found to apply (about 16 per cent) an at tempt is made to resolve the feature in the way already described. In some cases, of course, resolution is possible within the same sentence.

Page 15: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

186 The automatic generation of literature abstracts

The following are treated as null words after 'the', unless followed by <verb>1 'best ', 'bigEest ', 'fastest', 'highest ', 'ideal ', 'largest ', 'lea~t ', 'longest', 'lowest', '~-T~m-[al/uml', 'minim-[al/um] I, 'most'[izl], 'optim- [al/um] ', 'poorest ', ' shortest ', 'worst'.

The following are anaphoric and free-ran~el 'the [I,5] <verb>'.

The following are anaphoric and short-range: 'the consequent', Ithe former', Ithe latter I, 'the resultant', 'the resulting', 'the conclusion/outcome/result/etc. ' not f.by 'of'.

The following are not normally exophoricl 'the [115] of', 'the [115] associated with', 'the [115] about', 'the [1:5] concerned with', 'the [i15] concerns', Ithe [115] used', 'the [155] which', where the word before 'which' may be a preposition, 'the px~h'lem/questi~a how/when /whe ther / .hy / re (~ . rd in~e to . ', 'the reason why', ' the connection/relation- between', 'the same [Oz4] as', 'the ssmo'. 'the [0,2l be)ief/c'rt~ty/fac,/f seling/ide~/i,~eseion/,tc, that', 'the eZfeot-/influence on', 'the interval/range' unless f.by <verb>, 'the year 19..', 'the period/years 19..' 'the period/years between..', 'the reader', 'the [1:5] user', 'the i013] literature', 'the [0:~] media', 'the [0,)] calculus', ' t he enqu i ry / expe r imef l t / xnves t iga t ion / r e sea roh / s tudy /e tCo '|

' a t the b e s t ' , ' a t the w o r s t ' , ' a t the most t , ' a t the l e a s t ' , ' a t the moment', ' a t the p resen t t i m e ' , ' a t the same t i me ' ;

' c a l l - the t , 'know- as the I , ' r e f e r - to as t h e ' , ' t e rm- t h e ' ;

'the [012] <~tle>', where <Title> is a sequence of at least two words s t a r t i n ~ wi th c a p i t a l l e t t e r s (which may inc lude the word ' o f ' ) , o r an abbreviation or acronym cohtaining more capital than .m~] lett- ers (e.~. 'M~').

Coi~ent| • form such as 'the A and the B which' or 'the X, the Y and the Z of' are ambiguous in structure (is 'the A' anaphoric, or does it pair up with 'which'T). In such cases, an attempt is made to resolve all it~Is exQept the last, and then two is deducted from the sentence weisht for each unresolved item,

Figure 11.5 Some rules for diagnosing expressions starting with 'the'. [1:5] represents an intervening sequence of at least one and not more than five non-significant words. 'f.by' stands for 'followed by'. (verb) includes 'is', "are', 'will', 'was', 'were', 'may', 'might', 'could', 'would', 'should', 'ought', 'must', 'can', 'has', 'have'

While in most cases 'the' can be successfully handled by the above rules, in a significant minority of cases (perhaps one in four) problems remain. These are mainly due (a) to an instance being wrongly assumed to be anaphoric, because it is not covered by the rules in Figure 11.5, and (b) to a linkage which cannot be found. Consider the example:

This paper is concerned with the assignment of key-terms to documents using statistical criteria. The effectiveness of the indexing is compared with ...

Here 'the indexing' cannot be resolved (we will suppose) because the word- stem 'index-' cannot be found earlier in the document or in its title. In fact, it should be resolved by linking to 'assignment of key terms' in the previous sentence. Some such examples involve synonyms, while in others the terms are

Page 16: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Results 187

just closely associated. Clearly, the introduction of a certain term in one sentence predisposes the reader's mind to accept a number of related terms in later sentences. The incorporation of a thesaurus into our system would go a long way towards removing such problems.

Finally, we must mention the occurrence of features whose presence in a sentence makes that sentence less likely to be an acceptable part of an abstract*. While they are not exophoric (at least in the narrow sense of the term), it will probably be best to allow for their identification and treatment within the table of exophora. In some cases these features can be neutralised; where they cannot, they may either categorically rule out the sentence or else cause a decrease in the weight of the extract.

One group of features of this kind consists of references to diagrams, tables and appendices: for instance, the sentence 'The results of the survey are shown in Table II ' would not be accepted into an extract. Such features are easy to detect, but the rules must be framed so as not to dismiss innocent occurrences of words such as 'figure' or 'table': ' Incorporation of this procedure should reduce the error rate to an acceptable figure.' The other common and clearly defined group of such features consists of references to other documents. A sentence such as 'A method of this kind has been proposed by Johnson and Forster [99]' is concerned more with the content of a previous document than with the document containing it. On the other hand, a sentence containing an unascribed reference, such as 'The system investigated makes use of a multi- level classified file [99]', is usually acceptable, although the '[99]' should, of course, be deleted. Identification of these features requires definition of formats for authors' names ('Johnson', 'Johnson and Forster' , 'Johnson, Smith and Forster', 'Forster et al.'), and the settings in which they may occur ('... by Johnson [100]', ' . . . by Johnson in 1972 [100]', ' . . . Johnson's [100] in- vestigation ... ', 'Johnson [99, 100] has ...'). 5n some occurrences authors ' names are not stated, or are distant from the reference itself (for example, ' . . . discussed in [99]'). Another problem with references, of course, is the variety of conventions which are used for representing them.

]1r~o6 Respites

A number of abstracts have been produced by hand from various journals (all for 1976) chosen more or less at random from Lancaster University Library, as well as from Information Processing and Management for that year. The two most highly weighted abstracts from the first two papers in each journal are shown in Figure 11.61 The minimum and maximum length thresholds were 80 and 150 words, which gave a target threshold of 105 words.

In producing these abstracts it was quickly realised that it would be absurd not to make adjustments to the rules where appropriate. However, it must be emphatically stated that this was only done when the new rule was obvious and appeared to be safe. tt is clear that progressive refinement of the rules will need to continue for some time.

* These, of course, correspond to the negative terms in Edmundson's (1969) cue method and in the WCL of the ADAM system (Rush, Salvador and Zamora, 1971; Pollock and Zamora, 1975). ? Only one in cases where two could not be found.

Page 17: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

188

'An i n fo rma t ion p r o c e s s i n g c o n s t r a i n t s approach to the con junc t ion of macroeconomic and m a c r o p o l i t i c a l t h e o r y ' , by W.E. McZlpine. I n f . Proc. & Mana~ment, i_22, 1-17 (1976). (1) The purpose of t h i s a r t i c l e i s to i n t r o d u c e a means of c o n j o i n i n g

m a c r o p o l i t i o a l w i t h ~ c o n o m i c theo ry t h roush the use of i n f o r m a t i o n p r o c e s s i n g concep t s . I t i s an i n t r o d u c t i o n and t h e r e f o r e i s i n t e n t i o n a l l y l i m i t e d on a l l f r o n t s . ' M a o r o p o l i t i c a l I here r e f e r s to the g e n e r a l , ' i n t e l l e c t u a l I governance meohan/sms of a s o c i e t y , such as 1 c u l t u r e ' and J i d e o l o ~ ' . J u s t as macroeconomics t a k e s no s p e c i f i c focus on the theory of the f i r m , so m a c r o p o l i t i c s a s d i s c u s s e d here w i l l r e q u i r e no focus on s p e c i f i c i n s t i t u t i o n s o f governance .

79 words. Weight ~ 10 ( i , ~ c a t o r coded Agpdm).

(2) I sugges t t h a t the proper conc lu s ion to be drawn from t h i s p r e s e n t a t i o n is t h a t a s t r o n g c o n j u n c t i o n of maaroeconcmio and macropolitical t h e o r y is both possible and probably profitable. This can be pursued by an ~v"mln"%ion of the e f f e c t s of i n f o r m a t i o n p r o c e s s i n g c o n s t r a i n t s . T.~m~ted b e n e f i t can be expected from th__~e Keymsian model desc r ibed he re . I t i s p robably more i l l u s t r a t i v e of the f a c t t h a t con junc t ion i s p o s s i b l e than a profound means fo r p u r s u i n g t he e n q u i r y . One m ~ t like to know for eTgmple what causes a curve to shift.

87 words. Weights z + 5 / i n d i c a t o r coded Es) .

- 2 ~minor weight for ' suggest ' ). + 7 ( i D d i c a t o r coded Aeh).

Total s lO.

~ompaot 8ra~maer [ s i c ] f o r a l g o r i t h m i c Wiswesser n o t a t i o n us ing Morgan ~ame e , by S. Krishn841 & E.V. [~Tishn-~1~thy. Inf. Proc. & Management, 12, 19-)4 (1976).

In an earlier paper a new chemical notation system Algorithmic Wiswesser Notation (ALWIN) based on the Wiswesser tin~ Notation (WLN) was d e s c r i b e d . Al though ALWIN has s e v e r a l advan tages over WLI~, t he c n c o d i ~ [ s i c ] procedure s t i l l i n v o l v e s many complex r u l e s ( though fewer than i n WLN) f o r o b t a i n i n g a unique n o t a t i o n . This i n t r o d u c e s complexity in automatic encoding, In this paper, we simplif~ as well as e]imlna%e ~ o f the precedence rules in ALWIN by assigning unique numeric labels to the nodes of the given chemical graph using Morgan algorithm.

85 words. Weight s 8 ( i n d i c a t o r coded Bpd).

'Fine s t rucZore of f l u i d s e g r e g a t i o n o r g e n e l l e s of Paramecium c o n t r a c t i l e v a c u o l e s " by J .A. McKanna. J . U l t r a s t r u c t u r e Reso, 5 4 , I-i0 (1976).

Contractile vacuoles (CV's) are osmoregulatory org~nelles found in many protozoens and some other or6~aisms. By extractiD 6 a dilute solution from the cytoplasmic colloid and expelling it from the cell, these membranous structures Contribute to the maintenance of cellular water/electrolyte balance. The initial step in this process, ~;ermed fluid segregation (FS) by l~ppas and Brendt, is carried out by systems of vesicles or tubules that commuzLioate with the main vacuole.

The present paper reports the fine structure of the FS apparatus in Paramecium aurelia and discusses the functional implications of the

Figure 11.6 Sample abstracts based on indicated sentences. Upper length threshold, 150 words; lower length threshold, 80 words; target length threshold, 105 words

Page 18: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

Results 189

membrane c o a t i n l i g h t o f c o m p a r a t i v e u l t r a s t r u c t u r a l d a t a . 97 words. Weights : + 4 ( i nd i ca to r coded Cpd).

- 3 ( d i s L m d i c a t o r ' b y P a p p a s and B r a n d t ' ) . T o t a l : 1 . Remark : a s u b o r d i n a t e p h r a s e o f t h e t y p e ' t e r m e d X b y Y' s h o u l d

p r o b a b l y n o t be r e g a r d e d a s a d i s ± n d i c a t o r a t a l l .

' U l t r a s t r u c t u r a l m o r p h o m e t r i c - ~ y s i s o f the" u n s t i m t t l a t e d a d r , - A 1 c o r t e x o f r a t s w, b y H . P . R o h r e t a l . J . U l t r a s t r u c t ~ r e l i e s . , 5_44, 1 1 - 2 1 (1976).

~ a a s q u e n t t o t h e ~ t i t i a l u l t r a a t r u c t u r a l s t u d i e s o n t h e a d r e n a l c o r t e x o f t h e r a t , n u m e r o u s d e s c r i p t i v e r e p o r t s f r o m v a r i o u s a~ma . l s p e c i e s h a v e b e e n p u b l i s h e d . As a c o n s e q u e n c e , and proba~l~y a s a n e x p r e s s i o n o f t h e i n c r e a s i n g number o f q u a n t i t a t i v e b i o c h e m i c a l l i n d , s , e x t e n s i v e d a t a r e p o r t i n g on u l t r a s t r u c t u r a l c h a n g e s i n t h e t h r e e z o n e s e x p o s e d t o a w i d e v a ~ e t y o f f u n c t i o n a l s t i n ~ i a r e a v a i l a b l e .

p u r p o s e o f t h i s I ~ p e r i s t o p r e s e n t a s t e r e o - l o g i c s l model o f t h e zones~ t h e c e l l , a n d t h e c e l l c o m p a r t m e n t s o f t h e a d r e n a l c o r t e x . q%~s model i s i n t e n d e d t o s e r v e a s a b a s e l i n e f o r f u r t h e r e x p e r i m e n t a t i o n u n d e r d e f i n e d ~ k s t a b o l i c c o n d i t i o n s . C u r r e n t l y , t h i s t y p e o f s t u d y a p p e a r s t o be a l l t h e more p rom~sLng a s numerous b i o c h e ~ c a l d a t a on t h e s y n t h e s i s o f s t e r o i d ho rmones and i t s ~ n h i h i t i o n a r e a v a i l ~ l e .

127 words . W e i g h t s I + 10 ( i n d i c a t o r c o d e d Agpdm).

- 4 ( two u n r e s o l v e d i t e m s i n a m b i g u o u s ' t h e ' - s t r u c t t t r e ) . T o t a l : 6 .

'A r e s o l u t i o n o f t h e r e g i o n s l i z a t i o n p r o b l e m and i t s ~ m p l i c a t i o n s f o r p o l i t i c a l g e o g r a p h y and s o c i a l j u s t i c e ' , b y S . G a l e . Geo~. ~ n n , l e r , ~_BB, 1 - 1 6 ( 1 9 7 6 ) .

( 1 ) I n t h i s p a p e r , I w o u l d l i k e t o r e n e w a t l e a s t a p a r t o f t h e d e b a t e c o n c e r ~ t h e n a t u r e o f r e g i o n s i n t e r m s o f a d i s c u s s i o n o f t h e models used for identifying and describing regions. The model will be b a s e d on rules for the kinds of non-boolean (i.e., fuzzy) sets which (I shall ergtte) can represent the way people perceive 'participating in' or 'belonging to' a region. The resultant model will provide a basis for subjective characterisatione of regions; subjective, that is, in the sense of potentially representing responses of individuals to questions relating to bnlongin~ and participation.

93 w o r d s . Weight : 6 (indicator coded Bp).

(2) M~ intention is to demonstrate that if the usual conceptions of 'region' and 'boundary' do not provide an adequate (or interesting) picture of the nature of areal units, and if such conceptualizations can lead to widespreedsocial injustice (say, due to the effects of misclassification or aggregation ) then by amending our language it may be possible to obtain a more reasonable descriptive and prescriptive model. This emendation is of particular importance to the foundations of the discipline since it is specifically linked to the ways in whioh we use geographio concepts. Moreover, my argument reflects the very central role which categorial thinking has come to play in the development of concepts, explanations, and prescriptions in geography.

105 words. Weight $ 7 (indicator coded Fgs).

Figure 11.6 (continued)

The examples produced appear on the whole to provide helpful indicative abstracts, although obviously some cosmetic rearrangement would be desir- able Looking for indicated sentences by eye, and working through the rather complicated aggregation rules and tables, means that great care is necessary. Impartial application of the rules by computer will properly reveal the successes and failures of the method, but as yet there is much programming

Page 19: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

190 The automatic generation of literature abstracts

'Conceptual connection and causal r e l a t i o n m, by Max Deutscher. Australnsiau Jnl. of Philoso1~hy, 54, 5-13 (1976).

Whether an intention is at least part of the cause of an action done upon that intention, is an interesting issue in itself. It is, however, not at issue here. I refer to this controversy again because it has made people focus attention on the question whether the relation between cause and effect is necessary or whether it is contingent.

M~ intention is to ask 'Can the causal relation be conceptual? Can it be necessary? Can it be contingent? ' I shall try out various plausible meanlngs for 'conceptual', 'necessary'an~ 'contingent', and see what happens.

96 words. Weight : 6 (indicator coded Fgd).

'Eternal sentences', by S.H.Voss & C.Sayward. Australasian Jnl. of Philosophy, 54, 14-23 (1976).

Eternity in all its forms is traditionally the object of profound vision rather than close scrutiny. In the face o f this tradition, we arBue here that two apparently attractive conceptions of an eternal sentence only cloud the view, and we present an alternative conception which we think allows greater insight into the nature of semantic concepts .

56 words. Weights : *8 (indicator coded Esh), +3 (indicator coded Ea). Total : iI.

"fhree steps towards robust regression I, by H.Wainer & D.Thissen. Ps~chometrika, ~_i, 9-34 (1976).

(i) Ma~y models which are used in the analysis of behavioral data assume that errors of measurement are normally distributed; in multivariate situations, this assumption becomes one of multivariate normality. However, with many sets of behavioral data, there is reason to believe that the assumption of normal errors is not justified.

In this paper, wee are concerned with the area of robust regression. In particular we are concerned primarily with the case of robustness agaLust dist,n~,ions which are longer-tailed than Gaussian. Thus, whenever we use the term 'robust' we mean agaLust long-tailed distributions. The other problem, of short-tailed distributions, is a short tale. We shall make some brief co,-,ents on it at the end of this paper.

119 words. Weights : +6 (indicator coded Bpw), +4 (indicator coded Ec). Total : 10.

(2) In this paper we have explored a variety of schemes for estimating coefficients of linear functions with respect to their ability to yield reasonable answers when the form of the data distribution ranges broadly. Our strongest finding is that the most com~nnly ~pplied methodology, least squares estimators (LSE), are the worst performers in general. Only in the Pollyasalsh case of multivariate normality were LSE the estimators of choice, and with even modest deviations from this ideal LSE's fall from grace was precipitous.

81 words. Weight : +8 (indicator coded Bpwd).

'Interpretation of canonical analysis: rotated vs. unrotated solutions t, by N.Cliff & D.J.Krus. Ps~chometrika, 41, 55-42 (1976). Remark: paper contains one good indicated paragraph, but this is not accepted due to presence of an unresolved 'the-reference', which is in fact not anaphoric ('for the behavioral researcher').

Figure 11.6 (continued)

work to be done. The results obtained so far, however, do make it look as though the programming effort will be well worth while.

The whole method relies on the finding of indicators within a document. Quite a few indicators have been found which have not so far been incorporated into our rules, while a number of less overt indicators may be

Page 20: The Automatic Generation of Literature Abstracts: An ...students.lti.cs.cmu.edu/11899/files/p172-paice-cue-phrases.pdf · sentence to another (Halliday and Hasan, 1976). (As already

References 191

discovered by closer study of document texts. Nonetheless, there will always be some documents in which no indicator can be found at all. In such cases other methods will have to be used to identify significant sentences. Once the sentences have been found, however, it should be possible to construct an abstract using the same aggregation rules as for indicated sentences.

Re~'en'eaces

BAXENDALE, P. B. (1958). 'Machine-made index for technical literature - - an experiment', 1BM Journal of Research and Development, 2, 354-361

EARL, L. L. (1970). 'Experiments in automatic extracting and indexing', Information Storage and Retrieval, 6, 313-334

EDMUNDSON, H. P. (1969). 'New methods in automatic extracting', Journal of the Association for Computing Machinery, 116, 264-285

EDMUNDSON, H. P. and WYLLYS,R. E. (1961). 'Automatic abstracting and indexing: survey and recommendations', Communications of the Association for Computing Machinery, 4, 226-235

HALLiDAY, M. A. K. and I-IASAN, R. (1976). Cohesion in English, Longman, London

JANOS, J. (1979). 'Theory of functional sentence perspective and its appli- cation for the purposes of automatic abstracting', Information Processing and Management, ~5, 19-25

KARASEV, S. A. (1978). 'Building an auto-abstracting system with text analysis and synthesis', Nauchno-Tekhniceskaya Informazia, Ser. 2, No. 5, 26--29 (in Russian)

LUHN, H. P. (1958). 'The automatic creation of literature abstracts', IBM Journal of Research and Development, 2, 159-165

MATHIS, B., RUSH, J. E. and YOUNG, C. E. (1973). 'Improvement of automatic abstracts by the use of structural analysis', Journal of the American Society for Information Science, 24, 101-109

PAICE, C. D. (1977). Information Retrieval and the Computer, Chapter 5, Macdonald and Jane's, London

POLLOCK, J. J. and ZAM©RA, A. (1975). 'Automatic abstracting research at Chemical Abstracts Service', Journal of Chemical Information and Computer Sciences, ~5, 226-233

RUSH, J. E., SALVADOR, R. and ZAMORA, A. (1971). 'Automatic abstracting and indexing. II. Production of indicative abstracts by appli- cation of contextual inference and syntactic coherence criteria', Journal of the American Society for Information Science, 22, 260-274

SKOROKHOD'KO, E. F. (1972). 'Adaptive method of automatic abstracting and indexing', in IFIP Congress 71, Ljubljana, Jugoslavia, pp. 1179-1182, North-Holland, Amsterdam

TAYLOR, S. L. (1977). 'Experiments with an automatic abstracting system', Proceedings of the ASIS Annual Meeting, Chicago, 1977, Vol. 14, Information Management in the 1980s