the effect of anaphora and ellipsis resolution on ... · the effect of anaphora and ellipsis...

35
The Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text Database Ari Pirkola and Kalervo J‰ rvelin Dept. of Information Studies University of Tampere P.O.Box 607 FIN-33101 TAMPERE, Finland Corresponding author: Kalervo J‰ rvelin, Dept. of Information Studies, University of Tampere P.O.Box 607 FIN-33101 TAMPERE, Finland phone: + 358-31-2156 782 email: [email protected] Running title: Anaphor and Ellipsis Resolution in Proximity Searching

Upload: lycong

Post on 15-Apr-2018

234 views

Category:

Documents


2 download

TRANSCRIPT

The Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text Database

Ari Pirkola and Kalervo J‰rvelinDept. of Information StudiesUniversity of TampereP.O.Box 607FIN-33101 TAMPERE, Finland

Corresponding author:

Kalervo J‰rvelin,Dept. of Information Studies,University of TampereP.O.Box 607FIN-33101 TAMPERE, Finland

phone: + 358-31-2156 782email: [email protected]

Running title:

Anaphor and Ellipsis Resolution in Proximity Searching

______________________Reprints may be requested from Kalervo J‰rvelin, Dept. of Information Studies, University ofTampere, P.O.Box 607, FIN-33101 TAMPERE, Finland

AbstractSo far, methods for ellipsis and anaphor resolution have been developed and the effects of anaphorresolution have been analyzed in the context of statistical information retrieval (IR) of scientificabstracts. No significant improvement has been observed. In this study, the effects of ellipsis andanaphor resolution on proximity searching in a full text database are analyzed. Anaphora and ellipsesare classified on the basis of the type of their correlates / antecedents rather than, as traditional, on thebasis of their own linguistic type. The classification differentiates proper names and common nouns ofbasic words, compound words, and phrases. The study was carried out in a newspaper article databasecontaining 55.000 full text articles. A set of 154 keyword pairs in different categories was created.Human resolution of keyword ellipses and anaphora was performed to identify sentences andparagraphs which would match proximity searches after resolution. The findings indicate that ellipsisand anaphor resolution is most relevant for proper name phrases and only marginal in the otherkeyword categories. Therefore the recall effect of restricted resolution of proper name phrases only wasanalyzed for keyword pairs containing at least one proper name phrase. Our findings indicate a recallincrease of 38.2 % in sentence searches, and 28.8 % in paragraph searches when proper name ellipseswere resolved. The recall increase was 17.6 % in sentence searches, and 10.8 % in paragraph searcheswhen proper name anaphora were resolved. This suggests that some simple and computationallyjustifiable resolution method might be developed only for proper name phrases to support keyword-based full-text IR. Elements of such a method are discussed.

1. INTRODUCTIONThe evolution of information technology has made the electronic storage of very large text collectionspossible. Such collections, text databases (or full-text databases), often contain in a computer readableform the whole text, or at least large excerpts, of documents collected for some purpose, e.g.,newspaper article databases, other news databases, electronic mail databases, scholarly journal articledatabases, etc. Information retrieval (IR) in operational text databases is based to a great degree on thesame methods as retrieval in bibliographic databases with differences in the importance of someoperators (e.g., the proximity operators), in the availability of full text and frequent non-availability ofcontrolled vocabularies for searching. The relative effectiveness as well as associated potentials andproblems of text databases and full searching have been discussed in the literature (e.g., Blair, 1990;Blair & Maron, 1985; Kristensen, 1993; McKinin & al. 1991; Salton, 1989; Tenopir, 1985; Tenopir &Ro, 1990).Elliptic and anaphoric references in natural language full text constitute an important problem area ininformation retrieval. In either case, original keywords used for searching may occur in text alsoindirectly through the references and this may affect adversely the retrieval of relevant documents.Anaphora are textual elements which refer to earlier text elements (called correlates) and share themeaning of the correlates (Halliday & Hasan, 1976). For example words in italics in the sentences:"Kal does not have a dog but he would like to." and "Ari and Kal both own a car. The latter has aToyota." are anaphora. Ellipses are incomplete constructions derived by the omission of one or morewords that are obviously understood but must be supplied to make a construction grammatically

complete (Webster's, 1981, p. 737). For example (omissions indicated in parentheses), "Kal does nothave a dog but Ari does (have a dog)." and "Amnesty International published the annual human rightsreport yesterday. Amnesty's (International's) (human rights) report indicates ..." are elliptic sentences.Anaphora may be classified in several ways (e.g., Halliday & Hasan, 1976; Fraurud, 1988; Hirst,1981). Liddy et al. (1987) presented perhaps the first broad and coherent classification of anaphorawith classes (1) central pronouns (personal, possessive, reflexive), (2) nominal demonstratives, (3)relative pronouns, (4) nominal substitutes (e.g., 'former'), (5) the pro-verb 'do', (6) indefinite pronouns,(7) pro-adjectives (e.g., identical), (8) pro-adverbials (e.g., similarly), (9) subject references, and (10)the definite article 'the'. Ellipses are classified traditionally according to the structural relationships ofthe elliptical form and its antecedent into replacement ellipses, elaboration ellipses and repetitionellipses (Johnson, 1994). However, in the context of information seeking dialogues, pragmatic analysisis often needed in the interpretation of ellipses and thus they can be classified in novel, more helpfulways (Johnson, 1994). The categories of ellipses and anaphora occurring in our data are given inAppendix I.In the resolution of ellipses and anaphora their correct correlates are identified in the text. Theirresolution is necessary if the meaning of text is to be understood correctly. Thus resolution is needed innatural language understanding (NLU) systems (Grishman, 1986; Hirst, 1981; Johnson, 1994; Liddy,1990) as well as IR (Bonzi & Liddy, 1989; Liddy & al., 1987; Liddy, 1990). Liddy et al. (1987) pointout, that in IR-related systems, anaphoric resolution is needed in all systems based on semantic textrepresentations, question-answering systems, systems analyzing the questions posed by informationseekers (Johnson, 1994), automatic abstraction systems (Paice, 1990), and statistical IR. Ellipsisresolution is needed for the same tasks. In this paper we show that resolution would benefit alsotraditional Boolean retrieval.Many resolution methods have been developed for ellipses (Johnson, 1994; Winograd, 1983) andanaphora (Carter, 1987; Fraurud, 1988; Grishman, 1986; Hirst, 1981; Hobbs, 1978; Kuno, 1987;Lappin & Leass, 1994). Comprehensive and correct resolution of ellipses and anaphora is known asdifficult and requiring both morphological, syntactic, semantic (including discourse level) as well aspragmatic processing (Akmajian & al., 1990; Hirst, 1981; Smeaton, 1991). Lappin and Leass (1994)provide a recent overview of anaphoric resolution. It is not likely that a comprehensive resolutionsystem for unrestricted natural language text, such as newspaper text, could be built in the near future.Much of the literature on resolution is theoretical and we do not know of any empirical evaluations ofellipsis resolution. Fraurud (1988), Hobbs (1978), Liddy et. al. (1987) and Lappin and Leass (1994)have evaluated anaphor resolution empirically. Fraurud's manual method performed 91 % correct, onthe average, but the performance varied depending on text type (99 % correct in stories, 75 % inarticles on innovations in technology). Hobbs' manual method gave correct results in 88 % - 92 % ofanaphora depending on the composition of the method. His data consisted of Newsweek articles, anovel, and a monograph in archeology. Liddy et al. (1987) developed a manual resolution method andfound that the average frequency of anaphora per scientific abstract in the PsycINFO and INSPECdatabases was 4.49 and 2.86, respectively. Their resolution method achieved 83 % - 99 % correctnessdepending on anaphor class (see above) in this corpus. Lappin and Leass (1994) developed anautomatic resolution method for anaphora, in particular third person pronouns, reflexives andreciprocals. It is based on a syntactic and morphological filter and a salience weighting mechanism andachieved in a computer manual corpus 86% correct resolution. In general, the higher the requirementson quality are, the more knowledge must be harnessed and the higher are the computational overheads.Bonzi and Liddy (1989) found out that anaphora refer to central concepts of scientific abstracts andtheir resolution increases the term weights of their correlates in statistical IR. However, anaphorresolution does not help to differentiate between relevant and irrelevant documents because query

terms had higher weights than other words already prior to resolution. Therefore the relevance ofanaphora (or their resolution) to IR remains an open question (Liddy, 1990).In this study we approach anaphora and ellipsis resolution in a very different way. We do not develop anovel resolution method but, instead, analyze the effects a correct resolution method would have onproximity searching in a full text database. We do not know of earlier studies on this problem.Moreover, we classify anaphora and ellipsis on the basis of the type of their correlates / antecedents(the search keywords !) rather than on the basis of their own type. This classification is based ondifferentiating between proper names and common nouns on one hand, and between basic words,compound words and phrases on the other.The problems studied were (1) what are the frequencies of ellipses and anaphora in newspaper articlesfor keywords of different categories, (2) which share of documents not found in proximity searching(on sentences or paragraphs) for keywords in various categories actually contain sentences orparagraphs which would match the searches if resolution was performed, and (3) how great an increasein recall is achievable through resolution in each keyword category. We also wanted to find outwhether some restricted and computationally less demanding form of resolution might be sufficient inthe most relevant keyword categories.Our test environment was a full text newspaper article database containing some 55.000 Finnishnewspaper articles (127 MB of text) operated under the TOPIC information retrieval system (Chong,1989; TOPIC, 1990) allowing ordinary Boolean and proximity searching. A set of 154 keyword pairscontaining proper names and common nouns in the categories of basic (non-compound) words,compound words, and (noun) phrases was created. The sets of documents retrieved in proximitysearching without resolution and those retrieved only by Boolean conjunction of the keywords wereanalyzed. Human resolution of keyword ellipses and anaphora was performed in the latter sets toidentify sentences and paragraphs which would match in proximity searching after resolution. Exceptfor some exclusions explained in Section 2.2, documents containing the original keywords or theirellipses or anaphora in the required passages (sentences or paragraphs) were counted as relevant. Nogenuine information needs nor traditional human relevance assessment was involved. Recall increasewas calculated by counting the proportion of relevant documents retrievable only after resolution tothose retrievable before resolution. We argue that this recall increase indicates recall increase ingenuine retrieval situations, based on real information needs. This is based on our assumption thatdocuments containing matching passages (sentences or paragraphs) after resolution are relevant with anequal likelihood as documents containing matching passages before resolution. This assumption isjustified in Section 4 (see Recall in genuine IR situations).The methods and data of this research are described in detail in Section 2. The findings are reported inSection 3 and discussed in Section 4. Section 5 gives the conclusions.2. METHODS AND DATA2.1. The test database and retrieval systemThe test environment was a text database containing Finnish newspaper articles operated under theTOPIC retrieval system. The database contained about 55.000 articles published in three Finnishnewspapers in 1988-1992. One subset of articles (some 24.500) was a sample of articles on foreignaffairs published by Aamulehti (Tampere, Finland), another (some 14.000) a sample of all articlespublished by Keskisuomalainen (Jyv‰skyl‰, Finland), and the third (some 17.000) a sample of allarticles published by Kauppalehti (Helsinki, Finland). The first two are leading general newspapers intheir provinces and the last one the leading national newspaper on economics. The average articlelength was 202 words (standard deviation 155 words). Typical paragraphs were two or three sentencesin length. The whole database contained some 11 million words which required 127 MB of storage.The database is large enough for providing realistic findings (cf. Blair, 1986).

The database dictionary contained all word occurrences in their inflected form. Therefore there were,theoretically, 2.000 possible forms for each noun in the database dictionary. In practice, there oftenwere tens of inflected forms. (N.B. At the time of data collection we did not have a morphologyanalysis program which would have allowed a basic word form dictionary and searching in TOPIC).These language-specific features were masked in searching and document analysis. Thus our results arenot language bound in this respect.The retrieval system TOPIC allows ordinary Boolean searching ('and', 'or', 'not') and proximitysearching by operators 'sentence', 'paragraph' and 'phrase' as well as string matching by several types ofwild cards. In sentence proximity searching, TOPIC interpreted full stops '.' as sentence delimiters.Thus for TOPIC, the preceeding grammatical sentence contains two technical sentences. The findingsreported in this study are based on grammatical sentences, not on TOPIC's technical sentences.2.2. The test keywords and searchesIn order to analyze the dependence of resolution effects on keyword types, it was necessary to collect arepresentative sample of keywords in each keyword category and to formulate searches based on thesekeywords. Many real searches are based on simple conjunctions of two keywords. Such a brief-searchstrategy is relevant in high-precision proximity searching in large text databases, especially by end-users. In addition, keyword pairs provide the simplest possibility for analyzing keyword type effects.Therefore keyword pairs were used for searching. All pairs were generated by the researchers so that itwas easy to imagine a real information need which could be represented for searching by the keywordsin question. In each pair ('a', 'b'), the first word ('a'), called henceforth the focus word, determined thekeyword category of the pair. The keyword 'b' is called the second keyword. The categories, theirabbreviations used and the number of keyword pairs in each category were as follows:ï Basic word proper names (basic word/p. name): 22 keyword pairsï Basic word common nouns (basic word/com. noun): 27 keyword pairsï Compound word proper names (comp. word/p. name): 21 keyword pairsï Compound word common nouns (comp. word/com. noun): 27 keyword pairsï Proper name phrases (phrase/p. name): 31 keyword pairsï Common noun phrases (phrase/com. noun): 26 keyword pairsThe total number of keyword pairs and searches conducted was thus 154. Sample keyword pairs aregiven in Appendix II. In each category, the focus words were of the same type and the secondkeywords were purposefully made to vary over all types.In TOPIC, searches on a Boolean combination of two keywords 'a' and 'b' and proximity searches on 'a'and 'b' are expressed as follows:ï "a" and "b" denotes the ordinary Boolean conjunction ;ï "a" or "b" denotes the ordinary Boolean disjunction ;ï "a" and not "b" denotes the ordinary Boolean negation of 'b' with respect to 'a' ;ï sentence("a", "b") requires 'a' and 'b' to appear within the same (technical) sentence;ï paragraph("a", "b") requires 'a' and 'b' to appear within the same paragraph ;ï phrase("a", "b") requires 'a' and 'b' to appear in sequence.Four search types were employed in the study:ï sentence("a", "b") (1)ï paragraph("a", "b") (2)ï "a" and "b" and not sentence("a", "b") (3)ï "a" and "b" and not paragraph("a", "b") (4)The search types (1) and (2) are called the sentence search type and the paragraph search type,respectively. Searches of this type retrieve all documents containing the keywords 'a' and 'b' within one(technical) sentence (respectively, paragraph). Sentences (paragraphs) where either one occurs only via

an elliptic or anaphoric reference do not match this search type. The results of a sentence search arealways subsets of the results of the corresponding paragraph search.The search types (3) and (4) are called the KWNS -search type (for keywords-not-in-sentence) and theKWNP -search type (for keywords-not-in-paragraph). Searches of this type retrieve all documentscontaining the keywords 'a' and 'b' somewhere within the document but never within one (technical)sentence (respectively, paragraph). Thus the search results contain all documents whose retrieval insentence searching (respectively, paragraph searching) might be affected by ellipsis or anaphorresolution. (N.B. The KWNS search type finds documents which may contain the keywords in onegrammatical sentence but not in one technical sentence. However, this did not occur in the data.). Theresults of the KWNP -search type are always subsets of the results of corresponding KWNS -search.Thus it was sufficient to analyze the documents retrieved by KWNS -searches.In all search types, phrases as keywords were searched by the 'phrase'-operation.At the time of data collection, the TOPIC system did not have a morpological analysis program forFinnish for building the database index from words in their grammatical basic forms. With inflectedword forms in the index, keyword truncation is necessary and yields documents which would not beretrieved with a basic word form index. Thus truncation was applied to mask all inflectional variants ofkeywords (also within phrases). The resulting false drops due to truncation were excluded from furtheranalysis.Table 2.1. Documents excluded from analysis by search type and focus word category (%)

Documents excluded from analysis by search type, %

Focus Word CategorySentence searchParagraph searchKWNS -searchKWNP -search

Basic word/p. name33.027.825.326.1

Basic word/com. noun27.928.227.628.2

Comp. word/p. name42.439.7

23.822.3

Comp. word/com. noun25.823.622.721.7

Phrase/p. name27.722.34.23.9

Phrase/com. noun47.441.326.125.7

Average33.830.722.922.5

For each keyword pair, only documents containing at least one independent occurrence of the focusword and the second keyword were submitted to analysis. An occurrence is independent when it is nota part of a compound word or a part of a noun phrase. For example, the phrase "power plant" is notindependent but, instead, an embedded phrase, in the sentence "The Fifth Nuclear Power Plant Projectwas launched by the Department of Trade and Industry yesterday.". Neither is 'time' independent in'timetable'. Thus we excluded from analysis documents whose retrieval in proximity searching wouldnot be affected by (correct) ellipsis or anaphoric resolution. This was done to the results of all searchtypes. The percentages of documents excluded from analysis (either due to being false drops caused bytruncation or due to containing only non-independent keyword occurrences) are given in Table 2.1. Weshall argue in the discussion (see Recall in genuine IR situations) that document exclusion does notaffect our findings concerning recall increase in genuine IR situations due to anaphor and ellipsisresolution.Large document sets to be analyzed were sampled. If the set size exceeded 20, then a systematicsample of 20 documents was picked (only one case). The total number of documents retrieved andanalyzed in the KWNS -searches was 780.2.3. Data analysisWe introduce the following concepts for data analysis. Their formal and precise definitions are givenby Pirkola and J‰rvelin (1994).KWNS and KWNP -sets. The result of the KWNS -search for the keyword pair 'a' and 'b' is, excludingfalse drops, the set of documents, called KWNS-set, where 'a' and 'b' appear independently somewhere

in the document text but never within one grammatical sentence. Analogically, the result of thecorresponding KWNP -search is called the KWNP-set.SS and PS-sets. The set of documents containing independent occurrences of the keyword pair 'a' and'b' within a single sentence (respectively, a paragraph) is called the SS-set (PS-set, respectively).A document with an anaphoric matching sentence (paragraph) is a document which contains at leastone sentence (respectively, paragraph) in which a focus word anaphor co-occurs with the secondkeyword, its ellipsis or anaphor. Note that we did not have any lists of possible ellipses and anaphora ofany keywords. They were all interpreted in the document texts after document retrieval (see below). Adocument with an elliptic matching sentence (paragraph) is defined analogically.Figure 2.1 illustrates the sets formed in the analysis of anaphoric occurrences of a focus word 'a' in thesame sentence as any occurrence of the second keyword 'b'. The set Tab contains documents where 'a'and 'b' appear independently somewhere in the document text, the set Sab documents where 'a' and 'b'appear independently within one sentence of the document text, and the set SAab documents with withan anaphoric matching sentence for 'a' and 'b'. The documents in the sets Sab and NSab = Tab - Sabwere first formed (by retrieval and exclusion of false drops as well as documents based on non-independent keyword occurrences). Then the number of documents in these sets were recorded.In the analysis of the documents of the set NSab, the number of anaphora of the focus word 'a' in eachdocument was first counted. Multiple anaphora in one grammatical sentence were counted as one. Thuswe obtained, in fact, the number of anaphoric sentences for each focus word. Secondly, the documentswith an anaphoric matching sentence were identified in NSab as the NSAab = SAab - Sab and its sizewas counted. Both analyses were based on human resolution of anaphora._Fig. 2.1. Relationships of document setsTo characterize the effect of anaphor resolution on document retrieval, two figures were computed:

| SAab - Sab | / | Tab - Sab | and | SAab - Sab | / | Sab |.The former figure gives the share of documents with an anaphoric matching sentence in the KWNS-setfor 'a' and 'b'. This share of the KWNS-set result should be of interest in a sentence search for 'a' and 'b'because the keywords occur within a single sentence ó although indirectly. The latter figure gives therecall increase with respect to the sentence search result due to anaphor resolution. It is also ourestimate of recall increase in sentence searching due to anaphor resolution in the context of realinformation needs (see Section 4).The analysis of the effects of anaphor resolution in the case of paragraph searching was similar with theexception that the number of anaphora was not counted any more. The analysis of the effects of ellipsisresolution in the case of sentence and paragraph searching was similar to the analysis of anaphorresolution. Ellipses of basic words (i.e. cases where nothing replaces the antecedent as in "Kal does nothave a dog but Ari does.") were not considered in the analysis of ellipses.These analyses show the maximal recall increase achievable by ellipsis and anaphor resolution inproximity searching in each focus word category. It turns out in the analysis that statistically significantrecall improvement occurs only in some focus word categories. Thus we analyzed also whether theobserved recall improvement remains statistically significant if ellipsis and anaphor resolution isperformed only for keywords of those categories. More specifically, in the case of anaphora in sentencesearching, the recall improvement was computed in these focus word categories by

| SAr,ab - Sab | / | Sab |,where the set SAr,ab is the set of documents containing an anaphoric matching sentence under therestricted resolution of anaphora.The corresponding calculations used in the case of paragraph searching, and in the case of ellipses areobvious. Such a restricted category-based resolution policy would be justified if it is computationally

considerably less demanding than comprehensive resolution. In those focus word categories, where themaximal achievable recall increase is practically minor and statistically insignificant, the effects areeven smaller if less than comprehensive resolution is performed.Statistical significance of the findings on recall improvement due to resolution were tested by theWilcoxon signed rank test (Siegel & Castellan, 1989). The hypothesis H0 "Resolution does not affectrecall in proximity searching" was tested for both resolution types (ellipsis, anaphora) and bothproximity search types (sentence, paragraph) in each focus word category.3. FINDINGS3.1. The frequency of elliptic and anaphoric sentencesThe frequency of elliptic and anaphoric sentences by focus word category is reported in Table 3.1. Thesecond column in the table shows the number of KWNS-searches in each focus word category and thethird the number of documents (with independent keyword occurrences) found in these searches ineach category. In other words, altogether 112 documents were found to match a Boolean search but nota sentence search on 22 pairs of keywords consisting of a basic word -focus word and a secondkeyword. In column 4 the cells for basic words are empty because ellipses of basic words were notconsidered.Table 3.1. The frequency of elliptic and anaphoric sentences by focus word category in the results ofthe KWNS -searches.

Focus Word CategoryNo. ofNo. of Docs FoundSentence Frequency per Doc

Searchesin KWNS-SetsEllipticAnaphoric

Basic word/p. name22112-0.21

Basic word/com. noun27131-0.06

Comp. word/p. name21960.11

0.22

Comp. word/com. noun271430.220.05

Phrase/p. name312061.990.55

Phrase/com. noun261020.390.02

Columns 4 and 5 report the frequencies per document of elliptic and anaphoric sentences identified inthe KWNS-sets. There are, on the average 1.99 elliptic sentences and 0.55 anaphoric sentences forproper name phrases per document, and 0.39 elliptic sentences for common noun phrases perdocument. In other categories the average number of elliptic and anaphoric sentences for the focuswords are at most 0.22 per document. Thus ellipses and anaphora of phrases seem most important forresolution. In other cases resolution effects are likely to be much smaller even if correct resolution waspossible. These results are not accurate for all ellipses and anaphora in the text database becausemultiple occurrences in any sentence were counted as one, and ellipses and anaphora occur also indocuments matching sentence searches (and thus not included in KWNS -sets). However, thedifferences are minor and can not alter the clear result: ellipses of both phrase categories and anaphoraof proper name phrases are frequent in newspaper text. Otherwise the frequencies are relatively small.3.2. The effect of ellipsis and anaphor resolutionThe analyses of ellipsis and anaphor resoltion are logically similar. Therefore we present these analysestogether.The number and percentage of documents with elliptic (anaphoric) matching sentences and paragraphsis presented by focus word category in Table 3.2 (Table 3.3, respectively). In other words, the tablegives, by focus word category, the number and percentage of documents containing a focus wordellipsis (anaphor) in the same sentence and paragraph as an original, elliptic or anaphoric occurrence ofthe second keyword. This many new documents would be retrieved if the focus word ellipses(anaphora), and possible ellipses and anaphora of the other keyword were resolved. Columns 2 and 3show the number of documents in the KWNS and KWNP -sets. Because ellipses of basic words werenot considered, Table 3.2 does not contain these focus word categories.Table 3.2. The number and percentage of documents with elliptic matching sentences and paragraphsby focus word category.

Focus Word CategoryTotal Cardinality of Sets of Type

No. of Docs with Elliptic Matching

% of Docs with Elliptic Matching

KWNSKWNPSentenceParagraphSentenceParagraph

Comp. word/p. name9680111.01.3

Comp. word/com. noun143112654.24.5

Phrase/p. name206171394018.923.4

Phrase/com. noun10284423.92.4

Columns 4 and 6 (for sentences) and 5 and 7 (for paragraphs) reveal the number and percentage of

documents with elliptic (anaphoric) matching sentences and paragraphs ó documents not retrievableby either proximity search without resolution. These figures do not describe the possible increase inrecall (considered below) but, instead, the share of 'hot' documents among those only retrievable byrelaxing the proximity conditions. The statistics show again that the category of proper name phrases isrich with 'hot' documents.Table 3.3. The number and percentage of documents with matching sentences and paragraphs by focusword category

Focus Word CategoryTotal Cardinality of Sets of Type

No. of Docs with Anaphoric Matching

% of Docs with Anaphoric Matching

KWNSKWNPSentenceParagraphSentenceParagraph

Basic word/p. name11282443.64.9

Basic word/com. noun131107000.00.0

Comp. word/p. name9680111.01.3

Comp. word/com. noun143112201.40.0

Phrase/p. name20617119169.29.4

Phrase/com. noun10284000.00.0

The statistics for proper name phrases are elaborated in Tables 3.4 and 3.5. The results for ellipsesindicate that a significant share (about one fifth) of documents containing, in particular, person namesand organization names but not found by sentence search in fact contain a sentence which would matchin sentence searching if ellipses were resolved. In paragraph searching the share is even larger, about aquarter for both of person names and organization names. Also in the case of anaphora, a significantshare (about 12 %) of documents containing person names but not found by proximity search could befound through anaphor resolution. For ellipses, in the case of event names, and for anaphora, in thecase of organization and event names, documents with elliptic matching sentences and paragraphs areclearly less frequent among the documents not found by either of the proximity searches.Table 3.4. The number and percentage of documents with elliptic matching sentences and paragraphsby proper name phrase types.

Proper Name Phrase TypeTotal Cardinality of Sets of Type

No. of Docs with Elliptic Matching

% of Docs with Elliptic Matching

KWNS

KWNPSentenceParagraphSentenceParagraph

Person Name122102252720.526.5

Organization Name45339820.024.2

Event Name39365512.813.9

Total206171394018.923.4

Table 3.5. The number and percentage of documents with anaphoric matching sentences andparagraphs by proper name phrase type

Proper Name Phrase TypeTotal Cardinality of Sets of Type

No. of Docs with Anaphoric Matching

% of Docs with Anaphoric Matching

KWNSKWNPSentenceParagraphSentenceParagraph

Person Name122102141311.512.7

Organization Name4533428.96.1

Event Name3936112.62.8

Total20617119169.29.4

The recall improvement which can be achieved in sentence searching through ellipsis (anaphor)resolution is shown in Tables 3.6 and 3.7. Column 2 gives the total number of documents in SS-sets ineach focus word category, column 3 the number of documents with elliptic (anaphoric) matchingsentences, and column 4 the recall improvement due to resolution (i.e. the percentage of column 3figures of column 2 figures). It is clear that ellipsis resolution would have a practically significant

effect on recall in both phrase categories of focus words: 20 - 38 %. However, the resolution effect isstatistically significant only for proper name phrases. Anaphor resolution would have a significanteffect on recall in the focus word category of proper name phrases (18.6 %) and perhaps in the categoryof proper name basic words (6.3 %).Table 3.6. Recall improvement due to ellipsis resolution in sentence searches by focus word category.

Focus Word CategoryTotal Cardinality of SS-SetsNo. of Ellipsis Resolution Retrievable DocumentsRecall increase % due to Ellipsis Resolution

Comp. word/p. name5311.9

Comp. word/com. noun9266.5

Phrase/p. name1023938.2***

ï Person Name602541.7

ï Organization Name27933.3

ï Event Name15533.3

Phrase/com. noun20420.0*

*** significance level 0.001 * significance level 0.1

The statistics for proper name phrases are elaborated in Tables 3.6 and 3.7 to see the resolution effectsin searches for different proper name phrase types. The figures indicate that ellipsis resolution increasesrecall most in the case of person names (by over 40 %) but in the other cases also by one third.Anaphor resolution increases recall most in the case of person names (by over 23 %) and considerablyless in the other proper name cases.Table 3.7. Recall improvement due to anaphor resolution in sentence searches by focus word category.

Focus Word CategoryTotal Cardinality of SS-SetsNo. of Anaph. Resolution Retrievable DocumentsRecall increase % due to Anaphor resolution

Basic word/p. name6346.3

Basic word/com. noun6200.0

Comp. word/p. name5311.9

Comp. word/com. noun9222.2

Phrase/p. name1021918.6***

ï Person Name601423.3

ï Organization Name27414.8

ï Event Name1516.7

Phrase/com. noun2000.0

*** significance level 0.001Tables 3.8 and 3.9 show the ellipsis and anaphor resolution effects on recall of paragraph searches.Recall improvements are constantly smaller than in the case of sentence searches. This was expectedbecause the larger context of paragraphs is more likely to provide original occurrences of keywordsthan the narrower sentence contexts. However, the trends are the same. Recall improvements arepractically and statistically significant in the category of proper name phrases (nearly 30 % for ellipsesand 11.5 % for anaphora) and otherwise minor.Table 3.8. Recall improvement due to ellipsis resolution in paragraph searches by focus word category.

Focus Word CategoryTotal Cardinality of PS-SetsNo. of Ellipsis Resolution Retrievable DocumentsRecall increase % due to Ellipsis Resolution

Comp. word/p. name7011.4

Comp. word/com. noun12654.0

Phrase/p. name1394028.8***

ï Person Name822732.9

ï Organization Name388

21.1

ï Event Name19526.3

Phrase/com. noun3725.4

*** significance level 0.001Within the category of proper name phrases the recall improvements due to ellipsis resolution aregreatest among person names (about a third) and otherwise of the order of a fifth or a quarter. In thecase of anaphor resolution, the recall improvements are greatest, again, among person names (about 16%) and otherwise of the order of only 5 %.Table 3.9. Recall improvement due to anaphor resolution in paragraph searches by focus wordcategory.

Focus Word CategoryTotal Cardinality of PS-SetsNo. of Anaph. Resolution Retrievable DocumentsRecall increase % due to Anaphor resolution

Basic word/p. name9644.2

Basic word/com. noun8400.0

Comp. word/p. name7011.4

Comp. word/com. noun12600.0

Phrase/p. name139

1611.5***

ï Person Name821315.9

ï Organization Name3825.3

ï Event Name1915.3

Phrase/com. noun3700.0

*** significance level 0.0013.3. Resolution of person name ellipses and anaphoraThe findings above indicate statistically significant recall improvement in the category of proper namephrases as focus words. However, this was achieved by resolving (comprehensively) all ellipses andanaphora of the second keyword irrespective of their category. Therefore we calculated the recallimprovement and statistical significance of proper name phrase resolution if only proper name phrasesare resolved. In the case of ellipses, ellipses of the focus word (a proper name phrase) were resolved.Moreover, if the second keyword was also a proper name phrase, its ellipses were resolved. In the caseof anaphora, anaphora of the focus word (a proper name phrase) were resolved. However, if the actualantecedent of an anaphor was not a proper name phrase but its ellipsis, also this ellipsis was resolved inorder to have the real referent resolved. If the second keyword also was a proper name phrase, itsanaphora and ellipses were resolved in the same way. The results are given in Table 3.10.Due to restricted resolution, the figures for recall increase drop with respect to the comprehensiveresolution case, but only slightly. This was expected: when there are very few ellipses and anaphora perdocument for keywords other than proper name phrases, the probability of a proper name phraseellipsis or anaphor co-occurring with an ellipsis or anaphor of another type of keyword in the samesentence or paragraph is very small. In fact, only one document was missed due to restricting resolutionto proper name phrases. The findings on recall improvement are statistically significant in all cases.Therefore it is worthwhile to consider the resolution of ellipses and anaphora of proper name phrases.Table 3.10. Recall improvement due to resolution in proper name phrase searches by search type andresolution type (31 keyword pairs).

Resolution type / Search TypeTotal Cardinality of SS- and PS-Sets

No. of Documents Retrievable due to Resolution Recall increase % due to Resolution

Ellipses, sentence1023938.2***

Ellipses, paragraph1394028.8***

Anaphora, sentence1021817.6***

Anaphora, paragraph1391510.8**

*** significance level 0.001 ** significance level 0.002In newspaper text, a person is usually referred to through a personal pronoun (most often 'he' or 'she'),some other pronouns (e.g., 'who'), substantive anaphora, an apposition attribute (e.g., president GeorgeBush), the first name, the family name, or through a nick name (e.g., 'Jackie') or epithet (e.g., 'derEiserne Kanzler'). In our data, family name ellipses and the personal pronoun he/she ('h‰n' in Finnishequals both 'he' and 'she' ) were by far the most frequent. There were 308 person ellipses of which 305were family name ellipses and 3 first name ellipses. Moreover, there were over 100 person nameanaphora of which 81 were the pronouns he/she. Four epithets were identified. Other types ofreferences were infrequent. These statistics show that by resolving personal pronouns he/she and familyname ellipses, a great majority of personal references would be resolved. Because this would improverecall considerably, we analyzed whether this resolution could be performed in a simple way, withoutsyntactic analysis.In Finnish, also proper names have inflected forms which may share, depending on the name length,only a few first letters. If morphological analysis is not performed, a very quick-and-dirty method forresolving family name ellipses when adding documents into the database is to take the first three lettersof a family name of a person name (e.g., 'Bus' for 'George Bush') and match them with text words withthe hope that they would match the family name in the text. For each match, also the full name phrasewould be stored in the index with the same address as the matched word has. In addition to correctresolutions, such a method would make two kinds of mistakes: (1) incorrect extraneous resolutions dueto putting a full name into a wrong place, e.g., 'Hans Holmer' matching 'Holger Romander', 'JaakkoLassila' matching 'Lassila & Tikanoja' (a company), and 'Barbara Bush' matching 'George Bush'; (2)incorrect extraneous resolutions due to assumming erroneously some word pair as a full name andusing the second word for matching, e.g., take the first two words of the sentence 'If Lillehammar winsthe Olympic games ...' as a full name to resolve any subsequent occurrences of 'Lillehammar' or other

words, including family name ellipses, beginning with 'Lil'. This is silly but partly harmless becausehardly anybody searches for 'If Lillehammar'. However, if a family name ellipsis is matched, thepossibility for full name searching is lost. Such occasions were very rare in our data.By augmenting the method by some rules (e.g., 'do not resolve a match preceded or followed by a wordwith a capital first letter' and 'do not resolve within a capital phrase containing special characters' and'among several antecedent candidates, choose the first one'), errors of the first type would be reducedeffectively. In fact, all 305 family name ellipses would be resolved correctly. Errors of the second typeare avoided if full names are recognized correctly in documents, e.g., by augmenting the method bystop words not valid for first names, special characters not allowed in full names, or lists of most usualfirst names (in several languages) as cues for identifying the full names. Surprisingly simple seems towork surprisingly well. In English text, or in Finnish text with morphological analysis, full familynames instead of three-letter beginnings can be used for matching and thus there are less chances formistakes.If resolution is done in the search phase, the full name keyword gives the correct form of the antecedentpossibly having ellipses in a document. This suggests that a proximity search on a person name, saysentence(phrase("idi", "amin*"), "exile"), should be augmented automatically by a search: phrase("idi","amin*") and sentence("amin*", "exile"), if the user declares 'idi amin' as a person's name. In fact, noresolution is actually needed.The 81 (s)he -pronouns contained inter- and intra-sentence references to an ellipsis of the person nameused as the keyword as well as inter- and intra-sentence references to the full name. There were 6 intra-sentence references to the full name and 7 sentences with double occurrences of the pronoun. Thusthere were 68 sentences which might be affected by the resolution of the pronoun in proximitysearching. Assuming that person names can be identified in the text, the simplest resolution method isto assume the last preceding person name as the referent (Fraurud, 1988). This would resolve correctly77 of the 81 pronoun occurrences (95%). Again, the method could be improved by simple rules: (1) Aquoted sentence, followed by ' , he said.' (etc.) does not contain the referent of 'he'. (2) When thepronoun is preceded by both a family name ellipsis and another full name, the former is more likely tobe the correct referent. The latter rule is based on Fraurud's (1988) finding that the correct referentoften is the document's primary discourse referent and our finding that document's primary personnames tend to have many ellipses. By combining the rule for "the last preceding person name" with theabove rules (1) and (2), all personal pronouns (3rd person, singular) in the data can be resolvedcorrectly. Again, surprisingly simple seems to work surprisingly well provided that person names canbe identified. However, this 100 % correct resolution result should be confirmed in a larger data set. Ifresolution is done in the query phase, the full name keyword helps in the resolution of both ellipses andanaphora. If resolution is done while adding documents into the database it would benefit from lists ofcommon first names as cues for identifying the full names.4. DISCUSSIONWe have approached the study of anaphor and ellipsis resolution in IR in a novel way. We haveconsidered resolution from the viewpoint of proximity searching in a text database. We did not test anyspecific resolution method but, instead, performed manual and correct resolution in order to find themaximal IR effects achievable through resolution. Moreover, we have analyzed the resolution effectson recall in several keyword categories, i.e. in the categories of the correlates / antecedents of anaphoraand ellipses, not in the linguistic categories of the anaphora and ellipses themselves. Thus it becomespossible to judge what kind of resolution is needed from the proximity search viewpoint.Our findings show that anaphor and ellipsis resolution have a significant effect on search results inproximity searching in a newspaper article database, and that the effect depends on keyword category.The single most important keyword category for anaphor and ellipsis resolution is the proper name

phrase category. In the keywords in this category, the frequency of elliptic and anaphoric sentences perdocument is the greatest, 1.99 and 0.55, respectively. Therefore the likelihood of resolution effects isthe greatest in this category. For the focus words in this category, the KWNS and KWNP-setscontained the highest percentages of documents with matching elliptic sentences (18.9 %) andparagraphs (23.4 %) when comprehensive resolution of ellipses and anaphora of the second keywordwas performed. In other categories these percentages were at most 4.5 %. Within the category of propername phrases, the percentages were higher for person names and organization names, and lower forevent names. Consequently, the recall increase due to ellipsis resolution in the focus word category ofproper name phrases was statistically significant and high, 38.2 % (sentence searches) or 28.8 %(paragraph searches) both in the case of comprehensive resolution and restricted resolution (of propername phrases only) of the second keyword. In other focus word categories the resolution effects wereconsiderably smaller and statistically insignificant.The results are similar in the case of focus word anaphora. In comprehensive resolution of ellipses andanaphora of the second keyword, the KWNS and KWNP-sets contained the highest percentages ofdocuments with matching anaphoric sentences (9.2 %) and paragraphs (9.4 %) of focus words in theproper name phrase category. In other categories these percentages were at most 4.9 %. Within thecategory of proper name phrases, the percentages were highest for person names, lower fororganization names, and lowest for event names. Consequently, the recall increase due to anaphorresolution in focus word category of proper name phrases was statistically significant and high, 18.6 %(sentence searches) or 11.5 % (paragraph searches) in the case of comprehensive resolution of thesecond keyword, and 17.6 % (sentence searches) or 10.8 % (paragraph searches) in the case ofrestricted resolution (of proper name phrases only) of the second keyword. In other focus wordcategories the resolution effects were considerably smaller and statistically insignificant.In conclusion, although anaphor resolution did not have a positive effect on document ranking instatistical IR for scientific abstracts (Bonzi & Liddy, 1989; Liddy, 1990), it seems to have, like ellipsisresolution, a positive effect on recall in full text proximity searching on proper name phrases, at least ina newspaper article database. It is not surprising that, in such a database, person names are the singlemost important category for ellipsis and anaphor resolution: persons and their deeds are central foci inthe news.Comprehensive and correct resolution of ellipses and anaphora requires both morphological, syntactic,semantic as well as pragmatic knowledge (Akmajian & al., 1990; Hirst, 1981). With currenttechnology, morphological and surface-syntactic analysis is efficient but semantic analysis ofunrestricted text is not (Voutilainen & Heikkil‰, 1994). Person names form a keyword category whereresolution may be implemented more simply than in other categories. Therefore restricted types ofresolution, in particular of proper name phrases, which are easier to implement and effective inkeyword-based full text IR, should be considered. Even if resolution was done fully and correctly inother word categories, recall improvements would be much smaller and perhaps unjustified due tohigher resolution costs. More comprehensive and computationally more demanding resolution methods(Grishman, 1986; Fraurud, 1988; Hirst, 1981; Hobbs, 1978; Kuno, 1987; Lappin & Leass, 1994;Winograd, 1983) for other word categories may be justified in other IR tasks (e.g., semantic textrepresentation, question-answering systems, question analysis systems, automatic abstraction systems;Liddy & al., 1987).Some further justifications for these conclusions are, however, in order.(1) Recall in genuine IR situations. In the study setting recall could not decrease due to resolutionbecause of our technical approach to relevance. All documents containing the keywords, their ellipsesor anaphora within the required passage (sentence or paragraph) were considered relevant. Thusproximity searching with resolution is bound to retrieve more documents than without resolution and

the new documents due to resolution were judged relevant. Thus recall was bound to improve. Whatare the expected resolution effects if the queries are based on real information needs and real relevanceassessment ? It is obvious that, for any pair of keywords properly representing an information need,many documents matching a conjunctive Boolean search are not relevant. Many false drops are due to(e.g., Lancaster, 1968; 1986):(1) keywords not related in the text,(2) incorrect relationship between keywords,(3) homonyms,(4) the text dealing with the search topic only marginally or in a wrong way.In proximity searching, cause (1) is certainly less frequent and (4) perhaps equally frequent. Causes (2)and (3) may be less frequent in the narrower scope required in proximity searching. Consequently, theprecision of proximity search results has often been found better than that of conjunctive search resultsbut recall much lower (e.g., Kristensen, 1993; Tenopir & Ro, 1990). Tenopir and Ro (1990) point out,based on a study conducted in a database of magazine articles, that recall deteriorates in proximitysearching because all words are not explicitly repeated in every text passage dealing with the topicrepresented by them. Nevertheless, false drops turn up also in proximity searching. We believe thatdocuments with (correctly identified) elliptic or anaphoric matching sentences are at least equally likelyto be relevant as documents containing the original keywords in the required scope. In fact, they mayeven be more likely to be relevant because ellipses and anaphora probably refer to central concepts alsoin newspaper articles ó as seen in the context of scientific abstracts (Bonzi & Liddy, 1989). The aboveargument by Tenopir and Ro (1990) also supports this conclusion.Document exclusion done in data collection (see Table 2.1) might affect our recall findings in twoways. (1) Some findings on positive resolution effects could be overly optimistic, if it can be shownthat some of the documents excluded in sentence (paragraph) searches are in fact relevant. (2) Somefindings on only marginal resolution effects could be overly pessimistic, if it can be shown that some ofthe documents excluded in KWNS (KWNP) searches are in fact relevant and contain elliptic oranaphoric matching sentences as well. The first counter-argument can be weighted through a worst-case analysis. What happens to the recall figures, if one assumes that all documents excluded insentence (paragraph) searches were relevant ? In all categories except for phrases, the reportedmarginal resolution effect would turn out to be even more marginal. In the case of proper namephrases, the recall increase figures would drop as follows:ï ellipsis resolution, sentence search: from 38.2 % to 27.7 %ï ellipsis resolution, paragraph search: from 28.8 % to 22.3 %ï anaphor resolution, sentence search: from 18.6 % to 13.5 %ï anaphor resolution, paragraph search: from 11.5 % to 8.9 %All these figures indicate a great recall increase and are statistically significant. In the case of commonnoun phrases, the recall increase figure for ellipsis resolution in sentence search would drop from 20 %to 10.5 % (both statistically insignificant). Thus the worst case analysis does not change our conclusionthat resolution has significant effects on recall only for proper name phrases. Moreover, the assumptionthat all documents excluded in sentence (paragraph) searches are relevant, is highly unrealistic. In agenuine IR situation, the proportion of relevant documents certainly is higher among the includeddocuments than among the excluded documents.The second counter-argument can be weighted through a best-case analysis. What happens to the recallfigures, if one assumes that all documents excluded in KWNS (KWNP) searches which contain ellipticor anaphoric matching sentences, were relevant ? We do not know, how many of the excludeddocuments actually contained elliptic or anaphoric matching sentences, since they were not analyzed inthis respect. Thus we have to make further assumptions. We assume that, in each category, the same

share of documents excluded in KWNS (KWNP) searches contained elliptic or anaphoric matchingsentences as among those included. For example, in the case of anaphor resolution for basic words, 4documents of 112 in the KWNS-set contained elliptic or anaphoric matching sentences. As 25.3 % (or38 documents) of the KWNS search result were excluded, we estimate that the excluded documentswould contain (4 / 112) * 38 = 1.4 _ 1 document with an elliptic or anaphoric matching sentence.Assuming this document relevant would move the resolution effect on recall from 6.3 % to 8.0 %which is not significant. Performing the same analysis in all cases resulted always in zero to twoadditional documents which, if assumed relevant, do not change our statistical findings in anyinteresting way. Our statistical findings do not change even if the excluded documents were assumed tocontain elliptic or anaphoric matching sentences twice as frequently as those included. Also here therelevance assumption is unrealistic.In summary, we have argued that (1) documents with elliptic or anaphoric matching sentences are atleast equally likely to be relevant as documents containing the original keywords in the required scopeand (2) that document exclusion in data collection does not change the statistical results on recallimprovement nor their statistical significance. Thus we conclude, that our findings on increasing recallin proximity searching due to (correct) ellipsis and anaphor resolution, based on our technical approachto relevance, strongly suggest that an equally large increase in recall in proximity searching due toellipsis and anaphor resolution is achieved when the queries are based on real information needs. If theresolution methods are unreliable the recall improvements are smaller and search precision maydeteriorate.We have a new project under way which is based on 35 questions representing real information needson the same database (55.000 newspaper articles) as in the present study and a base of 5.000 documentswith relevance judgments (4-point scale) for the queries. In this project, the recall effects of ellipsis andanaphor resolution are considered by focus word type in queries of varying exhaustiveness (the numberof conjunctive search concepts, levels 2-5).(2) Database type. We have studied the case of newspaper text. Attention should be given to thedatabase type ó newspaper text is not representative of all texts of interest in IR. News text probablycontains proper name phrases more frequently than scientific text, not to mention scientific abstracts.The distributions of keyword categories in text and queries certainly vary among database types. Moreresearch is needed in databases (or corpora) of different types.(3) Compound words. The share of compound words differs between languages. Finnish, Swedish,German have many compound words while English has less. However, as the results indicate,compound words do not seem to be an important area for ellipsis or anaphor resolution for proximitysearching.Our findings also suggest development of search strategies even if no ellipsis or anaphor resolution isperformed, e.g., family name ellipses can be accounted for by searching for the full name phrase in thedocuments together with a passage containing the family name ellipsis and the other keywords.Another possible development resembles the CIP (Cataloguing-In-Publication) convention: withcurrent technology, most ellipses and anaphora could be resolved comfortably and effectively if donejointly by the author of a document and the document processing system at the document preparationstage. Because proximity searching on organization names would also benefit from resolution, andbecause organization names are very important in many IR situations, automatic recognition oforganization names augmented by ellipsis and anaphor resolution for them seems worth developing.5. CONCLUSIONSWe have considered anaphor and ellipsis resolution in a novel way, from the viewpoint of proximitysearching in a newspaper article database. We have performed correct manual resolution of anaphoraand ellipses to find the maximal achievable IR effects. These were analyzed in six keyword categories,

i.e. in the categories of the antecedents of anaphora and ellipses, not in the linguistic categories of theanaphora and ellipses themselves. Our findings indicate that anaphor and ellipsis resolution have asignificant positive effect on search recall in proximity searching in the proper name phrase category ofkeywords. Therefore also restricted resolution of proper name phrase ellipses and anaphora wasanalyzed. This meant, that for keyword pairs (A, B), where A was a proper name phrase and B akeyword in any keyword category, were considered. The keywords A were resolved and the keywordsB were left intact unless they were also proper name phrases. When A's ellipses were resolved, therecall increase was 38.2 % in sentence searches, and 28.8 % in paragraph searches. When A's anaphorawere resolved, the recall increase was 17.6 % in sentence searches, and 10.8 % in paragraph searches.In other keyword categories the recall increase was marginal even with comprehensive resolution ofellipses and anaphora.A restricted resolution method for proper name phrase ellipses and anaphora only is bound to be easierto implement and more efficient to apply than a comprehensive resolution method for all wordcategories. Therefore restricted resolution methods, in particular for proper name phrases should beconsidered in keyword-based full text IR. More comprehensive and computationally more demandingresolution methods for other keyword categories may be justified in other IR tasks like semantic textrepresentation, question-answering systems, question analysis systems, and automatic abstractionsystems.REFERENCESAkmajian, A., Demers, R., Farmer, A., & Harnish, R. (1990). Linguistics: An introduction to languageand communication. Cambridge, MA: The MIT Press.Blair, D.C. (1986). Full text retrieval: Evaluation and implications. International Classification, 13(1),18-23.Blair, D.C. (1990). Language and Representation in Information Retrieval. Amsterdam: Elsevier.Blair, D.C., & Maron, M.E. (1985) An Evaluation of retrieval effectiveness for a full-text documentretrieval system. Communications of the ACM, 28(3), 289-299.Bonzi, S., & Liddy, E. (1989). The use of anaphoric resolution for document description in informationretrieval. Information Processing & Management, 25(4), 429-441.Carter, D. (1987). Interpreting anaphors in natural language texts. New York, NY: John Wiley & Sons.Chong, A. (1989). Topic: A concept-based document rietrieval system. Library Software Review, 8,281-284.Fraurud, K. (1988). Pronoun resolution in unrestricted text. Nordic Journal of Linquistics, 11(1-2), 47-68.Grishman, R. (1986). Computational linguistics: An introduction. Cambridge, UK: CambridgeUniversity Press.Halliday, M., & Hasan, R. (1976). Cohesion in English. London, UK: Longman.Hirst, G. (1981). Anaphora in natural languare understanding: A survey. Lecture Notes in ComputerScience 119. Berlin: Springer.Hobbs, J. (1978). Resolving pronoun references. Lingua, 44, 311-338.Johnson, F.C. (1994). A classification of ellipsis based on a corpus of information seeking dialogues.Information Processing & Management, 30(3), 315-325.Kristensen, J. (1993). Expanding end-users' query statements for free text searching with a search-aidthesaurus. Information Processing & Management, 29(6), 733-744.Kuno, S. (1987). Functional syntax: Anaphora, discourse and empathy. Chicago, IL: The University ofChicago Press.Lancaster, F.W. (1968). Information retrieval systems: Characteristics, testing, and evaluation. NewYork, NY: John Wiley & Sons.

Lancaster, F.W. (1986).Vocabulary control for information retrieval. (2nd ed.) Arlington, VA:Information Resources Press.Lappin, S., & Leass, H.J. (1994). An algorithm for pronominal anaphora resolution. ComputationalLinguistics, 20(4), 535-561.Liddy, E. (1990). Anaphora in natural language processing and information retrieval. InformationProcessing & Management, 26(1), 39-52.Liddy, E., Bonzi, S., Katzer, J., & Oddy, E. (1987). A study of discourse anaphora in scientificabstracts. Journal of the American Society for Information Science, 38(4), 255-261.McKinin, E. J., Sievert, M.-E., Johnson, E.D., & Mitchell, J.A. (1991). The Medline/Full-TextResearch Project. Journal of the American Society for Information Science, 42(4), 297-307.Paice, C. (1990). Constructing Literature Abstracts by Computer: Techniques and prospects.Information Processing & Management, 26(1), 171-186.Pirkola, A., & J‰rvelin, K. (1994). The effect of anaphora and ellipsis resolution on proximitysearching in a text database. Tampere, Finland: University of Tampere, Dept. of Information Studies,RN-1994-1.Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of informationby computer. Reading, MA: Addison-Wesley.Siegel, S., & Castellan, J. (1989). Nonparametric statistics for the behavioral sciences. New York, NY:McGraw-Hill.Smeaton, A. (1991, October). Natural language processing and information retrieval. A tutorialpresented at the ACM SIGIR '91, The 14th International ACM/SIGIR Conference on Research andDevelopment in Information Retrieval, Chicago, IL.Tenopir, C. (1985). Full text database retrieval performance. Online Review, 9(2), 149 - 164.Tenopir, C., & Ro, J.S. (1990). Full text databases. New York, NY: Greenwood Press. (New Directionsin Information Management: 21).TOPIC (1990).TOPIC retrieval technology: A technical overview. Mountain View, CA: Verity, Inc.Voutilainen, A., & Heikkil‰, J. (1994). An English constraint grammar (ENGCG): a surface-syntacticparser of English. In U. Fries, G. Tottie & P. Schneider (Ed.), Creating and using English languagecorpora (pp. 189-199). Amsterdam: Rodopi.Webster's third new international dictionary of the English language, (1981). Chicago, IL:Encyclopaedia Britannica.Winograd, T. (1983). Language as a cognitive process. Volume 1: Syntax. Reading, MA: Addison-Wesley.APPENDIX I: Categories of ellipses and anaphoraThe following categories of ellipses and anaphora occurred in our data.Anaphora1. Pronouns

- personal pronouns (e.g., "he", "they")- demonstrative pronouns (e.g., "this", "it")- indefinite pronouns (e.g., "each", "both")

2. Nouns- ordinary nouns (e.g., "novelist")- epithets (e.g., "!the state oil-giant"ó referring to a company, "the Lion of Laukaa" ó referring to a rally pilot)

3. Numerals- Numerals occurred only in anaphoric complexes, e.g., "... of those two countries ..."

Ellipses

1. Omission of a compound word component, e.g., "metals_industry" ó "industry", "NoverañConsortium" ó "consortium". (NB. Here the underscore "_" denotes that the Finnish word for "metals industry" is a compound word.)

2. Omission of a phrase component- both components nouns, e.g., "Hans Holmer" ó "Holmer", "Tampella Power" ó"Power" and "stock exchange" ó "exhange"- an adjective and a noun, e.g., "Amnesty International" ó "Amnesty ", "short-term market interests" ó "market interests"

APPENDIX II: Sample keyword pairsWe provide here some examples of English equivalents of original Finnish keywords (six percategory). Finnish common nouns do not always translate into the same focus word category inEnglish. Sample keywords, which in Finnish are compound words but in English phrases, are givenbelow with an underscore connecting the components, e.g., nazi_criminal.Basic word proper names as focus wordsargentina nazi_criminalengland northern_irelandkola pollution (the Kola Peninsula in Russia)lancia juha kankkunen (a car brand and a Finnish rally pilot)vaisala turnover (a Finnish company)tamfelt revenue (a Finnish company)Basic word common nouns as focus wordsarcheologist excavationdrought somaliaitalian mafia_leaderpolice bank_robberpollution mediterraneanlibrary information seekingCompound word proper names as focus wordssouth_korea seoul olympicscenter_party agriculture's overproductionnew_zealand greenpeaceusa japanese carnagorno_karabah fight (a disturbed region in Azerbaizan)novera_group postipankki (a Finnish company and a bank)Compound word common nouns as focus wordsdoping_substance athleteindependence_fight estoniaissue_of_shares instrumentarium (a Finnish company)paper_industryprofitabilitymetal_industryexport valuedeath_sentence salman rushdieProper name phrases as focus wordsben johnson carl lewisidi amin exiletampella power orderbookamnesty international human_rights_crimewar in afganistan soviet_union

jyv‰skyl‰ rallyprotest movement (the Thousand Lakes Rally in Finland)Common noun phrases as focus wordsapartment prices helsinkidrug smuggling soviet unionpublic sector state debtillegal immigration germanynordic cooperation icelandmetals production rautaruukki (a Finnish company)

Pirkola - J‰rvelin

Pirkola - J‰rvelin

Pirkola - J‰rvelin

Pirkola - J‰rvelin

Pirkola - J‰rvelin

�̂ ÙÙÙÙÙÙÌÌÌÌÌÌÌÌÌÌÙÙÙÎÎÎ ÎÙMiscStatus2228625DefaultIconfm20,0InprocServerToolboxBitmap32fm20, 165ProgIDForms.MultiPage.1TypeLib&{0D452EE1-E08F-101A-852E-02608C4D0BB4}NormalNormalDefault Paragraph FontDefault Paragraph FontFooterFooterHeader

HeaderFootnote ReferenceFootnote ReferenceFootnote TextFootnote TextHeading 7Heading 7Heading 6Heading 6Heading 5Heading 5Heading 4Heading 4Heading 3Heading 3Heading 2Heading 2Heading 1Heading 1Page NumberPage NumberDocTitleDocTitleAbsHeadingAbsHeadingReferenceReferenceAuthorsAuthorsFigureHeadFigureHeadListingListingformulaformulaAbsTextAbsTextAbsListingAbsListingsql-likesql-likePrologPrologMainClassMainClassperustekstiperusteksti

Lukujen otsikotLukujen otsikotAbstractAbstractArticle titleArticle titlealilukuotsikkoalilukuotsikkoNF2-instanceNF2-instanceTimes New RomanTimes New RomanSymbolSymbolCourier NewCourier NewYThe Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text DatabaseYThe Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text DatabaseKal JarvelinKal JarvelinKal JarvelinKal JarvelinControThe Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text DatabaseKal JarvelinNormalKal JarvelinMicrosoft Word 8.0Dictionary-based MethodsMultilingual ThesauriParallel (aligned) CorporaMachine Translation3. Dictionary-based CLIR ProblemsSource language processingPhrase/compound identificationInflectional morphologySource language ambiguityTranslation dictionary (MDR) conversionTranslation process - dictionary coverage:Special terms Expansion effect (syns)Compound wordsProper namesCLIR Problems, 2Target language ambiguitySource language ambiguity + target language ambiguity = translation ambiguityQuery structuringWeights?

Phrases?Operators?CLIR Problems, 3Language specific problemsTerm segmentation in Chinese (Asian languages)Fogemorphemes in, e.g., Germanic languagesInflection in Fenno-Ugrian & African languagesPhrase / compound recognitionMany other language specific characteristicsGeneral problemsAvailability of dictionariesTransitive translation methods an alternative3.1. Compound ProcessingAtomic wordsCompounds, compound wordsappear in several parts-of-speechdeterminative, e.g. Weinkeller, vink‰llarecopulative , e.g. schwartzweiss, svartvitcompositional, e.g. Stadtverwaltung, stadsfˆ rvaltningnon-compositional, e.g. Erdbeere, jordgubbeCompound Word TranslationCompound wordsComponent words spelled togetherCompound in English may also denote a phraseAll compounds are not in a dictionarySome languages are very productiveNon-dict compounds often cannot be translated directly but must be split and translated by componentsTrend: large dicts: many compositional compounds added - small dicts: atomic words, old non-compositional compsCompounds relieve phrase identification problemsFogemorphemesJoining morphemes complicate compound analysis & translationFogemorphem types in Swedish<omission> flicknamn-s r‰ttsfall-e flickebarn-a g‰stabud-u gatubelysnin˛ ˇUniversity of TampereThe Effect of Anaphora and Ellipsis Resolution on Proximity Searching in a Text Database_PID_GUID{405D6A00-2CA8-11D6-AF0A-000502266052}{405D6A00-2CA8-11D6-AF0A-000502266052}Compound ProcessingA Finnish natural language query:l‰‰kkeet syd‰nvaivoihin(medicines for heart problems)

The output of morpho-logical analysisl‰‰kesyd‰nvaiva, syd‰n, vaivaDictionary translation and the output of component tagging:l‰‰ke ---> medication drugsyd‰nvaiva -Dictionary translation and the output of component tagging:l‰‰ke ---> medication drugsyd‰nvaiva - î not in dictîsyd‰n ---> heartvaiva ---> ailment, complaint, discomfort, inconvenience, trouble, vexationMany ways to combine components in queryCompound Processing, 2A Finnish natural language query:l‰‰kkeet syd‰nvaivaThe corresponding English CLIR query#and(#syn(medication drug) #syn(#uw3(heart ailment) #uw3(heart complaint) #uw3(heart discomfort)#uw3(heart inconvenience) #uw3(heart trouble) #uw3(heart vexation)))#uw3 = proximity operator for three intervening words, free word order3.2. Expression Forms - PhrasesPhrases (sometimes also called compounds)also appear in several parts-of-speechdeterminative (endocentric), e.g. wine cellarappositional , e.g. president Bushcompositional, e.g. city councilnon-compositional/idiomatic, e.g. rains cats and dogsfixed phrases, e.g. hot dogsyntactic phrases, e.g. This is a red housestatistical phrases -- based on word collocationsPhrase ProcessingPossibilitiesPhrase recognition in queriesPhrase translation (as phrases)Phrase indexing / query structuring (proximity operator)No specific processingBallesteros & Croft, 1996, Hull & Grefenstette, 1996manual analysis -> phrase translation -> improved resultsPhrase recognition essential for CLIRWithout it their translation may end up with quite different meaning (non-compositional phrasesespecially)Root Entry1TableWordDocumentSummaryInformationSummaryInformationDocumentSummaryInformationDocumentSummaryInformation

CompObjCompObjObjectPoolMicrosoft Word DocumentWord.Document.8ies1. nouns2. adjectives3. pronouns4. numerals5. verbs6. adverbs7. prepositions and postpositions8. conjunctions9. interjectionsTypical inflectional categories in most languagesnumber, e.g. nouns and adjectivescase: nouns, adjectives, pronouns, numeralsgender, e.g. adjectivescomparison, adjectivesperson, tempus, modu