text summarization : an overviewelloret/publications/textsummarization.pdf · the source text and...

TEXT SUMMARIZATION : ANOVERVIEW ∗

Elena LloretDept. Lenguajes y Sistemas Informaticos

Universidad de AlicanteAlicante, Spain

[email protected]

Abstract

This paper presents an overview of TextSummarization. Text Summarization is achallenging problem these days. Due to thegreat amount of information we are pro-vided with and thanks to the developmentof Internet technologies, needs of producingsummaries have become more and morewidespread. Summarization is a veryinteresting and useful task that gives sup-port to many other tasks as well as it takesadvantage of the techniques developed forrelated Natural Language Processing tasks.The paper we present here may help us tohave an idea of what Text Summarizationis and how it can be useful for.

Keywords : automatic text summa-rization; extracts and abstracts

∗This paper has been supported by the Span-ish Government under the project TEXT-MESS(TIN2006-15265-C06-01)

1 Introduction

The World Wide Web has brought us avast amount of on-line information. Dueto this fact, everytime someone searchssomething on the Internet, the responseobtained is lots of different Web pages withmany information, which is imposible for aperson to read completely. Although theattempts to generate automatic summariesbegan 50 years ago [40], in the recent yearsthe field of automatic Text Summarization(TS) has experienced an exponentialgrowth [25] [27] [46] due to these newtecnologies.

This paper addresses the current state-of-the-art of Text Summarization. Section 2gives an overview of the field TS and wepresent the factors related to it. Section3 explains the different approaches togenerate summaries. In section 4 wepresent a number of Text Summarizationsystems existing today. Section 5 presentsthe common measures to evaluate thosesystems, whereas section 6 exposes the

1

tendency adopted these days in Text Sum-marization. Finally, section 7 concludesthis paper and discusses future work.

2 What is TEXT SUM-

MARIZATION?

2.1 Definition and types

A summary can be defined as a textthat is produced from one or more texts,that contains a significant portion of theinformation in the original text(s), andthat is no longer than half of the originaltext(s) [23]. According to [39], text sum-marization is the process of distilling themost important information from a source(or sources) to produce an abridged versionfor a particular user (or user)and task (ortasks).

When this is done by means of a computer,i.e. automatically, we call this AutomaticText Summarization. Despite the fact thattext summarization has traditionally beenfocused on text input, the input to thesummarization process can also be multi-media information, such as images, videoor audio, as well as on-line informationor hypertexts. Furthermore, we can talkabout summarizing only one document ormultiple ones. In that case, this process isknown as Multi-document Summarization(MDS) and the source documents in thiscase can be in a single-language (monolin-gual) or in different languages (translingualor multilingual).

The output of a summary system maybe an extract (i.e. when a selection of

”significant” sentences of a document isperformed) or abstact, when the summarycan serve as a substitute to the originaldocument [15]. We can also distinguishbetween generic summaries and user-focused summaries (a.k.a query-driven).The first type of summaries can serve assurrogate of the original text as they maytry to represent all relevant features ofa source text. They are text-driven andfollow a bottom-up approach using IRtechniques. The user-focused summariesrely on a specification of a user informationneed, such a topic or query. They follow atop-down approach using IE techniques.

Concerning the style of the output, abroad distinction is normally made be-tween indicative summaries, which are usedto indicate what topics are addressed inthe source text and they can give an briefidea of what the original text is about,and the informative summaries, which areintended to cover the topics in the sourcetext [40][46].

2.2 Process of AutomaticText Summarization

Traditionally, summarization has been de-composed into three main stages [23][40][53]. We will follow the Sparck Jones[53] approach, which is:

• interpretation of the source text toobtain a text representation,

• transformation of the text represen-tation into a summary representation,and,

2

• finally, generation of the summarytext from the summary representation

Effective summarizing requires an ex-plicit and detailed analysis of context fac-tors. Sparck Jones in [53] distinguishesthree classes of context factors: input, pur-pose and output factors. We will describedthem briefly. For furhter information, see[53].

• Input factors. The features of thetext to be summarized crucially deter-mine the way a summary can be ob-tained. These fall into three groups,which are: text form (e.g. documentstructure); subjet type (ordinary, spe-cialized or restricted) and unit (singleor multiple documents as input).

• Purpose factors. These are the mostimportant factors. They fall underthree categories: situation refers to thecontext within the summary is to beused; audience (i.e. summary readers)and use (what is the summary for?).

• Output factors. In this class we cangroup: material (i.e. content) ; formatand style.

3 Approaches to Text

Summarization

Although many different approaches totext summarization can be found in theliterature [46], [55], in this paper we willonly take into account the one proposed byMani and Marbury (1999) [40]. This clas-sification, based on the level of processingthat each system performs, gives an idea

of which traditional approaches exist. Thiscan be suitable as a reference point fromwhich many techniques can be developed.Based on the traditional approaches,summarization can be characterized asapproaching the problem at the surface,entity, or discourse levels [40].

Surface levelThis approach inclines to represent infor-mation taking shallow features and thenselectively combining them together inotrder to obtain a salience function thatcan be used to extract information. Amongthese features, we have:

- Thematic features rely on word (sig-nificant words) occurence statistics, sothat sentences containing words thatoccur frequently in a text have higherweight than the rest. That means thatthese sentences are the important onesand they are hence extracted. Luhn(1958) [37] , who used the term fre-quency technique in his work, followedthis approach. Before doing term fre-quency, a filtering task must be doneusing a stop-list words which containswords such as pronouns, prepositionsand articles. This is the classical statis-tical approach. However, from a pointof view of a corpus-based approachtd*idf measure (commonly used in in-formation retrieval) is very useful todetermine keywords in text.

- Location refers to the position intext, paragraph or any other particularsection in the sense that they con-tain the target sentences to be in-cluded in the summary. This is usu-ally genre-dependent, but there are two

3

basic general methods, which are lead-method and the title-based method.The first one consists of extracting onlythe first sentences, assuming that theseare the most relevant ones, whereasthe second considers that words inthe headings or titles are positive rel-evant to summarization. Edmundson(1969) in [15] used this approach to-gether with cue-word method which isexplained later.

- Background assumes that the impor-tance of meaning units is detemined bythe presence of terms from the title orheadings, initial part of the text or auser’s query.

- Cue words and phrases, such as ”inconclusion”, ”important”, ”in this pa-per”, etc. can be very useful to deter-mine signals of relevance or irrelevance.These kind of units can be detected au-tomatically as well as manually.

Entity levelThis approach attemps to build a represen-tation of the text, modeling text entitiesand their relationships. The objective is tohelp to determine what is salient. This re-lations between entities include:

- Similarity occurs for example, whentwo words share a common stem,i.e. whose form is similar. Thiscan be extended for phrases or para-graphs. Similarity can be calculatedby vocabulary overlap or with linguis-tic techniques.

- Proximity refers to the distance be-tween texts units. With that informa-

tion is possible to establish entity rela-tions.

- Co-ocurrence: meaning units can berelated if they occur in common texts.

- Thesaural relatioships amongwords can be described as relation-ships like synonymy, hypernymy,part-of-relations (meronymy).

- Coreference. The idea behind co-reference is that referring expressionscan be linked so that, coreferencechains can be built with coreferring ex-pressions.

- Logical relations such as agreement,contradiction, entailment, and consis-tency.

- Syntactic relations are based onparse trees.

- Meaning representation-based re-lations, establishing relations betweenentities in the text as for example,predicate-argument relations.

Discourse levelThe target of discourse level approaches isto model the global structure of the textand its relations in order to achieve commu-nicative goals. The information that can beexploited at this level includes:

- Format of the document, such as hy-pertext markup or document outlines.

- Threads of topics as they are re-vealed in the text.

4

- Rethorical structre of text, rep-resenting argumentative or narrativestructure. The idea behind this dealswith the possibility to build thecoherence structure of a text, so thatthe ’centrality’ of textual units willreflect their importance.

To applied all these methods, two dif-ferent appoaches can be taken. Thesetechniques described so far can be de-veloped by using linguistic knowledgeor by applying marchine learning tech-niques. Last ones have a support role,for example, in identifying the infor-mation to be applied at specific processstages such as interpretation or genera-tion (for instance, they seem useful fortraining output sentence order).

4 Text Summarization

Systems

Approaches presented so far are examplesof pure techniques to apply, in order todevelop summarization systems. Thepredominant tendency in current systemsis to adopt a hibryd approach andcombine and integrate some of the tech-niques mentioned before (e.g. cue phrasesmethod combined with position and wordfrequency based methods in [24], or posi-tion, length weight of sentences combinedwith similarity of these sentences withthe headline in [21]). As we have given ageneral overview of the classical techniquesused in summarization and there is alarge number of different techniques andsystems, we are going to describe in this

section only few of them briefly, consideringsystems as wholes. However, in table 1some more systems are shown as well astheir main features. To finish this section,the most recent approaches concerningsummarization are mentioned.

Systems which have been selected tobe broader described are the following:MEAD [51] , WebInEssence [50] ,NeATS [36] , GISTExter [20] , andNetSum [54]

- MEAD [51]: This system wasdeveloped at the University of Michi-gan in 2001. It can produce bothsingle and multi-document extractivesummaries. The idea behind it is theuse of the centroid -based feature.Moreover, two more features are used:position and overlap with thefirst sentence. Then, the linearcombination of the three determineswhat sentences are most salient toinclude in the summary.

The system works as follows: firstof all, MEAD uses the CIDR TopicDetection and Tracking system toidentify all the articles related to anemerging event. CIDR produces aset of clusters. From each clustera centroid is built. Then, for eachsentence, three values are computed:the centroid score, which measureshow close the sentence to the centroidis; the position score indicates howfar is the sentence with respect to thebeginning of a document; and finally,the overlap with the first sentence ortittle of the document by calculating

5

tf*idf between the given sentence andthe first one. Then all these measuresare normalized and sentences whichare too similar to others are discarded(based on a cosine similarity measure).Any sentence that have not beendiscarded would be included in thesummary.

- WebInEssence [50]: This systemwas also developed at the Universityof Michigan in 2001. It is more than asummarization system. It is a searchengine to summmarize clusters ofrelated Web pages which provide morecontextual and summary informationto help users explore retrieval resultsmore efficiently. A version of MEAD[51] was used in the development ofthis Web-based summarizer, so thatthe features used to produce extractsare the same as the ones used inMEAD. The overall architecture of thesystem can be decomposed into twomain stages: the first one behaves asa Web-spider that collects URLs fromthe Internet and then it groups theURLs into clusters. The second mainstage is to create a multi-documentsummary from each cluster using theMEAD centroid-algorithm.

- NeATS [36] was first developed in2001 by the University of SouthernCalifornia’s Information SciencesInstitute. It is tailored to the genre ofnewspaper news. Its architecture con-sists of three main components: con-

tent selection, content filteringand content presentation. Thegoal of content selection is to identifyimportant concepts mentioned in adocument collection. The techniquesused at this stage are term frequency,topic signature or term clustering.For content filtering three differentfilters are used: sentence position,stigma words and redundancy filter.To achieve the latter, NeATS usesa simplified version of MMR [9]algorithm. To ensure coherence ofthe summary, NeATS outputs thefinal sentences in their chronologicalorder. From this system, iNeATS [35],i.e., an interactive multi-documentsummarization system that provides auser control over the summarizationprocess, was later developed.

- GISTexter [20]: This system wasdeveloped in 2002 and produces singleand multi-document extracts andabstracts by template-driven IE. Thesystem performs differently dependingon working with single document ormulti-document summarization. Forsingle-documents, the most relevantsentences are extracted and com-pressed by rules learned from a corpusof human-written abstracts. In thefinal stage, reduction is performed totrim the whole summary to the legnthof 100 words. When multi-documentsummarization has to be done, thesystem, based on Information Ex-traction (IE) techniques, usesIE-style templates, either from a prior

6

set (if the topic is well-known) or byad-hoc generation (if it is unknown).The templates generated by CICEROIE system are then mapped into textsnippets from the texts, in whichanaphoric expressions are resolved.These text snippets can be used togenerate coherent, informative multi-documents summaries.

- NetSum [54]. Different from theother approaches previous shown, Net-Sum, developed in 2007 by MicrosoftResearch Department, bets on single-document instead of multi-documentsummarization. The system producesfully automated single-documentextracts of newswire articles basedon neuronal nets. It uses machinelearning techniques in this way: atrain set is labeled so that the labelsidentify the best sentences. Then aset of features is extracted from eachsentence in the train and test sets,and the train set is used to train thesystem. The system is then evaluatedon the test set. The system learnsfrom a train set the distribution offeatures for the best sentences andoutputs a ranked list of sentences foreach document. Sentences are rankedusing RankNet algorithm [8].

After the brief description of the formersystems, the reader can take a look at ta-ble 1, where a few more systems can befound. In order to understand what eachcolumn means, the following information

is provided. In the first column (SYS-TEM [REF.], YEAR) the name of the sys-tem with its reference and year is writ-ten; the second column (# INPUTS ) dis-tinguish between single document or multi-document summarization (both values arealso possible. That means the system canperform both inputs). Next column (DO-MAIN SYSTEM ) indicates the genre of theinput, that is, whether it is designed fordomain-specific topics or for non-restricteddomain. The fourth column (FEATURES )lists the main characteristics and techniquesused in each system, and finally, last column(OUTPUT ) represents whether the sum-mary generated is either an extract or anabstract (with its variants. For some par-ticular systems both values are also possi-ble).

7

Table 1: Text Summarization Systems

SYSTEM[REF.], YEAR

# INPUTS DOMAINSYSTEM

FEATURES OUTPUT

Luhn [37], 1958 single-document

domain-specific(technicalarticles)

- term filtering and word frequency iscarried out (low-frequency terms areremoved)- sentences are weighted bythe significant terms they contained- sentence segmentation and extraction isperformed

extracts1

Edmundson[15], 1969

single-document

domain-specific(articles)

- techniques used in this approach are:word frequency, cue phrases,title and heading words and

sentence location- it uses a corpus-based methodology

extracts

ADAM [48],1975

single-document

domain-specific(chemistry)

- Cue phrases- term frequencies- sentence selection and rejection

indicativeabstracts2

ANES [7], 1995 single-document

domain-specific(news)

- term and sentence weighting (tf*idf )- non-anaphora resolution- first sentences are added to the summary

extracts

Barzilay &Elahadad [4],1997

single-document

unknown

- topic identification of the text bygrouping words into lexical chains3

- sentence extraction is helped by strongchains identification- non-anaphora resolution

extracts

continue on next page

1Although the output in Luhn’s work is calledabstract, it is more correct to say extract, as sen-tences from the source document take part into thesummary.

2Output sentences are edited to produce some-what different to the original ones, but not newsentences are generated

3Lexical chains are sequences of related termsgrouped together by text cohesion relationships(e.g. synonymy or holonymy)

8

Table 1 – continued from previous pageSYSTEM[REF.], YEAR


FEATURES OUTPUT

Boguraev &Kennedy [6],1997

single-document

domain-independent

- linguistic techniques to identifysalient phrasal units (topic stamps)- content characterisation methods toreflect global context (capsule overview)- anaphora resolution

capsuleoverview 4

DimSum [3],1997

single-document

unknown

- it uses corpus-based statistical NLPtechniques-multi-word phrases are extractedautomatically- conceptual representation of the textis performed- discourse features of lexical item withina text (name aliases, synonyms, andmorphological variants) are exploited

extracts

Marcu [41],1997

single-document


- it uses text coherence models- RST5 trees are built- this kind of representation is useful todetermine the most important units in atext

extracts

SUMMARIST[24], 1998

single-document


- symbolic concept-level world knowledge iscombined with NLP processing techniques- stages for summarization are divided in:topic indentification,interpretation andgeneration- it is a multi-lingual system

extracts


4A capsule overview is not a conventional sum-mary, i.e. it does not attemp to output documentcontent as a sequence of sentences. It is a semi-formal representation of the document

5Rhetorical Structure Theory

9



FEATURES OUTPUT

SUMMONS6

[44], 1998multi-document

domain-specific(onlinenews)

-its input is a set of templates from theMessage Understanding Conference7

- key sentences from an article areextracted using statistical techniquesand measures- planning operators8 such ascontradiction, agreement or supersetare used to synthesize a single article

extractsandabstracts

FociSum [28],1999

sinlge-document

domain-independent

- it merges information extraction (IE)with sentence extraction techniques- the topic of the the text (called fociin this system) is determined dynamicallyfrom name entities and multiwords terms

extracts

MultiGen [5],1999

multi-document


- it identifies and synthesizes similarelements across related text from aset of multiple documents- it is based on information fusion andreformulation- sets of similar sentences are extracted(themes)

abstracts

Chen & Lin[26], 2000

multi-document


- it produces multilingual (only Englishand Chinese) news summaries- monolingual and multilingual clusteringis done- meaning units detection such astopic chains or linking elements isperformed- similarity between meaning unitsis measured

extracts


6SUMMarizing Online NewS articles7http://www-nlpir.nist.gov/related projects/muc/index.html8for more detailed information see[44]

10



FEATURES OUTPUT

CENTRIFUSER[30] [29], 2001

multi-document

domain-specific(health-care arti-cles)

- it produces query-driven summaries- clustering is applied by SIMFINDER tool- it is based on document topic tree(each individual document is representedby a tree data strucutre)- composite topic trees are designed(they carry topicinformation for all articles )- query mapping using a similarityfunction enriched with structuralinformation from the topic trees is done

extracts

Cut & Paste[22], 2001

single-document

domain-independent

- it uses sentence reduction and sentencecombination techniques- Key sentences are identified by asentence extraction algorithm that coversthis techniques: lexical coherence,tf*idf score, cue phrases andsentence positions

abstracts9

MEAD [51],2001

multi-document


- it is based on sentence extraction throughthe features: centroid score, position andoverlap with the first sentence- sentences too similar to others arediscarded-experiments with CST10 and Cross-document subsumption have been made

extracts


9In this case, summaries are generated by refor-mulating the text of the original document

10Cross-Document Structural Relationships pro-poses a taxonomy of the informational relationshipsbetween documents in clusters of related docu-ments. This concept is similar to Rhetorical Struc-ture Theory (RST)

11



FEATURES OUTPUT

NeATS11 [36],2001

multi-document


- to select important content it uses:sentence position, term frequency,topic signature, term clustering- to avoid redundancy it employs MMR12

technique [9]- to improve cohesion and coherencestigma words and time stamps are used

extracts

NewsInEssence[49], 2001

multi-document


- clusters are built throughCIDR Topic detenction and trackingcomponent-it is based on the Cross-documentStructure Theory (CST)- its summaries are producedby MEAD [51]

personalizedextracts

WebInEssence[50], 2001

multi-document

domain-independent

- it is a Web-based summarization andrecommendation system- it employes centroid-basedsentence extraction technique- it uses similar techniquesto NewsInEssence [49] but applied toWeb documents

extractsandpersonalizedsummaries

COLUMBIAMDS [43], 2002

multi-document


- it is a composite system that usesdifferent summarizers depending on theinput: MultiGen for single events orDEMS 13 for multiple events or biographicaldocuments- statistical techniques are used



11Next Generation Automated Text Summariza-tion

12Maximal Marginal Relevance13Dissimilarity Engine for Multidocument Sum-

marization

12



FEATURES OUTPUT

Copeck et al.[12], 2002

single-document

domain-specific(biogra-phies)

- it uses machine learning techniques- keyphrases are extracted and ranked- text is segmented according tosentences that talk about the same topic

extracts

GISTexter [20],2002

single andmulti-document


- for single-document summarization, itextracts key sentences automatically usingthe technique of single-documentdecomposition- for multi-document summaries, it relieson CICERO IE system to extract relevantinformation by applying templates that aredetermined by the topic of the collection


GLEANS [13],2002

multi-document

unkown

- it performs document mappingto obtain a database-like representationthat explicits their main entities andrelations- each document in the collectionis classified into one of these categories:single person , single event,multiple event and natural disaster- the system is IE based

headlines,extractsandabstracts

NTT [21], 2002 single-document

unknown

- it employs the Support Vector Machine(SVM) machine learning techique toclassify a sentence into relevant ornon-relevant- it also uses the following features todescribed a sentence: position, length,weight, similarity with the headline andpresence of certains verbs or prepositions

extracts


13



FEATURES OUTPUT

Karamuftuoglu[31], 2002

single-document

unkown

- it is based on the extract-reduce-organizeparadigm- as a pattern matching method it useslexical links and bonds14

- sentences are selected by SVM technique

extracts

Kraaij et al.[33], 2002

multi-document

unkown

- it is based on probabilistic methods:sentence position, length, cue phrases-for headline generation, noun phrasesare exracted

headlinesandextracts

Lal & Reuger[34], 2002

single-document

unkown

- it is built within the GATE15 framework- it uses simple Bayes classifierto extract sentences- it resolves anaphora using GATE’sANNIE16 module

extracts

Newsblaster[42], 2002

multi-document


- news articles are clustered usingTopic Detection and Tracking (TDT)- it is a composite summarization system(it uses different strategies depending onthe type of documents in each cluster)- it uses similar techniques to [43]- thumbnails of images are displayed

extracts

SumUM [52],2002

single-document

domain-specific(technicalarticles)

- shallow syntactic and semantic analysis- concept identification andrelevant information extraction- summary representation constructionand text regeneration

abstracts


14A lexical link between two sentences is a wordthat appears in both sentences. When two or morelexical links between a pair of sentences occur, alexical bond between them is constituted

15General architecture for Text Engineering,Univerity of Sheffield

16A Nearly New Information Extraction System

14



FEATURES OUTPUT

Alfonseca &Rodrıguez[1],2003

single-document

domain-specific(articles)

- relevant sentences identification(with a genetic algorithm)- relevant words and phrases fromidentified sentences are extracted- coherence is keept for the output

very shortextracts(10 words)

GISTSumm[47], 2003

single-document

domain-independent

- it is based on the gist17 of the source text- it uses statistical measures: keywords todetermine what the gist sentence is- by means of the gist sentence, itis possible to build coherent extracts

extract

K.U. Leuven[2], 2003

sigle andmulti-document

unkown

- topic segmentation and clusteringtechniques are used for multi-documenttask- topic segmentation, sentence scoring(weight, position, proximity to the topic)and compression are used forsingle-document summarization

extracts

Univ. Leth-bridge [10], 2003


unkown

- it performs topic segmentation of thetext- it computes lexical chains for eachsegment- sentence extraction techniques areperformed- it uses heuristics to do some surfacerepairs to make summaries coherent andreadable

extracts

LAKE [14], 2004 single-document


- it exploits keyphrase extractionmethodology to identify relevantterms in the document- it is based on a supervised learningapproach- it considers linguistic features likename entity recognition or multiwords

very shortextracts


17The most important passage of the source text

15



FEATURES OUTPUT

MSR-NLPSummarizer[56], 2004

multi-document


- its objetive is to identify importantevents as opposed to entities- it uses a graph-scoring algorithm toidentify highly weighted nodes andrelations- summaries are generated by extractingand merging portions of logical forms

extracts

CATS [16], 2005 multi-document


- it analyzes which information in thedocument is important in order to includeit in the summary- it is an Answering Text Summarizer- statistical techniques are used tocomputed a score for each sentenceas well as temporal expresion andredundancy are solved

extracts

CLASSY [11],2005

multi-document


- it is a query-based system- it is based on Hiden Markov Modelalgorithm for sentence scoring and selection- it classifies sentences into two sets:those ones belonging to the summaryand those ones which not

extracts

QASUM-TALP [17],2005

multi-document


- it is a query-driven system- summary content is selectedfrom a set of candidate sentencesin relevant passages- summaries are produced in four steps:(1) collection pre-processing,(2) question generation,(3) relevant information extraction(4) and summary content selection

extracts

ERRS [57], 2007 single andmulti-document


- it is basically a heurisitic-based system- all kinds of summaries are generatedwith the same data structure:Fuzzy Coreference Cluster Graph

extracts


16



FEATURES OUTPUT

FemSum [18],2007



- the system is aimed to provideanswers to complex questions- summaries are produced taking intoaccount a syntactic and a semanticrepresentation of the sentences- it uses graph-representation to establishrelations between candidate sentences- it is composed of three languageindependent components:RID (Relevant Information Detector),

CE (Content Extractor),SC (Summary Composer)

extracts

GOFAISUM[19], 2007

multi-document


- it is only based on a symbolic approach- basically, the techniques usedare tf·idf and syntactic pruning- sentences with the highest scoreare selected to build the summary

extracts

NetSum [54],2007

single-document


- it is based on machine learningtechniques to generate summaries- it uses a neuronal network algorithmto enhance sentence features- the three sentences that best matchesthe document’s highlights are extracted

extracts

17

From the systems described above, it ispossible to notice that each system per-forms different methodologies to producesummaries. Furthermore, the inputs andthe genre can be different too. That givesus an idea of how developed the state-of-the-art is and the number of differentapproaches that exist to tackle this field ofresearch.

In the latest ACL (ACL’07 ) conference18

attempts to summarize entire books [45]have been made. The argument to supportthis idea is that most of studies have beenfocused in short documents, specially innews reports and very little effort has beendone on summarization of long documents,like books. Generating book summariescan be very useful for the reader to chooseor discard a book only by looking at theextract or abstract of that book. On thecontrary, in the previous year, europeanACL conferences (EACL’06)19 summa-rization of short fiction stories were alsoinvestigated [32] arguing that summariza-tion is the key issue to determine whetherto read a whole story or not.

Techniques employed in recent years arevery similiar to the classical ones but theyhave to be adapted to each particular kindof system and its objectives. Improvementsin machine learning techniques have allowed

18The 45th Annual Meeting of the Associa-tion of Computational Linguistics was held inPrague, Czech Republic, June 23rd30th 2007,http://ufal.mff.cuni.cz/acl2007/

19The 11th Conference of European Chap-ter of the Associationfor Computer Linguisticswas held in Trento, Italy, April 3rd-7th 2006,http://eacl06.itc.it/

that they can be used to train and de-velop summarization systems these days aswell. NetSum [54] which was presented inACL’07 is an example of a system thatuses machine learning algorithms to per-form summarization.

5 Measures of Evalua-

tion

Methods for evaluating automatic textsummarization can be classified into twocategories: intrinsic or extrinsic methods[38]. The first one measures the system’sperformance on its own, whereas theextrinsic methods evaluate how summariesare good enough to accomplish the purposeof some other specific task, e.g. filtering ininformation retrieval or report generation.An assesment of a summary can be donein different ways. Several examples, likeShannon Game or Question Game canbe found in [23]. In summary evaluationprogrammmes such as SUMMAC20, DUC21

or NTCIR22 automatic generated sum-maries (extracts or abstract) are evaluatedmostly instrinsically against human ref-erence or gold-standard summaries (idealsummaries). The problem is to establishwhat an ideal summary is. Humansknow how to sum up the most importantinformation of a text. However, differentexperts may disagree in considering whichinformation is the best to be extracted.Automatic evaluation programmes havetherefore been developed to try to give

20http://www-nlpir.nist.gov/related projects/tipster summac/21http://duc.nist.gov22http://research.nii.ac.jp/ntcir/

18

an objective point of view of evalua-tion. Systems like SEE23, ROUGE24 orBE25 have been created to help to this task.

6 The Evolution of Text

Summarization Ap-

proaches

Throughout the recent years summariza-tion has experienced a remarkable evolu-tion. Due to the evaluation programmesthat take place every year, the field ofText Summarization has been improvedconsiderably. For example, the tasks per-formed in The Document Understand Con-ferences (DUC) have changed from sim-ple tasks to more complex ones. Atthe beginning, efforts were done to gen-erate simple extracts from single docu-ments usually in English. Lately, the trendhas evolved to generate more sophisticatedsummaries such as abstracts from a num-ber of documents, not just a single one,and in a variety of languages. Differenttasks have been introduced year after yearso that, apart from the gereral main task,it is possible to find taks consisting of pro-ducing summaries from a specific questionor user-need, or just to generate a summaryfrom updated news. Finally, concerning tothe evaluation, the tendency has moved onto extrinsic evaluation, i.e. how useful thetask is in order to help other tasks, rather

23Summarization Evaluation Environment,http://www.isi.edu/publications/licensed-sw/see/

24Recall-Oriented Understudy for Gisting Evalu-ation, http://haydn.isi.edu/ROUGE

25Basic Elements, http://haydn.isi.edu/BE/

than intrinsic evaluation. However, thiskind of evaluation is also important to mea-sure linguistic quality or content responsive-ness, so manual evaluation is still performedby humans, together with automatic eval-uation systems like BE, SEE or ROUGEintroduced in the previous section. Theevolution of summarization systems has notfinished yet. There is still a great effort todo to achieve good and high quality sum-maries, either extracts or abstracts.

7 Conclusion and Future

Work

In this paper, we have described a generaloverview of automatic text summarization.The status, and state, of automatic sum-marising has radically changed through theyears. It has specially benefit from work ofother asks, e.g. information retrieval, in-formation extraction or text categorization.Research on this field will continue due tothe fact that text summarization task hasnot been finished yet and there is still mucheffort to do, to investigate and to improve.Definition, types, different approaches andevaluation methods have been exposed aswell as summarization systems features andtechniques already developed. In the futurewe plan to contribute to improve this fieldby means of improving the quality of sum-maries, and studying the influence of otherneighbour tasks techniques on summariza-tion.

References

[1] Alfonseca, E., Rodrıguez, P. Descrip-

19

tion of the UAM system for generatingvery short summaries at DUC-2003. InHLT/NAACL Workshop on Text Sum-marization / DUC 2003, 2003.

[2] Angheluta, R., Moens, M, F., DeBusser, R. K.U.Leuven SummarizationSystem. In DUC 2003, Edmonton, al-berta, Canada, 2003.

[3] Aonet, C., Okurowskit, M. E., Gorlin-skyt, J., et al. A Scalable Summariza-tion System Using Robust NLP. InProceedings of the ACL’07/EACL’97Workshop on Intelligent Sclalable TextSummarization, pages 66-73, 1997.

[4] Barzilay, R., Elhadad, M. Using Lex-ixal Chains for Text Summarization.In Inderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.

[5] Barzilay, R, McKeown, K, Elhadad, M.Information fusionn in the context ofmulti-document summarization. InProceedings of ACL 1999, 1999.

[6] Boguarev, B., Kennedy, C. Salience-based content characterisation of textdocuments. In Proceedings of ACL’97Workshop on Intelligent, Scalable TextSummarization, pages 2-9, Madrid,Spain, 1997.

[7] Brandow, R., Mitze, K., Rau, L. F.Automatic condensation of electronicpublications by sentence selection.Information Processing Management,31(5):675–685, 1995.

[8] Burges, C., Shaked, T., Renshaw, E.,et al. Learning to rank using gradi-ent descent. In ICML ’05: Proceedings

of the 22nd international conference onMachine learning, pages 89–96, NewYork, NY, USA, 2005. ACM.

[9] Carbonell, J, Goldstein, J. The useof MMR, diversity-based reranking forreordering documents and producingsummaries. In SIGIR ’98: Proceed-ings of the 21st annual internationalACM SIGIR conference on Researchand development in information re-trieval, pages 335–336, New York, NY,USA, 1998. ACM.

[10] Chali, Y., Kolla, M, Singh, N, etal. . The University of LethbridgeText Summarizer at DUC 2003. InHLT/NAACL Workshop on Text Sum-marization /DUC 2003, 2003.

[11] Conroy, J. M., Schlesinger, J. D., Gold-stein, J. CLASSY Query-Based Multi-Document Summarization. In the Doc-ument Understanding Workshop (pre-sented at the HLT/EMNLP AnnualMeeting), Vancouver, B.C., Canada,2005.

[12] Copeck, T., Szpakowicz, S., Jap-kowic, N. Learning How Best to Sum-marize. In Workshop on Text Summa-rization (In conjunction with the ACL2002 and including the DARPA/NISTsponsored DUC 2002 Meeting on TextSummarization), Philadelphia, 2002.

[13] Daume III, H., Echihabi, A.,Marcu, D., et al. GLEANS: AGenerator of Logical Extracts andAbstracts for Nice Summaries. InWorkshop on Text Summarization(In conjunction with the ACL 2002

20

and including the DARPA/NISTsponsored DUC 2002 Meeting on TextSummarization), Philadelphia, 2002.

[14] D’Avanzo, E., Magnini, B., Vallin, A.Keyphrase Extraction for Summariza-tion Purposes: The LAKE System atDUC-2004. In the Document Un-derstanding Workshop (presented atthe HLT/NAACL Annual Meeting),Boston, USA, 2004.

[15] Edmundson, H. P. New Methods inAutomatic Extracting. In InderjeetMani and Mark Marbury, editors, Ad-vances in Automatic Text Summariza-tion. MIT Press, 1999.

[16] Farzindar, A., Rozon, F., Lapalme, G.CATS A Topic-Oriented Multi-Document Summarization System atDUC 2005. In the Document Un-derstanding Workshop (presented atthe HLT/EMNLP Annual Meeting),Vancouver, B.C., Canada, 2005.

[17] Fuentes, M., Gonzaelz, E., Ferres, D.,et al. QASUM-TALP at DUC 2005Automatically Evaluated with a Pyra-mid Based Metric. In the DocumentUnderstanding Workshop (presented atthe HLT/EMNLP Annual Meeting),Vancouver, B.C., Canada, 2005.

[18] Fuentes, M., Rodrıguez, H., Fer-res, D. FEMsum at DUC 2007. Inthe Document Understanding Work-shop (presented at the HLT/NAACL),Rochester, New York USA, 2007.

[19] Gotti, F., Lapalme, G., Nerima, L.,et al. GOFAISUM: A Sympolic Sum-marizer for DUC. In the Document

Understanding Workshop (presented atthe HLT/NAACL), Rochester, NewYork USA, 2007.

[20] Harabagiu, S., Lacatusu, F. Generat-ing Single and Multi-Document Sum-maries with GISTEXTER. In Work-shop on Text Summarization (In Con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization) Philadelphia, Pennsylvania,USA, 2002.

[21] Hirao, T., Sasaki, Y., Isozaki, H., etal. NTT’s Text Summarization sys-tem for DUC-2002. In Workshopon Text Summarization (In conjunc-tion with the ACL 2002 and includ-ing the DARPA/NIST sponsored DUC2002 Meeting on Text Summariza-tion), Philadelphia, 2002.

[22] Hongyan, J. Cut-and-paste text sum-marization. PhD thesis, 2001. Adviser-Kathleen R. Mckeown.

[23] Hovy, E. H. Automated Text Summa-rization. In R. Mitkov (ed), The Ox-ford Handbook of Computational Lin-guistics, chapter 32, pages 583–598.Oxford University Press, 2005.

[24] Hovy, E., Lin, C. Y. Automated TextSummarization in SUMMARIST. InInderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.

[25] Hovy, E., Lin, C. Y., Zhou, L., etal. Automated Summarization Evalu-ation with Basic Elements. In Proceed-ings of the 5th International Confer-

21

ence on Language Resources and Eval-uation (LREC). Genoa, Italy, 2006.

[26] Hsin-Hsi, C., Chuan-Jie, L. A multi-lingual news summarizer. In Proceed-ings of the 18th conference on Com-putational linguistics, pages 159–165,Morristown, NJ, USA, 2000. Associa-tion for Computational Linguistics.

[27] Jackson, P., Moulinier, I. Naturallanguage processing for online appli-cations. John Benjamins PublishingCompany, 2002.

[28] Kan, M. Y., McKeown, K. Infor-mation extraction and summarization:Domainn independence through focustypes. Technical report, Computer Sci-ence Department, Columbia Univer-sity.

[29] Kan, M. Y., McKeown, K. R., Kla-vans, J. L. Applying Natural Lan-guage Generation to Indicative Sum-marization. In the 8th European Work-shop on Natural Language Generation,Toulouse, France, 2001.

[30] Kan, M. Y., McKeown, K. R., Kla-vans, J. L. Domain-specific informa-tive and indicative summarization forinformation retrieval. In the Docu-ment Understanding Conference, NewOrleans, USA, 2001.

[31] Karamuftuoglu, M. An approach tosummarizaion based on lexical bonds.In Workshop on Text Summarization(In conjunction with the ACL 2002 andincluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.

[32] Kazantseva, A. An Approach to Sum-marizing Short Stories. In Proceedingsof the Student Research Workshop atthe 11th Conference of the EuropeanChapter of the Association for Com-putational Linguistics, 2006.

[33] Kraaij, W., Spitters, M, Hulth, A.Headline extraction based on a com-bination of uni- and multidocumentsummarization techniques. In Work-shop on Text Summarization (In con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.

[34] Lal, P.,Rueger, S. Extract-based Sum-marization with Simplification. InWorkshop on Text Summarization (Inconjunction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.

[35] Leuski, A., Lin, C. Y., Hovy, E.iNEATS: Interactive Multi-documentSummarization. In ACL’03, 2003.

[36] Lin, C. Y., Hovy, E. Automated Multi-document Summarization in NeATS.In Proceedings of the Human Lan-guage Technology (HLT) Conference.San Diego, CA, 2001.

[37] Luhn, H., P. The Automatic Creationof Literature Abstracts. In InderjeetMani and Mark Marbury, editors, Ad-vances in Automatic Text Summariza-tion. MIT Press, 1999.

[38] Mani, I. Summarization evaluation:An overview. In Proceedings of the

22

North American chapter of the as-sociation for computational linguis-tics (NAACL) workshop on automaticsummarization, 2001.

[39] Mani, I., House, D., Klein, G., et al .The TIPSTER SUMMAC Text Sum-marization Evaluation. In Proceedingsof EACL, 1999.

[40] Mani, I., Maybury, M. T., Ed. Ad-vances in Automatic Text Summariza-tion. The MIT Press, 1999.

[41] Marcu, D. Discourse Trees Are GoodIndicators of Importance in Text. InInderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.

[42] McKeown, K, Barzilay, R, Evans, D,et al. Tracking and summarizing newson a daily basis with the Columbia’sNewsblaster. In Proceedings of the Hu-man Language Technology (HLT) Con-ference. San Diego, CA, 2002.

[43] McKeown, K., Evans, D., Nenkova, A.,et al. The Columbia Multi-DocumentSummarizer for DUC 2002. In Work-shop on Text Summarization (In con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.

[44] McKeown, K., Radev, D. R. Generat-ing Summaries of Multiple News Arti-cles. In Inderjeet Mani and Mark Mar-bury, editors, Advances in AutomaticText Summarization. MIT Press, 1999.

[45] Mihalcea, R., Ceylan, H. Explorationsin Automatic Book Summarization. InProceedings of the Joint Conferenceon Empirical Methods in Natural Lan-guage Processing and ComputationalNatural Language Learning (EMNLP-CoNLL), pages 380–389, 2007.

[46] Padro Cirera, L., Fuentes, M.J.,Alonso, L., et al. Approaches to TextSummarization: Questions and An-swers. Revista Iberoamericana de In-teligencia Artificial, ISSN 1137-3601,(22):79–102, 2004.

[47] Pardo, T., Rino, L., Nunes, M. Gist-Summ: A summarization tool based ona new extractive method. In 6th Work-shop on Computational Processing ofthe Portuguese Language – Writtenand Spoken. Number 2721 in Lec-ture Notes in Artificial Intelligence,Springer (2003) 210–218, 2003.

[48] Pollock, J. J., Zamora, A. Auto-matic Abstrating Research at Chemi-cal Abstracts. In Inderjeet Mani andMark Marbury, editors, Advances inAutomatic Text Summarization. MITPress, 1999.

[49] Radev, D., Blair-Goldensohn, S.,Zhang, Z., et al. NewsInEssence:a system for domain-independent,real-time news clustering and multi-document summarization. In HLT’01: Proceedings of the first interna-tional conference on Human languagetechnology research, pages 1–4, Mor-ristown, NJ, USA, 2001. Associationfor Computational Linguistics.

23

[50] Radev, D, Weiguo, F, Zhang, Z.Webinessence: A personalized web-based multi-document summarizationand recommendation system. InNAACL Workshop on automatic Sum-marization, Pittsburg, 2001.

[51] Radev, R., Blair-goldensohn, S,Zhang, Z. Experiments in Single andMulti-Docuemtn Summarization usingMEAD. In First Document Under-standing Conference, New Orleans,LA, 2001.

[52] Saggion, H, Lapalme, G. Generat-ing Indicative-Informative Summarieswith SumUM. Computational Linguis-tics, 28(4), 2002.

[53] Sparck Jones, K. Automatic summariz-ing: factors and directions. In Inder-jeet Mani and Mark Marbury, editors,Advances in Automatic Text Summa-rization. MIT Press, 1999.

[54] Svore, K., Vanderwende, L., Burges, C.Enhancing Single-Document Summa-rization by Combining RankNet andThird-Party Sources. In Proceed-ings of the Joint Conference on Em-pirical Methods in Natural LanguageProcessing and Computational NaturalLanguage Learning (EMNLP-CoNLL),pages 448–457, 2007.

[55] Tucker, R. Automatic summarisingand the CLASP system. PhD thesis,1999. Cambridge.

[56] Vanderwende, L., Banko, M.,Menezes, A. Event-Centric Sum-mary Generation. In the DocumentUnderstanding Workshop (presented

at the HLT/NAACL Annual Meeting),Boston, USA, 2004.

[57] Witte, R., Krestel, R., Bergler, S. Gen-erating Update Summaries for DUC2007. In the Document Under-standing Workshop (presented at theHLT/NAACL), Rochester, New YorkUSA, 2007.

24

text summarization : an overviewelloret/publications/textsummarization.pdf · the source text and...

Documents