text summarization : an overviewelloret/publications/textsummarization.pdf · the source text and...
TRANSCRIPT
TEXT SUMMARIZATION : ANOVERVIEW ∗
Elena LloretDept. Lenguajes y Sistemas Informaticos
Universidad de AlicanteAlicante, Spain
Abstract
This paper presents an overview of TextSummarization. Text Summarization is achallenging problem these days. Due to thegreat amount of information we are pro-vided with and thanks to the developmentof Internet technologies, needs of producingsummaries have become more and morewidespread. Summarization is a veryinteresting and useful task that gives sup-port to many other tasks as well as it takesadvantage of the techniques developed forrelated Natural Language Processing tasks.The paper we present here may help us tohave an idea of what Text Summarizationis and how it can be useful for.
Keywords : automatic text summa-rization; extracts and abstracts
∗This paper has been supported by the Span-ish Government under the project TEXT-MESS(TIN2006-15265-C06-01)
1 Introduction
The World Wide Web has brought us avast amount of on-line information. Dueto this fact, everytime someone searchssomething on the Internet, the responseobtained is lots of different Web pages withmany information, which is imposible for aperson to read completely. Although theattempts to generate automatic summariesbegan 50 years ago [40], in the recent yearsthe field of automatic Text Summarization(TS) has experienced an exponentialgrowth [25] [27] [46] due to these newtecnologies.
This paper addresses the current state-of-the-art of Text Summarization. Section 2gives an overview of the field TS and wepresent the factors related to it. Section3 explains the different approaches togenerate summaries. In section 4 wepresent a number of Text Summarizationsystems existing today. Section 5 presentsthe common measures to evaluate thosesystems, whereas section 6 exposes the
1
tendency adopted these days in Text Sum-marization. Finally, section 7 concludesthis paper and discusses future work.
2 What is TEXT SUM-
MARIZATION?
2.1 Definition and types
A summary can be defined as a textthat is produced from one or more texts,that contains a significant portion of theinformation in the original text(s), andthat is no longer than half of the originaltext(s) [23]. According to [39], text sum-marization is the process of distilling themost important information from a source(or sources) to produce an abridged versionfor a particular user (or user)and task (ortasks).
When this is done by means of a computer,i.e. automatically, we call this AutomaticText Summarization. Despite the fact thattext summarization has traditionally beenfocused on text input, the input to thesummarization process can also be multi-media information, such as images, videoor audio, as well as on-line informationor hypertexts. Furthermore, we can talkabout summarizing only one document ormultiple ones. In that case, this process isknown as Multi-document Summarization(MDS) and the source documents in thiscase can be in a single-language (monolin-gual) or in different languages (translingualor multilingual).
The output of a summary system maybe an extract (i.e. when a selection of
”significant” sentences of a document isperformed) or abstact, when the summarycan serve as a substitute to the originaldocument [15]. We can also distinguishbetween generic summaries and user-focused summaries (a.k.a query-driven).The first type of summaries can serve assurrogate of the original text as they maytry to represent all relevant features ofa source text. They are text-driven andfollow a bottom-up approach using IRtechniques. The user-focused summariesrely on a specification of a user informationneed, such a topic or query. They follow atop-down approach using IE techniques.
Concerning the style of the output, abroad distinction is normally made be-tween indicative summaries, which are usedto indicate what topics are addressed inthe source text and they can give an briefidea of what the original text is about,and the informative summaries, which areintended to cover the topics in the sourcetext [40][46].
2.2 Process of AutomaticText Summarization
Traditionally, summarization has been de-composed into three main stages [23][40][53]. We will follow the Sparck Jones[53] approach, which is:
• interpretation of the source text toobtain a text representation,
• transformation of the text represen-tation into a summary representation,and,
2
• finally, generation of the summarytext from the summary representation
Effective summarizing requires an ex-plicit and detailed analysis of context fac-tors. Sparck Jones in [53] distinguishesthree classes of context factors: input, pur-pose and output factors. We will describedthem briefly. For furhter information, see[53].
• Input factors. The features of thetext to be summarized crucially deter-mine the way a summary can be ob-tained. These fall into three groups,which are: text form (e.g. documentstructure); subjet type (ordinary, spe-cialized or restricted) and unit (singleor multiple documents as input).
• Purpose factors. These are the mostimportant factors. They fall underthree categories: situation refers to thecontext within the summary is to beused; audience (i.e. summary readers)and use (what is the summary for?).
• Output factors. In this class we cangroup: material (i.e. content) ; formatand style.
3 Approaches to Text
Summarization
Although many different approaches totext summarization can be found in theliterature [46], [55], in this paper we willonly take into account the one proposed byMani and Marbury (1999) [40]. This clas-sification, based on the level of processingthat each system performs, gives an idea
of which traditional approaches exist. Thiscan be suitable as a reference point fromwhich many techniques can be developed.Based on the traditional approaches,summarization can be characterized asapproaching the problem at the surface,entity, or discourse levels [40].
Surface levelThis approach inclines to represent infor-mation taking shallow features and thenselectively combining them together inotrder to obtain a salience function thatcan be used to extract information. Amongthese features, we have:
- Thematic features rely on word (sig-nificant words) occurence statistics, sothat sentences containing words thatoccur frequently in a text have higherweight than the rest. That means thatthese sentences are the important onesand they are hence extracted. Luhn(1958) [37] , who used the term fre-quency technique in his work, followedthis approach. Before doing term fre-quency, a filtering task must be doneusing a stop-list words which containswords such as pronouns, prepositionsand articles. This is the classical statis-tical approach. However, from a pointof view of a corpus-based approachtd*idf measure (commonly used in in-formation retrieval) is very useful todetermine keywords in text.
- Location refers to the position intext, paragraph or any other particularsection in the sense that they con-tain the target sentences to be in-cluded in the summary. This is usu-ally genre-dependent, but there are two
3
basic general methods, which are lead-method and the title-based method.The first one consists of extracting onlythe first sentences, assuming that theseare the most relevant ones, whereasthe second considers that words inthe headings or titles are positive rel-evant to summarization. Edmundson(1969) in [15] used this approach to-gether with cue-word method which isexplained later.
- Background assumes that the impor-tance of meaning units is detemined bythe presence of terms from the title orheadings, initial part of the text or auser’s query.
- Cue words and phrases, such as ”inconclusion”, ”important”, ”in this pa-per”, etc. can be very useful to deter-mine signals of relevance or irrelevance.These kind of units can be detected au-tomatically as well as manually.
Entity levelThis approach attemps to build a represen-tation of the text, modeling text entitiesand their relationships. The objective is tohelp to determine what is salient. This re-lations between entities include:
- Similarity occurs for example, whentwo words share a common stem,i.e. whose form is similar. Thiscan be extended for phrases or para-graphs. Similarity can be calculatedby vocabulary overlap or with linguis-tic techniques.
- Proximity refers to the distance be-tween texts units. With that informa-
tion is possible to establish entity rela-tions.
- Co-ocurrence: meaning units can berelated if they occur in common texts.
- Thesaural relatioships amongwords can be described as relation-ships like synonymy, hypernymy,part-of-relations (meronymy).
- Coreference. The idea behind co-reference is that referring expressionscan be linked so that, coreferencechains can be built with coreferring ex-pressions.
- Logical relations such as agreement,contradiction, entailment, and consis-tency.
- Syntactic relations are based onparse trees.
- Meaning representation-based re-lations, establishing relations betweenentities in the text as for example,predicate-argument relations.
Discourse levelThe target of discourse level approaches isto model the global structure of the textand its relations in order to achieve commu-nicative goals. The information that can beexploited at this level includes:
- Format of the document, such as hy-pertext markup or document outlines.
- Threads of topics as they are re-vealed in the text.
4
- Rethorical structre of text, rep-resenting argumentative or narrativestructure. The idea behind this dealswith the possibility to build thecoherence structure of a text, so thatthe ’centrality’ of textual units willreflect their importance.
To applied all these methods, two dif-ferent appoaches can be taken. Thesetechniques described so far can be de-veloped by using linguistic knowledgeor by applying marchine learning tech-niques. Last ones have a support role,for example, in identifying the infor-mation to be applied at specific processstages such as interpretation or genera-tion (for instance, they seem useful fortraining output sentence order).
4 Text Summarization
Systems
Approaches presented so far are examplesof pure techniques to apply, in order todevelop summarization systems. Thepredominant tendency in current systemsis to adopt a hibryd approach andcombine and integrate some of the tech-niques mentioned before (e.g. cue phrasesmethod combined with position and wordfrequency based methods in [24], or posi-tion, length weight of sentences combinedwith similarity of these sentences withthe headline in [21]). As we have given ageneral overview of the classical techniquesused in summarization and there is alarge number of different techniques andsystems, we are going to describe in this
section only few of them briefly, consideringsystems as wholes. However, in table 1some more systems are shown as well astheir main features. To finish this section,the most recent approaches concerningsummarization are mentioned.
Systems which have been selected tobe broader described are the following:MEAD [51] , WebInEssence [50] ,NeATS [36] , GISTExter [20] , andNetSum [54]
- MEAD [51]: This system wasdeveloped at the University of Michi-gan in 2001. It can produce bothsingle and multi-document extractivesummaries. The idea behind it is theuse of the centroid -based feature.Moreover, two more features are used:position and overlap with thefirst sentence. Then, the linearcombination of the three determineswhat sentences are most salient toinclude in the summary.
The system works as follows: firstof all, MEAD uses the CIDR TopicDetection and Tracking system toidentify all the articles related to anemerging event. CIDR produces aset of clusters. From each clustera centroid is built. Then, for eachsentence, three values are computed:the centroid score, which measureshow close the sentence to the centroidis; the position score indicates howfar is the sentence with respect to thebeginning of a document; and finally,the overlap with the first sentence ortittle of the document by calculating
5
tf*idf between the given sentence andthe first one. Then all these measuresare normalized and sentences whichare too similar to others are discarded(based on a cosine similarity measure).Any sentence that have not beendiscarded would be included in thesummary.
- WebInEssence [50]: This systemwas also developed at the Universityof Michigan in 2001. It is more than asummarization system. It is a searchengine to summmarize clusters ofrelated Web pages which provide morecontextual and summary informationto help users explore retrieval resultsmore efficiently. A version of MEAD[51] was used in the development ofthis Web-based summarizer, so thatthe features used to produce extractsare the same as the ones used inMEAD. The overall architecture of thesystem can be decomposed into twomain stages: the first one behaves asa Web-spider that collects URLs fromthe Internet and then it groups theURLs into clusters. The second mainstage is to create a multi-documentsummary from each cluster using theMEAD centroid-algorithm.
- NeATS [36] was first developed in2001 by the University of SouthernCalifornia’s Information SciencesInstitute. It is tailored to the genre ofnewspaper news. Its architecture con-sists of three main components: con-
tent selection, content filteringand content presentation. Thegoal of content selection is to identifyimportant concepts mentioned in adocument collection. The techniquesused at this stage are term frequency,topic signature or term clustering.For content filtering three differentfilters are used: sentence position,stigma words and redundancy filter.To achieve the latter, NeATS usesa simplified version of MMR [9]algorithm. To ensure coherence ofthe summary, NeATS outputs thefinal sentences in their chronologicalorder. From this system, iNeATS [35],i.e., an interactive multi-documentsummarization system that provides auser control over the summarizationprocess, was later developed.
- GISTexter [20]: This system wasdeveloped in 2002 and produces singleand multi-document extracts andabstracts by template-driven IE. Thesystem performs differently dependingon working with single document ormulti-document summarization. Forsingle-documents, the most relevantsentences are extracted and com-pressed by rules learned from a corpusof human-written abstracts. In thefinal stage, reduction is performed totrim the whole summary to the legnthof 100 words. When multi-documentsummarization has to be done, thesystem, based on Information Ex-traction (IE) techniques, usesIE-style templates, either from a prior
6
set (if the topic is well-known) or byad-hoc generation (if it is unknown).The templates generated by CICEROIE system are then mapped into textsnippets from the texts, in whichanaphoric expressions are resolved.These text snippets can be used togenerate coherent, informative multi-documents summaries.
- NetSum [54]. Different from theother approaches previous shown, Net-Sum, developed in 2007 by MicrosoftResearch Department, bets on single-document instead of multi-documentsummarization. The system producesfully automated single-documentextracts of newswire articles basedon neuronal nets. It uses machinelearning techniques in this way: atrain set is labeled so that the labelsidentify the best sentences. Then aset of features is extracted from eachsentence in the train and test sets,and the train set is used to train thesystem. The system is then evaluatedon the test set. The system learnsfrom a train set the distribution offeatures for the best sentences andoutputs a ranked list of sentences foreach document. Sentences are rankedusing RankNet algorithm [8].
After the brief description of the formersystems, the reader can take a look at ta-ble 1, where a few more systems can befound. In order to understand what eachcolumn means, the following information
is provided. In the first column (SYS-TEM [REF.], YEAR) the name of the sys-tem with its reference and year is writ-ten; the second column (# INPUTS ) dis-tinguish between single document or multi-document summarization (both values arealso possible. That means the system canperform both inputs). Next column (DO-MAIN SYSTEM ) indicates the genre of theinput, that is, whether it is designed fordomain-specific topics or for non-restricteddomain. The fourth column (FEATURES )lists the main characteristics and techniquesused in each system, and finally, last column(OUTPUT ) represents whether the sum-mary generated is either an extract or anabstract (with its variants. For some par-ticular systems both values are also possi-ble).
7
Table 1: Text Summarization Systems
SYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
Luhn [37], 1958 single-document
domain-specific(technicalarticles)
- term filtering and word frequency iscarried out (low-frequency terms areremoved)- sentences are weighted bythe significant terms they contained- sentence segmentation and extraction isperformed
extracts1
Edmundson[15], 1969
single-document
domain-specific(articles)
- techniques used in this approach are:word frequency, cue phrases,title and heading words and
sentence location- it uses a corpus-based methodology
extracts
ADAM [48],1975
single-document
domain-specific(chemistry)
- Cue phrases- term frequencies- sentence selection and rejection
indicativeabstracts2
ANES [7], 1995 single-document
domain-specific(news)
- term and sentence weighting (tf*idf )- non-anaphora resolution- first sentences are added to the summary
extracts
Barzilay &Elahadad [4],1997
single-document
unknown
- topic identification of the text bygrouping words into lexical chains3
- sentence extraction is helped by strongchains identification- non-anaphora resolution
extracts
continue on next page
1Although the output in Luhn’s work is calledabstract, it is more correct to say extract, as sen-tences from the source document take part into thesummary.
2Output sentences are edited to produce some-what different to the original ones, but not newsentences are generated
3Lexical chains are sequences of related termsgrouped together by text cohesion relationships(e.g. synonymy or holonymy)
8
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
Boguraev &Kennedy [6],1997
single-document
domain-independent
- linguistic techniques to identifysalient phrasal units (topic stamps)- content characterisation methods toreflect global context (capsule overview)- anaphora resolution
capsuleoverview 4
DimSum [3],1997
single-document
unknown
- it uses corpus-based statistical NLPtechniques-multi-word phrases are extractedautomatically- conceptual representation of the textis performed- discourse features of lexical item withina text (name aliases, synonyms, andmorphological variants) are exploited
extracts
Marcu [41],1997
single-document
domain-specific(news)
- it uses text coherence models- RST5 trees are built- this kind of representation is useful todetermine the most important units in atext
extracts
SUMMARIST[24], 1998
single-document
domain-specific(news)
- symbolic concept-level world knowledge iscombined with NLP processing techniques- stages for summarization are divided in:topic indentification,interpretation andgeneration- it is a multi-lingual system
extracts
continue on next page
4A capsule overview is not a conventional sum-mary, i.e. it does not attemp to output documentcontent as a sequence of sentences. It is a semi-formal representation of the document
5Rhetorical Structure Theory
9
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
SUMMONS6
[44], 1998multi-document
domain-specific(onlinenews)
-its input is a set of templates from theMessage Understanding Conference7
- key sentences from an article areextracted using statistical techniquesand measures- planning operators8 such ascontradiction, agreement or supersetare used to synthesize a single article
extractsandabstracts
FociSum [28],1999
sinlge-document
domain-independent
- it merges information extraction (IE)with sentence extraction techniques- the topic of the the text (called fociin this system) is determined dynamicallyfrom name entities and multiwords terms
extracts
MultiGen [5],1999
multi-document
domain-specific(news)
- it identifies and synthesizes similarelements across related text from aset of multiple documents- it is based on information fusion andreformulation- sets of similar sentences are extracted(themes)
abstracts
Chen & Lin[26], 2000
multi-document
domain-specific(news)
- it produces multilingual (only Englishand Chinese) news summaries- monolingual and multilingual clusteringis done- meaning units detection such astopic chains or linking elements isperformed- similarity between meaning unitsis measured
extracts
continue on next page
6SUMMarizing Online NewS articles7http://www-nlpir.nist.gov/related projects/muc/index.html8for more detailed information see[44]
10
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
CENTRIFUSER[30] [29], 2001
multi-document
domain-specific(health-care arti-cles)
- it produces query-driven summaries- clustering is applied by SIMFINDER tool- it is based on document topic tree(each individual document is representedby a tree data strucutre)- composite topic trees are designed(they carry topicinformation for all articles )- query mapping using a similarityfunction enriched with structuralinformation from the topic trees is done
extracts
Cut & Paste[22], 2001
single-document
domain-independent
- it uses sentence reduction and sentencecombination techniques- Key sentences are identified by asentence extraction algorithm that coversthis techniques: lexical coherence,tf*idf score, cue phrases andsentence positions
abstracts9
MEAD [51],2001
multi-document
domain-specific(news)
- it is based on sentence extraction throughthe features: centroid score, position andoverlap with the first sentence- sentences too similar to others arediscarded-experiments with CST10 and Cross-document subsumption have been made
extracts
continue on next page
9In this case, summaries are generated by refor-mulating the text of the original document
10Cross-Document Structural Relationships pro-poses a taxonomy of the informational relationshipsbetween documents in clusters of related docu-ments. This concept is similar to Rhetorical Struc-ture Theory (RST)
11
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
NeATS11 [36],2001
multi-document
domain-specific(news)
- to select important content it uses:sentence position, term frequency,topic signature, term clustering- to avoid redundancy it employs MMR12
technique [9]- to improve cohesion and coherencestigma words and time stamps are used
extracts
NewsInEssence[49], 2001
multi-document
domain-specific(onlinenews)
- clusters are built throughCIDR Topic detenction and trackingcomponent-it is based on the Cross-documentStructure Theory (CST)- its summaries are producedby MEAD [51]
personalizedextracts
WebInEssence[50], 2001
multi-document
domain-independent
- it is a Web-based summarization andrecommendation system- it employes centroid-basedsentence extraction technique- it uses similar techniquesto NewsInEssence [49] but applied toWeb documents
extractsandpersonalizedsummaries
COLUMBIAMDS [43], 2002
multi-document
domain-specific(news)
- it is a composite system that usesdifferent summarizers depending on theinput: MultiGen for single events orDEMS 13 for multiple events or biographicaldocuments- statistical techniques are used
extractsandabstracts
continue on next page
11Next Generation Automated Text Summariza-tion
12Maximal Marginal Relevance13Dissimilarity Engine for Multidocument Sum-
marization
12
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
Copeck et al.[12], 2002
single-document
domain-specific(biogra-phies)
- it uses machine learning techniques- keyphrases are extracted and ranked- text is segmented according tosentences that talk about the same topic
extracts
GISTexter [20],2002
single andmulti-document
domain-specific(news)
- for single-document summarization, itextracts key sentences automatically usingthe technique of single-documentdecomposition- for multi-document summaries, it relieson CICERO IE system to extract relevantinformation by applying templates that aredetermined by the topic of the collection
extractsandabstracts
GLEANS [13],2002
multi-document
unkown
- it performs document mappingto obtain a database-like representationthat explicits their main entities andrelations- each document in the collectionis classified into one of these categories:single person , single event,multiple event and natural disaster- the system is IE based
headlines,extractsandabstracts
NTT [21], 2002 single-document
unknown
- it employs the Support Vector Machine(SVM) machine learning techique toclassify a sentence into relevant ornon-relevant- it also uses the following features todescribed a sentence: position, length,weight, similarity with the headline andpresence of certains verbs or prepositions
extracts
continue on next page
13
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
Karamuftuoglu[31], 2002
single-document
unkown
- it is based on the extract-reduce-organizeparadigm- as a pattern matching method it useslexical links and bonds14
- sentences are selected by SVM technique
extracts
Kraaij et al.[33], 2002
multi-document
unkown
- it is based on probabilistic methods:sentence position, length, cue phrases-for headline generation, noun phrasesare exracted
headlinesandextracts
Lal & Reuger[34], 2002
single-document
unkown
- it is built within the GATE15 framework- it uses simple Bayes classifierto extract sentences- it resolves anaphora using GATE’sANNIE16 module
extracts
Newsblaster[42], 2002
multi-document
domain-specific(onlinenews)
- news articles are clustered usingTopic Detection and Tracking (TDT)- it is a composite summarization system(it uses different strategies depending onthe type of documents in each cluster)- it uses similar techniques to [43]- thumbnails of images are displayed
extracts
SumUM [52],2002
single-document
domain-specific(technicalarticles)
- shallow syntactic and semantic analysis- concept identification andrelevant information extraction- summary representation constructionand text regeneration
abstracts
continue on next page
14A lexical link between two sentences is a wordthat appears in both sentences. When two or morelexical links between a pair of sentences occur, alexical bond between them is constituted
15General architecture for Text Engineering,Univerity of Sheffield
16A Nearly New Information Extraction System
14
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
Alfonseca &Rodrıguez[1],2003
single-document
domain-specific(articles)
- relevant sentences identification(with a genetic algorithm)- relevant words and phrases fromidentified sentences are extracted- coherence is keept for the output
very shortextracts(10 words)
GISTSumm[47], 2003
single-document
domain-independent
- it is based on the gist17 of the source text- it uses statistical measures: keywords todetermine what the gist sentence is- by means of the gist sentence, itis possible to build coherent extracts
extract
K.U. Leuven[2], 2003
sigle andmulti-document
unkown
- topic segmentation and clusteringtechniques are used for multi-documenttask- topic segmentation, sentence scoring(weight, position, proximity to the topic)and compression are used forsingle-document summarization
extracts
Univ. Leth-bridge [10], 2003
single andmulti-document
unkown
- it performs topic segmentation of thetext- it computes lexical chains for eachsegment- sentence extraction techniques areperformed- it uses heuristics to do some surfacerepairs to make summaries coherent andreadable
extracts
LAKE [14], 2004 single-document
domain-specific(news)
- it exploits keyphrase extractionmethodology to identify relevantterms in the document- it is based on a supervised learningapproach- it considers linguistic features likename entity recognition or multiwords
very shortextracts
continue on next page
17The most important passage of the source text
15
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
MSR-NLPSummarizer[56], 2004
multi-document
domain-specific(news)
- its objetive is to identify importantevents as opposed to entities- it uses a graph-scoring algorithm toidentify highly weighted nodes andrelations- summaries are generated by extractingand merging portions of logical forms
extracts
CATS [16], 2005 multi-document
domain-specific(news)
- it analyzes which information in thedocument is important in order to includeit in the summary- it is an Answering Text Summarizer- statistical techniques are used tocomputed a score for each sentenceas well as temporal expresion andredundancy are solved
extracts
CLASSY [11],2005
multi-document
domain-specific(news)
- it is a query-based system- it is based on Hiden Markov Modelalgorithm for sentence scoring and selection- it classifies sentences into two sets:those ones belonging to the summaryand those ones which not
extracts
QASUM-TALP [17],2005
multi-document
domain-specific(news)
- it is a query-driven system- summary content is selectedfrom a set of candidate sentencesin relevant passages- summaries are produced in four steps:(1) collection pre-processing,(2) question generation,(3) relevant information extraction(4) and summary content selection
extracts
ERRS [57], 2007 single andmulti-document
domain-specific(news)
- it is basically a heurisitic-based system- all kinds of summaries are generatedwith the same data structure:Fuzzy Coreference Cluster Graph
extracts
continue on next page
16
Table 1 – continued from previous pageSYSTEM[REF.], YEAR
# INPUTS DOMAINSYSTEM
FEATURES OUTPUT
FemSum [18],2007
single andmulti-document
domain-specific(news)
- the system is aimed to provideanswers to complex questions- summaries are produced taking intoaccount a syntactic and a semanticrepresentation of the sentences- it uses graph-representation to establishrelations between candidate sentences- it is composed of three languageindependent components:RID (Relevant Information Detector),
CE (Content Extractor),SC (Summary Composer)
extracts
GOFAISUM[19], 2007
multi-document
domain-specific(news)
- it is only based on a symbolic approach- basically, the techniques usedare tf·idf and syntactic pruning- sentences with the highest scoreare selected to build the summary
extracts
NetSum [54],2007
single-document
domain-specific(news)
- it is based on machine learningtechniques to generate summaries- it uses a neuronal network algorithmto enhance sentence features- the three sentences that best matchesthe document’s highlights are extracted
extracts
17
From the systems described above, it ispossible to notice that each system per-forms different methodologies to producesummaries. Furthermore, the inputs andthe genre can be different too. That givesus an idea of how developed the state-of-the-art is and the number of differentapproaches that exist to tackle this field ofresearch.
In the latest ACL (ACL’07 ) conference18
attempts to summarize entire books [45]have been made. The argument to supportthis idea is that most of studies have beenfocused in short documents, specially innews reports and very little effort has beendone on summarization of long documents,like books. Generating book summariescan be very useful for the reader to chooseor discard a book only by looking at theextract or abstract of that book. On thecontrary, in the previous year, europeanACL conferences (EACL’06)19 summa-rization of short fiction stories were alsoinvestigated [32] arguing that summariza-tion is the key issue to determine whetherto read a whole story or not.
Techniques employed in recent years arevery similiar to the classical ones but theyhave to be adapted to each particular kindof system and its objectives. Improvementsin machine learning techniques have allowed
18The 45th Annual Meeting of the Associa-tion of Computational Linguistics was held inPrague, Czech Republic, June 23rd30th 2007,http://ufal.mff.cuni.cz/acl2007/
19The 11th Conference of European Chap-ter of the Associationfor Computer Linguisticswas held in Trento, Italy, April 3rd-7th 2006,http://eacl06.itc.it/
that they can be used to train and de-velop summarization systems these days aswell. NetSum [54] which was presented inACL’07 is an example of a system thatuses machine learning algorithms to per-form summarization.
5 Measures of Evalua-
tion
Methods for evaluating automatic textsummarization can be classified into twocategories: intrinsic or extrinsic methods[38]. The first one measures the system’sperformance on its own, whereas theextrinsic methods evaluate how summariesare good enough to accomplish the purposeof some other specific task, e.g. filtering ininformation retrieval or report generation.An assesment of a summary can be donein different ways. Several examples, likeShannon Game or Question Game canbe found in [23]. In summary evaluationprogrammmes such as SUMMAC20, DUC21
or NTCIR22 automatic generated sum-maries (extracts or abstract) are evaluatedmostly instrinsically against human ref-erence or gold-standard summaries (idealsummaries). The problem is to establishwhat an ideal summary is. Humansknow how to sum up the most importantinformation of a text. However, differentexperts may disagree in considering whichinformation is the best to be extracted.Automatic evaluation programmes havetherefore been developed to try to give
20http://www-nlpir.nist.gov/related projects/tipster summac/21http://duc.nist.gov22http://research.nii.ac.jp/ntcir/
18
an objective point of view of evalua-tion. Systems like SEE23, ROUGE24 orBE25 have been created to help to this task.
6 The Evolution of Text
Summarization Ap-
proaches
Throughout the recent years summariza-tion has experienced a remarkable evolu-tion. Due to the evaluation programmesthat take place every year, the field ofText Summarization has been improvedconsiderably. For example, the tasks per-formed in The Document Understand Con-ferences (DUC) have changed from sim-ple tasks to more complex ones. Atthe beginning, efforts were done to gen-erate simple extracts from single docu-ments usually in English. Lately, the trendhas evolved to generate more sophisticatedsummaries such as abstracts from a num-ber of documents, not just a single one,and in a variety of languages. Differenttasks have been introduced year after yearso that, apart from the gereral main task,it is possible to find taks consisting of pro-ducing summaries from a specific questionor user-need, or just to generate a summaryfrom updated news. Finally, concerning tothe evaluation, the tendency has moved onto extrinsic evaluation, i.e. how useful thetask is in order to help other tasks, rather
23Summarization Evaluation Environment,http://www.isi.edu/publications/licensed-sw/see/
24Recall-Oriented Understudy for Gisting Evalu-ation, http://haydn.isi.edu/ROUGE
25Basic Elements, http://haydn.isi.edu/BE/
than intrinsic evaluation. However, thiskind of evaluation is also important to mea-sure linguistic quality or content responsive-ness, so manual evaluation is still performedby humans, together with automatic eval-uation systems like BE, SEE or ROUGEintroduced in the previous section. Theevolution of summarization systems has notfinished yet. There is still a great effort todo to achieve good and high quality sum-maries, either extracts or abstracts.
7 Conclusion and Future
Work
In this paper, we have described a generaloverview of automatic text summarization.The status, and state, of automatic sum-marising has radically changed through theyears. It has specially benefit from work ofother asks, e.g. information retrieval, in-formation extraction or text categorization.Research on this field will continue due tothe fact that text summarization task hasnot been finished yet and there is still mucheffort to do, to investigate and to improve.Definition, types, different approaches andevaluation methods have been exposed aswell as summarization systems features andtechniques already developed. In the futurewe plan to contribute to improve this fieldby means of improving the quality of sum-maries, and studying the influence of otherneighbour tasks techniques on summariza-tion.
References
[1] Alfonseca, E., Rodrıguez, P. Descrip-
19
tion of the UAM system for generatingvery short summaries at DUC-2003. InHLT/NAACL Workshop on Text Sum-marization / DUC 2003, 2003.
[2] Angheluta, R., Moens, M, F., DeBusser, R. K.U.Leuven SummarizationSystem. In DUC 2003, Edmonton, al-berta, Canada, 2003.
[3] Aonet, C., Okurowskit, M. E., Gorlin-skyt, J., et al. A Scalable Summariza-tion System Using Robust NLP. InProceedings of the ACL’07/EACL’97Workshop on Intelligent Sclalable TextSummarization, pages 66-73, 1997.
[4] Barzilay, R., Elhadad, M. Using Lex-ixal Chains for Text Summarization.In Inderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.
[5] Barzilay, R, McKeown, K, Elhadad, M.Information fusionn in the context ofmulti-document summarization. InProceedings of ACL 1999, 1999.
[6] Boguarev, B., Kennedy, C. Salience-based content characterisation of textdocuments. In Proceedings of ACL’97Workshop on Intelligent, Scalable TextSummarization, pages 2-9, Madrid,Spain, 1997.
[7] Brandow, R., Mitze, K., Rau, L. F.Automatic condensation of electronicpublications by sentence selection.Information Processing Management,31(5):675–685, 1995.
[8] Burges, C., Shaked, T., Renshaw, E.,et al. Learning to rank using gradi-ent descent. In ICML ’05: Proceedings
of the 22nd international conference onMachine learning, pages 89–96, NewYork, NY, USA, 2005. ACM.
[9] Carbonell, J, Goldstein, J. The useof MMR, diversity-based reranking forreordering documents and producingsummaries. In SIGIR ’98: Proceed-ings of the 21st annual internationalACM SIGIR conference on Researchand development in information re-trieval, pages 335–336, New York, NY,USA, 1998. ACM.
[10] Chali, Y., Kolla, M, Singh, N, etal. . The University of LethbridgeText Summarizer at DUC 2003. InHLT/NAACL Workshop on Text Sum-marization /DUC 2003, 2003.
[11] Conroy, J. M., Schlesinger, J. D., Gold-stein, J. CLASSY Query-Based Multi-Document Summarization. In the Doc-ument Understanding Workshop (pre-sented at the HLT/EMNLP AnnualMeeting), Vancouver, B.C., Canada,2005.
[12] Copeck, T., Szpakowicz, S., Jap-kowic, N. Learning How Best to Sum-marize. In Workshop on Text Summa-rization (In conjunction with the ACL2002 and including the DARPA/NISTsponsored DUC 2002 Meeting on TextSummarization), Philadelphia, 2002.
[13] Daume III, H., Echihabi, A.,Marcu, D., et al. GLEANS: AGenerator of Logical Extracts andAbstracts for Nice Summaries. InWorkshop on Text Summarization(In conjunction with the ACL 2002
20
and including the DARPA/NISTsponsored DUC 2002 Meeting on TextSummarization), Philadelphia, 2002.
[14] D’Avanzo, E., Magnini, B., Vallin, A.Keyphrase Extraction for Summariza-tion Purposes: The LAKE System atDUC-2004. In the Document Un-derstanding Workshop (presented atthe HLT/NAACL Annual Meeting),Boston, USA, 2004.
[15] Edmundson, H. P. New Methods inAutomatic Extracting. In InderjeetMani and Mark Marbury, editors, Ad-vances in Automatic Text Summariza-tion. MIT Press, 1999.
[16] Farzindar, A., Rozon, F., Lapalme, G.CATS A Topic-Oriented Multi-Document Summarization System atDUC 2005. In the Document Un-derstanding Workshop (presented atthe HLT/EMNLP Annual Meeting),Vancouver, B.C., Canada, 2005.
[17] Fuentes, M., Gonzaelz, E., Ferres, D.,et al. QASUM-TALP at DUC 2005Automatically Evaluated with a Pyra-mid Based Metric. In the DocumentUnderstanding Workshop (presented atthe HLT/EMNLP Annual Meeting),Vancouver, B.C., Canada, 2005.
[18] Fuentes, M., Rodrıguez, H., Fer-res, D. FEMsum at DUC 2007. Inthe Document Understanding Work-shop (presented at the HLT/NAACL),Rochester, New York USA, 2007.
[19] Gotti, F., Lapalme, G., Nerima, L.,et al. GOFAISUM: A Sympolic Sum-marizer for DUC. In the Document
Understanding Workshop (presented atthe HLT/NAACL), Rochester, NewYork USA, 2007.
[20] Harabagiu, S., Lacatusu, F. Generat-ing Single and Multi-Document Sum-maries with GISTEXTER. In Work-shop on Text Summarization (In Con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization) Philadelphia, Pennsylvania,USA, 2002.
[21] Hirao, T., Sasaki, Y., Isozaki, H., etal. NTT’s Text Summarization sys-tem for DUC-2002. In Workshopon Text Summarization (In conjunc-tion with the ACL 2002 and includ-ing the DARPA/NIST sponsored DUC2002 Meeting on Text Summariza-tion), Philadelphia, 2002.
[22] Hongyan, J. Cut-and-paste text sum-marization. PhD thesis, 2001. Adviser-Kathleen R. Mckeown.
[23] Hovy, E. H. Automated Text Summa-rization. In R. Mitkov (ed), The Ox-ford Handbook of Computational Lin-guistics, chapter 32, pages 583–598.Oxford University Press, 2005.
[24] Hovy, E., Lin, C. Y. Automated TextSummarization in SUMMARIST. InInderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.
[25] Hovy, E., Lin, C. Y., Zhou, L., etal. Automated Summarization Evalu-ation with Basic Elements. In Proceed-ings of the 5th International Confer-
21
ence on Language Resources and Eval-uation (LREC). Genoa, Italy, 2006.
[26] Hsin-Hsi, C., Chuan-Jie, L. A multi-lingual news summarizer. In Proceed-ings of the 18th conference on Com-putational linguistics, pages 159–165,Morristown, NJ, USA, 2000. Associa-tion for Computational Linguistics.
[27] Jackson, P., Moulinier, I. Naturallanguage processing for online appli-cations. John Benjamins PublishingCompany, 2002.
[28] Kan, M. Y., McKeown, K. Infor-mation extraction and summarization:Domainn independence through focustypes. Technical report, Computer Sci-ence Department, Columbia Univer-sity.
[29] Kan, M. Y., McKeown, K. R., Kla-vans, J. L. Applying Natural Lan-guage Generation to Indicative Sum-marization. In the 8th European Work-shop on Natural Language Generation,Toulouse, France, 2001.
[30] Kan, M. Y., McKeown, K. R., Kla-vans, J. L. Domain-specific informa-tive and indicative summarization forinformation retrieval. In the Docu-ment Understanding Conference, NewOrleans, USA, 2001.
[31] Karamuftuoglu, M. An approach tosummarizaion based on lexical bonds.In Workshop on Text Summarization(In conjunction with the ACL 2002 andincluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.
[32] Kazantseva, A. An Approach to Sum-marizing Short Stories. In Proceedingsof the Student Research Workshop atthe 11th Conference of the EuropeanChapter of the Association for Com-putational Linguistics, 2006.
[33] Kraaij, W., Spitters, M, Hulth, A.Headline extraction based on a com-bination of uni- and multidocumentsummarization techniques. In Work-shop on Text Summarization (In con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.
[34] Lal, P.,Rueger, S. Extract-based Sum-marization with Simplification. InWorkshop on Text Summarization (Inconjunction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.
[35] Leuski, A., Lin, C. Y., Hovy, E.iNEATS: Interactive Multi-documentSummarization. In ACL’03, 2003.
[36] Lin, C. Y., Hovy, E. Automated Multi-document Summarization in NeATS.In Proceedings of the Human Lan-guage Technology (HLT) Conference.San Diego, CA, 2001.
[37] Luhn, H., P. The Automatic Creationof Literature Abstracts. In InderjeetMani and Mark Marbury, editors, Ad-vances in Automatic Text Summariza-tion. MIT Press, 1999.
[38] Mani, I. Summarization evaluation:An overview. In Proceedings of the
22
North American chapter of the as-sociation for computational linguis-tics (NAACL) workshop on automaticsummarization, 2001.
[39] Mani, I., House, D., Klein, G., et al .The TIPSTER SUMMAC Text Sum-marization Evaluation. In Proceedingsof EACL, 1999.
[40] Mani, I., Maybury, M. T., Ed. Ad-vances in Automatic Text Summariza-tion. The MIT Press, 1999.
[41] Marcu, D. Discourse Trees Are GoodIndicators of Importance in Text. InInderjeet Mani and Mark Marbury,editors, Advances in Automatic TextSummarization. MIT Press, 1999.
[42] McKeown, K, Barzilay, R, Evans, D,et al. Tracking and summarizing newson a daily basis with the Columbia’sNewsblaster. In Proceedings of the Hu-man Language Technology (HLT) Con-ference. San Diego, CA, 2002.
[43] McKeown, K., Evans, D., Nenkova, A.,et al. The Columbia Multi-DocumentSummarizer for DUC 2002. In Work-shop on Text Summarization (In con-junction with the ACL 2002 and in-cluding the DARPA/NIST sponsoredDUC 2002 Meeting on Text Summa-rization), Philadelphia, 2002.
[44] McKeown, K., Radev, D. R. Generat-ing Summaries of Multiple News Arti-cles. In Inderjeet Mani and Mark Mar-bury, editors, Advances in AutomaticText Summarization. MIT Press, 1999.
[45] Mihalcea, R., Ceylan, H. Explorationsin Automatic Book Summarization. InProceedings of the Joint Conferenceon Empirical Methods in Natural Lan-guage Processing and ComputationalNatural Language Learning (EMNLP-CoNLL), pages 380–389, 2007.
[46] Padro Cirera, L., Fuentes, M.J.,Alonso, L., et al. Approaches to TextSummarization: Questions and An-swers. Revista Iberoamericana de In-teligencia Artificial, ISSN 1137-3601,(22):79–102, 2004.
[47] Pardo, T., Rino, L., Nunes, M. Gist-Summ: A summarization tool based ona new extractive method. In 6th Work-shop on Computational Processing ofthe Portuguese Language – Writtenand Spoken. Number 2721 in Lec-ture Notes in Artificial Intelligence,Springer (2003) 210–218, 2003.
[48] Pollock, J. J., Zamora, A. Auto-matic Abstrating Research at Chemi-cal Abstracts. In Inderjeet Mani andMark Marbury, editors, Advances inAutomatic Text Summarization. MITPress, 1999.
[49] Radev, D., Blair-Goldensohn, S.,Zhang, Z., et al. NewsInEssence:a system for domain-independent,real-time news clustering and multi-document summarization. In HLT’01: Proceedings of the first interna-tional conference on Human languagetechnology research, pages 1–4, Mor-ristown, NJ, USA, 2001. Associationfor Computational Linguistics.
23
[50] Radev, D, Weiguo, F, Zhang, Z.Webinessence: A personalized web-based multi-document summarizationand recommendation system. InNAACL Workshop on automatic Sum-marization, Pittsburg, 2001.
[51] Radev, R., Blair-goldensohn, S,Zhang, Z. Experiments in Single andMulti-Docuemtn Summarization usingMEAD. In First Document Under-standing Conference, New Orleans,LA, 2001.
[52] Saggion, H, Lapalme, G. Generat-ing Indicative-Informative Summarieswith SumUM. Computational Linguis-tics, 28(4), 2002.
[53] Sparck Jones, K. Automatic summariz-ing: factors and directions. In Inder-jeet Mani and Mark Marbury, editors,Advances in Automatic Text Summa-rization. MIT Press, 1999.
[54] Svore, K., Vanderwende, L., Burges, C.Enhancing Single-Document Summa-rization by Combining RankNet andThird-Party Sources. In Proceed-ings of the Joint Conference on Em-pirical Methods in Natural LanguageProcessing and Computational NaturalLanguage Learning (EMNLP-CoNLL),pages 448–457, 2007.
[55] Tucker, R. Automatic summarisingand the CLASP system. PhD thesis,1999. Cambridge.
[56] Vanderwende, L., Banko, M.,Menezes, A. Event-Centric Sum-mary Generation. In the DocumentUnderstanding Workshop (presented
at the HLT/NAACL Annual Meeting),Boston, USA, 2004.
[57] Witte, R., Krestel, R., Bergler, S. Gen-erating Update Summaries for DUC2007. In the Document Under-standing Workshop (presented at theHLT/NAACL), Rochester, New YorkUSA, 2007.
24