automatic text summarization.pdf

Graz University of Technology

706.048 Information Search and Retrieval

WS 2009

Automatic Text Summarization

Group 10

Rejhan Basagic [0330740]

Damir Krupic [0231316]

Bojan Suzic [0631814]

Supervisor:

Dipl.-Ing. Dr.techn. Christian Gütl

Institute for Information Systems and Computer Media, Graz

II

Copyright (C) 2009 Rejhan Basagic, Damir Krupic, Bojan Suzic.

Dieses Werk kann durch jedermann gemäß den Bestimmungen der Lizenz für Freie Inhalte

genutzt werden.

Die Lizenzbedingungen können unter http://www.uvm.nrw.de/opencontent abgerufen

oder bei der Geschäftsstelle des Kompetenznetzwerkes Universitätsverbund MultiMedia

NRW, Universitätsstraße 11, D-58097 Hagen, schriftlich angefordert werden."

III

ABSTRACT

Recently, one of the problems arisen with the rapid growth of the web and generally information

availability (sometimes referred as an information overloading) is the increased need for effective

and powerful text summarization. In this document we will present a short historical overview and

advancement of automatic text summarization and the most relevant approaches currently used in

this area. The approaches will be selected both for single and multi-document summarization.

Furthermore, the examples of the use of the technology will be summarized and short conclusion

and future directions will be given.

ZUSAMMENFASSUNG

In letzter Zeit, ein der Probleme, das mit dem schnellen Wachstum des Internets und in der Regel

die Verfügbarkeit von Information entstanden wurde, ist die zunehmende Notwendigkeit für eine

effektive und leistungsfähige Textzusammenfassung. In diesem Dokument wird ein kurzen

historischen Überblick und Weiterentwicklung der automatischen Textzusammenfassung

präsentiert. Weiter, die meist relevanten und wichtigsten Konzepte werden dargestellt, mit dem

Fokus auf Single- und Multi- Dokumentzusammenfassung. Darüber hinaus werden die Beispiele für

die Nutzung der Technologie zusammengefasst und kurzes Fazit und künftige Richtungen gegeben

werden.

IV

TABLE OF CONTENTS

Abstract ..................................................................................................................................................................................... III

Zusammenfassung ................................................................................................................................................................ III

Table of Contents .................................................................................................................................................................... IV

1 Introduction ..................................................................................................................................................................... 6

1.1 Applications of the automatic summarization ......................................................................................... 6

1.2 Scope of this work ................................................................................................................................................ 7

2 Background ...................................................................................................................................................................... 8

2.1 1955-1979 Early Extraction and Linguistic Approaches ..................................................................... 8

2.2 1980`s and 1990´s - Artificial Intelligence Approaches and “Renaissance” ................................ 8

3 Taxonomy of Summarization Methods ............................................................................................................... 10

4 Single Document Summarization .......................................................................................................................... 11

4.1 Ontology Knowledge Based Summarization ........................................................................................... 11

4.2 Feature appraisal based summarization .................................................................................................. 12

4.3 Neural network based approach ................................................................................................................. 14

5 Multi-document summarization ............................................................................................................................ 16

5.1 History of multi-document summarization............................................................................................. 16

5.1.1 SUMMONS .................................................................................................................................................... 16

5.2 Abstraction ............................................................................................................................................................ 17

5.3 Topic driven Summarization and MMR .................................................................................................... 20

5.4 Centroid based Summarization .................................................................................................................... 20

6 Lessons Learned ........................................................................................................................................................... 22

V

7 Conclusion ....................................................................................................................................................................... 23

References ................................................................................................................................................................................ 24

Abbreviations .......................................................................................................................................................................... 26

Figures........................................................................................................................................................................................ 27

Tables ......................................................................................................................................................................................... 28

6

1 INTRODUCTION

In the recent time, the wide availability of information on the internet caused by many factors,

including rapid digitalization of paper documents, rapid Internet expansion based on Web 2.0

paradigm amongst others caused global information abundance. This raised the question and

necessity for alternative ways for displaying and selection of textual and multimedia content, on

such way that the most important parts of it are presented to the user in order make decision

support about further steps – should article be read and further investigated, or other one

considered. Nowadays there is too much content on the web and the information quality and

relevance is of a big concern. The users are expecting the short summaries of the written

information in order to select the most appropriate ones for further work.

1.1 APPLICATIONS OF THE AUTOMATIC SUMMARIZATION

One of examples of document summarization is in legal area. The legal experts perform difficult and

responsible work and their resources are sparse and expensive both in time and expertise levels.

Thus, the system for concise summarization is necessary in order for experts to be able to

effectively and in short time find compressed and restated content of relevant judicial documents,

including laws and their proposals, relevant court decisions or tribunal process summarizations.

In the medical branch, there is often overload of information and it is requirement in many cases

for the medical personal to find relevant information about patient’s conditions timely. This

involves crawling of many documents and patient’s record in order to gain necessary information.

In this area the text summarization specifically adjusted to medical domain is of considerable use,

saving time resources and optimizing availability of medical experts.

On the internet, there are many examples of the summarizations used. For instance, news portals

like Google1, Microsoft News2 or Columbia Newsblaster3 are relying of such techniques in order to

provide short news summaries to their visitors. There are also service providing blog

summarization and aggregation4 and opinion survey systems [Song et al. 2007].

Further, there is also application of document summarization for PDA devices with small screen,

where the only limited screen size and time are available for users to read. For the businesses it is

also important to have available summarizations of meetings coupled possibly with speech

recognition systems, to provide “meeting minutes” in short time and without using excessive

1 http://news.google.com 2 http://msnbc.msn.com 3 http://newsblaster.cs.columbia.edu 4 For instance, http://www.bloghearld.com

- 7 -

human and other resources. In the area of accessibility, for the handicapped people, the text

summarization systems are also of great help. They can save much time for readers of such

documents using speech synthesis technologies, in order to be able to timely recognize and

separate important and less important content according to their interests. [Lin et al. 2009]

1.2 SCOPE OF THIS WORK

This work is focused on topic automatic text summarization. In the first chapter some of interesting

uses of this technology are summarized. Further, in the second chapter the focus is given on short

historical overview of the text summarization.

In the third and fourth chapter, the parts of this area relating to single and multiple document

summarizations will be investigated in more detail, including the currently promising methods.

Some of the characteristics of those proposals and suggestions for future improvements will be

presented. In the fifth chapter the focus will be given on evaluation of text summarizations, while in

the sixth chapter the lessons learned will be analised in more detail, as the proposals for future

work.

Finally, in the seventh chapter of this work appropriate conclusion will be given, together with

short overview of this work.

- 8 -

2 BACKGROUND

2.1 1955-1979 EARLY EXTRACTION AND LINGUISTIC APPROACHES

Hans Peter Luhn, popular IBM inventor was the pioneer in using computer for information

retrieval. He published the first paper in this area, entitled "A new method of recording and

searching information" (Luhn, 1953). After getting engaged as the manager of information retrieval

research at IBM, Luhn explored and developed many IR applications.

One such application is KWIC (Keyword in Context) using three elements fundamental to

information and retrieval. Those three elements are: keyword, title and context. [Heting 2007]

Edmundson described new methods of automatic extracting documents for screening purposes. His

previous work was based on the presence of high frequency content words (keywords). In his

paper from 1969, he described also three additional components: pragmatic words (cue words),

title and heading words; and structural indicators.

The results indicated that the three newly proposed components dominate the frequency

component in the creation of better extracts. Edmundson also tried to develop an algorithm for

automatic extraction of summaries from a corpus. [Edmundson 1969]

2.2 1980`S AND 1990´S - ARTIFICIAL INTELLIGENCE APPROACHES AND “RENAISSANCE”

Interest is shifted toward using AI methods, hybrid approaches and summarization of group

documents and multimedia documents.

"... a series of knowledge-based text summarization systems evolved, the methodology of which was

almost exclusively based on the Schankian-type of Conceptual Dependency (CD) representations (e.g.

(Cullingford 78, Lehnert 81, DeJong 82, Dyer 83, Tait 85, Alterman 86)) CD representations, however,

are formally underspecified representation devices lacking any serious formal foundation.

According to this, the summarization operations these first-generations systems provide use only

informal heuristics to determine the salient topics from the text representation structures for the

purpose of summarization.

- 9 -

A second generation of summarization systems then adapted a more mature knowledge

representation approach, one based on the evolving methodological fremework of hybrid,

classification-based knowledge representation languages (cf (Woods & Schmolze 92) for a survey).

Among these systems count SUSY (Fum et. al. 85), SCISOR (Rau 87), and TOPIC (Reimer & Hahn 88),

but even in these frameworks no attempt was made to properly integrate the text summarization

process into the formal reasoning mechanisms of the underlying knowledge representation language."

[Reimer and Hahn 1997]

- 10 -

3 TAXONOMY OF SUMMARIZATION METHODS

There are several types of text summarizations. According to the form of summary the following

may be extracted:

Extracts: these are summaries completely consisting from the sentences or word sequences

contained in the original document. Besides the complete sentences, extracts can contain phrases

and paragraphs. Problem with this approach is usually lack of the balance and cohesion. Sentences

could be extracted out of the context and anaphoric references can be broken.

Abstracts: these are containing word sequences not present in the original. They are usually build

from the existing content but using advanced methods. It is generally hard for computer to

successfully solve the requirements of such approach as of many limitations, including the state of

the art in language generation and human language complexity.

From the point of processing level involved in the creation of document summaries the following

could be recognized:

Surface level approach: in this case the information is represented from the point of shallow

features. These include different types of terms, e.g. statistically and positionally salient ones, terms

from cue phrases or domain specific and user inputed terms. Usually this approach products

extraction based summary as an output.

Deeper level approach: this approach may involve sentence generation. The advanced semantic

analysis is necessary in order to achieve such task. The output of this approach may be in form of

abstracts and extracts.

It is also important to recognize the audience of the summaries:

Generic summaries: are aimed at a broad community of readers

Query-based summaries are summaries built on the top of previously submitted user query

User or topic focused summaries are tailored to the interest of the user and represent only

particular topic.

The summaries can be also single and multi-document based. They can be also produced in

mono- or multilingual context, and applied to different genres.

In this work we will concentrate on general types of summaries based on single and multi-

document context, analyzing a several approaches currently in the research.

- 11 -

4 SINGLE DOCUMENT SUMMARIZATION

Currently, the most of the work done in the field has been related to the sentence extraction based

on document summarization. Since dropping of single document summarization track from DUC

challenge (2003), the research in the area of single document summarization is somehow declining.

According to [Nenkova 2005] general performances in summarization systems tend to be better in

multi-summarization than single document summarization tasks. This sounds counter-intuitively to

the general feeling that multi-document summarization is more difficult than single document one.

However, this could be partially explained with the fact that repetitive occurrences in input

document can be used as an indication of importance in multi-document environments.

Recent works on single document summarization are based on the several different approaches.

Number of approaches differ from ones based on classical approaches, through statistical based to

ones built on the top of the deep natural language analysis methods.

In this section the most representative approaches will be analyzed and presented.

4.1 ONTOLOGY KNOWLEDGE BASED SUMMARIZATION

The proposal from [Verma et al. 2007] focuses on dynamic summary generation based on user

input query. This approach has been designed for application in specific domain (medical).

However it can be used in general domain too.

The idea presented in this proposal is based on the fact that user selects the keywords to search for

the document with specific requirements. However, these keywords may not match the document’s

main idea, thus the document’s summary provided by the static author-written abstract may be not

a good summary for the user and specific search query. Hence, the summary needs to be generated

dynamically, according to user requirements given by the search query.

The system is coupled with two ontology knowledge sources, WordNet and UMLS. WordNet is

widely known lexical database for the English, developed at the Princeton University. The database

consists of linked words, including nouns, verbs, adjectives and adverbs. These words are

connected into sets of cognitive synonyms called synsets, representing the basic inter-relations as

hypernym, meronym and pertainym. The second source, UMLS is maintained by US National

Library of Medicine and includes three knowledge sources: the Metathesaurus, the Semantic

Network and the Specialist lexicon. In current approach only first two sources are used, as its focus

is on semantic analysis.

Basically, there are three steps involved in the creation of the document summary in such system:

1) Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology

knowledge. The redundant keywords are removed and relevant on

2) Calculation of the distance of document’s sentences to the relevant query. The sentences

bellow the predefined threshold are subject to the inclusion in the document summary.

3) Calculation of the distance among the candidate summary sentences.

separated into the groups based on the threshold and the highest ranked candidate from each

group will become the part of document summary.

The system based on this method has been presented

approach have been related to problems with redundancy reduction, lack of syntax analysis and

insufficient query analysis. Namely, the already performed redundancy reduction step is only

partially effective, since the repetitiv

is present.

In the first tests as a part of DUC 2007, the method has s

that the present issues and future improvements are to be addressed in future work.

versions of the system the natural language processing will be included, including parsing and

syntax analysis. Moreover, the statistical data from the documents’ abstracts will be used as a part

of scoring algorithm in the sentence extraction.

4.2 FEATURE APPRAISAL BAS

Other similar approach based on the semantic analysis of document has been proposed in [

and Oussalah 2008]. In this approach the authors

extraction based on static and dynamic document features. Static features include sentence

locations and named entities (NE) in each sentence. Dynamic features used for scoring include

semantic similarity between sentences and user query.

FIGURE 1: THE SUMMARIZER ARCHITE

- 12 -

Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology

knowledge. The redundant keywords are removed and relevant ones added.

Calculation of the distance of document’s sentences to the relevant query. The sentences

predefined threshold are subject to the inclusion in the document summary.

among the candidate summary sentences. The candidates are then


group will become the part of document summary.

method has been presented at DUC 2007. The problems found with this


Namely, the already performed redundancy reduction step is only

repetitive coverage of the same information from multiple documents

DUC 2007, the method has shown good potential and authors

that the present issues and future improvements are to be addressed in future work.


the statistical data from the documents’ abstracts will be used as a part

sentence extraction.

EATURE APPRAISAL BASED SUMMARIZATION

ther similar approach based on the semantic analysis of document has been proposed in [

In this approach the authors propose the scoring system for document

and dynamic document features. Static features include sentence

locations and named entities (NE) in each sentence. Dynamic features used for scoring include

semantic similarity between sentences and user query.

E SUMMARIZER ARCHITECTURE [BAWAKID AND OUSSALAH 2008]]

Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology

Calculation of the distance of document’s sentences to the relevant query. The sentences

predefined threshold are subject to the inclusion in the document summary.

candidates are then


The problems found with this


Namely, the already performed redundancy reduction step is only

multiple documents

hown good potential and authors claimed

that the present issues and future improvements are to be addressed in future work. For the future


the statistical data from the documents’ abstracts will be used as a part

ther similar approach based on the semantic analysis of document has been proposed in [Bawakid

propose the scoring system for document

and dynamic document features. Static features include sentence

locations and named entities (NE) in each sentence. Dynamic features used for scoring include the

- 13 -

In proposed system there are three steps involved in creating document summary: Preprocessing,

Analyzing and Summary generation.

In preprocessing stage, the unnecessary elements such as HTML tags, news agencies names or table

numbers are removed from the document. Further, the document is tokenized and the sentence

boundaries are detected. In further processing named entities such as locations, organizations or

names are detected, the words are POS (part-of-speech) tagged and finally co-reference resolution

is done.

In the second step, the extraction and analyzing of features are done. Basically, the features such as

sentence location, named entities in sentence, semantic similarity to document title and user query

are used for building relevancy scoring for the sentences.

The important feature used in this range is also semantic similarity to the other sentences in the

document. As external information source used in this step is WordNet, which provides synonyms

for adjectives and adverbs found in the sentences. The further step in sentence similarity

calculation involves linking the adjectives to respective nouns and verbs to adverbs. In order to

distinguish relative importance of the nouns in the sentence, linguistic quantifiers are used.

Basically, two classes of linguistic quantifiers are used: ones inducing increasing importance of the

word (like “very” or “more”) and ones inducing decreasing word importance (like “less” or “none”).

Then, in both of the sentences compared the best average match for each noun and verb should be

calculated, taking into account the related quantifiers. After the average matches are calculated, the

similarity match between sentences is calculated by summing both sentences best average matches

and dividing the result with sum of sentence scores previously calculated. It means that the result

will depend on every verb and noun in both sentences. The same method is used to determine

similarity score with user query and document title.

Finally, the sentence score is calculated as a linear combination of the weighted features, on the

following way:

�� , �� , �� 1� ��

� �� 1�

Where:

� N is the total number of sentences in the document

� n(si) is number of sentences having semantic similarity above predefined threshold level

� P(si) is sentence position weight

� Sim(si,T) and Sim(si,Q) are related to semantic similarity with the document title and user

query, respectively

� NE is number of named entities in the document

� FNE(si) is the number of named entities contained in the sentence i

- 14 -

The summary is generated by choosing the most important sentences in the document (ones with

the highest score) and arranging them in chronological order. Multi-document summaries could be

generated similar way by calculating scores of sentences in each document separately and then

choosing the highest scoring sentences from all documents.

The system built on the top of this method has been presented in TAC 2008. The system has

demonstrated better performances related to finding relevant content than removing irrelevant

one. Also the additional weighting of user query has shown better performances than headlines

related one. For future work the authors plan to implement redundancy checking and remove

redundant information. This should be done on two levels: first, removing repeated or non-

essential content within sentences by adding new metric related to redundancy penalty, and

second, trying to maximize content information diversity by introducing additional threshold

metrics. Co-reference resolution and sentence compression are also planned to be implemented by

means of syntactic trimming.

4.3 NEURAL NETWORK BASED APPROACH

NetSum system developed at Microsoft Research [Svore et al. 2007] is utilizing machine-learning

method based on neural network algorithm RankNet. The system is customized to be used for

summary extraction of news articles including three highlighted sentences. The goal is pure

extraction without any sentence compression or sentence generation. Thus, system is designed to

extract three sentences from single document that best match three document highlights.

For such task, in order to rank extracted sentences used is RankNet, a ranking algorithm based on

neural networks. The system is trained on pairs of sentences in single document, such that first

sentence in the pair should be ranked higher than second one. Training is based on modified back-

propagation algorithm for two layer networks. The system relies on NetSum, which is a two layer

neural network.

This model utilizes the following features:

Symbol Feature Name

F(Si) Is First Sentence

Pos(Si) Sentence Position

SB(Si) 5

SumBasic Score

SBB(Si) SumBasic Bigram Score

Sim(Si)6 Title Similarity Score

TABLE 1: FEATURES USED IN THE SYSTEM

5 Describes sentence importance based on a word frequency 6 Based on relative probability that term in particular sentence is present in document title

- 15 -

Symbol Feature Name7

NT(Si) Average News Query Term Score

NT+(Si) News Query Term Sum Score

NTr(Si) Relative News Query Term Score

WE(Si) Average Wikipedia Entity Score

WE+(Si) Wikipedia Entity Sum Score

TABLE 2: FEATURES USED IN THE SYSTEM, BASED ON EXTERNAL DATA SOURCES

The news query logs are gathered from Microsoft’s news search engine8, while from the Wikipedia9

used are article titles. Hence, if the parts of news search query or Wikipedia title appear frequently

in the news article to be summarized, then higher importance score is attached to the sentences

containing these terms.

The results of this summarization approach are encouraging. Based on evaluation using ROUGE and

comparing to the baseline system, this system performs better than all previous systems for news

article summarization from DUC workshops. In feature ablation studies the authors have confirmed

that inclusion of news search queries and Wikipedia titles improve performance. Namely, the

performances of NetSum with external features are statistically significant at 95% confidence.

However, NetSum also performs better than baseline even without external features.

For the future work authors recommend more detailed investigation of external features for

possible inclusion. Also it is expected that separated feature selection for each of three highlight

sentences could further improve performances, as the highlights have different characteristics

compared to each other. It is also expected to develop further sentence simplification, splicing and

merging features.

7 Features dependant to external data sources, namely query-logs and Wikipedia 8 http://search.live.com/news 9 http://www.wikipedia.org

- 16 -

5 MULTI-DOCUMENT SUMMARIZATION

Nowadays it is enormously important to develop procedures for finding text in efficient way. There

are systems such as single document summarization system that support the automatic generation

of extracts, abstract, or questions based on summaries. Single-document summaries provide limited

information about the contents of a single document and the user in deciding whether he should

read the document or not.

The situation in which a user makes an inquiry about a topic which has been treated recently in the

news should be considered. This then would provide back hundreds of documents. Although they

differ in some areas, many of these documents from the content provide the same information. A

summary of each document would help in this case; however, they would be semantically similar.

In today’s community, in which time plays an important role, multi-document summarizers play

essential role in such situations.

5.1 HISTORY OF MULTI-DOCUMENT SUMMARIZATION

The Extraction of a summary text from multiple documents became popular in mid 1990s, mostly

used in domain of news articles. Several Web-based news clustering systems10 were inspired by

research on multi-document summarization. As already said, the difference between single

document and multi document summarization is that multi document summarization involves

multiple sources of information. The key task of multiple document summarization is not just

identifying redundancy across documents, but also recognizing novelty and ensuring that the final

summary is both coherent and complete. [Das and Martins 2007]

This field of automatic summarization has been pioneered by the NLP group at Columbia University

[McKeown and Radev, 1995], where a summarization system called SUMMONS was developed by

extending existing technology for template-driven message understanding systems.

5.1.1 SUMMONS

SUMMONS is a knowledge-based multi-document summarization system that produces summaries

through a series of in a domain located articles on terrorism. A set of semantics templates,

previously extracted from a message understanding system are supplied as an input to the system.

The system recognizes specific patterns in these templates, such as altering the perspective, 10 Google News, Columbia News Blaster, News in Essence (previously referenced)

- 17 -

contradictions, refinements, definitions and elaborations. The techniques used in SUMMONS,

require a large amount of knowledge in the field of Knowledge Engineering, even for a relatively

small text domains.

Its architecture is based on a speech generator. The speech generator in SUMMONS mainly consists

of two components, a content planner which selects the information which is added to the text, and

a linguistic component, which selects words to refer to the information containing the selected

concepts. The content planner decides which information derived from the templates should be

included in the collection process. The linguistic component uses a grammar and a dictionary to

determine the syntactic form of the summary. The length of the summary is determined by the

input parameters. Information is taken from several articles, and it will be evaluated in order of

importance.

FIGURE 2: ARCHITECTURE OF SUMMONS [MCKEOWN AND RADEV, 1995]

5.2 ABSTRACTION

Abstraction method of automatic text summarization is, in difference to extractive method, based

on text generation techniques. In this approach the summaries contain the words not present in the

original document. Generally as of language complexity and ambiguity it is hard task for computer

research to solve this task successfully. The following example shows the usage of SUMMONS in

creating of abstract from four articles containing similar information.

- 18 -

These articles are provided to the Message Understanding System. The system generates templates,

which are stored in database. SUMMONS then uses the templates to generate the summary text.

Example: Four articles

FIGURE 3: ARTICLES PROVIDED TO THE SYSTEM

Templates generated from the articles:

JERUSALEM - A Muslim suicide bomber blew apart 18 people on a Jerusalem bus and

wounded 10 in a mirror-image of an attack one week ago. The carnage could rob Israel's

Prime Minister Shimon Peres of the May 29 election victory he needs to pursue Middle East

peacemaking. Peres declared all-out war on Hamas but his tough talk did little to impress

stunned residents of Jerusalem who said the election would turn on the issue of personal

security.

JERUSALEM - A bomb at a busy Tel Aviv shopping mall killed at least 10 people and

wounded 30, Israel radio said quoting police. Army radio said the blast was apparently

caused by a suicide bomber. Police said there were many wounded.

A bomb blast ripped through the commercial heart of Tel Aviv Monday, killing at

least 13 people and wounding more than 100. Israeli police say an Islamic suicide bomber

blew himself up outside a crowded shopping mall. It was the fourth deadly bombing in

Israel in nine days. The Islamic fundamentalist group Hamas claimed responsibility for the

attacks, which have killed at least 54 people. Hamas is intent on stopping the Middle

East peace process. President Clinton joined the voices of international condemnation

after the latest attack. He said the ``forces of terror shall not triumph'' over

peacemaking efforts.

TEL AVIV (Reuter) - A Muslim suicide bomber killed at least 12 people and wounded

105, including children, outside a crowded Tel Aviv shopping mall Monday, police said.

Sunday, a Hamas suicide bomber killed 18 people on a Jerusalem bus. Hamas has now killed

at least 56 people in four attacks in nine days.

The windows of stores lining both sides of Dizengoff Street were shattered, the charred

skeletons of cars lay in the street, the sidewalks were strewn with blood. The last attack

on Dizengoff was in October 1994 when a Hamas suicide bomber killed 22 people on a bus.

3

1

2

4

MESSAGE: ID TST-REU-0002

SECSOURCE: SOURCE Reuters

SECSOURCE: DATE March 4, 1996 07:20

PRIMSOURCE: SOURCE Israel Radio

INCIDENT: DATE March 4, 1996

INCIDENT: LOCATION Tel Aviv

INCIDENT: TYPE Bombing

HUM TGT: NUMBER “killed: at least 10''

“wounded: more than 100”

PERP: ORGANIZATION ID




PRIMSOURCE: SOURCE


INCIDENT: LOCATION Jerusalem


HUM TGT: NUMBER “killed: 18''

“wounded: 10”


1 2

- 19 -

FIGURE 4: TEMPLATES GENERATED FROM THE PROVIDED ARTICLES [RADEV 2004]

Content planner selects the information to include in the summary through combination of the

input templates. Linguistic generator selects the right words to express the information in

grammatical and coherent text. Some of the operations performed by content planner require

resolving conflicts, for example: contradictory information among different sources or time

instants; others complete pieces of information that are included in some articles and not in other.

At the end, the linguistic generator gathers all the combined information, and uses connective

phrases to synthesize a summary:

FIGURE 5: SUMMARY SYNTHETISED BY THE SYSTEM [RADEV 2004]

This method is very promising when the domain is narrow, but a generalization for broader

domains would be problematic. This was improved later by McKeon and Barzilay.




PRIMSOURCE: SOURCE




HUM TGT: NUMBER “killed: at least

12''

“wounded: 105”





PRIMSOURCE: SOURCE




HUM TGT: NUMBER “killed: at least 13''

“wounded: more than 100”

PERP: ORGANIZATION ID “Hamas”

3 4

Reuters reported that 18 people were killed on Sunday in a bombing in Jerusalem. The next

day, a bomb in Tel Aviv killed at least 10 people and wounded 30 according to Israel radio.

Reuters reported that at least 12 people were killed and 105 wounded in the second incident.

Later the same day, Reuters reported that Hamas has claimed responsibility for the act.

- 20 -

5.3 TOPIC DRIVEN SUMMARIZATION AND MMR

MMR (maximal marginal relevance) is based on the vector-space model of text retrieval, and is

well-suited to query-based and multi-document summarization. In MMR, sentences are chosen

according to a weighted combination of their relevance to a query and their redundancy with the

sentences that have already been extracted.

"Let Q be a query or user profile and R a ranked list of documents retrieved by a search engine.

Consider an incremental procedure that selects documents, one at a time, and adds them to a set S.

So let S be the set of already selected documents in a particular step, and R \ S the set of yet

unselected documents in R. For each candidate document Di ∈ R\S, its marginal relevance MR(Di) is

computed as:" [Das and Martins 2007:14]

In this figure λ is a parameter that controls relative importance given to relevance versus

redundancy. Sim1 and Sim2 are two similarity measures, which are set to the standard similarity

used in vector space model:

yx

yxyxSimyxSim

×==

,),(),( 21

The document with highest marginal relevance is then selected, added to S, and the procedure

continues until a maximum number of documents are selected. It has been found that dynamically

changing of the parameter ( λ ) gives more effective results, than keeping it fixed.

To perform summarization, documents can be first segmented into sentences, and after a query is

submitted, the MMR algorithm can be applied. Top ranking passages (sentences) are selected, then

they are reordered according to the positions in the original documents, and presented as the

summary.

5.4 CENTROID BASED SUMMARIZATION

MEAD is a centroid-based Summarizer, which uses many algorithms (position-based, TF-IDF11,

largest common subsequence and keywords) to generate the summaries. As a result, it returns

centroid based summaries. A centroid is a set of words, which is statistically significant for a cluster.

11 Term frequency/inverse document frequency

),(max)1(),(:)(21 ji

SDii DDSimQDSimDMR

j ∈

−−= λλ

- 21 -

Centroids are used for both classification of relevant documents and the identification of the

Phrases in a cluster.

The MEAD Summarizer consists mainly of three components:

- Feature extractor,

- Sentence scorer and

- Sentence reranker .

Feature extractor calculates the values for each set of features (sentences), which are defined by the

user. Then, the sentence scorer assigns a value (linear combination of the features) to every feature.

After that, the sentences are sorted according to their weighting. The task of the sentence rerankers

is to insert the sentences in the resulting summaries, beginning with those which are high ranked.

In additions, sentence reranker reviews the sentences in the summary on similarity. If the degree of

consistency is over a certain threshold, the reranker is going to ignore the sentence and move to

another one.

Similar documents are using an algorithm that is described in detail by Radev, grouped into a

cluster. Each document is represented as a weighted vector of TF-IDF values. After that, CIDR

(system for automatic placement of text documents in a cluster) generates a centroid by using the

first document in the cluster.

As soon as new documents are processed, their TF-IDF values are compared with the centorid using

the formula:

In this formula, the following has been used:

- cj is the centroid of the j-th clusteer

- Cj is the set of documents that belong to the cluster

- d~ is a "truncated version" of d that vanishes on those words whose TF-IDF scores are below a

threshold.

If the value of cj is within a threshold, the new documents are added to the cluster.

j

Cd

jC

d

cj

∑∈

=

~

- 22 -

6 LESSONS LEARNED

The area of the text summarization has gained active focus of research community since 1990s.

Although several methods for such approaches exist, among them in recent time mostly were

applied ones based on different machine learning and statistics techniques, including binary

classifiers, Bayesian methods, and heuristic methods with feature weighted vectors. Graph based

methods were employed, as ones based on neural-network training. [Svore et al. 2007]

According to the methods presented in this work and research results done so far, many

researchers agree that in order to gain better performances multi-feature approaches are to be

used, especially ones based on external sources of information. Namely, recent works cited use

Wikipedia titles, or news queries from search engines to improve summarization performance. It is

to be further investigated which other sources of information and features should be used. It is also

important to note that summarization systems depend on other modules used in the preprocessing

and processing stages.

In the case of external NLP tools used the summarization system depends strongly on quality and

performance of underlying POS taggers, chunkers, lemmatizers, stemmers or sentence detectors.

For many languages such solutions are not completely available or they are in unmature

development stage.

In the case of external ontology based sources, the quality and depth of ontology plays important

role too. It is shown [Farzindar and Lapalme 2004; Verma et al. 2007] that specific domain systems

give better overall results for the document summarization. Also the works based on user query

input [Svore et al. 2007] give promising performances in the way that summarizations could be

adjusted and built according to user requirements and expectations.

- 23 -

7 CONCLUSION

In this work the short introduction on automatic text summarization has been given. The

interesting uses of this technology were described, as the issues confronting successful application

of such methods. Further, the short historical background related to the research in the area has

been elaborated, while the most attention in this work has been paid to the general approaches in

summarization currently used and proposed. Such approaches include single and multiple

document summarization methods.

Among approaches analyzed, for single document summarization analyzed were the three methods

relying on machine learning techniques. Namely, the ontology knowledge approach was presented,

the approach based on feature appraisal and NLP application in summarization. Finally, the recent

development based on neural-network application in extractive text summarization is presented,

together with findings and recommendations related to it.

For the multi-document summarization, the simple example of abstractive summary generation has

been given. Further, the topic driven and centroid based summarization were analyzed.

- 24 -

REFERENCES

[Bawakid and Oussalah 2008] Bawakid, A., Oussalah, M.: "A Semantic Summarization System: University of Birmingham at TAC 2008"; Proceedings of the First Text Analysis Conference (2008)

[Das and Martins 2007] Das, D., Martins, A. F.T.: "A Survey on Automatic Text Summarization"; Literature Survey for the Language and Statistics II course at CMU (2007)

[Edmundson 1969] Edmundson, H.P.: "New Methods in Automatic Extracting"; Journal of the ACM (JACM). Volume 16 , Issue 2 (April 1969). 264- 285

[Farzindar and Lapalme 2004] Farzindar, A. and Lapalme, G.: "LetSum, an automatic Legal Text Summarizing system"; Legal Knowledge and Information Systems. Jurix 2004: The Seventeenth Annual Conference. Amsterdam. IOS Press (2004), 11-18

[Heting 2007] Heting, C.: "Information Representation and Retrieval in the Digital Age"; Asist Monograph Series (2007), 7-8

[Jezek and Steinberger 2008] Jezek, K. and Steinberger, J.: "Automatic Text Summarization (The state of the art 2007 and new challenges)"; Znalosti 2008. FIIT STU Bratislava, Slovakia (2008), 1-12

[Lin et al. 2009] Lin, J., Ozsu, M. Tamer, Liu, L.: "Summarization"; Encyclopedia of Database Systems, Springer (2009)

[McKeown and Radev, 1995] McKeown, K.R. and Radev, D.R.: "Generating summaries of multiple news articles."; In Proceedings of SIGIR '95 (1995), 74-82

[Nenkova 2005] Nenkova, A.: "Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference"; American Association for Artificial Intelligence (2005)

[Neto et al. 2002] Neto, J.L., Freitas, A.A., Kaestner C.A.A.: "Automatic Text Summarization Using a Machine Learning Approach"; SBIA 2002. Springer Verlag Berlin Heidelberg (2002), 205-215

[Radev 2004] Dragomir R.R.: "Text summarization"; Tutorial ACM SIGIR, Sheffield, UK (2004), http://www.summarization.com/sigirtutorial2004.ppt

[Reimer and Hahn 1997] Reimer, U. and Hahn, U.: "A Formal Model of Text Summarization Based on Condensation Operators of a Terminological Logic"; Advances in Automated Text Summarization (2007), 97-104

[Song et al. 2007] Song, X., Chi, Y., Hino, K., Tseng, B. L.: "Summarization System by Identifying Influential Blogs"; ICWSM 2007 (2007)

[Svore et al. 2007] Svore, K.M., Vanderwende, L., Burges, C.J.C.: "Enhancing Single-document Summarization by Combining RankNet and Third-party Sources"; Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2007), Czech, 448–457,

- 25 -

[Verma et al. 2007] Verma, R., Chen, P., Lu, W.: "A Semantic Free-text Summarization System Using Ontology Knowledge"; Proceedings of the Document Understanding Conference (2007)

[Zhiqi et al. 2005] Zhiqi, W., Yongcheng, W., Chuanhan, L., Derong, L.: "An Automatic Summarization Service System Based on Web Services"; Proceedings of the 2005 The Fifth International Conference on Computer and Information Technology (2005)

- 26 -

ABBREVIATIONS

DUC Document Understanding Conference

DUC Document Understanding Conference

MEAD Platform for multi-document – multi-lingual text summarization

NE Named Entity

NLP Natural Language Processing

PDA Personal Device Assistant

ROUGE Recall Oriented Understudy for Gisting Evaluation

TAC Text Analysis Conference

UMLS Unified Medical Language System

- 27 -

FIGURES

Figure 1: The summarizer Architecture [ ] ................................................................................................................ 12

Figure 2: Architecture of SUMMONS [McKeown and Radev, 1995] ................................................................... 17

Figure 3: Articles provided to the system ................................................................................................................... 18

Figure 4: Templates generated from the provided articles [Radev, ACM SIGIR Tutorial] ...................... 19

Figure 5: Summary synthetised by the system [Radev, ACM SIGIR Tutorial] .............................................. 19

- 28 -

TABLES

Table 1: Features used in the system ............................................................................................................................ 14

Table 2: Features used in the system, based on external data sources........................................................... 15

automatic text summarization.pdf

Documents