automatic text summarization.pdf
TRANSCRIPT
Graz University of Technology
706.048 Information Search and Retrieval
WS 2009
Automatic Text Summarization
Group 10
Rejhan Basagic [0330740]
Damir Krupic [0231316]
Bojan Suzic [0631814]
Supervisor:
Dipl.-Ing. Dr.techn. Christian Gütl
Institute for Information Systems and Computer Media, Graz
II
Copyright (C) 2009 Rejhan Basagic, Damir Krupic, Bojan Suzic.
Dieses Werk kann durch jedermann gemäß den Bestimmungen der Lizenz für Freie Inhalte
genutzt werden.
Die Lizenzbedingungen können unter http://www.uvm.nrw.de/opencontent abgerufen
oder bei der Geschäftsstelle des Kompetenznetzwerkes Universitätsverbund MultiMedia
NRW, Universitätsstraße 11, D-58097 Hagen, schriftlich angefordert werden."
III
ABSTRACT
Recently, one of the problems arisen with the rapid growth of the web and generally information
availability (sometimes referred as an information overloading) is the increased need for effective
and powerful text summarization. In this document we will present a short historical overview and
advancement of automatic text summarization and the most relevant approaches currently used in
this area. The approaches will be selected both for single and multi-document summarization.
Furthermore, the examples of the use of the technology will be summarized and short conclusion
and future directions will be given.
ZUSAMMENFASSUNG
In letzter Zeit, ein der Probleme, das mit dem schnellen Wachstum des Internets und in der Regel
die Verfügbarkeit von Information entstanden wurde, ist die zunehmende Notwendigkeit für eine
effektive und leistungsfähige Textzusammenfassung. In diesem Dokument wird ein kurzen
historischen Überblick und Weiterentwicklung der automatischen Textzusammenfassung
präsentiert. Weiter, die meist relevanten und wichtigsten Konzepte werden dargestellt, mit dem
Fokus auf Single- und Multi- Dokumentzusammenfassung. Darüber hinaus werden die Beispiele für
die Nutzung der Technologie zusammengefasst und kurzes Fazit und künftige Richtungen gegeben
werden.
IV
TABLE OF CONTENTS
Abstract ..................................................................................................................................................................................... III
Zusammenfassung ................................................................................................................................................................ III
Table of Contents .................................................................................................................................................................... IV
1 Introduction ..................................................................................................................................................................... 6
1.1 Applications of the automatic summarization ......................................................................................... 6
1.2 Scope of this work ................................................................................................................................................ 7
2 Background ...................................................................................................................................................................... 8
2.1 1955-1979 Early Extraction and Linguistic Approaches ..................................................................... 8
2.2 1980`s and 1990´s - Artificial Intelligence Approaches and “Renaissance” ................................ 8
3 Taxonomy of Summarization Methods ............................................................................................................... 10
4 Single Document Summarization .......................................................................................................................... 11
4.1 Ontology Knowledge Based Summarization ........................................................................................... 11
4.2 Feature appraisal based summarization .................................................................................................. 12
4.3 Neural network based approach ................................................................................................................. 14
5 Multi-document summarization ............................................................................................................................ 16
5.1 History of multi-document summarization............................................................................................. 16
5.1.1 SUMMONS .................................................................................................................................................... 16
5.2 Abstraction ............................................................................................................................................................ 17
5.3 Topic driven Summarization and MMR .................................................................................................... 20
5.4 Centroid based Summarization .................................................................................................................... 20
6 Lessons Learned ........................................................................................................................................................... 22
V
7 Conclusion ....................................................................................................................................................................... 23
References ................................................................................................................................................................................ 24
Abbreviations .......................................................................................................................................................................... 26
Figures........................................................................................................................................................................................ 27
Tables ......................................................................................................................................................................................... 28
6
1 INTRODUCTION
In the recent time, the wide availability of information on the internet caused by many factors,
including rapid digitalization of paper documents, rapid Internet expansion based on Web 2.0
paradigm amongst others caused global information abundance. This raised the question and
necessity for alternative ways for displaying and selection of textual and multimedia content, on
such way that the most important parts of it are presented to the user in order make decision
support about further steps – should article be read and further investigated, or other one
considered. Nowadays there is too much content on the web and the information quality and
relevance is of a big concern. The users are expecting the short summaries of the written
information in order to select the most appropriate ones for further work.
1.1 APPLICATIONS OF THE AUTOMATIC SUMMARIZATION
One of examples of document summarization is in legal area. The legal experts perform difficult and
responsible work and their resources are sparse and expensive both in time and expertise levels.
Thus, the system for concise summarization is necessary in order for experts to be able to
effectively and in short time find compressed and restated content of relevant judicial documents,
including laws and their proposals, relevant court decisions or tribunal process summarizations.
In the medical branch, there is often overload of information and it is requirement in many cases
for the medical personal to find relevant information about patient’s conditions timely. This
involves crawling of many documents and patient’s record in order to gain necessary information.
In this area the text summarization specifically adjusted to medical domain is of considerable use,
saving time resources and optimizing availability of medical experts.
On the internet, there are many examples of the summarizations used. For instance, news portals
like Google1, Microsoft News2 or Columbia Newsblaster3 are relying of such techniques in order to
provide short news summaries to their visitors. There are also service providing blog
summarization and aggregation4 and opinion survey systems [Song et al. 2007].
Further, there is also application of document summarization for PDA devices with small screen,
where the only limited screen size and time are available for users to read. For the businesses it is
also important to have available summarizations of meetings coupled possibly with speech
recognition systems, to provide “meeting minutes” in short time and without using excessive
1 http://news.google.com 2 http://msnbc.msn.com 3 http://newsblaster.cs.columbia.edu 4 For instance, http://www.bloghearld.com
- 7 -
human and other resources. In the area of accessibility, for the handicapped people, the text
summarization systems are also of great help. They can save much time for readers of such
documents using speech synthesis technologies, in order to be able to timely recognize and
separate important and less important content according to their interests. [Lin et al. 2009]
1.2 SCOPE OF THIS WORK
This work is focused on topic automatic text summarization. In the first chapter some of interesting
uses of this technology are summarized. Further, in the second chapter the focus is given on short
historical overview of the text summarization.
In the third and fourth chapter, the parts of this area relating to single and multiple document
summarizations will be investigated in more detail, including the currently promising methods.
Some of the characteristics of those proposals and suggestions for future improvements will be
presented. In the fifth chapter the focus will be given on evaluation of text summarizations, while in
the sixth chapter the lessons learned will be analised in more detail, as the proposals for future
work.
Finally, in the seventh chapter of this work appropriate conclusion will be given, together with
short overview of this work.
- 8 -
2 BACKGROUND
2.1 1955-1979 EARLY EXTRACTION AND LINGUISTIC APPROACHES
Hans Peter Luhn, popular IBM inventor was the pioneer in using computer for information
retrieval. He published the first paper in this area, entitled "A new method of recording and
searching information" (Luhn, 1953). After getting engaged as the manager of information retrieval
research at IBM, Luhn explored and developed many IR applications.
One such application is KWIC (Keyword in Context) using three elements fundamental to
information and retrieval. Those three elements are: keyword, title and context. [Heting 2007]
Edmundson described new methods of automatic extracting documents for screening purposes. His
previous work was based on the presence of high frequency content words (keywords). In his
paper from 1969, he described also three additional components: pragmatic words (cue words),
title and heading words; and structural indicators.
The results indicated that the three newly proposed components dominate the frequency
component in the creation of better extracts. Edmundson also tried to develop an algorithm for
automatic extraction of summaries from a corpus. [Edmundson 1969]
2.2 1980`S AND 1990´S - ARTIFICIAL INTELLIGENCE APPROACHES AND “RENAISSANCE”
Interest is shifted toward using AI methods, hybrid approaches and summarization of group
documents and multimedia documents.
"... a series of knowledge-based text summarization systems evolved, the methodology of which was
almost exclusively based on the Schankian-type of Conceptual Dependency (CD) representations (e.g.
(Cullingford 78, Lehnert 81, DeJong 82, Dyer 83, Tait 85, Alterman 86)) CD representations, however,
are formally underspecified representation devices lacking any serious formal foundation.
According to this, the summarization operations these first-generations systems provide use only
informal heuristics to determine the salient topics from the text representation structures for the
purpose of summarization.
- 9 -
A second generation of summarization systems then adapted a more mature knowledge
representation approach, one based on the evolving methodological fremework of hybrid,
classification-based knowledge representation languages (cf (Woods & Schmolze 92) for a survey).
Among these systems count SUSY (Fum et. al. 85), SCISOR (Rau 87), and TOPIC (Reimer & Hahn 88),
but even in these frameworks no attempt was made to properly integrate the text summarization
process into the formal reasoning mechanisms of the underlying knowledge representation language."
[Reimer and Hahn 1997]
- 10 -
3 TAXONOMY OF SUMMARIZATION METHODS
There are several types of text summarizations. According to the form of summary the following
may be extracted:
Extracts: these are summaries completely consisting from the sentences or word sequences
contained in the original document. Besides the complete sentences, extracts can contain phrases
and paragraphs. Problem with this approach is usually lack of the balance and cohesion. Sentences
could be extracted out of the context and anaphoric references can be broken.
Abstracts: these are containing word sequences not present in the original. They are usually build
from the existing content but using advanced methods. It is generally hard for computer to
successfully solve the requirements of such approach as of many limitations, including the state of
the art in language generation and human language complexity.
From the point of processing level involved in the creation of document summaries the following
could be recognized:
Surface level approach: in this case the information is represented from the point of shallow
features. These include different types of terms, e.g. statistically and positionally salient ones, terms
from cue phrases or domain specific and user inputed terms. Usually this approach products
extraction based summary as an output.
Deeper level approach: this approach may involve sentence generation. The advanced semantic
analysis is necessary in order to achieve such task. The output of this approach may be in form of
abstracts and extracts.
It is also important to recognize the audience of the summaries:
Generic summaries: are aimed at a broad community of readers
Query-based summaries are summaries built on the top of previously submitted user query
User or topic focused summaries are tailored to the interest of the user and represent only
particular topic.
The summaries can be also single and multi-document based. They can be also produced in
mono- or multilingual context, and applied to different genres.
In this work we will concentrate on general types of summaries based on single and multi-
document context, analyzing a several approaches currently in the research.
- 11 -
4 SINGLE DOCUMENT SUMMARIZATION
Currently, the most of the work done in the field has been related to the sentence extraction based
on document summarization. Since dropping of single document summarization track from DUC
challenge (2003), the research in the area of single document summarization is somehow declining.
According to [Nenkova 2005] general performances in summarization systems tend to be better in
multi-summarization than single document summarization tasks. This sounds counter-intuitively to
the general feeling that multi-document summarization is more difficult than single document one.
However, this could be partially explained with the fact that repetitive occurrences in input
document can be used as an indication of importance in multi-document environments.
Recent works on single document summarization are based on the several different approaches.
Number of approaches differ from ones based on classical approaches, through statistical based to
ones built on the top of the deep natural language analysis methods.
In this section the most representative approaches will be analyzed and presented.
4.1 ONTOLOGY KNOWLEDGE BASED SUMMARIZATION
The proposal from [Verma et al. 2007] focuses on dynamic summary generation based on user
input query. This approach has been designed for application in specific domain (medical).
However it can be used in general domain too.
The idea presented in this proposal is based on the fact that user selects the keywords to search for
the document with specific requirements. However, these keywords may not match the document’s
main idea, thus the document’s summary provided by the static author-written abstract may be not
a good summary for the user and specific search query. Hence, the summary needs to be generated
dynamically, according to user requirements given by the search query.
The system is coupled with two ontology knowledge sources, WordNet and UMLS. WordNet is
widely known lexical database for the English, developed at the Princeton University. The database
consists of linked words, including nouns, verbs, adjectives and adverbs. These words are
connected into sets of cognitive synonyms called synsets, representing the basic inter-relations as
hypernym, meronym and pertainym. The second source, UMLS is maintained by US National
Library of Medicine and includes three knowledge sources: the Metathesaurus, the Semantic
Network and the Specialist lexicon. In current approach only first two sources are used, as its focus
is on semantic analysis.
Basically, there are three steps involved in the creation of the document summary in such system:
1) Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology
knowledge. The redundant keywords are removed and relevant on
2) Calculation of the distance of document’s sentences to the relevant query. The sentences
bellow the predefined threshold are subject to the inclusion in the document summary.
3) Calculation of the distance among the candidate summary sentences.
separated into the groups based on the threshold and the highest ranked candidate from each
group will become the part of document summary.
The system based on this method has been presented
approach have been related to problems with redundancy reduction, lack of syntax analysis and
insufficient query analysis. Namely, the already performed redundancy reduction step is only
partially effective, since the repetitiv
is present.
In the first tests as a part of DUC 2007, the method has s
that the present issues and future improvements are to be addressed in future work.
versions of the system the natural language processing will be included, including parsing and
syntax analysis. Moreover, the statistical data from the documents’ abstracts will be used as a part
of scoring algorithm in the sentence extraction.
4.2 FEATURE APPRAISAL BAS
Other similar approach based on the semantic analysis of document has been proposed in [
and Oussalah 2008]. In this approach the authors
extraction based on static and dynamic document features. Static features include sentence
locations and named entities (NE) in each sentence. Dynamic features used for scoring include
semantic similarity between sentences and user query.
FIGURE 1: THE SUMMARIZER ARCHITE
- 12 -
Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology
knowledge. The redundant keywords are removed and relevant ones added.
Calculation of the distance of document’s sentences to the relevant query. The sentences
predefined threshold are subject to the inclusion in the document summary.
among the candidate summary sentences. The candidates are then
separated into the groups based on the threshold and the highest ranked candidate from each
group will become the part of document summary.
method has been presented at DUC 2007. The problems found with this
approach have been related to problems with redundancy reduction, lack of syntax analysis and
Namely, the already performed redundancy reduction step is only
repetitive coverage of the same information from multiple documents
DUC 2007, the method has shown good potential and authors
that the present issues and future improvements are to be addressed in future work.
versions of the system the natural language processing will be included, including parsing and
the statistical data from the documents’ abstracts will be used as a part
sentence extraction.
EATURE APPRAISAL BASED SUMMARIZATION
ther similar approach based on the semantic analysis of document has been proposed in [
In this approach the authors propose the scoring system for document
and dynamic document features. Static features include sentence
locations and named entities (NE) in each sentence. Dynamic features used for scoring include
semantic similarity between sentences and user query.
E SUMMARIZER ARCHITECTURE [BAWAKID AND OUSSALAH 2008]]
Evaluation and adjustment of the query in regards to the WordNet and/or UMLS ontology
Calculation of the distance of document’s sentences to the relevant query. The sentences
predefined threshold are subject to the inclusion in the document summary.
candidates are then
separated into the groups based on the threshold and the highest ranked candidate from each
The problems found with this
approach have been related to problems with redundancy reduction, lack of syntax analysis and
Namely, the already performed redundancy reduction step is only
multiple documents
hown good potential and authors claimed
that the present issues and future improvements are to be addressed in future work. For the future
versions of the system the natural language processing will be included, including parsing and
the statistical data from the documents’ abstracts will be used as a part
ther similar approach based on the semantic analysis of document has been proposed in [Bawakid
propose the scoring system for document
and dynamic document features. Static features include sentence
locations and named entities (NE) in each sentence. Dynamic features used for scoring include the
- 13 -
In proposed system there are three steps involved in creating document summary: Preprocessing,
Analyzing and Summary generation.
In preprocessing stage, the unnecessary elements such as HTML tags, news agencies names or table
numbers are removed from the document. Further, the document is tokenized and the sentence
boundaries are detected. In further processing named entities such as locations, organizations or
names are detected, the words are POS (part-of-speech) tagged and finally co-reference resolution
is done.
In the second step, the extraction and analyzing of features are done. Basically, the features such as
sentence location, named entities in sentence, semantic similarity to document title and user query
are used for building relevancy scoring for the sentences.
The important feature used in this range is also semantic similarity to the other sentences in the
document. As external information source used in this step is WordNet, which provides synonyms
for adjectives and adverbs found in the sentences. The further step in sentence similarity
calculation involves linking the adjectives to respective nouns and verbs to adverbs. In order to
distinguish relative importance of the nouns in the sentence, linguistic quantifiers are used.
Basically, two classes of linguistic quantifiers are used: ones inducing increasing importance of the
word (like “very” or “more”) and ones inducing decreasing word importance (like “less” or “none”).
Then, in both of the sentences compared the best average match for each noun and verb should be
calculated, taking into account the related quantifiers. After the average matches are calculated, the
similarity match between sentences is calculated by summing both sentences best average matches
and dividing the result with sum of sentence scores previously calculated. It means that the result
will depend on every verb and noun in both sentences. The same method is used to determine
similarity score with user query and document title.
Finally, the sentence score is calculated as a linear combination of the weighted features, on the
following way:
�������� ���� ���, �� � ��� ���, ��� ����� �������� � 1� �����
� ��� � 1�
Where:
� N is the total number of sentences in the document
� n(si) is number of sentences having semantic similarity above predefined threshold level
� P(si) is sentence position weight
� Sim(si,T) and Sim(si,Q) are related to semantic similarity with the document title and user
query, respectively
� NE is number of named entities in the document
� FNE(si) is the number of named entities contained in the sentence i
- 14 -
The summary is generated by choosing the most important sentences in the document (ones with
the highest score) and arranging them in chronological order. Multi-document summaries could be
generated similar way by calculating scores of sentences in each document separately and then
choosing the highest scoring sentences from all documents.
The system built on the top of this method has been presented in TAC 2008. The system has
demonstrated better performances related to finding relevant content than removing irrelevant
one. Also the additional weighting of user query has shown better performances than headlines
related one. For future work the authors plan to implement redundancy checking and remove
redundant information. This should be done on two levels: first, removing repeated or non-
essential content within sentences by adding new metric related to redundancy penalty, and
second, trying to maximize content information diversity by introducing additional threshold
metrics. Co-reference resolution and sentence compression are also planned to be implemented by
means of syntactic trimming.
4.3 NEURAL NETWORK BASED APPROACH
NetSum system developed at Microsoft Research [Svore et al. 2007] is utilizing machine-learning
method based on neural network algorithm RankNet. The system is customized to be used for
summary extraction of news articles including three highlighted sentences. The goal is pure
extraction without any sentence compression or sentence generation. Thus, system is designed to
extract three sentences from single document that best match three document highlights.
For such task, in order to rank extracted sentences used is RankNet, a ranking algorithm based on
neural networks. The system is trained on pairs of sentences in single document, such that first
sentence in the pair should be ranked higher than second one. Training is based on modified back-
propagation algorithm for two layer networks. The system relies on NetSum, which is a two layer
neural network.
This model utilizes the following features:
Symbol Feature Name
F(Si) Is First Sentence
Pos(Si) Sentence Position
SB(Si) 5
SumBasic Score
SBB(Si) SumBasic Bigram Score
Sim(Si)6 Title Similarity Score
TABLE 1: FEATURES USED IN THE SYSTEM
5 Describes sentence importance based on a word frequency 6 Based on relative probability that term in particular sentence is present in document title
- 15 -
Symbol Feature Name7
NT(Si) Average News Query Term Score
NT+(Si) News Query Term Sum Score
NTr(Si) Relative News Query Term Score
WE(Si) Average Wikipedia Entity Score
WE+(Si) Wikipedia Entity Sum Score
TABLE 2: FEATURES USED IN THE SYSTEM, BASED ON EXTERNAL DATA SOURCES
The news query logs are gathered from Microsoft’s news search engine8, while from the Wikipedia9
used are article titles. Hence, if the parts of news search query or Wikipedia title appear frequently
in the news article to be summarized, then higher importance score is attached to the sentences
containing these terms.
The results of this summarization approach are encouraging. Based on evaluation using ROUGE and
comparing to the baseline system, this system performs better than all previous systems for news
article summarization from DUC workshops. In feature ablation studies the authors have confirmed
that inclusion of news search queries and Wikipedia titles improve performance. Namely, the
performances of NetSum with external features are statistically significant at 95% confidence.
However, NetSum also performs better than baseline even without external features.
For the future work authors recommend more detailed investigation of external features for
possible inclusion. Also it is expected that separated feature selection for each of three highlight
sentences could further improve performances, as the highlights have different characteristics
compared to each other. It is also expected to develop further sentence simplification, splicing and
merging features.
7 Features dependant to external data sources, namely query-logs and Wikipedia 8 http://search.live.com/news 9 http://www.wikipedia.org
- 16 -
5 MULTI-DOCUMENT SUMMARIZATION
Nowadays it is enormously important to develop procedures for finding text in efficient way. There
are systems such as single document summarization system that support the automatic generation
of extracts, abstract, or questions based on summaries. Single-document summaries provide limited
information about the contents of a single document and the user in deciding whether he should
read the document or not.
The situation in which a user makes an inquiry about a topic which has been treated recently in the
news should be considered. This then would provide back hundreds of documents. Although they
differ in some areas, many of these documents from the content provide the same information. A
summary of each document would help in this case; however, they would be semantically similar.
In today’s community, in which time plays an important role, multi-document summarizers play
essential role in such situations.
5.1 HISTORY OF MULTI-DOCUMENT SUMMARIZATION
The Extraction of a summary text from multiple documents became popular in mid 1990s, mostly
used in domain of news articles. Several Web-based news clustering systems10 were inspired by
research on multi-document summarization. As already said, the difference between single
document and multi document summarization is that multi document summarization involves
multiple sources of information. The key task of multiple document summarization is not just
identifying redundancy across documents, but also recognizing novelty and ensuring that the final
summary is both coherent and complete. [Das and Martins 2007]
This field of automatic summarization has been pioneered by the NLP group at Columbia University
[McKeown and Radev, 1995], where a summarization system called SUMMONS was developed by
extending existing technology for template-driven message understanding systems.
5.1.1 SUMMONS
SUMMONS is a knowledge-based multi-document summarization system that produces summaries
through a series of in a domain located articles on terrorism. A set of semantics templates,
previously extracted from a message understanding system are supplied as an input to the system.
The system recognizes specific patterns in these templates, such as altering the perspective, 10 Google News, Columbia News Blaster, News in Essence (previously referenced)
- 17 -
contradictions, refinements, definitions and elaborations. The techniques used in SUMMONS,
require a large amount of knowledge in the field of Knowledge Engineering, even for a relatively
small text domains.
Its architecture is based on a speech generator. The speech generator in SUMMONS mainly consists
of two components, a content planner which selects the information which is added to the text, and
a linguistic component, which selects words to refer to the information containing the selected
concepts. The content planner decides which information derived from the templates should be
included in the collection process. The linguistic component uses a grammar and a dictionary to
determine the syntactic form of the summary. The length of the summary is determined by the
input parameters. Information is taken from several articles, and it will be evaluated in order of
importance.
FIGURE 2: ARCHITECTURE OF SUMMONS [MCKEOWN AND RADEV, 1995]
5.2 ABSTRACTION
Abstraction method of automatic text summarization is, in difference to extractive method, based
on text generation techniques. In this approach the summaries contain the words not present in the
original document. Generally as of language complexity and ambiguity it is hard task for computer
research to solve this task successfully. The following example shows the usage of SUMMONS in
creating of abstract from four articles containing similar information.
- 18 -
These articles are provided to the Message Understanding System. The system generates templates,
which are stored in database. SUMMONS then uses the templates to generate the summary text.
Example: Four articles
FIGURE 3: ARTICLES PROVIDED TO THE SYSTEM
Templates generated from the articles:
JERUSALEM - A Muslim suicide bomber blew apart 18 people on a Jerusalem bus and
wounded 10 in a mirror-image of an attack one week ago. The carnage could rob Israel's
Prime Minister Shimon Peres of the May 29 election victory he needs to pursue Middle East
peacemaking. Peres declared all-out war on Hamas but his tough talk did little to impress
stunned residents of Jerusalem who said the election would turn on the issue of personal
security.
JERUSALEM - A bomb at a busy Tel Aviv shopping mall killed at least 10 people and
wounded 30, Israel radio said quoting police. Army radio said the blast was apparently
caused by a suicide bomber. Police said there were many wounded.
A bomb blast ripped through the commercial heart of Tel Aviv Monday, killing at
least 13 people and wounding more than 100. Israeli police say an Islamic suicide bomber
blew himself up outside a crowded shopping mall. It was the fourth deadly bombing in
Israel in nine days. The Islamic fundamentalist group Hamas claimed responsibility for the
attacks, which have killed at least 54 people. Hamas is intent on stopping the Middle
East peace process. President Clinton joined the voices of international condemnation
after the latest attack. He said the ``forces of terror shall not triumph'' over
peacemaking efforts.
TEL AVIV (Reuter) - A Muslim suicide bomber killed at least 12 people and wounded
105, including children, outside a crowded Tel Aviv shopping mall Monday, police said.
Sunday, a Hamas suicide bomber killed 18 people on a Jerusalem bus. Hamas has now killed
at least 56 people in four attacks in nine days.
The windows of stores lining both sides of Dizengoff Street were shattered, the charred
skeletons of cars lay in the street, the sidewalks were strewn with blood. The last attack
on Dizengoff was in October 1994 when a Hamas suicide bomber killed 22 people on a bus.
3
1
2
4
MESSAGE: ID TST-REU-0002
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 07:20
PRIMSOURCE: SOURCE Israel Radio
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least 10''
“wounded: more than 100”
PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0001
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 3, 1996 11:30
PRIMSOURCE: SOURCE
INCIDENT: DATE March 3, 1996
INCIDENT: LOCATION Jerusalem
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: 18''
“wounded: 10”
PERP: ORGANIZATION ID
1 2
- 19 -
FIGURE 4: TEMPLATES GENERATED FROM THE PROVIDED ARTICLES [RADEV 2004]
Content planner selects the information to include in the summary through combination of the
input templates. Linguistic generator selects the right words to express the information in
grammatical and coherent text. Some of the operations performed by content planner require
resolving conflicts, for example: contradictory information among different sources or time
instants; others complete pieces of information that are included in some articles and not in other.
At the end, the linguistic generator gathers all the combined information, and uses connective
phrases to synthesize a summary:
FIGURE 5: SUMMARY SYNTHETISED BY THE SYSTEM [RADEV 2004]
This method is very promising when the domain is narrow, but a generalization for broader
domains would be problematic. This was improved later by McKeon and Barzilay.
MESSAGE: ID TST-REU-0004
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 14:30
PRIMSOURCE: SOURCE
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least
12''
“wounded: 105”
PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0003
SECSOURCE: SOURCE Reuters
SECSOURCE: DATE March 4, 1996 14:20
PRIMSOURCE: SOURCE
INCIDENT: DATE March 4, 1996
INCIDENT: LOCATION Tel Aviv
INCIDENT: TYPE Bombing
HUM TGT: NUMBER “killed: at least 13''
“wounded: more than 100”
PERP: ORGANIZATION ID “Hamas”
3 4
Reuters reported that 18 people were killed on Sunday in a bombing in Jerusalem. The next
day, a bomb in Tel Aviv killed at least 10 people and wounded 30 according to Israel radio.
Reuters reported that at least 12 people were killed and 105 wounded in the second incident.
Later the same day, Reuters reported that Hamas has claimed responsibility for the act.
- 20 -
5.3 TOPIC DRIVEN SUMMARIZATION AND MMR
MMR (maximal marginal relevance) is based on the vector-space model of text retrieval, and is
well-suited to query-based and multi-document summarization. In MMR, sentences are chosen
according to a weighted combination of their relevance to a query and their redundancy with the
sentences that have already been extracted.
"Let Q be a query or user profile and R a ranked list of documents retrieved by a search engine.
Consider an incremental procedure that selects documents, one at a time, and adds them to a set S.
So let S be the set of already selected documents in a particular step, and R \ S the set of yet
unselected documents in R. For each candidate document Di ∈ R\S, its marginal relevance MR(Di) is
computed as:" [Das and Martins 2007:14]
In this figure λ is a parameter that controls relative importance given to relevance versus
redundancy. Sim1 and Sim2 are two similarity measures, which are set to the standard similarity
used in vector space model:
yx
yxyxSimyxSim
×==
,),(),( 21
The document with highest marginal relevance is then selected, added to S, and the procedure
continues until a maximum number of documents are selected. It has been found that dynamically
changing of the parameter ( λ ) gives more effective results, than keeping it fixed.
To perform summarization, documents can be first segmented into sentences, and after a query is
submitted, the MMR algorithm can be applied. Top ranking passages (sentences) are selected, then
they are reordered according to the positions in the original documents, and presented as the
summary.
5.4 CENTROID BASED SUMMARIZATION
MEAD is a centroid-based Summarizer, which uses many algorithms (position-based, TF-IDF11,
largest common subsequence and keywords) to generate the summaries. As a result, it returns
centroid based summaries. A centroid is a set of words, which is statistically significant for a cluster.
11 Term frequency/inverse document frequency
),(max)1(),(:)(21 ji
SDii DDSimQDSimDMR
j ∈
−−= λλ
- 21 -
Centroids are used for both classification of relevant documents and the identification of the
Phrases in a cluster.
The MEAD Summarizer consists mainly of three components:
- Feature extractor,
- Sentence scorer and
- Sentence reranker .
Feature extractor calculates the values for each set of features (sentences), which are defined by the
user. Then, the sentence scorer assigns a value (linear combination of the features) to every feature.
After that, the sentences are sorted according to their weighting. The task of the sentence rerankers
is to insert the sentences in the resulting summaries, beginning with those which are high ranked.
In additions, sentence reranker reviews the sentences in the summary on similarity. If the degree of
consistency is over a certain threshold, the reranker is going to ignore the sentence and move to
another one.
Similar documents are using an algorithm that is described in detail by Radev, grouped into a
cluster. Each document is represented as a weighted vector of TF-IDF values. After that, CIDR
(system for automatic placement of text documents in a cluster) generates a centroid by using the
first document in the cluster.
As soon as new documents are processed, their TF-IDF values are compared with the centorid using
the formula:
In this formula, the following has been used:
- cj is the centroid of the j-th clusteer
- Cj is the set of documents that belong to the cluster
- d~ is a "truncated version" of d that vanishes on those words whose TF-IDF scores are below a
threshold.
If the value of cj is within a threshold, the new documents are added to the cluster.
j
Cd
jC
d
cj
∑∈
=
~
- 22 -
6 LESSONS LEARNED
The area of the text summarization has gained active focus of research community since 1990s.
Although several methods for such approaches exist, among them in recent time mostly were
applied ones based on different machine learning and statistics techniques, including binary
classifiers, Bayesian methods, and heuristic methods with feature weighted vectors. Graph based
methods were employed, as ones based on neural-network training. [Svore et al. 2007]
According to the methods presented in this work and research results done so far, many
researchers agree that in order to gain better performances multi-feature approaches are to be
used, especially ones based on external sources of information. Namely, recent works cited use
Wikipedia titles, or news queries from search engines to improve summarization performance. It is
to be further investigated which other sources of information and features should be used. It is also
important to note that summarization systems depend on other modules used in the preprocessing
and processing stages.
In the case of external NLP tools used the summarization system depends strongly on quality and
performance of underlying POS taggers, chunkers, lemmatizers, stemmers or sentence detectors.
For many languages such solutions are not completely available or they are in unmature
development stage.
In the case of external ontology based sources, the quality and depth of ontology plays important
role too. It is shown [Farzindar and Lapalme 2004; Verma et al. 2007] that specific domain systems
give better overall results for the document summarization. Also the works based on user query
input [Svore et al. 2007] give promising performances in the way that summarizations could be
adjusted and built according to user requirements and expectations.
- 23 -
7 CONCLUSION
In this work the short introduction on automatic text summarization has been given. The
interesting uses of this technology were described, as the issues confronting successful application
of such methods. Further, the short historical background related to the research in the area has
been elaborated, while the most attention in this work has been paid to the general approaches in
summarization currently used and proposed. Such approaches include single and multiple
document summarization methods.
Among approaches analyzed, for single document summarization analyzed were the three methods
relying on machine learning techniques. Namely, the ontology knowledge approach was presented,
the approach based on feature appraisal and NLP application in summarization. Finally, the recent
development based on neural-network application in extractive text summarization is presented,
together with findings and recommendations related to it.
For the multi-document summarization, the simple example of abstractive summary generation has
been given. Further, the topic driven and centroid based summarization were analyzed.
- 24 -
REFERENCES
[Bawakid and Oussalah 2008] Bawakid, A., Oussalah, M.: "A Semantic Summarization System: University of Birmingham at TAC 2008"; Proceedings of the First Text Analysis Conference (2008)
[Das and Martins 2007] Das, D., Martins, A. F.T.: "A Survey on Automatic Text Summarization"; Literature Survey for the Language and Statistics II course at CMU (2007)
[Edmundson 1969] Edmundson, H.P.: "New Methods in Automatic Extracting"; Journal of the ACM (JACM). Volume 16 , Issue 2 (April 1969). 264- 285
[Farzindar and Lapalme 2004] Farzindar, A. and Lapalme, G.: "LetSum, an automatic Legal Text Summarizing system"; Legal Knowledge and Information Systems. Jurix 2004: The Seventeenth Annual Conference. Amsterdam. IOS Press (2004), 11-18
[Heting 2007] Heting, C.: "Information Representation and Retrieval in the Digital Age"; Asist Monograph Series (2007), 7-8
[Jezek and Steinberger 2008] Jezek, K. and Steinberger, J.: "Automatic Text Summarization (The state of the art 2007 and new challenges)"; Znalosti 2008. FIIT STU Bratislava, Slovakia (2008), 1-12
[Lin et al. 2009] Lin, J., Ozsu, M. Tamer, Liu, L.: "Summarization"; Encyclopedia of Database Systems, Springer (2009)
[McKeown and Radev, 1995] McKeown, K.R. and Radev, D.R.: "Generating summaries of multiple news articles."; In Proceedings of SIGIR '95 (1995), 74-82
[Nenkova 2005] Nenkova, A.: "Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference"; American Association for Artificial Intelligence (2005)
[Neto et al. 2002] Neto, J.L., Freitas, A.A., Kaestner C.A.A.: "Automatic Text Summarization Using a Machine Learning Approach"; SBIA 2002. Springer Verlag Berlin Heidelberg (2002), 205-215
[Radev 2004] Dragomir R.R.: "Text summarization"; Tutorial ACM SIGIR, Sheffield, UK (2004), http://www.summarization.com/sigirtutorial2004.ppt
[Reimer and Hahn 1997] Reimer, U. and Hahn, U.: "A Formal Model of Text Summarization Based on Condensation Operators of a Terminological Logic"; Advances in Automated Text Summarization (2007), 97-104
[Song et al. 2007] Song, X., Chi, Y., Hino, K., Tseng, B. L.: "Summarization System by Identifying Influential Blogs"; ICWSM 2007 (2007)
[Svore et al. 2007] Svore, K.M., Vanderwende, L., Burges, C.J.C.: "Enhancing Single-document Summarization by Combining RankNet and Third-party Sources"; Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2007), Czech, 448–457,
- 25 -
[Verma et al. 2007] Verma, R., Chen, P., Lu, W.: "A Semantic Free-text Summarization System Using Ontology Knowledge"; Proceedings of the Document Understanding Conference (2007)
[Zhiqi et al. 2005] Zhiqi, W., Yongcheng, W., Chuanhan, L., Derong, L.: "An Automatic Summarization Service System Based on Web Services"; Proceedings of the 2005 The Fifth International Conference on Computer and Information Technology (2005)
- 26 -
ABBREVIATIONS
DUC Document Understanding Conference
DUC Document Understanding Conference
MEAD Platform for multi-document – multi-lingual text summarization
NE Named Entity
NLP Natural Language Processing
PDA Personal Device Assistant
ROUGE Recall Oriented Understudy for Gisting Evaluation
TAC Text Analysis Conference
UMLS Unified Medical Language System
- 27 -
FIGURES
Figure 1: The summarizer Architecture [ ] ................................................................................................................ 12
Figure 2: Architecture of SUMMONS [McKeown and Radev, 1995] ................................................................... 17
Figure 3: Articles provided to the system ................................................................................................................... 18
Figure 4: Templates generated from the provided articles [Radev, ACM SIGIR Tutorial] ...................... 19
Figure 5: Summary synthetised by the system [Radev, ACM SIGIR Tutorial] .............................................. 19
- 28 -
TABLES
Table 1: Features used in the system ............................................................................................................................ 14
Table 2: Features used in the system, based on external data sources........................................................... 15