automatic summarization of news using wordnet concept graphs laura plaza, alberto díaz, pablo...

25
Automatic Automatic Summarization of Summarization of News using News using WordNet Concept WordNet Concept Graphs Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia Artificial Universidad Complutense de Madrid

Upload: charity-spencer

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Automatic Automatic Summarization Summarization of News using of News using

WordNet WordNet Concept GraphsConcept Graphs

Laura Plaza, Alberto Díaz, Pablo Gervás

Departamento de Ingeniería del Software e Inteligencia ArtificialUniversidad Complutense de Madrid

Page 2: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Introduction

2

ª Digital newspapers have increased explosively

ª In order to tackle this information overload, text summarization can play a role

ª More linguistic and ontological resources that can benefit NLP systems

We present an ontology-based method for automatic summarization of news items

• based on mapping the text to concepts in WordNet and representing the document and its sentences as concept graphs

INTRODUCTION

PREVIOUS WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 3: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Related Work

3

ª Text summarization extractive methods• Segments of text (usually sentences) that contain the

most relevant information are extracted • Classic methods compute the score of a sentence

using a linear combination of the weights that result from a set of different features

positional, linguistic, statistical • Graph-based methods generally represent the

sentences by its tf*idf vectors and compute sentence connectivity using the cosine similarity

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 4: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Related Work

4

ª Most approaches do not capture the semantic relations between terms (synonymy, hypernymy, homonymy, co-occurrence …)

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

1. “Hurricanes are useful to the climate machine. Their primary role is to transport heat from the lower to the upper atmosphere,'' he said.

2. He explained that cyclones are part of the atmospheric circulation mechanism, as they move heat from the superior to the inferior atmosphere.

Concept graph-based approach

same meaning!!!

Page 5: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Related Work

5

ª Graph-based document clustering for extractive multi-document summarization (CSUGAR, Yoo et al. 2007)

• The entire collection of documents is represented as an extended graph of MeSH concepts

• This graph is clustered using an algorithm based on the Scale-Free Network Theory that generates, for each cluster: a graph and a set of vertices as centroids (Hub Vertices Set, HVS)

• Each document is assigned to a cluster using a vote mechanism based on whether the vertices in the document belong to HVS or non-HVS

• All documents assigned to a same clusters contribute to produce a single summary

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 6: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

WordNet

6

ª An electronic lexical database developed at the Princeton University (Miller et al, 1993)

ª Words with the same meaning are grouped in a Synset

ª Each synset has a gloss that defines it

ª Synsets are connected via semantic relations:

• Hypernymy and Hyponymy:

feline is a hypernym of catcat is a hyponym of feline

• Holonymy and Meronymy:

vehicle is a holonym of wheelwheel is a meronym of vehicle

• Troponymy: to lisp is a troponym of to talk• Entailment: to sleep is entailed by to snore

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 7: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Summarization Method

7

ª An extractive mono-document method that selects sentences based on their similarity with concept clusters

ª Steps:• Preprocessing• Graph-based Document Representation• Concept Clustering and Theme Recognition• Sentence Selection

ª A worked document example from the DUC 2002 collection will illustrate the method

INTRODUCTION

PREVIOUS WORK

UMLS

SUMMARIZATION

METHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 8: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Preprocessing

8

ª Extracting the body of the news itemsª Splitting the text into sentences using

GATE (http://gate.ac.uk) ª Generic words and frequent terms are

removed• They are not useful in discriminating between

relevant and non relevant sentences

INTRODUCTION

PREVIOUS WORK

UMLS

SUMMARIZATION

METHOD

PREPROCESSING

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 9: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Graph-based Document Representation

9

ª Representing the news item body as a graph

• Where the vertices are the concepts in WordNet associated to the terms

• And the edges indicate the relations between them

ª Steps1. Mapping between terms and concepts2. Graph-based Sentence Representation3. Graph-based Document Representation

INTRODUCTION

PREVIOUS WORK

UMLS

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENT

REPRESENTATION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 10: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Graph-based Document Representation

10

1. Mapping between terms and concepts

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENT

REPRESENTATION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

WordNet Sense Relate (word sense

disambiguation)

Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for

high winds, heavy rains and high seas.

Concept WN Sense Concept WN Sense Concept WN Sensehurricane 1 defense 9 prepare 4Gilbert 2 alert 1 high 2sweep 1 heavily 2 wind 1Dominican Rep 1 populate 2 heavy 1sunday 1 south 1 rain 1civil 1 coast 1 sea 1

Page 11: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Graph-based Document Representation

11

2. Graph-based Sentence Representation• WordNet concepts derived from common nouns

are extended with their hypernyms

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENT

REPRESENTATION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

physical entity

physical object

abstract entity

abstraction

group

social group

organization

defense

measure

fundamentalquantity

time period

calendar day

entity

process

phenomenon

naturalphenomenon

physicalphenomenon

atmosphericphenomenon

windstorm

cyclone

hurricane

location

region

territory

territorialdivision

country day of the week

rest day

sunday

geologicalformation

shore

coast

weather

wind precipitation

rain

thing

Body ofwater

sea

Page 12: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Graph-based Document Representation

12

3. Graph-based Document Representation

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENT

REPRESENTATION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

• The sentence graphs are merged into a document graph

• Enriched with similarity relations between leaf concepts, when similarity > threshold

• Each edge is assigned a weight directly proportional to the depth of the concepts that it connects (Yoo et al. 2007)

WordNet Similarity Package

physical entity physical object

location

region

territory territorial division

country

geo. formation shore coast

thing

body of water sea

1/2 1/2 2/3 2/3

2/3 3/4 3/4 3/4 4/5 4/5

5/6

6/7

4/5

Page 13: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Concept Clustering and Theme Recognition

13

ª The document graph is clustered in concept subgraphs, using a clustering algorithm similar to that used in CSUGAR (Yoo et al., 2007)

• a subgraph and a HVS are obtained for each cluster

ª Each cluster is composed by a set of concepts that are closely related in meaning

• It can be seen as a theme in the document

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENTREPRESENTATION

CONCEPT CLUSTERING ANDTHEME RECOGNITION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 14: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Sentence Selection

14

ª Selecting significant sentences for the summary, based on the similarity between sentences and clusters

ª The similarity between a cluster Ci and a sentence Oj is measured with a vote mechanism that is based on whether the vertices of the sentence graph belongs to HVS or non-HVS.

simililarity (Ci, Oj) =

5.0)(

0.1)(

0

,

,

,

jkik

jkik

jkik

wCHVSv

wCHVSv

wCv

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENTREPRESENTATION

CONCEPT CLUSTERING &THEME RECOGNITION

SENTENCE SELECTION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

jkk Ovv

jkji wOCsimilitud ,),(

Page 15: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Sentence Selection

15

ª Three heuristics are tested

• Heuristic 1: for each cluster, the top sentences (in a number proportional to its size) are selected

• Heuristic 2: all sentences are selected from the cluster with more concepts

• Heuristic 3: a single score for each sentence is computed, as the sum of the similarity assigned to each cluster adjusted to their sizes, and the sentences with higher scores are selected

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATION

METHOD

PREPROCESSING

GRAPH - BASEDDOCUMENTREPRESENTATION

CONCEPT CLUSTERING &THEME RECOGNITION

SENTENCE SELECTION

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 16: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Experimental Results

16

ª Two separate groups of experiments

• Parameter setting

• Comparison with a positional lead baseline

ª Evaluation Collection

• 10 news items from the DUC 2002 collection

• Each item comes with (at least) a model abstract written by a human

ª Compression rate of 30%ª Evaluation Metric

• ROUGE recall measures: R-1, R-2, R-L, R-W-1.2

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 17: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Experimental Results

17

ª Algorithm parametrization

• Parameter 1: Which heuristic produces the best summaries?

• Parameter 2: What percentage of vertices should be considered as hub vertices (HV) by the clustering method?

• Parameter 3: Is it better to consider both types of relations between concepts (hypernymy and similarity) or just the hypernymy relation?

• Parameter 4: If the similarity relation is taken into account, what similarity threshold should be considered?

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 18: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Experimental Results

18

ª Parameters 1 & 2: Heuristic and Hub Vertices percentage

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Average R-1

Average R-2

Average R-L

Average R-W-1.2

Heuristic 1

1- percent 0,69324 0,33723 0,65202 0,24847

2- percent 0,71474 0,34520 0,69115 0,26225

5- percent 0,67814 0,28308 0,63836 0,23646

10- percent 0,67201 0,27093 0,62717 0,22283

Heuristic 2

1- percent 0,71446 0,33185 0,67367 0,25384

2- percent 0,72487 0,34438 0,68040 0,25810

5- percent 0,70358 0,30756 0,65924 0,24488

10- percent 0,72449 0,32887 0,67966 0,25547

Heuristic 3

1- percent 0,72056 0,34105 0,68058 0,25727

2- percent 0,72755 0,34438 0,68308 0,25886

5- percent 0,70560 0,31164 0,66377 0,24612

10- percent 0,71273 0,31980 0,66888 0,25162

Page 19: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Experimental Results

19

ª Parameter 3: Semantic Relations

ª Parameter 4: Similarity threshold

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Average R-1 Average R-2 Average R-L Average R-W-1.2

Heuristic 1Hypernymy 0,71470 0,32736 0,67565 0,25250

Hyp. & Sim. 0,72736 0,34230 0,68558 0,25681

Heuristic 2Hypernymy 0,72487 0,34438 0,68040 0,25810

Hyp. & Sim. 0,72920 0,33949 0,68463 0,25664

Heuristic 3Hypernymy 0,72755 0,34438 0,68308 0,25886

Hyp. & Sim. 0,73118 0,32941 0,67838 0,25323

Average R-1 Average R-2 Average R-L Average R-W-1.20.01 0,71145 0,31381 0,66314 0,248720.05 0,71470 0,32736 0,67565 0,25250

0.1 0,71953 0,32573 0,67578 0,254770.2 0,73118 0,32941 0,67838 0,253230.5 0,71058 0,31786 0,66690 0,24892

Page 20: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Experimental Results

20

ª Best Parametrization

ª Comparison with a positional lead baseline

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

• Heuristic: Undetermined

• Hub Vertices Percentage: 2%

• Relations : Hypernymy and similarity

• Similarity threshold: 0.2

Average R-1 Average R-2 Average R-L Average R-W-1.2Best Configuration 0,73118 0,32941 0,67838 0,25323

Lead Baseline 0,59436 0,18826 0,55522 0,20488

Page 21: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Conclusions

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Conclusions

21

ª Representing the text using ontology concepts allows a richer representation than the one provided by a vector space model

ª The method proposed can be applied to documents from different domains

ª Previously tested on biomedical scientific literature

ª Preliminary evaluation

• There are not significant differences between the heuristics

• Results are clearly better than the baseline

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 22: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Conclusions

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Future Work

22

ª Long sentences have a higher probability of being selected

• Normalizing the sentence scores overthe number of concepts

ª Performing a large-scale evaluation on the DUC 2002 Corpus (ROUGE)

ª Comparing with similar systems (i.e. LexRank)

ª Extending the method to multidocument summarization

INTRODUCTION

RELATED WORK

WORDNET

SUMMARIZATIONMETHOD

EXPERIMENTAL RESULTS

CONCLUSIONS FUTURE WORK

Page 23: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Thank you!Thank you!

Page 24: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Conclusions

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Clustering Algorithm

24

1. We compute the salience of all vertices in the document graph

2. The n vertices with higher salience (Hub Vertices) are iteratively grouped in Hub Vertex Sets (centroids)

3. The remaining vertices (non HVS) are assigned to that the cluster to which they are more connected

4. This process repeats iteratively, adjusting both HVS and the remaining vertices progressively

),(

)()(kijkj vvconectaeve

ji eweightvsalience

Page 25: Automatic Summarization of News using WordNet Concept Graphs Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia

Conclusions

IADIS Informatics 2009

Automatic Summarization of News using WordNet Concept Graphs

Rouge

25

ª Recall-Oriented Understudy for Gisting Evaluation

ª Comparison between automatic summaries and model or ideal summaries created by humans

ª Different recall metrics• ROUGE-N (1…4) : number or ngrams co-ocurring in a candidate

summary and the reference summaries

• ROUGE- L (Longest Common Subsequence): the number of the LCS is used to estimate similarity between the candidate and the reference summaries

• ROUGE-W-1.2 (Weighted Longest Common Subsequence): improves ROUGE-W by taking into account the presence of consecutive matches