mead 3.09 a platform for multidocument multilingual text summarization

29
MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia University University of Pennsylvania, Johns Hopkins University Chinese University of Hong Kong, University of Alabama University of Sheffield, University of Cambridge JHU Summer School 2004 - Baltimore

Upload: kamaria-gyasi

Post on 30-Dec-2015

39 views

Category:

Documents


1 download

DESCRIPTION

MEAD 3.09 A platform for multidocument multilingual text summarization. - PowerPoint PPT Presentation

TRANSCRIPT

MEAD 3.09 A platform for multidocument

multilingual text summarization

University of Michigan, Smith College, Columbia UniversityUniversity of Pennsylvania, Johns Hopkins University

Chinese University of Hong Kong, University of AlabamaUniversity of Sheffield, University of Cambridge

JHU Summer School 2004 - Baltimore

MEAD - JHU 2004 2

Text summarization

• Identifying the “most important” information from a document or set of documents.

• Extractive/abstractive

• Single-document/multi-document

• Informative/Indicative

MEAD - JHU 2004 3

MEAD

• Multi-document, multilingual, extractive summarization platform

• Open-source (Perl & Java), well documented API and utilities

• v. 1.0-2.0 (Michigan 2000), v. 3.0 (JHU 2001)

• Latest release is v. 3.09 (Michigan 2001-2004)

MEAD - JHU 2004 4

Four stages

• Preprocessing and clustering– CIDR, XML representation

• Feature extraction– Default + custom

• Score extraction– Feature combination

• Sentence reranking– Cross-sentence relationships: repetitions, chronology,

source preferences

MEAD - JHU 2004 5

Sample .config file<MEAD-CONFIG TARGET='GA3' LANG='ENG‘ CLUSTER-PATH='/clair4/mead/data/GA3' DATA-DIRECTORY='/clair4/mead/data/GA3/docsent'><FEATURE-SET BASE-DIRECTORY='/clair4/mead/data/GA3/feature/'> <FEATURE NAME='Centroid‘ SCRIPT='/clair4/mead/bin/feature-scripts/Centroid.pl HK-WORD-enidf ENG'/> <FEATURE NAME='Position‘ SCRIPT='/clair4/mead/bin/feature-scripts/Position.pl'/> <FEATURE NAME='Length‘ SCRIPT='/clair4/mead/bin/feature-scripts/Length.pl'/></FEATURE-SET><CLASSIFIER COMMAND-LINE='/clair4/mead/bin/default-classifier.pl \ Centroid 1 Position 1 Length 9' SYSTEM='MEADORIG' RUN='10/09'/><RERANKER COMMAND-LINE='/clair4/mead/bin/default-reranker.pl MEAD-cosine 0.7'/><COMPRESSION BASIS='sentences' PERCENT='20'/></MEAD-CONFIG>

MEAD - JHU 2004 6

Sample .sentfeature file<SENT-FEATURE>

<S DID="87" SNO="1" ><FEATURE N="Centroid" V="0.2749" />

</S><S DID="87" SNO="2" >

<FEATURE N="Centroid" V="0.8288" /></S><S DID="81" SNO="1" >

<FEATURE N="Centroid" V="0.1538" /></S><S DID="81" SNO="2" >

<FEATURE N="Centroid" V="1.0000" /></S><S DID="41" SNO="1" >

<FEATURE N="Centroid" V="0.1539" /></S><S DID="41" SNO="2" >

<FEATURE N="Centroid" V="0.9820" /></S>

</SENT-FEATURE>

MEAD - JHU 2004 7

Sample .extract file

<!DOCTYPE EXTRACT SYSTEM '/clair/tools/mead/dtd/extract.dtd'>

<EXTRACT QID='GA3' LANG='ENG' COMPRESSION='7' SYSTEM='MEADORIG' RUN='Sun Oct 13 11:01:19 2002'> <S ORDER='1' DID='41' SNO='2' /> <S ORDER='2' DID='41' SNO='3' /> <S ORDER='3' DID='41' SNO='11' /> <S ORDER='4' DID='81' SNO='3' /> <S ORDER='5' DID='81' SNO='7' /> <S ORDER='6' DID='87' SNO='2' /> <S ORDER='7' DID='87' SNO='3' /></EXTRACT>

MEAD - JHU 2004 9

Sample .query

<!DOCTYPE QUERY SYSTEM "/clair4/mead/dtd/query.dtd" ><QUERY QID="Q-551-E" QNO="551" TRANSLATED="NO"> <TITLE> Natural disaster victims aided </TITLE> <DESCRIPTION> The description is usually a few sentences describing the cluster. </DESCRIPTION> <NARRATIVE> The narrative often describes exactly what the user is looking for in the summary. </NARRATIVE></QUERY>

MEAD - JHU 2004 11

Features

• Centroid: cosine overlap with the centroid vector of the cluster

• SimWithFirst: cosine overlap with the first sentence in the document (or with the title, if it exists)

• Length: 1 if the length of the sentence is above a given threshold and 0 otherwise

• RealLength: the length of the sentence in words

• Position: the position of the sentence in the document

• QueryOverlap: cosine overlap with a query sentence or phrase

• KeywordMatch: full match from a list of keywords

• CosineCentrality: eigenvector centrality of the sentence on the lexical connectivity matrix with a defined threshold

MEAD - JHU 2004 12

Centrality in summarization

• Motivation: capture the most central words in a document or cluster

• Centroid score [Radev & al. 2000, 2004a]

• Alternative methods for computing centrality?

MEAD - JHU 2004 13

Social networks

• Induced by a relation r

• Prestige (centrality) in social networks:– Degree centrality: number of friends– Geodesic centrality: bridge quality– Eigenvector centrality: who your friends are

MEAD - JHU 2004 14

Eigenvectors of stochastic graphs

• Square connectivity matrix • Directed vs. undirected• An eigenvalue for a square matrix A is a scalar such that there exists a vector x0 such that Ax = x

• The normalized eigenvector associated with the largest is called the principal eigenvector of A

• A matrix is called a stochastic matrix when the sum of entries in each row sum to 1 and none is negative. All stochastic matrices have a principal eigenvector

• The connectivity matrix used in PageRank [Page & al. 1998] is irreducible [Langville & Meyer 2003]

• An iterative method (power method) can be used to compute the principal eigenvector

• That eigenvector corresponds to the stationary value of the Markov stochastic process described by the connectivity matrix

• This is also equivalent to performing a random walk on the matrix

MEAD - JHU 2004 15

Eigenvectors of stochastic graphs

• The stationary value of the Markov stochastic matrix can be computed using an iterative power method:

0)(

pEI

pEpT

T

• PageRank adds an extra twist to deal with dead-end pages. With a probability 1-, a random starting point is chosen. This has a natural interpretation in the case of Web page ranking

][ |][|

)(1)(

vpru usu

vp

nvp su = successor nodes

pr = predecessor nodes

MEAD - JHU 2004 16

LexPageRank (Cosine centrality)1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met.

2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990.

3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it.

4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation.

5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area.

6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.''

7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM).

8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors.

9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.''

10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations.

11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.

Example (cluster d1003t)

MEAD - JHU 2004 17

Cosine centrality

1 2 3 4 5 6 7 8 9 10 11

1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00

2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00

3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00

4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01

5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18

6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03

7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01

8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17

9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38

10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12

11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00

MEAD - JHU 2004 18

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Cosine centrality (t=0.3)

MEAD - JHU 2004 19

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Cosine centrality (t=0.2)

MEAD - JHU 2004 20

d4s1

d1s1

d3s2

d3s1

d2s3d3s3

d2s1

d2s2

d5s2d5s3

d5s1

Cosine centrality (t=0.1)

Sentences vote for the most central sentence!

d4s1

MEAD - JHU 2004 21

Cosine centrality vs. centroid centrality

ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid

d1s1 0.6007 0.6944 0.0909 0.7209

d2s1 0.8466 0.7317 0.0909 0.7249

d2s2 0.3491 0.6773 0.0909 0.1356

d2s3 0.7520 0.6550 0.0909 0.5694

d3s1 0.5907 0.4344 0.0909 0.6331

d3s2 0.7993 0.8718 0.0909 0.7972

d3s3 0.3548 0.4993 0.0909 0.3328

d4s1 1.0000 1.0000 0.0909 0.9414

d5s1 0.5921 0.7399 0.0909 0.9580

d5s2 0.6910 0.6967 0.0909 1.0000

d5s3 0.5921 0.4501 0.0909 0.7902

MEAD - JHU 2004 22

Classifiers

• Default: linear combination (possibly using thresholds)

• Lead-based: positional and chronological

• Random

• Decision-tree: trainable

MEAD - JHU 2004 23

Rerankers

• Identity: trivial• Default: remove sentences that are too similar• Time-based: use chronology• Source-based: source preference• Novelty: • CST-based: cross-document structure theory [Radev

2000, Zhang&al. 2002, Zhang&Radev 2004]• MMR: maximal marginal relevance [Carbonell &

Goldstein 1998]

MEAD - JHU 2004 24

Evaluation methods

• Precision/recall/f-measure: baseline

• Kappa: interjudge agreement and difficulty

• Relative utility: non-binary judgements [Radev 2000]

• Relevance correlation: IR-based

• Cosine: default or TF*IDF

• Longest-common subsequence [Saggion&al. 2002]

• Word overlap

• BLEU: n-gram precision [Papineni&al. 2002]

• ROUGE: n-gram recall and lcs [Lin 2004]

MEAD - JHU 2004 26

Recent applications

• NewsInEssence (www.newsinessence.com)

• DUC 2001-2004

• WapMEAD

• Java-MEAD interface

• Chronological fact extraction

• Novelty detection

• Protein interaction extraction

MEAD - JHU 2004 27

MEAD - JHU 2004 28

123

45

67

89

10111213

14151617

1819

2021

2223

2425

2627

28

MEAD - JHU 2004 29

MEAD - JHU 2004 30

More recent additions

• MEAD “addons” – conversion from plain text, HTML, PDF, etc. to MEAD XML

• Client + server

• Summary to sentjudge conversion

• Trainable version of MEAD using decision trees, maxent, and SVM

MEAD - JHU 2004 31

Successes

• Large-scale effort (more than 20 people have participated in it)

• Open architecture• Downloaded more than 1,000 times in the last 2 years• Used in teaching• Novel models of centrality: centroid, degree, cosine

centrality• Currently in five languages: English, Chinese, Korean,

Spanish, Japanese• DUC (including several first-place rankings in 2003, 2004)

MEAD - JHU 2004 33

Sample .meadrc file

compression_basis sentencescompression_absolute 1classifier \ /clair4/projects/mead307/source/mead/bin/default-classifier.pl \ Centroid 3.0 Position 1.0 Length 15 SimWithFirst 2.0reranker \ /clair4/projects/mead307/source/mead/bin/default-reranker.pl \ MEAD-cosine 0.9 enidf