1 reference julian kupiec, jan pedersen, francine chen, “a trainable document summarizer”,...

33
1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seatt le WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation of Sentence Selection for Speech Summarization”, Proceedings o f the 2nd International Conference on Recent Advan ces in Natural Language Processing (RANLP-05), Bor ovets, Bulgaria, pp. 39-45. September 2005. C.D. Paice, “Constructing literature abstracts by computer: Techniques and prospects”. Information P rocessing and Management, 26:171-186, 1990.

Upload: beatrix-lamb

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Reference

• Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995.

• Xiaodan Zhu, Gerald Penn, “Evaluation of Sentence Selection for Speech Summarization”, Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, pp. 39-45. September 2005.

• C.D. Paice, “Constructing literature abstracts by computer: Techniques and prospects”. Information Processing and Management, 26:171-186, 1990.

2

A Trainable Document Summarizer

Julian Kupiec, Jan Pedersen and Francine ChenXerox Palo Alto Research Center

3

Outline

• Introduction• A Trainable Summarizer• Experiments and Evaluation• Discussion and Conclusions

4

Introduction

• To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original

• This paper focuses on document extracts, a particular kind of computed document summary

• Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries

• Titles, key-words, tables-of-contents and abstracts might all be considered as forms of summary

• They approach extract election as a statistical classification problem• This framework provides a natural evaluation criterion: the classificat

ion success rate or precision• It does require a “training corpus” of documents with labelled extract

s

5

A Trainable Summarizer

• Features– Paice groups sentence scoring features into seven categories

• Frequency-keyword heuristics• The title-keyword heuristic• Location heuristics• Indicator phrase (e.g., “this report…..”)• Related heuristic: involves cue words

– Two set of words which are positively and negatively correlated with summary sentences

– Bonus: e.g., greatest and significant– Stigma: e.g., hardly and impossible

• Ref. ---The frequency-keyword approach、 the title-keyword method、 The location method、 Syntactic criteria、 The cue method、 The indicator-phrase method、 Relational criteria

6

A Trainable Summarizer

• Features– Sentence Length Cut-off Feature

• Given a threshold (e.g., 5 words)• The feature is true for all sentences longer than the

threshold, and false otherwise– Fixed-phrase Feature

• This features is true for sentences that contain any of 26 indicator phrases, or that follow section heads that contain specific key words

– Paragraph Feature– Thematic Word Feature

• The most frequent content words are defined as thematic words

• This feature is binary, depending on whether a sentence is present in the set of highest scoring sentences

– Uppercase Word Feature

7

A Trainable Summarizer

• Classifier– For each sentence s, to compute the probability it will be included i

n a summary S given the k features ,

which can be expressed using Bayes’ rule as follows:

Assuming statistical independence of the features:

is a constant and and can be estimated directly from the training set by “counting occurrences”

jF kij ...

),.....,(

)()|,.....,(),.....,|(

21

2121

k

kk FFFP

SsPSsFFFPFFFSsP

k

j j

k

j j

kFP

SsPSsFPFFFSsP

1

121

)(

)()|(),.....,|(

)|( SsFP j )( SsP )( jFP

8

Experiments and Evaluation

• The corpus – There are 188 document/summary pairs, sampled from 21

publications in the scientific/technical domain– The average number of sentences per document is 86– Each document was “normalized” so that the first line of each file

contained the document title

9

Experiments and Evaluation

• The corpus – Sentence Matching:

• Direct sentence match (verbatim or minor modification)• Direct join (two or more sentences)• Unmatchable• Incomplete (some overlap, includes a sentence from the

original document, but also contains other information)• The correspondences were produced in two passes• 79% of the summary sentences have direct matches

10

Experiments and Evaluation

• The corpus

11

Experiments and Evaluation

• Evaluation– Using a cross-validation strategy for evaluation– Unmatchable and incomplete sentences were excluded from

both training and testing, yielding a total of 498 unique sentences– Performance:

• First way – the highest performance– A sentence produced by the summarizer is defined as

correct here if: (direct sentence match, direct join)– Of the 568 sentences, 195 direct sentence matches and

6 direct joins were correctly identified, for a total of 201 correctly identified summary sentences : 35%

• Second way :– 498 match-able sentences– 42%

%83568

19451

12

Experiments and Evaluation

• Evaluation

– The best combination is (Paragraph+fixed-phrase+sentence-length)

– Addition of the frequency-keyword features (thematic and uppercase word features) results in a slight decrease in overall performance

– For a baseline, to select sentences from the beginning of a document (considering the sentence length cut-off feature alone)24% (121 sentences correct)

13

Experiments and Evaluation

– Figure 3 shows the performance of the summarizer (using all features) as a function of summary size

– Edmundson cites a sentence-level performance of 44%

– By analogy, 25% of the average document length (86 sentences) in our corpus is about 20 sentences

– Reference to the table indicatesperformance at 84%

14

Discussion and Conclusions

• The trends in our results are in agreement with those of edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus

• Frequency-keyword features also gave poorest individual performance in evaluation

• They have however retained these features in our final system for several reasons– The first is robustness– Secondly, as the number of sentences in a summary grows, mor

e dispersed informative material tends to be included

15

Discussion and Conclusions

• The goal is to provide a summarization program that is of general utility– The first concerns robustness– The second issue concerns presentation and other forms of

summary information

16

Reference

• Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995.

• Xiaodan Zhu, Gerald Penn, “Evaluation of Sentence Selection for Speech Summarization”, Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, pp. 39-45. September 2005.

17

Evaluation of Sentence Selection for Speech Summarization

Xiaodan Zhu and Gerald PennDepartment of Computer Science University of Toronto

18

Outline

• Introduction• Speech Summarization by Sentence Selection• Evaluation Metrics• Experiments• Conclusions

19

Introduction

• This paper consider whether ASR-inspired evaluation metrics produce different results than those taken from text summarization

• The goal of speech summarization is to distill important information from speech data

• In this paper, we will focus on sentence-level extraction

20

Speech Summarization by Sentence Selection

• “LEAD”: sentence selection is to select the first N% of sentences from the beginning of the transcript

• “RAND”: random selection• Knowledge-based Approach: “SEM”

– To calculate semantic similarity between a given utterance and the dialogue, the noun portion of WordNet is used as a knowledge source, with semantic distance between senses computed using normalized path length

– The performance of the system is reported as better than LEAD, RAND and TF*IDF based methods

– Not using manually disambiguated, to apply Brill’s POS tagger to acquire the nouns

– Using semantic similarity package

21

Speech Summarization by Sentence Selection

• MMR-based Approach: “MMR”

– Whether it is more similar to the whole dialogue

– Whether it is less similar to the sentences that have so far been selected

• Classification-Based Approaches– To formulate sentence selection as a binary classification problem

– The best two have consistently been SVM and logistic regression

– SVM: (OSU-SVM package)

• SVM seeks an optimal separating hyperplane, where the margin is maximal

• Decision function is :

),(2max)1(),(1arg ,,,

,,

krjnrt

jnrt

ttSimquerytSimMaxnextsentkrjnr

sN

jjjj xsKyaxf

1

)),(sgn()(

22

Speech Summarization by Sentence Selection

• Features

23

Speech Summarization by Sentence Selection

• Classification-Based Approaches– Logistic Regression: “LOG”

• To model the posterior probabilities of the class label with linear functions:

• X are feature sets and Yare class labels

xxXkYP

xXkYP Tkk

0)|(

)|(log

)exp(1

)exp()|1(

110

010

x

xxXYP

T

T

)exp(1

1)|2(

110 xxXYP

T

24

Evaluation Metrics

• Precision/Recall

– When evaluated on binary annotations and using precision/recall metrics, sys1 and sys2 achieve 50% and 0%

• Relative Utility– For the above example, if using relative utility, sys1 gets 18/19

and sys2 gets 15/19– The values obtained are higher than with P/R, but they are

higher for all of the systems evaluated

25

Evaluation Metrics

• Word Error Rate– Sentence level and word level– The sum of insertion error, substitution error and deletion error of

words, divided by the number of all these errors plus the number of corrects words

• Zechner’s Summarization Accuracy– The summarization accuracy is defined as the sum of the releva

nce scores of all the words in the automatic summary, divided by the maximum achievable relevance score with the same number of words

• ROUGE– To measuring overlapping units such as n-grams, word sequenc

es and word pairs– ROUGE-N and ROUGE-L

26

Experiments

• Corpus: the SWITCHBOARD dataset (a corpus of open-domain spoken dialogue)

• To randomly select 27 spoken dialogues from SWITCHBOARD

• Three annotators are asked to assign 0/1 labels to indicate whether a sentence is in the summary or not (required to select around 10% of the sentences into the summary)

• Judge’s annotation relative to another are evaluated (F-scores)

27

Experiments

• Precision/Recall– One standard marks a sentence as in the summary only when all

three annotators agree– LOG and SVM have similar performance and outperform the

others, with MMR following, and then SEM and LEAD

– At least two of the three judges include in the summary

28

Experiments

• Precision/Recall– Any of the three annotators

• Relative Utility– For three different human judges, an assignment of a number

between 0 and 9 to each sentence are obtained, to indicate the confidence that this sentence should be included in the summary

29

Experiments

• Relative Utility

– The performance ranks of the five summarizaers are the same here as they are in the three P/R evaluations

• First, the P/R agreement among annotators is not low• Second, the redundancy in the data is much less than in the

multi-document summarization tasks• Third , the summarizers we compare might tend to select the

same sentences

30

Experiments

• Word Error Rate and Summarization Accuracy

31

Experiments

• Word Error Rate and Summarization Accuracy

32

Experiments

• ROUGE

33

Conclusion

• Five summarizers were evaluated on three text-summarization-inspired metrics: (P/R), (RU), and ROUGE, as well as on two ASR-inspired evaluation metrics: (WER) and (SA)

• Preliminary conclusion is that considerably greater caution must be exercised when using ASR-based measures than we have witnessed to date in the speech summarization literature