summarizing encyclopedic term descriptions on the web from coling 2004 atsushi fujii and tetsuya...

22
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Is hikawa Graduate School of Library, Info rmation and Media Studies, Univer sity of Tsukuba

Upload: constance-fleming

Post on 14-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Summarizing Encyclopedic Term Descriptions on the Web

from Coling 2004Atsushi Fujii and Tetsuya Ishikawa

Graduate School of Library, Information and Media Studies, University of Tsukuba

Page 2: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

motivation

Existing encyclopedias often lack new terms and new definitions for existing terms

Web contains an enormous volume of up-to-date information is a source to obtain new term descriptions

The use of existing search engine has many problems

Page 3: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

search engine??

Often retrieve extraneous pages not describing a submitted term

A user has to identify page fragments describing the term

Descriptions in multiple pages are independent

Word senses are not distinguished for ambiguous terms

Page 4: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

They propose a summarization method that produces a concise and condensed term description from multiple paragraphs

In this paper, they focus on Japanese technical terms in the computer domain

Page 5: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Overview of CYCLONE

Page 6: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media
Page 7: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Summarization Method

Given a set of paragraph-style descriptions for a single term in a specific domain, their summarization method produces a concise text describing the term from different viewpoints

12 viewpoints in computer domain: definition, abbreviation, exemplification, purpose, synonym, reference, product, advantage, drawback , history, component, function

Page 8: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single group

Selection Determine one or more representative units for each group

Presentation Produce a summary in a format

Page 9: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Identification

A sentence is often associated with multiple viewpointse.g. XML is an abbreviation for eXtensible Markup Language, and is markup language

Segment Japanese sentences into simple sentences, and apply zero pronoun detection and anaphora resolution can be used

XML is an abbreviation for eXtensible Markup Language XML is markup language

Abbreviation viewpoint

definition viewpoint

Page 10: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single

group

Selection Determine one or more representative units for each group

Presentation Produce a summary in a format

Page 11: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Classification

12 viewpoints 36 linguistic patterns are used to describe

terms from a specific viewpoint Simple sentences match with patterns for

multiple viewpoints is classified into viewpoint group

Page 12: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Classification (cont)

How about those sentences do not match any patterns?

Classify remaining sentences into the group where their most similar sentence is belong

Compute the similarity between an unclassified sentences and each of the classified sentences (Dice coefficient)

“miscellaneous” group

Page 13: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

example

Page 14: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single group

Selection Determine one or more representative units for each

group

Presentation Produce a summary in a format

Page 15: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Selection The number of sentences selected from each group

depends on the desired size of the resultant summary

Compute the score for each sentence and select sentences with greater scores in each group # of common words included (W) – sentences including

frequent words are preferred Rank order in CYCLONE (R) # of characters include (C) – short sentences are preferred

Normalize each factor and compute final score as a weighed average of the three factors above (W>R>C)

Page 16: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Selection (cont)

For miscellaneous group, they select the most dissimilar sentence to representative sentences selected from the regular groups

Page 17: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Presentation

Page 18: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Top 50 paragraphs for the term “XML” Only one sentence was selected from each

group Each viewpoint label or sentence is hyper-

linked to the associated group or the source paragraph

Presentation (cont)

Page 19: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Evaluation

Summarization evaluation can be classified into intrinsic and extrinsic approaches

Intrinsic: the quality of a text, informativeness Extrinsic: if a summary improves the efficiency of

a specific task

Page 20: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Evaluation (cont)

15 Japanese terms are test inputs In order to calculate the coverage, for each of the

15 terms, two students annotate each simple sentence in the top 50 paragraphs in the CYCLONE results with one or more viewpoints

They define 28 viewpoints including the 12 viewpoints

Compression ratio and coverage were calculate by the top 50 paragraphs

Page 21: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Results

#Reps: the number of representative sentences selected from each viewpoint group

#Chars: the number of characters in a summary They select five sentences from the miscellaneous

group VBS: viewpoint-based summarization method Lead: systematically extracted the top N characters

from the CYCLONE results

Page 22: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media

Conclusion

To compile encyclopedic term descriptions from the Web, they introduced a summarization method

They identify the simple sentences, classify those sentences into viewpoint groups, select the representative sentences from each group and show them up

VBS got good compression ratio and the coverage score is better than baseline

Future work includes generating a coherent text and performing extrinsic evaluation method