summarisation work at sheffield robert gaizauskas natural language processing group department of...

14
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

Upload: anabel-mccoy

Post on 05-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

Summarisation Work at Sheffield

Robert GaizauskasNatural Language Processing GroupDepartment of Computer Science

University of Sheffield

Page 2: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Outline

Terminology

Approach 1: Generation from Templates

Approach 2: Coreference Chains

Approach 3: Statistical

Page 3: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Terminology

Extract vs Abstract Extract - subset of the sentences in the original Abstract - fusion of topics in original + text generation

Generic vs User-focused Generic - captures essence of text, independent of

user’s interests User-focused – summarises content wrt a particular user

interest Indicative vs Informative

Indicative – indicates whether document should be examined in more detail

Informative – serves as a surrogate for original

Page 4: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 1: Generation from Templates

To generate user-focused informative abstracts

we have used an IE system + simple NL generation techniques to produce simple summaries

Page 5: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Example: A Wall Street Journal Article<DOC><DOCID> wsj94_008.0212 </DOCID><DOCNO> 940413-0062. </DOCNO><HL> Who's News:@ Burns Fry Ltd. </HL><DD> 04/13/94 </DD><SO> WALL STREET JOURNAL (J), PAGE B10 </SO><CO> MER </CO><IN> SECURITIES (SCR) </IN><TXT><p> BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive

vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.

</p></TXT></DOC>

Page 6: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Example: BNF Definition of a Management Succession Event Template (MUC-6)

<TEMPLATE> := DOC_NR: "NUMBER" ^ CONTENT: <SUCCESSION_EVENT> *<SUCCESSION_EVENT> := ORGANIZATION: <ORGANIZATION> ^ POST: "POSITION TITLE" | "no title" ^ IN_AND_OUT: <IN_AND_OUT> + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} ^<IN_AND_OUT> := PERSON: <PERSON> ^ NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ^ ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: <ORGANIZATION> - REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} -<ORGANIZATION> := ORG_NAME: "NAME" - ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" - ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ^ ORG_LOCALE: LOCALE_STRING {{CITY, PROVINCE, COUNTRY, REGION, UNK} * ORG_COUNTRY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING *<PERSON> := PER_NAME: "NAME" - PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" *

Page 7: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

<TEMPLATE-9404130062> := DOC_NR: "9404130062" CONTENT: <SUCCESSION_EVENT-1><SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT: <IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK<IN_AND_OUT-1> := <IN_AND_OUT-2> := IO_PERSON: <PERSON-1> IO_PERSON: <PERSON-2> NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG<ORGANIZATION-1> := <ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada<PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr."

Example: A (Partially) Filled Management Succession Event Template

Page 8: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Example: One Use for a Template - Generating a Summary

From the completely filled version of the preceding template the LaSIE system generates the following natural language summary:

BURNS FRY Ltd. named Donald Wright as executive vice president.Donald Wright resigned as president of Merrill Lynch Canada Inc..Mark Kassirer left as president of BURNS FRY Ltd.

Producing summaries in other languages is relatively easy (compared to full machine translation).

Page 9: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 2: Coreference Chains

To generate generic informative extracts

we have used coreference chains

Page 10: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 2: Coreference Chains (cont)

Background: Morris and Hirst (’94) investigated lexical chains –

chains of lexically-related words in a text that serve to make texts cohere

Barzilay + Elhadad (’97) suggested using lexical chains as a basis for selecting sentences to form a summary – rank chains based on number of links + extent over text

Halliday and Hassan (’76) proposed coreference as another major factor contributing to coherence of NL texts

Idea: Explore use of coreference chains to produce

summaries

Page 11: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 2: Coreference Chains (cont)

Technique Use LaSIE to carry out discourse analysis of text,

including coreference resolution Extract all coreference chains Rank chains by a metric which counts chain length +

extent + starting point• Intuition: entities which occur most frequently and most

widely in a text are those which the text is most “about” Depending on desired summary length, select m

sentences from top n chains Details in Azzam, Humphreys and Gaizauskas ’99

Page 12: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 3: Statistical

To generate generic indicative extracts

we have used a stastical approach based on a set of factors

Page 13: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 3: Statistical (cont)

Factors which have been examined in selecting sentences for inclusion in extractive summaries include: number of content words shared with title/headings (T) presence of “cue words” (C) location of sentence in text (L) number of content words discriminative of current text

as opposed to corpus of texts from which it is drawn, using, e.g. tf-idf measure (K)

Page 14: Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield

January, 2001 AKT Workshop

Approach 3: Statistical (cont)

Assign a weight to each sentence according to a weighted linear combination of these factors

Learn weights to optimise sentence selection as measured against a corpus of extracts + texts

Select top ranked sentences up to desired summary length

)(

1),()()()()(

snumkey

isikeywordKLCT wwsLwsCwsTwsW