1 natural language processing for the web prof. kathleen mckeown 722 cepsr, 939-7118 office hours:...
Post on 19-Dec-2015
213 views
TRANSCRIPT
![Page 1: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/1.jpg)
1
Natural Language Processing for the Web
Prof. Kathleen McKeown
722 CEPSR, 939-7118
Office Hours: Wed, 1-2; Tues 4-5
TA:
Yves Petinot
719 CEPSR, 939-7116
Office Hours: Thurs 12-1, 8-9
![Page 2: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/2.jpg)
2
Projects
Proposal due today Hand in via courseworks (by midnight)
![Page 3: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/3.jpg)
3
Today
Discussants – Each person should sign up for TWO papers
Automated Discovery and Analysis of Social Networks from Threaded Discussions (Kathy)
From Social Bookmarking to Social Summarization (Kathy) Joint Group and Topic Discovery from Relations and Text
(Kathy) Discovering Authorities in Question Answering Communities
Using Link Analysis (Kenny)
Discussants: Lauren, Weiwei
![Page 4: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/4.jpg)
4
Automated Discovery and Analysis of Social Networks Threaded discussion:
Online class discussion board “…examining social networks – including the roles and positions
of actors in a social network, their influence on others, and what exchanges support and sustain the network – is an important goal for understanding networked learning processes”
Social Network background Most studies use meta-data about links This work uses nl analysis of text within
postings
![Page 5: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/5.jpg)
5
Who talks to whom? (ties)
Chain network Create a link from poster to previous poster Create a link from poster to thread starter
plus previous poster Create a link from poster to all previous
posters in a thread, decreasing weight with distance
![Page 6: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/6.jpg)
6
![Page 7: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/7.jpg)
7
Topic of this paper: identifying names in a posting Possible names
Previous poster (to line, direct reference, indirect reference, subject of discussion)
someone else entirely (author), current poster (self-reference, in address
line, signature)
![Page 8: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/8.jpg)
8
Methods for identifying “Who” Use of name lists Class roster Use of titles (Prof.), addresses (dear) Exclusion of 3 word capitalized sequences Confidence level (page vs “Page”) Mis-spellings (manual review and edits)
Results: Local vs Ling-Pipe Precision: .88 vs. .60 Recall: .66 vs. .68
![Page 9: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/9.jpg)
9
Methods for identifying “ties”: links
Chance of a tie proportional to the number of times each mentions the other as addressee or subject Add 1 to a poster and all names found in posting
More than one name: link name to userid using collocational analysis (A vs. P) Association type P (poster) A (addressee)
Information Exchange: essential social interaction Measure via content of exchange (vs. network structure) Information weight of an exchange (vs. discourse act
such as announcement) Yahoo Term Extractor to find content nouns
![Page 10: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/10.jpg)
10
Example
Keep in mind that google and other search technology are still evolving and getting better. I certainly don't believethat they will be as effective as a library in 2-5 years, but if they improve significantly, it will continue to be difficult for the public to perceive the difference.
![Page 11: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/11.jpg)
11
Evaluation
Metric: QAP correlation
![Page 12: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/12.jpg)
12
From Social Bookmarking to Social Summarization
Exploit user-created content Del.icio.us web page tags, Flick’r, review sites
Approach expands on query-focused summarization Extract bookmark tags for a page p: (b1, b2, ..) Issue a search with tags as query Extract snippets associated with p in result: S(bi, p) Normalize snippets Score each snippet according to frequency Rank order as summary
![Page 13: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/13.jpg)
13
Some details
Limit results of search to top N Normalize by
Determining overlap (like cosine) Match if overlap score above threshold T Take shorter sentence of a match
Determine frequency of selected snippets in search result to rank
![Page 14: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/14.jpg)
14
![Page 15: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/15.jpg)
15
Evaluation
Baselines: OTS, MEAD Metric: Rouge Set-up 1: SS used full set of tags,
average length = 24% Problem? Relative improvement:
SS to OTS: 31-39% relative improvement SS to MEAD: 24-29% relative improvement
![Page 16: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/16.jpg)
16
Set-up 2: Vary summary length from 10% to 50%
![Page 17: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/17.jpg)
17
Set-up 3: Community based summarization Use tags generated by a specific
community: skier community (“skiing” as seed bookmark) vs. a travel community (“travel” as seed bookmark)
Evaluation: recall in summary of terms in seed set
![Page 18: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/18.jpg)
18
![Page 19: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/19.jpg)
19
Joint Group and Topic Discovery from Relations and Text
Example: legislative body and alliances Different alliances may form depending on the
resolution topic (taxation vs. foreign trade)
GT model Discovery of groups guided by emerging topics Discovery of topics guided by emerging groups Example: resolutions that would have been assigned
to one group based on topic may be assigned to different one given voting patterns; distinct word-based topics may be merged if entities vote similarly.
![Page 20: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/20.jpg)
20
GT Model
Simultaneously clusters entities into groups and words into topics
Data set: voting data from the US Senate and the UN General Assembly
![Page 21: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/21.jpg)
21
Sentence extraction
Sparck Jones:
`what you see is what you get’, some of what is on view in the source text is transferred to constitute the summary
![Page 22: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/22.jpg)
22
Background
Sentence extraction the main approach
Some more sophisticated features for extraction in recent years
Lexical chains, anaphoric reference, topic signatures
Machine learning models for learning an extraction summarizer (e.g., Kupiec)
![Page 23: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/23.jpg)
23
Today’s systems
How can we edit the selected text?
![Page 24: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/24.jpg)
24
Karen Sparck JonesAutomatic Summarizing: Factors and Directions
![Page 25: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/25.jpg)
25
Sparck Jones claims
Need more power than text extraction and more flexibility than fact extraction (p. 4)
In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)
It is important to recognize the role of context factors because the idea of a general-purpose summary is manifestly an ignis fatuus. (p. 5)
Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)
I believe that the right direction to follow should start with intermediate source processing, as exemplified by sentence parsing to logical form, with local anaphor resolutions
![Page 26: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/26.jpg)
26
Questions (from Sparck Jones)
Does subject matter of the source influence summary style (e.g, chemical abstracts vs. sports reports)?
Should we take the reader into account and how?
Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?
![Page 27: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/27.jpg)
27
Consider the papers we read in light of Sparck Jones’ remarks on the influence of context: Input
Source form, subject type, unit Purpose
Situation, audience, use Output
Material, format, style
![Page 28: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/28.jpg)
28
Cut and Paste in Professional Summarization
Humans also reuse the input text to produce summaries
But they “cut and paste” the input rather than simply extract automatic corpus analysis (Zipf Davis)
300 summaries, 1,642 sentences 81% sentences were constructed by cutting
and pasting
![Page 29: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/29.jpg)
29
Major Cut and Paste Operations (1) Sentence reduction
~~~~~~~~~~~~
![Page 30: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/30.jpg)
30
Major Cut and Paste Operations (1) Sentence reduction
~~~~~~~~~~~~
![Page 31: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/31.jpg)
31
Major Cut and Paste Operations (1) Sentence reduction
(2) Sentence Combination
~~~~~~~~~~~~
~~~~~~~~~~~~~~ ~~~~~~
![Page 32: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/32.jpg)
32
(3) Generalization
"a proposed new law that would require Web publishers to obtain parental consent before collecting personal information from children" -> "legislation to protect children's privacy on-line"
![Page 33: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/33.jpg)
33
Cut and Paste Based Single Document Summarization -- System Architecture
Extraction
Sentence reduction
Generation
Sentence combination
Input: single document
Extracted sentences
Output: summary
Zipf DavisCorpus
Decomposition
Lexicon
Parser
Co-reference
![Page 34: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/34.jpg)
34
Sentence Reduction Step 1: Use linguistic knowledge to decide what
phrases MUST NOT be removed Obligatory arguments of verbs are saved
Step 2: Determine what phrases are most important in the local context Phrases with words that link forward or backward
Step 3: Compute the probabilities of humans removing a certain type of phrase
Step 4: Combine the three factors to decide
![Page 35: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/35.jpg)
35
Sentence Fusion for Multi-document Summarization http://newsblaster.cs.columbia.edu
![Page 36: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/36.jpg)
36
![Page 37: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/37.jpg)
37
Fusion
![Page 38: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/38.jpg)
38
Sentence Fusion Computation: Content Selection
Common information identification Alignment of constituents in parsed
theme sentences: only some subtrees match
Bottom-up local multi-sequence alignment
Similarity depends on Word/paraphrase similarity Tree structure similarity
![Page 39: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/39.jpg)
39
Sim(T,T’) = max (nodecompare(T,T’), Sim(T, children(T’)), Sim(children(T),T’))
Nodecompare searches for best possible alignment of all childnodes
Nodesimilarity dependson similarity between words of atomic nodes
![Page 40: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/40.jpg)
40
![Page 41: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/41.jpg)
41
Sentence Fusion: Generation
Fusion lattice computation Choose a basis sentence Add subtrees from fusion not present in basis Add alternative verbalizations Remove subtrees from basis not present in
fusion Lattice linearization
Generate all possible sentences from the fusion lattice
Score sentences using statistical language model
![Page 42: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/42.jpg)
42
![Page 43: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/43.jpg)
43
Questions
Jing: Not a statistical approach, not learned. Is this OK? Does it buy us anything over the approaches using learning?
Barzilay: Also not statistical, OK? How to compare with Jing? Is redundancy a good criteria for content selection? What could go wrong?
![Page 44: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/44.jpg)
44
Sparck Jones claims
Need more power than text extraction and more flexibility than fact extraction (p. 4)
In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)
It is important to recognize the role of context factors because the idea of a general-purpose summary is manifestly an ignis fatuus. (p. 5)
Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)
I believe that the right direction to follow should start with intermediate source processing, as exemplified by sentence parsing to logical form, with local anaphor resolutions
![Page 45: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/45.jpg)
45
Supervised and Unsupervised Learning for Sentence Compression
J. Turner and E. Charniak
![Page 46: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/46.jpg)
46
Knight and Marcu Model
Noisy Channel Model Zipf Davis corpus Given a long sentence, determine the short
sentence that maximizes P(s|l) Bayes rule:
P(l) is constant across all long, dropped
Language model combination of PCFG and bigram of S
![Page 47: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/47.jpg)
47
Two problems with K&M
Lack of training data – Why?
Probability model is ad hoc
![Page 48: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/48.jpg)
48
Turner and Charniak Approach – K&M modification Use syntactic language model Slightly change channel model:
Parameter to encourage compression
![Page 49: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/49.jpg)
49
Alternate models “Special rule” additions + K&M variation
NP(1) -> NP(2) CC NP (3) compressed to NP(2) Unsupervised version using PTB: no parallel corpus.
P(l|s) learned by comparing similar rules
NP -> DT JJ NN (3X) NP -> DT NN (4X) P(l|s) = 3/7
Semi-supervised: fall back on unsupervised when no data from supervised
Constraints: complement/adjunct distinction: never allow deletion of complement
![Page 50: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/50.jpg)
50
Results (evaluated using judges)
![Page 51: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/51.jpg)
51
Questions
How does this compare with Jing? Will same manual rules be captured?
Verb arguments not deleted? Context determines importance
What does statistics capture that is not captured by the manual approach?
How about revisions other than reduction?
![Page 52: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/52.jpg)
52
Compression Beyond Word Deletion Cohn and Lapata
![Page 53: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/53.jpg)
53
Goal and Approach
To delete, substitute, re-order
Collect a new corpus: why? 30 newspaper articles, 575 sentences Is this adequate?
Extract compressions
Collect paraphrases using MT
![Page 54: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/54.jpg)
54
Abstraction Example
High winds and snowfalls have, however, grounded at a lower level the powerful US Navy Sea Stallion helicopters used to transport the slabs.
Bad weather, however, has grounded the helicopters transporting the slabs.
![Page 55: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/55.jpg)
55
Extraction of compression rules Synchronous Tree Substitution
Grammar (S,S) -> (NPVBD NP, NP was VBN by NP)
Probabilistic (each grammar rule assigned a learned weight)
Prediction: generation finds the best scoring compression using the grammar rules
(Skip training section)
![Page 56: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/56.jpg)
56
Extension (contribution)
Paraphrasing with their corpus a problem
Learn paraphrase grammar rules Parallel bilingual corpus Learns over syntax tree fragments Translate from English to French and back
again -> an English paraphrase of original
These rules are added into extracted compression grammar
![Page 57: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/57.jpg)
57
Combined grammar
Incorporates an ngram language model as a feature Helps prevent ungrammatical output
Like K&M, Turner and Charniak, a parameter to penalize short output
Union of compression plus paraphrasing grammar plus a COPY grammar derived from the source side
![Page 58: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/58.jpg)
58
ResultsModels Grammaticality Importance Comp Rate
extract 3.1 2.43 82.5
abstract 3.38 2.85 79.2
gold 4.51 4.02 58.4
O: The scheme was intended for people of poor or moderate means.E: The scheme was intended for people of poor means.A: The scheme was intended for poor people.G: The scheme was intended for the poor.
O: He died last Thursday at his home from complications following a fall, said his wife author Margo Kurtz.E: He died last at his home from complications following a fall, said wife, author Margo Kurtz.A: His wife author Margo Kurtz died from complications after a decline.G: He died from complications following a fall.
![Page 59: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/59.jpg)
59
Quesitons
Is this comparable to K&M, Turner and Charniak?
Is it OK to take a risk? What are the weak points?
![Page 60: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/60.jpg)
60
![Page 61: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/61.jpg)
61
![Page 62: 1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d365503460f94a0e9a0/html5/thumbnails/62.jpg)
62