![Page 1: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/1.jpg)
1
Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Regina Barzilay and Lillian Lee
Cornell University
HLT-NAACL 2003 (22% 哇 !)
![Page 2: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/2.jpg)
2
Objective
• Generate paraphrases automatically by learning from comparable corpora
• Domain-dependent paraphrasing
• News-oriented
• The plane bombed the town. The town was bombed by the plane.
![Page 3: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/3.jpg)
3
Three Stages
1. Pattern Selection (Within Corpus)• Find reoccurring patterns such as “X
kicked Y”2. Paraphrase Acquisition (Across Corpora)• Pair patterns such as “X kicked Y” and “Y
was kicked by X”3. Generation• Convert “Alice kicked Bob” to “Bob was
kicked by Alice”
![Page 4: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/4.jpg)
4
免費送一個圖解
![Page 5: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/5.jpg)
5
1.Pattern Selection (Within Corpus)
2.Paraphrase Acquisition (Across Corpora)
3.Generation
一步一腳印
![Page 6: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/6.jpg)
6
Pattern Selection 之 Sentence Clustering
• Use complete-link clustering to cluster similar sentences within a corpus
• Use n-gram overlap as similarity measure
• 還沒上過老師 IR 課或上課不專心或已經忘記的鄉親請參考老師的投影片 .
• Replace dates, numbers and proper names in sentences with generic tokens to account for argument variability
![Page 7: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/7.jpg)
7
Sample Cluster
例子多的文章最棒
![Page 8: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/8.jpg)
8
Pattern Selection 之 Inducing Patterns
• Use multiple sequence alignment (MSA), which is an extension of pairwise sequence alignment
• Pairwise sequence alignment : similar to edit distance!– Aligning two identical words scores 1; inserting a word
scores -0.01; aligning two different words scores -0.5– Want to find the alignment that has the highest score
• MSA’s scoring function is the sum of all the pairwise alignment scores
• 人算不如天算 MSA is NP-complete! But there is an approximation algorithm
![Page 9: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/9.jpg)
9
Lattice Example
![Page 10: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/10.jpg)
10
Identify the Variables
• We want to identify variable parts (e.g. event, people name, location, …)
• The non-variable part (backbone node) is defined as a node which is shared by more than 50% of the cluster’s sentences
• Still have the problem of argument variability (bad) and synonym variability (good)
![Page 11: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/11.jpg)
11
Argument Variability VS Synonym Variability
• Idea: more variability in argument (e.g. different location names) than synonym
• Define synonymy threshold : 30%• If none of the parallel nodes have at least
30% of all edges pointing to it, then the parallel nodes are arguments rather than synonyms
• Replace parallel argument nodes with a slot
![Page 12: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/12.jpg)
12
Lattice with Slots
![Page 13: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/13.jpg)
13
1.Pattern Selection (Within Corpus)
2.Paraphrase Acquisition (Across Corpora)
3.Generation
一步一腳印
![Page 14: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/14.jpg)
14
Paraphrase Acquisition 之速配成功• Want to pair up lattices in two different corpora (e.g. “X ki
cked Y” in Corpus A and “Y was kicked by X” in Corpus B)
• Idea : paraphrases have the same slot values• “Take a pair of lattices from different corpora, look back
at the sentence clusters from which the two lattices were derived, and compare the slot values of those cross-corpus sentence pairs that appear in articles written on the same day” – © Barzilay 2003
• For example, we can have “the plane bombed the town” in Corpus A, and “the town was bombed by the plane” in Corpus B both written on the same day.
• More overlapping slot values better
![Page 15: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/15.jpg)
15
1.Pattern Selection (Within Corpus)
2.Paraphrase Acquisition (Across Corpora)
3.Generation
一步一腳印
![Page 16: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/16.jpg)
16
Generation 之大功告成• Given an input sentence, use MSA to find
the most similar training sentence
• Use the training sentence’s lattice to generate paraphrases
![Page 17: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/17.jpg)
17
Evaluation Corpora
• Corpus 1: Agence France-Presse (AFP)
• Corpus 2: Reuters
• Between September 2000 and August 2002
• Focus on violence in Isreal and army raids on Palestinian territories
• 9 MB of articles in total
![Page 18: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/18.jpg)
18
Experiment ½ 之 Template
• Evaluate the quality of template generated• Example : Is “X kicked Y” equivalent to “Y was kicked by
X”?• Baseline : DIRT, another paraphrase system (focus on
shorter phrases)• 4 human judges• Randomly extract 250 pairs of paraphrases per system• 100 pairs (50 per system) are evaluated by all 4 judges• Each judge evaluates different 100 of the remaining 400
pairs
![Page 19: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/19.jpg)
19
Result ½ : 開放 Call-in
![Page 20: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/20.jpg)
20
Result ½ : 開放 Call-in
• MSA outperforms DIRT by about 38% in all cases
• But DIRT focuses only on short phrases, so it’s unfair
• But no one has done sentence-level paraphrasing before
![Page 21: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/21.jpg)
21
Experiment 2/2 之 Paraphrase
• Evaluate the actual paraphrases generated• For testing, choose 20 AFP articles about violen
ce in Middle East, but are not in training corpus• Try to paraphrase every sentence in the 20 articl
es• Baseline : randomly substitute sentence words w
ith Wordnet synonyms• 2 judges ( 人工太貴 , 沒有五年五百億 )
![Page 22: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/22.jpg)
22
Result 2/2 : 開放 Call-in
• 59 out of 484 sentences have paraphrases (12.2%)
Judge MSA Wordnet Synonym
1 81.4% 69.5%
2 78.0% 66.1%
![Page 23: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003](https://reader030.vdocument.in/reader030/viewer/2022032414/56649ee75503460f94bf84c4/html5/thumbnails/23.jpg)
23
終於• Generating sentence-level paraphrases is
not addressed previously
• Use comparable corpora instead of parallel corpora
• 實驗室已經報告過 5 篇 Barzilay 的文章 ( 不包括此篇 )