language technologies institute school of computer science carnegie mellon university, usa
DESCRIPTION
Carnegie Mellon. Diversifiable Bootstrapping for Acquiring High-Coverage Paraphrase Resource. Hideki Shima Teruko Mitamura. Language Technologies Institute School of Computer Science Carnegie Mellon University, USA. Can a machine recognize the meaning similarity?. John killed Mary. - PowerPoint PPT PresentationTRANSCRIPT
LREC 2012, May 24th, 2012
Language Technologies InstituteSchool of Computer Science
Carnegie Mellon University, USA
Diversifiable Bootstrapping for Acquiring High-Coverage Paraphrase Resource
Carnegie Mellon
Hideki Shima
Teruko Mitamura
Carnegie Mellon
LREC 2012, May 24th, 2012 2
John killed Mary.Can a machine recognize the meaning similarity?
Carnegie Mellon
LREC 2012, May 24th, 2012 3
John killed Mary. Mary was killed by John. passivization
Can a machine recognize the meaning similarity?
Carnegie Mellon
LREC 2012, May 24th, 2012 4
John killed Mary. Mary was killed by John. John is the killer of Mary.
passivization
nominalization
Can a machine recognize the meaning similarity?
Carnegie Mellon
LREC 2012, May 24th, 2012 5
John killed Mary. Mary was killed by John. John is the killer of Mary. John assassinated Mary.
passivization
nominalization
entailment
Can a machine recognize the meaning similarity?
Carnegie Mellon
LREC 2012, May 24th, 2012 6
John killed Mary. Mary was killed by John. John is the killer of Mary. John assassinated Mary. John is the 187 suspect of Mary.
passivization
nominalization
entailment
slang
Can a machine recognize the meaning similarity?
187 means: “California penal code for murder, made popular in west coast gangsta rap”.
– From The Urban Dictionary dot com
Usage: “This is Gavilan. In pursuit of possible 187 suspects.” –From the movie, Hollywood Homicide
Carnegie Mellon
LREC 2012, May 24th, 2012 7
John killed Mary. Mary was killed by John. John is the killer of Mary. John assassinated Mary. John is the 187 suspect of Mary. John terminated Mary with extreme
prejudice.
passivization
nominalization
entailment
slang
Can a machine recognize the meaning similarity?
euphemism
“In military and other covert operations, terminate with extreme prejudice is a euphemism for execution” – Wikipedia
Carnegie Mellon
LREC 2012, May 24th, 2012 8
John killed Mary. Mary was killed by John. John is the killer of Mary. John assassinated Mary. John is the 187 suspect of Mary. John terminated Mary with extreme
prejudice.
passivization
nominalization
entailment
slang
Can a machine recognize the meaning similarity?
euphemism
Humans use various expressions to convey the same or similar meaning, which makes it difficult for machines to “read” text.
Carnegie Mellon
LREC 2012, May 24th, 2012 9
X killed Y. Y was killed by Y. X is the killer of Y. X assassinated Y. X is the 187 suspect of Y. X terminated Y with extreme prejudice.
passivization
nominalization
entailment
slang
Can a machine recognize the meaning similarity?
euphemismGoal: automatically acquire paraphrase patterns that are lexically-diverse
Carnegie Mellon
LREC 2012, May 24th, 2012 10
Automatic Evaluation– In Machine Translation [Kauchak & Barzilay, 2006][Padó et al., 2009]
– In Text Summarization [Zhou et al., 2006]
– In Question Answering [Ibrahim et al., 2003] [Dalmas, 2007]
Text Summarization [Lloret et al., 2008][Tatar et al., 2009]
Information Retrieval [Parapar et al., 2005][Riezler et al., 2007]
Information Extraction [Romano et al., 2006]
Question Answering [Harabagiu & Hickl, 2006][Dogdan et al., 2008]
Collocation Error Correction [Dahlmeier and Ng, 2011]
Paraphrase Recognition / Generationis a common need in various applications
Carnegie Mellon
Outline
LREC 2012, May 24th, 2012 11
Motivation Method: Diversifiable Bootstrapping Experiment Related Works Conclusion
Carnegie Mellon
Bootstrap Paraphrase Learning
LREC 2012, May 24th, 2012 12
monolingual plain corpus
seed instances
BOOTSTRAPLEARNING
ALGORITHM moreinstances
patterns
INPUT OUTPUT
Carnegie Mellon
BOOTSTRAPLEARNING
ALGORITHM
monolingual plain corpus
Bootstrapping moreinstances
patterns
INPUT OUTPUT
Bootstrap Paraphrase Learning
LREC 2012, May 24th, 2012 13
seed instances
X (killer) Y (victim)John Wilkes Booth
Mark David ChapmanNathuram Godse
Yigal AmirJohn Bellingham
Mohammed BouyeriDan White
Sirhan SirhanEl Sayyid Nosair
Mijailo Mijailovic
Abraham Lincoln John Lennon
Mahatma Gandhi Yitzhak Rabin
Spencer PercevalTheo van Gogh
Mayor George MosconeRobert F. Kennedy
Meir KahaneAnna Lindh
Carnegie Mellon
monolingual plain corpus
seed instances
Bootstrapping moreinstances
INPUT OUTPUT
Bootstrap Paraphrase Learning
LREC 2012, May 24th, 2012 14
patterns
X, the assassin of Yassassination of Y by X
X assassinated Ythe assassination of Y by X
of X, the assassin of YX assassinated Y in
: : :
Unlike many other bootstrapping worksthe goal is acquire patterns, not instances
Carnegie Mellon
Bootstrap Paraphrase Learning
LREC 2012, May 24th, 2012 15
monolingual plain corpus
seed instances
BOOTSTRAPLEARNING
ALGORITHM moreinstances
patterns
INPUT OUTPUT
Carnegie Mellon
Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 16
SeedInstances
Sentences ExtractedPatterns
RankedPatterns
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
This framework is based on ESPRESSO [Pantel & Pennacchiotti, 2006]
Carnegie Mellon
Search sentences by instancesBootstrap Learning Algorithm
LREC 2012, May 24th, 2012 17
ExtractedPatterns
RankedPatterns
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
SentencesSeedInstances
Edwin Booth was brother of John Wilkes Booth, the assassin of Abraham Lincoln.
John Wilkes Booth, the assassin of Abraham Lincoln, was inspired by Brutus.
In 1969 Berman was part of the defense team of Sirhan Sirhan, the assassin of Robert F. Kennedy.
: : :
Carnegie Mellon
Search sentences by instancesBootstrap Learning Algorithm
LREC 2012, May 24th, 2012 18
ExtractedPatterns
RankedPatterns
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
SentencesSeedInstances
Edwin Booth was brother of X, the assassin of Y. X, the assassin of Y, was inspired by Brutus. In 1969 Berman was part of the defense team of X,
the assassin of Y. : : :
Carnegie Mellon
Extract patterns from sentencesBootstrap Learning Algorithm
LREC 2012, May 24th, 2012 19
SeedInstances
RankedPatterns
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
ExtractedPatterns
Sentences
… brother of X, the assassin of Y. X, the assassin of Y, was …team of X, the assassin of Y.
Carnegie Mellon
Extract patterns from sentencesBootstrap Learning Algorithm
LREC 2012, May 24th, 2012 20
SeedInstances
RankedPatterns
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
ExtractedPatterns
Sentences
… brother of X, the assassin of Y . X, the assassin of Y , was …team of X, the assassin of Y .
Extracted Pattern: Longest Common Substring among retrieved sentences
Carnegie Mellon
Score and rank patterns
Sentences
Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 21
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
RankedPatternsRank by reliability of pattern: r(p).
r(p) is based on an association measure with each instance in the corpus.
ExtractedPatterns
SeedInstances
Carnegie Mellon
Score and rank patterns
Sentences
Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 22
ExtractedInstances
Sentences
RankedInstances
1stiteration
. . .2nditeration
RankedPatterns
1. 0.422 X, the assassin of Y2. 0.324 assassination of Y by X3. 0.312 X assassinated Y4. 0.231 the assassination of Y by X5. 0.208 of X, the assassin of Y
: : :
ExtractedPatterns
SeedInstances
Carnegie Mellon
Search sentences by pattern(s)
Sentences ExtractedPatterns
SeedInstances
Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 23
ExtractedInstances
RankedInstances
1stiteration
. . .2nditeration
RankedPatterns
Still shot from the CCTV video footage showing Oguen Samast, the assassin of Hrant Dink.
Henry Bellingham is a descendant of John Bellingham, the assassin of Spencer Perceval.
Sentences
Carnegie Mellon
RankedPatterns
Extract instances from sentences
Sentences ExtractedPatterns
SeedInstances
Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 24
RankedInstances
1stiteration
. . .2nditeration
Still shot from the CCTV video footage showing Oguen Samast, the assassin of Hrant Dink.
Henry Bellingham is a descendant of John Bellingham, the assassin of Spencer Perceval.
SentencesExtractedInstances
Carnegie Mellon
Sentences
Sentences
1stiteration
ExtractedPatterns
SeedInstances
Score and rank instances Bootstrap Learning Algorithm
LREC 2012, May 24th, 2012 25
. . .2nditeration
RankedPatterns
ExtractedInstances
RankedInstances
Rank instances by reliability: r(i) (similar to pattern reliability scoring)
Carnegie Mellon
Issue: Lack of Lexical Diversity
LREC 2012, May 24th, 2012 26
As a solution, we propose the Diversifiable Bootstrapping
X, the assassin of Yassassination of Y by X
X assassinated Ythe assassination of Y by X
of X, the assassin of YX assassinated Y in
Words participating in patterns are skewed
Carnegie Mellon
Diversifiable Bootstrapping
LREC 2012, May 24th, 2012 27
)()1()()(' pdiversityprpr
Original reliability score of a pattern
How is a pattern lexically different from other patterns originally
ranked higher than this?
Carnegie Mellon
Diversifiable Bootstrapping
LREC 2012, May 24th, 2012 28
)()1()()(' pdiversityprpr
Original reliability score of a pattern
Interpolation parameter: 10
How is a pattern lexically different from other patterns originally
ranked higher than this?
Carnegie Mellon
How is this pattern lexically different from
other patterns originally ranked higher than this?
Diversifiable Bootstrapping
LREC 2012, May 24th, 2012 29
)()1()()(' pdiversityprpr
Original reliability score of a pattern
Key contributionBy tweaking the parameter λ, patterns to acquire can be diversifiable with a specific degree one can control.
Interpolation parameter: 10
Carnegie Mellon
Experimental Settings
LREC 2012, May 24th, 2012 30
Bootstrapping Algorithm– Based on ESPRESSO framework [Pantel & Pennacchiotti, 2006]
– Unlike ESPRESSO, we aim to obtain patterns not instances
Lexical diversity scoring function:– Based on Shima & Mitamura [2011]
Seed instances: Schlaefer et al., [2006]
Corpus: English Wikipedia
Carnegie Mellon
Acquired Paraphrases: killed
LREC 2012, May 24th, 2012 31
X, the assassin of Yassassination of Y by XX assassinated Ythe assassination of Y by Xof X, the assassin of YX assassinated Y inX, the man who assassinated YY's assassin, Xof Y's assassin Xof the assassination of Y by XX shot and killed YY was assassinated by Xnamed X assassinated YY was shot by XX to assassinate Y
1 (no diversification)
Carnegie Mellon
Acquired Paraphrases: killed
LREC 2012, May 24th, 2012 32
X, the assassin of Yassassination of Y by XX assassinated Ythe assassination of Y by Xof X, the assassin of YX assassinated Y inX, the man who assassinated YY's assassin, Xof Y's assassin Xof the assassination of Y by XX shot and killed YY was assassinated by Xnamed X assassinated YY was shot by XX to assassinate Y
X, the assassin of YX assassinated Yassassination of Y by XY was shot by XX, who killed Ythe assassination of Y by XX assassinated Y inX tells his version of YX shoot YX murdered YY's killer, XY, at the theatre after XY, push X to his breaking pointX to assassinate Yof X, the assassin of Y
X, the assassin of YX, who killed YY was shot by XX tells his version of YX shoot YX murdered YY's killer, XY, at the theatre after XY, push X to his breaking pointX assassinated Yassassination of Y by XX to assassinate YX kills Yof X shooting YX assassinated Y in
1 7.0 3.0
Carnegie Mellon
Acquired Paraphrases: killed
LREC 2012, May 24th, 2012 33
X, the assassin of Yassassination of Y by XX assassinated Ythe assassination of Y by Xof X, the assassin of YX assassinated Y inX, the man who assassinated YY's assassin, Xof Y's assassin Xof the assassination of Y by XX shot and killed YY was assassinated by Xnamed X assassinated YY was shot by XX to assassinate Y
X, the assassin of YX assassinated Yassassination of Y by XY was shot by XX, who killed Ythe assassination of Y by XX assassinated Y inX tells his version of YX shoot YX murdered YY's killer, XY, at the theatre after XY, push X to his breaking pointX to assassinate Yof X, the assassin of Y
X, the assassin of YX, who killed YY was shot by XX tells his version of YX shoot YX murdered YY's killer, XY, at the theatre after XY, push X to his breaking pointX assassinated Yassassination of Y by XX to assassinate YX kills Yof X shooting YX assassinated Y in
1 7.0 3.0
Carnegie Mellon
Acquired Paraphrases: died-of
LREC 2012, May 24th, 2012 34
X died of YX died of Y inX died of Y onX died of lung YX died of lung Y inX died of lung Y onX died of Y in theX died of Y atX died of stomach YX died of natural Y X died of breast Y inX died of a YX died of Y in hisX passed away from YX died of a Y in
X died of Y inX died of YX's death from YX passed away from YY of X, newsY of X, a formerthat X was suffering from Ythe suspected Y of XX to breast Y inX was diagnosed with ovarian YX dies of YX was dying of YX died of lung YX died of Y onX died of lung Y in
X died of Y inX's death from YX passed away from YY of X, newsY of X, a formerthat X was suffering from Ythe suspected Y of XX succumbed to lung YX to breast Y inX was diagnosed with ovarian YX dies of YX was dying of YX died of YX's death from Y inX died of lung Y
1 7.0 3.0
Carnegie Mellon
Acquired Paraphrases: was-led-by
LREC 2012, May 24th, 2012 35
Y came to power in X inY came to power in XY to power in XY came to power in X in thewhen Y came to power in X inwhen Y came to power in XY took power in XY rose to power in Xafter Y came to power in XY became chancellor of XY came to power in X andY seized power in XY gained power in Xto power of Y in XY's rise to power in X
Y came to power in XY to power in Xregime of Y in XY came to power in X inY to power in X inY became chancellor of Xthe rise of Y in XX's dictator YX's president YY took control of XY, who ruled XY's success and X's saviourY declared that X hadX's leader Ygovernment of Y in X
Y came to power in X inregime of Y in XX's dictator YY became chancellor of XX's president Ythe rise of Y in XX's leader YY, who ruled XY took control of Xgovernment of Y in XX, led by Yquisling had visited Y in Xto flee X after YY in X the year beforeX, under the leadership of Y
1 7.0 3.0
Carnegie Mellon
LREC 2012, May 24th, 2012 36
E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998], Nomlex [Macleod et al., 1998], VerbNet [Kipper et al., 2006]
Related Works – Use of Thesaurus
Synonyms of “lead (v)” in WordNetID Words DefinitionS1 lead, take, direct, conduct,
guidetake somebody somewhere
S2 leave, result, lead produce as a result or residue
: S6 run, go, pass, lead, extend stretch out over a distance,
space, time, or scope:
S14 moderate, chair, lead preside over
Carnegie Mellon
LREC 2012, May 24th, 2012 37
E.g., WordNet [Miller, 1995], FrameNet [Baker et al., 1998], Nomlex [Macleod et al., 1998], VerbNet [Kipper et al., 2006]
Related Works – Use of Thesaurus
ID Words DefinitionS1 lead, take, direct, conduct,
guidetake somebody somewhere
S2 leave, result, lead produce as a result or residue
: S6 run, go, pass, lead, extend stretch out over a distance,
space, time, or scope:
S14 moderate, chair, lead preside over
Synonyms of “lead (v)” in WordNet
WEAKNESS
Need WSD or contexts to avoid false-positives.
Carnegie Mellon
LREC 2012, May 24th, 2012 38
Alignment Approach– Monolingual Comparable Corpus [Shinyama et al, 2002]
– Bilingual Parallel Corpus [Barzilay & McKeown, 2001][Bannard &
Callison-Burch, 2005][Callison-Burch, 2008]
Distributional Approach– Context as Vector Space [Pasca & Dienes, 2005][Bhagat &
Ravichandran, 2008]
– Context as Surface Pattern [Lin & Pantel, 2001][Ravichandran &
Hovy, 2002]
Related Works – Paraphrase Acquisition
Carnegie Mellon
LREC 2012, May 24th, 2012 39
Related Works – Paraphrase Acquisition[Bannard & Callison-Burch, 2005]
[Callison-Burch, 2008]
[Bhagat & Ravichandran, 2008]
[Pasca & Dienes, 2005]
murdered murdered killed in useddied dead killed , madebeaten death that killed involvedbeen killed deaths killed NN people foundare died killed NN bornlost victims killed by donewere killed killing were wounded in injuredkill been killed and wounding seenhave died dead , including taken
, hundreds released
Paraphrases acquired by Metzler et al., [2011]
Carnegie Mellon
LREC 2012, May 24th, 2012 40
Our work requires just a plain non-parallel corpus– Language portability:
• Good news for resource/tool-scarce languages– There’s a potential to learn words used in a closed
community (slangs, technical terms etc) by providing a domain-specific corpus
Bootstrapping works iteratively with minimum supervision– Smaller human effort is required as compared to heavily
supervised learning methods, or to relying on domain expert humans to hand-craft patterns.
Differences from Related Works
Carnegie Mellon
Conclusion
LREC 2012, May 24th, 2012 41
We proposed the Diversifiable Bootstrapping which can acquire lexically- diverse paraphrase patterns.
We gave initial experimental results on a few relations, which look promising.
As a future work, we hope to conduct formal evaluations on larger relations in different languages.
Carnegie Mellon
Acknowledgment
LREC 2012, May 24th, 2012 42
We also gratefully acknowledge the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0172. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government.
This publication was made possible in part by a NPRP grant (No: 09-873-1-129) from the Qatar National Research Fund (a member of The Qatar Foundation). The statements made herein are solely the responsibility of the authors.
Carnegie Mellon
Questions?
LREC 2012, May 24th, 2012 43