day 4: reranking/attention shift; surprisal-based sentence processing
DESCRIPTION
Day 4: Reranking/Attention shift; surprisal-based sentence processing. Roger Levy University of Edinburgh & University of California – San Diego. Overview for the day. Reranking & Attention shift Crash course in information theory Surprisal-based sentence processing. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/1.jpg)
Day 4: Reranking/Attention shift; surprisal-based sentence
processing
Roger LevyUniversity of Edinburgh
&University of California – San Diego
![Page 2: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/2.jpg)
Overview for the day
• Reranking & Attention shift• Crash course in information theory• Surprisal-based sentence processing
![Page 3: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/3.jpg)
Reranking & Attention shift
• Suppose an input prefix w1…I determines a ranked set of incremental structural analyses, call it Struct(w1…i)
• In general, adding a new word wi+1 to the input will determine a new ranked set of analysis Struct(w1…i+1)
• A reranking theory attributes processing difficulty to some function comparing the structural analyses
• An attention shift theory is a special case where difficulty is predicted only when the highest-ranked analysis differs between Struct(w1…i) and Struct(w1…i+1)
![Page 4: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/4.jpg)
Conceptual issues
• Granularity: what precisely is specified in an incremental structural analysis?
• Ranking metric: how are analyses ranked?• e.g.in terms of conditional probabilities P( T | w1…i)
• Degree of parallelism: how many (and which) analyses are retained in Struct(w1…i)?
![Page 5: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/5.jpg)
Attention shift: an example
• Parallel comprehension: two or more analyses entertained simultaneously
• Disambiguation comes at following context, “many workers…”
• There is an extra cost paid (reading is slower) at disambiguating context• Eye-tracking (Frazier and Rayner 1987)• Self-paced reading (MacDonald 1993)
The warehouse fires many workers each spring…
![Page 6: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/6.jpg)
Pruning isn’t enough
• Jurafsky analyzed NN/NV ambiguity for “warehouse fires” and concluded no pruning could happen
3.8 : 1
267 : 1
![Page 7: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/7.jpg)
Idea of attention shift
• Suppose that a change in the top-ranked candidate induces empirically-observed “difficulty”
• Not the same as serial parsing, which doesn’t even entertain alternate parses unless the current parse breaks down
• Why would this happen?• People could be gathering more information about
the preferred parse, and need extra time to do this when the preferred parse changes
• People could simply be surprised, and this could interrupt “normal reading processes”
![Page 8: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/8.jpg)
Crocker & Brants 2000
• Adopt an attention-shift linking hypothesis• (page 660; unfortunately not stated very
explicitly)
• Architectural aspects of their system:• Bottom-up, incremental parsing architecture• Some pruning at every “layer” from bottom on up• No lexicalization in the grammar• Skip other details…
![Page 9: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/9.jpg)
N/V ambiguity under attention shift
• Crocker & Brants 2000: relative strength of each interpretation changes from word to word
![Page 10: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/10.jpg)
N/V attention shift: which probs?
• This analysis relies on lexical & syntactic probabilities• P(fires|NN) is higher than P(fires|VBZ)• P(NP -> Det NN NN) is low, and putting “many”
after a subject NP is low-probability
• Is this a satisfactory analysis? (c.f. day 1!)
• MacDonald 1993 found no disambiguating-context difficulty when noun (corporation) doesn’t support noun-compound analysis
• These are, at the least, bilexical affinities
The corporation fires many workers each spring
![Page 11: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/11.jpg)
Results from MacDonald 1993
• Difficulty only with “warehouse” not “corporation” “fires”
• Observed difficulty delayed a bit (spillover)
relative difficulty in ambiguous case
![Page 12: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/12.jpg)
How to estimate parse probs
• In an attention-shift model, conditional probabilities are of primary interest
• “warehouse fires” vs. “corporation fires” creates a practical problem
• Model should include P(fires|warehouse,{NN,NV}) and P(fires|corporation,{NN,NV})
• But no parsed corpus even contains “fires” in the same sentence with either of these words
• What do we do here?
![Page 13: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/13.jpg)
How to estimate parse probs (2)
• MacDonald 1993’s approach: collect relevant quantitative norm data and correlate with RTs• warehouse head vs. modifying noun freq
• corresponds to P(NN|warehouse) fires noun/verb ambiguous word usage
• corresponds (indirectly) to P(fires|NN)
• warehouse fires modifier+head cooccurrence rate• corresponds to P(fires|warehouse,NN)
• warehouse fires plausibility ratings as NV vs. as NN• “how plausible is it to have a fire in a warehouse”• “how plausible is it to have a warehouse fire
someone?”
![Page 14: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/14.jpg)
How to estimate parse probs (3)
• In the era of gigantic corpora (e.g., the Web), another approach: the counting method• To estimate P(NN|the warehouse fires), simply collect
a sample of the warehouse fires and count how many of them are NN usages
• Many pitfalls!• often can’t hold external sentence context constant• vulnerable to undisclosed workings of search engines• hand-filtering the results is imperative• assumes human prob. estimates will match corpus
freqs
• BUT it gives access to huge data!
![Page 15: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/15.jpg)
How to estimate parse probs (3)
• Crude method: we’ll use a corpus search (Google) to estimate P(NN|warehouse,fires)• 21 instances (excluding psycholinguistics hits!) of
“warehouse fires” found; all were NN• two of these were potentially NV contexts
• At least some evidence that P(NN|warehouse,fires) is above 0.5
• Supports attention-shift analysis
I heard an interview on NPR of a Vieux Carre (French Quarter) native who explained how the warehouse fires started...
Not all the warehouse fires were so devastating, ...
![Page 16: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/16.jpg)
Attention shift in MV/RR ambiguity?
• McRae et al. 1998 also has an attention-shift interpretation (pursued by Narayanan & Jurafsky 2002)
shift to RR for good patients
shift to RR for good agents
the {crook/cop}
![Page 17: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/17.jpg)
Reranking/Attention shift summary
• Reranking attributes difficulty to changes in the ranking over interpretations caused by a given word
• Attention shift is a special form in which changes in the highest-ranked candidate matter
![Page 18: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/18.jpg)
Overview for the day
• Reranking & Attention shift• Tiny introduction to information theory• Surprisal-based sentence processing
![Page 19: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/19.jpg)
Tiny intro to information theory
• Shannon information content, or surprisal, of an event:
• Example: a bent coin with P(heads)=0.4
• A loaded die with P(1)=0.4 also has h(1)=1.32
74.06.0
1log)(
32.14.0
1log)(
2
2
tailsh
headsh
)(log)(
1log)( 22 xP
xPxh (sometimes called the
entropy of event x)
![Page 20: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/20.jpg)
Tiny intro to information theory (2)
• The entropy of a discrete probability distribution is the expected value of its Shannon information content
• Example: the entropy of a fair coin is
• Our bent P(heads)=0.4 coin has entropy less than 1:
x xP
xPXH)(
1log)()( 2
12log5.0
1log5.0
5.0
1log5.0)( 222 XH
97.044.053.06.0
1log6.0
4.0
1log4.0)( 22 XH
![Page 21: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/21.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Entropy of a loaded coin
p
h2
(p)
![Page 22: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/22.jpg)
Tiny intro to information theory (3)
• Our loaded die with P(1)=0.4 doesn’t have its entropy completely determined yet. Two examples:
• A fair die has entropy of 2.58
![Page 23: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/23.jpg)
Overview for the day
• Reranking & Attention shift• Crash course in information theory• Surprisal-based sentence processing
![Page 24: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/24.jpg)
Hale 2001, Levy 2005: surprisal
• Let the difficulty of a word be its surprisal given its context:
• Captures the expectation intuition: the more we expect an event, the easier it is to process
• Many probabilistic formalisms, including PCFGs (Jelinek & Lafferty 1991, Stolcke 1995), can give us word surprisals
![Page 25: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/25.jpg)
Intuitions for surprisal & PCFGs
• Consider the following PCFG
• Calculate surprisal at destroyed in these sentences:
P(S → NP VP) = 1.0 P(NP → DT N) = 0.4 P(NP → DT N N) = 0.3 P(NP → DT Adj N) = 0.3 P(N → warehouse) = 0.03P(N → fires) = 0.02
P(DT → the) = 0.3P(VP → V) = 0.3P(VP → V NP) = 0.4P(VP → V PP) = 0.1P(V → fires) = 0.05P(V → destroyed) = 0.04
the warehouse fires destroyed the neighborhood.the fires destroyed the neighborhood.
![Page 26: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/26.jpg)
Connection with reranking models
• Levy 2005 shows that surprisal is a special form of reranking model
• In particular, if reranking cost is taken as the KL divergence* between old & new parse distributions…
• …then reranking cost turns out equivalent to surprisal of the new word wi
• Thus representation neutrality is an interesting consequence of the surprisal theory
*a measure of the penalty incurred by encoding one probability distribution with another
![Page 27: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/27.jpg)
Levy 2006: syntactically constrained contexts
• In many cases, you know that you have to encounter a particular category C
• But you don’t know when you’ll encounter it, or which member of C will actually appear
• Call these syntactically constrained contexts• In these contexts, the more information related
to C you obtain, the sharper your expectations about C generally turn out to be
• Interesting contrast to some non-probabilistic theories that say holding onto the related information is hard
![Page 28: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/28.jpg)
Constrained contexts: final verbs
• Konieczny 2000 looked at reading times at German final verbs
Er hat die Gruppe auf den Berg geführtHe has the group to the mountain led
Er hat die Gruppe geführtHe has the group led
Er hat die Gruppe auf den SEHR SCHÖNEN Berg geführtHe has the group to the VERY BEAUTIFUL mtn. led
“He led the group”
“He led the group to the mountain”
“He led the group to the very beautiful mountain”
![Page 29: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/29.jpg)
450
460
470
480
490
500
510
520
No PP Short PP Long PP
Re
ad
ing
tim
e (
ms
)
14.8
15
15.2
15.4
15.6
15.8
16
16.2
Reading time at final verb
Negative Log probability
Er hat die Gruppe (auf den (sehr schönen) Berg) geführtEr hat die Gruppe (auf den (sehr schönen) Berg) geführtEr hat die Gruppe (auf den (sehr schönen) Berg) geführt
Surprisal’s predictions
![Page 30: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/30.jpg)
Once we’ve seen a PP goal we’re unlikely to see another
So the expectation of seeing anything else goes up For pi(w), used a PCFG derived empirically from
a syntactically annotated corpus of German (the NEGRA treebank)
• Seeing more = having more information• More information = more accurate expectations
Deriving Konieczny’s results
auf den Berg
PP
geführt
VNP?
PP-goal?PP-loc?Verb?ADVP?
die Gruppe
VP
NP
S
NP Vfin
Er hat
![Page 31: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/31.jpg)
Facilitative ambiguity and surprisal
• Review of when ambiguity facilitates processing:
The daughteri of the colonelj who shot himself*i/j
The daughteri of the colonelj who shot herselfi/*j
(Traxler et al. 1998; Van Gompel et al. 2001, 2005)
The soni of the colonelj who shot himselfi/j
harder
easier
![Page 32: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/32.jpg)
• Sometimes the reader attaches the RC low...• and everything’s OK
• But sometimes the reader attaches the RC high…• and the continuation is anomalous
• So we’re seeing garden-pathing ‘some’ of the time
himself himself
Traditional account: probabilistic serial disambiguation
NP PP
NPP
of
NP
the daughter
the colonel
RC
who shot…
![Page 33: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/33.jpg)
Surprisal as a parallel alternative
• assume a generative model where choice between herself and himself determined only by antecedent’s gender
NP
NP PP
NPP
of
NP
the daughter
the colonel
RC
who shot…
NP PP
NP
P
of
NP
the daughter
the colonel
RC
who shot…
NP
self
herself
)|()()( TwpTpwpT
ii
• Surprisal marginalizes over possible syntactic structures
![Page 34: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/34.jpg)
)T,self"|"himself(*)T|self""(*)T(
)T,self"|"himself(*)T|self""(*)T(himself)(
RC_highRC_highRC_high
RC_lowRC_lowRC_low
ppp
pppp
i
ii
lowx
highx
lowy
highy
1
0
)T,self"|"himself(*)T|self""(*)T(
)T,self"|"himself(*)T|self""(*)T(himself)(
RC_highRC_highRC_high
RC_lowRC_lowRC_low
ppp
pppp
i
ii
lowx
highx
lowy
highy
1
1
![Page 35: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/35.jpg)
Ambiguity reduces the surprisal
11)son | himself(
10)daughter|himself(
lowlowhighhighi
lowlowhighhighi
yxyxp
yxyxpBut son…who shot… can
daughter…who shot… can’t contribute probability mass to himself
)son|himself()daughter|himself( ii pp
![Page 36: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/36.jpg)
Ambiguity/surprisal conclusion
• Cases where ambiguity reduces difficulty aren’t problematic for parallel constraint satisfaction• Although they are problematic for
competition
• Attributing difficulty to surprisal rather than competition is a satisfactory revision of constraint-based theories
![Page 37: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/37.jpg)
Surprisal and garden paths: theory
• Revisiting the horse raced past the barn fell• After the horse raced past the barn, assume 2 parses:
• Jurafsky 1996 estimated the probability ratio of these parses as 82:1
• The surprisal differential of fell in reduced versus unreduced conditions should thus be log2 83 = 6.4 bits
*(assuming independence between RC reduction and main verb)
![Page 38: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/38.jpg)
Surprisal and garden paths: practice
• An unlexicalized PCFG (from Brown corpus) gets right monotonicity of surprisals at disambiguating word “fell”
• But there are some unwanted results too
this is right but diff. is small
these are waytoo high!
![Page 39: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/39.jpg)
Surprisal and garden paths
• raced has high surprisal because the grammar is unlexicalized – no connection with horse• Unfortunately, lexicalization in practice wouldn’t help:
race as a verb never co-occurs with horse in Penn Treebank!
• surprisal differential at fell is small for the same reason• failure to account for lexical preferences of raced
means that probability of RR alternative is likely overestimated
• Is surprisal a plausible source of explanation for most dramatic garden-path effects? Still seems unclear.
![Page 40: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/40.jpg)
Surprisal summary
• Motivation: expectations affect processing• When people encounter something unexpected,
they are surprised• Translates into slower reading (=processing
difficulty?)• This intuition can be captured and formalized
using tools from probability theory, information theory, and statistical NLP
![Page 41: Day 4: Reranking/Attention shift; surprisal-based sentence processing](https://reader036.vdocument.in/reader036/viewer/2022062322/56814411550346895db0af66/html5/thumbnails/41.jpg)
Tomorrow
• Other information-theoretic approaches to on-line sentence processing
• Brief look at connectionist approaches to sentence processing
• General discussion & course wrap-up