from pfp erformance to ctc ompetence · evelin 2012 florianjaeger human language processinglab ,...
TRANSCRIPT
F P f t C tF P f t C tFrom Performance to CompetenceFrom Performance to CompetenceLanguage evolution, language change, and Language evolution, language change, and evidence that language is shaped by usageevidence that language is shaped by usage
EVELIN 2012
Florian Jaeger Human Language Processing Lab
EVELIN 2012, Psycholinguistic and Grammar, Lecture 1
Florian Jaeger, Human Language Processing Lab http://www.hlp.rochester.edu/
Please interrupt when you have Please interrupt when you have iiquestionsquestions
[2]
Email listEmail list
• Please subscribe to our email list
https://groups.google.com/group/evelin2012‐h li i ti / b ibpsycholinguistic/subscribe
All updates will be sent to this address rather than the previous email list I used.
[3]
Goal of this classGoal of this class
P d di d d i i h f h li i i• Present and discuss data and insights from psycholinguistics that should be of relevance to linguistics and in doing so to give an introduction to a few current views and issues in psycholinguistics.
• Our entry point will be language use, which by some is treated as orthogonal to the study of linguistics (e.g. certain approaches in generative grammar), whereas others consider pp g g ),it pivotal (e.g. so‐called functionalist linguistics)
[4]
Language use (performance) & Language use (performance) & Grammar (competence)?Grammar (competence)?Grammar (competence)?Grammar (competence)?
C f• Competence vs. performance:– Ontologically (i.e. by definition) such a distinction can be drawn– But we can ask another question – and in some sense it’s a
personal question Is this distinction in the definition offered bypersonal question: Is this distinction in the definition offered by other useful or even relevant to (your) research? Is it productive in that it generates predictions that can be investigated? Etc.
• An old claim (let’s call it the functionalist hypothesis, even though it’s older): Grammar cannot be studied withoutthough it s older): Grammar cannot be studied without reference to language use. [e.g. BatesMacWhinney82,89; Bybee01,02; Givon91,92,01; Hawkins94 01 02 04 07; Hocket60; Langacker91 92; Slobin73]Hawkins94,01,02,04,07; Hocket60; Langacker91,92; Slobin73]
[5]
IfIf soso
‘G ’ b d d i h h li i i ‘Grammar’ cannot be understood without psycholinguistics and sociolinguistics as the studies of language use
[6]
What is meant by this?What is meant by this?
Wh h i f ‘ ’ i d f l i ?• What are the properties of ‘grammar’ in need of explanation?– Language acquisition
• The mere fact that it is possible (cf. poverty of stimulus, subset problem)problem)
• The time course of language acquisition– Patterns of language change
Typological distributions (‘Universals’)– Typological distributions ( Universals )
• Two interpretations of the functionalist hypothesis:p yp– Strong: All of the above properties can be explained without
reference to arbitrary linguistic biases/rules/constraints.– Weak: Some/many of the above properties can be explained
ith t f t bit li i ti bi / l / t i twithout reference to arbitrary linguistic biases/rules/constraints.
[7]
Types of universalsTypes of universals
C i l• Categorical– Absolute: Every/No language has property X– Implicational: Every language that has property X also has
property Yproperty Y
• Gradient/Statistical– E.g. word order preferences:g p f
[based on Tomlin 1986]
– E.g. Every language that has property X tends to also have property Y
• I’ll use the term ‘universal’ for all of these – for convenience’sI ll use the term universal for all of these for convenience s sake.
[8]
The functionalist hypothesis for The functionalist hypothesis for typological generalizationstypological generalizationstypological generalizationstypological generalizations
F i l l h G i l• Functional pressures on language change: Grammatical properties may be observed more often across languages because they improve a language’s ‘utility’. [e.g. BatesMacWhinney82,89; Bybee01,02; ChristiansonChater08; Croft04; Givon91,92,01; Hawkins94,01,02,04,07; Hocket60; Langacker91; Slobin73; Zipf49]
[9]
Biological vs. cultural evolutionBiological vs. cultural evolution
Biological evolution
Cultural evolution/language change ‘transmission’
[10]
[taken from Nowak et al. 2002‐Nature]
The functionalist hypothesis for The functionalist hypothesis for typological generalizationstypological generalizationstypological generalizationstypological generalizations
F i l l h G i l• Functional pressures on language change: Grammatical properties may be observed more often across languages because they improve a language’s ‘utility’. [e.g. BatesMacWhinney82,89; Bybee01,02; ChristiansonChater08; Croft04; Givon91,92,01; Hawkins94,01,02,04,07; Hocket60; Langacker91; Slobin73; Zipf49]
• This is an intriguing possibility, as it promises to reduce the b f iti l bit ( li i ti ifi ) tinumber of cognitively arbitrary (=linguistic specific) properties
of language that we need to explain.
[11]
ChallengesChallenges
1. ‘Transmission problem’: Where do hypothesized pressures operate? That is, how would such pressures come to shape language over time? g g
– Biases on language acquisition, changing the structures acquired by the next generation [Lecture 2]
– Biases operating throughout adult life that change the output provided to the next generation ( would imply linguistic
ll bilit th h t lif ) [L t 3 4]malleability throughout life) [Lecture 3, 4]
[12]
ChallengesChallenges
2. What is ‘utility’? What is good? [Lecture 1 & 4]– Learnability
[cf. Deacon98; Slobin76; Newport81; ChristiansenChater08]E f i Mi i i ti f t– Ease of processing, e.g.: Minimization of memory cost [cf. GildeaTemperley08; Hawkins94,01,02,04,07,09; Levy05; Tily11]
– Trade‐off between production and comprehension effort [Zipf, 1935, 1949; Levy & Jaeger, 2007]
– Efficient and robust communication[cf. Aylett and Turk, 2004; FerreriCancho05,07,10; GenzelCharniak02,03; Jaeger06,10; LevyJaeger07; PiantadosiTilyGibson11; QianJaeger09,10,submitted]
we need to define utility based on clear principles that are validated against psycholing. (and socioling.) data
[13]
[Hawkins, 2004
[14]
[15]
PlanPlan
• Today: – Crash course on word recognition and some aspects of sentenceCrash course on word recognition and some aspects of sentence
processing– Typological and diachronic evidence that languages across the
world are shaped by general learning biases, processing preferences and communicative pressures. [Gildea and Temperley, 2010; Hawkins, 2004; Manin, 2006; Piantadosi et al., 2011a,b; Zipf, 1935, 1949]
– Introduction to fundamentals of information theory
[16]
ReadingsReadings
R i d• Required: – Pinker (2000) – 2pp– Jaeger and Tily (2011) – 7 pp
[17]
P t 2P t 2Part 2Part 2
Let’s start with a wellLet’s start with a well‐‐known fact known fact about the mental lexiconabout the mental lexiconabout the mental lexiconabout the mental lexicon
[18]
The frequency ~ length correlationThe frequency ~ length correlation
German
[19]
[taken from Zipf 1935:23; based on Kaeding 1928]
American English
[20][taken from Zipf 1935:28]
The Principle of Least EffortThe Principle of Least Effort[Zi f 1949 1][Zipf 1949:1]
I i l h P i i l f L EffIn simple terms, the Principle of Least Efforts means, for example, that a person in solving his immediate problems will view these against the background of his probable future problems as estimated by himself. […] The person will strive to minimize the [ ] pprobable average rate of his work‐expenditure (over time).[ h i i i i l Zi f tt ib t th t f i il[emphasis in original; Zipf attributes the roots of similar ideas to Maupertuis in the 18th century]
[21]
The Principle of Least EffortThe Principle of Least Effort[Zi f 1949 1][Zipf 1949:1]
I i l h P i i l f L EffIn simple terms, the Principle of Least Efforts means, for example, that a person in solving his immediate problems will view these against the background of his probable future problems as estimated by himself. […] The person will strive to minimize the [ ] pprobable average rate of his work‐expenditure (over time).[ h i i i i l Zi f tt ib t th t f i il[emphasis in original; Zipf attributes the roots of similar ideas to Maupertuis in the 18th century]
[22]
Two opposing forcesTwo opposing forces
P i id i h l i h ( h• Prior to considering that language use might (among other things) serve to communicate:– Speaker economy (force of unification): map all meanings onto
th ( h t) d /Ə/the same (short) word /Ə/– Hearer economy (force of diversification): map each meaning
onto a different word
These two forces together are assumed to affect (through diachronic change) the structure of the mental lexicon.
[23]
Zi f’ i i h d l i d d i• Zipf’s insights and speculations preceded two important events that are crucial to his general idea: – The formulation of information theory (Shannon, 1948) – The rise of modern psycholinguistics (1960s): e.g. What makes a
word easy to produce or recognize?
[24]
Part 3Part 3
Information theory (light)Information theory (light)y ( g )y ( g )
Information theoryInformation theory[Shannon 1948][Shannon, 1948]
[26]
InformationInformation
Sh i f i f l f d• Shannon information, for example, of a word
I(w) = log[ 1 / p(w) ]= ‐log p(w)
– Log often taken to base 2 units of information is “bits”
• Intuitive properties:– 0 bits new information, if something is perfectly predictable (cf.
surprisal)surprisal)– More new information, the less predictable something is
[27]
Communication through a noisy Communication through a noisy channelchannelchannelchannel
[Figure 1 from Shannon 1949]
[28]
Shannon’s noisy channel theorem Shannon’s noisy channel theorem (for mere mortals like us)(for mere mortals like us)
[Shannon, 1948; Wolfowitz, 1961]
• For any noisy digital channel with capacity C > 0 and any rate f i f ti t i i 0 R C th i fi itof information transmission 0 < R < C, there is a finite
sequence of n codes (a finite language with 1 ≤ n < ∞ words) so that – The number of possible words increases exponentially with the
maximum length l < n– The error probability decreases exponentially with nIf R C i i ibl i bi il lIf R < C, it is possible to communicate at an arbitrarily low error
rate
• The converse holds, too: for any R > C, the error probability , y , p ywill converge against 1 the larger the codeword vocabulary.
[29]
Noisy Channel TheoremNoisy Channel Theorem
Th h l i d fi h i l f i f i• The channel capacity defines the maximal rate of information per time step/sent signal that allows communication at an arbitrarily low error rate.
• An optimal code then transmits information at an average rate close to, but not exceeding the channel capacity.– Constant Entropy Rate [Genzel and Charniak, 2002]– Smooth Signal Redundancy [Aylett and Turk, 2004]– Uniform information density [Jaeger, 2006; Levy and Jaeger, 2007]
[30]
Part 4Part 4
Some background on word recognitionSome background on word recognitiong gg g
Spoken word recognition
Spoken word recognition
The problem:Spoken words unfold as a series of transient acoustic events extending
f h d d ith t li bl t d b d iover a few hundred ms, without reliable cues to word boundaries.
Imagine reading this page through a two-letter aperture, the text scrolling past without spaces separating the words, at a variable rate one could not control, with the visual features for each letter arriving asynchronously.
We don’t say what we hearWe don’t say what we hear
M f h h i ’ ll i h h• Most of what we hear isn’t really in the speech stream or at least not in the sequential order that one might think
[Figure 2 from[Figure 2 from Johnson ,2004]
[34]
Variability (noise) in productionVariability (noise) in production
E h h li d h d ’• Even when phonemes are realized, they don’t map deterministically onto acoustic dimension
The acoustic signal created byspeakers maps linguistic
[from Fox and Vaughn, in prep]
categories probabilistically onto acoustic dimension (e.g. energy distributions over ( g gyfrequencies)
35
Perceptual noisePerceptual noise
Th h i bi l i l d h• The human nervous system is a biological system and as such it exhibits noisy responses to stimuli. Consider for example, a neuron’s response rate to its preferred stimulus (e.g. a line with a given spatial orientation):
tria
l)A
ctiv
ity(o
ne t
Noise at the neural level.
Neu
ral
Preferred stimulus
36
An illustration:An illustration:Noisy visual input during readingNoisy visual input during readingNoisy visual input during readingNoisy visual input during reading
Over time Over time
Speech perception: A challenging task!Speech perception: A challenging task!
Li h f d h f• Listeners have to extract features and phonemes out of speech stream
• Perception is noisy• Perception is noisy• Lack of invariance [cf. Lecture 3]
– for segments and for speakersg p– Sounds can be affected by phonetic context, syllabic stress, prosody & intonation, speaking rate, and emotional state
• So, how do we do it?
Integration of bottomIntegration of bottom‐‐up noisy signal up noisy signal with topwith top down expectationsdown expectationswith topwith top‐‐down expectationsdown expectations
[M Cl ll d][ lh ] [McClelland][Rummelhart]
[46]
[Rummelhart and McClelland 1981; this is a model of visual word recognition, but similar models have been proposed for spoken word recognition]
Sine Wave SpeechSine Wave Speech
Tones 1 and 2Tones 1 and 3Tones 2 and 3
All three
Original hTones 2 and 3
Source: http://www.haskins.yale.edu/haskins/misc/SWS/tonecombo.html
speech
Two types of topTwo types of top‐‐down knowledgedown knowledge(illustrated for visual word recognition)(illustrated for visual word recognition)
Captures degraded or partial input(word superiority)
• The ‘word superiority’ effect
• The ‘pronounceable non‐word’ effect
(word superiority)
• In online processing this effect is present too:
• In online processing, we are faster to recognizeeffect is present, too:
We are faster at recognizing letters in
are faster to recognize orthographically licensed forms.g g
words than non‐words
How can one study How can one study spokenspoken word recognition?word recognition?
Eye camera
Scene camera
Pick up the beakerPick up the beaker
[Allopenna et al 1998]
pp
Modern eyeModern eye‐‐trackerstrackers
[50]
1Trials
1
2+
3
4
ons
ons 200 ms
5
TimeLook at the cross. Cli k h b k
on o
f fix
atio
on o
f fix
atioClick on the beaker.
Target = beakerCohort = beetleU l t d ii
Prop
ortio
Prop
ortio
Unrelated = carriagecarriageTime
ResultsResults
?
Another topAnother top‐‐down effect: down effect: Freq enc effectsFreq enc effectsFrequency effectsFrequency effects
E id f i i f f d i• Evidence for stronger activation of more frequent words in lexical processing in speech (Dahan et al. 2001)
53
Back to ZipfBack to Zipf
W bi d d i f d• We are biased towards expecting more frequent words
• Think about the fact that frequent words are recognized faster even when they are equally longeven when they are equally long.
• Not only does it save production effort to have phonological words shorter (as Zipf speculated), it also seems that less bottom‐up signal is required to understand them!
• Interestingly more frequent words that have the same number of phonemes, still tend to be pronounced with shorter duration in speech. [e.g. Gahl, 2008]duration in speech. [e.g. Gahl, 2008]
[54]
More topMore top‐‐down knowledge: down knowledge: contextual predictabilitycontextual predictabilitycontextual predictabilitycontextual predictability
eat + (something edible)
“I’m gonna eat the ”eat the…
SurprisalSurprisal
H l (2001) d h d’ l i i• Hale (2001) proposed that a word’s complexity in sentence comprehension is determined by its surprisal, an old measure borrowed from information theory (it’s the same as Shannon information)
56
57
[from Smith and Levy, in Prep]Prep]
58
More topMore top‐‐down expectationsdown expectations
I ddi i l l i l k l d f d• In addition to general lexical knowledge, frequency, and contextual predictability, we also are incredibly good at taking into consideration the current visual context and what we think the speaker knows.
Listeners overcome the problem of noisy perception of invariant input but rapidly integrating the noisy bottom upinvariant input, but rapidly integrating the noisy bottom‐up signal with top‐down information from many information sources.
This is quite compatible with Zipf’s observations although he, of course, did not know about all of these facts about word recognition (he postulated his laws in 1949!)recognition (he postulated his laws in 1949!).
[59]
Part 5Part 5
ReRe visiting Zipf withvisiting Zipf withReRe‐‐visiting Zipf with visiting Zipf with information theory & information theory &
psycholinguistics in mindpsycholinguistics in mind
[60]
ReRe‐‐thinking Zipfthinking Zipf[Pi t d i Til d Gib 2011][Piantadosi, Tily, and Gibson, 2011]
If k d i i i i i h i ll• If spoken word recognition is sensitive to what is contextually expected, perhaps contextual expectations, rather than frequency, are what (over generations) determines the phonological length of words (i.e. the amount of bottom‐up signal)
How probable is the context given the word?Sum (marginalize!)
• The average information a word w carries in its different contexts C is:
How probable is the word given the context?( g )
over contexts
which can be estimated from a corpus:
[61]
Result for EnglishResult for English
[62][Figure 2 from Piantadosi et al., 2011]
Bigram results for all 11 languagesBigram results for all 11 languages
[63]
[part of Figure 1 from Piantadosi et al., 2011]
[Figure 1 from Piantadosi et al., 2011]
C i h l f• Comparing the results for different n in the ngram model
[64]
One more example: RussianOne more example: Russian
M i h
[Figure 1 and 2 from Manin, 2006]
• Manin uses human judgments from a cloze‐like task to estimate information (or unpredictability)
[65]
Take home points (1)Take home points (1)
B i i l i i i ( h f• Bottom‐up input in language processing is noisy (the output of production is variable and perception itself is a noisy process)
• Listeners overcome the challenges of spoken word recognition by relying on probabilistic cues (top‐down knowledge about the language, the current context, etc.)
• Words that are hard to process because they are on average unexpected (have high information), tend to be phonologically longer.longer.
[66]
[from http://roosterteeth com/comics/strip php; thx to Nurit Melnik][from http://roosterteeth.com/comics/strip.php; thx to Nurit Melnik]
[67]
P t 6P t 6Part 6Part 6
Beyond the lexicon: An example from Beyond the lexicon: An example from sentence processingsentence processingsentence processingsentence processing
[68]
Memory and dependency lengthMemory and dependency length[cf Jaeger and Til 2011 for a concise o er ie ][cf. Jaeger and Tily, 2011 for a concise overview]
Si il d i i i i i i• Similar to word recognition, sentence processing is sensitive to probabilities: structure that are less expected in the context take longer to process. [e.g. MacDonald and Shillcock, 2001; Kamide et al. 2003; Staub and Clifton, 2006; Levy, 2008; Smith and Levy, 2009]
• Syntactic processing is also sensitive to memory demands: Longer dependencies take longer to process [e.g. Gibson 1998, 2000; Gibson and Grodner, 2005; Lewis et al., 2006; Vasishth et al., 2005]
[69]
Memory and dependency lengthMemory and dependency length
[Figure 1 from Jaeger and Tily, 2011]
[ f Til 2012][ from Tily, 2012]
[70]
How do psycholinguists study effects How do psycholinguists study effects of dependency length?of dependency length?of dependency length?of dependency length?
S lf d di• Self‐paced reading– Word‐by‐word reading time between button presses– Comprehension accuracy in questions displayed after sentence
• Stop‐making‐sense task– Like self‐paced reading, but critical sentence are ungrammatical
and we measure how long it takes before subjects press “n”‐g j pnope, not a sentence anymore.
• Eye‐tracking readingFirst pass fixations– First pass fixations
– Fixation durations– Proportion of regressive saccades
[71]
[72]
[73]
[74]
[75]
[76]
[77]
…
[79]
An example findingAn example finding[ Figure 4 from Gibson, 1998]
• Confirmed in many independent studies, though there are still ti b t h th th lt b d d tquestions about whether the result can be reduced to
expectation‐based processing [e.g. Wells et al., 2009; MacDonald][80]
Is this processing preference reflected Is this processing preference reflected in grammar?in grammar?in grammar?in grammar?
S d l d k d d i h ?• So, do languages tend to keep dependencies short?
• This question was recently addressed for English and German by Gildea and Temperley (2010)
[81]
[ Table 1 from Gildea and Temperley, 2010]
Why is German not as optimal as Why is German not as optimal as English in terms of dependencies?English in terms of dependencies?English in terms of dependencies?English in terms of dependencies?
I ld b il bili f h h d l i• It could be availability of case, another cue to the underlying structure of the sentence (which reduces memory costs)
• Intriguing evidence comes from Tily (2011) who studies the• Intriguing evidence comes from Tily (2011), who studies the development of dependency length from Old English, which had case, to Modern English, which does not.
[82][ from Tily, 2012]
WrappingWrapping upupWrappingWrapping‐‐upup
[83]
Take home point (2)Take home point (2)
Gi h l h h i d if h• Given that language has the properties expected if humans and languages evolved to transfer information efficiently, and given that these properties are unlikely to be due to chance (see e.g. Ferrer i Cancho and Sole, 2003 on Zipf’s law), this provides tentative support for the idea that functional pressures shape language over time.p p g g
• But so far we’ve only seen correlations. How would processing and communicative biases come to affect language over time?
• Tomorrow we’ll see how one can test more directly how processing and communicative biases could come to affect language over generations by affecting language acquisitionlanguage over generations by affecting language acquisition.
[84]
How to estimate the average How to estimate the average information of a word?information of a word?information of a word?information of a word?
Pi d i l (2011) d l i h hi (• Piantadosi et al (2011) use ngram models with smoothing (n = 2,3,4) based on the Google ngram corpus available for 11 languages.– The large size of the Google ngram corpus is important to obtain
reliable estimates of the average information a word carries in context.
• These estimates are then regressed against word length.
[85]
How is information estimated from a How is information estimated from a corpus?corpus?corpus?corpus?
• As Shannon information is defined with reference toAs Shannon information is defined with reference to probability, we need to estimate the probability of words in order to estimate their information.
• So called ngram models provide a simple way that is frequently employed to derive probability estimates from a ll ti f h iticollection of speech or writing.
• So, let’s assume we have such a corpus.
[86]
A very small corpus:A very small corpus:Over the last two decades, cognitive science has undergone a paradigm shift towards probabilistic models of the brain and cognition. Many aspects of human cognition are
d d i f i l f il bl i f i i h li h fnow understood in terms of rational use of available information in the light of uncertainty (e.g. models in memory, categorization, generalization and concept learning, visual inference, motor planning). Building on a long traditional of computational models for language, such rational models have also been proposed for language processing and acquisition. This class provides an overview to the newly emerging field of computational psycholinguistics, which combines insights and methods from linguistic theory, natural language processing, machine learning, psycholinguistics, and cognitive science into the study of how we understand andpsycholinguistics, and cognitive science into the study of how we understand and produce language. There has been a surge in work in this area, which is attracting scholars from many disciplines. The goal of this class is to provide students with enough background to start their own research in computational psycholinguistics.
• Now let’s extract the bigrams from this text. That’s really just the list of all two‐word sequences in the above text, followed by how often they occur
[87]
From bigrams to Shannon informationFrom bigrams to Shannon information
thi l 2this class 2over the 1the last 1last two 1this area 1…
• (for a neat tool that let’s you estimate bigrams based on the y gBrown corpus, see But we can be lazy and use a tool, e.g. http://word.snu.ac.kr/ngram/)
E th d thi i f ll d 2 t f 3 ti b th d• E.g. the word this is followed 2 out of 3 times by the word class. Hence, our best (maximum likelihood) estimate of p(class | this) = ‐log 2/3
[88]
Getting the Getting the average average information of a information of a word in contextword in contextword in contextword in context
I l h d l l d i h• In our sample text, the word class only occurred twice, each time preceded by this. Recall that the average information of a word in context based on a corpus is calculated as:
where C is the context (here simply the preceding word) and N is the number of different context (here 2 since class occurs twice in the corpus). p )
• Hence, the average information of class given the preceding word in our sample is: ‐1/2(log 2/3 + log 2/3) = ‐1/2 *log 4/9 = 0 8 bi f i f i h l id0.58 bits of information that class on average provides.
[89]