probabilistic models in phonology john goldsmith lsa institute 2003

47
Probabilistic models in phonology John Goldsmith LSA Institute 2003

Post on 20-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Probabilistic models in phonology

John Goldsmith

LSA Institute 2003

Page 2: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Aims for today

1. Explain the basics of a probabilistic model: notion of a distribution, etc.

2. List some possible uses for such a model

3. Explore English data, perhaps French, Japanese, Dutch

4. Compare the computation of probability with an OT tableau’s computation

5. Time permitting: dealing with alternations

Page 3: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Message not to take away

• The message is not that we don’t need structure in phonology, or that numerical methods can replace the use of phonological structure.

• The message is that there is a great deal of information present even in the simplest of structures (here, linear sequences) – and that we should not posit new structure until we have deduced all of the theoretical consequences of the structure we have already posited.

Page 4: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Probabilistic model in [vanilla] linguistics

• Model phonotactics of a language as the function assigning the plog probability of the word

• Solve (part of the) problem of acquisition of phonological knowledge

• Calculate the plog prob of the entire data, so that one model’s performance can be compared with another model’s (or theory’s).

• That is so important that I’ll repeat it: the probabilistic point of view tells us that the goal of our analysis is to maximize the probability of the observed data -- not to deal with special cases.

Page 5: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Possible applications

1. Nativization moves a word towards higher probability (lower plog).

2. Conjecture: phonological alternations move words towards lower plog (“harmonic rule application”).

3. Russian hypocoristics (Svetlana Soglasnova): lower average plog compared to base name.

Page 6: Probabilistic models in phonology John Goldsmith LSA Institute 2003

4. Cross-linguistic language-identification

5. [Vowel harmony]

6. Optimality theory and probabilistic models: twins separated at birth

Page 7: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Basics of probability1. Sample space2. Distribution3. Probability and frequency4. Unigram model: prob (string) as product of the

prob’s of individual phones5. Log prob, plog prob6. Conditional probability7. Mutual information8. Average plog = entropy9. Maximize probability of entire corpus (or minimize

plog or entropy – same thing, in these cases.)10.Bayes’ Law [or rule]11.Back-off models

Page 8: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Observations

• I will assume that we have a body of observations that we are analyzing. We can count things in it, and we want to compute its probability in various ways.

• There isn’t just one way to compute the probability; the probability isn’t something that lies in the object: it is a function of the model that we establish.

Page 9: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Sample space

• Is our universe of possibilities.

• In our case, we assume for today that we have an alphabet A, and our main sample space is the set of all strings from A, A +.

• We also assume that # is in A, and all strings in our sample space begin and end with #.

Page 10: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Distribution• A distribution is a set of non-negative

numbers that sum to 1.• We require that there be a function on the

members of our sample space (“prob”), and the values of prob, as we go through all the elements of the sample space, must form a distribution.

• Think of it as if there were a kilo of stuff called probability mass that is being divided up over all the members of the sample space.

Page 11: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Sums and products ofdistributions

• We won’t have the time to get into this now in any detail, but the following points are extremely important if you want to develop these models:– The product of two distributions is a distribution.– The weighted average of two distributions is a

distribution: if D1 and D2 are distributions, then x.D1 + (1-x).D2 is also a distribution. (Here “x” is the amount of the probability mass distributed by D1, and “1-x” is the amount distributed by D2).

Page 12: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Alphabet

• We begin by recognizing that we will need a distribution over the letters of our alphabet.

• Let us use the letters’ frequencies as their probabilities. A letter’s frequency is its Count in the corpus divided by the total number of letters in the corpus.

• Think about that….

Page 13: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Probability of a string

• Simplest model is called the unigram model; it takes phonemes into consideration, but not their order.

W = dog; W[1] = ‘d’; W[2] = ‘o’; W[3]= ‘g’;

Prob(W) = pr(d)*pr (o)*pr(g) =

||

1

])[(W

i

iWpr

Page 14: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Log prob

• It’s much easier to think in terms of log probabilities:

• Better yet, calculate plogs (multiply by -1):

• plog (W) =

||

1

])[(log)(logW

i

iWprWprob

||

1

])[(logW

i

iWprp

Page 15: Probabilistic models in phonology John Goldsmith LSA Institute 2003

The joy of logarithms

• Probabilities get too small too fast.• Logs help.• Log probabilities are very big and negative.• We make them positive.• Positive log probabilities.

More standard notation: ~

p

Page 16: Probabilistic models in phonology John Goldsmith LSA Institute 2003
Page 17: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Base 2 logs• If 2x is y, then log2(y) is x.

• 23 is 8, so log (8) = 3;

• 2-3 is 1/8, so log(1/8) = -3.

• Log (ab) = log (a) + log(b)

• Log (a/b) = log (a) – log (b)

• Log (x) = - log (1/x)

• Log (an) = n log (a)

Page 18: Probabilistic models in phonology John Goldsmith LSA Institute 2003

4

1

4

1

])[(log])[(logii

iSprobiSprob

• It's important to be comfortable with notation, so that you see easily that the preceding equation can be written as follows, where the left side uses the capital pi to indicate products, and the right side uses a capital sigma to indicate sums:

Page 19: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• Let’s look at them…

Page 20: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Conditional probability: Bigram model

• If there is a dependency between two phenomena, we can better describe their occurrences if our models take that dependency into account. Effects in language are primarily local…

• Compute the probability of each phoneme, conditioned on the preceding phoneme.

Page 21: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Definition: …AB…

• The conditional probability of B, given A, is the probability of AB occurring together, divided by the probability of A. (It’s as if the sample space temporarily shrank down to the universe of events compatible with A).

Page 22: Probabilistic models in phonology John Goldsmith LSA Institute 2003

In the case of strings…

Pr ( W[i] = L1 | W[i-1] = L2 )

This says: the probability that the ith letter is L1, given that the i-1th letter is L2.

Pr ( W[i] = L1 | W[i-1] = L2 ) =

)1(

)21(

/)1(

/)21(

)1Pr(

)21Pr(

LCt

LLCt

NLCt

NLLCt

L

LL

Page 23: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• We’ll take logs of that in a moment. But first, mutual information.

• The mutual information between two letters is the log of the ratio of the observed frequency divided by the frequency that we would predict if the data had no structure, which is the product of the frequencies of the letters:

Page 24: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• MI(X,Y) =

)()(

)(log

YfreqXfreq

XYfreq

)(log

)(log

)(log

Yfreq

Xfreq

XYfreq

This is a measure of the stickiness between X and Y in the data: they can attract (MI>0) or repel (MI < 0).

Page 25: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Back to bigram model • The bigram model, with its particular conditional

probability, is just the sum (using logs) of the unigram model plus the mutual information:

)(log),(

)(log)(log)(log),()|(log...

)(log)(log),()(log

),log)(log)(log),(

).(log)(log)|(log

).(/)()|(

BprBAMI

AprBprAprBAMIBAprso

BprAprBAMIABpr

soprBAprABprobBAMI

AprABprobBApr

AprABprBApr

Page 26: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Mutual information

• …is the measure of the structural stickiness between pieces.

Important question: in a given set of data, how much does the quality of the analysis (which is the plog you’ve calculated) depend on a given MI? The answer is: the number of times which that bigram occurred times the MI. Let’s call that the weighted mutual information.

Page 27: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Bayes’ Rule:

The relationship between • pr ( A | B ) "the probability of A given B" and • pr ( B | A ) "the probability of B given A".

pr(B)

pr(A) A) | B (pr ) B |(A pr

Page 28: Probabilistic models in phonology John Goldsmith LSA Institute 2003

)(*)|()(

)(*)|()(

AprABprBandApr

BprBAprBandApr

)(*)|(*)|( AprABprBBApr

pr(B)

pr(A) A) | B (pr ) B |(A pr

Page 29: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Plogs form a vector

• It is extremely helpful to be able to think of (and visualize) the ~26 plog parameters of our language as a vector, an arrow pointing from the origin to a point in a 26-dimensional space whose coordinates are the parameter values:

(plog(a), plog(b), plog(c), … , plog(z) )

Same thing for the mutual information (MI) parameters, living in a bigger space.

Page 30: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Vector

• The phonotactics of the language as a whole are thought of as a fixed vector.

• Each individual word is also a vector, whose coordinates count the pieces of the word. “Dolly” has a 1 in the d’s place, a “1” in the o’s place, a “2” in the l’s place, etc.

• Then the plog of “dolly” is just the inner product of these two vectors:

Page 31: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Inner product

• Remember, the inner product of 2 vectors is the sum of the respective values…

• (2, 3, 1) * (0.22, 0.1, -0.5) =2 * 0.22 + 3*0.1 + 1* -0.5 = 0.24

It’s also directly connected to the angle between the vectors: the inner product is the product of the lengths of the 2 vectors times the cosine of the angle between them.

Page 32: Probabilistic models in phonology John Goldsmith LSA Institute 2003

This is a well-understood domain…

• The math is simple and well-understood.

Page 33: Probabilistic models in phonology John Goldsmith LSA Institute 2003

OT central ideas

1. A language provides a ranking of a set of constraints

2. Candidate representations are checked against each constraint to count the number of violations present

3. A well-known algorithm chooses the winning candidate, given the information in (1) and (2)

Page 34: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Revisions

• Constraints are assigned a value: the plog of their violations in the corpus. This defines the vector LL of the language.

• Candidate representations are checked against each constraint to count the number of violations present (no change)

• We take the inner product of candidate representation R with L L for each R.for each R.

• Candidate with lowest inner product wins.Candidate with lowest inner product wins.

Page 35: Probabilistic models in phonology John Goldsmith LSA Institute 2003

*a *b *c *f *z

cab 1 1 1 0 0

foyk 0 0 0 1 0

Page 36: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• Each constraint says: No Phoneme x;

• Counting phonemes is the same as counting violations; frequency of phonemes is the same as frequency of violations.

Page 37: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Why is vector inner product the same as OT tableau winner-

selection?• Let N be the largest number of violations that

crucially selects a winner in a given language: e.g. N = 3….in fact to make life easy, let’s suppose N is not greater than 10…

Con 1 Con 2 Con 3

Cand 1 ***

Cand 2 ** ** ***

Cand 3 ***

Page 38: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• Then we can reformulate the information inherent in the ranking of a set of constraints in a set of numbers:

• The top constraint is given the number 0.1; the next gets 0.01; the next gets 0.001; in general, the nth gets 10-n.

• Each candidate calculates its own scores by calculating its constraints against the costs of those constraints.

• As long as there are never more than 9 violations, strict ranking effects will hold.

(Constraint conjunction effects will emerge otherwise.)

Candidate with lowest score wins: its plog is minimal.

Page 39: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Con 1 Con 2 Con 3

Cand 1

0.300

***

Cand 2

0.223

** ** ***

Cand 3

0.31

*** *

Page 40: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Same point, repeated:

• If we compare 2 candidates with scores 0.10122 and 0.10045, the second wins, because the “third” constraint (the one with weight 0.001) was violated less by it.

Page 41: Probabilistic models in phonology John Goldsmith LSA Institute 2003

De la prose, de la poésie

• Like Molière’s M. Jourdain, we’ve been doing vector inner products every time we found a tableau winner, and we didn’t even know it.

Page 42: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• Who is imitating who?

• The inner product model is simply applying the fundamental principle: maximize the probability of the data.

• As reconstructed here, you don’t need a cognitive theory to do OT. If there is a cognitive basis, it’s in the choice of constraints.

Page 43: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Beyond phonotactics:Dealing with alternations

Consider representations on two levels, and suppose we know the correct UR. Then we wish to choose the representation with the highest probability that includes the UR (i.e., given the UR).

The simplest formulation: prob of a representation is the product of the probability of its surface form times the probability of the UR/SR corresondances:

Page 44: Probabilistic models in phonology John Goldsmith LSA Institute 2003

• /i/

• [i] []• /i/ realized as [i]: 9/10 (.15 bits)

• [e]: 1/10 (3.3 bits)

Page 45: Probabilistic models in phonology John Goldsmith LSA Institute 2003

Russian hypocoristics• The average value for average complexity for the four categories of nouns

in this corpus, was also calculated, with the following results: 2.8 for second declension nouns, 2.67 for female names, 2.40 for affective hypocoristics, and 2.33 for neutral hypocoristics. This ordering shows that on average, female names in this corpus are lower in their complexity than second declension nouns in general, affective hypocoristics lower than female names, and neutral hypocoristics the lowest in complexity.

• More detail can be gained from inspecting the distribution of average complexity values for the three groups - second declension nouns, female names, and hypocoristics. The graphs of complexity distribution for second declension nouns and female names are shown in Fig.1. The graphs of complexity distribution for second declension nouns and hypocoristics are shown in Fig.2. [1]

•[1] Neutral and affective hypocoristics are grouped together. The distribution in Figs.1 and 2 shows the percentage of data in each of 36 buckets with .1 step produced by rounding off the average complexity value to one decimal place (numerical values from 1.2 to 5.0).

Page 46: Probabilistic models in phonology John Goldsmith LSA Institute 2003

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6

Complexity (bigrams)

Pe

rce

nta

ge

of

ca

se

s

2 decl nouns

All hypos

Fig. 2. The distribution of well-formedness values in second declension nouns vs. hypocoristics

Page 47: Probabilistic models in phonology John Goldsmith LSA Institute 2003

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6

Complexity (bigrams)

Pe

rce

nta

ge

of

ca

se

s

2 decl nouns

Female names

Fig. 1. The distribution of well-formedness values in second declension nouns vs. female names