probabilistic models in phonology john goldsmith lsa institute 2003

Probabilistic models in phonology

John Goldsmith

LSA Institute 2003

Aims for today

1. Explain the basics of a probabilistic model: notion of a distribution, etc.

2. List some possible uses for such a model

3. Explore English data, perhaps French, Japanese, Dutch

4. Compare the computation of probability with an OT tableau’s computation

5. Time permitting: dealing with alternations

Message not to take away

• The message is not that we don’t need structure in phonology, or that numerical methods can replace the use of phonological structure.

• The message is that there is a great deal of information present even in the simplest of structures (here, linear sequences) – and that we should not posit new structure until we have deduced all of the theoretical consequences of the structure we have already posited.

Probabilistic model in [vanilla] linguistics

• Model phonotactics of a language as the function assigning the plog probability of the word

• Solve (part of the) problem of acquisition of phonological knowledge

• Calculate the plog prob of the entire data, so that one model’s performance can be compared with another model’s (or theory’s).

• That is so important that I’ll repeat it: the probabilistic point of view tells us that the goal of our analysis is to maximize the probability of the observed data -- not to deal with special cases.

Possible applications

1. Nativization moves a word towards higher probability (lower plog).

2. Conjecture: phonological alternations move words towards lower plog (“harmonic rule application”).

3. Russian hypocoristics (Svetlana Soglasnova): lower average plog compared to base name.

4. Cross-linguistic language-identification

5. [Vowel harmony]

6. Optimality theory and probabilistic models: twins separated at birth

Basics of probability1. Sample space2. Distribution3. Probability and frequency4. Unigram model: prob (string) as product of the

prob’s of individual phones5. Log prob, plog prob6. Conditional probability7. Mutual information8. Average plog = entropy9. Maximize probability of entire corpus (or minimize

plog or entropy – same thing, in these cases.)10.Bayes’ Law [or rule]11.Back-off models

Observations

• I will assume that we have a body of observations that we are analyzing. We can count things in it, and we want to compute its probability in various ways.

• There isn’t just one way to compute the probability; the probability isn’t something that lies in the object: it is a function of the model that we establish.

Sample space

• Is our universe of possibilities.

• In our case, we assume for today that we have an alphabet A, and our main sample space is the set of all strings from A, A +.

• We also assume that # is in A, and all strings in our sample space begin and end with #.

Distribution• A distribution is a set of non-negative

numbers that sum to 1.• We require that there be a function on the

members of our sample space (“prob”), and the values of prob, as we go through all the elements of the sample space, must form a distribution.

• Think of it as if there were a kilo of stuff called probability mass that is being divided up over all the members of the sample space.

Sums and products ofdistributions

• We won’t have the time to get into this now in any detail, but the following points are extremely important if you want to develop these models:– The product of two distributions is a distribution.– The weighted average of two distributions is a

distribution: if D1 and D2 are distributions, then x.D1 + (1-x).D2 is also a distribution. (Here “x” is the amount of the probability mass distributed by D1, and “1-x” is the amount distributed by D2).

Alphabet

• We begin by recognizing that we will need a distribution over the letters of our alphabet.

• Let us use the letters’ frequencies as their probabilities. A letter’s frequency is its Count in the corpus divided by the total number of letters in the corpus.

• Think about that….

Probability of a string

• Simplest model is called the unigram model; it takes phonemes into consideration, but not their order.

W = dog; W[1] = ‘d’; W[2] = ‘o’; W[3]= ‘g’;

Prob(W) = pr(d)*pr (o)*pr(g) =

||

1

])[(W

i

iWpr

Log prob

• It’s much easier to think in terms of log probabilities:

• Better yet, calculate plogs (multiply by -1):

• plog (W) =

||

1

])[(log)(logW

i

iWprWprob

||

1

])[(logW

i

iWprp

The joy of logarithms

• Probabilities get too small too fast.• Logs help.• Log probabilities are very big and negative.• We make them positive.• Positive log probabilities.

More standard notation: ~

p

Base 2 logs• If 2x is y, then log2(y) is x.

• 23 is 8, so log (8) = 3;

• 2-3 is 1/8, so log(1/8) = -3.

• Log (ab) = log (a) + log(b)

• Log (a/b) = log (a) – log (b)

• Log (x) = - log (1/x)

• Log (an) = n log (a)

4

1

4

1

])[(log])[(logii

iSprobiSprob

• It's important to be comfortable with notation, so that you see easily that the preceding equation can be written as follows, where the left side uses the capital pi to indicate products, and the right side uses a capital sigma to indicate sums:

• Let’s look at them…

Conditional probability: Bigram model

• If there is a dependency between two phenomena, we can better describe their occurrences if our models take that dependency into account. Effects in language are primarily local…

• Compute the probability of each phoneme, conditioned on the preceding phoneme.

Definition: …AB…

• The conditional probability of B, given A, is the probability of AB occurring together, divided by the probability of A. (It’s as if the sample space temporarily shrank down to the universe of events compatible with A).

In the case of strings…

Pr ( W[i] = L1 | W[i-1] = L2 )

This says: the probability that the ith letter is L1, given that the i-1th letter is L2.

Pr ( W[i] = L1 | W[i-1] = L2 ) =

)1(

)21(

/)1(

/)21(

)1Pr(

)21Pr(

LCt

LLCt

NLCt

NLLCt

L

LL

• We’ll take logs of that in a moment. But first, mutual information.

• The mutual information between two letters is the log of the ratio of the observed frequency divided by the frequency that we would predict if the data had no structure, which is the product of the frequencies of the letters:

• MI(X,Y) =

)()(

)(log

YfreqXfreq

XYfreq

)(log

)(log

)(log

Yfreq

Xfreq

XYfreq

This is a measure of the stickiness between X and Y in the data: they can attract (MI>0) or repel (MI < 0).

Back to bigram model • The bigram model, with its particular conditional

probability, is just the sum (using logs) of the unigram model plus the mutual information:

)(log),(

)(log)(log)(log),()|(log...

)(log)(log),()(log

),log)(log)(log),(

).(log)(log)|(log

).(/)()|(

BprBAMI

AprBprAprBAMIBAprso

BprAprBAMIABpr

soprBAprABprobBAMI

AprABprobBApr

AprABprBApr

Mutual information

• …is the measure of the structural stickiness between pieces.

Important question: in a given set of data, how much does the quality of the analysis (which is the plog you’ve calculated) depend on a given MI? The answer is: the number of times which that bigram occurred times the MI. Let’s call that the weighted mutual information.

Bayes’ Rule:

The relationship between • pr ( A | B ) "the probability of A given B" and • pr ( B | A ) "the probability of B given A".

pr(B)

pr(A) A) | B (pr ) B |(A pr

)(*)|()(

)(*)|()(

AprABprBandApr

BprBAprBandApr

)(*)|(*)|( AprABprBBApr

pr(B)

pr(A) A) | B (pr ) B |(A pr

Plogs form a vector

• It is extremely helpful to be able to think of (and visualize) the ~26 plog parameters of our language as a vector, an arrow pointing from the origin to a point in a 26-dimensional space whose coordinates are the parameter values:

(plog(a), plog(b), plog(c), … , plog(z) )

Same thing for the mutual information (MI) parameters, living in a bigger space.

Vector

• The phonotactics of the language as a whole are thought of as a fixed vector.

• Each individual word is also a vector, whose coordinates count the pieces of the word. “Dolly” has a 1 in the d’s place, a “1” in the o’s place, a “2” in the l’s place, etc.

• Then the plog of “dolly” is just the inner product of these two vectors:

Inner product

• Remember, the inner product of 2 vectors is the sum of the respective values…

• (2, 3, 1) * (0.22, 0.1, -0.5) =2 * 0.22 + 3*0.1 + 1* -0.5 = 0.24

It’s also directly connected to the angle between the vectors: the inner product is the product of the lengths of the 2 vectors times the cosine of the angle between them.

This is a well-understood domain…

• The math is simple and well-understood.

OT central ideas

1. A language provides a ranking of a set of constraints

2. Candidate representations are checked against each constraint to count the number of violations present

3. A well-known algorithm chooses the winning candidate, given the information in (1) and (2)

Revisions

• Constraints are assigned a value: the plog of their violations in the corpus. This defines the vector LL of the language.

• Candidate representations are checked against each constraint to count the number of violations present (no change)

• We take the inner product of candidate representation R with L L for each R.for each R.

• Candidate with lowest inner product wins.Candidate with lowest inner product wins.

*a *b *c *f *z

cab 1 1 1 0 0

foyk 0 0 0 1 0

• Each constraint says: No Phoneme x;

• Counting phonemes is the same as counting violations; frequency of phonemes is the same as frequency of violations.

Why is vector inner product the same as OT tableau winner-

selection?• Let N be the largest number of violations that

crucially selects a winner in a given language: e.g. N = 3….in fact to make life easy, let’s suppose N is not greater than 10…

Con 1 Con 2 Con 3

Cand 1 ***

Cand 2 ** ** ***

Cand 3 ***

• Then we can reformulate the information inherent in the ranking of a set of constraints in a set of numbers:

• The top constraint is given the number 0.1; the next gets 0.01; the next gets 0.001; in general, the nth gets 10-n.

• Each candidate calculates its own scores by calculating its constraints against the costs of those constraints.

• As long as there are never more than 9 violations, strict ranking effects will hold.

(Constraint conjunction effects will emerge otherwise.)

Candidate with lowest score wins: its plog is minimal.

Con 1 Con 2 Con 3

Cand 1

0.300

***

Cand 2

0.223

** ** ***

Cand 3

0.31

*** *

Same point, repeated:

• If we compare 2 candidates with scores 0.10122 and 0.10045, the second wins, because the “third” constraint (the one with weight 0.001) was violated less by it.

De la prose, de la poésie

• Like Molière’s M. Jourdain, we’ve been doing vector inner products every time we found a tableau winner, and we didn’t even know it.

• Who is imitating who?

• The inner product model is simply applying the fundamental principle: maximize the probability of the data.

• As reconstructed here, you don’t need a cognitive theory to do OT. If there is a cognitive basis, it’s in the choice of constraints.

Beyond phonotactics:Dealing with alternations

Consider representations on two levels, and suppose we know the correct UR. Then we wish to choose the representation with the highest probability that includes the UR (i.e., given the UR).

The simplest formulation: prob of a representation is the product of the probability of its surface form times the probability of the UR/SR corresondances:

• /i/

• [i] []• /i/ realized as [i]: 9/10 (.15 bits)

• [e]: 1/10 (3.3 bits)

Russian hypocoristics• The average value for average complexity for the four categories of nouns

in this corpus, was also calculated, with the following results: 2.8 for second declension nouns, 2.67 for female names, 2.40 for affective hypocoristics, and 2.33 for neutral hypocoristics. This ordering shows that on average, female names in this corpus are lower in their complexity than second declension nouns in general, affective hypocoristics lower than female names, and neutral hypocoristics the lowest in complexity.

• More detail can be gained from inspecting the distribution of average complexity values for the three groups - second declension nouns, female names, and hypocoristics. The graphs of complexity distribution for second declension nouns and female names are shown in Fig.1. The graphs of complexity distribution for second declension nouns and hypocoristics are shown in Fig.2. [1]

•[1] Neutral and affective hypocoristics are grouped together. The distribution in Figs.1 and 2 shows the percentage of data in each of 36 buckets with .1 step produced by rounding off the average complexity value to one decimal place (numerical values from 1.2 to 5.0).

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6

Complexity (bigrams)

Pe

rce

nta

ge

of

ca

se

s

2 decl nouns

All hypos

Fig. 2. The distribution of well-formedness values in second declension nouns vs. hypocoristics

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6

Complexity (bigrams)

Pe

rce

nta

ge

of

ca

se

s

2 decl nouns

Female names

Fig. 1. The distribution of well-formedness values in second declension nouns vs. female names

probabilistic models in phonology john goldsmith lsa institute 2003

Documents

plog probability

basics of probability

computation of probability

probability isnt

sample space prob

conditional probability

probability mass

higher probability lower