unsupervised language acquisition carl de marcken 1996

44
Unsupervised language acquisition Carl de Marcken 1996

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised language acquisition Carl de Marcken 1996

Unsupervised language acquisition

Carl de Marcken 1996

Page 2: Unsupervised language acquisition Carl de Marcken 1996

Broad outline

• Goal: take a large unbroken corpus (no indication of where word boundaries are), find the best analysis of the corpus into words.

• “Best”? Interpret the goal in the context of MDL (Minimum Description Length) theory

We begin with a corpus, and a lexicon which initially has all and only the individual characters (letters, or phonemes) as its entries.

Page 3: Unsupervised language acquisition Carl de Marcken 1996

1. Iterate several times (e.g., 7 times): 1. Construct tentative new entries for lexicon with

tentative counts; From counts, calculate rough probabilities.

2. EM (Expectation/Maximization): iterate 5 times:1. Expectation: find all possible occurrences of each lexical

entry in the corpus; assign relative weights to each occurrence found, based on its probability; use this to assign (non-integral!) counts of words in the corpus.

2. Maximization: convert counts into probabilities.

3. Test each lexical entry to see whether description length is better without it in the lexicon. If true, remove it.

2. Find best parse (Viterbi-parse), the one with highest probability.

Page 4: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

Lexicon: D E H I N R S T U

T 2

E 3

All others 1

Total count: 12

Page 5: Unsupervised language acquisition Carl de Marcken 1996

Step 0

• Initialize the lexicon with all of the symbols in the lexicon (the alphabet, the set of phonemes, whatever it is).

• Each symbol has a probability, which is simply its frequency.

• There are no (non-trivial) chunks in the lexicon.

Page 6: Unsupervised language acquisition Carl de Marcken 1996

Step 1

• 1.1 Create tentative members– TH HE ER RE EN NT TI IS SD DU UE– Give each of these a count of 1.– Now the total count of “words” in the corpus is

12 + 11= 23. – Calculate new probabilities: pr(E) = 3/23;

pr(TH) = 1/23. – Prob’s of the lexicon form a distribution.

Page 7: Unsupervised language acquisition Carl de Marcken 1996

Expectation/Maximization (EM)iterative:

• This is a widely used algorithm to do something important and almost miraculous: to find the best parameters for hidden parameters.

• Expectation: Find all occurrences of each lexical item in the corpus.

Use the Forward/Backward algorithm.

Page 8: Unsupervised language acquisition Carl de Marcken 1996

Forward algorithm

• Find all ways of parsing the corpus from the beginning to each point, and associate with each point the sum of the probabilities for all of those ways. We don’t know which is the right one, really.

Page 9: Unsupervised language acquisition Carl de Marcken 1996

ForwardStart at position 1, after T:

THERENTISDUE

The only way to get there and put a word break there (“T HERENTISDUE”) utilizes the word(?) “T”, whose probability is 2/23. Forward(1) = 2/23.

Now, after position 2, after TH:

There are 2 ways to get this:

T H ERENTISDUE (a) or

TH ERENETISDUE (b)

(a) has probability 2/23 * 1/23 = 2/529 =.003781

(b) Has prob 1/23 = 0.0435

Page 10: Unsupervised language acquisition Carl de Marcken 1996

There are 2 ways to get this:T H ERENTISDUE (a) orTH ERENETISDUE (b)

(a) has probability 2/23 * 1/23 = 2/529 =.003781(b) Has prob 1/23 = 0.0435

So the Forward probability after letter 2 (after “TH”) is 0.0472.

After letter 3 (after “THE”), we have to consider the possibilities:

(1)T-HE and (2)TH-E and (3)T-H-E

Page 11: Unsupervised language acquisition Carl de Marcken 1996

(1)T-HE (2)TH-E (3)T-H-E

(1) We calculate this prob as Prob of a break after (1) = “T” = 2/23 = .0869 * prob (HE) ( which is 1/23 = 0.0434) = 0.00378

(2) We combine cases (2) and (3), giving us for both, together: Prob of a break after position 2 (the H), already calculated as 0.0472 * prob of (E) = 0.0472 * 0.13 =0.00616.

Page 12: Unsupervised language acquisition Carl de Marcken 1996

Forward

T H E

Value of Forward here is the sum of the probabilities goingby the two paths, P1 and P2

P2

P1bP1a

Page 13: Unsupervised language acquisition Carl de Marcken 1996

Forward

• T H E

P2a P2b

P3

Value of Forward here is the sum of the probabilities goingby the two paths, P2 and P3

You only need to back (from where you are) the length of the longest lexical entry (which is now 2).

Page 14: Unsupervised language acquisition Carl de Marcken 1996

Conceptually

• We are computing for each break (between letters) what the probability is that there is a break there, by considering all possible chunkings of the (prefix) string, the string up to that point from the left.

• This is the Forward probability of that break.

Page 15: Unsupervised language acquisition Carl de Marcken 1996

Backward

• We do exactly the same thing from right to left, giving us a backward probability:

• …. D U E

Page 16: Unsupervised language acquisition Carl de Marcken 1996

Now the tricky step:

• T H E R E N T I S D U E

• Note that we know the probability of the entire string (it’s Forward(12), which is the sum of the probabilities of all the ways of chunking the string)=Pr(string)

• What is the probability that -R- is a word, given the string?

Page 17: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

• That is, we’re wondering whether the R here is a chunk, or part of the chunk ER, or part of the chunk RE. It can’t be all three, but we’re not in a position (yet) to decide which it is. How do we count it?

• We take the count of 1, and divide it up among the three options in proportion to their probabilities.

Page 18: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

• Probability that R is a word can be found in this expression:

(a)

This is the fractional count that goes to R!

)Pr(

)4(*)Pr(*)3(

string

BackwardwordaisRForward

Page 19: Unsupervised language acquisition Carl de Marcken 1996

Do this for all members of the lexicon

• Compute Forward and Backward just once for the whole corpus, or for each sentence or subutterance if you have that information.

• Compute the counts of all lexical items that conceivably could occur (in each sentence, etc.).

• End of Expectation.

Page 20: Unsupervised language acquisition Carl de Marcken 1996

• We’ll go through something just like this again in a few minutes, when we calculate the Viterbi-best parse.

Page 21: Unsupervised language acquisition Carl de Marcken 1996

Maximization

• We now have a bunch of counts for the lexical items. None of the counts are integral (except by accident).

• Normalize: take sum of counts over the lexicon = N, and calculate frequency of each word = Count (word)/N;

• Set prob(word) = freq (word).

Page 22: Unsupervised language acquisition Carl de Marcken 1996

Why “Maximization”?

• Because the values for probabilities that maximize the probability for the whole are obtained by using the frequency values for the probabilities.

• That’s not obvious….

Page 23: Unsupervised language acquisition Carl de Marcken 1996

Testing a lexical entry

A lexical entry makes a positive contribution to the analysis iff the Description Length of the corpus is lower when we incorporate that lexical entry than when we don’t, all other things being equal.

What is the description length (DL), and how is it calculated?

Page 24: Unsupervised language acquisition Carl de Marcken 1996

Description Length

DL (Corpus) = - log prob (Corpus)

+ length of the Lexicon.

Length of the lexicon?

Page 25: Unsupervised language acquisition Carl de Marcken 1996

Let’s look at 2 ways this could be viewed

1. Lexicon as a list of words

2. (de Marcken’s approach) Lexicon as a list of pointers

Page 26: Unsupervised language acquisition Carl de Marcken 1996

Lexicon as a list of items

• Our grammar for this simple task is just a list of words. What’s the probability of a list of words?

• First task is figuring out how to set up a distribution over an infinite set.

• We need a set of numbers (> 0) which add up to 1.0.

Page 27: Unsupervised language acquisition Carl de Marcken 1996

Convergent series

• You may remember this from a calculus class…or not.

• Some series do not converge½ + 1/3 + ¼ + 1/5 …

And some do:

½ + ¼ + 1/8 + 1/16 …= 1

3/10 + 3/100 + 3/1000 + … = .333… = 1/3

Page 28: Unsupervised language acquisition Carl de Marcken 1996

Geometric

• A series in which the ratio of one term to the next is a constant ratio ( > 1 ) is a geometric series, and it converges.

• p + p2 + p3 + p4 + …

Suppose A = p + p2 + p3 + p4 + …

A p = p2 + p3 + p4 + …

So A – A p =p, surprisingly enough. Hence

A (1-p) = p, or A=p/(1-p).

Page 29: Unsupervised language acquisition Carl de Marcken 1996

• This means that if you multiply each of the terms by (1-p), then they’ll add up to 1.0.

• We can use this to divide up our (1 kilo of) probability mass over A +, all strings made from our alphabet A.

• To play the role of p, we pick a parameter C1 < 1. It is responsible for how much less probability the longer strings get. In particular, all the strings (in toto, not individually) of length L get probability C1

L * (1 - C1).

Page 30: Unsupervised language acquisition Carl de Marcken 1996

• How many different strings of length L are there? If the alphabet A has Z characters in it (2, or 27, or whatever), then there are ZL different strings.

• All of them are created equal; none is better than any other. So we divide the probability mass among all of them, and each gets probability:

• C1L(1 - C1) / ZL. See that?

• So the log of what each gets is:

log prob L log C1 + ( ) – L log Z.

Page 31: Unsupervised language acquisition Carl de Marcken 1996

• Or plog prob L plog C1 + ( ) – L plog Z

Where () is some small constant.

~ L (plog ( ) )

Remember that C1 is an unknown constant (UG?), Z is a fact about the language, and L is the length of the lexicon (number of words).

Z

C1

Page 32: Unsupervised language acquisition Carl de Marcken 1996

2nd approach, closer to de Marcken’s

• The lexicon does not consist of strings of letters, but rather strings of pointers;

• The length of each pointer = plog of what it points to.

• “Optimal compressed length of that data”

• Fundamental idea in information theory: a string can be encoded by a (often smaller) number of bits, approximately equal to the plog of its frequency.

Page 33: Unsupervised language acquisition Carl de Marcken 1996

Approximately?

• Since the plot is rarely an integer, you may have to round up to the next integer, but that’s all.

• So the more often something is used in the lexicon, the cheaper it is for it to be used by the words that use it.

• Just like in morphology.

Page 34: Unsupervised language acquisition Carl de Marcken 1996
Page 35: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Viterbi-best parse

Page 36: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

After character 1: Best analysis: T: 3 bits

Page 37: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

After character 1: Best (only) analysis: T: 3 bits

After character 2: (1,2) not in lexicon

(1,1)’s best analysis + (2,2) which exists:3 + 3 = 6 bits

Page 38: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

After character 1: Best (only) analysis: T: 3 bitsAfter character 2: (1,2) not in lexicon

(1.1)’s best analysis + (2,2) which exists:3 + 3 = 6 bits (WINNER)

After character 3: (1,3) is in lexicon: THE: 3 bits

(1,1) best analysis + (2,3) which exists:3 + 4 = 7 bits T-HE

(1,2) best analysis + (3,3):T-H-E: 6 + 3 = 9 bits

THE wins (3 bits)

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Page 39: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

After character 1: Best (only) analysis: T: 3 bitsAfter character 2: (1,2) not in lexicon

(1,1)’s best analysis + (2,2) which exists:3 + 3 = 6 bits

After character 3: (1,3) is in lexicon: THE: 3 bits

(1,1) best analysis + (2,3) which exists:3 + 4 = 7 bits T-HE

(1,2) best analysis + (3,3):T-H-E: 6 + 3 = 9THE wins

After character 4: (1,4) not in lexicon;(2,4) not in lexicon;(3,4) not in lexicon;best up to 3: THE plus R yields THE-R,cost is 3 + 3 = 6. (Winner, sole entry)

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Page 40: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

1: T 32: T-H 63: THE 34: THE-R 6

5: (1,5) THERE: 5 (1,1) + (2,5)HERE = 3 + 5 = 8 (1,2) + (3,5)ERE = 6 + 8= 14 (4,5) not in lexicon (1,4) + (5,5) = THE-R-E = 6 + 3 = 9THERE is the winner (5 bits)

6: (1,6) not checked because exceeds lexicon max length(2,6) HEREN not in lexicon(3,6) EREN not in lexicon(4,6) REN not in lexicon(5,6) EN not in lexicon(1,5) + (6,6) = THERE-N = 5 + 3 = 8 Winner

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Page 41: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

1: T 32: T-H 63: THE 34: THE-R 65 THERE 56 THERE-N 8

7: start with ERENT: not in lexicon (1,3) + (4,7): THE-RENT=3 + 7 = 10 ENT not in lexicon NT not in lexicon (1,6) + (7,7) = THERE-N-T 8 + 3 = 11

THE-RENT winner (10 bits)

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Page 42: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E

1: T 32: T-H 63: THE 34: THE-R 65 THERE 56 THERE-N 87 THE-RENT 10

8 Start with RENTI: not in lexiconENTI, NTI, TI, none in lexicon(1,7) THE-RENT + (8,8) I = 10 + 3 = 13The winner by default

9: Start with ENTIS: not in lexicon, nor is NTIS(1,6) THERE-N + (7,9)TIS = 8 + 8 = 16(1,7) THE-RENT + (8,9) IS= 10 + 4 = 14(1,8) THE-RENT-I + (9,9) S = 13 + 3 = 16THE-RENT-IS is the winner (14)

A,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

Page 43: Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U EA,B,C… 3 bits eachHE 4THE 3HERE 5THERE 5RENT 7IS 4TIS 8DUE 7ERE 8

1: T 32: T-H 63: THE 34: THE-R 65 THERE 56 THERE-N 87 THE-RENT 108 THE-RENT-I 139 THE-RENT-IS 14

10: Not found: NTISD, TISD, ISD, SD (1,9) THE-RENT-IS + (10,10) D = 14 + 3 = 17

11: Not found: TISDU, ISDU, SDU, DU;(1,10) THE-RENT-IS-D + U = 17 + 3 = 20Winner: THE-RENT-IS-D-U (20)

12: Not found: ISDUE, SDUE;(1,9) THE-RENT-IS + (10,12) DUE = 14+7 = 21;UE not found; WINNER!(1,11) THE-RENT-IS-D-U + (12,12) E = 20 + 3 = 23

Page 44: Unsupervised language acquisition Carl de Marcken 1996

Let’s look at results

• It doesn’t know whether it’s finding letters, letter chunks, morphemes, words, or phrases.

• Why not?

• Statistical learning is heavily structure-bound: don’t forget that! If the structure is there, it must be found.