applications of speaker/lg recognition · lsa 352 summer 2007 1 lsa 352: speech recognition and...

17
1 1 LSA 352 Summer 2007 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability 3) Language Modeling IP notice: some slides for today from: Josh Goodman, Dan Klein, Bonnie Dorr, Julia Hirschberg, Sandiway Fong 2 LSA 352 Summer 2007 Outline Overview of Course Probability Language Modeling Language Modeling means “probabilistic grammar” 3 LSA 352 Summer 2007 Definitions Speech Recognition Speech-to-Text – Input: a wavefile, – Output: string of words Speech Synthesis Text-to-Speech – Input: a string of words – Output: a wavefile 4 LSA 352 Summer 2007 Automatic Speech Recognition (ASR) Automatic Speech Understanding (ASU) Applications Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Second language ('L2') (accent reduction) Audio archive searching Linguistic research – Automatically computing word durations, etc 5 LSA 352 Summer 2007 Applications of Speech Synthesis/Text-to-Speech (TTS) Games Telephone-based Information (directions, air travel, banking, etc) Eyes-free (in car) Reading/speaking for disabled Education: Reading tutors Education: L2 learning 6 LSA 352 Summer 2007 Applications of Speaker/Lg Recognition Language recognition for call routing Speaker Recognition: Speaker verification (binary decision) – Voice password, telephone assistant Speaker identification (one of N) – Criminal investigation

Upload: others

Post on 10-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

1

1LSA 352 Summer 2007

LSA 352:Speech Recognition and Synthesis

Dan Jurafsky

Lecture 1:1) Overview of Course2) Refresher: Intro to Probability3) Language Modeling

IP notice: some slides for today from: Josh Goodman, Dan Klein, Bonnie Dorr, Julia Hirschberg,Sandiway Fong

2LSA 352 Summer 2007

Outline

Overview of CourseProbabilityLanguage Modeling

Language Modeling means “probabilistic grammar”

3LSA 352 Summer 2007

Definitions

Speech RecognitionSpeech-to-Text

– Input: a wavefile,– Output: string of words

Speech SynthesisText-to-Speech

– Input: a string of words– Output: a wavefile

4LSA 352 Summer 2007

Automatic Speech Recognition (ASR)Automatic Speech Understanding (ASU)

ApplicationsDictationTelephone-based Information (directions, airtravel, banking, etc)Hands-free (in car)Second language ('L2') (accent reduction)Audio archive searchingLinguistic research– Automatically computing word durations, etc

5LSA 352 Summer 2007

Applications of SpeechSynthesis/Text-to-Speech (TTS)

GamesTelephone-based Information (directions, air travel,banking, etc)Eyes-free (in car)Reading/speaking for disabledEducation: Reading tutorsEducation: L2 learning

6LSA 352 Summer 2007

Applications of Speaker/LgRecognition

Language recognition for call routingSpeaker Recognition:

Speaker verification (binary decision)– Voice password, telephone assistant

Speaker identification (one of N)– Criminal investigation

Page 2: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

2

7LSA 352 Summer 2007

History: foundational insights1900s-1950s

Automaton:Markov 1911Turing 1936McCulloch-Pitts neuron (1943)

– http://marr.bsee.swin.edu.au/~dtl/het704/lecture10/ann/node1.html

– http://diwww.epfl.ch/mantra/tutorial/english/mcpits/html/Shannon (1948) link between automata and Markov models

Human speech processingFletcher at Bell Labs (1920’s)

Probabilistic/Information-theoretic modelsShannon (1948)

8LSA 352 Summer 2007

Synthesis precursors

Von Kempelen mechanical (bellows, reeds) speechproduction simulacrum1929 Channel vocoder (Dudley)

9LSA 352 Summer 2007

History: Early Recognition

• 1920’s Radio RexCelluloid dog with iron baseheld within house byelectromagnet against force ofspringCurrent to magnet flowedthrough bridge which wassensitive to energy at 500 Hz500 Hz energy caused bridge tovibrate, interrupting current,making dog spring forwardThe sound “e” (ARPAbet [eh])in Rex has 500 Hz component

10LSA 352 Summer 2007

History: early ASR systems

• 1950’s: Early Speech recognizers1952: Bell Labs single-speaker digit recognizer

– Measured energy from two bands (formants)– Built with analog electrical components– 2% error rate for single speaker, isolated digits

1958: Dudley built classifier that used continuous spectrumrather than just formants1959: Denes ASR combining grammar and acousticprobability

1960’sFFT - Fast Fourier transform (Cooley and Tukey 1965)LPC - linear prediction (1968)1969 John Pierce letter “Whither Speech Recognition?”

– Random tuning of parameters,– Lack of scientific rigor, no evaluation metrics– Need to rely on higher level knowledge

11LSA 352 Summer 2007

ASR: 1970’s and 1980’s

Hidden Markov Model 1972Independent application of Baker (CMU) and Jelinek/Bahl/Mercerlab (IBM) following work of Baum and colleagues at IDA

ARPA project 1971-19765-year speech understanding project: 1000 word vocab, continousspeech, multi-speakerSDC, CMU, BBNOnly 1 CMU system achieved goal

1980’s+Annual ARPA “Bakeoffs”Large corpus collection

– TIMIT– Resource Management– Wall Street Journal

12LSA 352 Summer 2007

State of the Art

ASRspeaker-independent, continuous, no noise, world’sbest research systems:

– Human-human speech: ~13-20% Word Error Rate(WER)

– Human-machine speech: ~3-5% WER

TTS (demo next week)

Page 3: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

3

13LSA 352 Summer 2007

LVCSR Overview

Large Vocabulary Continuous (Speaker-Independent)Speech Recognition

Build a statistical model of the speech-to-wordsprocessCollect lots of speech and transcribe all the wordsTrain the model on the labeled speechParadigm: Supervised Machine Learning + Search

14LSA 352 Summer 2007

Unit Selection TTS Overview

Collect lots of speech (5-50 hours) from one speaker,transcribe very carefully, all the syllables and phonesand whatnotTo synthesize a sentence, patch together syllablesand phones from the training data.Paradigm: search

15LSA 352 Summer 2007

Requirements and Grading

Readings:Required Text:Selected chapters on web from

– Jurafsky & Martin, 2000. Speech and Language Processing.– Taylor, Paul. 2007. Text-to-Speech Synthesis.

GradingHomework: 75% (3 homeworks, 25% each)Participation: 25%You may work in groups

16LSA 352 Summer 2007

Overview of the course

http://nlp.stanford.edu/courses/lsa352/

17LSA 352 Summer 2007

6. Introduction to Probability

Experiment (trial)Repeatable procedure with well-defined possible outcomes

Sample Space (S)– the set of all possible outcomes– finite or infinite

Example– coin toss experiment– possible outcomes: S = {heads, tails}

Example– die toss experiment– possible outcomes: S = {1,2,3,4,5,6}

Slides from Sandiway Fong 18LSA 352 Summer 2007

Introduction to Probability

Definition of sample space depends on what we are askingSample Space (S): the set of all possible outcomesExample

– die toss experiment for whether the number is even or odd– possible outcomes: {even,odd}– not {1,2,3,4,5,6}

Page 4: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

4

19LSA 352 Summer 2007

More definitions

Eventsan event is any subset of outcomes from the sample space

Exampledie toss experimentlet A represent the event such that the outcome of the die tossexperiment is divisible by 3A = {3,6}A is a subset of the sample space S= {1,2,3,4,5,6}

ExampleDraw a card from a deck

– suppose sample space S = {heart,spade,club,diamond} (foursuits)

let A represent the event of drawing a heartlet B represent the event of drawing a red cardA = {heart}B = {heart,diamond}

20LSA 352 Summer 2007

Introduction to Probability

Some definitionsCounting

– suppose operation oi can be performed in ni ways, then– a sequence of k operations o1o2...ok– can be performed in n1 × n2 × ... × nk ways

Example– die toss experiment, 6 possible outcomes– two dice are thrown at the same time– number of sample points in sample space = 6 × 6 = 36

21LSA 352 Summer 2007

Definition of Probability

The probability law assigns to an event a nonnegativenumberCalled P(A)Also called the probability AThat encodes our knowledge or belief about thecollective likelihood of all the elements of AProbability law must satisfy certain properties

22LSA 352 Summer 2007

Probability Axioms

NonnegativityP(A) >= 0, for every event A

AdditivityIf A and B are two disjoint events, then theprobability of their union satisfies:P(A U B) = P(A) + P(B)

NormalizationThe probability of the entire sample space S is equalto 1, I.e. P(S) = 1.

23LSA 352 Summer 2007

An example

An experiment involving a single coin tossThere are two possible outcomes, H and TSample space S is {H,T}If coin is fair, should assign equal probabilities to 2 outcomesSince they have to sum to 1P({H}) = 0.5P({T}) = 0.5P({H,T}) = P({H})+P({T}) = 1.0

24LSA 352 Summer 2007

Another example

Experiment involving 3 coin tossesOutcome is a 3-long string of H or TS ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT}Assume each outcome is equiprobable

“Uniform distribution”What is probability of the event that exactly 2 heads occur?A = {HHT,HTH,THH}P(A) = P({HHT})+P({HTH})+P({THH})= 1/8 + 1/8 + 1/8=3/8

Page 5: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

5

25LSA 352 Summer 2007

Probability definitions

In summary:

Probability of drawing a spade from 52 well-shuffled playingcards:

26LSA 352 Summer 2007

Probabilities of two events

If two events A and B are independentThen

P(A and B) = P(A) x P(B)

If flip a fair coin twiceWhat is the probability that they are both heads?

If draw a card from a deck, then put it back, draw a card fromthe deck again

What is the probability that both drawn cards are hearts?

A coin is flipped twiceWhat is the probability that it comes up heads both times?

27LSA 352 Summer 2007

How about non-uniformprobabilities? An example

A biased coin,twice as likely to come up tails as heads,is tossed twice

What is the probability that at least one head occurs?Sample space = {hh, ht, th, tt} (h = heads, t = tails)Sample points/probability for the event:

ht 1/3 x 2/3 = 2/9 hh 1/3 x 1/3= 1/9th 2/3 x 1/3 = 2/9 tt 2/3 x 2/3 = 4/9

Answer: 5/9 = ≈0.56 (sum of weights in red)

28LSA 352 Summer 2007

Moving toward language

What’s the probability of drawing a 2 from adeck of 52 cards with four 2s?

What’s the probability of a random word(from a random dictionary page) being averb?!

P(drawing a two) =4

52=

1

13= .077

!

P(drawing a verb) =#of ways to get a verb

all words

29LSA 352 Summer 2007

Probability and part of speech tags

• What’s the probability of a random word (from arandom dictionary page) being a verb?

• How to compute each of these• All words = just count all the words in the dictionary• # of ways to get a verb: number of words which are

verbs!• If a dictionary has 50,000 entries, and 10,000 are

verbs…. P(V) is 10000/50000 = 1/5 = .20

!

P(drawing a verb) =#of ways to get a verb

all words

30LSA 352 Summer 2007

Conditional Probability

A way to reason about the outcome of an experimentbased on partial information

In a word guessing game the first letter for the wordis a “t”. What is the likelihood that the second letteris an “h”?How likely is it that a person has a disease given thata medical test was negative?A spot shows up on a radar screen. How likely is itthat it corresponds to an aircraft?

Page 6: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

6

31LSA 352 Summer 2007

More precisely

Given an experiment, a corresponding sample space S, and aprobability lawSuppose we know that the outcome is within some given eventBWe want to quantify the likelihood that the outcome alsobelongs to some other given event A.We need a new probability law that gives us the conditionalprobability of A given BP(A|B)

32LSA 352 Summer 2007

An intuition

• A is “it’s raining now”.• P(A) in dry California is .01• B is “it was raining ten minutes ago”

• P(A|B) means “what is the probability of it raining now if it wasraining 10 minutes ago”

• P(A|B) is probably way higher than P(A)• Perhaps P(A|B) is .10

• Intuition: The knowledge about B should change our estimate ofthe probability of A.

33LSA 352 Summer 2007

Conditional probability

One of the following 30 items is chosen at randomWhat is P(X), the probability that it is an X?What is P(X|red), the probability that it is an X giventhat it is red?

34LSA 352 Summer 2007

S

Conditional Probability

let A and B be eventsp(B|A) = the probability of event B occurring given event A occursdefinition: p(B|A) = p(A ∩ B) / p(A)

35LSA 352 Summer 2007

Conditional probability

P(A|B) = P(A ∩ B)/P(B)

Or

)(

),()|(

BP

BAPBAP =

A BA,B

Note: P(A,B)=P(A|B) · P(B)Also: P(A,B) = P(B,A)

36LSA 352 Summer 2007

Independence

What is P(A,B) if A and B are independent?

P(A,B)=P(A) · P(B) iff A,B independent.

P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25

Note: P(A|B)=P(A) iff A,B independentAlso: P(B|A)=P(B) iff A,B independent

Page 7: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

7

37LSA 352 Summer 2007

Bayes Theorem

)(

)()|()|(

AP

BPBAPABP =

• Swap the conditioning

• Sometimes easier to estimate onekind of dependence than the other

38LSA 352 Summer 2007

Deriving Bayes Rule

!

P(B | A) =P(A" B)

P(A)

!

P(A |B) =P(A" B)

P(B)

!

P(B | A)P(A) = P(A" B)

!

P(A |B)P(B) = P(A" B)

!

P(A |B)P(B) = P(B | A)P(A)

!

P(A |B) =P(B | A)P(A)

P(B)

39LSA 352 Summer 2007

Summary

ProbabilityConditional ProbabilityIndependenceBayes Rule

40LSA 352 Summer 2007

How many words?

I do uh main- mainly business data processingFragmentsFilled pauses

Are cat and cats the same word?Some terminology

Lemma: a set of lexical forms having the samestem, major part of speech, and rough word sense

– Cat and cats = same lemma

Wordform: the full inflected surface form.– Cat and cats = different wordforms

41LSA 352 Summer 2007

How many words?

they picnicked by the pool then lay back on the grass and looked at thestars

16 tokens14 types

SWBD:~20,000 wordform types,2.4 million wordform tokens

Brown et al (1992) large corpus583 million wordform tokens293,181 wordform types

Let N = number of tokens, V = vocabulary = number of typesGeneral wisdom: V > O(sqrt(N))

42LSA 352 Summer 2007

Language Modeling

We want to compute P(w1,w2,w3,w4,w5…wn), theprobability of a sequenceAlternatively we want to computeP(w5|w1,w2,w3,w4,w5): the probability of a wordgiven some previous wordsThe model that computes P(W) or P(wn|w1,w2…wn-1) is called the language model.A better term for this would be “The Grammar”But “Language model” or LM is standard

Page 8: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

8

43LSA 352 Summer 2007

Computing P(W)

How to compute this joint probability:

P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”,”and”,”saw”,”a”,”lizard”)

Intuition: let’s rely on the Chain Rule of Probability

44LSA 352 Summer 2007

The Chain Rule of Probability

Recall the definition of conditional probabilities

Rewriting:

More generallyP(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)In generalP(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)

)(

)^()|(

BP

BAPBAP =

)()|()^( BPBAPBAP =

45LSA 352 Summer 2007

The Chain Rule Applied to jointprobability of words in sentence

P(“the big red dog was”)=

P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|thebig red dog)

46LSA 352 Summer 2007

Very easy estimate:

How to estimate?P(the|its water is so transparent that)

P(the|its water is so transparent that)=C(its water is so transparent that the)_______________________________C(its water is so transparent that)

47LSA 352 Summer 2007

Unfortunately

There are a lot of possible sentences

We’ll never be able to get enough data to computethe statistics for those long prefixes

P(lizard|the,other,day,I,was,walking,along,and,saw,a)OrP(the|its water is so transparent that)

48LSA 352 Summer 2007

Markov Assumption

Make the simplifying assumptionP(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)

Or maybeP(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

Page 9: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

9

49LSA 352 Summer 2007

So for each component in the product replace withthe approximation (assuming a prefix of N)

Bigram version

!

P(wn |w1n"1) # P(wn |w

n"N +1

n"1)

Markov Assumption

!

P(wn|w1

n"1) # P(w

n|w

n"1)

50LSA 352 Summer 2007

Estimating bigram probabilities

The Maximum Likelihood Estimate

!

P(wi|w

i"1) =count(w

i"1,wi)

count(wi"1)

!

P(wi|w

i"1) =c(w

i"1,wi)

c(wi"1)

51LSA 352 Summer 2007

An example

<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

This is the Maximum Likelihood Estimate, because it is the onewhich maximizes P(Training set|Model)

52LSA 352 Summer 2007

Maximum Likelihood Estimates

The maximum likelihood estimate of some parameter of a modelM from a training set T

Is the estimatethat maximizes the likelihood of the training set T given themodel M

Suppose the word Chinese occurs 400 times in a corpus of amillion words (Brown corpus)What is the probability that a random word from some other textwill be “Chinese”MLE estimate is 400/1000000 = .004

This may be a bad estimate for some other corpus

But it is the estimate that makes it most likely that “Chinese”will occur 400 times in a million word corpus.

53LSA 352 Summer 2007

More examples: BerkeleyRestaurant Project sentences

can you tell me about any good cantoneserestaurants close bymid priced thai food is what i’m looking fortell me about chez panissecan you give me a listing of the kinds of food that areavailablei’m looking for a good place to eat breakfastwhen is caffe venezia open during the day

54LSA 352 Summer 2007

Raw bigram counts

Out of 9222 sentences

Page 10: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

10

55LSA 352 Summer 2007

Raw bigram probabilities

Normalize by unigrams:

Result:

56LSA 352 Summer 2007

Bigram estimates of sentenceprobabilities

P(<s> I want english food </s>) =p(i|<s>) x p(want|I) x p(english|want)

x p(food|english) x p(</s>|food) = .24 x .33 x .0011 x 0.5 x 0.68 =.000031

57LSA 352 Summer 2007

What kinds of knowledge?

P(english|want) = .0011P(chinese|want) = .0065P(to|want) = .66P(eat | to) = .28P(food | to) = 0P(want | spend) = 0P (i | <s>) = .25

58LSA 352 Summer 2007

The Shannon VisualizationMethod

Generate random sentences:Choose a random bigram <s>, w according to its probabilityNow choose a random bigram (w, x) according to its probabilityAnd so on until we choose </s>Then string the words together<s> I

I want want to to eat eat Chinese

Chinese food food </s>

59LSA 352 Summer 2007 60LSA 352 Summer 2007

Shakespeare as corpus

N=884,647 tokens, V=29,066Shakespeare produced 300,000 bigram types out ofV2= 844 million possible bigrams: so, 99.96% of thepossible bigrams were never seen (have zero entriesin the table)Quadrigrams worse: What's coming out looks likeShakespeare because it is Shakespeare

Page 11: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

11

61LSA 352 Summer 2007

The wall street journal is notshakespeare (no offense)

62LSA 352 Summer 2007

Evaluation

We train parameters of our model on a training set.How do we evaluate how well our model works?We look at the models performance on some new dataThis is what happens in the real world; we want toknow how our model performs on data we haven’tseenSo a test set. A dataset which is different than ourtraining setThen we need an evaluation metric to tell us howwell our model is doing on the test set.One such metric is perplexity (to be introducedbelow)

63LSA 352 Summer 2007

Unknown words: Open versusclosed vocabulary tasks

If we know all the words in advancedVocabulary V is fixedClosed vocabulary task

Often we don’t know thisOut Of Vocabulary = OOV wordsOpen vocabulary task

Instead: create an unknown word token <UNK>Training of <UNK> probabilities

– Create a fixed lexicon L of size V– At text normalization phase, any training word not in L changed to

<UNK>– Now we train its probabilities like a normal word

At decoding time– If text input: Use UNK probabilities for any word not in training

64LSA 352 Summer 2007

Evaluating N-gram models

Best evaluation for an N-gramPut model A in a speech recognizerRun recognition, get word error rate (WER)for APut model B in speech recognition, get worderror rate for BCompare WER for A and BIn-vivo evaluation

65LSA 352 Summer 2007

Difficulty of in-vivo evaluation ofN-gram models

In-vivo evaluationThis is really time-consumingCan take days to run an experiment

SoAs a temporary solution, in order to run experimentsTo evaluate N-grams we often use an approximationcalled perplexityBut perplexity is a poor approximation unless the testdata looks just like the training dataSo is generally only useful in pilot experiments(generally is not sufficient to publish)But is helpful to think about.

66LSA 352 Summer 2007

Perplexity

Perplexity is the probability of the test set(assigned by the language model),normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probabilityThe best language model is one that best predicts anunseen test set

Page 12: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

12

67LSA 352 Summer 2007

A totally different perplexityIntuition

How hard is the task of recognizing digits‘0,1,2,3,4,5,6,7,8,9,oh’: easy, perplexity 11 (or if we ignore ‘oh’,perplexity 10)How hard is recognizing (30,000) names at Microsoft. Hard:perplexity = 30,000If a system has to recognize

Operator (1 in 4)Sales (1 in 4)Technical Support (1 in 4)30,000 names (1 in 120,000 each)Perplexity is 54

Perplexity is weighted equivalent branching factor

Slide from Josh Goodman68LSA 352 Summer 2007

Perplexity as branching factor

69LSA 352 Summer 2007

Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

70LSA 352 Summer 2007

Lesson 1: the perils of overfitting

N-grams only work well for word prediction if the testcorpus looks like the training corpus

In real life, it often doesn’tWe need to train robust models, adapt to test set, etc

71LSA 352 Summer 2007

Lesson 2: zeros or not?

Zipf’s Law:A small number of events occur with high frequencyA large number of events occur with low frequencyYou can quickly collect statistics on the high frequency eventsYou might have to wait an arbitrarily long time to get validstatistics on low frequency events

Result:Our estimates are sparse! no counts at all for the vast bulk ofthings we want to estimate!Some of the zeroes in the table are really zeros But others aresimply low frequency events you haven't seen yet. After all,ANYTHING CAN HAPPEN!How to address?

Answer:Estimate the likelihood of unseen N-grams!

Slide adapted from Bonnie Dorr and Julia Hirschberg 72LSA 352 Summer 2007

Smoothing is like Robin Hood:Steal from the rich and give to the poor (inprobability mass)

Slide from Dan Klein

Page 13: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

13

73LSA 352 Summer 2007

Laplace smoothing

Also called add-one smoothingJust add one to all the counts!Very simple

MLE estimate:

Laplace estimate:

Reconstructed counts:

74LSA 352 Summer 2007

Laplace smoothed bigram counts

75LSA 352 Summer 2007

Laplace-smoothed bigrams

76LSA 352 Summer 2007

Reconstituted counts

77LSA 352 Summer 2007

Note big change to counts

C(count to) went from 608 to 238!P(to|want) from .66 to .26!Discount d= c*/c

d for “chinese food” =.10!!! A 10x reductionSo in general, Laplace is a blunt instrumentCould use more fine-grained method (add-k)

But Laplace smoothing not used for N-grams, as we have muchbetter methodsDespite its flaws Laplace (add-k) is however still used to smoothother probabilistic models in NLP, especially

For pilot studiesin domains where the number of zeros isn’t so huge.

78LSA 352 Summer 2007

Better discounting algorithms

Intuition used by many smoothing algorithmsGood-TuringKneser-NeyWitten-Bell

Is to use the count of things we’ve seen once to helpestimate the count of things we’ve never seen

Page 14: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

14

79LSA 352 Summer 2007

Good-Turing: Josh Goodmanintuition

Imagine you are fishingThere are 8 species: carp, perch, whitefish, trout,salmon, eel, catfish, bass

You have caught10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel= 18 fish

How likely is it that next species is new (i.e. catfish orbass)

3/18

Assuming so, how likely is it that next species istrout?

Must be less than 1/18

Slide adapted from Josh Goodman 80LSA 352 Summer 2007

Good-Turing Intuition

Notation: Nx is the frequency-of-frequency-xSo N10=1, N1=3, etc

To estimate total number of unseen speciesUse number of species (words) we’ve seen oncec0

* =c1 p0 = N1/N

All other estimates are adjusted (down) to giveprobabilities for unseen

Slide from Josh Goodman

81LSA 352 Summer 2007

Good-Turing Intuition

Notation: Nx is the frequency-of-frequency-xSo N10=1, N1=3, etc

To estimate total number of unseen speciesUse number of species (words) we’ve seen oncec0

* =c1 p0 = N1/N p0=N1/N=3/18

All other estimates are adjusted (down) to giveprobabilities for unseen

P(eel) = c*(1) = (1+1) 1/ 3 = 2/3

Slide from Josh Goodman 82LSA 352 Summer 2007

83LSA 352 Summer 2007

Bigram frequencies offrequencies and GT re-estimates

84LSA 352 Summer 2007

Complications

In practice, assume large counts (c>k for some k) are reliable:

That complicates c*, making it:

Also: we assume singleton counts c=1 are unreliable, so treat N-gramswith count of 1 as if they were count=0Also, need the Nk to be non-zero, so we need to smooth (interpolate)the Nk counts before computing c* from them

Page 15: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

15

85LSA 352 Summer 2007

Backoff and Interpolation

Another really useful source of knowledgeIf we are estimating:

trigram p(z|xy)but c(xyz) is zero

Use info from:Bigram p(z|y)

Or even:Unigram p(z)

How to combine the trigram/bigram/unigram info?

86LSA 352 Summer 2007

Backoff versus interpolation

Backoff: use trigram if you have it, otherwisebigram, otherwise unigramInterpolation: mix all three

87LSA 352 Summer 2007

Interpolation

Simple interpolation

Lambdas conditional on context:

88LSA 352 Summer 2007

How to set the lambdas?

Use a held-out corpusChoose lambdas which maximize the probability ofsome held-out data

I.e. fix the N-gram probabilitiesThen search for lambda valuesThat when plugged into previous equationGive largest probability for held-out setCan use EM to do this search

89LSA 352 Summer 2007

Katz Backoff

90LSA 352 Summer 2007

Why discounts P* and alpha?

MLE probabilities sum to 1

So if we used MLE probabilities but backed off tolower order model when MLE prob is zeroWe would be adding extra probability massAnd total probability would be greater than 1

Page 16: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

16

91LSA 352 Summer 2007

GT smoothed bigram probs

92LSA 352 Summer 2007

Intuition of backoff+discounting

How much probability to assign to all the zerotrigrams?

Use GT or other discounting algorithm to tell us

How to divide that probability mass among differentcontexts?

Use the N-1 gram estimates to tell us

What do we do for the unigram words not seen intraining?

Out Of Vocabulary = OOV words

93LSA 352 Summer 2007

OOV words: <UNK> word

Out Of Vocabulary = OOV wordsWe don’t use GT smoothing for these

Because GT assumes we know the number of unseen eventsInstead: create an unknown word token <UNK>

Training of <UNK> probabilities– Create a fixed lexicon L of size V– At text normalization phase, any training word not in L changed to

<UNK>– Now we train its probabilities like a normal word

At decoding time– If text input: Use UNK probabilities for any word not in training

94LSA 352 Summer 2007

Practical Issues

We do everything in log spaceAvoid underflow(also adding is faster than multiplying)

95LSA 352 Summer 2007

ARPA format

96LSA 352 Summer 2007

Page 17: Applications of Speaker/Lg Recognition · LSA 352 Summer 2007 1 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability

17

97LSA 352 Summer 2007

Language Modeling Toolkits

SRILMCMU-Cambridge LM Toolkit

98LSA 352 Summer 2007

Google N-Gram Release

99LSA 352 Summer 2007

Google N-Gram Release

serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234

100LSA 352 Summer 2007

Advanced LM stuff

Current best smoothing algorithmKneser-Ney smoothing

Other stuffVariable-length n-gramsClass-based n-grams

– Clustering– Hand-built classes

Cache LMsTopic-based LMsSentence mixture modelsSkipping LMsParser-based LMs

101LSA 352 Summer 2007

Summary

LMN-gramsDiscounting: Good-TuringKatz backoff with Good-Turing discountingInterpolationUnknown wordsEvaluation:

– Entropy, Entropy Rate, Cross Entropy– Perplexity

Advanced LM algorithms