7 - 1 chapter 7 mathematical foundations. 7 - 2 notions of probability theory probability theory...

78
7 - 1 Chapter 7 Mathematical Foundations

Upload: teresa-neal

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

7 - 3 Example A fair coin is tossed 3 times. What is the chance of 2 heads? –  ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –uniform distribution P(basic outcome)= The chance of getting 2 heads when tossing 3 coins A={HHT, HTH, THH} P(A)= the event of interest an experiment sample space a subset of  probabilistic function

TRANSCRIPT

Page 1: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 1

Chapter 7Mathematical Foundations

Page 2: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 2

Notions of Probability Theory• Probability theory deals with predicting how likely it is that

something will happen.• The process by which an observation is made is called an

experiment or a trial.• The collection of basic outcomes (or sample points) for our

experiment is called the sample space (Omega).• An event is a subset of the sample space.• Probabilities are numbers between 0 and 1, where 0

indicates impossibility and 1 certainty.• A probability function/distribution distributes a probability

mass of 1 throughout the sample space.

1,0: F 1

Page 3: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 3

Example

• A fair coin is tossed 3 times. What is the chance of 2 heads? ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}– uniform distribution

P(basic outcome)=The chance of getting 2 heads when tossing 3 coins

A={HHT, HTH, THH}P(A)=

81

83

A

the event of interestan experiment

sample space

a subset of

probabilistic function

Page 4: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 4

Conditional Probability

• Conditional probabilities measure the probability of events given some knowledge.

• Prior probabilities measure the probabilities of events before we consider our additional knowledge.

• Posterior probabilities are probabilities that result from using our additional knowledge.1st 2nd 3rd

H H TT H

Event B: 1st is HEvent A: 2 Hs in 1st, 2nd and 3rd

P(B)=84

42

P(AB)=82 P(A|B)= 4

2

Page 5: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 5

P(B)

B)P(AB|A

A B

BA

The multiplication ruleA)|P(A)P(BB)|P(B)P(AB)P(A

The chain rule)A|P(A)AA|)P(AA|)P(AP(A)A...AP(A 1

1n213121n21 ini

(used in Markov models, …)

Page 6: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 6

Independence

• The chain rule relates intersection with conditionalization (important to NLP)

• Independence and conditional independence of events are two very important notions in statistics– independence:

– conditional independence

)()()( BPAPBAP )|()( BAPAP

)|()|()|( CBPCAPCBAP

Page 7: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 7

Bayes’ Theorem

• Bayes’ Theorem lets us swap the order of dependence between events. This is important when the former quantity is difficult to determine.

• P(A) is a normalizing constant.

P(A)

B)(B)|P(AP(A)

A)P(BA|B

Page 8: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 8

• Pick the best conclusion given some evidence– (1) evaluate the probability P(c|e) <--- unknown– (2) Select c with the largest P(c|e).

P( c ) and P(e|c) are known.• Example.

– Relative probability of a disease – how often a symptom is associated

)()|(*)()|(

ePcePcPecP

An Application

Page 9: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 9

Bayes’ Theorem

B)(B)|P(AP(A)

B)(B)|P(A maxargmaxarg

BB

A B

BA

)()|()()|()()()( BPBAPBPBAPBAPBAPAP

Page 10: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 10

Bayes’ Theorem

n

i ii

jjjjj

1)()|(

)()|()(

)()|()|(

if Bi

n

iA 1 , P(A) > 0, BB ji (ij)

(used in noisy channel model)

The group of sets Bi partition A

Page 11: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 11

An Example• A parasitic gap occurs once in 100,000 sentences.• A complicated pattern matcher attempts to identify

sentences with parasitic gaps.• The answer is positive with probability 0.95 when a

sentence has a parasitic gap, and the answer is positive with probability 0.005 when it has no parasitic gap.

• When the test says that a sentence contains a parasitic gaps, what is the probability that this is true?

P(G)=0.00001, , P(T|G)=0.95, 99999.0)( GP 005.0)|( GTP

002.099999.0005.00001.095.0

00001.095.0)()|()()|(

)()|()|(

GPGTPGPGTPGPGTPTGP

Page 12: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 12

Random Variables• A random variable is a function

X: (sample space) Rn

– Talk about the probabilities that are related to the event space• A discrete random variable is a function

X: (sample space) S where S is a countable subset of R.

• If X: (sample space) {0,1}, then X is called a Bernoulli trial.

• The probability mass function (pmf) for a random variable X gives the probability that the random variable has different numeric values.

)()()( xAPxXPxp where })(:{ xXAx

pmf

Page 13: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 13

Second die First die 1 2 3 4 5 6

6 7 8 9 10 11 12 5 6 7 8 9 10 11 4 5 6 7 8 9 10 3 4 5 6 7 8 9 2 3 4 5 6 7 8 1 2 3 4 5 6 7 x 2 3 4 5 6 7 8 9 10 11 12

p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36 1

Events: toss two dice and sum their faces

={(1,1), (1,2), …, (1,6), (2,1), (2,2), …, (6,1), (6,2), …, (6,6)}S={2, 3, 4, …, 12}X: S

pmf p(3)=p(X=3)=P(A3)=P({(1,2),(2,1)})=2/36where A3={: X()=3}={(1,2),(2,1)}

Page 14: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 14

Expectation

• The expectation is the mean (, Mu) or average of a random variable.

• Example– Roll one die and Y is the value on its face

x

xxpXE )()(

213

621

61)()(

6

1

6

1

yy

yyxpYE

y

ypygYgE )()())(( E(aY+b)=aE(Y)+bE(X+Y)=E(X)+E(Y)

Page 15: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 15

Variance

• The variance (2, Sigma2) of a random variable is a measure of whether the values of the random variable tend to be consistent over trials or to vary a lot.

• standard deviation ()– square root of the variance

)()()))((()( 222 XEXEXEXEXVar

Page 16: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 16

Second die First die 1 2 3 4 5 6

6 7 8 9 10 11 12 5 6 7 8 9 10 11 4 5 6 7 8 9 10 3 4 5 6 7 8 9 2 3 4 5 6 7 8 1 2 3 4 5 6 7 x 2 3 4 5 6 7 8 9 10 11 12

p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36 1

736112

18111

12110

919

3658

617

3656

915

1214

1813

3612)( XE

7213

213)()()()( YEYEYYEXE

655))()(()))((()var( 22

x

XExxpXEXEX

X: toss two dice, and sum of their faces, I.e., X=Y+YY: toss one die, and its value

X: a random variable that is the sum of the numbers on two dice

y

ypygYgE )()())((

Page 17: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 17

Joint and Conditional Distributions• More than one random variable can be defined over a sample

space. In this case, we talk about a joint or multivariate probability distribution.

• The joint probability mass function for two discrete random variables X and Y is: p(x, y)=P(X=x, Y=y)

• The marginal probability mass function totals up the probability masses for the values of each variable separately.

• If X and Y are independent, then

yX

yxpxp ),()( x

Yyxpyp ),()(

)()(),( ypxpyxp YX

Page 18: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 18

Joint and Conditional Distributions

• Similar intersection rules hold for joint distributions as for events.

• Chain rule in terms of random variables

)(),()|(

| ypyxpyx

YYX

p

),,|(),|()|()(),,,( yxwzpxwypwxpwpzyxwp

Page 19: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 19

Estimating Probability Functions• What is the probability that sentence “The cow chewed its cud”

will be uttered? Unknown P must be estimated from a sample of data.

• An important measure for estimating P is the relative frequency of the outcome, i.e., the proportion of times a certain outcome occurs.

• Assuming that certain aspects of language can be modeled by one of the well-known distribution is called using a parametric approach.

• If no such assumption can be made, we must use a non-parametric approach or distribution-free approach.

NuCfu

)(

Page 20: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 20

parametric approach

• Select an explicit probabilistic model• Specify a few parameters to determine a

particular probability distribution• The amount of training data required is not

great, and can be calculated to make good probability estimates

Page 21: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 21

Standard Distributions• In practice, one commonly finds the same basic form

of a probability mass function, but with different constants employed.

• Families of pmfs are called distributions and the constants that define the different possible pmfs in one family are called parameters.

• Discrete Distributions: the binomial distribution, the multinomial distribution, the Poisson distribution.

• Continuous Distributions: the normal distribution, the standard normal distribution.

Page 22: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 22

Binomial distribution• A series of trials with only two outcomes,

each trial is independent from all others• The number r of successes out of n trials

given that the probability of success in any trial is p:

rnrnr pppnrb )1(),;(

!)!(

!rrn

nnr

nr 0

expectation: np variance: np(1-p)

parametersvariable

Page 23: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 23

Page 24: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 24

The normal distribution

• two parameters for mean and standard deviation , the curve is given by

• standard normal distribution =0, =1

)2/()( 22

21),;(

xexn

Page 25: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 25

Page 26: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 26

Baysian Statistics I: Bayesian Updating• frequentist statistics vs. Bayesian statistics

– Toss a coin 10 times and get 8 heads 8/10 (maximum likelihood estimate)

– 8 heads out of 10 just happens sometimes given a small sample• Assume that the data are coming in sequentially and are i

ndependent.• Given an a priori probability distribution, we can update

our beliefs when a new datum comes in by calculating the Maximum A Posteriori (MAP) distribution.

• The MAP probability becomes the new prior and the process repeats on each new datum.

Page 27: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 27

m : the model that asserts P(head)=ms: a particular sequence of observations yielding i heads and j tails

jim mmsP )1()|(

Find the MLE (maximum likelihood estimate)

by differentiating the above polynomial.

)|(maxarg mmsP

jii

Frequentist Statistics

11

11

11

11

)1())((

)1()(

)1())1((

)1()1()('

)1()(

ji

jiii

jii

jiji

ji

mmmjii

mmjmimi

mmjmmi

mmjmmimf

mmmf

8 heads and 2 tails:8/(8+2)=0.8

Page 28: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 28

a priori probabilistic distribution: )1(6)( mmP m ji

m mmsP )1()|(

)()()|()|(

sPPsPsP mm

m

)()1(6

)()1(6)1(

11

sPmm

sPmmmm

ji

ji

belief in the fairness of a coin:a regular, fair one

a particular sequence of observations: i heads and j tails

New belief in the fairness of a coin?? )|(maxarg sP mm

mjmimm

mmjmmimf

mmmf

ji

jiji

ji

)1()1)(1()1(6

)1()1(6)1()1(6)('

)1(6)(11

11

21

jiim

43)|(maxarg sP m

m

(when i=8,j=2)new priori

Bayesian Statistics

Page 29: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 29

Bayesian Statistics II: Bayesian Decision Theory

• Bayesian Statistics can be used to evaluate which model or family of models better explains some data.

• We define two different models of the event and calculate the likelihood ratio between these two models.

Page 30: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 30

Entropy• The entropy is the average uncertainty of a single random

variable.• Let p(x)=P(X=x); where x X.

• H(p)= H(X)= - xX p(x)log2p(x)• In other words, entropy measures the amount of informati

on in a random variable. It is normally measured in bits.

Toss two coins and count the number of headsp(0)=1/4, p(1)=1/2, p(2)=1/4

Page 31: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 31

Example

• Roll an 8-sided die

• Entropy (another view)– the average length of the message needed to transmit an outc

ome of that variable1 2 3 4 5 6 7 8001 010 011 100 101 110 111 000

– Optimal code to send a message of probability p(i):

bitsipipXHi i

38log81log)

81log(

81)(log)()(

8

1

8

1

)(log ip

Page 32: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 32

Example• Problem:

send a friend a message that is a number from 0 to 3.How long a message must you send ? (in terms of number of bits)

• Example: watch a house with two occupants.Case Situation Probability1 Probability20 no occupants 0.25 0.51 1st occupant 0.25 0.1252 2nd occupant 0.25 0.1253 both occupant 0.25 0.25

Probability Code Probability2 Code0 0.25 00 0.5 01 0.25 01 0.125 1102 0.25 10 0.125 1113 0.25 11 0.25 10

Page 33: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 33

• Variable-length encoding

• Code tree:– (1) all messages are handled.– (2) when one message ends and the next starts.

– Fewer bits for more frequent messages more bits for less frequent messages.

0-No occupants

10-both occupants110-First occupant 111-Second occupant

)2(75.13*813*

812*

411*

21 bitsbitsbitsbitsbitsbit

Page 34: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 34

W : random variable for a messageV(W) : possible messagesP : probability distribution• lower bound on the number of bits needed to

encode such a message:

))(log()(

)()(

75.1

))3(81)3(

81)2(

41)1(

21(

)81log

81

81log

81

41log

41

21log

21(

)(

)(log)()(

)(

)(

)(

wPwP

wrequiredbitswP

WoccH

wPwPWH

WVw

WVw

WVw

(entropy of a random variable)

Page 35: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 35

• (1) entropy of a message:– lower bound for the average number of bits needed to tr

ansmit that message.• (2) encoding method:

– using -log P(w) bits

• entropy (another view)– a measure of the uncertainty about what a message says.– fewer bits for more certain message more bits for less c

ertain message

Page 36: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 36

Xx

xpxpXHpH )(log)()()( 2

Xx xp

xpXH)(

1log)()(

x

xxpXE )()(

)(1log)(Xp

EXH

Page 37: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 37

Simplified Polynesian• p t k a i u

1/8 1/4 1/8 1/4 1/8 1/8

• per-letter entropy

• p t k a i u100 00 101 01 110 111bits

iPiPPHuiaktpi

212

]41log

412

81log

814[

)(log)()(},,,,,{

Fewer bits are used to send more frequent letters.

Page 38: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 38

Joint Entropy and Conditional Entropy• The joint entropy of a pair of discrete random variables

X, Y ~ p(x,y) is the amount of information needed on average to specify both their values.

• The conditional entropy of a discrete random variable Y given another X, for X, Y ~ p(x,y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X.

Xx Yy

yxpyxpYXH ),(log),(),(

)|()()|( xXYHxpXYHXx

Xx Yy

xypxypxp )|(log)|()(

Xx Yy

xypyxp )|(log),(

Page 39: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 39

Chain rule for entropy)|()(),( XYHXHYXH

),...,|(...)|()(),...,( 111211 nnn XXXHXXHXHXXH

)|()(

))|((log))((log

))|(log)((log

))|()((log(

)),((log),(

),(),(

),(

),(

),(

XYHXH

xypxp

xypxp

xypxp

yxpYXH

EEEE

E

yxpyxp

yxp

yxp

yxp

Proof:

Page 40: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 40

Simplified Polynesian revised

• Distinction between model and reality• Simplified Polynesian has syllable structure• All words are consist of sequences of CV

(consonant-vowel) syllables.• A better model in terms of two random

variables C and V.

Page 41: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 41

p t k

a

i

u

161

83

161

161

163

81

43

81

0 163

161

414121

0

81

81

41

161

83

161

uiaktp

bitsbits

CH

061.13log43

49

)3log2(433

812)

81log

81

43log

43

81log

81()(

consonant

vowel

P(C,•)

P(•,V)P(V,C)

Page 42: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 42

ktpc

cCVHcCpCVH,,

)|()()|(

)21,0,

21(

81)

41,

41,

21(

43)0,

21,

21(

81 HHH

)

81

161

(log

81

161

0)

81

161

(log

81

161

81

)

43

163

(log

43

163

)

43

163

(log

43

163

)

4383

(log

4383

430)

81

161

(log

81

161

)

81

161

(log

81

161

81

181

21

21

21

431

81

bitsbits 375.1811

bitsbitsbits

CVHCHVCH44.2375.1061.1)|()(),(

Page 43: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 43

81

81

41

161

83

161

uiaktp

2825.386

42)3log3(

83

168

)81log

82

41log

41

83log

83

161log

162(

)81log

81

81log

81

41log

41

161log

161

83log

83

161log

161()(

PH

81

81

41

81

41

81

uiaktp

bitsPH212)( bitsbitsPH 5

2122)2(

Page 44: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 44

Short Summary

• Better understanding means much less uncertainty (2.44bits < 5 bits)

• Incorrect model’s cross entropy is larger than that of the correct model

),()( ,1,1 Mnn WW )(log)(),( ,1

,1,1,1 nM

nwn

def

Mn wwPW :)( ,1 nW correct model

approximate model:)( ,1 nM W

Page 45: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 45

Entropy rate: per-letter/word entropy

• The amount of information contained in a message depends on the length of the message

• The entropy of a human language L

)(log)(1)(1111

1

nx

nnrate xpxpn

XHn

Hn

),...,,(1lim)( 21 nnrate XXXHn

LH

Page 46: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 46

Mutual Information

• By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y)

• Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X)• This difference is called the mutual information

between X and Y.• It is the reduction in uncertainty of one random

variable due to knowing about another, or, in other words, the amount of information one random variable contains about another.

Page 47: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 47

H(X|Y) H(Y|X)

H(X,Y)

I(X;Y)

H(Y)H(X)

)()(),(log),(

),(log),()(

1log)()(

1log)(

),()()(

,

,

ypxpyxpyxp

yxpyxpyp

ypxp

xp

YXHYHXH

yx

yxyx

)|()();( YXHXHYXI

yX

yxpxp ),()(

x

Yyxpyp ),()(

Page 48: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 48

It is 0 only when two variables are independent, butfor two dependent variables, mutual information grows not onlywith the degree of dependence, but also according to the entropyof the variables

),|()|()|);(()|;( ZYXHZXHZYXIZYXI

),...,|;(...)|;();();( 111211 nnn XXYXIXYXIYXIYXI

n

iii XXYXI

111 ),...,|;(

Conditional mutual information

Chain rule for mutual information

)|()();( YXHXHYXI

Page 49: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 49

Pointwise Mutual Information

)()(),(log),(ypxp

yxpyxI

Applications: clustering wordsword sense disambiguation

Page 50: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 50

• Clustering by Next Word (Brown, et al., 1992)1. Each word was characterized by the word that im

mediately followed it. c(wi) ==<|w1|,|w2|, ...,|ww|>

wj total … wi wj … in the corpus2. Define the distance measure on such vectors. mutual information I(x, y) the amount of information one outcome gives us a

bout the other I(x, y) == (-log P(x)) - (-log P(x|y)) == log

)()(),(

yPxPyxP

def

x uncertainty x gives y uncertainty

certainty (x gives y)

Page 51: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 51

• Example. How much information the word “pancake” gives us about the following word “syrup”

)()(),(log);(

syrupPpancakePsyruppancakePsyruppancakeI

Page 52: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 52

Physical meaning of MI(1) wi and wj have no particular relation to each other. I(x; y) 0 P(wj| wi) = P(wj) x 與 y相互之間不會給對方信息

0)()(log

)()|(log

)()(),(log);(

j

j

j

ij

ji

jiji

wPwP

wPwwP

wPwPwwPwwI

Page 53: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 53

Physical meaning of MI(2) wi and wj are perfectly coordinated.

)(1log

)()()(log

)()(),(log);(

j

ji

i

ji

jiji

wP

wPwPwP

wPwPwwPwwI

a very large number

I(x; y) >> 0當知道 y 後,提供很多有關 x 的信息

Page 54: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 54

Physical meaning of MI

(3) wi and wj are negatively correlated a very small negative number I(x; y) << 0 王不見王

Page 55: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 55

The Noisy Channel Model• Assuming that you want to communicate messages over a c

hannel of restricted capacity, optimize (in terms of throughput and accuracy) the communication in the presence of noise in the channel.

• A channel’s capacity can be reached by designing an input code that maximizes the mutual information between the input and output over all possible input distributions.

);(max)(

YXICXp

EncoderW

Message from a finite

alphabet

Channelp(y|x)

X

Input tochannel

DecoderY

Output from channel

Attempt toreconstruct message

based on output

Wnoisy

channel

Page 56: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 56

01-p

1 1

0

p

1- p

A binary symmetric channel. A 1 or 0 in the input gets flipped ontransmission with probability p.

I(X;Y) = H(Y) – H(Y|X)= H(Y) – H(p)

)(1);(max)(

pHYXIXp

The channel capacity is 1 bit onlyif the entropy H(p) is 0.I.E.,If p=0 the channel reliabilitytransmits a 0 as 0 and 1 as 1.If p=1 it always flips bits

The channel capacity is 0 whenboth 0s and 1s are transmittedwith equal probability as 0s and 1s(i.e., p=1/2).

Completely noisy binary channel

121log

21

21log

21

H(p)=-p(x)log2p(x)

Page 57: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 57

)|()(maxarg)(

)|()(maxarg)|(maxarg iopipop

iopipoipIiii

p(i): language modelp(o|i): channel probability

Noisy Channel

p(o|i)

IDecoder

O

I

The noisy channel model in linguistics

Page 58: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 58

Speech Recognition• Find the sequence of words that maximizes

• Maximize P( ) P(Speech Signal | )

nW ,1

(()(

(

,1

,1,1

PPWP

wWP

n

nn

| Speech Signal)

Speech Signal)Speech Signal | )nW ,1

nW ,1 nW ,1

language model acoustic aspects of speech

3

2

1

1

aaaW

2

1

2

b

bW

4

3

2

1

3

ccccWsignal

|,,(),,( 412412 cbaPcbaP Speech Signal)412 cba

Page 59: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 59

The dog...bigpig

Assume P(big | the) = P(pig | the).

P(the big dog) = P(the)P(big | the)P(dog | the big)

P(the pig dog) = P(the)P(pig | the)P(dog | the pig)

P(dog | the big) > P(dog | the pig)

“the big dog” is selected.

=> “dog” selects “big”

Page 60: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 60

Page 61: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 61

Relative Entropy or Kullback-Leibler Divergence

• For 2 pmfs, p(x) and q(x), their relative entropy is:

• The relative entropy (also known as the Kullback-Leibler divergence) is a measure of how different two probability distributions (over the same event space) are.

• The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

Xx xq

xpxpqpD)()(log)()||(

0)||( qpD i.e., no bits are wasted when p=q

Page 62: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 62

)()(log)||(

XqXpEqpD p

))()(||),(();( ypxpyxpDYXI

x y xyq

xypxypxpxyqxypD)|()|(log)|()())|(||)|((

))|(||)|(())(||)(()),(||),(( xyqxypDxqxpDyxqyxpD

Application: measure selectional preferences in selection

A measure of how far a joint distribution is from independence:

Conditional relative entropy:

Chain rule for relative entropy:

x

xxpXE )()(

Xx xq

xpxpqpD)()(log)()||(

Recall

yx ypxp

yxpyxpYXI, )()(

),(log),();(

Page 63: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 63

The Relation to Language: Cross-Entropy• Entropy can be thought of as a matter of how surprised we will

be to see the next word given previous words we already saw. • The cross entropy between a random variable X with true prob

ability distribution p(x) and another pmf q (normally a model of p) is given by:

• Cross-entropy can help us find out what our average surprise for the next word is.

))(

1(log

)(log)()||()(),(

xq

xqxpqpDXHqXH

E p

x

)

)()(log

)(1(log)(

)()(log)(

)(1log)(

)||()(

xqxp

xpxp

xqxpxp

xpxp

qpDXH

Page 64: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 64

Cross Entropy

• Cross entropy of a language L=(Xi)~p(x)

• The language is “nice”

• Large body of utterances available

)(log)(1lim),( 111

nx

nnxmxp

nmLH

n

)(log1lim),( 1nnxm

nmLH

)(log1),( 1nxmn

mLH

Page 65: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 65

How much good does the approximate model do ?

),()( ,1,1 Mnn WW )(log)(),( ,1

,1,1,1 nM

nwn

def

Mn wwPW

:)( ,1 nW correct modelapproximate model:)( ,1 nM W

Page 66: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 66

),()( ,1,1 Mnn WW

proof:

n

ii

e

x

XX

1

1

1log ….(1)

n

iiy

1

1 ….(2)

(correct model) (approximate model)

n

i i

ii x

yx1

2 )(log

0

)(2log

1

)1(2log

1

)(log2log

1

1

1

1

n

iii

e

n

i i

ii

e

n

i i

iei

e

xy

xyx

xyx

… (1)

… (2)

y=x-1

e

xy elog

Page 67: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 67

n

i i

ii x

yx1

2 )(log

n

i

n

iiiii xxyx

1 122 0loglog

n

i

n

iiiii xxyx

1 122 loglog

n

i

n

iiiii yxxx

1 122 loglog

),()( ,1,1 Mnn WW

cross entropy of a language L give a model M

nwnMnn

def

M wwn

PL,1

)(log)(1lim),( ,1,1

?)( ,1 nw (“all” English test) infinite

representative samples of English text

Page 68: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 68

)(log1lim

)(log)(1lim

),(

,1

,1,1

nMn

nMnn

M

WPn

WPWPn

PLH

Page 69: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 69

Cross Entropy as a Model Evaluator

• Example:– Find the best model to produce message of 20 words.

• Correct probabilistic model.

Message Probability Message Probability M1 0.05 M5 0.10 M2 0.05 M6 0.20 M3 0.05 M7 0.20 M4 0.10 M8 0.25

Page 70: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 70

• Per-word cross entropy

• Approximate model

201

* (0.05 log P(M1) + 0.05 log P(M2)+ 0.05 log P(M3) + 0.10 log P(M4)+ 0.10 log P(M5) + 0.20 log P(M6)+ 0.20 log P(M7) + 0.25 log P(M8) )

100 samplesM1 M1 M1 M1 M1 (5)M2 M2 M2 M2 M2 (5)M3 M3 M3 M3 M3 (5)M4 M4 M4 ... M4 (10)M5 M5 M5 ... M5 (10)M6 M6 M6 ... M6 (20)M7 M7 M7 ... M7 (20)M8 M8 M8 ... M8 (25)

Each message is independent of the next

Page 71: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 71

MP = M1 : M2 : . . . M8 :

P(M1) * P(M1) * … * P(M1)5 times

P(M2) * P(M2) * … * P(M2)5 times

P(M8) * P(M8) * … * P(M8)25 times

* ...

MPLog( ) = 5 log P(M1) + 5 log P(M2) + 5 log P(M3)+ 10 log P(M4) + 10 log P(M5) + 20 log P(M6)+ 20 log P(M7) + 25 log P(M8)

Page 72: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 72

• per-word cross entropy

• the test suite of 100 examples was exactly indicative of the probabilistic model.

• To measure the cross entropy of a model works only if the test sequence has not been used by model builder.

))8(log10025)7(log

10020

)6(log10020)5(log

10010)4(log

10010

)3(log100

5)2(log100

5)1(log100

5(201

)(log10020

1

MPMP

MPMPMP

MPMPMP

PM

closed test (inside test)open test (outside test)

Page 73: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 73

• The Composition of Brown and LOB Corpora• Brown Corpus:

– Brown University Standard Corpus of Present-day American English.

• LOB Corpus:– Lancaster/Oslo-Bergen Corpus of British English.

Page 74: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 74

Text Category Number of Texts Brown LOBA Press:reportage 報導文學 44 44B Press:editorial 社論 27 27C Press:reviews 書評 17 17D Religion 宗教性 17 17E Skills and hobbies 技藝,商業性,娛樂性 36 38F Popular lore 名間傳奇 48 44G Belles lettres,biography,memoirs,etc 純文學 75 77H Miscellaneous(mainly government documents) 雜類 ( 包括政府、出版品、基金會報告等 ) 30 30 J Learned(including suence and technology) 學術性和科學論文 80 80K General fiction 一般小說 29 29L Mystery and detective 神秘及偵探小說 24 24M Suence fiction 科幻小說 6 6N Adverture and western fiction 探險及西部小說 29 29P Romancer and love story 浪漫愛情小說 29 29R Humour 幽默體 9 9

Page 75: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 75

The Entropy of English

• We can model English using n-gram models (also known a Markov chains).

• These models assume limited memory, i.e., we assume that the next word depends only on the previous k ones [kth order Markov approximation].

),...,|( 1111 xXxXxXP nnnn

),...,|( 11 knknnnnn xXxXxXP

Page 76: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 76

)|()...|()|()(

)(),...,,(

1,12,13121

,1,1

2211

nn

nn

nn

wwPwwPwwPwP

wWPwWwWwWP

P(w1,n)

bigram:

trigram:

)|()...|()|()(),...,,(

123121

2211

nn

nn

wwPwwPwwPwPwWwWwWP

)|()...|()|()(),...,,(

1,22,13121

2211

nnn

nn

wwPwwPwwPwPwWwWwWP

Page 77: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 77

The Entropy of English

• What is the Entropy of English?

Model Cross entropy (bits)

zeroth order 4.76 (uniform model, so log27)

first order 4.07

second order 208

Shannon's experiment 1.3(1.34) (Cover and Thomas 1991:140)

Page 78: 7 - 1 Chapter 7 Mathematical Foundations. 7 - 2 Notions of Probability Theory Probability theory deals with predicting how likely it is that something

7 - 78

Perplexity• A measure related to the notion of cross-entropy and us

ed in the speech recognition community is called the perplexity.

• A perplexity of k means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step.

),(1

12),( mxHn

nmxperplexity

nnxm

1

1 )(