probabilistic model of sequences

24
Probabilistic Model of Sequences Ata Kaban The University of Birmingham

Upload: wilma-bentley

Post on 30-Dec-2015

27 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Model of Sequences. Ata Kaban The University of Birmingham. Sequence. Example1: a b a c a b a b a c Example2: 1 0 0 1 1 0 1 0 0 1 Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 Roll a six-sided die N times. You get a sequence. Roll it again: You get another sequence. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Model of Sequences

Probabilistic Model of Sequences

Ata Kaban

The University of Birmingham

Page 2: Probabilistic Model of Sequences

Sequence

• Example1: a b a c a b a b a c • Example2: 1 0 0 1 1 0 1 0 0 1 • Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 • Roll a six-sided die N times. You get a sequence.• Roll it again: You get another sequence.• Here is a sequence of characters, can you see it? • What is a sequence?• Alphabet1 = {a,b,c}, Alphabet2={0,1},

Alphabet3={1,2,3,4,5,6}

Page 3: Probabilistic Model of Sequences

Probabilistic Model

• Model = system that simulates the sequence under consideration

• Probabilistic model = model that produces different outcomes with different probabilities– It includes uncertainty

– It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence

• Could you simulate any of the sequences on the previous slide?

Page 4: Probabilistic Model of Sequences

Random sequence model

• Back to the die example (can possibly be loaded)– Model of a roll: has 6 parameters: p1,p2,p3,p4,p5,p6– Here, p_i is the probability of throwing i – To be probabilities, these must be non-negative and must

sum to one.– What is the probability of the sequence [1, 6, 3]? p1*p6*p3

• NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

Page 5: Probabilistic Model of Sequences

Maximum Likelihood parameter estimation

• The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called training set.

• Example (t=tail, h=head) : [t t t h t h h t]– Count up the frequencies: t5, h3– Compute probabilities:

• p(t)=5/(5+3), p(h)=3/(5+3)

– These are the Maximum Likelihood (ML) estimates of the parameters of the coin.

– Does it make sense?– What if you know the coin is fair?

Page 6: Probabilistic Model of Sequences

Overfitting• A fair coin has probabilities p(t)=0.5, p(h)=0.5

• If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0.

• Consequently, from these estimates, the probability of e.g. the sequence [h, t, h, t] = ………….

• This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning!

• Solution1: get more data

• Solution2: build in what you already know into the model. (Will return to it during the module)

Page 7: Probabilistic Model of Sequences

Why is it called Maximum Likelihood?

• It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). P(Data|parameters)

Page 8: Probabilistic Model of Sequences

Probabilities

• Have two dice D1 and D2• The probability of rolling I given die D1 is called

P(i|D1). This is a conditional probability• Pick a die at random with probability P(Dj), j=1 or

2. The probability for picking die Dj and rolling i is is called joint probability and is P(I,Dj)=P(Dj)P(I|Dj).

• For any events X and Y, P(X,Y)=P(X|Y)P(Y)• If we know P(X,Y), then the so-called marginal

probability p(X) can be computed as Y

YXPXP ),()(

Page 9: Probabilistic Model of Sequences

• Now, we show that maximising P(Data|parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

T

t

T

t

L

l

T

ttl

T

t

xL

ll

t

L

tp

tp

tpxspTppSP

tpspTppSP

Tttsymbolofoccurrenceoffrequencyx

iesprobabilitparametersTpp

alphabetofsizeT

sequenceoflengthL

sssymbolsofsequenceS

Notations

t

1

1

1 1

11

1

)1)(( termLagrangian a addingby imposed becan constraint This

1)( so ies,probabilit be toneed p(t)hat Remember t

)(log)(log))(),...,1(|(log

likelihood theof logarithm themaximise

maximise)()())(),...,1(|(

),...1(,:

)(:)(),...,1(

:

:

,...,:

:

Page 10: Probabilistic Model of Sequences

T

tt

t

T

t

T

tt

t

tt

T

t

T

tt

x

xtpSo

tpx

tpx

xtp

tp

x

p(t)

Obj

tptpx

1

11

11

)(,

)(

)(

1,...T when tsidesboth up add and p(t)by sidesboth multiply , compute toNow

)(0)(

zero. isfunction a of derivative themaximum, aat Remember,

)1)(()(logObj maximise toneed weTherefore

Why did we bother?

Because in more complicated models we cannot ‘guess’ the result.

Page 11: Probabilistic Model of Sequences

Markov Chains• Further examples of sequences:

– Bio-sequences

– Web page request sequences while browsing

• These are not anymore random sequences, but have a time-structure.

• How many parameters would such a model have?

• We need to make simplifying assumptions to end up with a reasonable number of parameters

• The first order Markov assumption: the observation only depends on the immediately previous one, no longer history

• Markov Chain = sequence model which makes the Markov assumption

)|(),...,|( 111 llll ssPsssP

Page 12: Probabilistic Model of Sequences

Markov Chains

• The probability of a Markov sequence:

• The alphabet’s symbols are also called states• Once the parameters are estimated from training

data, the Markov chain can be used for prediction• Amongst others, Markov Chains are successful for

web browsing behavior prediction

L

lll

LL

ssPsP

ssPssPssPsPSP

211

123121

)|()(

)|()...|()|()()(

Page 13: Probabilistic Model of Sequences

Markov Chains

• A Markov Chain is stationary if at any time, it has the same transition probabilities.

• We assume stationary models here.

• Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

)|()()|( 1 jiPshorthandjsisP tt

Page 14: Probabilistic Model of Sequences

ML parameter estimation

• We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences.

• The ML estimate of the transition matrix will be again very intuitive:

T

iij

ij

x

xjiP

1

)|( Remember that

jjiPT

i

,1)|(1

Page 15: Probabilistic Model of Sequences

Simple example• If it is raining today, it will rain tomorrow with probability

0.8 implies the contrary has probability 0.2• If it is not raining today, it will rain tomorrow with

probability 0.6 implies the contrary has probability 0.4• Build the transition matrix

• Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix.

• Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

4.06.0

2.08.0

)|()|(

)|()|(

todayrainnotomorrowrainnoPtodayrainnotomorrowrainP

todayraintomorrowrainnoPtodayraintomorrowrainP

Page 16: Probabilistic Model of Sequences

Examples of Web Applications

• HTTP request prediction:

– To predict the probabilities of the next requests from the same user based on the history of requests from that client.

• Adaptive Web navigation:

– To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits.

– The predicted link does not strictly have to be a link present in the Web page currently being viewed.

• Tour generation:

– Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

Page 17: Probabilistic Model of Sequences

Building Markov Models from Web Log Files

• A Web log file is a collection of records of user requests for documents on a Web site, an example:

• Transition matrix can be seen as a graph – Link pair: (r - referrer, u - requested page, w - hyperlink

weight)– Link graph: it is called the state diagram of the MarkovChain

• a directed weighted graph• a hierarchy from the homepage down to multiple levels

177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

Page 18: Probabilistic Model of Sequences

Link Graph: an example (University of Ulster site)

1

2 3 4

5 6 7 8 910

11

9000

1800

2700 4500

810880 720

880

648

600

2128

282

2390 1800 2400

180023902400

72

University of Ulster

Department

InformationStudent

CSScience&Arts

InternationalOffice Library

Under-graduate

Graduate

Jobs

200 300S

E

9000

Start

Exit

122128

Register

State diagram:

- Nodes: states

- Weighted arrows: number of transitions

Zhu et al. 2002

Page 19: Probabilistic Model of Sequences

Experimental Results(Sarukkai, 2000)• Simulations :

– ‘Correct link’ refers to the actual link chosen at the next step.

– ‘depth of the correct link’ is measured by counting the umber of links which have a probability greater than or equal to the correct link.

– Over 70% of correct links are in the top 20 scoring states.

– Difficulties: very large state space

Page 20: Probabilistic Model of Sequences

Simple exercise

• Build the Markov transition matrix of the following sequence:

[a b b a c a b c b b d e e d e d e d]

State space: {…………….}

0

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

eePedPecPebPeaP

dePddPdcPdbPdaP

cePcdPccPcbPcaP

bePbdPbcPbbPbaP

aePadPacPabPaaP

Page 21: Probabilistic Model of Sequences

Further topics• Hidden Markov Model

– Does not make the Markov assumption on the observed sequence

– Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian

– More powerful– Estimation is more complicated

• Aggregate Markov model– Useful for clustering sub-graphs of a transition graph

K

ktttt skPksPssP

111 )|()|()|(

Page 22: Probabilistic Model of Sequences

HMM at an intuitive level

• Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probabilities.

ANSWER:

If the initial state is S1: 0.2*(0.4*0.8+0.6*0.7) = 0.148.

In the second case: 0.5*0.148+0.5*0.3*(0.3*0.7+0.7*0.8) = 0.1895.

Page 23: Probabilistic Model of Sequences

Conclusions

• Probabilistic Model

• Maximum Likelihood parameter estimation

• Random sequence model

• Markov chain model

---------------------------------

• Hidden Markov Model

• Aggregate Markov Model

Page 24: Probabilistic Model of Sequences

Any questions?