probabilistic model of sequences

Probabilistic Model of Sequences

Ata Kaban

The University of Birmingham

Sequence

• Example1: a b a c a b a b a c • Example2: 1 0 0 1 1 0 1 0 0 1 • Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 • Roll a six-sided die N times. You get a sequence.• Roll it again: You get another sequence.• Here is a sequence of characters, can you see it? • What is a sequence?• Alphabet1 = {a,b,c}, Alphabet2={0,1},

Alphabet3={1,2,3,4,5,6}

Probabilistic Model

• Model = system that simulates the sequence under consideration

• Probabilistic model = model that produces different outcomes with different probabilities– It includes uncertainty

– It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence

• Could you simulate any of the sequences on the previous slide?

Random sequence model

• Back to the die example (can possibly be loaded)– Model of a roll: has 6 parameters: p1,p2,p3,p4,p5,p6– Here, p_i is the probability of throwing i – To be probabilities, these must be non-negative and must

sum to one.– What is the probability of the sequence [1, 6, 3]? p1*p6*p3

• NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

Maximum Likelihood parameter estimation

• The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called training set.

• Example (t=tail, h=head) : [t t t h t h h t]– Count up the frequencies: t5, h3– Compute probabilities:

• p(t)=5/(5+3), p(h)=3/(5+3)

– These are the Maximum Likelihood (ML) estimates of the parameters of the coin.

– Does it make sense?– What if you know the coin is fair?

Overfitting• A fair coin has probabilities p(t)=0.5, p(h)=0.5

• If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0.

• Consequently, from these estimates, the probability of e.g. the sequence [h, t, h, t] = ………….

• This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning!

• Solution1: get more data

• Solution2: build in what you already know into the model. (Will return to it during the module)

Why is it called Maximum Likelihood?

• It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). P(Data|parameters)

Probabilities

• Have two dice D1 and D2• The probability of rolling I given die D1 is called

P(i|D1). This is a conditional probability• Pick a die at random with probability P(Dj), j=1 or

2. The probability for picking die Dj and rolling i is is called joint probability and is P(I,Dj)=P(Dj)P(I|Dj).

• For any events X and Y, P(X,Y)=P(X|Y)P(Y)• If we know P(X,Y), then the so-called marginal

probability p(X) can be computed as Y

YXPXP ),()(

• Now, we show that maximising P(Data|parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

T

t

T

t

L

l

T

ttl

T

t

xL

ll

t

L

tp

tp

tpxspTppSP

tpspTppSP

Tttsymbolofoccurrenceoffrequencyx

iesprobabilitparametersTpp

alphabetofsizeT

sequenceoflengthL

sssymbolsofsequenceS

Notations

t

1

1

1 1

11

1

)1)(( termLagrangian a addingby imposed becan constraint This

1)( so ies,probabilit be toneed p(t)hat Remember t

)(log)(log))(),...,1(|(log

likelihood theof logarithm themaximise

maximise)()())(),...,1(|(

),...1(,:

)(:)(),...,1(

:

:

,...,:

:

T

tt

t

T

t

T

tt

t

tt

T

t

T

tt

x

xtpSo

tpx

tpx

xtp

tp

x

p(t)

Obj

tptpx

1

11

11

)(,

)(

)(

1,...T when tsidesboth up add and p(t)by sidesboth multiply , compute toNow

)(0)(

zero. isfunction a of derivative themaximum, aat Remember,

)1)(()(logObj maximise toneed weTherefore

Why did we bother?

Because in more complicated models we cannot ‘guess’ the result.

Markov Chains• Further examples of sequences:

– Bio-sequences

– Web page request sequences while browsing

• These are not anymore random sequences, but have a time-structure.

• How many parameters would such a model have?

• We need to make simplifying assumptions to end up with a reasonable number of parameters

• The first order Markov assumption: the observation only depends on the immediately previous one, no longer history

• Markov Chain = sequence model which makes the Markov assumption

)|(),...,|( 111 llll ssPsssP

Markov Chains

• The probability of a Markov sequence:

• The alphabet’s symbols are also called states• Once the parameters are estimated from training

data, the Markov chain can be used for prediction• Amongst others, Markov Chains are successful for

web browsing behavior prediction

L

lll

LL

ssPsP

ssPssPssPsPSP

211

123121

)|()(

)|()...|()|()()(

Markov Chains

• A Markov Chain is stationary if at any time, it has the same transition probabilities.

• We assume stationary models here.

• Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

)|()()|( 1 jiPshorthandjsisP tt

ML parameter estimation

• We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences.

• The ML estimate of the transition matrix will be again very intuitive:

T

iij

ij

x

xjiP

1

)|( Remember that

jjiPT

i

,1)|(1

Simple example• If it is raining today, it will rain tomorrow with probability

0.8 implies the contrary has probability 0.2• If it is not raining today, it will rain tomorrow with

probability 0.6 implies the contrary has probability 0.4• Build the transition matrix

• Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix.

• Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

4.06.0

2.08.0

)|()|(

)|()|(

todayrainnotomorrowrainnoPtodayrainnotomorrowrainP

todayraintomorrowrainnoPtodayraintomorrowrainP

Examples of Web Applications

• HTTP request prediction:

– To predict the probabilities of the next requests from the same user based on the history of requests from that client.

• Adaptive Web navigation:

– To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits.

– The predicted link does not strictly have to be a link present in the Web page currently being viewed.

• Tour generation:

– Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

Building Markov Models from Web Log Files

• A Web log file is a collection of records of user requests for documents on a Web site, an example:

• Transition matrix can be seen as a graph – Link pair: (r - referrer, u - requested page, w - hyperlink

weight)– Link graph: it is called the state diagram of the MarkovChain

• a directed weighted graph• a hierarchy from the homepage down to multiple levels

177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

Link Graph: an example (University of Ulster site)

1

2 3 4

5 6 7 8 910

11

9000

1800

2700 4500

810880 720

880

648

600

2128

282

2390 1800 2400

180023902400

72

University of Ulster

Department

InformationStudent

CSScience&Arts

InternationalOffice Library

Under-graduate

Graduate

Jobs

200 300S

E

9000

Start

Exit

122128

Register

State diagram:

- Nodes: states

- Weighted arrows: number of transitions

Zhu et al. 2002

Experimental Results(Sarukkai, 2000)• Simulations :

– ‘Correct link’ refers to the actual link chosen at the next step.

– ‘depth of the correct link’ is measured by counting the umber of links which have a probability greater than or equal to the correct link.

– Over 70% of correct links are in the top 20 scoring states.

– Difficulties: very large state space

Simple exercise

• Build the Markov transition matrix of the following sequence:

[a b b a c a b c b b d e e d e d e d]

State space: {…………….}

0

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

)|()|()|()|()|(

eePedPecPebPeaP

dePddPdcPdbPdaP

cePcdPccPcbPcaP

bePbdPbcPbbPbaP

aePadPacPabPaaP

Further topics• Hidden Markov Model

– Does not make the Markov assumption on the observed sequence

– Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian

– More powerful– Estimation is more complicated

• Aggregate Markov model– Useful for clustering sub-graphs of a transition graph

K

ktttt skPksPssP

111 )|()|()|(

HMM at an intuitive level

• Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probabilities.

ANSWER:

If the initial state is S1: 0.2*(0.4*0.8+0.6*0.7) = 0.148.

In the second case: 0.5*0.148+0.5*0.3*(0.3*0.7+0.7*0.8) = 0.1895.

Conclusions

• Probabilistic Model

• Maximum Likelihood parameter estimation

• Random sequence model

• Markov chain model

---------------------------------

• Hidden Markov Model

• Aggregate Markov Model

Any questions?

probabilistic model of sequences

Documents