Hidden Markov Models Markov assumption: –The current state is conditionally independent of all the other past states given the state in the previous time

Hidden Markov Models• Markov assumption:

– The current state is conditionally independent of all the other past states given the state in the previous time step

– The evidence at time t depends only on the state at time t

• Transition model:

P(Xt | X0:t-1) = P(Xt | Xt-1)

• Observation model:

P(Et | X0:t, E1:t-1) = P(Et | Xt)

X0

E1

X1

Et-1

Xt-1

Et

Xt…E2

X2

An example HMM• States: X = {home, office, cafe}• Observations: E = {sms, facebook, email}

Slide credit: Andy White

The Joint Distribution

• Transition model: P(Xt | Xt-1)

• Observation model: P(Et | Xt)

• How do we compute the full joint P(X0:t, E1:t)?

X0

E1

X1

Et-1

Xt-1

Et

Xt…E2

X2

t

iiiii:t:t |XEP|XXPXP,P

11010 )()()()( EX

HMM inference tasks• Filtering: what is the distribution over the current state Xt

given all the evidence so far, e1:t ?

X0

E1

X1

Et-1

Xt-1

Et

Xt…Ek

Xk

Query variable

Evidence variables

…



• Smoothing: what is the distribution of some state Xk given the entire observation sequence e1:t?

X0

E1

X1

Et-1

Xt-1

Et

…Ek

Xk… Xt




• Evaluation: compute the probability of a given observation sequence e1:t

X0

E1

X1

Et-1

Xt-1

Et

…Ek

Xk… Xt


given all the evidence so far, e1:t


• Evaluation: compute the probability of a given observation sequence e1:t

• Decoding: what is the most likely state sequence X0:t given the observation sequence e1:t?

X0

E1

X1

Et-1

Xt-1

Et

…Ek

Xk… Xt

Filtering• Task: compute the probability distribution over the current

state given all the evidence so far: P(Xt | e1:t)

• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)

X0

E1

X1

Et-1

Xt-1

Et

Xt…Ek

Xk

Query variable

Evidence variables

…




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??

Time: t – 1 Time: t

et-1 = Facebook

0.6

0.2

0.8

What is P(Xt = Office | e1:t-1) ?

P(Xt-1 | e1:t-1) P(Xt | Xt-1)

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??


et-1 = Facebook

0.6

0.2

0.8


1

1

1

)|()|(

)|(),|(

)|,()|(

1:111

1:111:11

1:111:1

t

t

t

xtttt

xttttt

xttttt

xPxXP

xPxXP

xXPXP

e

ee

ee

P(Xt-1 | e1:t-1) P(Xt | Xt-1)

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??


et-1 = Facebook

0.6

0.2

0.8


1

)|()|()|( 1:1111:1

txtttttt xPxXPXP ee

P(Xt-1 | e1:t-1) P(Xt | Xt-1)

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??


et-1 = Facebook et = Email

0.6

0.2

0.8


1

)|()|()|( 1:1111:1

txtttttt xPxXPXP ee

P(Xt-1 | e1:t-1) P(Xt | Xt-1)

What is P(Xt = Office | e1:t) ?

P(et | Xt)= 0.8

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??



0.6

0.2

0.8


1

)|()|()|( 1:1111:1

txtttttt xPxXPXP ee

P(Xt-1 | e1:t-1) P(Xt | Xt-1)


)|()|(

)|(

)|();|(

);|(

1:1

1:1

1:11:1

1:1

tttt

tt

ttttt

ttt

XPXeP

eP

XPXeP

eXP

e

e

ee

e

P(et | Xt)= 0.8

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5




Home0.6

Office0.3

Cafe0.1

Home??

Office??

Cafe??



0.6

0.2

0.8


1

)|()|()|( 1:1111:1

txtttttt xPxXPXP ee

P(Xt-1 | e1:t-1) P(Xt | Xt-1)


)|()|()|( 1:1:1 tttttt XPXePXP ee

P(et | Xt)= 0.8

0.5 * 0.8 = 0.4

Note: must also compute this value for Home and Cafe, and

renormalize to sum to 1

0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5

Filtering: The Forward Algorithm• Task: compute the probability distribution over the current



– Base case: priors P(X0)

• Prediction: propagate belief from Xt-1 to Xt

• Correction: weight by evidence et

• Renormalize to have all P(Xt = x | e1:t) sum to 1

1

)|()|()|( 1:1111:1

txtttttt xPxXPXP ee

)|()|();|()|( 1:11:1:1 ttttttttt XPXePeXPXP eee

Filtering: The Forward Algorithm

Home

Office

Cafe

Home

Office

Cafe


et-1 et

…

…

…

Homeprior

Officeprior

Cafeprior

Time: 0

Evaluation

• Compute the probability of the current sequence: P(e1:t)

• Recursive formulation: suppose we know P(e1:t-1)

t

t

t

xttttt

xtttttt

xtttt

ttt

ttt

xPxePP

xPxePP

xePP

ePP

ePP

)|()|()(

)|(),|()(

)|,()(

)|()(

),()(

1:11:1

1:11:11:1

1:11:1

1:11:1

1:1:1

ee

eee

ee

ee

ee

Evaluation

• Compute the probability of the current sequence: P(e1:t)

• Recursive formulation: suppose we know P(e1:t-1)

tx

tttttt xPxePPP )|()|()()( 1:11:1:1 eee

filteringrecursion

Smoothing

• What is the distribution of some state Xk given the entire observation sequence e1:t?

X0

E1

X1

Et-1

Xt-1

Et

…Ek

Xk… Xt

Smoothing

• What is the distribution of some state Xk given the entire observation sequence e1:t?

• Solution: the forward-backward algorithm

Home

Office

Cafe

…

…

…

Time: kek

…

…

…

Home

Office

Cafe

Time: 0

Home

Office

Cafe

Time: tet

Forward message: P(Xk | e1:k)

Backward message: P(ek+1:t | Xk)

Decoding: Viterbi Algorithm• Task: given observation sequence e1:t, compute most likely

state sequence x0:t

X0

E1

X1

Et-1

Xt-1

Et

…Ek

Xk… Xt

)|(maxarg :1:0*:0 :0 ttt P

texx x

Decoding: Viterbi Algorithm• Task: given observation sequence e1:t, compute most likely

state sequence x0:t

• The most likely path that ends in a particular state xt consists of the most likely path to some state xt-1 followed by the transition to xt

Time: t – 1 Time: tTime: 0

xt-1

xt

Decoding: Viterbi Algorithm• Let mt(xt) denote the probability of the most likely path that

ends in xt:

)|()|()(max

),,(max

)|,(max)(

111

:11:0

:11:0

1

1:0

1:0

ttttttx

ttt

ttttt

xePxxPxm

xP

xPxm

t

t

t

ex

ex

x

x

Time: t – 1 Time: tTime: 0

xt-1

xt

P(xt | xt-1)

mt-1(xt-1)

Learning• Given: a training sample of observation sequences• Goal: compute model parameters

– Transition probabilities P(Xt | Xt-1)

– Observation probabilities P(Et | Xt)

• What if we had complete data, i.e., e1:t and x0:t ?

– Then we could estimate all the parameters by relative frequencies

P(Xt = b | Xt-1= a) # of times state b follows state a

total # of transitions from state a

P(E = e | X = a) # of times e is emitted from state a

total # of emissions from state a




• What if we had complete data, i.e., e1:t and x0:t ?

– Then we could estimate all the parameters by relative frequencies

• What if we knew the model parameters?– Then we could use inference to find the posterior

distribution of the hidden states given the observations




• The EM (expectation-maximization) algorithm:

– Starting with a random initialization of parameters:• E-step: find the posterior distribution of the hidden variables

given observations and current parameter estimate• M-step: re-estimate parameter values given the expected values

of the hidden variables

x

xXeexX )|,(),|(maxarg )()1( LP tt

Documents

Hidden Markov Models Markov assumption: –The current state is conditionally independent of all the other past states given the state in the previous time