View
217
Download
0
Embed Size (px)
Citation preview
Hidden Markov Models• Markov assumption:
– The current state is conditionally independent of all the other past states given the state in the previous time step
– The evidence at time t depends only on the state at time t
• Transition model:
P(Xt | X0:t-1) = P(Xt | Xt-1)
• Observation model:
P(Et | X0:t, E1:t-1) = P(Et | Xt)
X0
E1
X1
Et-1
Xt-1
Et
Xt…E2
X2
An example HMM• States: X = {home, office, cafe}• Observations: E = {sms, facebook, email}
Slide credit: Andy White
The Joint Distribution
• Transition model: P(Xt | Xt-1)
• Observation model: P(Et | Xt)
• How do we compute the full joint P(X0:t, E1:t)?
X0
E1
X1
Et-1
Xt-1
Et
Xt…E2
X2
t
iiiii:t:t |XEP|XXPXP,P
11010 )()()()( EX
HMM inference tasks• Filtering: what is the distribution over the current state Xt
given all the evidence so far, e1:t ?
X0
E1
X1
Et-1
Xt-1
Et
Xt…Ek
Xk
Query variable
Evidence variables
…
HMM inference tasks• Filtering: what is the distribution over the current state Xt
given all the evidence so far, e1:t ?
• Smoothing: what is the distribution of some state Xk given the entire observation sequence e1:t?
X0
E1
X1
Et-1
Xt-1
Et
…Ek
Xk… Xt
HMM inference tasks• Filtering: what is the distribution over the current state Xt
given all the evidence so far, e1:t ?
• Smoothing: what is the distribution of some state Xk given the entire observation sequence e1:t?
• Evaluation: compute the probability of a given observation sequence e1:t
X0
E1
X1
Et-1
Xt-1
Et
…Ek
Xk… Xt
HMM inference tasks• Filtering: what is the distribution over the current state Xt
given all the evidence so far, e1:t
• Smoothing: what is the distribution of some state Xk given the entire observation sequence e1:t?
• Evaluation: compute the probability of a given observation sequence e1:t
• Decoding: what is the most likely state sequence X0:t given the observation sequence e1:t?
X0
E1
X1
Et-1
Xt-1
Et
…Ek
Xk… Xt
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
X0
E1
X1
Et-1
Xt-1
Et
Xt…Ek
Xk
Query variable
Evidence variables
…
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
1
1
1
)|()|(
)|(),|(
)|,()|(
1:111
1:111:11
1:111:1
t
t
t
xtttt
xttttt
xttttt
xPxXP
xPxXP
xXPXP
e
ee
ee
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
1
)|()|()|( 1:1111:1
txtttttt xPxXPXP ee
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook et = Email
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
1
)|()|()|( 1:1111:1
txtttttt xPxXPXP ee
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
What is P(Xt = Office | e1:t) ?
P(et | Xt)= 0.8
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook et = Email
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
1
)|()|()|( 1:1111:1
txtttttt xPxXPXP ee
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
What is P(Xt = Office | e1:t) ?
)|()|(
)|(
)|();|(
);|(
1:1
1:1
1:11:1
1:1
tttt
tt
ttttt
ttt
XPXeP
eP
XPXeP
eXP
e
e
ee
e
P(et | Xt)= 0.8
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
Home0.6
Office0.3
Cafe0.1
Home??
Office??
Cafe??
Time: t – 1 Time: t
et-1 = Facebook et = Email
0.6
0.2
0.8
What is P(Xt = Office | e1:t-1) ?
1
)|()|()|( 1:1111:1
txtttttt xPxXPXP ee
P(Xt-1 | e1:t-1) P(Xt | Xt-1)
What is P(Xt = Office | e1:t) ?
)|()|()|( 1:1:1 tttttt XPXePXP ee
P(et | Xt)= 0.8
0.5 * 0.8 = 0.4
Note: must also compute this value for Home and Cafe, and
renormalize to sum to 1
0.6 * 0.6 + 0.2 * 0.3 + 0.8 * 0.1 = 0.5
Filtering: The Forward Algorithm• Task: compute the probability distribution over the current
state given all the evidence so far: P(Xt | e1:t)
• Recursive formulation: suppose we know P(Xt-1 | e1:t-1)
– Base case: priors P(X0)
• Prediction: propagate belief from Xt-1 to Xt
• Correction: weight by evidence et
• Renormalize to have all P(Xt = x | e1:t) sum to 1
1
)|()|()|( 1:1111:1
txtttttt xPxXPXP ee
)|()|();|()|( 1:11:1:1 ttttttttt XPXePeXPXP eee
Filtering: The Forward Algorithm
Home
Office
Cafe
Home
Office
Cafe
Time: t – 1 Time: t
et-1 et
…
…
…
Homeprior
Officeprior
Cafeprior
Time: 0
Evaluation
• Compute the probability of the current sequence: P(e1:t)
• Recursive formulation: suppose we know P(e1:t-1)
t
t
t
xttttt
xtttttt
xtttt
ttt
ttt
xPxePP
xPxePP
xePP
ePP
ePP
)|()|()(
)|(),|()(
)|,()(
)|()(
),()(
1:11:1
1:11:11:1
1:11:1
1:11:1
1:1:1
ee
eee
ee
ee
ee
Evaluation
• Compute the probability of the current sequence: P(e1:t)
• Recursive formulation: suppose we know P(e1:t-1)
tx
tttttt xPxePPP )|()|()()( 1:11:1:1 eee
filteringrecursion
Smoothing
• What is the distribution of some state Xk given the entire observation sequence e1:t?
X0
E1
X1
Et-1
Xt-1
Et
…Ek
Xk… Xt
Smoothing
• What is the distribution of some state Xk given the entire observation sequence e1:t?
• Solution: the forward-backward algorithm
Home
Office
Cafe
…
…
…
Time: kek
…
…
…
Home
Office
Cafe
Time: 0
Home
Office
Cafe
Time: tet
Forward message: P(Xk | e1:k)
Backward message: P(ek+1:t | Xk)
Decoding: Viterbi Algorithm• Task: given observation sequence e1:t, compute most likely
state sequence x0:t
X0
E1
X1
Et-1
Xt-1
Et
…Ek
Xk… Xt
)|(maxarg :1:0*:0 :0 ttt P
texx x
Decoding: Viterbi Algorithm• Task: given observation sequence e1:t, compute most likely
state sequence x0:t
• The most likely path that ends in a particular state xt consists of the most likely path to some state xt-1 followed by the transition to xt
Time: t – 1 Time: tTime: 0
xt-1
xt
Decoding: Viterbi Algorithm• Let mt(xt) denote the probability of the most likely path that
ends in xt:
)|()|()(max
),,(max
)|,(max)(
111
:11:0
:11:0
1
1:0
1:0
ttttttx
ttt
ttttt
xePxxPxm
xP
xPxm
t
t
t
ex
ex
x
x
Time: t – 1 Time: tTime: 0
xt-1
xt
P(xt | xt-1)
mt-1(xt-1)
Learning• Given: a training sample of observation sequences• Goal: compute model parameters
– Transition probabilities P(Xt | Xt-1)
– Observation probabilities P(Et | Xt)
• What if we had complete data, i.e., e1:t and x0:t ?
– Then we could estimate all the parameters by relative frequencies
P(Xt = b | Xt-1= a) # of times state b follows state a
total # of transitions from state a
P(E = e | X = a) # of times e is emitted from state a
total # of emissions from state a
Learning• Given: a training sample of observation sequences• Goal: compute model parameters
– Transition probabilities P(Xt | Xt-1)
– Observation probabilities P(Et | Xt)
• What if we had complete data, i.e., e1:t and x0:t ?
– Then we could estimate all the parameters by relative frequencies
• What if we knew the model parameters?– Then we could use inference to find the posterior
distribution of the hidden states given the observations
Learning• Given: a training sample of observation sequences• Goal: compute model parameters
– Transition probabilities P(Xt | Xt-1)
– Observation probabilities P(Et | Xt)
• The EM (expectation-maximization) algorithm:
– Starting with a random initialization of parameters:• E-step: find the posterior distribution of the hidden variables
given observations and current parameter estimate• M-step: re-estimate parameter values given the expected values
of the hidden variables
x
xXeexX )|,(),|(maxarg )()1( LP tt