42
. EM and variants of HMM Lecture #9 Background Readings : Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

EM and variants of HMM Lecture #9

  • Upload
    laksha

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

EM and variants of HMM Lecture #9. Background Readings : Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis , Durbin et al., 2001. Reminder: Relative Entropy. - PowerPoint PPT Presentation

Citation preview

Page 1: EM and variants of HMM Lecture #9

.

EM and variants of HMMLecture #9

Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Page 2: EM and variants of HMM Lecture #9

2

Reminder: Relative Entropy

Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by

H(p||q) = D(p||q) = ∑x p(x)log[p(x)/q(x)]

= ∑x p(x)log(1/(q(x)) -

-∑ x p(x)log(1/(p(x)).

“The inefficiency of assuming distribution q when the correct distribution is p”.

H(p)

Page 3: EM and variants of HMM Lecture #9

3

Non negativity of relative entropy

Claim: D(p||q)=∑xp(x)log[p(x)/q(x)]≥0

Equality only if q=p.

Proof

We may take the log to base e – ie, log x = ln x.

Then, for all x>0, ln x ≤ x-1, with equality only if x=1.

Thus

-D(p||q) = ∑xp(x)ln[q(x)/p(x)] ≤ ∑xp(x)[q(x)/p(x) – 1] =

=∑x[q(x) - p(x)] = 0

Page 4: EM and variants of HMM Lecture #9

4

Relative entropy as average score for sequence comparisons

)()(

),(log),(

bQaQ

baPba

Recall that we have defined the scoring function via

Note that the average score is the relative entropy

D(P||Q)=∑a,bP(a,b)log[P(a,b)/Q(a,b)]

where Q(a,b) = Q(a) Q(b).

Page 5: EM and variants of HMM Lecture #9

5

The EM algorithm

Consider a model where, for observed data x and model parameters θ, p(x|θ) is defined by:

p(x|θ)=∑yp(x,y|θ).y are the “hidden parameters”

The EM algorithm receives x and parameters θ , and return new parameters s.t. p(x| ) > p(x|θ).

Note: In Durbin et. Al. book, the initial parameters are denoted by θ0, and the new parameters by θ.

Page 6: EM and variants of HMM Lecture #9

7

In each iteration the EM algorithm does the following. (E step): Calculate

Qθ () = ∑y p(y|x,θ)log p(x,y|)

(M step): Find * which maximizes Qθ ()

(Next iteration sets * and repeats).

The EM algorithm

Comments: 1. When θ is clear, we shall use Q() instead of Qθ

()

2. At the M-step we only need that Qθ(*)>Qθ(θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.

Page 7: EM and variants of HMM Lecture #9

8

Example: EM for 2 coin tosses

Consider the following experiment:

Given a coin with two possible outcomes: H (head) and T

(tail), with probabilities qH, qT = 1- qH.

The coin is tossed twice, but only the 1st outcome, T, is

seen. So the data is x = (T,*).

We wish to apply the EM algorithm to get parameters

that increase the likelihood of the data.

Let the initial parameters be θ = (qH, qT) = ( ¼, ¾ ).

Page 8: EM and variants of HMM Lecture #9

9

EM for 2 coin tosses

The hidden data which can produce x are the sequences

y1= (T,H); y2=(T,T); (note that with this definition (x,yi)=yi ).

The likelihood of x with parameters (qH, qT), is qHqT+qT2

For the initial parameters θ = ( ¼, ¾ ), we have:

p(x| θ) = P(x,y1 |) + P(x,y2 |) = ¾ * ¼ + ¾ * ¾ = ¾

(note that in this case P(x,yi |) = P(yi |), for i = 1,2.)

Page 9: EM and variants of HMM Lecture #9

10

EM for 2 coin tosses : Expectation step

Calculate Qθ () = Qθ(qH,qT). Note: qH,qT are variables

Qθ () = p(y1|x,θ)log p(x,y1|)+p(y2|x,θ)log p(x,y2|)

p(y1|x,θ) = p(y1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼

p(y2|x,θ) = p(y2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾

Thus we have

Qθ () = ¼ log p(x,y1|) + ¾ log p(x,y2|)

Page 10: EM and variants of HMM Lecture #9

11

EM for 2 coin tosses : Expectation step

For a sequence y of coin tosses, let NH(y) be the number of

H’s in y, and NT(y) be the number of T’s in y. Then

log p(y|) = NH(y) log qH+ NT(y) log qT

[ In our example:

log p(y1 |) = log qH + log qT

log p(y2 |) = 2log qT ]

Page 11: EM and variants of HMM Lecture #9

12

EM for 2 coin tosses : Expectation step

Thus

¼ log p(x,y1|) = ¼ (NH(y1) log qH+ NT(y1) log qT) = ¼ (log qH+ log qT)

¾ log p(x,y2|) = ¾ ( NH(y2) log qH+ NT(y2) log qT) = ¾ (2 log qT)

Substituting in the equation for Qθ () :

Qθ () = ¼ log p(x,y1|)+ ¾ log p(x,y2|)

= ( ¼ NH(y1)+ ¾ NH(y2))log qH + ( ¼ NT(y1)+ ¾ NT(y2))log qT

Qθ () = NHlog qH + NTlog qT

NT= 7/4NH= ¼

Page 12: EM and variants of HMM Lecture #9

13

EM for 2 coin tosses : Maximization step

Find * which maximizes Qθ ()

Qθ () = NHlog qH + NTlog qT = ¼ log qH + 7/4 log qT

We saw earlier that this is maximized when:

TH

TT

TH

HH NN

Nq

NN

Nq

;

.)|(),( 87

87

81 and is,that

87 ;8

14

74

1

47

47

41

41

xp

qq TH

[The optimal parameters (0,1), will never be reached by the EM algorithm!]

Page 13: EM and variants of HMM Lecture #9

14

EM for general stochastic processes

But this time (x,y) is generated by a general stochastic process, which employs r discrete random variables (dices) Z1,...,Zr . This can be viewed as a probabilistic state machine, where at each state one of the random variable Zi is sampled, and then the next state is determined – until a final state is reached.

Now we wish to maximize likelihood of observation x with hidden data as before, ie maximize

p(x|)=∑yp(x,y | ).

Page 14: EM and variants of HMM Lecture #9

15

EM for general stochastic processes

In HMM, the random variables are the transmissions probabilities akl and the emission probabilities ek(b).

x stands for the visible information

y stands for the sequence s of states

(x,y) stands for the complete HMM

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

For brevity, we assume that (x,y) = y (otherwise we set

y’ (x,y) and replace the “ y ”s by “ y’ ”s.

Page 15: EM and variants of HMM Lecture #9

16

EM for general stochastic processes

Each random variable Zk (k =1,...,r) has mk values zk,1,...zk,mk

with probabilities {qkl,|l=1,...,mk}.

Each y defines a sequence of outcomes (zk1,l1,...,zkn,ln) of of the

random variables used in y.

In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence yj .

Let Nkl(y) = #(zkl appears in y).

Page 16: EM and variants of HMM Lecture #9

17

Define Nkl as the expected value of Nkl(y) under θ:

Nkl=E(Nkl|x,θ) = ∑y p(y|x,θ) Nkl(y),

Then we have:

EM for general stochastic processes (cont)

Similarly to the dice case, we have:

r

k

m

lklkl

k

qyNyp1 1

log)()|(log

Page 17: EM and variants of HMM Lecture #9

18

' '

log

log)(),|(

log)(),|(

)|(log),|()(

l kl

klkl

kl

r

k

m

lkl

kl

r

k

m

l ykl

kly

r

k

m

lkl

y

N

Nq

qN

qyNxyp

qyNxyp

ypxypQ

k

k

k

for maximized iswhich

1 1

1 1

1 1

Q (λ) for general stochastic processes

Nkl

Page 18: EM and variants of HMM Lecture #9

19

EM algorithm for general stochastic processes

Maximization step

Set qkl=Nkl / (∑l’ Nkl’)

Similarly to the one dice case we get:

Expectation step

Set Nkl to E (Nkl(y)|x,θ), ie:Nkl= ∑y p(y|x,θ) Nkl(y)

Page 19: EM and variants of HMM Lecture #9

20

EM algorithm for n independent observations x1,…, xn :

Expectation stepIt can be shown that, if the xj are independent, then:

n

j

jkl

jkl

y

jn

jkl NxyNxypN

11

),(),|(

jklN

),()|,()(

jkl

y

jn

jj

xyNxypxp

1

1

Page 20: EM and variants of HMM Lecture #9

21

Application to HMM

For HMM, the random variables zkl are the state transitions and symbol emissions from state k, and qkl are the corresponding probabilities akl and ek(b).

Page 21: EM and variants of HMM Lecture #9

22

EM algorithm for HMM: (the Baum-Welch training):

Expectation step (single observation x):Akl , the expected number of (k,l) transitions:

Akl= ∑s p(s|x,θ) Nkl(x,s) Is computed by:

1

1( 1) ( ) ( )

( )

L

kl k kl l i li

A f i a e x b ip x

Ekb, the expected number of emissions of b from state k: Ekb = ∑s p(s|x,θ) Ekb(x,s), computed by:

bxi

kkkb

i

ibifxp

E:

)()()(

1

Page 22: EM and variants of HMM Lecture #9

23

EM algorithm for HMM: (the Baum-Welch training):

Expectation step (n observations x1,...,xn):Akl , the expected number of (k,l) transitions:

Akl= ∑j ∑s p(s|xj,θ) Nkl(xj,s) Is computed by:

Ekb = ∑s p(s|xj ,θ) Ekb(xj,s), is computed by:

bxi

jk

jk

n

jjkb

i

ibifxp

E:

)()()(1

1

bxi

jlilkl

jk

n

jjkl

i

ibxeaifxp

A:

)()()()(

11

1

Page 23: EM and variants of HMM Lecture #9

24

EM algorithm for HMM: (the Baum-Welch training):

Maximization step:

The new parameters are given by:

' '' '

, and kl kbkl kb

kl kbl b

A Ea e

A E

Page 24: EM and variants of HMM Lecture #9

25

Correctness proof of EM

Theorem:

If λ* maximizes Q (λ) = ∑i p(yi|x,θ)log p(yi| λ), then P(x| λ*) P(x| θ) .

Comment: In the proof we will assume only that

Q(λ*) Q(θ).

Page 25: EM and variants of HMM Lecture #9

26

For each y we have

p(x|) p(y |x,) = p(y ,x|), and hence:

log p(x|) = log p(y,x|) – log p(y|x,)Hence

log p(x| λ) = ∑y p(y|x, θ) [log p(y|λ) – log p(y|x, λ)]

log p(x| λ)

Proof (cont.)

=1

(Next..)

Page 26: EM and variants of HMM Lecture #9

27

Proof (end)

log p(x| λ) = ∑y p(y|x, θ) log p(y|λ)

+ ∑y p(y|x,θ) log [1/p(y|x,λ)]

Q(λ|θ)

Thus

log p(x| λ*) - log p(x|θ) =

Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*))

≥ Q(λ*) – Q(θ) ≥ 0 [since λ* maximizes Q(λ)]. QED

Relative entropy 0 ≤

Page 27: EM and variants of HMM Lecture #9

28

Example: The ABO locusA locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

N

Nq

N

Nq

N

Nq

N

Nq

N

Nq

N

Nq oo

ooba

baob

obbb

bboa

oaaa

aa/

//

//

//

//

//

/ ,,,,,

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.We wish to estimate the proportion in a population of the 6 genotypes.

Page 28: EM and variants of HMM Lecture #9

29

The ABO locus (Cont.)

However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?

The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

Page 29: EM and variants of HMM Lecture #9

30

The ABO locus (Cont.)

We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles qa,qb,qo in the population determine the frequencies of the genotypes as follows: qa/b= 2qa qb, qa/o= 2qa qo, qb/o= 2qb qo, qa/a= [qa]2, qb/b= [qb]2, qo/o= [qo]2. So now we have three parameters that we need to estimate.

Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: We have three possible alleles a, b and o. The blood type A, B, AB or O is determined by two successive sampling of alleles (which define the genotype). For instance blood type A corresponds to the samplings (a,a), (a,o) and (o,a).

Page 30: EM and variants of HMM Lecture #9

31

The Likelihood Function

We wish to determine the probabilities of the six genotypes xa/a, xa/o ,xb/b, xb/o, xa/b , xo/o. These are defined by the parameters = {qa ,qb, qo} eg:

P(X= xa/b | ) = P({(a,b), (b,a)} | )= 2qa qb. Similarly P(X= xo/o | ) = qo qo. And so on for the other four genotypes.So all we need is to find the parameters = {qa ,qb, qo}.

Page 31: EM and variants of HMM Lecture #9

32

The Likelihood FunctionWe wish to compute the parameters by sampling a data and then use MLE. This is naturally dealt by EM, because the sampled data – the blood types - have hidden parameters (the genotype)

215232 222 oobaobboaa qqqqqqqqqqDataP )|(

Assume the sampled data is {B,A,B,B,O,A,B,A,O,B, AB}What is its probability, for given parameters ?

Obtaining the maximum of this function yields the MLE.We use the EM algorithm to replace by which increases the likelihood.

Page 32: EM and variants of HMM Lecture #9

33

ABO loci as a special case of HMMModel the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is x .

Emission is done every other state, and is determined by the state.Eg, ea/o(A)=1, since a/o produces blood type A.

aoa/o a/b

A AB

a/b

AB

b b a a

Page 33: EM and variants of HMM Lecture #9

34

A faster and simpler EM for ABO loci

Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way:

Consider the input data {B,A,B,B,O,A,B,A,O,B, AB} as observations x1,…x11 . The hidden data of an observation are the genotypes which produce it. Eg, for O it is (o,o), and for B it is (o,b), (b,o) and (b,b).

Page 34: EM and variants of HMM Lecture #9

35

A faster EM for ABO loci

For each genotype y we have Na(y), Nb(y) and No(y).

Eg, Na(o,b)=0; Nb(o,b) = No(o,b) = 1.

For each observation of blood type xj and for each allel z in {a,b,o} we compute Nz

j , the expected number of times that

z appear in xj .

Page 35: EM and variants of HMM Lecture #9

36

A faster EM for ABO lociThe computation for blood type B:

P(B|) = P((b,b)|) + p((b,o)|) +p((o,b)|)) = qb2 + 2qbqo.

NoB and Nb

B , the expected number of occurrences of o and b in B, are

given by:

obb

obb

yb

Bb

obb

obob

yo

Bo

qqq

qqqyNBypN

qqq

qq

Bp

qqyNBypN

2

22

2

22

2

2

2

)(),|(

)|()(),|(

Observe that NbB + No

B = 2

Page 36: EM and variants of HMM Lecture #9

37

A faster EM for ABO loci

Similarly, P(A|) = qa2 + 2qaqo.

oaa

oaaAa

oaa

oaAo

qqq

qqqN

qqq

qqN

2

22

2

22

2

2

,

P(AB|) = p((b,a)|) + p((a,b)|)) = 2qaqb ;

P(O|) = p((o,o)|) = qa2

NaAB = Nb

AB = 1

NoO = 2

[ NbO = Na

O = NoAB = Nb

A = NaB = 0 ]

Page 37: EM and variants of HMM Lecture #9

38

E step: compute Na, Nb and No

Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively.

222 that Note

NNNN

NONBNAN

NABNBN

NABNAN

oba

Oo

Bo

Aoo

ABb

Bbb

ABa

Aaa

)(#)(#)(#

)(#)(#

)(#)(#

N

Nq

N

Nq

N

Nq o

ob

ba

a 222 ;;

M step: compute new values of qa, qb and qo

Page 38: EM and variants of HMM Lecture #9

39

EM in Practice

Initial parameters: Random parameters setting “Best” guess from other source

Stopping criteria: Small change in likelihood of data Small change in parameter values

Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

Page 39: EM and variants of HMM Lecture #9

40

log P(x| )

Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

E ’[log P(x,y| )]

Page 40: EM and variants of HMM Lecture #9

41

HMM model structure:1. Duration Modeling

Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, Markov chain forces the distribution of segments in which some state is repeated for k times to be (1-p)pk-1, for some p.

Several ways enable modeling of other distributions. One is assigning more than one state to represent the same “real” state.

A1 A2 A3 A4

Page 41: EM and variants of HMM Lecture #9

42

HMM model structure:2. Silent states

States which do not emit symbols (as we saw in the abo locus). Can be used to model duration distributions. Also used to allow arbitrary jumps (needed for use of HMM in

pairwise alignments) Need to adjust the Forward and Backward algorithms to count

for the silent states

Silent states:

Regular states:

Page 42: EM and variants of HMM Lecture #9

43

HMM model structure:3. High Order Markov Chains

Markov chains in which the transition probability depends on the last n states:

P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-n)

Can be represented by a standard Markov chain with more states:

AA

BBBA

AB