CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models

CISC841, F08, Liao

CISC 841 Bioinformatics(Fall 2008)

Hidden Markov Models

Model comparison

How to tell if two HMMs are equivalent?– If not equivalent, how (dis-)similar are they?

Remember: HMMs are generative

Given a sequence x, P(x|M) is the probability that x can be generated from the model M.

How to compare two probability distribution?

Mutual entropy H(M, M’) = x P(x|M) log [P(x|M)/P(x|M’)]

CISC841, F08, Liao

Mutual entropy:

H(p|q) 0 (why?)

H(p|q) = 0 iff p = q

Complexity of comparing HMMs

- It is proved to be NP-hard. (Lyngso and Pedersen, LNCS, 2001, 2223:416-428.)

CISC841, F08, Liao

I

Start M M M End

I II

D D D

X X . . . X

bat A G – – – C

rat A – A G –C

cat A G – A A–

gnat – – A A AC

goat A G – – – C

1 2 3

Observed emission/transition counts

node position 0 1 2 3------------------A – 4 0 0C – 0 0 4

G – 0 3 0T – 0 0 0------------------A 0 0 6 0C 0 0 0 0G 0 0 1 0T 0 0 0 0------------------

MM 4 3 2 4 MD 1 1 0 0 MI 0 0 1 0 IM 0 0 2 0 ID 0 0 1 0 II 0 0 4 0 DM – 0 0 1 DD – 1 0 0 DI – 0 2 0

Hidden Markov Model

0 1 2 3CISC841, F08, Liao

Sequence-to-sequence (pair wise) - for proteins with relatively high sequence identity

- dynamic programming methods Sequence-to-profile - for distant relationships and improved alignment accuracy - PSI-BLAST, HMMER, SAM

Profile-to-profile - for more sensitivity and accuracy of alignment - COMPASS, Prof_sim

Comparison levels for homology detection

V G A H - A G E Y

A G A H D - G E F

Seq dbaseSeq dbase

query

hit

A G A - - H D G E FV - - - - N V D E FC K A - - D V A G HV K G - - - - - - FV L S - - T I E T SD N K - - T I A K HI A G A D T G A G V

V G A - - H A G E Y

Prof dbaseProf dbase


V G A - - H A G E Y

Seq dbaseSeq dbase

query

query

hit

hit

V G A - - H A G E YV - - - - N V D E VV E A - - D V A G HV K G - - - - - - DV Y S - - T Y E T SF N A - - N I P K HI A G A D N G A G V


Prof dbaseProf dbasehit

query

CISC841, F08, Liao

Performance quantifiers Ability to detect distant relationships

- sensitivity

- specificity

Accuracy of alignment prediction (when compared to corresponding structure based alignment)

V G A H - A G E Y

A G A H D - G E F

Sequence based alignment G A H A G E

G A H D G E

Structure based alignment

Modeler’s accuracy metric (Qm) = Nc/Nseq Developer’s accuracy metric (Qd) = Nc/Nstr Combined metric (Qc) = Nc / (Nseq + Nstr – Nc) where Nc = number of aligned pairs common to both alignments Nseq = number of aligned pairs in the sequence based alignemnt Nstr = number of alinged pairs in the structure based alignment

sid Qm Qd Qc 5/6 5/7 5/6 5/8

0

0.5

1

1 2 3 4 5 6

seq id regions

avg

par

amte

r

query hit e-val relationship tp fp HBA_HUMAN HBB_HUMAN 3.87e-60 +1 1 0 MYG_PHYCA 5.02e-23 +1 2 0 GLB3_CHITP 5.60e-4 +1 3 0 GLB5_PETMA 1.43e-1 -1 3 1 GLB2_LUPLU 1.56e+1 +1 4 1

GLB1_GLYDI 1.45e+3 -1 4 2

tp : true positive count fp : false positive count relationship is +1 if query and hit sequences are related at super family level

0246

0 2 4

fp

tp

CISC841, F08, Liao

On profile-profile comparisons From MSA to numeric profiles - sampling

- dropping columns

Alignment of numeric profiles

- scoring functions

- dynamic programming

alignment

Example: COMPASS (Sadreyev et. al. J. Mol. Biol. (2003) 326, pp. 317–336. )

1..20

V G A - H A G E YV - - - N V D E VV E A - D V A G H V K G - - - - - DV Y S - T Y E T SF N A - N I P K HI A G - N G A G VA G A H D - G E FV - - N V - D E FC K A D V - A G HV K G - - - - - FV L S T I - E T SD N K T I - A K HI A G T G - A G V


L1A G A - - H D G E FV - - - - N V D E FC K A - - D V A G HV K G - - - - - - FV L S - - T I E T SD N K - - T I A K HI A G A D T G A G V

L2

Numeric profiles

Subs matrix

L2

L1

CISC841, F08, Liao

Build profile HMMs using existing packages (SAM-T99 or HMMER)

Generation of quasi consensus sequence from the model

Alignment of consensus sequence of a model with another model

Extraction of two alignments in each direction

Quasi consensus based comparison of HMMs



V G A - - H A G E YV - K A - T I A E HA - G A - H D G E F

Consensus2Seed 1

Seed 2

A G A - - H D G E FV - G A N - V A E HV - G A H - A G E Y

Seed 2Consensus 1Seed 1

V - K A - T I A E H

V G A - - N V A E H

S(c2|M1)

A - G A - H D G E FV G A - - H A G E Y

Aln21

A G A - - H D G E FV - G A H - A G E Y

Aln12

V - G A N - V A E H

V K A - - T I A E H

S(c1|M2)

M1 V G A N V A E HConsensus 1

M2 V K A T I A E H Consensus 2



CISC841, F08, Liao

Benchmark experiment I : Detection ability

All-vs-all comparisons of 569 MSAs from (Wang and Dunbrack, 2004) using COMPASS and QC-COMP. Two MSAs are said to be related if their seed sequences are from the same SCOP superfamily.

In all-vs-all comparisons using QC-COMP, the ith HMM is used to score consensus sequences from the remaining 568 HMMs and the resulting scores are transformed into z-scores zi(ck) = [si(ck) - <s>]/

Mi = { zi (c1), zi (c2), . . . zi (ci-1), zi (ci+1), . . ., zi (c569) } Mj = { zj (c1), zj (c2), . . . zj (cj-1), zj (cj+1), . . ., zj (c569) } dij = Mi .ej = zi (cj) asymmetric similarity measure between Mi and Mj dij = Mi .ej + Mj .ei = zi (cj) + zj (ci) symmetric similarity measure between Mi and Mj

Same experiment is repeated using seed sequences instead of consensus sequences

For COMPASS, the ith profile is compared with the remaining 568 profiles and the scores are transformed into z-scores. The same similarity measures are used. We also consider E-values measures.CISC841, F08, Liao

Results for detection ability experiment

COMPASS SEED CON

sym 0.883450 0.858050 0.914950

asym 0.839538 0.761250 0.866337

e-value 0.876912 - -

ROC values

CISC841, F08, Liao

Benchmark experiment II : Alignment accuracy

2305 pairs of MSAs from (Wang and Dunbrack, 2004) were aligned using COMPASS and QC-COMP.

Same experiment is repeated using seed sequences instead of consensus sequences

Region Identity range #Pairs 1 0.00 - 0.05 58 2 0.05 - 0.10 522 3 0.10 - 0.15 598 4 0.15 - 0.20 382 5 0.20 - 0.25 258 6 0.25 - 0.30 217 7 0.30 - 0.35 162 8 0.35 - 0.40 108

A G A H D - G E FV G A - H A G E Y

COMPASS

Extracted alignment Accuracy parameters Qm, Qd and Qc

Extraction schemes - MAX- AND- AND1- AND2- AND3

V G A - H A G E YV - - - N V D E VV E A - D V A G H V K G - - - - - DV Y S - T Y E T SF N A - N I P K HI A G - N G A G VA G A H D - G E FV - - N V - D E FC K A D V - A G HV K G - - - - - FV L S T I - E T SD N K T I - A K HI A G T G - A G V

V G A - - N V A E H

S(c2|M1)

V K A - - T I A E H

S(c1|M2)



A - G A - H D G E FV G A - - H A G E Y

Aln21

A G A - - H D G E FV - G A H - A G E Y

Aln12

CISC841, F08, Liao

Results for alignment accuracy experiment

Consensus based

CISC841, F08, Liao


Seed based

CISC841, F08, Liao


Mix

Mixing scheme: if the symmetric similarity measure between a pair of HMMs is less than –22.0, seed-based alignment is taken. Otherwise, consensus-based alignment is chosen.The threshold –22.0 was determined using a separate training set (1136 pairs of HMMs).

CISC841, F08, Liao

CISC841, F08, Liao

Documents

CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models