42
Proteiinianalyysi 3 Monen sekvenssin linjaus http://www.bioinfo.biocenter.helsi nki.fi/downloads/teaching/spring20 06/proteiinianalyysi

Proteiinianalyysi 3 Monen sekvenssin linjaus ads/teaching/spring2006/proteiinianalyysi

Embed Size (px)

Citation preview

Page 1: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Proteiinianalyysi 3

Monen sekvenssin linjaus

http://www.bioinfo.biocenter.helsinki.fi/downloads/teaching/spring2006/proteiinianalyysi

Page 2: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Multiple sequence alignment (msa)• A) define characters for phylogenetic analysis• B) search for additional family members

SeqA N ∙ F L SSeqB N ∙ F – SSeqC N K Y L SSeqD N ∙ Y L S

NYLS NKYLS NFS NFLS

+K -L

YF

Page 3: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Complexity of performing msa

• Sum of pairs (SP) score• Extension of sequence pair alignment by

dynamic programming– O(LN), where L is sequence length and N is

number of sequences

• MSA algorithm searches subspace bounded by– Heuristic multiple alignment (suboptimal)– Pairwise optimal alignments

Page 4: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Pairwise projectionPairwise optimal alignmentGreedy multiple alignmentGlobal msa optimum (sequences A,B,C,…) is probably found in the area in-between

Sequence A

Sequence B

Page 5: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Dynamic programming

B A B A

A

B

C

BEGINEND

D

Maximal path sum BEGIN END ?(a) Enumerate every path brute force(b) Use induction: only one optimal path up to any node in graph.

1

30

31

2

4

Page 6: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Example: all paths leading to B

A

B

C

BEGINEND

D

1

30

31

2

43

1

3

8

7

Page 7: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Motivation: existing alignment methods

Heuristic alignment algorithms

method progressive pairwise alignment profile hidden Markov models

software ClustalW, MUSCLE, ProbCons SAM, HMMER

strengths consider evolutionary relatedness

drawbacks

other require guide tree require training alignment

site-specific emission & indel probabilities

assume homogeneous process across sites

assume star-like tree, problems with long gaps

Page 8: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Progressive alignment

• 1. derive a guide tree from pairwise sequence distances

• 2. align the closest (groups of) sequences

• Pluses: works with many sequences

• Minuses: errors made early on are frozen in the final alignment

• Programs: Clustal, Pileup, T-Coffee

Page 9: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Progressive pairwise alignment

TACCTA

TACGTA

CACCTA

TATA

TACGTA

TACGTC

TACAGTA

time

CACCTA

TATA

TACGTC

TACAGTA

Evolution Alignment

progress

Motivation: existing alignment methods

Page 10: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Progressive pairwise alignment

CACCTA

TATA

TACGTC

TACAGTA

1.

2.

3.

progress

TACGTC

T A C A G T A1.

T A C A G T AT A C G T C

match

vertical gap

horizontal gap

TATA

2.

T A T A

CACCTA

3.

C A C G T A

Motivation: existing alignment methods

Page 11: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Progressive pairwise alignment1. 2. 3.T A C A G T A

T A C G T C T A T A C A C G T A

C A C C T A

T A T A

T A C G T C

T A C A G T A

T A C A G T AT A C G T CT A T AC A C G T A

Motivation: existing alignment methods

Page 12: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

Profile HMMsMotivation: existing alignment methods

Page 13: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Iterative methods

• Gotoh– inner loop uses progressive alignment– outer loop recomputes distances and guide

tree based on the msa obtained

• DIALIGN– Identify high-scoring ungapped segment pairs– Build up msa using consistent segment pairs – Greedy algorithm DIALIGN2; exact DIALIGN3

Page 14: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

 

- several proteins are grouped together by similarity searches - they share a conserved motif - motif is stringent enough to retrieve the family members from the complete protein database  

Motifs

MMCOL10A1_1.483 SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM-------Ca1x_Chick SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI-------S15435 SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM-------CA18_MOUSE.597 SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM-------Ca28_Human SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT-------MM37222_1.98 SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN-----COLE_LEPMA.264 SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAMHP27_TAMAS.72 SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN-------S19018 AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS-------C1qb_Mouse TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD-------C1qb_Human TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD-------Cerb_Human SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL-------2.HS27109_1 TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------- :: : : : * *:*

Page 15: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Local msa

• Profile analysis a portion of the alignment that is highly conserved and produces a type of scoring matrix called a profile. New sequences can be aligned to the profile using dynamic programming.

• Block analysis scans a global msa for ungapped regions, called blocks, and these blocks are then used in sequence alignments.

• Pattern-searching or statistical methods find recurrent regions of sequence similarity in a set of initially unaligned sequences.

Page 16: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

TEIRESIAS

• Finds all patterns with minimum support in input set of unaligned sequences– Maximum spacer length – For example L…G…………..A….L…L

• Enumerative algorithm– Build up longer patterns by merging overlapping short

patterns– For example A….L + L…L A….L…L

• Fewer instances by chance when more positions are specified

• Pluses: exact• Minuses: biological significance of the patterns?

Page 17: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Expectation maximization (EM)

• Initial guess of motif location in each sequence• E-step: Estimate frequencies of amino acids in

each motif column. Columns not in motif provide background frequencies. For each possible site location, calculate the probability that the site starts there.

• M-step: use the site probabilities as weights to provide a new table of expected values for base counts for each of the site positions.

• Repeat E-step and M-step until convergence.

Page 18: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

EM procedure• Let the observed variables be known as y and the latent variables as z. Together, y and z form the complete data.

Assume that p is a joint model of the complete data with parameters θ: p(y,z | θ). An EM algorithm will then iteratively improve an initial estimate θ0 and construct new estimates θ1 through θN. An individual re-estimation step that derives from takes the following form (shown for the discrete case; the continuous case is similar):

– Eqn 1• In other words, θn + 1 is the value that maximizes (M) the expectation (E) of the complete data log-likelihood with

respect to the conditional distribution of the latent data under the previous parameter value. This expectation is usually denoted as Q(θ):

– Eqn 2• Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-

dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

• It can be shown that an EM iteration does not decrease the observed data likelihood function, and that the only stationary points of the iteration are the stationary points of the observed data likelihood function. In practice, this means that an EM algorithm will converge to a local maximum of the observed data likelihood function.

• EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.

• "Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model.

• An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.

• There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

Eqn 1Eqn 2

Page 19: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Exercise

• Analyze the following ten DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle three positions are a motif. The size of the motif is a guess that is based on a molecular model. The alignment of the sequences is also a guess.

Seq1 C CAG ASeq2 G TTA ASeq3 G TAC CSeq4 T TAT TSeq5 C AGA TSeq6 T TTT GSeq7 A TAC TSeq8 C TAT GSeq9 A GCT CSeq10 G TAG A

Page 20: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(a) Calculate the observed frequency of each base at each of

the three middle positionsSeq1 C CAG ASeq2 G TTA ASeq3 G TAC CSeq4 T TAT TSeq5 C AGA TSeq6 T TTT GSeq7 A TAC TSeq8 C TAT GSeq9 A GCT CSeq10 G TAG A

Base 1st 2nd 3rd

A 0.1 0.6 0.2

C 0.1 0.1 0.2

G 0.1 0.1 0.2

T 0.7 0.2 0.4

Page 21: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(b) Calculate the odds likelihood of finding the motif at each of the

possible locations in sequence 5

Base 1st 2nd 3rd

A 0.1 0.6 0.2

C 0.1 0.1 0.2

G 0.1 0.1 0.2

T 0.7 0.2 0.4

Seq5 CAGATCAGAT: 0.1*0.6*0.2/0.25/0.25/0.25=0.768

CAGAT: 0.1*0.1*0.2/0.25/0.25/0.25=0.128

CAGAT: 0.1*0.6*0.4/0.25/0.25/0.25=1.536

Non-motif sites: 0.25/0.25=1

Page 22: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(c) Calculate the probability of finding the motif at each position of

sequence 5

Base 1st 2nd 3rd

A 0.1 0.6 0.2

C 0.1 0.1 0.2

G 0.1 0.1 0.2

T 0.7 0.2 0.4

Seq5 CAGATCAGAT: 0.1*0.6*0.2*0.25*0.25=0.00075

CAGAT: 0.25*0.1*0.1*0.2*0.25=0.000125

CAGAT: 0.25*0.25*0.1*0.6*0.4=0.0015

Non-motif sites: p=0.25

Page 23: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(d) Calculate what change will be made to the base count in each column of the motif table as a

result of matching the motif to sequence 5.

Base 1st 2nd 3rd

A 0.05 0.32+0.63=0.95 0.05

C 0.32 0 0

G 0.63 0.05 0.32

T 0 0 0.63

CAGAT: 0.1*0.6*0.2*0.25*0.25=0.00075, rel. weight 0.32CAGAT: 0.25*0.1*0.1*0.2*0.25=0.000125, rel. weight 0.05CAGAT: 0.25*0.25*0.1*0.6*0.4=0.0015, rel. weight 0.63

Updated counts from sequence 5 (initial location guess shaded).

Page 24: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(e) What other steps are taken to update or maximize the table

values?• The weighted sequence data from the

remaining sequences are also added to the counts table

• The base frequencies in the new table are used as an updated estimate of the site residue composition.

• The expectation and maximization steps are repeated until the base frequencies do not change.

Page 25: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Gibbs sampler

• Start from random msa; realign one sequence against profile derived from the other sequences; iterate

• Finds the most probable pattern common to all of the sequences by sliding them back and forth until the ratio of the motif probability to the background probability is a maximum.

Page 26: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

A. Estimate the amino acid frequencies in the motif columns of all but one sequence. Also obtain background.

Random start Motifpositions chosen ↓xxxMxxxxx xxxMxxxxxxxxxxxMxx xxxxxxMxxxxxxxMxxx xxxxxMxxxxMxxxxxxx xMxxxxxxxXxxxxxxxx XxxxxxxxxMxxxxxxxx MxxxxxxxxxxxxMxxxx xxxxMxxxxxMxxxxxxx xMxxxxxxxxxxxxxxxM xxxxxxxxM

B. Use the estimates from A to calculate the ratio of probability of motif to backgroundscore at each position in the left-out sequence. This ratio for each possible locationis the weight of the position.xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxM M M M

Page 27: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

C. Choose a new location for the motif in the left-out sequence by a random selectionusing the weights to bias the choice.

xxxxxxxMxx Estimated location of the motif in left-out sequence.

D. Repeat steps A to C many (>100) times.

Page 28: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Exercise: Gibbs sampler

• Analyze the left-hand-side DNA sequences by the Gibbs sampling algorithm.

Seq1 C CAG ASeq2 G TTA ASeq3 G TAC CSeq4 T TAT TSeq5 C AGA TSeq6 T TTT GSeq7 A TAC TSeq8 C TAT GSeq9 A GCT CSeq10 G TAG A

Page 29: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(a) Assuming that the background base frequencies are 0.25, calculate a log odds

matrix for the central three positions.

Base 1st 2nd 3rd

A 0.1 0.6 0.2

C 0.1 0.1 0.2

G 0.1 0.1 0.2

T 0.7 0.2 0.4

Base 1st 2nd 3rd

A -1.32 1.26 -0.32

C -1.32 -1.32 -0.32

G -1.32 -1.32 -0.32

T 1.49 -0.32 0.68

Page 30: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(b) Assuming that another sequence GTTTG is the left-out sequence, slide the log-odds matrix along the left-out sequence and find the log-odds score at each

of three possible positions.

Base 1st 2nd 3rd

A -1.32 1.26 -0.32

C -1.32 -1.32 -0.32

G -1.32 -1.32 -0.32

T 1.49 -0.32 0.68

GTTTG: -1.32 + -0.32 + 0.68 = -1.00GTTTG: 1.49 + -0.32 + 0.68 = 1.81GTTTG: 1.49 + -0.32 + -0.32 = 0.85

Page 31: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(c) Calculate the probability of a match at each position in the left-out sequence

log-odds=log2(p/bg), p=2s bg

GTTTG: 2-1.00/64=0.078, 0.1*0.2*0.4=0.008GTTTG: 21.81/64=0.055, 0.7*0.2*0.4=0.056GTTTG: 20.85/64=0.028, 0.7*0.2*0.2=0.028

Base 1st 2nd 3rd

A 0.1 0.6 0.2

C 0.1 0.1 0.2

G 0.1 0.1 0.2

T 0.7 0.2 0.4

Page 32: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(d) How do we choose a possible location for the motif in the left-out sequence?

• 0.008+0.056+0.028=0.092

• Normalised weights:– GTTTG: 0.008/0.092=0.09– GTTTG: 0.056/0.092=0.61– GTTTG: 0.028/0.092=0.30

Page 33: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Modelling protein families

MMCOL10A1_1.483 SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM-------Ca1x_Chick SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI-------S15435 SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM-------CA18_MOUSE.597 SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM-------Ca28_Human SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT-------MM37222_1.98 SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN-----COLE_LEPMA.264 SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAMHP27_TAMAS.72 SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN-------S19018 AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS-------C1qb_Mouse TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD-------C1qb_Human TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD-------Cerb_Human SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL-------2.HS27109_1 TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------- :: : : : * *:*

Page 34: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

PSSM

• The PSSM is constructed by a logarithmic transformation of a matrix giving the frequency of each amino acid in the motif.

• If a good sampling of sequences is available, the number of sequences is sufficiently large, and the motif structure is not too complex, it should, in principle, be possible to produce a PSSM that is highly representative of the same motif in other sequences also.– Pseudocounts– Sequence logo, information content

Page 35: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Exercise: construct a PSSM

Base Column 1 frequency

Column 2 frequency

Column 3 frequency

Column 4 frequency

A 0.6 0.1 0.2 0.1

C 0.1 0.7 0.1 0.1

G 0.1 0.1 0.6 0.1

T 0.2 0.1 0.1 0.7

Page 36: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Exercise cntd

• (a) Assuming the background frequency is 0.25 for each base, calculate a log odds score for each table position, i.e. log to the base 2 of the ratio of each observed value to the expected frequency.

Page 37: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Base Column 1 frequency

Column 2 frequency

Column 3 frequency

Column 4 frequency

A 0.6/0.25=2.4

0.1/0.25=0.4

0.2/0.25=0.8

0.1/0.25=0.4

C 0.1/0.25=0.4

0.7/0.25=2.8

0.1/0.25=0.4

0.1/0.25=0.4

G 0.1/0.25=0.4

0.1/0.25=0.4

0.6/0.25=2.4

0.1/0.25=0.4

T 0.2/0.25=0.8

0.1/0.25=0.4

0.1/0.25=0.4

0.7/0.25=2.8

Observed/expected frequencies

Page 38: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Base Column 1 frequency

Column 2 frequency

Column 3 frequency

Column 4 frequency

A log22.4=1.26 log20.4=-1.32 log20.8=-0.32 log20.4=-1.32

C log20.4=-1.32 log22.8=1.49 log20.4=-1.32 log20.4=-1.32

G log20.4=-1.32 log20.4=-1.32 log22.4=1.26 log20.4=-1.32

T log20.8=-0.32 log20.4=-1.32 log20.4=-1.32 log22.8=1.49

Log-odds scores

Page 39: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Exercise cntd

• (b) Align the matrix with each position in the sequence TCACGTAA starting at position 1,2, etc., and calculate the log odds score for the matrix to that position.

Page 40: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Base Column 1 frequency

Column 2 frequency

Column 3 frequency

Column 4 frequency

A 1.26 -1.32 -0.32 -1.32

C -1.32 1.49 -1.32 -1.32

G -1.32 -1.32 1.26 -1.32

T -0.32 -1.32 -1.32 1.49

TCACGTAA: -0.32 + 1.49 + -0.32 + -1.32 = -0.47TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -5.28TCACGTAA: 1.26 + 1.49 + 1.26 + 1.49 = 4.5TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -2.81TCACGTAA: -1.32 + -1.32 + -0.32 + -1.32 = -4.28

Page 41: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

(c) Calculate the probability of the best matching position.

• p(TCACGTAA): 0.6*0.7*0.6*0.7=0.1764

Page 42: Proteiinianalyysi 3 Monen sekvenssin linjaus  ads/teaching/spring2006/proteiinianalyysi

Hidden Markov ModelsAlignment of sequences with a structure: hidden Markov models

● HMMs suit well on describing correlation among neighbouring sites

● probabilistic framework allows for realistic modelling

● well developed mathematical methods; provide

1. best solution (most probable path through the model)

2. confidence score (posterior probability of any single solution)

3. inference of structure (posterior probability of using model states)

● we can align multiple sequence using complex models and simultaneously predict their internal structure