66
Pairwise Sequence Alignment Misha Kapushesky Slides: Stuart M. Brown, Fourie Joubert, NYU St. Petersburg Russia 2010

Введение в биоинформатику, весна 2010: Лекция 7

Embed Size (px)

Citation preview

Page 1: Введение в биоинформатику, весна 2010: Лекция 7

Pairwise Sequence Alignment

Misha KapusheskySlides: Stuart M. Brown, Fourie Joubert, NYUSt. Petersburg Russia 2010

Page 2: Введение в биоинформатику, весна 2010: Лекция 7

Protein Evolution

“For many protein sequences, evolutionary history can be traced back 1-2 billion years”

-William Pearson• When we align sequences, we assume that they share a

common ancestor• They are then homologous

• Protein fold is much more conserved than protein sequence

• DNA sequences tend to be less informative than protein sequences

Page 3: Введение в биоинформатику, весна 2010: Лекция 7

Definition

• Homology: related by descent

• Homologous sequence positions

ATTGCGCà ATTGCGC

à ATCCGCC

ATTGCGCAT-CCGCà

ATTGCGC

Page 4: Введение в биоинформатику, весна 2010: Лекция 7

Orthologous and paralogous

• Orthologous sequences differ because they are found in different species (a speciation event)

• Paralogous sequences differ due to a gene duplication event

• Sequences may be both orthologous and paralogous

Page 5: Введение в биоинформатику, весна 2010: Лекция 7

Pairwise Alignment

• The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. • There are lots of possible alignments.

• Two sequences can always be aligned.• Sequence alignments have to be scored.• Often there is more than one solution with the same

score.•

Page 6: Введение в биоинформатику, весна 2010: Лекция 7

Methods of Alignment

• By hand - slide sequences on two lines of a word processor

• Dot plot• with windows

• Rigorous mathematical approach• Dynamic programming (slow, optimal)

• Heuristic methods (fast, approximate)• BLAST and FASTA

• Word matching and hash tables0

Page 7: Введение в биоинформатику, весна 2010: Лекция 7

Align by Hand

GATCGCCTA_TTACGTCCTGGAC <----> AGGCATACGTA_GCCCTTTCGC

You still need some kind of scoring system to find the best alignment

Page 8: Введение в биоинформатику, весна 2010: Лекция 7

Percent Sequence Identity

• The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G A C G T G – G C A G

70% identicalmismatch

indel

Page 9: Введение в биоинформатику, весна 2010: Лекция 7

Dotplot:

A l l l lT l l l lT l l l lC l l lA l l l lC l l lA l l l lT l l l lA l l l l

T A C A T T A C G T A C

Sequence 1

Sequence 2

A dotplot gives an overview of all possible alignments

Page 10: Введение в биоинформатику, весна 2010: Лекция 7

Dotplot:

A l l l lT l l l lT l l l lC l l lA l l l lC l l lA l l l lT l l l lA l l l l

T A C A T T A C G T A C

T A C A T T A C G T A C

A T A C A C T T A

Sequence 1

Sequence 2

One possible alignment:

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

Page 11: Введение в биоинформатику, весна 2010: Лекция 7

Insertions / Deletions in a Dotplot

TACTGTCAT

T A C T G T T C A TSequence 1

Sequence 2

T A C T G - T C A T| | | | | | | | |T A C T G T T C A T

Page 12: Введение в биоинформатику, весна 2010: Лекция 7

Hemoglobin α-chain

Hemoglobinβ-chain

Dotplot(Window = 130 / Stringency = 9)

Page 13: Введение в биоинформатику, весна 2010: Лекция 7

Word Size Algorithm

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

CTAT ­GACA

T A C G G T A T G

Word Size = 3

­

Page 14: Введение в биоинформатику, весна 2010: Лекция 7

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 7

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 11

­

Matrix: PAM250

Window = 12 Stringency = 9

Scoring Matrix Filtering

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 11

­

Window / Stringency

Page 15: Введение в биоинформатику, весна 2010: Лекция 7

Dotplot(Window = 18 / Stringency = 10)

Hemoglobinβ-chain

Hemoglobin α-chain

Page 16: Введение в биоинформатику, весна 2010: Лекция 7

Considerations

• The window/stringency method is more sensitive than the wordsizemethod (ambiguities are permitted).

• The smaller the window, the larger the weight of statistical(unspecific) matches.

• With large windows the sensitivity for short sequences is reduced.

• Insertions/deletions are not treated explicitly.

Page 17: Введение в биоинформатику, весна 2010: Лекция 7

Alignment methods

• Rigorous algorithms = Dynamic Programming

• Needleman-Wunsch (global)• Smith-Waterman (local)

• Heuristic algorithms (faster but approximate)

• BLAST• FASTA

Page 18: Введение в биоинформатику, весна 2010: Лекция 7

The Rocks game

• N rocks, 2 piles, 2 players• Player can

• Remove 1 rock from either pile• Remove 1 rock from each pile

• Last to remove a rock wins• Assume 10 rocks in each pile – winning strategy?

Page 19: Введение в биоинформатику, весна 2010: Лекция 7

Dynamic Programming

Page 20: Введение в биоинформатику, весна 2010: Лекция 7

Basic principles of dynamic programming

- Creation of an alignment path matrix

- Stepwise calculation of score values

- Backtracking (evaluation of the optimal path)

Page 21: Введение в биоинформатику, весна 2010: Лекция 7

Dynamic Programming

• Dynamic Programming is a very general programming technique.

• It is applicable when a large search space can be structured into a succession of stages, such that: • the initial stage contains trivial solutions to sub-

problems • each partial solution in a later stage can be

calculated by recurring a fixed number of partial solutions in an earlier stage

• the final stage contains the overall solution

Page 22: Введение в биоинформатику, весна 2010: Лекция 7

Creation of an alignment path matrix

Idea:Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences

• Construct matrix F indexed by i and j (one index for each sequence)

• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj

• Build F(i,j) recursively beginning with F(0,0) = 0

Page 23: Введение в биоинформатику, весна 2010: Лекция 7
Page 24: Введение в биоинформатику, весна 2010: Лекция 7

• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)

• Three possibilities:

• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

• xi is aligned to a gap, F(i,j) = F(i-1,j) - d

• yj is aligned to a gap, F(i,j) = F(i,j-1) - d

• The best score up to (i,j) will be the largest of the three options

Creation of an alignment path matrix

Page 25: Введение в биоинформатику, весна 2010: Лекция 7

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

Backtracking

-5

1

-A

EEHHG-WWAA

G-AP

E-H-

0

-25

-5

-20

-13

-3

3

-8 -16

-17

Optimal global alignment: EE

Page 26: Введение в биоинформатику, весна 2010: Лекция 7

Global vs. Local Alignments

• Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.

• Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

Page 27: Введение в биоинформатику, весна 2010: Лекция 7
Page 28: Введение в биоинформатику, весна 2010: Лекция 7

needle (Needleman & Wunsch) creates an end-to-end alignment.

Global Alignment

Two closely related sequences:

Page 29: Введение в биоинформатику, весна 2010: Лекция 7

Two sequences sharing several regions of local similarity:

1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67

1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70

|||||||||||||| | | | |||| || | | | ||

Global Alignment

Page 30: Введение в биоинформатику, весна 2010: Лекция 7

Global Alignment (Needleman-Wunsch)

• The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle)

• Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Sometimes two sequences may be derived from ancient

recombination events where only a single functional domain is shared.

• Global methods are useful when you want to force two sequences to align over their entire length

Page 31: Введение в биоинформатику, весна 2010: Лекция 7

Local Alignment (Smith-Waterman)

• Local alignment• Identify the most similar sub-region shared between two

sequences• Smith-Waterman

• EMBOSS: water

Page 32: Введение в биоинформатику, весна 2010: Лекция 7

Parameters of Sequence Alignment

Scoring Systems:

• Each symbol pairing is assigned a numerical value, based on a symbol comparison table.

Gap Penalties:

• Opening: The cost to introduce a gap

• Extension: The cost to elongate a gap

Page 33: Введение в биоинформатику, весна 2010: Лекция 7

DNA Scoring Systems-very simple

actaccagttcatttgatacttctcaaa

taccattaccgtgttaactgaaaggacttaaagact

Sequence 1

Sequence 2

A G C T

A 1 0 0 0

G 0 1 0 0

C 0 0 1 0

T 0 0 0 1

Match: 1Mismatch: 0Score = 5

Page 34: Введение в биоинформатику, весна 2010: Лекция 7

Protein Scoring Systems

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Sequence 1

Sequence 2

Scoringmatrix

T:G = -2 T:T = 5Score = 48

C S T P A G N D . .C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

C S T P A G N D . .C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

Page 35: Введение в биоинформатику, весна 2010: Лекция 7

• Amino acids have different biochemical and physical propertiesthat influence their relative replaceability in evolution.

CP

GGAVIL

MF

YW H

KR

E Q

DNS

TCSH

S+S

positive

chargedpolar

aliphatic

aromatic

small

tiny

hydrophobic

Protein Scoring Systems

Page 36: Введение в биоинформатику, весна 2010: Лекция 7

• Scoring matrices reflect:– # of mutations to convert one to another– chemical similarity– observed mutation frequencies– the probability of occurrence of each amino acid

• Widely used scoring matrices:• PAM • BLOSUM

Protein Scoring Systems

Page 37: Введение в биоинформатику, весна 2010: Лекция 7

PAM matrices

• Family of matrices PAM 80, PAM 120, PAM 250

• The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based

• Greater numbers denote greater distances

Page 38: Введение в биоинформатику, весна 2010: Лекция 7

PAM (Percent Accepted Mutations) matrices

• The numbers of replacements were used to compute a so-calledPAM-1 matrix.

• The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM matrices for larger evolutionary distances canbe extrapolated from the PAM-1 matrix.

• PAM250 = 250 mutations per 100 residues.

• Greater numbers mean bigger evolutionary distance

Page 39: Введение в биоинформатику, весна 2010: Лекция 7

PAM (Percent Accepted Mutations) matrices

• Derived from global alignments of protein families . Family membersshare at least 85% identity (Dayhoff et al., 1978).

• Construction of phylogenetic tree and ancestral sequences ofeach protein family

• Computation of number of replacements for each pair of amino acids

Page 40: Введение в биоинформатику, весна 2010: Лекция 7

A R N D C Q E G H I L K M F P S T W Y V B ZA 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6

PAM 250

C

-8 17

W

W

Page 41: Введение в биоинформатику, весна 2010: Лекция 7

PAM - limitations

• Based on only one original dataset

• Examines proteins with few differences (85% identity)

• Based mainly on small globular proteins so the matrix is biased

Page 42: Введение в биоинформатику, весна 2010: Лекция 7

BLOSUM matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments)

• BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity

• BLOSUM62 represents closer sequences than BLOSUM45

Page 43: Введение в биоинформатику, весна 2010: Лекция 7

• Derived from alignments of domains of distantly relatedproteins (Henikoff & Henikoff,1992).

• Occurrences of each amino acid pairin each column of each block alignmentis counted.

• The numbers derived from all blocks were used to compute the BLOSUM matrices.

A

A

C

E

C

A - C = 4A - E = 2C - E = 2A - A = 1C - C = 1

BLOSUM (Blocks Substitution Matrix)

AACEC

Page 44: Введение в биоинформатику, весна 2010: Лекция 7

The Blosum50 Scoring Matrix

Page 45: Введение в биоинформатику, весна 2010: Лекция 7

BLOSUM (Blocks Substitution Matrix)

• Sequences within blocks are clustered according to their level of identity.

• Clusters are counted as a single sequence.

• Different BLOSUM matrices differ in the percentage of sequence identityused in clustering.

• The number in the matrix name (e.g. 62 in BLOSUM62) refers to thepercentage of sequence identity used to build the matrix.

• Greater numbers mean smaller evolutionary distance.

Page 46: Введение в биоинформатику, весна 2010: Лекция 7

PAM Vs. BLOSUM

PAM100 = BLOSUM90PAM120 = BLOSUM80PAM160 = BLOSUM60PAM200 = BLOSUM52PAM250 = BLOSUM45

More distant sequences

lBLOSUM62 for general uselBLOSUM80 for close relationslBLOSUM45 for distant relations

lBLOSUM62 for general uselBLOSUM80 for close relationslBLOSUM45 for distant relations

lPAM120 for general uselPAM60 for close relations lPAM250 for distant relations

lPAM120 for general uselPAM60 for close relations lPAM250 for distant relations

Page 47: Введение в биоинформатику, весна 2010: Лекция 7

TIPS on choosing a scoring matrix

• Generally, BLOSUM matrices perform better than PAM matricesfor local similarity searches (Henikoff & Henikoff, 1993).

• When comparing closely related proteins one should use lowerPAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices.

• For database searching the commonly used matrix is BLOSUM62.

Page 48: Введение в биоинформатику, весна 2010: Лекция 7

T A T G T G G A A T G A

Scoring Insertions and Deletions

A T G T - - A A T G C A

A T G T A A T G C A

T A T G T G G A A T G A

The creation of a gap is penalized with a negative score value.

insertion / deletion

Page 49: Введение в биоинформатику, весна 2010: Лекция 7

1 GTGATAGACACAGACCGGTGGCATTGTGG 29||| | | ||| | || || |

1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29

Why Gap Penalties?

Gaps allowed but not penalized Score: 88

Gaps not permitted Score: 0

1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29||| || | | | ||| || | | || || |

1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29

Match = 5Mismatch = -4

Page 50: Введение в биоинформатику, весна 2010: Лекция 7

• The optimal alignment of two similar sequences is usually that which• maximizes the number of matches and• minimizes the number of gaps.•There is a tradeoff between these two

- adding gaps reduces mismatches

• Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences.

• Penalizing gaps forces alignments to have relatively few gaps.

Why Gap Penalties?

Page 51: Введение в биоинформатику, весна 2010: Лекция 7

Gap Penalties

• How to balance gaps with mismatches?

• Gaps must get a steep penalty, or else you’ll end up with nonsense alignments.

• In real sequences, multi-base (or amino acid) gaps are quite common

•genetic insertion/deletion events

• “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.

Page 52: Введение в биоинформатику, весна 2010: Лекция 7

Scoring Insertions and Deletions

A T G T T A T A C

T A T G T G C G T A T A

Total Score: 4

Gap parameters:d = 3 (gap opening)e = 0.1 (gap extension)g = 3 (gap lenght)

γ(g) = -3 - (3 -1) 0.1 = -3.2

T A T G T G C G T A T A

A T G T - - - T A T A C

insertion / deletion

match = 1mismatch = 0

Total Score: 8 - 3.2 = 4.8

Page 53: Введение в биоинформатику, весна 2010: Лекция 7

Modification of Gap Penalties

1 V...LSPADKFLTNV 12| |||| | | |

1 VFTELSPA.K..T.V 11

1 ...VLSPADKFLTNV 12||||

1 VFTELSPAKTV.... 11

gap opening penalty = 0gap extension penalty = 0.1score = 11.3

Score Matrix: BLOSUM62

gap opening penalty = 3gap extension penalty = 0.1score = 6.3

Page 54: Введение в биоинформатику, весна 2010: Лекция 7

BLAST AlgorithmBasic Local Alignment Search Tool• Fast alignment technique(s)

• Similar to FASTA algorithms (not used much now)• There are more accurate ones, but they’re slower• BLAST makes a big use of lookup tables

• Idea: statistically significant alignments (hits)• Will have regions of at least 3 letters same

• Or at least high scoring with respect to BLOSUM matrix

• Based on small local alignmentsCCNDHRKMTCSPNDNNRK

TTNDHRMTACSPDNNNKH

CCNDHRKMTCSPNDNNRK

YTNHHMMTTYSLDNNNKKmore likely than

Page 55: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Overview

• Given a query sequence Q• Seven main stages

1. Remove (filter) low complexity regions from Q2. Harvest k-tuples (triples) from Q3. Expand each triple into ~50 high scoring words4. Seed a set of possible alignments5. Generate high scoring pairs (HSPs) from the seeds6. Test significance of matches from HSPs7. Report the alignments found from the HSPs

Page 56: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 1 Removing Low-complexity Segments

• Imagine matching • HHHHHHHHKMAY and HHHHHHHHURHD• The KMAY and URHD are the interesting parts• But this pair score highly using BLOSUM

• It’s a good idea to remove the HHHHHHHs• From the query sequence (low complexity)

• SEG program does this kind of thing• Comes with most BLAST implementations• Often doesn’t do much, and it can be turned off

Page 57: Введение в биоинформатику, весна 2010: Лекция 7

Removing Low-complexity Segments

• Given a segment of length L• With each amino acid occurring n1 n2 … n20 times

• Use the following measure for “compositional complexity”:

• To use this measure• Slide a “window” of ~12 residues along Query Sequence Q• Use a threshold to determine low complexity windows• Use a minimise routine to replace the segment

• With an optimal minimised segment (or just an X)

Page 58: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 2Harvesting k-tuples

• Collect all the k-tuples of elements in Q• k set to 3 for residues and 11 for DNA (can vary)• Triples are called ‘words’. Call this set W

S T S L S T S D K L M RSTSTSLSLSLST

Page 59: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 3Finding High Scoring Triples

• Given a word w from W• Find all other words w’ of same length (3), which:

• Appear in some database sequence • Blosum(w,w’) > a threshold T

• Choose T to limit number to around 50• Call these the high scoring triples (words) for w

• Example: letting w=PQG, set T to be 13• Suppose that PQG, PEG, PSG, PQA are found in database• Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15• Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12• Hence, PQG and PEG only are kept

Page 60: Введение в биоинформатику, весна 2010: Лекция 7

Finding High Scoring Triples

• For each w in W, find all the high scoring words• Organise these sets of words

• Remembering all the places where w was found in Q• Each high scoring triple is going to be a seed

• In order to generate possible alignment(s)• One seed can generate more than one alignment

• End of the first half of the algorithm• Going to find alignments now

Page 61: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 4Seeding Possible Alignments

• Look at first triple V in query sequence Q• Actually from Q (not from W - which has omissions)• Retrieve the set of ~50 high scoring words

• Call this set HV

• Retrieve the list of places in Q where V occurs• Call this set PV

• For every pair (word, pos)• Where word is from HV and pos is from PV

• Find all the database sequences D• Which have an exact match with word at position pos’

• Store an alignment between Q and D• With V matched at pos in Q and pos’ in D

• Repeat this for the second triple in Q, and so on

Page 62: Введение в биоинформатику, весна 2010: Лекция 7

Seeding Possible AlignmentsExample

• Suppose Q = QQGPHUIQEGQQG• Suppose V = QQG, HV = {QQG, QEG}

• Then PV = {1, 11} • Suppose we are looking in the database at:

• D = PKLMMQQGKQEG• Then the alignments seeded are:

QQGPHUIQEGQQG word=QQG QQGPHUIQEGQQG word=QQGPKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

QQGPHUIQEGQQG word=QEG QQGPHUIQEGQQG word=QEG PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

Page 63: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 5Generating High Scoring Pairs (HSPs)• For each alignment A

• Where sequences Q and D are matched• Original region matching was M

• Extend M to the left• Until the Blosum score begins to decrease

• Extend M to the right• Until the Blosum score begins to decrease

• Larger stretch of sequence now matches• May have higher score than the original triple• Call these high scoring pairs

• Throw away any alignments for which the score S of the extended region M is lower than some cutoff score

Page 64: Введение в биоинформатику, весна 2010: Лекция 7

Extending Alignment RegionsExample

QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27

PKLMMQQGKQEGM

So, the extension to the right stops here HSP (before left extension) is QQGKEE, scoring 28

Page 65: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 6Checking Statistical Significance• Reason we extended alignment regions

• Give a more accurate picture of the probability of that BLOSUM score occurring by chance

• Question: is a HSP significant?• Suppose we have a HSP such that

• It scores S for a region of length L in sequences Q & D• Then the probability of two random sequences Q’ and D’

scoring S in a region of length L is calculated• Where Q’ is same length as Q and D’ is same length as D

• This probability needs to be low for significance

Page 66: Введение в биоинформатику, весна 2010: Лекция 7

BLAST Algorithm Part 7Reporting the Alignments

• For each statistically significant HSP• The alignment is reported

• If a sequence D has two HSPs with Query Q• Two different alignments are reported

• Later versions of BLAST• Try and unify the two alignments