Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bioinformatics Chapter 8.
Sequence Alignment and Database Searching
오 석 준
서울대 바이오정보기술연구센터
E-mail: [email protected]
Sirk-June AughCenter for Bioinformation Technology (CBIT)
Seoul National University
This material is available at http://bi.snu.ac.kr./ &http://cbit.snu.ac.kr/
장 병 탁
서울대 컴퓨터공학부 바이오지능연구실 �
바이오정보기술연구센터
E-mail: [email protected]
Byoung-Tak ZhangCenter for Bioinformation Technology (CBIT) &
Biointelligence Laboratory School of Computer Science and Engineering
Seoul National University
(C) 2002 SNU CSE Biointelligence Lab (BI) 2
Outline
! The Evolutionary Basis of Sequence Alignment
! The Modular Nature of Proteins
! Optimal Alignment Methods
! Substitution Scores and Gap Penalties
! Statistical Significance of Alignments
! Database Similarity Searching
! FASTA
! BLAST
! Database Searching Artifacts
! Position-Specific Scoring Matrices
! Spliced Alignments
(C) 2002 SNU CSE Biointelligence Lab (BI) 3
Similarity vs. Homology
! Similarity:♦ An observable quantity that might be expressed, as say, percent
identity or some other suitable measure.
! Homology:♦ Refers to a conclusion drawn from these data that two genes
share a common evolutionary history
(C) 2002 SNU CSE Biointelligence Lab (BI) 4
Conserved Positions are often of Functional Importance
(C) 2002 SNU CSE Biointelligence Lab (BI) 5
Alignment Jargon
! Substitutions
! Insertions
! Deletions
ancestor
A B
Observed sequences
Evolutionarily related sequences differ from one other because of several processes:
(C) 2002 SNU CSE Biointelligence Lab (BI) 6
Alignment Jargon
ACG
GCG ACG
Substitution
A→G
GCG||ACG
• 1 mismatch• 2 matches
(C) 2002 SNU CSE Biointelligence Lab (BI) 7
Alignment Jargon
ACG
ATCG ACG
Insertion
→T
ATCG| ||A-CG
• 0 mismatches• 3 matches• 1 gap
(C) 2002 SNU CSE Biointelligence Lab (BI) 8
Alignment Jargon
Deletion
ATCG
ACG ATCG
←T
ATCG| ||A-CG
• 0 mismatches• 3 matches• 1 gap
(C) 2002 SNU CSE Biointelligence Lab (BI) 9
Alignment vs. Prediction: When are Alignment Reliable?! New sequences are aligned against all sequences in DB
♦ Hints toward structural/functional relationships + functional insights
! Fundamental question♦ When is the sequence similarity high enough that one may infer a
structural/functional similarity from the pairwise alignment of two sequences?
♦ Given the detected overlap in a sequence segment, can a similarity threshold be defined that sifts out cases where the inference will be reliable?
! Answer♦ It depends on the structural/functional aspect one wants to investigate♦ Different for each task
! Prediction♦ When alignment alone is not enough to lead a reliable inference
(C) 2002 SNU CSE Biointelligence Lab (BI) 10
Local Alignment vs. Global Alignment
! Local Alignment♦ Attempts to align regions of sequences with the highest density
of matches. In doing so, one of more islands of subalignments are created in the aligned sequences.
! Global Alignment♦ Attempts to match as many characters as possible, from end to
end, in a set of two or more sequences.
(C) 2002 SNU CSE Biointelligence Lab (BI) 11
Local Alignment vs. Global Alignment
A C C A C A C A0 m m m m m m m m
A m 1 -1 -3 -5 -7 -9 -11 -13
C m -1 2 0 -2 -4 -6 -8 -10
A m -3 0 1 1 -1 -3 -5 -7
C m -5 -2 1 0 2 0 -2 -4
C m -7 -4 -1 0 1 1 1 -1
A m -9 -6 -3 0 -1 2 0 2
T m -11 -8 -5 -2 -1 0 1 0
A m -13 -10 -7 -4 -1 0 -1 2
A C C A C A C A0 0 0 0 0 0 0 0 0
A 0 1 0 0 1 0 1 0 1
C 0 0 2 1 0 2 0 2 0
A 0 1 0 1 2 0 3 1 3
C 0 0 2 1 0 3 1 4 2
C 0 0 0 3 2 1 2 2 3
A 0 1 0 1 4 2 2 1 3
T 0 0 0 0 2 3 1 1 1
A 0 1 0 0 1 1 4 2 2
(C) 2002 SNU CSE Biointelligence Lab (BI) 12
BLAST search result
Query: Soy bean (plant)leghemoglobin
Database: homo sapiens
Alignment result shows two merely matched sequences, but their functions and structures are surprisingly coincided.
(C) 2002 SNU CSE Biointelligence Lab (BI) 13
Optimal Global Sequence Alignment
(C) 2002 SNU CSE Biointelligence Lab (BI) 14
Prediction of Functional Features
! Two protein sequences share sequence similarity♦ These proteins share common function?
! New sequence identity threshold is required for prediction of function♦ Threshold used for structural problems can not be used.
! Solution♦ Split each sequence into a number of subsequences
♦ The fraction of aligned site per alignment
(C) 2002 SNU CSE Biointelligence Lab (BI) 15
Modular Nature of Proteins
Figure 8.3. Modular structure of two proteins involved in blood clotting. Schematic representation of the modular structure of human tissue plasminogen activator and coagulation factor XII. A module labeled C is shared by several proteins involved in blood clotting. F1 and F2 are frequently repeated units that were first seen in fibronectin. E is a module resembling epidermal growth factor. A module known as a “kringle domain” is denoted K.
Catalytic
Catalytic
KEF1EF2F12
PLAT KEF1 K
(C) 2002 SNU CSE Biointelligence Lab (BI) 16
Measuring Alignment Quality
Good alignments should have …
! “many” exact matches
! “few” mismatches! “many” of the mismatches should be similar residues
! “few” gaps
(C) 2002 SNU CSE Biointelligence Lab (BI) 17
Measuring Alignment Quality
QTRPQNVLNPP
|||
STRQNVINPWAAQ
Begin with...
Longest Exact Match
S = 3aS=alignment scorea=match score
(C) 2002 SNU CSE Biointelligence Lab (BI) 18
Measuring Alignment Quality
QTRPQNVLNPP
||| ||
STRQNVINPWAAQ
… allow some mismatches
S = 5a - 1bS=alignment scorea=match scoreb=mismatch penalty
(C) 2002 SNU CSE Biointelligence Lab (BI) 19
Measuring Alignment Quality
QTRPQNVLNPP
|| ||| ||
STR-QNVINPWAAQ
…and finally, introduce some gaps
S = 7a - 1b -1c
S=alignment scorea=match scoreb=mismatch penaltyc=gap penalty
(C) 2002 SNU CSE Biointelligence Lab (BI) 20
Dot-matrix Representations
!Figure 8.4
(C) 2002 SNU CSE Biointelligence Lab (BI) 21
Dot-matrix Representations
! Figure 8.5
(C) 2002 SNU CSE Biointelligence Lab (BI) 22
Local Alignments
! Many problems in computer science can be reduce to the task of finding the optimal path through a graph
! Some positive incremental scores will be used for aligning identical residues, with negative scores used for substitutions and gaps.
! Finding optimal local alignment♦ Dynamic programming.
!Needleman-Wunsch algorithm (Needleman and Wunsch, 1970)
!Smith-Waterman algorithm (Smith and Waterman, 1981)
!Viterbi decoding algorithm
(C) 2002 SNU CSE Biointelligence Lab (BI) 23
Dynamic Programming
! General optimization technique
! Application environment♦ Problem can be recursively subdivided into two similar
subproblems of smaller size
♦ Solution can be obtained by piecing together the solutions to the two subproblems
♦ Example: finding shortest path!Problem: [A " C " B] # [A " C] and [C " B]
!Solution: [A " C] + [C " B] # [A " B]
(C) 2002 SNU CSE Biointelligence Lab (BI) 24
Smith-Waterman Algorithm (1)
! Finding local alignments
! Using dynamic programming
TCAT*G*CATTG
(C) 2002 SNU CSE Biointelligence Lab (BI) 25
Smith-Waterman Algorithm (2)
! Best results and slow performance
! Can grasp results missed by BLAST or FASTA.! Available on web:
spiral.genes.nig.ac.jp/homolgy/ssearch-e.shtml! Ref : Smith, T. F. and Watermann, M. S. (1981)
Identification of common molecular subsequence.Journal of Molecular Biology, 147: 196-197.
(C) 2002 SNU CSE Biointelligence Lab (BI) 26
(C) 2002 SNU CSE Biointelligence Lab (BI) 27
Scoring Matrix
! A unitary matrix is used for base pairs♦ Each position can be given a score of +1 if it matches and a
score of zero if it does not.
! Substitution matrices are used for amino acid alignments. ♦ Certain amino acids can substitute easily for one another in
related proteins because of their similar physicochemical properties.
(C) 2002 SNU CSE Biointelligence Lab (BI) 28
Substitution Scores
! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino
acids have been changed
♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural
log of the ratio of target frequencies to background frequencies
! BLOSUM substitution matrices ♦ Use of a different strategy for estimating the target frequencies
♦ The underlying data are derived from the BLOCKS database, which contains local alignments (“blocks”)
(C) 2002 SNU CSE Biointelligence Lab (BI) 29
Gap Penalties
! Affine gap penalties
♦ For a gap with length n:!Penalty = G + Ln
!G: gap-opening penalty (around 10-15)
!L: gap-extension penalty (around 1 or 2)
(C) 2002 SNU CSE Biointelligence Lab (BI) 30
PAM Scoring Matrices
! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino
acids have been changed
♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural
log of the ratio of target frequencies to background frequencies
! Parameters were estimated from a small sample of closely related proteins.
(C) 2002 SNU CSE Biointelligence Lab (BI) 31
The PAM250 Scoring Matrix
A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 12Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 -2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
(C) 2002 SNU CSE Biointelligence Lab (BI) 32
BLOSUM Scoring Matrices
! Construct a database of “blocks”: ungapped, aligned conserved regions of proteins)
! Cluster sequences within a block that are more similar than a chosen threshold (e.g., 62% for BLOSUM62)
! Represent each cluster of sequences using their “average”sequence
! After the averaging procedure, each modified block consists of average sequence(s) and sequences that did not cluster with the others
! Examine all pairs of sequences in the modified blocks, tabulate pij the probability of observing aa i in one sequence and aa j in a second sequence.
(C) 2002 SNU CSE Biointelligence Lab (BI) 33
The BLOSUM62 Scoring Matrix
A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -2 -3 -3 -3 -1 -3 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
(C) 2002 SNU CSE Biointelligence Lab (BI) 34
Statistical Significance: E-values
! Let Pi be the frequency of aa i. For unrelated sequences, an alignment of i with j has probability Pij.
! Given Pij and sij, we can calculate normalized scores (“bit scores”) from the raw score, S:
! When 2 random sequences of length m and n are aligned, the expected number of HSPs with normalized scores greater than S� is approximately
′ = −S
S Kλ ln
ln 2K is a function of the database size,λ is a function of the scoring matrix
E N N nmS= =′2, where
(C) 2002 SNU CSE Biointelligence Lab (BI) 35
BLAST & FASTA
! Using heuristic algorithms
! Word based match
! Faster than Smith &Waterman
! BLAST: www.ncbi.nlm.nih.gov/BLASTRef: Altschul, S. F., et al. (1990) Basic local alignment
search tool. Journal of Molecular Biology, 215, 403-410.
! FASTA: www.ebi.ac.uk/fasta3Ref: Wilbur, W. J. and Lipman, D. J. (1983) Rapid
similarity searches of nucleic acid and protein databanks. Proceedings of the National Academy of Sciences USA 80, 726-30.
(C) 2002 SNU CSE Biointelligence Lab (BI) 36
! Figure 8.9
(C) 2002 SNU CSE Biointelligence Lab (BI) 37
FASTA
! Score using a substitution matrix♦ Applies a substitution matrix to the 10 best
regions
♦ The matrix encapsulates the biological significance of word matches
♦ Single best subalignment - init1
♦ db strings with init1 < threshold are filtered
(C) 2002 SNU CSE Biointelligence Lab (BI) 38
BLAST Result
Program: BLASTP
Query: human MYB
binding protein
Database: Swissprot
(C) 2002 SNU CSE Biointelligence Lab (BI) 39 (C) 2002 SNU CSE Biointelligence Lab (BI) 40
The BLAST Search
For the query, find the list of high scoring words of length W.
Compares the word list to the database and identifies exact matches.
For each word match, extends the alignment in both directions to find alignments that score greater than a threshold of value S.
(C) 2002 SNU CSE Biointelligence Lab (BI) 41
BLAST Programs
May be useful for EST analysis, but computationally intensive
Nucleotide
(translated)
Nucleotide
(translated)
TBLASTX
Useful for finding unannotated coding regions in database sequences
Nucleotide
(translated)
ProteinTBLASTN
Useful for analysis of new DNA sequences andESTs
ProteinNucleotide
(translated)
BLASTX
Tuned for very high-scoring matches, not distant relationships
NucleotideNucleotideBLASTN
Uses substitution matrix for finding distant relationship relationships; SEG filtering available
ProteinProteinBLASTP
CommentsDatabaseQueryProgram
(C) 2002 SNU CSE Biointelligence Lab (BI) 42
Protein Sequence Databases for use with BLAST
Databas Description
nr
month
swissprot
pdb
ecoliyeastdrosoph
Non-redundant merge of SWISS-PROT, PIR, PRF, and proteins derived form GenBank coding sequences and PDB atomic coordinates
Subset of nr witch is new or modified within the last 30 days
The SWISS-PROT database
Amino acid sequences parsed from atomic coordinates of three-dimensional structuresComplete set of proteins encoded by the E. coli genomeComplete set of proteins encoded by the S. cerevisiae genomeComplete set of proteins encoded by the E. melanogaster genome
(C) 2002 SNU CSE Biointelligence Lab (BI) 43
Nucleotide Sequence Databases for use with BLAST
Databas Description
nrmonth
htgs
ecoliyeastdrosoph
Non-redundant Genbank, excluding the EST, STS, and GSS divisions
Subset of nr, witch is new or modified within the last 30 days
Genbank HTG division (high-throughput genomic sequences)
Complete genome sequence of E. coliComplete genome sequence of S. cerevisiaeComplete genome sequence of E. melanogaster
est Genbank EST division (expressed sequence tags)sts Genbank STS division (sequence tagged sites)
gss Genbank GSS division (genome survey sequences)
mito Complete genome sequence of vertebrate mitochondria
alu Collection of primate Alu repeat sequencesvector Collection of popular cloning vectors
(C) 2002 SNU CSE Biointelligence Lab (BI) 44
TBLASTN Search
! The protein product of the Alzheimer’s disease susceptibility gene ( GenBank L43964) was used as the query in a TBLASTN search against the est database. The goal was to identify cDNA clones from other organisms that may represent homologs of the human gene.
(C) 2002 SNU CSE Biointelligence Lab (BI) 45 (C) 2002 SNU CSE Biointelligence Lab (BI) 46
! Portion of the hit list showing the 25 best hits. Each sequence is identified by GenBank accession number and a portion of the definition line. The reading frame and score of the best HSP areshown, together with the sum probability of a chance occurrence.The value in the last column gives the number of HSPs that were used in the sum probability calculation At least 10 sequence from mouse and one from Drosophila may be seen on the hit list.
! Match to the conceptual translation of the Drosophila EST sequence (GenBank AA390557). Two HSPs were found, each in a different reading frame. Identical residues are echoed to the central line, and plus (+) symbols indicate pairs of nonidentical amino acids with positive substitution scores.
(C) 2002 SNU CSE Biointelligence Lab (BI) 47
! Figure 8.11
(C) 2002 SNU CSE Biointelligence Lab (BI) 48
! Figure 8.12
(C) 2002 SNU CSE Biointelligence Lab (BI) 49
Position Specific Scoring Matrices
! Position-Specific Iterated BLAST (PSI-BLAST)
�������
����
��� �������
�������������� �����
����������������������
�����
������������������
��������������������������������������������������
(C) 2002 SNU CSE Biointelligence Lab (BI) 50
! Figure 8.13!Figure 8.13
(C) 2002 SNU CSE Biointelligence Lab (BI) 51
Spliced Alignments Problem
! Given strings G (genomic sequence) and T (target sequence), and a set B of substrings of G, find a set of non-overlapping strings from B whose concatenation C fits T the best; i.e., the edit distance between C and T is minimal among all sets of blocks from B.
ATCAGTGCAATGCAGCCATGAComplement of mRNA T = t1t2 ... tm
ATCAGTGCAATGT..........AGGCAGCCATGA
Exon Intron ExonG
B = { b1, b2, ... bp }
(C) 2002 SNU CSE Biointelligence Lab (BI) 52
! Figure 8.14
Figure 8.14. Spliced alignment. The sim4 program was used to align a novel human mRNA (RefSeqNM_015372) to the genomic sequence of a cosmid from chromosome 22 (EMBL Z82248). Three exons were identified on the complementary (the third one has been truncated for brevity). The “>>>” symbols indicate splice sites found at the exon/intron boundaries.
(C) 2002 SNU CSE Biointelligence Lab (BI) 53
Systems biologyNeuroimmunology
BioinformaticsData miningCombinatorial chemistrySynthetic biology
Biochip & microfluidics (LOC)Biocomputation
BIOLOGY IN THE FUTURE