Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

Bioinformatics Chapter 8.

Sequence Alignment and Database Searching

오 석 준

서울대 바이오정보기술연구센터

E-mail: [email protected]

Sirk-June AughCenter for Bioinformation Technology (CBIT)

Seoul National University

This material is available at http://bi.snu.ac.kr./ &http://cbit.snu.ac.kr/

장 병 탁

서울대 컴퓨터공학부 바이오지능연구실 �

바이오정보기술연구센터

E-mail: [email protected]

Byoung-Tak ZhangCenter for Bioinformation Technology (CBIT) &

Biointelligence Laboratory School of Computer Science and Engineering

Seoul National University

(C) 2002 SNU CSE Biointelligence Lab (BI) 2

Outline

! The Evolutionary Basis of Sequence Alignment

! The Modular Nature of Proteins

! Optimal Alignment Methods

! Substitution Scores and Gap Penalties

! Statistical Significance of Alignments

! Database Similarity Searching

! FASTA

! BLAST

! Database Searching Artifacts

! Position-Specific Scoring Matrices

! Spliced Alignments


Similarity vs. Homology

! Similarity:♦ An observable quantity that might be expressed, as say, percent

identity or some other suitable measure.

! Homology:♦ Refers to a conclusion drawn from these data that two genes

share a common evolutionary history


Conserved Positions are often of Functional Importance


Alignment Jargon

! Substitutions

! Insertions

! Deletions

ancestor

A B

Observed sequences

Evolutionarily related sequences differ from one other because of several processes:


Alignment Jargon

ACG

GCG ACG

Substitution

A→G

GCG||ACG

• 1 mismatch• 2 matches


Alignment Jargon

ACG

ATCG ACG

Insertion

→T

ATCG| ||A-CG

• 0 mismatches• 3 matches• 1 gap


Alignment Jargon

Deletion

ATCG

ACG ATCG

←T

ATCG| ||A-CG

• 0 mismatches• 3 matches• 1 gap


Alignment vs. Prediction: When are Alignment Reliable?! New sequences are aligned against all sequences in DB

♦ Hints toward structural/functional relationships + functional insights

! Fundamental question♦ When is the sequence similarity high enough that one may infer a

structural/functional similarity from the pairwise alignment of two sequences?

♦ Given the detected overlap in a sequence segment, can a similarity threshold be defined that sifts out cases where the inference will be reliable?

! Answer♦ It depends on the structural/functional aspect one wants to investigate♦ Different for each task

! Prediction♦ When alignment alone is not enough to lead a reliable inference


Local Alignment vs. Global Alignment

! Local Alignment♦ Attempts to align regions of sequences with the highest density

of matches. In doing so, one of more islands of subalignments are created in the aligned sequences.

! Global Alignment♦ Attempts to match as many characters as possible, from end to

end, in a set of two or more sequences.


Local Alignment vs. Global Alignment

A C C A C A C A0 m m m m m m m m

A m 1 -1 -3 -5 -7 -9 -11 -13

C m -1 2 0 -2 -4 -6 -8 -10

A m -3 0 1 1 -1 -3 -5 -7

C m -5 -2 1 0 2 0 -2 -4

C m -7 -4 -1 0 1 1 1 -1

A m -9 -6 -3 0 -1 2 0 2

T m -11 -8 -5 -2 -1 0 1 0

A m -13 -10 -7 -4 -1 0 -1 2

A C C A C A C A0 0 0 0 0 0 0 0 0

A 0 1 0 0 1 0 1 0 1

C 0 0 2 1 0 2 0 2 0

A 0 1 0 1 2 0 3 1 3

C 0 0 2 1 0 3 1 4 2

C 0 0 0 3 2 1 2 2 3

A 0 1 0 1 4 2 2 1 3

T 0 0 0 0 2 3 1 1 1

A 0 1 0 0 1 1 4 2 2


BLAST search result

Query: Soy bean (plant)leghemoglobin

Database: homo sapiens

Alignment result shows two merely matched sequences, but their functions and structures are surprisingly coincided.


Optimal Global Sequence Alignment


Prediction of Functional Features

! Two protein sequences share sequence similarity♦ These proteins share common function?

! New sequence identity threshold is required for prediction of function♦ Threshold used for structural problems can not be used.

! Solution♦ Split each sequence into a number of subsequences

♦ The fraction of aligned site per alignment


Modular Nature of Proteins

Figure 8.3. Modular structure of two proteins involved in blood clotting. Schematic representation of the modular structure of human tissue plasminogen activator and coagulation factor XII. A module labeled C is shared by several proteins involved in blood clotting. F1 and F2 are frequently repeated units that were first seen in fibronectin. E is a module resembling epidermal growth factor. A module known as a “kringle domain” is denoted K.

Catalytic

Catalytic

KEF1EF2F12

PLAT KEF1 K


Measuring Alignment Quality

Good alignments should have …

! “many” exact matches

! “few” mismatches! “many” of the mismatches should be similar residues

! “few” gaps



QTRPQNVLNPP

|||

STRQNVINPWAAQ

Begin with...

Longest Exact Match

S = 3aS=alignment scorea=match score



QTRPQNVLNPP

||| ||

STRQNVINPWAAQ

… allow some mismatches

S = 5a - 1bS=alignment scorea=match scoreb=mismatch penalty



QTRPQNVLNPP

|| ||| ||

STR-QNVINPWAAQ

…and finally, introduce some gaps

S = 7a - 1b -1c

S=alignment scorea=match scoreb=mismatch penaltyc=gap penalty


Dot-matrix Representations

!Figure 8.4


Dot-matrix Representations

! Figure 8.5


Local Alignments

! Many problems in computer science can be reduce to the task of finding the optimal path through a graph

! Some positive incremental scores will be used for aligning identical residues, with negative scores used for substitutions and gaps.

! Finding optimal local alignment♦ Dynamic programming.

!Needleman-Wunsch algorithm (Needleman and Wunsch, 1970)

!Smith-Waterman algorithm (Smith and Waterman, 1981)

!Viterbi decoding algorithm


Dynamic Programming

! General optimization technique

! Application environment♦ Problem can be recursively subdivided into two similar

subproblems of smaller size

♦ Solution can be obtained by piecing together the solutions to the two subproblems

♦ Example: finding shortest path!Problem: [A " C " B] # [A " C] and [C " B]

!Solution: [A " C] + [C " B] # [A " B]


Smith-Waterman Algorithm (1)

! Finding local alignments

! Using dynamic programming

TCAT*G*CATTG


Smith-Waterman Algorithm (2)

! Best results and slow performance

! Can grasp results missed by BLAST or FASTA.! Available on web:

spiral.genes.nig.ac.jp/homolgy/ssearch-e.shtml! Ref : Smith, T. F. and Watermann, M. S. (1981)

Identification of common molecular subsequence.Journal of Molecular Biology, 147: 196-197.



Scoring Matrix

! A unitary matrix is used for base pairs♦ Each position can be given a score of +1 if it matches and a

score of zero if it does not.

! Substitution matrices are used for amino acid alignments. ♦ Certain amino acids can substitute easily for one another in

related proteins because of their similar physicochemical properties.


Substitution Scores

! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino

acids have been changed

♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural

log of the ratio of target frequencies to background frequencies

! BLOSUM substitution matrices ♦ Use of a different strategy for estimating the target frequencies

♦ The underlying data are derived from the BLOCKS database, which contains local alignments (“blocks”)


Gap Penalties

! Affine gap penalties

♦ For a gap with length n:!Penalty = G + Ln

!G: gap-opening penalty (around 10-15)

!L: gap-extension penalty (around 1 or 2)


PAM Scoring Matrices

! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino

acids have been changed

♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural

log of the ratio of target frequencies to background frequencies

! Parameters were estimated from a small sample of closely related proteins.


The PAM250 Scoring Matrix

A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 12Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 -2 -1 -1 -1 0 -6 -2 4

A R N D C Q E G H I L K M F P S T W Y V


BLOSUM Scoring Matrices

! Construct a database of “blocks”: ungapped, aligned conserved regions of proteins)

! Cluster sequences within a block that are more similar than a chosen threshold (e.g., 62% for BLOSUM62)

! Represent each cluster of sequences using their “average”sequence

! After the averaging procedure, each modified block consists of average sequence(s) and sequences that did not cluster with the others

! Examine all pairs of sequences in the modified blocks, tabulate pij the probability of observing aa i in one sequence and aa j in a second sequence.


The BLOSUM62 Scoring Matrix

A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -2 -3 -3 -3 -1 -3 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V


Statistical Significance: E-values

! Let Pi be the frequency of aa i. For unrelated sequences, an alignment of i with j has probability Pij.

! Given Pij and sij, we can calculate normalized scores (“bit scores”) from the raw score, S:

! When 2 random sequences of length m and n are aligned, the expected number of HSPs with normalized scores greater than S� is approximately

′ = −S

S Kλ ln

ln 2K is a function of the database size,λ is a function of the scoring matrix

E N N nmS= =′2, where


BLAST & FASTA

! Using heuristic algorithms

! Word based match

! Faster than Smith &Waterman

! BLAST: www.ncbi.nlm.nih.gov/BLASTRef: Altschul, S. F., et al. (1990) Basic local alignment

search tool. Journal of Molecular Biology, 215, 403-410.

! FASTA: www.ebi.ac.uk/fasta3Ref: Wilbur, W. J. and Lipman, D. J. (1983) Rapid

similarity searches of nucleic acid and protein databanks. Proceedings of the National Academy of Sciences USA 80, 726-30.


! Figure 8.9


FASTA

! Score using a substitution matrix♦ Applies a substitution matrix to the 10 best

regions

♦ The matrix encapsulates the biological significance of word matches

♦ Single best subalignment - init1

♦ db strings with init1 < threshold are filtered


BLAST Result

Program: BLASTP

Query: human MYB

binding protein

Database: Swissprot

(C) 2002 SNU CSE Biointelligence Lab (BI) 39 (C) 2002 SNU CSE Biointelligence Lab (BI) 40

The BLAST Search

For the query, find the list of high scoring words of length W.

Compares the word list to the database and identifies exact matches.

For each word match, extends the alignment in both directions to find alignments that score greater than a threshold of value S.


BLAST Programs

May be useful for EST analysis, but computationally intensive

Nucleotide

(translated)

Nucleotide

(translated)

TBLASTX

Useful for finding unannotated coding regions in database sequences

Nucleotide

(translated)

ProteinTBLASTN

Useful for analysis of new DNA sequences andESTs

ProteinNucleotide

(translated)

BLASTX

Tuned for very high-scoring matches, not distant relationships

NucleotideNucleotideBLASTN

Uses substitution matrix for finding distant relationship relationships; SEG filtering available

ProteinProteinBLASTP

CommentsDatabaseQueryProgram


Protein Sequence Databases for use with BLAST

Databas Description

nr

month

swissprot

pdb

ecoliyeastdrosoph

Non-redundant merge of SWISS-PROT, PIR, PRF, and proteins derived form GenBank coding sequences and PDB atomic coordinates

Subset of nr witch is new or modified within the last 30 days

The SWISS-PROT database

Amino acid sequences parsed from atomic coordinates of three-dimensional structuresComplete set of proteins encoded by the E. coli genomeComplete set of proteins encoded by the S. cerevisiae genomeComplete set of proteins encoded by the E. melanogaster genome


Nucleotide Sequence Databases for use with BLAST

Databas Description

nrmonth

htgs

ecoliyeastdrosoph

Non-redundant Genbank, excluding the EST, STS, and GSS divisions

Subset of nr, witch is new or modified within the last 30 days

Genbank HTG division (high-throughput genomic sequences)

Complete genome sequence of E. coliComplete genome sequence of S. cerevisiaeComplete genome sequence of E. melanogaster

est Genbank EST division (expressed sequence tags)sts Genbank STS division (sequence tagged sites)

gss Genbank GSS division (genome survey sequences)

mito Complete genome sequence of vertebrate mitochondria

alu Collection of primate Alu repeat sequencesvector Collection of popular cloning vectors


TBLASTN Search

! The protein product of the Alzheimer’s disease susceptibility gene ( GenBank L43964) was used as the query in a TBLASTN search against the est database. The goal was to identify cDNA clones from other organisms that may represent homologs of the human gene.

(C) 2002 SNU CSE Biointelligence Lab (BI) 45 (C) 2002 SNU CSE Biointelligence Lab (BI) 46

! Portion of the hit list showing the 25 best hits. Each sequence is identified by GenBank accession number and a portion of the definition line. The reading frame and score of the best HSP areshown, together with the sum probability of a chance occurrence.The value in the last column gives the number of HSPs that were used in the sum probability calculation At least 10 sequence from mouse and one from Drosophila may be seen on the hit list.

! Match to the conceptual translation of the Drosophila EST sequence (GenBank AA390557). Two HSPs were found, each in a different reading frame. Identical residues are echoed to the central line, and plus (+) symbols indicate pairs of nonidentical amino acids with positive substitution scores.


! Figure 8.11


! Figure 8.12


Position Specific Scoring Matrices

! Position-Specific Iterated BLAST (PSI-BLAST)

��

��

��

��

��

��

��

��


! Figure 8.13!Figure 8.13


Spliced Alignments Problem

! Given strings G (genomic sequence) and T (target sequence), and a set B of substrings of G, find a set of non-overlapping strings from B whose concatenation C fits T the best; i.e., the edit distance between C and T is minimal among all sets of blocks from B.

ATCAGTGCAATGCAGCCATGAComplement of mRNA T = t1t2 ... tm

ATCAGTGCAATGT..........AGGCAGCCATGA

Exon Intron ExonG

B = { b1, b2, ... bp }


! Figure 8.14

Figure 8.14. Spliced alignment. The sim4 program was used to align a novel human mRNA (RefSeqNM_015372) to the genomic sequence of a cosmid from chromosome 22 (EMBL Z82248). Three exons were identified on the complementary (the third one has been truncated for brevity). The “>>>” symbols indicate splice sites found at the exon/intron boundaries.


Systems biologyNeuroimmunology

BioinformaticsData miningCombinatorial chemistrySynthetic biology

Biochip & microfluidics (LOC)Biocomputation

BIOLOGY IN THE FUTURE

Documents

Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12