14
Bioinformatics Chapter 8. Sequence Alignment and Database Searching 오석준 서울대 바이오정보기술연구센터 E-mail: [email protected] Sirk-June Augh Center for Bioinformation Technology (CBIT) Seoul National University This material is available at http://bi.snu.ac.kr./ & http://cbit.snu.ac.kr/ 장병탁 서울대 컴퓨터공학부 바이오지능연구실 바이오정보기술연구센터 E-mail: [email protected] Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University (C) 2002 SNU CSE Biointelligence Lab (BI) 2 Outline ! The Evolutionary Basis of Sequence Alignment ! The Modular Nature of Proteins ! Optimal Alignment Methods ! Substitution Scores and Gap Penalties ! Statistical Significance of Alignments ! Database Similarity Searching ! FASTA ! BLAST ! Database Searching Artifacts ! Position-Specific Scoring Matrices ! Spliced Alignments (C) 2002 SNU CSE Biointelligence Lab (BI) 3 Similarity vs. Homology ! Similarity: An observable quantity that might be expressed, as say, percent identity or some other suitable measure. ! Homology: Refers to a conclusion drawn from these data that two genes share a common evolutionary history (C) 2002 SNU CSE Biointelligence Lab (BI) 4 Conserved Positions are often of Functional Importance

Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

Bioinformatics Chapter 8.

Sequence Alignment and Database Searching

오 석 준

서울대 바이오정보기술연구센터

E-mail: [email protected]

Sirk-June AughCenter for Bioinformation Technology (CBIT)

Seoul National University

This material is available at http://bi.snu.ac.kr./ &http://cbit.snu.ac.kr/

장 병 탁

서울대 컴퓨터공학부 바이오지능연구실 �

바이오정보기술연구센터

E-mail: [email protected]

Byoung-Tak ZhangCenter for Bioinformation Technology (CBIT) &

Biointelligence Laboratory School of Computer Science and Engineering

Seoul National University

(C) 2002 SNU CSE Biointelligence Lab (BI) 2

Outline

! The Evolutionary Basis of Sequence Alignment

! The Modular Nature of Proteins

! Optimal Alignment Methods

! Substitution Scores and Gap Penalties

! Statistical Significance of Alignments

! Database Similarity Searching

! FASTA

! BLAST

! Database Searching Artifacts

! Position-Specific Scoring Matrices

! Spliced Alignments

(C) 2002 SNU CSE Biointelligence Lab (BI) 3

Similarity vs. Homology

! Similarity:♦ An observable quantity that might be expressed, as say, percent

identity or some other suitable measure.

! Homology:♦ Refers to a conclusion drawn from these data that two genes

share a common evolutionary history

(C) 2002 SNU CSE Biointelligence Lab (BI) 4

Conserved Positions are often of Functional Importance

Page 2: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 5

Alignment Jargon

! Substitutions

! Insertions

! Deletions

ancestor

A B

Observed sequences

Evolutionarily related sequences differ from one other because of several processes:

(C) 2002 SNU CSE Biointelligence Lab (BI) 6

Alignment Jargon

ACG

GCG ACG

Substitution

A→G

GCG||ACG

• 1 mismatch• 2 matches

(C) 2002 SNU CSE Biointelligence Lab (BI) 7

Alignment Jargon

ACG

ATCG ACG

Insertion

→T

ATCG| ||A-CG

• 0 mismatches• 3 matches• 1 gap

(C) 2002 SNU CSE Biointelligence Lab (BI) 8

Alignment Jargon

Deletion

ATCG

ACG ATCG

←T

ATCG| ||A-CG

• 0 mismatches• 3 matches• 1 gap

Page 3: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 9

Alignment vs. Prediction: When are Alignment Reliable?! New sequences are aligned against all sequences in DB

♦ Hints toward structural/functional relationships + functional insights

! Fundamental question♦ When is the sequence similarity high enough that one may infer a

structural/functional similarity from the pairwise alignment of two sequences?

♦ Given the detected overlap in a sequence segment, can a similarity threshold be defined that sifts out cases where the inference will be reliable?

! Answer♦ It depends on the structural/functional aspect one wants to investigate♦ Different for each task

! Prediction♦ When alignment alone is not enough to lead a reliable inference

(C) 2002 SNU CSE Biointelligence Lab (BI) 10

Local Alignment vs. Global Alignment

! Local Alignment♦ Attempts to align regions of sequences with the highest density

of matches. In doing so, one of more islands of subalignments are created in the aligned sequences.

! Global Alignment♦ Attempts to match as many characters as possible, from end to

end, in a set of two or more sequences.

(C) 2002 SNU CSE Biointelligence Lab (BI) 11

Local Alignment vs. Global Alignment

A C C A C A C A0 m m m m m m m m

A m 1 -1 -3 -5 -7 -9 -11 -13

C m -1 2 0 -2 -4 -6 -8 -10

A m -3 0 1 1 -1 -3 -5 -7

C m -5 -2 1 0 2 0 -2 -4

C m -7 -4 -1 0 1 1 1 -1

A m -9 -6 -3 0 -1 2 0 2

T m -11 -8 -5 -2 -1 0 1 0

A m -13 -10 -7 -4 -1 0 -1 2

A C C A C A C A0 0 0 0 0 0 0 0 0

A 0 1 0 0 1 0 1 0 1

C 0 0 2 1 0 2 0 2 0

A 0 1 0 1 2 0 3 1 3

C 0 0 2 1 0 3 1 4 2

C 0 0 0 3 2 1 2 2 3

A 0 1 0 1 4 2 2 1 3

T 0 0 0 0 2 3 1 1 1

A 0 1 0 0 1 1 4 2 2

(C) 2002 SNU CSE Biointelligence Lab (BI) 12

BLAST search result

Query: Soy bean (plant)leghemoglobin

Database: homo sapiens

Alignment result shows two merely matched sequences, but their functions and structures are surprisingly coincided.

Page 4: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 13

Optimal Global Sequence Alignment

(C) 2002 SNU CSE Biointelligence Lab (BI) 14

Prediction of Functional Features

! Two protein sequences share sequence similarity♦ These proteins share common function?

! New sequence identity threshold is required for prediction of function♦ Threshold used for structural problems can not be used.

! Solution♦ Split each sequence into a number of subsequences

♦ The fraction of aligned site per alignment

(C) 2002 SNU CSE Biointelligence Lab (BI) 15

Modular Nature of Proteins

Figure 8.3. Modular structure of two proteins involved in blood clotting. Schematic representation of the modular structure of human tissue plasminogen activator and coagulation factor XII. A module labeled C is shared by several proteins involved in blood clotting. F1 and F2 are frequently repeated units that were first seen in fibronectin. E is a module resembling epidermal growth factor. A module known as a “kringle domain” is denoted K.

Catalytic

Catalytic

KEF1EF2F12

PLAT KEF1 K

(C) 2002 SNU CSE Biointelligence Lab (BI) 16

Measuring Alignment Quality

Good alignments should have …

! “many” exact matches

! “few” mismatches! “many” of the mismatches should be similar residues

! “few” gaps

Page 5: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 17

Measuring Alignment Quality

QTRPQNVLNPP

|||

STRQNVINPWAAQ

Begin with...

Longest Exact Match

S = 3aS=alignment scorea=match score

(C) 2002 SNU CSE Biointelligence Lab (BI) 18

Measuring Alignment Quality

QTRPQNVLNPP

||| ||

STRQNVINPWAAQ

… allow some mismatches

S = 5a - 1bS=alignment scorea=match scoreb=mismatch penalty

(C) 2002 SNU CSE Biointelligence Lab (BI) 19

Measuring Alignment Quality

QTRPQNVLNPP

|| ||| ||

STR-QNVINPWAAQ

…and finally, introduce some gaps

S = 7a - 1b -1c

S=alignment scorea=match scoreb=mismatch penaltyc=gap penalty

(C) 2002 SNU CSE Biointelligence Lab (BI) 20

Dot-matrix Representations

!Figure 8.4

Page 6: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 21

Dot-matrix Representations

! Figure 8.5

(C) 2002 SNU CSE Biointelligence Lab (BI) 22

Local Alignments

! Many problems in computer science can be reduce to the task of finding the optimal path through a graph

! Some positive incremental scores will be used for aligning identical residues, with negative scores used for substitutions and gaps.

! Finding optimal local alignment♦ Dynamic programming.

!Needleman-Wunsch algorithm (Needleman and Wunsch, 1970)

!Smith-Waterman algorithm (Smith and Waterman, 1981)

!Viterbi decoding algorithm

(C) 2002 SNU CSE Biointelligence Lab (BI) 23

Dynamic Programming

! General optimization technique

! Application environment♦ Problem can be recursively subdivided into two similar

subproblems of smaller size

♦ Solution can be obtained by piecing together the solutions to the two subproblems

♦ Example: finding shortest path!Problem: [A " C " B] # [A " C] and [C " B]

!Solution: [A " C] + [C " B] # [A " B]

(C) 2002 SNU CSE Biointelligence Lab (BI) 24

Smith-Waterman Algorithm (1)

! Finding local alignments

! Using dynamic programming

TCAT*G*CATTG

Page 7: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 25

Smith-Waterman Algorithm (2)

! Best results and slow performance

! Can grasp results missed by BLAST or FASTA.! Available on web:

spiral.genes.nig.ac.jp/homolgy/ssearch-e.shtml! Ref : Smith, T. F. and Watermann, M. S. (1981)

Identification of common molecular subsequence.Journal of Molecular Biology, 147: 196-197.

(C) 2002 SNU CSE Biointelligence Lab (BI) 26

(C) 2002 SNU CSE Biointelligence Lab (BI) 27

Scoring Matrix

! A unitary matrix is used for base pairs♦ Each position can be given a score of +1 if it matches and a

score of zero if it does not.

! Substitution matrices are used for amino acid alignments. ♦ Certain amino acids can substitute easily for one another in

related proteins because of their similar physicochemical properties.

(C) 2002 SNU CSE Biointelligence Lab (BI) 28

Substitution Scores

! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino

acids have been changed

♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural

log of the ratio of target frequencies to background frequencies

! BLOSUM substitution matrices ♦ Use of a different strategy for estimating the target frequencies

♦ The underlying data are derived from the BLOCKS database, which contains local alignments (“blocks”)

Page 8: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 29

Gap Penalties

! Affine gap penalties

♦ For a gap with length n:!Penalty = G + Ln

!G: gap-opening penalty (around 10-15)

!L: gap-extension penalty (around 1 or 2)

(C) 2002 SNU CSE Biointelligence Lab (BI) 30

PAM Scoring Matrices

! Point accepted mutation (PAM) model of evolution♦ A unit of evolutionary divergence in which 1% of the amino

acids have been changed

♦ Log-odds approach!The substitution scores in the matrix are proportional to the natural

log of the ratio of target frequencies to background frequencies

! Parameters were estimated from a small sample of closely related proteins.

(C) 2002 SNU CSE Biointelligence Lab (BI) 31

The PAM250 Scoring Matrix

A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 12Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 -2 -1 -1 -1 0 -6 -2 4

A R N D C Q E G H I L K M F P S T W Y V

(C) 2002 SNU CSE Biointelligence Lab (BI) 32

BLOSUM Scoring Matrices

! Construct a database of “blocks”: ungapped, aligned conserved regions of proteins)

! Cluster sequences within a block that are more similar than a chosen threshold (e.g., 62% for BLOSUM62)

! Represent each cluster of sequences using their “average”sequence

! After the averaging procedure, each modified block consists of average sequence(s) and sequences that did not cluster with the others

! Examine all pairs of sequences in the modified blocks, tabulate pij the probability of observing aa i in one sequence and aa j in a second sequence.

Page 9: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 33

The BLOSUM62 Scoring Matrix

A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -2 -3 -3 -3 -1 -3 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

(C) 2002 SNU CSE Biointelligence Lab (BI) 34

Statistical Significance: E-values

! Let Pi be the frequency of aa i. For unrelated sequences, an alignment of i with j has probability Pij.

! Given Pij and sij, we can calculate normalized scores (“bit scores”) from the raw score, S:

! When 2 random sequences of length m and n are aligned, the expected number of HSPs with normalized scores greater than S� is approximately

′ = −S

S Kλ ln

ln 2K is a function of the database size,λ is a function of the scoring matrix

E N N nmS= =′2, where

(C) 2002 SNU CSE Biointelligence Lab (BI) 35

BLAST & FASTA

! Using heuristic algorithms

! Word based match

! Faster than Smith &Waterman

! BLAST: www.ncbi.nlm.nih.gov/BLASTRef: Altschul, S. F., et al. (1990) Basic local alignment

search tool. Journal of Molecular Biology, 215, 403-410.

! FASTA: www.ebi.ac.uk/fasta3Ref: Wilbur, W. J. and Lipman, D. J. (1983) Rapid

similarity searches of nucleic acid and protein databanks. Proceedings of the National Academy of Sciences USA 80, 726-30.

(C) 2002 SNU CSE Biointelligence Lab (BI) 36

! Figure 8.9

Page 10: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 37

FASTA

! Score using a substitution matrix♦ Applies a substitution matrix to the 10 best

regions

♦ The matrix encapsulates the biological significance of word matches

♦ Single best subalignment - init1

♦ db strings with init1 < threshold are filtered

(C) 2002 SNU CSE Biointelligence Lab (BI) 38

BLAST Result

Program: BLASTP

Query: human MYB

binding protein

Database: Swissprot

(C) 2002 SNU CSE Biointelligence Lab (BI) 39 (C) 2002 SNU CSE Biointelligence Lab (BI) 40

The BLAST Search

For the query, find the list of high scoring words of length W.

Compares the word list to the database and identifies exact matches.

For each word match, extends the alignment in both directions to find alignments that score greater than a threshold of value S.

Page 11: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 41

BLAST Programs

May be useful for EST analysis, but computationally intensive

Nucleotide

(translated)

Nucleotide

(translated)

TBLASTX

Useful for finding unannotated coding regions in database sequences

Nucleotide

(translated)

ProteinTBLASTN

Useful for analysis of new DNA sequences andESTs

ProteinNucleotide

(translated)

BLASTX

Tuned for very high-scoring matches, not distant relationships

NucleotideNucleotideBLASTN

Uses substitution matrix for finding distant relationship relationships; SEG filtering available

ProteinProteinBLASTP

CommentsDatabaseQueryProgram

(C) 2002 SNU CSE Biointelligence Lab (BI) 42

Protein Sequence Databases for use with BLAST

Databas Description

nr

month

swissprot

pdb

ecoliyeastdrosoph

Non-redundant merge of SWISS-PROT, PIR, PRF, and proteins derived form GenBank coding sequences and PDB atomic coordinates

Subset of nr witch is new or modified within the last 30 days

The SWISS-PROT database

Amino acid sequences parsed from atomic coordinates of three-dimensional structuresComplete set of proteins encoded by the E. coli genomeComplete set of proteins encoded by the S. cerevisiae genomeComplete set of proteins encoded by the E. melanogaster genome

(C) 2002 SNU CSE Biointelligence Lab (BI) 43

Nucleotide Sequence Databases for use with BLAST

Databas Description

nrmonth

htgs

ecoliyeastdrosoph

Non-redundant Genbank, excluding the EST, STS, and GSS divisions

Subset of nr, witch is new or modified within the last 30 days

Genbank HTG division (high-throughput genomic sequences)

Complete genome sequence of E. coliComplete genome sequence of S. cerevisiaeComplete genome sequence of E. melanogaster

est Genbank EST division (expressed sequence tags)sts Genbank STS division (sequence tagged sites)

gss Genbank GSS division (genome survey sequences)

mito Complete genome sequence of vertebrate mitochondria

alu Collection of primate Alu repeat sequencesvector Collection of popular cloning vectors

(C) 2002 SNU CSE Biointelligence Lab (BI) 44

TBLASTN Search

! The protein product of the Alzheimer’s disease susceptibility gene ( GenBank L43964) was used as the query in a TBLASTN search against the est database. The goal was to identify cDNA clones from other organisms that may represent homologs of the human gene.

Page 12: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 45 (C) 2002 SNU CSE Biointelligence Lab (BI) 46

! Portion of the hit list showing the 25 best hits. Each sequence is identified by GenBank accession number and a portion of the definition line. The reading frame and score of the best HSP areshown, together with the sum probability of a chance occurrence.The value in the last column gives the number of HSPs that were used in the sum probability calculation At least 10 sequence from mouse and one from Drosophila may be seen on the hit list.

! Match to the conceptual translation of the Drosophila EST sequence (GenBank AA390557). Two HSPs were found, each in a different reading frame. Identical residues are echoed to the central line, and plus (+) symbols indicate pairs of nonidentical amino acids with positive substitution scores.

(C) 2002 SNU CSE Biointelligence Lab (BI) 47

! Figure 8.11

(C) 2002 SNU CSE Biointelligence Lab (BI) 48

! Figure 8.12

Page 13: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 49

Position Specific Scoring Matrices

! Position-Specific Iterated BLAST (PSI-BLAST)

�������

����

��� �������

�������������� �����

����������������������

�����

������������������

��������������������������������������������������

(C) 2002 SNU CSE Biointelligence Lab (BI) 50

! Figure 8.13!Figure 8.13

(C) 2002 SNU CSE Biointelligence Lab (BI) 51

Spliced Alignments Problem

! Given strings G (genomic sequence) and T (target sequence), and a set B of substrings of G, find a set of non-overlapping strings from B whose concatenation C fits T the best; i.e., the edit distance between C and T is minimal among all sets of blocks from B.

ATCAGTGCAATGCAGCCATGAComplement of mRNA T = t1t2 ... tm

ATCAGTGCAATGT..........AGGCAGCCATGA

Exon Intron ExonG

B = { b1, b2, ... bp }

(C) 2002 SNU CSE Biointelligence Lab (BI) 52

! Figure 8.14

Figure 8.14. Spliced alignment. The sim4 program was used to align a novel human mRNA (RefSeqNM_015372) to the genomic sequence of a cosmid from chromosome 22 (EMBL Z82248). Three exons were identified on the complementary (the third one has been truncated for brevity). The “>>>” symbols indicate splice sites found at the exon/intron boundaries.

Page 14: Sequence Alignment and Database Searching · 2015-11-24 · C 002 1 031 42 C 00 03 212 23 A 01 014 22 13 T 00 00 231 11 A 01 00 114 22 (C) 2002 SNU CSE Biointelligence Lab (BI) 12

(C) 2002 SNU CSE Biointelligence Lab (BI) 53

Systems biologyNeuroimmunology

BioinformaticsData miningCombinatorial chemistrySynthetic biology

Biochip & microfluidics (LOC)Biocomputation

BIOLOGY IN THE FUTURE