Upload
gretel
View
22
Download
0
Embed Size (px)
DESCRIPTION
Homology Search Tools. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. Homology Search Tools. Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - PowerPoint PPT Presentation
Citation preview
Homology Search Tools
Kun-Mao Chao (趙坤茂 )Department of Computer Science an
d Information EngineeringNational Taiwan University, Taiwan
WWW: http://www.csie.ntu.edu.tw/~kmchao
2
Homology Search Tools
• Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987)
• FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)
• BLAST(Altschul et al., 1990; Altschul et al., 1997)
• BLAT(Kent, 2002)
• PatternHunter(Li et al., 2004)
3
Finding Exact Word Matches
• Hash Tables
• Suffix Trees
• Suffix Arrays
4
Hash Tables
… …
… …
… …
… …
… …
… …
CATCCA
CTT
TCCTCGTCT
TTT
GAT
010011 (19)010100 (20)
011111 (31)
100011 (35)
110101 (53)
110111 (55)110110 (54)
111111 (63)
AAA000000 (0)
ATC001101 (13)
1
2
3
45
6
7
8
AG TTCTACCT
1021 9876543
5
Suffix Trees (I)
AG TTCTACCT
1021 9876543
10
362
8
4
519
ATC
CATCTT TT
GATCCATCTTC
CATCTT
TTATCTT
T
CATCTTTT
T
7
C
6
Suffix Trees (II)11
AG TTCTACCT
1021 9876543
$
10
362
8
4
5
19
ATC
CATCTT$ TT$
GATCCATCTT$C
CATCTT$
TT$ATCTT$
T
CATCTT$
TT$
T$
7
C
$
$
11
7
Suffix Arrays
AG TTCTACCT
1021 9876543 ATCCATCTT 2
ATCTT 6
CATCTT 5
CCATCTT 4
CTT 8
GATCCATCTT 1
T 10
TCCATCTT 3
TCTT 7
TT 9
8
FASTA
1) Find runs of identities, and identify regions with the highest density of identities.
2) Re-score using PAM matrix, and keep top scoring segments.
3) Eliminate segments that are unlikely to be part of the alignment.
4) Optimize the alignment in a band.
9
FASTA
Step 1: Find runes of identities, and identify regions with the highest density of identities.
Sequence A
Sequence B
10
FASTA
Step 2: Re-score using PAM matrix, andkeep top scoring segments.
11
FASTA
Step 3: Eliminate segments that are unlikely to be part
of the alignment.
12
FASTA
Step 4: Optimize the alignment in a band.
13
BLAST
Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman)
The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
14
The maximal segment pair measure
A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4)
the highest scoring pair
•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.
•BLAST heuristically attempts to calculate the MSP score.
15
A matrix of similarity scores
G
CTACCTA
TC
T
-4
GTCTTACTA-4-4-4-4-4-4-4 5-4-4
-4 -4-45-4-45-4 -4-4-4
-4 55-4-45-45 -45-4
5 -4-4-45-4-4-4 -4-45
5 -4-4-45-4-4-4 -4-45
-4 -4-45-4-45-4 -4-4-4
-4 55-4-45-45 -45-4
5 -4-4-45-4-4-4 -4-45
-4 55-4-45-45 -45-4
T -4 55-4-45-45 -45-4
16
A maximum-scoring segment
10
1110
G
CTACCTA
TC
T
-4
GTCTTACTA-4-4-4-4-4-4-4 5-4-4
-4 -4-45-4-4-4 -4-4-4
-4 55-4-4-45 -45-4
5 -4-4-4-4-4-4 -4-45
5 -4-45-4-4-4 -4-45
-4 -45-4-45-4 -4-4-4
-4 5-4-45-45 -45-4
5 -4-4-45-4-4-4 -4-4
-4 55-4-45-45 -4-4
5
5
5
-4
-4
5
5
1
8765432
9
21 9876543
T -4 55-4-45-45 -45-4
5
17
BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
3) Extend hits.
18
BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)
For DNA sequences:
Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..
TTT
For protein sequences:
Seq. A = ELVIS
Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧
19
BLASTStep2: Scan sequence B for hits.
20
BLASTStep2: Scan sequence B for hits.
Step 3: Extend hits.
hit
Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
21
Gapped BLAST (I)D
The two-hit method
22
Gapped BLAST (II)
Confining the dynamic-programming
HSP with score at least Sq
seed residue pair
region confined by Xq
23
BLAT
database
index
query
24
PatternHunter (I)
25
PatternHunter (II) T
… …
… …
… …
… …
… …
… …
AG TTCTACC
1021 9876543
CAC
TCA
TCT
TTT
GAC
010001 (17)
100001 (33)
110100 (52)
110111 (55)
111111 (63)
AAA000000 (0)
ATC001101 (13)
1
2
3
4
5
7
ATT001111 (15) 6
… …010100 (20) CCT
ATG001110 (14)
26
Remarks
• Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments.
• The idea of filtration was used in FASTA, BLAST, BLAT, and PatternHunter.