Upload
jessica-allison
View
234
Download
0
Embed Size (px)
Citation preview
http://datamining.xmu.edu.cn
近似搜索
邹权
博士、助理教授
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
Outline
Global alignment
Local alignment
BLAST
http://datamining.xmu.edu.cn
why compare sequences?
sequence comparison:
operation consisting of finding
which parts of the sequences are
alike and which parts differ /
Algorithms for an efficient
solution
http://datamining.xmu.edu.cn
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
http://datamining.xmu.edu.cn
Two notions
• Similarity: a measure of how similar
two sequences are
• Alignment: a basic operation to
compare two sequences, a way of
placing one sequence above the
other in order to make clear the
correspondence between similar
characters or substrings from the
sequences.
http://datamining.xmu.edu.cn
comparing two sequences
alignments involving:
global comparisons: entire
sequences
local comparisons: just
substrings of sequences
dynamic programming (DP)
http://datamining.xmu.edu.cn
global comparison- example example of aligning
GACGGATTAG
GATCGGAATAG
GA –CGGATTAG
GATCGGAATAG
an extra T; a change from A to T; space: dash
http://datamining.xmu.edu.cn
global comparison- the basic algorithm Definitions
Alignment: • insertion of spaces: same size• creating a correspondence: one over the other•Both space are not allowed•(Spaces can be inserted in beginning or end)
Scoring function : a measure of similarity between elements ;
• a match: +1/ identical characters• a mismatch: -1/ distinct characters• a space: -2/ •Scoring system: to reward matches and penalize mismatche
s and spaces
http://datamining.xmu.edu.cn
global comparison- the basic algorithm GA –CGGATTAG
GATCGGAATAG
Example: total score is 6
similarity : sim(s, t)
• maximum alignment score; many alignments with similarity
best alignment
• alignment with similarity
http://datamining.xmu.edu.cn
Basic DP algorithm for comparison of two sequences
number of alignment between two sequences: exponential
Efficient algorithm
• DP: prefixes: shorter to larger
•Idea:(m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j
p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners
( [1 ], [1 1]) 2
( [1 ], [1 ]) max ( [1 1], [1 1]) ( , )
( [1 1], [1 ]) 2
sim s i t j
sim s i t j sim s i t j p i j
sim s i t j
2],1[
],[]1,1[
2]1,[
max],[
jia
jipjia
jia
jia
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
0
0
0 -2
1
-4
2
-6
3
1 1
-21
-42
-63
-84 -1 -5
1 -3
1 -1
A
A
A
C
A G C
-1 -1
-1 -4
-1 -2
-1 0
1 -1
-1 -1
-1 -2
-1 -3
http://datamining.xmu.edu.cn
local comparison
Problem:
local alignment between s and t: an
alignment between a substring of s and a
substring of t
Algorithm: to find the highest scoring local
alignment between two sequences
http://datamining.xmu.edu.cn
local comparisonIdea:
Data structure: •an (m+1)×(n+1) array; •entry: holding the highest score of an alignment bet
ween a suffix of s[1..i] and a suffix of t[1..j]. Initialization
•First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
0
],1[
),(]1,1[
]1,[
max],[gjia
jipjia
gjia
jia
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
Global alignment
http://datamining.xmu.edu.cn
Local vs. Global Alignment (cont’d)
Global Alignment
Local Alignment—better alignment to find
conserved segment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
http://datamining.xmu.edu.cn
Local Alignment: Example
Global alignment
Local alignment
Compute a “mini” Global Alignment to get Local
http://datamining.xmu.edu.cn
semiglobal comparison Summary
Forgiving initial spaces: initializing certain positions with zero
Forgiving final spaces: looking for maximum along certain positions
Place where spaces are not charged for
Action
Beginning of first sequence Initialize first row with zeros
End of first sequence Look for maximum in last row
Beginning of second sequence Initialize first column with zeros
End of second sequence Look for maximum in last column
http://datamining.xmu.edu.cn
http://datamining.xmu.edu.cn
saving spaceComputing sim(s, t)
Algorithm BestScoreinput: sequence s and toutput: vector am←|s|n←|t|for j←0 to n do
a[j] ←j×gfor i←1 to m do
old ←a[0]a[0] ←i×g
for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp
http://datamining.xmu.edu.cn
An optimal alignment in linear spaceIdea: Divide and conquer strategyFix position i in s, and consider what matching s[i] in alignme
nt, two possibilities:1, The symbol t[j] will match s[i], for some j in 1..n
(3.6)
2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n
(3.7)
Recursive method1, for fixed i2, to decide which value of i to use in each recursive call: to pic
k i as close as possible to the middle of sequence
]..1[
]..1[
][
][
]1..1[
]1..1[
njt
misoptimal
jt
is
jt
isoptimal
]..1[
]..1[][
]..1[
]1..1[
njt
misoptimal
is
jt
isoptimal
http://datamining.xmu.edu.cn
saving space
http://datamining.xmu.edu.cn
BLAST/Lucene步骤
为数据库建立倒排索引
查询倒排索引
扩展检验
问题K值选取
变长 Kmer
http://datamining.xmu.edu.cn
Homework
为{ apple, please, eat, apply}建立关键字树,并画出所有的失效链接
比对两个字符串( aaac和 agc),假定:match得 2分,mismatch-1分,空格 -2分,画出动态规划表和回溯路径,并给出针对该回溯路径的比对方式
简述 BLAST的主要思想
为字符串“ abababc”计算每一位的 sp和 sp‘值