Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

Preview:

Citation preview

http://datamining.xmu.edu.cn

近似搜索

邹权

博士、助理教授

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

Outline

Global alignment

Local alignment

BLAST

http://datamining.xmu.edu.cn

why compare sequences?

sequence comparison:

operation consisting of finding

which parts of the sequences are

alike and which parts differ /

Algorithms for an efficient

solution

http://datamining.xmu.edu.cn

TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG

AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG

AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

http://datamining.xmu.edu.cn

Two notions

• Similarity: a measure of how similar

two sequences are

• Alignment: a basic operation to

compare two sequences, a way of

placing one sequence above the

other in order to make clear the

correspondence between similar

characters or substrings from the

sequences.

http://datamining.xmu.edu.cn

comparing two sequences

alignments involving:

global comparisons: entire

sequences

local comparisons: just

substrings of sequences

dynamic programming (DP)

http://datamining.xmu.edu.cn

global comparison- example example of aligning

GACGGATTAG

GATCGGAATAG

GA –CGGATTAG

GATCGGAATAG

an extra T; a change from A to T; space: dash

http://datamining.xmu.edu.cn

global comparison- the basic algorithm Definitions

Alignment: • insertion of spaces: same size• creating a correspondence: one over the other•Both space are not allowed•(Spaces can be inserted in beginning or end)

Scoring function : a measure of similarity between elements ;

• a match: +1/ identical characters• a mismatch: -1/ distinct characters• a space: -2/ •Scoring system: to reward matches and penalize mismatche

s and spaces

http://datamining.xmu.edu.cn

global comparison- the basic algorithm GA –CGGATTAG

GATCGGAATAG

Example: total score is 6

similarity : sim(s, t)

• maximum alignment score; many alignments with similarity

best alignment

• alignment with similarity

http://datamining.xmu.edu.cn

Basic DP algorithm for comparison of two sequences

number of alignment between two sequences: exponential

Efficient algorithm

• DP: prefixes: shorter to larger

•Idea:(m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j

p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners

( [1 ], [1 1]) 2

( [1 ], [1 ]) max ( [1 1], [1 1]) ( , )

( [1 1], [1 ]) 2

sim s i t j

sim s i t j sim s i t j p i j

sim s i t j

2],1[

],[]1,1[

2]1,[

max],[

jia

jipjia

jia

jia

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

0

0

0 -2

1

-4

2

-6

3

1 1

-21

-42

-63

-84 -1 -5

1 -3

1 -1

A

A

A

C

A G C

-1 -1

-1 -4

-1 -2

-1 0

1 -1

-1 -1

-1 -2

-1 -3

http://datamining.xmu.edu.cn

local comparison

Problem:

local alignment between s and t: an

alignment between a substring of s and a

substring of t

Algorithm: to find the highest scoring local

alignment between two sequences

http://datamining.xmu.edu.cn

local comparisonIdea:

Data structure: •an (m+1)×(n+1) array; •entry: holding the highest score of an alignment bet

ween a suffix of s[1..i] and a suffix of t[1..j]. Initialization

•First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

0

],1[

),(]1,1[

]1,[

max],[gjia

jipjia

gjia

jia

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

Global alignment

http://datamining.xmu.edu.cn

Local vs. Global Alignment (cont’d)

Global Alignment

Local Alignment—better alignment to find

conserved segment

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||

aattgccgccgtcgttttcagCAGTTATGTCAGatc

http://datamining.xmu.edu.cn

Local Alignment: Example

Global alignment

Local alignment

Compute a “mini” Global Alignment to get Local

http://datamining.xmu.edu.cn

semiglobal comparison Summary

Forgiving initial spaces: initializing certain positions with zero

Forgiving final spaces: looking for maximum along certain positions

Place where spaces are not charged for

Action

Beginning of first sequence Initialize first row with zeros

End of first sequence Look for maximum in last row

Beginning of second sequence Initialize first column with zeros

End of second sequence Look for maximum in last column

http://datamining.xmu.edu.cn

http://datamining.xmu.edu.cn

saving spaceComputing sim(s, t)

Algorithm BestScoreinput: sequence s and toutput: vector am←|s|n←|t|for j←0 to n do

a[j] ←j×gfor i←1 to m do

old ←a[0]a[0] ←i×g

for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp

http://datamining.xmu.edu.cn

An optimal alignment in linear spaceIdea: Divide and conquer strategyFix position i in s, and consider what matching s[i] in alignme

nt, two possibilities:1, The symbol t[j] will match s[i], for some j in 1..n

(3.6)

2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n

(3.7)

Recursive method1, for fixed i2, to decide which value of i to use in each recursive call: to pic

k i as close as possible to the middle of sequence

]..1[

]..1[

][

][

]1..1[

]1..1[

njt

misoptimal

jt

is

jt

isoptimal

]..1[

]..1[][

]..1[

]1..1[

njt

misoptimal

is

jt

isoptimal

http://datamining.xmu.edu.cn

saving space

http://datamining.xmu.edu.cn

BLAST/Lucene步骤

为数据库建立倒排索引

查询倒排索引

扩展检验

问题K值选取

变长 Kmer

http://datamining.xmu.edu.cn

Homework

为{ apple, please, eat, apply}建立关键字树,并画出所有的失效链接

比对两个字符串( aaac和 agc),假定:match得 2分,mismatch-1分,空格 -2分,画出动态规划表和回溯路径,并给出针对该回溯路径的比对方式

简述 BLAST的主要思想

为字符串“ abababc”计算每一位的 sp和 sp‘值

Recommended