27
http://datamining.xmu.edu.cn 近近近近 邹邹 邹邹 邹邹邹邹 http://datamining.xmu.edu.cn

Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

Embed Size (px)

Citation preview

Page 1: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

近似搜索

邹权

博士、助理教授

http://datamining.xmu.edu.cn

Page 2: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Outline

Global alignment

Local alignment

BLAST

Page 3: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

why compare sequences?

sequence comparison:

operation consisting of finding

which parts of the sequences are

alike and which parts differ /

Algorithms for an efficient

solution

Page 4: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG

AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG

AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

Page 5: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Two notions

• Similarity: a measure of how similar

two sequences are

• Alignment: a basic operation to

compare two sequences, a way of

placing one sequence above the

other in order to make clear the

correspondence between similar

characters or substrings from the

sequences.

Page 6: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

comparing two sequences

alignments involving:

global comparisons: entire

sequences

local comparisons: just

substrings of sequences

dynamic programming (DP)

Page 7: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

global comparison- example example of aligning

GACGGATTAG

GATCGGAATAG

GA –CGGATTAG

GATCGGAATAG

an extra T; a change from A to T; space: dash

Page 8: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

global comparison- the basic algorithm Definitions

Alignment: • insertion of spaces: same size• creating a correspondence: one over the other•Both space are not allowed•(Spaces can be inserted in beginning or end)

Scoring function : a measure of similarity between elements ;

• a match: +1/ identical characters• a mismatch: -1/ distinct characters• a space: -2/ •Scoring system: to reward matches and penalize mismatche

s and spaces

Page 9: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

global comparison- the basic algorithm GA –CGGATTAG

GATCGGAATAG

Example: total score is 6

similarity : sim(s, t)

• maximum alignment score; many alignments with similarity

best alignment

• alignment with similarity

Page 10: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Basic DP algorithm for comparison of two sequences

number of alignment between two sequences: exponential

Efficient algorithm

• DP: prefixes: shorter to larger

•Idea:(m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j

p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners

( [1 ], [1 1]) 2

( [1 ], [1 ]) max ( [1 1], [1 1]) ( , )

( [1 1], [1 ]) 2

sim s i t j

sim s i t j sim s i t j p i j

sim s i t j

2],1[

],[]1,1[

2]1,[

max],[

jia

jipjia

jia

jia

Page 11: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Page 12: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

0

0

0 -2

1

-4

2

-6

3

1 1

-21

-42

-63

-84 -1 -5

1 -3

1 -1

A

A

A

C

A G C

-1 -1

-1 -4

-1 -2

-1 0

1 -1

-1 -1

-1 -2

-1 -3

Page 13: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

local comparison

Problem:

local alignment between s and t: an

alignment between a substring of s and a

substring of t

Algorithm: to find the highest scoring local

alignment between two sequences

Page 14: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

local comparisonIdea:

Data structure: •an (m+1)×(n+1) array; •entry: holding the highest score of an alignment bet

ween a suffix of s[1..i] and a suffix of t[1..j]. Initialization

•First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

0

],1[

),(]1,1[

]1,[

max],[gjia

jipjia

gjia

jia

Page 15: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Page 16: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Page 17: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Page 18: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Global alignment

Page 19: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Local vs. Global Alignment (cont’d)

Global Alignment

Local Alignment—better alignment to find

conserved segment

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||

aattgccgccgtcgttttcagCAGTTATGTCAGatc

Page 20: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Local Alignment: Example

Global alignment

Local alignment

Compute a “mini” Global Alignment to get Local

Page 21: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

semiglobal comparison Summary

Forgiving initial spaces: initializing certain positions with zero

Forgiving final spaces: looking for maximum along certain positions

Place where spaces are not charged for

Action

Beginning of first sequence Initialize first row with zeros

End of first sequence Look for maximum in last row

Beginning of second sequence Initialize first column with zeros

End of second sequence Look for maximum in last column

Page 22: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Page 23: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

saving spaceComputing sim(s, t)

Algorithm BestScoreinput: sequence s and toutput: vector am←|s|n←|t|for j←0 to n do

a[j] ←j×gfor i←1 to m do

old ←a[0]a[0] ←i×g

for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp

Page 24: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

An optimal alignment in linear spaceIdea: Divide and conquer strategyFix position i in s, and consider what matching s[i] in alignme

nt, two possibilities:1, The symbol t[j] will match s[i], for some j in 1..n

(3.6)

2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n

(3.7)

Recursive method1, for fixed i2, to decide which value of i to use in each recursive call: to pic

k i as close as possible to the middle of sequence

]..1[

]..1[

][

][

]1..1[

]1..1[

njt

misoptimal

jt

is

jt

isoptimal

]..1[

]..1[][

]..1[

]1..1[

njt

misoptimal

is

jt

isoptimal

Page 25: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

saving space

Page 26: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

BLAST/Lucene步骤

为数据库建立倒排索引

查询倒排索引

扩展检验

问题K值选取

变长 Kmer

Page 27: Http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授

http://datamining.xmu.edu.cn

Homework

为{ apple, please, eat, apply}建立关键字树,并画出所有的失效链接

比对两个字符串( aaac和 agc),假定:match得 2分,mismatch-1分,空格 -2分,画出动态规划表和回溯路径,并给出针对该回溯路径的比对方式

简述 BLAST的主要思想

为字符串“ abababc”计算每一位的 sp和 sp‘值