Upload
fauna
View
241
Download
0
Embed Size (px)
DESCRIPTION
第四章 Dynamic Programming 技术. 邹权(博士) 计算机科学系. 4. 1 Introduction. Fibonacci number F ( n ). 1if n = 0 or 1 F ( n -1) + F ( n -2) if n > 1. F ( n ) =. Pseudo code for the recursive algorithm: F ( n ) 1 if n =0 or n =1 then return 1 2 else return F ( n -1) + F ( n -2). - PowerPoint PPT Presentation
Citation preview
1
第四章第四章
Dynamic Dynamic ProgrammingProgramming 技术技术
邹权(博士)邹权(博士)计算机科学系计算机科学系
2
4.1 Introduction
F(n) = 1 if n = 0 or 1F(n-1) + F(n-2) if n > 1
n 0 1 2 3 4 5 6 7 8 9 10
F(n) 1 1 2 3 5 8 13 21 34 55 89
Pseudo code for the recursive algorithm:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)
Pseudo code for the recursive algorithm:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)
Fibonacci number F(n)
3
The execution of F(7)
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
4
The execution of F(7)
Computation of F(2) is repeated 8
times!
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
5
The execution of F(7)
Computation of F(3) is also repeated 5
times!
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
6
The execution of F(7)
Many computations are
repeated!!How to avoid
this?
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
7
Idea for improvement
Memorization
Store F1(i) somewhere after we have computed its value
Afterward, we don’t need to re-compute F1(i); we can retrieve its value from our memory.
F1(n)
1 if v[n] < 0 then
2 v[n] ←F1(n-1)+F1(n-2)
3 return v[n]
Main()
1 v[0] = v[1] ←1
2 for i ← 2 to n do
3 v[i] = -1
4 output F1(n)
8
Look at the execution of F(7)
1
1
-1
-1
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
F(i)=Fi
9
Look at the execution of F(7)
1
1
-1
-1
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
10
Look at the execution of F(7)
1
1
2
-1
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
11
Look at the execution of F(7)
1
1
2
-1
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
12
Look at the execution of F(7)
1
1
2
3
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
13
Look at the execution of F(7)
1
1
2
3
-1
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1 F0
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
14
F1 F0
Look at the execution of F(7)
1
1
2
3
5
-1
-1
-1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
15
F1 F0
F2
F1 F0
F1
Look at the execution of F(7)
1
1
2
3
5
-1
-1
-1
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1 F1
F1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
16
F1 F0
F2
F1 F0
F1
Look at the execution of F(7)
1
1
2
3
5
8
-1
-1
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F2
F1 F0
F2
F1 F0
F4 F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1 F1
F1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
17
F1 F0
F2
F1 F0
F2
F1 F0
F2
F1 F0
F3
F1
F1
Look at the execution of F(7)
1
1
2
3
5
8
-1
-1
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F4 F4 F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
18
F1 F0
F2
F1 F0
F2
F1 F0
F2
F1 F0
F3
F1
F1
Look at the execution of F(7)
1
1
2
3
5
8
13
-1
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F4 F4 F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
19
F1 F0
F2
F1 F0
F2
F1 F0
F2
F1 F0
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
Look at the execution of F(7)
1
1
2
3
5
8
13
-1
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F1
F4 F4
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
20
F1
F2
F1 F0
F2
F1 F0
F2
F1 F0
F4
F3
F3
F3
F2
F1 F0
F2
F1 F0
F2
F1 F0
F1
F1 F1
F1
Look at the execution of F(7)
1
1
2
3
5
8
13
21
F7
F6
F5
F5
F4 F3
F3
F2
F1 F0
F2
F0
F1
F4
v[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
v[7]
21
This new implementation saves lots of overhead.
Can we do even better? Observation
The 2nd version still make many function calls, and each wastes times in parameters passing, dynamic linking, ...
In general, to compute F(i), we need F(i-1) & F(i-2) only Idea to further improve
Compute the values in bottom-up fashion. That is, compute F(2) (we already know F(0)=F(1)=1), then
F(3), then F(4)…F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]
22
Recursive vs Dynamic programming
Recursive version:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)
Recursive version:F(n)1 if n=0 or n=1 then return 12 else return F(n-1) + F(n-2)
Dynamic Programming version:F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]
Dynamic Programming version:F2(n)1 F[0] ← 12 F[1] ← 13 for i ← 2 to n do4 F[i] ← F[i-1] + F[i-2]5 return F[n]
TooSlow!
Efficient!Time complexity is O(n)
23
Summary of the methodology Write down a formula that relates a solution of a problem with
those of sub-problems.E.g. F(n) = F(n-1) + F(n-2).
Index the sub-problems so that they can be stored and retrieved easily in a table (i.e., array)
Fill the table in some bottom-up manner; start filling the solution of the smallest problem. This ensures that when we solve a particular sub-problem, the
solutions of all the smaller sub-problems that it depends are available.
For historical reasons, we call such methodology
Dynamic Programming.In the late 40’s (when computers were rare),programming refers to the "tabular method".
24
Dynamic programming VS Divide-and-conquer
Divide-and-conquer method 1. Subproblem is independent.
2. Subproblem is solved repeatedly. Dynamic programming (DP) 1. Subproblem is not independent. 2. Subproblem is just solved once. Common: Problem is partitioned into one or more subproblem,
then the solution of subproblem is combined. DP reduces computation by
Solving subproblems in a bottom-up fashion. Storing solution to a subproblem the first time it is solved. Looking up the solution when subproblem is met again.
25
26
4.2 Assembly-line scheduling
End
Si[j] :The jth station on line i ( i=1 or 2) .ai[j] :The assembly time required at station Si[j]. ei :Entry time. xi: Exit time.
e1
e2
a1[1] a1[2] a1[3]
a2[1] a2[2] a2[3]
a1[n-1] a1[n]
a2[n-1] a2[n]
x1
x2
t1[n-1]
t2[n-1]…Begin
Line 1
Line 2
t1[1]
t2[1]
t1[2]
t2[2]
S1[1] S1[2] S1[3] S1[n-1] S1[n]
S2[1] S2[2] S2[3] S2[n-1] S2[n]
27
One Solution Brute force
Enumerate all possibilities of selecting stations Compute how long it takes in each case and choose
the best one Problem:
There are 2n possible ways to choose stations Infeasible when n is large
1 0 0 1 1
1 if choosing line 1 at step j (= n)
1 2 3 4 n
0 if choosing line 2 at step j (= 3)
28
fi[j] = the fastest time to get from the starting point through station Si[j]
Let f * denote fastest time to get a chassis all the way through the factory.
j = 1 (getting through station 1)
f1[1] = e1 + a1[1]
f2[1] = e2 + a2[1]
f * = min( f1[n]+x1 , f2[n]+x2 )
e1
e2
a1[1] a1[2] a1[3]
a2[1] a2[2] a2[3]
a1[n-1] a1[n]
a2[n-1] a2[n]
x1
x2
t1[n-1]
t2[n-1]…Begin
Line 1
Line 2
t1[1]
t2[1]
t1[2]
t2[2]
29
2. A Recursive Solution (cont.) Compute fi[j] for j = 2, 3, …,n, and i = 1, 2
Fastest way through S1[j] is either:
fastest way through S1[j-1]then directly through S1[j], or f1[j] = f1[j - 1] + a1[j]
fastest way through S2[j-1], transfer from line 2 to line 1, then through S1[j]
f1[j] = f2[j -1] + t2[j-1] + a1[j]
f1[j] = min(f1[j - 1] + a1[j], f2[j -1] + t2[j-1] + a1[j])
a1[j]a1[j-1]
a2[j-1]
t2[j-1]
S1[j]S1[j-1]
S2[j-1]
30
The recursive equation
For example, n=5:
Solving top-down would result in exponential running time
1 if])[]1[]1[ ],[]1[min(
1 if]1[][
1 if])[]1[]1[ ],[]1[min(
1 if]1[][
21122
222
12211
111
jjajtjfjajf
jeajf
jjajtjfjajf
jeajf
f1[j]
f2[j]
1 2 3 4 5
f1[5]
f2[5]
f1[4]
f2[4]
f1[3]
f2[3]
2 times4 times
f1[2]
f2[2]
f1[1]
f2[1]
31
3. Computing the Optimal Solution For j ≥ 2, each value fi[j] depends only on the values of
f1[j – 1] and f2[j - 1]
Compute the values of fi[j] in increasing order of j
Bottom-up approach First find optimal solutions to subproblems Find an optimal solution to the problem from the subproblems
f1[j]
f2[j]
1 2 3 4 5
increasing j
32
Additional Information To construct the optimal solution we need the sequence of
what line has been used at each station:
li[j] – the line number (1, 2) whose station (j - 1) has been
used to get in fastest time through Si[j], j = 2, 3, …, n
l* - the line whose station n is used to get in the fastest way
through the entire factory
l1[j]
l2[j]
2 3 4 5
increasing j
33
Step 3: Computing the fastest wayDPFastestWay(a, t, e, x, n)1 f1[1] ← e1 + a1[1]; f2[1] ←e2 + a2[1]2 for j ← 2 to n do3 if f1[j - 1] + a1[j] ≤ f2 [j - 1] + t2[j-1] + a1[j] then4 f1[j] ← f1[j - 1] + a1[j]5 l1[j] ← 16 else f1[j] ← f2 [j - 1] + t2[j-1] + a1[j] 7 l1[j] ← 28 if f2[j - 1] + a2[j]≤ f1[j - 1] + t1[j-1] + a2[j] then9 f2[j] ← f2[j - 1] + a2[j] 10 l2[j] ← 211 else f2 [j] ← f1[j - 1] + t1[j-1]+ a2[j] then 12 l2[j] ← 113 if f1[n] + x1≤f2[n] + x2 then14 f* ← f1[n] + x1
15 l* ← 116 else f* ← f2[n] + x2 17 l*← 2
Running time: (n)
34
For example
2
4
3
2
7 9 4 48
8 5 4 75
2 3 3 4
2 1 2 1
3
6
1
2
Begin End
1 2 3 4 5 6j
f1[j]
f2[j]
f *=38
9
12
18
16
20
22
24
25
32
30
35
37
2 3 4 5 6j
l1[j]
l2[j]
l *=1
21 1 1
1 12 2
2
2
35
4. Construct an Optimal Solution
PrintStations(l, n)
1 i ← l*
2 print “line ” i “, station ” n
3 for j ← n downto 2 do
4 i ←li[j]
5 print “line ” i “, station ” j - 1
line 1, station 6
line 2, station 5
line 2, station 4
line 1, station 3
line 2, station 2
2 3 4 5 6j
l1[j]
l2[j]l *=1
21 1 1
1 12 2
2
2
line 1, station 1
36
2
4
3
2
7 9 4 48
8 5 4 75
2 3 3 4
2 1 2 1
3
6
1
2Begin End
The fastest assembly way f *=38
37
4.3 Matrix-chain multiplication
Problem:
Given a chain A1, A2, . . . , An of n matrices, where for i = 1,
2, . . . , n , matrix Ai has dimension pi-1 × pi, fully parenthesize
the product A1 A2...An in a way that minimizes the number of
scalar multiplications.
A1 A2 Ai Ai+1 An
p0 × p1 p1 × p2 pi-1 × pi pi × pi+1 pn-1 × pn
A product of matrices is fully parenthesized if it is either a single matrix or the product of two fully parenthesized matrix products, surrounded by parentheses.
38
For example, if the chain of matrices is <A1, A2, A3, A4>, the product A1 A2 A3 A4 can be fully parenthesized in five distinct ways:
(A1 (A2 (A3 A4))) ,
(A1 ((A2 A3) A4)) ,
((A1 A2) (A3 A4)) ,
((A1 (A2 A3)) A4) ,
(((A1 A2) A3) A4).
39
MatrixMultiply(A,B) 1 if m≠p then2 print “Two matrices cannot multiply" 3 else for i←1 to n 4 for j←1 to q do5 C[i, j]←0 6 for k ←1 to m do7 C[i ,j]←C[i ,j] + A[i, k] B[k, j] 8 return C
Running time: (nmq)
40
Consider the problem of a chain A1, A2, A3 of three matrices. Suppose that the dimensions of the matrices are 10×100, 100×5, and 5×50, respectively.
1. ((A1A2) A3): A1A2 = 10×100×5 = 5,000 (10×5)
((A1 A2) A3) = 10×5×50 = 2,500
Total: 7,500 scalar multiplications
2. (A1 (A2A3)): A2A3 = 100×5×50 = 25,000 (100×50)
(A1 (A2 A3)) = 10×100×50 = 50,000
Total: 75,000 scalar multiplications
one order of magnitude difference!!
41
1. The structure of an optimal parenthesization
Notation:
Ai…j = Ai Ai+1 Aj, i j
For i < j:
Ai…j = Ai Ai+1 Aj
= Ai Ai+1 Ak Ak+1 Aj
= Ai…k Ak+1…j
42
2. A recursive solution
Subproblem:
determine the minimum cost of parenthesizing Ai…j = Ai
Ai+1 Aj for 1 i j n
Let m[i, j] = the minimum number of multiplications
needed to compute Ai…j
Full problem (A1..n): m[1, n]
43
2. A Recursive Solution (cont.)
Consider the subproblem of parenthesizing Ai…j = Ai
Ai+1 Aj for 1 i j n
= Ai…k Ak+1…j for i k < j
m[i, j] = the minimum number of multiplications needed
to compute the product Ai…j
m[i, j] = m[i, k] + m[k+1, j] + pi-1pkpjmin # of multiplications to compute Ai…k
# of multiplications to compute Ai…kAk…j
min # of multiplications to compute Ak+1…j
m[i, k] m[k+1,j]
pi-1pkpj
44
2. A Recursive Solution (cont.)
m[i, j] = m[i, k] + m[k+1, j] + pi-1pk pj
We do not know the value of k There are j–i possible values for k: k = i, i+1, …, j-1
Minimizing the cost of parenthesizing the product Ai Ai+1 Aj becomes:
. if}],1[],[{min
, if0],[
1 jipppjkmkim
jijim
jkijki
45
3. Computing the Optimal Costs
How many subproblems do we have? Parenthesize Ai…j for 1 i j n One problem for each choice of i and j
A recurrent algorithm may encounter each subproblem many times in different branches of the recursion overlapping subproblems
Compute a solution using a tabular bottom-up approach
(n2)
. if}],1[],[{min
, if0],[
1 jipppjkmkim
jijim
jkijki
46
Reconstructing the Optimal Solution
Additional information to maintain:
s[i, j] = a value of k at which we can split the product Ai
Ai+1 Aj in order to obtain an optimal parenthesization
47
3. Computing the Optimal Costs (cont.)
How do we fill in the tables m[1..n, 1..n] and s[1..n, 1..n] ? Determine which entries of the table are used in
computing m[i, j] m[i, j] = cost of computing a product of j – i – 1
matrices m[i, j] depends only on costs for products of fewer
than j – i – 1 matrices
Ai…j = Ai…k Ak+1…j
Fill in m such that it corresponds to solving problems of increasing length
48
3. Computing the Optimal Costs (cont.)
Length = 0: i = j, i = 1, 2, …, n Length = 1: j = i + 1, i = 1, 2, …, n-1
1
1
2 3 n
2
3
n
first
seco
nd
Compute rows from bottom to top and from left to rightIn a similar matrix s we keep the optimal values of k
m[1, n] gives the optimalsolution to the problem
i
j
50
Example: min {m[i, k] + m[k+1, j] + pi-1 pk pj}
m[2, 2] + m[3, 5] + p1p2p5
m[2, 3] + m[4, 5] + p1p3p5
m[2, 4] + m[5, 5] + p1p4p5
1
1
2 3 6
2
3
6
i
j
4 5
4
5
m[2, 5] = min
Values m[i, j] depend only on values that have been previously computed
k = 2
k = 3
k = 4
51
Example: Compute A1A2A3
A1: 10×100 (p0×p1)
A2: 100×5 (p1×p2)
A3: 5×50 (p2×p3)
m[i, i] = 0 for i = 1, 2, 3
m[1, 2] = m[1, 1] + m[2, 2] + p0p1p2 (A1A2) = 0 + 0 + 10×100×5 = 5,000
m[2, 3] = m[2, 2] + m[3, 3] + p1p2p3 (A2A3) = 0 + 0 + 100×5×50 = 25,000
m[1,1] + m[2,3] + p0p1p3 = 75,000 (A1(A2A3))
m[1,2] +m[3, 3] + p0p2p3 = 7,500 ((A1A2)A3)
0
0
0
1
1
2
2
3
3
50001
250002
75002
m[1,3] = min
52
4. Construct the Optimal Solution
Store the optimal choice made at each subproblem s[i, j] = a value of k such that an optimal parenthesization of
Ai..j splits the product between Ak and Ak+1
s[1, n] is associated with the entire product A1..n
The final matrix multiplication will be split at k = s[1,n]
A1..n = A1..s[1, n] As[1, n]+1..n
For each subproduct recursively find the corresponding value of k that results in an optimal parenthesization
53
4. Construct the Optimal Solution (cont.)
s[i, j] = value of k such that the optimal parenthesization of Ai Ai+1 Aj splits the product between Ak and Ak+1
3 3 3 5 5 -
3 3 3 4 -
3 3 3 -
1 2 -
1 -
-
1
1
2 3 6
2
3
6
i
j
4 5
4
5 s[1,n]=3 A1..6 = A1..3 A4..6
s[1,3]=1 A1..3 = A1..1 A2..3
s[4,6]=5 A4..6 = A4..5 A6..6
A1..n = A1..s[1, n] As[1, n]+1..n
54
4. Construct the Optimal Solution (cont.)
3 3 3 5 5 -
3 3 3 4 -
3 3 3 -
1 2 -
1 -
-
1
1
2 3 6
2
3
6
i
j
4 5
4
5
PrintParens(s, i, j)
1 if i = j then print “A”i
2 else print “(”
3 PrintParens(s, i, s[i, j])
4 PrintParens(s, s[i, j] + 1, j)
5 print “)”
55
Example: A1 A6
3 3 3 5 5 -
3 3 3 4 -
3 3 3 -
1 2 -
1 -
-
1
1
2 3 6
2
3
6
i
j
4 5
4
5
PrintOPTParens(s, i, j)
1 if i = j then print “A”i
2 else print “(”
3 PrintOPTParens(s, i, s[i, j])
4 PrintOPTParens(s, s[i, j] + 1, j)
5 print “)”
POP(s, 1, 6)s[1, 6] = 3
i = 1, j = 6 “(“ POP (s, 1, 3) s[1, 3] = 1
i = 1, j = 3 “(“ POP(s, 1, 1) “A1”
POP(s, 2, 3) s[2, 3] = 2
i = 2, j = 3 “(“ POP (s, 2, 2) “A2”
POP (s, 3, 3) “A3”
“)”
“)”
( ((A4A5)A6))A1(A2A3) )
…
(
s[1..6, 1..6]
56
4.4 The longest-common-subsequence problem
Subsequence Common subsequence Maximum-length common subsequence Problem:Given two sequences Xm =< x1, x2, ..., xm>and
Yn = <y1, y2, ..., yn>and wish to find a maximum-length common subsequence of Xm and Yn .
E.g.:
X = A, B, C, B, D, A, B Subsequences of X:
A subset of elements in the sequence taken in order
A, B, D, B, C, D, B, etc.
57
Example
X = A, B, C, B, D, A, B X = A, B, C, B, D, A, B
Y = B, D, C, A, B, A Y = B, D, C, A, B, A
B, C, B, A and B, D, A, B are longest common subsequences of X and Y (length = 4)
B, C, A, however is not a LCS of X and Y
58
Brute-Force Solution
For every subsequence of Xm, check whether it’s a
subsequence of Yn
There are 2m subsequences of Xm to check
Each subsequence takes (n) time to check
Scan Yn for first letter, from there scan for second, and so on
Running time: (n2m)
59
Notations
Given a sequence Xm = x1, x2, …, xm we define the i-th
prefix of Xm , for i = 0, 1, 2, …, m
Xi = x1, x2, …, xi
c[i, j] = the length of a LCS of the sequences Xi = x1,
x2, …, xi and Yj = y1, y2, …, yj
60
Step 1: Optimal substructure
Theorem (Optimal substructure of an LCS)
Let Xm =< x1, x2, ..., xm>and Yn = <y1, y2, ..., yn> be sequences, and let Zk =< z1, z2, ..., zk> be some LCS of Xm and Yn.
1. If xm = yn, then zk = xm = yn and zk-1 is an LCS of Xm-1 and Yn-1.
2. If xm ≠ yn, and zk ≠ xm,then Zk is an LCS of Xm-1 and Yn.
3. If xm≠ yn, and zk ≠ yn ,then Zk is an LCS of Xm and Yn-1.
61
Step 2: A recursive solution
Case 1: xi = yj
e.g.: Xi = A, B, D, E
Yj = Z, B, E
Append xi = yj to the LCS of Xi-1 and Yj-1
Must find a LCS of Xi-1 and Yj-1 optimal solution to a
problem includes optimal solutions to subproblems
. and 0, if])1,[],,1[max(
, and 0, if1]1,1[
,0or 0 if0
],[
ji
ji
yxjijicjic
yxjijic
ji
jic
62
A Recursive Solution
Case 2: xi yj
e.g.: Xi = A, B, D, G
Yj = Z, B, D
Must solve two problems
find a LCS of Xi-1 and Yj: Xi-1 = A, B, D and Yj = Z, B, D
find a LCS of Xi and Yj-1: Xi = A, B, D, G and Yj = Z, B
Optimal solution to a problem includes optimal solutions to
subproblems
c[i, j] = max { c[i - 1, j], c[i, j-1] }
63
Overlapping Subproblems
To find a LCS of Xm and Yn
We may need to find the LCS between Xm and Yn-1 and that of
Xm-1 and Yn
Both the above subproblems has the subproblem of finding the
LCS of Xm-1 and Yn-1
Subproblems share subsubproblems
64
Step 3: Computing the Length of the LCS
0 if i = 0 or j = 0
c[i, j] = c[i-1, j-1] + 1 if xi = yj
max(c[i, j-1], c[i-1, j]) if xi yj
0 0 0 0 0 0
0
0
0
0
0
yj:
xm
y1 y2 yn
x1
x2
xi
j
i
0 1 2 n
m
1
2
0
first
second
c[i, j]
65
Additional Information 0 if i, j = 0
c[i, j] = c[i-1, j-1] + 1 if xi = yj
max(c[i, j-1], c[i-1, j]) if xi yj
0 0 0 0 0 0
0
0
0
0
0
yj:
D
A C F
A
B
xi
j
i
0 1 2 n
m
1
2
0
A matrix b[i, j]: it tells us what choice
was made to obtain the optimal value
If xi = yj
b[i, j] = “ ” Else, if
c[i - 1, j] ≥ c[i, j-1]
b[i, j] = “ ”else
b[i, j] = “ ”
3
3 C
Db & c:
c[i,j-1]
c[i-1,j]
66
LCSLength(X, Y, m, n)1 for i ← 1 to m do c[i, 0] ← 02 for j ← 0 to n do c[0, j] ← 03 for i ← 1 to m do4 for j ← 1 to n do
5 if xi = yj then6 c[i, j] ← c[i - 1, j - 1] + 17 b[i, j ] ← “ ”↖8 else if c[i - 1, j] ≥ c[i, j - 1] then9 c[i, j] ← c[i - 1, j]10 b[i, j] ← “↑”11 else c[i, j] ← c[i, j - 1]12 b[i, j] ← “←”13 return c and b
Running time: (nm)
67
ExampleX = B, D, C, A, B, AY = A, B, C, B, D, A
0 if i = 0 or j = 0
c[i, j] = c[i-1, j-1] + 1 if xi = yj
max(c[i, j-1], c[i-1, j]) if xi yj
0 1 2 63 4 5yj B D AC A B
5
1
2
0
3
4
6
7
D
A
B
xi
C
B
A
B
0 0 00 0 00
0
0
0
0
0
0
0
0
0
0
1 1
1
1 1 1
1
2 2
1
1
2 2
2
2
1
1
2
2
3 3
1
2
2
2
3
3
1
2
3
2
3
4
1
2
2
3
4
4
if xi = yj then
b[i, j] = “ ”
else if c[i - 1, j]≥c[i, j-1] then
b[i, j] = “ ” else
b[i, j] = “ ”
68
4. Constructing a LCS Start at b[m, n] and follow the arrows When we encounter a “ “ in b[i, j] xi = yj is an
element of the LCS 0 1 2 63 4 5
yj B D AC A B
5
1
2
0
3
4
6
7
D
A
B
xi
C
B
A
B
0 0 00 0 00
0
0
0
0
0
0
0
0
0
0
1 1
1
1 1 1
1
2 2
1
1
2 2
2
2
1
1
2
2
3 3
1
2
2
2
3
3
1
2
3
2
3
4
1
2
2
3
4
4
69
Step 4: Constructing an LCS
PrintLCS(b, X, i, j)1 if i = 0 or j = 0 then return 02 if b[i, j] = " " ↖ then3 PrintLCS (b, X, i - 1, j - 1)4 print xi5 else if b[i, j] = "↑" then6 PrintLCS(b, X, i - 1, j)7 else PrintLCS(b, X, i, j - 1)
70
4. Constructing a LCS Start at b[m, n] and follow the arrows When we encounter a “ “ in ↖ b[i, j] xi = yj is an element of
the LCS
0 1 2 63 4 5yj B D AC A B
5
1
2
0
3
4
6
7
D
A
B
xi
C
B
A
B
0 0 00 0 00
0
0
0
0
0
0
0
0
0
0
1 1
1
1 1 1
1
2 2
1
1
2 2
2
2
1
1
2
2
3 3
1
2
2
2
3
3
1
2
3
2
3
4
1
2
2
3
4
4
j
i
71
Improving the Code What can we say about how each entry c[i, j] is computed?
It depends only on c[i -1, j - 1], c[i - 1, j], and c[i, j - 1] Eliminate table b and compute in O(1) which of the three values was
used to compute c[i, j] We save (mn) space from table b However, we do not asymptotically decrease the auxiliary space
requirements: still need table c
If we only need the length of the LCS LCS Lenght works only on two rows of c at a time
The row being computed and the previous row We can reduce the asymptotic space requirements by storing only
these two rows
72
An Interesting Example before this chapter
Answers for English Exam
73
4.5 sequence comparison
why compare sequences?
sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution
74
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
75
Two notions• Similarity: a measure of how similar two sequences
are
• Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.
76
Four cases, task and application• (1)2 sequences: tens of thousands (10,000) of characters;
isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference , two different sequencings
• (2)2 sequences 100s: whether a prefix similar to a suffix; if so, produced
• (3)2 sequences, is one is the subsequence of the other, to search certain sequence pattern
• (4)2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence
77
comparing two sequences
alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences semiglobal comparisons: prefixes and suffixes
dynamic programming (DP)
78
global comparison- example
example of aligning GACGGATTAG GATCGGAATAG
GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash
79
global comparison- the basic algorithm Definitions
Alignment: • insertion of spaces: same size• creating a correspondence: one over the other•Both space are not allowed•(Spaces can be inserted in beginning or end)
Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps);
• a match: +1/ identical characters• a mismatch: -1/ distinct characters• a space: -2/ •Scoring system: to reward matches and penalize mismatches
and spaces
80
global comparison- the basic algorithm
GA –CGGATTAG GATCGGAATAG
Example: total score is 6 similarity : sim(s, t)
• maximum alignment score; many alignments with similarity best alignment
• alignment with similarity
81
global comparison- the basic algorithm
Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm
• DP: prefixes: shorter to larger•Idea:
(m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j
p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners
( [1 ], [1 1]) 2
( [1 ], [1 ]) max ( [1 1], [1 1]) ( , )
( [1 1], [1 ]) 2
sim s i t j
sim s i t j sim s i t j p i j
sim s i t j
2],1[
],[]1,1[
2]1,[
max],[
jia
jipjia
jia
jia
82
global comparison- the basic algorithm
0
0
0 -2
1
-4
2
-6
3
1 1
-21
-42
-63
-84 -1 -5
1 -3
1 -1
A
A
A
C
A G C
-1 -1
-1 -4
-1 -2
-1 0
1 -1
-1 -1
-1 -2
-1 -3
83
global comparison- the basic algorithm
a good computing order: row by row: left to right on each row column by column: top
to bottom on each column other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are
available when a[i, j] must be computed. notes:
parameter g: specifying the space penalty (usually g<0)/g=-2 scoring function p for pairs of characters/p(a,b)=1 if a=b, and
p(a,b)=-1 if a!=b.
84
global comparison- the basic algorithm
Algorithm Similarityinput: sequence s and toutput: similarity between s and tm←|s|n←|t|for i←0 to m do a[i, 0] ←i×gfor j←0 to n do a[0, j] ←j×gfor i←1 to m do
for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g)
return a[m,n]
85
optimal alignments How to construct an optimal alignment between two
sequences ← similarity Idea of Algorithm Align
All we need to do is start at entry (m, n) and follow the arrows until we get to (0, 0).
An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm.
The variables align-s and align-t are treated as globals in the code.
Call Align(m, n, len) will construct an optimal alignment Note: max(|s|, |t|)≤len≤m+n
86
Recursive algorithm for optimal alignmentAlgorithm Aligninput: indices i, j, array a given by algorithm Similarityoutput: alignment in align-s, align-t, and length in lenif i=0 and j=0 then
len← 0else if i>0 and a[i, j]= a[i-1, j]+g then
Align(i-1, j, len)len← len+1align-s[len] ←s[i]align-t[len] ← -
else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) thenAlign(i-1, j-1, len)len← len+1align-s[len] ←s[i]align-t[len] ← t[j]
else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len)len← len+1align-s[len] ← -align-t[len] ← t[i]
87
optimal alignments
Arrow preference When there is choice, a column with a space in t has precedence over a
column with two symbols, which in turn has precedence over a column with a space in s
- ATAT rather than ATAT - TATA - - TATA
For AA and AAAA, there are 4!/2!2=6 optimal alignments.
maximum
preference
minimum
preference
88
optimal alignments
Complexity of the algorithms for time and space:
• Basic dynamic programming: comparison of two sequences/ figure 3.2/ to compute Similarity: O(mn) or O(n2)
• Recursive algorithm for optimal alignment: O(len)=O(m+n)
89
local comparison
Problem: local alignment between s and t: an alignment
between a substring of s and a substring of t Algorithm: to find the highest scoring local
alignment between two sequences
90
local comparison Idea:
Data structure: •an (m+1)×(n+1) array; •entry: holding the highest score of an alignment between a suffix of
s[1..i] and a suffix of t[1..j]. Initialization
•First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
0
],1[
),(]1,1[
]1,[
max],[gjia
jipjia
gjia
jia
91
Local alignment
0 0 0 0 0 0 0 0
0 2 1 0 0 0 2 1
0 1 1 0 0 0 1 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 2 1 0
ij
0 1a
2b
3b
4c
5d
6a
7e
0
1 a
2 f
3 g
4 f
5 d
0 0 0 0 0 1 1 36 e
92
Global alignment
93
semiglobal comparison
Similar to global comparison but to ignore the score end spaces
definitions end spaces: appear before the first after last charater in a
sequence. semiglobal comparison: to score alignments ignoring some of
the end spaces in the sequence Suitability: The lengths of the two sequences differ
considerably(8/18): many spaces in any alignment ⇒giving a large negative contribution to the score.
94
semiglobal comparison example
CAGCA- CTTGGATTCTCGG (3.4) [score: -19/3]- - - CAGCGTGG- - - - - - - -
CAGCACTTGGATTCTCGG (3.5) [score: -12]CAGC- - - - - G- T- - - -GGComparison: According to the scoring system we have been using so far: (3.5) (-12) better
than (3.4) (-19) Disregard (not charge for) end spaces: (3.4) (3) better than (3.5) (-12) /Pretty
good From the point of view of finding similar regions in the sequences: (3.5) was
simply torn apart brutally by the spaces just for the sake of matching exactly its characters
if we are looking for regions of the longer sequence that are approximately the same as the shorter sequence, then undoubtedly (3.4) is more to the point
95
semiglobal comparison
1
( , ) max ,n
j
sim s t a m j
Ingnore the end space of first sequence s: maximum along border of last row
Ingnore end space of second sequence t: maximum along border of last column
Without any end spaces: maximum along border of last row and column, unit (0,0)
1
( , ) max ,m
i
sim s t a i n
96
semiglobal comparison
Ingnore the beginning of first sequence s not charge for initial spaces in s: to the best alignment
between s and a suffix of t each entry (i, j): contain the highest similarity
between s1..i and a suffix of a prefix, which is to say a suffix of t1..j
To initialize the first row with 0, a[m,n]Between both suffixes of s and t:
• Both first row and column: initialized with 0• Optimal alignment: from maximum entry/ until border,
then back to origin
97
semiglobal comparison
Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions
Place where spaces are not charged for Action
Beginning of first sequence Initialize first row with zeros
End of first sequence Look for maximum in last row
Beginning of second sequence Initialize first column with zeros
End of second sequence Look for maximum in last column
98
extensions to the basic algorithms
motivation Better algorithms for special situations
improvements Computational complexity
•space: O(mn)⇒O(m+n) / at cost of doubling time•Time reducing: only for similar sequences and for a certain
family of scoring parameters Biological interpretation of alignments
•A series of consecutive spaces is more realistic than individual spaces
99
saving space
Computing sim(s, t)Algorithm BestScoreinput: sequence s and toutput: vector am←|s|n←|t|for j←0 to n do
a[j] ←j×gfor i←1 to m do
old ←a[0]a[0] ←i×g
for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp
100
saving space An optimal alignment in linear space
Idea:Divide and conquer strategyFix position i in s, and consider what matching s[i] in alignment, two possibilities:1, The symbol t[j] will match s[i], for some j in 1..n
(3.6)
2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n
(3.7)
Recursive method1, for fixed i2, to decide which value of i to use in each recursive call: to pick i as close as
possible to the middle of sequence
]..1[
]..1[
][
][
]1..1[
]1..1[
njt
misoptimal
jt
is
jt
isoptimal
]..1[
]..1[][
]..1[
]1..1[
njt
misoptimal
is
jt
isoptimal
101
saving space
102
general gap penalty functions
Examples AAAAAAAA AAAAAAAA – A– A– A– – AAA – – – – –
motivation Gap definition: a consecutive number k>1 of spaces Fact: occurrence of a gap with k spaces is more probable than
occurrence of k isolated spaces when mutations are involved Reason: a gap due to single mutational event (more often); while
separated space to distinct events Penalty function: w(k)=bk / no distinction between clustered or
isolated spaces
103
general gap penalty functions
blocks Additive: at block level other than column Three kinds blocks
• 1. two aligned characters from ∑• 2. A maximal series of consecutive characters in t
aligned with spaces in s• 3. A maximal series of consecutive characters in s
aligned with spaces in t• scores: 1) p(a, b); 2), 3): -w(k)
104
general gap penalty functions
ExampleS1 AAC---AATTCCGACTACS2 ACTACCT------CGC--S1 AAC---AATTCCGACTACS2 ACTACCT------CGC--
Blocks cannot follow other blocks arbitrarily, especially for type 2 or 3 Algorithm
Initialization / data structure• 3 arrays of (m+1)×(n+1): one for each type of ending block
a[0, 0]=0: alignments ending in character-character blocks
b[0, j]=-w(j): alignments ending in spaces in s
c[i, 0]=-w(i): alignments ending in spaces in t
Others: -∞
105
general gap penalty functions Updated process / recurrence relations
Sim(s,t): maximum among a[m,n], b[m,n] and c[m,n] Time complexity: O(mn2+m2n)• 3+2j+2i
]1,1[
]1,1[
]1,1[
max),(],[
jic
jib
jia
jipjia
jkforkwkjic
jkforkwkjiajib
1),(],[
1),(],[max],[
ikforkwjkib
ikforkwjkiajic
1),(],[
1),(],[max],[
106
general gap penalty functions
• mn2+5mn+m2n Optimal alignment: as before Summary
considerable increase in time and space ( auxiliary arrays)--critical
m
i
n
j1 12i2j3
107
affine gap penalty functions motivation
a linear gap function: O(n2) general gap function: O(n3) Here: less general / charges less for a gap with k spaces than k
isolated spaces: O(n2) ?⇒ Two concepts
subaddittive function: w(k1+k2+…+kn)≤w(k1) +…+ w(kn) Affine function:
•w(k)=h+gk: k≥1, with w(0)=0•Subaddittive: If h, g>0•First space in a gap costs: h+g / the other spaces cost: g
108
affine gap penalty functions Algorithm
Initialization / data structure• 3 arrays of (m+1)×(n+1): one for each type of ending block
a[0, 0]=0
a[i, 0]= -∞ for 1≤i≤m
a[0, j]= -∞ for 1≤j≤n
b[i, 0]= -∞ for 1≤i≤m
b[0, j]= -(h+gj) for 1≤j≤n
c[i, 0]= -(h+gi) for 1≤i≤m
c[0, j]= -∞ for 1≤j≤n
109
affine gap penalty functions
Updated process / recurrence relations
Sim(s,t): maximum among a[m,n], b[m,n] and c[m,n] Optimal alignment: trace back from (m, n) until (0, 0)
Time complexity: O(mn) / space complexity: O(mn) or O(n)
]1,1[
]1,1[
]1,1[
max),(],[
jic
jib
jia
jipjia
]1,[)(
]1,[
]1,[)(
max],[
jicgh
jibg
jiagh
jib
],1[
],1[)(
],1[)(
max],[
jicg
jibgh
jiagh
jic
110
comparing similar sequences motivation
Two sequences are similar: “look like” when scores of alignments are very close to maximum possible
faster algorithms to find good alignments global alignments only
assumption s and t have the same length n properties:
• DP matrix is a square matrix• Main diagonal from (0, 0) to (n, n): unique alignment without spaces between s and t
• spaces will always be inserted in pairs, one in s and one in t• alignment is thrown off the main diagonal• Number of space pairs is greater than or equal to maximum departure from main diagonal
111
comparing similar sequences
G C G C A T G G A T T G A G C G A
TGCGCCATGGATGAGCA
112
comparing similar sequences example
s=GCGCATGGATTGAGCGAt=TGCGCCATGGATGAGCA
s=- GCGC -ATGGATTGAGCGAt=TGCGCCATGGAT- GAGC - A
idea Path of best alignment near main diagonal: if sequences are similar Not necessary to fill entire matrix suffice : a narrow band⇒
Algorithm KBand Fill-in a band of horizontal (or vertical) width 2k+1 around main diagonal a[n, n]: highest score of an alignment confined to that band O(kn): a big win over the usual O(n2) if k≪n
113
comparing multiple sequences motivation
multiple alignment (MA): which parts of the sequences are similar and which parts are different / s1, …, sk
multiple alignment is a generalization of pairwise alignment, similar operation
// no column made exclusively of spaces
114
comparing multiple sequences
Amino acid sequences: are more common with proteins How to evaluate different MAs of the same set of sequences?
115
comparing multiple sequences
Multiple sequence alignments are used for many reasons, including:
(1) to detect regions of variability or conservation in a family of proteins,
(2) to provide stronger evidence than pairwise similarity for structural and functional inferences,
(3) to serve as the first step in phylogenetic reconstruction, in RNA secondary structure
prediction, and in building profiles (probabilistic models) for protein families or DNA
signals.
116
comparing multiple sequences
Scoring scheme: (1) SP measure:scoring a alignment based on
pairwise alignments. (2) star alignment (3) tree alignmant
117
the SP measureScoring MA
additive functions here: one way: combination of arguments / for k is 10 1023⇒ than 1000 “Reasonable” properties
(1)Functions: independent of order of sequences,i.e
SP(I,-,I,V)=SP(V,I,I-)
(2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces
ii
Score s
118
the SP measure
sum-of-pairs (SP) function is a function which meet the two properties
E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V)
• ( match = 1, a mismatch = -1, and a gap = -2) SP(I,-,I,V) = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V)
+ score(I,V) = -2 + 1 + -1 + -2 + -2 + -1 = -7
119
the SP measure
Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps
p(-, -)=0
120
the SP measure1. In MA, select two of sequences / forget all the rest / remove
columns with two spaces and derive a true PA2. E.g.,
PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW
PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW
• αij: PA induced by α on si and sj
summary• Way 1: compute scores of each column, and then add all column scores ←true only if
p(-, -)=0• Way 2: compute scores for induced PA, and then add these scores • Common: add scores p(s’i[c], s’j[c]) / s’: Extended sequence (with spaces inserted)
ji
ijscorescoreSP )()(
121
the SP measure
Note: An optimum MSA-SP may not give the optimal pairwise alignments. For example, the optimum MSA-SP for the sequences AT, A, T, AT and AT is as follows, with the induced alignment
between A & T also shown.
122
the SP measure
However, the optimum pairwise arrangement for the two sequences A and T is
which, if it were part of the MSA would yield:
123
DP
124
DP
Therefore, we must: (1) fill out the entire array which has [n + 1]k cells, (2) consider 2k – 1 possibilities for each cell in the array, (3) compute
the sum-of-pairs measure for each of these possibilities. Therefore, the time complexity of DP
(n + 1)k(2k – 1)( k2)
This is impractical even for numbers as small as k = 10
125
DP
summary Total time complexity: O(k22knk) ⇒exponential in
input sequence number Polynomial algorithms: unlikely ←MA problem
with SP measure is NPC
126
star alignments
Heuristic method for multiple sequence alignments
Select a sequence sc as the center of the star For each sequence s1, …, sk such that index i c,
perform a global alignment Aggregate alignments with the principle “once a
gap, always a gap.”
127
star alignments
For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C
(1) To find the center sequence
ci
ci ssscore ),(
128
star alignments
(2) do pairwise alignments
129
star alignments
(3) build the multiple alignment
130
tree alignments
Model the k sequences with a tree having k leaves (1 to 1 correspondence)
Compute a weight for each edge, which is the similarity score
Sum of all the weights is the score of the tree Find tree with maximum score
131
tree alignments
motivation: An evolutionary tree for the sequences involved Comparison with SP measure
• tree alignments: to compute overall similarity based on PA along tree edges• SP measure: to take into account all pairwise similarities
tree alignment Leaves: one-to-one correspondence between k leaves and k sequences Interior nodes: to assign sequences Weight for each edge: similarity between two sequences in nodes incident to
the edge Score of tree: sum of all weights tree alignments: finding a sequence assignment maximizing score Star alignments: particular case / tree is a star
132
tree alignments
example
Match +1, gap -1, mismatch 0 tree alignment problem:
NP-hard Approximation algos: weight: distance / sequence assignment minimizing
distance sum
CTG
CGGT
CAT
y=CGx=CT
1 1
2
1
1
133
参考文献 Algorithms on Strings, Trees and
Sequences (Dan Gusfield)
计算分子生物学导论 ( 赛图宝 )
134
4.6 0/1 Knapsack Problem The 0-1 knapsack problem
A thief rubbing a store finds n items: the i-th item is worth
vi dollars and weights wi pounds (vi, wi integers)
The thief can only carry W pounds in his knapsack
Items must be taken entirely or left behind
Which items should the thief take to maximize the value of
his load?
135
The 0-1 Knapsack Problem
Thief has a knapsack of capacity W
There are n items: for i-th item value vi and weight wi
Goal:
find xi such that for all xi = {0, 1}, i = 1, 2, .., n
wixi W and
xivi is maximum
136
Optimal Substructure
Consider the most valuable load that weights at most W
pounds
If we remove item j from this load
The remaining load must be the most valuable load
weighing at most W – wj that can be taken from the
remaining n – 1 items
137
0-1 Knapsack - Dynamic Programming
V(i, w) – the maximum profit that can be
obtained from items 1 to i, if the
knapsack has size w
Case 1: thief takes item i
V(i, w) =
Case 2: thief does not take item i
V(i, w) =
vi + V(i - 1, w-wi)
V(i - 1, w)
138
DPKnapsack(S, W)
1 for w ← 0 to w1 - 1 do V[1, w] ← 0
2 for w ← w1 to W do V[1, w] ← v1
3 for i ← 2 to n do4 for w ← 0 to W do
5 if wi > w then6 V[i, w] ← V[i-1, w]7 b[i, w] ← "↑"
8 else if V[i-1, w]> V[i-1, w-wi] + vi then9 V[i, w] ← V[i-1, w] 10 b[i, w] ← "↑"11 else
12 V[i, w] ←V[i-1, w-wi] + vi
13 b[i, w] ← " "↖14 return V and b
Running time: (nW)
139
0-1 Knapsack - Dynamic Programming
0 0 0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0:
n
1 w - wi W
i-1
0
first
V(i, w) = max {vi + V(i - 1, w-wi), V(i - 1, w) }
Item i was taken Item i was not taken
i
w
second
140
V(i, w) = max {vi + V(i - 1, w-wi), V(i - 1, w) }
0 0 0 0 0 0
0
0
0
0
物品号 i wi vi
1 2 12
2 1 10
3 3 20
4 2 150 1 2 3 4 5
1
2
3
4
W = 5
0
12 12 12 12
10 12 22 22 22
10 12 22 30 32
10 15 25 30 37
V(1, 1) =
V(1, 2) =
V(1, 3) =
V(1, 4) =
V(1, 5) =
V(2, 1)=
V(2, 2)=
V(2, 3)=
V(2, 4)=
V(2, 5)=
V(3, 1)=
V(3, 2)=
V(3, 3)=
V(3, 4)=
V(4, 5)=
V(4, 1)=
V(4, 2)=
V(4, 3)=
V(4, 4)=
V(4, 5)=
max{12+0, 0} = 12
max{12+0, 0} = 12
max{12+0, 0} = 12
max{12+0, 0} = 12
max{10+0, 0} = 10
max{10+0, 12} = 12
max{10+12, 12} = 22
max{10+12, 12} = 22
max{10+12, 12} = 22
P(2,1) = 10
P(2,2) = 12
max{20+0, 22}=22
max{20+10,22}=30
max{20+12,22}=32
P(3,1) = 10
max{15+0, 12} = 15
max{15+10, 22}=25
max{15+12, 30}=30
max{15+22, 32}=37
0
P(0, 1) = 0
Example
wi
141
Reconstructing the Optimal Solution
0 0 0 0 0 0
0
0
0
0
0 1 2 3 4 5
1
2
3
4
0
12 12 12 12
10 12 22 22 22
10 12 22 30 32
10 15 25 30 37
0
Start at V(n, W) When you go left-up item i has been taken When you go straight up item i has not been taken
• Item 4
• Item 2
• Item 1
142
The problem we presented before
We get to observe the “qualities” of m secretaries: X1,…,Xm sequentially according to a random order. Our goal is to maximize the probability of finding the “best” candidate with no looking back!
Pm=1/m;