View
33
Download
1
Category
Preview:
DESCRIPTION
Dynamic programming: one algorithmic key to many biological locks. Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011. BIOINFORMATICS FOR BIOLOGISTS Pavel Pevzner and Ron Shamir, eds. (Cambridge University Press, 2011) - PowerPoint PPT Presentation
Citation preview
Dynamic programming:
one algorithmic key to many
biological locksMikhail Gelfand
RTCB, IITP, RA S and FBB, MSU
2010-2011
BIOINFORMATICS FOR BIOLOGISTSPavel Pevzner and Ron Shamir, eds.(Cambridge University Press, 2011)
Ch. 4. DYNAMIC PROGRAMMING: ONE ALGORITHMIC KEY FOR MANY BIOLOGICAL LOCKS
Mikhail GelfandResearch and Training Center “Bioinformatics” of the
Institute for Information Transmission Problems, RASand Faculty of Bioengineering and Bioinformatics,
M.V.Lomonosov Moscow State University
Alignment
Three (of many) alignments of two sequences. Plus denotes a match; dot, a mismatch, minus, a gap. (a) Two matches, five mismatches, (b) three matches, one mismatch, two gaps of size three (six indels, that is one-nucleotide insertions/deletions), (c) four matches, two gaps of size three (six indels).
The number of alignments is large
# of alignments of two sequences of length N~ (1+√2)2N+1√N
at N = 1000 # ≈ 10767
# of elementary particles in the Universe ≈ 1080 at N = 100 # ≈ 1076
assume 1 operation per alignment, 1012 operations per second
=> need 1057 years
=> we cannot consider them one by one
Gene recognition
Segmentation of a genomic fragment into protein-coding and non-coding regionsbased on differences in statistical
properties of these regionsdifficult in eukaryotes due to the
existence of introns, non-coding regions within genes
Toy example
How many operations are needed to calculate
∑i=1…m, j=1…n xi∙yj =
= x1∙y1 + x1∙y2 + … + x1∙yn +
+ x2∙y1 + x2∙y2 + … + x2∙yn +
+ … +
+ xm∙y1 + xm∙y2 + … + xm∙yn
Naïve answer: mn multiplications and mn–1 additions
but rewrite as…
(x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) =
= ∑i=1…m xi ∙ ∑j=1…n yj
and it becomes m+n–2 additions and just 1 multiplication
Quiz
How many multiplications do we need to calculate
x1y1 ∙ x1
y2 ∙ … ∙ x1yn ∙ x2
y1 ∙ x2y2 ∙ … ∙ x2
yn ∙ … ∙
∙ xmy1 ∙ xm
y2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…n xi
yj
if we are (a)naïve? (b) sophisticated? (c) What if in addition to multiplication, we
have an operation “taking to the power”? (d) if we may perform not only multiplication,
but also addition?
Lesson
Restructuring the order of calculations using properties of the data may sharply decrease the number of operations
GraphsVertices/nodes: v1, v2, …, vn
Arcs /edges– directed pairs of vertices: am(vi, vj)
contains cyclesmultiple sources and sinks
“bad” graphs and not graphs
multiple arcs loop
multiple components
not a graph (hanging arc)
undirected graph
Sources, sinks, paths, cyclesSource is a vertex that is not an end vertex for any arcSink is a vertex that is not a start vertex for any arc.Walk p of length N is an ordered set of N arcs
w = (a1, …, aN) such that the end vertex of arc an = (bn, en) coincides with the start vertex of arc an+1, en=bn+1, for all n = 1, …, N–1.
no source and sink
multiple sources and sinks
one source and one sink
w=(a(v1,v3),a(v3,v2,),a(v2,v4,),a(v3,v4), a(v3,v1), a(v1,v3))
w=(a(v4,v5),a(v5,v3))w=(a(v2,v1))
v1
v3 v4
v2v1
v2
v1
v4 v5
v2
v6
v3
Sources, sinks, paths, cyclesIn a graph without loops and multiple arcs, each walk
may also be defined as an ordered set of vertices w = (v1, …, vN+1) such that for each pair of adjacent vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N.
no source and sink
multiple sources and sinks
one source and one sink
w=(v1,v3,v2,v4,v3,v1,v3)
v1
v3 v4
v2v1
v2
v1
v4 v5
v2
v6
v3
w=(v4,v5,v3)w=(v2,v1)
Sources, sinks, paths, cyclesA path is a walk in which no arc is passed twice.Cycle is a path in which the end vertex of the last arc
aN coincides with the start vertex of the first arc a1, eN=b1.
Acyclic graph contains no cycles.
no source and sink
multiple sources and sinks
one source and one sink
p=c=(v1,v3,v2,v4,v3,v1)
v1
v3 v4
v2v1
v2
v1
v4 v5
v2
v6
v3
p=(v4,v5,v3)p=(v2,v1)
Acyclic graph Acyclic graph Cyclic graph
Quiz
(a) Draw all acyclic connected oriented graphs with three vertices (up to vertex labels).
(b) How many oriented graphs will there be if we label vertices with symbols A, B and C?
(c) Prove that in an acyclic graph there is at least one source and at least one sink.
(d) Draw sinks and sources in the graphs of (a).
Problem
Consider an acyclic graph with one source and one sink. Assign each arc with a number called a weight. For a given path, its path score is defined as the sum of the weights of its arcs.
Given a weighted acyclic graph, find the highest scoring path from the sink to the source.
ObservationIf two subpaths P and Q end at the same vertex v,
and the score of P is larger than the score of Q, then for all pairs of paths P* and Q* that start with P and Q, respectively, and coincide after v, the score of P* is higher than the score of Q*.
Hence, we do not need to consider all paths, as it is sufficient to construct the highest scoring subpath from the source to each vertex, finishing at the sink.Q
P
v P*,Q*P > Q P* > Q*
Let’s do it for this graph
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
13
22
41
23 4
1
1
6 5
25
86 5 2
23
3 1
45
2
Step 1 Step 2
3
6
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
45
2
Step 3
3
6
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 4
3
7
1110
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 5
3
7
1112
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 6
3
7
1118
16
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 7
3
7
1118
16
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 8
3
7
1119
16
19
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Step 9
3
7
1119
16
20
2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
105
2
Backtracing
3
7
1119
16
20
Quiz
At what steps did we have more than one vertex with all incoming arcs processed?
AlgorithmData types and definitions:
vertices: v, u, Source, Sink;
arcs: (v,u), a;
start vertex of arc a: Begining_vertex(a);
weight of arc (v,u): W(v,u);
path: BestPath; // defined as a set of arcs
the highest score of subpath ending at v: Score (v);
the highest score of subpath coming through (v,u) and ending at
u : Top_score (v,u);
the last arc of the highest scoring subpath ending at u:
Last_arc(u).
Initialize: for each vertex v: Score (v) := minus_infinity.Forward process: while There are unprocessed vertices: v := arbitrary unprocessed vertex with all incoming arcs processed; for each arc (v,u): // consider all arcs starting at v Top_score (v,u) := Score (v)+W(v,u); if Top_score (v,u)>Score (u) // subpath coming through v is better than the //current best subpath ending at u then: // update the data for u Score (u) := Top_score (v,u); Last_acr (u) := (v,u); endif; (v,u) := processed_arc; endfor; v := processed_vertex;endwhile.Backtracing: BestPath = empty_set; // initialize v := Sink; // go from the sink backwards by marked arcs until v=Source Add Last_arc (v) to BestPath; // add the last arc of the best path ending at the //current vertex v := Beging_vertex (Last_arc(v)); // go to the start vertex of this arc enduntil.Output BestPath.
The number of operations
The limiting procedure is processing vertices and adding arcs to paths, and we consider each arc only once
Hence the number of operations is linear in the number of arcs A: the run time of the algorithm is O (A)
Greedy algorithm
Start at the source and select the highest-weighted arc at each step.
13 < 20
It does not work. 2
41
23 4
1
1
6 5
25
86 5 2
23
3 1
Quiz(a)Construct the simplest possible graph in which
the greedy algorithm yields the highest scoring path.
(b) Construct a graph with three vertices in which the greedy algorithm does not yield the highest scoring path.
(c) Construct a graph with three vertices in which the greedy algorithm does yield the highest scoring path.
(d) Assign new weights to the arcs of the above graph so that the greedy algorithm will yield the highest scoring path.
Quiz cont’d(e) Write an algorithm for construction of the path
with the maximum number of arcs and apply it to the above graph.
Hint: do not change the algorithm, set proper arc weights.
(f) Modify the maximum score algorithm so as to construct the path with the minimal score and find this path for the above graph.
(g) Provide a greedy algorithm for finding the path of minimal score in a graph, and apply it to the above graph.
(h) For the above graph, find the path with the minimal number of arcs.
Lesson
The generic dynamic programming algorithm may be applied to different problems. The common feature of these problems is that each one can be decomposed into an ordered set of smaller subproblems, and to solve a more complex subproblem one needs to know only the solutions of the simpler ones, but not the entire set of possibilities.
Note
There exist path optimization problems that cannot be solved by the dynamic programming.
Traveling salesman problem. Given a non-oriented graph with weighted arcs, we need to construct the lowest scoring path passing through all the vertices (the salesman needs to visit all cities with travel time between the cities given by the arc weights, while spending the least amount of time traveling).
All cities need to be visited in a single trip => NP-complete problem.
No efficient algorithms are known. Most computer scientists believe that for all NP-complete problems the number of operations required to provide an optimal solution is exponential in the problem size.
AlignmentGiven two symbol sequences (nucleotides or
amino acids) of lengths M and N, set a correspondence between these sequences so that some symbols are set in pairs, matching or mismatching, whereas other symbols are ignored (indels). The order of corresponding symbols in the subsequences should coincide.
The alignment score is the sum of match premiums r per matching pair minus the sum of mismatch penalties p per mismatching pair and deletion penalties q per ignored symbol.
The goal is to construct the highest scoring alignment.
Quiz
What are the scores of the alignments
Reduction to the optimal path problem
Construct a graph.Vertices correpond to pairs of positions
(endpoint of partial alignments).Outcoming arcs (for each vertex) are
of three types:• match (weight r ) or mismatch (weight(–
p)); total M∙N arcs
• deletion in the 1st sequence (weight (–q)); total M∙(N+1) arcs
• deletion in the 2nd sequence (weight (–q); total (M+1)∙N) arcs
Alignment graphg e l af n d
g
a
l
a
f
n
d
Alignment graph with weights
r
q
g e
q
p
q
q
p
q
q
p
q
q
r
q
q
p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
pp
p p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
q
q
q
q
q
q
q q q
q q q
p p p
p r
p p r
p p
p p p
p r p
l af n d
g
a
l
a
f
n
d
p qq
q
r qq
q
p qq
q
q
q q q
p p p
p
q
q
q
r
q
q
q
p
q
q
q
q
q
q
q
p
p
p
p q
q
Paths for the three alignmentsg e l af n d
g
a
l
a
f
n
d
Variants
• Hanging-end alignment (genome assembly)– zero-weight arcs from the source to the
top and left “perimeter” and from the right and bottom perimeter to the sink
• Local alignment– zero-weight arcs from the source to all
internal vertices and from internal vertices to the sink
Weights• Amino-acid substitution weight matrices
– evolutionary• PAM (sure alignment of closely related proteins,
take matrix to the power)• BLOSUM (alignment of conservative regions in
distantly related proteins)– based on physical and chemical properties of
residues• Deletion penalty
– affine penalties (opening and extension penalties)
• Structural alignment as the gold standard
Quiz
For the above alignments, assuming match premium r=10, what combinations of mismatch and deletion penalties would yield optimal alignments (a), (b), and (c)?
Multiple alignment
• triple cubic graph– etc
• for K sequences of length N requires O(NK) operations
• soon becomes unworkable• progressive alignment
– all pairwise alignments, distance matrices
– guide tree– alignment of partial alignment
Lesson
Weights matter. The same graph with differently assigned arc weights will yield different types of alignment.
Gene recognitionDefine a gene as a sequence fragment consisting of
exons and introns.The boundaries between them are donor sites (between
exons and introns, usually GT) and acceptor sites (between introns and exons, usually AG).
Each exon and intron is assigned a weight, measuring coding affinity (respectively, non-coding affinity) of its sequence.
The gene’s score is the sum of weights of constituent exons and introns.
The goal is, given a sequence and a set of candidate donor and acceptor sites, construct the highest-scoring exon–intron structure for a gene.
Construct a graph
actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga
actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga
(a)
(b)
Complexity
Assume even distribution of sites (leave out details)
=> O(L) vertices, O(L2) arcs
Can we do better?
It makes sense to assume that the segment weights are additive (we assume that for exons
anyhow). Then we have just O(L) arcs
actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga
actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga
(a)
(b)
(a)
(b)
Quiz
There are two paths in the segment graph that describe exon–intron structures not represented in the exon–intron graph. What are they? What arcs need to be added to the exon–intron graph to represent these structures?
Lesson
Structure matters. The same problem may be represented by different graphs, and the conceptually simplest representation is not necessarily the most efficient one.
Return to the toy problem
calculate
the standard trick would not work because
x∙z + y∙z = (x + y) ∙ z (before) holds, but
(x+z) ∙ (y+z) = x∙y + z generally does not.
Quiz. When (x+z) ∙ (y+z) = x∙y + z ?
DP, generic statement.1. Path weights
Let be the operation of calculating the path score S given arc weights W. We require that the associative rule hold
Hence we can simply write .
The path weight (former S(P) = ) becomes .
DP, generic statement.2. Graph score
Let Ψ be the set of all paths and the operation of selecting the path. We require that possess the associative, commutative rules for combining paths:
and .The graph score is define as
(for the optimal path problem )
+
+
DP, generic statement.3. Transitivity
To use dynamic programming, we need the distribution law
and .
This is a generalization of the property used for calculating the optimal path:max (x + z, y + z) = max (x, y) + z.
DP, algorithm
Problem (physics of polymers)
Linear polymer chain of L+1 monomers k = 0, …, L.Each monomer assumes N states σ(k) є {σi | i =
1, …, N}.Energy of interactions between adjacent monomers
is defined by an N×N matrix ξ(σi,σj) (measured in the KT units).
Chain conformation P is defined by the states of the monomers {σ(0), σ(1), …, σ(L)}.
Exponent of energy: S(P) = exp (–E(P)) = = ∏k=1…L exp (–ξ(σ(k–1),σ(k)).
Ψ is the set of all conformations. Calculate the partition function of the set of all
conformations Ω = ∑PєΨ S(P).
Graph construction and reduction to DP
Vertices correspond to monomer states, so that their number is (L+1)∙N+2 (two additional vertices are the source and the sink, corresponding to the virtual start and end of the chain).
Arcs link vertices corresponding to adjacent monomers.
Arc weights are the interaction energies. Paths through this graph exactly correspond to the
chain conformations. is ordinary multiplication, and is additionThe path score is the product of arc weights.The total graph score is the sum of these products.Standard DP solves the problem.
Quiz
(a)How many operations shall we need?
(b) How many operations shall we need if we calculate the partition function directly?
(c) Provide an algorithm for calculating the number of paths in a graph. Hint: invent suitable arc weights and reduce to the previous problem.
(d) What will Ω be if both and are the operation of taking the maximum?
ProblemCalculate the minimum energy and the number of
conformations with the minimum energy.Arc weights are pairs [1, ξ], with ξ as defined previously.Path scores are pars [n, ε], where ε is the energy, and n is
the number of conformations having this energy.When two systems are combined, the resulting energy is
the sum of the systems’ energies, whereas the number of states is the product of the numbers of states. Hence
solves the problem.
Lesson
Generalizations are useful
Note
Not all problems that can be solved by dynamic programming have a simple graph representation. For example, reconstruction of the secondary structure of a RNA molecule given its sequence can be decomposed into simpler, embedded problems and can be solved by a variant of dynamic programming algorithm, but in the language of this paragraph it requires slightly more complicated objects called hypergraphs.
Спасибо
• Mikhail Roytberg
• Andrei Mironov• Anatoly Rubinov• Pavel Pevzner
Recommended