52
1 Greedy Approximation Algorithms for Covering Problems in Computational Biology Ion Mandoiu Computer Science & Engineering Department University of Connecticut

Greedy Approximation Algorithms for Covering Problems in Computational Biology

Embed Size (px)

DESCRIPTION

Greedy Approximation Algorithms for Covering Problems in Computational Biology. Ion Mandoiu Computer Science & Engineering Department University of Connecticut. Why Approximation Algorithms?. Most practical optimization problems are NP-hard - PowerPoint PPT Presentation

Citation preview

Page 1: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

1

Greedy Approximation Algorithms for Covering Problems in Computational Biology

Ion MandoiuComputer Science & Engineering Department

University of Connecticut

Page 2: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

2

Why Approximation Algorithms?

• Most practical optimization problems are NP-hard• Approximation algorithms offer the next best thing to an

efficient exact algorithm– Polynomial time

– Solutions guaranteed to be “close” to optimum-approximation algorithm: solution cost within a

multiplicative factor of of optimum cost

• Practical relevance: insights needed to establish approximation guarantee often lead to fast, highly effective practical implementations

Page 3: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

3

Why Computational Biology?

• Exploding multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, …

• Source of a fast growing number of combinatorial optimization applications:– TSP and Euler paths in DNA sequencing– Dynamic Programming in sequence alignment– Integer Programming in Haplotype inference– …

• This talk: two “covering” problems in computational biology (primer set selection and string barcoding)

Page 4: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

4

Overview

Potential function greedy algorithm- The set cover problem and the greedy algorithm

- Potential function generalization Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

Page 5: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

5

The Set Cover Problem

Given: - Universal set U with n elements

- Family of sets (Sx, xX) covering all elements of U

Find:- Minimum size subset X’ of X s.t. (Sx, xX’) covers all

elements of U

Greedy Algorithm: - Start with empty X’, and repeatedly add x such that Sx

contains the most uncovered elements until U is covered

Page 6: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

6

Approximation Guarantee

Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n)

- The approximation factor is tight- Cannot be approximated within a factor of (1-)ln(n) unless

NP=DTIME(nloglog(n))

Page 7: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

7

General setting

“Potential function” (X’) 0 ({}) = max

(X’) = 0 for all feasible solutions X’’ X’ (X’’) (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where

∆(x,X’) := (X’) - (X’+x)

Problem: find minimum size set X’ with (X’)=0

Page 8: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

8

Generic Greedy Algorithm

• Theorem (Konwar et al.’05) The generic greedy algorithm has an approximation factor of 1+ln ∆max

X’ {} While (X’) > 0

Find x with maximum ∆(x,X’) X’ X’ + x

Page 9: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

9

Proof IdeaLet x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen, and x*1, x*2,…,x*k be the elements of an optimum solution.

Charging scheme: xi charges to x*j a cost of

where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j})

Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max

Page 10: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

10

Proof of claim 2Fact 2: Each xi charges at least 1 unit of cost

Page 11: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

11

Overview

Potential function greedy algorithm Primer set selection for multiplex PCR

- Motivation and problem formulation- Greedy applied to primer set selection- Experimental results

The String Barcoding Problem Conclusions

Page 12: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

12

DNA Structure

• Four nucleotide types: A,C,T,G

• Normally double stranded

• A’s paired with T’s

• C’s paired with G’s

Page 13: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

13

The Polymerase Chain Reaction

Target Sequence Polymerase

Primer 1Primer 2

Primers

Repeat 20-30 cycles

Page 14: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

14

Primer Pair Selection Problem

• Given:

• Genomic sequence around amplification locus

• Primer length k

• Amplification upperbound L

• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)

L

Forward primer

Reverse primer

amplification locus

3'

3'

5'

5'

Page 15: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

15

Multiplex PCR• Multiplex PCR (MP-PCR)

– Multiple DNA fragments amplified simultaneously

– Each amplified fragment still defined by two primers

– A primer may participate in amplification of multiple targets

• Primer set selection– Typically done by time-consuming trial and error

– An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers higher

amplification efficiency Reduced unintended amplification

Page 16: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

16

Primer Set Selection Problem

• Given:

• Genomic sequences around n amplification loci

• Primer length k

• Amplification upper bound L

• Find:

• Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other

Page 17: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

17

Applications• Single Nucleotide Polymorphism (SNP) genotyping

– Up to thousands of SNPs genotyped simultaneously

– Selective PCR amplification required for improved accuracy

• Spotted microarray synthesis [Fernandes&Skiena’02]– Primers can be used multiple times

– For each target, need a pair of primers amplifying that target and only that target (amplification uniqueness constraint)

– Can still reduce #primers from 2n to O(n1/2)

Page 18: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

18

Previous Work on Primer Selection

• Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc.

• Almost all problem formulations decouple selection of forward and reverse primers– To enforce bound of L on amplification length, select only

primers that hybridize within L/2 bases of desired target

– In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum

• [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation

Page 19: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

19

Previous Work (contd.)

• [Fernandes&Skiena’02] study primer set selection with uniqueness constraints

• Minimum Multi-Colored Subgraph Problem:– Vertices correspond to candidate primers– Edge colored by color i between u and v iff

corresponding primers hybridize within a distance of L of each other around i-th amplification locus

– Goal is to find minimum size set of vertices inducing edges of all colors

• Can capture length amplification constraints too

Page 20: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

20

Integer Program Formulation• 0/1 variable xu for every vertex

• 0/1 variable ye for every edge e

Page 21: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

21

LP-Rounding Algorithm

Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m1/2lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes

For primer selection, m L2 approximation factor is O(Llnn)

Better approximation?- Unlikely for minimum multi-colored subgraph problem

(1) Solve linear programming relaxation

(2) Select node u with probability xu

(3) Repeat step 2 O(ln(n)) times and return selected nodes

Page 22: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

22

Selection w/o Uniqueness Constraints• Can be seen as a “simultaneous set covering” problem:

- The ground set is partitioned into n disjoint sets Si (one for each target), each with 2L elements

- The goal is to select a minimum number of sets (i.e., primers) that cover at least half of the elements in each partition

L L

SNPi

Page 23: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

23

Greedy Algorithm

• Potential function = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si}

• Initially, = nL

• For feasible solutions, = 0

• ∆() nL (much smaller in practice)

•Theorem [Konwar et al.’05]: The number of primers selected by the greedy algorithm is at most 1+ln(nL) larger than the optimum

Page 24: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

24

Experimental Setting• Datasets extracted from NCBI databases, L=1000• Dell PowerEdge 2.8GHz Xeon• Compared algorithms

– G-FIX: greedy primer cover algorithm [Pearson et al.]

– MIPS-PT: iterative beam-search heuristic [Souvenir et al.]

• Restrict primers to L/2 bases around amplification locus

– G-VAR: naïve modification of G-FIX

• First selected primer can be up to L bases away

• Opposite sequence truncated after selecting first primer

– G-POT: potential function driven greedy algorithm

Page 25: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

25

Experimental Results, NCBI tests

#Targets

k

G-FIX(Pearson et al.)

G-VAR(G-FIX with dynamic

truncation)

MIPS-PT (Souvenir et al.)

G-POT(Potential- function

greedy)

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

20

8 7 0.04 7 0.08 8 10 6 0.10

10 9 0.03 10 0.08 13 15 9 0.08

12 14 0.04 13 0.08 18 26 13 0.11

50

8 13 0.13 15 0.30 21 48 10 0.32

10 23 0.22 24 0.36 30 150 18 0.33

12 31 0.14 32 0.30 41 246 29 0.28

100

8 17 0.49 20 0.89 32 226 14 0.58

10 37 0.37 37 0.72 50 844 31 0.75

12 53 0.59 48 0.84 75 2601 42 0.61

Page 26: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

26

#primers, as percentage of 2n (l=8)

n

Page 27: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

27

#primers, as percentage of 2n (l=10)

n

Page 28: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

28

#primers, as percentage of 2n (l=12)

n

Page 29: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

29

CPU Seconds (l=10)

n

Page 30: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

30

Overview

Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem

- Problem Formulation

- Integer programming and greedy algorithms

- Experimental results Conclusions

Page 31: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

31

Motivation

• Rapid pathogen detection– Given

• Pathogen with unknown identity

• Database of known pathogens

– Problem• Identify unknown pathogen quickly

• Ideal solution: determine DNA sequence of unknown pathogen

Page 32: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

32

Real World

• Not possible to quickly sequence an unknown pathogen– Only have sequence for pathogens in database

• Can quickly test for presence of short substrings in unknown virus (substring tests) using hybridization

• String barcoding [Borneman et al.’01, RashGusfield’02]– Use substring tests that uniquely identify each pathogen in the

database

Page 33: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

33

String Barcoding Problem

Given:

Genomic sequences g1,…, gn

Find:

Minimum number of distinguisher strings t1,…,tk

Such that:

For every gi gj, there exists a string tl which is substring of gi or gj, but not of both

- At least log2n distinguishers needed

- Fingerprints n distinguishers

Page 34: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

34

Example

• Given sequences:1. cagtgc

2. cagttc

3. catgga

• Feasible set of distinguishers: {tg, atgga}

tg atgga

cagtgc 1 0

cagttc 0 0

catgga 1 1

Row vectors: unique barcodes for each pathogen

Page 35: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

35

Computational Complexity

• [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

Page 36: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

36

Setcover Greedy Algorithm• Distinguisher selection as setcover problem

– Elements to be covered are the pairs of sequences

– Each candidate distinguisher defines a set of pairs that it separates

• Another view: covering all edges of a complete graph with n vertices by the minimum number of given cuts

• For n sequences, largest set can have O(n2) elements The setcover greedy guarantees ln(n2) = 2 ln n approximation

Page 37: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

37

Integer Program Formulation

• 0/1 variable for each candidate distinguisher• 1 candidate is selected

• 0 candidate is not selected

• For each pair of sequences, at least one candidate separating them is selected

• Objective Function– Minimize #selected candidates

Page 38: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

38

Practical Issues

• Quadratic # of constraints, huge # of variables – Genome sizes range from thousands of bases for phage and

viruses to millions for bacteria to billions for higher organisms

• Many variables can be removed:– Candidates that appear in all sequences– Sufficient to keep a single candidate among those that appear

in the same set of sequences

• How to efficiently remove useless variables?– Rash&Gusfield use suffix trees

Page 39: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

39

Suffix Tree Example

• Strings:1. cagtgc

2. cagttc

3. catgga

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Page 40: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

40

Integer Program

MinimizeV18 + V22 + V11 + V17 + V8 #objective functionSuch thatV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3Binaries #all variables are 0/1V18 V22 V11 V17 V8End

tg (V18) atgga (V22)

cagtgc 1 0

cagttc 0 0

catgga 1 1

Page 41: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

41

Limitations of Integer Program Method

• Works only for small instances – 50-150 sequences– Average length ~1000 characters– Over 4 hours needed to come within 20% of

optimum!

• Scalable Heuristics?

Page 42: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

42

Distinguisher Induced Partition

• Key idea [Berman et al. 04]: Keep track of the partition defined by distinguishers selected so far

1

2

3

n-1n

Distinguisher 1

Distinguisher 2

Page 43: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

43

Information Content Heuristic

= partition entropy = log2(#permutations compatible with current partition)– Initial partition entropy = log2(n!) n log2n

– For feasible distinguisher sets, partition entropy = 0

– ∆() n :

• log2(n!) - log2(k!(n-k)!) < log2(2n) = n

• Information content heuristic (ICH) = greedy driven by partition entropy

• Theorem [Berman et al.’04] ICH has an approximation factor of 1+ln(n)

Page 44: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

44

ICH Limitations

• Real genomic data has degenerate nucleotides– Ambiguous sequencing– Single nucleotide polymorphisms

• For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization– Sure hybridization– Sure mismatch– Uncertain hybridization

No partition to work with!

Page 45: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

45

Practical Implementation• ICH and setcover greedy give nearly identical

results on data w/o non-degenerate bases

• Setcover greedy can also be extended to handle– degenerate bases in the sequences – redundancy requirements (each pair of sequences

must be separated r times)

• Two main steps for both algorithms:– Candidate generation– Greedy selection

Page 46: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

46

Candidate Generation

• Can be done using suffix trees

• We use a simpler yet efficient incremental approach

• Candidates that match all or only one sequence are removed from consideration

• Solution quality is similar even when candidates are generated from a single sequence– Equivalent to considering only distinguisher sets that

assign a barcode of (1,1,…,1) to the source sequence

Page 47: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

47

Candidate Selection

• Evaluate ∆() for all candidates and choose best

• Speed-up techniques– Efficient gain computation using partition data-

structure– Lazy gain update: if old ∆() is lower than best so

far, do not recompute

Page 48: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

48

Experimental Results

mat mat part part # n lazy lazy dist 100 35.4 22.1 2.2 1.4 8.0 200 221.6 125.2 8.8 4.6 10.0 500 2168.8 1144.4 53.0 18.7 12.31000 5600.4 2756.4 113.6 31.7 14.1

• Averages over 10 testcases, sequence length = 10,000• Barcodes for 100 sequences of length 1,000,000 computed

in less than 10 minutes

Page 49: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

49

Overview

Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

Page 50: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

50

Conclusions

• General potential function framework for designing and analyzing greedy covering algorithms

• Improved approximation guarantees and practical performance for two important optimization problems in computational biology: primer set selection for multiplex PCR, and distinguisher selection for string barcoding

Page 51: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

51

Ongoing Work• Primer Set Selection

– Improved hybridization models

– Degenerate primers

– Partitioning into multiple multiplexed PCR reactions

– Close approximation gap for minimum multicolored sub-graph

• String Barcoding– Probe mixtures as distinguishers

– Beyond redundancy: error correcting

– Simultaneous detection of multiple pathogens

Page 52: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

52

Acknowledgments

• B. DasGupta, K. Konwar, A. Russell, A. Shvartsman

• UCONN Research Foundation