4
A Probabilistic Branch-and-Bound Approach for Reconstructing of Haplotype using SNP-Fragments and Related Genotype Rasoul Taghipour , Naemeh Ganoodi Φ , Ehsan Asgarian Ψ Department of Computer Engineering, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran Φ Faculty of Science, University of Birjand, Birjand, Iran Ψ Department of Engineering, Quchan Institute of Engineering and Technology, Quchan, Iran [email protected], Φ [email protected], Ψ [email protected] AbstractMost positions of the human genome are typically invariant (99%) and only some positions (1%) are commonly invariant which are associated with complex genetic diseases. Haplotype reconstruction problem divide aligned single nucleotide polymorphism (SNP) fragments into two classes and infer a pair of haplotypes from them. An important computational model of this problem is minimum error correction (MEC) but it is only effective when the error rate of the fragments is low. MEC/GI as an extension to MEC employs the compatible genotype information besides the SNP fragments and so results in a more accurate inference. The haplotyping problems, due to its NP-hardness, several computational and heuristic methods have addressed the problem seeking feasible answers. In this paper, we develop a new branch-and-bound algorithm with running time ( ) 2 h n h O nm k § · ª º u ¨ ¸ « » « » © ¹ in which m is maximum length of SNP fragments where SNP sites are heterozygous, n is the number of fragments and h is depth of our exploration in binary tree. Since h (hn) is small in real biological applications, our proposed algorithm is practical and efficient. Keywords: Branch-and-Bound Problem; haplotype; SNP fragments; genotype information; classification; reconstruction rate. I. INTRODUCTION It is proved that MEC is a NP-Hard problem [1-3], so heuristic methods are used to reduce running time of this problem [4-6]. This problem was solved by some classification and heuristic methods [7, 8]. Zhang introduces a classification algorithm based on two distances (Hamming and a proposed distance) to compare SNP fragments together [9]. In his work an algorithm was implemented to solve MEC model. Real and simulation data sets are available as two standard databases. The inputs in these databases contain an error rate between 10% and 40%. The method in [10] is based on K-means algorithm. Although the result and algorithm’s running time were acceptable, it is widely believed that K-means doesn’t work well for noisy inputs. Solving MEC and MEC/GI models for haplotype reconstruction with GA was published in [11]. The goal of the MEC/GI model is again to nd a best partition between SNP fragments. By using the modied error function, the algorithms for solving the MEC model can be straightforwardly used to solve the MEC/GI model [11, 12]. The results in haplotyping were not only better than K-means but also it takes more execution time. On the other hand GA has an adaptive behavior in terms of error rate and it approximately guarantees not to get stuck local minima. A variety of approaches that each of them might have its own strengths and weaknesses made us to design practical and efficient method to more detailed search solution space. In this paper we design Probabilistic Branch-and- Bound (PBB) would help us to increase Reconstruction Rate. We use branch and bound technique to improve reconstruction rate. Although it search the solutions space more precisely to get the best answer, but it is not practical due to high execution time. We suggest pruning the some nodes in binary tree by using probabilistic methods. The proposed approach is practical and improves the result of the mentioned problem. To demonstrate the effectiveness of our approach, we have applied PBB algorithms on standard SNP fragments database and got good results in comparison with the K- means [10] and GA [11]. II. NOTATION AND PROBLEM DEFINITION Suppose that there are m SNP fragments from a pair of haplotypes. M=m ij is defined as a matrix of fragments, which each entry m ij has value ‘A’, ‘B’ or ‘-’ (‘-’ is missing or skipped SNP site which is called gap). We use partition P(C 1 ,C 2 ) (C 1 and C 2 are two classes) to formulate the problem. P as an exact algorithm or clustering method divides fragments into C 1 and C 2 . Each haplotype is reconstructed from the members of one of the classes with voting function. The function is so defined: (N i A (M) (or N i B (M)) denotes the number of ‘A’s (or 'B's) in j th column of matrix M). 2011 First International Conference on Informatics and Computational Intelligence 978-0-7695-4618-6/11 $26.00 © 2011 IEEE DOI 10.1109/ICI.2011.17 41 2011 First International Conference on Informatics and Computational Intelligence 978-0-7695-4618-6/11 $26.00 © 2011 IEEE DOI 10.1109/ICI.2011.17 41

[IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

  • Upload
    ehsan

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

A Probabilistic Branch-and-Bound Approach for Reconstructing of Haplotype using SNP-Fragments and Related Genotype

Rasoul Taghipour€, Naemeh GanoodiΦ, Ehsan AsgarianΨ

€ Department of Computer Engineering, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran ΦFaculty of Science, University of Birjand, Birjand, Iran

ΨDepartment of Engineering, Quchan Institute of Engineering and Technology, Quchan, Iran

[email protected], Φ[email protected], Ψ[email protected]

Abstract—Most positions of the human genome are typically invariant (99%) and only some positions (1%) are commonly invariant which are associated with complex genetic diseases. Haplotype reconstruction problem divide aligned single nucleotide polymorphism (SNP) fragments into two classes and infer a pair of haplotypes from them. An important computational model of this problem is minimum error correction (MEC) but it is only effective when the error rate of the fragments is low. MEC/GI as an extension to MEC employs the compatible genotype information besides the SNP fragments and so results in a more accurate inference. The haplotyping problems, due to its NP-hardness, several computational and heuristic methods have addressed the problem seeking feasible answers. In this paper, we develop a new branch-and-bound algorithm with running time

( ) 2hn hO nmkin which m is maximum length of SNP

fragments where SNP sites are heterozygous, n is the number of fragments and h is depth of our exploration in binary tree. Since h (h n) is small in real biological applications, our

proposed algorithm is practical and efficient. Keywords: Branch-and-Bound Problem; haplotype; SNP fragments; genotype information; classification; reconstruction rate.

I. INTRODUCTION

It is proved that MEC is a NP-Hard problem [1-3], so heuristic methods are used to reduce running time of this problem [4-6]. This problem was solved by some classification and heuristic methods [7, 8]. Zhang introduces a classification algorithm based on two distances (Hamming and a proposed distance) to compare SNP fragments together [9]. In his work an algorithm was implemented to solve MEC model. Real and simulation data sets are available as two standard databases. The inputs in these databases contain an error rate between 10% and 40%. The method in [10] is based on K-means algorithm. Although the result and algorithm’s running time were acceptable, it is widely believed that K-means doesn’t work well for noisy inputs.

Solving MEC and MEC/GI models for haplotype reconstruction with GA was published in [11]. The goal of

the MEC/GI model is again to find a best partition between SNP fragments. By using the modified error function, the algorithms for solving the MEC model can be straightforwardly used to solve the MEC/GI model [11, 12]. The results in haplotyping were not only better than K-means but also it takes more execution time. On the other hand GA has an adaptive behavior in terms of error rate and it approximately guarantees not to get stuck local minima.

A variety of approaches that each of them might have its own strengths and weaknesses made us to design practical and efficient method to more detailed search solution space.

In this paper we design Probabilistic Branch-and-Bound (PBB) would help us to increase Reconstruction Rate. We use branch and bound technique to improve reconstruction rate. Although it search the solutions space more precisely to get the best answer, but it is not practical due to high execution time. We suggest pruning the some nodes in binary tree by using probabilistic methods. The proposed approach is practical and improves the result of the mentioned problem. To demonstrate the effectiveness of our approach, we have applied PBB algorithms on standard SNP fragments database and got good results in comparison with the K-means [10] and GA [11].

II. NOTATION AND PROBLEM DEFINITION

Suppose that there are m SNP fragments from a pair of haplotypes. M=mij is defined as a matrix of fragments, which each entry mij has value ‘A’, ‘B’ or ‘-’ (‘-’ is missing or skipped SNP site which is called gap). We use partition P(C1,C2) (C1 and C2 are two classes) to formulate the problem. P as an exact algorithm or clustering method divides fragments into C1 and C2. Each haplotype is reconstructed from the members of one of the classes with voting function. The function is so defined: (Ni

A(M) (or Ni

B(M)) denotes the number of ‘A’s (or 'B's) in jth column of matrix M).

2011 First International Conference on Informatics and Computational Intelligence

978-0-7695-4618-6/11 $26.00 © 2011 IEEEDOI 10.1109/ICI.2011.17

41

2011 First International Conference on Informatics and Computational Intelligence

978-0-7695-4618-6/11 $26.00 © 2011 IEEEDOI 10.1109/ICI.2011.17

41

Page 2: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

( ) ( ), {1,2},0

j jA i B i

ij

A N C N CV i j n

B otherwise A haplotype h is a vector (h1, … ,hm) over {0,1}. A genotype g is a vector (g1, … ,gn) over {0,1,2}. Let h1,h2 be a pair of haplotypes from the corresponding pair of chromosomes. Then the relationship between h1,h2 and g is:

1 1 2[ ] [ ] [ ][ ] , ( 1,..., )

2

h i if h i h ig i i n

otherwise

We denoted it by h1 h2 = g, or h1 = g h2, h2 = g

h1 where h1, h2 are said to resolve the genotype g. Reconstruction rate (shortly RR) is a very simple and

popular mean to compare the results of designed algorithms on existing datasets. RR, which is based on Hamming Distance (HD), is the degree of similarity between the original haplotypes (h = (h1, h2)) and reconstructed ones (h' = (h'1 , h'2)). Hamming distance of two fragments HD(fi, fj) and RR(h ,h') are formulated as:

1

11 22 12 21

1 ( )( , )

0

( , ) ( , )

( , ) , 1, 2

min( , )( , ) 1

2

ij kj

ij kj

n

i k ij kjj

ij i j

m md h h

Otherwise

HD h h d h h

r HD h h i j

r r r rRR h h

n HD1 and HD2 are considered as two distances obtained

from comparison of fi and the two other fragments (f1 and f2).

In this paper we study MEC/GI (Minimum Error Correction with Genotype Information) model. In this model a matrix of SNP fragments is available as an input. We try to decrease the number of haplotype errors in comparison with corresponding real haplotypes.

III. PROPOSED APPROACHES

Our algorithm search a path in a binary tree, in which the node on the jth level denotes the jth fragment and the branch on the path connecting its child denotes its corresponding class-membership. Owing to the symmetry of complete binary trees, we only need to search half of the tree for all possible solutions. To speed up the algorithm, we can replace E(c1 , c2 , ... , ck) with a tighter lower-bound similar to the one used by Koontz et al. [13] when the search process arrives at a certain depth.

We use a binary string to express a solution vector which represents a classification of SNP fragments (a feasible solution to the MEC model). The length of each solution vector in the hypothesis space is the number of SNP fragments. After labeling SNP fragments by 1,2,...,m, the value 0 or 1 on the ith position of an individual characterizes the class-membership of the ith SNP fragment. Thus, all of the binary strings having length of m constitute the hypothesis space:

H = {(x1,x2,…,xm) | xi {0,1}, i = 1,2,…,m}

In this paper, we present a branch-and-bound algorithm to get sub-optimal solutions to the MEC/GI model. The main idea here is to use fast greedy method for search solution space. In order to do so, the vector representing solution (clustering mode) is generated gradually in a number of steps. Length of solution vectors depend to number of SNP fragments in dataset. For achieving best possible solution, firstly, t solution vectors are generated simultaneously. After that, the best vector is selected based on fitness value. The process of solution vector generation is described below.

Table 1: Reconstruction rate on ACE, DALY, SIM0 and SIM50 datasets for different Gap and error rate

4242

Page 3: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

First step, we create all possible states of separating first h SNP fragment into two classes. Then we have 2h various clusters states. In each step, all possibility states in h next levels are generated which is 2h. We select 2h/k (or 2(h-log

2k)) states as

the set of possible solutions based on probability p and eliminate the rest. For each possible solution p is increased by having greater fitness. The total number of generated states at each step is 2h out of which only 2(h-

log2

k) solution vectors with greater fitness are selected. For each selected solution k next levels are expanded (cluster number of log2k SNP fragments). In other hand, k various states of clustering are generated. Therefore 2h new states are considered for all 2(h-log k) selected solutions (k*2(h-log k)

= 2h). This process repeat until the solution vector is completed. Finally, the solution vector with greatest fitness is selected as the final solution. The running time of our algorithm is:

( ) 2hn hO nmk

In which m is maximum length of SNP fragments where SNP sites are heterozygous, n is the number of fragments and h and k is depth of our exploration in first step and

next steps in binary tree. Since h (h n) is small in real

biological applications, our proposed algorithm is practical and efficient. The algorithm was implemented and t=10, h=10 and k=3 was chosen for our problem (Table 2). In this problem the objective function is defined as:

1 21 2

( { , ,..., })( , ,..., ) , 0,1m

m i

mn E P x x xMaximize f x x x x

mnWher

e m is length of SNP fragment; n is number of SNP fragments; xi is ith SNP of current SNP fragment; P{x1, x2,…, xm} is a partitioning of {x1, x2,…, xm} and E(P{x1,x2,…,xm}) is the corresponding error correction in comparison with their own center, i.e. the distance between center and each fragment.

Algorithm: Probabilistic Branch-and-Bound (PBB) Input: SNP fragments and related genotype

Output: Two haplotypes

Step0: Initialize parameters (t, h and k).

Parallel Steps (repeat t times to generate solution vector): Step1: Create all possible states of separating first h SNP

fragment into two classes (2h various clusters states) Step2: Select 2(h-log

2k) solution vectors appropriate to fitness

value of each state. Step3: For each selected solution k next levels are expanded

(cluster number of log2k SNP fragments) and 2h new states are generated.

Step4: Go to Step2, until the solution vector is completed

Serial Steps (combining the results of parallel steps): Step5: the best vector is selected (from among the t solution)

based on fitness value

Step6: based on voting function we produce two haplotype from current two clusters

Table 2: Pseudo code for PBB

IV. EXPERIMENTAL RESULTS

There are real biological datasets and also simulation datasets available for haplotyping problem like ACE, DALY, SIM0 and SIM50. We chose DALY dataset which includes 4 different subsets. Each subset has a different error rate (10%, 20%, 30%, 40%) and includes 384 different test cases [14]. In table 1, the results of the experiments on DALY set for MEC/GI model is shown. In figure 1, the comparisons of methods’ results are demonstrated in various gaps and errors in DALY database. Also we implemented all algorithms for MEC/GI model.

V. CONCLUSION

In this paper, we focus on probabilistic branch-and-bound to solve MEC/GI model. Then all of the aforementioned methods are implemented and tested on MEC/GI model problem which are intended to infer haplotypes with high accuracy by employing genotype information. We compare the results of all methods in terms of Reconstruction Rate. In MEC/GI model, PBB outperform the other approaches due to the fact that GA finds near optimal solutions in search space and k-means acts as a local heuristic classifier to find the real answer.

4343

Page 4: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

Figure 1: Subtraction of reconstruction rates for PBB and kMeans, GA in different gap rates on DALY database.

REFERENCES

[1] V. Bafna, et al., "Polynomial and APX-hard cases of the individual haplotyping problem," Theoretical Computer Science, vol. 335, pp. 109-125, 2005.

[2] R. Cilibrasi, et al., "On the complexity of several haplotyping problems," Algorithms in Bioinformatics, pp. 128-139, 2005.

[3] R. Cilibrasi, et al., "The complexity of the single individual SNP haplotyping problem," Algorithmica, vol. 49, pp. 13-36, 2007.

[4] P. Bonizzoni, et al., "The haplotyping problem: an overview of computational models and solutions," Journal of Computer Science and Technology, vol. 18, pp. 675-688, 2003.

[5] B. Halldórsson, et al., "A survey of computational methods for determining haplotypes," Computational Methods for SNPs and Haplotype Inference, pp. 613-614, 2004.

[6] M. H. Moeinzadeh, et al., "Three Heuristic Clustering Methods for Haplotype Reconstruction Problem with Genotype Information," in International Conference on Innovations in Information Technology, 2007, pp. 402-406.

[7] E. Asgarian, et al., "Solving mec model of haplotype reconstruction using information fusion, single greedy and parallel clustering approaches," presented at the The sixth ACS/IEEE International Conference on Computer Systems and Applications 2008.

[8] M. H. Moeinzadeh, et al., "Neural network based approaches, solving haplotype reconstruction in MEC and MEC/GI models," in International Conference on Modeling & Simulation, 2008, pp. 934-939.

[9] X. S. Zhang, et al., "Minimum conflict individual haplotyping from SNP fragments and related genotype," Evolutionary bioinformatics online, vol. 2, p. 261, 2006.

[10] Y. Wang, et al., "A clustering algorithm based on two distance functions for MEC model," Computational biology and chemistry, vol. 31, pp. 148-150, 2007.

[11] R. S. Wang, et al., "Haplotype reconstruction from SNP fragments by minimum error correction," Bioinformatics, vol. 21, p. 2456, 2005.

[12] J. Wang, et al., "A practical exact algorithm for the individual haplotyping problem MEC/GI," Algorithmica, vol. 56, pp. 283-296, 2010.

[13] W. L. G. Koontz, et al., "A branch and bound clustering algorithm," Computers, IEEE Transactions on, vol. 100, pp. 908-915, 1975.

[14] M. J. Daly, et al., "High-resolution haplotype structure in the human genome," Nature Genetics, vol. 29, pp. 229-232, 2001.

4444