23
CSB 2006 1 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

CSB 2006 1

Efficient Computation of Minimum Recombination

With Genotypes (Not Haplotypes)

Yufeng Wu and Dan Gusfield

University of California, Davis

Page 2: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

2

Haplotypes/Genotypes

• Diploid organisms have two copies of (not identical) chromosomes. A single copy is a haplotype, vector of 0,1. The mixed description is a genotype, vector of 0,1,2. At each site,– If both haplotypes are 0, genotype is 0– If both haplotypes are 1, genotype is 1– If one is 0 and the other is 1, genotype is 2

• Key fact: easier to collect genotypes, but many downstream applications work better with haplotypes

Page 3: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

3

Haplotyping

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0Genotype

Sites: 1 2 3 4 5 6 7 8 9

Haplotype

Haplotype Inference (HI) Problem: given a set of n genotypes, infer the real n haplotype pairs that form the given genotypes

2 1 2 1 0 0 1 2 0

0 1 1

1 1 0

0 1 0

1 1 1

Phasing the 2s

Page 4: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

4

Two-stage Approach

• Given a set of genotypes G, we are interested in downstream problems

• Many HI solutions for G• Two stage: first infer the “correct” HI solution

from the genotypes, then do the downstream analysis with the inferred haplotypes

• Haplotype inference: extensively studied and believed to be accurate to certain extent

Page 5: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

5

One-stage Approach

• What effect does the haplotyping inaccuracy has on downstream questions?

• Our work: directly use genotype data for downstream problems– Without fixing a choice for the HI solution– Minimum recombination problem

Page 6: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

6

Recombination: Single Crossover

• Recombination is one of the principle genetic force shaping variation within species• Two equal length sequences generate a third equal length sequence

110001111111001

000110000001111

Prefix

Suffix

11000 0000001111

breakpoint

Page 7: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

7

Kreitman’s Data (1983)00000000110000000011011101111000000000000000010000000000000001101110111100000000000000000000000000000000000000000000000001000010100000000000000001100000000000000000100110000001100010110011110000000000000000001000000001000000000000100000000000000101011100001000100000000000010000000000000111111010000001111100010111001000000000000011111101100000111110001011100100000000000001111110110000011111000101110010000000000000111111011000001111111110000101000010001000011111101000000

Question: what is the minimum number of recombinations needed to derive these sequences?Assume at most 1 mutation per site

Page 8: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

8

Minimizing Recombination• Compute the minimum number of

recombinations (Rmin) for deriving a set of haplotypes, assuming at most 1 mutation per site– NP-hard in general– Heuristics– Lower bounds on Rmin

Page 9: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

9

Lower Bounds on Genotypes

• For a particular recombination lower bound method L, what is the range of possible bounds for L over all possible HI solutions?– MinL(G): minimum L over all HI solutions for G.– MaxL(G): maximum L over all HI solutions for G.

• This paper: HK bound, connected component bound and relaxed haplotype bound.– Polynomial-time algorithms for MaxHK, MinCC.– Heuristic method for relaxed haplotype bound.

Page 10: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

10

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

Incompatibility Graph (IG):A node each site, edgebetween incompatible pair

M

Lower Bound: Incompatibility

• Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11

• Sites p,q are incompatible A recombination must occur between p,q

1 2 3 4 5

Page 11: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

11

HK Bound (1985)

• Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence.

• HK bound = maximum number of non-overlapping edges in incompatibility graph (IG).

• Easy to compute for haplotype data.

1 2 3 4 5

HK Lower Bound = 1

Page 12: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

12

IG for HI Solutions

01010101010020222200

0101001010101011010100000001010100010100

1 2 3 4 5

HK = 1

HI1

0101001010101011010100001001000000011100

1 2 3 4 5

HK = 3

HI2

Page 13: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

13

HK Bounds on Genotypes

• Known efficient algorithm for MinHK(G) (Wiuf, 2004).

• This paper: polynomial-time algorithm for MaxHK(G)

Page 14: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

14

Maximal Incompatibility Graph

• An edge between sites p and q if there is a phasing of p, q so p and q are incompatible– Each pair of sites is considered independently

• E(G): a maximum-sized set of non-overlapping edges in MIG(G)

01010101010020222200

G

1 2 3 4 5

MIG(G)

E(G) = {12, 23, 35}

Page 15: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

15

MaxHK(G)

• Claim: MaxHK(G) = |E(G)|• MaxHK(G) |E(G)|

– MIG(G): supergraph of IG(H) for any HI solution H

• If we can find an HI solution H, whose every pair of sites in E(G) is incompatible, then HK(H) |E(G)|

• Together, MaxHK(G) = |E(G)|

Page 16: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

• Phase sites from left to right. • Each component in E(G) is a simple path• Each site only constrained by at most one site to the left

Finding such an H

MIG(G)

Page 17: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

Phasing G for Incompatibility

0101001010101011010100?0?00?0?0??001??00

0101001010101011010100?0?00?0?00?0011?00

010100101010101101010010?0000?0000011100

• No matter how a previous site p is phased, can always phase this site q to make p, q incompatible

Page 18: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

18

Haplotyping With Minimum Number of Recombinations

• Compute Rmin(G) – Haplotyping on a network with fewest

recombinations

• NP-hard• This paper: A branch and bound method

computing exact Rmin(G) for data with small number of sites

• APOE data: 47 non-trivial genotypes, 9 sites– Our method: 2 minutes, Rmin(G) = 5

Page 19: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

19

Application: Recombination Hotspot

• Recombination hotspot: regions where recombination rate is much higher than neighboring regions

• Previous study (Bafna and Bansal, 2005): a recombination lower bound with inferred haplotypes were used to identify recombination hotspots

• Our work: compute the exact Rmin(G) with genotypes for a sliding window of a small number of SNPs to detect recombination hotspots

Page 20: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

20

Result from haplotypes (Bafna and Bansal, 2005)

Result from original genotypes (this paper)

MS32 data (Jeffreys, et al. 2001)

Page 21: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

21

Other Applications• Finding true Rmin from genotypes G

– Two stage approach: run PHAS to get an HI solution H, and compute Rmin(H)

– One stage approach: directly compute Rmin(G)

• Accuracy of haplotype inference on a minimum network

• Simulation results: comparable, slightly weaker and non-conclusive

Page 22: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

22

Summary• Main goal of this paper: develop

computational tools for the minimum recombination problem with genotypes– Polynomial-time algorithm for MaxHK and MinCC

problems– Practical heuristics for other problems– Simulation results to several application questions

are not conclusive– Our tools facilitate the study of these problems

Page 23: CSB 20061 Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis

23

Thank You

• Software: available upon request