44
Noam Shental Department of Computer Science, The Open University of Israel [email protected] A The Open University Of Israel Identification of rare alleles and their carriers using compressed se(que)nsing

Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Embed Size (px)

Citation preview

Page 1: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Noam Shental

Department of Computer Science, The Open University of Israel

[email protected]

A The Open University Of Israel

Identification of rare alleles and their carriers

using compressed se(que)nsing

Page 2: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The Task:

Identification of rare alleles and their carriers

Output

rare allele at position #1

rare allele at position #2

• List of mutation loci.

• List of carriers

We try to perform this task with minimal resources

• Genes („regions‟) of interest

• Large set of DNA samples

Input

Page 3: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Motivation #1: Nationwide Carrier Screen

for known risk alleles

Example: Rare recessive genetic diseases

Carrier Healthy!

Normal Healthy

Genotype Phenotype

Affected Sick

Page 5: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Genetic Disorder Carrier

rate

Tay-Sachs 1:25

Cystic Fibrosis 1:30

Familial Dysautonomia 1:30

Usher Syndrome 1:40

Canavan 1:40

Glycogen Storage 1:71

Fanconi Anemia C 1:80

Niemann-Pick 1:80

Mucolipidosis type 4 1:100

Bloom 1:102

Nemaline Myopathay 1:108

Large scale carrier screen

(rates vary across ethnic groups)

Page 6: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Genetic Disorder Carrier

rate

Tay-Sachs 1:25

Specific mutations

HEXA gene on chromosome 15

over 100 mutations are known

TS (1277) 3.50%

TS (1421) 0.30%

TS (F304/305)

0.10%

TS (G269) 0.10%

TS (R170Q) 0.20%

Page 7: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Motivation #2: de novo SNP identification

TS (1277) 3.50%

TS (1421) 0.30%

TS (F304/305)

0.10%

TS (G269) 0.10%

TS (R170Q) 0.20%

HEXA gene on chromosome 15

over 100 mutations are known

Tay-Sachs carriers – find new mutations

A. known genes

Page 8: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Motivation #2: de novo SNP identification

B. Association Studies

ControlsCases

• „discoveries‟ - variations that are less prevalent in the control groups.

• p - value

Page 9: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The Task:

Identification of rare alleles and their carriers

• Genes („regions‟) of interest

• Large set of DNA samples

Input Output

rare allele at position #1

rare allele at position #2

• List of mutation loci.

• List of carriers

We try to perform this task with minimal resources

Page 10: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Identification of rare alleles and their carriers

using compressed se(que)nsing

Joint work with

Amnon Amir

Department of Physics of Complex Systems, Weizmann Institute of Science

Or Zuk

Broad Institute of MIT and Harvard

Nucleic Acids Research, Aug 2010, doi:10.1093/nar/gkq675

Looking for collaborations

Page 11: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure simulations‟

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Page 12: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure simulations‟

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Page 13: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Specific mutations - notation

“A”

“B”

“B”

Reference genome …AGCGTTCT…

…AGTGTTCT…Single-nucleotide polymorphism (SNPs)

…AGGTTCTInsertions/Deletions (InDels)

Carrier test screen: Amplify a sample of DNA and then test

“AA” “AB”

fraction of B‟s out of tested alleles1/20

Page 14: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Naïve approach – one test per individual

collect DNA samples

Apply 9

independent tests

AB ABAA AA AA AA AAAA AA

fraction of B‟s out of tested alleles01/2 0 0 0 1/2 0 0 0

Page 15: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

infer/reconstruct

52

11

42

0

52

1

Compressed sensing based group testing

Next Generation

Sequencing

Technology

compressed

sensinga few tests instead of 9

fraction of B‟s

Page 16: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

# pools

Example

Page 17: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Generic approach that puts together sequencing and CS for identifying rare

allele and their carriers.

• Much higher efficiency over the naive approach - may significantly improve

cost effectiveness in future Association Studies, and in screening large DNA

cohorts for specific risk alleles.

• In our approach experimental costs (both sample preparation and direct

sequencing costs) are proportional to the number of pools and NOT to the

number samples

Take home message

Page 18: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure simulations‟

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Page 19: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The CS problemTask:

Infer the entries of a vector x with minimal number

of “operations”

Nx

x

x

x

2

1

Assumption:x is sparse

351 xmy ii

53.4

0

0

0

0

345

0

0

47.1

0

x 11,1,1,11,1,1,1,1 im

How is it done?

We can select a vector as we wish, and get: im

xmy ii

“measurement”

Page 20: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The CS problem (cont)

We can repeat that k times, and get k “measurements”

xMy

k

Nk

m

m

m

M

2

1

Problem: k<<N: under-determined system

How is it done?

We can select a vector as we wish, and get: im

xmy ii

“measurement”

Page 21: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The CS problem (cont)

Solution: CS breakthrough #1:

If M obeys certain properties it is effectively invertible

xyM

"" 1

Example: Bernoulli Matrix

1111111111

1111111111

1111111111

M

1001010010

0101101100

1110110101

M

Easy to create a “suitable” matrix M

Page 22: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

The CS problem (cont)

Solution: CS breakthrough #2:

solving the following optimization yields the correct solution

yxMtsxx

..argmin1x

xx

With probability almost 1

many off-the-shelf algorithms – we applied the GPSR algorithm

Page 23: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

infer/reconstruct

52

11

42

0

52

1

Compressed sensing based group testing

Next Generation

Sequencing

Technology

Compressed

sensing

Page 24: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Rare allele identification in a CS framework

52

11

2

1

xmy ii

individuals in the pool

5

11,0,1,1,1,0,0,0,1im

x

# rare alleles

0

0

0

1

0

0

0

0

1

AA

AA

AA

AB

AA

AA

AA

AA

AB

52

11

Page 25: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

infer/reconstruct

52

11

42

0

52

1

Compressed sensing based group testing

Next Generation

Sequencing

Technology

Compressed

sensing

Page 26: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Measuring device – NGST

Roche/454 Illumina Solexa

Applied Biosystems

SOLiD

Helicos

Time: a few days, Price : a few thousand $

Constantly improving (exponentially!)

Page 27: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

NGST output

output: “reads”

Illumina: A few millions reads per lane

454: almost 1 million

line = “read”

coverage: # reads per location

Page 28: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

NGST – targeted sequencing

We measure the number of reads containing B out of

the total number of reads. Here: 1/16

Page 29: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure simulations‟

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Page 30: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Parts of this modeling appeared in P. Prabhu & I. Pe‟er, Genome Research July 09

Ideal measurement - the fraction of “B” reads:

model formulation

xmy ii

2

1

r is itself a random variable )1,loci#

reads # total(~ r

1. sampling noise: finite number of reads from each site - r

NGST measurement:

2. Technical errors:

reread errors: 0.5-1%

DNA preparation errors

2

12,1,0

)21/()1

(2

1..minarg* rr

x

eezr

xMtsxxN

),(~ ii yrBinomialz , Estimated frequency: ii yrz /

Page 31: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Unique properties of this application

2. the matrix M is known up to noise: DNA preparation errors

3. potential constraints on the matrix M - sparseness:

1001010010

0101101100

1110110101

Mpotential technical

problems

total amount of DNA

1. measurement noise is pool dependent xmy ii

2

1),(~ ii yrBinomialz

Page 32: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure‟ insilico simulations

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Matlab package available at

www.broadinstitute.org/mpg/comseq/

Page 33: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

„pure‟ in silico simulations

In the paper:

• 3-70 fold more efficient than the naïve approach

• dependence on the number targeted loci

• homozygous rare allele case (BB)

• combination with barcodes

• coverage vs. number of pools trade off

• dependence on the three noise factors

• dependence on the number of samples per pool.

Results

# pools

Page 34: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Simulations based on experimental data

Objective – validate our model.

In the paper:

• Carrier identification based on pooled experimental data (Out et al. Hum. Mutat.,

2009)

• Decoding mixtures based on the 1000 Genomes Pilot 3 project.

Results

Page 35: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

“pooled DNA” experiments (e.g. GWAS): allele frequency – not carriers

Prior work

Group testing in general

• Drug screening

• Streaming algorithms

• communications

(D. Du and F.K. Hwang., A.C. Gilbert and M.J. Strauss)

Group testing for rare allele detection

Genome Research journal July 09

S. Prabhu and I. Pe‟er - “Overlapping pools for high-throughput targeted

resequencing.” – single carrier

Y. Erlich, G.J. Hannon et al. “DNA Sudoku-harnessing high-throughput

sequencing for multiplexed specimen” - pooling based Chinese-

Remainder-Theorem and barcoding.

Y. Erlich et al., “Compressed Genotyping”, IEEE Information Theory, 2010

Y. Erlich, NS, Amnon Amir, Or Zuk, “Compressed Sensing Approach for High

Throughput Carrier Screen”, Allerton 2009

Page 36: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Current work – Dor Yeshorim

In collaboration with Y. Erlich,

Whitehead Institute, MIT.

3000 DNA samples

Page 37: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Current work – Dor Yeshorim

• 41 samples with known genotype

• Pool these 41 samples and sequence

BLOOM

CF1152

CF3849L

CF508

CN285

FD696R

GS

TS1277

Page 38: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Current work – Sorghum bicolor

Submitted proposal with

Eyal Fridman, Faculty of Agriculture, HUJI

Zhanguo Xin, USDA-ARS

Yaniv Erlich, Whitehread Institute, MIT

Collaborators:

Rivka Elbaum, Faculty of Agriculture, HUJI

Noam Shomron, TAU

Or Zuk, Broad Institute

“Efficient allele mining in a model sequenced crop via the compressed

sequencing approach”

Page 39: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Current work – Sorghum bicolor

Eyal FridmanRivka Elbaum

Page 40: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Validation phase: mutant detection - the COMT as a target gene:

• 800 samples

• 2 known loci in the COMT gene; each in a single sample

Expected resources:

• A single Illumina lane

• # of pools

# p

oo

ls n

eed

ed

to

reac

h 9

5%

zero

err

or

sim

ula

tio

ns

Current work – Sorghum bicolor

Page 41: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Exploration phase: denovo SNP identification.

• Candidate genes involved in silica and water homeostasis

[Rivka Elbaum, HUJI]

• Target 120nt in 7 aquaporin NIP-like genes.

• 6400 samples

Pooling design - The same pools may be used to seek variation in any other gene

Expected resources: 2 Illumina lanes

Current work – Sorghum bicolor

Page 42: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

suggested workflow

samples

create

pools

once and

for all!

Target

gene #1

order

primers

for gene #1

amplify the

pools

Target

gene #2

order

primers

for gene #2

amplify the

pools

Costs are proportional to the number of pools!

sample preparation

direct sequencing costs

Page 43: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

• Generic approach that puts together sequencing and CS for identifying rare

allele and their carriers.

• Much higher efficiency over the naive approach - may significantly improve

cost effectiveness in future Association Studies, and in screening large DNA

cohorts for specific risk alleles.

•The method naturally deals with all possible scenarios of multiple carriers and

heterozygous or homozygous rare alleles.

• Seamlessly combined with barcodes

• In our approach experimental costs (both sample preparation and direct

sequencing costs) are proportional to the number of pools and NOT to the

number samples

Conclusions

Page 44: Identification of rare alleles and their carriers - BGUtabio122/wiki.files/CS_rareAlleleMichal... · Identification of rare alleles and their carriers using compressed se(que)nsing

Ackonwledgements

Amnon Amir

Department of Physics of Complex Systems, Weizmann Institute of Science

Or Zuk

Broad Institute of MIT and Harvard

Yaniv Erlich

Whitehead Institute, MIT