43
DNA Sequencing and Assembly

DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

DNA Sequencingand Assembly

Page 2: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

DNA sequencing

How we obtain the sequence of nucleotides of a species

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

Page 3: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Which representative of the species?

Which human?

Answer one:

Answer two: it doesn’t matter

Polymorphism rate: number of letter changes between two different members of a species

Humans: ~1/1,000 – 1/10,000

Other organisms have much higher polymorphism rates

Page 4: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

DNA sequencing – vectors

+ =

DNA

Shake

DNA fragments

VectorCircular genome(bacterium, plasmid)

Knownlocation

(restrictionsite)

Page 5: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Different types of vectors

VECTOR Size of insert

Plasmid2,000-10,000

Can control the size

Cosmid 40,000

BAC (Bacterial Artificial Chromosome)

70,000-300,000

YAC (Yeast Artificial Chromosome)

> 300,000Not used much

recently

Page 6: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

DNA sequencing – gel electrophoresis

Start at primer(restriction site)

Grow DNA chain

Include dideoxynucleoside(modified a, c, g, t)

Stops reaction at allpossible points

Separate products withlength, using gel electrophoresis

Page 7: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Electrophoresis diagrams

Page 8: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Output of gel electrophoresis: a read

A read: 500-700 nucleotides

A C G A A T C A G …. A16 18 21 23 25 15 28 30 32 21

Quality scores: -10log10Prob(Error)

Reads can be obtained from leftmost, rightmost ends of the insert

Double-barreled sequencing:Both leftmost & rightmost ends are sequenced

Page 9: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Method to sequence segments longer than 500

cut many times at random (Shotgun)

genomic segment

Get one or two reads from each segment

~500 bp ~500 bp

Page 10: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Reconstructing the Sequence (Fragment Assembly)

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

reads

Page 11: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Definition of Coverage

Length of genomic segment: LNumber of reads: nLength of each read: l

Definition: Coverage C = nl/L

How much coverage is enough?

(Lander-Waterman model):Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

C

Page 12: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Challenges with Fragment Assembly

• Sequencing errors~1-2% of bases are wrong

• Repeats

• Computation: ~ O( N2 ) where N = # reads

false overlap due to repeat

Page 13: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Repeats

Bacterial genomes: 5%Mammals: 50%

Repeat types:

Low-Complexity DNA (e.g. ATATATATACATA…)Microsatellite repeats: (a1…ak)N where k ~ 3-6

(e.g. CAGCAGTAGCAGCACCAG)Common Repeat Families

SINE (Short Interspersed Nuclear Elements)(e.g. ALU: ~300-long, 106 copies)

LINE (Long Interspersed Nuclear Elements)~500-5,000-long, 200,000 copies

MIRLTR/Retroviral

Other-Genes that are duplicated & then diverge (paralogs)-Recent duplications, ~100,000-long, very similar copies

Page 14: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

Page 15: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

What can we do about repeats?

Two main approaches:• Cluster the reads

• Link the reads

Page 16: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Strategies for sequencing a whole genome

1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Page 17: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Hierarchical Sequencing

Page 18: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Hierarchical Sequencing Strategy

1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together

a BAC clone

mapgenome

Page 19: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Methods of physical mapping

Goal:

Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization• Digestion

Page 20: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

1. Hybridization

Short words, the probes, attach to complementary words

1. Construct many probes2. Treat each BAC with all probes3. Record which ones attach to it4. Same words attaching to BACS X, Y overlap

p1 pn

Page 21: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Hybridization – Computational Challenge

Matrix:m probes n clones

(i, j): 1, if pi hybridizes to Cj

0, otherwise

Definition: Consecutive ones matrixA matrix 1s are consecutive

Computational problem:Reorder the probes so that matrix is in consecutive-ones form

Can be solved in O(m3) time (m >> n)Unfortunately, data is not perfect

p1 p2 …………………….pm

C1

C2 …

……

……

….C

n

1 0 1…………………...01 1 0 …………………..0

0 0 1 …………………..1

pi1pi2…………………….pim

Cj1C

j2 …

……

……

….C

jn

1 1 1 0 0 0……………..00 1 1 1 1 1……………..00 0 1 1 1 0……………..0

0 0 0 0 0 0………1 1 1 00 0 0 0 0 0………0 1 1 1

Page 22: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

2. Digestion

Restriction enzymes cut DNA where specific words appear

1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk } overlap

Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B

Page 23: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Whole-Genome Shotgun Sequencing

Page 24: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse linked reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~500 bp~500 bp

Page 25: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

The Overlap-Layout-Consensus approach

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge good pairs of reads into longer contigs

3. Link contigs to form supercontigs

+ many heuristics

Page 26: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

1. Find Overlapping Reads

• Sort all k-mers in reads (k ~ 24)

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

• Find pairs of reads sharing a k-mer

• Extend to full alignment – throw away if not >95% similar

T GA

TAGA| ||

TACA

TAGT||

Page 27: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

1. Find Overlapping Reads

One caveat: repeats

A k-mer that appears N times, initiates N2 comparisons

ALU: 1,000,000 times

Solution:

Discard all k-mers that appear more than c Coverage, (c ~ 10)

Page 28: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

1. Find Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

Page 29: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

1. Find Overlapping Reads (cont’d)

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

C: 20C: 35T: 30C: 35C: 40

C: 20C: 35C: 0C: 35C: 40

• Score alignments

• Accept alignments with good scores

A: 15A: 25A: 40A: 25-

A: 15A: 25A: 40A: 25A: 0

Page 30: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Basic principle of assembly

Repeats confuse us

Ability to merge two reads ability to detect repeats

We can dismiss as repeat any overlap of < t% similarity

Role of error correction:

Discards ~90% of single-letter sequencing errors

Threshold t% increases

Page 31: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

2. Merge Reads into Contigs (cont’d)

Merge reads up to potential repeat boundaries(Myers, 1995)

repeat region

Page 32: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

2. Merge Reads into Contigs (cont’d)

• Ignore non-maximal reads• Merge only maximal reads into contigs

repeat region

Page 33: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

2. Merge Reads into Contigs (cont’d)

• Ignore “hanging” reads, when detecting repeat boundaries

sequencing errorrepeat boundary???

b

a

Page 34: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

2. Merge Reads into Contigs (cont’d)

?????

Unambiguous

• Insert non-maximal reads whenever unambiguous

Page 35: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

3. Link Contigs into Supercontigs

Too dense: Overcollapsed?

(Myers et al. 2000)

Inconsistent links: Overcollapsed?

Normal density

Page 36: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Find all links between unique contigs

3. Link Contigs into Supercontigs (cont’d)

Connect contigs incrementally, if 2 links

Page 37: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Fill gaps in supercontigs with paths of overcollapsed contigs

3. Link Contigs into Supercontigs

Page 38: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Define G = ( V, E )V := contigs

E := ( A, B ) such that d( A, B ) < C

Reason to do so: Efficiency; full shortest paths cannot be computed

3. Link Contigs into Supercontigs

d ( A, B )Contig A

Contig B

Page 39: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

3. Link Contigs into Supercontigs

Contig AContig B

Define T: contigs linked to either A or B

Fill gap between A and B if there is a path in G passing only from contigs in T

Page 40: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

4. Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

Page 41: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Mouse Genome

Several heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigs

Size of problem: 32,000,000 reads

Time: 15 days, 1 processorMemory: 28 Gb

N50 Contig size: 16.3 Kb 24.8 Kb N50 Supercontig size: .265 Mb 16.9 Mb

Page 42: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Mouse Assembly

Page 43: DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA

Sequencing in the (near) future

CMOS ChipPhotodiodes

m

4m

Microfluidic Chip

m

Inlet

Outlet

m

m