24
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

Embed Size (px)

Citation preview

Page 1: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

Chap. 4 FRAGMENT ASSEMBLY OF DNA

Introduction to Computational Molecular Biology

Chapter 4

Page 2: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background The ideal case

Approximation = 10 bases

The consensus sequence = TTACCGTGC Answer = 9 bases ( close )∴

The four sequences Fragment assembly

ACCGT

CGTGC

TTAC

TACCGT

ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C

T T A C ㅡㅡㅡㅡㅡㅡ T A C C G T ㅡㅡ

T T A C C G T G C

Page 3: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Substitution

There was a substitution error in the second position of the last fragment, where A was replaced by G.

The consensus is still correct because of majority voting.

The four sequences Fragment assembly

ACCGT

CGTGC

TTAC

TGCCGT

ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C

T T A C ㅡㅡㅡㅡㅡㅡ T G C C G T ㅡㅡ

T T A C C G T G C

Page 4: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Insertion

There was an insertion error in the second position of the second fragment. Base A appeared where there should be none.

The consensus is still correct.

The four sequences Fragment assembly

ACCGT

CAGTGC

TTAC

TACCGT

ㅡㅡ A C C ㅡ G T ㅡㅡㅡㅡㅡㅡ C A G T G C

T T A C ㅡㅡㅡㅡㅡㅡㅡ T A C C ㅡ G T ㅡㅡ

T T A C C ㅡ G T G C

Page 5: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Deletion

There was a deletion in the third ( or fourth) base in the last fragment.

The consensus is still correct.

The four sequences Fragment assembly

ACCGT

CGTGC

TTAC

TACGT

ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C

T T A C ㅡㅡㅡㅡㅡㅡ T A C ㅡ G T ㅡㅡ

T T A C C G T G C

Page 6: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Chimera

The last fragment in this input set is a chimera.

The four sequences Fragment assembly

ACCGT

CGTGC

TTAC

TACCGT

TTATGC

ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C

T T A C ㅡㅡㅡㅡㅡㅡ T A C C G T ㅡㅡ

T T A C C G T G C

T T A ㅡㅡㅡ T G C

Page 7: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Unknown Orientation

Fragments can come from any of the DNA strands and we generally do not know to which strand a particular fragment belongs.

We do know, however, that whatever the strand the sequence read goes from 5’ to 3’.

Because of the complementarity and opposite orientation of strands.

Using A fragment ( substring of one strand ) is equivalent to its

reverse complement(substring of the other).

Page 8: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Fragment assembly with unknown orientation

Initially we do not know the orientation of fragments.

Input Answer

CACGT

ACGT ACTACG

GTACT

ACTGA

CTGA

CACGTXXXXXXXX

XACGTXXXXXXXX

XXCGTAGTXXXXX

XXXXXAGTACXXX

XXXXXXXXACTGA

XXXXXXXXXCTGA

CACGTAGTACTGA

Page 9: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Fragment assembly with unknown orientation

Repeated regions Repeated regions or repeats are sequences that appear two or

more times in the target molecule. If the level of similarity between two copies of a repeat is high

enough, the differences can be mistaken for base call errors.

The blocks marked X1 and X2 are approximately the samesequence.

X1 X2

Page 10: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Fragment assembly with unknown orientation

The kinds of problems (Repeats)

Target sequence leading to ambiguous assembly because of repeats of the form XXX.

A X B X C X D

A X C X B X D

Page 11: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Fragment assembly with unknown orientation

The kinds of problems (Repeats)

Target sequence leading to ambiguous assembly because of repeats of the form XYXY.

A X B Y C X D Y E

A X D Y C X B Y E

Page 12: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background Fragment assembly with unknown orientation

The kinds of problems (Repeats) Inverted repeats, which are repeated regions in opposite

strands, can also occur and are potentially more dangerous.

Target sequence with inverted repeat.

X X

X X

Rotate 1800

Page 13: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Shortest Common Superstring

Problem : Shortest Common Superstring(SCS) Input : A collection F of strings Output : A shortest possible string S such that for every

f F, S is a superstring of f.∈

Example

F={ACT, CTA,AGT}

S=ACTAGT is the SCS of F.

CTA is a substring of S.

ACT

CTA

AGT

ACTAGT

Page 14: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Shortest Common Superstring

Problem

X X

Target sequence with long repeat that contains many fragments.

Page 15: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Reconstruction

To deal with errors Errors and unknown orientation

Substring edit distance S(b) = The set of all substrings of b d is the classical edit distance ds(a,b) ≠ ds(b,a) : asymmetric

),(),( min)(

sadbabSs

sd

Page 16: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Reconstruction

Example Optimal alignment for substring edit distance, which does not

charge for end deletions in the first string.

- - - - - G C – G A T A G - - - -C A G T C G C T G A T C G T A C G

ds(a,b)=2

Page 17: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Reconstruction

An error tolerance f is an approximate substring Permission : for each base in f.

Input : A collection F of strings and an error tolerance between 0 and 1.

Output : A shortest possible string S such that for ever f F

fSfd s),(

fSfSf dd ss),(),,(min(

Page 18: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.2 Models Multicontig

--TAATGTGTAA-- GTAC 3-contig

TAATG------TGTAA GTAC 2-contig

TGTAA-------TAATG---------GTAC

1-contig

Page 19: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.3 Algorithms Overlap multigraph

PATH1 = abc

GACA-------- ---ACCC----- ------CTAAAG

PATH2 = abcd

a= TACGA----------- b= ----ACCC-------- c= -------CTAAAG--- d= ------------GACA

b ACCCTACGA

CTAAAG

GACA

1

1

1

12

d

c

a

Overlap between fragment c and d

Page 20: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.3 Algorithms The greedy

Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.

Page 21: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.3 Algorithms The greedy

ExampleS=AGTATTGGCAATCGATGCAAACCTTTTGGCAATCACT

w=AGTATTGGCAATC

z=AATCGATG

u=ATGCAAACCT

x=CCTTTTGG

y=TTGGCAATCACT

This solution has length 36 and is generated by the Greedy algorithm. However, its weakest link is zero.

Page 22: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.3 Algorithms Acyclic

Hamiltonian path

4 3 3 4

This solution has length 37. Its weakest link is 3.

Page 23: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.4 Heuristics Alignment and consensus

Suppose we have a path f-> g-> h

f=CATAGTCg=TAACTATh=AGACTATCC

C A T A G T C - - - - -- - T A – A C T A T - -- - - A G A C T A T C CC A T A G A C T A T C C

Page 24: Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.4 Heuristics Alignment and consensus

Two layouts for the same sequences

ACT-GGACTTGGAC-TGGACT-GGAC-TGGACTTGG

ACT-GGACTTGGAC-TGGACT-GGAC-TGGACTTGG

T-TT-TT--TTT

T-TT-TT--TTT

Using a sum-of pairs scoring