33
262 Lecture 14, Win07, Batzoglou Multiple Sequence Multiple Sequence Alignments Alignments

CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 2: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

pxy

pzw

pxyzw

Page 3: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree

x

w

y

z?

Page 4: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Some Resources

Genome Resources

Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable

http://probcons.stanford.edu/ PROBCONS – most accurate

Page 5: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Real-world protein aligners

• MUSCLE High throughput One of the best in accuracy

• ProbCons High accuracy Reasonable speed

Page 6: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

MUSCLE at a glance

1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

2. Build tree TDRAFT based on those distances, with UPGMA

3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT

• Only perform alignment steps for the parts of the tree that have changed

4. Measure new Kimura-based distances D(x, y) based on MDRAFT

5. Build tree T based on D

6. Progressive alignment over T, to build M

7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept

Page 7: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

PROBCONS at a glance

1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM

2. Re-estimation of posterior matrices M’xy with probabilistic consistency

• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)

3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)

• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)

4. Build tree T with hierarchical clustering using similarity measure E(x, y)

5. Progressive alignment on T to maximize E(.,.)

6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each

sequence and realign the two resulting profiles

Page 8: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 9: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Page 10: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Motivation

• Genomic sequences are very long:

Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long

• Aligning genomic regions is useful for revealing common gene structure

It is useful to compare regions > 1,000,000-long

Page 11: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

The UCSC Browser

• http://genome.ucsc.edu/cgi-bin/hgGateway

Page 12: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Main Idea

Genomic regions of interest contain islands of similarity, such as genes

1. Find local alignments

2. Chain an optimal subset of them

3. Refine/complete the alignment

Systems that use this idea to various degrees:

MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

Page 13: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Saving cells in DP

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 14: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 15: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 16: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Quadratic Time Solution

• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score

Directed edges: local alignments that can be chained

• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd

• ya < yb < yc < yd

Each local alignment

is a node vi with

alignment score si

Page 17: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Quadratic Time Solution

Initialization:

Find each node va s.t. there is no edge (u, va)

Set score of V(a) to be sa

Iteration:For each vi, optimal path ending in vi has total score:

V(i) = maxj s.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) )

Termination:Optimal global chain:

j = argmax ( V(j) ); trace chain from vj

Worst case time: quadratic

Page 18: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse Dynamic Programming

Back to the LCS problem:

• Given two sequences x = x1, …, xm

y = y1, …, yn

• Find the longest common subsequence Quadratic solution with DP

• How about when “hits” xi = yj are sparse?

Page 19: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse Dynamic Programming

15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Page 20: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

x = x1, …, xm

• Find a subsequence

s = s1, …, sk

s1 < s2 < … < sk

Page 21: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:L: last LIS elt. array L[0] = -inf

L[1] = w1 L[2…n] = +inf

B: array holding LIS elts; B[1] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far

ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]

B[j] iP[i] B[j – 1]

}

That’s it!!!• Running time?

Page 22: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:

• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order

• The 11 example points are inserted in the order given

• a = (y, x), b = (y’, x’) can be chained iff

a is before b in w, and y < y’

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 23: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 24: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L = [L1] [L2] [L3] [L4] [L5] …

1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)

Longest common subsequence:s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 25: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj: smallest (North) to largest (South) value

L is implemented as a balanced binary tree

y

h

l

Page 26: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score

V(b)V(a)

Page 27: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

i

j

k

Is k ever removed?

Page 28: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Example

x

y

a: 5

c: 3

b: 6

d: 4e: 2

2

56

9101112141516

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

a b c d eV

5

L

li

V(i)

i

5

5

a

8

11

8

c

11 12

9

11

b

15

12

d

13

16

13

3

Page 29: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so logN per deletion• Each element is deleted at most once: < N logN for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 30: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Whole-genome Alignment Pipelines

Given N species, phylogenetic tree:

1. Local Alignment between all pairs – BLAST

2. In the order of the tree:1. Synteny mapping: find long regions with lots of collinear alignments

2. In each synteny region,1. Chaining

2. Global alignment

Alternatively, all species are mapped to one reference (e.g., human)

Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment

Page 31: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Examples

Human Genome BrowserABC

Page 32: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Whole-genome alignment Rat—Mouse—Human

Page 33: CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win07, Batzoglou

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned