UNIVERSITA’ DI MILANO-BICOCCA LAUREA MAGISTRALE IN BIOINFORMATICA

UNIVERSITA’ DI MILANO-BICOCCAUNIVERSITA’ DI MILANO-BICOCCALAUREA MAGISTRALE IN LAUREA MAGISTRALE IN

BIOINFORMATICABIOINFORMATICA

Corso di

BIOINFORMATICA: TECNICHE DI BASE

Prof. Giancarlo Mauri

Lezione 8

Allineamento multiplo di sequenze

2

Perché l’allineamento multiplo?

Analisi filogenetica Proteine con funzioni simili in specie diverse L’allineamento multiplo ottimale fornisce informazioni sulla storia evolutiva delle specie

Scoperta di irregolarità (es. gene della Fibrosi Cistica)

Individuazione regioni conservate L’allineamento multiplo locale rivela regioni conservate

Le regioni conservate di solito sono regioni chiave per la funzionalità

Le regioni conservate sono target prioritari per lo sviluppo di farmaci

3

Definizione del problema

INPUTINPUT:

un insieme S = {S1,S2,…,Sn} di n sequenze definite su un alfabeto euna matrice di punteggio d: ({-})2 R

OUTPUTOUTPUT:

un insieme S’ = {S’1,S’2,…,S’n} di n sequenze sull’alfabeto U{} con le seguenti proprietà: Si’| = L i

eliminando gli spazi da Si’ si ottiene Si i

il punteggio di allineamento A è massimo

4

1 2 3 4 5 6 7 Indice di colonna

M Q P I L L L

M L R - L L -

M K - I - L L

M P P V I L L

Esempio 1

5

Esempio 2

SUGAR

SUCRE

AZUCAR

ZUCKER

SAKARI

ZUCCHERO

SOKKER

SUGAR-

SUC-RE-------

SUG/C?R?

-SUGAR-

-SUC-RE

AZUCAR-

-ZUCKER-

-SAKARI

-ZUCKERO

-SOKKER---------

-SUCKAR- E

6

Tutte le formulazioni del problema di

allineamento multiplo sono NP-difficili

Complessità del problema

Occorre trovare algoritmi che producano un buon allineamento e siano efficienti

rispetto al tempo utilizzato

7

La valutazione di un allineamento multiplo è basata su una funzione di punteggio predefinita; l’obiettivo è l’ottenimento di uno fra gli allineamenti di punteggio massimo

Il punteggio relativo ad un allineamento S’ di S, è dato dalla somma dei punteggi associati a tutte le colonne di S’

Dovremmo dare una funzione w: (U{})k R a k argomenti, e quindi considerare un numero esponenziale di casi

Sappiamo calcolare il punteggio dell’allineamento a coppie, per somma dei punteggi delle colonne; come estendere all’allineamento multiplo? (N.B: diamo un punteggio anche all’allineamento tra due spazi, ad esempio d(,) = 0).

Punteggio di un allineamento multiplo

8

La misura SP (Sum of Pairs)

Il punteggio relativo ad una colonna dell’allineamento S’ è dato dalla somma dei punteggi di tutte le coppie di simboli nella colonna (incluse coppie di gap)

Il punteggio dell’allineamento S’ è la somma dei punteggi di tutte le colonne

Euristiche per accelerare l’algoritmo di programmazione dinamica

9

Sequenza di consenso

Dato un allineamento S’ di un insieme S di n sequenze si definiscee la sequenza di consenso Sc per S’ nel seguente modo Sc ha la stessa lunghezza l è la della

generica sequenza S’i in S’

Il simbolo i-esimo in Sc è uguale al simbolo

più frequente nella colonna i-esima di S’

Il punteggio dell’allineamento S’ è la somma dei punteggi degli allineamenti di ciascuna sequenza S’i in S’ con Sc

10

Allineamento con albero

Dato l’insieme S = {S1,S2,…,Sn} e l’albero T

che ha S come insieme dei nodi, si determinano gli allineamenti ottimi tra le coppie (Si,Sj) che appartengono all’insieme degli archi di T

A T si associa un punteggio P dato dalla somma dei punteggi degli allineamenti di cui al punto 1 (somma dei punteggi degli archi)

Si ricostruisce l’allineamento multiplo S’ di S a partire dagli allineamenti ottimi determinati al punto 1

11

Metodo Star Alignment

1. Dato l’insieme S = {S1,S2,…,Sn}, si determinano gli allineamenti ottimi tra tutte le coppie di sequenze (Si,Sj) (j=1,2,…,n e j i)

2. Ad ogni Si si associa un punteggio Pi dato dalla somma dei punteggi degli allineamenti di cui al punto 1 con tutte le altre sequenze Sj

3. Si considera l’indice i per cui il valore di Pi è ottimo (minimo o massimo)

4. Si ricostruisce l’allineamento multiplo S’ di S a partire dagli allineamenti ottimi determinati al punto 1 per la stella di indice i, aggiungendo man mano gaps a Si

12

S1 S2 S3 S4 S5

S1

S3

S4

S2

S5

7 -2 0 -3

7 -2 0 -4

-2 -2 0 -7

0 0 0 -3

-3 -4 -7 -3

Pi 2 1 -11-3-17

Star Alignment: esempio

S = { ATTGCCATT,

ATGGCCATT,

ATCCAATTTT,

ATCTTCTT,

ACTGACC}

Schema di punteggio:

d(x,x) = 1

d(x,y) = -1

d(x,-) = d(-,x) = -2

13

Metodo Star Alignment: esempio

Centro stella S1=ATTGCCATT

Ricostruzione dell’allineamento multiplo

ATTGCCATT--

ATGGCCATT--

ATC-CAATTTT

ATCTTC-TT--

ACTGACC----

14

Sofware disponibili

CLUSTAL

Basato su algoritmo di Feng-Doolittle

Idea: allineare a coppie le sequenze del set di input S utilizzare l’insieme dei punteggi trovati come matrice delle distanze del metodo neighbor-joining per costruire un albero filogenetico per le sequenze in S

allineare le sequenze secondo l’ordine fissato dall’albero filogenetico (prima le sequenze più simili)

DiAlign

Idea: individuare diagonali (sottosequenze allineate senza spazi)

costruire l’allineamento a partire dalle diagonali

15

Sofware disponibili

CLUSTAL-W Standard popular software

It does multiple alignment as follows: Align 2 Repeat: keep on adding a new sequence to the alignment until no more, or do tree-like heuristics

Problem: It is simply a heuristics

Alternative: dynamic programming nk for k sequences. This is simply too slow

We need to understand the problem and solve it right

16

Making the problem simpler!

Multiple alignment is very hard For k sequences, nk time, by dynamic programming

NP hard in general, not clear how to approximate

Popular practice -- alignment within a band: the p-th letter in one sequence is not more than c places away from the p-th letter in another sequence in the final alignment – the alignment is along a diagonal bandwidth 2c

Used in final stage of FASTA program

17

In literature

NP hardness under various models Wang-Jiang (JCB) Li-Ma-Wang (STOC99) Just

Approximation results Gusfield (2- 1/L) Bafna, Lawler, Pevzner (CPM94, 2-k/L) star alignment

Sankoff, Kruskal discussed “within a band”

Pearson showed alignment within a band gives very good results for a lot protein superfamilies

Altschul and Lipman, Chao-Pearson-Miller, Fickett, Ukkonen, Spouge (survey) all have studied alignment within a band

18

The following were proved

SP-Alignment

NP hard

PTAS for constant band

PTAS for constant number of insertion/deletion gaps per sequence on average (for coding regions, this assumption makes a lot of sense)

Star-Alignment

PTAS in constant band

PTAS for constant number of insertion/deletion gaps per sequence on average

19

We will do only SP-alignment

Notation: in an alignment, a block of inserted “---” is called a gap. If a multiple alignment has c gaps on the average for each sequence, we call it average c-gap alignment

We first design a PTAS for the average c-gap SP alignment

Then using the PTAS for the average c-gap SP alignment, we design a PTAS for SP-alignment within a band

20

Average c-gap SP Alignment

Key Idea: choose r representative sequences, we find their “correct” alignment in the optimal alignment, by exhaustive search. Then we use this alignment as reference

Then we align every other sequence against this alignment

Then choose the best

All we have to show is that there are r sequences whose letter frequencies in each column of their alignment approximates the complete alignment

21

Some over-simplified reasoning

If M is optimal average c-gap SP alignment

In this alignment, many sequences have less than cl gaps.

So if we take r of these sequences, and try every possibly way, one way coincides with M

Then hopefully, its letter frequencies in each column “more or less” approximates that of M’s

Then we can simply optimally align all the rest of the sequences one by one according to this frequency matrix

22

If this column has k percent a’s

We also expectthis column

has ~ k percent a’s

Complete Alignment Alignment with r sequences

j j

Sampling r sequences

23

for L=m to nm {

for any r sequences {

for all possible alignment M’ of length L

and with no more than cl gaps {

align all other sequences to M’ //one alignment

}}}

Output the best alignment

AverageSPAlign

24

SP Alignment within c-Band

Basic Idea

Dynamically cut sequences into segments

Each segment satisfies the average c-gap condition. Hence use previous algorithm

Then assemble the segments together

Divide and Conquer

Cutting these sequences into 6 segments, each segmenthas c-gaps per sequence onaverage in optimal alignment

25

while (not finished) {

find a maximum prefix for each sequence (same length) such that AverageSPAlign returns “low”cost. Keep the multiple alignment for this segment

}

Concatenate the multiple alignments for all segments together to as final alignment

The final algorithm: diagonalAlign

Documents

UNIVERSITA’ DI MILANO-BICOCCA LAUREA MAGISTRALE IN BIOINFORMATICA