28
La place de la phylogénie en Bio-informatique Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy, France http://www.igs.cnrs-mrs.fr Journées RNG « Phylogénie »

La place de la phylogénie en Bio-informatique

  • Upload
    garin

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

La place de la phylogénie en Bio-informatique. Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy , France. http://www.igs.cnrs-mrs.fr. Journées RNG « Phylogénie ». Bio-Informatique des Séquences: Une progression naturelle. - PowerPoint PPT Presentation

Citation preview

Page 1: La place de la phylogénie  en Bio-informatique

La place de la phylogénie en Bio-informatique

Jean-Michel CLAVERIE

Structural & Genetic Information LaboratoryCNRS UPR2589Luminy, France

http://www.igs.cnrs-mrs.fr

Journées RNG « Phylogénie »

Page 2: La place de la phylogénie  en Bio-informatique

Bio-Informatique des Séquences:Une progression naturelle

• Une séquence: trouver où sont les motifs et les gènes

• Deux séquences: les aligner

• Trois et plus: faire une arbre

Page 3: La place de la phylogénie  en Bio-informatique

Gene finding: the general principles behind all current methods

1. Gene-encoded peptide sequences have to lead to foldable, compact structure in a water environnement,

2. Real protein are thus made of a balanced mixture of hydrophilic, hydrophobic, rigid, flexible, neutral, or charged residues

3. This constraint induces a recognizable statistical bias in coding DNA sequences, that we use for detection.

4. Around the good looking regions, we look for facultative additional signals,

5. The optimal prediction is then the best scoring one according to a « standard » gene model.

Page 4: La place de la phylogénie  en Bio-informatique

64 codons, including 3 STOPs : TGA, TAA, TAGSimple situation like E.coli: (A=T=G=C=0.25)

Between two successive STOP codons: each of the 61 codons -> 1.6 % expected relative frequency

In contrast, codon distributions within bona fide genes have to results into ‘foldable’ proteins, hence balancing hydrophylic and hydrophobic, + and - charged, rigid and flexible residues. 

Non-coding vs. Protein-coding DNA sequence:Why is there a bias?

Coding segments have to be compatible with the typical amino acid composition of

natural proteins

Page 5: La place de la phylogénie  en Bio-informatique

  Amino Acid % in N codon % expected proteins at random 

Ala 9.46 4 6.4 +Arg 5.84 6 9.6 -Asn 3.99 2 3.2 +Asp 5.10 2 3.2 +Cys 1.20 2 3.2 -Gln 4.57 2 3.2 +Glu 5.50 2 3.2 +Gly 7.41 4 6.4 +His 2.44 2 3.2 -Ile 5.95 3 4.8 +Leu 10.5 6 9.6 +Lys 4.14 2 3.2 +Met 2.72 1 1.6 +Phe 4.19 2 3.2 +Pro 4.39 4 6.4 -Ser 5.71 6 9.6 -Thr 5.50 4 6.4 -Trp 1.48 1 1.6Tyr 2.76 2 3.2Val 7.16 4 6.4

AA with

+

or

20% deviation

from random

Page 6: La place de la phylogénie  en Bio-informatique

A variety of methods take advantageof this bias

Sequence

SlidingWindow

Chi2

(obs-rand)

CodingRegion

Page 7: La place de la phylogénie  en Bio-informatique

The Caesar (Julius) Cipher

Plaintext : veni, vidi, viciCiphertext: YHQL, YLGL, YLFL

Plain alphabet: abcdefghijklmnopqrstuvwxyzCipher alphabet: DEFGHIJKLMNOPQRSTUVWXYZABC

The Caesar cipher is based on a cipher alphabet that is shifted a certain number of places, relative to the plain alphabet

Page 8: La place de la phylogénie  en Bio-informatique

Frequency of letters in

English

letters % letters %a 8.2 n 6.7

b 1.5 o 7.5

c 2.8 p 1.9

d 4.3 q 0.1

e 12.7 r 6.0

f 2.2 s 6.3

g 2.0 t 9.1

h 6.1 u 2.8

i 7.0 v 1.0

j 0.2 w 2.4

k 0.8 x 0.2

l 4.0 y 2.0

m 2.4 z 0.1

From 100,362 characters

(Beker & Piper, 1950)

Page 9: La place de la phylogénie  en Bio-informatique

Frequency analysis of an

enciphered message

letters % letters %A 0.9 N 0.9

B 7.4 O 11.2

C 8.0 P 9.2

D 4.1 Q 0.6

E 1.5 R 1.8

F 0.6 S 2.1

G 0.3 T 0.0

H 0.0 U 1.8

I 3.3 V 5.3

J 5.3 W 0.3

K 7.7 X 10.1

L 7.4 Y 5.6

M 3.3 Z 1.5

XZAVOIDBYGERSPCFHJKLMNQTUWabcdefghijklmnopqrstuvwxyz

e = 12.7

Page 10: La place de la phylogénie  en Bio-informatique

The foundingFather of

« Gene-finding » bioinformatics

‘Al-kin-dee-u’

Al-Kindī

Al-Kindī (0850)Decyphering Cryptographic MessagesPMID: 0000001

Page 11: La place de la phylogénie  en Bio-informatique

Homophonic substitution cipher

a b c d e f g h i j k l m n o p q r s t u v w x y

09 48 13 01 14 10 06 23 32 15 04 26 22 18 00 38 94 29 11 17 08 34 60 28 21

12 81 41 03 16 31 25 39 70 37 27 58 05 95 35 19 20 61 89 52

33 62 45 24 50 73 51 66 54 42 76 43

47 79 44 56 83 84 71 72 77 86 49

53 46 65 88 91 90 80 96 69

67 55 68 93 99 75

78 57 85

92 64 97

74

82

87

Frequent letters, many codes

Page 12: La place de la phylogénie  en Bio-informatique

  Amino Acid % in N codon % expected proteins at random 

Leu 10.5 6 9.6 +Arg 5.84 6 9.6 -Ser 5.71 6 9.6 -Ala 9.46 4 6.4 +Gly 7.41 4 6.4 +Val 7.16 4 6.4Thr 5.50 4 6.4 -Pro 4.39 4 6.4 -Ile 5.95 3 4.8 +Glu 5.50 2 3.2 +Asp 5.10 2 3.2 +Gln 4.57 2 3.2 +Phe 4.19 2 3.2 +Lys 4.14 2 3.2 +Asn 3.99 2 3.2 +Tyr 2.76 2 3.2His 2.44 2 3.2 -Cys 1.20 2 3.2 -Met 2.72 1 1.6 +Trp 1.48 1 1.6

« Homophonic » Substitution in the Genetic Code

Page 13: La place de la phylogénie  en Bio-informatique

From old to new methods

Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile

Page 14: La place de la phylogénie  en Bio-informatique

K-tuple concepts and methods

ATGCTAGCATAGCTGCATGACATGCATGCAATGC TGCA TGCT ATGC GCTA CATG CTAG GCAT TAGC TGCA AGCA ATGC

4-tuple(tetramer)

Pre-compute: Fcoding (ATGC) , Fncoding (ATGC)

……………. , ………………

……………. , ………………

Fcoding (°°°°) , Fncoding (°°°°)

Page 15: La place de la phylogénie  en Bio-informatique

K-tuple profile

Sequence

Fc

-------

Fc

+Fnc

CodingRegion

ATGC

Fcoding (ATGC)

------------------------------------------------

Fcoding (ATGC) + Fncoding (ATGC)

0.5

Page 16: La place de la phylogénie  en Bio-informatique

k-tuple frequency analysis was probably knownof the Arabs (1000) and rediscovered

during the Renaissance

Homophonic deciphering:

looking for 2-tuples that often, never or rarely occurs: je, jx, wa

Page 17: La place de la phylogénie  en Bio-informatique

Blaise de Vigenère(1586)

invented his undecipherable cipher

to resist k-tuple analysis

Page 18: La place de la phylogénie  en Bio-informatique

The Vigenère

square

Plain abcdefghijklmnopqrstuvwxyz 1 BCDEFGHIJKLMNOPQRSTUVWXYZA 2 CDEFGHIJKLMNOPQRSTUVWXYZAB 3 DEFGHIJKLMNOPQRSTUVWXYZABC 4 EFGHIJKLMNOPQRSTUVWXYZABCD 5 FGHIJKLMNOPQRSTUVWXYZABCDE 6 GHIJKLMNOPQRSTUVWXYZABCDEF 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 9 JKLMNOPQRSTUVWXYZABCDEFGHI10 KLMNOPQRSTUVWXYZABCDEFGHIJ11-20 ……………………………………………………………………21 VWXYZABCDEFGHIJKLMNOPQRSTU22 WXYZABCDEFGHIJKLMNOPQRSTUV23 XYZABCDEFGHIJKLMNOPQRSTUVW24 YZABCDEFGHIJKLMNOPQRSTUVWX25 ZABCDEFGHIJKLMNOPQRSTUVWXY26 ABCDEFGHIJKLMNOPQRSTUVWXYZ

Page 19: La place de la phylogénie  en Bio-informatique

The Vigenère square: example

Keyword : WHITE WHITEWHITEWHITEWHIPlaintext : diverttroopstoeastridgeCiphertext: ZPDXVPAZHSLZBHIWZBKMZNM

Plain abcdefghijklmnopqrstuvwxyz 4 EFGHIJKLMNOPQRSTUVWXYZABCD 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH19 TUVWXYZABCDEFGHIJKLMNOPQRS22 WXYZABCDEFGHIJKLMNOPQRSTUV

WHITE

Page 20: La place de la phylogénie  en Bio-informatique

Charles Babbage:

Another pioneerIn bioinformatics

Cryptanalysis ofThe Vigenère cypher (1854)

PMID: 0000002

Page 21: La place de la phylogénie  en Bio-informatique

1. Look for repeats in the cipher text

2. Infer the keyword lenght from consistent repeat distance: L

3. Then analyze the character frequency at each position independently: 1, 2, 3 , …., L

4. Deduce (« call ») the corresponding Ceasar shift

The invention of inhomogeneous (and hidden) Markov Models

Page 22: La place de la phylogénie  en Bio-informatique

From old to new methods

Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile3. Differential in-phase k-tuples (1988) : profile4. Non-homogeous Markov chains (Borodvsky,1986;

Tavare & Song, 1989) : profile/call5. Neural-net (Mural & Uberbacher, 1991) : call6. Hidden Markov chain (Kulp & al., 1996) : call

Page 23: La place de la phylogénie  en Bio-informatique

Bioinformatics vs. Biology

• 1951 1st protein sequence (Insulin, Sanger)

• 1960 Sequence-structure relationship (Globins, Perutz)

• 1965 "Evolutionary divergence & convergence in Proteins" Zuckerkandl & Pauling

• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 Atlas of Protein Sequences (M. Dayhoff, Georgetown)

• 1970 "A general method applicable to the search for similaries in amino-acid sequences of two proteins" Needleman & Wunsch

• 1973 Genetic engineering (Cohen, Boyer et al.)

• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1977 ADN sequencing (Sanger, Maxam, Gilbert)

• 1977 1st bioinformatic "package" (Staden; DB/ assembly, analysis)

• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Database access via telephone lines ( PIR )• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt

Page 24: La place de la phylogénie  en Bio-informatique

• 1965 First «  industrial » computer IBM/360• "Evolutionary divergence and convergence in Proteins"

Zuckerkandl & Pauling• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 First « mini » computer DEC PDP-8 (floor top)• Atlas of Protein Sequences (M. Dayhoff, Georgetown)• 1970 "A general method applicable to the search for similaries "

Needleman & Wunsch.• 1971 1st work on RNA folding (Ninio)• 1972 First « micro » processor Intel 8008• 1973 Genetic engineering (Cohen et al.)• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1975 Intel 8080, kit Altair• 1977 1st bioinformatic package (Staden; PDP 8/11) • DEC-VAX Mini-computers• Micro-computer (Apple, Commodore, Radioshack)• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Telephone access to the PIR database• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt• IBM-PC (8088), 16-32kb• 1983 IBM-XT HARD disk (10 Mbytes)• 1984 MacIntosh : graphic/mouse interface

Bioinformatics vs. Computers

Page 25: La place de la phylogénie  en Bio-informatique

Perspective historique (suite)

• 1981 Local Alignement (Smith-Waterman , JMB)

• 1985 "Fasta" (Pearson-Lipman, PNAS)

• 1989 ARPANET --> INTERNET

• 1990 "Blast" (Altschul et al., JMB)

• 1990 Positional cloning of NF-1

• 1991 "Grail", first practical gene «finder » (Mural et al., PNAS)

• 1991 "EST" (Venter et al., Matsubara et al.)

• 1992 Complete sequence of yeast chr. 3

• 1995 Complete sequence of H. influenza

• 1996 Complete sequence of S. cerevisiae

• 1997 "Gapped Blast" (Alschul et al., NAR)/Genscan (Burge & Karlin)

• 1997 11 complete bacterial genomes available

• 1998 2 Mbase/day of new public sequence data, C. elegans• 2000 Human Chr 22, 21, 90%draft, Drosophila, 30 bacterial

genomes

Bioinformatics & Genomics: more recent past

Page 26: La place de la phylogénie  en Bio-informatique

Bon workshop!

Page 27: La place de la phylogénie  en Bio-informatique

Genes & Genomes

DNA

RNA (transcripts)

transcription

Gene 1 Gene 2 Gene 3 Gene 4

translation+ folding

Proteins (or RNAs)Function 2 Function 3

Function 4

Function 1

Bio

Info

rmat

ics

Page 28: La place de la phylogénie  en Bio-informatique