52
Plegamiento de Plegamiento de proteínas: proteínas: Una perspectiva Una perspectiva bioinformática. bioinformática. Ugo Bastolla, Ugo Bastolla, Red Nacional de Red Nacional de Bioinformática y Bioinformática y Centro de Astrobiología Centro de Astrobiología (CSIC-INTA) (CSIC-INTA) Universidad Politécnica de Madrid, 14 de Universidad Politécnica de Madrid, 14 de enero 2003 enero 2003

Plegamiento de proteínas: Una perspectiva bioinformática

  • Upload
    ikia

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Plegamiento de proteínas: Una perspectiva bioinformática. Ugo Bastolla, Red Nacional de Bioinformática y Centro de Astrobiología (CSIC-INTA) Universidad Politécnica de Madrid, 14 de enero 2003. Proteins as interdisciplinary molecules. - PowerPoint PPT Presentation

Citation preview

Page 1: Plegamiento de proteínas: Una perspectiva bioinformática

Plegamiento de Plegamiento de proteínas:proteínas:Una perspectiva Una perspectiva bioinformática.bioinformática.

Ugo Bastolla,Ugo Bastolla,Red Nacional de Bioinformática Red Nacional de Bioinformática yy

Centro de Astrobiología (CSIC-Centro de Astrobiología (CSIC-INTA)INTA)Universidad Politécnica de Madrid, 14 de enero Universidad Politécnica de Madrid, 14 de enero 20032003

Page 2: Plegamiento de proteínas: Una perspectiva bioinformática

Proteins are evolving molecular machines, at the border between Physics and Biology.

• They are molecular machines that obey the laws of statistical mechanics.

• They are evolving machines, produced through the action of mutation and natural selection.

Bioinformatics integrates both sources of information to predict biological properties. Thermodynamics sheds light on protein evolution, and evolutionary considerations sheds light on protein folding.

Proteins as interdisciplinary Proteins as interdisciplinary moleculesmolecules

Page 3: Plegamiento de proteínas: Una perspectiva bioinformática

Proteins are polymers formed by 20 amino-acid types bound by peptide bonds.

Soft degrees of freedom: phi-psi angles

Page 4: Plegamiento de proteínas: Una perspectiva bioinformática

Torsion angles cluster at values corresponding to regular local structure (secondary structure), stabilized by hydrogen bonds.

Page 5: Plegamiento de proteínas: Una perspectiva bioinformática

Hierarchical organization of protein structure

Page 6: Plegamiento de proteínas: Una perspectiva bioinformática

Many proteins (e.g. antibodies) are formed by several, almost independently folding units called domains.

Page 7: Plegamiento de proteínas: Una perspectiva bioinformática

Protein FoldingProtein Folding

Most proteins fold spontaneously in a well defined three dimensional conformation, the Native State.

It is believed that the Native State is the state of minimal free energy available to the protein plus solvent system. This depends on the state of the solvent, ex. on temperature and pH.

AMTYHLDVVSAEQQMFSGLVEKIQVT…..

Page 8: Plegamiento de proteínas: Una perspectiva bioinformática

Statistical mechanics of protein folding

N residues, exponentially large (eaN) number of conformations.

Boltzmann distribution in configuration space:

Prob.(C) exp(-E(C)/kBT)

kB is the Boltzmann constant, T is absolute temperature, the effective free energy depends on the state of the solvent (averaged out) through temperature, pH, presence of denaturants…

In proper conditions, the configuration with minimal effective free energy and its neighbors have Boltzmann probability close to one

They are always observed in independent experiments. They represent the Native State.

Transition

Unfolded

NativeF

ree

ener

gy

Page 9: Plegamiento de proteínas: Una perspectiva bioinformática

Lattice models of protein Lattice models of protein foldingfolding

• Exponentially large number of conformations, Monte Carlo simulations.

•Well designed sequences: fold fast to the lowest free energy state, stable thermodynamically and against mutations. They have well correlated landscape.

• Qualitative features reproduced, no experimental comparison possible.

Page 10: Plegamiento de proteínas: Una perspectiva bioinformática

Random sequences Slow folding, low stability, small

Designed sequences

Fast folding, high stability, large

E(C)-E(C0)

|E(C0)|>(1-q(C,C0))

The normalized energy gap gives a quantitative measure of energy landscape correlations

Page 11: Plegamiento de proteínas: Una perspectiva bioinformática

Molecular DynamicsMolecular Dynamics

• Model: all atoms in the protein

• Solvent either explicit or implicit

• Molecular dynamics simulations d2xi/dt2=Fi(x1… xN) =10-12 sec.

•Force field ideally from “first principles” (but simplifications are needed!). Ex: CHARMM, AMBER

Very useful to model the functioning of an enzyme, but useless for folding prediction: Time scales are too long, simulations can be trapped in energy minima, and it is not even clear whether the model is accurate enough.

Page 12: Plegamiento de proteínas: Una perspectiva bioinformática

The holy gral of protein The holy gral of protein foldingfolding

Develop a model simple enough to allow computation, yet realistic enough to be comparable with experiments.

The only a-priori reliable model needs quantum interactions (for instance, interactions between aromatic amino acids) and all atoms of the solvent (PROBLEM!).

Simplest models have 2N torsion angles degrees of freedom. The number of possible conformations is O(e2N), incredibly huge even for a quite reduced chain (impossible to compute all of them).

Page 13: Plegamiento de proteínas: Una perspectiva bioinformática

Homology modellingHomology modellingJust use biology, not physics! (homology=common origin)

Proteins with more than 25% sequence similarity always have very similar structure, because structure is very conserved in evolution.

• Align query and template

• Build the backbone from aligned template

• Build non-aligned regions (loops)

• Build side chains.

The more similar the sequences, the more similar the structures and the better the model.

Page 14: Plegamiento de proteínas: Una perspectiva bioinformática

Homology models tend to be rather good in conserved regions, but they are poor in more variable regions (loops).

They are only reliable if sequence similarity is above 25% (this threshold has been decreased due to better alignment techniques), whereas most protein pairs have lower similarity.

Page 15: Plegamiento de proteínas: Una perspectiva bioinformática

The bioinformatic The bioinformatic approach: look at known approach: look at known protein structuresprotein structures

• Related proteins with the same fold have typically low sequence similarity: Their similarity can hardly be recognized only aligning the sequences.

• Score: how suitable is a structure template to a query sequence? (Effective energy function)

• Recognize the known structure which best fits an unknown sequence

• No physical derivation for the scoring scheme, but thermodynamic estimates are sometimes possible

Page 16: Plegamiento de proteínas: Una perspectiva bioinformática

Reduced representation of proteinsReduced representation of proteins

We represent protein structures as contact maps:

Cij={

Similarity is measured as the fraction of common contacts, or overlap:

q(C,C’) =

Alternative structures of sequence A={A1...AN} are generated by aligning A without gaps with all structures in the PDB (gapless threading).

The energy is assumed of the form

E(C,A)/kBT=ij Cij U(Ai,Aj)

depending on 210 parameters U(a,b).

1 if ij contact

0 otherwise

ij Cij Cij’

max(ijCij,ijCij’)

Page 17: Plegamiento de proteínas: Una perspectiva bioinformática

Effective energy for Effective energy for simplified protein modelssimplified protein models

We have optimized the parameters of a contact energy function such that the Native State has the lowest energy and the energy landscape is well correlated for most independent proteins in the Protein Data Bank.

Our optimization method is based on the maximization of the Boltzmann average of the similarity with the native state:

Q(A) ~ C exp(-E(C,A)/kBT) q(C,Cnat)

When this parameter is maximal (Q ~ 1) the native state has lowest energy and dissimilar states have high energy (the energy landscape is well correlated). This can be achieved for nearly all proteins in the PDB

Page 18: Plegamiento de proteínas: Una perspectiva bioinformática

The Native States have lowest energy and the energy landscapes are well correlated.

The resulting normalized energy gap (0.2-0.8) is much higher than for random sequences (<0.1) and increases with chain length

Effective energy Effective energy applied to crystal applied to crystal structuresstructures

Prediction of unfolding free energies using crystal structures and effective energy functionG/NkBT = Enat/NkBT - s

Page 19: Plegamiento de proteínas: Una perspectiva bioinformática

The main contribution to the energy parameters The main contribution to the energy parameters comes from hydrophobicitycomes from hydrophobicity

Page 20: Plegamiento de proteínas: Una perspectiva bioinformática

A facility for protein structure prediction at CABA facility for protein structure prediction at CAB(http://www.cab.inta.es/~CAFASP/)(http://www.cab.inta.es/~CAFASP/)

The PROTFINDER algorithm looks for the structure in the PDB database which better aligns (with gaps) to the query sequence. It took part to CAFASP4.

It is available through a web server realized and cured by Alain Lepinette of CAB.

Page 21: Plegamiento de proteínas: Una perspectiva bioinformática

Scoring function:

Contact free energyConfigurational entropy loss S0 for each aligned residueGap penalties G0 (create) and G1 (extend)

Sequence-structure alignment a(i)

Score= -ij C(a(i),a(j))U(Ai,Aj)- S0Lali -G0Ngaps-G1Lgaps

Sequence homology information is not used.

A semi-deterministic algorithm used to generate candidate alignments.

Page 22: Plegamiento de proteínas: Una perspectiva bioinformática

Fold recognition

The ability to predict protein structures depends crucially on the most similar structure available in the database, qmax.

Very similar structure present in the database correctly selected on the basis of the energy.

No structure above a threshold of similarity almost random prediction.

The high similarity needed is frequent in proteins of detectable homology, but not in very distant homologous.

Page 23: Plegamiento de proteínas: Una perspectiva bioinformática

Sequence-structure alignments obtained through Sequence-structure alignments obtained through ProtFinder are very similar to those found in ProtFinder are very similar to those found in databases of protein alignments (PFAM)databases of protein alignments (PFAM)

Page 24: Plegamiento de proteínas: Una perspectiva bioinformática

The CASP experiment evaluates protein structure prediction methods

Page 25: Plegamiento de proteínas: Una perspectiva bioinformática

Stability of orthologous Stability of orthologous proteinsproteins

Related proteins with the same fold have typically low sequence similarity.

What are the common features of their sequences?

How similar are their thermodynamic properties?

Page 26: Plegamiento de proteínas: Una perspectiva bioinformática

With our tools we can compare thermodynamic properties of homologous proteins.

We estimate two key parameters: the folding free energy G and the normalized energy gap .

We apply our energy function to families of orthologous proteins predicting their Native Structure. In all cases, this coincides with the structure of the closest analog in the PDB, despite our algorithm does not use the information on sequence similarity.

Page 27: Plegamiento de proteínas: Una perspectiva bioinformática

• ATPE ACKA

• AROQ COAD

• DDL DUT

• EFTS FLAV

• FOLA FTSJ

• PDF PTH

• PTHP RL14

• RNH RNPA

• TRXA TRXB

• TPIS TRPA

• DNAK

• Free-living: B.subtilis, B. anthracis, C.crescentus, , D.radiodurans, E.coli, E.acidophylus, H.influentiae, L.lactis, L.monocytogenes, L.innocua, M.tubercolosis, M.smegmatis, N.meningitis, P.multocida, P.aeruginosa, P. putida, R.loti, R. meliloti, S.typhimurium, S.aureus, S.pyogenes, S.coelicolor, Synechococcus, T.pallidum, V.cholerae, X.fastidiosa, Z.mobilis

• Intracellular: B.burgdorferi, B.aphidicola (APS, BPS, SGR), C.jejuni, C.pneumoniae, C.thrachomatis, H.pylori, M.capriolum, M.genitalium, M.pneumoniae, M.leprae, R.prowazeki, U.parvum, Y. pestis, W.glossinidia, Wolbachia sp.

• Thermophyles: A.aeolicus, B.stereothermophylus, T.maritima, T.aquaticus

• Archea: A.pernix, A.fulgidus, M.Jannaschi, M.thermoautotrophicum, P.furiosus

List of genes: List of organisms:

Page 28: Plegamiento de proteínas: Una perspectiva bioinformática

Protein folding thermodynamics depends on hydrophobicity.

More hydrophobic sequences have more negative folding free energy (they are more stable against unfolding), but they have lower energy gap (they are less stable against misfolding).

Evolution has to look for a compromise between these properties! (Frustration)

Page 29: Plegamiento de proteínas: Una perspectiva bioinformática

Folding efficiency (normalized energy gap) is correlated with genome size.

Smaller genomes, such as those of intracellular bacteria, have reduced folding efficiency. Possible misfolding problems are consistent with observed high expression of chaperones in these bacteria.

Page 30: Plegamiento de proteínas: Una perspectiva bioinformática

Intracellular bacteriaIntracellular bacteria

Very small genomesVery small genomes High AT content ;High AT content ; High hydrophobicityHigh hydrophobicity Reduced population size;Reduced population size; Reduced folding ability of proteins;Reduced folding ability of proteins;

The genomes of obligate intracellular organisms (organelles, endosymbionts, parasites) share important common features:

These features can be explained from the point of view of evolutionary theory

Page 31: Plegamiento de proteínas: Una perspectiva bioinformática

Our results show that the normalized energy gap is smaller for intracellular bacteria than for free living bacteria.

This fact can be explained (a) because intracellular genomes have mutation bias towards A+T, hence express more hydrophobic proteins; (b) because of the weaker selection experienced by intracellular bacteria due to their small populations.

A smaller folding parameter implies that the occurrence of misfolding is much higher. This can lead to protein aggregation, very dangerous for cellular processes.

To avoid aggregation, these bacteria express very high amounts of chaperones, proteins in charge of helping protein folding. The chaperone DNAK appears more stable in organisms with smaller genome.

Page 32: Plegamiento de proteínas: Una perspectiva bioinformática

Spectral decomposition of the interaction matrix:

E= ikCikU(Ai,Ak) ~ ikCikh(Ai)h(Ak)

Sequences with the same fold have similar Hydrophobicity Vector h(Ai) (HV).

The HV has large correlation r(h,c) with the Principal Eigenvector (PE) of the contact matrix Cij.

Therefore, sequences with the same fold have a common hydrophobic fingerprint that coincides with the PE of the contact matrix.

The evolutionary average HV correlates with the PE much more strongly than the PE of a single sequence.

What do sequences with the same fold have in common?

Page 33: Plegamiento de proteínas: Una perspectiva bioinformática

BioinformaticsBioinformatics

Biological information is accumulating at very fast pace.

• Need of classifying this information for storing and retrieving (One could say that biology is the art of classifying!)

• Protein structures: decomposition, structural classification, hidden evolutionary relationships.

• Biological sequences: Identification of protein sequences (genes), classification, structure and function prediction.

• Molecular interactions: reconstruction of metabolic networks and cellular regulatory networks (system biology)

• Organisms: evolutionary classification (phylogeny)

• Biological literature: classification and retrieving

Page 34: Plegamiento de proteínas: Una perspectiva bioinformática

Proteins are made of modules (domains) that are duplicated and combined in many possible ways to create always new molecules.

Page 35: Plegamiento de proteínas: Una perspectiva bioinformática
Page 36: Plegamiento de proteínas: Una perspectiva bioinformática

The Protein Data Bank (PDB) contains roughly 24000 protein structures, determined either by X-ray crystallography or by NMR spectrometry. Less than 4000 are different folds. The number of new folds (blue bar) is decreasing each year. Other classification schemes yield less than 1000 different folds Evolution uses a reduced number of folds for a large number of biological functions.

Page 37: Plegamiento de proteínas: Una perspectiva bioinformática

CATH structural classification: 813 folds (Topology level) (Thornton, Orengo)

Page 38: Plegamiento de proteínas: Una perspectiva bioinformática

SCOP Structural Classification of Proteins: 800 folds (Chothia, Murzin)

Page 39: Plegamiento de proteínas: Una perspectiva bioinformática

DALI: Algorithm and server for automatic classification of protein structures (Holm and Sander).

It aligns protein structures minimizing the dissimilarity score:

S=ik | raik - rb

ik |/(raik + rb

ik) exp(-(raik - rb

ik)2/4r02) r0=20A

The sum runs over C alpha atoms i,k.

It generates the database FSSP of structurally similar proteins (S much smaller than for random pairs of structures, Z score criterion).

Page 40: Plegamiento de proteínas: Una perspectiva bioinformática

For each new structure:

• Store it in the PDB with proper format.

• Decompose it in domains;

• Classify domains, discover new evolutionary relationships.

For each new sequence:

• Find the gene sequences in the genome (easy for prokaryotes, very difficult for eukaryotes because genes are interrupted by introns).

• Find homologous domains, infer structure and function.

• Decide whether structure determination is worthwhile

Page 41: Plegamiento de proteínas: Una perspectiva bioinformática

Protein databases

GeneBank: Protein sequences (not annotated), from genomic projects.

SwissProt: Annotated protein sequences. Domain organization, structure, function, active site may be known from homology.

Protein Data Bank (PDB): Protein structures

Page 42: Plegamiento de proteínas: Una perspectiva bioinformática

Sequence Alignment

Page 43: Plegamiento de proteínas: Una perspectiva bioinformática

Alignment is the main tool in Bioinformatics. It is justified by the fact that aligned elements have a common evolutionary origin (homology).

Amino acids or nucleotides in evolution can be conserved, substituted (usually with minimal modification of the Native State), inserted or deleted. The last two processes generate gaps in the alignment.

The score for an alignment a(i) between two sequences A1i, A2

k is

Score= i S(A1i,A2

a(i)) - G0Ngaps - G1Lgaps

The 20 20 matrix S(a,b) is called Substitution matrix and is determined from aligned protein families. The most used are the BLOSUM62 and the PAM250 matrices. G0 is the gap opening and G1 is the gap extension penalty.

The number of possible alignments grows exponentially with sequence length, but the optimal alignment can be found exactly with an O(L3) algorithm using dynamic programming (Needleman & Wunsch, Smith & Waterman).

The optimal solution is often, but not always, the biologically relevant one. The gap parameter and substitution matrix used are crucial! One has to check the statistical significance.

Page 44: Plegamiento de proteínas: Una perspectiva bioinformática

Multiple Sequence Alignments

Multiple alignments of M sequences is an NP problem: no solution polinomial in M is thought to exist. Once the first two sequences have been aligned, in fact, the score for the next one has been modified!

The most used solution is implemented in the algorithm CLUSTALW, it consists in aligning the easy pairs first:

• Align all pairs of sequences with a fast algorithm

• Build a tree of their relationship

• Start aligning accurately the two most closely related sequences (easiest). Represent both of them with a single profile.

• Iterate, looking again for the two most closely related sequences or profiles.

Page 45: Plegamiento de proteínas: Una perspectiva bioinformática

Database search

Often, we do not need accurate alignments but just a list of database entries that are evolutionarily related to our query sequence. Most used algorithms for this purpose are BLAST and FASTA.

BLAST compares the query sequence to all sequences in a database like SwissProt or GeneBank in few seconds. For each pair of sequences, it finds all exact matches of length k, extends and combines them, and provides the P value that the matches are found by chance.

PSI-BLAST is an iterative procedure based on BLAST.

• Find all sequences significantly related to the query.

• Construct a profile (amino acid distribution per site) from the multiple alignment

• Iterate the search using the profile as query.

In this way, very distant evolutionary relationships can be retrieved confidently. This method is very useful for protein structure prediction.

Page 46: Plegamiento de proteínas: Una perspectiva bioinformática

Phylogenetic treesPhylogenetic trees

A B C D E F G

Distance

Evolving species can be placed on the leaves of a phylogenetic tree.

The time past since the last common ancestor of species A and B, d(A,B), is a distance allowing classification. This is based on the ultrametric property: all triangles have the two longest sides equal.

Phylogenetic trees were once built by comparing external characters, but now they are built using macromolecules such as proteins, RNA and DNA.

Page 47: Plegamiento de proteínas: Una perspectiva bioinformática

The molecular clock

Empirical observation: the number of amino acid substitutions between two orthologous proteins (ex. Myoglobin) of two speices A and B is linearly correlated with their divergence time t(A,B). Fluctuations of the number of substitutions are small.

K(A,B) ~ a t(A,B)

If the divergence time is not known, the number of substitutions can be used to estimate it. K(A,B) can be obtained from the number of mismatches in the sequence alignment, using some model of evolution to correct for multiple substitutions.

Methods to generate phylogenetic trees range from deterministic clustering algorithms to optimization methods. The two most used are:

• Neighbor Joining: Join the two closest sequences, recalculate distances, iterate. Very fast but not very accurate.

• Maximal Likelihood: For a model of sequence evolution (independent sites needed!), calculate the likelihood of the observed sequences given the parameters and the tree. Exhaustive search of the ML tree is impossible, but approximate algorithms give good results.

Page 48: Plegamiento de proteínas: Una perspectiva bioinformática
Page 49: Plegamiento de proteínas: Una perspectiva bioinformática

Tree of seven replication proteins found in all bacterial genomes (using the BLAST algorithm), obtained with the Neighbor-Joining method. The number represent Bootstrap values (number of times, out of 1000, that the plotted branching is observed using a random subset of all aligned positions).

Some groups (clades) can be confidently recontructed, for instance Proteobacteria and Gram-positive bacteria, but some divergences are too ancient and no similarity signal is found in their proteins.

Page 50: Plegamiento de proteínas: Una perspectiva bioinformática

Some problems with phylogenetics

• The protein tree, which we reconstruct, does not always coincide with the species tree, if there has been gene transfer between species (frequent in bacteria) or gene duplication prior to species separation (paralogous proteins).

• The molecular clock is known to hold for neutral evolution (when the properties of the protein do not change), but adaptations happen at a much faster rate. The substitution rate can vary in different branches also due to different mutation rate or generation time. When the rate is too variable, the estimates of branch lengths and the reconstructed trees are not reliable.

• The number of substitutions K(A,B) can be reliably estimated from the number of mismatches when it is not saturated.

• An indication of these problems is that different proteins usually give different tree topologies.

Page 51: Plegamiento de proteínas: Una perspectiva bioinformática

Some courses on the web:

http://www. pdg.cnb.uam.es/cursos/BioInfo2002/pages/index.htmlCurso de Doctorado: BIOINFORMÁTICA

http://www.cryst.bbk.ac.uk/PPS2/index.htmlPrinciples of Protein Structure Using the Internet

http://www.biochemtech.uni-halle.de/PPS2/projects/day/TDayDiThe Source of Stability in Proteins

http://www.fst.reading.ac.uk/courses/fs916/index.htmProtein Structure and Function

http://www.cm.utexas.edu/academic/courses/Spring2002/CH339K/Robertus/

http://www.oup.com/lesk/bioinfSite of the book: Introduction to Bioinformatics, by A.M. Lesk (Oxford)

Page 52: Plegamiento de proteínas: Una perspectiva bioinformática

Main databases and resources:

http://www.ncbi.nlm.nih.gov/National Center for Biotechnology InformationGenomes, PubMed (literature), genes, proteins...

http://www.bmn.com/BioMedNet (Medline) Biological literature

http://www.tigr.org/The Institute for Genomic Research

http://www.ebi.ac.uk/swissprot/Swiss-Prot: annotated proteins

http://pfam.wustl.edu/Pfam: aligned protein families

http://gibk26.bse.kyutech.ac.jp/jouhou/jouhoubank.htmlBioInfo Bank: several data bases

http://www.rcsb.org/pdb/Protein Data Bank: protein structures

http://www.ebi.ac.uk/dali/FSSP: Alignment of protein domainshttp://www.biochem.ucl.ac.uk/bsm/cath/CATH : Classification of domainshttp://scop.mrc-lmb.cam.ac.uk/scop/SCOP : Classification of domainshttp://www.ebi.ac.uk/

http://pqs.ebi.ac.uk/Protein Quaternary Structure (interactions)

http://BioInfo.PL/cafasp/Servers for automatic protein structure prediction