The Inverse Protein Folding Problem* Arvind Gupta Simon Fraser University May 24, 2005 *Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya,

The Inverse Protein Folding The Inverse Protein Folding Problem*Problem*Arvind Gupta

Simon Fraser UniversityMay 24, 2005

*Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya, X. Huang

Canada-China Industrial Workshop, 2005 Hong Kong Baptist University

OutlineOutline• Background

• Forces in Protein Folding

• Hydrophobic-Polar Model

• Protein Databank

• Determining Attributes of the Ideal Lattice

• Future Steps

DNA• Genetic code• A “string” of nucleotides over A C G T• Code for all proteins• Self-replicating

Proteins

• A “string” over 20 amino acids• In solvent will fold into a unique 3D spatial

structure with minimal energy

Protein Structure

• Structure determines protein function.• Proteins normally are in an aqueous environment• Proteins are globular.

Proteins in the body

• Proteins are involved in all processes in the body:

Insulin

Hemoglobin

Proteins and diseases

M. Thorpe, Protein Folding, HIV and Drug Design, Physics and Technology Forefronts (2003).

Forward Protein Folding ProblemForward Protein Folding Problem

• Identify the protein structure for a specific amino acid sequence.

MAGWTRLS..

• Central open problem in biology• NP-hard under most models

Inverse Protein Folding ProblemInverse Protein Folding Problem• Given a structure (or a functionality) identify an

amino acid sequence whose fold will be that structure (exhibit that functionality).

• Crucial problem in drug design.• NP-hard under most models.

Forces acting on ProteinsForces acting on Proteins• Hydrogen Bonding

• Van der Waals interactions

• Ion pairing

• Disulfide bonds

• Intrinsic properties

(conformational preference)

• Hydrophobicity: the dominant

force in protein folding (Dill, 1990)

Hydro (water) philic (loving)phobic

(fearing)

Hydrophobic InteractionsHydrophobic Interactions

• Each amino acid can be classified as either hydrophobic or hydrophilic (polar)

• Hydrophobic [Polar] are in a higher [lower] energy state in an aqueous environment.

Hydrophobic – Polar (HP) ModelHydrophobic – Polar (HP) Model

• Introduced by Dill (1985) and Chan (1985)• “0” for polar; “1” for hydrophobic• Protein sequence embedded on lattice• Each amino acid in exactly one cell• Interactions across adjacent cells• Empty lattice cells contain water• Given protein maximize hydrophobic interactions

(native fold).• IE: Given 0-1 string embed onto a lattice,

maximizing adjacent 1’s.

The 2-D Square LatticeThe 2-D Square Lattice

• Hydrophobic “1”: Polar “0”:• Peptide bond: Hydrophobic interaction:• Example.

Protein:

Inverse protein foldingInverse protein folding

• Problem: For a given shape find a protein (amino acid string) with a native fold approximating the shape.

• Example.

Constructible structuresConstructible structures

Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S.

• Proof by induction:– Base case:

p(S)=010010010010



• Proof by induction:– Inductive case:



• Proof by induction:– Inductive case:



• Proof:– Folds are saturated: every hydrophobic “1” is involved

in two hydrophobic interactions– saturated implies native

Stability of proteinsStability of proteins

Together 82 native folds!

• Proteins is stable if it has unique “native fold” (fold with minimal energy).

• Most natural proteins are stable.• The protein in our example is not stable:

Stability of proteinsStability of proteins

Conjecture: For any constructible structure S, the protein p(S) is stable.

• Tested for >20,000 constructible structures.• Mathematically proved for two simple infinite

classes of constructible structures L0 and L1.

L0: L1:

Boundary squaresBoundary squares

• Diagonal frame: the smallest diagonal rectangle containing all hydrophobic “1”-s.

• Boundary square: hydrophobic “1” lying on the border of diagonal frame.

5 boundary squares

Boundary squaresBoundary squares• Useful to find the last tile of constructible

structure.• A saturated fold has at least 4 of them.

Lemma. Let p=0{0,1}*0 be a protein string not containing 11, 000 and 10101 as a substring. For every saturated fold of p, each boundary square not adjacent to a terminal is the main square of a corner-closed core.

Proof for LProof for L00 structures structures• Take a saturated fold for p(S), L0.

• It has at least 4 boundary squares, and at least 2 not adjacent to a terminal (the first or the last amino acid).

• By Lemma, each is contained in a corner-closed core, i.e., is a red 1 of substring 1001001 of the protein string.

• In p(S)=0(10010)n(01001)n0, there are only two occurrences of substring 1001001, and they are overlapping.

• Hence, cores match each other and form a fully-closed core (closed on 3 sides) - the last tile.

• Cut the last tile and apply induction.

LL11 structures are more complex structures are more complex• p(S)=0(10010)n010(10010)m(01001)m01(01001)n-10

• p(S) contains one occurrence of substring 10101 (Lemma cannot be directly applied) and three occurrences of 1001001 (two corner-closed cores does not imply a fully-closed core).

Choosing a LatticeChoosing a Lattice• 2D is easier

Fewer options for combinatorial case analysisMore visually intuitiveTorsion angles describe protein mainchain

• 3D is more relevantMore biologically relevantMore representative of actual protein

structuresDirectly applicable to known protein structures

Protein Data Bank (PDB)

• Worldwide repository for

3-D biological macromolecular structure data• Contains 30857 known protein structures (May17,2005)

• Structures derived using different techniques– Nuclear Magnetic Resonance spectroscopy– X-ray crystallography

• PDB ‘known structures’ are really models of the structure of a protein

Determining Ideal Lattice AttributesDetermining Ideal Lattice Attributes

1. Should all edges of the lattice be identical in length?

2. How should distances between non-adjacent lattice points behave?

3. What angles should the lattice have?

4. How regular should the lattice be?

Use PDB statistics to answer these questions

Assemble a Set of Proteins

a) Protein structures generated using X-ray diffraction

b) High resolution structures (<= 1.75 Å)c) Model fits the experimental data well

Result: 3704 Protein structures in subset

Create a protein structure subset of good quality protein structures from the PDB:

Q1: Uniform Edge Length?

Overall distribution of consecutive residue distance:

Consecutive residue distance appears consistently with length 3.8 Å.

Answer to Question 1: All edge lengths should be uniform with length 3.8 Å.

Q2: Non-adjacent Vertex Distances?

Overall distribution of non-consecutive

residue distance:

Answer to Question 2: Non-adjacent vertices should be at least 3.8 Å apart.

• minimum distance: 3.06 Å

• only 10 distances < 3.5Å

• 1813 distances < 3.8Å

(out of 426 billion pairs).

Q3: Lattice Angles?

One amino acid

Amino acid chain

Q3: Lattice Angles?

• Calculate C angles: angle produced by three consecutive C atoms

• Group results by middle amino acid residue type

Overall distribution of C angles:

Bimodal distribution:

• Sharp peak at 90o

• Shallow peak at 120o

Q3: Lattice Angles?Some differences appear for C angles around certain amino acids:Shown: Proline, Phenylalanine, Aspartic acid

Q4: Lattice Regularity?• Determine average corresponding coordinate root

square mean deviation (c-RMS) values between the original PDB structure and lattice approximated structures (over the entire 3704 PDB protein subset)

n

ban

iii

1

2||RMS-c

ai = coordinates of lattice vertex corresponding to bi

bi = coordinates of residue in protein X-ray structure

Q4: Lattice Regularity?• Periodic Lattices: Cubic and Face-Centered-Cubic (FCC)

• Randomized Lattices: Shift each vertex in periodic lattices by a random value from normal (0, 0.0025) distribution, preserve edges

• De Novo Random Lattices: Generate random nodes and edges, maintain average degree and edge length of periodic

lattices

Q4: Lattice Regularity?• average c-RMS values generally increase as the

randomization of the lattices increase

Answer to Question 4: Periodic lattices achieve better approximation of protein structure than random lattices of the same degree

lattice model

degreeaverage c-RMS

periodic lattice

Randomized periodic lattice

de novo random lattice

FCC 12 1.82 1.967 4.85

Cubic 6 3.11 3.21 3.96

Results: Ideal Lattice Attributes

• Uniform edge lengths of 3.8Å

• Mimimum distance between any two vertices of 3.8Å

• Supporting mainly 90o and 120o angles

• Periodic in structure

Candidate lattices (space-filling)Candidate lattices (space-filling)

cubic hex. prism truncatedoctahedron

cuboctahedron

truncated tetrahedron

Candidate lattices (vector-based)Candidate lattices (vector-based)

Face-centered cubic (FCC)

Side+FCC (S+FCC)

Extended FCC (e-FCC)

RMS comparison of latticesRMS comparison of latticesc-RMS d-RMS a-RMS

Truncated Octahedron

5.3053 3.2479 13.0982

Hexagonal Prism 3.8704 2.4312 10.0313

Truncated Tetrahedron

3.6913 2.4133 19.9030

Simple Cubic 3.1123 2.1081 21.1005

Cubeoctahedron 2.5581 1.7427 8.3526

FCC 1.8212 1.4369 8.3346

S+FCC 2.1791 1.5819 6.2022

e-FCC 1.5385 1.1048 2.5700

Angle comparison of latticesAngle comparison of lattices

LatticeTrunc. octahedron

Hexagonal prism

Trunc. tetrahedron

Cubic

Cubocta-hedron

FCC S+FCC e-FCC

Degree 4 5 6 6 8 12 18 42

Closeness to 90

20 18 42 18 30 30 28.82 31.40

Closeness to 120

10 24 36 36 34.29 32.73 36.47 38.72

Future

1. Investigate candidate lattices to determine an ideal lattice for inverse protein folding

2. Mathematically prove that the ideal lattice can generate stable sequences for specified protein shapes within the HP model

3. Attempt to assign specific amino acids to lattice sites

Future4. Investigate protein sequences generated

by the model for stability and folding properties.

5. Incorporate other protein folding forces– Hydrogen Bonding– Van der Waals interactions – Intrinsic properties (conformational preference)– Ion pairing– Disulfide bonds

Questions?Questions?

Documents

The Inverse Protein Folding Problem* Arvind Gupta Simon Fraser University May 24, 2005 *Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya,