A Simple Model for Protein Structure 施奇廷（東海大學物理系）

A Simple Model for Protein Structure

施奇廷（東海大學物理系）

The ModelsThe Models

HP Model:

pi is H or P and =1 for contacts

EHH=-2.3, EHP=-1, and EPP=0 (Li et al., Science 273, 666)

For “additive” case: EHH=-2, EHP=-1, and Epipj=0: Epq=-

(pi+pj) where pi=1 (0) for H (P) residues

HP Model (2nd type):

)spps(2

1)s,pH( 22 2

尋找最低能量態

對於每一種氨基酸序列，將之放入所有可對於每一種氨基酸序列，將之放入所有可能的構形中，計算其能量，找出能量最低能的構形中，計算其能量，找出能量最低者為其基態。注意基態能量不可簡併，否者為其基態。注意基態能量不可簡併，否則為不穩定之構形，將被演化淘汰。例如則為不穩定之構形，將被演化淘汰。例如在在 4x44x4 晶格中，一序列為：晶格中，一序列為：

HHPHHPHPPPPHHPHHHHPHHPHPPPPHHPHH

HP Model (1st Type)

第二個模型可以視為第二個模型可以視為 HP HP 模型之「平均場模型之「平均場近似」：將晶格點的位置分為兩類，一種近似」：將晶格點的位置分為兩類，一種是表面的（是表面的（ SS ），一種是核心的（），一種是核心的（ CC ），），若一疏水氨基酸出現在核心（不與水接若一疏水氨基酸出現在核心（不與水接觸），則能量可降低一個單位。在此近似觸），則能量可降低一個單位。在此近似下，可將一種形狀用一個下，可將一種形狀用一個 NN 維向量（）表維向量（）表示，以示，以 0 0 表表 SS ，以，以 1 1 表表 CC ，氨基酸序，氨基酸序列亦同（）：以列亦同（）：以 0 0 表表 PP ，以，以 1 1 表表 HH 。。

Second Model: A Mean-Field Approximation

s

p

HP Model (2nd Type)

)0110111101101000(

)0110000001100000(

)0001110010000000(

2

1

p

s

s

可設計度（ Designability ）長度為長度為 NN 的序列，一共的序列，一共有有 22NN 種，每一個序列種，每一個序列都找出其對應的基態構都找出其對應的基態構形（基態簡併者除外），形（基態簡併者除外），計算每種構形被選為基計算每種構形被選為基態的次數，即為該構形態的次數，即為該構形的可設計度。的可設計度。

Designability of a given structure:

Number of peptide sequences choosing a particular geometric structure as its non-degenerate ground state.

Geometrical under-standing of the HP model (2nd type)

)spps(2

1)s,pH( 22 2

LS Model: (C. Micheletti et al., PRL 80, 4987)

i

iii zzAzH ))()(()(

σi=L (0, large) or S(1, small);

z(σi)=1 (2) for L (S) residues inside the chain and z(σi)=2

(3) for L (S) residues at the ends of the chain;

zi() is number of contacts at site I;

A(x)=1 for x 0 and –a otherwise (a>0, a=≧ ∞ in the Ref.).

In the N×N square lattices:

oS

cSS

oL

sLL

cS

sS

cL

sL nnnannaannnannH 2)21(222

Notations: nz is the number of (L or S) on the z-type sites,

z=o (s,c) for corner (side, core) sites, n=znz

for a >> 1 but finite, we get:

))1((22,22)(2 200 LSLSL

sL

oL nNananEEshanannnaH

for a=∞, L is prohibited to be on the core sites→nLc=0

The most encodable compact structures for the LS model for 6×6 lattice. The shape of the one with highest score is identical to the case of HP model

Geometrical Properties of the 2D Square Lattices

n00 (n10, n11): number of peptide bonds connecting 00 (10,

11) residues. The 1-0 bonds partition the sequence into n10+1 segments of contiguous 1’s or 0’s.

Constraints for N>4:

1. An isolated single 1 may only occur at an end of a path

2. An isolated single 0 may only either occur at or be one 1-segment away from an end of a path

3. Each of the four corners on the lattice belongs to a 0-segment with at least 4 sites, except when the corner is an end of a path

4. For a path (1…1), 2n00 + n10 = 8N-8 and 2 n≦ 10 4N-12≦

5. (0010011…1): 2n00 + n10 = 8N-9, and 5 n≦ 10 4N-11≦

6. (0010011…1100100): 2n00 + n10 = 8N-10; and 10 n≦ 10

4N-10 for N>6, and 8 n≦ ≦ 10 4N-10 for N 6≦ ≦

7. (0010011…0) but not 6., 2n00 + n10 = 8N-10, and 4 n≦ 10

4N-12≦

8. (0…0) but not 6. and 7., 2n00 + n10 = 8N-10, 4 n≦ 10 4N-12≦

9. (0…1) but not 5., 2n00 + n10 = 8N-9, 1 n≦ 10 4N-13 ≦

Geometrical Properties of the 2D Square Lattices (conti.)

Example:

Constraint 4: (1……1) type

Left: maximum n10=12 and Right: minimum n10=2

Distribution of the Allowed Structures in the Hyperspace

More possible binary sequences with larger n10 are not

allowed to be a structure s than those with smaller n10

from the combinatorial point of view.

Minimal Hamming distance dH(s1,s2) between two path s1,s2 is

approximately 4k (2k for triangular lattices) if n10=4k or 4k-

2:

1. (…01111110…10000001…)→(01111000…10011001…)2. (…01111110…10000001…)→(01100110…10011001…)

On the average, the designability of s with larger n10 will be larger. And the results will also be true for other shape of 2D lattices.

Comparison with Protein Data Bank

Metric representation of a sequence p with length l=2k:

k

i

iki

k

i

iik pypx

1

)1(

1

2;2

For a set of sequences collected by the models, calculate the frequency distribution of the subsequences with length 2k of the sequences. And plot it in a unit square. And then Calculate the correlation of the distribution function:

l

m

lj

li

lij mFmFO

2

1

)()()( )()(

where Fi(l)(m) is the normalized frequency of the mth

subsequence with length l in the set i.

Results and Discussion

Average designabilities of the paths vs. n10 for the (c) 4×7 and

(d) 6×6 lattices, respectively.

The frequencies of all the subsequences with length 12 observed in

(a) all proteins in PDB,(b) the alpha-helix parts

of (a), (c) the sequences belong

to the highly designable structures,

(d) the sequences belong to the low designable structures of HP model.

The frequencies of all the subsequences with length 12 observed in

(a) all proteins in PDB,(b) the sequences belong

to highly designable structures of LS model.

(c) normalized frequencies of (a),

(d) normalized frequencies of (b).

Summary

HP model HP model 為研究蛋白質結構最簡單之模型，只為研究蛋白質結構最簡單之模型，只考慮親梳水作用考慮親梳水作用

可設計度之研究，可以解釋許多不同的蛋白質，可設計度之研究，可以解釋許多不同的蛋白質，折疊成類似形狀的現象折疊成類似形狀的現象

可設計度高的結構，擁有叫「縐摺」的表面可設計度高的結構，擁有叫「縐摺」的表面→→可可以自然給出表面的以自然給出表面的 -- 螺旋二級結構，與實驗結螺旋二級結構，與實驗結果吻合果吻合

LS model LS model 在數學上與在數學上與 HP model HP model 是等價的，是等價的，但是物理意義卻不同但是物理意義卻不同

藉由與實際蛋白質序列與結構的比較，我們可藉由與實際蛋白質序列與結構的比較，我們可以判別各個不同的簡化模型之優劣以判別各個不同的簡化模型之優劣

Documents

A Simple Model for Protein Structure 施奇廷（東海大學物理系）