รศ · 2019. 9. 5. · NMR Spectrometer (~1-2.5A) Electron Microscope for cryo-EM ~10-15A Three complex techniques X-ray crystallography NMR spectroscopy Cyro-electron microscopy

รศ.ดร. เกียรติทวี ชูวงศ์โกมลE-mail: [email protected] Phone: 085-555-1480

mailto:[email protected]

Assoc. Prof. Kiattawee Choowongkomon, Ph.D.

Department of Biochemistry

Faculty of Science

Kasetsart University

Email: [email protected]

Mobile Phone: 085-555-1480


Central Dogma of Biochemistry

Central Paradigm of Bioinformatics

3 billion base pair => 6 G letters &

1 letter => 1 byte

The whole genome can be recorded in just 10 CD-ROMs!

In 2003, Human genome sequence was

deciphered!

Genome is the complete set of genes of a living thing.

In 2003, the human genome sequencing was completed.

The human genome contains about 3 billion base pairs.

The number of genes is estimated to be between 20,000 to 25,000.

The difference between the genome of human and that of

chimpanzee is only 1.23%!

Biological Diversity

Bacteria

Fruit Fly

Human

Yeast

Escherichia coli Methanococcus jannaschii

Yeast Human

E. coli M. jannaschii S. cerevisiae H. sapiens

Characterized experimentally 2046 97 3307 10189

Characterized by similarity 1083 1025 1055 10901

Unknown, conserved 285 211 1007 2723

Unknown, no similarity 874 411 966 7965

Proteins and Diseases

Proteins and Diseases (2)

Distribution of PDB structure according to diseases

Protein Structure and Function

proteins are polymers consisting of amino acids linked by covalent peptide bonds.

native conformations : proteins have

biological activity

proteins are complex, they are defined in terms of four levels of structure.

Major examples of protein

functions

⚫Binding

⚫Catalysis

⚫Molecular switches

⚫Structural components

Protein Function and Architecture

Binding

Catalysis

Switching

Structure

DNA

Protein

Nucleotides

sequence

Gene expression = Protein production

Bioinformatics: storage of biological

information

DNA (Genome)

RNA (Transcriptome)

Protein (Proteome)

Computational applications for

DNA (genome)

DNA Simple sequence analysis

Database searching

Pairwise analysis

Regulatory analysis

Gene Finding

Whole genome annotation and analysis

Comparative genomic analysis

DNA sequence: Bases: A, G, T, C Coding or non-coding sequence? Contain regulatory elements?

Information of DNA (genome)

>gi|109637932:950-1310 Hepatitis B virus (SUBTYPE

ADW2), genotype A, complete genome

GAAAACTTCCTGTTAACAGGCCTATTGATTGGAAAGTATGTCAAAGAATTGTGG

GTCTTTTGGGCTTTGCTGCTCCATTTACACAATGTGGATATCCTGCCTTAATGC

CTTTGTATGCCTGTATACAAGCTAAACAGGCTTTCACTTTCTCGCCAACTTACA

AGGCCTTTCTAAGTAAACAGTACATGAACCTTTACCCCGTTGCTCGGCAACGGC

CTGGTCTGTGCCAAGTGTTTGCTGACGCAACCCCCACTGGCTGGGGCTTGGCCA

TAGGCCATCAGCGCATGCGTGGAACCTTTGTGGCTCCTCTGCCGATCCATACTG

CGGAACTCCTAGCCGCTTGTTTTGCTCGCAGCCGGTC

Information of protein (proteinome)

Protein:

20 amino acids

Primary sequence structure

domain binding analysis

Multiple sequence analysis

Secondary and tertiary structure

Active site

Binding analysis

Functional analysis including mutation analysis

"middle surface protein (HepB)"

MQWNSTAFHQALQDPKVRGLYFPAGGSSSGTV

NPAPNIASHISSISARTGDPVTNMENITSGFL

GPLLVLQAGFFLLTRILTIPQSLDSWWTSLNF

LGGSPVCLGQNSQSPTSNHSPTSCPPICPGYR

WMCLRRFIIFLFILLLCLIFLLVLLDYQGMLP

VCPLIPGTTTTSTGPCKTCTTPAQGNSMFPSC

CCTKPSDGNCTCIPIPSSWAFAKYLWEWASV

RFSWLSLLVPFVQWFVGLSPTVWLSAIWMMWY

WGPSLYSIVSPFIPLLPIFFCLWVYI

Structure levels of protein

Primary Structure

Sequential order of amino acids in a polypeptide

Writing left to right (N-ter to C-ter)

Protein Structure - Primary

Protein: chain of amino acids joined by

peptide bonds

Amino Acid

Central carbon (Cα) attached to:

○ Hydrogen (H)

○ Amino group (-NH2)

○ Carboxyl group (-COOH)

○ Side chain (R)

General Amino Acid Structure

Cα

H

R

COOHH2N

20 amino acids

28

Amino Acid

R-groups

Polar

UnchargedCysteine

Proline

Serine

Glutamine

Asparagine

ChargedArginine (+)

Glutamic acid (-)

Aspartic Acid (-)

Lysine (+)

Histidine (+)

Non-Polar

HydrophobicTryptophan

Phenylalanine

Isoleucine

Tyrosine

Leucine

Valine

Methionine

AmbivalentGlycine

Threonine

Alanine

Linear polymers of proteins

Amino acids are connected by amide bonds,

often called peptide bonds

peptide bond is a

covalent bond between

A carboxylic acid and

A amino group by loss of

A water molecule

Peptide bond

Stability of peptide bond is due to resonance

(delocalization of electrons over several atoms)

Increase the polarity of the peptide bond

○ Also generate dipole moment

Partial double-bond character

○ coplanar / non-rotatable

Primary Structure of Proteins

The amino acid sequence (the primary structure) of a protein determines its three-dimensional structure, which, in turn, determines its properties.

In every protein, the correct three-dimensional structure is needed for correct functioning.

Determining the sequence of amino acids in a protein is a routine, but not trivial, operation in classical biochemistry.

Christian B. Anfinsen: Nobel Prize in Chemistry (1972)

1KETAAAKFERQHMDSSTSAASSSNYCNQMMKS

RNLTKDRCKPVNTFVHES

LADVQAVCSQKNVACKNGQTNCYQSYSTMSITD

CRETGSSKYPNCAYKTT

QANKHIIVACEGNPYVPVHFDASV124

Sequence Determines Structure

All of the information

necessary for folding

the peptide chain into

its "native” structure

is contained in the

primary amino acid

structure of the

peptide.

A case study of the effects of

mutation: Sickle cell anemia

Secondary Structure

Organizing of local conformation in polypeptide

chain

Backbone Torsion Angles

• Dihedral angle ω (omega):

rotation about the peptide

bond, namely Cα1-{C-N}- Cα

2

• Dihedral angle φ (phi): rotation

about the bond between N and

Cα

• Dihedral angle ψ (psi): rotation

about the bond between Cα

and the

carbonyl carbon

Ramachandran Plot

White = sterically

disallowed

conformations (atoms

come closer than sum of

van der Waals radii)

Blue = sterically

allowed conformations

Secondary Structure Prediction

One of the first fields to emerge in

bioinformatics (~1967)

Grow from a simple observation that

certain amino acids or combinations of

amino acids seemed to prefer to be in

certain secondary structures

Subject of hundreds of papers and dozens

of books, many methods…

PSSP Algorithms

There are three generations in PSSP algorithms

• First Generation: based on statisticalinformation of single aminoacids

• Second Generation: based on windows(segments) of aminoacids. Typically a window containes 11-21 aminoacids

• Third Generation: based on the use of windows on evolutionary information

PSSP: First Generation

First generation PSSP systems are based on

statistical information on a single aminoacid

The most relevant algorithms:

Chow-Fasman, 1974

GOR, 1978

Both algorithms claimed 74-78% of predictive

accuracy, but tested with better constructed

datasets were proved to have the predictive

accuracy ~50% (Nishikawa, 1983)

Chou & Fasman

Determined the frequency of occurrence

of each amino acid in helices and sheets.

Calculated from survey of 15 known

structure proteins.

http:/fasta.bioch.virginia.edu/fasta_www2/f

asta_www.cgi?rm=misc1

Chou-Fasman parameters

Note: The parameters given in the textbook are 100*Pi

Simplified C-F Algorithm

Select a window of 7 residues

Calculate average P over this window and assign that value to the central residue

Repeat the calculation for Pb and Pc Slide the window down one residue and repeat

until sequence is complete

Analyze resulting “plot” and assign secondary structure (H, B, C) for each residue to highest value.

Exercise

Predict the secondary structure of the following

protein sequence:

Ala Pro Ala Phe Ser Val Ser Leu Ala Ser Gly Ala

142 57 142 113 77 106 77 121 142 77 57 142

83 55 83 138 75 170 75 130 83 75 75 83

66 152 66 60 143 50 143 59 66 143 156 66

Simplified C-F Algorithm

helix beta coil

10 20 30 40 50 60

Prediction Performance

45

50

55

60

65

70

75C

F

GO

R I

LIM

LE

VIN

PT

IT

JA

SE

P7

GO

R II

I

ZH

AN

G

PH

D

Sco

res (

%)



Faculty of Science





Download DNA/protein sequence

Translate DNA to protein

Comparing Protein sequences

Extract information

pI / MW

secondary structure prediction



Faculty of Science





Tertiary Structure of Proteins

The tertiary structure of a protein is the three-

dimensional arrangement of all the atoms in

the molecule.

The conformations of the side chains and the

positions of any prosthetic groups are parts of

the tertiary structure as is the arrangement of

helical and pleated-sheet sections with

respect to one another.

Important bonds for protein folding and stability

The oxidization of the sulfhydryl

groups of two cystein residues

(intramolecule: ribonuclease;

intersubunit; dimeric protein

insulin)

Weak (2-5 kcal/mol vs.

covalent: 70-100 kcal/mol),

but massive

Weak (3 kcal/mol),

affected by pH value

Dipole molecules attract each other by van der Waals force (transient and weak: 0.1-0.2 kcal/mol)

Hydrophobic interaction, a tendency of hydrophobic groups or molecules being excluded from interact with

hydrophilic environment

Structural Modeling Methods

1. Experimental Methods

X-ray crystallography

NMR spectroscopy

Cyro-electron microscopy

Other Biophysical Methods

2. Computer modeling Methods

Comparative methods

Ab initio methods

X-ray Crystallography

Diffraction pattern

X-ray source Crystal

Intensities

Nuclear Magnetic Resonance (NMR)

Uniformly 15N and/or

13C- labeled

peptides

Pulse sequences

Magne

t

NMR SpectraAnalysis, Assignment,

and Structure CalculationStructure

Nuclear Overhauser Effect (NOE)

H H H

HH

H

NOE

NOE NOE

Cryo-Electron Microscope

Single particle image reconstruction

Koning et al. (2003)

Bacteriophage MS2

Cryo-EM density at 7 A of

Adenovirus PIIIA

High Field

NMR

Spectrometer

(~1-2.5A) Electron Microscopefor cryo-EM ~10-15A

Three complex techniques

❑ X-ray crystallography

❑ NMR spectroscopy

❑ Cyro-electron microscopy

Synchrotron radiation X-ray

crystallography (~1A)

Structure representation of ProteinsRibbon mesh Surface contour

Ribbon with sidechain Space-filling model

Protein

Data

Bank

http://www.rcsb.org/pdb/home/home.do

How to solve the structures

Rate of success to determine the

structures

http://targetdb.pdb.org/statistics/TargetStatistics.html

PDB File Format

REMARK FILENAME="/usr/people/nonella/xplor/benchmark1/ALANIN.PDB"

REMARK PARAM11.PRO ( from PARAM6A )

...

REMARK JACS 103:3976-3985 WITH 1-4 RC=1.80/0.1

REMARK DATE:16-Feb-89 11:21:32 created by user: nonella

ATOM 1 CA ACE 1 -2.184 0.591 0.910 1.00 7.00 MAIN

ATOM 2 C ACE 1 -0.665 0.627 0.966 1.00 0.00 MAIN

ATOM 3 O ACE 1 -0.069 1.213 1.868 1.00 0.00 MAIN

...

ATOM 64 N CBX 12 8.610 8.962 9.714 1.00 0.00 MAIN

ATOM 65 H CBX 12 8.050 8.324 9.225 1.00 0.00 MAIN

ATOM 66 CA CBX 12 9.223 8.571 11.014 1.00 0.00 MAIN

END

Example: alanin.pdb in the VMD Distribution

Comments

Coordinat

e

Informati

on

PDB File Format (2)RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line

ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93

ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94

RTyp: Record Type

Num: Serial number of the atom. Each atom has a unique serial number.

Atm: Atom name (IUPAC format).

Res: Residue name (IUPAC format).

Ch: Chain to which the atom belongs (in this case, L for light chain of an

antibody).

ResN: Residue sequence number.

X, Y, Z: Cartesian coordinates specifying atomic position in space.

Occ: Occupancy factor

Temp: Temperature factor (atoms disordered in the crystal have high

temperature factors).

PDB: The PDB data file unique identifier.

Line: Line (record) number in the data file.

(http://www.umass.edu/microbio/rasmol/pdb.htm)

How to Represent a Molecule?

H

H

H

H

H

H

Small

Molecules

Example: C6H6

Lines Sticks Ball & Sticks

DotsSapcefill B&S&D

An Example: Alanine Peptide

Lines Sticks (Bonds)

Ribbons Cartoon

Surface Balls & Sticks & Ribbons

More Examples: Proteins

Cytochrome P450cam

Avidin-Biotin Complexhttp://www.ks.uiuc.edu/Research/vmd/gallery/

3D Visualization Tools

Web-based tools

JMOL (http://jmol.sourceforge.net/)

CHIME (http://www.mdli.com/)

Free Programs

Rasmol (http://www.OpenRasMol.org/)

Deep Viewer(http://www.expasy.org/spdbv/)

Pymol (http://pymol.sourceforge.net/)

Commercial Proteins

Discover Studio (Insight)

SYBYL

Chime/ JMOL

[ http://www.expasy.org/spdbv/ ]

Deep Viewer (Spdb viewer)

Discover Studio



Faculty of Science





RCSB web

Swiss PDB Viewer

Pymol

Discovery Studio

Assist. Prof. Kiattawee Choowongkomon, Ph.D.

Flow Chart for Protien Modeling

?

KQFTKCELSQNLYDIDGYGRIALPELICTMF

HTSGYDTQAIVENDESTEYGLFQISNALWCK

SSQSPQSRNICDITCDKFLDDDITDDIMCAK

KILDIKGIDYWIAHKALCTEKLEQWLCEKE

homology modeling

(Comparative Modeling)

Use as template

& model8lyz1alc

KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK

FESNFNTQATNRNTDGSTDYGILQINSRWWCND

GRTPGSRNLCNIPCSALLSSDITASVNCAKKIV

SDGNGMNAWVAWRNRCKGTDVQAWIRGCRLShare

Similar

Sequence

Homologous

Homology Modeling Steps

1. Search homologous proteins: By sequence search tools such as dot plot, blast

2. Alignment (key step): Find Structurally Conserved Regions (SCRs) and Structurally Variable Regions (SVRs)

3. Core modeling: copy backbone coordinates from the homologous one with know structure

4. Loop modeling: search fragment library

5. Side chain modeling: search rotamer library

6. Optimizing model: enegy minimization

7. Evaluating model: some tools such as WHAT IF, PROCHECK, and Verify3D can be used

Searching Sequence Homologues in PDB

PRTEINSEQENCEPRTEINSEQUENC

EPRTEINSEQNCEQWERYTRASDFHG

TREWQIYPASDFGHKLMCNASQERWW

PRETWQLKHGFDSADAMNCVCNQWER

GFDHSDASFWERQWK

Query Sequence PDB

Searching Sequence Homologues in PDB





GFDHSDASFWERQWK

Query Sequence PDB





GFDHSDASFWERQWK



TREWQIYPASDFG



TREWQIYPASDFGPRTEINSEQENCE

PRTEINSEQUENCEPRTEINSEQNCE

QWERYTRASDFHGTREWQIYPASDFG

TREWQIYPASDFGPRTEINSEQENCE

PRTEINSEQUENCEPRTEINSEQNCE

QWERYTRASDFHGTREWQ


EPRTEINSEQQWEWEWQWEWEQWEWE

WQRYEYEWQWNCEQWERYTRASDFHG

TREWQIYPASDWERWEREWRFDSFG





GFDHSDASFWERQWK



TREWQIYPASDFG



TREWQIYPASDFGPRTEINSEQENC


EPRTEINSEQQWEWEWQWEWEQWEWE

WQRYEYEWQWNCEQWERYTRASDFHG

TR

Hit #1

Hit #2

Why Homology Modeling?

Value in structure based drug design

Find common catalytic sites/molecular recognition sites

Use as a guide to planning and interpreting experiments

70-80 % chance a protein has a similar fold to the target protein due to X-ray crystallography or NMR spectroscopy

Sometimes it’s the only option or best guess

Homology Modeling Limitations

Cannot study conformational changes

Cannot find new catalytic/binding sites

Large Bias towards structure of template

Three exceptional cases to keep in mind Same fold but not similar sequence

○ Myoglobin & hemoglobin, only 20% similarity

Different structures but similar functions○ subtilisin

Similar sequences but different functions○ Chymotrypsionogen, trypsinogen and plasminogen

○ 40% homologous

○ 2 active, 1 no activity, cannot explain why

Homology modeling

Web Server SWISS-MODEL (http://swissmodel.expasy.org//SWISS-MODEL.html)

CPHmodels (http://www.cbs.dtu.dk/services/CPHmodels/)

ESyPred3D (http://www.fundp.ac.be/sciences/biologie /urbm/bioinfo/esypred/)

3Djigsaw (http://www.bmm.icnet.uk/servers/3djigsaw/)

Geno3D (http://geno3d-pbil.ibcp.fr/)

Free Programs MODELLER (http://salilab.org/modeller/)

TINK (http://dasher.wustl.edu/tinker/)

Commercial Programs COMPOSER (www.tripos.com/data/SYBYL/ )

http://swissmodel.expasy.org/SWISS-MODEL.htmlhttp://www.cbs.dtu.dk/services/CPHmodels/http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/http://www.bmm.icnet.uk/servers/3djigsaw/http://geno3d-pbil.ibcp.fr/http://salilab.org/modeller/http://dasher.wustl.edu/tinker/

Protein Threading

The word threading implies that one drags the

sequence (ACDEFG...) step by step through each

location on each template

Protein Threading or Fold Recognition

KQFTKCELSQNLYDIDG

YGRIALPELICTMFHTS

GYDTQAIVENDESTEYG

LFQISNALWCKSSQSPQ

SRNICDITCDKFLDDDI

TDDIMCAKKILDIKGID

YWIAHKALCTEKLEQWL

CEKE

What do we need for protein

threading ?

Accuracy prediction of secondary structure

All of (known) Protein folding pattern

Scoring function to get the best result

http://zhanglab.ccmb.med.umich.edu/I-TASSER/



Faculty of Science





3D modeling

Swissmodel

https://swissmodel.expasy.org/int

eractive/Kw3ETX/models/

https://swissmodel.expasy.org/int

eractive/RdDYfp/models/

I-tasser

http://zhanglab.ccmb.med.umich.edu

/I-TASSER/output/S460254/

https://zhanglab.ccmb.med.umich.ed

u/I-TASSER/output/S460261/

https://swissmodel.expasy.org/interactive/Kw3ETX/models/https://swissmodel.expasy.org/interactive/RdDYfp/models/http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S460254/https://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S460261/

A Good Protein Structure..

Minimizes disallowed

torsion angles

Maximizes number of

hydrogen bonds

Maximizes buried

hydrophobic ASA

Maximizes exposed

hydrophilic ASA

Minimizes interstitial

cavities or spaces

A Good Protein Structure..

Minimizes number of

“bad” contacts

Minimizes number of

buried charges

Minimizes radius of

gyration

Minimizes covalent and

noncovalent (van der

Waals and coulombic)

energies

Error Detections

Geometrical error detection PROCHECK

WHATCHECK

Error detection using mean-force/statistical approaches PROVE

PROSA II

ANOLEA

Assessment of Φ,Ψ values: the Ramachandran plot

Ramachandran et al. drew the simple scatterplot of Φ,Ψ for a set of proteins. Because of clashes between

backbone atoms (N,Ca,C,O) and the Cβ or other part of the side-chain, only a small part of the Φ,Ψ plot is

actually populated.

The good thing is that the “Ramachandran constraints” are not included into refinement programs, thus

making use of the Ramachandran statistics an orthogonal indicator. In addition, there is an excellent

empirical correlation between the quality of structural models (as measured by the resolution) and their

compliance with the “Ramachandran constraints” (cf the plot of “what makes a good quality indicator).

Φ

Ψ

Example: PROCHECK output

most favored region

allowed region

generously allowed region

disallowed region

Plot 1. Ramachandran plot

phi-psi torsion angles for all

Glycine residues are triangles as these are not restricted to the regions of the plot

The coloring on the plot represents the different Red areas = most favourable core

Yellow areas = additional allow region

Bright Yellow areas = generally allow region

White area = disallowed region

Less than 2% in disallowed region

Example of Prove Output

Average Z-score of

atoms in well resolved

structures tend to be

between -0.10 and

0.10.

Z-score rms of atoms in

well resolved structures

tend to be less than

1.0.

Outlier atoms > 3.0 SD

Summary of Validation

Programs

PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html

WHATCHECK/WHATIF http://swift.cmbi.kun.nl/swift/whatcheck/

PROSA http://www.came.sbg.ac.at/Services/prosa.html

VERIFY3D http://nihserver.mbi.ucla.edu/Verify_3D/

ANOLEA http://swissmodel.expasy.org/anolea/

PROTABLE http://www.tripos.com (Commercial license)

Web services

http://www.jcsg.org/prod/scripts/validation/sv2.

cgi

PPOCHECK

SFCHECK

PROVE

ERRAT

WASP

DDQ

WHATCHECK
http://www.jcsg.org/prod/scripts/validation/sv2.cgi

Documents

รศ · 2019. 9. 5. · NMR Spectrometer (~1-2.5A) Electron Microscope for cryo-EM ~10-15A Three complex techniques X-ray crystallography NMR spectroscopy Cyro-electron microscopy