Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
รศ.ดร. เกียรติทวี ชูวงศ์โกมลE-mail: [email protected] Phone: 085-555-1480
mailto:[email protected]
Assoc. Prof. Kiattawee Choowongkomon, Ph.D.
Department of Biochemistry
Faculty of Science
Kasetsart University
Email: [email protected]
Mobile Phone: 085-555-1480
mailto:[email protected]
Central Dogma of Biochemistry
Central Paradigm of Bioinformatics
3 billion base pair => 6 G letters &
1 letter => 1 byte
The whole genome can be recorded in just 10 CD-ROMs!
In 2003, Human genome sequence was
deciphered!
Genome is the complete set of genes of a living thing.
In 2003, the human genome sequencing was completed.
The human genome contains about 3 billion base pairs.
The number of genes is estimated to be between 20,000 to 25,000.
The difference between the genome of human and that of
chimpanzee is only 1.23%!
Biological Diversity
Bacteria
Fruit Fly
Human
Yeast
Escherichia coli Methanococcus jannaschii
Yeast Human
E. coli M. jannaschii S. cerevisiae H. sapiens
Characterized experimentally 2046 97 3307 10189
Characterized by similarity 1083 1025 1055 10901
Unknown, conserved 285 211 1007 2723
Unknown, no similarity 874 411 966 7965
Proteins and Diseases
Proteins and Diseases (2)
Distribution of PDB structure according to diseases
Protein Structure and Function
proteins are polymers consisting of amino acids linked by covalent peptide bonds.
native conformations : proteins have
biological activity
proteins are complex, they are defined in terms of four levels of structure.
Major examples of protein
functions
⚫Binding
⚫Catalysis
⚫Molecular switches
⚫Structural components
Protein Function and Architecture
Binding
Catalysis
Switching
Structure
DNA
Protein
Nucleotides
sequence
Gene expression = Protein production
Bioinformatics: storage of biological
information
DNA (Genome)
RNA (Transcriptome)
Protein (Proteome)
Computational applications for
DNA (genome)
DNA Simple sequence analysis
Database searching
Pairwise analysis
Regulatory analysis
Gene Finding
Whole genome annotation and analysis
Comparative genomic analysis
DNA sequence: Bases: A, G, T, C Coding or non-coding sequence? Contain regulatory elements?
Information of DNA (genome)
>gi|109637932:950-1310 Hepatitis B virus (SUBTYPE
ADW2), genotype A, complete genome
GAAAACTTCCTGTTAACAGGCCTATTGATTGGAAAGTATGTCAAAGAATTGTGG
GTCTTTTGGGCTTTGCTGCTCCATTTACACAATGTGGATATCCTGCCTTAATGC
CTTTGTATGCCTGTATACAAGCTAAACAGGCTTTCACTTTCTCGCCAACTTACA
AGGCCTTTCTAAGTAAACAGTACATGAACCTTTACCCCGTTGCTCGGCAACGGC
CTGGTCTGTGCCAAGTGTTTGCTGACGCAACCCCCACTGGCTGGGGCTTGGCCA
TAGGCCATCAGCGCATGCGTGGAACCTTTGTGGCTCCTCTGCCGATCCATACTG
CGGAACTCCTAGCCGCTTGTTTTGCTCGCAGCCGGTC
Information of protein (proteinome)
Protein:
20 amino acids
Primary sequence structure
domain binding analysis
Multiple sequence analysis
Secondary and tertiary structure
Active site
Binding analysis
Functional analysis including mutation analysis
"middle surface protein (HepB)"
MQWNSTAFHQALQDPKVRGLYFPAGGSSSGTV
NPAPNIASHISSISARTGDPVTNMENITSGFL
GPLLVLQAGFFLLTRILTIPQSLDSWWTSLNF
LGGSPVCLGQNSQSPTSNHSPTSCPPICPGYR
WMCLRRFIIFLFILLLCLIFLLVLLDYQGMLP
VCPLIPGTTTTSTGPCKTCTTPAQGNSMFPSC
CCTKPSDGNCTCIPIPSSWAFAKYLWEWASV
RFSWLSLLVPFVQWFVGLSPTVWLSAIWMMWY
WGPSLYSIVSPFIPLLPIFFCLWVYI
Structure levels of protein
Primary Structure
Sequential order of amino acids in a polypeptide
Writing left to right (N-ter to C-ter)
Protein Structure - Primary
Protein: chain of amino acids joined by
peptide bonds
Amino Acid
Central carbon (Cα) attached to:
○ Hydrogen (H)
○ Amino group (-NH2)
○ Carboxyl group (-COOH)
○ Side chain (R)
General Amino Acid Structure
Cα
H
R
COOHH2N
20 amino acids
28
Amino Acid
R-groups
Polar
UnchargedCysteine
Proline
Serine
Glutamine
Asparagine
ChargedArginine (+)
Glutamic acid (-)
Aspartic Acid (-)
Lysine (+)
Histidine (+)
Non-Polar
HydrophobicTryptophan
Phenylalanine
Isoleucine
Tyrosine
Leucine
Valine
Methionine
AmbivalentGlycine
Threonine
Alanine
Linear polymers of proteins
Amino acids are connected by amide bonds,
often called peptide bonds
peptide bond is a
covalent bond between
A carboxylic acid and
A amino group by loss of
A water molecule
Peptide bond
Stability of peptide bond is due to resonance
(delocalization of electrons over several atoms)
Increase the polarity of the peptide bond
○ Also generate dipole moment
Partial double-bond character
○ coplanar / non-rotatable
Primary Structure of Proteins
The amino acid sequence (the primary structure) of a protein determines its three-dimensional structure, which, in turn, determines its properties.
In every protein, the correct three-dimensional structure is needed for correct functioning.
Determining the sequence of amino acids in a protein is a routine, but not trivial, operation in classical biochemistry.
Christian B. Anfinsen: Nobel Prize in Chemistry (1972)
1KETAAAKFERQHMDSSTSAASSSNYCNQMMKS
RNLTKDRCKPVNTFVHES
LADVQAVCSQKNVACKNGQTNCYQSYSTMSITD
CRETGSSKYPNCAYKTT
QANKHIIVACEGNPYVPVHFDASV124
Sequence Determines Structure
All of the information
necessary for folding
the peptide chain into
its "native” structure
is contained in the
primary amino acid
structure of the
peptide.
A case study of the effects of
mutation: Sickle cell anemia
Secondary Structure
Organizing of local conformation in polypeptide
chain
Backbone Torsion Angles
• Dihedral angle ω (omega):
rotation about the peptide
bond, namely Cα1-{C-N}- Cα
2
• Dihedral angle φ (phi): rotation
about the bond between N and
Cα
• Dihedral angle ψ (psi): rotation
about the bond between Cα
and the
carbonyl carbon
Ramachandran Plot
White = sterically
disallowed
conformations (atoms
come closer than sum of
van der Waals radii)
Blue = sterically
allowed conformations
Secondary Structure Prediction
One of the first fields to emerge in
bioinformatics (~1967)
Grow from a simple observation that
certain amino acids or combinations of
amino acids seemed to prefer to be in
certain secondary structures
Subject of hundreds of papers and dozens
of books, many methods…
PSSP Algorithms
There are three generations in PSSP algorithms
• First Generation: based on statisticalinformation of single aminoacids
• Second Generation: based on windows(segments) of aminoacids. Typically a window containes 11-21 aminoacids
• Third Generation: based on the use of windows on evolutionary information
PSSP: First Generation
First generation PSSP systems are based on
statistical information on a single aminoacid
The most relevant algorithms:
Chow-Fasman, 1974
GOR, 1978
Both algorithms claimed 74-78% of predictive
accuracy, but tested with better constructed
datasets were proved to have the predictive
accuracy ~50% (Nishikawa, 1983)
Chou & Fasman
Determined the frequency of occurrence
of each amino acid in helices and sheets.
Calculated from survey of 15 known
structure proteins.
http:/fasta.bioch.virginia.edu/fasta_www2/f
asta_www.cgi?rm=misc1
Chou-Fasman parameters
Note: The parameters given in the textbook are 100*Pi
Simplified C-F Algorithm
Select a window of 7 residues
Calculate average P over this window and assign that value to the central residue
Repeat the calculation for Pb and Pc Slide the window down one residue and repeat
until sequence is complete
Analyze resulting “plot” and assign secondary structure (H, B, C) for each residue to highest value.
Exercise
Predict the secondary structure of the following
protein sequence:
Ala Pro Ala Phe Ser Val Ser Leu Ala Ser Gly Ala
142 57 142 113 77 106 77 121 142 77 57 142
83 55 83 138 75 170 75 130 83 75 75 83
66 152 66 60 143 50 143 59 66 143 156 66
Simplified C-F Algorithm
helix beta coil
10 20 30 40 50 60
Prediction Performance
45
50
55
60
65
70
75C
F
GO
R I
LIM
LE
VIN
PT
IT
JA
SE
P7
GO
R II
I
ZH
AN
G
PH
D
Sco
res (
%)
Assoc. Prof. Kiattawee Choowongkomon, Ph.D.
Department of Biochemistry
Faculty of Science
Kasetsart University
Email: [email protected]
Mobile Phone: 085-555-1480
mailto:[email protected]
Download DNA/protein sequence
Translate DNA to protein
Comparing Protein sequences
Extract information
pI / MW
secondary structure prediction
Assoc. Prof. Kiattawee Choowongkomon, Ph.D.
Department of Biochemistry
Faculty of Science
Kasetsart University
Email: [email protected]
Mobile Phone: 085-555-1480
mailto:[email protected]
Tertiary Structure of Proteins
The tertiary structure of a protein is the three-
dimensional arrangement of all the atoms in
the molecule.
The conformations of the side chains and the
positions of any prosthetic groups are parts of
the tertiary structure as is the arrangement of
helical and pleated-sheet sections with
respect to one another.
Important bonds for protein folding and stability
The oxidization of the sulfhydryl
groups of two cystein residues
(intramolecule: ribonuclease;
intersubunit; dimeric protein
insulin)
Weak (2-5 kcal/mol vs.
covalent: 70-100 kcal/mol),
but massive
Weak (3 kcal/mol),
affected by pH value
Dipole molecules attract each other by van der Waals force (transient and weak: 0.1-0.2 kcal/mol)
Hydrophobic interaction, a tendency of hydrophobic groups or molecules being excluded from interact with
hydrophilic environment
Structural Modeling Methods
1. Experimental Methods
X-ray crystallography
NMR spectroscopy
Cyro-electron microscopy
Other Biophysical Methods
2. Computer modeling Methods
Comparative methods
Ab initio methods
X-ray Crystallography
Diffraction pattern
X-ray source Crystal
Intensities
Nuclear Magnetic Resonance (NMR)
Uniformly 15N and/or
13C- labeled
peptides
Pulse sequences
Magne
t
NMR SpectraAnalysis, Assignment,
and Structure CalculationStructure
Nuclear Overhauser Effect (NOE)
H H H
HH
H
NOE
NOE NOE
Cryo-Electron Microscope
Single particle image reconstruction
Koning et al. (2003)
Bacteriophage MS2
Cryo-EM density at 7 A of
Adenovirus PIIIA
High Field
NMR
Spectrometer
(~1-2.5A) Electron Microscopefor cryo-EM ~10-15A
Three complex techniques
❑ X-ray crystallography
❑ NMR spectroscopy
❑ Cyro-electron microscopy
Synchrotron radiation X-ray
crystallography (~1A)
Structure representation of ProteinsRibbon mesh Surface contour
Ribbon with sidechain Space-filling model
Protein
Data
Bank
http://www.rcsb.org/pdb/home/home.do
How to solve the structures
Rate of success to determine the
structures
http://targetdb.pdb.org/statistics/TargetStatistics.html
PDB File Format
REMARK FILENAME="/usr/people/nonella/xplor/benchmark1/ALANIN.PDB"
REMARK PARAM11.PRO ( from PARAM6A )
...
REMARK JACS 103:3976-3985 WITH 1-4 RC=1.80/0.1
REMARK DATE:16-Feb-89 11:21:32 created by user: nonella
ATOM 1 CA ACE 1 -2.184 0.591 0.910 1.00 7.00 MAIN
ATOM 2 C ACE 1 -0.665 0.627 0.966 1.00 0.00 MAIN
ATOM 3 O ACE 1 -0.069 1.213 1.868 1.00 0.00 MAIN
...
ATOM 64 N CBX 12 8.610 8.962 9.714 1.00 0.00 MAIN
ATOM 65 H CBX 12 8.050 8.324 9.225 1.00 0.00 MAIN
ATOM 66 CA CBX 12 9.223 8.571 11.014 1.00 0.00 MAIN
END
Example: alanin.pdb in the VMD Distribution
Comments
Coordinat
e
Informati
on
PDB File Format (2)RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
RTyp: Record Type
Num: Serial number of the atom. Each atom has a unique serial number.
Atm: Atom name (IUPAC format).
Res: Residue name (IUPAC format).
Ch: Chain to which the atom belongs (in this case, L for light chain of an
antibody).
ResN: Residue sequence number.
X, Y, Z: Cartesian coordinates specifying atomic position in space.
Occ: Occupancy factor
Temp: Temperature factor (atoms disordered in the crystal have high
temperature factors).
PDB: The PDB data file unique identifier.
Line: Line (record) number in the data file.
(http://www.umass.edu/microbio/rasmol/pdb.htm)
How to Represent a Molecule?
H
H
H
H
H
H
Small
Molecules
Example: C6H6
Lines Sticks Ball & Sticks
DotsSapcefill B&S&D
An Example: Alanine Peptide
Lines Sticks (Bonds)
Ribbons Cartoon
Surface Balls & Sticks & Ribbons
More Examples: Proteins
Cytochrome P450cam
Avidin-Biotin Complexhttp://www.ks.uiuc.edu/Research/vmd/gallery/
3D Visualization Tools
Web-based tools
JMOL (http://jmol.sourceforge.net/)
CHIME (http://www.mdli.com/)
Free Programs
Rasmol (http://www.OpenRasMol.org/)
Deep Viewer(http://www.expasy.org/spdbv/)
Pymol (http://pymol.sourceforge.net/)
Commercial Proteins
Discover Studio (Insight)
SYBYL
Chime/ JMOL
[ http://www.expasy.org/spdbv/ ]
Deep Viewer (Spdb viewer)
Pymol
SYBYL
Discover Studio
Assoc. Prof. Kiattawee Choowongkomon, Ph.D.
Department of Biochemistry
Faculty of Science
Kasetsart University
Email: [email protected]
Mobile Phone: 085-555-1480
mailto:[email protected]
RCSB web
Swiss PDB Viewer
Pymol
Discovery Studio
Assist. Prof. Kiattawee Choowongkomon, Ph.D.
Flow Chart for Protien Modeling
?
KQFTKCELSQNLYDIDGYGRIALPELICTMF
HTSGYDTQAIVENDESTEYGLFQISNALWCK
SSQSPQSRNICDITCDKFLDDDITDDIMCAK
KILDIKGIDYWIAHKALCTEKLEQWLCEKE
homology modeling
(Comparative Modeling)
Use as template
& model8lyz1alc
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK
FESNFNTQATNRNTDGSTDYGILQINSRWWCND
GRTPGSRNLCNIPCSALLSSDITASVNCAKKIV
SDGNGMNAWVAWRNRCKGTDVQAWIRGCRLShare
Similar
Sequence
Homologous
Homology Modeling Steps
1. Search homologous proteins: By sequence search tools such as dot plot, blast
2. Alignment (key step): Find Structurally Conserved Regions (SCRs) and Structurally Variable Regions (SVRs)
3. Core modeling: copy backbone coordinates from the homologous one with know structure
4. Loop modeling: search fragment library
5. Side chain modeling: search rotamer library
6. Optimizing model: enegy minimization
7. Evaluating model: some tools such as WHAT IF, PROCHECK, and Verify3D can be used
Searching Sequence Homologues in PDB
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGHKLMCNASQERWW
PRETWQLKHGFDSADAMNCVCNQWER
GFDHSDASFWERQWK
Query Sequence PDB
Searching Sequence Homologues in PDB
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGHKLMCNASQERWW
PRETWQLKHGFDSADAMNCVCNQWER
GFDHSDASFWERQWK
Query Sequence PDB
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGHKLMCNASQERWW
PRETWQLKHGFDSADAMNCVCNQWER
GFDHSDASFWERQWK
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFG
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGPRTEINSEQENCE
PRTEINSEQUENCEPRTEINSEQNCE
QWERYTRASDFHGTREWQIYPASDFG
TREWQIYPASDFGPRTEINSEQENCE
PRTEINSEQUENCEPRTEINSEQNCE
QWERYTRASDFHGTREWQ
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQQWEWEWQWEWEQWEWE
WQRYEYEWQWNCEQWERYTRASDFHG
TREWQIYPASDWERWEREWRFDSFG
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGHKLMCNASQERWW
PRETWQLKHGFDSADAMNCVCNQWER
GFDHSDASFWERQWK
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFG
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQNCEQWERYTRASDFHG
TREWQIYPASDFGPRTEINSEQENC
PRTEINSEQENCEPRTEINSEQUENC
EPRTEINSEQQWEWEWQWEWEQWEWE
WQRYEYEWQWNCEQWERYTRASDFHG
TR
Hit #1
Hit #2
Why Homology Modeling?
Value in structure based drug design
Find common catalytic sites/molecular recognition sites
Use as a guide to planning and interpreting experiments
70-80 % chance a protein has a similar fold to the target protein due to X-ray crystallography or NMR spectroscopy
Sometimes it’s the only option or best guess
Homology Modeling Limitations
Cannot study conformational changes
Cannot find new catalytic/binding sites
Large Bias towards structure of template
Three exceptional cases to keep in mind Same fold but not similar sequence
○ Myoglobin & hemoglobin, only 20% similarity
Different structures but similar functions○ subtilisin
Similar sequences but different functions○ Chymotrypsionogen, trypsinogen and plasminogen
○ 40% homologous
○ 2 active, 1 no activity, cannot explain why
Homology modeling
Web Server SWISS-MODEL (http://swissmodel.expasy.org//SWISS-MODEL.html)
CPHmodels (http://www.cbs.dtu.dk/services/CPHmodels/)
ESyPred3D (http://www.fundp.ac.be/sciences/biologie /urbm/bioinfo/esypred/)
3Djigsaw (http://www.bmm.icnet.uk/servers/3djigsaw/)
Geno3D (http://geno3d-pbil.ibcp.fr/)
Free Programs MODELLER (http://salilab.org/modeller/)
TINK (http://dasher.wustl.edu/tinker/)
Commercial Programs COMPOSER (www.tripos.com/data/SYBYL/ )
http://swissmodel.expasy.org/SWISS-MODEL.htmlhttp://www.cbs.dtu.dk/services/CPHmodels/http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/http://www.bmm.icnet.uk/servers/3djigsaw/http://geno3d-pbil.ibcp.fr/http://salilab.org/modeller/http://dasher.wustl.edu/tinker/
Protein Threading
The word threading implies that one drags the
sequence (ACDEFG...) step by step through each
location on each template
Protein Threading or Fold Recognition
KQFTKCELSQNLYDIDG
YGRIALPELICTMFHTS
GYDTQAIVENDESTEYG
LFQISNALWCKSSQSPQ
SRNICDITCDKFLDDDI
TDDIMCAKKILDIKGID
YWIAHKALCTEKLEQWL
CEKE
What do we need for protein
threading ?
Accuracy prediction of secondary structure
All of (known) Protein folding pattern
Scoring function to get the best result
http://zhanglab.ccmb.med.umich.edu/I-TASSER/
Assoc. Prof. Kiattawee Choowongkomon, Ph.D.
Department of Biochemistry
Faculty of Science
Kasetsart University
Email: [email protected]
Mobile Phone: 085-555-1480
mailto:[email protected]
3D modeling
Swissmodel
https://swissmodel.expasy.org/int
eractive/Kw3ETX/models/
https://swissmodel.expasy.org/int
eractive/RdDYfp/models/
I-tasser
http://zhanglab.ccmb.med.umich.edu
/I-TASSER/output/S460254/
https://zhanglab.ccmb.med.umich.ed
u/I-TASSER/output/S460261/
https://swissmodel.expasy.org/interactive/Kw3ETX/models/https://swissmodel.expasy.org/interactive/RdDYfp/models/http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S460254/https://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S460261/
A Good Protein Structure..
Minimizes disallowed
torsion angles
Maximizes number of
hydrogen bonds
Maximizes buried
hydrophobic ASA
Maximizes exposed
hydrophilic ASA
Minimizes interstitial
cavities or spaces
A Good Protein Structure..
Minimizes number of
“bad” contacts
Minimizes number of
buried charges
Minimizes radius of
gyration
Minimizes covalent and
noncovalent (van der
Waals and coulombic)
energies
Error Detections
Geometrical error detection PROCHECK
WHATCHECK
Error detection using mean-force/statistical approaches PROVE
PROSA II
ANOLEA
Assessment of Φ,Ψ values: the Ramachandran plot
Ramachandran et al. drew the simple scatterplot of Φ,Ψ for a set of proteins. Because of clashes between
backbone atoms (N,Ca,C,O) and the Cβ or other part of the side-chain, only a small part of the Φ,Ψ plot is
actually populated.
The good thing is that the “Ramachandran constraints” are not included into refinement programs, thus
making use of the Ramachandran statistics an orthogonal indicator. In addition, there is an excellent
empirical correlation between the quality of structural models (as measured by the resolution) and their
compliance with the “Ramachandran constraints” (cf the plot of “what makes a good quality indicator).
Φ
Ψ
Example: PROCHECK output
most favored region
allowed region
generously allowed region
disallowed region
Plot 1. Ramachandran plot
phi-psi torsion angles for all
Glycine residues are triangles as these are not restricted to the regions of the plot
The coloring on the plot represents the different Red areas = most favourable core
Yellow areas = additional allow region
Bright Yellow areas = generally allow region
White area = disallowed region
Less than 2% in disallowed region
Example of Prove Output
Average Z-score of
atoms in well resolved
structures tend to be
between -0.10 and
0.10.
Z-score rms of atoms in
well resolved structures
tend to be less than
1.0.
Outlier atoms > 3.0 SD
Summary of Validation
Programs
PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
WHATCHECK/WHATIF http://swift.cmbi.kun.nl/swift/whatcheck/
PROSA http://www.came.sbg.ac.at/Services/prosa.html
VERIFY3D http://nihserver.mbi.ucla.edu/Verify_3D/
ANOLEA http://swissmodel.expasy.org/anolea/
PROTABLE http://www.tripos.com (Commercial license)
Web services
http://www.jcsg.org/prod/scripts/validation/sv2.
cgi
PPOCHECK
SFCHECK
PROVE
ERRAT
WASP
DDQ
WHATCHECK
http://www.jcsg.org/prod/scripts/validation/sv2.cgi