Proteins

1

Proteins

Lecture 10

2

ביו-אינפורמטיקה של חלבונים ביו-אינפורמטיקה של חלבונים ProteomicsProteomics

פרוטאומיקה היא חקר התכונות של מגוון החלבונים המיוצרים ע"י אורגניזם High throughputבגישה שיטתית ומערכתית ועבודה ב

Gene prediction מהם החלבונים שהאורגניזם מייצר? 1.

מהם המודיפיקציות שנעשות על החלבון?2.

מהו המבנה השניוני של חלבונים אלו? 3.

Structural genomics מהו המבנה המרחבי של חלבונים אלו? 4.

Functional genomics מהו התפקיד של חלבונים אלו?5.

Expression patternמהי תבנית הביטוי של חלבונים אלו?: 6.

מהו מנגנון פעולתם? למשל אינטראקציות בין חלבונים.7.

איך החלבונים עוברים מודיפיקציה )פוספורילציה, גליקוליזציה וכו'(8.

איך החלבונים מתפרקים וממוחזרים?9.

3

X-ray crystallography1. Obtain an ordered protein crystal.

2. Check x-ray diffraction.

3. Analyze diffraction pattern and produce an electron density map.

4. Thread the known protein sequence into the density map.

Tyrosine

4

NMR1. Nuclear Magnetic Resonance measures the radio

frequency absorption of different nuclei in the protein.

2. Measure the distances between different atoms in the protein.

3. Use thousands of distance measurements to construct a single protein model under the constraints.

• X-ray crystallography is the most widely used method.• NMR is typically applicable to small proteins• Quaternary structure of large proteins (ribosomes, virus particles, etc) can be determined by electron microscopes.

5

Structural bioinformaticsביואינפורמטיקה מבנית

של לאנליזההתמחות בשיטות ממוחשבות •.DNAמבנה חלבונים ו

אינפורמציה מבנית ובכך "לנבא"יכולת •לחסוך בניסוים ארוכים ויקרים.

מבנים מולקולרים. לתכנןיכולת •

6

ExPASy Proteomics Server http://www.expasy.org/

7

Swiss-Prot file formatentry

8

www.expasy.ch/tools/ ?מה אפשר ללמוד מהרצף הראשונימשקל מולקולרי-PIערכי -אזורים הידרופוביים/הידרופיליים-איזורי טרנסממברנליים-המצאות מוטיבים-

9

www.expasy.ch/tools/

10


11


12


13

PDB (Protein Data Bank )www.rcsb.org

• holds 3D models of biological molecules (protein, RNA,DNA).

• Contains almost 26,000 different models. Many entries are redundant,actually between 3000-5000 unique proteins are included.

• Models are arranged in files (accession number, xyz coordinates for each atom).

• PDB is growing fast, but the number of NEW folds discovered is not growing fast

14

PDB model:A model defines the 3D positions of atoms in one or more molecules:

A protein, a protein complex, protein and DNA, etc.

The models also include the positions of ligand molecules, solvent molecules, metal ions, etc.

Example – 1D66 (4 chains) : GAL4

S. cerevisiae transcription factor.

Homo dimmer

DNA

Cadmium ion

http://www.rcsb.org/pdb/cgi/explore.cgi?&pdbId=1D66

15

16Chime exe for windows: /scratch/BT05/ChimePlugIn/

17

Terminology

• Motif – a group of packed secondary structures.

• Domain – A fundamental unit of the tertiary structure. The domain can fold independently and perform its function independently. Domains are made of one or more motifs.

Beta-Alpha-Beta motif

• Secondary structure:Loop

Beta strand

Alpha helix

18

19

Website: proteinexplorer.org. watch the demo there!!

20

21

visualizations

• View by chain •View by structure View by amino acid group (hydrophobic)

•View by solvent accessible surface.

•View by surface and electrostatic potential

We can also examine the active site in detail.

22

Structural alignment• Structural alignment is considered more accurate since,

during evolution, the structure is more conserved than the sequence.

• Flexprot: pairwise alignment.

Rigid alignment Flexible alignment

•Input: PDB files

Output: Possible alignments

• The results can be viewed with Protein explorer.

23

Default parameters• Structural alignments were used to find the default

parameters for the matrices of sequence alignment (PAM, blosum).

“True” alignment

)structural alignment(

Possible alignment

)Sequence alignment

Different parameters(

?=

Default parameters

Try new set of parameters

No Yes

24

Relationships between MSA and structure• Aligning similar proteins reveals conserved residues that

are important for function or structure (recall: IPNS example).

25

ConSurf :Working processInput a protein with a known 3D structure

(PDB id or file provided by the user)

Find homologue protein sequences )psi-blast(

Perform multiple sequence alignment )removing doubles(

Construct an evolutionary tree

Calculate mutation rate for each site

Present calculations by coloring the 3D structure

http://consurf.tau.ac.il/

26

27

Consurf- Why is phylogeny important?

1 2 3 4 5 6 7

Human D M M A H M M

Chimp D E M A G G C

S. cerevisiae D D G A F M A

S. pombe D D G A L G E

Human )M(

Chimp )M(

S. cerevisiae )G(

S. pombe )G(

Human )M(

Chimp )G(

S. cerevisiae )M(

S. pombe )G(

Example: Which of the two sites is more conserved? Or - Which site evolves faster?

28

Protein structure classificationWhy classify proteins

• Number of solved structures grow rapidly• Generate overview of structure types• Detect similarities (evolutionary relationships)• Build model of a protein based on proteins

from the same class• Set up prediction benchmarks

When are two structures similar?• RMS of 6 Ang. – not related. RMS of 3-6 Ang – related RMS less than 3 Ang – similar• Two structures are of the same fold if they have RMS

< 3 Ang over 70% of their length• Use the RMS measure (root mean square) for superposition of corresponding residues

29

• SCOP

– Manual classification (A Murzin)

http://scop.mrc-lmb.cam.ac.uk/scop/

• CATH

– Semi manual classification (C orengo)

• FSSP

– Automatic classification (L Holm)

5 classes

Classification schemes

http://scop.mrc-lmb.cam.ac.uk/scop/

30

Major classes in scopClasses

– All alpha proteins– Alpha and beta proteins (a/b)– Alpha and beta proteins (a+b)– Multi-domain proteins– Membrane and cell surface proteins– Small proteins

All alpha: Hemoglobin (1bab)

All beta: Immunoglobulin (8fab)

Alpha/beta: (1hti) Alpha+beta: Lysozyme (1jsf)

31

Folds - Proteins which have the same (>~50%) secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold.

Superfamilies - superfamily contains proteins which are thought to be evolutionarily related due to (*)Sequence, (*)Function, and (*)Special structural features.

• Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Families - Contains members whose relationship is readily recognizable from the sequence (>~25% sequence identity)

• Families are further subdivided into Proteins• Proteins are divided into Species

– The same protein may be found in several species

32

Families

33

Folds: how many?• Chothia (1992) – appr. 1,000 folds• Estimates vary from 1,000 – 10,000• With 100,000 human proteins, ~20

genes per fold on average

Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4.

Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.

Distribution of protein seqs. among protein families

http://bioinfo.mbb.yale.edu/lectures/t2000/talk.pdf

34

Finding domains in your protein

Only a fraction of the domains have been experimentally characterized, . If you find such a domain within your protein, you may get information, such as active site, fold, protein modification, interactions with protein/DNA molecules, ligands

Collection of domains Each collection has pros and cons, using different modeling and curators….

“curated”: The families are created semi-automatically based on expert knowledge, sequence similarity, other databases

The major domain servers: enable looking simultaneously at a few domain collections.

Interpro- search in Prosite, prints, ProDom and PfamCD-search – search in Pfam, CDD.Pfscan – search in the most updated PROSITE.

collection Information

Prosite Regular expressions – patterns (+ profiles) curated

Prints Aligned ungapped unweighted motifs – fingerprints curated

BLOCKS Aligned ungapped weighted motifs - BLOCKS Automated

Profiles Weighted matrices - profiles curated

Pfam Hidden Markov Models curated

35

Prosite http://www.expasy.ch/prosite

• Characterization of protein families by conserved motifs (called patterns) observed in a multiple sequence alignments of known homologues.

• Each family is defined by a single motif.• Pattern is a method for describing a conserved sequence

described using regular expressions.

• Entries are divided into two files– Pattern/profile file: the pattern and all SwissProt matches.– Documentation file: Details of the characterized family, a

description of the biological role of the chosen motif, references.

Example: [AC]-x-v-x(4)-{ED}. [Ala or Cys]-any-val-any-any-any-any-any but Glu or Asp

36

A A C T T G

A A G T C G

C A C T T C

1 2 3 4 5

A 0.66 1 0 0 .

T 0 0 0 1 .

C 0.33 0 0.66 0 .

G 0 0 0.33 0 .

A A C T T G

[AC-]A-[GC]-T-[TC]-[GC]

multiple alignment

consensus

pattern

profile

N A N T N N

Modeling domain - example 1

37

Modeling domains• What is a “good” model?

• How do we measure the success of a pattern?

Positive (hit) Negative

True True-positive True-negative

False False-positive False-negative

Sensitivity =TP/trueSpecificity=TP/positives

modeling is a compromise between sensitivity and specificity (we want many true positive and few false positives)

38

Training set

AGACT

AGTCT

ACAGT

ACTGT

AGTCT

Modeling domain - example 2genome

Additional true domains

AGACT

AGTCT

ACAGT

ACTGT

False domains

ACTCT

AGTGT

ACACT

AGAGT

Consensus representation:

AGTCT

0 mismatches search:

• 1 true positive

• 3 true negatives

• No false positive

high specificity (1), low sensitivity(1/4)

1 mismatch search:

• 2 true positive

• 2 true negative

• 2 false positives

Specificity=1/2, sensitivity=1/2

SensitivityTP / True

SpecificityTP / Predicted

True

Predicted

TP

39

Training set

AGACT

AGTCT

ACAGT

ACTGT

AGTCT

genome

Additional binding sites

AGACT

AGTCT

ACAGT

ACTGT

Non binding sites

ACTCT

AGTGT

ACACT

AGAGT

PSSM:

High probability search:

• 4 true positive

• 0 true negatives

• 4 false positive

specificity=1/2, high sensitivity! (1)

1 2 3 4 5

A 1 0 2/5 0 0

G 0 3/7 0 2/7 0

C 0 2/7 0 3/7 0

T 0 0 3/5 0 1

SensitivityTP / True

SpecificityTP / Predicted

True

Predicted

TP

Modeling domain - example 2

If the model is more detailed: higher sensitivity, lower specificity.

40

Blocks http://www.blocks.fhcrc.org

• Blocks are MSA corresponding to the most highly conserved regions of proteins.

• Families are taken from InterPro. • Creation of BLOCKS by automatically detecting the most highly

conserved regions of each protein family• Blocks of 5-200 aa long alignments.• A family is characterized by a group of common blocks.

41

Prints http://www.bioinf.man.ac.uk/ddbrowser/PRINTS/

• Fingerprint are made up from several conserved motifs.

Pfam http://www.sanger.ac.uk/Software/Pfam

• Pfam contains MSAs and HMM-profiles of complete protein domains.

http://www.bioinf.man.ac.uk/ddbrowser/PRINTS/

ftp://ftp.genetics.wustl.edu/pub/eddy/papers/hmmreview-bioinformatics-98.pdf

42

Searching InterPro http://www.ebi.ac.uk/interpro/

43

44

45

46

Searching CD-Search (rpsblast)

47

ידוע? את המבנה השניוני כאשר המבנה התלת מימדי קובעיםאיך

המבנה השניוני אינו "נמדד" בניסוי. אמנם בדרך כלל קל לאתר מבנים שניוניים בתוךהמבנה התלת-מימדי אבל לא תמיד הקביעה היא חד-משמעית. הבעיה חמורה במיוחד

בקצוות של המבנים.

The DSSP code (www.sander.ebi.ac.uk/dssp/ (H = alpha helix B = residue in isolated beta-bridge E = extended strand, participates in beta ladder G = 3-helix )3/10 helix( I = 5 helix )pi helix( T = hydrogen bonded turn S = bend

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Kabsch and Sander Biopolymers 22:2577-2637, 1983

DSSP is a database of secondary structure assignments (and much more) for all protein entries in the Protein Data Bank (PDB). DSSP is also the program that calculates DSSP entries from PDB entries.

48

?לא ידוע את המבנה השניוני כאשר המבנה התלת מימדי מנבאיםאיך

הרעיון הבסיסי: מסתבר שלחומצות אמיניות שונות יש נטיה שונה להמצא בכל מבנה שניוני•

למבנים שניוניים יש אורך טיפוסי, ולכן •הניבוי צריך להיות קונסיסטנטי למשך

קטע מחזוריות של חומצות הידרופוביות / •

הידרופיליות

נוכחות של חומצות פולריות קטנות •(A,S,T) ובעיקר G אופיינית ל

TURNS • Proline בעל שרשרת ראשית :

, לא מתאים kinkמיוחדת ולכן יוצר מאד , Helixלמשל למרכז של מבני

של N-terminalמתאים לקצה ה Helices

psa, sscp, sosui , PHD, PRIDCT

49

http://www.embl-heidelberg.de/predictprotein/predictprotein.html

Input: sequence

Output: Secondary structure prediction, globular regions, coiled-coil regions, transmembrane helices, PROSITE motifs, bound cystein…

50

איך קובעים הצלחה בניבוי?: אחוז החומצות שנקבעו נכון יחסית לכלל Q3המדד המקובל בדר"כ

MWHSGAVTTYPNKLYTREADSGGYVSAVL SequenceהניבויTHHHHHTTTEEEETTTEEEEETTTEEEET PredictionTTHHHHHHTTEEETTTEEEETTHHHHHTT Real Assignment

Q3 is 18/30 = 0.6

Authors Year % acurracy MethodChou-Fasman 1974 50% propensities of aa's in 2nd structures Garnier 1978 62% interactions between aa'sLevin 1993 69% multiple seq. alignments )MSA(Rost & Sander 1994 72% neural networks + MSA

הדיוק בניבוי

CASPבשנים האחרונות נבדקת הצלחת הניבויים כחלק מתחרויות הבודקת באופן אוביקטיבי את יכולות הניבוי של מבנים שניוניים ושל

יכולת הניבוי של המבנה השניוני המבנה התלת מימדי. 75-78% עומדת כיום על

51

מה גבול הניבוי האפשרי?צריך לקחת בחשבון את הנקודות הבאות:

יש גבול לרמה שבה הרצף הלוקלי קובע את המבנה התלת-מימדי•יש בחלבונים איזורים רבים שאינם בעלי מבנה שניוני יציב•יש בעיה בהגדרה חד-משמעית של המבנים במיוחד בקצוות.•90%כתוצאה מכך מקובל להניח שגבול הניבוי האפשרי הוא כ •

בשביל מה אנו צריכים את הניבוי הזה?בדרך כלל אין חשיבות לניבוי המבנה השניוני בלבד, אבל הוא

נחשב כשלב חשוב בניבוים אחרים.. לביצוע יותר מוצלח של התאמה מרובת רצפים )שימו לב לטיעון המעגלי( 1

וזאת לצורך הבלטת האיזורים המשותפים שהם בדרך כלל האיזורים החשובים פונקציונלית.

: ניבוי המבנה התלת מימדי Modeling. בתור שלב ראשון לביצוע 2 ע"ס הדמיון למבנה ידוע של חלבון דומה ברצף.

. בתור שלב ראשון בתהליך ניבוי מבנה שלישוני, קודם ננבא את המבנה3 השניוני, ואח"כ נחליט איך מסדרים את האלמנטים האלה לצורך

קביעת מבנה תלת-מימדי.

52

ניבוי מבנה תלת-מימדישיטות ישירות חישובי מינימום אנרגיה•דינמיקה מולקולרית•מדי.יניבוי מבנה שניוני והרכבת האלמנטים השניוניים למבנה תלת מ•תוך בחירת כמה אפשרויות לכל רצף, ניבוי מבנים לוקליים עבור רצפים קצרים •

חיבור המבנים הקצרים למבנים מלאים באופנים שונים תהליך היוצר עשרות אלפי מבנים אפשריים, סינון המבנים תוך שימוש בפונקציות אנרגיה ע"מ לקבל

של ניבויים סופיים. מספר קטן

שיטות ביואינפורמטיותהעקרון המנחה: יש הרבה יותר רצפים ממבנים )כמה אלפי

מבנים(. לכן: יש סיכוי סביר שהמבנה המבוקש כבר ידוע.

( לחלבון בעל מבנה ידוע, ניתן 25-30אם קיים דמיון גבוה )מעל % ( Homology modelingלהשתמש בו כבסיס לחלבון החדש. )( יש להעזר בשיטות 25- 15אם הדמיון נמוך )אזור הדימדומים %

כאן המטרה אינה לקבל מבנה מדויק .Threading-של פרופילים ו בלבד!Foldאלא לזהות את ה

53

Homology modelingComparative protein modelling via SwissModel

• Proteins from different sources can have similar sequences, and it is generally accepted that high sequence similarity is reflected by distinct structure similarity. Indeed, the relative mean square deviation (rmsd) of the alpha-carbon coordinates for protein cores sharing 50% residue identity is expected to be around 1Å. This fact served as the premise for the development of comparative protein modeling, which is presently the most reliable method. Comparative model building consist of the extrapolation of the structure for a new (target) sequence from the known 3D-structure of related family members (templates).

• Step 1: Identification of modeling templatesWe need least one sequence of known 3D-structure with significant similarity to the target sequence. We apply BLAST of the target sequence with a database of sequences derived from the PDB. Hits lower than 10-5 in BLAST are considered for the model building procedure. > 50% similarity: good predictions.> 25% similarity: less reliable.

• Step 2: Now we generate a structurally corrected multiple sequence alignment of the hits.

http://www.expasy.org/swissmod/SWISS-MODEL.html

54

Step3: Aligning the target sequence with the template sequence The target sequence is aligned with the template sequence or, if several templates were selected, with the structurally corrected multiple sequence alignment. Step 4: Building the model

• Framework construction: The next step is the construction of a framework, which is computed by averaging the position of each atom in the target sequence, based on the location of the corresponding atoms in the template. When more than one template is available, the relative contribution, or weight, of each structure is determined by its local degree of sequence identity with the target sequence. • improvement: Building non-conserved loops, Completing the backbone, Adding side chains• Model refinement: performing energy minimization with force fields.

55

THREADING (Threader)

• The idea was to physically "thread" a sequence of amino acid side chains onto a backbone structure (a fold) and to evaluate this proposed 3-D structure using a set of pair potentials and (importantly) a separate solvation potential. • A library of unique protein folds is derived from the database of protein structures. The test sequence is then optimally fitted to each library fold (allowing for relative insertions and deletions in loop regions), with the 'energy' of each possible fit (or threading) being calculated by summing the proposed pairwise interactions. The library of folds is then ranked in ascending order of total energy, with the lowest energy fold being taken as the most probable match.

http://bioinf.cs.ucl.ac.uk/threader/threader.html

56

עבור כל מבנה ידוע בונים פרופיל מבני:פולריות., מבנה שניוני, פנים / חוץכגוןנים פרמטרים מבניים יילכל עמדה מאפסטטיסטיקות לגבי ההתאמה של כל סוג של חומצה PDB אוספים מה

למאפינים אלו. מבצעים התאמה בין הרצף החדש למחרוזת האפיון של המבנה הקיים.

פרופילים

Documents

Proteins