2019.10.01. Bioinformatics - Proteomics
Bioinformatics − Proteomics Lecture 4
Prof. László Poppe
BME Department of Organic Chemistryand Technology
Bioinformatics – Proteomics
Lecture and practice
2 2019.10.01. Bioinformatics - Proteomics
Biological databases
Felhasználó
KeresõprogramBLAST
Biológiai adatbázisok
Szerkezeti adatbázisok Szekvencia adatbázisok
Primeradatbázisok
Szekunderadatbázisok
Protein Protein NukleinsavPDB SwissProt
TrEMBLPIR
GenBankDDBJEMBL
SCOPCATH
PFAMBLOCKSPROSITE
Integrált adatbázisok INTERPRO
user
Searching
BLAST
Biological databases
Structural databases Sequence databases
Primary
databases
Secondary
databases
Nucleotide
Integrated databases
3 2019.10.01. Bioinformatics - Proteomics
Secondary databases
The secondary seqence databases – containing sequence pattern data - are derived from
primary (ie containing seqences) databases
From multiple alignments of primary seqence data, motifs can be determined.
A fingerprint is a group of conserved motifs used to characterise a protein family. On the basis
of motifs regular expressions, or frqency matrices can be derived.
fingerprint
motif
Seqences in
multiple aligment
insertions
frequency
matrix
balanced
frequency
matrix
(block)
regular expression
4 2019.10.01. Bioinformatics - Proteomics
Secondary databases
Secondary
database
Primary or
secondary source
Content
PROSITE SwissProt Regular expressions
(motifs)
Profiles (part of
PROSITE)
SwissProt Balanced matrices (profiles)
PRINTS SwissProt + TrEMBL Aligned motifs
(fingerprints)
Pfam SwissProt Hidden Markov models (HMMs)
BLOCKS* PROSITE / PRINTS Aligned motifs
(blocks)
eMOTIF* BLOCKS / PRINTS "Fuzzy" regular expressions
(patterns)
* Derived from secondary databese
5 2019.10.01. Bioinformatics - Proteomics
SINGLE MOTIF
METODS
METODS BASED ON FULL
ALIGNMENT
MULI-MOTIF
METHODS
fuzzy regular expression
accurate regular expression
profiles
Secondary databases
6 2019.10.01. Bioinformatics - Proteomics
Application paradigm
Similar seqence
Paralogy
Orthology
Similar seqence - Similar structure - Similar function
Homology
Similar structure Similar function
+++/- +/-
{ +?
Bioinformatics basic question: new sequence -> protein function, structure family, etc.
Search tools (FASTA, BLAST, PSI−BLAST, etc.) -> good to find homology, but sometimes
identification of orthology is more important (homolog may be a paralog of ortholog, less
useful)
Secondary databases (deriving mostly seqences of proteins of similar function) can help to
find ortholgy.
7 2019.10.01. Bioinformatics - Proteomics
PROSITE – Regular expressions
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
Alignment of four proteins
[AS]−D−[IVL]−G−x4−{PG}−C−[DE]−R−[FY]2−Q
· Standard IUPAC single letter amino acid (AA) codes
· Positions are separeted by −· An AA letter: fully conserved position (eg. −G−)
· Squered bracket: one of the listed AAs (pl. [AS])
· {AA}: Any AA but the given ones (eg. {PG})
· x: any AA
· Number: repeating (eg. [FY]2, x4)
· x(2,4): x 2−, 3− or 4−times.
8 2019.10.01. Bioinformatics - Proteomics
H-x-[LIVM]-{P}-x(0,2)-G-x(4)-W
Example:
H-C-I-N--G-YFRA-W
Sequence mathing
PROSITE – Regular expressions
9 2019.10.01. Bioinformatics - Proteomics
PROSITE - Patterns
Homologous regions from multiple alignments which have relevant function within a
protein family, eg.:
Catalytic sites of enzymes
Binding sites for prosthetic groups (e.g. hem, biotin, etc.)
Metal binding sites
Disulfide-brige forming cysteines
Binding sites for ligands (ADP/ATP, GDP/GTP, calcium, DNA, etc.)
Single motiv database, based on SwissProt manual alignments, annotated by experts using
experimental and literature data
Goodness of expressions are regularly checked/enhanced.
Reliable and exhaustive documentation.
10 2019.10.01. Bioinformatics - Proteomics
PROSITE – Pattern records
11 2019.10.01. Bioinformatics - Proteomics
PROSITE – Pattern description
12 2019.10.01. Bioinformatics - Proteomics
PROSITE – Documentation section
13 2019.10.01. Bioinformatics - Proteomics
PROSITE - Search
14 2019.10.01. Bioinformatics - Proteomics
PRINTS – „Fingerprints”
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved
motifs used to characterise a protein family; its diagnostic power is refined by iterative
scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are
separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can
encode protein folds and functionalities more flexibly and powerfully than can single
motifs, full diagnostic potency deriving from the mutual context provided by motif
neighbours.
Building PRINTS
Primary database: SWISSPROT+TrEMBL
Starts with several manually aligned seqences of a protin family
Finding conserved regions (mostly visually) -> starting motifs
Freqency matrices are derived for each motif
Search (SwissProt+TrEMBL) with these freqency matrices
The best scoring seqences are added to the starting motifs
Repeating the process iteratively until no more enhancement found
15 2019.10.01. Bioinformatics - Proteomics
PRINTS – Database
16 2019.10.01. Bioinformatics - Proteomics
BLOCKS fields
Older, matrix-based approach, derived from SwissProt database
BLOCKS – „Blokkok”
BLOCKS search (not maintained, nowdays more sophisticated methods are available)
Keyword, description, etc.
Comparison of a sequence with BLOCKS (with balanced frequency matrix): -> matching
blocks, with E value.
The sc. logo of the block (the AA frequencies are converted to letter height) can be
visualized e.g..:
17 2019.10.01. Bioinformatics - Proteomics
Profiles – Prosite, Pfam
Sequence profiles are mathematical objects describing full seqences which are derived from
aligned seqences. Profiles are essentially patterns where each position in the sequence of the
segment (or motif) has been assigned a probability value for each possible amino-acid
residue type. There are two main types of profiles:
Weight matrices: balanced freqency matrices (similar like at BLOCKS), extended with
position-dependent gap opening and extension penalties (22 number in a ro of matrix: 20 AAs +
2 gap penalties). The PROSITE desribes in this way the protein families for which there is no
good regular expression is found.
Hidden Markov Models (HMM): HMM is a probability model which “generates” seqences.
Ie. a linear chain with Match (M), Insertion (I) and Deletion (D) states, with values for their
transitions. HMM is a theoretically sound modeling paradigm for collections of motifs for
which efficient algorithms exist.
18 2019.10.01. Bioinformatics - Proteomics
The Hidden Markov model (HMM) is a virtual machine generating sequences. The machine
has a finite states. The machine steps between these states. At each state or state transition, a
sequence unit (i.e. one AA or nucleotide) may be generated. The generated sequences are
assembled from this units.
Hidden Markov−models (HMM)
19 2019.10.01. Bioinformatics - Proteomics
Hidden Markov−models (HMM)
On the basis of related seqences a HMM may be defined. If the model efficiently represents the
related family, this HMM can generate novel seqences similar to the ones which were present in
the starting set of sequences.
In the case of seqence analysis, it is possible to calculate for a seqence the probability that the
HMM could generate the given sequence. If this probability is high than the seqence blongs to
the family from which th HMM was constructed.
20 2019.10.01. Bioinformatics - Proteomics
Profilok – Prosite, Pfam
PROSITE profile records
Basic parameters: the scores for MI (eg. Match−Insertion)
M: Match states, with parameters (elements of the weight matrix)
I: Insertion states, with parameters
Pfam records
Desription resords: Descriptions of families (seqence lists)
HMM record: it gives the HMM
Pfam−A: Well documented families
Pfam−B: Badly documented, automatically generated families
Search in profile databases
Seqence comparisons with the profiles (various programs / servers)
21 2019.10.01. Bioinformatics - Proteomics
Integrated secondary database - INTERPRO
Integration of best-documented
secondary databases
(PROSITE, PRINTS)
with other secondary databases
(Pfam, PRODOM, etc.).
Thousands of protein families
22 2019.10.01. Bioinformatics - Proteomics
Integated secondary database - INTERPRO
23 2019.10.01. Bioinformatics - Proteomics
Integated secondary database - INTERPRO
24 2019.10.01. Bioinformatics - Proteomics
Integrated biological database -
NCBI
25 2019.10.01. Bioinformatics - Proteomics
Integrated biological database - NCBI
26 2019.10.01. Bioinformatics - Proteomics
Integrated biological database – NCBI Structure
27 2019.10.01. Bioinformatics - Proteomics
Integrated biological database – NCBI PubMed