Download pdf - Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

2019.10.01. Bioinformatics - Proteomics

Bioinformatics − Proteomics Lecture 4

Prof. László Poppe

BME Department of Organic Chemistryand Technology

Bioinformatics – Proteomics

Lecture and practice

2 2019.10.01. Bioinformatics - Proteomics

Biological databases

Felhasználó

KeresõprogramBLAST

Biológiai adatbázisok

Szerkezeti adatbázisok Szekvencia adatbázisok

Primeradatbázisok

Szekunderadatbázisok

Protein Protein NukleinsavPDB SwissProt

TrEMBLPIR

GenBankDDBJEMBL

SCOPCATH

PFAMBLOCKSPROSITE

Integrált adatbázisok INTERPRO

user

Searching

BLAST

Biological databases

Structural databases Sequence databases

Primary

databases

Secondary

databases

Nucleotide

Integrated databases


Secondary databases

The secondary seqence databases – containing sequence pattern data - are derived from

primary (ie containing seqences) databases

From multiple alignments of primary seqence data, motifs can be determined.

A fingerprint is a group of conserved motifs used to characterise a protein family. On the basis

of motifs regular expressions, or frqency matrices can be derived.

fingerprint

motif

Seqences in

multiple aligment

insertions

frequency

matrix

balanced

frequency

matrix

(block)

regular expression


Secondary databases

Secondary

database

Primary or

secondary source

Content

PROSITE SwissProt Regular expressions

(motifs)

Profiles (part of

PROSITE)

SwissProt Balanced matrices (profiles)

PRINTS SwissProt + TrEMBL Aligned motifs

(fingerprints)

Pfam SwissProt Hidden Markov models (HMMs)

BLOCKS* PROSITE / PRINTS Aligned motifs

(blocks)

eMOTIF* BLOCKS / PRINTS "Fuzzy" regular expressions

(patterns)

* Derived from secondary databese


SINGLE MOTIF

METODS

METODS BASED ON FULL

ALIGNMENT

MULI-MOTIF

METHODS

fuzzy regular expression

accurate regular expression

profiles

Secondary databases


Application paradigm

Similar seqence

Paralogy

Orthology

Similar seqence - Similar structure - Similar function

Homology

Similar structure Similar function

+++/- +/-

{ +?

Bioinformatics basic question: new sequence -> protein function, structure family, etc.

Search tools (FASTA, BLAST, PSI−BLAST, etc.) -> good to find homology, but sometimes

identification of orthology is more important (homolog may be a paralog of ortholog, less

useful)

Secondary databases (deriving mostly seqences of proteins of similar function) can help to

find ortholgy.


PROSITE – Regular expressions

ADLGAVFALCDRYFQ

SDVGPRSCFCERFYQ

ADLGRTQNRCDRYYQ

ADIGQPHSLCERYFQ

Alignment of four proteins

[AS]−D−[IVL]−G−x4−{PG}−C−[DE]−R−[FY]2−Q

· Standard IUPAC single letter amino acid (AA) codes

· Positions are separeted by −· An AA letter: fully conserved position (eg. −G−)

· Squered bracket: one of the listed AAs (pl. [AS])

· {AA}: Any AA but the given ones (eg. {PG})

· x: any AA

· Number: repeating (eg. [FY]2, x4)

· x(2,4): x 2−, 3− or 4−times.


H-x-[LIVM]-{P}-x(0,2)-G-x(4)-W

Example:

H-C-I-N--G-YFRA-W

Sequence mathing

PROSITE – Regular expressions


PROSITE - Patterns

Homologous regions from multiple alignments which have relevant function within a

protein family, eg.:

Catalytic sites of enzymes

Binding sites for prosthetic groups (e.g. hem, biotin, etc.)

Metal binding sites

Disulfide-brige forming cysteines

Binding sites for ligands (ADP/ATP, GDP/GTP, calcium, DNA, etc.)

Single motiv database, based on SwissProt manual alignments, annotated by experts using

experimental and literature data

Goodness of expressions are regularly checked/enhanced.

Reliable and exhaustive documentation.


PROSITE – Pattern records


PROSITE – Pattern description


PROSITE – Documentation section


PROSITE - Search


PRINTS – „Fingerprints”

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved

motifs used to characterise a protein family; its diagnostic power is refined by iterative

scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are

separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can

encode protein folds and functionalities more flexibly and powerfully than can single

motifs, full diagnostic potency deriving from the mutual context provided by motif

neighbours.

Building PRINTS

Primary database: SWISSPROT+TrEMBL

Starts with several manually aligned seqences of a protin family

Finding conserved regions (mostly visually) -> starting motifs

Freqency matrices are derived for each motif

Search (SwissProt+TrEMBL) with these freqency matrices

The best scoring seqences are added to the starting motifs

Repeating the process iteratively until no more enhancement found


PRINTS – Database


BLOCKS fields

Older, matrix-based approach, derived from SwissProt database

BLOCKS – „Blokkok”

BLOCKS search (not maintained, nowdays more sophisticated methods are available)

Keyword, description, etc.

Comparison of a sequence with BLOCKS (with balanced frequency matrix): -> matching

blocks, with E value.

The sc. logo of the block (the AA frequencies are converted to letter height) can be

visualized e.g..:


Profiles – Prosite, Pfam

Sequence profiles are mathematical objects describing full seqences which are derived from

aligned seqences. Profiles are essentially patterns where each position in the sequence of the

segment (or motif) has been assigned a probability value for each possible amino-acid

residue type. There are two main types of profiles:

Weight matrices: balanced freqency matrices (similar like at BLOCKS), extended with

position-dependent gap opening and extension penalties (22 number in a ro of matrix: 20 AAs +

2 gap penalties). The PROSITE desribes in this way the protein families for which there is no

good regular expression is found.

Hidden Markov Models (HMM): HMM is a probability model which “generates” seqences.

Ie. a linear chain with Match (M), Insertion (I) and Deletion (D) states, with values for their

transitions. HMM is a theoretically sound modeling paradigm for collections of motifs for

which efficient algorithms exist.


The Hidden Markov model (HMM) is a virtual machine generating sequences. The machine

has a finite states. The machine steps between these states. At each state or state transition, a

sequence unit (i.e. one AA or nucleotide) may be generated. The generated sequences are

assembled from this units.

Hidden Markov−models (HMM)


Hidden Markov−models (HMM)

On the basis of related seqences a HMM may be defined. If the model efficiently represents the

related family, this HMM can generate novel seqences similar to the ones which were present in

the starting set of sequences.

In the case of seqence analysis, it is possible to calculate for a seqence the probability that the

HMM could generate the given sequence. If this probability is high than the seqence blongs to

the family from which th HMM was constructed.


Profilok – Prosite, Pfam

PROSITE profile records

Basic parameters: the scores for MI (eg. Match−Insertion)

M: Match states, with parameters (elements of the weight matrix)

I: Insertion states, with parameters

Pfam records

Desription resords: Descriptions of families (seqence lists)

HMM record: it gives the HMM

Pfam−A: Well documented families

Pfam−B: Badly documented, automatically generated families

Search in profile databases

Seqence comparisons with the profiles (various programs / servers)


Integrated secondary database - INTERPRO

Integration of best-documented

secondary databases

(PROSITE, PRINTS)

with other secondary databases

(Pfam, PRODOM, etc.).

Thousands of protein families


Integated secondary database - INTERPRO


Integated secondary database - INTERPRO


Integrated biological database -

NCBI


Integrated biological database - NCBI


Integrated biological database – NCBI Structure


Integrated biological database – NCBI PubMed