Bioinformatika a výpočetní biologie KFC/BIN V. Predikce genůfch.upol.cz/wp-content/uploads/2018/03/BIN_04_Geny_ontologie_vz3.pdf · KFC/BIN VI. Geny –predikce a ontologie RNDr

Bioinformatika

a

výpočetní biologie

KFC/BIN

VI. Geny – predikce a ontologieRNDr. Karel Berka, Ph.D.

Univerzita Palackého v Olomouci

Predikce genů

gene is "a locatable region of genomic sequence,

corresponding to a unit of inheritance, which is

associated with regulatory regions, transcribed regions,

and or other functional sequence regions „

allele is one variant of that gene (e.g. "good genes, "hair

color gene")

Gregor Mendel

Predikce:

rozdílný informační obsah kódujících (CDS) a nekódujících

(UTR) sekvencí v genomu.

http://en.wikipedia.org/wiki/Locus_(genetics)

http://en.wikipedia.org/wiki/Genome

http://en.wikipedia.org/wiki/Gregor_Mendel

informační obsah

i l-ve you

hr-jlka ds

4

The value of genome sequences

lies in their annotation

• Annotation – Characterizing genomic

features using computational and

experimental methods

• Genes: Four levels of annotation

– Gene Prediction – Where are genes?

– What do they look like?

– Domains – What do the proteins do?

– Role – What pathway(s) involved in?

5

Kolik má člověk genů?

Consortium: 35,000 genes?

Celera: 30,000 genes?

Affymetrix: 60,000 human genes on GeneChips?

Incyte and HGS: over 120,000 genes?

GenBank: 49,000 unique gene coding sequences?

UniGene: > 89,000 clusters of unique ESTs?

6

Current consensus (in flux …)

• 20,000 known genes (2010)

– (similarity to previously isolated genes and

expressed sequences from a large variety of

different organisms)

– 15 000 known in 2003

• 22,333 predicted (RefSeq)

– problémy s predikčními algoritmy (nízká

účinnost) (Nature blog 2010)

7

How to we get from here …

8

to here,

9

• Complete DNA segments responsible to

make functional products

• Products

– Proteins

– Functional RNA molecules

• RNAi (interfering RNA)

• rRNA (ribosomal RNA)

• snRNA (small nuclear)

• snoRNA (small nucleolar)

• tRNA (transfer RNA)

What are genes? - 1

10

What are genes? - 2

• Definition vs. dynamic concept

• Consider

– Prokaryotic vs. eukaryotic gene models

– Introns/exons

– Posttranscriptional modifications

– Alternative splicing

– Differential expression

– Genes-in-genes

– Genes-ad-genes

– Posttranslational modifications

– Multi-subunit proteins

11

Prokaryotic gene model:

ORF-genes• “Small” genomes, high gene density

– Haemophilus influenza genome 85% genic

• Operons

– One transcript, many genes

• No introns.

– One gene, one protein

• Open reading frames (ORF)

– One ORF per gene

– ORFs begin with start,

end with stop codon (DNA)

- TAG ("amber") UAG

- TAA ("ochre") UAA

- TGA ("opal" or "umber"). UGA

Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone"

TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

12

Eukaryotic gene model: spliced genes

Posttranscriptional modification

5’-CAP, polyA tail, splicing

Open reading frames

Mature mRNA contains ORF

All internal exons contain open “read-through”

Pre-start and post-stop sequences are UTRs

Multiple translates

One gene – many proteins via alternative splicing

13

Expansions and Clarifications• ORFs

– Start – triplets – stop

– Prokaryotes: gene = ORF

– Eukaryotes: spliced genes or ORF genes

• Exons

– Remain after introns have been removed

– Flanking parts contain non-coding sequence (5’-

and 3’-UTRs)

14

Where do genes live?

• V genomech

• Příklad: lidský genom

– 3,274,571,503 bp (Ensembl 2010)

– 25 chromosomes : 1-22, X, Y, mt

– 22,333 genes (RefSeq estimate 2010)

– 128 nucleotides (RNA gene) – 2,800 kb (DMD)

– Ca. 25% of genome are genes (introns, exons)

– Ca. 1% of genome codes for amino acids (CDS)

– 30 kb gene length (average)

– 1.4 kb ORF length (average)

– 3 transcripts per gene (average)

15

Genomic sequence features

• Repeats – Transposable elements, simple repeats

– RepeatMasker (http://www.repeatmasker.org/)

• Genes

– Vary in density, length, structure

– Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research

• Pseudo genes

– Look-a-likes of genes, obstruct gene finding efforts.

• Non-coding RNAs (ncRNA)

– tRNA, rRNA, snRNA, snoRNA, miRNA

– tRNASCAN-SE, COVE (http://selab.janelia.org/software.html)

http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker

http://www.repeatmasker.org/

http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker

http://www.genetics.wustl.edu/eddy/software/

http://selab.janelia.org/software.html

16

• Homology-based gene prediction

– Similarity Searches (e.g. BLAST, BLAT)

– Genome Browsers

– RNA evidence (ESTs - Expressed sequence tag in cDNA)

• Ab initio gene prediction

– Gene prediction programs

– Prokaryotes

• ORF identification

– Eukaryotes

• Promoter prediction

• PolyA-signal prediction

• Splice site, start/stop-codon predictions

Gene identification

17

Gene prediction through comparative

genomics

• Highly similar (Conserved) regions between two

genomes are useful or else they would have diverged

• If genomes are too closely related all regions are similar,

not just genes

• If genomes are too far apart, analogous regions may be

too dissimilar to be found

18

Genome Browsers

Generic Genome Browser (CSHL)

www.wormbase.org/db/seq/gbrowse

NCBI Map Viewer

www.ncbi.nlm.nih.gov/mapview/

Ensembl Genome Browser

www.ensembl.org/

Apollo Genome Browser

www.bdgp.org/annot/apollo/

UCSC Genome Browser

genome.ucsc.edu/cgi-bin/hgGateway?org=human

http://www.wormbase.org/db/seq/gbrowse

http://www.ncbi.nlm.nih.gov/mapview/

http://www.ensembl.org/

http://www.ncbi.nlm.nih.gov/mapview/static/MVstart.html

http://www.bdgp.org/annot/apollo/

http://genome.ucsc.edu/cgi-bin/hgGateway?org=human

19

Gene discovery using ESTs

• Expressed Sequence Tags (ESTs)

represent sequences from expressed

genes.

• If region matches EST with high

consensus then region is probably a gene

or pseudogene.

– EST overlapping exon boundary gives an

accurate prediction of exon boundary.

20

Ab initio gene prediction

• Prokaryotes

– ORF-Detectors

• Eukaryotes

– Position, extent & direction: through promoter

and polyA-signal predictors

– Structure: through splice site predictors

– Exact location of coding sequences: through

determination of relationships between

potential start codons, splice sites, ORFs, and

stop codons

21

How it works I - ORFswf film

22

How it works I – Motif

identificationExon-Intron Borders = Splice Sites

Exon Intron Exon

~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~

~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~

~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~

~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~

Splice site Splice site

Exon Intron Exon

~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~

~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~

~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~

~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~

Splice site Splice site

Motif Extraction Programs at http://www-btls.jst.go.jp/

http://www-btls.jst.go.jp/

23

How it works III – The (ugly) truth

24

Gene prediction programs

• Homology– use BLAST-like

– Example: Exofish, CRITICA

• Rule-based programs– Use explicit set of rules to make decisions.

– Example: GeneFinder

• Neural Network-based programs– Use data set to build rules.

– Examples: Grail, GrailEXP, Genemark

• Hidden Markov Model-based programs

– Use probabilities of states and transitions between these states to predict features.

– Examples: Genscan, GenomeScan

http://argon.cshl.org/genefinder/

http://compbio.ornl.gov/Grail-1.3/

http://compbio.ornl.gov/grailexp/

http://genes.mit.edu/GENSCAN.html

http://genes.mit.edu/genomescan.html

25

Tools

• ORF detectors– NCBI: http://www.ncbi.nih.gov/gorf/gorf.html

• Promoter predictors– CSHL: http://rulai.cshl.org/software/index1.htm

– BDGP: fruitfly.org/seq_tools/promoter.html

– ICG: TATA-Box predictor

• PolyA signal predictors– CSHL: argon.cshl.org/tabaska/polyadq_form.html

• Splice site predictors– BDGP: http://www.fruitfly.org/seq_tools/splice.html

• Start-/stop-codon identifiers– DNALC: Translator/ORF-Finder

– BCM: Searchlauncher

http://www.ncbi.nih.gov/gorf/gorf.html

http://rulai.cshl.org/software/index1.htm

http://www.fruitfly.org/seq_tools/promoter.html

http://wwwmgs.bionet.nsc.ru/mgs/programs/bdna/tata_bdna.html

http://argon.cshl.org/tabaska/polyadq_form.html

http://www.fruitfly.org/seq_tools/splice.html

http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm

http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html

CRITICAprediction of prokaryotic genes

search for RBS (ribosomal binding site, Shine-Dalgarno

sequence)

Principle:

TBLASTP against protein database and choosing clearly

coding parts (usually only parts of the genes).

Calculating of statistical model.

Prediction of genes.

New statistical model and new prediction etc etc.

Genscan

prediction of eukaryotic genes

different statistical models for the first and last

exon

search for promotores, terminators, polyA signal

different statistical parameter for different GC

www:

http://genes.mit.edu/GENSCAN.html

Genscan

probability exons exactly partialy overlap

s

error

0.99 - 1.00 917 97.7% 0.9% 0.0% 1.4%

0.95 - 0.99 551 92.4% 3.4% 0.2% 4.0%

0.90 - 0.95 263 87.8% 6.1% 0.4% 5.7%

0.75 - 0.90 337 74.8% 16.0% 1.2% 8.0%

0.50 - 0.75 362 54.1% 26.2% 2.2% 17.4%

0.00 - 0.50 248 29.8% 27.8% 4.0% 38.3%

GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20

Sequence HERV17_004640 : 40714 bp : 37.79% C+G : Isochore 1 ( 0.00 -

43.00 C+G%)

Parameter matrix: HumanIso.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..

----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

1.01 Init + 1825 1853 29 0 2 86 71 45 0.579 1.72

1.02 Term + 3886 4075 190 1 1 85 44 198 0.941 11.04

1.03 PlyA + 4961 4966 6 1.05

2.00 Prom + 6668 6707 40 -4.65

2.01 Init + 17251 17375 125 0 2 45 72 80 0.590 1.81

2.02 Term + 20137 20329 193 1 1 85 43 196 0.990 10.71

2.03 PlyA + 20809 20814 6 1.05

3.08 PlyA - 21608 21603 6 -3.24

3.07 Term - 22315 21651 665 2 2 -17 55 522 0.952 31.44

3.06 Intr - 24268 22592 1677 2 0 81 94 2124 0.885 198.67

…

Genscan - example

Genscan - example

Kvalita predikce

Sensitivity = TP / (TP + FN)

How many genes were found out of all present?

Specificity = TP / (TP + FP)How many predicted genes are indeed genes?

TP . TN + FP . FN

PP . PN + RP . RN

Correlation Coefficient =

RP

PP

TPreal

předpověď

FP TN FN TP FN

RN

PN

32

Gene prediction accuracies

• Nucleotide level: 95%Sn, 90%Sp (Lows less than

50%)

• Exon level: 75%Sn, 68%Sp (Lows less than 30%)

• Gene Level: 40% Sn, 30%Sp (Lows less than 10%)

• Programs that combine statistical evaluations with

similarity searches most powerful.

33

Common difficulties

• First and last exons difficult to annotate because

they contain UTRs.

• Smaller genes are not statistically significant so

they are thrown out.

• Algorithms are trained with sequences from

known genes which biases them against genes

about which nothing is known.

• Masking repeats frequently removes potentially

indicative chunks from the untranslated regions

of genes that contain repetitive elements.

34

The annotation pipeline

• Mask repeats using RepeatMasker.

• Run sequence through several programs.

• Take predicted genes and do similarity

search against ESTs and genes from

other organisms.

• Do similarity search for non-coding

sequences to find ncRNA.

35

Annotation nomenclature

• Known Gene – Predicted gene matches the entire length of a known gene.

• Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.

• Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.

• Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.

Gene Ontology (GO)

• URL: http://www.geneontology.org/

• Gene Ontology is

– A hierarchy of roles of genes and gene

products independent of any organism.

– Composed of three independent ontologies:

• molecular function,

• biological process,

• cellular component

– GO itself does not contain any information

on genes or gene products

Luis Tari

http://www.geneontology.org/

Gene Ontology

• Developed by an international consortium– about 50 members

• Editorial office, 4 full-time editors (ish)

• Many other part-time editors at databases

• Multiple changes made a day– made live immediately

Evolution of GO

• GO development traditionally annotation-driven– development directed by use

• Terms added as new species annotated

• Terms added on as as-needed basis

• Resulted in ‘organic’ structure, little formality

• Ontological formality added subsequently– philosophical and logical

Growth of GOGO term history 2001 - 2007

0

5000

10000

15000

20000

25000

30000

Jan-

01

Apr-01

Jul-01

Oct-0

1

Jan-

02

Apr-02

Jul-02

Oct-0

2

Jan-

03

Apr-03

Jul-03

Oct-0

3

Jan-

04

Apr-04

Jul-04

Oct-0

4

Jan-

05

Apr-05

Jul-05

Oct-0

5

Jan-

06

Apr-06

Jul-06

Oct-0

6

Jan-

07

Date

Nu

mb

er o

f te

rm

s

obsolete

undefined terms

defined terms

GO annotations

• http://www.geneontology.org/GO.current.a

nnotations.shtml

• Curators annotate their findings of genes

(known as annotations) by utilizing GO

for various organisms (about 20 of them).

• Different kinds of evidence codes

– Annotations with IEA (inferred from electronic

annotation) evidence code are not manually

verified (Least reliable)

Luis Tari

http://www.geneontology.org/GO.current.annotations.shtml

Structure of GO relationships

GO Molecular Function Ontology

• Describes activities, such as catalytic or binding activities,

that can be performed by individual gene products or

assembled complexes of gene products at the molecular

level.

• Example of activities

– transporter activity

• Genes that enable the directed movement of substances

(such as macromolecules, small molecules, ions) into, out

of, within or between cells.

• Example of binding

– insulin receptor binding

• Genes that interact with insulin receptors

Luis Tari

GO Biological Process Ontology

• Defined as a biological objective to which the

gene or gene product contributes.

• Examples

– cell proliferation

• Genes that are responsible for the multiplication or

reproduction of cells, resulting in the rapid expansion of a cell

population.

– learning/memory

• Genes that e acquisition and processing of information

and/or the storage and retrieval of this information over time.

Luis Tari

GO Cellular Component Ontology

• Refers to the place in the cell where the gene product is

active.

• Examples

– bud

– nucleus

– cell membrane

GO

An example showing a partial hierarchy of the Gene Ontology that

involves the term apoptosis. Snapshot taken from the TGen GOBrowser.

Luis Tari

Example of a gene product

• A gene product has one or more molecular functions and is

used in one or more biological processes; it might be

associated with one or more cellular components.

An example showing all

occurrences of SODC in

the Gene Ontology from

the human annotation.

Luis Tari

Common applications of GO

• Analysis of microarray data

– Finding genes with similar functions

– Utilize biological process ontology

• Evaluation of protein-protein interactions

– Proteins are likely to interact if they are in the

same location

– Utilize cellular component ontology

Luis Tari

Extension to Ontology?

• We know that APOE is involved in Alzheimer’s disease.

• Based on the Gene Ontology annotation, APOE is involved in “learning and/or memory” biological process.

• If we ask “is the gene APOE related to Alzheimer’s disease?”– Yes, because APOE is known to be involved in “learning

and/or memory”.

• BUT there is NO ontology that says– learning and/or memory can influence Alzheimer’s disease

– Degradation of ubiquitin cycle can cause extra long/short half-life of genes

– Extra long/short half-life of genes can cause cancer

Luis Tari

Credits

• http://www.dnalc.org/bioinformatics/presen

tations/hhmi_2003/2003_3.ppt

• Paces a Vondrasek, kurz Bioinformatiky,

UK

http://www.dnalc.org/bioinformatics/presentations/hhmi_2003/2003_3.ppt

Documents

Bioinformatika a výpočetní biologie KFC/BIN V. Predikce genůfch.upol.cz/wp-content/uploads/2018/03/BIN_04_Geny_ontologie_vz3.pdf · KFC/BIN VI. Geny –predikce a ontologie RNDr