Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bioinformatika
a
výpočetní biologie
KFC/BIN
VI. Geny – predikce a ontologieRNDr. Karel Berka, Ph.D.
Univerzita Palackého v Olomouci
Predikce genů
gene is "a locatable region of genomic sequence,
corresponding to a unit of inheritance, which is
associated with regulatory regions, transcribed regions,
and or other functional sequence regions „
allele is one variant of that gene (e.g. "good genes, "hair
color gene")
Gregor Mendel
Predikce:
rozdílný informační obsah kódujících (CDS) a nekódujících
(UTR) sekvencí v genomu.
informační obsah
i l-ve you
hr-jlka ds
4
The value of genome sequences
lies in their annotation
• Annotation – Characterizing genomic
features using computational and
experimental methods
• Genes: Four levels of annotation
– Gene Prediction – Where are genes?
– What do they look like?
– Domains – What do the proteins do?
– Role – What pathway(s) involved in?
5
Kolik má člověk genů?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding sequences?
UniGene: > 89,000 clusters of unique ESTs?
6
Current consensus (in flux …)
• 20,000 known genes (2010)
– (similarity to previously isolated genes and
expressed sequences from a large variety of
different organisms)
– 15 000 known in 2003
• 22,333 predicted (RefSeq)
– problémy s predikčními algoritmy (nízká
účinnost) (Nature blog 2010)
7
How to we get from here …
8
to here,
9
• Complete DNA segments responsible to
make functional products
• Products
– Proteins
– Functional RNA molecules
• RNAi (interfering RNA)
• rRNA (ribosomal RNA)
• snRNA (small nuclear)
• snoRNA (small nucleolar)
• tRNA (transfer RNA)
What are genes? - 1
10
What are genes? - 2
• Definition vs. dynamic concept
• Consider
– Prokaryotic vs. eukaryotic gene models
– Introns/exons
– Posttranscriptional modifications
– Alternative splicing
– Differential expression
– Genes-in-genes
– Genes-ad-genes
– Posttranslational modifications
– Multi-subunit proteins
11
Prokaryotic gene model:
ORF-genes• “Small” genomes, high gene density
– Haemophilus influenza genome 85% genic
• Operons
– One transcript, many genes
• No introns.
– One gene, one protein
• Open reading frames (ORF)
– One ORF per gene
– ORFs begin with start,
end with stop codon (DNA)
- TAG ("amber") UAG
- TAA ("ochre") UAA
- TGA ("opal" or "umber"). UGA
Mnemonic UGA: "U Go Away" UAA: "U Are Away" UAG: "U Are Gone"
TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi
12
Eukaryotic gene model: spliced genes
Posttranscriptional modification
5’-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open “read-through”
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene – many proteins via alternative splicing
13
Expansions and Clarifications• ORFs
– Start – triplets – stop
– Prokaryotes: gene = ORF
– Eukaryotes: spliced genes or ORF genes
• Exons
– Remain after introns have been removed
– Flanking parts contain non-coding sequence (5’-
and 3’-UTRs)
14
Where do genes live?
• V genomech
• Příklad: lidský genom
– 3,274,571,503 bp (Ensembl 2010)
– 25 chromosomes : 1-22, X, Y, mt
– 22,333 genes (RefSeq estimate 2010)
– 128 nucleotides (RNA gene) – 2,800 kb (DMD)
– Ca. 25% of genome are genes (introns, exons)
– Ca. 1% of genome codes for amino acids (CDS)
– 30 kb gene length (average)
– 1.4 kb ORF length (average)
– 3 transcripts per gene (average)
15
Genomic sequence features
• Repeats – Transposable elements, simple repeats
– RepeatMasker (http://www.repeatmasker.org/)
• Genes
– Vary in density, length, structure
– Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research
• Pseudo genes
– Look-a-likes of genes, obstruct gene finding efforts.
• Non-coding RNAs (ncRNA)
– tRNA, rRNA, snRNA, snoRNA, miRNA
– tRNASCAN-SE, COVE (http://selab.janelia.org/software.html)
16
• Homology-based gene prediction
– Similarity Searches (e.g. BLAST, BLAT)
– Genome Browsers
– RNA evidence (ESTs - Expressed sequence tag in cDNA)
• Ab initio gene prediction
– Gene prediction programs
– Prokaryotes
• ORF identification
– Eukaryotes
• Promoter prediction
• PolyA-signal prediction
• Splice site, start/stop-codon predictions
Gene identification
17
Gene prediction through comparative
genomics
• Highly similar (Conserved) regions between two
genomes are useful or else they would have diverged
• If genomes are too closely related all regions are similar,
not just genes
• If genomes are too far apart, analogous regions may be
too dissimilar to be found
18
Genome Browsers
Generic Genome Browser (CSHL)
www.wormbase.org/db/seq/gbrowse
NCBI Map Viewer
www.ncbi.nlm.nih.gov/mapview/
Ensembl Genome Browser
www.ensembl.org/
Apollo Genome Browser
www.bdgp.org/annot/apollo/
UCSC Genome Browser
genome.ucsc.edu/cgi-bin/hgGateway?org=human
19
Gene discovery using ESTs
• Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
• If region matches EST with high
consensus then region is probably a gene
or pseudogene.
– EST overlapping exon boundary gives an
accurate prediction of exon boundary.
20
Ab initio gene prediction
• Prokaryotes
– ORF-Detectors
• Eukaryotes
– Position, extent & direction: through promoter
and polyA-signal predictors
– Structure: through splice site predictors
– Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs, and
stop codons
21
How it works I - ORFswf film
22
How it works I – Motif
identificationExon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/
23
How it works III – The (ugly) truth
24
Gene prediction programs
• Homology– use BLAST-like
– Example: Exofish, CRITICA
• Rule-based programs– Use explicit set of rules to make decisions.
– Example: GeneFinder
• Neural Network-based programs– Use data set to build rules.
– Examples: Grail, GrailEXP, Genemark
• Hidden Markov Model-based programs
– Use probabilities of states and transitions between these states to predict features.
– Examples: Genscan, GenomeScan
25
Tools
• ORF detectors– NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
• Promoter predictors– CSHL: http://rulai.cshl.org/software/index1.htm
– BDGP: fruitfly.org/seq_tools/promoter.html
– ICG: TATA-Box predictor
• PolyA signal predictors– CSHL: argon.cshl.org/tabaska/polyadq_form.html
• Splice site predictors– BDGP: http://www.fruitfly.org/seq_tools/splice.html
• Start-/stop-codon identifiers– DNALC: Translator/ORF-Finder
– BCM: Searchlauncher
CRITICAprediction of prokaryotic genes
search for RBS (ribosomal binding site, Shine-Dalgarno
sequence)
Principle:
TBLASTP against protein database and choosing clearly
coding parts (usually only parts of the genes).
Calculating of statistical model.
Prediction of genes.
New statistical model and new prediction etc etc.
Genscan
prediction of eukaryotic genes
different statistical models for the first and last
exon
search for promotores, terminators, polyA signal
different statistical parameter for different GC
www:
http://genes.mit.edu/GENSCAN.html
Genscan
probability exons exactly partialy overlap
s
error
0.99 - 1.00 917 97.7% 0.9% 0.0% 1.4%
0.95 - 0.99 551 92.4% 3.4% 0.2% 4.0%
0.90 - 0.95 263 87.8% 6.1% 0.4% 5.7%
0.75 - 0.90 337 74.8% 16.0% 1.2% 8.0%
0.50 - 0.75 362 54.1% 26.2% 2.2% 17.4%
0.00 - 0.50 248 29.8% 27.8% 4.0% 38.3%
GENSCAN 1.0 Date run: 31-Oct-100 Time: 15:54:20
Sequence HERV17_004640 : 40714 bp : 37.79% C+G : Isochore 1 ( 0.00 -
43.00 C+G%)
Parameter matrix: HumanIso.smat
Predicted genes/exons:
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
1.01 Init + 1825 1853 29 0 2 86 71 45 0.579 1.72
1.02 Term + 3886 4075 190 1 1 85 44 198 0.941 11.04
1.03 PlyA + 4961 4966 6 1.05
2.00 Prom + 6668 6707 40 -4.65
2.01 Init + 17251 17375 125 0 2 45 72 80 0.590 1.81
2.02 Term + 20137 20329 193 1 1 85 43 196 0.990 10.71
2.03 PlyA + 20809 20814 6 1.05
3.08 PlyA - 21608 21603 6 -3.24
3.07 Term - 22315 21651 665 2 2 -17 55 522 0.952 31.44
3.06 Intr - 24268 22592 1677 2 0 81 94 2124 0.885 198.67
…
Genscan - example
Genscan - example
Kvalita predikce
Sensitivity = TP / (TP + FN)
How many genes were found out of all present?
Specificity = TP / (TP + FP)How many predicted genes are indeed genes?
TP . TN + FP . FN
PP . PN + RP . RN
Correlation Coefficient =
RP
PP
TPreal
předpověď
FP TN FN TP FN
RN
PN
32
Gene prediction accuracies
• Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
• Exon level: 75%Sn, 68%Sp (Lows less than 30%)
• Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
• Programs that combine statistical evaluations with
similarity searches most powerful.
33
Common difficulties
• First and last exons difficult to annotate because
they contain UTRs.
• Smaller genes are not statistically significant so
they are thrown out.
• Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
• Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
34
The annotation pipeline
• Mask repeats using RepeatMasker.
• Run sequence through several programs.
• Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
• Do similarity search for non-coding
sequences to find ncRNA.
35
Annotation nomenclature
• Known Gene – Predicted gene matches the entire length of a known gene.
• Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.
• Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.
• Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.
Gene Ontology (GO)
• URL: http://www.geneontology.org/
• Gene Ontology is
– A hierarchy of roles of genes and gene
products independent of any organism.
– Composed of three independent ontologies:
• molecular function,
• biological process,
• cellular component
– GO itself does not contain any information
on genes or gene products
Luis Tari
Gene Ontology
• Developed by an international consortium– about 50 members
• Editorial office, 4 full-time editors (ish)
• Many other part-time editors at databases
• Multiple changes made a day– made live immediately
Evolution of GO
• GO development traditionally annotation-driven– development directed by use
• Terms added as new species annotated
• Terms added on as as-needed basis
• Resulted in ‘organic’ structure, little formality
• Ontological formality added subsequently– philosophical and logical
Growth of GOGO term history 2001 - 2007
0
5000
10000
15000
20000
25000
30000
Jan-
01
Apr-01
Jul-01
Oct-0
1
Jan-
02
Apr-02
Jul-02
Oct-0
2
Jan-
03
Apr-03
Jul-03
Oct-0
3
Jan-
04
Apr-04
Jul-04
Oct-0
4
Jan-
05
Apr-05
Jul-05
Oct-0
5
Jan-
06
Apr-06
Jul-06
Oct-0
6
Jan-
07
Date
Nu
mb
er o
f te
rm
s
obsolete
undefined terms
defined terms
GO annotations
• http://www.geneontology.org/GO.current.a
nnotations.shtml
• Curators annotate their findings of genes
(known as annotations) by utilizing GO
for various organisms (about 20 of them).
• Different kinds of evidence codes
– Annotations with IEA (inferred from electronic
annotation) evidence code are not manually
verified (Least reliable)
Luis Tari
Structure of GO relationships
GO Molecular Function Ontology
• Describes activities, such as catalytic or binding activities,
that can be performed by individual gene products or
assembled complexes of gene products at the molecular
level.
• Example of activities
– transporter activity
• Genes that enable the directed movement of substances
(such as macromolecules, small molecules, ions) into, out
of, within or between cells.
• Example of binding
– insulin receptor binding
• Genes that interact with insulin receptors
Luis Tari
GO Biological Process Ontology
• Defined as a biological objective to which the
gene or gene product contributes.
• Examples
– cell proliferation
• Genes that are responsible for the multiplication or
reproduction of cells, resulting in the rapid expansion of a cell
population.
– learning/memory
• Genes that e acquisition and processing of information
and/or the storage and retrieval of this information over time.
Luis Tari
GO Cellular Component Ontology
• Refers to the place in the cell where the gene product is
active.
• Examples
– bud
– nucleus
– cell membrane
GO
An example showing a partial hierarchy of the Gene Ontology that
involves the term apoptosis. Snapshot taken from the TGen GOBrowser.
Luis Tari
Example of a gene product
• A gene product has one or more molecular functions and is
used in one or more biological processes; it might be
associated with one or more cellular components.
An example showing all
occurrences of SODC in
the Gene Ontology from
the human annotation.
Luis Tari
Common applications of GO
• Analysis of microarray data
– Finding genes with similar functions
– Utilize biological process ontology
• Evaluation of protein-protein interactions
– Proteins are likely to interact if they are in the
same location
– Utilize cellular component ontology
Luis Tari
Extension to Ontology?
• We know that APOE is involved in Alzheimer’s disease.
• Based on the Gene Ontology annotation, APOE is involved in “learning and/or memory” biological process.
• If we ask “is the gene APOE related to Alzheimer’s disease?”– Yes, because APOE is known to be involved in “learning
and/or memory”.
• BUT there is NO ontology that says– learning and/or memory can influence Alzheimer’s disease
– Degradation of ubiquitin cycle can cause extra long/short half-life of genes
– Extra long/short half-life of genes can cause cancer
Luis Tari
Credits
• http://www.dnalc.org/bioinformatics/presen
tations/hhmi_2003/2003_3.ppt
• Paces a Vondrasek, kurz Bioinformatiky,
UK