Text of Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford...
Slide 1
Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University
Slide 2
Finishing Evolution of Orthologues Selection pressures in orthologues and paralogs Gene Duplications Reproduction, immunity or chemosensation Synonymous substitution rates Mutation and selection varies by chromosome size Gene birth in the human lineage Ongoing duplications underlie polymorphism
Slide 3
Orthology is the key
Slide 4
We are consumers of orthology / paralogy Started off using Ensembl predictions Ensembl 1:1 covered 50% of predicted mouse genes. Ewans manual survey said 80% How it started
Slide 5
Paralogues evolve fast (and are fun!) 1) General observations for all mammalian genomes
Slide 6
dmel dsim dyak dere dana dpse dvir dmoj dgri cele cbri crem c2801 hsap mmus cfam mdom oana ggal 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Drosophila Nematodes Amniotes Lineage specific d N /d S Species 2) Observations for whole clades of species
5) Treasure trove in the details clade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade: !!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16. !!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699) gene identifier order chrm exons stop length -------------------- ----- ---- ----- ---- ------ MUS_GENE_21705 6639 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 ; ENSMUSP00000086007 4 182 MUS_GENE_22420 6643 5 predicted gene, EG623898 ; ENSMUSP00000099126 2 72 < MUS_GENE_19599 6646 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 (Speer1-ps1) on chromosome 5 ; NCBIMUSP_83776567 4 157 < MUS_GENE_23688 6651 5 predicted gene, EG623898 ; ENSMUSP00000094421 2 72 MUS_GENE_19774 6657 5 spermatogenesis associated glutamate (E)-rich protein 3 ; On going mouse inparalogues analysis: Lots and lots of reproductive genes
Slide 10
Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res. 14(8):1516-29 6) Candidates for evolutionary and functional analyses
How do we find function in the genome? Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975).
Slide 13
How to find the function in the genome? Similar Sequences Common Ancestry (homology) Similar Structures / Folds Similar Functions ? (Genes / Genome regions)
Slide 14
ARs Whole Genome How much of the genome is functional? How much of the genome is functional? Compare with the mouse Ancestral Repetitive (AR) is non-functional and has evenly distributed conservation scores (red) Ancestral Repetitive (AR) sequence is non-functional and has evenly distributed conservation scores (red) (symmetrical bell shaped due to biological variation) Whole Genome contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection Whole Genome sequence contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection (asymetrical) Functional sequence =Whole Genome Ancestral Repetitive =Whole Genome - Ancestral Repetitive = 5% = 5% N.B. This is an estimate that doesnt take into account sequence Turning over rapidly (not shared by mouse/human) Under positive (diversifying) selection
Slide 15
The human genome (euchromatic sequence) Unknown (old repetitive junk?) Protein coding: 1.2% UTR: 0.3% Repeats (Transposable elements, ) ~45% Conserved non-coding (3.5% ?) Neutral
Slide 16
Conserved non-coding material Transcription factor binding sitesTranscription factor binding sites Enhancers, insulators and other non-transcribed regulatory elementsEnhancers, insulators and other non-transcribed regulatory elements Alternative splicing signalsAlternative splicing signals Transfer RNAs, ribosomal RNAsTransfer RNAs, ribosomal RNAs Small RNAs () regulatory/gene silencing / RNA degradationSmall RNAs ( e.g. snoRNAs, microRNAs, siRNAs and piRNAs ) regulatory/gene silencing / RNA degradation MacroRNAs (e.g. Xist) enzymatic? / chromosome inactivationMacroRNAs (e.g. Xist) enzymatic? / chromosome inactivation
Slide 17
Functional parts of genes are highly conserved
Slide 18
How many protein coding genes? Walter Gilbert [1980s] 100k Antequera & Bird [1993] 70-80k John Quackenbush et al. (TIGR) [2000] 120k Ewing & Green [2000] 30k Tetraodon analysis [2001] 35k Human Genome Project (public) [2001] ~ 31k Human Genome Project (Celera) [2001] 24-40k Mouse Genome Project (public) [2002] 25k -30k Lee Rowen [2003] 25,947 Human Genome Project (finishing) 20-25k [2004] Current predictions [2008] 19-20k
Slide 19
Traditional Genome Orthology Reciprocal BLAST best hits between longest transcript of each gene (+ synteny) Assumes: Protein similarity is proportional to evolutionary distance (selection is invariant!) Pairwise relationships adequately represent the evolutionary tree No gene losses or missing predictions Alternative splicing can be ignored! No gene translocations after tandem duplication
Slide 20
Orthology prediction methods Two genomes Reciprocal best blast hit Multiple genomes Clustering of reciprocal best hits protein similarities Query Blast hits
Slide 21
Reciprocal Blast Best Hits Advantages: Fast, Well understood Works well for distant lineages Can correlate with protein structure (domains) Disadvantages: Only provides 1:1 orthologues in the best case Can be difficult to reconcile with the species tree
Slide 22
Genes on chromosome of species 1 Genes on chromosome of species 2
Slide 23
Reciprocal Blast Best Hits
Slide 24
?
Slide 25
How to add duplicated genes? synteny Ensembl compara in the past Local gene order tends to be conserved in mammalian lineages Look for inparalogs locally even if the protein distances dont add up ( sequence error, sampling error etc. )
Slide 26
? Blast Best Hits in Local Regions
Slide 27
Slide 28
Problems with relying only on synteny Local homologs are often not inparalogs: Local rearrangements Missing predictions (neighbouring orphans) Need sanity checking
Slide 29
Human and Mouse chromosomes: Extensive rearrangements only over larger regions Conservation of gene order in the short range
Slide 30
Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to many Many to one Olfactory Orthology from compara
Slide 31
Olfactory Orthology Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to many Many to one
Slide 32
Inparanoid Remm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 10411052. Avoids multiple alignments and phylogenetic methods for speed and to avoid errors Heuristics are implicitly phylogenetic
Slide 33
How Inparanoid works Longest Transcripts Pairwise alignments scores Reciprocal Best Hits are orthologues Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) With confidences Add lineage Specific duplicates inparalogs (inparalogs) With confidences Resolve conflicts Use cutoff 2. 3. 4. 5. Orthology
Slide 34
Identify inparalog candidatesIdentify main orthologues Longest Transcripts Pairwise alignments scores Reciprocal Best Hits are orthologues Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) With confidences Resolve conflicts Use cutoff 2. 3. 4. Orthology Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) Add lineage Specific duplicates inparalogs (inparalogs) Add lineage Specific duplicates inparalogs (inparalogs) With confidences Add lineage Specific duplicates inparalogs (inparalogs) With confidences 5.
Slide 35
Confidence values for inparalogs 1.Most confident inparalog is when the inparalog is sequence identical to main orthologue. 2.Maximum value = score identical score orthologs 3.Confidence = (score inparalog score orthologs ) / (score ide