Click here to load reader

Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University

  • View
    222

  • Download
    0

Embed Size (px)

Text of Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford...

  • Slide 1
  • Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University
  • Slide 2
  • Finishing Evolution of Orthologues Selection pressures in orthologues and paralogs Gene Duplications Reproduction, immunity or chemosensation Synonymous substitution rates Mutation and selection varies by chromosome size Gene birth in the human lineage Ongoing duplications underlie polymorphism
  • Slide 3
  • Orthology is the key
  • Slide 4
  • We are consumers of orthology / paralogy Started off using Ensembl predictions Ensembl 1:1 covered 50% of predicted mouse genes. Ewans manual survey said 80% How it started
  • Slide 5
  • Paralogues evolve fast (and are fun!) 1) General observations for all mammalian genomes
  • Slide 6
  • dmel dsim dyak dere dana dpse dvir dmoj dgri cele cbri crem c2801 hsap mmus cfam mdom oana ggal 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Drosophila Nematodes Amniotes Lineage specific d N /d S Species 2) Observations for whole clades of species
  • Slide 7
  • 3) Inparalogues define lineage specific biology Marsupial / Monodelphis biology revealed by lineage specific genes Chemosensation Chemosensation (OR, V1R and V2R ) Reproduction Reproduction (Vomeronasal Receptors, lipocalins, -microseminoprotein (12:1)) Immunity Immunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules ) pancreatic RNAses Detoxification Detoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets) KRAB ZnFingers KRAB ZnFingers
  • Slide 8
  • 4) Interesting stories in the aggregate
  • Slide 9
  • 5) Treasure trove in the details clade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade: !!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16. !!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699) gene identifier order chrm exons stop length -------------------- ----- ---- ----- ---- ------ MUS_GENE_21705 6639 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 ; ENSMUSP00000086007 4 182 MUS_GENE_22420 6643 5 predicted gene, EG623898 ; ENSMUSP00000099126 2 72 < MUS_GENE_19599 6646 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 (Speer1-ps1) on chromosome 5 ; NCBIMUSP_83776567 4 157 < MUS_GENE_23688 6651 5 predicted gene, EG623898 ; ENSMUSP00000094421 2 72 MUS_GENE_19774 6657 5 spermatogenesis associated glutamate (E)-rich protein 3 ; On going mouse inparalogues analysis: Lots and lots of reproductive genes
  • Slide 10
  • Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res. 14(8):1516-29 6) Candidates for evolutionary and functional analyses
  • Slide 11
  • Hedges, SB Nature Reviews Genetics 3, 838 -849 (2002) Available Genomes AndDivergences
  • Slide 12
  • How do we find function in the genome? Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975).
  • Slide 13
  • How to find the function in the genome? Similar Sequences Common Ancestry (homology) Similar Structures / Folds Similar Functions ? (Genes / Genome regions)
  • Slide 14
  • ARs Whole Genome How much of the genome is functional? How much of the genome is functional? Compare with the mouse Ancestral Repetitive (AR) is non-functional and has evenly distributed conservation scores (red) Ancestral Repetitive (AR) sequence is non-functional and has evenly distributed conservation scores (red) (symmetrical bell shaped due to biological variation) Whole Genome contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection Whole Genome sequence contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection (asymetrical) Functional sequence =Whole Genome Ancestral Repetitive =Whole Genome - Ancestral Repetitive = 5% = 5% N.B. This is an estimate that doesnt take into account sequence Turning over rapidly (not shared by mouse/human) Under positive (diversifying) selection
  • Slide 15
  • The human genome (euchromatic sequence) Unknown (old repetitive junk?) Protein coding: 1.2% UTR: 0.3% Repeats (Transposable elements, ) ~45% Conserved non-coding (3.5% ?) Neutral
  • Slide 16
  • Conserved non-coding material Transcription factor binding sitesTranscription factor binding sites Enhancers, insulators and other non-transcribed regulatory elementsEnhancers, insulators and other non-transcribed regulatory elements Alternative splicing signalsAlternative splicing signals Transfer RNAs, ribosomal RNAsTransfer RNAs, ribosomal RNAs Small RNAs () regulatory/gene silencing / RNA degradationSmall RNAs ( e.g. snoRNAs, microRNAs, siRNAs and piRNAs ) regulatory/gene silencing / RNA degradation MacroRNAs (e.g. Xist) enzymatic? / chromosome inactivationMacroRNAs (e.g. Xist) enzymatic? / chromosome inactivation
  • Slide 17
  • Functional parts of genes are highly conserved
  • Slide 18
  • How many protein coding genes? Walter Gilbert [1980s] 100k Antequera & Bird [1993] 70-80k John Quackenbush et al. (TIGR) [2000] 120k Ewing & Green [2000] 30k Tetraodon analysis [2001] 35k Human Genome Project (public) [2001] ~ 31k Human Genome Project (Celera) [2001] 24-40k Mouse Genome Project (public) [2002] 25k -30k Lee Rowen [2003] 25,947 Human Genome Project (finishing) 20-25k [2004] Current predictions [2008] 19-20k
  • Slide 19
  • Traditional Genome Orthology Reciprocal BLAST best hits between longest transcript of each gene (+ synteny) Assumes: Protein similarity is proportional to evolutionary distance (selection is invariant!) Pairwise relationships adequately represent the evolutionary tree No gene losses or missing predictions Alternative splicing can be ignored! No gene translocations after tandem duplication
  • Slide 20
  • Orthology prediction methods Two genomes Reciprocal best blast hit Multiple genomes Clustering of reciprocal best hits protein similarities Query Blast hits
  • Slide 21
  • Reciprocal Blast Best Hits Advantages: Fast, Well understood Works well for distant lineages Can correlate with protein structure (domains) Disadvantages: Only provides 1:1 orthologues in the best case Can be difficult to reconcile with the species tree
  • Slide 22
  • Genes on chromosome of species 1 Genes on chromosome of species 2
  • Slide 23
  • Reciprocal Blast Best Hits
  • Slide 24
  • ?
  • Slide 25
  • How to add duplicated genes? synteny Ensembl compara in the past Local gene order tends to be conserved in mammalian lineages Look for inparalogs locally even if the protein distances dont add up ( sequence error, sampling error etc. )
  • Slide 26
  • ? Blast Best Hits in Local Regions
  • Slide 27
  • Slide 28
  • Problems with relying only on synteny Local homologs are often not inparalogs: Local rearrangements Missing predictions (neighbouring orphans) Need sanity checking
  • Slide 29
  • Human and Mouse chromosomes: Extensive rearrangements only over larger regions Conservation of gene order in the short range
  • Slide 30
  • Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to many Many to one Olfactory Orthology from compara
  • Slide 31
  • Olfactory Orthology Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to many Many to one
  • Slide 32
  • Inparanoid Remm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 10411052. Avoids multiple alignments and phylogenetic methods for speed and to avoid errors Heuristics are implicitly phylogenetic
  • Slide 33
  • How Inparanoid works Longest Transcripts Pairwise alignments scores Reciprocal Best Hits are orthologues Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) With confidences Add lineage Specific duplicates inparalogs (inparalogs) With confidences Resolve conflicts Use cutoff 2. 3. 4. 5. Orthology
  • Slide 34
  • Identify inparalog candidatesIdentify main orthologues Longest Transcripts Pairwise alignments scores Reciprocal Best Hits are orthologues Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) With confidences Resolve conflicts Use cutoff 2. 3. 4. Orthology Reciprocal Best Hits are orthologues Add lineage Specific duplicates inparalogs (inparalogs) Add lineage Specific duplicates inparalogs (inparalogs) Add lineage Specific duplicates inparalogs (inparalogs) With confidences Add lineage Specific duplicates inparalogs (inparalogs) With confidences 5.
  • Slide 35
  • Confidence values for inparalogs 1.Most confident inparalog is when the inparalog is sequence identical to main orthologue. 2.Maximum value = score identical score orthologs 3.Confidence = (score inparalog score orthologs ) / (score ide

Search related