69
Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Alternative Splicing from ESTs

Eduardo Eyras

Bioinformatics UPF – February 2004

Page 2: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Intro

ESTs

Prediction of Alternative Splicing from ESTs

Page 3: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

AAAAAAA5’ CAPMature mRNA

Splicing

5’

3’

3’

5’

pre-mRNA

Transcriptionexons

introns

Translation

Peptide

Page 4: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

AAAAAAA5’ CAPMature mRNA

Different Splicing

5’

3’

3’

5’

pre-mRNA

Transcriptionexons

introns

Translation

Different Peptide

Page 5: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Alt splicing as a mechanism of gene regulation

Functional domains can be added/subtracted protein diversity

Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs

It can modify the activity of the transcription factors, affecting the expression of genes

It is observed nearly in all metazoans

Estimated to occur in 30%-40% of human

Page 6: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Forms of alternative splicing

Exon skipping / inclusion

Alternative 3’ splice site

Alternative 5’ splice site

Mutually exclusive exons

Intron retention

Constitutive exon Alternatively spliced exons

Page 7: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to study alternative splicing?

Page 8: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTs (Expressed Sequence Tags)

Single-pass sequencing of a small (end) piece of cDNA

Typically 200-500 nucleotides long

It may contain coding and/or non-coding region

Page 9: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTsCells from a specific organ, tissue or developmental stage

AAAAAA 3’5’

AAAAAA 3’5’

TTTTTT5’3’

AAAAAA 3’5’

TTTTTT5’3’

TTTTTT5’3’

AAAAAA 3’5’

TTTTTT5’3’

mRNA extraction

RNA

DNA

Double stranded cDNA

Add oligo-dT primer

Reverse transcriptase

Ribonuclease H

DNA polimerase Ribonuclease H

Page 10: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTs

AAAAAA 3’5’

TTTTTT5’3’Clone cDNA into a vector

Multiple cDNA clones5’ EST

3’ EST

Single-pass sequence reads

Page 11: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Splice variants

Genomic

Primary transcript

Splicing

cDNA clones

EST sequences

5’ 3’ 5’ 3’

Alternative Splicing from ESTs

Page 12: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Alternative Splicing from ESTs

ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)

Page 13: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST sequencing

Is fast and cheap

Gives direct information about the gene sequence

Partial information

Resulting ESTs Known gene

(DB searches) Similar to known gene

Contaminant

Novel gene

Page 14: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTs provide expression data

eVOC Ontologies http://www.sanbi.ac.za/evoc/

Anatomical System

Cell Type

The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina.

Pathology

The precise cell type from which a sample was prepared. Examples are: B-lymphocyte, fibroblast and oocyte.

Developmental Stage

The pathological state of the sample from which the sample was prepared.Examples are: normal, lymphoma, and congenital.

Pooling

The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult.

Indicates whether the tissue used to prepare the library was derived from single or multiple samples.  Examples are pooled, pooled donor and pooled tissue.

Page 15: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Linking the expression vocabulary to gene annotations

ESTs

Genes

Page 16: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Normalized vs. non-normalized libraries

Page 17: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

The down side of the ESTs

Cannot detect lowly/rarely expressed genes or non-expressed sequences (regulatory)

Random sampling: the more ESTs we sequence the less new useful sequences we will get

Page 18: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Gene Hunting

Sequencing of the Human Genome (HGP) EST Sequencing

Page 19: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Origin of the ESTs

Science. 1991 Jun 21;252(5013):1651-6

Complementary DNA sequencing: expressed sequence tags and human genome project.

Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al.

Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD.

Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.

Page 20: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST-sequencing explosion

Merck and WashU (1994)

public ESTs

GenBank

dbEST

non-exclusivity (1992)

Page 21: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Number of public entries: 20,039,613

Summary by organism

Homo sapiens (human) 5,472,005Mus musculus + domesticus (mouse) 4,056,481Rattus sp. (rat) 583,841Triticum aestivum (wheat) 549,926Ciona intestinalis 492,511Gallus gallus (chicken) 460,385Danio rerio (zebrafish) 450,652Zea mays (maize) 391,417Xenopus laevis (African clawed frog) 359,901…

dbEST release 20 February 2004

Page 22: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST lengths

Human EST length distribution (dbEST Sep. 2003 )

~ 450 bp

Page 23: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recover the mRNA from the ESTs

Page 24: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

What is an EST cluster?

A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity

Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene.  (Burke, Davison, Hide, Genome Research 1999).

Page 25: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST pre-processing

VectorRepeats MitochondrialXenocontaminants

Page 26: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST Clustering

UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene

TIGR Human Gene Index www.tigr.org

(The Institute for Genomic Research)

StackDB www.sanbi.ac.za

(South African Bioinformatics Institute)

Page 27: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

UniGene

Species UniGene Entries

Homo sapiens 118,517

Mus musculus 82,482

Rattus norvegicus 43,942

Sus scrofa 20,426

Gallus gallus 11,970

Xenopus laevis 21,734

Xenopus tropicalis 17,102

Page 28: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTs and the Genome

Page 29: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ESTs aligned to the genome

Some advantages:

•It defines the location of exons and introns

•We can verify the splice sites of introns (e.g. GT-AG)

hence also check the correct strand of spliced ESTs

•It helps preventing chimeras

•It can avoid putting together ESTs from paralogous genes

•We can prevent including pseudogenes in our analysis

Page 30: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Aligning ESTs to the Genome

Many ESTs Fast programs, Fast computers

Nearly exact matches Coverage >= 97%Percent_id >= 97%

Splice sites: GT—AG, AT—AC, GC—AG

Page 31: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Aligning ESTs to the Genome

Clip poly A tails/Clip 20bp from either end

Best in genome

Remove potential processed pseudogenes

Give preference to ESTs that are spliced

Extra pre-processing of ESTs:

Page 32: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Human ESTGenesGenomic length distribution of aligned human ESTs

Tail up to ~ 800kb

~ 400bp

Page 33: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

The Problem

What are the transcripts represented in this set of mapped ESTs?

ESTs

Genome

Page 34: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Transcript predictions

ESTs

Predict Transcripts from ESTs

Merge ESTs according to splicing structure compatibility

Page 35: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Representation

Extension

Inclusion zx

y

x

Sort by the smallest coordinate ascending and by the largest coordinate descending

Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not

The redundancy relation is a graph:

x

y

x

z

Page 36: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Criteria of merging

Allow internal mismatches

Allow intron mismatches

Allow edge-exon mismatches

Page 37: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Transitivity

Extension

Inclusionwz

y

x

w

x

This reduces the number of comparisons needed

x

y

z

xzw

Page 38: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

ClusterMerge graph

z

x

x

y

y

z

w

Each node defines an inclusion sub-tree

Extensions form acyclic graphs

y

xz

xyzw

Page 39: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recovering the Solution

1

2

9

6

8

7

43

5

Mergeable sets of ESTs can be recovered asspecial paths in the graph

Page 40: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Leaf: not-extended and root of an inclusion tree

Root: does not extend any node

Page 41: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Any set of ESTs in a path from a root to a leaf is mergeable

Page 42: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recovering the Solution

1

2

9

6

8

7

43

5

Root

Leaves

Add the inclusion tree attached to each node in the path

Page 43: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Recovering the Solution

1

2

9

6

8

7

43

5

Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9)

This representation minimizes the necessary comparisons between ESTs

Page 44: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

Mutual Recursion

Search graph (leaves)

Recursion search along extension branch

Search sub-graph

Inclusion => go up in the tree

Page 45: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

Page 46: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

Page 47: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Leaves

Page 48: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

Page 49: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

Page 50: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Extension

Page 51: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

Page 52: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Place

7

Page 53: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Inclusion

7

Page 54: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

tagged as visited - skip

7

Page 55: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How to build the graph

1

32

4

65

Example

1

4

2

6

5

3

7

Possible sub-trees beyond 1 or 3 remain unseen!

The representation minimizes the necessary comparisons

7

Page 56: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Deriving the transcripts from the lists

Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute

Page 57: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Deriving the transcripts from the lists

Splice Sites: are set to the most common coordinate

5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most

Page 58: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Single exon transcripts

Reject resulting single exon transcripts when using ESTs

Page 59: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Annotation with ESTs

ESTs aligned to the genome can provide information about UTRs and alternative splicing

Page 60: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Annotation with ESTs

EST-Transcripts at www.ensembl.org

Page 61: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Annotation with ESTs

Page 62: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Results for Human and Mouse

Human EST-genes (assembly ncbi33):

38,581 Genes

122,247Transcripts ( 42% with full CDS )

Mouse EST-genes (assembly ncbi30):

32,848 Genes

103,664 Transcripts ( 36% with full CDS )

Page 63: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How many transcripts are conserved?

Is Alternative Splicing conserved?

Page 64: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

EST-transcript pairs

42,625 transcript pairs (in 18,242 gene pairs)

gene pairs

78% with one transcript pair conserved

22% with more than one transcript pair conserved

For 22% of the gene pairs

some form of alt. splicing is conserved

Page 65: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Conservation of Alt. SplicingTake gene-pairs with more than one transcript-pair

19% of alt. variants in human are conserved in mouse

32% of alt. variants in mouse are conserved in human

∑ ( number of paired transcripts - 1)

%conservation = -------------------------------------------------------

∑ ( number of transcripts - 1 )

∑ = sum over genes in a gene pair with more than one variant

( subtract the ‘main’ transcript form)

Page 66: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

How many predicted ‘novel’ genes are validated by Human-Mouse

comparison?

Page 67: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Novel genesESTGenes

Not in Ensembl Human ESTGenes validated by comparison to mouse

13,174 18,242

ESTGenes with at least one complete ORF

24,201

Page 68: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Novel genes

984

ESTGenes not in Ensembl validated by comparison to mouse

With a complete ORF

Page 69: Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

THE END