Clusters of Internally Primed Transcripts Reveal Novel Long ...79716/UQ_PV...Clusters of Internally Primed Transcripts Reveal Novel Long Noncoding RNAs Masaaki Furuno1[, Ken C. Pang2,3[,

Clusters of Internally Primed TranscriptsReveal Novel Long Noncoding RNAsMasaaki Furuno

1[, Ken C. Pang

2,3[, Noriko Ninomiya

4, Shiro Fukuda

4, Martin C. Frith

2,4, Carol Bult

1, Chikatoshi Kai

4,

Jun Kawai4,5

, Piero Carninci4,5

, Yoshihide Hayashizaki4,5

, John S. Mattick2

, Harukazu Suzuki4*

1 Mouse Genome Informatics Consortium, The Jackson Laboratory, Bar Harbor, Maine, United States of America, 2 Australian Research Council Special Research Centre for

Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia, 3 T Cell laboratory, Ludwig Institute for Cancer Research,

Austin Health, Heidelberg, Victoria, Australia, 4 Genome Exploration Research Group (Genome Network Project Core Group), RIKEN Genomic Sciences Center, RIKEN

Yokohama Institute, Yokohama, Japan, 5 Genome Science Laboratory, Discovery Research Institute, RIKEN Wako Institute, Wako, Japan

Non-protein-coding RNAs (ncRNAs) are increasingly being recognized as having important regulatory roles. Althoughmuch recent attention has focused on tiny 22- to 25-nucleotide microRNAs, several functional ncRNAs are orders ofmagnitude larger in size. Examples of such macro ncRNAs include Xist and Air, which in mouse are 18 and 108 kilobases(Kb), respectively. We surveyed the 102,801 FANTOM3 mouse cDNA clones and found that Air and Xist were presentnot as single, full-length transcripts but as a cluster of multiple, shorter cDNAs, which were unspliced, had little codingpotential, and were most likely primed from internal adenine-rich regions within longer parental transcripts. Wetherefore conducted a genome-wide search for regional clusters of such cDNAs to find novel macro ncRNA candidates.Sixty-six regions were identified, each of which mapped outside known protein-coding loci and which had a meanlength of 92 Kb. We detected several known long ncRNAs within these regions, supporting the basic rationale of ourapproach. In silico analysis showed that many regions had evidence of imprinting and/or antisense transcription. Theseregions were significantly associated with microRNAs and transcripts from the central nervous system. We selectedeight novel regions for experimental validation by northern blot and RT-PCR and found that the majority representpreviously unrecognized noncoding transcripts that are at least 10 Kb in size and predominantly localized in thenucleus. Taken together, the data not only identify multiple new ncRNAs but also suggest the existence of many moremacro ncRNAs like Xist and Air.

Citation: Furuno M, Pang KC, Ninomiya N, Fukuda S, Frith MC, et al. (2006) Clusters of internally primed transcripts reveal novel long noncoding RNAs. PLoS Genet 2(4): e37.DOI: 10.1371/journal.pgen.0020037

Introduction

The existence of non-protein-coding RNAs (ncRNAs) hasbeen known for many decades, and the importance ofessential infrastructural ncRNAs such as ribosomal RNAsand transfer RNAs in facilitating protein synthesis has longbeen recognized. Recently, other ncRNAs have generatedintense interest based upon their ability to regulate geneexpression. Foremost among these are microRNAs (miRNAs),which are about 22 nucleotides in length and function bytargeting mRNAs for cleavage or translational repression.Hundreds of miRNAs have been identified in animals, plants,and viruses, and they mediate critical regulatory functions ina range of developmental and physiological pathways [1–3].Another prominent class of ncRNAs is the short interferingRNAs (siRNAs), which were discovered as a tool for knockingdown gene expression in the lab but have subsequently beenfound to act as natural endogenous regulators of geneexpression [1].

Given the considerable attention that these tiny ncRNAshave attracted, it would be understandable to think thatregulatory ncRNAs are short. However, a small number offunctional ncRNAs have also been identified that are ordersof magnitude larger in size than miRNAs and siRNAs. Well-known examples of such macro ncRNAs include Xist and Air,which in mouse are approximately 18 and 108 Kb, respec-tively [4,5]. Xist plays an essential role in mammals byassociating with chromatin and causing widespread genesilencing on the inactive X chromosome [6], while Air is

Editors: Judith Blake (The Jackson Laboratory, US), John Hancock (MRC-Harwell,UK), Bill Pavan (NHGRI-NIH, US), and Lisa Stubbs (Lawrence Livermore NationalLaboratory, US), together with PLoS Genetics EIC Wayne Frankel (The JacksonLaboratory, US)

Received August 16, 2005; Accepted February 1, 2006; Published April 28, 2006

DOI: 10.1371/journal.pgen.0020037

Copyright: � 2006 Furuno et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original authorand source are credited.

Abbreviations: CNS, central nervous system; ENOR, expressed noncoding region;EST, expressed sequence tag; GEV, FANTOM3 Genomic Element Viewer; GNF,Genomics Institute of the Novartis Research Foundation; Kb, kilobase(s); miRNA,microRNA; ncRNA, non-protein-coding RNA; qRT-PCR, quantitative real-time RT-PCR; siRNA, short interfering RNA; snoRNA, small nucleolar RNA; TU, transcriptionalunit; UCSC, University of California Santa Cruz; UNA, unspliced, noncoding, andcontaining adjunct adenine-rich regions

* To whom correspondence should be addressed. E-mail: [email protected]

[ These authors contributed equally to this work.

PLoS Genetics | www.plosgenetics.org April 2006 | Volume 2 | Issue 4 | e370537

required for paternal silencing of the Igf2r/Slc22a2/Slc22a3gene cluster [5]. Apart from their extreme length, Xist and Airshare two other important features: genomic imprinting andantisense transcription. Genomic imprinting is a process bywhich certain genes are expressed differently according towhether they have been inherited from the maternal orpaternal allele. Imprinting is critical for normal development,and loss of imprinting has been implicated in a variety ofhuman diseases [7]. ncRNAs have been discovered at manydifferent imprinted loci and appear to be important in theimprinting process itself [5,8]. The other feature that Xist andAir have in common is that both are members of naturallyoccurring cis-antisense transcript pairs. Previous studies haveindicated the existence of thousands of mammalian cis-antisense transcripts [9–12]. These transcripts may regulategene expression in a variety of ways including RNAinterference, translational regulation, RNA editing, alterna-tive splicing, and alternative polyadenylation [13,14],although the exact mechanisms by which antisense RNAsfunction are unknown.

In addition to well-documented ncRNAs, recent evidencefrom both high-density tiling arrays [15,16] and large-scaleanalyses of full-length enriched cDNA libraries [17] suggeststhat there may be thousands more ncRNAs within themammalian transcriptome. Many of these candidates haveemerged from the RIKEN Mouse Gene Encyclopedia project[17,18], and full-length sequencing and analysis by theFANTOM consortium of 102,801 cDNAs recently revealedthat around one-third (34,030) lack an apparent protein-coding region as judged by manual annotation [19]. Althoughsome of these RNAs have been shown to have biologicalfunction [20,21], the vast majority of these putative non-coding cDNAs remain of uncertain significance, especiallygiven that many are likely to represent internally primedtranscription artifacts (which arise during first-strand cDNAsynthesis when oligo[dT] primers bind not to genuine polyAtails but rather to internal adenine-rich regions within longertranscripts) and are not true, full-length transcripts [22,23].

In surveying the FANTOM3 mouse cDNAs, we observed

that macro ncRNAs such as Air and Xist were present not assingle, full-length transcripts but rather as fragmentedclusters of cDNAs, most of which were not only internallyprimed but also unspliced and of minimal protein-codingpotential. We hypothesized that we might discover novelmacro ncRNAs by conducting a genome-wide search forsimilar clusters of cDNAs. We subsequently identified 66candidate ncRNA regions. A few of these overlap with knownlong ncRNAs, and many contain imprinted cDNA candidates,cis-antisense transcripts, or miRNAs. Eight regions werecharacterized experimentally, and the majority were foundto represent previously unknown long ncRNAs that arelocalized to the nucleus. Taken together, the data suggest theexistence of many more macro ncRNAs that, like Xist and Air,may fulfill important regulatory roles in mammalian biology.

Results

Xist and Air Are Represented by Clusters of TruncatedNoncoding cDNAsAs part of the FANTOM3 project, we looked for the

existence of known ncRNAs among the 102,801 cDNAs. Wefound that 16 of 43 (39%) non-small-nucleolar, non-microreference mouse ncRNAs that are present in RNAdb, adatabase of mammalian ncRNAs [24], were detectable amongthe RIKEN cDNA collection, as judged by similarity usingBLASTN (Table 1). The two longest ncRNAs detected wereXist and Air. Very long transcripts such as these createsubstantial difficulties for cDNA cloning protocols for avariety of well-established technical reasons [23,25]. We weretherefore not surprised that examination of both loci via theFANTOM3 Genomic Element Viewer (GEV) (http://fantom32p.gsc.riken.jp/gev-f3/gbrowse/mm5) revealed thatXist and Air were represented by a cluster of truncatedRIKEN and non-RIKEN cDNAs interspersed along the lengthof their parent transcripts. Inspection of the individualcDNAs demonstrated that the majority were unspliced, heldminimal protein-coding potential, and had adjunct genomicadenine-rich regions immediately downstream of their 39ends, suggesting that they had been internally primed. Figure1A illustrates transcription within the Air/Igf2r locus. Air isrepresented by 20 individual cDNAs dispersed along itsreported length, of which 14 are unspliced, noncoding RIKENcDNAs that contain an adjunct adenine-rich region. Figure1B shows Xist and its antisense partner Tsix. Here, ninecDNAs are seen along the length of the spliced Xist transcript,of which four are unspliced, noncoding RIKEN cDNAs thatcontain an adjunct adenine-rich region.

Genome-Wide Search Reveals Multiple Clusters ofUnspliced, Internally Primed Noncoding Transcripts LyingOutside Protein-Coding LociBased upon these observations (Table 1; Figure 1), we

reasoned that it might be possible to discover novel macroncRNAs via a genome-wide search for clusters of transcriptsthat were unspliced, noncoding, and contained adjunctadenine-rich regions (UNA transcripts) (Figure 2). To begin,we classified transcriptional units (TUs) into protein-codingand noncoding using the manual annotations of FANTOM3collaborators [19], where a TU is defined as a group oftranscripts that share at least one exonic nucleotide overlapand that map to the same chromosomal strand [19]. Of 37,348


Clusters of Transcripts Reveal Long ncRNAs

Synopsis

The human genome has been sequenced, and, intriguingly, lessthan 2% specifies the information for the basic protein buildingblocks of our bodies. So, what does the other 98% do? It nowappears that the mammalian genome also specifies the instructionsfor many previously undiscovered ‘‘non protein-coding RNA’’(ncRNA) genes. However, what these ncRNAs do is largely unknown.In recent years, strategies have been designed that have successfullyidentified hundreds of short ncRNAs—termed microRNAs—many ofwhich have since been shown to act as genetic regulators. Alsoknown to be functionally important are a handful of ncRNAs ordersof magnitude larger in size than microRNAs. The availability ofcomplete genome and comprehensive transcript sequences allowsfor the systematic discovery of more large ncRNAs. The authorsdeveloped a computational strategy to screen the mouse genomeand identify large ncRNAs. They detected existing large ncRNAs,thus validating their approach, but, more importantly, discoveredmore than 60 other candidates, some of which were subsequentlyconfirmed experimentally. This work opens the door to a virtuallyunexplored world of large ncRNAs and beckons future experimentalwork to define the cellular functions of these molecules.

TUs, 20,708 were classified as noncoding TUs. We knew,however, from previous work that noncoding TUs oftenoverlap with protein-coding genes, since they can beinternally primed off long pre-mRNAs [22]. Figure 1C showsan example of this, where a cluster of five UNA cDNAsoverlap with intronic regions of the large dystrophin (Dmd)transcript. Of 20,708 noncoding TUs, we excluded 8,228located within intronic regions of protein-coding TUs. We

then selected UNA TUs based on the following criteria: (1) anadjunct adenine-rich region was present at the TU end, (2) nomajor polyA signal (AATAAA/ATTAAA) was present within100 nucleotides of the TU end, and (3) the TU was unspliced.Of 12,480 noncoding TUs, 2,699 satisfied the criteria. We thenclustered these 2,699 UNA TUs by merging any two or morelocated within 100 Kb of one another, provided that (1) therewere no intervening protein-coding transcripts or gene

Table 1. Detection of Known Mouse ncRNAs within the FANTOM3 cDNA Collection

Reference ncRNA FANTOM3 cDNA BLASTN Hit

Name Lengtha Spliced Clone ID Adjunct

Adenine-Rich

Region Present

Spliced Length Percent

cDNA

Length

Percent

ncRNA

Length

Air 107,796 Unspliced 2810051F02 No Yes 1,064 95% 1%

6430704I19 Yes No 2,849 100% 3%

6720429I20 Yes No 3,699 100% 3%

9530009E11 Yes No 2,160 100% 2%

A330053J22 Yes No 1,266 100% 1%

A530029P07 Yes No 1,893 100% 2%

A530040I05 Yes No 1,899 100% 2%

B930018I07 Yes No 2,054 100% 2%

C130030E20 Yes No 1,401 100% 1%

C130073E08 Yes No 1,050 100% 1%

D130094O12 Yes No 682 100% 1%

D930036N23 Yes No 2,651 99% 2%

E130107J18 Yes No 2,205 100% 2%

G130203I22 Yes No 1,644 100% 2%

Cior 2,135 Spliced 2310040O21 No Yes 1,088 99% 51%

7120406M20 No Yes 2,112 99% 99%

7120489O22 No Yes 2,111 99% 99%

9130011J15 No Yes 2,112 98% 99%

9630004F23 No Yes 2,112 74% 99%

E430002L08 No Yes 2,111 99% 99%

I830031I11 No Yes 2,110 100% 99%

I830072L22 No Yes 2,110 100% 99%

I830083L02 No Yes 2,110 100% 99%

Ftx 676 Spliced 9530061G23 No Yes 637 15% 98%

B230206F22 Yes Yes 676 100% 100%

D430040K02 No Yes 615 27% 100%

Gtl2 3,199 Spliced 2900058E08 Yes No 1,229 100% 39%

6330403F08 Yes Yes 3,199 100% 100%

B230342G15 No No 418 100% 13%

H19 1,899 Spliced 1100001A04 No Yes 868 99% 46%

I0C0030C13 No Yes 1,842 82% 97%

Ipw 155 Spliced B230105C16 No Yes 155 4% 100%

Jpx/Enox 259 Spliced 2510040I06 No Yes 148 22% 93%

9830107K21 No Yes 259 7% 100%

G370019D15 No Yes 259 8% 100%

Kcnq1ot1/LIT1/Kvlqt1-AS 4,729 Unspliced C130002M05 Yes No 2,467 100% 52%

mirg 1,297 Spliced 2810474H01 No Yes 592 100% 46%

Mit1/Lb9 1,879 Unspliced 3110055B08 No No 240 97% 13%

Nespas 3,806 Unspliced and spliced D030028H20 Yes No 3,806 100% 100%

D330038P10 No No 629 100% 100%

Peg13 4,419 Unspliced F630009C06 No No 3,334 100% 75%

F930102O09 No No 3,277 100% 74%

telomerase RNA 397 Unspliced D430035J07 Yes No 397 14% 100%

U17HG 383 Spliced 3830421G02 No Yes 383 99% 100%

UHG 591 Spliced A730062M15 No Yes 476 100% 81%

Xist 17,919 Spliced 0610031I13 Yes No 518 100% 3%

2610022A11 Yes No 890 100% 5%

A430022B11 Yes No 1,495 100% 12%

D030072M03 Yes No 3,420 100% 19%

FANTOM3 cDNAs were tested for similarity to a reference set of non-sno, non-micro ncRNAs from RNAdb using BLASTN (for details, see Materials and Methods).aWhere multiple sequences exist for the reference ncRNA (e.g., different splice variants), the indicated length is that of the longest sequence.DOI: 10.1371/journal.pgen.0020037.t001



Figure 1. Snapshots of the GEV Showing Transcription

(A) The Air/Igf2r locus (Chromosome 17: 12,091,531–12,258,195).(B) The Xist/Tsix locus (X chromosome: 94,835,096–94,888,536).(C) The dystrophin (Dmd) locus (X chromosome: 76,500,000–76,754,601).For the transcripts, cDNA sequences from the RIKEN and public databases are shown, and are colored in brown and purple depending upon their



predictions (based on either FANTOM3 annotations, NCBIRefSeq sequences, or Ensembl gene models) and (2) therewere no intervening transcripts with major polyA signals andwithout adjunct adenine-rich regions (which would indicate

likely transcript termination sites). Using this approach, weidentified 191 genomic regions, containing 528 clusteredUNA TUs. To increase the likelihood that these regionsrepresented genuine long transcripts, we excluded any thatwere less than 10 Kb long or contained less than tenexpressed sequence tags (ESTs). This left 86 regions, whichwere then manually inspected using the GEV to look forpossible internal transcriptional start sites (e.g., CpG islands,CAGE tags, or multiple ESTs arising from the same position)or transcripts encoding small proteins not already filtered outin the discovery pipeline. Following this, 66 regions remained(Table 2). We named these long expressed noncoding regions(ENORs).

ENORs Successfully Identify Several Known Long ncRNAsTo assess the validity of our approach, we examined

whether known mouse macro ncRNAs were detected amongthe 66 ENORs. Notably, the cluster we had manuallyidentified as corresponding to Xist was not detected. Thiswas because one of the original Xist transcripts (GenBankaccession number X59289) remains annotated as a hypo-thetical protein of 299 amino acids based upon an earlierpresumption that it was translated [26]; consequently, thiscluster of cDNAs was automatically classified as beingprotein-coding and thus rejected. We did, however, succeedin identifying Air (ENOR60) and several other long ncRNAs.These included the following: Kcnq1ot1 (ENOR24), an im-printed antisense transcript of ;54 Kb [27]; Rian (ENOR44), aspliced 5.4-Kb imprinted transcript that spans more than 10Kb of mouse genome and acts as a host gene for multiplesmall nucleolar RNAs (snoRNAs) [28,29]; and Ube3a-ats(ENOR22), an imprinted, ;1,000-Kb antisense transcript thatis brain-specific and hosts numerous snoRNAs [30]. Addi-tionally, we detected Dleu2 (ENOR49), an alternatively splicedantisense ncRNA of ;1.4 Kb that spans more than 80 Kb andis a host gene for miRNAs [31,32]. Apart from Xist, the onlyother ncRNA in the RNAdb reference set longer than 5 Kbthat was not detected was Emx2os, a 5.04-Kb antisensetranscript that spans ;35 Kb [33]. Inspection of this locusshowed that it contained only one UNA cDNA. Takentogether, these observations indicated that our approachwas able to successfully detect existing long ncRNAs, althoughit missed some either because of annotation errors or becausethe number of UNA transcripts fell below our discoverypipeline threshold. Our approach also appeared to detectshorter ncRNAs such as Dleu2 that were spliced and spanneda long genomic region.

In Silico Characterization of ENOR RegionsNext, we sought to characterize the 66 ENORs in greater

detail (Table 2). The maximum number of UNA TUs perregion was 12 (ENOR59), and the average was 3.8 per 100 Kb.The region length ranged from 11 to 458 Kb, with a mean of92 Kb. The number, length, and distribution of the ENORsacross each chromosome are shown in Table S1 and FigureS1. Chromosome 8 had the highest number of ENORs (nine),with a total length of 860 Kb. Chromosome 16 had thegreatest length (1,089 Kb), as represented by three ENORs.

chromosomal strand of origin. Predicted genes from Ensembl, NCBI, and RefSeq databases are shown in gray. CpG islands as defined by the UCSCGenome Browser are shown. Blue circles indicate unspliced, noncoding RIKEN cDNAs with adjunct adenine-rich regions. Red circles indicate RIKENimprinted cDNA candidates [38].DOI: 10.1371/journal.pgen.0020037.g001

Figure 2. Discovery Pipeline for ENORs

FANTOM and public transcripts were clustered into 37,348 TUs bygrouping any two or more transcripts that shared genomic coordinates.Then, the following procedures were applied. (1) Protein-coding TUs wereexcluded by removing any whose transcripts had an open reading frameof either 150 amino acids or more (RIKEN/MGC cDNAs) or one amino acidor more (non-RIKEN/MGC cDNAs). (2) TUs wholly encompassed withinintrons of protein-coding TUs were excluded to avoid possible pre-mRNAintronic transcripts. (3) Intron-containing TUs were excluded to select forunspliced transcripts. (4) TUs lacking adjunct adenine-rich regions orcontaining polyA signals were excluded to select for internally primedtranscripts. (5) Remaining UNA TUs that mapped within 100 Kb of oneanother on the mouse genome (mm5) were clustered together, providedthey did not overlap the genomic coordinates of a protein-coding TU/NCBI RefSeq/Ensembl gene model with a CDS of 150 amino acids or moreor a noncoding TU with a polyA signal within 100 bp of the 39 end andwithout an adjunct adenine-rich region. (6) Reliably expressed UNA TUclusters were selected by identifying those with at least ten supportingESTs. (7) Selected UNA TU clusters were then manually screened andseparated based upon evidence of possible internal transcription statesites (based upon CpG islands, CAGE tags, and EST clusters), resulting inthe identification of 66 ENORs.DOI: 10.1371/journal.pgen.0020037.g002



Ta

ble

2.

Bio

info

rmat

icC

har

acte

riza

tio

no

f6

6EN

OR

s

Re

gio

n

ID

Ch

rom

oso

ma

l

Lo

cati

on

aS

tra

nd

Le

ng

th

(Kb

)

Nu

mb

er

of

UN

A

TU

s

Av

era

ge

TU

s/1

00

Kb

Sp

lici

ng

Sta

tus

Kn

ow

n

Se

nse

Tra

nsc

rip

tsb

An

tise

nse

Tra

nsc

rip

ts

(Pro

tein

-Co

din

g)c

An

tise

nse

Tra

nsc

rip

ts

(No

nco

din

g)c

sno

RN

As

miR

NA

sC

an

did

ate

Imp

rin

ted

RIK

EN

cDN

Asd

Hu

ma

n–

Mo

use

Sy

nte

ny

e

ENO

R1

chr1

:63

51

87

34

..63

54

14

20

þ2

32

8.8

Splic

ed

5830

412M

15R

ikN

du

fs1

(NA

DH

de

hyd

rog

en

ase

[ub

iqu

ino

ne

]

Fe-S

pro

tein

1)

No

ENO

R2

chr1

:13

77

90

61

4..1

37

81

57

60þ

25

41

5.9

Un

splic

ed

Ye

s

ENO

R3

chr1

:13

78

24

57

0..1

37

88

51

16þ

61

69

.9U

nsp

lice

dA

4301

06G

13R

ikm

ir-2

13,

mir

-181

b-1

Ye

s

ENO

R4

chr1

:19

12

66

59

0..1

91

28

21

91þ

16

21

2.8

Un

splic

ed

Pp

p2r

5a(p

rote

in

ph

osp

hat

ase

2,

reg

ula

tory

sub

un

itB

[B5

6],

alp

ha

iso

form

)

No

ENO

R5

chr1

:19

48

86

61

6..1

94

93

80

31þ

51

35

.8Sp

lice

dA

3300

23F2

4Rik

Mcp

(me

mb

ran

e

cofa

cto

rp

rote

in)

mir

-29b

-2,

mir

-29c

A2

30

02

5O

03

(M),

96

30

03

8K

03

(P)

Ye

s

ENO

R6

chr2

:38

78

92

88

..38

82

55

24

þ3

62

5.5

Un

splic

ed

Nr6

a1

(an

orp

han

nu

cle

arre

cep

tor)

mir

-181

a,

mir

-181

b-2

No

ENO

R7

chr2

:50

29

55

92

..50

37

56

87

þ8

02

2.5

Splic

ed

B2

30

31

2L0

2cD

NA

Ye

s

ENO

R8

chr2

:52

82

66

14

..52

93

65

21

þ1

10

54

.5U

nsp

lice

dD

2Ert

d29

5eY

es

ENO

R9

chr2

:67

59

27

86

..67

71

10

02

þ1

18

65

.1U

nsp

lice

d9

23

00

05

N2

3cD

NA

Ye

s

ENO

R1

0ch

r2:1

69

14

37

55

..16

92

18

83

5�

75

34

.0U

nsp

lice

d7

53

04

18

G0

4cD

NA

No

ENO

R1

1ch

r3:7

35

08

27

9..7

35

34

68

8þ

26

27

.6Sp

lice

dA

93

00

12

C2

4cD

NA

No

ENO

R1

2ch

r4:4

72

00

81

2..4

72

80

79

8�

80

33

.8U

nsp

lice

d4

93

24

22

O1

8cD

NA

Ye

s

ENO

R1

3ch

r4:1

40

33

20

09

..14

03

68

87

2þ

37

25

.4Sp

lice

dB

23

03

69

C1

9cD

NA

49

32

44

3I2

3ES

TY

es

ENO

R1

4ch

r4:1

40

80

63

32

..14

08

79

25

5�

73

34

.1U

nsp

lice

dN

o

ENO

R1

5ch

r5:2

83

32

43

5..2

83

99

70

2�

67

23

.0Sp

lice

dA

2300

98N

10R

ikF7

30

11

9A

01

cDN

AN

o

ENO

R1

6ch

r5:5

04

78

87

0..5

05

35

33

7�

56

47

.1U

nsp

lice

dA

33

00

78

N1

0(M

)Y

es

ENO

R1

7ch

r6:1

36

63

95

8..1

36

80

75

4þ

17

21

1.9

Splic

ed

1110

019D

14R

ikN

o

ENO

R1

8ch

r6:1

55

99

31

5..1

57

02

70

9�

10

32

1.9

Un

splic

ed

Ye

s

ENO

R1

9ch

r6:3

09

56

52

0..3

10

24

43

0�

68

34

.4U

nsp

lice

dY

es

ENO

R2

0ch

r6:5

29

61

31

8..5

29

95

71

0�

34

25

.8U

nsp

lice

dY

es

ENO

R2

1ch

r6:6

20

56

40

7..6

22

04

00

2þ

14

85

3.4

Un

splic

ed

No

ENO

R2

2ch

r7:4

68

08

31

1..4

69

37

81

4�

13

06

4.6

Un

splic

ed

Ub

e3a

-ats

(ub

iqu

itin

pro

tein

ligas

e

E3A

anti

sen

se)

Ub

e3a

(ub

iqu

itin

pro

tein

ligas

eE3

A)

A1

30

03

3P

09

(M),

B2

30

34

1L1

0(M

),

93

30

16

2G

02

(M),

93

30

15

6G

04

(M),

A2

30

07

3K

19

(M)

No

ENO

R2

3ch

r7:4

87

44

56

3..4

90

25

14

4�

28

18

2.9

Splic

ed

A33

0076

H08

Rik

,

A23

0057

D06

Rik

mir

-344

96

30

05

4K

16

(M)

No

ENO

R2

4ch

r7:1

30

91

78

83

..13

10

58

68

4�

14

16

4.3

Un

splic

ed

Kcn

q1o

t1(K

vlq

t1-a

s)

(KC

NQ

1o

verl

app

ing

tran

scri

pt

1)

Kcn

q1

(po

tass

ium

volt

age

-gat

ed

chan

ne

l,su

bfa

mily

Q,

me

mb

er

1)

A3

30

04

9H

05

(M)

Ye

s

ENO

R2

5ch

r8:3

10

90

53

8..3

13

46

69

8�

25

66

2.3

Un

splic

ed

E13

01

19

F02

cDN

AY

es

ENO

R2

6ch

r8:3

57

18

16

3..3

57

64

58

4�

46

48

.6U

nsp

lice

dY

es

ENO

R2

7ch

r8:4

95

45

73

7..4

95

70

09

6�

24

31

2.3

Splic

ed

1700

019L

22R

ik7

42

04

04

P1

9cD

NA

Ye

s

ENO

R2

8ch

r8:4

98

95

33

3..5

00

79

90

4�

18

58

4.3

Un

splic

ed

Ye

s

ENO

R2

9ch

r8:5

63

89

75

0..5

64

02

70

8�

13

32

3.1

Un

splic

ed

Ye

s

ENO

R3

0ch

r8:5

64

24

55

9..5

64

84

02

3�

59

35

.0U

nsp

lice

dA

93

00

38

F16

EST

Ye

s



Ta

ble

2.

Co

nti

nu

ed

Re

gio

n

ID

Ch

rom

oso

ma

l

Lo

cati

on

aS

tra

nd

Le

ng

th

(Kb

)

Nu

mb

er

of

UN

A

TU

s

Av

era

ge

TU

s/1

00

Kb

Sp

lici

ng

Sta

tus

Kn

ow

n

Se

nse

Tra

nsc

rip

tsb

An

tise

nse

Tra

nsc

rip

ts

(Pro

tein

-Co

din

g)c

An

tise

nse

Tra

nsc

rip

ts

(No

nco

din

g)c

sno

RN

As

miR

NA

sC

an

did

ate

Imp

rin

ted

RIK

EN

cDN

Asd

Hu

ma

n–

Mo

use

Sy

nte

ny

e

ENO

R3

1ch

r8:5

69

96

42

2..5

70

94

39

0�

98

55

.1U

nsp

lice

dA

33

00

54

M1

7(M

)N

o

ENO

R3

2ch

r8:1

08

49

59

97

..10

85

71

33

6þ

75

22

.7Sp

lice

dD

0300

68K

23R

ik49

2250

2B01

Rik

Ye

s

ENO

R3

3ch

r8:1

14

85

42

10

..11

49

56

54

1�

10

23

2.9

Un

splic

ed

Ye

s

ENO

R3

4ch

r9:8

37

75

88

6..8

38

16

08

5þ

40

25

.0U

nsp

lice

dN

o

ENO

R3

5ch

r9:9

11

60

52

3..9

11

83

08

7�

23

31

3.3

Un

splic

ed

No

ENO

R3

6ch

r9:1

01

01

05

85

..10

10

44

46

7þ

34

38

.9U

nsp

lice

d32

2240

2P14

Rik

No

ENO

R3

7ch

r10

:39

50

35

01

..39

52

01

82

�1

72

12

.0Sp

lice

dA

I42

67

48

Tra

f3ip

2(T

raf3

inte

ract

ing

pro

tein

2)

Ye

s

ENO

R3

8ch

r10

:91

88

02

73

..91

92

69

36

�4

73

6.4

Splic

ed

Dm

t2(d

ors

o-m

ed

ial

tele

nce

ph

alo

ng

en

e2

)

Ye

s

ENO

R3

9ch

r10

:10

95

70

57

5..1

09

89

65

99�

32

69

2.8

Splic

ed

9630

020C

08R

ikY

es

ENO

R4

0ch

r11

:18

64

68

95

..18

76

60

69

�1

19

32

.5Sp

lice

d84

3041

9K02

Rik

No

ENO

R4

1ch

r11

:11

27

11

49

5..1

12

81

02

79�

99

33

.0Sp

lice

d26

1003

5D17

Rik

Ye

s

ENO

R4

2ch

r12

:23

90

33

32

..23

92

55

72

þ2

22

9.0

Splic

ed

53

30

41

9A

06

cDN

AN

o

ENO

R4

3ch

r12

:51

69

13

66

..51

71

72

85

�2

64

15

.4Sp

lice

dE0

3001

9B13

Rik

Ye

s

ENO

R4

4ch

r12

:10

43

43

57

1..1

04

42

57

59þ

82

56

.1Sp

lice

dR

ian

(RN

Aim

pri

nte

d

and

accu

mu

late

d

inn

ucl

eu

s)

MB

II-48

,

MB

II-49

,

MB

II-34

3

mir

-341

,

mir

-370

A2

30

05

2J0

8(P

)Y

es

ENO

R4

5ch

r13

:27

82

50

37

..28

24

68

20

�4

22

11

2.6

Splic

ed

2610

307P

16R

ikY

es

ENO

R4

6ch

r13

:28

34

86

60

..28

41

62

95

þ6

82

3.0

Splic

ed

A33

0102

I10R

ikN

o

ENO

R4

7ch

r13

:35

48

54

15

..35

55

11

02

�6

63

4.6

Splic

ed

BC

03

46

64

cDN

AY

es

ENO

R4

8ch

r14

:41

37

76

72

..41

45

21

18

þ7

42

2.7

Splic

ed

A7

30

01

4E0

5cD

NA

A9

30

01

5C

18

cDN

AN

o

ENO

R4

9ch

r14

:53

58

59

43

..53

66

40

74

�7

82

2.6

Splic

ed

Dle

u2

(de

lete

din

lym

ph

ocy

tic

leu

kem

ia2

)

Trim

13(t

rip

arti

tem

oti

f

pro

tein

13

),K

cnrg

(po

tass

ium

chan

ne

l

reg

ula

tor)

mir

-16–

1,

mir

-15a

Ye

s

ENO

R5

0ch

r14

:10

10

67

46

1..1

01

16

56

69�

98

44

.1U

nsp

lice

dA

23

00

20

C1

9(M

)N

o

ENO

R5

1ch

r15

:83

02

25

9..8

35

69

66

þ5

52

3.7

Un

splic

ed

4933

421G

18R

ikY

es

ENO

R5

2ch

r15

:12

03

00

64

..12

04

18

53

þ1

22

17

.0U

nsp

lice

d9

33

01

15

C1

7(M

)N

o

ENO

R5

3ch

r15

:20

60

20

48

..20

62

25

83

þ2

12

9.7

Splic

ed

D1

30

08

0K

17

cDN

AY

es

ENO

R5

4ch

r15

:68

61

15

38

..68

62

59

28

�1

42

13

.9U

nsp

lice

dY

es

ENO

R5

5ch

r15

:82

69

59

45

..82

72

83

84

þ3

23

9.2

Un

splic

ed

Cyp

2d22

(cyt

och

rom

e

P4

50

,fa

mily

2,

sub

fam

ilyd

,

po

lyp

ep

tid

e2

2)

No

ENO

R5

6ch

r15

:96

69

18

93

..96

70

30

25

�1

12

18

.0Sp

lice

d26

1003

7D02

Rik

No

ENO

R5

7ch

r16

:72

37

69

98

..72

83

54

71

þ4

58

10

2.2

Un

splic

ed

B2

30

33

8E1

2(M

),

B2

30

31

5D

22

(M)

Ye

s

ENO

R5

8ch

r16

:75

22

64

29

..75

55

14

63

�3

25

72

.2U

nsp

lice

dA

73

00

76

E23

(M)

Ye

s

ENO

R5

9ch

r16

:77

74

11

08

..78

04

65

67

þ3

05

12

3.9

Splic

ed

2810

055G

20R

ikm

ir-9

9a,

let-

7c-1

,

mir

-125

b-2

Ye

s

ENO

R6

0ch

r17

:12

17

26

64

..12

27

54

48

þ1

03

76

.8Sp

lice

dA

ir(a

nti

sen

seIg

f2r

RN

A),

D17

Ertd

663e

Igf2

r(i

nsu

lin-l

ike

gro

wth

fact

or

2re

cep

tor)

A3

30

05

3J2

2(M

),

A5

30

04

0I0

5(M

)

No

ENO

R6

1ch

r17

:50

01

50

30

..50

06

46

12

þ5

05

10

.1U

nsp

lice

dSa

tb1

(sp

eci

al

AT

-ric

hse

qu

en

ce

bin

din

gp

rote

in1

)

No



The total length of the 66 ENORs was 6,044 Kb, correspond-ing to 0.23% of the mouse genome.We classified the 66 ENORs based upon the frequency of

spliced and unspliced ESTs (Table 3). Twenty-eight regionscontained numerous spliced ESTs, while the remaining 38regions included no or very few spliced ESTs. The longestunspliced region was ENOR57, which included ten UNA TUsspanning almost 460 Kb. Interestingly, we found that Air,which has previously been reported as unspliced [5], over-lapped with several spliced cDNAs and ESTs, suggesting thatAir may also exist as spliced isoforms. Consistent with thisidea, there is another ncRNA, Nespas, for which multiplespliced and unspliced forms have been reported, and thehuman–mouse conservation of these different isoformssuggests that they may be functionally relevant [34].Sequence conservation between different species indirectly

suggests function. To assess the conservation of the ENORs,we searched for syntenic human loci using mouse–humanwhole genome alignments available from the University ofCalifornia Santa Cruz (UCSC) Genome Browser [35,36]. ManyENORs (38 of 66) could be successfully aligned betweenmouse and human over at least 50% of their length (Table 2).However, a significant minority were not well-conserved, andthese included known functional ncRNAs such as Air andUbe3a-ats, which highlights that a lack of conservation doesnot necessarily imply a lack of function [37]. Because somelong poorly conserved ncRNAs such as Xist retain patches ofwell-conserved sequence [6], we also examined ENOR con-servation in short 50-nucleotide windows (Figure S2). Thisapproach indicated not only that ENORs have patches of highconservation but also that they are more conserved than thegenome average, so that while only ;45% of the mousegenome windows are alignable to the human genome, ;60%of ENOR windows are alignable.

ENORs Show Evidence of Imprinting and AntisenseTranscriptionBecause previous studies revealed associations between

macro ncRNAs and both imprinting and antisense tran-scription, we looked to see if our ENOR loci were associatedwith either of these phenomena.To examine imprinting, we obtained 2,114 candidate

imprinted mouse transcripts previously identified by Nikaidoet al. [38]. By mapping these transcripts to the mouse genome(May 2004 assembly; mm5), we found that 13 ENORs(containing 20 candidate imprinted cDNAs) showed evidenceof imprinting (Tables 2 and 3). This number was significantlyhigher than expected by chance (Chi-square, p , 0.001). Ofthe 13 ENORs identified, four contained well-characterizedimprinted ncRNAs (Rian, Air, Ube3a-ats, and Kcnq1ot1) andnine represent potentially imprinted ncRNAs.To characterize cis-antisense transcription, we searched

for transcripts that appeared in the complementary strandof each ENOR (Tables 2 and 3). Of 28 spliced ENORs, twocorresponded to known antisense ncRNAs (Air and Dleu2),and a further eight represented potentially novel antisensetranscripts to either protein-coding genes (Mcp, Ndufs1,and Traf3ip2) or to noncoding transcripts. In the case ofDleu2, which has been suggested to play a role in the splice-site regulation of its cognate antisense partner Trim13 [32],we also identified a potentially new antisense partner,Kcnrg. Of 38 unspliced ENORs, two corresponded toTa

ble

2.

Co

nti

nu

ed

Re

gio

n

ID

Ch

rom

oso

ma

l

Lo

cati

on

aS

tra

nd

Le

ng

th

(Kb

)

Nu

mb

er

of

UN

A

TU

s

Av

era

ge

TU

s/1

00

Kb

Sp

lici

ng

Sta

tus

Kn

ow

n

Se

nse

Tra

nsc

rip

tsb

An

tise

nse

Tra

nsc

rip

ts

(Pro

tein

-Co

din

g)c

An

tise

nse

Tra

nsc

rip

ts

(No

nco

din

g)c

sno

RN

As

miR

NA

sC

an

did

ate

Imp

rin

ted

RIK

EN

cDN

Asd

Hu

ma

n–

Mo

use

Sy

nte

ny

e

ENO

R6

2ch

r17

:69

28

15

88

..69

29

41

09

þ1

32

16

.0U

nsp

lice

dTg

if(T

Gin

tera

ctin

gfa

cto

r)N

o

ENO

R6

3ch

r18

:55

49

21

79

..55

58

21

08

þ9

02

2.2

Splic

ed

A7

30

05

2O

22

cDN

AY

es

ENO

R6

4ch

r19

:20

55

38

02

..20

56

50

04

þ1

12

17

.9U

nsp

lice

dN

o

ENO

R6

5ch

rX:7

17

60

15

..72

10

90

8�

35

25

.7U

nsp

lice

dN

o

ENO

R6

6ch

rX:1

51

93

51

62

..15

20

88

23

4�

15

36

3.9

Un

splic

ed

94

30

04

3O

19

(M)

Ye

s

aB

ou

nd

arie

sre

fer

toth

eg

en

om

icco

ord

inat

es

of

the

star

tan

de

nd

of

the

mo

st59

and

39

cDN

A,

resp

ect

ive

ly,

wit

hin

eac

hEN

OR

.b‘‘

Sen

se’’

refe

rsto

atr

ansc

rip

tw

ho

seg

en

om

icco

ord

inat

es

ove

rlap

pe

dw

ith

that

of

anEN

OR

atth

ep

re-m

RN

Ale

vel

and

that

was

loca

ted

on

the

sam

est

ran

das

the

ENO

R.

c‘‘

An

tise

nse’’

refe

rsto

atr

ansc

rip

tw

ho

seg

en

om

icco

ord

inat

es

ove

rlap

pe

dw

ith

that

of

anEN

OR

atth

ep

re-m

RN

Ale

vel

and

that

was

loca

ted

on

the

op

po

site

stra

nd

toth

eEN

OR

.d‘‘

(P)’’

and‘‘

(M)’’

refe

rto

pat

ern

ally

and

mat

ern

ally

imp

rin

ted

cDN

As,

resp

ect

ive

ly.

e‘‘

Ye

s’’

ind

icat

es

that

mo

reth

an5

0%

of

the

ENO

Rw

assu

cce

ssfu

llyal

ign

ed

toth

eh

um

ang

en

om

e.

DO

I:1

0.1

37

1/j

ou

rnal

.pg

en

.00

20

03

7.t

00

2



known ncRNAs antisense to Ube3a and Kcnq1, and a further13 represented potentially novel antisense transcripts toeither protein-coding genes (Cyp2d22, Nr6a1, Ppp2r5a,Satb1, Tgif, 3222402P14Rik, and 4933421G18Rik) or tononcoding transcripts. Many of the protein-coding genesare involved in development and disease, and, as with Igf2rand Air, the discovery of long noncoding antisense tran-scripts may be very important in understanding theregulation of these genes.

ENORs Are Associated with miRNAs and Show Tissue-Specific Expression

As indicated above, a number of ENORs corresponded toknown ncRNAs that act as host genes for either snoRNAs ormiRNAs. We were therefore interested to see whether any

other ENORs contained miRNAs or snoRNAs. We down-loaded 224 known miRNAs and 175 snoRNAs from themiRBase Registry and RNAdb, respectively [24,39]. We thenmapped these sequences to the mouse genome, and examinedthem for overlap with the 66 ENORs. We found that sevenENORs overlapped with 14 known miRNAs (14/224; 6%;Table 2), an association unlikely to have occurred by chance(p , 0.0001). Some of these ENORs also contained imprintedcDNA candidates, in keeping with a previously notedassociation between miRNAs and imprinting [40]. No newsnoRNA hosts were found.Next, we examined the expression of ENOR transcripts.

Using the publicly available mouse gene expression atlas datafrom the Genomics Institute of the Novartis Research

Table 3. Summary Characteristics of 66 ENORs

Splicing Status Antisense Transcription Imprinted Candidate miRNA Host snoRNA Host

Category Number Category Number

Spliced 28 Antisense to protein-coding mRNA 5 2 2 0

Antisense to ncRNA 5 0 0 0

No antisense transcripts 18 2 3 1

Unspliced 38 Antisense to protein-coding mRNA 9 2 1 0

Antisense to ncRNA 6 0 1 0

No antisense transcripts 23 7 0 0

DOI: 10.1371/journal.pgen.0020037.t003

Figure 3. ENOR Tissue Expression

Tissue expression information for individual ENORs was obtained using publicly available GNF Gene Expression Atlas data. GNF probes that overlappedENORs were identified, and the corresponding relative expression ratios for 61 tissues were hierarchically clustered. Red squares indicate highexpression, black squares indicate low expression, and grey squares indicate where expression was not reliably detected (based upon Affymetrix MAS5absent/present calls). med. olfactory epi., medial olfactory epithelium.DOI: 10.1371/journal.pgen.0020037.g003



Foundation (GNF) [41], we found that 23 ENORs wereexpressed in at least one of the 61 tissues examined. Thirty-three of the remaining ENORs did not have any correspond-ing GNF probes, while a further ten had probes whoseexpression was not reliably detected. Of the 23 ENORs, somewere expressed almost ubiquitously, while others showed arestricted, tissue-specific expression profile (Figure 3). Nota-bly, many ENORs were enriched in the central nervous system(CNS), and these included known brain-specific ncRNAsUbe3a-ats (ENOR22) and Rian (ENOR44) [28,30]. Because onlya minority of ENORs had supporting GNF information, wealso used RIKEN EST data to assess whether ENOR tran-scripts showed preferential expression in particular tissues.We searched for RIKEN ESTs that mapped within each ENORand tallied the number of clones associated with the ESTsthat were derived from a particular tissue (as per EdinburghMouse Atlas Project descriptions), and then we comparedthese counts with those of the entire FANTOM3 set. Wefound that ENOR transcripts as a whole are significantlyoverrepresented in a number of tissues including the CNS(Table S2). A caveat to this result is that RIKEN ESTs werederived after intensive subtraction, and their relativeabundance might therefore not reflect natural tissue ex-

pression, although the EST data were in general agreementwith the GNF results for a number of ENORs.As noted earlier, spliced ENORs such as that corresponding

to Dleu2 may not necessarily represent macro ncRNAsbecause clusters of UNA transcripts may be derived fromthe introns of longer pre-mRNAs whose final product may beless than 10 Kb. To proceed, we therefore focused ourattention on the unspliced class of ENORs, which wereasoned were most likely to represent novel macro ncRNAs.

Long PCR and Quantitative RT-PCR Provide IndirectEvidence of Macro ncRNAsAs proof of principle, we selected two regions for initial

experimental characterization: ENOR28 and ENOR31 (Figure4). ENOR28 (Figure 4B) was located on Chromosome 8(49,895,333–50,079,904; mm5), appeared unspliced, spanned185 Kb, and contained eight UNA TUs. ENOR31 (Figure 4C)was also on Chromosome 8 (56,996,422–57,094,390), appearedunspliced, spanned 98 Kb, and contained five UNA TUs, oneof which was a possible imprinted transcript [38]. Themajority of cDNAs in both regions were from commontissues (CNS), and this—together with their lack of splicingand greater than average length—made them good initialcandidates.

Figure 4. qRT-PCR Analysis

Analysis of (A) Air, (B) ENOR28, and (C) ENOR31 loci. Above in each panel, screen shots of the GEV featuring the loci around Air, ENOR28, and ENOR31 areshown. The orange bars indicate the regions for Air, ENOR28, and ENOR31. cDNA sequences from the RIKEN and public databases are shown. Sequencesmapped on the plus strand and minus strand are brown and purple, respectively. Predicted genes from Ensembl, NCBI, and RefSeq databases are shownin gray. For RIKEN imprinted transcripts, imprinted cDNA candidates identified previously [38] are shown. CpG islands as defined by the UCSC GenomeBrowser are shown. Positions of primer pairs are marked by small vertical arrows. Below in each panel, qRT-PCR results for midbrain, hippocampus,thalamus, striatum, and testis using the corresponding primer pairs are shown.DOI: 10.1371/journal.pgen.0020037.g004



Initially, we looked for the presence of transcriptionbetween neighboring cDNAs by long PCR (all of the PCRprimers are shown in Table S3). Figure 5 shows that wesuccessfully amplified transcripts between cDNAs 7120464J01and C030047J05 in ENOR31, and between cDNAsC130073E24 and D130067E14 in ENOR28. These resultssuggested that directly adjacent cDNAs arise from a commontranscript.

If each region represents one continuous transcript underthe control of a single promoter in a given tissue, wereasoned, then across multiple tissues the levels of expressionfor each ENOR cDNA should remain consistent with those ofthe other cDNAs in the region. Total RNA was thereforeisolated from eight different tissues, and each ENOR cDNAexpression profile was examined by quantitative real-timeRT-PCR (qRT-PCR). We found that the expression of allcDNAs (apart from A730038J21 in ENOR28) was highlyintercorrelated (R . 0.9; Table S4), thus providing indirectevidence that not only directly adjacent cDNAs but also thosemore remote from one another were from the sametranscript.

Northern Blots Directly Confirm Existence of MultipleNovel Macro ncRNAs

Northern blot analysis is a direct means to demonstrate theexistence of very large RNAs. We therefore selected eightENORs (ENOR2, ENOR14, ENOR16, ENOR28, ENOR31,ENOR54, ENOR61, and ENOR62), which together wererepresentative of a broad range of lengths, chromosomes,and EST abundance, and tested them by northern blot usingspecific probes (Table S3). As a positive control, we also

examined ENOR60, which corresponds to Air. Figure 6 showsthat Air was readily detected as a band greater than 10 Kb insize. Similarly, probes against ENOR2, ENOR16, ENOR28,ENOR31, and ENOR54 all detected clearly visible bandsgreater than 10 Kb. Other ENORs gave less clear results.ENOR61 had a broad signal that appeared as a smearoriginating from the upper reaches of the gel, and it wasunclear whether this was due to degradation of a largetranscript or was nonspecific. ENOR62 produced a similarresult. Probes for ENOR14, on the other hand, detected onemajor product larger than 7.5 Kb and possibly another largerthan 10 Kb. Thus, in six of nine cases, we were able tosuccessfully demonstrate macro RNAs larger than 10 Kb andin the remaining three cases the results were equivocal.

Detailed qRT-PCR Analysis Reveals That ENORs MightContain Multiple Long TranscriptsNorthern blots do not accurately resolve the size of

transcripts larger than 10 Kb. For this reason, the 108-KbAir transcript in Figure 6 appears only just above 10 Kb(which is similar to the original northern blot result obtainedby Lyle et al. [42]), and the actual size of the other macroncRNAs cannot be successfully determined from the blots. Togain a better understanding of the true extent of tran-scription across our regions, we therefore performed furtherqRT-PCR analysis across our original candidate ENORs,ENOR28 and ENOR31. Specific primer pairs were designedbefore, after, and along the length of the region, incorporat-ing individual cDNAs as well as the areas in between (Figure4B and 4C; Table S3). As a control, ENOR60, containing Air,was analyzed in a similar manner (Figure 4A; Table S3).To begin, we extracted total RNA from different CNS

tissues (midbrain, hippocampus, corpus striatum, and thala-mus) and from testis, and assessed the level of expression of atleast 20 separate subregions spanning the length of bothENORs as well as Air. Figure 4A demonstrates that, in the Airlocus, expression arises from downstream of a CpG island, aspreviously reported [43], and then remains relatively constantfor the next 70–80 Kb. Beyond this, expression falls below 100transcripts per 12.5 ng of total RNA for the next 30–40 Kb

Figure 5. Presence of Transcription between Adjacent cDNAs

PCR was carried out with and without reverse transcription (RT[þ] andRT[�], respectively) using midbrain total RNA and the correspondingprimer pairs (see Table S3). PCR using genomic DNA was also carried outas a control. A DNA ladder (Promega; http://www.promega.com) wasused as a size marker. The amplified fragments were confirmed as theexpected ones by analyzing digestion pattern using several restrictionenzymes. The lower band, observed in the RT(þ) lane of the amplifiedfragment C, seems to be nonspecific, because it was amplified using onlythe right primer and because it showed a digestion pattern withrestriction enzymes quite different from that of the upper band and theband of the genomic DNA (unpublished data).DOI: 10.1371/journal.pgen.0020037.g005

Figure 6. Northern Blot Analysis of ENOR Transcripts

Mouse whole brain total RNA (10 lg/lane) was used for the analysisexcept for ENOR2 and ENOR61, where mouse thymus total RNA wasused. DNA fragments without any predicted repeated sequences werePCR-amplified from cDNAs in ENORs (Table S3), labeled with 32P-dCTP(Amersham Biosciences), and then used as probes. RNA size wasestimated with an RNA ladder (Invitrogen). ENORs are listed in increasingorder based on the estimated length of each region.DOI: 10.1371/journal.pgen.0020037.g006



(primer pairs 11–15) then rises and plateaus again for afurther 30 Kb. Examination of the alignment between thegenomic DNA sequence of Air (GenBank accession numberAJ249895) and the genome assembly revealed that there aretwo inserted sequences in the genome assembly (dotted linesin Figure 4A). These sequences are disconnected by gaps inthe genome assembly, indicating that the transient fall inexpression is an artifact. Overall, then, this result was inkeeping with the presence of a continuous macro ncRNA;100–110 Kb in size, and provided evidence that the qRT-PCR-based strategy employed here was able to successfullydetect such long transcripts and to provide a reasonableestimate of their size.

Figure 4B illustrates expression across the ENOR28 locus.In CNS tissues, the overall expression pattern was similar tothat of Air, with sustained expression over tens of kilobases attranscript copy numbers in the hundreds to thousands (per12.5 ng of total RNA). Looking more closely and starting fromupstream of the 59 end, expression levels are at their lowestaround primer pair 20, dipping below 100 copies; next,transcript levels for primer pairs 12–18 are intermediate;finally, from primer pairs 1–11 (a distance larger than 100Kb), expression is highest of all, is relatively constant, andextends well beyond the previously defined 39 ENORboundary. Our previous experience using primer pairsagainst different positions of the same protein-coding geneshad indicated that the expected differences in transcript copynumber are generally less than 2-fold for the same transcript(unpublished data). Assuming such results can be appliedhere, the roughly 10-fold variation in CNS expression acrossENOR28 challenges our original hypothesis of a singlepromoter driving expression across the entire region. Rather,it is possible that a number of separate transcripts arepresent, the longest of which spans primer pairs 1–11 andappears to be larger than 100 Kb. Interestingly, testisexpression fell below detection threshold at both the 59 and39 ends of ENOR28, suggesting the existence of a shortertestis-specific transcript.

Figure 4C shows that expression in the ENOR31 locus wasrelatively constant and extensive, with transcript copynumbers in CNS greater than 1,000 per 12.5 ng of totalRNA not only within but also up- and downstream of theoriginal ENOR boundaries. Approximately 10-fold expres-sion spikes at primer pairs 15–16 and 7–8 suggested thepossibility of up to three separate transcripts larger than 50Kb. Testis expression gave a similar pattern but was muchlower than CNS expression. Overall, then, assuming that a 10-fold variation in transcript levels between primer pairs isindicative of separate transcripts, both the ENOR28 andENOR31 loci appear to produce not one but several macroncRNAs (all of which are enriched in brain). However, it isworth noting that our data for Air (Figure 4A) (excludingregions with assembly gaps) also showed ;10-fold variation intranscript levels (e.g., primer pairs 9–10). Since Air isgenerally acknowledged to be a continuous transcriptspanning ;108 Kb [5], it seems plausible that a 10-foldvariation in transcript levels between primer pairs need notindicate multiple transcripts. If that is true, then the data forENOR28 and ENOR31 would support the alternative con-clusion that each region gives rise to a single macro ncRNAlarger than 100 Kb.

ENOR Transcripts Predominantly Localize to the NucleusSubcellular localization may provide clues to the function

of ENOR transcripts. For instance,Xist exerts its chromosomalsilencing effect within the nucleus [6]. We therefore examinedthe localization of the same eight ENORs (ENOR2, ENOR14,ENOR16, ENOR28, ENOR31, ENOR54, ENOR61, andENOR62) we previously had characterized via northern blotby comparing brain expression levels from cytoplasmic andtotal RNA (the latter consists of both cytoplasmic and nuclearRNA). To validate our method, we initially tested b-glucur-onidase (Gusb) mRNA, a housekeeping gene, and the RianncRNA (ENOR44), which preferentially localize to thecytoplasm and nucleus, respectively [28]. Figure 7 shows that,in keeping with our expectations, the copy number for GusbmRNA was similar in cytoplasmic and total RNA (whichsuggests that there is a negligible nuclear RNA component)while Rian exists in cytoplasmic RNA at much lower levelsthan in total RNA (which suggests that the nuclear componentpredominates). Interestingly, when we examined the eightENOR transcripts in an identical manner (Figure 7), seven ofthem (ENOR2, ENOR14, ENOR16, ENOR28, ENOR31,ENOR54, and ENOR62) showed much higher expression intotal RNA, suggesting that they are localized in the nucleus.ENOR 61, on the other hand, appeared to be cytoplasmic.

Discussion

The analysis of full-length enriched cDNA libraries has beenof vital importance in improving our understanding of themammalian transcriptome. In this regard, however, unsplicednoncoding cDNAs are often viewed with skepticism becausethey can arise as truncation artifacts of cDNA libraryconstruction. Here, we have shown that such artifacts clusterwithin very long, functionally important ncRNAs such as Airand Xist, and, rather than summarily dismissing these cDNAsas worthless, we have employed a strategy that uses them toidentify long ncRNAs genome-wide. The resulting list of 66

Figure 7. Localization of ENOR Transcripts

qRT-PCR was carried out using total and cytoplasmic RNA from mousewhole brain and the corresponding primer pairs (Table S3). ENORs arelisted in increasing order based on the estimated length of each region.Apart from the results shown, we also examined the localization of othermRNAs (b-actin and GAPDH) and additional regions of Rian and otherENORs, and these results were consistent with the rest (unpublished data).DOI: 10.1371/journal.pgen.0020037.g007



candidate ENORs—itself almost certainly an underestimate—potentially expands several-fold the number of known mousencRNAs larger than 10 Kb in size, which, to date, includes onlya few examples such as Xist, Air, Kcnq1ot1, and Ube3a-ats, mostof which were successfully detected with our methods. In thepast, such macro ncRNAs have been discovered experimen-tally on an ad hoc basis, and it has not been possible tosystematically identify large ncRNAs by bioinformatics means,since most existing tools are limited to the discovery of smallerncRNAs with conserved primary sequences and/or secondarystructures [44]. Our strategy offers a solution to this problem.

Expression studies produced a number of interestingobservations. First, in silico analysis indicated that someENORs cluster together within the genome and are coex-pressed. For example, ENOR22 and ENOR23 are locatedwithin 2,300 Kb of each other on Chromosome 7 and arespecifically expressed inCNS.Onepossible explanation for thiscoexpression is that these regions share a common chromatindomain. Second, we found that the majority of ENORtranscripts were predominantly nuclear, similar to functionalncRNAs such as Xist and Tsix. ncRNAs like these areincreasingly being recognized as important in altering chro-matin structure [45,46], and it is tempting to speculate that theENOR transcripts might also function in this way. Third, qRT-PCR studies of the ENOR28 and ENOR31 loci (Figure 4)indicated that the actual transcribed regions are almostcertainly underestimated based upon current ENOR bounda-ries. This is not surprising, since the boundaries were estimatedusing internally primed transcript coordinates, and reflectsthat our discovery pipeline was not designed to capturetranscription start and end sites. Lastly, despite the possibleexistence of multiple macro ncRNAs in ENOR28 and ENOR31,expression correlation between the individual cDNAs wasextremely high (average R ¼ 0.96). This indicates that even ifthere are separate transcripts arising from each region theyappear to be under the influence of similar regional promoters,enhancers, or chromatin domains. Fluorescence in situ hybrid-ization studies might prove useful to visualize the ENORtranscripts and their surrounding chromatin structure (via theuse of histone-specific antibodies), and may also directlydemonstrate in which specific cell types and subcellularcompartments ENORs are localized. For instance, knowingexactly which groups of neurons in the brain express ENOR28andENOR31 transcriptsmight provide indirect information asto their function. Understanding how the expression of thesetranscripts is regulated will also be important. For instance,fine-detailed mapping of transcript copy number by qRT-PCRusing more primer pairs might better define the relevanttranscriptional start sites and promoter regions.

Macro ncRNAs can function in a variety of ways, and someclues to the possible function of the ENORs can be gleanedfrom their association with antisense transcription, candidateimprinting domains, and miRNAs. Antisense transcripts exertregulatory effects in a number of ways, as mentioned earlier.Some of these effects (e.g., RNA interference and translationregulation) can be mediated by small miRNAs and siRNAs,and it is unclear if longer antisense transcripts—such as thoseidentified in this study—are required to function in certainregulatory contexts. Of course, long antisense transcriptsmight be processed into smaller functional RNAs, althoughthere has been no evidence that Xist or Air, for instance, workin this manner. Macro ncRNAs can also regulate genomic

imprinting. Ube3a-ats, Kcnq1ot1, and Air have all beenimplicated in the imprinting control of their antisensetranscripts. These three ncRNAs are themselves imprinted, afact correctly predicted by the methods we used here. Thesesame methods suggest that a further nine ENORs mightrepresent potentially imprinted ncRNAs, which, if confirmed,would add substantially to the number of imprinted ncRNAscurrently characterized. Finally, in silico analysis detectedoverlap between ENORs and more than 5% of known mousemiRNAs, suggesting that one of the possible functions of someof these regions may be to act as miRNA host genes. Given arecent report indicating that many mammalian miRNAs arestill to be discovered [47], the possibility exists that moreENORs will be associated with novel miRNAs in the future.Lacking any direct evidence of ENOR function, we also

acknowledge the possibility that some of these regions do notplay any functional role as RNAs. It has been shown, forinstance, that expression of the yeast noncoding RNA Srg1 isnecessary for the repression of its downstream gene, Ser3, butthis appears to be due to the act of Srg1 transcription (causingSer3 promoter interference) rather than any direct action ofthe Srg1 RNA itself [48]. Meanwhile, Wyers et al. found thatintergenic transcripts in yeast are rapidly degraded by aspecific nuclear quality control pathway and are thereforelikely to be nonfunctional [49]. Another recent report inwhich megabase deletions of noncoding DNA were engi-neered and failed to produce any detectable phenotype inmice [50] suggests that large noncoding regions of the genomemay not have function. It should be noted, however, that theregions targeted in this deletion study lacked evidence oftranscription, in direct contrast to the regions we havecharacterized. A suggestion has also been made that manynoncoding transcripts simply represent useless by-products of‘‘leaky transcription’’ [51]. Based upon our expression studiesof ENOR28 and ENOR31, transcripts from both these regionsappear to be clearly expressed in brain (estimated at 1–8copies/cell based upon our previous work [52], which is similarto Air [Figure 4] and to most mRNAs [53]), suggesting that inthese cases, at least, transcripts are controlled. To demon-strate the importance (or otherwise) of the ENORs, it willultimately be necessary to test their function directly. This,together with efforts to better understand the gene structure,expression, and regulation of individual transcripts withineach region, is the challenge that lies ahead.

Materials and Methods

Identification of known mouse ncRNAs within the FANTOM3cDNA collection. Non-sno, non-micro reference mouse ncRNAsequences were downloaded from RNAdb, a database of mammalianncRNAs (http://jsm-research.imb.uq.edu.au/rnadb) [24]. BLASTN wasused to assess the similarity between the 102,801 FANTOM3 cDNAsand the reference ncRNAs using an initial E-value cutoff of 0.01, andany resulting hits with 98% or greater identity across 90% or more ofthe length of either a query cDNA or reference ncRNA sequence wereconsidered significant matches. Repetitive sequences were identifiedin the FANTOM3 sequences using the union of RepeatMasker (http://www.repeatmasker.org) and runnseg predictions, and BLAST options–F ‘‘m’’ –U T were used to ignore repeats in the seeding but not theextension stage of the alignment.

Genome-wide search for clusters of internally primed cDNAs. Weused the TU data prepared for the FANTOM3 project (ftp://fantom.gsc.riken.jp/RTPS/fantom3_mouse/primary_est_rtps/TU),which were generated by clustering the following mouse cDNA andEST sequences: (1) 56,006 mRNA sequences from GenBank (Release139.0 and daily [2004–1–27]), (2) 102,597 RIKEN cDNAs from theFANTOM3 set, (3) 606,629 RIKEN 59-end ESTs (59-end set), (4)



907,007 RIKEN 39-end ESTs (39-end set), and (5) 1,569,444 GenBankEST sequences. Figure 2 summarizes the subsequent search pipeline,a full description of which was provided in the Results.

Bioinformatic analysis of candidate clusters. To judge whetherENOR sequences were spliced or unspliced, we searched for all TUsthat overlapped with the chromosomal boundaries of each ENOR andwere on the same strand. We included any spliced TUs whose intronicarea overlapped with a region. We then counted ESTs associated withthe TUs and classified the regions as follows: spliced, if spliced ESTswere more than 10% of total ESTs; otherwise unspliced. We used thethreshold of 10% since a certain number of ESTs can be expected tobe inappropriately mapped onto the genome and may thereforeappear as falsely spliced ESTs. To find transcripts on the sense orantisense strand, we searched for TUs that overlapped with theregions on the same or opposite strand based on genomiccoordinates. We searched for the gene name associated with theseTUs, as defined by the RTPS pipeline used for FANTOM2 andFANTOM3 [54], and selected appropriate names manually. For thespliced ENORs, we selected the gene name of major transcripts on thesame strand. For the unspliced ENORs, we used only informativegene names because uninformative names such as the RIKEN cloneIDs were associated with unspliced cDNAs that covered only shortregions. We also searched for gene names on the MGI database (http://www.informatics.jax.org) and used official gene symbols if available.

To examine ENOR conservation, we used blastz axtNet alignmentsfrom UCSC (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm5/vsHg17)to identify blocks in the mouse genome that successfully align withthe human genome. We classified individual ENORs as conserved ifthe total length of alignable blocks was greater than 50% of theENOR length. To determine the overall conservation levels of ENORsequences, the mouse genome was divided into 50-nucleotidewindows, and the number of identically matching nucleotides ineach window in the human genome was counted for both the ENORsand the genome as a whole.

Information on candidate imprinted cDNAs was provided byNikaido et al. ([38]; http://fantom2.gsc.riken.go.jp/imprinting), andlists of miRNA and snoRNAs were downloaded from the miRBaseRegistry (http://www.sanger.ac.uk/Software/Rfam/mirna) and RNAdb,respectively, and mapped to the mouse genome (mm5) usingMEGABLAST with options –F F –D 1 –J F. We then searched theseimprinted cDNAs, miRNAs, and snoRNAs for overlap with the ENORloci based on genomic coordinates. To determine whether theassociation between candidate imprinted cDNAs and ENOR loci waslikely to have occurred by chance, we randomly sampled 66 regionswith an average cDNA density equal to that of the ENORs (five RIKENcDNAs per region) and determined the number of regions thatcontained at least one candidate imprinted cDNA; this procedure wasrepeated 100 times, and the significance was determined using a Chi-square test. To determine the significance of the association betweenmiRNAs and ENOR loci, we performed the following calculation:given that ENORs cover 0.23% of the genome, the probability that amiRNA lies in an ENOR on the same strand is 0.00233 0.5. Using thebinomial distribution, the probability that 14 or more out of 224miRNAs lie in ENORs is about 3 3 10�20 (i.e., p , 0.0001).

To examine ENOR expression, we identified GNF Gene ExpressionAtlas (http://expression.gnf.org/cgi-bin/index.cgi) probes that over-lapped with the genomic loci of the ENORs via the UCSC GenomeBrowser (http://genome.ucsc.edu), then downloaded the relevantexpression data (http://symatlas.gnf.org). Affymetrix MAS5 softwareabsent/present calls were used to identify probes with detectableexpression in at least one of the 61 tissues tested. Log 2 ratio expressiondata for these probes were then hierarchically clustered via averagelinkage clustering using Cluster software [55]. Additionally, we down-loaded the list of RIKEN libraries and their corresponding EdinburghMouse Atlas Project (http://genex.hgu.mrc.ac.uk) tissue descriptions,then searched for RIKENESTs thatmappedwithin an ENOR region onthe same strand, and tallied the number of ESTs that were derived fromeach tissue library. We counted 59 EST and 39 EST sequences derivedfrom a same clone only once. Library information for some ESTs couldnot be used because of uninformative tissue descriptions.

Primers. Primer pairs were designed using Primer3 software [56],with an optimal primer size of 20 bases and annealing temperature of60 8C (see Table S3). The uniqueness of the designed primer pairs waschecked by a BLAST search (http://www.ncbi.nlm.nih.gov/BLAST) sothat homologous regions were not cross-amplified by the sameprimer pair.

Preparation of RNA samples. Adult male C57BL/6J mice werekilled according to the RIKEN Institute’s guidelines, and the tissueswere removed. Total RNA was extracted by the acid phenol-guanidinium thiocyanate-chloroform method [57]. Cytoplasmic

RNA was prepared as described elsewhere [58]. RNA was checkedby agarose gel electrophoresis and was treated with DNaseI beforeRT-PCR as described elsewhere [52].

RT-PCR analysis of candidate clusters. First-strand cDNA synthesis(5 lg of total RNA per 20-ll reaction) was carried out using a randomprimer and the ThermoScript RT-PCR System (Invitrogen; http://www.invitrogen.com) in accordance with the manufacturer’s proto-col. qRT-PCR was carried out with first-strand cDNA correspondingto 12.5 ng of total RNA per test well using the tailor-made reaction[52]. The PCR reactions were performed with an ABI Prism machine(Applied Biosystems; http://www.appliedbiosystems.com) using thefollowing cycling protocols: 15-min hot start at 94 8C, followed by 40cycles of 15 s at 94 8C, 30 s at 60 8C, and 30 s at 72 8C. The thresholdcycle (Ct) value was calculated from amplification plots, in which thefluorescence signal detected was plotted against the PCR cycle. Thenumber of transcripts was calculated from the slope of the standardcurve using genomic DNA.

Long PCR. Long PCR was carried out with first-strand cDNAcorresponding to 500 ng of total RNA and KOD DNA polymerase(Toyobo; http://www.toyobo.co.jp/e) per 50-ll reaction according tothe manufacturer’s protocol. We also used 200 ng of mouse genomicDNA, instead of first-strand cDNA, to amplify the fragments from thegenome. The PCR reactions were performed with an ABI9700(Applied Biosystems) using the following cycling protocols: 2-minhot start at 94 8C, followed by 35 cycles of 15 s at 94 8C, 30 s at 60 8C,and 5 min at 68 8C. One to two microliters of sample was subjected to1% agarose gel electrophoresis.

Northern blot. Total RNA was denatured by formaldehyde/forma-mide and electrophoresed in a 1% agarose gel. RNA was transferredontoHybond-Nþnylonmembrane (GEHealthcare Life Sciences; http://www4.amershambiosciences.com). Hybridization was carried out using32P-labeled DNA probe and ExpressHyb hybridization solution (BDBiosciences; http://www.bdbiosciences.com) according to the manufac-turer’s protocol. The hybridization signal was detected using a BAS2500image analyzer (Fujifilm; http://www.fujifilm.com).

Supporting Information

Figure S1. Genomic Distribution of 66 ENORs

Found at DOI: 10.1371/journal.pgen.0020037.sg001 (68 KB PPT).

Figure S2. ENORs Are More Conserved than the Genome Average

Found at DOI: 10.1371/journal.pgen.0020037.sg002 (68 KB PPT).

Table S1. ENORs on Each Chromosome

Found at DOI: 10.1371/journal.pgen.0020037.st001 (17 KB XLS).

Table S2. EST Tissue Data for 66 ENORs


Table S3. Primer Pairs


Table S4. Expression Correlation between cDNAs within ENOR28and ENOR31

Data for (A) ENOR28 and (B) ENOR31.


Accession Numbers

The MGI (http://www.informatics.jax.org) accession numbers for thesequences described in this paper are 3222402P14Rik (2442104),4933421G18Rik (1913976), Air (1353471), Cyp2d22 (1929474), Dleu2(1934030), Dmd (94909), Emx2os (3052329), Gusb (95872), Igf2r (96435),Kcnq1ot1 (1926855), Kcnrg (2685591), Mcp (1203290), Ndufs1 (2443241),Nespas (1861674), Nr6a1 (1352459), Ppp2r5a (1929474), Rian(19222995), Satb1 (105084), Slc22a2 (18339), Slc22a3 (1333817), Tgif(1194497), Traf3ip2 (2143599), Trim13 (1913847), Tsix (1336196), Ube3a(105098), and Xist (98974). The SGD (http://www.yeastgenome.org)accession number for yeast Srg1 is S000029010. The NCBI EntrezGene(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db¼gene) accessionnumber for yeast Ser3 is 856814.

Acknowledgments

Author contributions. MF, KCP, PC, YH, and HS conceived anddesigned the experiments. NN, SF, CK, and HS performed theexperiments. MF, KCP, MCF, and HS analyzed the data. MF, CK, JK,



PC, YH, and HS contributed reagents/materials/analysis tools. MF,KCP, CB, JSM, and HS wrote the paper.

Funding. This study was supported by research grants from theAustralian Research Council to JSM and from the Ministry ofEducation, Culture, Sports, Science and Technology of the JapaneseGovernment to YH for: (1) the Genome Network Project, (2) theRIKEN Genome Exploration Research Project, and (3) the Strategic

Programs for R&D of RIKEN. MF is supported by The JacksonLaboratory Postdoctoral Fellowship award. KCP is supported by aNational Health and Medical Research Council Medical PostgraduateScholarship. MCF is supported by a University of QueenslandPostdoctoral Fellowship.

Competing interests. The authors have declared that no competinginterests exist. &

References1. Mattick JS, Makunin IV (2005) Small regulatory RNAs in mammals. Hum

Mol Genet 14 (Suppl 1): R121–R132.2. Pfeffer S, Zavolan M, Grasser FA, Chien M, Russo JJ, et al. (2004)

Identification of virus-encoded microRNAs. Science 304: 734–736.3. Bartel DP (2004) MicroRNAs: Genomics, biogenesis, mechanism, and

function. Cell 116: 281–297.4. Brockdorff N, Ashworth A, Kay GF, McCabe VM, Norris DP, et al. (1992)

The product of the mouse Xist gene is a 15 kb inactive X-specific transcriptcontaining no conserved ORF and located in the nucleus. Cell 71: 515–526.

5. Sleutels F, Zwart R, Barlow DP (2002) The non-coding Air RNA is requiredfor silencing autosomal imprinted genes. Nature 415: 810–813.

6. Wutz A (2003) Xist RNA associates with chromatin and causes genesilencing. In: Barciszewski J, Erdmann VA, editors. Noncoding RNAs:Molecular biology and molecular medicine. Georgetown (Texas): LandesBioscience. pp. 49–65.

7. Delaval K, Feil R (2004) Epigenetic regulation of mammalian genomicimprinting. Curr Opin Genet Dev 14: 188–195.

8. Kelley RL, Kuroda MI (2000) Noncoding RNA genes in dosage compensa-tion and imprinting. Cell 103: 9–12.

9. Chen J, Sun M, Kent WJ, Huang X, Xie H, et al. (2004) Over 20% of humantranscripts might form sense-antisense pairs. Nucleic Acids Res 32: 4812–4820.

10. Yelin R, Dahary D, Sorek R, Levanon EY, Goldstein O, et al. (2003)Widespread occurrence of antisense transcription in the human genome.Nat Biotechnol 21: 379–386.

11. Kiyosawa H, Yamanaka I, Osato N, Kondo S, Hayashizaki Y (2003) Antisensetranscripts with FANTOM2 clone set and their implications for generegulation. Genome Res 13: 1324–1334.

12. Katayama S, Tomaru Y, Kasukawa T, Waki K, Nakanishi M, et al. (2005)Antisense transcription in the mammalian transcriptome. Science 309:1564–1566.

13. Lavorgna G, Dahary D, Lehner B, Sorek R, Sanderson CM, et al. (2004) Insearch of antisense. Trends Biochem Sci 29: 88–94.

14. Jen CH, Michalopoulos I, Westhead DR, Meyer P (2005) Natural antisensetranscripts with coding capacity in Arabidopsis may have a regulatory rolethat is not linked to double-stranded RNA degradation. Genome Biol 6:R51.

15. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, et al. (2004)Unbiased mapping of transcription factor binding sites along humanchromosomes 21 and 22 points to widespread regulation of noncodingRNAs. Cell 116: 499–509.

16. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, et al. (2005)Transcriptional maps of 10 human chromosomes at 5-nucleotide reso-lution. Science 308: 1149–1154.

17. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, et al. (2002) Analysisof the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420: 563–573.

18. Numata K, Kanai A, Saito R, Kondo S, Adachi J, et al. (2003) Identificationof putative noncoding RNAs among the RIKEN mouse full-length cDNAcollection. Genome Res 13: 1301–1306.

19. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, et al. (2005) Thetranscriptional landscape of the mammalian genome. Science 309: 1559–1563.

20. Willingham AT, Orth AP, Batalov S, Peters EC, Wen BG, et al. (2005) Astrategy for probing the function of noncoding RNAs finds a repressor ofNFAT. Science 309: 1570–1573.

21. Mattick JS (2005) The functional genomics of noncoding RNA. Science 309:1527–1528.

22. Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, et al. (2006)Experimental validation of the regulated expression of large numbers ofnon-coding RNAs from the mouse genome. Genome Res 16: 11–19.

23. Nam DK, Lee S, Zhou G, Cao X, Wang C, et al. (2002) Oligo(dT) primergenerates a high frequency of truncated cDNAs through internal poly(A)priming during reverse transcription. Proc Natl Acad Sci U S A 99: 6152–6156.

24. Pang KC, Stephen S, Engstrom PG, Tajul-Arifin K, Chen W, et al. (2005)RNAdb—A comprehensive mammalian noncoding RNA database. NucleicAcids Res 33: D125–D130.

25. Carninci P, Waki K, Shiraki T, Konno H, Shibata K, et al. (2003) Targeting acomplex transcriptome: The construction of the mouse full-length cDNAencyclopedia. Genome Res 13: 1273–1289.

26. Borsani G, Tonlorenzi R, Simmler MC, Dandolo L, Arnaud D, et al. (1991)Characterization of a murine gene expressed from the inactive Xchromosome. Nature 351: 325–329.

27. Engemann S, Strodicke M, Paulsen M, Franck O, Reinhardt R, et al. (2000)Sequence and functional comparison in the Beckwith-Wiedemann region:Implications for a novel imprinting centre and extended imprinting. HumMol Genet 9: 2691–2706.

28. Hatada I, Morita S, Obata Y, Sotomaru Y, Shimoda M, et al. (2001)Identification of a new imprinted gene, Rian, on mouse chromosome 12 byfluorescent differential display screening. J Biochem (Tokyo) 130: 187–190.

29. Cavaille J, Seitz H, Paulsen M, Ferguson-Smith AC, Bachellerie JP (2002)Identification of tandemly-repeated C/D snoRNA genes a

Documents

Clusters of Internally Primed Transcripts Reveal Novel Long ...79716/UQ_PV...Clusters of Internally Primed Transcripts Reveal Novel Long Noncoding RNAs Masaaki Furuno1[, Ken C. Pang2,3[,