Mapping of 19032 mouse cDNAs on mouse chromosomesgenome.gsc.riken.jp/jsfg_1.pdf · 2001. 12. 3. · Page 1 Mapping of 19032 mouse cDNAs on mouse chromosomes Itaru Yamanaka,1 Hidenori

Page 1

Mapping of 19032 mouse cDNAs on mouse chromosomes

Itaru Yamanaka,1 Hidenori Kiyosawa,1,2 Shinji Kondo,1 Tetsuya Saito,1 Akira Shinagawa,1

Katsunori Aizawa,1 Shiro Fukuda,1 Ayako Hara,1 Masayoshi Itoh,1 Jun Kawai,1 Kazuhiro

Shibata,1 Takahiro Arakawa,1 Yoshiyuki Ishii,

1 Yoshihide Hayashizaki1,2,3,*

1Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center

(GSC), RIKEN Yokohama Institute, 1- 7- 22 Suehiro- cho, Tsurumi- ku, Yokohama City,

Kanagawa 230- 0045, Japan

2Division of Genomic Information Resource Exploration, Science of Biological

Supramolecular Systems, Yokohama City University Graduate School of Integrated Science,

1- 7- 29 Suehiro- cho, Tsurumi- ku, Yokohama City, Kanagawa 230- 0045, Japan

3University of Tsukuba, 1- 1- 1 Tennodai, Ts ukuba, Ibaraki 305- 8577, Japan

Key words: cDNA, database, mapping, mouse, radiation hybrid, synteny

Abstract

Finding genes by the positional candidate approach requires abundant cDNAs mapped to

chromosomes. To provide such important information, we computationally mapped 19032 of

our mouse cDNAs to mouse chromosomes by using data from public databases.

We used 2 approaches. In the first, we integrated the mapping data of cDNAs on the

human genome, known gene - related data, and comparative mapping data. From this , we

calculated map positions on the mouse chromosomes. For this first approach, we developed a

simple and powerful criterion to choose the correct map position from candidate positions in

sequence homology searches. In the second approach, we related cDNAs to expressed

sequence tags ( EST) previously mapped in radiation hybrid experiments. We discuss

* To whom correspondence should be addressed:

Tel: +81- 45- 503- 9222; Fax: +81- 45- 503- 9216; E- mail: rgscerg@ gsc.riken.go.jp

Page 2

improving the mapping by combining the 2 methods.

Introduction

Genome and cDNA sequencing projects are progressing all over the world. In the RIKEN

(The institute of Physical and Chemical Research) project, the Mouse Full- length cDNA

Encyclopedia

1–14 is being filled with cDNAs, sequence data, and expression profiles. Many

other projects to accumulate phenotyp ic data are systematically producing mutants with

mutagens or are planned. To understand the cause behind phenotypes at the molecular level,

bridging both sorts of information is crucial.

Mapping of the cDNA sequences onto the mouse chromosomes is essential to making the

bridge, because phenotyp ic data can be converted to genetic positions by linkage analysis and

much mapping data on phenotypes and mutants is publicly available. Here we report an

attempt to computationally map RIKEN’s mouse cDNAs to the mouse chromosomes by using

public data. To process the large amount of data rapidly, we made active use of public

databases and automation. We used the human genome sequence,7,15 –17 the human– mouse

homology database,18,19 genetic mapping data19 gained by linkage analysis, and mapping data

gained by radiatio n hybrid experiments.20

The main theoretical idea we used is that of conserved synteny. Conserved synteny means

that when 2 loci are linked in 1 species, their homologs are linked in another species. At the

molecular level, by comparing the chromosomes of 2 species, we see evolutionally conserved

chromosomal segments. Between mouse and human, most parts of the chromosomes consist

of conserved segments. In a conserved segment, genes should lie on the chromosome of the

first species in the same order as do the homologous genes of the other species. Examples of

synteny can be seen in some lists of human–mouse homologous genes (comparative maps),

for example, the Jackson Laboratory’s human– mouse homology data18 in the MGI database.19

http://www.informatics.jax.org/

Page 3

We used 2 methods to map the RIKEN mouse cDNAs onto the mouse chromosomes. The

first method mapped the cDNAs onto the mouse chromosomes by using synteny. We call this

method synteny mapping. The second method us ed EST data20 that had been mapped onto the

mouse chromosomes by using radiation hybrid (RH) mapping. We related our clones to the

ESTs by sequence homology and then computationally mapped our cDNAs.

We report the results of both methods and discuss improving the mapping by combining

the 2 methods.

Methods

1) Synteny map

The synteny map was constructed in 2 steps. The first step was to map the sequences of the

RIKEN mouse cDNAs onto the human genome draft sequences by sequence homology. In

Reference 21, Kondo et al. had worked out this step by using NCBI BLASTN (-e 1.0), the

human draft sequences assembled by University of California, Santa Cruz ( UCSC)15 and the

human finished sequences for chromosomes 21 and 22.16,17 They had inspected strands and

positions of aligned regions to determine each locus.21 The second step was to convert their

positions on the human genome to positions on the mouse chromosomes.

Each cDNA was mapped to more than 1 candidate loci of the human genome. To choose

the most reliable locus, we used the length of the genomic DNA occupied by each locus that

was predicted by the sequence homology. The length was defined to include expected introns.

We called it the “occupied length” of a locus. We took the locus that had the longest occupied

length among the candidates to be the correct locus. The assumption was also true for some

pseudogenes and their original genes, because some pseudogenes had lost introns and the ir

occupied length was shorter than that of the original genes. Therefore, we mapped the cDNA

to the locus with the longest occupied length. We call this criterion the “longest occupied

length criterion.” Its accuracy is reported in the Results section.

Page 4

In the following subsections, we explain the 3 steps to convert the positions of the RIKEN

mouse cDNAs on the human genome to positions on the mouse chromosomes.

1.1) Identification of the positions of the known human genes

We ran the e- PCR computer program22 with the human sequence- tagged site (STS ) database

of NCBI ( filename human.sts)23 on the human draft sequence. Using another NCBI data file

(hs_mrna .epcr)23 that relates STSs to the known human genes, we identified the positions of

some known genes on the human genome sequence.

1.2) Fundamental comparative map

We used the Jackson Laboratory’s human– mouse homology data18,19 to identify the

relationship between the known human genes and the known mouse genes. These data include

chromosome number and position information (in cM).

By procedures 1.1 and 1.2, we related the positions of 956 known genes on the human

genome to positions on the mouse chromosomes.

1.3) Mapping of the RIKEN cDNAs on the mouse chromosomes

In a conserved chromosomal segment, mouse genes should lie on the mouse chromosome in

the same order as the homologous human genes on the human chromosome. To map 1 of

RIKEN’s cDNAs to the mouse chromosomes, we used 2 pairs of human–mouse homologous

genes that sandwiched the RIKEN cDNA on the human genome. If the 2 mouse genes were

on the same mouse chromosome, we assumed that the 2 mouse genes and the RIKEN cDNA

were on 1 conserved segment. Then we mapped the RIKEN cDNA between the positions of

the 2 mouse genes (see Fig. 1 for an example).

To indicate the most probable map position in the interval bounded by the 2 mouse genes,

by using the map position on the human genome, we calculated the position on the mouse

chromosomes that maintained the ratio of distances to the pairs of genes. If the 2 mouse genes

were mapped on different mouse chromosomes, then there should be a chromosomal

http://www.ncbi.nlm.nih.gov/dbSTS/index.html

http://www.ncbi.nlm.nih.gov/dbSTS/index.html

Page 5

rearrangement breakpoint. If this was the case, we did not map the RIKEN cDNA, because

we could not decide which conserved segment the cDNA belonged to.

2) Radiation hybrid map

We performed a BLASTN search of the RIKEN cDNAs on the RH mapped sequences of the

mouse RH data from the Whitehead Institute’s Center for Genome Research at the MIT

(WICGR).20 We assumed correspondence if the homologous region was > 100 bp and the

similarity was > 98%. The RIKEN cDNAs were computationally mapped on the mouse

genome.

Results

We used 44184 cDNAs, 21076 of which had full- length sequences and 23108 of which had 3'

or 5' end sequences. There is redundancy among the cDNA set: the redundancy rate is about

1.6 (= 21076/12890).24 By the 2 methods described in the Methods section, we mapped 19032

cDNAs on the mouse chromosomes (Table 1). We ma pped 9082 full- length cDNAs by the

synteny method. The positions in the genetic map of 761 of these were already known at the

FANTOM meeting of 2000.24 Therefore, the synteny method newly mapped 8321 full- length

cDNAs in centimorgans. We evaluated the correctness of the synteny map by using a set of

cDNAs whose map position was already known in the MGI database.19 For 69% of the

cDNAs of this set, the differences between the results of our synteny map and the known map

position we re less than 5 cM (please see Fig.2 for detail).

Figure 3 indicates the results of the synteny map. Figure 4 shows the results of the RH

map. The details of the map data are available at http://genome.gsc.riken.go.jp/m/mouse/ (see

Appe ndix). There we provide, for example, graphics to show the fundamental data of the

synteny map for some cDNAs (Fig. 1 shows an example).

We report the accuracy of the longest occupied length criterion for the full- length

sequences and the end sequences of the cDNAs separately.

http://genome.gsc.riken.go.jp/m/mouse/


http://www.gsc.riken.go.jp/e/FANTOM/

Page 6

Each full- length cDNA was initially mapped to more than 1 locus on the mouse

chromosomes by synteny. We selected 540 cDNAs whose genetic positions were known in

the MGI database19 and for which at least 1 of the candidate loci was within 5 cM of the

correct position. Using the criterion, we were able to choose 1 candidate locus that was closest

to the correct position for 466 cDNAs. This gave a success rate of 86% (466/540). Therefore,

the longest occupied length criterion provides a simple method to choose the correct mapping

posit ions in many cases. Note that the accuracy reported here should be distinguished from

the correctness of the synteny map, because the criterion is only 1 of several processes to

derive the synteny map.

The criterion was also applicable to the end sequences. To show that, we selected clones

which have both of full- length sequence and end sequence. Then we compared the map

position calculated by using the full- length sequence and the map position calculated by using

the end sequence. We found that 96% (6930/7223) of the end sequences were correctly

mapped relative to the map of the full- length sequences. The reason for the 4% inconsistency

may be that the BLASTN program sometimes cannot detect the correct exons.

Discussion

The set of mapped cDNAs will be a powerful tool for finding genes responsible for various

diseases in the positional candidate approach. Even if we select all the cDNAs as candidates

within 10 cM around a predicted disease locus, the number of selected cDNAs represents only

about 0.6% of all mapped cDNAs. To identify the responsible genes, we may need other

experimental data, such as cDNA expression data. At the same time, it is still important to

improve the mapping data of cDNAs, because the quality directly affects the choice of

candidate cDNAs in finding genes.

The RH map can be used to improve the quality of the mapping data. We constructed a

genetic synteny map based on the mouse genetic map. It is easy to imagine another sort of


Page 7

synteny map, which is based on an RH map. First, a comparative map can be made of cDNAs

that have sequence homology with both RH-mapped ESTs and the human genome. Then

other cDNAs located on the human genome can be mapped to the mouse chromosomes by

using the comparative map as a guide. We call such a map an RH synteny map.

We may be able to use the RH synteny map to improve the genetic synteny map. In the

synteny map, around some chromosome rearrangement breakpoints there appeared blank

regions where no cDNAs were mapped (Fig. 3). Each blank region was bounded by

human–mouse homologous genes adjacent to the breakpoint. Therefore, the size of the blank

region depended on the density of the human–mouse homologous genes. We can construct an

RH synteny map to fill some parts of the blank regions. We hope to report the results of this

in the future.

To improve the longest occupied length criterion, we can incorporate the number of

identity base pairs from the homology search. The idea works when more than 1 longest

occupied length loci exist. In general, we need to search for a good evaluation function that

depends on both the occupied length and the number of identity base pairs. The evaluation

function can be tested by using the sequences of some genes whose loci are known.

This work will help in the correct ordering of the mouse draft genome sequences. The

correct assembly of the draft sequences of genome contigs is no easy task, because the

number of contigs and the number of possible combinations are enormous. Using the order of

genes on mouse chromosomes derived from synteny maps may help the correct assembly of

the contigs.

Acknowledgments

This study has been supported by Research Grant for the RIKEN Genome Exploration

Research Project from the Ministry of Education, Culture, Sports, Science and Technology of

the Japanese Government, ACT-JST (Research and Development for Applying Advanced

Page 8

Computational Science and Technology) of Japan Science and Technology Corporation (JST)

and Special Coordination Funds for Promoting the Bioresource Program from RIKEN

Tsukuba Institute. This work also was supported as the Rice Genome Full-Length cDNA

Library Construction Project by BRAIN (Bio-oriented Technology Research Advancement

Institution).

Appendix: Supplementary Information Web Site

For the details of the map data and the clones mapped in this paper, please visit

http://genome.gsc.riken.go.jp/m/mouse/. The site includes a map viewer of the synteny map

for the full- length sequences, and tables including the map coordinates of all mapped cDNAs.

In addition, we provide graphics to show the fundamental data of the synteny map for some

cDNAs (Fig. 1 shows an example ).

The list of the DDBJ (public database) accession numbers and the RIKEN IDs for the

sequences mapped in this work is also available at the web site.

References

1. Carninci, P. and Hayashizaki, Y. (1999) Methods Enzymol, 303, 19–44.

2. Carninci, P. et al. (1996) Genomics, 37, 327–336.

3. Carninci, P. et al. (1997) DNA Res., 4, 61–66.

4. Carninci, P. et al. (1998) Proc. Natl. Acad. Sci. USA, 95, 520–524.

5. Carninci, P. et al. (2000) Genome Res., 10, 1617–1630.

6. Shibata, K. et al. (2000) Genome Res., 10, 1757–1771.

7. International Human Genome Sequencing Consortium (2001) Nature, 409, 860–921.

8. Itoh, M. et al. (1997) Nucleic Acids Res., 25, 1315–1316.

9. Itoh, M. et al. (1999) Genome Res., 9, 463–470.

http://genome.gsc.riken.go.jp/m/mouse/

Page 9

10. Izawa, M. et al. (1998) J. Biol. Chem., 273, 14242–14246.

11. Sasaki, N. et al. (1998) Genomics, 49, 167–179.

12. Sasaki, N. et al. (1998) Proc. Natl. Acad. Sci. USA, 95, 3455–3460.

13. Sasaki, N. et al. (1998) Gene, 222, 17–23.

14. Konno, H. et al. (2001) Genome Res., 11(2), 281–289.

15. Human Genome Project Working Draft at UCSC, http://genome.ucsc.edu/.

16. Hattori, M. et al. (2000) Nature, 405, 311–319.

17. Dunham, I. et al. (1999) Nature, 402, 489–495.

18. Homo logy data for this paper were retrieved from the Mouse Genome Database (MGD),

Mouse Genome Informatics Web Site, the Jackson Laboratory, Bar Harbor, Maine, USA:

http://www.informatics.jax.org/reports/ (September, 2000).

19. Blake, J.A., Eppig, J.T., Richardson, J.E., Davisson, M. T., and the Mouse Genome

Database Group 2000 (2000) Nucleic Acids Res., 28, 108–111.

20. van Etten, W. J. et al. (1999) Nature Genet. Lett., 22, 384–387. Whitehead Institute/MIT

Center for Genome Research, Mouse EST RH Mapping Project, Public Data Release 7

(August, 2000).

21. Kondo, S. et al. (2001) Mamm. Genome. in press.

22. Schuler, G.D. (1997) Genome Res., 7, 541–550.

23. The e-PCR data files, human.sts and hs_mrna.epcr, were developed by the National

Center for Biotechnology Information at the National Library of Medicine

( ftp://ncbi.nlm.nih.gov/repository/dbSTS/human.sts,

ftp://ncbi.nlm.nih.gov/repository/dbSTS/ePCR.Reports/hs/hs_mrna.epcr)

24. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM

Consortium (2001) Nature, 409, 685–690.

http://genome.ucsc.edu/

ftp://ncbi.nlm.nih.gov/repository/dbSTS/

ftp://ncbi.nlm.nih.gov/repository/dbSTS/ePCR.Reports/hs/




Page 10

Figures

Figure 1.

An example of the relationships among the fundamental data of the synteny map

(RIKEN cDNA 1110019C08). The 2 vertical lines indicate a human chromosome and a

mouse chromosome. Each dashed line connects a pair of homologous genes according to the

comparative map of the Jackson Laboratory. The fact that they cross indicates that the order is

reversed. The other short horizontal line on the human genome indicates the position where

the sequence of the cDNA 1110019C08 is similar to the human genome sequence. On the

mouse chromosome, the other short horizontal line indicates the position of the cDNA

predicted by the synteny map. In this diagram, “D12887” and “stSG8903” are STS marker

names. “POU1F1” and “GAP43” are human gene names. “Pit1” and “Gap43” are mouse gene

names. The numbers on the human genome indicate base pair coordinates. “99950164” and

“99957519” indicate the start and the end positions of the region similar to the cDNA

1110019C08.

Figure 2.

Evaluation of the synteny map for known genes. The map positions of 737 cDNAs were

known in the MGI database. Among them, for each of 575 cDNAs, our synteny map

predicted the same chromosome number that is known in the MGI database. We used the 575

cDNAs to create this figure. The horizontal axis of the figure indicates the distance between

the map position of our synteny map and the known genetic map position of the MGI

database.

Page 11

Figure 3.

Synteny map of the RIKEN cDNAs on the mouse chromosomes. The map positions of

17125 cDNAs are indicated by the horizontal lines to the right of the vertical lines

representing the mouse chromosomes.

Figure 4.

RH map of the RIKEN cDNAs on the mouse chromosomes using mouse RH map data

from the Whitehead Institute’s Center for Genome Research at the MIT (WICGR). The

map positions of 3943 cDNAs are indicated by the horizontal lines to the right of the vertical

lines representing the mouse chromosomes.

Table

Table 1. Numbers of cDNAs mapped during this study

Full- length

sequences

End

sequences

Total

cDNAs used for this mapping 21076 23108 44184

cDNAs mapped to the human genome 12979 11087 24066

cDNAs mapped by synteny 9082 8043 17125

cDNAs mapped to the RH map 2469 1474 3943

cDNAs mapped by synteny or the RH

map

10274 8758 19032

Documents

Mapping of 19032 mouse cDNAs on mouse chromosomesgenome.gsc.riken.jp/jsfg_1.pdf · 2001. 12. 3. · Page 1 Mapping of 19032 mouse cDNAs on mouse chromosomes Itaru Yamanaka,1 Hidenori