Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Page 1
Mapping of 19032 mouse cDNAs on mouse chromosomes
Itaru Yamanaka,1 Hidenori Kiyosawa,1,2 Shinji Kondo,1 Tetsuya Saito,1 Akira Shinagawa,1
Katsunori Aizawa,1 Shiro Fukuda,1 Ayako Hara,1 Masayoshi Itoh,1 Jun Kawai,1 Kazuhiro
Shibata,1 Takahiro Arakawa,1 Yoshiyuki Ishii,
1 Yoshihide Hayashizaki1,2,3,*
1Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center
(GSC), RIKEN Yokohama Institute, 1- 7- 22 Suehiro- cho, Tsurumi- ku, Yokohama City,
Kanagawa 230- 0045, Japan
2Division of Genomic Information Resource Exploration, Science of Biological
Supramolecular Systems, Yokohama City University Graduate School of Integrated Science,
1- 7- 29 Suehiro- cho, Tsurumi- ku, Yokohama City, Kanagawa 230- 0045, Japan
3University of Tsukuba, 1- 1- 1 Tennodai, Ts ukuba, Ibaraki 305- 8577, Japan
Key words: cDNA, database, mapping, mouse, radiation hybrid, synteny
Abstract
Finding genes by the positional candidate approach requires abundant cDNAs mapped to
chromosomes. To provide such important information, we computationally mapped 19032 of
our mouse cDNAs to mouse chromosomes by using data from public databases.
We used 2 approaches. In the first, we integrated the mapping data of cDNAs on the
human genome, known gene - related data, and comparative mapping data. From this , we
calculated map positions on the mouse chromosomes. For this first approach, we developed a
simple and powerful criterion to choose the correct map position from candidate positions in
sequence homology searches. In the second approach, we related cDNAs to expressed
sequence tags ( EST) previously mapped in radiation hybrid experiments. We discuss
* To whom correspondence should be addressed:
Tel: +81- 45- 503- 9222; Fax: +81- 45- 503- 9216; E- mail: rgscerg@ gsc.riken.go.jp
Page 2
improving the mapping by combining the 2 methods.
Introduction
Genome and cDNA sequencing projects are progressing all over the world. In the RIKEN
(The institute of Physical and Chemical Research) project, the Mouse Full- length cDNA
Encyclopedia
1–14 is being filled with cDNAs, sequence data, and expression profiles. Many
other projects to accumulate phenotyp ic data are systematically producing mutants with
mutagens or are planned. To understand the cause behind phenotypes at the molecular level,
bridging both sorts of information is crucial.
Mapping of the cDNA sequences onto the mouse chromosomes is essential to making the
bridge, because phenotyp ic data can be converted to genetic positions by linkage analysis and
much mapping data on phenotypes and mutants is publicly available. Here we report an
attempt to computationally map RIKEN’s mouse cDNAs to the mouse chromosomes by using
public data. To process the large amount of data rapidly, we made active use of public
databases and automation. We used the human genome sequence,7,15 –17 the human– mouse
homology database,18,19 genetic mapping data19 gained by linkage analysis, and mapping data
gained by radiatio n hybrid experiments.20
The main theoretical idea we used is that of conserved synteny. Conserved synteny means
that when 2 loci are linked in 1 species, their homologs are linked in another species. At the
molecular level, by comparing the chromosomes of 2 species, we see evolutionally conserved
chromosomal segments. Between mouse and human, most parts of the chromosomes consist
of conserved segments. In a conserved segment, genes should lie on the chromosome of the
first species in the same order as do the homologous genes of the other species. Examples of
synteny can be seen in some lists of human–mouse homologous genes (comparative maps),
for example, the Jackson Laboratory’s human– mouse homology data18 in the MGI database.19
Page 3
We used 2 methods to map the RIKEN mouse cDNAs onto the mouse chromosomes. The
first method mapped the cDNAs onto the mouse chromosomes by using synteny. We call this
method synteny mapping. The second method us ed EST data20 that had been mapped onto the
mouse chromosomes by using radiation hybrid (RH) mapping. We related our clones to the
ESTs by sequence homology and then computationally mapped our cDNAs.
We report the results of both methods and discuss improving the mapping by combining
the 2 methods.
Methods
1) Synteny map
The synteny map was constructed in 2 steps. The first step was to map the sequences of the
RIKEN mouse cDNAs onto the human genome draft sequences by sequence homology. In
Reference 21, Kondo et al. had worked out this step by using NCBI BLASTN (-e 1.0), the
human draft sequences assembled by University of California, Santa Cruz ( UCSC)15 and the
human finished sequences for chromosomes 21 and 22.16,17 They had inspected strands and
positions of aligned regions to determine each locus.21 The second step was to convert their
positions on the human genome to positions on the mouse chromosomes.
Each cDNA was mapped to more than 1 candidate loci of the human genome. To choose
the most reliable locus, we used the length of the genomic DNA occupied by each locus that
was predicted by the sequence homology. The length was defined to include expected introns.
We called it the “occupied length” of a locus. We took the locus that had the longest occupied
length among the candidates to be the correct locus. The assumption was also true for some
pseudogenes and their original genes, because some pseudogenes had lost introns and the ir
occupied length was shorter than that of the original genes. Therefore, we mapped the cDNA
to the locus with the longest occupied length. We call this criterion the “longest occupied
length criterion.” Its accuracy is reported in the Results section.
Page 4
In the following subsections, we explain the 3 steps to convert the positions of the RIKEN
mouse cDNAs on the human genome to positions on the mouse chromosomes.
1.1) Identification of the positions of the known human genes
We ran the e- PCR computer program22 with the human sequence- tagged site (STS ) database
of NCBI ( filename human.sts)23 on the human draft sequence. Using another NCBI data file
(hs_mrna .epcr)23 that relates STSs to the known human genes, we identified the positions of
some known genes on the human genome sequence.
1.2) Fundamental comparative map
We used the Jackson Laboratory’s human– mouse homology data18,19 to identify the
relationship between the known human genes and the known mouse genes. These data include
chromosome number and position information (in cM).
By procedures 1.1 and 1.2, we related the positions of 956 known genes on the human
genome to positions on the mouse chromosomes.
1.3) Mapping of the RIKEN cDNAs on the mouse chromosomes
In a conserved chromosomal segment, mouse genes should lie on the mouse chromosome in
the same order as the homologous human genes on the human chromosome. To map 1 of
RIKEN’s cDNAs to the mouse chromosomes, we used 2 pairs of human–mouse homologous
genes that sandwiched the RIKEN cDNA on the human genome. If the 2 mouse genes were
on the same mouse chromosome, we assumed that the 2 mouse genes and the RIKEN cDNA
were on 1 conserved segment. Then we mapped the RIKEN cDNA between the positions of
the 2 mouse genes (see Fig. 1 for an example).
To indicate the most probable map position in the interval bounded by the 2 mouse genes,
by using the map position on the human genome, we calculated the position on the mouse
chromosomes that maintained the ratio of distances to the pairs of genes. If the 2 mouse genes
were mapped on different mouse chromosomes, then there should be a chromosomal
Page 5
rearrangement breakpoint. If this was the case, we did not map the RIKEN cDNA, because
we could not decide which conserved segment the cDNA belonged to.
2) Radiation hybrid map
We performed a BLASTN search of the RIKEN cDNAs on the RH mapped sequences of the
mouse RH data from the Whitehead Institute’s Center for Genome Research at the MIT
(WICGR).20 We assumed correspondence if the homologous region was > 100 bp and the
similarity was > 98%. The RIKEN cDNAs were computationally mapped on the mouse
genome.
Results
We used 44184 cDNAs, 21076 of which had full- length sequences and 23108 of which had 3'
or 5' end sequences. There is redundancy among the cDNA set: the redundancy rate is about
1.6 (= 21076/12890).24 By the 2 methods described in the Methods section, we mapped 19032
cDNAs on the mouse chromosomes (Table 1). We ma pped 9082 full- length cDNAs by the
synteny method. The positions in the genetic map of 761 of these were already known at the
FANTOM meeting of 2000.24 Therefore, the synteny method newly mapped 8321 full- length
cDNAs in centimorgans. We evaluated the correctness of the synteny map by using a set of
cDNAs whose map position was already known in the MGI database.19 For 69% of the
cDNAs of this set, the differences between the results of our synteny map and the known map
position we re less than 5 cM (please see Fig.2 for detail).
Figure 3 indicates the results of the synteny map. Figure 4 shows the results of the RH
map. The details of the map data are available at http://genome.gsc.riken.go.jp/m/mouse/ (see
Appe ndix). There we provide, for example, graphics to show the fundamental data of the
synteny map for some cDNAs (Fig. 1 shows an example).
We report the accuracy of the longest occupied length criterion for the full- length
sequences and the end sequences of the cDNAs separately.
Page 6
Each full- length cDNA was initially mapped to more than 1 locus on the mouse
chromosomes by synteny. We selected 540 cDNAs whose genetic positions were known in
the MGI database19 and for which at least 1 of the candidate loci was within 5 cM of the
correct position. Using the criterion, we were able to choose 1 candidate locus that was closest
to the correct position for 466 cDNAs. This gave a success rate of 86% (466/540). Therefore,
the longest occupied length criterion provides a simple method to choose the correct mapping
posit ions in many cases. Note that the accuracy reported here should be distinguished from
the correctness of the synteny map, because the criterion is only 1 of several processes to
derive the synteny map.
The criterion was also applicable to the end sequences. To show that, we selected clones
which have both of full- length sequence and end sequence. Then we compared the map
position calculated by using the full- length sequence and the map position calculated by using
the end sequence. We found that 96% (6930/7223) of the end sequences were correctly
mapped relative to the map of the full- length sequences. The reason for the 4% inconsistency
may be that the BLASTN program sometimes cannot detect the correct exons.
Discussion
The set of mapped cDNAs will be a powerful tool for finding genes responsible for various
diseases in the positional candidate approach. Even if we select all the cDNAs as candidates
within 10 cM around a predicted disease locus, the number of selected cDNAs represents only
about 0.6% of all mapped cDNAs. To identify the responsible genes, we may need other
experimental data, such as cDNA expression data. At the same time, it is still important to
improve the mapping data of cDNAs, because the quality directly affects the choice of
candidate cDNAs in finding genes.
The RH map can be used to improve the quality of the mapping data. We constructed a
genetic synteny map based on the mouse genetic map. It is easy to imagine another sort of
Page 7
synteny map, which is based on an RH map. First, a comparative map can be made of cDNAs
that have sequence homology with both RH-mapped ESTs and the human genome. Then
other cDNAs located on the human genome can be mapped to the mouse chromosomes by
using the comparative map as a guide. We call such a map an RH synteny map.
We may be able to use the RH synteny map to improve the genetic synteny map. In the
synteny map, around some chromosome rearrangement breakpoints there appeared blank
regions where no cDNAs were mapped (Fig. 3). Each blank region was bounded by
human–mouse homologous genes adjacent to the breakpoint. Therefore, the size of the blank
region depended on the density of the human–mouse homologous genes. We can construct an
RH synteny map to fill some parts of the blank regions. We hope to report the results of this
in the future.
To improve the longest occupied length criterion, we can incorporate the number of
identity base pairs from the homology search. The idea works when more than 1 longest
occupied length loci exist. In general, we need to search for a good evaluation function that
depends on both the occupied length and the number of identity base pairs. The evaluation
function can be tested by using the sequences of some genes whose loci are known.
This work will help in the correct ordering of the mouse draft genome sequences. The
correct assembly of the draft sequences of genome contigs is no easy task, because the
number of contigs and the number of possible combinations are enormous. Using the order of
genes on mouse chromosomes derived from synteny maps may help the correct assembly of
the contigs.
Acknowledgments
This study has been supported by Research Grant for the RIKEN Genome Exploration
Research Project from the Ministry of Education, Culture, Sports, Science and Technology of
the Japanese Government, ACT-JST (Research and Development for Applying Advanced
Page 8
Computational Science and Technology) of Japan Science and Technology Corporation (JST)
and Special Coordination Funds for Promoting the Bioresource Program from RIKEN
Tsukuba Institute. This work also was supported as the Rice Genome Full-Length cDNA
Library Construction Project by BRAIN (Bio-oriented Technology Research Advancement
Institution).
Appendix: Supplementary Information Web Site
For the details of the map data and the clones mapped in this paper, please visit
http://genome.gsc.riken.go.jp/m/mouse/. The site includes a map viewer of the synteny map
for the full- length sequences, and tables including the map coordinates of all mapped cDNAs.
In addition, we provide graphics to show the fundamental data of the synteny map for some
cDNAs (Fig. 1 shows an example ).
The list of the DDBJ (public database) accession numbers and the RIKEN IDs for the
sequences mapped in this work is also available at the web site.
References
1. Carninci, P. and Hayashizaki, Y. (1999) Methods Enzymol, 303, 19–44.
2. Carninci, P. et al. (1996) Genomics, 37, 327–336.
3. Carninci, P. et al. (1997) DNA Res., 4, 61–66.
4. Carninci, P. et al. (1998) Proc. Natl. Acad. Sci. USA, 95, 520–524.
5. Carninci, P. et al. (2000) Genome Res., 10, 1617–1630.
6. Shibata, K. et al. (2000) Genome Res., 10, 1757–1771.
7. International Human Genome Sequencing Consortium (2001) Nature, 409, 860–921.
8. Itoh, M. et al. (1997) Nucleic Acids Res., 25, 1315–1316.
9. Itoh, M. et al. (1999) Genome Res., 9, 463–470.
Page 9
10. Izawa, M. et al. (1998) J. Biol. Chem., 273, 14242–14246.
11. Sasaki, N. et al. (1998) Genomics, 49, 167–179.
12. Sasaki, N. et al. (1998) Proc. Natl. Acad. Sci. USA, 95, 3455–3460.
13. Sasaki, N. et al. (1998) Gene, 222, 17–23.
14. Konno, H. et al. (2001) Genome Res., 11(2), 281–289.
15. Human Genome Project Working Draft at UCSC, http://genome.ucsc.edu/.
16. Hattori, M. et al. (2000) Nature, 405, 311–319.
17. Dunham, I. et al. (1999) Nature, 402, 489–495.
18. Homo logy data for this paper were retrieved from the Mouse Genome Database (MGD),
Mouse Genome Informatics Web Site, the Jackson Laboratory, Bar Harbor, Maine, USA:
http://www.informatics.jax.org/reports/ (September, 2000).
19. Blake, J.A., Eppig, J.T., Richardson, J.E., Davisson, M. T., and the Mouse Genome
Database Group 2000 (2000) Nucleic Acids Res., 28, 108–111.
20. van Etten, W. J. et al. (1999) Nature Genet. Lett., 22, 384–387. Whitehead Institute/MIT
Center for Genome Research, Mouse EST RH Mapping Project, Public Data Release 7
(August, 2000).
21. Kondo, S. et al. (2001) Mamm. Genome. in press.
22. Schuler, G.D. (1997) Genome Res., 7, 541–550.
23. The e-PCR data files, human.sts and hs_mrna.epcr, were developed by the National
Center for Biotechnology Information at the National Library of Medicine
( ftp://ncbi.nlm.nih.gov/repository/dbSTS/human.sts,
ftp://ncbi.nlm.nih.gov/repository/dbSTS/ePCR.Reports/hs/hs_mrna.epcr)
24. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM
Consortium (2001) Nature, 409, 685–690.
Page 10
Figures
Figure 1.
An example of the relationships among the fundamental data of the synteny map
(RIKEN cDNA 1110019C08). The 2 vertical lines indicate a human chromosome and a
mouse chromosome. Each dashed line connects a pair of homologous genes according to the
comparative map of the Jackson Laboratory. The fact that they cross indicates that the order is
reversed. The other short horizontal line on the human genome indicates the position where
the sequence of the cDNA 1110019C08 is similar to the human genome sequence. On the
mouse chromosome, the other short horizontal line indicates the position of the cDNA
predicted by the synteny map. In this diagram, “D12887” and “stSG8903” are STS marker
names. “POU1F1” and “GAP43” are human gene names. “Pit1” and “Gap43” are mouse gene
names. The numbers on the human genome indicate base pair coordinates. “99950164” and
“99957519” indicate the start and the end positions of the region similar to the cDNA
1110019C08.
Figure 2.
Evaluation of the synteny map for known genes. The map positions of 737 cDNAs were
known in the MGI database. Among them, for each of 575 cDNAs, our synteny map
predicted the same chromosome number that is known in the MGI database. We used the 575
cDNAs to create this figure. The horizontal axis of the figure indicates the distance between
the map position of our synteny map and the known genetic map position of the MGI
database.
Page 11
Figure 3.
Synteny map of the RIKEN cDNAs on the mouse chromosomes. The map positions of
17125 cDNAs are indicated by the horizontal lines to the right of the vertical lines
representing the mouse chromosomes.
Figure 4.
RH map of the RIKEN cDNAs on the mouse chromosomes using mouse RH map data
from the Whitehead Institute’s Center for Genome Research at the MIT (WICGR). The
map positions of 3943 cDNAs are indicated by the horizontal lines to the right of the vertical
lines representing the mouse chromosomes.
Table
Table 1. Numbers of cDNAs mapped during this study
Full- length
sequences
End
sequences
Total
cDNAs used for this mapping 21076 23108 44184
cDNAs mapped to the human genome 12979 11087 24066
cDNAs mapped by synteny 9082 8043 17125
cDNAs mapped to the RH map 2469 1474 3943
cDNAs mapped by synteny or the RH
map
10274 8758 19032