12
1 Report for Taikichiro Mori Memorial Research Grants 2019 (2019 年度森基金研究成果報告書) 生命の複製に関わる酵素の新規発見と機能解明 Comprehensive evolutionary analysis of re- verse transcriptases in viruses and prokary- otes Shohei Nagata Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0035, Japan and Sys- tems Biology Program, Graduate School of Media and Governance, Keio University, Fu- jisawa 252-0882, Japan. Abstract Reverse transcriptases (RTs) are enzymes that polymerize DNA from RNA tem- plates. RTs are usually thought to be viral and eukaryotic elements, but they are also present in bacteria. Bacterial RTs are seemed to be ancestors of eukaryotic RTs and several types are identified i.e. group II introns, retrons, CRISPR/Cas- associated RTs, diversity-generating retroelements (DGRs), and Abi -like genes. Recently, several studies reported that the existence of RTs in a recently reported bacterial group, candidate phyla radiation (CPR). These CPR RTs are thought to have an important role and functions in CPR bacterial ecologies since they retain RT genes while lacking numerous biosynthetic pathways. In this study, I compre- hensively collected RT-like sequences from CPR genomes and systematically char- acterized RT functions and evolution. Using known functional domain profiles in RTs as queries, sequence similarity search was performed against 804 near-complete genomes of CPR bacteria in the database. I obtained 514 RT sequences and these RTs are widely distributed in CPR phyla. It is known that CPR bacteria utilize RTs involved in DGRs to adapt rapidly changing environments, I found RTs related to group II introns, retrons, and abortive infection (Abi). I will discuss possible roles and evolution of RTs in CPR bacteria. Contact: [email protected] 1 Introduction Central dogma in molecular biology is a flow of in- formation that genetic information retained on DNA is transcribed into mRNA and translated into protein, which was proposed in 1958. However, in 1970, an RNA-dependent DNA polymerase (reverse tran- scriptase; RT), which synthesizes DNA based on RNA, reversed this flow [1,2]. This was discovered by studies of tumor-associated retroviruses that in- fect eukaryotes, and various types of RT enzymes have been discovered primarily related to eukary- otes thereafter. In addition to viruses infecting eu- karyotic organisms (retrovirus, pararetrovirus, hepadnavirus), the existence of a RT homologous region in long terminal repeat (LTR) retroelement, non-LTR retroelement, telomerase has been re- vealed. In 1989, retron, one of the reverse transcriptase (RT) was found in bacteria [3,4]. Even after that, various types of RTs were discovered in bacteria and archaea by the discovery of group II intron [5– 7] and diversity-generating retroelements (DGRs) [8–10] etc. Retrons consist of an RT and an adjacent repeat sequence but its function remains unknown.

Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

1

Report for Taikichiro Mori Memorial Research Grants 2019 (2019 年度森基金研究成果報告書)

生命の複製に関わる酵素の新規発見と機能解明 Comprehensive evolutionary analysis of re-verse transcriptases in viruses and prokary-otes Shohei Nagata Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0035, Japan and Sys-tems Biology Program, Graduate School of Media and Governance, Keio University, Fu-jisawa 252-0882, Japan.

Abstract Reverse transcriptases (RTs) are enzymes that polymerize DNA from RNA tem-plates. RTs are usually thought to be viral and eukaryotic elements, but they are also present in bacteria. Bacterial RTs are seemed to be ancestors of eukaryotic RTs and several types are identified i.e. group II introns, retrons, CRISPR/Cas-associated RTs, diversity-generating retroelements (DGRs), and Abi -like genes. Recently, several studies reported that the existence of RTs in a recently reported bacterial group, candidate phyla radiation (CPR). These CPR RTs are thought to have an important role and functions in CPR bacterial ecologies since they retain RT genes while lacking numerous biosynthetic pathways. In this study, I compre-hensively collected RT-like sequences from CPR genomes and systematically char-acterized RT functions and evolution. Using known functional domain profiles in RTs as queries, sequence similarity search was performed against 804 near-complete genomes of CPR bacteria in the database. I obtained 514 RT sequences and these RTs are widely distributed in CPR phyla. It is known that CPR bacteria utilize RTs involved in DGRs to adapt rapidly changing environments, I found RTs related to group II introns, retrons, and abortive infection (Abi). I will discuss possible roles and evolution of RTs in CPR bacteria. Contact: [email protected]

1 Introduction Central dogma in molecular biology is a flow of in-formation that genetic information retained on DNA is transcribed into mRNA and translated into protein, which was proposed in 1958. However, in 1970, an RNA-dependent DNA polymerase (reverse tran-scriptase; RT), which synthesizes DNA based on RNA, reversed this flow [1,2]. This was discovered by studies of tumor-associated retroviruses that in-fect eukaryotes, and various types of RT enzymes have been discovered primarily related to eukary-

otes thereafter. In addition to viruses infecting eu-karyotic organisms (retrovirus, pararetrovirus, hepadnavirus), the existence of a RT homologous region in long terminal repeat (LTR) retroelement, non-LTR retroelement, telomerase has been re-vealed.

In 1989, retron, one of the reverse transcriptase (RT) was found in bacteria [3,4]. Even after that, various types of RTs were discovered in bacteria and archaea by the discovery of group II intron [5–7] and diversity-generating retroelements (DGRs) [8–10] etc. Retrons consist of an RT and an adjacent repeat sequence but its function remains unknown.

Page 2: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

2

Group II introns are retroelements consists of cata-lytic RNA and an RT protein which mediate splic-ing and mobility reactions [11–13]. DGRs are retro-elements that lost mobility functions and use reverse transcription to generate sequence variations in spe-cific target genes [10]. Then, it was revealed that RT is a gene that is widely present in the three domains of life (bacteria, archaea, eukaryotes) and viruses [14–17]. In bacteria, it is also known that RT ho-mologous region exists also in abi gene related to abortive infection (Abi) to phage [18,19] and cas1 gene of CRISPR/Cas immune system [20,21]. Three bacterial RT-related proteins are involved in phage resistance; AbiA, AbiK, and Abi-P2 [15]. AbiA and AbiK are thought to provide phage im-munity through abortive infection. Also, recently there have been reports that many uncharacterized RT-like sequences mainly exist in bacteria [15,20,21]. However, what kind of functions/activi-ties they possess, and how they divergences were unclear.

More recently, it has become clear that a vast un-known microbial strain group exists in bacteria by technological advances in metagenomic analysis and single-cell genomics. Metagenomic approach revealed huge diversity of previously unknown phyla of bacteria and archaea since they have differ-ent forms of 16S rRNA sequences. In bacteria, these metagenomically recovered bacterial strain was de-scribed as candidate phyla radiation (CPR) and comprises at least 15% of all bacteria [22]. The CPR seems to be monophyletic and clearly separated from other bacteria (Figure 1.1; Castelle and Banfield, 2018; Hug et al., 2016). CPR bacteria are widely distributed across the various environments such as human microbiome [25] , deep subsurface sediments [26], the dolphin mouth [27], drinking water [28], soil [29], marine sediment [30] and other environments [24,31].

CPR bacteria have various unusual features com-pared to non-CPR bacteria. CPR genomes are less than 1.5Mb while the genome size of non-CPR bac-teria, Escherichia coli, is 4.6Mb. Most of them lost TCA-cycle genes and they have intron regions in rRNA genes [22,31]. It is sometimes questioned whether CPR bacteria is a cellular organism, at least, CPR genomes encode genetic systems for cell divi-sion (e.g. Fts-Z-based mechanisms, not found in some symbionts with very reduced genomes), and measurements of replication rates and images show-ing cell division indicate that the cells are metabol-ically active. It is also thought that they may adhere to the surface of other microorganisms to survive.

It is reported that CPR bacteria have RT-like se-quences in their genomes, however, the types of RTs, their functions, and its evolutionary scenario

of diversification are not well understood. In this study, a comprehensive analysis was performed on the RT sequence from CPR bacterial genomes, to reveal roles and evolution of RTs in CPR bacteria.

Figure 1.1 A current view of the tree of life. The phy-logenetic tree of bacteria, archaea, and eukaryotes, in-cluding 92 named bacterial phyla, 26 archaeal phyla and all five of the Eukaryotic supergroups. The tree was esti-mated by maximum-likelihood method using concatena-tion of ribosomal protein sequences. The figure adapted from reference [23].

2 Methods

2.1 Data sources Complete genome sequences of bacteria and ar-chaea were downloaded from the Reference Se-quence Database (RefSeq) [32] at the National Cen-ter for Biotechnology Information (NCBI) as of May 2018. The acquired genomes (denoted as Ref-Seq prokaryotes in this manuscript) were 9,078 ge-nomes (total of bacteria 8,825, archaea 253, respec-tively).

Nearly full-length (restored by ≥ 70% based on the estimated full length) of 804 genomes (790 spe-cies) of CPR bacteria were obtained from NCBI GenBank based on Hug et al. [23].

Known RT sequences were obtained from a pre-vious study by Simon et al. [20]. Sequences anno-tated as “Unknown”, “Unclassified”, and “nonRTs” were eliminated and totally 930 RT sequences were collected.

0.4

Candidate Phyla Radiation

Microgenomates

Parcubacteria

Eukaryotes

Archaea

Bacteria

DPANN

Opisthokonta

AmoebozoaChromalveolata

ArchaeplastidaExcavata

RBX1WOR1

Cyanobacteria

Melainabacteria

PVC superphylum

TACK

Major lineage lacking isolated representative: Major lineages with isolated representative: italics

Dojkabacteria WS6

PeregrinibacteriaGracilibacteria BD1-5, GN02

Absconditabacteria SR1

Katanobacteria WWE3

Berkelbacteria

SM2F11

CPR1CPR3

Nomurabacteria KaiserbacteriaAdlerbacteria

Campbellbacteria

Wirthbacteria

Chloroflexi

Armatimonadetes

GiovannonibacteriaWolfebacteria

Jorgensenbacteria

Azambacteria

YanofskybacteriaMoranbacteria

MagasanikbacteriaUhrbacteria

Falkowbacteria

Saccharibacteria

Woesebacteria

AmesbacteriaShapirobacteria

CollierbacteriaPacebacteria

BeckwithbacteriaRoizmanbacteria

GottesmanbacteriaLevybacteria

DaviesbacteriaCurtissbacteria

NanoarchaeotaWoesearchaeota

Pacearchaeota

Nanohaloarchaeota

Micrarchaeota

Altiarchaeales

Aenigmarchaeota

Diapherotrites

Z7ME43

Loki.

Thaumarchaeota

ArchaeoglobiMethanomicrobia

Halobacteria

Thermoplasmata

Methanococci

Spirochaetes

Firmicutes

(Tenericutes)

BacteroidetesChlorobi

Gammaproteobacteria

Alphaproteobacteria

Betaproteobacteria

Actinobacteria

PlanctomycetesChlamydiae,Lentisphaerae,Verrucomicrobia

Omnitrophica

Aminicentantes Rokubacteria

NC10

Elusimicrobia

Poribacteria

Ignavibacteria

Dadabacteria

TM6

Atribacteria

Gemmatimonadetes

CloacimonetesFibrobacteres

Nitrospirae

Latescibacteria

TA06

Caldithrix

Marinimicrobia

WOR-3

Zixibacteria

SynergistetesFusobacteria

AquificaeCalescamantes

Deinococcus-Therm.

CaldisericaDictyoglomi

Deltaprotebacteria(Thermodesulfobacteria)

Epsilonproteobacteria

DeferribacteresChrysiogenetes

Tectomicrobia, ModulibacteriaNitrospinae

Acidobacteria

Zetaproteo.

Thermotogae

Acidithiobacillia

Parvarchaeota

Hydrogenedentes NKB19

Thor.

BRC1

ThermococciMethanobacteria

Hadesarchaea

Methanopyri

Aigarch.

Crenarch.

YNPFFA

Korarch.

Bathyarc.

Figure 1 | A current view of the tree of life, encompassing the total diversity represented by sequenced genomes. The tree includes 92 named bacterialphyla, 26 archaeal phyla and all !ve of the Eukaryotic supergroups. Major lineages are assigned arbitrary colours and named, with well-characterized lineagenames, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon sampling and treeinference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutes andthe Deltaproteobacteria, respectively. Eukaryotic supergroups are noted, but not otherwise delineated due to the low resolution of these lineages. The CPRphyla are assigned a single colour as they are composed entirely of organisms without isolated representatives, and are still in the process of de!nition atlower taxonomic levels. The complete ribosomal protein tree is available in rectangular format with full bootstrap values as Supplementary Fig. 1 and inNewick format in Supplementary Dataset 2.

LETTERS NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.48

NATURE MICROBIOLOGY | www.nature.com/naturemicrobiology2

© 2016 Macmillan Publishers Limited. All rights reserved

Page 3: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

3

2.2 Identification of RT sequences From the prokaryotic genomes collected, RT-like sequences which have RT functional domains were identified using HMMER v.3.2 (hmmscan program; E-value ≤ 1e-5) [33] search against sequence pro-files corresponding to “RVT_1” (PF00078) or “RVT_2” (PF07727) in Pfam-A 32.0 [34]. In our first pipeline, Pfam ID: “RVT_3” (PF13456) was included in the query profile since “RVT_3” do-main was registered as “Reverse transcriptase-like” in the database. However, proteins collected with “RVT_3” profile query were RNase H protein ra-ther than RT. Therefore, I exclude proteins exist as RNase H alone, not a part of RT protein, to observe the diversity and evolution of RT domains and pro-teins in the analysis including CPR bacteria.

2.3 Network analysis based on sequence sim-

ilarities The sequence similarity scores were calculated to construct a weighted undirected graph (SSN). The similarity scores (Basic Local Alignment Search Tool [BLAST] bit scores) [35] for all the collected protein sequences were calculated with an all-against-all BLASTP (BLAST 2.7.1+) analysis [36,37], with a cut-off E-value of ≤ 1e−5. Using the BLAST bit scores, the sequence similarities were normalized to 0.0–1.0, with the following equation [38,39]:

𝒔𝒊𝒎(𝒙, 𝒚) =𝐦𝐚𝐱(𝒃𝒊𝒕𝒔𝒄𝒐𝒓𝒆(𝒙, 𝒚), 𝒃𝒊𝒕𝒔𝒄𝒐𝒓𝒆(𝒚, 𝒙))𝐦𝐚𝐱(𝒃𝒊𝒕𝒔𝒄𝒐𝒓𝒆(𝒙, 𝒙), 𝒃𝒊𝒕𝒔𝒄𝒐𝒓𝒆(𝒚, 𝒚))

where sim(x,y) represents the normalized sequence similarity between two sequences x and y. If the score was 1.0, the pair was deemed to be identical. A weighted undirected graph was constructed based on the scores of all the pairs of sequences, and the edges were weighted with the scores. I set a thresh-old sequence identity value and connected the nodes when the sequence identity exceeded the threshold. The threshold to be used was determined by com-paring the networks constructed with an incremen-tal series of threshold values. The constructed net-works were visualized with Cytoscape 3.7.1 [40], using “Prefuse Force-Directed OpenCL Layout” with default parameters except for enabling “Force deterministic layouts” option.

2.4 Sequence comparison and phylogenetic

analysis To compare differences between RefSeq prokary-otic RT and CPR bacteria RT, I aligned RT se-

quences using MAFFT v.7.407 (L-INS-i algo-rithms) [41] and estimated maximum likelihood tree using RAxML v.8.2.11 [42] with PROTGAMMAJTT evolutionary model for amino acid sequences. Both analyses were performed and visualized through the environment for tree explo-ration (ETE) v.3.1.1 [43]. Also, the identified CPR RTs were mapped onto the phylogenetic tree esti-mated by Hug et al. [23] using iTOL [44].

2.5 Estimation of frameshift mutations in

polymerases To verify whether the frameshift mutation occurred only in CPR bacterial RTs, DNA polymerase family A proteins were identified and compared to the RTs. DNA polymerase family A proteins were identified using HMMER v.3.2 (hmmscan program; E-value ≤ 1e-5) [33] search against sequence profiles corre-sponding to “DNA_pol_A” (PF00476) in Pfam 32.0 [34]. To increase phylogenetic coverage of the pol-ymerases in CPR phylogeny, the retrieved DNA polymerase protein sequences (438 sequences for CPR bacteria) were additionally run against all cod-ing sequences of datasets using BLAST v.2.8.1+ (blastp program; E-value ≤ 1e-5; query coverage per subject ≥ 50%) [35–37] and 670 sequences were identified for CPR bacteria. With the same pipeline, I also re-identified RT sequences using 514 CPR RT sequences as query and retrieved 539 RTs from CPR genomes.

2.6 Domain architecture of related proteins Domain organization of CPR RTs were visualized with DoMosaics v.0.95 [45]. The visualized do-mains were extracted using HMMER v.3.2 (hmm-scan program) [33] search against Pfam-A 32.0 [34] database. HMMER was performed and the results were combined by DoMosaics. Other sequences which have specific domain architecture was searched by InterProScan [46] against InterPro da-tabase [47].

2.7 Identification of RT-related group II in-

trons Since most of bacterial group II introns have RT as intron-encoded protein (IEP) in its open reading frame (ORF), I identified the introns to annotate RT functions. To detect its characteristic RNA second-ary structures surrounding IEP (RT), homologous structures to the specific domains of the introns (do-mains I-VI) in CPR bacterial genomes were searched. Domains V, VI were searched using In-

Page 4: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

4

fernal v.1.1.2 (cmsearch program with --nohmm op-tion; score > 24) against RNA secondary structural profiles corresponding to “Intron_gpII” (RF00029) in Rfam database [48]. For domains I-IV, Infernal v.1.1.2 (cmsearch program with --rfam option; E-value ≤ 1e-10) were used against profiles corre-sponding to “group-II-D1D4-1” (RF01998), “group-II-D1D4-2” (RF01999), “group-II-D1D4-3” (RF02001), “group-II-D1D4-4” (RF02003), “group-II-D1D4-5” (RF02004), “group-II-D1D4-6” (RF02005), and “group-II-D1D4-7” (RF02012) in the database. Based on the search results, consider-ing the distances between the intron components, types of group II introns were defined as follows; full-length, which has all domains I-IV, ORF-RT, domains V-VI; ORF-less, which lacks ORF-RT but has domains; others which lacks one of the three components.

3 Results and discussion

2.1 Overall relationships among prokaryot-

ic RTs To see overall sequence relationship of RT and RT-related proteins in prokaryotes and viruses, I con-structed and visualized sequences sequence similar-ity network (SSN) (Figure 3.1). The SSN is a graphical representation of the similarities between sequences. Each sequence is indicated by a point (node) and the similarity between the sequences is represented by the length of the line (edge) connect-ing the points. The smaller the distance between the nodes, the greater the degree of similarity between the sequences. I used RT and the related protein se-quences identified from prokaryotic and viral ge-nomes in RefSeq dataset. Nodes are colored accord-ing to the origin of sequences: bacteria (non-CPR); archaea; virus (Figure 3.1A) or to the types of RT and RT-related proteins (Figure 3.1B). An over-view of the entire network structure shows that the RT and RT-related proteins can be divided into four groups, i.e., RTs of bacteria and archaea, RTs of vi-ruses, RNA-dependent RNA polymerases (RdRp), Ribonuclease (RNase) H. The group of viral RTs and viral RdRp consisted only of sequences derived from viruses, whereas the group of RNase H and bacterial RT both contained sequences derived from thee domains. Some bacterial type of RTs, such as DGR, have been found in virus (bacteriophage) ge-nomes [49], and they are mainly associated with the bacterial RT group on the network.

The phylogenetic relationships of the obtained RTs and RT-related proteins were analyzed together with structure of protein functional domains

(Figure 3.2). Several sequences were selected from each type of RT and used. The color of tips in the tree corresponds to the color of the node in Figure 3.1, and the type of RT and the taxonomic domain (bacteria, archaea, virus) derived from are described together. Retroviral, LTR, non-LTR, and retron II types of RTs were located nearby on the phyloge-netic tree, while group II introns and retron I RTs were splitted and located on multiple strains. Many RTs of the virus possessed various protein domains in addition to the central domain of the RT, as de-scribed “RVT_1” in the figure, and the sequence length was considerably longer than that of prokar-yotes. This is probably because viruses often encode one protein with multiple functions.

Figure 3.1 Sequence similarity network of RTs from RefSeq prokaryotes. Nodes (colored dots) represent the RT protein sequences and the edge lengths represent the sequence similarities. (A) Nodes are colored according to the origin of sequences: bacteria (non-CPR); archaea; vi-rus. (B) Nodes are colored according to the types of RT and RT-related proteins.

RdRp and RNase H, which is not RT itself, were obtained as RT-related proteins. RdRp has been considered to be evolutionary related to RT [50] and it is not surprising that the RdRp domain sequences were highly similar to the RT domain. On the other hand, for RNase H, I selected Pfam ID: “RVT_3” in the process of selecting protein sequences having the RT domain. Although the “RVT_3” domain are registered as “Reverse transcriptase-like” in NCBI CDD, the superfamily does not belong to “RVT_1 Superfamily” and “RT_like Superfamily” with the other RTs but the superfamily belong to

BacteriaArchaeaVirus

RNase H

RNA dependent RNA polymerase

RT ZFREV-like

RT Rtv

RT LTR

Viral DNA polymerase

RT retronⅠ

RT nLTR-like

RT group II intron RT retronⅡ

RdRP 3

RdRP 4

Unclassified

A

B

Page 5: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

5

“RNase_H_like Superfamily”. In many cases, RT has an RNase H domain region as part of it [51,52]. However, after this, I exclude proteins exist as RNase H alone, not a part of RT protein, to observe the diversity and evolution of RT domains and pro-teins in the subsequent analysis including CPR bac-teria.

To analyze the characteristics of RT in CPR bac-teria, I firstly plotted histograms of amino acid se-quence length of RTs extracted from CPR bacteria and non-CPR prokaryotes registered in RefSeq (Figure 3.3). Only when plotting histograms, Ref-Seq prokaryotic RTs were used for cluster repre-sentative sequences to which at least 5 sequences belong to each cluster in order to ensure sequence

Figure 3.2 Phylogenetic tree and domain architecture of RTs. Based on the RT and RT-related proteins identi-fied in the RefSeq prokaryotic genomes, several se-quences were obtained from each type of RT. The color-ing of the tip of the phylogenetic tree corresponds to the coloring of the node in Figure 3.1. Also, the type of RT and the taxonomic domain (bacteria, archaea, virus) de-rived from were described. Functional protein domains are colored by domain type and names of domain in Pfam databases are indicated.

diversity. Sequence length of each RT dataset shows that the minimum length was 78 residues for CPR bacteria and 72 residues for RefSeq prokary-otes, the mean length was 311 residues and 475 res-idues, and the maximum length was 763 residues and 1879 residues respectively. The shape of the distribution showed that the RT of CPR bacteria was unimodal and had a small variation in sequence length, while the RefSeq prokaryotes had roughly three peaks with a multimodal distribution. As a re-sult, the RT of the prokaryote registered in RefSeq contains a wide variety of RT types, whereas most of the RT of CPR bacteria are specific types of RT.

For comparing sequence between RTs in CPR bacteria and non-CPR prokaryotes, I constructed and visualized SSN of RTs from both datasets (Figure 3.4). Note that in Figure 3.1, Pfam ID: “RVT_3” was also included in the extraction of RTs. However, a considerable number of RNase H se-quences were included in the network. These RNase H protein profile (Pfam ID: “RVT_3”) was ex-cluded since I would like to target only sequences close to the RT enzyme. CPR bacteria RTs, which nodes are colored blue, showed a cluster-like se-quences on the left side of the network and se-quences scattered slightly to the lower left

Figure 3.3 Distribution of sequence length of the iden-tified RT proteins. Distribution of amino acid sequence length of the identified RTs from (A) CPR RT (B) non-CPR prokaryotic RTs registered in RefSeq database. Note that panel B is a representative sequence of clusters containing 5 or more sequences due to reduce the bias in the sequence data of RefSeq.

A

B

Page 6: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

6

Figure 3.4 Sequence similarity network of RTs from RefSeq (non-CPR) prokaryotes and CPR bacteria. Nodes (colored dots) represent the RT protein sequences and the edge lengths represent the sequence similarities. (A) Nodes are colored according to the origin of se-quences: bacteria (non-CPR); CPR bacteria; archaea; vi-rus. (B) Nodes are colored according to the types of RT and RT-related proteins.

(Figure 3.4A). These RTs were classified as group II intron type and retron type, respectively (Figure 3.4B). In addition to these, some CPR bacterial RTs have been annotated as RNase H-like proteins (5 se-quences) or seemed to be similar to viral RdRp (3 sequences).

Nodes annotated as group II introns type of RT from CPR bacteria were clustered on the network (Figure 3.4) and seemed to be consists majority of the CPR RTs. Previous study reported that 75% of the RT in the bacterial genome belongs to the group II intron, with 12% for the retron and 3% for the DGR [15,53]. However, it should be noted that, as mentioned above, a detailed discussion must be made in conjunction with a more accurate RT type annotation. This cluster of sequences is might be characteristic RTs of CPR bacteria because of its distance compared to bacteria and archaeal RT other than CPR bacteria on the network. If these were RT associated with Group II introns as noted, new types of Group II introns might be present in CPR bacteria. Since this RT annotation was determined only by

the best hits in the NCBI CDD profiles, detailed types would be identified by phylogenetic analysis with known types of RTs in the next section.

2.2 Functional analysis and classification of

CPR RTs Sequence similarity-based search of RT domains identified 514 RT protein sequences. To observe the phylogenetic distribution of the RTs, they were mapped onto CPR bacterial phylogenies [23] (Figure 3.5). RTs were widely distributed in CPR bacteria. They appeared in both major superphyla of CPR, Parcubacteria (OD1) and Microgenomates. RTs were found in 313 species out of 804 of CPR bacteria.

I combined CPR RT sequences and the known RT sequences and constructed phylogenetic tree (Figure 3.6). The CPR RTs were not monophyletic, and RTs related to retrons, abortive infection (AbiK, Abi-P2, but not AbiA), DGRs, group II introns and group II intron-like were observed in CPR. Most of CPR RTs (441 sequences) were involved in DGRs and it consists 86% of CPR RTs

Figure 3.5 Phylogenetic distribution of RTs in CPR bacteria. RTs were found in 313 genomes and they were mapped onto the CPR phylogeny (804 genomes). Ge-nomes with RT proteins are colored in blue. The CPR phylogeny was taken from Hug et al. and modified.

CPR BacteriaArchaeaVirus

Bacteria

RNA-dependent RNA polymerase

RNase H-like

RT ZFREV-like

RT Rtv

RT LTR

Viral DNA polymerase

RT retronⅠ

RT nLTR-like

RT group II intron

RT retronⅡ

RdRP 3

RdRP 4

Others & Unclassified

Cas1

A

B

Page 7: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

7

Figure 3.6 Phylogenetic relationships of CPR RTs and known RTs. 514 sequences of CPR RTs and 930 se-quences of known RTs were combined, and the phyloge-netic tree were estimated using maximum-likelihood method. The types of RTs are indicated at right side of the tree nodes (tips). Collapsed nodes are of known RT sequences except for 416 CPR RT sequences indicated as star.

Most CPR RTs have only RT domain, but five RT sequences have other domains (Figure 3.8). Three of them (GenBank accession, KKR03365.1; KKU32234.1; OGV95898.1) have “GIIM” (Pfam ID, PF08388) domain which is maturase-specificdomain of group II intron. Also, protein OGV95898.1 has “RVT_N” (PF13655) domain which means N-terminal domain of reverse tran-scriptase. Interestingly, the other two proteins have specific domains, “zf-CHC2” (PF01807) and “UDG”

(PF03167). “zf-CHC2” is a domain of CHC2-type zinc finger domain which bind metals such as zinc, iron, or no metal at all. “UDG” is a domain of uracil DNA glycosylase (UDG). UDG is an enzyme that reverts mutations in DNA and crucial in DNA repair. I further searched sequences by domain architecture which have both RT domain and UDG domain against the protein database of almost all public available sequences. However, the only sequences I found was sequences of Candidatus Giovannoni-bacteria, which host species is same as the sequence I mentioned (OGF82770.1). Therefore, I concluded that the RT which have UDG is specific in Candi-datus Giovannonibacteria.

To observe RTs in bacterial group II introns, the introns were detected by searching their character-istic RNA secondary structures (Table 3.1). Totally 12 group II introns were detected and six of them had RTs as IEP. The remaining six group II introns had both domains I-IV and domains V-VI, but the distance between the domains were short (~450 nt) that they didn’t possess corresponding RTs.

The phylogenetic mapping and the aforemen-tioned analysis revealed several types of RT pro-teins were widely distributed and conserved in CPR bacterial phyla. The roles and functions of RTs in CPR bacterial ecologies were less understood ex-cept for DGRs [16,54]. DGRs, retroelements that generates sequence variations in specific target genes using RT, were identified in CPR bacteria and suggested to be utilized for adaptation to a dy-namic host-dependent environment [54]. The ob-served phylogenetic distribution of CPR RTs was congruent with the previous study which identified DGRs [54]. Despite the study was focused on RTs as component of DGRs, I newly identified RTs in CPR phyla, i.e. Candidatus Collierbacteria, Candi-datus Pacebacteria, subdivision RIF-10, 15, 16, 17, 20, 21 of Candidatus Parcubacteria. Especially, RTs in RIF-10, 15, 16, 17, 20, 21 of Parcubacteria were identified in multiple genomes that indicates non-DGR RTs exist in the phyla and may play important role for their ecologies.

The identified group II introns were seemed to be recently acquired by horizontal gene transfer since they have well conserved group II intron maturase domain and their RTs are closely related to RTs

24@82702063 411629.1

@150391097 001321146.1

@14584767024588.1

01000002.1@C B1 37 [email protected]@@ @1802223@14902 16117

C01000013.1@ 2011 A2 47 [email protected]@@ @1618841@6751 8077

@117924433 865050.1

C01000021.1@C 2011 2 46 [email protected]@@ A- A @1619001@2782 4459

@92113328 573256.1

25@87308561 01090701.1

01000024.1@C C2 01 47 [email protected]@@ @1802558@70735 71734

@94310415 583625.1

01000032.1@C C2 01 46 [email protected]@@ @1798380@130 1216

C01000013.1@C 2011 B1 43 [email protected]@@- @1618578@563704 565018

@29347723 811226.1

A@32455447 862563.1

@15925199 372733.1

@113940973 01426789.1

01000012.1@C C2 01 41 [email protected]@@ @1798657@62416 63481

BA01000005.1@C 2011 1 35 [email protected]@@ @1618707@80730 81774

@20091619 617694.1

01000007.1@C 1 39 [email protected]@@ @1802780@4831 5782

[email protected]

@20093406 619481.1

@90409437 01217503.1

@150004673 001299417.1

@54293958 126373.1

@14887141770280.1

01000015.1@C B A2 43 [email protected]@@ @1797472@19464 20781

@117923963 864580.1

01000022.1@C C2 30 33 [email protected]@@ @1805340@2510 3563

A-2@150378854 01918051.1

@83309559 419823.1

@23335577 00120811.1

21@34541577 906056.1

A-2@145634195 01789906.1

B01000037.1@ 2011 A1 36 [email protected]@@A- A @1618782@783 2118

01000035.1@C C2 02 43 [email protected]@@ @1802681@11378 13670

@117925277 865894.1

[email protected]

@139439157 01772609.1

C005957.1@C @A62529.1@@ A- A @1332188@819704 820739

@108757513 633367.1

@69933606 00628808.1

C01000011.1@C 2011 A1 43 [email protected]@@- @1618731@372 1446

01000008.1@C C2 02 42 [email protected]@@ @1798544@7062 7635

C01000005.1@ 2011 1 44 [email protected]@@A- A @1618537@156 708

C01000032.1@C C2 12 45 [email protected]@@ @1802603@3678 4569

@126090247 001041702.1

@121727482 01680600.1

A-2@110642862 670592.1

01000003.1@C C2 12 40 [email protected]@@ @1801794@4656 5937

01000029.1@C B A2 43 [email protected]@@ @1797472@7478 8378

B01000002.1@C C2 30 36 [email protected]@@ @1805300@110184 111267

@28871080 793699.1

B01000013.1@C 2011 B1 37 [email protected]@@A- A @1618742@8966 10004

@118034236 01505672.1

25@149173121 01851752.1

@88812538 01127786.1

[email protected]

01000023.1@C C2 12 50 [email protected]@@ @1798526@2291 3323

C01000022.1@ 2011 B1 43 [email protected]@@- @1618874@4583 5684

@147678802 001213017.1

CA01000003.1@C 2011 B1 41 [email protected]@@ @1619006@62053 62887

C01000025.1@C C2 02 49 [email protected]@@ @1798489@58351 59635

01000024.1@C C2 12 42 [email protected]@@ @1798473@64 991

@148359926 001251133.1

A-2@145632406 01788141.101000005.1@C C2 36 [email protected]@@ @1798002@13174 14677

01000010.1@C C2 01 48 [email protected]@@ @1802115@5846 7247

[email protected]

C01000014.1@C 2011 A2 45 [email protected]@@A- A @1618662@1107 1704

@68551181 00590605.1

A@17227201 478367.1

@121528340 01660954.1

@75758415 00738538.1

CB01000007.1@ 2011 B1 41 [email protected]@@ @1618869@24444 25329

01000025.1@C B1 49 [email protected]@@ @1798644@5912 6851

2 @149195205 01872295.1

01000004.1@ 3 B 16 37 [email protected]@@ @1802610@1356 2160

C01000017.1@ 2011 A1 59 [email protected]@@A- A @1618804@28000 29530

B01000010.1@ 2011 2 38 [email protected]@@ @1618949@7972 9313

@94490804 01298039.1

@71065017 263744.1

@78189651 379989.1

@150005876 001300620.1

C01000030.1@C 2011 A2 52 [email protected]@@- @1618671@5891 7085

01000005.1@C C2 01 46 [email protected]@@ @1802738@174327 175584

01000008.1@C C2 02 42 [email protected]@@ @1798544@1142 1604

24@71735515 277063.1

@52549848AA83697.1

@119857475 01638904.1

B01000037.1@ 2011 A1 36 [email protected]@@ @1618782@2080 2707

@90407942 01216116.1

@87133431AB24341.1

01000042.1@C C1 02 54 [email protected]@@ @1805323@31900 32515

A-2@118587264 01544691.1

01000032.1@ 3 1 43 [email protected]@@ @1802652@17430 19548

B01000001.1@ 2011 C2 40 [email protected]@@A- A @1618923@111719 113063

01000020.1@C 2 34 [email protected]@@ @1798007@12897 13848

01000046.1@C B 13 34 [email protected]@@ @1802477@4752 6297

@20090946 617021.1

01000003.1@C B C2 01 50 [email protected]@@ @1797471@21074 22046

BA01000012.1@C 2011 1 35 [email protected]@@- @1618707@37124 38210

@21233055 638972.1

B01000003.1@C 2011 2 38 [email protected]@@- @1618639@22810 24028

A01000043.1@C C 1 40 [email protected]@@ @1817731@4594 5467

01000009.1@C C2 01 43 [email protected]@@ @1802736@3217 4132

@33325845AA08377.1

@42527768 972866.1

01000040.1@C C1 02 54 [email protected]@@ @1805323@48601 49693

A01000041.1@C C 1 40 [email protected]@@ @1817731@55742 56720

01000008.1@C C2 01 45 [email protected]@@ @1798649@5816 7163

B01000028.1@C 2011 A2 32 [email protected]@@ @1618475@2909 3575

@13407823070

BA01000012.1@C 2011 1 35 [email protected]@@- @1618707@51277 52450

@68553139 00592520.1

01000003.1@C 2 36 [email protected]@@ @1802338@0 1074

01000001.1@C B 16 42 [email protected]@@ @1802485@286583 287609

@60681593 211737.1

24@126356711 01713715.1

@90580666 01236470.1

C01000010.1@C 2011 C2 48 [email protected]@@ CA@1619071@14844 16377

01000016.1@C A2 50 [email protected]@@ @1798474@29752 31144

@83943143 00955603.1

01000025.1@C A B 16 42 [email protected]@@ @1817814@4508 6005

B01000006.1@C 2011 2 40 [email protected]@@ @1618595@45560 46538

01000006.1@C C2 01 47 [email protected]@@ @1802402@41594 43373

01000001.1@ C1 02 41 [email protected]@@ @1805308@1639 2689

@121604255 981584.1

@126664055 01735049.1

@75909461 323757.1

01000005.1@C C2 01 48 [email protected]@@ @1802115@104266 105190

@77406984 00784000.1

@15158092544663.1

01000048.1@ 3 B 19 CB 34 [email protected]@@ @1802612@1374 2922

@134093299 001098374.1

CC01000009.1@C 2011 C2 41 [email protected]@@ @1619029@12144 12894

01000007.1@C C2 01 46 [email protected]@@ @1802301@125724 126855

@22536745 687596.1

25@149176144 01854760.1

C01000039.1@ B 16 45 [email protected]@@ @1817747@2779 4027

01000030.1@C C C1 02 44 [email protected]@@ @1805087@7228 9307

@88811340 01126595.1

@56698727 166298.1

@29348025 811528.1

@8100799AA72414.1

[email protected]

B01000005.1@C 2011 2 36 [email protected]@@- @1618713@31180 32314

01000015.1@C C2 01 49 [email protected]@@ @1802448@9819 10770

@150004670 001299414.1

@118747050 01594931.1

@83310593 420857.1

@94266883 01290540.1

@56475694 157283.1

B01000019.1@ 2011 A2 40 [email protected]@@A- A @1618816@12412 13744

C01000010.1@ 2011 A2 43 [email protected]@@ @1618827@984 1947

@150007547 001302290.1

@113937250 01423127.1

01000054.1@C B C1 02 42 [email protected]@@ @1805036@3231 4407

@29347707 811210.1

01000025.1@C C2 01 38 [email protected]@@ @1798591@1834 2767

01000011.1@C C2 01 37 [email protected]@@ @1801768@22124 22991

B01000031.1@C 2011 2 39 [email protected]@@A- A @1618995@8899 9832

01000006.1@C C2 01 44 [email protected]@@ @1802525@188878 189886

C005957.1@C @A61879.1@@ A- A @1332188@174268 175129

01000016.1@C B C2 01 44 [email protected]@@ @1797535@35043 36300

01000022.1@ 2 52 [email protected]@@ @1817746@2872 3853

C01000036.1@C 2011 C2 48 [email protected]@@ @1619071@3198 4356

�����������

………………………………………☆

Retrons

Abi-like

DGRs

Group II introns

Table 1. Identified group II introns in CPR bacteria

Genomes (GenBank accession) Exsistence of RNA domains and RTCorresponding RTs(GenBank accession)

Microgenomates_group_bacterium_RBG_16_45_19 (MHDC01000039.1) Full (Domain I-IV, ORF-RT, Domain V,VI) OGV95898.1

Candidatus_Kerfeldbacteria_bacterium_RIFCSPLOWO2_02_FULL_42_19 (MHKF01000008.1) Full (Domain I-IV, ORF-RT, Domain V,VI) OGY84268.1

Candidatus_Uhrbacteria_bacterium_GW2011_GWF2_46_218 (LCMG01000021.1) Full (Domain I-IV, ORF-RT, Domain V,VI) KKU32234.1

Candidatus_Uhrbacteria_bacterium_GW2011_GWF2_39_13 (LBWG01000031.1) ORF-RT, Domain V,VI KKR03365.1

Candidatus_Vogelbacteria_bacterium_RIFOXYB1_FULL_42_16 (MHTH01000005.1) Domain I-IV, ORF-RT OHA59186.1

Candidate_division_Kazan_bacterium_RIFCSPLOWO2_01_FULL_48_13 (METE01000004.1) Domain I-IV, ORF-RT OGB85388.1

Candidatus_Levybacteria_bacterium_RIFCSPHIGHO2_01_FULL_37_33 (MFNM01000037.1) ORF-less (Domain I-IV, Domain V,VI)

Candidatus_Levybacteria_bacterium_RIFCSPLOWO2_02_FULL_37_10 (MFPB01000045.1) ORF-less (Domain I-IV, Domain V,VI)

Candidatus_Colwellbacteria_bacterium_RIFCSPHIGHO2_12_FULL_44_17 (MHIX01000008.1) ORF-less (Domain I-IV, Domain V,VI)

Candidatus_Nealsonbacteria_bacterium_RBG_13_42_11 (MHLY01000006.1) ORF-less (Domain I-IV, Domain V,VI)

Candidatus_Terrybacteria_bacterium_RIFCSPLOWO2_01_FULL_40_23 (MHSW01000005.1) ORF-less (Domain I-IV, Domain V,VI)

Candidatus_Gottesmanbacteria_bacterium_GW2011_GWB1_49_7 (LCQD01000034.1) ORF-less (Domain I-IV, Domain V,VI)

Table 3.1 Identified group II introns in CPR bacteria.

Page 8: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

8

from non-CPR prokaryotes. RTs involved in abor-tive infection and retron were also identified. Abor-tive infection is a process which provide phage im-munity through blocking phage multiplication by programmed death of the cell. Three bacterial RT-related proteins are involved in phage resistance; AbiA, AbiK, and Abi-P2. AbiK and Abi-P2 were identified in CPR bacterial genome in this study, and several RTs that closed to AbiK but formed dis-tinct clade were also identified. Also, retrons consist of an RT gene and an adjacent inverted repeat se-quence, but its function remains unknown. Detailed sequence properties of these RTs will be investi-gated for further research.

Despite the CPR bacteria phyla were radiation and clearly separated from the other bacteria, the RT tree were polyphyletic. This result implies that CPR bacteria and the other non-CPR bacteria RTs ex-change RT genes occasionally, contributing RT evolution and diversifications. I identified several types of RTs and biological properties of these pro-teins were not elucidated, but the RTs might con-tribute to CPR ecology since the RTs still exist and not discarded in extremely small CPR genomes

Figure 3.8 Domain architecture of CPR RTs with other domains. Functional domains in RT proteins are visualized. Domains are based on Pfam database and de-fined as follows: RVT_1 (PF00078), Reverse transcrip-tase (RNA-dependent DNA polymerase); GIIM (PF08388), Group II intron, maturase-specific domain; RVT_N (PF13655), N-terminal domain of reverse tran-scriptase; UDG (PF03167), Uracil DNA glycosylase su-perfamily; zf0CHC2 (PF01807), CHC2 zinc finger.

2.3 Sequence analysis of small CPR RTs im-

plies putative ribosomal frameshifts To observe features of CPR RT, I compare se-quences of CPR RT and usual bacterial RT (Figure 3.7). Compared to group II intron RT (in the intron-encoded protein), retron, and DGR RT, multiple Se-quence alignments showed that some CPR RTs have very short sequences and it seemed to be trun-cated since their first-half regions were well aligned but they didn’t have latter half regions.

I hypothesized that these truncations in coding se-quences were occurred by ribosomal frameshift so that the latter half regions may exist in the down-stream region of the coding region. To confirm the hypothesis, the downstream regions were concate-nated with their coding sequences and aligned (Figure 3.9). The downstream regions of those trun-cated RTs were well aligned to full-length (not trun-cated) RT, so it seemed that frameshift mutation had occurred at the end of the coding region.

Figure 3.9 Sequence alignment of RT-coding regions and its neighbors. Representative truncated RTs and their downstream regions were concatenated and aligned with a full-length (not truncated) RT (protein ID: KKS64476.1 of Parcubacteria group bacterium GW2011 GWB1 42 6). Black dashed box represents putative translational frameshift site.

LBWG01000031.1@[email protected]@@RNA-directed_DNA_polymerase__Reverse_transcriptase_@taxon|1618995@8899_9832 RVT_1 GIIM

LCMG01000021.1@[email protected]@@Reverse_transcriptase__RNA-dependent_DNA_polymerase_@taxon|1619001@2782_4459 RVT_N RVT_1 GIIM

MFIA01000018.1@Candidatus_Giovannonibacteria_bacterium_RIFCSPLOWO2_01_FULL_44_16@OGF82770.1@@hypothetical_protein@taxon|1798348@11033_12213 RVT_1 UDG

MFLC01000025.1@Candidatus_Kaiserbacteria_bacterium_RIFCSPHIGHO2_02_FULL_49_11@OGG55025.1@@hypothetical_protein@taxon|1798489@58351_59635 RVT_1 zf-CHC2

MHDC01000039.1@[email protected]@@group_II_intron_reverse_transcriptase/maturase@taxon|1817747@2779_4027 RVT_1 GIIM

Domain architecture

KKR03365.1

KKU32234.1

OGF82770.1

OGG55025.1

OGV95898.1

Protein (GenBank accession)

Full-length RT

RT① CDS CDS + downstream

RT② CDS CDS + downstream

RT③ CDS CDS + downstream

RT④ CDS CDS + downstream

RT⑤ CDS CDS + downstream

Full-length RT

RT① CDS CDS + downstream

RT② CDS CDS + downstream

RT③ CDS CDS + downstream

RT④ CDS CDS + downstream

RT⑤ CDS CDS + downstream

Full-length RT

RT① CDS CDS + downstream

RT② CDS CDS + downstream

RT③ CDS CDS + downstream

RT④ CDS CDS + downstream

RT⑤ CDS CDS + downstream

Full-length RT

RT① CDS CDS + downstream

RT② CDS CDS + downstream

RT③ CDS CDS + downstream

RT④ CDS CDS + downstream

RT⑤ CDS CDS + downstream

Fig. 2.

CPR RT

Fig. 1.

Fig. 1.

Figure 3.7 Phylogenetic relationships and schematic sequence alignment of RT protein sequences. Six CPR RTs and three the other RTs (RT of group II introns, DGRs, retrons, respectively) were aligned and schematically represented in the right side. Grey colored box represents the existence of residues and the other regions are gaps of alignments. CPR RTs are surrounded with a blue square.

Page 9: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

9

To observe the phylogenetic distribution of these putative frameshift RTs, the RTs were mapped onto CPR bacterial phylogenies [23] (Figure 3.10A). Pu-tative frameshift RTs were not inclined to appear in specific phyla, but they were widely distributed in CPR bacteria. They appeared in both major super-phyla of CPR, Parcubacteria (OD1) and Microge-nomates. RTs were found in 318 species out of 804 of CPR bacteria and putative frameshift RTs existed in 38 species. Totally, 63 out of 539 RTs (11.7%) and 367 out of 7,143 RTs (5.1%) were identified as frameshift RT in CPR bacteria and RefSeq prokar-yotes, respectively.

To confirm that the frameshift occurred especially in RT, DNA polymerase family A proteins were re-trieved and analyzed frameshift likewise (Figure 3.10B). In contrast to RT, only 4 out of 670 DNA polymerase family A proteins (0.6%) were identi-fied as frameshift protein and they were found in only 4 species.

From this analysis, I found that RT proteins con-taining frameshift mutations were widely conserved in CPR bacteria phyla. It was difficult to think that these proteins were conserved without any reasons, and so I assumed that the proteins might be fully translated with translational frameshift, the reading ribosomes slipped and skip nucleotides and read a different frame hereafter.

There were few studies that comprehensively identified RTs in CPR bacteria, except for DGRs [16,54]. DGRs, retroelements consists of RT and specific types of related sequences, were identified in CPR bacteria and suggested to be utilized for ad-aptation to a dynamic host-dependent environment [54]. The phylogenetic distribution of CPR RTs was congruent with the previous study which identified DGRs [54]. Despite the study was focused on DGR RTs, I newly identified RTs in CPR phyla, i.e. Can-didatus Amesbacteria, Candidatus Collierbacteria, Candidatus Pacebacteria. The types of RTs which I identified were not clearly identified and the biolog-ical properties of these phyla were not elucidated, but the RTs might contribute to CPR ecology since the RTs still exist and not discarded in extremely small CPR genomes.

Considering that putative translational frameshifts existed in RT, not other polymerases, some specific mechanisms related to RT might contribute to arise frameshift in RT. I supposed that RTs in bacterial group II introns contributed to the mutation. Group II introns are a kind of mobile elements and its mo-bility reactions are mediated by RTs encoded in the introns [6,7]. RTs synthesized DNA from tran-scribed group II introns, and frameshift mutation may arise during the reverse transcription. Despite

RTs in group II introns has high processive and fi-delity to preserving functions in the RT [12,55,56], RTs lack the 3’ to 5’ proofreading capability of other DNA polymerases [57] and prone to errors.

Since most of CPR bacteria genomes were con-structed by metagenomics, it was possible that RT protein frameshift was just an artifact of sequencing. Therefore, I observed protein frameshift in another protein, DNA polymerase protein, to reduce the possibility, and the result showed that protein frameshifts were only found in few cases implying that the frameshifts could not be an artifact of met-agenome methodologies.

Figure 3.10 Phylogenetic distribution of putative frameshift proteins. The existence of putative frameshift/full-length RTs (A) and DNA polymerase family A (B) proteins were mapped onto the CPR phy-logeny (804 genomes). Genomes with full-length pro-teins are colored in blue and putative frameshift proteins are colored in red. RTs were found in 318 species and putative frameshift RTs existed in 38 species, and DNA polymerase family A proteins were found in 619 species and the putative frameshift proteins existed in 4 species. The CPR phylogeny was taken from Hug et al. and mod-ified.

Page 10: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

10

4 Conclusion In this study, I revealed that several types of RT pro-teins were widely distributed and conserved in CPR bacterial phyla. At least, CPR bacteria have RTs re-lated to DGRs, group II introns, retrons, and abor-tive infection (Abi), and the abundance was differ-ent from non-CPR bacteria. While CPR bacteria thought to attach to the other micro-organism to live, the result of the majority of DGRs, which are minor in other prokaryotes, suggests that CPR bacteria have successfully utilized the property of DGRs, in-troducing mutations in the target genes, to adapt rapidly changing host environments. The polyphy-letic tree of CPR RTs implying that CPR bacteria and the other non-CPR bacteria RTs exchange genes occasionally, contributing RT evolution and diversifications. Also, Sequence comparisons among CPR RTs, the other prokaryotic and viral RTs showed that there were several truncated RT protein sequences. These were RTs containing frameshift mutations and widely distributed in CPR phyla. Since this phenomenon is RT-specific and it is unlikely that group II introns introducing muta-tions when replicating their own sequences, it is speculated that these RTs with frameshift mutations may retain some kind of functions.

Acknowledgements I would like to thank Professor Akio Kanai for his great support of my research. He taught me the fas-cinating aspects of molecular biology and evolution, and he always encouraging my research. I also thank Ms. Megumi Tsurumaki, Mr. Masahiro Miura, and all the members of the RNA Group for their in-sightful discussions. I would also like to express my gratitude to my family for their moral support and warm encouragement. Finally, I would like to thank Professor Masaru Tomita for providing a stimulat-ing environment to do my research in IAB. This work was supported, in part, by Taikichiro Mori Memorial Research Grants.

References 1. Baltimore D. RNA-dependent DNA polymerase in

virions of RNA tumour viruses. Nature. 1970;226(5252):1209–11.

2. Temin HM, Mizutani S. RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature. 1970;226(5252):1211–3.

3. Lampson BC, Sun J, Hsu MY, Vallejo-Ramirez J, Inouye S, Inouye M. Reverse transcriptase in a clinical strain of Escherichia coli: production of branched RNA-linked msDNA. Science. 1989;243(4894 Pt 1):1033–8.

4. Lim D, Maas WK. Reverse transcriptase-dependent synthesis of a covalently linked, branched DNA-RNA compound in E. coli B. Cell. 1989;56(5):891–904.

5. Michel F, Jacquier A, Dujon B. Comparison of fungal mitochondrial introns reveals extensive homologies in RNA secondary structure. Biochimie. 1982;64(10):867–81.

6. Mcneil BA, Semper C, Zimmerly S. Group II introns: Versatile ribozymes and retroelements. Wiley Interdiscip Rev RNA. 2016;7(3):341–55.

7. Lambowitz AM, Belfort M. Mobile Bacterial Group II Introns at the Crux of Eukaryotic Evolution. Microbiol Spectr. 2015;3(2):1–26.

8. Liu M, Deora R, Doulatov SR, Gingery M, Eiserling FA, Preston A, et al. Reverse Transcriptase–Mediated Tropism Switching in. Science (80- ). 2002;295(March):2091–4.

9. Liu M, Deora R, Simons RW, Doulatov S, Hodes A, Dai L, et al. Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature. 2004;431(7007):476–81.

10. Arambula D, Miller JF, Guo H, Ghosh P. Diversity-generating Retroelements in Phage and Bacterial Genomes. Microbiol Spectr. 2014;2(6):1–16.

11. van der Veen R, Arnberg AC, van der Horst G, Bonen L, Tabak HF, Grivell LA. Excised group II introns in yeast mitochondria are lariats and can be formed by self-splicing in vitro. Cell. 1986;44(2):225–34.

12. Cousineau B, Smith D, Lawrence-Cavanagh S, Mueller JE, Yang J, Mills D, et al. Retrohoming of a bacterial group II intron: Mobility via complete reverse splicing, independent of homologous DNA recombination. Cell. 1998;94(4):451–62.

13. Lazowska J, Meunier B, Macadre C. Homing of a group II intron in yeast mitochondrial DNA is accompanied by unidirectional co-conversion of upstream-located markers. EMBO J. 1994;13(20):4963–72.

14. Gladyshev E a, Arkhipova IR. A widespread class of reverse transcriptase-related cellular genes. Proc Natl Acad Sci U S A. 2011;108(51):20311–6.

15. Zimmerly S, Wu L. An Unexplored Diversity of Reverse Transcriptases in Bacteria. Microbiol Spectr. 2015;3(2):1–16.

16. Wu L, Gingery M, Abebe M, Arambula D, Czornyj E, Handa S, et al. Diversity-generating retroelements: Natural variation, classification and evolution inferred from a large-scale genomic survey. Nucleic Acids Res. 2018;46(1):11–24.

17. Menéndez-Arias L, Sebastián-Martín A, Álvarez M. Viral reverse transcriptases. Virus Res. 2017;234:153–76.

18. Wang C, Villion M, Semper C, Coros C, Moineau S, Zimmerly S. A reverse transcriptase-related protein mediates phage resistance and polymerizes

Page 11: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

11

untemplated DNA in vitro. Nucleic Acids Res. 2011;39(17):7620–9.

19. Fortier LC, Bouchard JD, Moineau S. Expression and site-directed mutagenesis of the lactococcal abortive phage infection protein AbiK. J Bacteriol. 2005;187(11):3721–30.

20. Simon DM, Zimmerly S. A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 2008;36(22):7219–29.

21. Kojima KK, Kanehisa M. Systematic survey for novel types of prokaryotic retroelements based on gene neighborhood and protein architecture. Mol Biol Evol. 2008;25(7):1395–404.

22. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature. 2015;523(7559):208–11.

23. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nat Microbiol. 2016;1(5):Manuscript submitted for publication.

24. Castelle CJ, Banfield JF. Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life. Cell. 2018;172(6):1181–97.

25. He X, McLean JS, Edlund A, Yooseph S, Hall AP, Liu SY, et al. Cultivation of a human-associated TM7 phylotype reveals a reduced genome and epibiotic parasitic lifestyle. Proc Natl Acad Sci U S A. 2015;112(1):244–9.

26. Anantharaman K, Brown CT, Hug LA, Sharon I, Castelle CJ, Probst AJ, et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat Commun. 2016;7:1–11.

27. Dudek NK, Sun CL, Burstein D, Kantor RS, Aliaga Goltsman DS, Bik EM, et al. Novel Microbial Diversity and Functional Potential in the Marine Mammal Oral Microbiome. Curr Biol. 2017;27(24):3752-3762.e6.

28. Danczak RE, Johnston MD, Kenah C, Slattery M, Wrighton KC, Wilkins MJ. Members of the Candidate Phyla Radiation are functionally differentiated by carbon- and nitrogen-cycling capabilities. Microbiome. 2017;5(1):112.

29. Starr EP, Shi S, Blazewicz SJ, Probst AJ, Herman DJ, Firestone MK, et al. Stable isotope informed genome-resolved metagenomics reveals that Saccharibacteria utilize microbially-processed plant-derived carbon. Microbiome. 2018;6(1):1–12.

30. Orsi WD, Richards TA, Francis WR. Predicted microbial secretomes and their target substrates in marine sediment. Nat Microbiol. 2018;3(1):32–7.

31. Castelle CJ, Brown CT, Anantharaman K, Probst AJ, Huang RH, Banfield JF. Biosynthetic capacity, metabolic variety and unusual biology in the CPR and DPANN radiations. Nat Rev Microbiol. 2018;

32. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al. RefSeq: An update on

prokaryotic genome annotation and curation. Nucleic Acids Res. 2018;46(D1):D851–60.

33. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41(12).

34. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2018;(8):1–6.

35. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.

36. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic acids Res. 1997;25(17):3389–402.

37. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(421):1.

38. Dufour YS, Kiley PJ, Donohue TJ. Reconstruction of the core and extended regulons of global transcription factors. PLoS Genet. 2010;6(7):1–20.

39. Matsui M, Tomita M, Kanai A. Comprehensive computational analysis of bacterial CRP/FNR superfamily and its target motifs reveals stepwise evolution of transcriptional networks. Genome Biol Evol. 2013;5(2):267–82.

40. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A software Environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.

41. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80.

42. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–3.

43. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016;33(6):1635–8.

44. Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016;44(W1):W242–5.

45. Moore AD, Heldy A, Terrapon N, Weiner J, Bornberg-Bauer E. DoMosaics: Software for domain arrangement visualization and domain-centric analysis of proteins. Bioinformatics. 2014;30(2):282–3.

46. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–40.

47. Mitchell AL, Attwood TK, Babbitt PC, Blum M, Bork P, Bridge A, et al. InterPro in 2019: Improving

Page 12: Comprehensive evolutionary analysis of re- verse ...Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes 5 “RNase_H_like Superfamily”. In many

S.Nagata

12

coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47(D1):D351–60.

48. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, et al. Rfam 13.0: Shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46(D1):D335–42.

49. Miller JL, Le Coq J, Hodes A, Barbalat R, Miller JF, Ghosh P. Selective ligand recognition by a diversity-generating retroelement variable protein. PLoS Biol. 2008;6(6):1195–207.

50. Iyer LM, Koonin E V, Aravind L. Evolutionary connection between the catalytic subunits of DNA-dependent RNA polymerases and eukaryotic RNA-dependent RNA polymerases and the origin of RNA polymerases. 2003;23:1–23.

51. Moelling K, Broecker F, Russo G, Sunagawa S. RNase H As gene modifier, driver of evolution and antiviral defense. Front Microbiol. 2017;8(SEP):1–20.

52. Moelling K, Broecker F. The reverse transcriptase-RNase H: From viruses to antiviral defense. Ann N Y Acad Sci. 2015;1341(1):126–35.

53. Toro N, Nisa-Martínez R. Comprehensive phylogenetic analysis of bacterial reverse transcriptases. PLoS One. 2014;9(11):1–16.

54. Paul BG, Burstein D, Castelle CJ, Handa S, Arambula D, Czornyj E, et al. Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea. Nat Microbiol. 2017;2(April):1–7.

55. Mohr S, Ghanem E, Smith W, Sheeter D, Qin Y, King O, et al. Thermostable group II intron reverse transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA sequencing. Rna. 2013;19(7):958–70.

56. Conlan LH, Stanger MJ, Ichiyanagi K, Belfort M. Localization, mobility and fidelity of retrotransposed Group II introns in rRNA genes. Nucleic Acids Res. 2005;33(16):5262–70.

57. Kunkel TA, Bebenek K. DNA Replication Fidelity. Annu Rev Biochem. 2000;69(1):497–529.