Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ph.D. Thesis
Development of computational tools for RNA tertiary structure prediction
Opracowanie narzędzi komputerowych do przewidywania struktury trzeciorzędowej RNA
Marcin Magnus
Supervisor:
Professor Janusz Marek Bujnicki
The work has been conducted in
the Laboratory of Bioinformatics and Protein Engineering
at the International Institute of Molecular and Cell Biology in Warsaw
and at
the Department of Biochemistry of Stanford University, USA
(under the supervision of professor Rhiju Das).
The Graduate School of Molecular Biology at
the Institute of Biochemistry and Biophysics Polish Academy of Sciences, in Warsaw
Warsaw, 2017
ii
iii
Cover image by Marcin Magnus and Janusz M.
Bujnicki: An accurate 3D model of HCV IRES
RNA structure, obtained with the fully
automated RNA modeling method
SimRNAweb (Magnus et al. 2016), using only
RNA sequence as an input. The model has
RMSD of 5.52 Å to the experimentally
determined structure (PDB id: 1KH6 (Kieft et
al. 2002)). The superposition has been done and
visualized using PyMOL, and the image has
been repainted according to the style of
“Transverse Line” painting by Wassily
Kandinsky (1923), using a machine learning
method DeepArt.io.
https://academic.oup.com/nar/issue/45/1#127533-2871059
iv
Abstract
Ribonucleic acid (RNA) is one of the key types of molecules found in living cells. It is
involved in a number of highly important biological processes, not only as the carrier of the
genetic information but also serving catalytic, scaffolding and structural functions. The
interest in the field of non-coding RNA has been increasing for the past few decades with the
new types of non-coding RNAs discovered every year. Similarly, to proteins, a 3D structure
of RNA molecule determines its function. In order to build a 3D model of RNA particle, one
can take advantage of high-resolution experimental techniques, such as biocrystallography.
However, experimental techniques are tedious, time-consuming, expensive, and require
specialized equipment, and not always can be applied. An alternative to experimental
techniques are methods for computational modeling. However, the results of the RNA-
Puzzles - a collective experiment for blind RNA structure prediction - show that accurate
modeling of RNA is still very difficult and there is still much room for improvement.
To facilitate the task of RNA 3D structure prediction, two new approaches were proposed in
this study: one for the prediction of relative model accuracy, and another for the generation of
RNA 3D structure models.
First, I developed a new approach for answering a question: How to choose a structural
model that is closest to the native structure? This task is called “model evaluation” and it is
an important step for 3D RNA structure prediction. A few methods were developed so far but
their accuracy is not sufficient and they behave differently depending on the dataset they are
used on. This stimulated the development of a meta-predictor, mqapRNA, which combines
the primary methods and uses the deep learning model to take advantage of their combined
strengths and to eliminate their individual weaknesses. In addition, mqapRNA is equipped
with a module that helps the user to refine the prediction by applying distance restraints
obtained from an experimental method or evolutionary analysis. The method is available as
an easy-to-use web server.
Second, I developed a new approach for RNA 3D structure prediction named EvoClustRNA
that takes advantage of incorporation of evolutionary information from distant sequence
homologs, based on a classic strategy of protein structure prediction. Based on the empirical
v
observation that RNA sequences from the same RNA family typically fold into similar 3D
structures, I tested whether it is possible to guide in silico modeling by seeking a global
helical arrangement, for the target sequence, that is shared across de novo models of
numerous sequence homologs. EvoClustRNA performs a multi-step modeling process and
can be coupled with any method for RNA structure prediction, such as SimRNA or Rosetta.
EvoClustRNA approach was tested on two blind RNA-Puzzles challenges. The predictions
ranked as the first of all submissions for the L-glutamine riboswitch and as the second for the
ZMP riboswitch. The method was also benchmarked on the testing dataset.
As a complementary activity, I developed a software package called rna-pdb-tools. It is a
Python library and a set of tools dedicated to RNA structural file handling and manipulating,
like (1) rebuilding of missing atoms in RNA structures, (2) structural clustering, (3)
standardization of PDB format to comply with the format required by RNA-Puzzles, (4)
visualization of secondary RNA structures and drawing RNA arch diagrams of secondary
structure triggered from Python scripts or Jupyter Notebooks, and much more. The code is
modular and well documented which should encourage new developers to build new
applications on top of rna-pdb-tools. The code is open and free to use, and can serve as an
example of “scientific software that computes”.
The ability to predict RNA 3D structure opens great opportunities for the new developments
in biotechnology and basic science. However, it is not possible to take advantage of these
opportunities without the understanding of the structures of these molecules. The proposed
projects can make investigations of RNA structures much more effective.
All developed tools are available online: https://genesilico.pl/mqapRNA/,
https://github.com/mmagnus/EvoClustRNA, https://github.com/mmagnus/rna-pdb-tools.
vi
Streszczenie
Kwas rybonukleinowy (RNA, ang. ribonucleic acid) jest jednym z podstawowych typów
cząsteczek występujących w żywych komórkach. Jest zaangażowany w liczne ważne procesy
biologiczne, nie tylko jako nośnik informacji genetycznej, ale pełni także funkcje
katalityczne, regulacyjne, strukturalne. Ponieważ funkcja wielu rodzin cząsteczek RNA
uzależniona jest od ich struktury przestrzennej, możemy próbować zrozumieć mechanizm
funkcjonowania danego typu RNA poprzez poznanie kształtu cząsteczki. W celu poznania
struktury przestrzennej RNA można użyć technik doświadczalnych wysokiej rozdzielczości,
takich jak biokrystalografia. Techniki doświadczalne są jednak czasochłonne, wymagają
dużych nakładów pieniężnych i wymagają specjalistycznej aparatury, nie zawsze też ich
zastosowanie jest kończy się sukcesem. Alternatywą dla technik doświadczalnych są metody
modelowania komputerowego. Jednak jak pokazują wyniki doświadczenia RNA-Puzzles, w
ramach którego naukowcy z całego świata modelują struktury RNA dla zadanej sekwencji,
problem ten jest bardzo trudny i wyniki przewidywań są często niezadowalające.
Aby zbliżyć nas do rozwiązania problemu, w ramach niniejszej rozprawy proponuję dwa
nowe protokoły obliczeniowe, dotyczące modelowania struktury 3D RNA oraz
przewidywania dokładności modeli.
W typowych zadaniach modelowania komputerowego badacz otrzymuje zestaw
alternatywnych modeli struktury danego RNA. Staje on przed bardzo ważnym pytaniem: Jak
wybrać model najbardziej zbliżony do rzeczywistej (natywnej) struktury? Otrzymane in
silico modele muszą być poddane procedurze oszacowania ich jakości i uszeregowania ich
zgodnie z uzyskanymi wartościami. Dotychczas opracowano kilka metod przewidywania
struktury 3D RNA, ale ich dokładność nie jest wystarczająca, a również zachowują się one
odmiennie w zależności od stosowanego do testowania zestawu danych. W celu rozwiązania
tego problemu opracowałem nowy meta-predyktor, mqapRNA. Narzędzie to łączy oceny
uzyskane z kilku metod składowych, a następnie wykorzystuje model statystyczny oparty o
głębokie uczenie maszynowe do oceny jakości modeli struktury, w szerszym kontekście.
Dodatkowo, mqapRNA jest wyposażony w moduł, który pomaga użytkownikowi
doprecyzować wyniki programu poprzez możliwość dodatnia więzów odległości uzyskanych
vii
metodą doświadczalną lub z analizy ewolucyjnej. Metoda jest dostępna jako łatwy w
obsłudze serwis internetowy.
Abstrahując od metod oceny jakości modeli będących wynikiem przewidywania
komputerowego, sam proces przewidywania jest dużym wyzwaniem. Dlatego zdecydowałem
się opracować nowe podejście do przewidywanie struktury 3D RNA, które wcześniej z
dużym sukcesem było używane do modelowania struktur białek. Opierając się na obserwacji,
że sekwencje RNA należące do tej samej rodziny zwijają się do bardzo podobnej struktury
trzeciorzędowej, zbadałem, czy można wykorzystać to zjawisko do poprawy wyników
modelowania RNA. Zostały przeprowadzone niezależne symulacje zwijania różnych
sekwencji homologicznych, w celu wykrycia wspólnego dla nich ułożenia w przestrzeni
regionów helikalnych. Program EvoClustRNA wykonuje wieloetapowy proces modelowania
i może być sprzężony z jakąkolwiek metodą przewidywania struktury RNA, na przykład
SimRNA lub Rosetta. Podejście EvoClustRNA zostało sprawdzone na dwóch „ślepych”
przewidywaniach w ramach konkursu RNA-Puzzle. W przypadku modelowania
ryboprzełącznika wiążącego L-glutamine, model otrzymany w wyniku EvoClustRNA
uplasował się na pierwszym miejscu w ostatecznym rankingu, a model ryboprzełącznika
ZMP na miejscu drugim. Metoda została także sprawdzona na zestawie testowym.
W trzeciej części niniejszej rozprawy opisuję pakiet oprogramowania, którego opracowanie
umożliwiło realizację powyższych projektów. rna-pdb-tools jest biblioteką programistyczną
w języku Python i zestawem narzędzi przeznaczonych do obsługi i modyfikacji plików
strukturalnych RNA w formacie PDB, takich jak: odbudowa brakujących atomów w
strukturach RNA, standaryzacja formatów PDB, uruchamiane z poziomu skryptów w
Pythonie lub interaktywnego notatnika Jupyter, a także wielu innych.
Skuteczne komputerowe przewidywanie struktury przestrzennej RNA daje nowe możliwości
dla biotechnologii oraz nauk podstawowych. Jednak bez zrozumienia zależności struktury
RNA od jego sekwencji nie będzie można z nich skorzystać. Opracowane narzędzia mogą w
znaczący sposób ułatwić zrozumienie tej zależności.
Narzędzia dostępne są pod adresami https://genesilico.pl/mqapRNA/,
https://github.com/mmagnus/EvoClustRNA, https://github.com/mmagnus/rna-pdb-tools
viii
Acknowledgements
Thank You
To my parents Danuta and Jerzy, and my siblings, Ania, Adam, Natalia, Patryk who crossed the fingers for me and always supported me.
To Janusz Bujnicki, for always nurturing my hard work, critical spirit, motivation, and his guidance.
To the present and past members of the Bujnicki Lab, Agnieszka Faliszewska, Dorota Niedzałek, Filip Stefaniak, Michał Boniecki, Tomasz Wirecki, Kasia Merdas, Pietro Boccaletto, Adrianna Żyła, Elżbieta Purta, Dawid Głów, Radosław Pluta, Astha, Katarzyna Merdas, Astha, Catarina Almeida, Błażej Bagiński, Krzysztof Szczepaniak, Dorota Matelska, Grzegorz Chojnowski, Grzegorz Łach, Wayne Dawson, Łukasz Kozłowski, Stanisław Dunin-Horkawicz, Tomasz Puton, Irina Tuszynska, Magdalena Machnicka, Magda Byszewska, Diana Toczydłowska, Paweł Piątkowski, Marcin Pawłowski for the precious help, support, and funs
To Magda Konarska & Rhiju Das for their mentorship, and long discussions about SCIENCE.
To Henri Sara, Elmar Bucher, Matthias Nees, John Patrick Mpindi for such an enjoyable time at VTT.
To all guests and members of the Do Science Family, for being such an inspiring and lovely community.
To Wojtek Siwek for always being there and his friendship.
To Grzegorz Lorek for inspiration, deep insight, free thought, and joy of "ONE" life.
To the IIMCB team: Jacek Kuźnicki, Marcin Nowotny, Daria Goś, Agnieszka Potęga, Dorota Libiszowska, Justyna Szopa, Agata Skaruz, Hanna Iwaniukowicz for precious help and support in all crazy activates.
To the developers of all open source tools used in this work.
To Paulina for her love and patience.
To SCIENCE!
ix
Abbreviations
CM - Covariance model
Cryo-EM - cryo-electron microscopy
DCA - direct coupling analysis
DNA - deoxyribonucleic acid
EC - enrichment score
ESR/EPR - electron spin/paramagnetic resonance
FRET - Förster resonance energy transfer
INF - interaction network fidelity
MD - molecular dynamics
MOHCA - multiplexed hydroxyl radical cleavage analysis
MSA - multiple sequence alignment
NM - normal mode
NMR - nuclear magnetic resonance
PDB - Protein Data Bank
RMSD - root mean square deviation
RNA - ribonucleic acid
SHAPE - selective 2'-hydroxyl acylation analyzed by primer extension
ZMP - 5-aminoimidazole-4-carboxamide riboside 5′-monophosphate
x
Publications
The thesis convers partially the results described in the following scientific publications:
1. Z. Miao, R. W. Adamiak, M. Antczak, R. T. Batey, A. J. Becka, M. Biesiada, M. J.
Boniecki, J. M. Bujnicki, S.-J. Chen, C. Y. Cheng, F.-C. Chou, A. R. Ferré-D'Amaré, R. Das,
W. K. Dawson, F. Ding, N. V. Dokholyan, S. Dunin-Horkawicz, C. Geniesse, K. Kappel, W.
Kladwang, A. Krokhotin, G. E. Łach, F. Major, T. H. Mann, M. Magnus, K. Pachulska-
Wieczorek, D. J. Patel, J. A. Piccirilli, M. Popenda, K. J. Purzycka, A. Ren, G. M. Rice, J.
Santalucia, J. Sarzynska, M. Szachniuk, A. Tandon, J. J. Trausch, S. Tian, J. Wang, K. M.
Weeks, B. Williams, Y. Xiao, X. Xu, D. Zhang, T. Zok, and E. Westhof, “RNA-Puzzles
Round III: 3D RNA structure prediction of five riboswitches and one ribozyme.,” RNA, vol.
23, no. 5, pp. 655–672, May 2017.
2. M. Magnus*, M. J. Boniecki*, W. Dawson, and J. M. Bujnicki, “SimRNAweb: a web
server for RNA 3D structure modeling with optional restraints.,” Nucleic Acids Research,
vol. 44, no. 1, pp. W315–9, Jul. 2016.
3. P. Piatkowski, J. M. Kasprzak, D. Kumar, M. Magnus, G. Chojnowski, and J. M.
Bujnicki, “RNA 3D Structure Modeling by Combination of Template-Based Method
ModeRNA, Template-Free Folding with SimRNA, and Refinement with QRNAS.,” Methods
Mol. Biol., vol. 1490, no. Suppl, pp. 217–235, 2016.
4. Z. Miao, R. W. Adamiak, M.-F. Blanchet, M. Boniecki, J. M. Bujnicki, S.-J. Chen, C.
Cheng, G. Chojnowski, F.-C. Chou, P. Cordero, J. A. Cruz, A. R. Ferré-D'Amaré, R. Das, F.
Ding, N. V. Dokholyan, S. Dunin-Horkawicz, W. Kladwang, A. Krokhotin, G. Lach, M.
Magnus, F. Major, T. H. Mann, B. Masquida, D. Matelska, M. Meyer, A. Peselis, M.
Popenda, K. J. Purzycka, A. Serganov, J. Stasiewicz, M. Szachniuk, A. Tandon, S. Tian, J.
Wang, Y. Xiao, X. Xu, J. Zhang, P. Zhao, T. Zok, and E. Westhof, “RNA-Puzzles Round II:
assessment of RNA structure prediction programs applied to three large RNA structures.,”
RNA, vol. 21, no. 6, pp. 1066–1084, Jun. 2015.
5. M. Magnus*, D. Matelska*, G. Lach, G. Chojnowski, M. J. Boniecki, E. Purta, W.
Dawson, S. Dunin-Horkawicz, and J. M. Bujnicki, “Computational modeling of RNA 3D
structures, with the aid of experimental restraints.,” RNA Biol, vol. 11, no. 5, pp. 522–536,
2014.
* joint first authorship
xi
Funding
This work was supported by the following sources.
Foundation for Polish Science (FNP) grant to professor Janusz Bujnicki, Modeling of RNA
and RNA-protein complexes: from sequence to structure to function, TEAM/2009-4/2/styp3.
Mazovia Scholarship to Marcin Magnus, executed under the Operational Programme
Human Capital – Priority 8.2.2, is addressed to PhD students engaged in the innovative
scientific research in areas considered particularly important for the development of Mazovia
Voivodship, 2014/2015 NR 669.
National Science Centre (NCN) grant Etiuda 2 to Marcin Magnus, Development and
application of bioinformatics tools to assess the quality of RNA structures,
2014/12/T/NZ2/00501.
National Science Centre (NCN) grant Preludium 9 to Marcin Magnus, RNA structure
prediction based on modeling the target sequence and homologous sequences, UMO-
2015/17/N/NZ2/03360.
xii
Table of Content
Abstract ................................................................................................................................................. iv
Streszczenie ........................................................................................................................................... vi
Acknowledgements ............................................................................................................................viii
Abbreviations ....................................................................................................................................... ix
Publications ........................................................................................................................................... x
Funding ................................................................................................................................................. xi
Table of Content .................................................................................................................................. xii
1 Introduction .................................................................................................................................. 1
1.1 Ribonucleic acid (RNA) ........................................................................................................ 1
1.2 RNA structure ........................................................................................................................ 2
1.2.1 RNA secondary structure .................................................................................................. 3
1.2.2 RNA tertiary structure ....................................................................................................... 6
1.3 RNA structure prediction with low-resolution experimental data ....................................... 16
1.4 RNA families ....................................................................................................................... 19
1.5 RNA-Puzzles ....................................................................................................................... 21
2 Aim of this work ......................................................................................................................... 24
3 Materials & Methods ................................................................................................................. 25
3.1 Hardware ............................................................................................................................. 25
3.2 Software ............................................................................................................................... 25
3.3 Structure visualizations........................................................................................................ 26
3.4 Databases ............................................................................................................................. 27
3.5 Development of mqapRNA ................................................................................................. 27
3.5.1 Datasets ........................................................................................................................... 27
3.5.2 Primary methods ............................................................................................................. 28
3.5.3 Secondary structure comparison ..................................................................................... 29
3.5.4 Standardization of PDB files ........................................................................................... 29
xiii
3.5.5 Evaluation of scoring functions ....................................................................................... 30
3.5.6 Statistical analyses .......................................................................................................... 30
3.5.7 Implementation of the web server ................................................................................... 31
3.6 Development of EvoClustRNA ........................................................................................... 31
3.6.1 Multiple sequence alignment generation and selection of homologs .............................. 31
3.6.2 Modeling of sequences with SimRNA/SimRNAweb and Rosetta.................................. 32
3.6.3 Clustering routine ............................................................................................................ 33
4 Results ......................................................................................................................................... 34
4.1 mqapRNA ............................................................................................................................ 34
4.1.1 Implementation of mqapRNA ......................................................................................... 34
4.1.2 Performance of mqapRNA .............................................................................................. 39
4.1.3 mqapRNA web server: quality prediction with optional restraints ................................. 44
4.2 EvoClustRNA ...................................................................................................................... 49
4.2.1 Implementation of EvoClustRNA ................................................................................... 49
4.2.2 Blind predictions with EvoClustRNA in the RNA-Puzzles ............................................ 50
4.2.3 Performance of EvoClustRNA ........................................................................................ 52
4.3 rna-pdb-tools ........................................................................................................................ 61
5 Discussion .................................................................................................................................... 68
5.1 mqapRNA ............................................................................................................................ 68
5.1.1 Similar tools or approaches ............................................................................................. 69
5.2 EvoClustRNA ...................................................................................................................... 71
5.2.1 Similar tools or approaches ............................................................................................. 71
5.3 rna-pdb-tools ........................................................................................................................ 73
5.3.1 Future directions .............................................................................................................. 75
5.4 Potential limitations of the RNA 3D structure prediction methods ..................................... 77
5.4.1 RNA-ligand interactions ................................................................................................. 77
5.4.2 Non-canonical interactions .............................................................................................. 78
5.4.3 Loop modeling ................................................................................................................ 80
5.4.4 Sampling of conformational space .................................................................................. 82
6 Conclusions ................................................................................................................................. 85
7 Supplementary data ................................................................................................................... 87
S1. List of all the sequences and secondary structures used in the benchmark of EvoClustRNA and
a list of links to the SimRNAweb predictions .................................................................................. 87
xiv
Table of Figures................................................................................................................................... 91
Table of Tables .................................................................................................................................... 99
Reference ........................................................................................................................................... 100
1
1 Introduction
1.1 Ribonucleic acid (RNA)
Ribonucleic acid (RNA) is one of the key types of molecules that are essential for the
functioning of living cells. It is involved in a number of highly important biological processes
serving catalytic, scaffolding and structural functions. With the discovery that RNAs can
perform catalytic reactions, our vision that RNA simply serves as information transfer
molecules has dramatically changed. We call these RNAs ribozymes, and for this discovery,
Sidney Altman and Thomas Cech received the Nobel Prize in 1989. Today we know that
RNAs not only serve as an intermediary between DNA and proteins, but are also able to
perform catalytic reactions and are involved in a variety of processes in cells, such as
translation, transcription, gene expression and more!
The more we learn about RNA molecules, the more we discover new ways for their potential
use in medicine, biotechnology, and basic science. For example, riboswitches are a unique
feature of bacteria with a great diversity and distribution, (McCown et al. 2017) and therefore
have become a promising target for antibacterial treatments. Fluorescent riboswitches,
combined with “interchangeable” aptamer domains that can bind various ligands, are
becoming an important tool in basic science for monitoring metabolites in living cells
(Kellenberger et al. 2015; Strack et al. 2013). This can lead to a revolution, similar to the
discovery of the GFP protein (Nobel Prize in 2008). MicroRNAs are used in medicine for
new therapies and in molecular biology to silence genes of interest(Hayes et al. 2014).
Scientists are investigating CRISPR-Cas9 (Pennisi 2013) – a prokaryotic immune system - as
a tool for genome editing. It has been also proved that long noncoding RNAs are involved in
the cancer development (Li et al. 2017; Xu et al. 2017). Many antibiotics, e.g.,
aminoglycoside antibiotics (Kulik et al. 2015), bind to ribosomal RNA and disable bacterial
protein synthesis. Alas, we do not know yet the function of newly discovered circular RNAs
(Szabo and Salzman 2016). RNA because of its ability to self-assembly3 (Chworos et al.
2004) seems to be ideal for creating nanorobots - biodevices that can be programmed, for
example, to detect microRNAs (Aw et al. 2016) related to human diseases in the blood, or
regulate gene expression (Berens et al. 2015), and much more.
2
When investigating the universe of the mentioned above features, we must always consider
that in order to conduct any function, RNA molecule must fold into a specific structure.
1.2 RNA structure
The structure of RNA is hierarchical, which means that we can distinguish levels of
organization: (1) a primary structure (the nucleic acid sequence), (2) a secondary structure
(canonical interactions between the bases in an RNA chain), (3) and then a tertiary structure
(arrangement of secondary structure elements in the three-dimensional space).
The first level of organization is an RNA sequence, the so-called RNA primary structure. The
primary structure is described as a chain of ribonucleotides. This chain is a linear polymer,
linked by the phosphodiester bond. Each ribonucleotide (Fig. 1.2.1) consists of a nucleoside
(a ribose and a base) and a phosphate residue.
Figure 1.2.1: Ribonucleotide - a building block of RNA. Source (Wikimedia-Commons)
Four different common ribonucleotides are the building blocks of RNA molecules, which
contain four different bases, connected to the ribose. These bases are: purines: adenine (A)
and cytosine (C), and pyrimidines: guanine (G) and uracil (U).
At this level of organization RNA is very similar to DNA. However, there are very important
and biologically relevant differences. RNA has one extra oxygen atom attached to the C2′
3
sugar. This extra atom induces the RNA molecules to be extremely reactive, and prone to
degradation. The second difference is the presence of uracil instead of thymine (T).
Consequently, RNA has only four standard building blocks, which makes a sequence search
and alignment of sequences far more difficult compared to proteins, which consist of twenty
standard amino acids.
Many RNAs found in nature, exhibit additional residues, beyond the standard ones. They are
generated as chemical modifications, which are introduced post-transcriptionally by different
enzymes, and usually modification occur on a 2ʹ-OH group of a ribose moiety, or/and on one
or more of different atoms of a base moiety. One of the most common modification is a
pseudouridine, in which an uracil is linked to a ribose via a carbon-carbon bond instead of a
nitrogen-carbon bond. According to the MODOMICS database (Czerwoniec et al. 2009;
Dunin-Horkawicz et al. 2006; Machnicka et al. 2013) in September 2017, there were over
160 known modifications occurring in RNA. However, at the current stage of RNA structural
bioinformatics, all methods but ModeRNA (Rother et al. 2011a) neglect modified residues
and are designed to predict structures for only standard residues.
Another important difference between DNA and RNA is the typical feature of RNA to fold
into complex three-dimensional (3D) structures. DNA molecules usually consist of two
strands coiled around each other to form a very well defined, and very long double helix. By
contrast, RNA molecules tend to be relatively short, with a single strand folded into short
helices interspersed by loops, and their functional form requires an intrinsic, complex
structure.
We can distinguish two levels of this organization: secondary structure, and tertiary structure.
1.2.1 RNA secondary structure
Nucleic acid bases can interact in various ways, including base-base stacking and edge-to-
edge pairing (canonical and non-canonical). While base stacking interactions provide the key
driving force for folding of an RNA molecule, the edge-to-edge pairing interactions,
mediated by hydrogen bonds, provide directionality and specificity (Leontis and Westhof
2001). Leontis and Westhof proposed a classification, based on the observation that the
4
planar edge-to-edge, hydrogen-bonding interactions between RNA bases, which involve one
of three distinct edges: the Watson–Crick (W) edge, the Hoogsteen (H) edge, and the Sugar
(S) edge (Fig. 1.2.2A). Moreover, each base in a pair can interact in either of two orientations
with respect to the glycosidic bonds, cis or trans, relative to the hydrogen bonds (Fig.
1.2.2B). It gives twelve geometric base pair families (Fig. 1.2.2C) and eighteen base pairing
relations, due to the asymmetry of some base pairs. Besides, bases can form triples that can
be also classified and characterized (Abu Almakarem et al. 2012).
5
Figure 1.2.2: Leontis/Westhof classification of base pairings. (A) RNA bases - adenine (A),
cytosine (C), guanine (G) and uracil (U) - involve one of three distinct edges: the Watson–
Crick (W) edge, the Hoogsteen (H) edge, and the Sugar (S) edge. (B) Each pair of can
interact in either cis or trans orientations with respect to the glycosidic bonds. (C) For these
reasons, all base pairs can be grouped into twelve geometric base pair families and eighteen
pairing relationships (bases are represented as triangles). Each pair is represented by a
symbol that can be used in a secondary structure and a tertiary structure diagrams. Filled
symbols mean cis base pair configuration, and open symbols, trans base pair. (D)
Interestingly, bases can form triples and they have own classification devised by Leontis and
coworkers (Abu Almakarem et al. 2012)(Creative Commons License)
Canonical base pairs are G-C, connected by three hydrogen bonds, and A-U, connected by
two hydrogen bonds. These pairs are characterized by their isostericity (geometrical
equivalence), which gives rise to a regular A-form double helix, and allows each of the four
combinations of canonical pairs to substitute for each other, without distorting the 3D helical
structure.
6
The secondary structure is defined as a set of canonical interactions between the bases in an
RNA chain, while the tertiary structure is described as the positions of the atoms in the three-
dimensional space. The fundamental elements of the secondary structure of RNA are single-
stranded fragments and paired fragments (helices). Depending on the structural context,
several types of unpaired fragments can be distinguished: (1) single stranded fragments at the
5′ or 3′ ends of the RNA chain, (2) hairpin loops, occurring at the ends of double stranded
fragments, (3) bulges of single nucleotides, (4) interior loops, consisting of several unpaired
nucleotides inside the helix, and (5) junctions, connecting several helices. RNA molecules
also form pseudoknots, a structural configuration where one single stranded region folds back
on itself and connects another single stranded region within a stem.
The number of possible secondary RNA structures increases exponentially with the length of
the sequence. Although the secondary structure prediction is a key it still remains an unsolved
problem in structural biology of RNA. The earliest algorithms dynamically searched for the
secondary structure with the lowest free energy, taking into account the hydrogen bonding
energy of the canonical base pairs (Nussinov et al. 1978). Next generations of methods took
into account the energy of the base stacking (Zuker and Stiegler 1981), and the possibility of
creating pseudoknots (Rivas and Eddy 1999). The CompaRNA web server (Puton et al. 2013)
provides a continuous benchmark of automated standalone, and web server methods for RNA
secondary structure prediction, and collects predictions of over 40 tools! This server was
published in 2013, thus one should expect that today, there are even more methods for RNA
secondary structure prediction.
Elements of secondary structure can fold to create more complex tertiary shapes.
1.2.2 RNA tertiary structure
The tertiary structure of RNA is formed by an appropriate spatial arrangement of secondary
structure elements. Its formation is conditioned by long-range effects, created between the
single stranded regions. The phosphate groups of RNA are negatively charged, making RNA
a charged molecule. For this reason, mono and divalent metal cations, including K+ and
Mg2+
, which neutralize negative charge, play important role in the RNA folding. It is
important to mention that most of the computational methods for RNA tertiary prediction
neglect the presence of ions, and do not model them explicitly.
7
In nature, RNAs form complicated molecular 3D architectures. Some RNAs can perform
their function only when folded into a particular shape. By studying the spatial structure of
RNA, we can try to understand the mechanism of action of a particular type of RNA. To
determine the spatial structure of RNA, researchers can use experimental techniques, such as
biocrystallography or nuclear magnetic resonance (NMR) spectroscopy. However, the
experimental techniques, are tiresome, expensive, and require specialized equipment. An
alternative to the experimental techniques are computer modeling methods. Although the
computer modeling methods are not as accurate as mentioned above experiments, they can be
successfully used to investigate the function and mechanism of action of the RNA molecules
(Kladwang et al. 2012). Therefore, there is a need for computational methods that can
provide reliable models of RNA structures efficiently and cheaply, using only information on
a nucleotide sequence. The goal of computational structural bioinformatics is not to replace
experimental techniques, but to compliment them especially when the for answer scientific
questions are beyond their reach. Unfortunately, despite the fact that computational methods
are being continuously improved, they not always predict the correct structures of RNA.
A collation of an example secondary and the corresponding tertiary structure of a riboswitch
(the Pistol ribozyme) is shown in Figure 1.2.3.
8
Figure 1.2.3: Collation of an example secondary (A) and the corresponding tertiary structure
(B) of the Pistol ribozyme (PDB code: 5K7c (Ren et al. 2016)). This riboswitch adopts a
compact tertiary architecture stabilized by an embedded pseudoknot (violet) fold and is
composed of three helical regions, P1 (green), P2 (blue), P3 (orange). This is a self-cleaving
ribozyme that is widely distributed in nature (Jimenez et al. 2015). The cleavage site is
marked in yellow. The secondary structure diagram was generated with VARNA (Darty et al.
2009), and the tertiary structure figure was generated with PyMOL (DELANO 2002)
1.2.2.1 RNA tertiary structure computational prediction
The secondary structure determination (or prediction) is often the starting point for the spatial
(3D) structure determination of RNA. Programs for predicting tertiary structure of RNA
generally represent two categories: (1) methods based on the laws of physics, (2) methods
based on experimental data extrapolating knowledge of experimentally solved structures.
The first approach is based on Anfinsen's hypothesis (Anfinsen 1973), formulated in 1973 for
proteins, and later adapted to other macromolecules, including RNA. According to Anfinsen,
at the environmental conditions at which folding occurs, the native structure is a unique,
stable and kinetically accessible minimum of the free energy. Since the accurate quantum-
chemical calculations of the free energy derived directly from the Schrödinger equation are
very costly calculations, many approximations are used. The potential energy function of the
system (i.e., force field) is written in the form of the sum of several elements, accounting for
the geometry of covalent bonds or the spacing between atomic atoms, parameterized using
quantum-chemical calculations or experimental measurements. The most popular force fields
9
used for simulation biomolecular systems are Amber (Case et al. 2005) and CHARMM
(Brooks et al. 2009). However, their computational cost prevents the Molecular Dynamics of
the whole macromolecular structures and usually are only used to optimize the geometry of
the model, obtained by other methods. The force field methods are also used to simulate short
processes, such as ligand binding and to investigate the stability of RNA fragments. DMD
(Discrete Molecular Dynamics) (Ding et al. 2008) is a program that uses discrete molecular
dynamics and a mostly physics-based energy function. To make physics-based calculation
feasible, an RNA molecule is reduced to a coarse-grained representation.
The second group of methods is based on extrapolating the knowledge of structures. For
some methods in this group further simplifications are used, such as grouping of atoms to be
represented as single pseudo-atoms. In programs using coarse-grained representation of a
molecule, the energy function is devised based on the solved molecular structures searching
for a model imitating the law of RNA folding. The effectiveness of this approach, in the
prediction of RNA 3D models, has been documented for several large RNA molecules
modeled with constraints on the secondary structure and tertiary interactions (Jonikas et al.
2009; Miao et al. 2017). NAST (Jonikas et al. 2009), ERNWIN (Kerpedjiev et al. 2015) and
SimRNA (Boniecki et al. 2016) are good examples of state-of-the-art programs utilizing
coarse-grained approach for RNA molecule folding simulations. Among the methods that
extrapolate the fragments of already solved structures are (1) assembly based methods and (2)
homology modeling (comparative modeling) (3) manual building structures based on
figments. In the first approach, structural motifs are found in a database of known structures
and a prediction of an assembly of these fragments in accordance with the predicted topology
of the whole molecule is made. An assembly is scored by the corresponding evaluation
function, and a final prediction is generated, or the process is repeated iteratively. Examples
of such methods are RNAComposer (Popenda et al. 2012), MC-Sym|MC-Fold (Parisien and
Major 2008), FARNA (Das and Baker 2007), 3dRNA . Unlike fragment assembly,
comparative modeling methods, such as ModeRNA (Rother et al. 2011b) (Rother et al.
2011a), RNA123 (Eriksson et al. 2014), MacroMoleculeBuilder (Flores et al. 2010), requires
a precise indication of the homologous structure of the RNA molecule and the alignment of
the corresponding sequence. Another subgroup are tools that can be for manual structure
building such us ERNA-3D (Zwieb and Müller 1997), MANIP (Massire and Westhof 1998),
10
Assemble (Jossinet et al. 2010), RNA2D3D (Martinez et al. 2008), S2D (Jossinet and
Westhof 2005), Nucleic Acid Builder program (NAB) (Macke and Case 2009). They have
been used with a great success for modeling, for example, architecture of group I catalytic
introns (Michel and Westhof 1990), tmRNA (Burks et al. 2005). However, this thesis focuses
only on automated predictive methods (Table 1.2.1).
The protocols of RNA 3D structure predictions both using template-based ModeRNA and
template-free Folding SimRNA, and Refinement with QRNAS, are described in details here
(Piatkowski et al. 2016).
Table 1.2.1 Computation methods for RNA 3D structure prediction, based on (Magnus et al. 2014).
Type Method Name Description Representation Probing of
conformations Folding simulation
DMD (Discrete Molecular Dynamics)
Coarse-grained simulation method that uses discrete molecular dynamics and a mostly physics-based energy function
Coarse-grained (3 centers / residue)
Discrete molecular dynamics
Folding simulation
SimRNA Coarse-grained simulation method that uses Monte Carlo sampling method and a knowledge-based energy function
Coarse-grained (5 centers / residue)
Monte Carlo
Folding simulation
NAST (The Nucleic Acid Simulation Tool)
Very coarse-grained simulation method that uses molecular dynamics and relies almost completely on restraints supplied by a user
Coarse-grained (1 center/ residue)
Molecular dynamics
Fragment assembly
FARNA (Fragment Assembly of RNA) / Fragment Assembly of RNA with Full Atom Refinement (FARFAR)
Adaptation of the ROSETTA method for RNA structure prediction, assembles the structure from short single- stranded fragments using a Monte Carlo procedure and a hybrid physics/statistics-based scoring function, followed by full-atom refinement with a physics-based function
Full-atom Monte Carlo
Fragment assembly
MC-Fold|MC-Sym A method that assembles RNA structures from nucleotide cyclic motifs (NCN) with the sampling defined as a constraint satisfaction problem and evaluates the resulting conformations with a hybrid physics/statistics-based scoring function
Full-atom Constraint satisfaction problem
Fragment assembly
RNA Composer s A method that can assemble large RNA structures from fragments taken from RNA FRABASE, using user-defined restraints, based on the machine translation principle
Full-atom Machine translation workflow
12
For projects covered in this thesis, two RNA 3D structure prediction methods were used:
SimRNA developed by dr Michał Boniecki and colleagues in the laboratory of professor
Janusz Bujnicki and FARNA (an extension of ROSETTA) developed by professor Rhiju Das
and colleagues first in the laboratory of professor Baker and later in his own group. These
methods will be described here briefly.
Michał Boniecki, Janusz Bujnicki and colleagues at the International Institute of Molecular
and Cell Biology in Warsaw developed SimRNA, a method for RNA folding simulations and
3D structure prediction that uses a coarse-grained representation of five atoms per residue
and a statistical potential methodology. The method predicts RNA 3D structure from
sequence alone, and, if available, can use additional structural information in the form of
secondary structure restraints, distance restraints that define the local arrangement of certain
atoms. Moreover, the method can jump-start the simulation with a 3D structure provided in a
PDB file. The energy function is based entirely on statistics derived from databases of known
structures. For space sampling, the Monte Carlo algorithm was implemented. SimRNA is
available as a standalone package that requires the user to have some computer skills and a
powerful computer – typical simulation (of a sequence ~70 residues) take around 6 hours, on
an 80-core machine. To help biologists with no bioinformatics background use SimRNA, a
web server was implemented, SimRNAweb (Magnus et al. 2016). The web service that
simplifies the steps of the stand-alone package does not require the user to supply computing
power and memory, provides a simple interface for the user, and displays the progress of the
simulation in real time. This renders the approach available to an individual who is not
necessarily an expert in RNA structure and does not have access to state-of-the-art 3D
molecular modeling facilities, but who needs a model of the RNA 3D structure, for instance
to design biochemical experiments, or may want to observe the conformational changes of
the RNA as it folds.
Rhiju Das, David Baker and colleagues at the University of Washington developed the
Fragment Assembly of RNA (FARNA) tool based on the Rosetta Protein Modeling (Leaver-
Fay et al. 2011). The program uses a simplified representation of the RNA model, where each
nucleotide is represented in the form of one pseudo-atom. The method predicts the tertiary
structure by assembling of short 3-residue fragments sampling, using Monte Carlo algorithm,
guided by a knowledge-based energy. The method was upgraded in 2010 by Das and his
13
team by adding the addition of extensive new energy terms within the force field. The new
method is called Fragment Assembly of RNA with Full-Atom Refinement (FARFAR).
FARFAR defines terms for hydrogen bonding between bases and backbone oxygen atoms,
and, importantly includes information about bonds between hydrogen and the hydroxyl O2′
group (which is the difference between RNA and DNA). It also includes an energy term for
C-H...O contacts, which contribute to the conformational preferences of the nucleotides and
participate in the formation of some non-Watson–Crick base pairs. A description how to use
FARNA/FARFAR can be found here (Cheng et al. 2015). For short RNA fragments (up to 32
nucleotides) Rosetta can be accessed via the Rosetta Online Server That Includes Everyone
ROSIE) (Moretti et al. 2017).
1.2.2.2 Model quality assessment programs
As a result of computer modeling, the researcher gets a set of alternative models of RNA
structure. How to choose a model that is the closest to the real one? This predictive task,
called “model evaluation”, “quality assessment”, or “scoring”, is a crucial step for 3D RNA
structure prediction.
Programs that try to solve this problem are called MQAPs (Model Quality Assessment
Programs). MQAPs analyze structural models and calculate for each of them a quality score,
which often aims at predicting the global and/or local accuracy of the method, as compared to
the “real” structure, which is typically not available in real-life cases. In addition, MQAPs
can also provide the user with a list of errors for a given model, informing, for example, of
any chain breaks, incorrect rotamers, custom lengths of atoms, steric conflicts. It is worth
mentioning that initially the methods we would call today MQAPs were not used to evaluate
theoretical models resulting from computer modeling. The very first MQAPs were used to
detect errors in structures determined using X-ray crystallography methods. The first
crystallographic structures, due to the low resolution and difficulty of tracing the protein
chain in the density map, often contained serious errors, e.g., in the first crystallographic
model of the small Rubisco protein subunit the polypeptide chain was led in the opposite
direction compared to that in the true structure (Chapman et al. 1988). The most known
model-evaluation methods, so-called, “stereochemistry MQAPs” are PROCHECK
(Laskowski et al. 1993), WHATCHECK (Dunbrack 2004).
14
The second group comprises MQAPs that are knowledge-based statistical potentials. An
example of these programs is Verify3D (Eisenberg et al. 1997). In this approach, first a
statistical potential has to be developed based on a database of solved experimentally
structures. Next, for analyzed structural models, a statistical potential returns the value of the
quality assessment, which reflects how often the given structural features occurred in the
database. Proteins with rare structural features receive poor quality score. MQAPs based on
statistical potentials are less sensitive to small errors and can be used to evaluate the quality
of theoretical models. With the development of structural bioinformatics, the need for such
methods has increased.
Another group of programs is called, “clustering MQAPs”. These programs need multiple
alternative models (often tens or even thousands) to predict scores for them. The quality
assessment reflects the average similarity of the structural features of a given model to the
rest of the models in the analyzed pool. Models with overrepresented structural features are
rated as most likely to be most accurate. On the other hand, models with structural features
occurring only in a small number of other models get poor quality rating. Examples of such
programs include Pcons (Lundström et al. 2008), ModFOLDclust (McGuffin 2008) and 3D-
Jury (Ginalski et al. 2003).
The last group of MQAPs are methods based on so-called “meta-predictors”. These programs
use statistical models to interpret scores calculated based on third party tools, called primary
predictors. Meta-predictors are based on learning methods, such as support vector machine,
linear regression, network, or recently deep neural networks. Examples of such programs are
QA-ModFOLD (McGuffin 2008) and developed in the laboratory of professor Janusz
Bujnicki, Meta-MQAP (Pawlowski et al. 2008). mqapRNA, a program described in this
study, is an attempt to bring the principle underlying Meta-MQAP to RNA structure
bioinformatics.
MQAPs can also be divided into two types of quality assessment they compute - global and
local. Global MQAPs for a structural model calculates one quality value. In contrast, local
MQAPS also assess the local quality of a model and can be used to detect parts of a model
that need refolding or further minimization.
15
For proteins, many of the mentioned approaches turned out to be very effective for scoring
models (Kryshtafovych et al. 2017). In contrast, model quality assessment of RNA models is
at a very early stage. However, several attempts have been made recently toward the
development of statistical potentials for quality assessment for 3D RNA models, e.g.,
Ribonucleic Acids Statistical Potential (RASP) (Capriotti et al. 2011) RNA KB potential
(Bernauer et al. 2011), 3dRNAscore (Wang et al. 2015), εSCORE (Bottaro et al. 2014).
RASP is a statistical potential that is derived from a non-redundant set of 85 RNA structures.
The method is based on geometrical descriptors that explicitly account for base pairing and
base stacking interactions, and it includes a representation of local and non-local interactions
in RNA structures. In addition, the method is capable of a local quality assessment. The total
RASP score is the sum of the individual scores of all interactions found within an RNA
molecule. The method is easy to install and to use. Moreover, it can also be used via a web
server http://melolab.org/webrasp/home.php (Norambuena et al. 2013).
RNA KB includes two fully differentiable knowledge-based potentials, a coarse-grained one
and an all-atom one. The potentials were derived from a curated dataset of RNA structures.
Based on the observed distance measurements in this dataset, a potential mean force was
built, as described previously for proteins (Lu and Skolnick 2001). RNA KB potentials
implicitly incorporate all base interactions into distance-based potentials. The tool is quite
hard to use and requires a basic knowledge of Molecular Dynamics. RNA KB is distributed
as a force field that can be used for Molecular Dynamics implementation in the GROMACS
package.
3dRNAscore is a knowledge-based potential, which combines distance-dependent and
dihedral-dependent energies. The functional form of 3dRNAscore was devised from
Boltzmann distribution, and contains two energy terms: the distance-dependent energy and
the backbone dihedral-dependent energy. The parameters in the scoring function were
obtained based on a training set of non-redundant RNA tertiary structures.
εSCORE employs a coarse-grained representation (one bead per nucleotide) and is not
sequence dependent. εSCORE describes an RNA structure as a collection of vectors that
represents base-base and stacking interactions. The method was trained on the crystal
structure of the H. marismortui large ribosomal subunit. εSCORE is not only a scoring
16
function, but it is also a metric that can be used to compare two RNA structures. The software
for performing the calculations is freely available as a part of the Barnaba package.
Scoring methods can also be useful if we have a pool of relatively good quality models. Since
predicting the structure of large (>70 nucleotides) RNAs remains challenging task (Laing and
Schlick 2010; Miao et al. 2017). In order to increase the accuracy of the prediction of RNA
structure by bioinformatics tools, both at the secondary and tertiary level, experimental data
can be used.
1.3 RNA structure prediction with low-resolution experimental data
Experimental techniques for RNA secondary structure determination typically utilize
chemical or enzymatic probing and can be used either in vitro or in vivo (Table 1.3.1). The
main principle is that chemical reagents and nucleases used for this type of analysis interact
differentially with paired and unpaired nucleotide residues, e.g., ribonuclease (RNase) V1 is
reactive toward residues in double-stranded RNA, and RNase S1 is reactive toward single-
stranded regions. The use of base-selective chemical reagents (DMS, kethoxal, CMCT, See
Table 1.3.1) provides structural information about the base stacking, hydrogen bonding, and
electrostatic environment adjacent to the base. Local nucleotide flexibility and dynamics can
be inferred from experiments that interrogate all four RNA nucleotides. For instance,
selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) technique uses
hydroxyl-selective electrophiles that react with the 2′-hydroxyl group at flexible or disordered
nucleotides (Merino et al. 2005). The in-line probing method does not require the use of any
chemicals but exploits the natural instability of RNA molecules. The RNA is incubated at
slightly alkaline pH, and the spontaneous cleavage of the sugar backbone by adjacent 2′-
hydroxyl groups, which reflects the local nucleotide flexibility, is monitored (Nahvi and
Green 2013). Although there is a clear correlation between the local reactivities of RNA
molecules and base pairing probabilities, the problem of how to incorporate the probing data
into computational modeling procedure is not straightforward. The difficulty originates from
the fact that reactivities depend on the structural context and are influenced by tertiary
contacts (Washietl et al. 2012). Thus, computational methods have been adapted to allow
transforming the reactivities to discrete states (paired or unpaired), or calibrating the
interaction energy term proportionally to the reactivities (Mathews et al. 2004). There have
17
been attempts to integrate Molecular Dynamics simulations with SHAPE reactivates
(Kirmizialtin et al. 2015) and in-line probing experiments (Mlynsky and Bussi 2017).
Experiment techniques can be also used to detect non-local interactions and the data they
generate can be processed into a list of distance (long-range) restraints. Distance restraints are
important for the modeling process, as even a small number of them are sufficient to reduce
the conformational space sufficiently to allow accurate prediction of native RNA structures
(Lavender et al. 2010). A “mutate-and-map” strategy (Kladwang et al. 2011) is based on the
observation that when a paired nucleotide is mutated, its partner becomes more accessible to
reagent, which can be readily detected by subsequent chemical probing (e.g., by SHAPE).
Importantly, this strategy can reveal not only pairings in secondary structure, but also tertiary
contacts between sequentially distant fragments of the molecule. Multiplexed hydroxyl
radical cleavage analysis (MOHCA) (Das et al. 2008) is another technique that provides
information about long-range contacts. There, RNAs are created with randomly incorporated
nucleotides tethered to a Fe(II)–EDTA moiety, which can be used to induce through-space
cleavage of nearby residues in the RNA. Sites of that cleavage and the location of the probe
nucleotide can be identified by two-dimensional gel electrophoresis. Experimental methods
that are used to probe long-range contacts include UV- or chemically induced cross-linking,
site-directed cleavage, fluorescence resonance energy transfer (FRET) (Klostermeier and
Millar 2001), electron spin resonance (ESR/EPR) (Qin and Dieckmann 2004). All of the
experimental techniques mentioned above can be used both in guiding prediction process and
in filtering out the best predictions from a pool of RNA 3D models. mqapRNA allows for
filtering based on a set of distance restraints and this aspect will be discussed in this thesis.
Type of restraints Method Description Secondary structure SHAPE (Selective 2′-Hydroxyl
Acylation analyzed by Primer extension)
Method for quantitative detection of local nucleotide flexibility. 2′-OH in flexible, unpaired nucleotides reacts preferentially with a probing reagent, forming adducts that can be identified as stops to primer extension by reverse transcriptase.
Secondary structure DMS (dimethylsulfate footprinting)
DMS reacts with adenine at N1 and cytosine at N3. Reactive cytosines and adenines can be detected by reverse transcription and are
18
considered as unpaired. Secondary structure CMCT (1-cyclohexyl-
(2-morpholinoethyl) carbodiimide metho-p-toluene sulfonate)
CMCT reacts with N3 of uridine and, to a lesser extent, N1 of guanine. Reactive residues can be detected by reverse transcription and are considered as unpaired.
Secondary structure Kethoxal Kethoxal specifically attacks accessible N1 and N2 of guanine, and it is used for detection of unpaired guanines. The modified sites can be detected by reverse transcription.
Secondary structure + tertiary contacts
Mutate-and-map SHAPE/DMS/CMCT chemical probing for a large number (preferably all) of point mutants of the RNA sequence. Analysis of changes in secondary structures of the set of point mutants can be used to infer tertiary contacts.
Tertiary contacts MOHCA (multiplexed hydroxyl radical cleavage analysis)
enables the detection of pairs of contacting residues via random incorporation of radical cleavage agents. Contacting residues are detected from a cleavage pattern analyzed in two-dimensional gel electrophoresis.
Tertiary contacts Cross-linking Based on the formation of covalent bonds between spatially close regions of RNA that may be distant in sequence. Can be achieved using physical factors such as UV light or by chemical reagents.
Distances between labeled residues
FRET (Forster Resonance Energy Transfer)
Distances between fluorescent dyes linked to RNA molecule are inferred from the intensity of energy transfer.
Distances between labeled residues
ESR/EPR (Electron Spin/Paramagnetic Resonance) spectroscopy
Distances are derived from the measured spin–spin splittings for unpaired electrons localized on paramagnetic labels linked to RNA molecule
Table 1.3.1: Low-resolution experimental methods that generate particularly useful data for
computational prediction of RNA 3D structure, based on (Magnus et al. 2014). An accurate
secondary structure or/and distance restraints can be used with mqapRNA to refine the final
ranking.
19
Experimental techniques are a great source of information that can be used for RNA 3D
structure prediction. However, we can also learn a lot of about the structure by a thoughtful
analysis RNA alignments.
1.4 RNA families
Just like proteins, RNAs can be grouped into families that have evolved from a common
ancestor. Sequences of RNAs from the same family can be aligned to each other to give a
multiple sequence alignment (MSA). The analysis of patterns of sequence conservation or the
lack thereof can be used to detect important conserved regions, e.g., regions that bind ligands,
active sites, or involved in other important functions.
An accurate RNA sequence alignment can improve secondary structure prediction.
According to the CompaRNA (Puton et al. 2013) continuous benchmarking platform,
methods that exploit RNA alignments, such as PETfold (Seemann et al. 2008) outperform
single sequence predictive methods.
RNA alignments can be used to improve tertiary structure prediction. Weinreb and coworkers
(Weinreb et al. 2016) adapted the maximum entropy model to RNA sequence alignments to
predict long-range contacts between residues for 180 RNA gene families. They applied the
information about predicted contacts to guide in silico simulations and observed significant
improvement in predictions of five cases they investigated. mqapRNA, a method described in
this work, has a capability of processing this type of restraints and use them for scoring
models. Another way to use RNA alignments is take advantage of an observation that
members of the same family tend to fold into the same 3D shape (Fig. 1.4.1). RNA
alignments can be used to carry out independent folding simulations for a subset of the
homologous sequences in the MSA and then identifying the best models common to all
folded sequences via simultaneous clustering of the independent folding runs. This approach
was earlier implemented and benchmarked for proteins by Bonneau and coworkers (Bonneau
et al. 2001) and successfully applied to in silico model tertiary structures of major protein
families (Bonneau et al. 2002). To the best of my knowledge, EvoClustRNA developed in
this study is the first attempt to use this approach for RNA 3D structure prediction.
20
Figure 1.4.1: RNA families tend to fold into the same 3D shape. Structures of the riboswitch
c-di-AMP solved independently by three groups: for two different sequences obtained from
Thermoanaerobacter pseudethanolicus (PDB id: 4QK8) and Thermovirga lienii (PDB id:
4QK9) (Gao and Serganov 2014), for a sequence from Thermoanaerobacter tengcongensis
(PDB id: 4QLM) (Ren and Patel 2014) and for a sequence from Bacillus subtilis (PDB id:
4W90) (the molecule in blue is a protein used to facilitate crystallization) (Jones and Ferré-
D'Amaré 2014). There is some variation between structures in the peripheral parts (marked
with red arrows), but the overall structure of the core is preserved.
Information about RNA families is collected in the Rfam database (Nawrocki et al. 2015).
Each RNA family is represented by multiple sequence alignments, consensus secondary
structures and covariance models (CMs). Another source of information about RNA families
is RNArchitecture (http://genesilico.pl/RNArchitecture/) (Boccaletto et al. 2017). It is a
database developed in the laboratory of professor Bujnicki, which provides a comprehensive
description of relationships between known families of structured RNAs (taken from Rfam),
with a focus on structural similarities. RNArchitecture includes 2688 families of which only
2.54% (70 families) have a structural model solved experimentally (Fig. 1.4.2). Thus, there is
a huge need for fast and accuracy methods for RNA structure determination, both
experimental and computational to provide structural insights into these RNA molecules.
21
Figure 1.4.2: According to the RNArchitecture database, there are only 3% (70) Rfam
families with known experimentally solved structures, and 97% (2,618 families) without
known structures.
1.5 RNA-Puzzles
To track the progress in computational methods for RNA 3D structure prediction and how
close, the RNA-Puzzles initiative was proposed and implemented by professor Eric Westhof
and coworkers. It is a collective experiment for blind RNA structure prediction modeled after
a well-established initiative Critical Assessment of Techniques for Protein Structure
Prediction Experiment (CASP) (Kryshtafovych et al. 2017). The organizers of RNA-Puzzles
receive from crystallographers an RNA sequence for which the structure has been solved in
their laboratories, and is not yet publicly available. The sequence is sent out to groups
involved in the modeling of RNA structures all around the world. These groups have
approximately a month to apply available bioinformatics methods to model the structure for
the target sequence and to send relevant results to the organizers. The goal of the experiment
is to determine the capabilities and limitations of the current state of the art methods for 3D
RNA structure prediction based on sequence. This challenge is also an opportunity to
evaluate the progress that has been made in the RNA structure prediction methodologies, as
well as what has to be done to achieve better solutions. The initiative identifies specific
bottlenecks that may hold back the field and promotes the available methods, providing
guidance in the choice of suitable tools for real-world problems. For each target, a ranking of
models is prepared and can be sorted according to various criteria (Fig. 1.5.1), such as root-
22
mean-square deviation (RMSD), interaction network fidelity (INF). In addition, each
submitted model has its own page where is a JSmol (Hanson et al. 2013) visualization is
shown with a model superimposed on the native structure (Fig. 1.5.2). The rankings can be
found at http://ahsoka.u-strasbg.fr/rnapuzzlesv2/results/. Until now, twenty puzzles have been
set up and three publications, describing three rounds of the experiment, have been published
to summarize the results and discuss the progress in the field (Miao et al. 2017; Miao et al.
2015). To summarize them briefly, huge progress was made since the first round, and some
of the models reached a near-atomic resolution, like in the case of a twister sister (RNA
Puzzle 19 challenge) (Liu et al. 2017) or a Zika virus domain (RNA Puzzle 20 challenge)
(Akiyama et al. 2016). A very important problem in the RNA structure determination is the
prediction of the non-Watson-Crick interactions that are key factor in RNA folding. Another
problem is that some of the submitted puzzles’ models have high Clash Scores.
23
Figure 1.5.1: The results of RNA Puzzle 13. The second model in the ranking (sorted
according to RMSD) is a model obtained with a prototype version of EvoClustRNA
developed at the Stanford University. There is not one the way to sort the models. Different
metrics have unique properties, and a researcher should decide what is useful for his/her
application. RMSD informs about a geometrical similarity between a prediction the
crystallographic structure (the lower, the better). INF informs about the similarity of
interaction networks and ranges from 0 to 1 (the higher, the better). Several partial INF can
be computed: INF WC (the canonical interactions only), INF NWC (the non-canonical
interactions only), INF stacking (the stacking interactions only). INF ALL takes into account
all the interactions mentioned above. This RNA-Puzzle shows one of the biggest problems in
the RNA 3D structure prediction, very low INF NWC in all submissions, which means lack
of accurate prediction non-canonical interactions.
Figure 1.5.2: The detailed view of the results of the ZMP riboswitch (RNA Puzzle 13). For
each submitted model a detailed summary is available online that includes a superposition of
a prediction, in this case, the EvoClustRNA prediction (red), on the crystallographic structure
(green). Various metrics are shown in the result summary.
24
2 Aim of this work
The prediction of three-dimensional structures for complex RNAs remains a challenging task,
despite progress made recently by many researchers working in this field of science. The aim
of this work it to develop and benchmark three tools that makes this process more feasible:
(1) mqapRNA – a model quality assessment tool for RNA 3D models,
(2) EvoClustRNA – a predictive method based on simulations of homologs,
(3) rna-pdb-tools – a toolbox for RNA structural bioinformatics.
mqapRNA is a new scoring method that uses a deep learning algorithm to provide the
improved quality prediction of RNA structural models. To test the method, a set of datasets
were prepared, and the benchmark was performed. EvoClustRNA is a clustering routines of
evolutionary conserved regions (helical regions) for RNA fold prediction. rna-pdb-tools is a
new package of a Python library and a set of over 50 tools to enhance development of new
applications and procedures in RNA structural bioinformatics.
25
3 Materials & Methods
3.1 Hardware
All calculations, included in this work, were performed on resources provided by the
International Institute of Molecular and Cell Biology: HPC (High-Performance Computing)
Cluster (hostname: Peyote2, operating system: Ubuntu 10.04.4 LTS), Apple MacBook laptop
(macOS Sierra), and a virtual machine mqapRNA-vm (Ubuntu 14.04.3 LTS). The initial
version of EvoClustRNA was run at the University of Stanford: HPC (hostname: Biox3,
operating system: CentOS 6).
3.2 Software
All tools, created in this study: mqapRNA, EvoClustRNA, rna-pdb-tools, are written in
Python (version 2.7, http://www.python.org/). Python is a scripting language that uses Object
Oriented Programming, is open source and free to use. Python libraries, used in the projects,
are as follows: multiprocessing, Pandas, pytest, argparse, pyflakes. The code follows best
practices described by Kristan Rother (Rother 2017), Robert Martin (Martin 2008), Andrew
Hunt & David Thomas (Hunt and Thomas 1999), e.g., short functions, extensive
documentation, automated testing, version control.
GNU Emacs (version 25.1, https://www.gnu.org/software/emacs/) is an extremely extensible
and customizable text editor that was used for this work in various areas: alignment
preparation, code editing, note-taking, and more. In addition to the standard installation of
GNU EMACS, following extensions were used: magit, org-mode, markdown-mode, python-
mode.el, jedi, flycheck, yasnippet, projectile, sphinx-doc, RealGUD, autopep8, Emacs Speak
Statistics (ESS). The configuration file can be found under
https://github.com/mmagnus/emacs-env. This editor was also used for PDB files modification
using pdb-mode, and alignment preparation with RALEE (Griffiths-Jones 2005).
Git (version 2.11, https://git-scm.com/) was used to manage all the scientific code. Git is a
free and open source distributed version control system. “Distributed” means that there is no
(central) repository that each developer has to send his or her code to. Each copy of a
26
repository has its history and later can be easily merged with another repository of a team or
another developer. To host the code online, to make it available for everyone to download,
GitHub (online, https://github.com/) is used. All local changes of programs are sent to
GitHub and can be then seen by users all around the world. GitHub is free of charge for open
source projects. The GitHub repositories of the projects can be found under the links:
https://github.com/mmagnus/EvoClustRNA and https://github.com/mmagnus/ /rna-pdb-tools.
A tutorial on how to start working with Git written by the author of this thesis can be found at
http://rna-pdb-tools.readthedocs.io/en/latest/git.html.
Documentation All projects, described in this work, are well documented using Python
docstrings in classes, modules, and functions in concert with Sphinx (version 1.6.3,
http://www.sphinx-doc.org/en/stable/). Sphinx is a free, open source, very easy to use tool
that creates beautiful documentation in various formats, e.g., HTML, PDF, ePub. Sphinx is
run locally to generate documentation on local machines.
To be able to share the documentation publicly, Read the Docs (RTD, online,
https://readthedocs.org/) is used to provide a web interface for documentations of the
projects. The servers of Read The Docs tracks changes at GitHub repositories. If there is a
change in code or documentation, the RTD server is triggered, a new documentation is
compiled and after a few second is presented online. The RTD documentation can be found
under the links: http://EvoClustRNA.rtfd.io, http://rna-pdb-tools.rtfd.io. For all projects, the
Google style docstrings (https://google.github.io/styleguide/pyguide.html) via Napoleon is
used. Napoleon is a Sphinx Extension that enables Sphinx to parse Google style docstrings -
the style recommended by Khan Academy (http://www.sphinx-
doc.org/en/stable/ext/napoleon.html#type-annotations)
3.3 Structure visualizations
Structure visualizations in 3D were generated with PyMOL (version 1.7.4 Edu Enhanced for
Mac OS X by Schrödinger) (DELANO 2002). VARNA (version 3.93) (Darty et al. 2009) is a
plug-in, written in Java, dedicated to draw secondary RNA structures. It was used to visualize
the secondary structure of RNA in this work.
27
3.4 Databases
Protein Data Bank (Berman et al. 2000) (http://www.pdb.org/) is a database of
experimentally-determined structures of proteins, nucleic acids, and biomolecular complexes.
All structures, solved experimentally and used in this study were obtained from the Protein
Data Bank database.
In the study, the Rfam database (http://rfam.xfam.org/) was also used, see Materials &
Methods 3.6.1.
3.5 Development of mqapRNA
3.5.1 Datasets
To train mqapRNA, two datasets were used: RASP, RNA KB.
The first dataset (RASP) was made by Capriotti to develop the RASP method (Capriotti et al.
2011). The dataset was obtained by generating from the 85 native structures a set of Gaussian
restraints for dihedral angles and atom distances. For each native RNA structure, a set of 500
decoy structures was built by randomly removing an increasing fraction of constraints,
generated from the native RNA structure. Each decoy was built using the MODELLER
computer program (Sali and Blundell 1993), using a subset of restraints as Gaussian
potentials. This dataset can be downloaded via http://melolab.org/supmat.html. Four
structures were too big (over 200 nucleotides) to be considered for the development of
mqapRNA, and, therefore, were removed from the dataset.
The second dataset was prepared to develop RNA KB potential (Bernauer et al. 2011). The
dataset contains two subsets. The first one, RNA KB-Molecular Dynamics (MD), is based on
a set of Molecular Dynamics simulation in the explicit solvent that generated structures that
have RMSD values a few angstroms (typically 2Å) away from the native structure. The
subset contains five sequences with 3500 distorted models per sequence. The second subset,
RNA KB-Normal Mode (NM), was generated by Normal Mode perturbation of the crystal
structures. The subset contains 15 sequences with 500 models per sequence. All the decoys
28
can be downloaded from http://csb.stanford.edu/rna, and further details are described in the
corresponding article. In addition, one more dataset was prepared to only test mqapRNA.
To test the methods, the third dataset was prepared of all models submitted to the RNA-
Puzzle organizers (http://ahsoka.u-strasbg.fr/rnapuzzles/). First, all models were manually
inspected to detect discrepancies that can not be solved automatically with rna-pdb-tools;
e.g., various chain names. Second, all models were standardized with rna-pdb-tools. This
dataset is available under a link https://github.com/mmagnus/RNA-Puzzles-Normalized-
submissions , with detailed descriptions how models were edited. This is a very unique and
valuable dataset that will be useful for the community.
3.5.2 Primary methods
mqapRNA includes a Python interface as wrappers around the primary methods. If a given
method returns also addition subscores, e.g. energy of stacking possible, all subscores were
collected and used for a statistical model (Table 3.5.1). The primary predictors were divided
into four categories: (1) model quality methods: 3RNAscore (Wang and Xiao 2002), RASP
(Capriotti et al. 2011), RNAkb (Bernauer et al. 2011), εSCORE (Bottaro et al. 2014); (2)
RNA structure modeling methods: SimRNA (Boniecki et al. 2016) and Rosetta (in two
modes: low resolution (Das and Baker 2007), and full-atom high resolution (Das et al.
2010)); (3) clash score calculator (using the Probe program from Molprobity suite (Adams
et al. 2010)) and correctness of geometry analyzer (using the Suitename program from
Molprobity suite (Adams et al. 2010)), radius of gyration implemented in a Python script.
Method Subscores
1. SimRNA simrna_steps, simrna_total_energy, simrna_base_base,
simrna_short_stacking, simrna_base_backbone,
simrna_local_geometry, simrna_bonds_dist_cp,
simrna_bonds_dist_pc, simrna_flat_angles_cpc,
simrna_flat_angles_pcp, simrna_tors_eta_theta,
simrna_sphere_penalty, simrna_chain_energy
2a. Rosetta - coarse
grained low resolution
farna_rna_vdw, farna_rna_base_backbone,
farna_rna_backbone_backbone, farna_rna_repulsive,
farna_rna_base_pair, farna_rna_base_axis,
farna_rna_base_stagger, farna_rna_base_stack,
farna_rna_base_stack_axis, farna_rna_rg,
farna_atom_pair_constraint, farna_linear_chainbreak,
farna_rna_data_backbone, farna_score_lowres
2b. Rosetta - full-atom
high resolution
farna_fa_atr, farna_fa_rep, farna_fa_intra_rep,
farna_lk_nonpolar, farna_fa_elec_rna_phos_phos,
farna_ch_bond, farna_rna_torsion, farna_rna_sugar_close,
farna_hbond_sr_bb_sc, farna_hbond_lr_bb_sc, farna_hbond_sc,
farna_geom_sol, farna_atom_pair_constraint_hires,
29
farna_linear_chainbreak_hires, farna_score_hires
3. RASP rasp_c3_pdb_energy, rasp_c3_no_contacts, rasp_c3_norm_energy,
rasp_c3_mean_energy, rasp_c3_sd_energy, rasp_c3_zscore,
rasp_bb_pdb_energy, rasp_bb_no_contacts, rasp_bb_norm_energy,
rasp_bb_mean_energy, rasp_bb_sd_energy, rasp_bb_zscore,
rasp_bbr_pdb_energy, rasp_bbr_no_contacts,
rasp_bbr_norm_energy, rasp_bbr_mean_energy,
rasp_bbr_sd_energy, rasp_bbr_zscore, rasp_all_pdb_energy,
rasp_all_no_contacts, rasp_all_norm_energy,
rasp_all_mean_energy, rasp_all_sd_energy, rasp_all_zscore
4. RNA KB rnakb_bond, rnakb_angle, rnakb_proper_dih,
rnakb_improper_dih, rnakb_lj14, rnakb_coulomb14, rnakb_lj_sr,
rnakb_coulomb_sr, rnakb_potential, rnakb_kinetic_en,
rnakb_total_energy
5. 3RNAscore x3rnascore
6. εSCORE escore
7. Geometry Analysis analyze_geometry
8. Clash Score clash_score
Table 3.5.1: A list of subscores extracted from the primary methods used for training and
prediction with mqapRNA. For each analyzed structure, all these scores are provided in a
CSV output file, both in the standalone version and the web servers
3.5.3 Secondary structure comparison
Secondary structure comparisons were calculated based on outputs of ClaRNA (Waleń et al.
2014) using the Interaction Network Fidelity (INF) value which is computed as:
where TP is the number of correctly predicted base–base interactions, FP is the number of
predicted base–base interactions with no correspondence in the solution model, and FN is the
number of base–base interactions in the solution model not present in the predicted model
(Miao et al. 2017).
3.5.4 Standardization of PDB files
All structures before scoring were standardized with rna-pdb-tools
(https://github.com/mmagnus/rna-pdb-tools).
30
3.5.5 Evaluation of scoring functions
To assess the accuracy of the prediction, 5-fold cross-validation was performed on the RASP
and RNA KB datasets. None of structures of the RNA-Puzzles dataset were used at any stage
of training the statistical model. The cross-validation was performed using built-in
functionality of the H2O platform via the H2O Flow web interface.
To assess the performance of the scoring functions, rank correlation (Spearman, R) between
scores and RMSDs were calculated. To compare structural models to native structures, root
mean square deviation (RMSD) was used. RMSD is defined by the following formula:
where δ is the Euclidean distance between a given pair of corresponding atoms. RMSD is
calculated for all heavy atoms. The R ranges from -1 to 1. If the energy is perfectly linear to
the RMSD, R is equal to 1. If the energy is random, R is equal to 0.
The second metric used to assess the performance was Enrichment Score (ES), described in
the publication about RNA KB (Bernauer et al. 2011). The enrichment score is defined as:
where Etop10% is the set of structures with energies in the top 10%, and Rtop10% is the set of
structures with the RMSD in the lowest 10%. | Etop10% ∩ Rtop10% | is the number of structures
in the intersection of these two sets. The ES ranges from 0 to 10. If the energy is perfectly
linear to the RMSD, ES is equal to 10. If the energy is random, ES is equal to 1.
3.5.6 Statistical analyses
A wide range of statistical methods was applied to complete the project: Pearson and
Spearman's rank correlation coefficients, data normalization, statistical model building.
Statistical analyses were carried out using R (version 3.3) and Python (version 2.7) with
Jupyter - former IPython (Pérez and Granger 2007). The final statistical model was built with
31
the H2O platform (https://www.h2o.ai/). H2O is a free and open source machine-learning
platform that allows for building statistical models training on big (and small) data.
3.5.7 Implementation of the web server
The web server of mqapRNA (http://genesilico.pl/mqapRNA/) was implemented in Python
(https://www.python.org/, version 2.7) coupled with Django (https://www.djangoproject.com,
version 1.5.1) and SQLite (http://www.sqlite.org, version 3.8.2). SQLite is a self-contained,
high-reliability, embedded, full-featured, public-domain, SQL database engine that keeps all
the data in one file. Database management system is used to store information about users’
submissions. Django is a Python Web framework that is designed for fast development of
web services. mqapRNAweb provides a clean interface that is developed to be user-friendly
even for users without prior expertise in RNA bioinformatics.
3.6 Development of EvoClustRNA
3.6.1 Multiple sequence alignment generation and selection of homologs
For each sequence, the corresponding Rfam (Nawrocki et al. 2015) alignment was
downloaded. Rfam (Nawrocki et al. 2015) (http://rfam.sanger.ac.uk/) is a database of RNA
sequences grouped into RNA families. Each family is represented as a statistical model (CM,
covariance model) using Infernal software (Nawrocki et al. 2009) that combines sequential
and structural (secondary structure) information. Sequences in alignments were sorted by
length, and the redundancy was reduced to the threshold of sequence similarity to 90% with
Jalview (Waterhouse et al. 2009). Four of the shortest sequences were selected for the
modeling. The conserved regions were visually identified in Emacs using the RALEE
(Griffiths-Jones 2005) plugin. A new pseudo-sequence named “x” (Fig 3.6.1) was created to
mark the conserved residues which should be cut out for clustering. If the target sequence
was not in the alignment, it was manually added. Based on alignments a set of FASTA input
files with sequences and their secondary structures were created with Jalview (Fig. 3.6.2) and
used as an input for modeling.
32
Figure 3.6.1: The alignment preparation. The conserved residues are marked with “x” in the
pseudo-sequence “x”. The marked as the conserved residues columns can be inspected in an
arc diagrams of RNA secondary structures (Lai et al. 2012) as the pink line (at the very
bottom).
Figure 3.6.2: Each sequence and associated secondary structure was "Saved as" to a Fasta
file and used at the next stage of modeling with the use of the Jalview program.
3.6.2 Modeling of sequences with SimRNA/SimRNAweb and Rosetta
For modeling with SimRNA (Boniecki et al. 2016), the SimRNAweb (Magnus et al. 2016)
(http://genesilico.pl/SimRNAweb) server was used with the default parameters (1% of the
33
lowest energy frames taken for clustering, 500 - a number of simulation steps). SimRNA
trajectories were downloaded from the server and one hundred low-energy models were
obtained from each SimRNA trajectory with programs implemented in rna-pdb-tools
(https://rna-pdb-tools.readthedocs.io/en/latest/utils.html#simrna).
For Rosetta, a pipeline implemented in rna-pdb-tools (utils Rosetta, https://rna-pdb-
tools.readthedocs.io/en/latest/utils.html#rosetta) was used as described in the work of Cheng
and coworkers (Cheng et al. 2015). The procedure starts with pre-assembling of helices. Then
Rosetta runs, without minimization, to obtain 10,000 output models. Next, 1/6 (17%) of the
lowest energy models is minimized. For each Rosetta run, one hundred low-energy models
were selected for clustering with EvoClustRNA.
3.6.3 Clustering routine
The clustering procedure used with EvoClustRNA has been implemented by Irina Tuszyńska
for the use of DARS-RNP and QUASI-RNP (statistical potentials for protein-RNA docking)
(Tuszyńska and Bujnicki 2011). In the case EvoClustRNA, the procedure was slightly
modified, but the underlying principles remained the same. The program is an
implementation of an algorithm used for clustering with Rosetta for protein structure
prediction (Simons et al. 1999), also described in (Bonneau et al. 2001). Briefly, one hundred
low-energy structures for each homolog are taken for clustering. The clustering procedure is
iterative and begins with calculating a list of neighbors for each structure. Two structures are
considered as neighbors when their RMSD between them is smaller than a given distance
cutoff. To find a proper cutoff, an iterative procedure of clustering starts from 0.5 Å and
incremented by 0.5 Å, until the three biggest clusters contains half of all structures used for
clustering. For example, for five homologs, 500 structures are clustered. An iterative
clustering stops when there are at least 250 structures in the three biggest clusters.
34
4 Results
4.1 mqapRNA
mqapRNA (where “mqap” stands for “model quality assessment program”) is a computer
program that analyses a set of models provided by the user in the PDB format and predicts
quality scores. It is a meta-predictor, a method designed to use other methods (called:
primary methods), and to analyze their outputs by dedicated statistical model. Such approach
could provide a better prediction by overcoming weaknesses of individual methods and
building on their individual strengths. The meta-prediction approach has been shown
successful in structural bioinformatics, in particular in protein (Albrecht et al. 2003) and
RNA secondary structure prediction (Siebert and Backofen 2005), protein fold-recognition
(Kurowski and Bujnicki 2003), identification of protein domains (Saini and Fischer 2005),
and evaluation of protein model quality (Pawlowski et al. 2008). Earlier, I used this kind of
approach to improve prediction of subcellular localization for proteins (Magnus et al. 2012).
4.1.1 Implementation of mqapRNA
Primary prediction methods. mqapRNA relies on existing methods and a statistical model
to potentially provide better prediction than each individual method. Based on the results
obtained for the series of primary predictors, mqapRNA uses their outputs to generate a
consensus prediction. Table 3.5.1 in the Materials & Methods section lists all the primary
predictors used in this work.
35
Figure 4.1.1: Graphical diagram of primary methods used by mqapRNA to describe the
analyzed model. (A) other methods for model quality assessment, (B) RNA modeling
software (C) Others.
Primary methods are divided into two groups. One group of programs includes dedicated
methods for model quality assessment: RNAscore, RASP, RNA KB, εSCORE (Fig. 4.1.1A).
The other group of programs are methods for RNA structure modeling, that also allows for
calculating structural descriptors and energy values, that can be used as input in final
statistical model. For each analyzed structure, mqapRNA runs a single step simulation with
SimRNA and Rosetta (both executable in two modes: low resolution (FARNA), and full-
atom high resolution (FARFAR)) to generate scores for the input model (Fig. 4.1.1B). One
more group contains methods for calculating clash score, correctness of geometry and radius
of gyration (Fig. 4.1.1B). For each analyzed structure, mqapRNA runs all the above-
mentioned programs to generate a list of scores and uses it for quality prediction using a deep
learning statistical model.
Datasets used for training and testing. To build a statistical model, two datasets were used:
RASP, RNA KB. These datasets consist of near-native (deliberately perturbed) structural
36
models (“decoys”). Decoy structures are used to test discriminative power of scoring
methods. A good scoring method should be effective in identifying near-native decoys in a
pool of structures.
The RASP dataset was generated by MODELLER (Sali and Blundell 1993) with a set of
Gaussian restraints for dihedral angles and atom distances from 85 native structures. The
dataset includes 85 decoy sets, each containing 500 structures (Fig. 4.1.2).
Figure 4.1.2: Example of a decoy set from the RASP dataset of the adenine riboswitch (PDB
ID: 1Y26). (A) The native structure. (B-F) A set of structures (files in the PDB format)
selected from this decoy with increasing deviation from the native (in parentheses are
RMSDs to the native). Files: (B) 1y26X_M100 (RMSD: 1.7Å), (C) 1y26X_M200 (RMSD:
2.49Å), (D) 1y26X_M300 (RMSD: 3.23Å), (E) 1y26X_M400 (RMSD: 3.31Å), (F)
1y26X_M500 (RMSD: 5.12Å).
The second dataset used for training mqapRNA was RNA KB. This dataset includes two
subsets: RNA KB-Molecular Dynamics (MD) and RNA KB-Normal Mode (NM). The first
subset was generated by position-restrained molecular dynamics and Replica-Exchange
Molecular Dynamics (REMD) simulations and covers a wide near-native RMSD range (from
0.1 to 10 Å, Fig. 4.1.3). In the REMD simulation, 1ns REMD simulations are performed for
each RNA structures. The subset contains five decoy sets, each containing 3500 structures.
The second subset of RNA KB was generated by Normal Mode perturbation method. The
structures in this subset possess stereochemically correct bond lengths and angles but without
correct base pairing (Fig. 4.1.4). The subset contains 15 decoy sets, each including 500
structures.
37
The third dataset was used only for testing and includes all models submitted to the RNA-
Puzzle organizers by participating groups.
Figure 4.1.3: Histograms of RMSDs [Å] per dataset. In red, the datasets used for training
mqapRNA; in orange, the dataset used only for testing. X: number of structures (not scaled in
the same way for all plots because of the very diverse ranges), Y: RMSDs [Å].
All datasets have different structural properties (Fig. 4.1.3 and Fig. 4.1.4). The RASP dataset
covers RMSD from 0 Å to 10 Å (median: 3.94 Å) and structures with very distorted base
pairing (median of INF 0.79). RNA KB-MD contains structures the closest to the native
structures in terms of geometry (median of RMSD: 1.59 Å) and base pairing (median of INF:
1.0). RNA KB-MD covers a narrower range of RMSD but presents different geometrical
distortions from the prior physics-based force field method with very divergent secondary
structure similarity (median of INF 0.77). The RNA-Puzzle dataset consists of structure that
are far from the native structures (median: 14.78 Å, standard deviation: 7.46 Å).
Figure 4.1.4: Histograms of Secondary Structure (INFs) per dataset. In red, the datasets used
for training mqapRNA, in orange, the dataset used only for testing. X: number of structures
(not scaled in the same way for all plots because of the very diverse ranges), Y: Secondary
Structure similarity of a given model to a secondary structure of a native structure (INFs).
Training of the statistical model mqapRNA is based on machine learning methods to
recognize patterns in the data, to make predictions about new data. It utilizes “supervised
38
learning” where the good examples and bad examples are presented to a statistical model in
order to train it to make predictions. For each structure of the training datasets, a list of scores
from primary methods was obtained. Each structure was described by 70 variables - scores
obtained from 8 primary methods. The response variable was the value of the structure
RMSD to the native structure (Fig. 4.1.5A), and this is the value that mqapRNA projects for
new structures (of unknown RMSD) based only on scores from the primary methods (Fig.
4.1.5B).
Figure 4.1.5: mqapRNA is a machine learning based method. (A) First, a statistical model
was built on a training dataset of structures of known RMSD to native structures. Each
structure is described by a list of scores, results of the primary methods. Since this is the
training set, RMSD of these structure to native structures is known. This process allows
mqapRNA to detect what is the correspondence between scores and RMSDs. (B) Next, the
statistical model is applied for new cases, where RMSD is unknown.
The statistical model used in mqapRNA is based on the deep learning algorithm. The series
of grid searches were performed to find an optimal set of parameters for the statistical model.
The five-fold cross-validation was performed to limit the bias towards the training dataset.
The whole procedure was performed using the machine learning platform, H2O. At the very
end of this process, the final statistical model was selected to be used in mqapRNA. An
accurate statistical model, based on the training data, should be able to discover links
39
between scores and RMSD values, therefore, for a new vector where only scores are known,
the method should predict the theoretical RMSD. The predicted RMSDs are a measure of the
quality of the structure.
Figure 4.1.6: Contribution (“Importance”) to a given subscore (“Variable”) in the final deep
learning model developed for mqapRNA (a plot generated with the H2O flow Notebook).
The higher, the more a given subscore is required for accurate predictions of the statistical
model.
Figure 4.1.6. shows the impact (“variable importance”) on a given variable in the final
statistical model. Surprisingly, the variable with the greatest impact had a score that describes
the radius of gyration of an analyzed model. This might mean that the appropriate radius of
gyration (compactness) of models is important for quality prediction. The second score
(scaled importance: 0.68) on the list is a component of RASP (“RASP All Interactions
Normalized Energy”), and the third 3RNAscore (scaled importance: 0.59). The statistical
model also depends on a number of chains (“No Chains”) and the length of an analyzed
structural model (7th and 8th in the ranking, respectively).
4.1.2 Performance of mqapRNA
To test how the scores of the methods correlate with the observed structural deviation from
the native conformation, rank correlations were calculated between the structural deviations,
40
measured as RMSD to the native structure and scores on three datasets: RASP, RNA KB
(Molecular Dynamics & Normal Mode), and the submission to the RNA-puzzles.
The first benchmark uses rank correlations (R) to show how well a given method is able to
rank all the models, from very good to very bad. Figure 4.1.7 shows all rank correlations
between each decoy and scoring methods. To compare performance of the scoring method on
datasets of different sizes, the weighted average was introduced. mqapRNA (Fig. 4.1.7, 3rd
column) outperformed all other scoring methods achieving a weighted average (Fig. 4.1.7,
the last row of the plot) of rank correlations of 0.77. The second was RASP with a weighted
average of 0.74. SimRNA scored as the third method with a weighted average of 0.71. Note,
that mqapRNA also achieved a very high accuracy factor (average per decoy of over 0.8) for
datasets: RASP, RNA KB-Molecular Dynamics, RNA KB-Normal Mode. Clash Score and
Analyze Geometry performed poorly with weighted averages 0.23 and 0.42 respectively.
However, all the methods scored poorly for the RNA-Puzzle datasets, compared to the others.
This low-quality prediction is due to the higher level of distortion complexity of the RNA-
Puzzle datasets. This might suggest that the datasets of RASP and RNA KB do not represent
deviations of models that one might encounter in real life case studies of RNA structure
prediction.
The second benchmark uses Enrichment Score (ES) to show how many of 10% of the best
models were scored by a given method as 10%. This metric tests the capability of methods to
identify the subset of the best models in a given decoy set. Figure 4.1.8 shows all Enrichment
Scores between each decoy set and scoring methods. In this test, SimRNA achieved the
highest weighted average of 5.4 and outperformed mqapRNA with a weighted average of 5.3.
Once again, the RNA-puzzles dataset was the most difficult for the quality prediction. The
best method on this dataset was FARNA (in the high resolution mode) with an average of
2.3. mqapRNA on this dataset was the second with an average of 1.8. Interestingly,
Secondary Structure (INF) which is a scoring that is a comparison of the secondary structure
of a model with the true secondary structure obtained from a crystal structure achieved an
average of 3.6. This scoring assumes that the predicted structure for a given sequence is the
same as the secondary structure of the crystal structure, which in practice is very difficult to
obtain. For the first four consecutive RNA-Puzzles, almost all methods achieved EC of zero.
For these RNA-Puzzles the participating groups submitted only one/two models per group.
41
Thus, to get a high EC value for RNA-Puzzle 1 with only twelve submitted models (for
example), a scoring function should detect one particular model, since 10% of twelve is 1.2.
Interestingly, for all the methods, a huge difference in the performance has been recorded
between the RASP (the dark read area in Figure 4.1.8) and the RNA KB datasets (the mix
read blue area in the middle of the Figure 4.1.8). This might suggest that the RASP dataset is
composed of a limited number of near-native models. Since the majority of models are far
from the native structures, they cannot be detected as good ones. In the case of the RNA KB
datasets, it appears that there are many near-native models, but the scoring functions have
problems distinguishing them from worse models. Figure 4.1.8 also displays that this is the
case when the RNA KB decoys are similar to the native structures, making their assignment
to the best 10% of the models much harder.
In Figure 4.1.9., a close-up on the RNA-Puzzle 14 scorings is shown. mqapRNA achieved an
EC of 7.7, being able to identify a group of near-native models. Other methods were not able
to rank models properly. Note, that both modes of FARNA (the high resolution mode and the
low resolution mode) scored as the second with an EC of 5.8.
42
Figure 4.1.7: Rank correlations for each decoy set and scoring method. mqapRNA (3rd
column) outperformed other scoring functions with a weighted average of rank correlations
of 0.77)
43
Figure 4.1.8: Enrichment Score for each decoy set and scoring method. mqapRNA (3rd
column) is outperformed by SimRNA (10th column) by 0.1 in terms of EC.
44
Figure 4.1.9: Close-up on the RNA-Puzzle 14 results in a form of RMSD [Å] vs Score plots.
The perfect method should follow a diagonal in a plot. mqapRNA achieved an EC of 7.7 and
was able to identify a group of the near-native models. Other methods were not able to rank
models properly.
4.1.3 mqapRNA web server: quality prediction with optional restraints
The mqapRNA web server is a workflow based on a combination of computational tools. It
offers a user-friendly web interface to submit RNA PDB structures and view the results. All
steps of the analysis are automatized, which makes the process of scoring available to users
who would otherwise become tripped up by installing many programs locally. All
intermediate results can be downloaded and processed by the user. The server can be found
under the http://genesilico.pl/mqapRNA/ link. The server is free and open to all users, with
no login requirement. Further details can be found at the documentation page of the server
(http://genesilico.pl/mqapRNA/documentation).
A user can submit RNA structural files in the PDB format in three different forms: (1) a
single file, (2) a single file with many models (“NMR-style”), or a ZIP file with multiple
PDB files (Fig. 4.1.10). All PDB files are processed with rna-pdb-tools to get the RNA
standardized structures so the primary methods can be run on them. Since incorporation of
the information about RNA secondary structure improves the quality prediction of models,
users can provide their own secondary structure or let mqapRNA to predict it. The secondary
45
structure can be predicted with the use of experimental chemical probing method - SHAPE
data. The user can also provide a set of distance restraints to refine the quality prediction. The
distance restraints and secondary structure do not have to be provided upfront; the user can
submit them also at the result page. Moreover, both type of restraints can be easily re-
submitted at the result page to help the user to select the right models. A complete quality
prediction result consists of a plot of mqapRNA scores (Fig. 4.1.11), a table of the scores (Fig
4.1.11). and the distance restraints editor (Fig. 4.1.12). The server accepts distance restraints
in a flat text file. We tested the method to improve the quality of prediction with the
evolutionary restraints, and MOHCA restraints. For evolutionary restraints, the suggested
distance is 7 Å, while for MOHCA-seq 25 Å (Das et al. 2008). Analysis of those scores can
help the user to decide which structure to select for further investigation. The raw output
files from each step of the prediction are also available and the user can carry out additional
data analysis, if desired.
46
Figure 4.1.10: The homepage of the mqapRNA web server.
47
Figure 4.1.11: A result page of mqapRNA. The page is divided into three panels: a plot of
mqapRNA score, a table of the score, and the restraints editor. The distance restraints can be
easily modified and re-submitted to the server. The results will be immediately updated
which might encourage the user to try different sets of restraints.
48
Figure 4.1.12: Distance restraints editor at the bottom of the result page. The user can upload
a file with distance restraints or use an online editor to modify his/her query. After the re-
submission, the scores are re-calculated, and a new plot is generated.
49
4.2 EvoClustRNA
4.2.1 Implementation of EvoClustRNA
Based on the observation that RNA sequences from the same RNA family fold into a highly
conserved structure, together with professor Das, we made an assumption that a similar
process can be observed in silico. We assumed that computational modeling could be used to
detect global helical arrangements for the target sequence, based on the arrangements within
a subset of homologs. Thus, this project explores the use of multiple sequence alignment
information and parallel modeling of RNA homologs to improve ab initio RNA structure
prediction methods. To build a structural model of the target sequence, a multi-step modeling
process must be performed (Fig. 4.2.1).
Figure 4.2.1: The scheme of the proposed methodology. (A) Homologous sequences are
found for the target sequence, and an RNA alignment is created. (B) Using Rosetta and
SimRNA or/and Rosetta, structural models for all sequences are generated. (C) The
conserved regions are cut out and clustered. (D) The final prediction of the method is the
model containing the most commonly preserved structural arrangements in the set of
homologs.
50
First, a subset of homologous sequences for the target sequence is selected using an
alignment from the Rfam database. Alignments are processed as described in the Materials &
Methods section 3.6.1. Subsequently, independent folding simulations are performed with
SimRNAweb and Rosetta for the selected sequences to generate initial models. Then,
structural fragment, which are evolutionarily conserved helical regions that were determined
from the alignment, are extracted from all obtained models and clustered. The center (model
with the highest number of neighbors) the biggest cluster is taken as the final prediction.
In the current implementation of the method, the user should create a new line “x” in the
alignment that marks the regions that are selected for the clustering. This line can be created
automatically with rna-pdb-tools. However, the user can also define a region for clustering.
This step is critical for the whole process, and the user should carefully include in clustering
only the wanted regions.
The initial version of the method, which was developed at Stanford University with professor
Rhiju Das, used models generated with Rosetta. However, the EvoClustRNA method itself is
independent from the source of analyzed initial structural models. For this reason, I decided
to also test the EvoClustRNA with a method using models generated with SimRNAweb, a
tool for RNA structure prediction developed in the laboratory of professor Janusz Bujnicki.
EvoClustRNA is implemented as a set of Python programs, which can be downloaded
together with the documentation and examples from the GitHub repository
(https://github.com/mmagnus/EvoClustRNA). The evoClustRNA.py main script requires an
input alignment and a folder with initial models of all homologs to generate an all-vs-all
distance matrix between selected clustering fragments. The next step is to use the
evoClust_autoclustix.py, which is an implementation of an iterative clustering procedure. As
results of this script, a set of clusters is generated. The structure with the highest number of
neighbors of the first (biggest) cluster is taken as the final prediction.
4.2.2 Blind predictions with EvoClustRNA in the RNA-Puzzles
EvoClustRNA was tested on the RNA-Puzzle 13 problem. The target of 71 nucleotides was
an RNA 5-aminoimidazole-4-carboxamide riboside 5′-monophosphate (ZMP) riboswitch,
which can up-regulate de novo purine synthesis in response to increased intracellular levels of
51
ZMP (Trausch et al. 2015). The alignment for this riboswitch was downloaded from the
Rfam database (RF01750), whence ten homologs were selected for modeling with Rosetta.
The secondary structures for all homologs were devised with Jalview based on the Rfam
alignment. The pseudoknot was suggested in the available literature (Kim et al. 2015) and it
was used for modeling. The EvoClustRNA prediction with an RMSD of 5.55 A with respect
to the native structure (Fig. 4.2.2) was the second in the total ranking of RNA-Puzzles,
(http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle13/). The final prediction was made
based on the visual inspection of the best clusters, which were obtained by using the
EvoClustRNA method.
Figure 4.2.2: The RNA-Puzzle 13 - the ZMP riboswitch. The superposition of the native
structure (green) and the EvoClustRNA prediction (blue). The RMSD between structures is
5.55 A, the prediction was ranked as the second in the total ranking of the RNA-Puzzles
(according to the RMSD values).
EvoClustRNA was also used in the RNA-Puzzles for modeling the problem 14. The RNA
molecule of interest was the 61-nucleotide long L-glutamine riboswitch, which upon
glutamine binding undergoes a major conformational change in the P3 helix (Ren et al.
2015). It was the first RNA-Puzzle, for which the participating groups were asked to model
two forms of the RNA molecule: one with a ligand (“bound”) and another one without a
ligand (“free”). However, the EvoClustRNA method was used only to model the “bound”
52
form. The alignment for this RNA family (RFAM ID: RF01739) was downloaded from the
Rfam database, whence two homologs were selected for modeling with Rosetta. It was
suggested in the literature (Westhof 2010) that the structure included an E-loop motif. This
motif was found in the PDB database and was used as a rigid fragment during the modeling.
Three independent simulations were performed and the final prediction was obtained in a
fully automated manner. The native structure of the riboswitch superimposed on the model
obtained with the EvoClustRNA method is shown in Fig. 4.2.3. The EvoClustRNA
prediction was ranked at the first place in the overall ranking with 5.56 Å RMSD with respect
to the native structure (http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle14Bound/).
Figure 4.2.3: The RNA Puzzle 14 - L-glutamine riboswitch. The RMSD between the native
structure (green) and the EvoClustRNA prediction (blue) is 5.56 Å.
4.2.3 Performance of EvoClustRNA
To rigorously test the EvoClustRNA methodology, the dataset composed of nine RNAs with
known experimentally solved structures was used. This dataset included (1) five RNAs used
to benchmark modeling restraints from direct coupling analysis by Weinreb and coworkers
(Weinreb et al. 2016), (2) four RNA-Puzzles, 6, 13, 14, 17 (Table 4.2.1, rows from 6 to 9).
To compare the results obtained by Weinreb et al. with a single sequence predictions and
EvoClustRNA runs, Table 4.2.1 includes a column “DCA” with RMSDs calculated for
models from Weinreb’s publication. A single sequence and EvoClustRNA predictions were
performed using both SimRNAweb and Rosetta.
53
According to our results, EvoClustRNA|SimRNAweb improved the results in 5 out of 9
cases. However, the improvement was relatively small, namely 0.30 Å RMSD. In the case of
EvoClustRNA|Rosetta, the obtained models were 0.36 Å RMSD less accurate than Rosetta
models generated for single sequences. Interestingly, SimRNAweb and Rosetta gave similar
results on average regarding RMSDs.
All sequences and secondary structures used for modeling are listed as Supplementary
Information S1.
Adenine riboswitch (Ade, PDB ID: 1Y26, RFAM ID: RF00167). The first RNA in Table 1
is the adenine riboswitch. The sequence used for modeling is 72-nucleotide long. This
riboswitch has a pseudoknot and it was used for modeling. The best RMSD was achieved by
SimRNAweb (6.85 Å), which was even better than modeling with the use of evolutionary
restraints by Weinreb et al (9.23 Å). EvoClustRNA did not improve the results, and a model
1 2 3 4 5 6 7 8 9 10
No RNA Len. DCA SimRNA
web
EvoClustRNA|
SimRNAweb
Improvement
of
EvoClustRNA|
SimRNAweb
Rosetta
EvoClustRNA|
Rosetta
Improvement
of
EvoClustRNA|
Rosetta
1. Ade 72 9.23 6.85 7.52 -0.67 9.02 13.89 -4.87
2. TPP 80 10.35 22.37 24.08 -1.71 20.88 13.92 6.96
3. tRNA 76 8.58 14.37 10.35 4.02 13.11 14.6 -1.49
4. cdiGMP 76 11.1 12.26 9.65 2.61 11.41 14.53 -3.12
5. THF 89 8.84 12.22 11.35 0.87 4.83 7.68 -2.85
6. COB #6 168 NA 31.02 33.39 -2.37 31.44 33.19 -1.75
7. ZMP #13 71 NA 6.42 6.73 -0.31 8.32 6.73 1.59
8. GlnA #14 61 NA 4.71 4.44 0.27 6.54 4.83 1.71
9. Pistol #17 62 NA 12.19 12.17 0.02 12.72 12.17 0.55
- Average - - 13.60 13.30 0.30 13.14 13.50 -0.36
Table 4.2.1: The performance of EvoClustRNA on the test dataset. The results for nine
RNAs. Column 1, original numeration. Column 2, RNA type and PDB ID code for each
RNA. Column 3, sequence length. Column 4, RMSD [Å] of models obtained by Weinreb et
al., only for RNAs 1-5. Column 5, RMSD of the first cluster obtained with SimRNAweb.
Column 6, RMSD [Å] of the first cluster obtained with EvoClustRNA|SimRNAweb. Column
7, the difference between column 6 and column 5. Column 8, RMSD [Å] of the first cluster
obtained with Rosetta. Column 9, RMSD [Å] of the first cluster obtained with
EvoClustRNA|Rosetta. 10, the difference between column 9 and column 8. The
improvements in RMSDs when EvoClustRNA is used are marked in green, the cases where
EvoClustRNA worsened the results are marked in red.
54
of EvoClustRNA|SimRNAweb gave 7.52 Å, while a value of 13.89 Å was obtained by using
EvoClustRNA|Rosetta (Fig. 4.2.4).
Figure 4.2.4: The native structure (PDB ID: 1Y26). Models generated by (B) Weinberg et al.
(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F) EvoClustRNA|Rosetta.
All models exhibit the native-like fold. However, only models C, D exhibit similar
orientation of secondary structure elements with respect to the native structure.
Thiamine pyrophosphate-sensing riboswitch (TPP, PDB ID: 2GDI, RFAM ID:
RF00059). The model obtained with DCA restraints achieved an RMSD of 10.35 Å. This
riboswitch was predicted poorly by all four approaches, with RMSDs ranging from 13.92 Å
to 24.08 Å. The model obtained with EvoClustRNA|Rosetta was the most accurate with
RMSD of 13.92 Å (Fig. 4.2.5).
55
Figure 4.2.5: The native structure (PDB ID: 2GDI). Models generated by (B) Weinberg et al.
(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F) EvoClustRNA|Rosetta.
Only model B shares the three-dimensional fold with the native structure, with RMSD of
13.92 Å.
HIV reverse-transcription primer tRNA (PDB ID: 1fir, RFAM: RF00005). The best
model for this tRNA structure was modeled by Weinberg et al. using DCA restraints (RMSD
8.58 Å). The most accurate model from four other approaches was generated by
EvoClustRNA|SimRNA with RMSD 10.35 Å (Fig. 4.2.6).
56
Figure 4.2.6: (A) The native structure (PDB ID: 1FIR). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. Only model B shares the three-dimensional fold with the native
structure, with an RMSD of 10.35 Å.
c-di-GMP-II riboswitch (cdiGMP, PDB ID: 3Q3Z, RFAM ID: RF01786). Similarly to the
THF riboswitch, this structure is a long helix that folds back on itself and forms a
pseudoknot. The best model, in terms of RMSD, was generated with
EvoClustRNA|SimRNAweb (RMSD 9.65 Å). However, the fold of this RNA is complicated
(compared to the THF riboswitch), as a result, none of the methods generated this fold
correctly (Fig. 4.2.7).
Figure 4.2.7: (A) The native structure (PDB ID: 3Q3Z). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. The RMSDs range from 9.65 Å to 14.53 Å.
57
Tetrahydrofolate riboswitch (THF, PDB ID: 4LVV, RFAM ID: RF00059). This structure
is a simple long helix that folds back on itself and forms a pseudoknot. The fold of this RNA
is relatively simple and all predicted models were well predicted with RMSDs ranging from
4.83 Å to 12.22 Å. Interestingly, the DCA modeling was outperformed by Rosetta (RMSD
4.83 Å) and EvoClustRNA|Rosetta (RMSD 7.68 Å) (Figure 4.2.8).
Figure 4.2.8: (A) The native structure (PDB ID: 4LVV). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. Model E is the closest to the native structure with an RMSD 4.83 Å.
Adenosylcobalamin riboswitch - RNA Puzzle 6 (COB, PDB ID: 4GXY, RFAM ID:
RF00174). This RNA is a riboswitch, which was experimentally solved with the ligand.
Since none of the methods explicitly predicts RNA-ligand interactions, all generated models
were far from the native structure with RMSDs ranging from 31.02 Å to 33.39 Å (Fig. 4.2.9).
58
Figure 4.2.9: (A) The native structure (PDB ID: 4GXY). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta. Due to
missing RNA-ligand interactions, none of the models is close to the native structure (RMSDs
range from 31.02 Å to 33.39 Å).
ZMP (5-aminoimidazole-4-carboxamide ribonucleotide) riboswitch - RNA Puzzle 13
(PDB id: 4XW7, Rfam id: RF01750). The best model of this short (71-nucleotide long)
riboswitch was obtained with SimRNA (RMSD 6.42 Å). EvoClustRNA improved predictions
only in the case of EvoClustRNA|Rosetta by 1.59 Å. The P2 helix (Figure 4.2.10, in green) in
the native structure makes interactions with the binding pocket where the ZMP ligand binds.
These interactions are missing in all predictions as the P2 helix is protruding outward from
the binding pocket. Once again, missing RNA-ligand interactions hampered a correct
modeling of an RNA sequence.
Figure 4.2.10: (A) The native structure (PDB ID: 4XW7). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
L-glutamine riboswitch - RNA Puzzle 14 (GlnA, PDB ID: 5DDO, RFAM ID: RF01739).
The best model of this 61-nucleotide long riboswitch was obtained with
EvoClustRNA|SimRNAweb (RMSD 4.44 Å). In all predictions, a fragment of an E-loop
59
motif was used. The RMSDs of models were ranging from 4.44 Å to 6.54 Å (Fig. 4.2.11).
EvoClustRNA improved predictions in both modes, using models from SimRNA
(improvement of 0.27 Å) and Rosetta (improvement of 1.71 Å). This structure was
experimentally solved with the ligand. However, the ligand was not modeled explicitly, as in
the case of previous RNAs.
Figure 4.2.11: (A) The native structure (PDB ID: 5DDO). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta. The
most accurate model of this riboswitch was generated with EvoClustRNA|SimRNAweb
(RMSD 4.44 Å).
Pistol ribozyme - RNA Puzzle 17 (PDB ID: 5K7C, RFAM ID: RF02679). The RNA-
Puzzle 17 is a Pistol ribozyme. This is a 62-nucleotide long, self-cleaving ribozyme.
Figure 4.2.12: (A) The native structure (PDB ID: 5K7C). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
60
In all predicted models (Fig. 4.2.12), the substrate (Fig. 4.2.12, chain in red) is located
“behind” (SimRNAweb) or within (Rosetta) the molecule. The predictions reached RMSDs
around 12.5 Å, ranging from 12.17 Å to 12.72 Å.
61
4.3 rna-pdb-tools
To facilitate the daily work of a researcher working in RNA structural bioinformatics, a
project named rna-pdb-tools was initiated (https://github.com/mmagnus/rna-pdb-tools). rna-
pdb-tools is a Python library and a set of tools dedicated to RNA structural file handling and
manipulating, like (1) rebuilding of missing atoms in RNA structures, (2) structural
clustering, (3) standardization of PDB formats to comply with the format required by RNA-
Puzzles, (4) visualization of secondary RNA structures and drawing RNA arch diagrams of
secondary structure triggered from Python scripts or Jupyter Notebooks, and much more.
Additionally, rna-pdb-tools should be considered as a library of functions rather than a closed
program with one fixed set of functionalities. rna-pdb-tools is a framework of various
functions, and if needed the user is invited to extend it with his/her own scripts on the top of
the existing package. In this way, it is possible to adapt the framework for every specific
case, for example to have a particular parser or converter that can be used by the user for a
very specific application.
Furthermore, to ensure the quality control of the code, the software is under heavy testing by
Travis CI, every time is detected a change. Travis CI is a hosted, distributed, continuous
integration service, used to build and test software projects (https://travis-ci.org/). To verify
the correctness of all operations performed by the software, a set of input files is prepared.
During each test, the input files are processed to get output files, and the output files are
compared with each other during each test.
rna-pdb-tools is a core part of my other projects: NPDock (RNA/DNA-protein docking
method, http://genesilico.pl/NPDock/) (Tuszyńska et al. 2015), SimRNAweb (RNA 3D
structure prediction method) (Magnus et al. 2016), EvoClustRNA, and mqapRNA. The rna-
pdb-tools package has been recognized by the organizers of the RNA-Puzzles, and it has been
suggested at the homepage of the experiment as an approved tool to process structures for the
contest (http://ahsoka.u-strasbg.fr/rnapuzzles/). The step-by-step tutorial that explains how to
prepare files for submission to the RNA-Puzzles can be found here https://rna-pdb-
tools.readthedocs.io/en/latest/rna-puzzles.html.
62
The central part of the package is the rna_pdb_tools_lib library and the rna_pdb_toolsx.py
script. The script uses functions coded in the main library, and it is an interface to run them
from the command line. The full list of operations of rna_pdb_toolsx.py can be displayed
using “-h” as argument of the command:
$ rna_pdb_toolsx.py -h
usage: rna_pdb_toolsx.py [-h] [--version] [-r] [-c] [--is_pdb] [--is_nmr]
[--un_nmr] [--orgmode] [--get_chain GET_CHAIN]
[--fetch] [--fetch_ba] [--get_seq] [--get_ss]
[--rosetta2generic] [--get_rnapuzzle_ready] [--rpr]
[--no_hr] [--renumber_residues]
[--dont_rename_chains] [--dont_fix_missing_atoms]
[--dont_report_missing_atoms] [--collapsed_view]
[--cv] [-v] [--replace_hetatm] [--inplace]
[--edit EDIT] [--delete DELETE]
file [file ...]
rna_pdb_tools - a swiss army knife to manipulation of RNA pdb structures
Usage
$ for i in *pdb; do rna_pdb_toolsx.py --delete A:46-56 $i > ../rpr_rm_loop/$i ; done
$ rna_pdb_toolsx.py --get_seq *
# BujnickiLab_RNApuzzle14_n01bound
> A:1-61
# BujnickiLab_RNApuzzle14_n02bound
> A:1-61
CGUUAGCCCAGGAAACUGGGCGGAAGUAAGGCCCAUUGCACUCCGGGCCUGAAGCAACGCG
[...]
positional arguments:
file file
optional arguments:
-h, --help show this help message and exit
--version
-r, --report get report
-c, --clean get clean structure
--is_pdb check if a file is in the pdb format
--is_nmr check if a file is NMR-style multiple model pdb
--un_nmr Split NMR-style multiple model pdb files into
individual models [biopython]
--orgmode get a structure in org-mode format <sick!>
--get_chain GET_CHAIN
get chain, .e.g A
--fetch fetch file from the PDB db
--fetch_ba fetch biological assembly from the PDB db
--get_seq get seq
--get_ss get secondary structure
--rosetta2generic convert ROSETTA-like format to a generic pdb
--get_rnapuzzle_ready
get RNApuzzle ready (keep only standard atoms).Be
default it does not renumber residues, use
--renumber_residues [requires biopython]
--rpr alias to get_rnapuzzle ready)
--no_hr do not insert the header into files
--renumber_residues by default is false
--dont_rename_chains used only with --get_rnapuzzle_ready. By default
--get_rnapuzzle_ready rename chains from ABC.. to stop
behavior switch on this option
--dont_fix_missing_atoms
used only with --get_rnapuzzle_ready
--dont_report_missing_atoms
used only with --get_rnapuzzle_ready
--collapsed_view
--cv alias to collapsed_view
63
-v, --verbose tell me more what you're doing, please!
--replace_hetatm replace 'HETATM' with 'ATOM' [tested only with
--get_rnapuzzle_ready]
--inplace in place edit the file! [experimental, only for
get_rnapuzzle_ready, delete, get_ss, get_seq]
--edit EDIT edit 'A:6>B:200', 'A:2-7>B:2-7'
--delete DELETE delete the selected fragment, e.g. A:10-16
The functions of the package can also be imported to one’s projects, for example “–is_pdb”
can be accessed from the shell utility:
$ rna_pdb_toolsx.py --is_pdb input/1I9V_A.pdb
True
$ rna_pdb_toolsx.py --is_pdb input/image.png
False
but also from a Python script:
>>> from rna_pdb_tools_lib import *
>>> s = RNAStructure('input/1I9V_A.pdb')
>>> s.is_pdb()
True
rna-pdb-tools can also be used from the Emacs text editor (Fig. 4.3.1), and some of its
functions can be executed via plugins in my note-taking system Geekbook
(https://github.com/mmagnus/geekbook).
Figure 4.3.1: rna-pdb-tools can be run also from Emacs. A researcher can edit a PDB file
using the text-oriented functionality of this editor and then without leaving the editor can
apply the RNApuzzle function to standardize the file.
64
A list of example command-line utils included in the packages:
The user: “I want to”: Command-line utils to use:
get a sequence based on a PDB file rna_pdb_toolsx.py --get_seq *.pdb
download a PDB file rna_pdb_toolsx.py --fetch <PDB id>
compare text-content of PDB files diffpdb.py <fn1.pdb> <fn2.pdb> (Fig. 4.3.4)
annotate secondary structure of my PDB files clarna_app.py - a wrapper to ClaRNA
or
rna_x3dna.py - a wrapper to 3dna
calculate RMSDs between the target file and a set of
other files
rmsd_calc_to_target.py -t <target.pdb> *.pdb
compare interactions networks (base pairs) between
two 3D structures
rna_calc_inf.py -t <target.pdb> *.pdb
a wrapper to ClaRNA
filter a set of PDB files to select ones that fulfil
required distance restraints
rna_filter.py -s <restraints.txt> -s *.pdb
calculate distances based on given restraints on PDB
files or SimRNA trajectories
refine my models rna_refinement.py -n <steps> *.pdb
– a wrapper around QRNAS
merge single files into an NMR-style multiple model
file PDB file
rna_pdb_merge_into_one.py *.pdb > out.pdb
model an RNA sequence with Rosetta and process
output files
a set of tools to work with Rosetta:
rna_rosetta_check_progress.py,
rna_rosetta_cluster.py, rna_rosetta_min.py,
rna_rosetta_run.py
process output files of SimRNA/SimRNAweb a set of tools to work with SimRNA:
rna_simrna_cluster.py, rna_simrna_extract.py,
rna_simrna_lowest.py
download output files from the SimRNAweb server
for a given job id
rna_simrnaweb_download_job.py <job id>
edit occupancy or B Factor in only in a part of a PDB
file
e.g. rna_pdb_edit_occupancy_bfactor.py --occupancy
--select A:1-40,B:1-22 --set-to 0
<pdb.pdb>
edit a part of a chain (change fragment A:1-75 to
A:7-81)
e.g. rna_pdb_toolsx.py --edit 'A:1-75>A:7-81'
3q3z_rpr.pdb > 3q3z_rpr_A7-81.pdb
add missing bases rna_pdb_toolsx.py --get_rnapuzzle_ready <fn.pdb>
(Fig 4.3.2)
65
Figure 4.3.2: rna_pdb_toolsx.py is able to rebuild missing base (drawn in thin line) to
complete a structure.
Figure 4.3.3: rna-pdb-tools comes with a detailed documentation that can be viewed online
or as a PDF file.
The rna-pdb-tools package is published as an open source project (GPL-3.0 license), thus it
can be widely used, deployed and modified by the scientific community. rna-pdb-tools is well
documented in both online documentation and tutorials that will walk the user through
66
various use cases (http://rna-pdb-tools.readthedocs.io/en/latest/) as well as a PDF manual
(over 130 pages as of September 2017) (Fig. 4.3.3).
Figure 4.3.4: diffpdb.py is a tool to detect differences in formatting between two PDB files.
First, the tool removes columns of coordinates, and next compares only columns with
annotation (atom naming, numbering).
The rna-pdb-tools package was extended by a set of functions for
analyzing/editing/formatting RNA alignments. A set of operations that can be done with rna-
pdb-tools are shown in Fig. 4.3.5 and can be explored in a Jupyter notebook available under a
link
https://github.com/mmagnus/rna-pdb-
tools/blob/master/rna_pdb_tools/utils/rna_alignment/rna_alignment.ipynb. A user can easily
load a new alignment, subset columns or sequences (rows), save a subset to a new file, plot
an RChie plot, get a secondary structure and a sequence of each of sequences in an alignment
(Fig 4.3.5), and more. The functions can be imported to a user’s own Pythons script but also
to a Jupyter notebook. The scripts were used to process the data, RNA alignments, secondary
structures, tertiary structures for the database and classification system of RNA families,
RNArchitecture (http://genesilico.pl/RNArchitecture) (Boccaletto et al. 2017)
67
Figure 4.3.5: A fragment of the demo on the RNA alignment functionality implemented in
rna-pdb-tools. Top: a user can load a new alignment and plot an RChie plot, bottom: a user
can also get a secondary structure and a sequence for a row taken for an alignment (gaps are
removed) in the text format or get a visualization using VARNA. The functions can be
imported to a user’s own Python scripts but also to a Jupyter notebook (as shown in the
figure).
68
5 Discussion
5.1 mqapRNA
The aim of the first project described in this work was to facilitate the task of selection of the
most accurate RNA 3D models from a pool of models obtained by use of various RNA 3D
structure prediction methods. The new method is a meta-predictor, mqapRNA, which
combines the existing methods and uses the deep learning model to take advantage of their
combined strengths and to eliminate their individual weaknesses. In the benchmark presented
in this study, mqapRNA (on average) outperformed other existing methods, and at this stage,
the method is a great starting point for further statistical model optimization and improved
training on even bigger datasets of more diverse structures. In addition, mqapRNA allows for
interactive refinement of the predictions by applying distance restraints obtained from
experimental methods or evolutionary analysis, and by using secondary structure
information. The method is available as an easy-to-use web server.
However, it is important to realize how theoretical datasets, generated for method
development, can vary from cases observed in real life. The benchmark showed that,
although all the methods perform very well on theoretical decoys, they poorly perform in
scoring models created by scientists in real life scenario, e.g., the RNA-Puzzles targets. The
reason could be that the models submitted by groups have different patterns of distortions,
that the datasets do not account for. The second reason could be that, if we start a 3D RNA
structure prediction from the sequence only it is very hard to reach models accurate enough to
be scored efficiently by the existing methods.
The benchmark devised for this study highlights the importance of a correct secondary
structure. A correct secondary structure can be used as a reasonable evaluation method and
can help to identify models of poor quality. The secondary structure prediction is a complex
problem on its own, and one should understand how to apply experimental data to obtain an
accurate prediction. The RNA-puzzles publications describe cases where a wrong secondary
structure led to wrong three-dimensional models.
69
mqapRNA can only be useful if it is applied to score accurate RNA 3D structural models.
However, the results of the RNA-Puzzles show that accurate modeling of RNA is still very
challenging and accuracy of obtained models is far from near-native structures.
5.1.1 Similar tools or approaches
The program uses a deep learning statistical model to interpret outputs of primary methods
and provides quality predictions. According to the benchmark, the method outperforms other
existing methods. However, there is still ample room of improvement, in particular in the
case of models of non-trivial distortions like in the RNA-puzzles. One way would be to use
more diverse decoys that could allow improvement in the quality and consistency of
predictions as well as clarify what accounts for a good or bad model of RNA. A robust
feature selection analysis of the statistical model could better reflect on identifying critical
factors on the accurate assessment. The accuracy of the method strongly depends on the
training set. mqapRNA was not trained on decoys generated by SimRNA or Rosetta or even
experimental observed intermediates, which could probably improve the statistical model.
Another direction of development of the method would be to add new primary methods, that
can score RNA structural models. This area of science is a very active field, and one should
expect more methods coming within the next years.
mqapRNA can be run as an easy-to-use web server, similarly to the web server for RASP,
called WebRASP (Norambuena et al. 2013) (http://melolab.org/webrasp/). RASP is very fast,
easy to install and can be used for quality prediction. RNA KB is a force-field in the
GROMACS (Van Der Spoel et al. 2005) package for Molecular Dynamics simulations,
which makes it hard to use for researchers without prior experience in running molecular
dynamics simulations. Methods εSCORE, 3RNAscore are easy to install and run, however;
according to the benchmark, they are not as good as the best method, mqapRNA.
mqapRNA is able to predict the “global quality” of models and provides (just) one score per
structure. However, one can think of a tool which assesses the quality at the level of
individual residues. This approach is named “local quality assessment” and could be
developed and tested in further implementations of mqapRNA. Such functionality is
implemented in Meta-MQAP (Pawlowski et al. 2008), which is an analogical tool but for
protein quality assessment. With this kind of local quality assessment, it would be possible to
70
(1) detect misfolded parts of RNA and apply further optimization, (2) replace a given
molecular fragment with a new one from a database of fragments, (3) refold the RNA
entirely, using SimRNA. QA-RecombineIt (Pawlowski et al. 2013) is a method, developed in
the Bujnicki laboratory, that assesses the quality of protein 3D structure models and improves
the accuracy of these models by merging fragments of multiple input models.
71
5.2 EvoClustRNA
An efficient scoring method can be applied with a success only if in a pool of input models
are near-native structures. The analysis of the decoys from the RNA-Puzzles experiment
suggests that we need more accurate methods for RNA 3D structure prediction to begin with.
To facilitate RNA structure prediction, EvoClustRNA, a new evolutionary approach for RNA
3D structure was implemented and benchmarked.
EvoClustRNA could be tested with models produced by other method for modeling, e.g.
RNAComposer (Popenda et al. 2012), MC-Sym|MC-Fold (Parisien and Major 2008),
iFoldRNA (Ding et al. 2008), etc.
EvoClustRNA could be potentially improved by a different set of parameters for clustering.
The procedure of selecting homologs could also be investigated and its variants tested.
Clustering visualized with Clans could improve the final selection of models.
However, combining EvoClustRNA with a DCA analysis would be the most beneficial. This
is also in same direction, in which the protein version of the methodology has gone (Richard
Bonneau, private communication).
One of the drawbacks of the method EvoClustRNA is the alignment preparation. At the
current stage, alignments were prepared manually with some help of scripts from the rna-
pdb-tools package. This could be further simplified with a new script developed as part of the
packages.
EvoClustRNA in some cases improved the results. However, EvoClustRNA highly depends
on initial models, which makes it limited as much as the original predictive methods. Thus,
the major current challenges in RNA structure prediction lie within an improvement of
algorithms in predictions of (1) RNA-ligand interactions, (2) non-canonical interactions, (3)
loop modeling.
5.2.1 Similar tools or approaches
EvoClustRNA was inspired by a similar approach that was used in protein structure
prediction (Bonneau et al. 2001). The approach is still used for protein structure prediction
72
(Richard Bonneau, private communication) and it was applied, for example, for modeling
structures for major protein families (Bonneau et al. 2002). To the best of my knowledge,
EvoClustRNA is the first time when this methodology was applied for RNA.
However, there are other ways how RNA sequence alignments could be used to improve
tertiary structure prediction.
The first one is to use evolutionary restraints as described by Weinreb and coworkers
(Weinreb et al. 2016) and Leonardis and coworkers (De Leonardis et al. 2015). These
methods require alignments with over 1000 sequences (De Leonardis et al. 2015) to provide
sufficient statistics for detecting nucleotide coevolution, which is not always is possible. In
addition, these methods are very sensitive to false-positives that can result in wrong models.
In contrast, EvoClustRNA can be used even when only a few (3-5) homologs are available.
The second way is to apply methods such as RMdetect (Cruz and Westhof 2011), to detect
RNA motifs from an RNA alignment. However, this approach gives only information about
some part of an RNA molecule, and a tertiary structure prediction method must be used to
obtain a full-length model.
73
5.3 rna-pdb-tools
Structural bioinformatics of RNA is a relatively young area of science that is struggling with
the lack of bioinformatics tools to facilitate the daily work of a researcher. The main problem
of the existing tools is that there is no universal parser that will solve all the problems that
one might have when working with PDB files, and that will suit the need of various users.
There are already many libraries, developed for researchers to work with PDB structures, in
languages, such as R (Bio3d by (Grant et al. 2006); Haskell (hPDB by (Gajda 2013)); Python
(BioPython (Cock et al. 2009), PyCogent (Knight et al. 2007)). The problem with these
packages is that they are primarily designed to work on protein structural files. In principle,
protein structure files are not different from RNA ones. However, for everyday work,
researchers working on RNA structures need a set of RNA-related functions, such as
preparation of the structure for the RNA-Puzzles competition, preparation for the SimRNA
simulation, getting the secondary structure, etc. Several RNA structural files parsers are
available for the scientific community. A set of tools that comes with Rosetta by professor
Rhiju Das and coworkers
(https://www.rosettacommons.org/docs/latest/application_documentation/rna/RNA-tools),
and by Peter Kerpedjiev and coworkers (Kerpedjiev et al. 2015)
(https://github.com/ViennaRNA/forgi), both are written in Python. However, RNA-tools is
intended to work on input and output files for Rosetta and it is not designed as a complete
package. Forgi is a Python library for manipulating RNA secondary structure and can solve
only a limited set of problems.
rna-pdb-tools provides an easy to adapt framework for user’s own tools. Just by copying-and-
pasting, and then modifying existing code, a user can build a new application very quickly.
rna-pdb-tools also shows how some third-party tools can be efficiently wrapped into
command-line utils, e.g., ClaRNA (Waleń et al. 2014). ClaRNA is a classifier of contacts in
RNA 3D structures. The program is written in Python and due to its single-thread
architecture, it is relatively slow. rna-pdb-tools includes a wrapper around ClaRNA, that can
run multiple instances of ClaRNA on all available processors and make the whole procedure
much faster. Moreover, rna_calc_inf.py provides the same interface for inputs as a script for
calculating RMSDs, rna_calc_rmsd.py. For that reason, both utils can be run in the similar
74
way: “rna_calc_inf.py -t <native.pdb> *.pdb”, “rna_calc_rmsd.py -t <native.pdb> *.pdb”,
which simplifies composing complex workflows.
rna-pdb-tools can be used both as command-line tools and in a Jupyter Notebook
(https://jupyter.org/) (former IPython (Pérez and Granger 2007)). The Jupyter Notebook is an
open-source web application that allows users to create and share documents that contain live
code - works with such languages as with Python, R, Scala – equations, visualizations and
explanatory text. The functions implemented in rna-pdb-tools can be imported to such
notebooks to create reproducible analyses that can be uploaded online and shared with the
RNA structural bioinformatics community. One such notebook is uploaded together with the
rna-pdb-tools packages and illustrates the steps performed for the Bujnicki group to collect
information about the RNA-Puzzle 18 problem (https://github.com/mmagnus/rna-pdb-
tools/blob/master/rp18.ipynb) (Fig. 5.3.1). The notebook reports the results of various
secondary structure prediction methods, and a successful hit for the target sequence in the
PDB database. The structure in the PDB database, Xrn1-resistant RNA from the 3'
untranslated region of a flavivirus (PDB ID: 4PQV) (Chapman et al. 2014), turned out to be a
homolog of the RNA Puzzle 18 and was used for a comparative modeling. Because of the
problem of reproducibility in bioinformatics (Sandve et al. 2013), rna-pdb-tools with Jupyter
notebooks seems to be a valuable combination to help scientists to share their analyses, e.g.,
protocols used for modeling in the RNA-Puzzle challenge, that can be later reproduced by
others.
75
Figure 5.3.1: The Jupyter notebook (a part of the whole notebook) for the RNA-Puzzle 18
problem. The notebook reports steps of a bioinformatical analysis to collect information
about the target sequence, such as: secondary structure predictions using three different
methods and a BLAST search on the PDB database that led to the detection of a homolog
used later for a comparative modeling.
5.3.1 Future directions
In the future, the rna-pdb-tools package could be merged with BioPython. rna-pdb-tools
already are using internally the Bio.AlignIO class of the BioPython package. Next step, in the
development of rna-pdb-tools would be to bring functions implemented in my package to
BioPython to provide a unified package for structural bioinformatics.
Another direction of the development would be a modification of some functions to work
also on mmCIF files (Crystallographic Information File).
76
rna-pdb-tools also needs even better documentation and better tests. Hopefully, still small,
but growing community of rna-pdb-tools users will contribute to improve the documentation
and add new tests.
There is a need for a new comprehensive workflow for RNA structure prediction. A huge
problem in the field is to make such workflows, because of incompatibility in input and
output data formats. This fragmentation of tools in bioinformatics leads to difficulties in
combining them efficiently into a full setup for a complete analysis. Even the PDB formats
used by the methods to define models may be very different. An enormous amount of time in
the development of mqapRNA was spent to prepare tools that will process input files and
convert them into formats that can be accepted by the existing scoring methods. The
realization of this task led to the development of rna-pdb-tools, as a practical converter of one
set of formats into another. To build complex workflow, we need well-written wrappers
around tools that will expose unified interfaces and allow for building complete pipelines.
The documentation of rna-pdb-tools could be a place to describe the tools and fill gaps in
original documentations. Such workflows could be implemented as a set of command-line
tools or as IPython Notebook where a Python script controls a flow of programs and data.
“... scientific programming does not compute” (Merali 2010). In my opinion, this is very true.
Merali in his paper described cases where wrong implementations caused retractions of
publications. What to do, so scientific programming will compute. Write clean code,
document, and test it. I hope that rna-pdb-tools will serve as an example of scientific code
that computes. More about rna-pdb-tools can be found under a link
https://media.readthedocs.org/pdf/rna-pdb-tools/latest/rna-pdb-tools.pdf.
77
5.4 Potential limitations of the RNA 3D structure prediction methods
Based on the results of RNA 3D structure prediction runs, the potential limitations of the
predictive methods will be discussed in this section.
5.4.1 RNA-ligand interactions
The adenosylcobalamin riboswitch (RNA Puzzle 6) is a riboswitch, which was
experimentally solved with the ligand. Rosetta and SimRNA do not explicitly predict RNA-
ligand interactions, therefore, all the predicted models were in “unfolded” conformation and
the RMSDs were high.
Figure 5.4.1. The native structure (PDB ID: 4GXY) solved with the ligand (indicated by the
arrow).
To test whether any interactions that could improve this modeling can be detected, a DCA
analysis was conducted (as described in (Weinreb et al. 2016)). A set of interactions were
detected (Fig. 5.4.2), however, none of them occurred between the ligand-bound structured
78
core of RNA and a bent peripheral domain (Fig. 5.4.2, in yellow). This might mean that DCA
restraints would not improve a prediction by bringing these two parts closer in space.
Figure 5.4.2: The results of a DCA analysis performed for the adenosylcobalamin
riboswitch. The bars represent interactions detected by DCA analysis (the structure made
transparent to highlight the bars). The red box indicates the interface between the core and
the peripheral domain with the lack of predicted interactions).
5.4.2 Non-canonical interactions
tRNAs are difficult to model in silico because the form many non-canonical interactions (Fig.
5.4.3), which SimRNA and Rosetta are not able to predict correctly. Moreover, most of
tRNAs contain modified nucleotides. Since SimRNA and Rosetta do not model modified
nucleotides, an “unmodified” sequence of A, G, C, and U residues only was used for the
modeling. For the latter two reasons, the DCA-based modeling was expected to outperform
other predictive approaches.
79
Figure 5.4.3: A network of canonical and non-canonical interactions depicted using the
Leontis/Westhof classification obtained with RNAView (Yang et al. 2003) for the structure
of tRNA (PDB id: 1FIR).
The thiamine pyrophosphate-sensing riboswitch binds directly to thiamine pyrophosphate
(TPP) to regulate gene expression through a variety of mechanisms in archaea, bacteria and
eukaryotes. The high RMSDs of the predicted models can be explained by the lack of key
interactions in generated models.
Figure 5.4.4: Secondary/tertiary structure presentation in the Leontis–Westhof nomenclature.
Two non-canonical interactions A69-C38 and A69-C22 (highlighted in red) were not
predicted by SimRNA or Rosetta (Lang et al. 2007).
The bound ligand keeps two helices (P3 and P5) together and, the binding is stabilized two
non-canonical interactions A69-C38 and A69-C22 (Fig. 5.4.4) (Lang et al. 2007). At the
80
current stage of the development of SimRNA and Rosetta, this type of interactions is
impossible to predict. Hence, the DCA-based modeling in this case outperformed other
approaches, as was expected.
5.4.3 Loop modeling
The Pistol ribozyme (RNA Puzzle 17) includes a conserved region with an A-minor motif.
The AAA trinucleotide (Fig. 5.4.5A, red) is interacting with the minor groove of the P1 stem
(Fig. 5.4.5A, green) in the native structure. However, the motif was not formed in any of the
predictions (Fig. 5.4.5B-E, red). The ribozyme cleaves the substrate and there is a very sharp
bend in the backbone involving the G53-U54 cleavage site (Fig. 5.4.5A, yellow). This bend
was not predicted by any of the used methods. The structure includes a six-base-pair
pseudoknot involving complementary loop segments between the hairpin and the internal
loops with the pseudoknot duplex positioned between stems P1 and P3. This pseudoknot was
used as an input for modeling and it was accurately modeled in all the predictions (Fig.
5.4.5A-E, violet).
Figure 5.4.5: Color-coded: G53-U54 cleavage site (yellow), P1 (green), pseudoknot (violet),
P2 (blue), loops (dark blue) (A) the native structure (PDB ID: 5K7C), and models generated
by (B) SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
81
Figure 5.4.6: Superposition of all predicted (A) P1 stems and pseudoknots, (B) P2 stems, (C)
P3 stems. All the fragments are of are good accuracy (RMSDs up to 3.5 Å).
Figure 5.4.7: Fragments of stems P1 with pseudoknots and single-stranded regions extracted
from all the predictions. A conserved region with the AAA trinucleotide (red) is interacting
with the minor groove of the P1 stem (green) in the native structure. However, the motif was
not formed in any of the predictions.
Interestingly, although the models looked very different from each other when the extracted
fragments were superimposed: pseudoknots and P1 stems, P2 steams, and P3 stems, were
very similar with RMSDs up to 3.5 Å (Fig. 5.4.6). Even the P1 stem with pseudoknot of 22
82
nucleotides was predicted accurately in all models with the RMSDs between 2.97 Å and 3.41
Å.
The largest deviations were observed in the loop with RMSDs ranging from 9.97 Å to 11.59
Å (Fig. 5.4.7). Loops are difficult to model because they are usually formed owing to non-
canonical interactions or/and RNA-ligand interactions. Moreover, there are fewer loops in the
PDB database than helical regions, therefore, statistical potentials might have more problems
to model them correctly. Loops are peculiarities and with accumulation of new
experimentally solved RNA structures, predictive methods are expected to generate better
predictions.
5.4.4 Sampling of conformational space
To test whether there were structures that shared the same topology in comparison with the
native structure in the pool of 500 structures of homologs, the results of clustering were
visualized with Clans (Frickey and Lupas 2004) (Fig. 5.4.8). Clans uses a version of the
Fruchterman–Reingold graph layout algorithm to visualize pairwise sequence similarities in
either two-dimensional or three-dimensional space. The program was designed to calculates
pairwise attraction values to compare protein sequences; however, it is possible to load a
matrix of precomputed attraction values and thereby display any kind of data based on
pairwise interactions. Therefore, the Clanstix program from the rna-pdb-tools package was
used to convert the all-vs-all distance (RMSD) matrix, between selected for clustering
fragments from the EvoClustRNA|SimRNAweb run, into an input file for Clans. The results
of clanstix are shown in Fig. 5.4.8. In this clustering visualization, 100 models of five
homologs are shown (each homolog uniquely colored, models of the target sequence are
colored in lime). Models with a pairwise distance in terms of RMSDs lower than 6 Å are
connected. The native structure was added to this clustering (Fig. 5.4.8A, big dot) to see
where it would be mapped. Interestingly, the native structure was mapped to the small
cluster. In this cluster, there are three models for the target sequence. The model the closest to
this the cluster center (Fig. 5.4.8B) achieved an RMSD of 6.98 Å to the native structure. This
clustering visualization showed that there were models generated with the correct fold, but
none of them were selected as the final prediction. The final prediction was the center of the
biggest cluster (Fig. 5.4.8C).
83
Figure 5.4.8: Clustering visualized with Clans (A) the native structure, (B) the model with
the close fold to the native, detected in a small cluster, (C) the biggest cluster with the model
that was returned as the final prediction.
An analogous analysis was performed the results of clustering of
EvoClustRNA|SimRNAweb run for the TPP riboswtich. Models with a pairwise distance in
terms of RMSDs lower than 9 Å are connected. Interestingly, the native structure (Fig.
5.4.9A, big dot) was mapped to a cluster of models of one of the homologs (Fig. 5.4.9, blue).
The center of this cluster (Fig. 5.4.9B) achieved an RMSD (of helical, shared fragments) of 9
Å to the native structure. In this cluster, there were not models for the target sequence. Since
SimRNAweb was not able to detect non-canonical interactions, most of the structures were in
“open” conformation and clustered far from the native structure. The final prediction was
(Fig. 5.4.9C) achieved an RMSD of 24.08 Å with respect to the native.
84
Figure 5.4.9: Clustering visualized with Clans (A) the native structure, (B) the model with
the close fold to the native (C) the biggest cluster with the model that was returned as the
final prediction.
These two analyses showed that SimRNAweb was able to sample conformational space
efficiently and near-native structures are generated during simulations. Incorrect predictions
were made because of the problem with the energy function to score models properly.
85
6 Conclusions
RNAs are one of the key molecules of life and are involved in a number of highly important
biological processes. Starting from storage information through signaling to enzymatic
activity and many others. RNAs can also serve as excellent tools and targets in medicine
(e.g., miRNA therapies, rRNAs as targets for antibiotics) and biotechnology (e.g., gene
editing with CRISPR-Cas9).
To perform their function complex RNA molecules must fold into a specific structure. Since
high-resolution experimental techniques are not always applicable, in this study two new
methods for computational modeling were developed and their results were investigated.
A a new scoring method mqapRNA was developed and showed to be relatively efficient in
scoring 3D structural models. To provide accurate models for scoring, EvoClustRNA, a new
evolutionary approach for RNA 3D structure prediction was implemented and benchmarked.
The realization of these two projects would not be possible without a toolbox, rna-pdb-tools,
of various scripts that allows for fast building of new applications and efficient data
management.
mqapRNA and EvoClustRNA highly depends on initial models and suffer from the
limitations of the predictive methods as indentified in this study: (1) lack of prediction of
RNA-ligand (Fig. 5.1A) and (2) non-canonical interactions (Fig. 5.1B), and (3) difficulties in
modeling loops (Fig.5.2C).
The results described in this thesis suggest that there is a need for more holistic and
thoughtful pipline for RNA structure prediction (Fig. 5.1D) which must include:
methods for homology search and sequence alignment preparation,
methods for secondary structure predictions based on sequence alignments and
methods for local motif detection (to use them as fragments in prediction),
methods for RNA 3D structure prediction with the capability to predict RNA-ligand
and non-canonical interactions with the aid of experimental or evolutionary restraints,
which should be run to predict structures of a few homologs,
86
methods for scoring the obtained models to generate the final prediction.
In this work three of tools that could be included in the ultimate pipeline for RNA 3D
structure prediction were described.
Figure 5.4.1: Limitation of the predictive methods identified based on the results of this
study (A-C) and a description of the ultimate pipeline for RNA 3D structure prediction (D).
By exploring new ideas by developing new tools and identification of limitations of the
current RNA 3D structure prediction methods, this work is bringing us closer to the near-
native computational RNA 3D models.
87
7 Supplementary data
S1. List of all the sequences and secondary structures used in the
benchmark of EvoClustRNA and a list of links to the SimRNAweb
predictions
ade
> 1Y26:X|PDBID|CHAIN|SEQUENCE
CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
(((((((((...((((((......[[.))))))........((((((]].....))))))..)))))))))
http://genesilico.pl/SimRNAweb/jobs/ade_pk-35b2a2c1/
> AAML04000013.1
UAUAACAUAUAAUUUUGACAAUAUGGGUCAUAAGUUUCUACCGGAAUACCGUAAAUAUUCUGACUAUGUAUA
((((.((((...((.((((.....[[)))).))........(.(((((]].....))))).)..))))))))
http://genesilico.pl/SimRNAweb/jobs/9c6339e0-591c-498d-9745-1a828f9ee81d/
> BA000028.3/1103960-1104044
UUUUCAUAUAAUCGCGGGGAUAUGGCCUGCAAGUUUCUACCGGUUUACCGUAAAUGAACCGACUAUGGAAA
(((.((((...(.(((((.....[[))))).)........(((((.(]].....).)))))..)))).)))
http://genesilico.pl/SimRNAweb/jobs/7bc1d432-eac8-47cf-a42e-aa3c89efc721/
> U51115.1/15606-15691
ACCUCAUAUAAUCUUGGGAAUAUGGCCCAUAAGUUUCUACCCGGCAACCGUAAAUUGCCGGACUAUGCAGG
.(..((((...(..((((.....[[))))..)........(((((((]].....)))))))..))))..).
http://genesilico.pl/SimRNAweb/jobs/e614e4a0-0898-45f2-9964-52db07279965/
> AAFV01000199.1/524-602
((((((((...((((((....[[)))))).........(((((]]....)))))...))))))))
http://genesilico.pl/SimRNAweb/jobs/2e496700-b989-4044-883d-d34257b022ab/
tpp
> tpp
gGACUCGGGGUGCCCUUCUGCGUGAAGGCUGAGAAAUACCCGUAUCACCUGAUCUGGAUAAUGCCAGCGUAGGGAAGUUc
(((((((((.((((.(((.....))))))......)..)))).....(((...((((......))))...)))..)))))
http://genesilico.pl/SimRNAweb/jobs/16662ebf-cf31-42d1-98a3-2aae31f28087/
>CP000050.1/1019813-1019911
CCGCCGAAGUGGGGGUACCACAGCACUGCUGCGGUUGAGAUAGUCCCUUCGAACCUGAUCCGGCUCAUACCGGCGUAGGGAAGCUUCGUUAGA
UGCGCU
.....(((((((((..(((.(.........).))).........)))).....(((...((((......))))...)))...)))))......
......
http://iimcb.genesilico.pl/SimRNAweb/jobs/aed2c40b-bb70-44a7-846d-b133359fc6bd/
>BX248356.1/234808-234920
ACGAGAUGCCCGGGUGCCAUGUGCUUGCUGUACGUGGCUGAGACGGCUGUUUGGCCGAACCGUAGAACCUGAUCUGGGUAAUACCAGCGAUAG
GAAGACUUCAUACUGUGACU
.....(.(.((((..((((...............)))).....((((......))))..))).....(((...((((......))))....))
)..).).)............
http://iimcb.genesilico.pl/SimRNAweb/jobs/0abbb76e-9cda-482f-abb2-94557e91acd8/
>AE017180.1/640928-641029
AUAGUCUGCUGGGGGAGUUCUUGGGAACUGAGACGGGCAACGCCCGAACCCUUUGAACCUGAUCCGGUUUAUACCGGCGUAGGGAAGCGGCCA
GAAACAAUC
.....(.(((.(((..((((....)))).....(((((...)))))..)))......(((...((((......))))...)))..))).)...
.........
http://iimcb.genesilico.pl/SimRNAweb/jobs/6bff10d7-d4ec-43ce-8f79-8f538fa1ae65/
>AL766847.1/75304-75402
CACAAGGGAGUGCCUUGAGCUGAGAUUGCAGAUAUGCAAAAUCCUCUAACCUGAUCUCGUUAGGACGAGCGUAGGAAUUGUG
(((((((((..((.....)).....(((((....)))))..))))....((((..((((......))))..))))..)))))
http://genesilico.pl/SimRNAweb/jobs/d2609d4d-bd6f-49fd-acbe-0ab278e0166b/
tRNA
>1fir
GCCCGGAUAGCUCAGUCGGAGAGCAUCAGACUUUUAAUCUGAGGGUCCAGGGUUCAAGUCCCUGUUCGGGCGCCA
(((((((..((((........))))..(((.........)))......(((((.......))))))))))))....
http://iimcb.genesilico.pl/SimRNAweb/jobs/a9bc516d-e3da-489d-93ef-5eb20e3f13c3/
>AF396436.1/4744747513
GCCGCUUGGAUGGUUCCGGUGUGGGCUCAUUUCCCAUAACUAUAAAGUUCGAUUCUUUAAAGUGGCU
88
(((((((...(((..)))..(((((.......))))).....(((((.......)))))))))))).
http://iimcb.genesilico.pl/SimRNAweb/jobs/822df074-320e-4166-9fd1-8fbcf085908a/
>M57527.1/170
ACUCUUAUAGCUUAAUAUUAAAGUAUAGCGCUGAAAACGCUAAGAUGAACCCUAAAAAGUUCUAGGGGUA
(((((((..((((.......)))).(((((.......)))))....(((((......)))))))))))).
http://iimcb.genesilico.pl/SimRNAweb/jobs/613bcfcf-f513-4945-9cf4-6df7db04545e/
>AB009835.1/171
CAUUAGAUGACUGAAAGCAAGUACUGGUCUCUUAAACCAUUUAAUAGUAAAUUAGCACUUACUUCUAAUGA
(((((((..(((.......)))..((((.......))))......(((.((.......)))))))))))).
http://iimcb.genesilico.pl/SimRNAweb/jobs/cf61bea5-88c4-4e82-8042-dc04ce5cadcf/
>M26977.1/379453
GGGGCCAUAGGGUAGCCUGGUCUAUCCUUUGGGCUUUGGGAGCCUGAGACCCCGGUUCAAAUCCGGGUGGCCCCA
(((((((..((((...........)))).(((((.......)))))....(((((.......)))))))))))).
http://iimcb.genesilico.pl/SimRNAweb/jobs/8ca21d4d-7ceb-4736-9619-7c1814c75637/
GMP
>gmp
gCGCGGAAACAAUGAUGAAUGGGUUUAAAUUGGGCACUUGACUCAUUUUGAGUUAGUAGUGCAACCGACCGUGCUgg
((((((..((......(((((((((......[[[[[[[.)))))))))...))....]]]]]..]]..))))))...
http://iimcb.genesilico.pl/SimRNAweb/jobs/faa97ed7/
>AE015927.1/474745-474827
AUUUUAAGAGGAAAUUUUGAACUAUAUACUUAUUUGGGCACUUUGUAUAUAGGGAGUUAGUAGUGCAACCGACCUUGAUUAAU
(((....((((.(((......((((((((......[[[[[[[..))))))))...)))...]]]]]..]]..))))....)))
http://genesilico.pl/SimRNAweb/jobs/e59064f8-ef9c-4c2c-864a-e20b4092cb03/
>ABFD02000011.1/154500-154585
AAAUAUUAUAGAGAUGUUGAAGUAUAUUCUAUUAUUGGGCACCUUAUGGAUAUACUGAGUCAGUGGUGCAACCGGCUAUGAAUAUA
.....((((((.(((......((((((((.......[[[[[[[....))))))))...)))...]]]]]..]]..)))))).....
http://genesilico.pl/SimRNAweb/jobs/5c0d22ec-c061-4567-aa68-3f8e5ac9ab46/
>BA000004.3/387918-388001
AAUCAAUAGGGAAGCAACGAAGCAUAGCCUUUAUAUGGACACUUGGGUUAUGUGGAGCUACUAGUGUAACCGGCCCUCCUUUAA
....(..((((.(((......((((((((.......[[[[[[[..))))))))...)))...]]]]]..]]..))))..)....
http://genesilico.pl/SimRNAweb/jobs/e5332c4d-e096-4d01-91f0-6b5ef2f92d37/
>AE000513.1/1919839-1919923
CUGUCGAAGAGACGCGAUGAAUCCCGCCCUGUAAUUCGGGCACCUCGGACGGGAGGAGCAAGUGGUGCGACCGGCUUUUCGUUGG
(((.(((((((..((......(((((.((........[[[[[[[..)).)))))...))....]]]]]..]]..))))))).)))
http://genesilico.pl/SimRNAweb/jobs/e462d8a5-7079-41df-b1bb-25edcb065cca/
THF
>thf
GGAGAGUAGAUGAUUCGCGUUAAGUGUGUGUGAAUGGGAUGUCGUCACACAACGAAGCGAGAGCGCGGUGAAUCAUUGCAUCCGCUCCA
((((....((((((((((((......(((((((...[[[[....))))))).....((....))))).)))))))))..]]]].)))).
http://genesilico.pl/SimRNAweb/jobs/7f0f8826/
>ACCL02000010.1/116901-116991
AGUAGAGUAGGUCUUAUACGUAAAGUGUCAUCGGAUGGGGAGACUUCCGGUGAACGAAGGGUUACCGCGUUAUAUGACCGCUUCCGCUACU
(((((....((((.((((......((.(((.((((..[[[[....)))).)))))...((....))....)))).))))..]]]].)))))
http://iimcb.genesilico.pl/SimRNAweb/jobs/a690ac93-1e57-4f25-9f63-aabf0700574d/
>ACKX01000080.1/10519-10620
UGCAGAGUAGAGAAUAAAGUGGUUAGUGCCCGACACACAGGGAGUUGGUGUCGAGACGAAGAGCCGAAUCGGUUCCCAGUUUUAUUUUCGCAU
CCCGCUGCC
(((((....(((((((((((.....((.((((((((...[[[[....))))))))))...((((.......))))...)))))))))))...]
]]].)))))
http://iimcb.genesilico.pl/SimRNAweb/jobs/cb6e7e4d/
> haq
UGCAAAAUAGGUUUCCAUGCGUCAAGUGUUUUGUGGAUGGGGAGUUGCCACAGAAACGAAAAGUCGGUUCGCGUGCGGACCGGACUUACGAUA
UGGUUACCGCACCCGUUGCA
(((((....(((..(((((......((.((((((((...[[[.....))))))))))...(((.....................)))....))
)))..)))...]]].)))))
http://genesilico.pl/SimRNAweb/jobs/497811c4/
> hcp
GGUAGAGUAGGUGUCUCGCGUUAAGUGCCAAGGGAUGGGACGUUGCCCUUGGACGAAAGCUAUUAAGAGCUGCGUUGGGACAUCGCGUUCGCU
AUC
(((((....((((((((((.....((.(((((((...[[[[....)))))))))...((((......))))...))))))))))..]]]].))
)))
http://iimcb.genesilico.pl/SimRNAweb/jobs/fae110a9/
RNA Puzzle 06
>4gxy AP006840.1/2074430-2074237
cggcaggugcucccgacccugcggucgggaguuaaaagggaagccggugcaaguccggcacggucccgccacugugacggggagucgccccuc
gggaugugccacuggcccgaaggccgggaaggcggaggggcggcgaggauccggagucaggaaaccugccugccg
((((((((((((((((((....))))))))))....(((...(((((.......)))))[[(((...))).(((...(((...((((((((((
......((((.((((((....))))))...))))))))))))))......)))..]])))....)))))))))))
http://genesilico.pl/SimRNAweb/jobs/9d39f986/
>BX571869.1/30799-30632
AUGGUGUGGUUGGGAAGGAGGUGAAAGUCCUCCGCAGCCCCCGCUGCUGUGAUGCUGACAACUCCGCUGAUGCCACUGGUCGGAAAGACUGGG
AAGGUUGCGGGGAAGGGUGACGCUAAGCCAGAAGACCGACCUG
89
..(((......((...(((((.......)))))[[(((....))).(((....((........((((.....((.(..(((.....)))..).
..))..))))...........))...]])))....)).)))..
http://genesilico.pl/SimRNAweb/jobs/ca9c767d-06b5-494d-841f-f1eb1ed904f1/
>cp771
CUUUGCAUGUUGAAAGGGAAGCCCGGUGAAAAUCCGGCGCGGGGCCGCCACCGUGAGUGGGGACGAAAUUCACAAUAUACCACUGGCCUAAUU
UUGGCUGGGAAGGUGUGAAGAGUAGGAUGAUCCACGAGUCGGGAGACCUAACAUGCAAAG
.((((((((((...(((...(.((((.......)))))[[((.....)).(((...((((........(((((.....(((.((((((.....
..))))))...))))))))............))))..]])))....))))))))))))).
http://genesilico.pl/SimRNAweb/jobs/3bbb8853-dd87-4913-acab-47caaed213ed/
>af193
uuaagguucuuugucauuggcaaagcuaagagggaaacuggugcgaaagaauuuucaaagccagugcugcccccgcaacuguaaacggcgagc
aaagaucaaaaugccacugauauuauuaucgggaaggcugaucggacgcggugacccgucaagucaggagaccugccuuaa
http://genesilico.pl/SimRNAweb/jobs/d752779c-bd51-411c-9716-064bcbd8606e/
> AM406670.1/3903431-3903207
UCAGGUGCCCGAAGGCGGUCCUCGCCCCAGGGUUAAACGGGAAACAGGUGCGCGCCUCCGGCGCAAUGCCUGUGCUGCCCCCGCAACGGUAAG
CGAGUGCAAGGCGCAUCAACAGCCACUGGGUCGUCCCCGGGAAGGCGAUGCGUCGGAGCCGGCCACAGCCGCUCCAGCCCGCGAGCCCGGAUA
CCGGCCCGA
((.(((.(((...((((.....))))...))).....(((...(((((...(((((...)))))....)))))[[(((....))).(((...(
((.........((((((....(((.(((((.....)))))...)))))))))..((((.((((....))))))))....)))..]])))....
)))))).))
http://genesilico.pl/SimRNAweb/jobs/6ab4c5c2-7605-4a81-a8b9-62cda22bb4a6/
RNA Puzzle 13 > zmp
gggucgugacuggcgaacaggugggaaaccaccggggagcgaccccggcaucgauagccgcccgccugggc
(((((((....[[[[....(((((....))))).....)))))))...........(((...]]]]..)))
http://genesilico.pl/SimRNAweb/jobs/175dd34c-100b-4a46-9aaa-e773b1468c39/
>CU234118.1/352539-352459
gcucucgcgcgacuggcgacuuuggauggagcaccaucggggagcgcgggaucgaccgccgugcgccugggc
((((((((((....[[[[......(((((....))))).....))))))))))....(((...]]]]..)))
http://genesilico.pl/SimRNAweb/jobs/0bf5c25e-4936-4da7-b145-928eea4031c7/
>BAAV01000055.1/28972982
ugaguuuucugcgacugacggauuauugcagagcacugcaagggaacagaaaaacucuuuuucagccgaccgucugggcacaccug
....(((((((.....[[[[[.....(((((....)))))......)))))))...........(((..]]]]]..))).......
http://genesilico.pl/SimRNAweb/jobs/8a418378-29f5-45df-af4a-5ecac1a5e7a4/
>CP000927.1/5164264-5164343
gcccguucgcgugacuggcgcuagugauggggaaccaucggggagcgcgaaccacaucgccgcgcgccugggcuccucga
....((((((((....[[[[[....(((((....))))).....))))))))......(((..]]]]]..))).......
http://genesilico.pl/SimRNAweb/jobs/d1969c5d-5a55-4025-944e-089de20719cf/
> AP009385.1/718103-718202
ucaccccugcgugacuggcgauagaacccucggguucaagguggagcaucccaccgugaagcgcagggcgccguuuuugccguucgccugggc
agccguu
....((((((((....[[[[[..(((((....)))))..(((((......))))).....))))))))........(((((..]]]]]..)))
)).....
http://genesilico.pl/SimRNAweb/jobs/9ca56ed4-69bb-477b-8ac2-35bfd085685f/
RNA Puzzle 14
>rp14
CGUUGACCCAGGAAACUGGGCGGAAGUAAGGCCCAUUGCACUCCGGGCCUGAAGCAACGCG
(((((.(((((....)))))........((((((..........))))))....)))))..
http://genesilico.pl/SimRNAweb/jobs/1aa9a03c-33e4-4718-899e-54ab3158d64c/
>AJ630128.1
AUCGUUCAUUCGCUAUUCGCAAAUAGCGAACGCAAAAGCCGACUGAAGGAACGGGAC
..(((((.(((((((........)))))))......((....))....)))))....
http://genesilico.pl/SimRNAweb/jobs/r14aj63pk-2f5f0e3d/
>AACY023015051.1
CGUUCAUCUUAUUUUAUUAAAUAGGACGGAAGUAGGAAGAUAGGAAAACCUCUUUCUUUUUUAAAGAAAGGCUAGCAAGUACCGCUUGGGUUA
AUUUAUCUUAGGCGGGAACGAGACCGAAUAUCUGCCGAAGGAACGC
(((((.(((((((......)))))))..[.....((..((((((....)).((((((((...))))))))(((....))).(((((((((...
......)))))))))...((....))..))))..))....))))).
RNA Puzzle 17
>rp17
CGUGGUUAGGGCCACGUUAAAUAGUUGCUUAAGCCCUAAGCGUUGAUAAAUAUCAGGUGCAA
((((..[[[[[.))))........((((.....]]]]]....(((((....)))))..))))
http://iimcb.genesilico.pl/SimRNAweb/jobs/27b5093d/
>hcf
UGCCGUUUGAGCGGCAUUAAACAGGUCUUAAGCUCAAAGCGUCACCGCCUACAAUGCUAGGCGGUGGGUGACA
((((..[[[[[.))))........(((.....]]]]]....((((((((((......))))))))))..))).
http://genesilico.pl/SimRNAweb/jobs/6d8062dd/
>s223
90
GCUCGUCUGGGCGAGGAUAAAUAGCUGUUAGGCCCAGAGCGGCUCUUCGGAUUGUGUUCCCUCCGCAAUCCGGGGAGCGUCAGC
.(((..[[[[[.)))........((((.....]]]]]....(((((((((((((((.......)))))))))))))))..))))
http://genesilico.pl/SimRNAweb/jobs/36828e10/
>s221
AGCCGUUGCGGCGGCUAUAAAUAGGACAUUAAGCCGCAAGCGUUGCCCGGUAUACCGCCGGGCAGGUUGUC
((((..[[[[[.))))........((((.....]]]]]....(((((((((.....)))))))))..))))
http://genesilico.pl/SimRNAweb/jobs/742b47e6/
>pisol
AGCCGUUCGGGCGGCUAUAAACAGACCUCAGGCCCGAAGCGUGGCGGCGCCGCCGGUGGUA
((((..[[[[[.)))).......((((.....]]]]]....((((((()))))))..))))
http://genesilico.pl/SimRNAweb/jobs/336e0098/
91
Table of Figures
Figure 1.2.1: Ribonucleotide - a building block of RNA. Source (Wikimedia-Commons) ..... 2
Figure 1.2.2: Leontis/Westhof classification of base pairings. (A) RNA bases - adenine (A),
cytosine (C), guanine (G) and uracil (U) - involve one of three distinct edges: the
Watson–Crick (W) edge, the Hoogsteen (H) edge, and the Sugar (S) edge. (B) Each pair
of can interact in either cis or trans orientations with respect to the glycosidic bonds. (C)
For these reasons, all base pairs can be grouped into twelve geometric base pair families
and eighteen pairing relationships (bases are represented as triangles). Each pair is
represented by a symbol that can be used in a secondary structure and a tertiary structure
diagrams. Filled symbols mean cis base pair configuration, and open symbols, trans base
pair. (D) Interestingly, bases can form triples and they have own classification devised
by Leontis and coworkers (Abu Almakarem et al. 2012)(Creative Commons License) ... 5
Figure 1.2.3: Collation of an example secondary (A) and the corresponding tertiary structure
(B) of the Pistol ribozyme (PDB code: 5K7c (Ren et al. 2016)). This riboswitch adopts a
compact tertiary architecture stabilized by an embedded pseudoknot (violet) fold and is
composed of three helical regions, P1 (green), P2 (blue), P3 (orange). This is a self-
cleaving ribozyme that is widely distributed in nature (Jimenez et al. 2015). The
cleavage site is marked in yellow. The secondary structure diagram was generated with
VARNA (Darty et al. 2009), and the tertiary structure figure was generated with
PyMOL (DELANO 2002) ................................................................................................. 8
Figure 1.4.1: RNA families tend to fold into the same 3D shape. Structures of the riboswitch
c-di-AMP solved independently by three groups: for two different sequences obtained
from Thermoanaerobacter pseudethanolicus (PDB id: 4QK8) and Thermovirga lienii
(PDB id: 4QK9) (Gao and Serganov 2014), for a sequence from Thermoanaerobacter
tengcongensis (PDB id: 4QLM) (Ren and Patel 2014) and for a sequence from Bacillus
subtilis (PDB id: 4W90) (the molecule in blue is a protein used to facilitate
crystallization) (Jones and Ferré-D'Amaré 2014). There is some variation between
structures in the peripheral parts (marked with red arrows), but the overall structure of
the core is preserved......................................................................................................... 20
92
Figure 1.4.2: According to the RNArchitecture database, there are only 3% (70) Rfam
families with known experimentally solved structures, and 97% (2,618 families) without
known structures. ............................................................................................................. 21
Figure 1.5.1: The results of RNA Puzzle 13. The second model in the ranking (sorted
according to RMSD) is a model obtained with a prototype version of EvoClustRNA
developed at the Stanford University. There is not one the way to sort the models.
Different metrics have unique properties, and a researcher should decide what is useful
for his/her application. RMSD informs about a geometrical similarity between a
prediction the crystallographic structure (the lower, the better). INF informs about the
similarity of interaction networks and ranges from 0 to 1 (the higher, the better). Several
partial INF can be computed: INF WC (the canonical interactions only), INF NWC (the
non-canonical interactions only), INF stacking (the stacking interactions only). INF
ALL takes into account all the interactions mentioned above. This RNA-Puzzle shows
one of the biggest problems in the RNA 3D structure prediction, very low INF NWC in
all submissions, which means lack of accurate prediction non-canonical interactions. .. 23
Figure 1.5.2: The detailed view of the results of the ZMP riboswitch (RNA Puzzle 13). For
each submitted model a detailed summary is available online that includes a
superposition of a prediction, in this case, the EvoClustRNA prediction (red), on the
crystallographic structure (green). Various metrics are shown in the result summary. ... 23
Figure 3.6.1: The alignment preparation. The conserved residues are marked with “x” in the
pseudo-sequence “x”. The marked as the conserved residues columns can be inspected
in an arc diagrams of RNA secondary structures (Lai et al. 2012) as the pink line (at the
very bottom). .................................................................................................................... 32
Figure 3.6.2: Each sequence and associated secondary structure was "Saved as" to a Fasta
file and used at the next stage of modeling with the use of the Jalview program. .......... 32
Figure 4.1.1: Graphical diagram of primary methods used by mqapRNA to describe the
analyzed model. (A) other methods for model quality assessment, (B) RNA modeling
software (C) Others. ......................................................................................................... 35
93
Figure 4.1.2: Example of a decoy set from the RASP dataset of the adenine riboswitch (PDB
ID: 1Y26). (A) The native structure. (B-F) A set of structures (files in the PDB format)
selected from this decoy with increasing deviation from the native (in parentheses are
RMSDs to the native). Files: (B) 1y26X_M100 (RMSD: 1.7Å), (C) 1y26X_M200
(RMSD: 2.49Å), (D) 1y26X_M300 (RMSD: 3.23Å), (E) 1y26X_M400 (RMSD:
3.31Å), (F) 1y26X_M500 (RMSD: 5.12Å). .................................................................... 36
Figure 4.1.3: Histograms of RMSDs [Å] per dataset. In red, the datasets used for training
mqapRNA; in orange, the dataset used only for testing. X: number of structures (not
scaled in the same way for all plots because of the very diverse ranges), Y: RMSDs [Å].
.......................................................................................................................................... 37
Figure 4.1.4: Histograms of Secondary Structure (INFs) per dataset. In red, the datasets used
for training mqapRNA, in orange, the dataset used only for testing. X: number of
structures (not scaled in the same way for all plots because of the very diverse ranges),
Y: Secondary Structure similarity of a given model to a secondary structure of a native
structure (INFs). ............................................................................................................... 37
Figure 4.1.5: mqapRNA is a machine learning based method. (A) First, a statistical model
was built on a training dataset of structures of known RMSD to native structures. Each
structure is described by a list of scores, results of the primary methods. Since this is the
training set, RMSD of these structure to native structures is known. This process allows
mqapRNA to detect what is the correspondence between scores and RMSDs. (B) Next,
the statistical model is applied for new cases, where RMSD is unknown. ...................... 38
Figure 4.1.6: Contribution (“Importance”) to a given subscore (“Variable”) in the final deep
learning model developed for mqapRNA (a plot generated with the H2O flow
Notebook). The higher, the more a given subscore is required for accurate predictions of
the statistical model.......................................................................................................... 39
Figure 4.1.7: Rank correlations for each decoy set and scoring method. mqapRNA (3rd
column) outperformed other scoring functions with a weighted average of rank
correlations of 0.77) ......................................................................................................... 42
94
Figure 4.1.8: Enrichment Score for each decoy set and scoring method. mqapRNA (3rd
column) is outperformed by SimRNA (10th column) by 0.1 in terms of EC. ................. 43
Figure 4.1.9: Close-up on the RNA-Puzzle 14 results in a form of RMSD [Å] vs Score plots.
The perfect method should follow a diagonal in a plot. mqapRNA achieved an EC of 7.7
and was able to identify a group of the near-native models. Other methods were not able
to rank models properly. .................................................................................................. 44
Figure 4.1.10: The homepage of the mqapRNA web server. ................................................. 46
Figure 4.1.11: A result page of mqapRNA. The page is divided into three panels: a plot of
mqapRNA score, a table of the score, and the restraints editor. The distance restraints
can be easily modified and re-submitted to the server. The results will be immediately
updated which might encourage the user to try different sets of restraints. .................... 47
Figure 4.1.12: Distance restraints editor at the bottom of the result page. The user can upload
a file with distance restraints or use an online editor to modify his/her query. After the
re-submission, the scores are re-calculated, and a new plot is generated. ....................... 48
Figure 4.2.1: The scheme of the proposed methodology. (A) Homologous sequences are
found for the target sequence, and an RNA alignment is created. (B) Using Rosetta and
SimRNA or/and Rosetta, structural models for all sequences are generated. (C) The
conserved regions are cut out and clustered. (D) The final prediction of the method is the
model containing the most commonly preserved structural arrangements in the set of
homologs. ......................................................................................................................... 49
Figure 4.2.2: The RNA-Puzzle 13 - the ZMP riboswitch. The superposition of the native
structure (green) and the EvoClustRNA prediction (blue). The RMSD between
structures is 5.55 A, the prediction was ranked as the second in the total ranking of the
RNA-Puzzles (according to the RMSD values)............................................................... 51
Figure 4.2.3: The RNA Puzzle 14 - L-glutamine riboswitch. The RMSD between the native
structure (green) and the EvoClustRNA prediction (blue) is 5.56 Å. .............................. 52
95
Figure 4.2.4: The native structure (PDB ID: 1Y26). Models generated by (B) Weinberg et al.
(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. All models exhibit the native-like fold. However, only models
C, D exhibit similar orientation of secondary structure elements with respect to the
native structure. ................................................................................................................ 54
Figure 4.2.5: The native structure (PDB ID: 2GDI). Models generated by (B) Weinberg et al.
(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. Only model B shares the three-dimensional fold with the native
structure, with RMSD of 13.92 Å. ................................................................................... 55
Figure 4.2.6: (A) The native structure (PDB ID: 1FIR). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. Only model B shares the three-dimensional fold with the native
structure, with an RMSD of 10.35 Å. .............................................................................. 56
Figure 4.2.7: (A) The native structure (PDB ID: 3Q3Z). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. The RMSDs range from 9.65 Å to 14.53 Å. .............................. 56
Figure 4.2.8: (A) The native structure (PDB ID: 4LVV). Models generated by (B) Weinberg
et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)
EvoClustRNA|Rosetta. Model E is the closest to the native structure with an RMSD
4.83 Å. .............................................................................................................................. 57
Figure 4.2.9: (A) The native structure (PDB ID: 4GXY). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
Due to missing RNA-ligand interactions, none of the models is close to the native
structure (RMSDs range from 31.02 Å to 33.39 Å). ....................................................... 58
Figure 4.2.10: (A) The native structure (PDB ID: 4XW7). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
.......................................................................................................................................... 58
96
Figure 4.2.11: (A) The native structure (PDB ID: 5DDO). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
The most accurate model of this riboswitch was generated with
EvoClustRNA|SimRNAweb (RMSD 4.44 Å). ................................................................ 59
Figure 4.2.12: (A) The native structure (PDB ID: 5K7C). Models generated by (B)
SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.
.......................................................................................................................................... 59
Figure 4.3.1: rna-pdb-tools can be run also from Emacs. A researcher can edit a PDB file
using the text-oriented functionality of this editor and then without leaving the editor can
apply the RNApuzzle function to standardize the file. .................................................... 63
Figure 4.3.2: rna_pdb_toolsx.py is able to rebuild missing base (drawn in thin line) to
complete a structure. ........................................................................................................ 65
Figure 4.3.3: rna-pdb-tools comes with a detailed documentation that can be viewed online
or as a PDF file. ............................................................................................................... 65
Figure 4.3.4: diffpdb.py is a tool to detect differences in formatting between two PDB files.
First, the tool removes columns of coordinates, and next compares only columns with
annotation (atom naming, numbering). ............................................................................ 66
Figure 4.3.5: A fragment of the demo on the RNA alignment functionality implemented in
rna-pdb-tools. Top: a user can load a new alignment and plot an RChie plot, bottom: a
user can also get a secondary structure and a sequence for a row taken for an alignment
(gaps are removed) in the text format or get a visualization using VARNA. The
functions can be imported to a user’s own Python scripts but also to a Jupyter notebook
(as shown in the figure).................................................................................................... 67
Figure 5.3.1: The Jupyter notebook (a part of the whole notebook) for the RNA-Puzzle 18
problem. The notebook reports steps of a bioinformatical analysis to collect information
about the target sequence, such as: secondary structure predictions using three different
methods and a BLAST search on the PDB database that led to the detection of a
homolog used later for a comparative modeling.............................................................. 75
97
Figure 5.4.1. The native structure (PDB ID: 4GXY) solved with the ligand (indicated by the
arrow). .............................................................................................................................. 77
Figure 5.4.2: The results of a DCA analysis performed for the adenosylcobalamin
riboswitch. The bars represent interactions detected by DCA analysis (the structure
made transparent to highlight the bars). The red box indicates the interface between the
core and the peripheral domain with the lack of predicted interactions). ........................ 78
Figure 5.4.3: A network of canonical and non-canonical interactions depicted using the
Leontis/Westhof classification obtained with RNAView (Yang et al. 2003) for the
structure of tRNA (PDB id: 1FIR). .................................................................................. 79
Figure 5.4.4: Secondary/tertiary structure presentation in the Leontis–Westhof nomenclature.
Two non-canonical interactions A69-C38 and A69-C22 (highlighted in red) were not
predicted by SimRNA or Rosetta (Lang et al. 2007). ...................................................... 79
Figure 5.4.5: Color-coded: G53-U54 cleavage site (yellow), P1 (green), pseudoknot (violet),
P2 (blue), loops (dark blue) (A) the native structure (PDB ID: 5K7C), and models
generated by (B) SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E)
EvoClustRNA|Rosetta. .................................................................................................... 80
Figure 5.4.6: Superposition of all predicted (A) P1 stems and pseudoknots, (B) P2 stems, (C)
P3 stems. All the fragments are of are good accuracy (RMSDs up to 3.5 Å). ................ 81
Figure 5.4.7: Fragments of stems P1 with pseudoknots and single-stranded regions extracted
from all the predictions. A conserved region with the AAA trinucleotide (red) is
interacting with the minor groove of the P1 stem (green) in the native structure.
However, the motif was not formed in any of the predictions. ........................................ 81
Figure 5.4.8: Clustering visualized with Clans (A) the native structure, (B) the model with
the close fold to the native, detected in a small cluster, (C) the biggest cluster with the
model that was returned as the final prediction. .............................................................. 83
98
Figure 5.4.9: Clustering visualized with Clans (A) the native structure, (B) the model with
the close fold to the native (C) the biggest cluster with the model that was returned as
the final prediction. .......................................................................................................... 84
Figure 5.4.1: Limitation of the predictive methods identified based on the results of this
study (A-C) and a description of the ultimate pipeline for RNA 3D structure prediction
(D). ................................................................................................................................... 86
99
Table of Tables
Table 1.2.1 Computation methods for RNA 3D structure prediction, based on (Magnus et al.
2014). ............................................................................................................................... 11
Table 1.3.1: Low-resolution experimental methods that generate particularly useful data for
computational prediction of RNA 3D structure, based on (Magnus et al. 2014). An
accurate secondary structure or/and distance restraints can be used with mqapRNA to
refine the final ranking. .................................................................................................... 18
Table 3.5.1: A list of subscores extracted from the primary methods used for training and
prediction with mqapRNA. For each analyzed structure, all these scores are provided in
a CSV output file, both in the standalone version and the web servers ........................... 29
Table 4.2.1: The performance of EvoClustRNA on the test dataset. The results for nine
RNAs. Column 1, original numeration. Column 2, RNA type and PDB ID code for each
RNA. Column 3, sequence length. Column 4, RMSD [Å] of models obtained by
Weinreb et al., only for RNAs 1-5. Column 5, RMSD of the first cluster obtained with
SimRNAweb. Column 6, RMSD [Å] of the first cluster obtained with
EvoClustRNA|SimRNAweb. Column 7, the difference between column 6 and column 5.
Column 8, RMSD [Å] of the first cluster obtained with Rosetta. Column 9, RMSD [Å]
of the first cluster obtained with EvoClustRNA|Rosetta. 10, the difference between
column 9 and column 8. The improvements in RMSDs when EvoClustRNA is used are
marked in green, the cases where EvoClustRNA worsened the results are marked in red.
.......................................................................................................................................... 53
100
Reference
Abu Almakarem, Amal S, Anton I Petrov, Jesse Stombaugh, Craig L Zirbel, and Neocles B
Leontis. 2012. “Comprehensive Survey and Geometric Classification of Base Triples in
RNA Structures..” Nucleic Acids Research 40 (4): 1407–23. doi:10.1093/nar/gkr810.
Adams, Paul D, Pavel V Afonine, Gábor Bunkóczi, Vincent B Chen, Ian W Davis, Nathaniel
Echols, Jeffrey J Headd, et al. 2010. “PHENIX: a Comprehensive Python-Based System
for Macromolecular Structure Solution..” Acta Crystallographica. Section D, Biological
Crystallography 66 (Pt 2). International Union of Crystallography: 213–21.
doi:10.1107/S0907444909052925.
Akiyama, Benjamin M, Hannah M Laurence, Aaron R Massey, David A Costantino, Xuping
Xie, Yujiao Yang, Pei-yong Shi, Jay C Nix, J David Beckham, and Jeffrey S Kieft. 2016.
“Zika Virus Produces Noncoding RNAs Using a Multi-Pseudoknot Structure That
Confounds a Cellular Exonuclease..” Science (New York, N.Y.) 354 (6316): 1148–52.
doi:10.1126/science.aah3963.
Albrecht, Mario, Silvio C E Tosatto, Thomas Lengauer, and Giorgio Valle. 2003. “Simple
Consensus Procedures Are Effective and Sufficient in Secondary Structure Prediction..”
Protein Engineering 16 (7): 459–62.
Anfinsen, C B. 1973. “Principles That Govern the Folding of Protein Chains..” Science (New
York, N.Y.) 181 (4096): 223–30.
Aw, Sherry S, Melissa XM Tang, Yin Nah Teo, and Stephen M Cohen. 2016. “A
Conformation-Induced Fluorescence Method for microRNA Detection.” Nucleic Acids
Research 44 (10): e92–e92. doi:10.1093/nar/gkw108.
Berens, Christian, Florian Groher, and Beatrix Suess. 2015. “RNA Aptamers as Genetic
Control Devices: the Potential of Riboswitches as Synthetic Elements for Regulating
Gene Expression.” Biotechnology Journal 10 (2). WILEY‐VCH Verlag: 246–57.
doi:10.1002/biot.201300498.
Berman, H M, J Westbrook, Z Feng, G Gilliland, T N Bhat, H Weissig, I N Shindyalov, and
P E Bourne. 2000. “The Protein Data Bank..” Nucleic Acids Research 28 (1). Oxford
University Press: 235–42.
Bernauer, Julie, Xuhui Huang, Adelene Y L Sim, and Michael Levitt. 2011. “Fully
Differentiable Coarse-Grained and All-Atom Knowledge-Based Potentials for RNA
Structure Evaluation..” RNA (New York, N.Y.) 17 (6): 1066–75.
doi:10.1261/rna.2543711.
Boccaletto, Pietro, Marcin Magnus, Catarina Almeida, Adriana Zyła, Astha, Radosław Pluta,
Blazej Bagiński, et al. 2017. “RNArchitecture: a Database and a Classification System of
RNA Families, with a Focus on Structural Information.” Submitted for Review.
Boniecki, Michal J, Grzegorz Lach, Wayne K Dawson, Konrad Tomala, Pawel Lukasz,
Tomasz Soltysinski, Kristian M Rother, and Janusz M Bujnicki. 2016. “SimRNA: a
Coarse-Grained Method for RNA Folding Simulations and 3D Structure Prediction..”
Nucleic Acids Research 44 (7): e63–e63. doi:10.1093/nar/gkv1479.
Bonneau, Richard, Charlie E M Strauss, and David Baker. 2001. “Improving the Performance
of Rosetta Using Multiple Sequence Alignment Information and Global Measures of
Hydrophobic Core Formation.” Proteins: Structure, Function, and Bioinformatics 43 (1).
John Wiley & Sons, Inc.: 1–11. doi:10.1002/1097-0134(20010401)43:1<1::AID-
PROT1012>3.0.CO;2-A.
Bonneau, Richard, Charlie E M Strauss, Carol A Rohl, Dylan Chivian, Phillip Bradley, Lars
101
Malmström, Tim Robertson, and David Baker. 2002. “De Novo Prediction of Three-
Dimensional Structures for Major Protein Families..” Journal of Molecular Biology 322
(1): 65–78.
Bottaro, Sandro, Francesco Di Palma, and Giovanni Bussi. 2014. “The Role of Nucleobase
Interactions in RNA Structure and Dynamics..” Nucleic Acids Research 42 (21): 13306–
14. doi:10.1093/nar/gku972.
Brooks, B R, C L Brooks, A D Mackerell, L Nilsson, R J Petrella, B Roux, Y Won, et al.
2009. “CHARMM: the Biomolecular Simulation Program..” Edited by Charles L Brooks
III and David A Case. Journal of Computational Chemistry 30 (10). Wiley Subscription
Services, Inc., A Wiley Company: 1545–1614. doi:10.1002/jcc.21287.
Burks, Jody, Christian Zwieb, Florian Müller, Iwona Wower, and Jacek Wower. 2005.
“Comparative 3-D Modeling of tmRNA..” BMC Molecular Biology 6 (1). BioMed
Central: 14. doi:10.1186/1471-2199-6-14.
Capriotti, Emidio, Tomas Norambuena, Marc A Marti-Renom, and Francisco Melo. 2011.
“All-Atom Knowledge-Based Potential for RNA Structure Prediction and Assessment..”
Bioinformatics (Oxford, England) 27 (8): 1086–93. doi:10.1093/bioinformatics/btr093.
Case, David A, Thomas E Cheatham, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M
Merz, Alexey Onufriev, Carlos Simmerling, Bing Wang, and Robert J Woods. 2005.
“The Amber Biomolecular Simulation Programs.” Journal of Computational Chemistry
26 (16). Wiley Subscription Services, Inc., A Wiley Company: 1668–88.
doi:10.1002/jcc.20290.
Chapman, Erich G, David A Costantino, Jennifer L Rabe, Stephanie L Moon, Jeffrey Wilusz,
Jay C Nix, and Jeffrey S Kieft. 2014. “The Structural Basis of Pathogenic Subgenomic
Flavivirus RNA (sfRNA) Production..” Science (New York, N.Y.) 344 (6181): 307–10.
doi:10.1126/science.1250897.
Chapman, Michael S, Se Won Suh, Paul M G Curmi, Duilio Cascio, Ward W Smith, and
David S Eisenberg. 1988. “Tertiary Structure of Plant RuBisCO: Domains and Their
Contacts.” Science (New York, N.Y.) 241 (4861). American Association for the
Advancement of Science: 71–75.
Cheng, Clarence Yu, Fang-Chieh Chou, and Rhiju Das. 2015. “Modeling Complex RNA
Tertiary Folds with Rosetta.” In Computational Methods for Understanding
Riboswitches, 553:35–64. Methods in Enzymology. Elsevier.
doi:10.1016/bs.mie.2014.10.051.
Chworos, Arkadiusz, Isil Severcan, Alexey Y Koyfman, Patrick Weinkam, Emin Oroudjev,
Helen G Hansma, and Luc Jaeger. 2004. “Building Programmable Jigsaw Puzzles with
RNA..” Science (New York, N.Y.) 306 (5704). American Association for the
Advancement of Science: 2068–72. doi:10.1126/science.1104686.
Cock, Peter J A, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew
Dalke, Iddo Friedberg, et al. 2009. “Biopython: Freely Available Python Tools for
Computational Molecular Biology and Bioinformatics..” Bioinformatics 25 (11): 1422–
23. doi:10.1093/bioinformatics/btp163.
Cruz, José Almeida, and Eric Westhof. 2011. “Sequence-Based Identification of 3D
Structural Modules in RNA with RMDetect.” Nature Methods 8 (6): 513–19.
doi:10.1038/nmeth.1603.
Czerwoniec, Anna, Stanislaw Dunin-Horkawicz, Elżbieta Purta, Katarzyna H Kaminska,
Joanna M Kasprzak, Janusz M Bujnicki, Henri Grosjean, and Kristian Rother. 2009.
“MODOMICS: a Database of RNA Modification Pathways. 2008 Update..” Nucleic
Acids Research 37 (Database issue): D118–21. doi:10.1093/nar/gkn710.
102
Darty, Kévin, Alain Denise, and Yann Ponty. 2009. “VARNA: Interactive Drawing and
Editing of the RNA Secondary Structure..” Bioinformatics (Oxford, England) 25 (15):
1974–75. doi:10.1093/bioinformatics/btp250.
Das, Rhiju, and David Baker. 2007. “Automated De Novo Prediction of Native-Like RNA
Tertiary Structures..” Proceedings of the National Academy of Sciences 104 (37).
National Acad Sciences: 14664–69. doi:10.1073/pnas.0703836104.
Das, Rhiju, John Karanicolas, and David Baker. 2010. “Atomic Accuracy in Predicting and
Designing Noncanonical RNA Structure.” Nature Methods 7 (4): 291–94.
doi:10.1038/nmeth.1433.
Das, Rhiju, Madhuri Kudaravalli, Magdalena Jonikas, Alain Laederach, Robert Fong, Jason P
Schwans, David Baker, Joseph A Piccirilli, Russ B Altman, and Daniel Herschlag. 2008.
“Structural Inference of Native and Partially Folded RNA by High-Throughput Contact
Mapping.” Proceedings of the National Academy of Sciences 105 (11). National Acad
Sciences: 4144–49. doi:10.1073/pnas.0709032105.
De Leonardis, Eleonora, Benjamin Lutz, Sebastian Ratz, Simona Cocco, Rémi Monasson,
Alexander Schug, and Martin Weigt. 2015. “Direct-Coupling Analysis of Nucleotide
Coevolution Facilitates RNA Secondary and Tertiary Structure Prediction.” Nucleic
Acids Research, September, gkv932–12. doi:10.1093/nar/gkv932.
DELANO, W L. 2002. “The PyMOL Molecular Graphics System.” Pymol.org 52 (1).
DeLano Scientific: 62–67. doi:10.5940/jcrsj.52.62.
Ding, Feng, Shantanu Sharma, Poornima Chalasani, Vadim V Demidov, Natalia E Broude,
and Nikolay V Dokholyan. 2008. “Ab Initio RNA Folding by Discrete Molecular
Dynamics: From Structure Prediction to Folding Mechanisms..” Rna 14 (6). Cold Spring
Harbor Lab: 1164–73. doi:10.1261/rna.894608.
Dunbrack, Roland. 2004. Whatcheck. Vol. 13. Chichester, UK: John Wiley & Sons, Ltd.
doi:10.1002/9780471650126.dob0791.pub2.
Dunin-Horkawicz, Stanislaw, Anna Czerwoniec, Michal J Gajda, Marcin Feder, Henri
Grosjean, and Janusz M Bujnicki. 2006. “MODOMICS: a Database of RNA
Modification Pathways..” Nucleic Acids Research 34 (Database issue): D145–49.
doi:10.1093/nar/gkj084.
Eisenberg, David, Roland Lüthy, and James U Bowie. 1997. “VERIFY3D: Assessment of
Protein Models with Three-Dimensional Profiles.” In Macromolecular Crystallography
Part B, 277:396–404. Methods in Enzymology. Elsevier. doi:10.1016/S0076-
6879(97)77022-8.
Eriksson, Emma S E, Lokesh Joshi, Martin Billeter, and Leif A Eriksson. 2014. “De Novo
Tertiary Structure Prediction Using RNA123--Benchmarking and Application to
Macugen..” Journal of Molecular Modeling 20 (8). Springer Berlin Heidelberg: 2389.
doi:10.1007/s00894-014-2389-z.
Flores, Samuel C, Yaqi Wan, Rick Russell, and Russ B Altman. 2010. “Predicting RNA
Structure by Multiple Template Homology Modeling..” Pacific Symposium on
Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 216–27.
Frickey, T, and A Lupas. 2004. “CLANS: a Java Application for Visualizing Protein Families
Based on Pairwise Similarity.” Bioinformatics 20 (18): 3702–4.
doi:10.1093/bioinformatics/bth444.
Gajda, Michał Jan. 2013. “HPDB-Haskell Library for Processing Atomic Biomolecular
Structures in Protein Data Bank Format..” BMC Research Notes 6 (1). BioMed Central:
483. doi:10.1186/1756-0500-6-483.
Gao, Ang, and Alexander Serganov. 2014. “Structural Insights Into Recognition of C-Di-
103
AMP by the ydaO Riboswitch..” Nature Chemical Biology 10 (9): 787–92.
doi:10.1038/nchembio.1607.
Ginalski, K, A Elofsson, D Fischer, and L Rychlewski. 2003. “3D-Jury: a Simple Approach
to Improve Protein Structure Predictions.” Bioinformatics 19 (8): 1015–18.
doi:10.1093/bioinformatics/btg124.
Grant, Barry J, Rodrigues, Ana P. C., Karim M ElSawy, J Andrew McCammon, and Leo S D
Caves. 2006. “Bio3d: an R Package for the Comparative Analysis of Protein Structures.”
Bioinformatics 22 (21). Oxford University Press: 2695–96.
doi:10.1093/bioinformatics/btl461.
Griffiths-Jones, Sam. 2005. “RALEE—RNA ALignment Editor in Emacs.” Bioinformatics
21 (2). Oxford University Press: 257–59. doi:10.1093/bioinformatics/bth489.
Hanson, Robert M, Jaime Prilusky, Zhou Renjian, Takanori Nakane, and Joel L Sussman.
2013. “JSmol and the Next‐Generation Web‐Based Representation of 3D Molecular
Structure as Applied to Proteopedia.” Israel Journal of Chemistry 53 (3‐4). WILEY‐VCH
Verlag: 207–16. doi:10.1002/ijch.201300024.
Hayes, Josie, Pier Paolo Peruzzi, and Sean Lawler. 2014. “MicroRNAs in Cancer:
Biomarkers, Functions and Therapy.” Trends in Molecular Medicine 20 (8): 460–69.
doi:10.1016/j.molmed.2014.06.005.
Hunt, Andrew, and David Thomas. 1999. The Pragmatic Programmer. Addison-Wesley
Professional.
Jimenez, Randi M, Julio A Polanco, and Andrej Lupták. 2015. “Chemistry and Biology of
Self-Cleaving Ribozymes.” Trends in Biochemical Sciences 40 (11): 648–61.
doi:10.1016/j.tibs.2015.09.001.
Jones, Christopher P, and Adrian R Ferré-D'Amaré. 2014. “Crystal Structure of a C-Di-AMP
Riboswitch Reveals an Internally Pseudo-Dimeric RNA..” The EMBO Journal 33 (22).
EMBO Press: 2692–2703. doi:10.15252/embj.201489209.
Jonikas, Magdalena A, Randall J Radmer, and Russ B Altman. 2009. “Knowledge-Based
Instantiation of Full Atomic Detail Into Coarse-Grain RNA 3D Structural Models.”
Bioinformatics 25 (24): 3259–66. doi:10.1093/bioinformatics/btp576.
Jossinet, Fabrice, and Eric Westhof. 2005. “Sequence to Structure (S2S): Display,
Manipulate and Interconnect RNA Data From Sequence to Structure..” Bioinformatics 21
(15): 3320–21. doi:10.1093/bioinformatics/bti504.
Jossinet, Fabrice, Thomas E Ludwig, and Eric Westhof. 2010. “Assemble: an Interactive
Graphical Tool to Analyze and Build RNA Architectures at the 2D and 3D Levels..”
Bioinformatics (Oxford, England) 26 (16): 2057–59. doi:10.1093/bioinformatics/btq321.
Kellenberger, Colleen A, Chen Chen, Aaron T Whiteley, Daniel A Portnoy, and Ming C
Hammond. 2015. “RNA-Based Fluorescent Biosensors for Live Cell Imaging of Second
Messenger Cyclic Di-AMP.” Journal of the American Chemical Society 137 (20).
American Chemical Society: 6432–35. doi:10.1021/jacs.5b00275.
Kerpedjiev, Peter, Christian Höner Zu Siederdissen, and Ivo L Hofacker. 2015. “Predicting
RNA 3D Structure Using a Coarse-Grain Helix-Centered Model..” RNA (New York, N.Y.)
21 (6). Cold Spring Harbor Lab: 1110–21. doi:10.1261/rna.047522.114.
Kieft, Jeffrey S, Kaihong Zhou, Angie Grech, Ronald Jubin, and Jennifer A Doudna. 2002.
“Crystal Structure of an RNA Tertiary Domain Essential to HCV IRES-Mediated
Translation Initiation.” Nature Structural Biology, April. doi:10.1038/nsb781.
Kim, Peter B, James W Nelson, and Ronald R Breaker. 2015. “An Ancient Riboswitch Class
in Bacteria Regulates Purine Biosynthesis and One-Carbon Metabolism..” Molecular
104
Cell 57 (2): 317–28. doi:10.1016/j.molcel.2015.01.001.
Kirmizialtin, Serdal, Scott P Hennelly, Alexander Schug, Jose N Onuchic, and Karissa Y
Sanbonmatsu. 2015. “Integrating Molecular Dynamics Simulations with Chemical
Probing Experiments Using SHAPE-FIT..” Methods in Enzymology 553. Elsevier: 215–
34. doi:10.1016/bs.mie.2014.10.061.
Kladwang, Wipapat, Christopher C VanLang, Pablo Cordero, and Rhiju Das. 2011. “A Two-
Dimensional Mutate-and-Map Strategy for Non-Coding RNA Structure..” Nature
Chemistry 3 (12): 954–62. doi:10.1038/nchem.1176.
Kladwang, Wipapat, Fang-Chieh Chou, and Rhiju Das. 2012. “Automated RNA Structure
Prediction Uncovers a Kink-Turn Linker in Double Glycine Riboswitches.” Journal of
the American Chemical Society 134 (3): 1404–7. doi:10.1021/ja2093508.
Klostermeier, D, and D P Millar. 2001. “Time-Resolved Fluorescence Resonance Energy
Transfer: a Versatile Tool for the Analysis of Nucleic Acids..” Biopolymers 61 (3). Wiley
Subscription Services, Inc., A Wiley Company: 159–79. doi:10.1002/bip.10146.
Knight, Rob, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett
C Easton, Michael Eaton, et al. 2007. “PyCogent: a Toolkit for Making Sense From
Sequence.” Genome Biology 8 (8). BioMed Central: R171. doi:10.1186/gb-2007-8-8-
r171.
Kryshtafovych, Andriy, Bohdan Monastyrskyy, Krzysztof Fidelis, Torsten Schwede, and
Anna Tramontano. 2017. “Assessment of Model Accuracy Estimations in CASP12.”
Proteins: Structure, Function, and Bioinformatics 84 (Suppl 1): 349.
doi:10.1002/prot.25371.
Kulik, Marta, Anna M Goral, Maciej Jasiński, Paulina M Dominiak, and Joanna Trylska.
2015. “Electrostatic Interactions in Aminoglycoside-RNA Complexes..” Biophysical
Journal 108 (3): 655–65. doi:10.1016/j.bpj.2014.12.020.
Kurowski, Michal A, and Janusz M Bujnicki. 2003. “GeneSilico Protein Structure Prediction
Meta-Server..” Nucleic Acids Research 31 (13). Oxford University Press: 3305–7.
Lai, D, J R Proctor, JYA Zhu, and I M Meyer. 2012. “R-CHIE: a Web Server and R Package
for Visualizing RNA Secondary Structures.” Nucleic Acids Research.
Laing, Christian, and Tamar Schlick. 2010. “Computational Approaches to 3D Modeling of
RNA.” Journal of Physics: Condensed Matter 22 (28): 283101–19. doi:10.1088/0953-
8984/22/28/283101.
Lang, Kathrin, Renate Rieder, and Ronald Micura. 2007. “Ligand-Induced Folding of the
thiM TPP Riboswitch Investigated by a Structure-Based Fluorescence Spectroscopic
Approach..” Nucleic Acids Research 35 (16): 5370–78. doi:10.1093/nar/gkm580.
Laskowski, R A, M W MacArthur, D S Moss, and J M Thornton. 1993. “PROCHECK: a
Program to Check the Stereochemical Quality of Protein Structures.” Journal of Applied
Crystallography 26 (2). International Union of Crystallography: 283–91.
doi:10.1107/S0021889892009944.
Lavender, Christopher A, Feng Ding, Nikolay V Dokholyan, and Kevin M Weeks. 2010.
“Robust and Generic RNA Modeling Using Inferred Constraints: a Structure for the
Hepatitis C Virus IRES Pseudoknot Domain..” Biochemistry 49 (24): 4931–33.
doi:10.1021/bi100142y.
Leaver-Fay, Andrew, Michael Tyka, Steven M Lewis, Oliver F Lange, James Thompson,
Ron Jacak, Kristian Kaufman, et al. 2011. “ROSETTA3: an Object-Oriented Software
Suite for the Simulation and Design of Macromolecules..” Methods in Enzymology 487.
Elsevier: 545–74. doi:10.1016/B978-0-12-381270-4.00019-6.
Leontis, Neocles B, and Eric Westhof. 2001. “Geometric Nomenclature and Classification of
105
RNA Base Pairs.” Rna 7 (4). Cambridge University Press: 499–512.
Li, He, Si-Qing Ma, Jin Huang, Xiao-Ping Chen, and Hong-Hao Zhou. 2017. “Roles of Long
Noncoding RNAs in Colorectal Cancer Metastasis..” Oncotarget 8 (24). Impact Journals:
39859–76. doi:10.18632/oncotarget.16339.
Liu, Yijin, Timothy J Wilson, and David M J Lilley. 2017. “The Structure of a Nucleolytic
Ribozyme That Employs a Catalytic Metal Ion.” Nature Chemical Biology 13 (5). Nature
Research: 508–13. doi:10.1038/nchembio.2333.
Lu, H, and J Skolnick. 2001. “A Distance-Dependent Atomic Knowledge-Based Potential for
Improved Protein Structure Selection..” Proteins 44 (3): 223–32.
Lundström, Jesper, Leszek Rychlewski, Arne Elofsson, and Janusz M Bujnicki. 2008.
“Pcons: a Neural-Network-Based Consensus Predictor That Improves Fold Recognition.”
Protein Science 10 (11). Cold Spring Harbor Laboratory Press: 2354–62.
doi:10.1110/ps.08501.
Machnicka, Magdalena A, Kaja Milanowska, Okan Osman Oglou, Elżbieta Purta,
Malgorzata Kurkowska, Anna Olchowik, Witold Januszewski, et al. 2013.
“MODOMICS: a Database of RNA Modification Pathways--2013 Update..” Nucleic
Acids Research 41 (Database issue): D262–67. doi:10.1093/nar/gks1007.
Macke, Thomas J, and David A Case. 2009. “Modeling Unusual Nucleic Acid Structures.” In
Molecular Modeling of Nucleic Acids, 682:379–93. ACS Symposium Series.
Washington, DC: American Chemical Society. doi:10.1021/bk-1998-0682.ch024.
Magnus, Marcin, Dorota Matelska, Grzegorz Lach, Grzegorz Chojnowski, Michal J
Boniecki, Elżbieta Purta, Wayne Dawson, Stanislaw Dunin-Horkawicz, and Janusz M
Bujnicki. 2014. “Computational Modeling of RNA 3D Structures, with the Aid of
Experimental Restraints..” RNA Biology 11 (5): 522–36. doi:10.4161/rna.28826.
Magnus, Marcin, Marcin Pawlowski, and Janusz M Bujnicki. 2012. “MetaLocGramN: a
Meta-Predictor of Protein Subcellular Localization for Gram-Negative Bacteria.” BBA -
Proteins and Proteomics 1824 (12). Elsevier B.V.: 1425–33.
doi:10.1016/j.bbapap.2012.05.018.
Magnus, Marcin, Michał J Boniecki, Wayne Dawson, and Janusz M Bujnicki. 2016.
“SimRNAweb: a Web Server for RNA 3D Structure Modeling with Optional
Restraints..” Nucleic Acids Research 44 (W1): W315–19. doi:10.1093/nar/gkw279.
Martin, Robert C. 2008. Clean Code. Pearson Education.
Martinez, Hugo M, Jacob V Maizel, and Bruce A Shapiro. 2008. “RNA2D3D: a Program for
Generating, Viewing, and Comparing 3-Dimensional Models of RNA..” Journal of
Biomolecular Structure & Dynamics 25 (6): 669–83.
doi:10.1080/07391102.2008.10531240.
Massire, C, and E Westhof. 1998. “MANIP: an Interactive Tool for Modelling RNA..”
Journal of Molecular Graphics & Modelling 16 (4-6): 197–205–255–7.
Mathews, David H, Matthew D Disney, Jessica L Childs, Susan J Schroeder, Michael Zuker,
and Douglas H Turner. 2004. “Incorporating Chemical Modification Constraints Into a
Dynamic Programming Algorithm for Prediction of RNA Secondary Structure..”
Proceedings of the National Academy of Sciences 101 (19): 7287–92.
doi:10.1073/pnas.0401799101.
McCown, Phillip J, Keith A Corbino, Shira Stav, Madeline E Sherlock, and Ronald R
Breaker. 2017. “Riboswitch Diversity and Distribution..” Rna 23 (7): 995–1011.
doi:10.1261/rna.061234.117.
McGuffin, L J. 2008. “The ModFOLD Server for the Quality Assessment of Protein
Structural Models.” Bioinformatics 24 (4): 586–87. doi:10.1093/bioinformatics/btn014.
106
Merali, Zeeya. 2010. “Computational Science: ...Error..” Nature, October 14.
doi:10.1038/467775a.
Merino, Edward J, Kevin A Wilkinson, Jennifer L Coughlan, and Kevin M Weeks. 2005.
“RNA Structure Analysis at Single Nucleotide Resolution by Selective 2'-Hydroxyl
Acylation and Primer Extension (SHAPE)..” Journal of the American Chemical Society
127 (12): 4223–31. doi:10.1021/ja043822v.
Miao, Zhichao, Ryszard W Adamiak, Maciej Antczak, Robert T Batey, Alexander J Becka,
Marcin Biesiada, Michał J Boniecki, et al. 2017. “RNA-Puzzles Round III: 3D RNA
Structure Prediction of Five Riboswitches and One Ribozyme..” RNA (New York, N.Y.)
23 (5): 655–72. doi:10.1261/rna.060368.116.
Miao, Zhichao, Ryszard W Adamiak, Marc-Frédérick Blanchet, Michal Boniecki, Janusz M
Bujnicki, Shi-Jie Chen, Clarence Cheng, et al. 2015. “RNA-Puzzles Round II:
Assessment of RNA Structure Prediction Programs Applied to Three Large RNA
Structures..” RNA (New York, N.Y.) 21 (6). Cold Spring Harbor Lab: 1066–84.
doi:10.1261/rna.049502.114.
Michel, F, and E Westhof. 1990. “Modelling of the Three-Dimensional Architecture of
Group I Catalytic Introns Based on Comparative Sequence Analysis..” Journal of
Molecular Biology 216 (3): 585–610. doi:10.1016/0022-2836(90)90386-Z.
Mlynsky, Vojtech, and Giovanni Bussi. 2017. “Understanding in-Line Probing Experiments
by Modeling Cleavage of Non-Reactive RNA Nucleotides..” Rna 23 (5). Cold Spring
Harbor Lab: rna.060442.116–720. doi:10.1261/rna.060442.116.
Moretti, Rocco, Sergey Lyskov, Rhiju Das, Jens Meiler, and Jeffrey J Gray. 2017. “Web-
Accessible Molecular Modeling with Rosetta: the Rosetta Online Server That Includes
Everyone (ROSIE)..” Protein Science : a Publication of the Protein Society, September.
doi:10.1002/pro.3313.
Nahvi, Ali, and Rachel Green. 2013. “Structural Analysis of RNA Backbone Using in-Line
Probing..” Methods in Enzymology 530. Elsevier: 381–97. doi:10.1016/B978-0-12-
420037-1.00022-1.
Nawrocki, E P, S W Burge, A Bateman, J Daub, R Y Eberhardt, S R Eddy, E W Floden, et al.
2015. “Rfam 12.0: Updates to the RNA Families Database.” Nucleic Acids Research 43
(D1): D130–37. doi:10.1093/nar/gku1063.
Nawrocki, Eric P, Diana L Kolbe, and Sean R Eddy. 2009. “Infernal 1.0: Inference of RNA
Alignments..” Bioinformatics (Oxford, England) 25 (10): 1335–37.
doi:10.1093/bioinformatics/btp157.
Norambuena, T, J F Cares, E Capriotti, and F Melo. 2013. “WebRASP: a Server for
Computing Energy Scores to Assess the Accuracy and Stability of RNA 3D Structures.”
Bioinformatics (Oxford, England). doi:10.1093/bioinformatics/btt441.
Nussinov, Ruth, George Pieczenik, Jerrold R Griggs, and Daniel J Kleitman. 1978.
“Algorithms for Loop Matchings.” SIAM Journal on Applied Mathematics 35 (1): 68–82.
doi:10.1137/0135006.
Parisien, Marc, and François Major. 2008. “The MC-Fold and MC-Sym Pipeline Infers RNA
Structure From Sequence Data” 452 (7183): 51–55. doi:10.1038/nature06684.
Pawlowski, Marcin, Albert Bogdanowicz, and Janusz M Bujnicki. 2013. “QA-RecombineIt:
a Server for Quality Assessment and Recombination of Protein Models.” Nucleic Acids
Research 41 (W1). Oxford University Press: W389–97. doi:10.1093/nar/gkt408.
Pawlowski, Marcin, Michal J Gajda, Ryszard Matlak, and Janusz M Bujnicki. 2008.
“MetaMQAP: a Meta-Server for the Quality Assessment of Protein Models..” BMC
Bioinformatics 9 (1). BioMed Central: 403. doi:10.1186/1471-2105-9-403.
107
Pennisi, Elizabeth. 2013. “The CRISPR Craze..” Science (New York, N.Y.), August 23.
doi:10.1126/science.341.6148.833.
Pérez, F, and B E Granger. 2007. “IPython: a System for Interactive Scientific Computing.”
Computing in Science & Engineering 9 (3): 21–29.
doi:10.1109/MCSE.2007.53&orderBeanReset=true&volumeNum=9&issueNum=3","dis
playPublicationTitle“:”Computing.
Piatkowski, Pawel, Joanna M Kasprzak, Deepak Kumar, Marcin Magnus, Grzegorz
Chojnowski, and Janusz M Bujnicki. 2016. “RNA 3D Structure Modeling by
Combination of Template-Based Method ModeRNA, Template-Free Folding with
SimRNA, and Refinement with QRNAS..” Methods in Molecular Biology (Clifton, N.J.)
1490 (Suppl). New York, NY: Springer New York: 217–35. doi:10.1007/978-1-4939-
6433-8_14.
Popenda, M, M Szachniuk, M Antczak, K J Purzycka, P Lukasiak, N Bartol, J Blazewicz,
and R W Adamiak. 2012. “Automated 3D Structure Composition for Large RNAs.”
Nucleic Acids Research 40 (14): e112–12. doi:10.1093/nar/gks339.
Puton, Tomasz, Lukasz P Kozlowski, Kristian M Rother, and Janusz M Bujnicki. 2013.
“CompaRNA: a Server for Continuous Benchmarking of Automated Methods for RNA
Secondary Structure Prediction..” Nucleic Acids Research 41 (7): 4307–23.
doi:10.1093/nar/gkt101.
Qin, Peter Z, and Thorsten Dieckmann. 2004. “Application of NMR and EPR Methods to the
Study of RNA.” Current Opinion in Structural Biology 14 (3): 350–59.
doi:10.1016/j.sbi.2004.04.002.
Ren, Aiming, and Dinshaw J Patel. 2014. “C-Di-AMP Binds the ydaO Riboswitch in Two
Pseudo-Symmetry-Related Pockets..” Nature Chemical Biology 10 (9): 780–86.
doi:10.1038/nchembio.1606.
Ren, Aiming, Kanagalaghatta R Rajashankar, and Dinshaw J Patel. 2015. “Global RNA Fold
and Molecular Recognition for a Pfl Riboswitch Bound to ZMP, a Master Regulator of
One-Carbon Metabolism.” Structure 23 (8): 1375–81. doi:10.1016/j.str.2015.05.016.
Ren, Aiming, Nikola Vušurović, Jennifer Gebetsberger, Pu Gao, Michael Juen, Christoph
Kreutz, Ronald Micura, and Dinshaw J Patel. 2016. “Pistol Ribozyme Adopts a
Pseudoknot Fold Facilitating Site-Specific in-Line Cleavage..” Nature Chemical Biology
12 (9): 702–8. doi:10.1038/nchembio.2125.
Rivas, Elena, and Sean R Eddy. 1999. “A Dynamic Programming Algorithm for RNA
Structure Prediction Including Pseudoknots 1 1Edited by I. Tinoco.” Journal of
Molecular Biology 285 (5): 2053–68. doi:10.1006/jmbi.1998.2436.
Rother, Kristian. 2017. Pro Python Best Practices. Berkeley, CA: Apress. doi:10.1007/978-1-
4842-2241-6.
Rother, M, K Milanowska, T Puton, J Jeleniewicz, K Rother, and Janusz M Bujnicki. 2011.
“ModeRNA Server: an Online Tool for Modeling RNA 3D Structures.” Bioinformatics
27 (17): 2441–42. doi:10.1093/bioinformatics/btr400.
Rother, Magdalena, Kristian Rother, Tomasz Puton, and Janusz M Bujnicki. 2011.
“ModeRNA: a Tool for Comparative Modeling of RNA 3D Structure.” Nucleic Acids
Research 39 (10). Oxford University Press: 4007–22. doi:10.1093/nar/gkq1320.
Saini, Harpreet Kaur, and Daniel Fischer. 2005. “Meta-DP: Domain Prediction Meta-
Server..” Bioinformatics 21 (12): 2917–20. doi:10.1093/bioinformatics/bti445.
Sali, A, and T L Blundell. 1993. “Comparative Protein Modelling by Satisfaction of Spatial
Restraints..” Journal of Molecular Biology 234 (3): 779–815.
doi:10.1006/jmbi.1993.1626.
108
Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple
Rules for Reproducible Computational Research..” Edited by Philip E Bourne. PLoS
Computational Biology 9 (10). Public Library of Science: e1003285.
doi:10.1371/journal.pcbi.1003285.
Seemann, Stefan E, Jan Gorodkin, and Rolf Backofen. 2008. “Unifying Evolutionary and
Thermodynamic Information for RNA Folding of Multiple Alignments..” Nucleic Acids
Research 36 (20): 6355–62. doi:10.1093/nar/gkn544.
Siebert, S, and R Backofen. 2005. “MARNA: Multiple Alignment and Consensus Structure
Prediction of RNAs Based on Sequence Structure Comparisons.” Bioinformatics 21 (16):
3352–59. doi:10.1093/bioinformatics/bti550.
Simons, Kim T, Ingo Ruczinski, Charles Kooperberg, Brian A Fox, Chris Bystroff, and
David Baker. 1999. “Improved Recognition of Native‐Like Protein Structures Using a
Combination of Sequence‐Dependent and Sequence‐Independent Features of Proteins.”
Proteins: Structure, Function, and Bioinformatics 34 (1). John Wiley & Sons, Inc.: 82–
95. doi:10.1002/(SICI)1097-0134(19990101)34:1<82::AID-PROT7>3.0.CO;2-A.
Strack, Rita L, Wenjiao Song, and Samie R Jaffrey. 2013. “Using Spinach-Based Sensors for
Fluorescence Imaging of Intracellular Metabolites and Proteins in Living Bacteria.”
Nature Protocols 9 (1): 146–55. doi:10.1038/nprot.2014.001.
Szabo, Linda, and Julia Salzman. 2016. “Detecting Circular RNAs: Bioinformatic and
Experimental Challenges.” Nature Reviews Genetics 17 (11): 679–92.
doi:10.1038/nrg.2016.114.
Trausch, J J, J G Marcano-Velázquez, and M M Matyjasik. 2015. “Metal Ion-Mediated
Nucleobase Recognition by the ZTP Riboswitch.” doi:10.1016/j.chembiol.2015.06.007.
Tuszyńska, Irina, and Janusz M Bujnicki. 2011. “DARS-RNP and QUASI-RNP: New
Statistical Potentials for Protein-RNA Docking..” BMC Bioinformatics 12 (1). BioMed
Central: 348. doi:10.1186/1471-2105-12-348.
Tuszyńska, Irina, Marcin Magnus, Katarzyna Jonak, Wayne Dawson, and Janusz M Bujnicki.
2015. “NPDock: a Web Server for Protein-Nucleic Acid Docking..” Nucleic Acids
Research 43 (W1): W425–30. doi:10.1093/nar/gkv493.
Van Der Spoel, David, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman
J C Berendsen. 2005. “GROMACS: Fast, Flexible, and Free.” Journal of Computational
Chemistry 26 (16). Wiley Subscription Services, Inc., A Wiley Company: 1701–18.
doi:10.1002/jcc.20291.
Waleń, Tomasz, Grzegorz Chojnowski, Przemysław Gierski, and Janusz M Bujnicki. 2014.
“ClaRNA: a Classifier of Contacts in RNA 3D Structures Based on a Comparative
Analysis of Various Classification Schemes.” Nucleic Acids Research 42 (19). Oxford
University Press: e151–51. doi:10.1093/nar/gku765.
Wang, J, Y Zhao, C Zhu, and Y Xiao. 2015. “3dRNAscore: a Distance and Torsion Angle
Dependent Evaluation Function of 3D RNA Structures.” Nucleic Acids Research 43 (10):
e63–e63. doi:10.1093/nar/gkv141.
Wang, Jian, and Yi Xiao. 2002. Using 3dRNA for RNA 3-D Structure Prediction and
Evaluation. Vol. 17. Hoboken, NJ, USA: John Wiley & Sons, Inc. doi:10.1002/cpbi.21.
Washietl, Stefan, Ivo L Hofacker, Peter F Stadler, and Manolis Kellis. 2012. “RNA Folding
with Soft Constraints: Reconciliation of Probing Data and Thermodynamic Secondary
Structure Prediction..” Nucleic Acids Research 40 (10): 4261–72.
doi:10.1093/nar/gks009.
Waterhouse, A M, J B Procter, and DMA Martin. 2009. “Jalview Version 2—a Multiple
109
Sequence Alignment Editor and Analysis Workbench.” ….
Weinreb, Caleb, Adam J Riesselman, John B Ingraham, Torsten Gross, Chris Sander, and
Debora S Marks. 2016. “3D RNA and Functional Interactions From Evolutionary
Couplings.” Cell, October. Elsevier Inc., 1–14. doi:10.1016/j.cell.2016.03.030.
Westhof, Eric. 2010. “The Amazing World of Bacterial Structured RNAs..” Genome Biology
11 (3). BioMed Central: 108. doi:10.1186/gb-2010-11-3-108.
Wikimedia-Commons. 2017. “File:RNA_Chemical_Structure.GIF.”
Commons.Wikimedia.org. Accessed September 16.
https://commons.wikimedia.org/wiki/File:RNA_chemical_structure.GIF.
Xu, Shouping, Dejia Kong, Qianlin Chen, Yanyan Ping, and Da Pang. 2017. “Oncogenic
Long Noncoding RNA Landscape in Breast Cancer..” Molecular Cancer 16 (1). BioMed
Central: 129. doi:10.1186/s12943-017-0696-6.
Yang, Huanwang, Fabrice Jossinet, Neocles Leontis, Li Chen, John Westbrook, Helen
Berman, and Eric Westhof. 2003. “Tools for the Automatic Identification and
Classification of RNA Base Pairs..” Nucleic Acids Research 31 (13). Oxford University
Press: 3450–60.
Zuker, Michael, and Patrick Stiegler. 1981. “Optimal Computer Folding of Large RNA
Sequences Using Thermodynamics and Auxiliary Information.” Nucleic Acids Research
9 (1): 133–48. doi:10.1093/nar/9.1.133.
Zwieb, C, and F Müller. 1997. “Three-Dimensional Comparative Modeling of RNA..”
Nucleic Acids Symposium Series, no. 36: 69–71.