Mol Biol Evol 2001 Laprevotte 1231 45

  • Upload
    incxx

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    1/15

    1231

    Mol. Biol. Evol. 18(7):12311245. 2001

    2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038

    HIV-1 and HIV-2 LTR Nucleotide Sequences: Assessment of the Alignment

    by N-block Presentation, Retroviral Signatures of Overrepeated

    Oligonucleotides, and a Probable Important Role of Scrambled Stepwise

    Duplications/Deletions in Molecular Evolution

    Ivan Laprevotte,* Maude Pupin, Eivind Coward, Gilles Didier,* Christophe Terzian,Claudine Devauchelle,* and Alain Henaut*

    *Laboratoire Genome et Informatique, Universite de Versailles Saint Quentin-en-Yvelines, Versailles, France; LaboratoiredInformatique Fondamentale de Lille, Equipe Bioinformatique, Universite des Sciences et Technologie de Lille, VilleneuvedAscq, France; Deutsches Krebsforschungszentrum Theoretische Bioinformatik (H0300) Im Neuenheimer Feld 280,Heidelberg, Germany; and Institut de Genetique Humaine, Montpellier, France

    Previous analyses of retroviral nucleotide sequences, suggest a so-called scrambled duplicative stepwise molecularevolution (many sectors with successive duplications/deletions of short and longer motifs) that could have stemmedfrom one or several starter tandemly repeated short sequence(s). In the present report, we tested this hypothesis byfocusing on the long terminal repeats (LTRs) (and flanking sequences) of 24 human and 3 simian immunodeficiencyviruses. By using a calculation strategy applicable to short sequences, we found consensus overrepresented motifs(often containing CTG or CAG) that were congruent with the previously defined retroviral signature. We alsoshow many local repetition patterns that are significant when compared with simply shuffled sequences. First- andsecond-order Markov chain analyses demonstrate that a major portion of the overrepresented oligonucleotides canbe predicted from the dinucleotide compositions of the sequences, but by no means can biological mechanisms bededuced from these results: some of the listed local repetitions remain significant against dinucleotide-conservingshuffled sequences; together with previous results, this suggests that interspersed and/or local mononucleotide andoligonucleotide repetitions could have biased the dinucleotide compositions of the sequences. We searched forsuggestive evolutionary patterns by scrutinizing a reliable multiple alignment of the 27 sequences. A manuallyconstructed alignment based on homology blocks was in good agreement with the polypeptide alignment in thecoding sectors and has been exhaustively assessed by using a multiplied alphabet obtained by the promising math-ematical strategy called the N-block presentation (taking into account the environment of each nucleotide in asequence). Sector by sector, we hypothesize many successive duplication/deletion scenarios that fit our previousevolutionary hypotheses. This suggests an important duplication/deletion role for the reverse transcriptase, partic-ularly in inducing stuttering cryptic simplicity patterns.

    Introduction

    Previously, computer-aided analyses of retroviralnucleotide sequences aimed to unravel putative molec-ular evolution models from sequence comparisons andoligonucleotide distributions (Laprevotte et al. 1984,1997; Laprevotte 1989, 1992; Terzian et al. 1997). Ananalysis of 24 viruses from the 10 classes of vertebrateretroviruses has shown common features for the over-represented oligonucleotides three to six bases in length(Laprevotte et al. 1997): alternating purine and pyrimi-dine stretches are emphasized and displayed clearly ineach of two subsets on both sides of CTG (or TTG) orCAG, respectively. Two general consensuses show up:CCTGG and CAGR; both are found in most classes ofretroviruses and at least one is found in each class. Thisretroviral signature was not found among yeast, plant,and invertebrate retrotransposons, which indicates that

    the vertebrate retroviruses are a distinct homogeneousgroup (Terzian et al. 1997); this fits a common evolu-

    Key words: HIV-1 and HIV-2 LTR nucleotide sequences, multiplealignment, N-block presentation, retroviral signatures of overre-peated oligonucleotides, scrambled stepwise duplications/deletions,cryptic simplicity.

    Address for correspondence and reprints: Ivan Laprevotte, Labor-atoire Genome et Informatique, Universite de Versailles Saint Quentin-en-Yvelines, 45 avenue des Etats-Unis, 78035 Versailles cedex, France.E-mail: [email protected].

    tionary origin for the retroviruses and is consistent withthe universal rule of a trend toward TG/CT excess,which was proposed as a generative principle of nucle-otide sequences (Ohno and Yomo 1990). Most of theoligonucleotides 36 bases in length with significantlylarger than average numbers of occurrences appear tobe internally repeated (with mono- or oligonucleotideinternal iterations), suggesting an evolutionary stage byslippage-like local duplications. As a whole, these re-sults are consistent with a scrambled duplicative step-wise molecular evolution (many sectors with succes-sive duplications/deletions of short and longer motifs).Core consensuses could correspond to intermediary evo-lutionary stages, with short tandem repeats giving riseto longer oligonucleotide repeats as previously hypoth-esized (Southern 1972; Ohno 1988).

    In the present study, we tested these evolutionaryhypotheses by focusing on the long terminal repeats(LTRs) (and flanking sequences) that bound proviralDNA sequences from two groups of human immuno-deficiency viruses (HIV): 15 HIV-1s together with achimpanzee simian immunodeficiency virus (SIV), and9 HIV-2s together with a macaque SIV and a sootymangabey SIV. It is known that following retrovirus in-tegration into the host-cell genome, the double-strandedproviral DNA is flanked by two identical LTRs, with the5 LTR element serving as the binding site for transcrip-tion factors (reviewed in Ou and Gaynor 1995; Pereira

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    2/15

    1232 Laprevotte et al.

    FIG. 1.Diagrammatic presentation of the 5/3 long terminal re-peats (LTRs) of the human and simian immunodeficiency viruses. U3,R, U5, PBS, PPT, and Nef are defined in the text. U3, R, and U5 makeup the LTR. The intercalated and interrupted sectors, together with thedotted line, represent the rest of the retrovirus sequence. The hatchedzones correspond to the flanking eukaryotic sequences and the Nefgene-coding sequence. The actual sequences aligned and studied in thepresent paper are recapitulated in the bottom line.

    et al. 2000). The HIV nef gene open reading frame par-

    tially overlaps the 3 LTR (fig. 1). The HIV LTRs areshort sequences that can be visually compared and havebeen subjected to exhaustive sequencing and biologicalstudies because of an important pathological concern (asa result of the worldwide AIDS crisis) and the presenceof transcription control sites on them. It is alreadyknown that retrovirus LTRs have multiplied motifs thatmay correspond to experimentally determined regula-tory elements (Frech, Brack-Werner, and Werner 1996).Several studies have shown that the LTR structures andtheir regulation are of particular interest for HIV ex-pression (Gaynor 1992). Here, we use the control sitesthat are conserved during evolution as starter homologyblocks for a reliable multiple alignment of the 27 LTR

    sequences. We also list overrepresented words by usinga new calculation strategy (Klaerr-Blanchard, Chiapello,and Coward 2000) applicable to short sequences suchas the LTRs and their coding and noncoding sectors.CTG is often found in the overlapping multiplied motifsdescribed in HIV-1 LTRs (Seto, Brunck, and Bernstein1989). Moreover, the sequences of HIV (1 and 2) arethe most biased in favor of the overrepresented trinu-cleotides in the LTRs (Laprevotte et al. 1997).We searchfor putative short- and longer-range duplications/dele-tions by comparisons with shuffled sequences and byscrutinizing the thoroughly assessed alignment of the 27sequences sector by sector. The results are in accordancewith the previous hypotheses for the retrovirus nucleo-

    tide sequences of molecular evolution by scrambledstepwise short- and longer-range duplications/deletions.

    Materials and MethodsThe 27 Studied Nucleotide Sequences

    The sequences represented a portion of the plusstrand of the proviral DNA (this strand corresponds tothe viral RNA). These are the LTRs together with flank-ing sectors (fig. 1 and the alignment on the web page).The 27 5 nucleotides were those located upstream ofthe 3 LTR. They included the polypurine tract (PPT)

    that is the binding site for the primer of DNA plus-strand reverse transcription. The 40 3 nucleotideswere those located downstream of the 5 LTR. Theyincluded the primer-binding site (PBS) for minus-strandreverse transcription. These two flanking sectors werehighly conserved 5 and 3 landmarks that bound thealignment (see below). The three regions of the LTRwere 5-U3-R-U5-3 (reviewed in Peterlin 1995; Vogt1997). U3 corresponds to the unique regulatory se-quence at the 3 end of viral RNA, R corresponds to theterminal direct repeat RNA, and U5 corresponds to theunique regulatory sequence at the 5 end of viral RNA.The 27 sequences are listed in table 1 and in the leftcolumn of the alignment on the web page. The upper(HIV-1) group is that of the HIV-1s together with arelated (Berkhout 1996) chimpanzee SIV (CIVCG[X52154]). The lower (HIV-2) group corresponds to theHIV-2s together with related SIVs (Berkhout 1996): asooty mangabey SIV (RESIVSMM [X14307]) and amacaque SIV (RESIMM251 [M19499]). HIV-1 se-quences were regrouped in accordance with their de-grees of reciprocal homologies (determined by pairwisecomparisons taking into account only the putative basesubstitutions and excluding the other evolutionaryevents; data not shown). HIV-2 sequences (L07625 andX61240 excluded) are listed in alphabetical order oftheir EMBL accession numbers. L07625 and X61240were brought together: as seen in the alignment, theyshared many common features that separate them fromthe other HIV-2 sequences (Kreutz et al. 1992; Barnettet al. 1993).

    The Method for Finding Exceptional Words in aSequence

    The calculation strategy (Klaerr-Blanchard, Chia-

    pello, and Coward 2000) is applicable to short sequenc-es. The basic operation is to count occurrences of wordsof a given length in sub-sequences called windows(here, the LTRs and their coding and noncoding sectors).The P value is the probability for any word to occur inat least its observed number of positions. The calcula-tion strategy involves the possibility that a word couldoverlap with itself. Since the probabilities are calculatedfor words actually occurring in the window, the firstoccurrence of any word is considered given (probability 1). The P value is considered significant when it islower than 0.01 or even 0.001 (tables 25). When theBernoulli model is used, the calculation takes into ac-count the length and the base composition of the ob-

    served window. When the model is a first-order Markovchain, the dinucleotide composition is additionally takeninto account. The implementation of these probabilitycalculations is a part of the program Excep (Coward1998).

    Methods for Finding Probable Local RepetitionSectors

    We set a priori patterns that could correspond tolocal repetitions. A numerical value was used, that is,the percentage of the bases in the observed sequence

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    3/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1233

    Table 1The 27 Retroviral Nucleotide Sequences Analyzed in this Work

    Virus

    EMBLAccession

    No.

    Chimpanzee immunodeficiency virus (CIVCG), SIV(cpz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X52154

    Human immunodeficiency virus type 1, isolate BRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate PV 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Human immunodeficiency virus type 1, (HXB2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, NY5/BRU (LAV-1) recombinant clone pNL4-3 . . . . . . .Human T-cell leukaemia type III (HTLV-III) proviral genome (AIDS virus) . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate ARV-2/SF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate MN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate RF (HAT-3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate ELI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, isolate Z2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human lymphadenopathy virus (MAL isolate) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 1, Ugandan isolate U455 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    K02013K02083

    K03455M19921X01762K02007M17449M17451M26727K03454M22639M27323K03456L20571M62320

    Simian immunodeficiency virus from sooty mangabey monkey (RESIVSMM) . . . . . . . . . . . . . . .Simian (macaque) immunodeficiency virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    X14307M19499

    Human immunodeficiency virus type 2 from strain HIV-2UC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Human immunodeficiency virus type 2, isolate D205 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2, isolate HIV2FG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2, isolate SBLISY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2, isolate ROD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2, isolate GH-1, clone 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Human immunodeficiency virus type 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    L07625

    X61240J03654J04498J04542M15390M30502M30895M31113

    that were included in at least one of the repetitions sodefined. For each of the 27 sequences, the actual valuewas compared with those of 100 simply shuffled se-quences (Bernoulli model) or 100 shuffled sequencesadditionally conserving the exact starter dinucleotide

    count (first-order Markov chain model). The result wasconsidered significant (table 6) when any random valuewas lower than that of the observed sequence. The resultwas considered somewhat significant when no morethan 5 of the 100 random values were above the ob-served one. A computer program that shuffled the lettersof a sequence while accurately conserving its dinucle-otide, or even trinucleotide, composition, was imple-mented (Kandel et al. 1996; Coward 1999). Searchingof approximately duplicated sectors was done using theprogram BESTFIT (Smith and Waterman 1981; Dever-eux 1989): the alignment score between two sectors wascompared with 100 random values corresponding to thealignments of one of the two observed sectors with 100

    shufflings of the other.

    The N-Presentation Algorithm

    The N-presentation strategy (Didier 1999) enablesone to code a set of biological sequences (here, nucle-otide sequences) by using a multiplied alphabet account-ing for the neighborhood of each letter in the sequences.Here, the N-presentation rank was the length of theneighborhood considered (here, the 8- and the 12-pre-sentations were performed). In the N-presentation, eachnucleotide is renamed by its letter followed by a number

    (its type). A type of nucleotide appears in two differentpositions in the sequences if the same neighborhood oflength N covers these two positions with the same rel-ative rank (i.e., the neighborhood starts at the same dis-tance [N] from the two positions). The calculation pro-

    cedure also enables one to identically rename the samestarter characters when they are included in oligonucle-otides that are only partially similar. In this study, theN-block presentation computer program was used to as-sess the alignment of the 27 sequences. Actually, thisalignment was difficult to assess with the four-letter al-phabet of DNA because of a large number of putativeduplications/deletions that made the problem of the gapsdifficult to manage and because the homology blockswere often difficult to distinguish from noise. The ho-mology blocks were much easier to delineate using analphabet made up of a large number of characters (4,711here when the 12-presentation was used). To begin with,the alignment was constructed manually (see below), in-

    asmuch as the available software could not use an al-phabet with more than 26 letters.

    The Alignment Strategy

    The alignment can be found on the web page http://g enome.geneti que.uv sq. fr/ laprev ott e/. Within eachHIV-1 or HIV-2 group, the sequences were closely re-lated and the alignments are easily constructed, such thatthe two consensuses were easily deduced. The point isto align HIV-1 and HIV-2 sequences together (these se-quences are supposed to have a common evolutionary

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    4/15

    1234 Laprevotte et al.

    Table2

    OverrepresentedOligonucleotidesintheEntireSequences

    A

    C

    G

    T

    A

    A

    A

    C

    C

    C

    C

    G

    G

    T

    T

    C

    G

    G

    A

    T

    T

    T

    C

    G

    C

    G

    T

    A

    G

    G

    C

    G

    T

    T

    A

    T

    G

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    C

    C

    C

    C

    C

    C

    C

    C

    A

    C

    C

    C

    C

    C

    G

    G

    G

    G

    G

    G

    A

    A

    A

    C

    C

    C

    T

    T

    G

    A

    A

    C

    T

    T

    A

    A

    C

    C

    G

    T

    A

    G

    G

    A

    C

    T

    A

    C

    A

    A

    C

    C

    A

    G

    A

    C

    A

    T

    A

    G

    G

    A

    G

    G

    T

    G

    G

    T

    C

    C

    C

    C

    G

    G

    G

    G

    G

    G

    G

    G

    G

    T

    T

    T

    T

    T

    T

    T

    T

    T

    T

    T

    A

    A

    A

    C

    C

    C

    C

    G

    G

    A

    C

    C

    G

    G

    G

    G

    G

    G

    T

    T

    A

    C

    G

    A

    C

    T

    T

    A

    G

    C

    T

    T

    C

    C

    G

    G

    A

    G

    G

    T

    G

    T

    G

    G

    T

    G

    T

    A

    A

    T

    C

    G

    C

    T

    C

    G

    CIVCG...........

    K02013...........

    K02083...........

    K03455...........

    M19921...........

    X01762...........

    K02007...........

    M17449...........

    M17451...........

    M26727...........

    K03454...........

    M22639...........

    M27323...........

    K03456...........

    L20571...........

    M62320...........

    X

    x

    X

    x

    X

    x

    X

    x

    X

    x

    X

    x

    X

    x

    X

    x

    X X X

    x

    X

    x

    X X X X

    x

    x

    X

    x

    X

    x

    X

    x

    x

    X

    x

    X

    x

    X

    x

    X

    x

    x

    X

    x

    X x

    x

    x

    X

    x

    x

    x

    x

    x

    X

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    RESIVSMM.......

    RESIMM251......

    L07625...........

    X61240...........

    J03654............

    J04498............

    J04542............

    M15390...........

    M30502...........

    M30895...........

    M31113...........

    x

    X

    X

    X

    x

    X

    x

    X

    x

    X

    x

    X

    X

    x

    x

    X

    x

    X

    x

    X

    X

    X

    X

    x

    X

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    X

    x

    X

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    X

    x

    X

    X

    x

    x

    X

    X

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    NOTE.

    Overrepresentedoligonucleotidesaredisplayedfromtoptobottom(thoseincludedinCCTGGorCAGRareinboldfacetype).X

    significant;x

    somewhatsignificant(seeMaterialsandMethods).

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    5/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1235

    Table 3Overrepresented Oligonucleotides in the Coding Sectors

    A

    G

    C

    T

    A

    C

    A

    A

    G

    A

    C

    T

    G

    G

    C

    T

    A

    A

    G

    A

    A

    C

    A

    A

    A

    C

    A

    C

    A

    G

    A

    A

    A

    G

    G

    C

    C

    A

    C

    A

    C

    C

    A

    G

    C

    T

    G

    G

    G

    A

    A

    G

    G

    A

    C

    T

    G

    A

    G

    A

    G

    G

    A

    A

    G

    G

    A

    T

    G

    G

    C

    T

    T

    A

    C

    C

    T

    C

    A

    G

    T

    G

    G

    A

    T

    G

    G

    C

    T

    T

    T

    G

    CIVCG . . . . . . .K02013 . . . . . . .

    K02083 . . . . . . .M19921. . . . . . .K02007 . . . . . . .M17451. . . . . . .M26727. . . . . . .K03454 . . . . . . .M22639. . . . . . .M27323. . . . . . .K03456 . . . . . . .L20571 . . . . . . .M62320. . . . . . .

    x

    x

    x

    x

    x x X

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    xx

    x

    RESIVSMM . . .RESIMM251. . .L07625 . . . . . . .X61240 . . . . . . .J03654. . . . . . . .J04498. . . . . . . .

    J04542. . . . . . . .M15390. . . . . . .M30502. . . . . . .M30895. . . . . . .M31113. . . . . . .

    x

    x

    x X x

    x

    X

    x

    xx

    x

    x

    X

    x

    x

    x

    x

    x

    x x

    x

    x

    x

    x

    x

    NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X significant; x

    somewhat significant (see Materials and Methods). Three HIV-1 sequences are not included because of premature stop codons.

    Table 4Overrepresented Oligonucleotides in the Noncoding Sectors

    A

    G

    C

    T

    A

    C

    T

    A

    G

    A

    A

    G

    G

    C

    A

    G

    C

    T

    G

    C

    T

    T

    G

    C

    T

    T

    C

    T

    A

    A

    G

    C

    A

    G

    A

    G

    A

    G

    C

    A

    A

    G

    T

    G

    C

    A

    C

    T

    C

    A

    G

    G

    C

    C

    C

    T

    C

    C

    T

    G

    C

    T

    A

    G

    C

    T

    C

    T

    C

    T

    G

    C

    C

    T

    G

    G

    C

    T

    T

    G

    C

    T

    T

    T

    G

    A

    C

    T

    G

    C

    A

    G

    G

    C

    T

    G

    G

    G

    G

    A

    T

    A

    A

    A

    T

    A

    C

    T

    T

    C

    T

    C

    T

    G

    C

    T

    T

    G

    G

    G

    CIVCG . . . . . . .K02013. . . . . . .K02083. . . . . . .M19921 . . . . . .K02007. . . . . . .M17451 . . . . . .M26727 . . . . . .K03454. . . . . . .M22639 . . . . . .M27323 . . . . . .K03456. . . . . . .L20571 . . . . . . .M62320 . . . . . .

    x

    X

    X

    X

    X

    X

    X

    X

    X

    X

    x

    X

    X

    X x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    X

    X

    X

    X

    x

    X

    x

    x

    x

    x

    X

    X

    X

    X

    X

    X

    X

    x

    x

    x

    x

    x x

    RESIVSMM. . .RESIMM251 . .L07625 . . . . . . .X61240. . . . . . .J03654 . . . . . . .J04498 . . . . . . .J04542 . . . . . . .M15390 . . . . . .M30502 . . . . . .M30895 . . . . . .M31113 . . . . . .

    x

    x

    x

    x

    x

    x

    x

    x

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    X

    x

    x

    x

    X

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    X x

    X

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X significant; x

    somewhat significant (see Materials and Methods).

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    6/15

    1236 Laprevotte et al.

    Table 5Overrepresented Oligonucleotides Versus a First-Order Markov Chain Model

    A

    T

    C

    C

    G

    C

    A

    C

    A

    A

    A

    G

    A

    C

    A

    G

    T

    G

    A

    T

    C

    C

    C

    G

    C

    T

    C

    G

    G

    G

    C

    T

    G

    G

    G

    A

    A

    G

    G

    A

    T

    C

    G

    C

    G

    C

    G

    G

    G

    A

    G

    T

    G

    T

    T

    A

    A

    A

    T

    A

    C

    C

    T

    A

    T

    A

    T

    G

    G

    C

    T

    G

    T

    A

    CIVCG. . . . . . .

    K02013. . . . . . .K02083. . . . . . .M19921 . . . . . .X01762. . . . . . .K02007. . . . . . .

    M17451 . . . . . .

    M26727 . . . . . .K03454. . . . . . .M22639 . . . . . .

    M27323 . . . . . .

    K03456. . . . . . .L20571 . . . . . . .

    M62320 . . . . . .

    716a

    [1353]b

    [360700]c

    [360700]c

    [360700]c

    700a

    700a

    [360700]c

    700a

    [360700]c

    [360700]c

    [360700]c

    700a

    [1356]b

    [360700]c

    700a

    [360700]c

    [360697]c

    702a

    [360702]c[357708]c

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    Xx

    X

    x

    x

    x

    x

    x

    x

    X

    x

    xx

    x

    RESIVSMM. . .

    RESIMM251 . .L07625 . . . . . . .X61240. . . . . . .

    J03654 . . . . . . .J04498 . . . . . . .

    M30502 . . . . . .M30895 . . . . . .

    881a

    [1569]b

    [1383]b

    [375921]c

    [1413]b

    [417917]c

    694a

    921a

    [1410]b

    [1413]b

    920a

    [411920]c

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X significant; x

    somewhat significant (see Materials and Methods).a The entire sequence.b The coding sector.c The noncoding sector.

    progenitor). The alignment was constructed manually.To begin with, it was based on eight consensus elements(or groups of elements), that is, 18 positions highlightedby Frech, Brack-Werner, and Werner (1996), who stud-ied common modular structures in primate lentiviralLTR sequences. Most of these core blocks enable oneto propose a reliable alignment of the corresponding andneighboring HIV-1 and HIV-2 sectors, provided thatsome local corrections are done. In addition, PPT and

    PBS were very significant core blocks, together with the5 and 3 ends of the LTRs, respectively. The rest of thealignment was built by recursively searching the inter-calary sectors for perfectly matched segments of at leastthree bases in length. For each step, a new intercalaryalignment was then based on the longest perfect matchbetween any paired HIV-1 and HIV-2 sequences, that is,consistent with the prealigned bordering sectors. Thismatch was a priori assumed to be the closest to theputative original sequence. In addition, probable dupli-cation/deletion events were taken into account. Espe-cially in the case of an unequal number of repeated mo-

    tifs between HIV-1s and HIV-2s, gaps were inserted(gaps were not treated explicitly but remain as thoseparts of the sequences that did not belong to any of thealigned segments; Morgenstern, Dress, and Werner1996). Eventually, the alignment was based on the nu-cleotides printed on the line labeled common sectors.These nucleotides covered 643 positions (58%) of thealignment. In the coding reading frame, the polypeptidealignment was constructed in the same way based on

    the conserved amino acids (and the corresponding co-dons). In order to assess and to locally correct the nu-cleotide alignment while increasing the signal-to-noiseratio, alphabets of more than four letters were addition-ally used: that of the polypeptide alignment in the cod-ing reading frame (as just mentioned), and that obtainedby the 8- and 12-ranked N-block presentation for thewhole of the sequences. Obviously, there was goodagreement between the polypeptide and the nucleotidealignments except for a few locations (see the webpage). All of the aligned sequences were coded using a12-ranked and an 8-ranked N-block presentation (the

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    7/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1237

    Table 6Sectors Suggesting Local Repetition Scenarios

    A

    0 1

    B

    0 1

    C

    1

    D

    1

    E

    1

    F

    0 1

    G

    0 1

    H

    0 1

    I

    0 1

    J

    0 1

    K

    0 1

    CIVCG. . . . . . . . .K02013 . . . . . . . .K02083 . . . . . . . .K03455 . . . . . . . .M19921 . . . . . . . .X01762 . . . . . . . .K02007 . . . . . . . .M17449 . . . . . . . .M17451 . . . . . . . .M26727 . . . . . . . .K03454 . . . . . . . .M22639 . . . . . . . .M27323 . . . . . . . .K03456 . . . . . . . .L20571. . . . . . . . .M62320 . . . . . . . .

    X

    x

    x

    x

    x

    x

    X

    X

    X

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    X

    x

    x

    x

    x

    x

    x x

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    x

    X

    x

    X

    X

    X

    x

    X

    X

    x

    x

    X

    x

    X

    x

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    x

    X

    x

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    x

    X

    X

    x

    x

    X

    x

    x

    X

    x

    x

    X

    X

    x

    x

    x

    x

    x

    x

    x

    X

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    x

    X

    x

    X

    X

    X

    x

    X

    x

    X

    x

    X

    x

    x

    X

    X

    x

    X

    X

    x

    X

    x

    X

    X

    x

    x

    X

    RESIVSMM . . . .RESIMM251 . . . .L07625. . . . . . . . .X61240 . . . . . . . .J03654 . . . . . . . . .J04498 . . . . . . . . .J04542 . . . . . . . . .M15390 . . . . . . . .M30502 . . . . . . . .M30895 . . . . . . . .M31113 . . . . . . . .

    x

    X

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    X

    X

    x

    x

    x

    X

    X

    x

    X

    X

    x

    x

    X

    X

    X

    x

    X

    x

    x

    x

    x

    x

    x

    x

    X

    x

    X

    x

    x

    x

    x

    x

    X

    X

    x

    X

    X

    x

    x

    X

    x

    X

    x

    X

    X

    x

    x

    X

    X

    x

    X

    X

    X

    x

    X

    X

    x

    x

    NOTE.A perfect tandem repeats of words at least two bases in length plus local repetitions of words at least three bases in length with no more than five

    bases intercalated; B tandemly repeated dinucleotides; C motifs at least six bases in length made up of two distinct and alternate letters; D ABCDABCDABCD

    patterns; E nonoverlapping repetitions of words at least 10 bases in length with no more than five letters intercalated; F dinucleotides repeated at least nine

    times in windows 50 bases in length; G trinucleotides repeated at least six times in windows 50 bases in length; H trinucleotides repeated at least four times

    in windows 25 bases in length; I tetranucleotides repeated at least four times in windows 50 bases in length; J sectors more than 15 bases in length that are

    made up of no more than two letters; K sectors at least 30 bases in length made up of no more than three letters (at most one base excepted, this latter not

    included in the computation of the numerical value defined in Materials and Methods). For 0, the results are compared with a Bernoulli model; For 1, the results

    are compared with a first-order Markov chain model. X significant; x somewhat significant (see Materials and Methods).

    latter being less stringent). Obviously (see the webpage), the N-block presentation corroborated the ho-mology blocks (in addition, local corrections of thealignment were made possible).

    ResultsOverrepresented Oligonucleotides Appear to BeCongruent with the So-Called Retroviral Signature

    We used a new calculation strategy (Materials andMethods) to perform on a short sector of the retroviralgenome an investigation similar to that performed oncomplete sequences (Laprevotte et al. 1997). For theBernoulli model, the results are displayed in tables 2

    4. The overrepresented oligonucleotides of 24 bases inlength are shown in the whole sequences (table 2), inthe 5 part (coding for the 3 end of the nef gene; table3), and in the 3 (noncoding) part (table 4).

    As a whole, the overrepeated words selected fromthe entire lengths of the sequences (table 2) appeared tobe congruent with the retroviral signature previouslyfound, particularly the core consensuses CCTGG andCAGR (Laprevotte et al. 1997). For instance, 32 out ofthe 53 oligonucleotides displayed shared over more thanhalf of their lengths, a continuous sector with one ofthese consensuses. The 2 dinucleotides, 5 (out of 11)

    trinucleotides, and 4 (out of 40) tetranucleotides werecompletely included in these consensuses. In compari-son, the corresponding proportions of such oligonucle-otides in the all possible di-, tri-, and tetranucleotideswere 7/16, 6/64, and 4/256, respectively. The most oftenselected words were AG, CT, AGA, CTG, CTGG,GGGA, and, only for the HIV-2 group, CAG, which iscomplementary to CTG. The word CTG was overrep-resented in all of the sequences HIV-1 except forM26727 (although it did contain an overrepeatedCCAG). Except for the macaque virus (although itshowed an overrepresented CAG), the HIV-2 group alsoshowed an overrepresented CTG, with the sooty man-gabey virus (RESIVSMM), which is supposed to be theevolutionary progenitor of the HIV-2s (Gao et al. 1999),

    included.In the coding sector (table 3), only CIVCG and

    RESIVSMM showed overrepresented oligonucleotidesincluding CTG. In addition, two HIV-1s and six HIV-

    2s showed an overrepresented CCAG. The noncodingsector (table 4) appeared to be much more congruentwith the retroviral signature: the sequences studiedshowed at least one overrepresented oligonucleotide in-

    cluding CTG (except for K03456, M26727, RE-SIVSMM, and RESIMM251); six HIV-2s out of nine

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    8/15

    1238 Laprevotte et al.

    FIG. 2.Percentages of the sequences occupied by approximate tandem repeats (perfect tandem repeats of a word at least two bases inlength and locally repeated oligonucleotides at least three bases in length) for 27 100 simple (left), dinucleotide-conserving (middle), andtrinucleotide-conserving (right) shufflings of the 27 sequences.

    showed at least one overrepresented word includingCAG. For K03455, X01762, and M17449, only the en-tire lengths of the sequences were studied because of apremature stop codon (tables 35).

    Table 5 displays the tri- and tetranucleotides thatwere found to be overrepresented when a first-orderMarkov chain model was used. Only the sequence RE-SIVSMM had an overrepresented oligonucleotide(CTGG, in the entire sequence and in the coding sector)that was congruent with the so-called retroviral signa-ture. GGGA remained significantly overrepresented inthe noncoding sectors of all of the HIV-1 sequences thatwere tested (except for CIVCG) and three HIV-2 se-quences (L07625, M30895, and X61240). Obviously,the major portion of these overrepresented GGGAs wasclustered in the sectors (aligned with CIVCG 397458),where the repeated sites NF-KB and SP.1 were located(see the alignment on the web page). Actually, for these

    sequences with overrepresented GGGAs, the noncodingsectors were 341547 bases in length and included be-tween 7 and 10 occurrences of this word. In these actualsectors, the zone in which NF-KB and SP.1 sites wereclustered (being only 5973 bases in length) includedas many as four or five occurrences of GGGA (boxedby a thick line in the alignment).

    Sectors Suggesting Local Repetition Scenarios

    The simulation procedures (Materials and Meth-ods) were aimed at finding local repetition processes

    such as those hypothesized previously (Laprevotte et al.1997). For each sequence, the patterns studied in table6 (column A) were approximate tandem repeats (i.e.,either perfect tandem repeats of a word at least two ba-

    ses in length or local repetitions of a motif at least threebases in length with no more than five bases intercalatedbetween two successive occurrences of this word).These repeated words are boxed by a thin line in thealignment on the web page. For each sequence, the sig-nificance of the numerical value (Materials and Meth-ods) was assessed against the Bernoulli model (table 6,column A, left). Except for K03456, HIV-1 group se-quences (15 sequences) appeared to be significant (forCIVCG, K02007, M17449, M17451, and L20571) orsomewhat significant (for the other 10 sequences). In theHIV-2 group, only 3 sequences out of 11 were at leastsomewhat significant: L07625 (somewhat significant),M15390 (somewhat significant), and X61240 (signifi-

    cant). On average, the significant results were greater by5% than the nonsignificant results. The first-order Mar-kov chain model (table 6, column A, right) showed thatfor a major portion, the significant results correlatedwith the dinucleotide compositions of the correspondingsequences: 2 HIV-2- and 8 HIV-1 sequences were nolonger significant; X61240 and 4 HIV-1 sequences in-cluding CIVCG become only somewhat significant; thedegree of significance was conserved only for K02083,M17451, and M26727.

    In figure 2, the numerical values defined for table6 (column A) are displayed for three sets of 2,700 (27

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    9/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1239

    100) shuffled sequences. For each distribution graph,each of the 27 starter sequences was shuffled 100 times(Materials and Methods). For the left graph, the se-quences only rigorously conserved the starter nucleotidecompositions. For the middle graph, the dinucleotidecounts were additionally exactly conserved, as were thetrinucleotide compositions for the right graph. The mid-dle graph was more shifted from the left than was theright from the middle, such that the major part of theincrease of the random values was accounted for by thefirst-order Markov chain model. Hence, the repeated se-quences investigated in table 6 (column A) appear to beaccounted for, to a large extent, by the dinucleotide com-positions of the sequences.

    Columns B, C, D, and E of table 6 focus on par-ticular repetitions that were part of those recapitulatedin column A. Column B focuses on the tandemly re-peated dinucleotides and displays but a few significantresults. For columns C, D, and E, significant results aredefined using a first-order Markov chain model. ColumnC displays sequences with overrepresented motifs of atleast six bases in length, made up of only two distinctand alternate letters; such are all of the HIV-1 sequences(CIVCG excepted), and a single HIV-2 sequence; onaverage, this overrepresentation accounts for an over-estimation of about 5% of the numerical value that iscomputed for column A. Column D shows that all ofthe HIV-2 sequences have overrepresented repetitions asABCDABCDABCD; actually, this only highlights thesector GCTTGCTTGCTT (boxed by a thick line in thealignment on the web page), which extends from posi-tion RESIVSMM-671 to position RESIVSMM-682.Column E accounts for nonoverlapping repetitions ofwords at least 10 bases in length with no more than fiveletters intercalated between two successive identical mo-

    tifs. For each starter sequence, no more than one randomsequence with such a repetition (and with only two cop-ies) occurs, such that in an observed sequence, this rep-etition (boxed by a thick line in the alignment) is thusconsidered significant (it accounts for an overestimationof about 3% of the numerical value calculated for col-umn A). Together with the overrepetitions displayed incolumn C, these overrepresented patterns could accountfor at least some of the significant results presented incolumn A. Only a single HIV-2 sequence (L07625)shows such a pattern (two copies of a decanucleotide),located in the sector of the SP-1 sites (see below). Onthe contrary, all the HIV-1 sequences but L20571 showsuch a repetition, that is, 2 copies of the NF-KB site (see

    below). In the same sector of the alignment, L20571(which is usually distinct from the other sequences ofits group; Gurtler et al. 1994) shows a probable imper-fect duplication (boxed by a thick line in the alignment)of a segment including the NF-KB site (underlined), butwith an unequal number of CTGs and Gs:

    359 ACTG---ACACTGC-GGGACTTTCCAG 381

    382 ACTGCTGACACTGCGGGGACTTTCCAG 408.

    Columns F, G, H, and I of table 6 account for signifi-

    cantly clustered di-, tri-, or tetranucleotides in windowsof 50 bases in length (for column H, trinucleotides re-peated at least four times in a window 25 bases inlength). The window slides step by step along the se-quence (step 1), with the purpose of delineating sim-ply repeated sequences or cryptic simplicity stretch-es (rich in short direct repeats, as discussed in Tautz,Trick, and Dover 1986; Treier, Pfeifle, and Tautz 1989).The most numerous significant results are those dis-played in column F (dinucleotides repeated at least ninetimes in windows 50 bases in length). As compared withthe Bernoulli model, all of the sequences (RESIMM251excepted) show at least a somewhat significant result.As compared with a first-order Markov chain model, theresults remain significant for the HIV-1 group except forL20571 (CIVCG remains only somewhat significant); ofthe HIV-2 sequences, five remain at least somewhat sig-nificant, in contrast to the other five (J03654, L07625,M15390, M31113, and X61240). On average, the nu-merical value to be compared (Materials and Methods)was overestimated by about 12% for the sequences that

    remained significant when a one-order Markov chainwas used, as compared with those which were not sig-nificant or those not remaining significant. Results sim-ilar to those in column F of table 6 (albeit less numer-ous) are displayed in column G (trinucleotides repeatedat least six times in windows 50 bases in length), H,and I (tetranucleotides repeated at least four times inwindows 50 bases in length).

    Column J of table 6 accounts for significant over-representations of sectors of more than 15 bases inlength that are made up of no more than two letters.According to the Bernoulli model, the HIV-1 sequences(except for K02013, K02007, and K03456) were some-what significant; except for M17451, they were not sig-

    nificant when compared with a first-order Markov chainmodel. For the HIV-2 group, when a first-order Markovchain model was used, RESIVSMM, J04542, L07625,M30502, and M31113 remained significant, whileJ03654, J04498, and M15390 did not; the three othersequences in the group were not significant anyway. Forthe sequences that remained significant when comparedto shuffled sequences conserving the exact starter di-nucleotide count (first-order Markov chain model), thenumerical value was 3.6% or 3.7%, except for M17451(2.3%) conserving only a borderline significance; forthose sequences which did not conserve their signifi-cance or remain nonsignificant, the parameter was 0%2.3%.

    Column K of table 6 accounts for significant over-representations of sectors at least 30 bases in lengthmade up of no more than three letters (with at most onebase excepted, this latter not being included in the com-putation of the numerical value defined above). HIV-1sequences (except for M17449 and L20571) were sig-nificant even against a first-order Markov chain model.For significant sequences, the numerical values rangefrom 18.7% to 32% (14% and 10.1% for M17449 andL20571, respectively). As a whole, HIV-2 sequences(value from 0% to 17%) were not significant.

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    10/15

    1240 Laprevotte et al.

    Table 7Tests of Three Available Multiple-Sequence AlignmentPrograms Against HIV-1 and HIV-2 Common CoreBlocks

    Clustal-X Mabios Dialign

    Polypurine tract . . . . . . . . . . . . . . . . . .LTR (U3) 5 end . . . . . . . . . . . . . . . . .Primate element blocka . . . . . . . . . . . .

    NF-KBb . . . . . . . . . . . . . . . . . . . . . . . . .SP-1c . . . . . . . . . . . . . . . . . . . . . . . . . . .TATA box. . . . . . . . . . . . . . . . . . . . . . .U3-R junction. . . . . . . . . . . . . . . . . . . .TAR common sectord . . . . . . . . . . . . .Poly (A) . . . . . . . . . . . . . . . . . . . . . . . .R-U5 junction. . . . . . . . . . . . . . . . . . . .Poly (A) downstream elemente . . . . . .LTR (U5) 3 end . . . . . . . . . . . . . . . . .Primer-binding site. . . . . . . . . . . . . . . .

    X

    XX

    xxXX

    x

    X

    XX

    xxXX

    x

    XXX

    xXX

    XxxxxX

    xXX

    NOTE.X perfect alignment; x imperfect or partial alignment.a Four sites.b Dialign is the only program highlighting the duplication of the NF- KB site

    in the HIV-1 group.c The SP-1-sites sector exhibits clustered GGGAs and stretches of Gs, so

    several alignments are possible.

    d Dialign constructs an almost perfect alignment for approximately the 5two thirds of the sector.

    e Dialign constructs a correct alignment except for the two 3 bases.

    The Reliability of the Alignment of the 27 NucleotideSequences

    A reliable alignment is an essential tool in the pre-sent work. The accurate alignment of previously iden-tified benchmarks and its congruency with the polypep-tide alignment and with the N-block presentation codingof the sequences (Materials and Methods and the align-ment on the web page) allowed us to consider this align-ment reliable for testing local molecular evolution hy-

    potheses. Three available multiple-sequence alignmentprograms were tested (table 7) against the benchmarksfound in both the HIV-1 and the HIV-2 groups to selectthe most suitable algorithm for aligning the actual se-quences studied in this work. Clustal-X (Thompson,Plewniak, and Poch 1999) is a progressive alignmentmethod comparing individual residues by using a Nee-dleman-Wunschbased algorithm (Needleman andWunsch 1970) and employing gap penalties; Mabios(Abdeddam 1997) and Dialign (Morgenstern, Dress,and Werner 1996) calculate homology blocks of whichthe best combinations are chosen in order to select thebenchmarks on which the rest of the alignment is con-structed. At first, it appeared that there was no program

    constructing the same alignment that another did. Theprogram Clustal-X produced a total misalignment down-stream of the HIV-1 deletion zone following TARCommon Sector (CIVCG-519); moreover, the deletedsequence J03654 was oddly aligned (data not shown).Mabios and/or Dialign aligned all of the benchmarks(the R-U5 junction excepted); the alignment of five ofthese benchmarks was more accurate when Mabios wasused (with Dialign constructing an alignment that wasless accurate or only partial). However, the Dialign pro-gram was the only one aligning all but one of the in-dicated benchmarks, particularly the much-conserved

    polyadenylation site (Poly (A)) and highlighting the du-plication of the NF-KB site in the HIV-1 group (by in-serting a gap in HIV-2 sequences in front of one of thetwo NF-KB copies). Hence, as regards the present align-ment, Dialign appeared to be the most reliable programof those tested; it was further tested for two sectorswhere the alignment was difficult to construct even man-ually: the set of sequences aligned with CIVCG 328463 and that aligned with CIVCG 558609 (see the webpage). The nucleotides of the alignment constructed withDialign in these sectors were coded by the 8-ranked N-block presentation, HIV-1 and HIV-2 matching lettersbeing highlighted in red as on the web page (data notshown). Actually, these highlighted letters are less nu-merous than in the manually aligned corresponding sec-tors, which suggests that the alignment constructed withsuch a program has at least to be visually refined.

    Discussion

    Previous analyses of retroviral nucleotide sequenc-es have suggested a scrambled stepwise duplicative mo-lecular evolution. Genetic diversity in these sequencesis usually presumed to arise as a consequence of reversetranscriptase infidelity (Katz and Skalka 1990), which isdue to numerous phenomena, such as nucleotide mis-copying, duplication, deletion, recombination, and G-to-A hypermutation of the viral genome (Vartanian et al.1991). Duplications have been reported to occur muchless frequently than deletions. Additionally, an oligo-purine sequence bias has been reported to occur in eu-karyotic viruses (Beasty and Behe 1988). In this paper,we focused on the LTRs of human and simian immu-nodeficiency viruses to perform a high-resolution study.The listed overrepresented oligonucleotides (often con-

    taining CTG or CAG, as previously shown) are congru-ent with the previously named retroviral signature. Thisis particularly clear in the noncoding part of the LTRs,which strengthens the previous conclusion that theseoverrepresented oligonucleotides are not merely pre-dictable from the codon usage in the coding frames ofretroviral sequences; nucleotide repetitiveness and co-don usage appeared not to be strictly correlated, andoverrepresented/clustered nucleotide motifs showed upwithout regard to the coding/noncoding sectors and thephase of the reading frame (Laprevotte 1989, 1992).However, one must keep in mind that an initially codingforeign genetic material could be inserted by recombi-nation into a noncoding part of a retroviral sequence

    (Katz and Skalka 1990). For instance, the nef gene isthought to be a captured element (Myers 1997), whichcould account for the less characteristic results achievedhere in the coding part of the LTRs. Additionally, it hasrecently been suggested that reading-frame-independentforce(s) may influence synonymous codon choice (An-tezana and Kreitman 1999).

    By no means can biological mechanisms be de-duced from the correlation of overrepresented words andsignificant local repetitions with the dinucleotide com-positions of the corresponding sequences. It is impos-sible to decide between two hypotheses: either the dou-

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    11/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1241

    blet frequencies, due to any event, account for thesewords found to be overrepresented or clustered whencompared with a Bernoulli model, or a large number ofduplications of oligonucleotides (such as AG and CT;tables 24) bias the dinucleotide composition of the se-quence and account for the nonsignificance of many rep-etitions when tested against a first-order Markov chainmodel. Such duplications could favor particular nucle-otide motifs for biochemical reasons or because of start-er tandem repeated sequences. Previous studies of com-plete retroviral sequences (Laprevotte 1992; Laprevotteet al. 1997) strengthen the second hypothesis by dem-onstrating that for most of the overrepeated oligonucle-otides, the observed frequency is not merely a conse-quence of dinucleotide distribution (many overrepre-sented oligonucleotides remained significant versus afirst-order Markov chain model; moreover, the correla-tion between the dinucleotide distribution in the subsetof the overrepresented oligonucleotides and that of thewhole sequence was variable, high, weak, or even null).The fact that for RESIVSMM (which is supposed to bethe evolutionary progenitor of the HIV-2s; Gao et al.1999) CTGG is overrepresented even when a first-orderMarkov chain model is used (table 5) fits the same hy-pothesis. Moreover, many of the putative locally re-peated sectors remain significant even against a first-order Markov chain model (table 6), giving examples ofprobable duplications that are obviously not accountedfor by the dinucleotide composition of the sequence;these are tandem repeats, local repetitions, clusters ofoligonucleotides, and monotonous sectors made up ofno more than two or three letters. Columns F and K oftable 6 show many repetitions and monotonous sec-tors that may cover up to 30% of the sequence andthat are significant even against Markov-1 random

    sequences (in these cases, the percentage is overesti-mated by about 10%15%). Moreover, about one thirdof the alignment includes sectors boxed by a thick lineat at least one sequence or one HIV-1/HIV-2 consensus(see the web page). As seen below, these sectors suggestlocal repetition events. Hence, it appears that in any casethe dinucleotide compositions cannot account for all ofthe listed repetitive patterns and that these patterns covera large portion of the sequences. The discrepancies be-tween the results (tables 26) for the HIV-1 and theHIV-2 groups, respectively, suggest distinct mono- oroligonucleotide duplication/deletion scenarios that couldhave occurred since the evolutionary divergence be-tween the two groups; this led us to search the reliable

    alignment of the sequences for patterns both statisticallysignificant and evolutionary suggestive.Differentiated sectors can be delineated in the

    alignment (see the web page) in terms of the degrees ofhomology between HIV-1 and HIV-2 aligned sectors.The 5 landmark that is the PPT, together with the 5end of the LTR (CIVCG 1037) and the 3 landmarkthat is the PBS (CIVCG 682705), are highly conservedand highlighted by the 12-ranked N-block presentation,as are six other homology blocks; out of these sixblocks, the NF-KB site (CIVCG 397407 and 409418)and the polyadenylation signal (CIVCG 570581) are in

    the noncoding part of the sequence; the other four(CIVCG [105136], [168181], [215229], and [284296]), align with conserved sectors in the polypeptidesequence nef.

    The major portion of the coding sectors (up to andincluding position CIVCG-346), is to be distinguishedfrom the rest of the alignment: the aligned sectors (ex-cept for two) measure about the same length (337346bases). Except for M17449 (which exhibits a prematurestop codon), the lengths are equal or differ, as expected,by multiples of three. The length is longer for theL20571 sequence (349 bases); a scan of the coloredalignment obviously corroborates the fact that L20571is a divergent isolate among the HIV-1 group (Gurtleret al. 1994). In the HIV-2 group, J03654 (Zagury et al.1988) is deleted between positions CIVCG-88 andCIVCG-317 (excluded). This could be accounted for bytwo successive deletion events. Let us write the HIV-2consensus between the positions CIVCG-79 andCIVCG-94 while supposing a jump of the reverse tran-scriptase (Katz and Skalka 1990; Zhang and Temin1994) from the first aga to the second; then, the se-quence becomes TATACTTAGAAGG. Eleven out ofthe 13 letters of this motif match the HIV-2 consensusbetween positions CIVCG-309 and CIVCG-321 (TA-TARYTACAAGG), suggesting a second jump betweenthe two motifs. Furthermore, in spite of the conservationof the lengths of the major part of the coding sectors,gaps have to be inserted in the sequences in order toalign the homology blocks, suggesting any number ofduplications/deletions. For instance, for the sectorsaligned with CIVCG from position 39 to position 58,four demonstrative sequences lead to the proposal of asuggestive alignment:

    M26727 ATTTACTCCCAG----AAAAGACA 58

    M62320 ATTCACTCACAG----AAAAGACA 58

    RESIMM251 ATTTATT-ACAGTGCAAGAAGACA 58

    L07625 ATTTACT-ATAGTGAGAGAAGACA 58

    If this is biologically meaningful, the (underlined) serinecodons (TCM for the HIV-1 group, AGT for the HIV-2group) cannot be aligned; the evolutionary hypothesisof a simultaneous double-nucleotide substitution(TCAG), as discussed elsewhere (Averof et al. 2000),is not confirmed in this actual case of an assumed re-

    verse transcriptase directed evolution. This suggests aless straightforward evolutionary mechanism and furtheremphasizes the importance of accurate alignments intesting local evolutionary hypotheses. The sectorsaligned with CIVCG-81CIVCG-95 are particularlysuggestive. Between the two landmarks that are the con-served 5 TAY (tyrosine) and 3 GGV (glycine), theHIV-1 and HIV-2 groups do not match; this would leadto consideration of a large number of nucleotide substi-tutions if only these molecular evolution events were tobe taken into account in the alignment. In addition, ex-cept for L20571, the HIV-1 sectors include a stretch of

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    12/15

    1242 Laprevotte et al.

    more than six bases made up of only two different al-ternate letters; seven of the aligned HIV-2 sectors in-clude a stretch of more than 15 bases made up of onlytwo letters. In the two groups, these selected stretchesare boxed by a thick line in the alignment (see the webpage) when the pattern is significantly overrepresentedin the entire corresponding sequence (table 6, columnsC and J). The two aligned K03454 (HIV-1) and M31113(HIV-2) sequences can be taken as an example:

    K03454 TACAACACAC------AAGGCAT 97

    M31113 TACTTAGAAAAGGAAGAGGGAAT 97

    tyrosine glycine

    One can imagine that at a pause site (Wu et al. 1995),the reverse transcriptase could replicate the same shorttemplate several times, thus expanding the last mono-or oligonucleotide of the nascent DNA strand. If that isthe case, the alternate Cs and As in the HIV-1 groupshould have arisen from the 5 end of the sector, and

    the aligned sector in the HIV-2 group should have arisenfrom the 3 end. This could somewhat mimic the per-formance of the telomerase, a cellular reverse transcrip-tase which synthesizes short repeat sequences rich in Tand G and carries its own RNA template with a segmentcomplementary to one and a half copies of the telomericrepeat. One should keep in mind that there is a closerelationship between the cellular telomerase active sub-unit and retroelement reverse transcriptases (reviewed inBoeke and Stoye 1997; Malik, Burke, and Eickbush2000). It has been suggested that the first steps of DNAsynthesis by reverse transcriptases of non-LTR retro-transposons might be similar to the generation of telo-meric repeats (Chaboissier, Finnegan, and Bucheton

    2000).From position CIVCG-347 downward, the major

    portion of the aligned sequences is noncoding. Conse-quently, their lengths do not necessarily differ by mul-tiples of three; they are much more divergent betweenthe HIV-1 and the HIV-2 groups and even, within theHIV-2 group, between the two SIV-2 and the HIV-2 se-quences. The duplication/deletion events appear to havebeen much less constrained during evolution than they

    have been in the coding parts. In this respect, severalsectors deserve scrutiny.

    The alignment between positions CIVCG-348 andCIVCG-386 can be accounted for by stepwise duplica-tions/deletions (see the web page). HIV-1 clones havebeen described (Estable et al. 1996) where the HIV-1empty sectors are occupied by the so-called most fre-quent naturally occurring length polymorphism(MFLNP on the web page), which shows varyinglengths and appears more or less clearly to contain re-peated sectors. Here, the aligned sectors in the HIV-2group do not appear to be deleted.

    Between CIVCG-394 and CIVCG-413 (excluded),HIV-2 sequences (SIVs excluded) exhibit two imper-fectly repeated sectors that could be the remnant of aduplication event. L07625 and X61240 HIV-2 sequencesare to be distinguished from the other seven (Kreutz etal. 1992; Barnett et al. 1993), as they differ in numerouslocations all along the alignment. In each of them, thetwo homologous sectors (boxed by a thick line in thealignment) extend from position L07625-436 to positionL07625-460 and from position L07625-461 to positionL07625-484, respectively, and do not coincide withthose of the other HIV-2 sequences:

    GG-AACTAGCTGACACTGCACAAGAR

    GGAAACTAGCWGACACYGCA--GGGA

    Each alignment score is greater than those of 100 ran-dom alignments (BESTFIT program). The other sevenHIV-2 sequences show two successive homologous sec-tors with five intercalated letters that appear to be a du-plication of the 3 end of the first sector. The alignmentscore is less than a random score for at most five randomsequences out of 100 (BESTFIT). For J03654, M15390,

    and M31113 (homologous sectors boxed by a thick linein the alignment), no random score is greater than theobserved one. The first sector extends from positionL07625-408 to position L07625-454, the second fromposition L07625-457 to position L07625-483 (withM30895 having an additional sector because of a prob-able duplication of CTGCAG at the 3 end of the align-ment). The most significant alignment is that ofM15390:

    408 AGTTAA--AGACAGGAACAGCT-ATACTTGGTCAGGG 441

    442 CAGGA 446

    447 AGT-AACTA-ACAGAAACAGCTGAGACT-G--CAGGG 478

    Additionally, from position L07625-465 to position

    L07625-536, binding sites for transcription factors show

    a variable number of copies, which also suggests dupli-

    cation/deletion events. First, from the left to the right,

    two sites alternate: the Bel-1 similar region (for RE-

    SIVSMM and RESIMM251, to a lesser extent for the

    HIV-2 sequences, and only partially for the HIV-1

    group), NF-KB (only for the HIV-1 group), the Bel-1

    similar region (for L20571 and partially for M30895),

    and, finally, NF-KB (for both the HIV-1 and the HIV-2

    groups). Aligned L20571 (HIV-1) and RESIMM251

    (SIV HIV-2) clearly show this pattern:

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    13/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1243

    L20571 ACT-GC---------GGGACTTTCCAGACTGCTGACA-CTGCGGGGACTTTCCA 407

    RESIMM251 ACTCGCTGAGATAG---------------------------CAGGGACTTTCCA 451

    [Bel-1 similar region] [ NF-KB ][ Bel-1 similar region ] [ NF-KB ]

    Downstream, the SP-1 sites are located in a variable wayat four possible segments in a zone with G stretches anda cluster of GGGAs including those located in the NF-KB sites (see above).

    Column K of table 6 shows that the HIV-1 sequenc-es (except for M17449 and L20571) include overrep-resented sectors at least 30 bases in length made up ofno more than three letters (with at most one base ex-cepted). Such a sector is found in these sequences be-tween positions K02013-462 and K02013-494 (the cor-responding sector is boxed by a thick line at the HIV-1consensus; see the web page). In this sector, as well asupstream and downstream, the HIV-1 group shows clus-ters of CTs, CTGs, and CTGGs (table 6, columns F,

    G, H, and I). These words are boxed by a thick line inthe alignment when the corresponding pattern is over-represented against a one-order Markov chain model inthe corresponding sequence taken as a whole (table 6).All of these words are scattered in a region that couldbe accounted for, at least partly, by stepwise duplica-tions/deletions of mono- or oligonucleotides taken fromtandemly repeated CTGs.

    The aligned sectors extending from the 5 end ofthe R region (CIVCG-501, the beginning of viral RNA;see above) to the positions aligned with CIVCG-565correspond to the TAR region, which has been exten-sively studied concerning its biological meaning and thestable stem-loop structure that forms TAR RNA (re-

    viewed in Ou and Gaynor 1995; Rabson and Graves1997). The HIV-1 TAR RNA contains both a loop anda bulge structure that are critical for Tat-mediated acti-vation. The HIV-2 TAR RNA is capable of forming acomplex structure that consists of two discrete stem-loop regions. Possible evolution routes from simple one-hairpin to complex branched TAR structures have beendiscussed in the literature. The extended portion of theHIV-2 TAR, relative to the HIV-1 TAR, have the great-est similarity to a human immunoglobulin pseudogenesequence, suggesting (see above) that this sub-sequenceis a captured element (reviewed in Myers 1997). In thealignment, the sector referred to as TAR Common Sec-tor is conserved between HIV-1 and HIV-2. It corre-

    sponds to the upper portion of the HIV-1 stem-loop (thebulge-and-loop zone) and to the 5 HIV-2 stem-loop re-gion. The two successive sectors of the HIV-2 consensusthat are boxed by a thick line in the alignment on theweb page (the first including the TAR Common Sector),correspond (apart from a few bases) to the two HIV-2TAR RNA discrete stem-loop regions:

    GCAGATTGAGCCCTGGGAGGTTCTCT-CCAGCACT

    GCAGGTAGAG-CCTGGG-TGTTCCCTGCTAG-ACT

    These two sectors have an alignment score greater than

    any obtained from 100 randomizations (BESTFIT).Whatever the possible evolution routes discussed withregard to this region, the possibility of a duplicationevent has to be taken into account. In contrast, the cor-responding zone of the HIV-1 sequence appears to bemuch deleted. Additionally, the HIV-2 M30502 se-quence (from position 596 to position 648) shows (table6, column F) a cluster of AGs (boxed by a thick linein the alignment). The rest of the HIV-2 TAR regionadditionally shows an extended portion relative to HIV-1. The 42/43-base-long HIV-2 sectors included betweenpositions CIVCG-559 and CIVCG-565 (aligned with azone of HIV-2 consensus that is boxed by a thick line)show clustered oligonucleotides (boxed by a thick line

    when they correspond to a pattern that is overrepre-sented in the entire corresponding sequence; table 6, col-umns D and FI), which, again, suggest stuttering localduplications (see the web page).

    As a whole, the results discussed above fit the mo-lecular-evolution model hypothesized previously (Lapre-votte 1989, 1992; Laprevotte et al. 1997): overrepre-sented oligonucleotides are scattered throughout the en-tire range of the retroviral sequences; they share com-plementary core consensuses that fit the rule of a trendto TG/CT excess (Ohno and Yomo 1990) and suggeststarter tandemly repeated oligonucleotides (short tandemrepeats giving rise to longer oligonucleotide repeats, ashypothesized previously [Southern 1972; Ohno 1988]);

    they are mixed with scrambled short-scale repetitions,deletions/duplications, tandem repeats, and cryptic sim-plicity patterns, suggesting a molecular evolution byscrambled stepwise short- and longer-range duplica-tions/deletions (in addition to nucleotide miscopying).

    Even though this model gives a good account ofthe repetitive aspects of retroviral nucleotide sequences,other evolutionary processes may be considered, suchas gene conversion (leading to homogeneity throughoutDNA sequences; see discussion in Laprevotte 1989) anda converging evolution toward repeated motifs servinguseful functions (Laprevotte et al. 1997). This also leadsto consideration of possible selective pressures main-taining the repeats.

    Conclusions

    The listed overrepresented oligonucleotides (se-lected here by using a calculation strategy applicable toshort sequences and often containing CTG or CAG) arecongruent with the retroviral signature (previously de-fined for the entire sequences) when focusing on thenoncoding part of the HIV LTRs (this retroviral signa-ture was not found among yeast, plant, and invertebrateretrotransposons; Terzian et al. 1997). The coding partis much less characteristic, which strengthens the hy-

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    14/15

    1244 Laprevotte et al.

    pothesis, namely, that the nefgene is a captured element.It appears that the search for consensus overrepresentedoligonucleotides could be part of an analysis aimed atdefining retrovirus-like sequences in a genome. The bi-ased dinucleotide distribution could be, in large part, theconsequence of interspersed duplications of oligonucle-otide motifs.

    Sector by sector, we hypothesize a large number oflocal duplication/deletion scenarios that span a greatportion of the alignment and could account for lengthdivergences between the HIV-1 and HIV-2 groups. Con-sequently, base substitutions are by no means the uniqueevolutionary process to take into account for compari-sons of such sequences and their phylogenetic analyses.Altogether, our results support our previous hypotheseson the molecular evolution of retroviral nucleotide se-quences: a large portion of the sequences can be ac-counted for by scrambled stepwise short- and longer-range duplications/deletions. There is an emerging hy-pothesis of an important duplication/deletion role for thereverse transcriptase that could (in addition to already-

    proposed scenarios) generate perfect or stuttering tan-dem repeats and then a cryptic simplicity of the se-quence. The consensus overrepresented motifs and thenumerous cryptic simplicity sectors observed suggestone or several starter tandemly repeated short motif(s).Additional comparisons of decreasingly homologous se-quences using a fast and reliable method for the align-ments could further unravel these evolutionary patterns.

    A reliable and accurate alignment of the comparedsequences is an essential tool for performing a high-resolution molecular evolution study. The accurate as-sessment of the nucleotide alignment with already-iden-tified benchmarks, with the polypeptide alignment, andwith the N-presentation coding of the sequences allows

    us to consider the alignment reliable. The multiplied al-phabet obtained by the mathematical strategy called N-block presentation appears to be a promising method toincrease the signal-to-noise ratio in the nucleotide align-ment studies.

    It is well known that in eukaryotic cells, reversetranscription processes are not restricted to parasitic ret-roviruses, and that a diverse set of genes, referred to asretrotranscripts, derived from their normal progenitorgenes via an mRNA intermediate (Boeke and Stoye1997). These elements, as well as retroviruses and re-trotransposons, are a source of genomic variation, ascould be an increasing number of human endogenousretrovirus sequences that have been demonstrated (Kjell-man, Sjogren, and Widegren 1999). The endogenousIAP particles of mice may also contribute to the gen-eration of genetic diversity in this host population. Fur-thermore, it has been hypothesized that if the prebioticgenetic material was RNA, reverse transcription mighthave been required to formulate DNA-based genetic in-formation (Katz and Skalka 1990). All of these data andothers, taken together, suggest that further investigationof the reverse transcription could shed light on someaspects of eukaryotic genome evolution and consequent-ly not be restricted to the biology of retroviruses.

    Supplementary Material

    The multiple alignment of the 27 HIV-1 and HIV-2 LTR nucleotide sequences is available from the web-site http://genome.genetique.uvsq.fr/laprevotte. In addi-tion, this full sequence alignment is directly availablefrom I.L.

    LITERATURE CITED

    ABDEDDAIM, S. 1997. Incremental computation of transitiveclosure and greedy alignment. Lect. Notes Comput. Sci.1264:167179.

    ANTEZANA, M. A., and M. KREITMAN. 1999. The nonrandomlocation of synonymous codons suggests that readingframe-independent forces have patterned codon preferences.J. Mol. Evol. 49:3643.

    AVEROF, M., A. ROKAS, K. H. WOLFE, and P. M. SHARP. 2000.Evidence for a high frequency of simultaneous double-nu-cleotide substitutions. Science 287:12831286.

    BARNETT, S. W., M. QUIROGU, A. WERNER, D. DINA, and J.A. LEVY. 1993. Distinguishing features of an infectious mo-lecular clone of the highly divergent and noncytopathic hu-man immunodeficiency virus type 2 UC1 strain. J. Virol.67:10061014.

    BEASTY, A. M., and M. J. BEHE. 1988. An oligopurine se-quence bias occurs in eukaryotic viruses. Nucleic AcidsRes. 16:15171528.

    BERKHOUT, B. 1996. Structure and function of the human im-munodeficiency virus. Prog. Nucleic Acid Res. Mol. Biol.54:134.

    BOEKE, J. D., and J. P. STOYE. 1997. Retrotransposons, endog-enous retroviruses, and the evolution of the retroelements.Pp. 343435 in J. M. COFFIN, S. H. HUGHES, and H. E.VARMUS, eds. Retroviruses. Cold Spring Harbor LaboratoryPress, New York.

    CHABOISSIER , M. C., D. FINNEGAN, and A. BUCHETON. 2000.Retrotransposition of the Ifactor, a non-long terminal repeatretrotransposon of Drosophila, generates tandem repeats at

    the 3 end. Nucleic Acids Res. 28:24672472.COWARD, E. 1998. Mathematical methods for repeated patterns

    in biological sequences. Dr.Ing. thesis, Norwegian Univer-sity of Science and Technology, Trondheim, Norway.

    . 1999. Shufflet: shuffling sequences while conservingthe k-let counts. Bioinformatics 15:10581059.

    DEVEREUX, J. 1989. The GCG sequence analysis softwarepackage. Version 6.0. Genetics Computer Group, Madison,Wis.

    DIDIER, G. 1999. Caracterisation des N-ecritures et applicationa letude des suites de complexite ultimement n cste.Theor. Comput. Sci. 215:3149.

    ESTABLE, M. C., B. BELL, A. MERZOUKI, J. S. G. MONTANER,M. V. OSHAUGHNESSY, and I. J. SADOWSKI. 1996. Humanimmunodeficiency virus type 1 long terminal repeat variants

    from 42 patients representing all stages of infection displaya wide range of sequence polymorphism and transcriptionactivity. J. Virol. 70:40534062.

    FRECH, K., R. BRACK-WERNER, and T. WERNER. 1996. Com-mon modular structure of lentivirus LTRs. Virology 224:256267.

    GAO, F., E. BAILES, L. ROBERTSON et al. (12 co-authors). 1999.Origin of HIV-1 in the chimpanzee Pan troglodytes trog-lodytes. Nature 397:436441.

    GAYNOR, R. 1992. Cellular transcription factors involved in theregulation of HIV-1 gene expression. AIDS 6:347363.

    GURTLER, L. G., P. H. HAUSER, J. EBERLI, A. VON BRUNN, S.KNAPP, L. ZEKENG, J. M. TSAGUE, and L. KAPTUE. 1994.

  • 8/2/2019 Mol Biol Evol 2001 Laprevotte 1231 45

    15/15

    HIV-1 and HIV-2 LTR Nucleotide Sequences 1245

    A new subtype of human immunodeficiency virus type 1(MVP-5180) from Cameroon. J. Virol. 68:15811585.

    KANDEL, D., Y. MATIAS, R. UNGER, and P. WINKLER. 1996.Shuffling biological sequences. Discrete Appl. Math. 71:171185.

    KATZ, R. A., and A. M. SKALKA. 1990. Generation of diversityin retroviruses. Annu. Rev. Genet. 24:409445.

    KJELLMAN, C., H. O. SJOGREN, a n d B . WIDEGREN. 1999.

    HERV-F, a new group of human endogenous retrovirus se-quences. J. Gen. Virol. 80:23832392.KLAERR-BLANCHARD, M., H . CHIAPELLO, a nd E. COWARD.

    2000. Detecting localized repeats in genomic sequences: anew strategy and its application to B. subtilis and A. thali-ana sequences. Comput. Chem. 24:5770.

    KREUTZ, R., U. DIETRICH, H. KUHNEL, K. NIESELT-STRUWE,M. EIGEN, and H. RUBSAMEN-WAIGMANN. 1992. Analysisof the envelope region of the highly divergent HIV-2 ALTisolate extends the known range of variability within theprimate immunodeficiency viruses. AIDS Res. Hum. Ret-roviruses 8:16191629.

    LAPREVOTTE, I. 1989. Scrambled duplications in the feline leu-kemia virus gag gene: a putative pattern for molecular evo-lution. J. Mol. Evol. 29:135148.

    1992. Mo-MuLV nucleotide sequence exhibits threelevels of oligomeric repetitions, suggesting a stepwise mo-lecular evolution. J. Mol. Evol. 35:420428.

    LAPREVOTTE, I., S. BROUILLET, C. TERZIAN, and A. HENAUT.1997. Retroviral oligonucleotide distributions correlate withbiased nucleotide compositions of retrovirus sequences,suggesting a duplicative stepwise molecular evolution. J.Mol. Evol. 44:214225.

    LAPREVOTTE, I., A. HAMPE, C. J . SHERR, and F. GALIBERT.1984. Nucleotide sequence of the gag gene and gag-poljunction of feline leukemia virus. J. Virol. 50:884894.

    MALIK, H. S., W. D. BURKE, and T. H. EICKBUSH. 2000. Pu-tative telomerase catalytic subunits from Giardia lambliaand Caenorhabditis elegans. Gene 251:101108.

    MORGENSTERN, B., A. DRESS, and T. WERNER. 1996. MultipleDNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93:1209812103.

    MYERS, G. 1997. Retroviral sequences. Pp. 709755 in J. M.COFFIN, S. H. HUGHES, and H. E. VARMUS, eds. Retrovi-ruses. Cold Spring Harbor Laboratory Press, New York.

    NEEDLEMAN, S. B., and C. D. WUNSCH. 1970. A general meth-od applicable to the search for similarities in the amino acidsequence of two proteins. J. Mol. Biol. 48:443453.

    OHNO, S. 1988. Codon preference is but an illusion created bythe construction principle of coding sequences. Proc. Natl.Acad. Sci. USA 85:43784382.

    OHNO, S., and T. YOMO. 1990. Various regulatory sequencesare deprived of their uniqueness by the universal rule ofTA/CG deficiency and TG/CT excess. Proc. Natl. Acad. Sci.USA 87:12181222.

    OU, S.-H. I., and R. B. GAYNOR. 1995. Intracellular factorsinvolved in gene expression of human retroviruses. Pp. 97184 in J. LEVY, ed. The Retroviridae. Vol. 4. Plenum Press,New York and London.

    PEREIRA, L. A., K. BENTLEY, A. PEETERS, M. J. CHURCHILL,and N. J. DEACON. 2000. A compilation of cellular tran-scription factor interactions with the HIV-1 LTR promoter.Nucleic Acids Res. 28:663668.

    PETERLIN, B. M. 1995. Molecular biology of HIV. Pp. 185238 in J. LEVY, ed. The Retroviridae. Vol. 4. Plenum Press,New York and London.

    RABSON, A. B., and B. J. GRAVES. 1997. Synthesis and pro-

    cessing of viral RNA. Pp. 205261in

    J. M. COFFIN

    , S. H.HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold SpringHarbor Laboratory Press, New York.

    SETO, M. H., T. K. BRUNCK, and R. L. BERNSTEIN. 1989. Over-lapping redundant sextuplets identical with regulatory ele-ments of HIV-1 and SV40. Nucleic Acids Res. 17:27832800.

    SMITH, T. F., and M. S. WATERMAN. 1981. Identification ofcommon molecular subsequences. J. Mol. Biol. 147:195197.

    SOUTHERN, E. 1972. Repetitive DNA in mammals. Pp. 1927in R. A. PFEIFFER, ed. Modern aspects of cytogenetics: con-stitutive heterochromatin in man. Symposia MedicaHoechst No. 6. Schattauer Verlag, Stuttgart, Germany.

    TAUTZ, D., M. TRICK, and G. A. DOVER. 1986. Cryptic sim-plicity in DNA is a major source of genetic variation. Na-

    ture 322:652656.TERZIAN, C., I. LAPREVOTTE, S. BROUILLET, and A. HENAUT.

    1997. Genomic signatures: tracing the origin of retroele-ments at the nucleotide level. Genetica 100:271279.

    THOMPSON, J. D., F. PLEWNIAK, and O. POCH. 1999. A com-prehensive comparison of multiple sequence alignment pro-grams. Nucleic Acids Res. 27:26822690.

    TREIER, M., C. PFEIFLE, and D. TAUTZ. 1989. Comparison ofthe gap segmentation gene hunchback between Drosophilamelanogasterand Drosophila virilis reveals novel modes ofevolutionary change. EMBO J. 8:15171525.

    VARTANIAN, J.-P., A. MEYERHANS, B. ASJO, and S. WAIN-HOB-SON. 1991. Selection, recombination, and GA hypermu-tation of human immunodeficiency virus type 1 genomes.J. Virol. 65:17791788.

    VOGT, P. K. 1997. Historical introduction to the general prop-erties of retroviruses. Pp. 125 in J . M. COFFIN, S. H .HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold SpringHarbor Laboratory Press, New York.

    WU, W., B. M. BLUMBERG, P. J. FAY, and R. A. BAMBARA.1995. Strand transfer mediated by immunodeficiency virusreverse transcriptase in vitro is promoted by pausing andresults in misincorporation. J. Biol. Chem. 270:325332.

    ZAGURY, J. F., G. FRANCHINI, M. REITZ et al. (15 co-authors).1988. Genetic variability between isolates of human im-munodeficiency virus (HIV) type 2 is comparable to thevariability among HIV type 1. Proc. Natl. Acad. Sci. USA85:59415945.

    ZHANG, H., and H. M. TEMIN. 1994. Retrovirus recombinationdepends on the length of sequence identity and is not errorprone. J. Virol. 68:24092414.

    PIERRE CAPY, reviewing editor

    Accepted March 14, 2001