DKMS05

Embed Size (px)

Citation preview

  • 8/13/2019 DKMS05

    1/12

    Better External Memory Suffix Array Construction

    Roman Dementiev , Juha K arkkainen , Jens Mehnert , Peter Sanders

    AbstractSuffix arrays are a simple and powerful data structurefor text processing that can be used for full text indexes,data compression, and many other applications in par-ticular in bioinformatics. However, so far it has lookedprohibitive to build suffix arrays for huge inputs that donot t into main memory. This paper presents design,analysis, implementation, and experimental evaluationof several new and improved algorithms for suffix arrayconstruction. The algorithms are asymptotically opti-mal in the worst case or on the average. Our imple-mentation can construct suffix arrays for inputs of upto 4GBytes in hours on a low cost machine.

    As a tool of possible independent interest we presenta systematic way to design, analyze, and implementpipelined algorithms.

    1 IntroductionThe suffix array [21, 13], a lexicographically sorted arrayof the suffixes of a string, has numerous applications,e.g., in string matching [21, 13], genome analysis [1] andtext compression [6]. For example, one can use it as fulltext index: To nd all occurrences of a pattern P in atext T do binary search in the suffix array of T , i.e., lookfor the interval of suffixes that have P as a prex. Alot of effort has been devoted to efficient construction of suffix arrays, culminating recently in three direct lineartime algorithms [17, 18, 19]. One of the linear timealgorithms [17] is very simple and can also be adaptedto obtain an optimal algorithm for external memory:The DC3-algorithm [17] constructs a suffix array of atext T of length n using O(sort( n)) I/Os where sort( n)is the number of I/Os needed for sorting the charactersof T .

    However, suffix arrays are still rarely used forprocessing huge inputs. Less powerful techniques like

    an index of all words appearing in a text are verysimple, have favorable constant factors and can be

    Fakult at f ur Informatik, Universit at Karlsruhe, 76128 Karl-sruhe, Germany, [dementiev,sanders]@ira.uka.de .

    Department of Computer Science, FIN-00014 University of Helsinki, Finland, [email protected] .

    Max-Planck-Institut f ur Informatik, Stuhlsatzenhausweg 85,66123 Saarbr ucken, Germany, [email protected] .

    implemented to work well with external memory forpractical inputs. In contrast, the only previous externalmemory implementations of suffix array construction [8]are not only asymptotically suboptimal but also so slowthat measurements could only be done for small inputsand articially reduced internal memory size.

    The main objective of the present paper is to nar-row the gap between theory and practice by engineeringalgorithms for constructing suffix arrays that are at thesame time asymptotically optimal and the best practi-cal algorithms, and that can process really large inputsin realistic time. In the context of this paper, engi-neering includes algorithm design, theoretical analysis,careful implementation, and experiments with large, re-alistic inputs all working together to improve relevantconstant factors, to understand realistic inputs, and toobtain fair comparisons between different algorithms.

    1.1 Basic Concepts. We use the shorthands [ i, j ] ={i , . . . , j } and [i, j ) = [ i, j 1] for ranges of integers andextend to substrings as seen below. The input of ouralgorithms is an n character string T = T [0] T [n 1] = T [0, n ) of characters in the alphabet = [1 , n ].The restriction to the alphabet [1 , n ] is not a serious

    one. For a string T over any alphabet, we can rstsort the characters of T , remove duplicates, assign arank to each character, and construct a new stringT over the alphabet [1 , n ] by renaming the charactersof T with their ranks. Since the renaming is orderpreserving, the order of the suffixes does not change. Asimilar technique called lexicographic naming will playan important role in all of our algorithms where a string(e.g., a substring of T ) is replaced by its rank in someset of strings.

    Let $ be a special character that is smaller than anycharacter in the alphabet. We use the convention thatT [i] = $ if i n. T i = T [i, n ) denotes the i-th suffix of T . The suffix array SA of T is a permutation of [0 , n )such that T SA[ i ] < T SA[ j ] whenever 0 i < j < n .Let lcp( i, j ) denote the longest common prex lengthof SA[i] and SA[ j ] (lcp(i, j ) = 0 if i < 0 or j n).Then dps( i) := 1+ max {lcp(i 1, i ), lcp( i, i + 1) } is thedistinguishing prex size of T i . We get the followingderived quantities that can be used to characterize thedifficulty of an input or that will turn out to play such

  • 8/13/2019 DKMS05

    2/12

    a role in our analysis.

    maxlcp := max0 i

  • 8/13/2019 DKMS05

    3/12

    dexes [20, 14, 15]. We view this as an indication thatinternal memory is too expensive for the big suffix ar-rays one would like to build. Going to external memorycan be viewed as an alternative and more scalable so-lution to this problem. Once this step is made, spaceconsumption is less of an issue because disk space is twoorders of magnitude cheaper than RAM.

    The biggest suffix array computations we are awareof are for the human genome [23, 20]. One [20] computesthe compressed suffix array on a PC with 3 GBytes of memory in 21 hours. Compressed suffix arrays work wellin this case (they need only 2 GByte of space) becausethe small alphabet size present in genomic informationenables efficient compression. The other implementa-tion [23] uses a supercomputer with 64 GBytes of mem-ory and needs 7 hours. Our algorithms have comparablespeed using external memory.

    Pipelining to reduce I/Os is well known techniquein executing database queries [24]. However, previousalgorithm libraries for external memory [4, 9] do not

    support it. We decided quite early in the design of our library Stxxl [10] that we wanted to remove thisdecit. Since suffix array construction can prot im-mensely from pipelining and since the different algo-rithms give a rich set of examples, we decided to usethis application as a test bed for a prototype implemen-tation of pipelining.

    2 Doubling AlgorithmFigure 1 gives pseudocode for the doubling algorithm.The basic idea is to replace characters T [i] of the inputby lexicographic names that respect the lexicographicorder of the length 2 k substring T [i, i + 2 k ) in the k-th iteration. In contrast to previous variants of thisalgorithm, our formulation never actually builds theresulting string of names. Rather, it manipulates asequence P of pairs (c, i) where each name c is taggedwith its position i in the input. To obtain names forthe next iteration k + 1, the names for T [i, i + 2 k )and T [i + 2 k , i + 2 k+1 ) together with the position i arestored in a sequence S and sorted. The new namescan now be obtained by scanning this sequence andcomparing adjacent tuples. Sequence S can be buildusing consecutive elements of P if we sort P usingthe pair ( i mod 2k , i div 2k ). Previous formulations of the algorithm use i as a sorting criterion and thereforehave to access elements that are 2 k characters apart.Our approach saves I/Os and simplies the pipeliningoptimization described in Section 3.

    The algorithm performs a constant number of sort-ing and scanning operations for sequences of size n ineach iteration. The number of iterations is determinedby the logarithm of the longest common prex.

    Function doubling (T )S := [(( T [i], T [i + 1]) , i) : i [0, n )] (0)for k := 1 to log n do

    sort S (1)P := name (S ) (2)invariant (c, i) P :

    c is a lexicographic name for T [i, i + 2 k )

    if the names in P are unique thenreturn [i : (c, i) P ] (3)sort P by (i mod 2k , i div 2k )) (4)S := ((c, c ), i ) : j [0, n ), (5)

    (c, i) = P [ j ], (c , i + 2 k ) = P [ j + 1]

    Function name (S : Sequence of Pair )q := r:= 0; ( , ):= ($ , $)result := foreach ((c, c ), i) S do

    q ++if (c, c ) = ( , ) then r:= q ; ( , ):= ( c, c )append ( r, i ) to result

    return result

    Figure 1: The doubling algorithm.

    Theorem 2.1. The doubling algorithm computes a suf- x array using O(sort( n) logmaxlcp ) I/Os.

    3 PipeliningThe I/O volume of the doubling algorithm from Figure 1can be reduced signicantly by observing that ratherthan writing the sequence S to external memory, we

    can directly feed it to the sorter in Line (1). Similarly,the sorted tuples need not be written but can be directlyfed into the naming procedure in Line (2) which can inturn forward it to the sorter in Line (4). The resultof this sorting operation need not be written but candirectly yield tuples of S that can be fed into the nextiteration of the doubling algorithm. Appendix A givesa simplied analysis of this example for pipelining .

    Let us discuss a more systematic model: Thecomputations in many external memory algorithms canbe viewed as a data ow through a directed acyclicgraph G = ( V = F S R, E ). The le nodes F represent data that has to be stored physically on disk.When a le node f F is accessed we need a bufferof size b(f ) = ( BD ). The streaming nodes s S read zero, one or several sequences and output zero,one or several new sequences using internal buffers of size b(s).1 The Sorting nodes r R read a sequence

    1 Streaming nodes may cause additional I/Os for internalprocessing, e.g., for large FIFO queues or priority queues. These

  • 8/13/2019 DKMS05

    4/12

    and output it in sorted order. Sorting nodes havea buffer requirement of b(r ) = ( M ) and outdegreeone2 . Edges are labeled with the number of machinewords w(e) owing between two nodes. In the proof of Theorem 3.2 you nd the ow graph for the pipelineddoubling algorithm. We will see a somewhat morecomplicated graph in Sections 4 and 6. The followingtheorem gives necessary and sufficient conditions foran I/O efficient execution of such a data ow graph.Moreover, it shows that streaming computations can bescheduled completely systematically in an I/O efficientway.

    Theorem 3.1. The computations of a data ow graph G = ( V = F S R, E ) with edge ows w : E R +and buffer requirements b : V R + can be executed using

    (3.4)eE (F V V F )

    scan( w(e)) +eE (V R )

    sort( w(e))

    I/Os iff the following conditions are fullled. Consider the graph G which is a copy of G except that edges between streaming nodes are replaced by bidirected edges.The strongly connected components (SCCs) of G are required to be either single le nodes, single sorting nodes, or sets of streaming nodes. The total buffer requirement of each SCC C of streaming nodes plus the buffer requirements of the nodes directly connected to C has to be bounded by the internal memory size M .

    Proof. The basic observation is that all streaming nodeswithin an SCC C of G must be executed togetherexchanging data through their internal buffers if anynode from C is excluded it will eventually stall thecomputation because input or output buffer ll up.

    Now assume that G fullls the requirements. Weschedule the computations for each SCC of G in topo-logically sorted order. First consider an SCC C of streaming nodes. We perform in a single pass all thecomputations of the streaming nodes in C , reading fromthe le nodes with edges entering C , writing to thele nodes with edges coming from C , performing therst phase of sorting (e.g., run formation) of the sortingnodes with edges coming from C , and performing thelast phase of sorting (e.g. multiway merging) for the

    sorting nodes with edges entering C . The requirementon the buffer sizes ensures that there is sufficient inter-nal memory. The topological sorting ensures that all thedata from incoming edges is available. Since there are

    I/Os are not counted in our analysis.2 We could allow additional outgoing edges at an I/O cost

    n/DB . However, this would mean to perform the last phase of the sorting algorithm several times.

    only streaming nodes in C , data can freely ow throughthem respecting the topological sorting of G.3

    When a sorting node is encountered as an SCC wemay have to perform I/Os to make sure that the nalphase can incrementally produce the sorted elements.However for a sorting volume of O M 2/B , multiwaymerging only needs the run formation phase that willalready be done and the nal merging phase that willbe done later. For SCCs consisting of le nodes we donothing.

    Now assume the G violates the requirements. If there is an SCC that exceeds its buffer requirements,there is no systematic way to execute all its nodestogether.

    If an SCC C of G contains a sorting node v, theremust be a streaming node w that directly or indirectlyneeds input from v, i.e., it cannot start executing beforev starts to produce output. Node v cannot produceany output before it did not see its complete input.This input directly or indirectly depends on some other

    streaming node u in C . Since u and w are in thesame SCC, they have to be executed together. Butthe data dependencies above make this impossible. Theargument for a le node within an SCC is analogous.

    Theorem 3.1 can be used to design and analyzepipelined external memory algorithms in a systematicway. All we have to do is to give a data ow graph thatfullls the requirements and we can then read off the I/Ocomplexity. Using the relations a scan( x) = scan( a x)+O(1) and a sort( x) sort( a x)+ O(1), we can representthe result in the form scan( x) + sort( y) + O(1), i.e., we

    can characterize the complexity in terms of the sorting volume x and the scanning volume y. One could furtherevaluate this function by plugging in the I/O complexityof a particular sorting algorithm (e.g., 2x/DB forx M 2 /DB and M DB ) but this may not bedesirable because we lose information. In particular,scanning implies less internal work and can usually beimplemented using bulk I/Os in the sense of [8] (we thenneed larger buffers b(v) for le nodes) whereas sortingrequires many random accesses for information theoreticreasons [2].

    Now we apply Theorem 3.1 to the doubling algo-rithm:

    Theorem 3.2. The doubling algorithm from Figure 1 can be implemented to run using sort(5 n) log(1 + maxlcp) + O(scan( n)) I/Os.

    3 In our implementations the detailed scheduling within thecomponents is done by the user to keep the overhead small.However, one could also schedule them automatically, possiblyusing multithreading.

  • 8/13/2019 DKMS05

    5/12

    Proof. The following ow graph shows that each iter-ation can be implemented using sort(2 n) + sort(3 n) sort(5 n) I/Os. The numbers refer to the line numbersin Figure 1

    1 2 543n 2n streaming node

    sorting node

    After log(1 + maxlcp) iterations, the algorithm n-ishes. The O(sort( n)) term accounts for the I/Osneeded in Line 0 and for computing the nal result.Note that there is a small technicality here: Althoughnaming can nd out for free whether all names areunique, the result is known only when naming nishes.However, at this time, the rst phase of the sorting stepin Line 4 has also nished and has already incurredsome I/Os. Moreover, the convenient arrangement of the pairs in P is destroyed now. However we can thenabort the sorting process, undo the wrong sorting, andcompute the correct output.

    In Stxxl the data ow nodes are implementedas objects with an interface similar to the STL input iterators [10]. A node reads data from input nodesusing their * operators. With help of their preincrementoperators a node proceeds to the next elements of theinput sequences. The interface also denes an empty()function which signals the end of the sequence. Aftercreating all node objects, the computation starts in alazy fashion, rst trying to evaluate the result of thetopologically latest node. The node reads its inputnodes element by element. Those nodes continue in thesame mode, pulling the inputs needed to produce anoutput element. The process terminates when the resultof the topologically latest node is computed. To supportnodes with more than one output, Stxxl exposes aninterface where a node generates output accessible notonly via the * operator but a node can also push anoutput element to output nodes.

    The library already offers basic generic classeswhich implement the functionality of sorting, le, andstreaming nodes.

    4 DiscardingLet cki be the lexicographic name of T [i, i + 2 k), i.e., thevalue paired with i at iteration k in Figure 1. Sincecki is the number of strictly smaller substrings of length2k , it is a non-decreasing function of k. More precisely,ck+1i cki is the number of positions j such that ckj = ckibut ckj +2 k < c

    ki+2 k . This provides an alternative way of

    computing the names given in Figure 3.Another consequence of the above observation is

    that if cki is unique, i.e., ckj = cki for all j = i, then

    Function name2 (S : Sequence of Pair )q := q := 0; ( , ):= ($ , $)result := foreach ((c, c ), i) S do

    if c = then q := q := 0; ( , ):= ( c, c )else if c = then q := q ; := cappend ( c + q , i) to result

    q ++return result

    Figure 3: The alternative naming procedure.

    chi = cki for all h > k . The idea of the discardingalgorithm is to take advantage of this, i.e., discardpair ( c, i) from further iterations once c is unique. Akey to this is the new naming procedure in Figure 3,because it works correctly even if we exclude from S all tuples (( c, c ), i), where c is unique. Note, however,

    that we cannot exclude (( c, c ), i) if c is unique but cis not. Therefore, we will partially discard ( c, i) when

    Function doubling + discarding (T )S := [(( T [i], T [i + 1]) , i) : i [0, n )] (1)sort S (2)U := name (S ) // undiscarded (3)P := // partially discardedF := // fully discardedfor k := 1 to log n do

    mark unique names in U (4)sort U by (i mod 2k , i div 2k ) (5)merge P into U ; P := (6)S := ; count := 0foreach (c, i) U do (7)

    if c is unique thenif count < 2 then

    append ( c, i) to F else append ( c, i) to P count := 0

    elselet (c , i ) be the next pair in U append (( c, c ), i) to S

    count ++if S = thensort F by rst component (8)return [i : (c, i) F ] (9)

    sort S (10)U := name2 (S ) (11)

    Figure 4: The doubling with discarding algorithm.

  • 8/13/2019 DKMS05

    6/12

    2,10 3,11 4 5 6 71

    P

    983N 2N 2ninput

    2n2n

    output

    Figure 2: Data ow graph for the doubling + discarding . The numbers refer to line numbers in Figure 4. Theedge weights are sums over the whole execution with N = n logdps.

    c is unique. We will fully discard ( c, i) = ( cki , i ) whenalso either cki 2k or c

    ki 2k +1 is unique, because then in

    any iteration h > k , the rst component of the tuple((chi 2h , c

    hi ), i 2h ) must be unique. The nal algorithm

    is given in Figure 4.

    Theorem 4.1. Doubling with discarding can be imple-mented to run using sort(5 n logdps)+ O(sort( n)) I/Os.

    Proof. We prove the theorem by showing that the totalamount of data in the different steps of the algorithmover the whole execution is as in the data ow graphin Figure 2. The nontrivial points are that at most

    N = n log dps tuples are processed in each sorting stepover the whole execution and that at most n tuplesare written to P . The former follows from the factthat a suffix i is involved in the sorting steps as longas it has a non-unique rank, which happens in exactlylog(1 + dps( i)) iterations. To show the latter, we note

    that a tuple ( c, i) is written to P in iteration k onlyif the previous tuple ( c , i 2k ) was not unique. Thatprevious tuple will become unique in the next iteration,because it is represented by (( c , c), i 2k ) in S . Sinceeach tuple turns unique only once, the total number of tuples written to P is at most n.

    A slightly different algorithm with the same asymp-totic complexity is described in [16]. The algorithmin [8] does partial but not full discarding, adding theterm O(scan( n log maxlcp)) to its complexity.

    5 From Doubling to a-TuplingIt is straightforward to generalize the doubling algo-rithms from Figures 1 and 4 so that it maintains theinvariant that in iteration k, lexicographic names repre-sent strings of length a k : just gather a names from thelast iteration that are ak 1 characters apart. Sort andname as before.

    Theorem 5.1. The a-tupling algorithm can be imple-mented to run using

    sort(a + 3log a

    n) log maxlcp + O(sort( n)) or

    sort(a + 3log a

    n) log dps + O(sort( n))

    I/Os without or with discarding respectively.

    We get a tradeoff between higher cost for each iterationand a smaller number of iterations that is determinedby the ratio a +3log a . Evaluating this expression we get theoptimum for a = 5. But the value for a = 4 is only1.5 % worse, needs less memory, and calculations aremuch easier because four is a power two. Hence, wechoose a = 4 for our implementation of the a-tuplingalgorithm. This quadrupling algorithm needs 30 % lessI/Os than doubling.

    6 A Pipelined I/O-Optimal AlgorithmThe following three step algorithm outlines a linear time

    algorithm for suffix array construction [17]:

    1. Construct the suffix array of the suffixes starting atpositions i mod 3 = 0. This is done by reductionto the suffix array construction of a string of twothirds the length, which is solved recursively.

    2. Construct the suffix array of the remaining suffixesusing the result of the rst step.

    3. Merge the two suffix arrays into one.

    Figure 5 gives pseudocode for an external implementa-tion of this algorithm and Figure 6 gives a data ow

    graph that allows pipelined execution. Step 1 is imple-mented by Lines (1)(6) and starts out quite similar tothe tripling (3-tupling) algorithm described in Section 5.The main difference is that triples are only obtained fortwo thirds of the suffixes and that we use recursion tond lexicographic names that exactly characterize therelative order of these sample suffixes . As a preparationfor the Steps 2 and 3, in lines (7)(10) these samplenames are used to annotate each suffix position i withenough information to determine its global rank. Moreprecisely, at most two sample names and the rst one ortwo characters suffice to completely determine the rankof a suffix. This information can be obtained I/O ef-ciently by simultaneously scanning the input and thenames of the sample suffixes sorted by their positionin the input. With this information, Step 2 reduces tosorting suffixes T i with i mod 3 = 0 by their rst char-acter and the name for T i+1 in the sample (Line 11).Line (12) reconstructs the order of the mod-2 suffixesand mod-3 suffixes. Line (13) implements Step 3 byordinary comparison based merging. The slight com-

  • 8/13/2019 DKMS05

    7/12

    plication is the comparison function. There are threecases:

    A mod-0 suffix T i can be compared with a mod-1suffix T j by looking at at the rst characters and thenames for T i+1 and T j +1 in the sample respectively.

    For a comparison between a mod-0 suffix T i and amod-2 suffix T j the above technique does not worksince T j +1 is not in the sample. However, both T i+2and T j +2 are in the sample so that it suffices to lookat the rst two characters and the names of T i+2and T j +2 respectively.

    Mod-1 suffixes and Mod-2 suffixes can be comparedby looking at their names in the sample.

    The resulting data ow graph is large but fairly straight-forward except for the le node which stores a copy of input stream T . The problem is that the input is neededtwice. First, Line 2 uses it for generating the sample andlater, the node implementing Lines (8)(10) scans it si-

    multaneously with the names of the sample suffixes. Itis not possible to pipeline both scans because we wouldviolate the requirement of Theorem 3.1 that edges be-tween streaming nodes must not cross sorting nodes.This problem can be solved by writing a temporary copyof the input stream. Note that this is still cheaper thanusing a le representation for the input since this wouldmean that this le is read twice. We are now ready toanalyze the I/O complexity of the algorithm.

    Theorem 6.1. The doubling algorithm from Figure 5 can be implemented to run using sort(30 n) + scan(6 n)I/Os.

    Proof. Let V (n) denote the number of I/Os for theexternal DC3 algorithm. Using Theorem 3.1 and thedata ow diagram from Figure 6 we can conclude that

    V (n) sort(( 83 + 43 +

    43 +

    53 +

    33 +

    53 )n)

    + scan(2 n) + V (23

    n)

    = sort(10 n) + scan(2 n) + V ( 23 n)

    This recurrence has the solution V (n) 3(sort(10 n) +scan(2 n)) sort(30 n) + scan(6 n). Note that the dataow diagram assumes that the input is a data streaminto the procedure call. However, we get the samecomplexity if the original input is a le. In that case,we have to read the input once but we save writing it tothe local le node T .

    7 ExperimentsWe have implemented the algorithms in C++ us-ing the g++ 3.2.3 compiler (optimization level -O2

    Function DC3 (T )S := [(( T [i, i + 2]) , i) : i [0, n ), i mod 3 = 0] (1)sort S by the rst component (2)P := name (S ) (3)if the names in P are not unique then

    sort the ( i, r ) P by (i mod 3, i div 3) (4)SA12 := DC3 ([c : (c, i) P ]) (5)

    P := ( j + 1 , SA12

    [ j ]) : j [0, 2n/ 3) (6)sort P by the second component (7)S 0 := (T [i], T [i + 1] , c , c , i) : (8)

    i mod 3 = 0 , (c , i + 1) , (c , i + 2) P S 1 := (c, T [i], c , i) : (9)

    i mod 3 = 1 , (c, i), (c , i + 1) P S 2 := (c, T [i], T [i + 1] , c , i) : (10)

    i mod 3 = 2 , (c, i), (c , i + 2) P sort S 0 by components 1,3 (11)sort S 1 and S 2 by component 1 (12)S := merge (S 0 , S 1 , S 2) comparison function: (13)

    (t, t , c , c , i ) S 0 (d,u,d , j ) S 1 (t, c ) (u, d )

    (t, t , c , c , i ) S 0 (d,u,u , d , j ) S 2 (t, t , c ) (u, u , d )

    (c,t,c , i ) S 1 (d,u,u , d , j ) S 2 c d

    return [last component of s : s S ] (14)

    Figure 5: The DC3-algorithm.

    --omit-framepointer ) and the external memory li-brary Stxxl Version 0.52 [10]. Our experimental plat-form has two 2.0 GHz Intel Xeon processors, one GByte

    of RAM, and we use four 80 GByte IBM 120GXP disks.Refer to [11] for a performance evaluation of this ma-chine whose cost was 2500 Euro in July 2002. The fol-lowing instances have been considered:Random2: Two concatenated copies of a Randomstring of length n/ 2. This is a difficult instance thatis hard to beat using simple heuristics.Gutenberg: Freely available English texts from http://promo.net/pg/list.html .Genome: The known pieces of the human genomefrom http://genome.ucsc.edu/downloads.html (sta-tus May, 2004). We have normalized this input to ignorethe distinction between upper case and lower case let-ters. The result are characters in an alphabet of size5 (ACGT and sometimes long sequences of unknowncharacters).HTML: Pages from a web crawl containing only pagesfrom .gov domains. These pages are ltered so thatonly text and html code is contained but no picturesand no binary les.

  • 8/13/2019 DKMS05

    8/12

  • 8/13/2019 DKMS05

    9/12

    0

    20

    40

    60

    80

    100

    120

    140

    R a n d o m

    2 :

    T i m e

    [ s ]

    / n

    nonpipelinedDoubling

    DiscardingQuadrupling

    Quad-DiscardingDC3

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    I / O V o

    l u m e

    [ b y t e

    ] / n

    0

    10

    20

    30

    40

    50

    60

    70

    80

    G u t e n

    b e r g :

    T i m e

    [ s ]

    / n

    0

    100

    200 300

    400

    500

    600

    700

    800

    900

    1000

    I / O

    V o

    l u m e

    [ b y t e

    ] / n

    0

    10

    20

    30

    40

    50

    60

    70

    80

    G e n o m e :

    T i m e

    [ s ]

    / n

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    I / O

    V o

    l u m e

    [ b y t e

    ] / n

    0

    5

    10

    15

    20

    25

    30

    35

    40

    H T M L :

    T i m e

    [ s ]

    / n

    0

    100

    200

    300

    400

    500

    600

    I / O

    V o

    l u m e

    [ b y t e

    ] / n

    0

    5

    10

    15

    20

    25

    3035

    40

    224 226 228 230 232

    S o u r c e :

    T i m e

    [ s ]

    / n

    n

    0

    100

    200

    300

    400

    500

    600

    224 226 228 230 232

    I / O

    V o

    l u m e

    [ b y t e

    ] / n

    n

  • 8/13/2019 DKMS05

    10/12

    one can see that pipelining brings a huge reduction of I/O volume whereas the execution time is affected muchless a clear indication that our algorithms are dom-inated by internal calculations. We also have reasonsto believe that our nonpipelined sorter is more highlytuned than the pipelined one so that the advantage of pipelining may grow in future versions of our imple-mentation. We do not show the nonpipelined algorithmfor the other inputs since the relative performance com-pared to pipelined doubling should remain about thesame.

    A comparison of the new algorithms with previousalgorithms is more difficult. The implementation of [8] works only up to 2GByte of total external memoryconsumption and would thus have to compete withspace efficient internal algorithms on our machine. Atleast we can compare I/O volume per byte of input forthe measurements in [8]. Their most scalable algorithmfor the largest real world input tested (26 MByte of textfrom the Reuters news agency) is nonpipelined doubling

    with partial discarding. This algorithm needs an I/Ovolume of 1303 Bytes per character of input. The DC3-algorithm needs about 5 times less I/Os. Furthermore,it is to be expected that the lead gets bigger for largerinputs. The GBS algorithm [13] needs 486 bytes of I/O per character for this input in [8], i.e., even forthis small input DC3 already outperforms the GBSalgorithm. We can also attempt a speed comparisonin terms of clock cycles per byte of input. Here [8]needs 157 000 cycles per byte for doubling with simplediscarding and 147 000 cycles per byte for the GBSalgorithm whereas DC3 needs only about 20 000 cycles.Again, the advantage should grow for larger inputs inparticular when comparing with the GBS algorithm.

    The following small table shows the execution timeof DC3 for 1 to 8 disks on the Source instance.

    D 1 2 4 6 8t[s/ byte] 13.96 9.88 8.81 8.65 8.52

    We see that adding more disks gives only very smallspeedup. (And we would see very similar speedupsfor the other algorithms except nonpipelined doubling).Even with 8 disks, DC3 has an I/O rate of less than30 MByte/s which is less than the peak performance of

    a single disk (45 MByte/s). Hence, by more effectiveoverlapping of I/O and computation it should be pos-sible to sustain the performance of eight disks using asingle cheap disk so that even very cheap PCs could beused for external suffix array construction.

    8 A CheckerTo ensure the correctness of our algorithms we havedesigned and implemented a simple and fast suffix arraychecker. It is given in Figure 7 and is based on thefollowing result.

    Lemma 8.1. ([5]) An array SA[0, n ) is the suffix array of a text T iff the following conditions are satised:

    1. SA contains a permutation of [0, n ).

    2. i, j : r i r j (T [i], r i+1 ) (T [ j ], r j +1 ) where r i denotes the rank of the suffix S i according to the suffix array.

    Proof. The conditions are clearly necessary. To showsufficiency, assume that the suffix array contains exactlypermutation of [0 , n ) but in wrong order. Let S i and S jbe a pair of wrongly ordered suffixes, say S i > S j butr i < r j , that maximizes i + j . The second conditionsis violated if T [i] > T [ j ]. Otherwise, we must haveT [i] = T [ j ] and S i+1 > S j +1 . But then r i > r j bymaximality of i + j and the second condition is violated.

    Theorem 8.1. The suffix array checker from Figure 7 can be implemented to run using sort(5 n) + scan(2 n)I/Os.

    Function Checker (SA,T )P := [( SA [i], i + 1) : i [0, n )] (1)sort P by the rst component (2)if [i : (i, r ) S ] = [0 , n ) then return false

    S := [( r, (T [i], r )) : i [0, n ), (3)(i, r ) = P [i], (i + 1 , r ) = P [i + 1]]sort S by rst component (4)if [(c, r ) : (r, (c, r )) S ] is sorted (5)then return true else return false

    Figure 7: The suffix array checker.

    9 ConclusionOur efficient external version of the DC3-algorithm istheoretically optimal and clearly outperforms all pre-vious algorithms in practice. Since all practical previ-ous algorithms are asymptotically suboptimal and de-pendent on the inputs, this closes a gap between the-ory and practice. DC3 even outperforms the pipelinedquadrupling-with-discarding algorithm even for realworld instances. This underlines the practical usefulnessof DC3 since a mere comparison with the relatively sim-ple, nonpipelined previous implementations would havebeen unfair.

  • 8/13/2019 DKMS05

    11/12

    As a side effect, the various generalizations of dou-bling yield an interesting case study for the systematicdesign of pipelined external algorithms.

    The most important practical question is whetherconstructing suffix arrays in external memory is nowfeasible. We believe that the answer is a carefulyes. We can now process 4 109 characters overnighton a low cost machine. Two orders of magnitudemore than in [8] in a time faster or comparable toprevious internal memory computations [23, 20] on moreexpensive machines.

    There are also many opportunities to scale to evenlarger inputs. For example, one could exploit that abouthalf of the sorting operations are just permutationswhich should be implementable with less internal workthan general sorting. It should also be possible to betteroverlap I/O and computation. More interestingly,there are many ways to parallelize. On a small scale,pipelining allows us to run several sorters and onestreaming thread in parallel. On a large scale DC3 is

    also perfectly parallelizable [17]. Since the algorithm islargely compute bound, even cheap switched Gigabit-Ethernet should allow high efficiency (DC3 sorts about13 MByte/s in our measurements). Considering allthese improvements and the continuing advance intechnology, there is no reason why it should not bepossible to handle inputs that are another two ordersof magnitude larger in a few years.

    Acknowledgements. We would like to thank StefanBurkhardt and Knut Reinert for valuable pointers tointeresting experimental input. Lutz Kettner helpedwith the design of Stxxl . The html pages were suppliedby Sergey Sizov from the information retrieval groupat MPII. Christian Klein helped with Unix tricks forassembling the data.

    References

    [1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Theenhanced suffix array and its applications to genomeanalysis. In Proc. 2nd Workshop on Algorithms in Bioinformatics , volume 2452 of LNCS , pages 449463.Springer, 2002.

    [2] A. Aggarwal and J. S. Vitter. The input/output com-plexity of sorting and related problems. Communica-

    tions of the ACM , 31(9):11161127, 1988.[3] L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter.

    On sorting strings in external memory. In 29th ACM Symposium on Theory of Computing , pages 540548,El Paso, May 1997. ACM Press.

    [4] L. Arge, O. Procopiuc, and J. S. Vitter. ImplementingI/O-efficient data structures using TPIE. In 10th European Symposium on Algorithms (ESA) , volume2461 of LNCS , pages 88100. Springer, 2002.

    [5] S. Burkhardt and J. K arkk ainen. Fast lightweight suf-x array construction and checking. In Proc. 14th An-nual Symposium on Combinatorial Pattern Matching ,volume 2676 of LNCS , pages 5569. Springer, 2003.

    [6] M. Burrows and D. J. Wheeler. A block-sorting losslessdata compression algorithm. Technical Report 124,SRC (digital, Palo Alto), May 1994.

    [7] C-F. Cheung, J. X. Yu, and H. Lu. Constructing suffixtrees for gigabyte sequences with megabyte memory.IEEE Transactions on Knowledge and Data Engineer-ing , 17(1):90105, 2005.

    [8] A. Crauser and P. Ferragina. Theoretical and exper-imental study on the construction of suffix arrays inexternal memory. Algorithmica , 32(1):135, 2002.

    [9] A. Crauser and K. Mehlhorn. LEDA-SM a platformfor secondary memory computations. Technical report,MPII, 1998. draft.

    [10] R. Dementiev. The stxxl library. documentation anddownload at http://www.mpi-sb.mpg.de/~rdementi/stxxl.html .

    [11] R. Dementiev and P. Sanders. Asynchronous paralleldisk sorting. In 15th ACM Symposium on Parallelism

    in Algorithms and Architectures , pages 138148, SanDiego, 2003.[12] M. Farach-Colton, P. Ferragina, and S. Muthukrish-

    nan. On the sorting-complexity of suffix tree construc-tion. Journal of the ACM , 47(6):9871011, 2000.

    [13] G. Gonnet, R. Baeza-Yates, and T. Snider. Newindices for text: PAT trees and PAT arrays. InW. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms . Prentice-Hall, 1992.

    [14] W.-K. Hon, T.-W. Lam, K. Sadakane, and W.-K.Sung. Constructing compressed suffix arrays with largealphabets. In Proc. 14th International Symposium on Algorithms and Computation , volume 2906 of LNCS ,pages 240249. Springer, 2003.

    [15] W.-K. Hon, K. Sadakane, and W.-K. Sung. Breaking atime-and-space barrier in constructing full-text indices.In Proc. 44th Annual Symposium on Foundations of Computer Science , pages 251260. IEEE, 2003.

    [16] J. K arkk ainen. Algorithms for Memory Hierarchies ,volume 2625 of LNCS , chapter Full-Text Indexes inExternal Memory, pages 171192. Springer, 2003.

    [17] J. K arkk ainen and P. Sanders. Simple linear worksuffix array construction. In Proc. 30th International Conference on Automata, Languages and Program-ming , volume 2719 of LNCS , pages 943955. Springer,2003.

    [18] D. K. Kim, J. S. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proc. 14th An-nual Symposium on Combinatorial Pattern Matching ,volume 2676 of LNCS , pages 186199. Springer, June2003.

    [19] P. Ko and S. Aluru. Space efficient linear time con-struction of suffix arrays. In Proc. 14th Annual Sympo-sium on Combinatorial Pattern Matching , volume 2089of LNCS , pages 200210. Springer, 2003.

  • 8/13/2019 DKMS05

    12/12

    [20] T.-W. Lam, K. Sadakane, W.-K. Sung, and Siu-MingYiu. A space and time efficient algorithm for construct-ing compressed suffix arrays. In Proc. 8th Annual In-ternational Conference on Computing and Combina-torics , volume 2387 of LNCS , pages 401410. Springer,2002.

    [21] U. Manber and G. Myers. Suffix arrays: A newmethod for on-line string searches. SIAM Journal on Computing , 22(5):935948, October 1993.

    [22] G. Manzini and P. Ferragina. Engineering a lightweightsuffix array construction algorithm. In Proc. 10th Annual European Symposium on Algorithms , volume2461 of LNCS , pages 698710. Springer, 2002.

    [23] K. Sadakane and T.Shibuya. Indexing huge genomesequences for solving various problems. Genome In- formatics , 12:175183, 2001.

    [24] A. Silberschatz, H. F. Korth, and S. Sudarshan.Database System Concepts . McGraw-Hill, 4th edition,2001.

    [25] J. S. Vitter and E. A. M. Shriver. Algorithms forparallel memory, I: Two level memories. Algorithmica ,12(2/3):110147, 1994.

    A An Introductory Example For PipeliningTo motivate the idea of pipelining let us rst analyzethe constant factor in a naive implementation of thedoubling algorithm from Figure 1. For simplicityassume for now that inputs are not too large so thatsorting m words can be done in 4 m/DB I/Os usingtwo passes over the data. For example, one runformation phase could build sorted runs of size M andone multiway merging phase could merge the runs intoa single sorted sequence.

    Line (1) sorts n triples and hence needs 12 n/DB

    I/Os. Naming in Line (2) scans the triples and writesname-index pairs using 3 n/DB + 2n/DB = 5n/DBI/Os. The naming procedure can also determinewhether all names are unique now, hence the test inLine (3) needs no I/Os. Sorting the pairs in P in Line (4)costs 8n/DB I/Os. Scanning the pairs and producingtriples in Line (5) costs another 5 n/DB I/Os. Overall,we get (12 + 5 + 8 + 5) n/DB = 30n/DB I/Os for eachiteration.

    This can be radically reduced by interpreting thesequences S and P not as les but as pipelines similarto the pipes available in UNIX. In the beginning weexplicitly scan the input T and produce triples for S .We do not count these I/Os since they are not neededfor the subsequent iterations. The triples are not outputdirectly but immediately fed into the run formationphase of the sorting operation in Line (1). The runs areoutput to disk (3 n/DB I/Os). The multiway mergingphase reads the runs (3 n/DB I/Os) and directly feedsthe sorted triples into the naming procedure called inLine (2) which generates pairs that are immediately

    fed into the run formation process of the next sortingoperation in Line (3) (2 n/DB I/Os). The multiwaymerging phase (2 n/DB I/Os) for Line (3) does not writethe sorted pairs but in Line (4) it generates triples forS that are fed into the pipeline for the next iteration.We have eliminated all the I/Os for scanning and half of the I/Os for sorting resulting in only 10 n/DB I/Osper iteration only one third of the I/Os needed forthe naive implementation.

    Note that pipelining would have been more compli-cated in the more traditional formulation where Line (3)sorts P directly by the index i. In that case, a pipeliningformulation would require a FIFO of size 2 k to producea shifted sequences. When 2 k > M this FIFO wouldhave to be maintained externally causing 2 n/DB addi-tional I/Os per iteration, i.e., our modication simpliesthe algorithm and saves up to 20 % I/Os.