tcstr_06_21

Embed Size (px)

Citation preview

  • 7/28/2019 tcstr_06_21

    1/7

    An Adaptive Algorithm for Splitting Large Sets of Strings

    and Its Application to Efficient External Sorting

    Tatsuya Asai Seishi Okamoto Hiroki Arimura

    Knowledge Research Center, Fujitsu Laboratories Ltd.411 Kamikodanaka, Nakahara-ku, Kawasaki 2118588, Japan

    Email: [email protected]: [email protected]

    Graduate School of Information Science and Technology, Hokkaido UniversityKita 14jo, Nishi 9chome, Sapporo 0600814, Japan

    Email: [email protected]

    Abstract

    In this paper, we study the problem of sorting a large

    collection of strings in external memory. Based onadaptive construction of a summary data structure,called adaptive synopsis trie, we present a practi-cal string sorting algorithm DistStrSort, which issuitable to sorting string collections of large size inexternal memory, and also suitable for more com-plex string processing problems in text and semi-structured databases such as counting, aggregation,and statistics. Case analyses of the algorithm and ex-periments on real datasets show the efficiency of ouralgorithm in realistic setting.

    Keywords: external string sorting, trie sort, semi-

    structured database, text database, data splitting

    1 Introduction

    Sorting strings is a fundamental task of string pro-cessing, which is not only sorting the objects in as-sociated ordering, but also used as a basis of moresophisticated processing such as grouping, count-ing, and aggregation. With a recent emergence ofmassive amount of semi-structured data (Abiteboulet al. 2000), such as plain texts, Web pages, orXML documents, it is often the case that we wouldlike to sort a large collection of strings that do notfit into main memory. Moreover, there are poten-tial demands for flexible methods tailored for semi-structured data beyond basic sorting facilities suchas statistics computation, counting, and aggregationover various objects including texts, data, codes,and streams (Abiteboul et al. 2005, Stonebraker etal. 2005).

    In this paper, we study efficient methods for sort-ing strings in external memory. We present a newapproach to sorting a large collection of strings in ex-ternal memory. The main idea of our method is togeneralize distribution sort algorithms for collections

    of strings by the extensive use of trie data structure.To implement this idea, we introduce a synopsis datastructure for strings, called a synopsis trie, for ap-proximate distribution, and develop an adaptive con-

    STEP1: BuildDietTrie

    STEP2: SplitDietTrie STEP3: SortBucket STEP4: Merge

    InputStringListS

    OutputStringLis

    tT

    Growth

    parameter

    Sorted

    Sorted

    Bucket

    Sorted

    Bucket

    Sorted

    Bucket

    Bucket

    Bucket

    Bucket

    Internal memoryInternal memory

    Synopsis Trie TS

    Figure 1: The architecture of an external sorting al-gorithm DistStrSort using adaptive constructionof a synopsis trie

    struction algorithm for synopsis tries for a large col-lection of strings with the idea of adaptive growth.

    Based on this technique, we present an externalstring sorting algorithm DistStrSort. In Fig. 1, weshow an architecture of our external sorting. Thisalgorithm first adaptively construct a small synopsistrie as a model of string distribution by a single scanof the input data, splits input strings into a collec-tion of buckets using the synopsis trie as a splitter,then sort the strings in each bucket in main memory,and finally merge the sorted buckets into an outputcollection.

    By an analysis, we show that if both of the size ofthe synopsis trie and the maximum size of the bucketsare small enough to fit into main memory of size M,then our algorithm DistStrSort correctly solves theexternal string sorting problem in O(N) time withinmain memory M using 3N sequential read, N ran-dom write, and N sequential write in the externalmemory, where N is the total size of an input stringlist S. Thus, this algorithm of O(N) time complex-ity is attractive compared with a popular merge sort-based external string sort algorithm of O(Nlog N)time complexity. Experiments on real datasets show

    the effectiveness of our algorithm.As an advantage, this algorithm is suitable

    to implement complex data manipulation opera-tions (Ramakrishnan and Gehrke 2000) such as

  • 7/28/2019 tcstr_06_21

    2/7

    group-by aggregation, statistics calculation, and func-tional join for texts.

    1.1 Related works

    In what follows, we denote by K = |S| and N = ||S||be the cardinality and the total size of an input stringlist S ().

    A straightforward extension of quick sort al-

    gorithm solves the string sorting problem inO(LKlog K) time in main memory, where L is themaximum length of the input strings and N =O(LK). This algorithm is efficient if L = O(1), butinefficient if L is large.

    Arge et al. presented an algorithm that solvesthe string sorting problem in O(Klog K + N) timein main memory (Arge et al. 1997). The String B-tree is an efficient dynamic indexing data structurefor strings (Ferragina and Grossi 1999). This datastructure also has almost optimal performance for ex-ternal string sorting in theory, while the performance

    degenerates in practice because of many random ac-cess to external memory.Sinha and Zobel presented a trie-based string

    sorting algorithm, called the burst sort, and showthat the algorithm outperforms many of previousstring sorting algorithms on real datasets (Sinha andZobel 2003). However, their algorithm is an inter-nal sorting algorithm, and furthermore, the main aimof this paper is proposal of an adaptive strategy forrealizing an efficient external string sorting.

    Although there are a number of studies on sort-ing the suffixes of a single input string (Bentley andSedgewick 1997, Manber and Myers 1993, Manzini

    and Ferragina 2002, Sadakane 1999), they are ratherirrelevant to this work since they are mainly incorealgorithms and utilize the repetition structure of suf-fixes of a string.

    The adaptive construction of a trie for a stream ofletters has been studied since 90s in time-series mod-eling (Laird, Saul 1994), data compression (Moffat1990), and bioinformatics (Ron, Singer, Tishby 1996).However, the application of adaptive trie constructionto external sorting seems new and an interesting di-rection in text and semi-structured data processing.Recently, (Stonebraker et al. 2005) showed the use-

    fulness of distributed stream processing as a massivedata application of new type.

    1.2 Organization

    This paper is organized as follows. In Section 2, weprepare basic notions and definitions on string sortingand related string processing problems. In Section 3,we present our external string sorting algorithm basedon adaptive construction of a synopsis trie. In Section4, we report experiments on real datasets, and finally,we conclude in Section 5.

    2 Preliminaries

    In this section, we give basic notions and definitionson external string sorting.

    2.1 Strings

    Let = {A,B,A1, A2, . . .} be an alphabet of letters,with which a total order on is associated. Inparticular, we assume that our alphabet is a set ofintegers in a small range, i.e., = {0, . . . , C 1}for some integer C 0, which is a constant or atmost O(N) for the total input size N. This is thecase for ASCII alphabet {0, . . . , 255} or DNA alpha-

    bet {A , T , C , G}.We denote by the set of all finite strings on

    and by the empty string. Let s = a1 an

    be a string on . For 1 i j n, we denote by|s| = n the length of s, by s[i] = ai the i-th letter,and by s[i..j] = ai aj the substring from i-th toj-th letters. If s = pq for some p, q then we saythat p is a prefix of s and denote p s.

    We define the lexicographic order lex on stringsof by extending the total order on letters inthe standard way.

    2.2 String sorting problemFor strings s1, . . . , sm

    , we distinguish an or-dered list (s1, . . . , sm), an unordered list (or multi-set) {{s1, . . . , sm}}, and a set {s1, . . . , sm} (of uniquestrings) each other. The notations and = are de-fined accordingly. We use the intentional notation( s | P(s) ) or { s | P(s) }. For lists of strings S1, S2,we denote by S1S2 (S1 S2, resp.) the concatenationof S1, S2 as ordered lists (as unordered lists, resp.).

    An input to our algorithm is a string list of sizeK 0 that is just an ordered list S = (s1, . . . , sK) () of possibly identical strings, where si

    is a

    strings on for every 1 i K. We denote by |S| =K the cardinality of S and ||S|| = N =

    Ki=1 |si| the

    total size ofS. We denote the set of all unique stringsin S by uniq(S) = { si | i = 1, . . . , K }. A string listS = (s1, . . . , sn) is sorted if si lex si+1 holds forevery i = 1, . . . , n 1.

    Definition 1 The string sorting problem (STR-SORT) is, given a string list S = (s1, . . . , sK) (), to compute the sorted list (S) =(s(1), . . . , s(K)) (

    ) for some permutation :{1, . . . , K } {1, . . . , K } of its indices.

    We can extend our framework for more compli-cated string processing problems as follows.

    Definition 2 The string counting problem (STR-COUNT) is, given a string list S = (s1, . . . , sK) (), to compute the histogram h : uniq(S) Nsuch that h(s) is the number of occurrences of theunique string s.

    Let (, ) be a pair of a collection ofvalues anda corresponding associative operation : 2 .For example, (N, +) and (R, max) are examples of(, ). A string database of size K 0 over dom(D)

    is a list D = ( (si, vi) | i = 1, . . . , K ) ( )of pairs of a string and a value. We denote byuniq(D) = { si | (si, vi) D } the set of unique stringsin S. For (, ), the aggregate on s w.r.t. , denoted

  • 7/28/2019 tcstr_06_21

    3/7

    Algorithm DistStrSort:

    Input: A string list S = (s1, . . . , sK) () and

    a maximal bucket size 0 B M.

    Output: The sorted string list T ().

    1. Determine a growth parameter 0. Build anadaptive synopsis trie T,S for S in main memorywith the growth parameter and a bucket size B.

    (BuildDietTrie in Fig. 4)

    2. Split the input list S in external memory intopartition of buckets S1, . . . , S d of size at most B(d 0) by using T,S, where Si Si+1 for everyi = 1, . . . , n 1. (BuildDietTrie in Fig. 5)

    3. For every i = 1, . . . , K , sort the bucket Bi by ar-bitrary internal sorting algorithm and write backto Bi in external memory. (BuildDietTrie inFig. 6)

    4. Return the concatenation T = B1 BK ofsorted buckets.

    Figure 2: An outline of the external string sortingalgorithm with adaptive splitting DistStrSort

    by aggr(s), is defined as aggr(s) = ( v | (s, v) D ) = v1 vm, where (s1, v1), . . . , (sm, vm) Dis the list of all pairs in D with si = s.

    Definition 3 The string aggregation problem w.r.t.operation (, ) (STR-AGGREGATE(, )) is, givena string database S = ( (si, vi) | i = 1, . . . , K ) (), to compute the pair (s,aggr(s)) for every

    unique string s uniq(D).

    The above string counting and string aggregationproblems as well as string sorting problem can be ef-ficiently solved by a modification of trie-based sorter.In this paper, we extend such a trie-based sorter foran external memory environment.

    2.3 Model of computation

    In what follows, we denote by K = |S| and N = ||S||the cardinality and the total size of the input stringlist S. We assume a naive model of computation such

    that a CPU has random access to a main memoryof size M 0 and sequential and random access toan external memory of unbounded size, where mem-ory space is measured in the number of integers tobe stored. Though the external memory can be par-titioned into blocks of size 0 B M, we do notstudy a detailed analysis of I/O complexity. In thispaper, we are particularly interested in the case thatthe input size does not fit into the main memory butnot so large, namely, N = O(M2).

    3 Our Algorithm

    3.1 Outline of the main algorithm

    In Fig. 2, we show the outline of our external stringsorting algorithm DistStrSort using an adaptive

    Figure 3: A trie

    construction of synopsis trie for splitting, which is avariant of distribution sort for integer lists. In Fig. 1shows the architecture ofDistStrSort.

    The key of the algorithm is Step 2 that splits theinput list S into buckets S1, . . . , S d of at most B bycomputing an ordered partition defined as follows.For string lists S1, S2 (

    ), S1 precedes S2, de-noted by S1 S2, if s1

  • 7/28/2019 tcstr_06_21

    4/7

    Algorithm: BuildDietTrie

    Input: A string list S = (s1, . . . , sK), a maximumbucket size M B 0, and a growth parameter > 0.

    Output: A synopsis trie T,S for S.

    1. T := {root};

    2. For each i := 1, . . . , K , do the following:

    (a) Read the next string s = si from the exter-nal disk;

    (b) For each j := 1, . . . , |s|, do:

    y := goto(x, s[j]); (Trace the edge cj.)

    If y = holds, then:

    Ifcount(x) holds, then:Create a new state y; T := T {y};count(y) := 1; goto(x, cj) := y;

    Else:break the inner for-loop and goto

    the step 2(a); Else, x := y andcount(y) := count(y) + 1;

    3. Return T;

    Figure 4: An algorithm for constructing a synopsistrie

    If a trie T satisfies that Str(T) = S then T is saidto be a trie for S. The trie T for a string list S can beconstructed in O(N) time and O(N) space in the total

    input size N = ||S|| for constant (Fredkin 1960).Then, a synopsis trie for S is a trie T for S which

    satisfies that for every string s uniq(S), some leaf T represents a prefix of s, i.e., str() s. Notethat a synopsis trie for S is not unique and the emptytrie T = {root} is a trivial one for any set S.

    Given a string list S, a synopsis trie T can storestrings in S. For every leaf L(T), we define thesublist of strings in S belongs to by list(, S) =( s S | str() s ) and the count of the vertex bycount(, S) = |list(, S)|. The total space requiredto implement a trie for S is O(||S||) even if we includethe information on count and list. On the incorecomputation by a trie, we have the following lemma.

    Lemma 1 We can solve the string sorting, the stringcounting, and the string aggregation problems for aninput string list S in O(N) time using main memoryof size O(N), where N = ||S||.

    We will extensively use this trie data structure inthe following subsections in order to implement thecomputation at Step 1, Step 2, and Step 3 of themain sorting algorithm in Fig. 2.

    3.3 STEP1: Adaptive Construction of a Syn-opsis Trie

    In Fig. 4, we show the algorithm BuildDietTrie forconstructing an adaptive synopsis trie.

    Although the empty trie T = {root} obviouslysatisfies the above condition, it is useless for com-puting a good ordered partition. Instead, the goalis adaptive construction of a synopsis trie T for Ssatisfying the following conditions:

    (i) T fits into the main memory of size M, i.e.,size(T) M.

    (ii) For an input string list S, count(, S) B for aparameter B M

    The adaptive construction is done as follows.For an alphabet = {c0, . . . , cs1}, each statev of the automaton is implemented as the list(goto(v, 0), . . . ,goto(v, s 1)) of s pointers. For analphabet of constant size, this is implemented as anarray of pointers. At the creation of a new statev, these pointers are set to NULL and a countercount(v) is set to 1.

    When we construct a synopsis trie T, the algo-rithm uses a positive integer > 0, called the thresh-

    old for growing. This threshold represents a minimumfrequency for extending a new state to the currentedge.

    The algorithm firstly initializes a trie T as the trieT = {root} consisting of the root node only. Whenthe algorithm inserts a string s S into T, it tracesthe corresponding edge to w starting from root bygoto pointers. Each state v in the synopsis trie Thas an integer count(v) incremented when a string wreaches to v. These counters represent approximateoccurrences of suffixes of input strings.

    When there are no states the current state v trans-fers to, the synopsis trie tries to generate a new state.However, the algorithm does not permit the trie toextend a new edge unless the counter count(v) ofv ismore than the given threshold . Ifcount(v) exceeds, the trie generates a new state, and attach it to thecurrent state with a new edge.

    Lemma 2 Let S be an input string list of total sizeN. The algorithmBuildDietTrie in Fig. 4 com-putes a synopsis trie for S in O(N) time and O(|T |)space.

    At present, we have no theoretical upper boundsof |T | and the maximum value ofcount() according

    to the value of growth parameter 0. Tentatively,setting = cB for some constant 0 c 1 workswell in practice.

    3.4 STEP2: Splitting a String List into Buck-

    ets

    Let BI D = {1, . . . , b} be a set ofbucket-ids for someb 0. Then, a bucket-id assignment for T is a map-ping : L(T) BI D that assigns bucket ids tothe leaves. Let (T,BID) be a synopsis trie T whoseleaves are labeled with the bucket-id assignment .

    A bucket-id assignment for T is said to be order-preserving if it satisfies the following condition: forevery leaves 1, 2, ifstr(1) lex str(2) then (1) (2).

  • 7/28/2019 tcstr_06_21

    5/7

    Algorithm SplitDietTrie:

    Input: A string list S = (s1, . . . , sK) (in externalmemory), a synopsis trie T,S for S (in internalmemory), and a positive integer M B 0.

    Output: An ordered partition B1 Bb for S.

    1. //Compute a bucket-id assignment

    Let N

    =

    L(T,S) count(); B

    =BN/N; k = 1; A = 0;

    For each leaf L(T,S) in the lexico-graphic order of str(), do: If (A+count() B), then bucket() :=k and A := A + count(); Else, k := k + 1 and A := 0;

    b = k;

    2. //Distribute all strings in S into the correspond-ing buckets.

    For each k = 1, . . . , b, do: Bi := ;

    For each string s S, do: Find the leaf = vertex(s, T,S) reach-able from the root by s; Put s to the k-th bucket Bk in externalmemory for k = ().

    Figure 5: An algorithm SplitDietTrie for splittingan input string list

    By the definition of a synopsis trie above, we knowthat for every string s uniq(S), there is the unique

    leaf L(T) such that str() s. We denotethis leaf by vertex(s, T) = . Then, the bucket-id assigned to s is defined by bucket(T, , s) =(vertex(s, T)). Given an input string list S (), (T, ) defines the partition

    PART(S, T, ) = S1 . . . Sb

    where Sk = ( s S | bid(T, , s) = k ) holds forevery bid k BI D. Then, the maximum bucket sizeof (T,BID) on input S is defined by max{ Sk | k =1, . . . , b }. Clearly, S1 . . . Sb = S holds.

    Lemma 3 If is order-preserving then

    PART(S, T, ) is an ordered partition.

    In Fig. 5, we show an algorithm SplitDiet-Trie for splitting an input string list S into bucketsPART(S, T, ) = B1 . . . Bb. We can easily seethat the bucket-id assignment computed by algorithmSplitDietTrie is always order-preserving. Further-more, we have the following lemma.

    Lemma 4 LetS be an input list and T be a synopsistrie with bucket-id assignment for S. Suppose that|T | M. Then, the algorithmSplitDietTrie of inFig. 5, given S and T, computes an ordered partition

    ofS inO(N) time andO(|T |) space in main memory,where N = ||S||.

    We have the following lemma on the accuracy ofthe approximation.

    Algorithm SortBucket:

    Input: A bucket B ().

    Output: A bucket Ci () obtained from B bysorted in lex.

    1. Build a trie TB for the bucket B in main memory;

    2. Traverse all vertices v of TB in the lexicographicorder of path strings str(v) and output str(v).

    Figure 6: An algorithm SortBucket for sortingeach bucket

    Lemma 5 Let B 0 (0 B M) and k 1 beany integers. Let T be the synopsis trie for S. If thetrie T on S satisfies 0 count(v) 1

    kB for any

    vertex L(T), then the resulting ordered partitionS = B1 Bb (b 0) satisfies that

    k1k

    B Bi B holds for any bucket Bi (1 i m).

    3.5 STEP3: Internal Sorting of Buckets

    Once an input string list S of total size N has beensplit into an ordered partition S1 . . . Sb = Sof maximum bucket size B M, we can sort eachbucket Bi in main memory by using any internalmemory sorting algorithm. For this purpose, we usea internal trie sort which is described in Fig. 6.

    Lemma 6 The algorithm SortBucket of Fig. 6sorts a bucket B of strings in O(N) time using mainmemory of size O(N), where N = ||B|| and is aconstant alphabet.

    3.6 STEP4: Final concatenation

    Step 4 is just a concatenation of already sorted buck-ets B1, . . . , Bb. Therefore, this step requires no extracost.

    3.7 Analysis of the algorithm

    In this subsection, we give a case analysis of a practi-cal performance of the proposed external sorting al-gorithm assuming a practical situations for such analgorithm. For the purpose, we suppose the followingcondition:

    Condition 1 the synopsis trie computed by Build-DietTrie and SplitDietTrie on the input stringlist S and the choice of a growth parameter satis-fies the following conditions:

    The synopsis trie fits into main memory, that is,size(T) M.

    The maximum bucket size of the ordered partitioncomputed bySplitDietTrie at Step 2 does notexceed M.

    For the above condition to hold, we know that atleast the input S satisfies that N = O(M2). Notethat at this time, we only have a heuristic algorithm

  • 7/28/2019 tcstr_06_21

    6/7

    Table 1: Parameters of the datasets

    Dataset data A data B

    File size 67MB 1.1GB

    Number of records (with duplications) 140104 2290104

    Number of records (no duplications) 22104 109104

    Average of record length 45.0 44.9

    Variance of record length 45.1 19.7

    Maximum record length 58 58

    Minimum record length 15 15

    for computing a good synopsis trie without any the-oretical guarantee for its performance. Finally, wehave the following result.

    Theorem 7 Suppose that the algorithms BuildDi-etTrie andSplitDietTrie satisfy Condition 1 onthe input S and 0. The algorithmDistStrSortcorrectly solves the external string sorting problem inO(N) time within main memory M using3N sequen-

    tial read (Step 1Step 3), N random write (Step 2),and N sequential write (Step 3) in the external mem-ory.

    From Theorem 7 and Lemma 1, we can show thefollowing corollary using modified version of Dist-StrSort algorithm.

    Corollary 8 Let(, ) be any associative binary op-erator, where operation can be computed in O(1)time. Suppose that the algorithmsBuildDietTrieandSplitDietTrie satisfy Condition 1 on the inputS and 0. There exists an algorithm that solves

    the string counting (STR-COUNT), and the stringaggregation problems (STR-AGGREGATE(, )) foran input string list S in O(N) time using main mem-ory of size M in external memory.

    4 Experimental Results

    In this section, we present experimental results onreal datasets to evaluate the performance of our al-gorithms. We implemented our algorithms in C. Theexperiments were run on a PC (Xeon 3.6GHz, Win-dows XP) with 3.25 gigabytes of main memory.

    We prepared two datasets data A and data Bcon-sisting of cookie values from Web access logs. Theparameters of the datasets are shown in Table 1.

    4.1 Size of Synopsis Trie

    First, we studied sizes of synopsis tries constructed bythe algorithm BuildDietTrie by varying the growthparameter = 100, 1000, or 10000. Table 2 shows thenumber of states of the constructed synopsis trie foreach on the data A. We can see that the bigger is,the smaller the size of the constructed synopsis triebecomes.

    Table 2: Sizes of the constructed synopsis triesGrowth parameter 100 1,000 10,000Number of states 426,209 92,768 5,030

    0

    5

    10

    15

    2025

    30

    35

    [10,

    100]

    [50,

    100]

    [100

    ,100

    ]

    [200

    ,100

    ]

    [10.

    1000

    ]

    [50,

    1000

    ]

    [100

    ,100

    0]

    [200

    ,100

    0]

    [10,

    1000

    0]

    [50,

    1000

    0]

    [100

    ,100

    00]

    [200

    ,100

    00]

    [Number of buckets, Threshold for growing]

    Runtime(sec)

    S p l i t D i e t T r i e

    B u i l d D i e t T r i e

    Figure 7: Running time of BuildDietTrie andSplitDietTrie

    4.2 Running Time

    4.2.1 BuildDietTrie and SplitDietTrie

    Next, we measured the running times of the subpro-cedures BuildDietTrie and SplitDietTrie on thedata A, by varying the number of buckets and thegrowth parameter. Fig. 7 show the results. The hori-zontal axis in the figure represents pairs of the numberof buckets and the growth parameter. The verticalaxis represents the running time.

    The results indicate that the running time isshorter for larger growth parameters. Thus, a smallersynopsis trie can be constructed in a reasonable com-putational time.

    4.2.2 DistStrSort

    Finally, we studied the running time of the algorithmDistStrSort. Fig. 8 shows the running times on thedata BofDistStrSort and the GNU sort 5.97. Therunning time of our algorithm scales linearly with thesize of data and is faster than the GNU sort for larger

    data sizes.

    5 Conclusion

    In this paper, we presented a new algorithm for sort-ing large collections of strings in external memory.The main idea of our method is first splitting a set ofstrings to some buckets by adaptive construction ofa synopsis trie for input strings, then sort the stringsin each bucket, and finally merge the sorted buck-ets into an output collection. Experiments on realdatasets show the effectiveness of our algorithm.

    In this paper, we only made empirical studies on

    the performance of adaptive strategy for constructionof a synopsis trie for data splitting. The probabilisticanalysis of the upper bounds of|T | and the maximum

  • 7/28/2019 tcstr_06_21

    7/7

    15

    1GB2270

    GNU sort

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    0 5000000 10000000 15000000 20000000 25000000

    Number of strings

    Runtimesec

    DistStrSort

    GNU sort

    Figure 8: Running times ofDistStrSort and GNUsort

    value of count() concerning to the value of growthparameter 0 is an interesting future research.

    We considered only the case that the input sizeis at most the square of the memory size, and thenpresented a two-level external sorting algorithm. It isa future research to develop a hierarchical version ofour algorithm for inputs of unbounded size. A stream-based distributed version of distribution string sortalgorithms is another possible direction.

    References

    S. Abiteboul, R. Agrawal, P. A. Bernstein,M. J. Carey, S. Ceri, W. B. Croft, D. J. De-

    Witt, et al., The Lowell database research self-assessment, C. ACM, 48(5), 111118, 2005.

    S. Abiteboul, P. Buneman, D. Suciu, Data on theWeb, Morgan Kaufmann, 2000.

    L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter, OnSorting Strings in External Memory, Proc. the29th Annual ACM Symposium on Theory ofComputing (STOC97), 540548, 1997.

    J. Bentley, R. Sedgewick, Fast Algorithms for Sort-ing and Searching Strings, Proc. the 8th AnnualACM-SIAM Symposium on Discrete Algorithms

    (SODA97), 360369, 1997.P. Ferragina, R. Grossi, The String B-tree: A New

    Data Structure for String Search in ExternalMemory and Its Applications, J. ACM, 46(2),236280, 1999.

    E. Fredkin, Trie Memory, C. ACM, 3(9), 490499,1960.

    P. Laird and R. Saul, Discrete sequence predictionand its applications, Machine Learning, 15(1):4368, 1994.

    U. Manber, E. W. Myers, Suffix Arrays: A NewMethod for On-Line String Searches, SIAM J.Comput., 22(5), 935948, 1993.

    G. Manzini and P. Ferragina, Engineering aLightweight Suffix Array Construction Algo-

    rithm, Proc. the 10th European Symposium onAlgorithms (ESA02), 698-710, 2002.

    A. Moffat, Implementing the PPM data compressionscheme, IEEE Trans Communications COM-38(11): 19171921, 1990.

    R. Ramakrishnan, J. Gehrke, Database ManagementSystems, McGraw-Hill Professional, 2000.

    D. Ron, Y. Singer, and N. Tishby, The power of am-nesia: learning probabilistic automata with vari-able memory length, Machine Learning, 25(2-3):117 149, 1996.

    K. Sadakane A Fast Algorithms for Making Suf-fix Arrays and for Burrows-Wheeler Transforma-tion, Proc. the 8th Data Compression Confer-ence (DCC98), 129138, 1999.

    R. Sinha, J. Zobel, Efficient Trie-Based Sort-ing of Large Sets of Strings, Proc. the26th Australasian Computer Science Conference(ACSC03), 2003.

    M. Stonebraker, U. Cetintemel, One Size Fits All:An Idea Whose Time Has Come and Gone,Proc. the IEEE 21st International Conferenceon Data Engineering (ICDE05), keynote, 211,2005.