tcstr_06_21

7/28/2019 tcstr_06_21

1/7

An Adaptive Algorithm for Splitting Large Sets of Strings

and Its Application to Efficient External Sorting

Tatsuya Asai Seishi Okamoto Hiroki Arimura

Knowledge Research Center, Fujitsu Laboratories Ltd.411 Kamikodanaka, Nakahara-ku, Kawasaki 2118588, Japan

Email: [email protected]: [email protected]

Graduate School of Information Science and Technology, Hokkaido UniversityKita 14jo, Nishi 9chome, Sapporo 0600814, Japan

Email: [email protected]

Abstract

In this paper, we study the problem of sorting a large

collection of strings in external memory. Based onadaptive construction of a summary data structure,called adaptive synopsis trie, we present a practi-cal string sorting algorithm DistStrSort, which issuitable to sorting string collections of large size inexternal memory, and also suitable for more com-plex string processing problems in text and semi-structured databases such as counting, aggregation,and statistics. Case analyses of the algorithm and ex-periments on real datasets show the efficiency of ouralgorithm in realistic setting.

Keywords: external string sorting, trie sort, semi-

structured database, text database, data splitting

1 Introduction

Sorting strings is a fundamental task of string pro-cessing, which is not only sorting the objects in as-sociated ordering, but also used as a basis of moresophisticated processing such as grouping, count-ing, and aggregation. With a recent emergence ofmassive amount of semi-structured data (Abiteboulet al. 2000), such as plain texts, Web pages, orXML documents, it is often the case that we wouldlike to sort a large collection of strings that do notfit into main memory. Moreover, there are poten-tial demands for flexible methods tailored for semi-structured data beyond basic sorting facilities suchas statistics computation, counting, and aggregationover various objects including texts, data, codes,and streams (Abiteboul et al. 2005, Stonebraker etal. 2005).

In this paper, we study efficient methods for sort-ing strings in external memory. We present a newapproach to sorting a large collection of strings in ex-ternal memory. The main idea of our method is togeneralize distribution sort algorithms for collections

of strings by the extensive use of trie data structure.To implement this idea, we introduce a synopsis datastructure for strings, called a synopsis trie, for ap-proximate distribution, and develop an adaptive con-

STEP1: BuildDietTrie

STEP2: SplitDietTrie STEP3: SortBucket STEP4: Merge

InputStringListS

OutputStringLis

tT

Growth

parameter

Sorted

Sorted

Bucket

Sorted

Bucket

Sorted

Bucket

Bucket

Bucket

Bucket

Internal memoryInternal memory

Synopsis Trie TS

Figure 1: The architecture of an external sorting al-gorithm DistStrSort using adaptive constructionof a synopsis trie

struction algorithm for synopsis tries for a large col-lection of strings with the idea of adaptive growth.

Based on this technique, we present an externalstring sorting algorithm DistStrSort. In Fig. 1, weshow an architecture of our external sorting. Thisalgorithm first adaptively construct a small synopsistrie as a model of string distribution by a single scanof the input data, splits input strings into a collec-tion of buckets using the synopsis trie as a splitter,then sort the strings in each bucket in main memory,and finally merge the sorted buckets into an outputcollection.

By an analysis, we show that if both of the size ofthe synopsis trie and the maximum size of the bucketsare small enough to fit into main memory of size M,then our algorithm DistStrSort correctly solves theexternal string sorting problem in O(N) time withinmain memory M using 3N sequential read, N ran-dom write, and N sequential write in the externalmemory, where N is the total size of an input stringlist S. Thus, this algorithm of O(N) time complex-ity is attractive compared with a popular merge sort-based external string sort algorithm of O(Nlog N)time complexity. Experiments on real datasets show

the effectiveness of our algorithm.As an advantage, this algorithm is suitable

to implement complex data manipulation opera-tions (Ramakrishnan and Gehrke 2000) such as

7/28/2019 tcstr_06_21

2/7

group-by aggregation, statistics calculation, and func-tional join for texts.

1.1 Related works

In what follows, we denote by K = |S| and N = ||S||be the cardinality and the total size of an input stringlist S ().

A straightforward extension of quick sort al-

gorithm solves the string sorting problem inO(LKlog K) time in main memory, where L is themaximum length of the input strings and N =O(LK). This algorithm is efficient if L = O(1), butinefficient if L is large.

Arge et al. presented an algorithm that solvesthe string sorting problem in O(Klog K + N) timein main memory (Arge et al. 1997). The String B-tree is an efficient dynamic indexing data structurefor strings (Ferragina and Grossi 1999). This datastructure also has almost optimal performance for ex-ternal string sorting in theory, while the performance

degenerates in practice because of many random ac-cess to external memory.Sinha and Zobel presented a trie-based string

sorting algorithm, called the burst sort, and showthat the algorithm outperforms many of previousstring sorting algorithms on real datasets (Sinha andZobel 2003). However, their algorithm is an inter-nal sorting algorithm, and furthermore, the main aimof this paper is proposal of an adaptive strategy forrealizing an efficient external string sorting.

Although there are a number of studies on sort-ing the suffixes of a single input string (Bentley andSedgewick 1997, Manber and Myers 1993, Manzini

and Ferragina 2002, Sadakane 1999), they are ratherirrelevant to this work since they are mainly incorealgorithms and utilize the repetition structure of suf-fixes of a string.

The adaptive construction of a trie for a stream ofletters has been studied since 90s in time-series mod-eling (Laird, Saul 1994), data compression (Moffat1990), and bioinformatics (Ron, Singer, Tishby 1996).However, the application of adaptive trie constructionto external sorting seems new and an interesting di-rection in text and semi-structured data processing.Recently, (Stonebraker et al. 2005) showed the use-

fulness of distributed stream processing as a massivedata application of new type.

1.2 Organization

This paper is organized as follows. In Section 2, weprepare basic notions and definitions on string sortingand related string processing problems. In Section 3,we present our external string sorting algorithm basedon adaptive construction of a synopsis trie. In Section4, we report experiments on real datasets, and finally,we conclude in Section 5.

2 Preliminaries

In this section, we give basic notions and definitionson external string sorting.

2.1 Strings

Let = {A,B,A1, A2, . . .} be an alphabet of letters,with which a total order on is associated. Inparticular, we assume that our alphabet is a set ofintegers in a small range, i.e., = {0, . . . , C 1}for some integer C 0, which is a constant or atmost O(N) for the total input size N. This is thecase for ASCII alphabet {0, . . . , 255} or DNA alpha-

bet {A , T , C , G}.We denote by the set of all finite strings on

and by the empty string. Let s = a1 an

be a string on . For 1 i j n, we denote by|s| = n the length of s, by s[i] = ai the i-th letter,and by s[i..j] = ai aj the substring from i-th toj-th letters. If s = pq for some p, q then we saythat p is a prefix of s and denote p s.

We define the lexicographic order lex on stringsof by extending the total order on letters inthe standard way.

2.2 String sorting problemFor strings s1, . . . , sm

, we distinguish an or-dered list (s1, . . . , sm), an unordered list (or multi-set) {{s1, . . . , sm}}, and a set {s1, . . . , sm} (of uniquestrings) each other. The notations and = are de-fined accordingly. We use the intentional notation( s | P(s) ) or { s | P(s) }. For lists of strings S1, S2,we denote by S1S2 (S1 S2, resp.) the concatenationof S1, S2 as ordered lists (as unordered lists, resp.).

An input to our algorithm is a string list of sizeK 0 that is just an ordered list S = (s1, . . . , sK) () of possibly identical strings, where si

is a

strings on for every 1 i K. We denote by |S| =K the cardinality of S and ||S|| = N =

Ki=1 |si| the

total size ofS. We denote the set of all unique stringsin S by uniq(S) = { si | i = 1, . . . , K }. A string listS = (s1, . . . , sn) is sorted if si lex si+1 holds forevery i = 1, . . . , n 1.

Definition 1 The string sorting problem (STR-SORT) is, given a string list S = (s1, . . . , sK) (), to compute the sorted list (S) =(s(1), . . . , s(K)) (

) for some permutation :{1, . . . , K } {1, . . . , K } of its indices.

We can extend our framework for more compli-cated string processing problems as follows.

Definition 2 The string counting problem (STR-COUNT) is, given a string list S = (s1, . . . , sK) (), to compute the histogram h : uniq(S) Nsuch that h(s) is the number of occurrences of theunique string s.

Let (, ) be a pair of a collection ofvalues anda corresponding associative operation : 2 .For example, (N, +) and (R, max) are examples of(, ). A string database of size K 0 over dom(D)

is a list D = ( (si, vi) | i = 1, . . . , K ) ( )of pairs of a string and a value. We denote byuniq(D) = { si | (si, vi) D } the set of unique stringsin S. For (, ), the aggregate on s w.r.t. , denoted

7/28/2019 tcstr_06_21

3/7

Algorithm DistStrSort:

Input: A string list S = (s1, . . . , sK) () and

a maximal bucket size 0 B M.

Output: The sorted string list T ().

1. Determine a growth parameter 0. Build anadaptive synopsis trie T,S for S in main memorywith the growth parameter and a bucket size B.

(BuildDietTrie in Fig. 4)

2. Split the input list S in external memory intopartition of buckets S1, . . . , S d of size at most B(d 0) by using T,S, where Si Si+1 for everyi = 1, . . . , n 1. (BuildDietTrie in Fig. 5)

3. For every i = 1, . . . , K , sort the bucket Bi by ar-bitrary internal sorting algorithm and write backto Bi in external memory. (BuildDietTrie inFig. 6)

4. Return the concatenation T = B1 BK ofsorted buckets.

Figure 2: An outline of the external string sortingalgorithm with adaptive splitting DistStrSort

by aggr(s), is defined as aggr(s) = ( v | (s, v) D ) = v1 vm, where (s1, v1), . . . , (sm, vm) Dis the list of all pairs in D with si = s.

Definition 3 The string aggregation problem w.r.t.operation (, ) (STR-AGGREGATE(, )) is, givena string database S = ( (si, vi) | i = 1, . . . , K ) (), to compute the pair (s,aggr(s)) for every

unique string s uniq(D).

The above string counting and string aggregationproblems as well as string sorting problem can be ef-ficiently solved by a modification of trie-based sorter.In this paper, we extend such a trie-based sorter foran external memory environment.

2.3 Model of computation

In what follows, we denote by K = |S| and N = ||S||the cardinality and the total size of the input stringlist S. We assume a naive model of computation such

that a CPU has random access to a main memoryof size M 0 and sequential and random access toan external memory of unbounded size, where mem-ory space is measured in the number of integers tobe stored. Though the external memory can be par-titioned into blocks of size 0 B M, we do notstudy a detailed analysis of I/O complexity. In thispaper, we are particularly interested in the case thatthe input size does not fit into the main memory butnot so large, namely, N = O(M2).

3 Our Algorithm

3.1 Outline of the main algorithm

In Fig. 2, we show the outline of our external stringsorting algorithm DistStrSort using an adaptive

Figure 3: A trie

construction of synopsis trie for splitting, which is avariant of distribution sort for integer lists. In Fig. 1shows the architecture ofDistStrSort.

The key of the algorithm is Step 2 that splits theinput list S into buckets S1, . . . , S d of at most B bycomputing an ordered partition defined as follows.For string lists S1, S2 (

), S1 precedes S2, de-noted by S1 S2, if s1

7/28/2019 tcstr_06_21

4/7

Algorithm: BuildDietTrie

Input: A string list S = (s1, . . . , sK), a maximumbucket size M B 0, and a growth parameter > 0.

Output: A synopsis trie T,S for S.

1. T := {root};

2. For each i := 1, . . . , K , do the following:

(a) Read the next string s = si from the exter-nal disk;

(b) For each j := 1, . . . , |s|, do:

y := goto(x, s[j]); (Trace the edge cj.)

If y = holds, then:

Ifcount(x) holds, then:Create a new state y; T := T {y};count(y) := 1; goto(x, cj) := y;

Else:break the inner for-loop and goto

the step 2(a); Else, x := y andcount(y) := count(y) + 1;

3. Return T;

Figure 4: An algorithm for constructing a synopsistrie

If a trie T satisfies that Str(T) = S then T is saidto be a trie for S. The trie T for a string list S can beconstructed in O(N) time and O(N) space in the total

input size N = ||S|| for constant (Fredkin 1960).Then, a synopsis trie for S is a trie T for S which

satisfies that for every string s uniq(S), some leaf T represents a prefix of s, i.e., str() s. Notethat a synopsis trie for S is not unique and the emptytrie T = {root} is a trivial one for any set S.

Given a string list S, a synopsis trie T can storestrings in S. For every leaf L(T), we define thesublist of strings in S belongs to by list(, S) =( s S | str() s ) and the count of the vertex bycount(, S) = |list(, S)|. The total space requiredto implement a trie for S is O(||S||) even if we includethe information on count and list. On the incorecomputation by a trie, we have the following lemma.

Lemma 1 We can solve the string sorting, the stringcounting, and the string aggregation problems for aninput string list S in O(N) time using main memoryof size O(N), where N = ||S||.

We will extensively use this trie data structure inthe following subsections in order to implement thecomputation at Step 1, Step 2, and Step 3 of themain sorting algorithm in Fig. 2.

3.3 STEP1: Adaptive Construction of a Syn-opsis Trie

In Fig. 4, we show the algorithm BuildDietTrie forconstructing an adaptive synopsis trie.

Although the empty trie T = {root} obviouslysatisfies the above condition, it is useless for com-puting a good ordered partition. Instead, the goalis adaptive construction of a synopsis trie T for Ssatisfying the following conditions:

(i) T fits into the main memory of size M, i.e.,size(T) M.

(ii) For an input string list S, count(, S) B for aparameter B M

The adaptive construction is done as follows.For an alphabet = {c0, . . . , cs1}, each statev of the automaton is implemented as the list(goto(v, 0), . . . ,goto(v, s 1)) of s pointers. For analphabet of constant size, this is implemented as anarray of pointers. At the creation of a new statev, these pointers are set to NULL and a countercount(v) is set to 1.

When we construct a synopsis trie T, the algo-rithm uses a positive integer > 0, called the thresh-

old for growing. This threshold represents a minimumfrequency for extending a new state to the currentedge.

The algorithm firstly initializes a trie T as the trieT = {root} consisting of the root node only. Whenthe algorithm inserts a string s S into T, it tracesthe corresponding edge to w starting from root bygoto pointers. Each state v in the synopsis trie Thas an integer count(v) incremented when a string wreaches to v. These counters represent approximateoccurrences of suffixes of input strings.

When there are no states the current state v trans-fers to, the synopsis trie tries to generate a new state.However, the algorithm does not permit the trie toextend a new edge unless the counter count(v) ofv ismore than the given threshold . Ifcount(v) exceeds, the trie generates a new state, and attach it to thecurrent state with a new edge.

Lemma 2 Let S be an input string list of total sizeN. The algorithmBuildDietTrie in Fig. 4 com-putes a synopsis trie for S in O(N) time and O(|T |)space.

At present, we have no theoretical upper boundsof |T | and the maximum value ofcount() according

to the value of growth parameter 0. Tentatively,setting = cB for some constant 0 c 1 workswell in practice.

3.4 STEP2: Splitting a String List into Buck-

ets

Let BI D = {1, . . . , b} be a set ofbucket-ids for someb 0. Then, a bucket-id assignment for T is a map-ping : L(T) BI D that assigns bucket ids tothe leaves. Let (T,BID) be a synopsis trie T whoseleaves are labeled with the bucket-id assignment .

A bucket-id assignment for T is said to be order-preserving if it satisfies the following condition: forevery leaves 1, 2, ifstr(1) lex str(2) then (1) (2).

7/28/2019 tcstr_06_21

5/7

Algorithm SplitDietTrie:

Input: A string list S = (s1, . . . , sK) (in externalmemory), a synopsis trie T,S for S (in internalmemory), and a positive integer M B 0.

Output: An ordered partition B1 Bb for S.

1. //Compute a bucket-id assignment

Let N

=

L(T,S) count(); B

=BN/N; k = 1; A = 0;

For each leaf L(T,S) in the lexico-graphic order of str(), do: If (A+count() B), then bucket() :=k and A := A + count(); Else, k := k + 1 and A := 0;

b = k;

2. //Distribute all strings in S into the correspond-ing buckets.

For each k = 1, . . . , b, do: Bi := ;

For each string s S, do: Find the leaf = vertex(s, T,S) reach-able from the root by s; Put s to the k-th bucket Bk in externalmemory for k = ().

Figure 5: An algorithm SplitDietTrie for splittingan input string list

By the definition of a synopsis trie above, we knowthat for every string s uniq(S), there is the unique

leaf L(T) such that str() s. We denotethis leaf by vertex(s, T) = . Then, the bucket-id assigned to s is defined by bucket(T, , s) =(vertex(s, T)). Given an input string list S (), (T, ) defines the partition

PART(S, T, ) = S1 . . . Sb

where Sk = ( s S | bid(T, , s) = k ) holds forevery bid k BI D. Then, the maximum bucket sizeof (T,BID) on input S is defined by max{ Sk | k =1, . . . , b }. Clearly, S1 . . . Sb = S holds.

Lemma 3 If is order-preserving then

PART(S, T, ) is an ordered partition.

In Fig. 5, we show an algorithm SplitDiet-Trie for splitting an input string list S into bucketsPART(S, T, ) = B1 . . . Bb. We can easily seethat the bucket-id assignment computed by algorithmSplitDietTrie is always order-preserving. Further-more, we have the following lemma.

Lemma 4 LetS be an input list and T be a synopsistrie with bucket-id assignment for S. Suppose that|T | M. Then, the algorithmSplitDietTrie of inFig. 5, given S and T, computes an ordered partition

ofS inO(N) time andO(|T |) space in main memory,where N = ||S||.

We have the following lemma on the accuracy ofthe approximation.

Algorithm SortBucket:

Input: A bucket B ().

Output: A bucket Ci () obtained from B bysorted in lex.

1. Build a trie TB for the bucket B in main memory;

2. Traverse all vertices v of TB in the lexicographicorder of path strings str(v) and output str(v).

Figure 6: An algorithm SortBucket for sortingeach bucket

Lemma 5 Let B 0 (0 B M) and k 1 beany integers. Let T be the synopsis trie for S. If thetrie T on S satisfies 0 count(v) 1

kB for any

vertex L(T), then the resulting ordered partitionS = B1 Bb (b 0) satisfies that

k1k

B Bi B holds for any bucket Bi (1 i m).

3.5 STEP3: Internal Sorting of Buckets

Once an input string list S of total size N has beensplit into an ordered partition S1 . . . Sb = Sof maximum bucket size B M, we can sort eachbucket Bi in main memory by using any internalmemory sorting algorithm. For this purpose, we usea internal trie sort which is described in Fig. 6.

Lemma 6 The algorithm SortBucket of Fig. 6sorts a bucket B of strings in O(N) time using mainmemory of size O(N), where N = ||B|| and is aconstant alphabet.

3.6 STEP4: Final concatenation

Step 4 is just a concatenation of already sorted buck-ets B1, . . . , Bb. Therefore, this step requires no extracost.

3.7 Analysis of the algorithm

In this subsection, we give a case analysis of a practi-cal performance of the proposed external sorting al-gorithm assuming a practical situations for such analgorithm. For the purpose, we suppose the followingcondition:

Condition 1 the synopsis trie computed by Build-DietTrie and SplitDietTrie on the input stringlist S and the choice of a growth parameter satis-fies the following conditions:

The synopsis trie fits into main memory, that is,size(T) M.

The maximum bucket size of the ordered partitioncomputed bySplitDietTrie at Step 2 does notexceed M.

For the above condition to hold, we know that atleast the input S satisfies that N = O(M2). Notethat at this time, we only have a heuristic algorithm

7/28/2019 tcstr_06_21

6/7

Table 1: Parameters of the datasets

Dataset data A data B

File size 67MB 1.1GB

Number of records (with duplications) 140104 2290104

Number of records (no duplications) 22104 109104

Average of record length 45.0 44.9

Variance of record length 45.1 19.7

Maximum record length 58 58

Minimum record length 15 15

for computing a good synopsis trie without any the-oretical guarantee for its performance. Finally, wehave the following result.

Theorem 7 Suppose that the algorithms BuildDi-etTrie andSplitDietTrie satisfy Condition 1 onthe input S and 0. The algorithmDistStrSortcorrectly solves the external string sorting problem inO(N) time within main memory M using3N sequen-

tial read (Step 1Step 3), N random write (Step 2),and N sequential write (Step 3) in the external mem-ory.

From Theorem 7 and Lemma 1, we can show thefollowing corollary using modified version of Dist-StrSort algorithm.

Corollary 8 Let(, ) be any associative binary op-erator, where operation can be computed in O(1)time. Suppose that the algorithmsBuildDietTrieandSplitDietTrie satisfy Condition 1 on the inputS and 0. There exists an algorithm that solves

the string counting (STR-COUNT), and the stringaggregation problems (STR-AGGREGATE(, )) foran input string list S in O(N) time using main mem-ory of size M in external memory.

4 Experimental Results

In this section, we present experimental results onreal datasets to evaluate the performance of our al-gorithms. We implemented our algorithms in C. Theexperiments were run on a PC (Xeon 3.6GHz, Win-dows XP) with 3.25 gigabytes of main memory.

We prepared two datasets data A and data Bcon-sisting of cookie values from Web access logs. Theparameters of the datasets are shown in Table 1.

4.1 Size of Synopsis Trie

First, we studied sizes of synopsis tries constructed bythe algorithm BuildDietTrie by varying the growthparameter = 100, 1000, or 10000. Table 2 shows thenumber of states of the constructed synopsis trie foreach on the data A. We can see that the bigger is,the smaller the size of the constructed synopsis triebecomes.

Table 2: Sizes of the constructed synopsis triesGrowth parameter 100 1,000 10,000Number of states 426,209 92,768 5,030

0

5

10

15

2025

30

35

[10,

100]

[50,

100]

[100

,100

]

[200

,100

]

[10.

1000

]

[50,

1000

]

[100

,100

0]

[200

,100

0]

[10,

1000

0]

[50,

1000

0]

[100

,100

00]

[200

,100

00]

[Number of buckets, Threshold for growing]

Runtime(sec)

S p l i t D i e t T r i e

B u i l d D i e t T r i e

Figure 7: Running time of BuildDietTrie andSplitDietTrie

4.2 Running Time

4.2.1 BuildDietTrie and SplitDietTrie

Next, we measured the running times of the subpro-cedures BuildDietTrie and SplitDietTrie on thedata A, by varying the number of buckets and thegrowth parameter. Fig. 7 show the results. The hori-zontal axis in the figure represents pairs of the numberof buckets and the growth parameter. The verticalaxis represents the running time.

The results indicate that the running time isshorter for larger growth parameters. Thus, a smallersynopsis trie can be constructed in a reasonable com-putational time.

4.2.2 DistStrSort

Finally, we studied the running time of the algorithmDistStrSort. Fig. 8 shows the running times on thedata BofDistStrSort and the GNU sort 5.97. Therunning time of our algorithm scales linearly with thesize of data and is faster than the GNU sort for larger

data sizes.

5 Conclusion

In this paper, we presented a new algorithm for sort-ing large collections of strings in external memory.The main idea of our method is first splitting a set ofstrings to some buckets by adaptive construction ofa synopsis trie for input strings, then sort the stringsin each bucket, and finally merge the sorted buck-ets into an output collection. Experiments on realdatasets show the effectiveness of our algorithm.

In this paper, we only made empirical studies on

the performance of adaptive strategy for constructionof a synopsis trie for data splitting. The probabilisticanalysis of the upper bounds of|T | and the maximum

7/28/2019 tcstr_06_21

7/7

15

1GB2270

GNU sort

0

100

200

300

400

500

600

700

800

900

0 5000000 10000000 15000000 20000000 25000000

Number of strings

Runtimesec

DistStrSort

GNU sort

Figure 8: Running times ofDistStrSort and GNUsort

value of count() concerning to the value of growthparameter 0 is an interesting future research.

We considered only the case that the input sizeis at most the square of the memory size, and thenpresented a two-level external sorting algorithm. It isa future research to develop a hierarchical version ofour algorithm for inputs of unbounded size. A stream-based distributed version of distribution string sortalgorithms is another possible direction.

References

S. Abiteboul, R. Agrawal, P. A. Bernstein,M. J. Carey, S. Ceri, W. B. Croft, D. J. De-

Witt, et al., The Lowell database research self-assessment, C. ACM, 48(5), 111118, 2005.

S. Abiteboul, P. Buneman, D. Suciu, Data on theWeb, Morgan Kaufmann, 2000.

L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter, OnSorting Strings in External Memory, Proc. the29th Annual ACM Symposium on Theory ofComputing (STOC97), 540548, 1997.

J. Bentley, R. Sedgewick, Fast Algorithms for Sort-ing and Searching Strings, Proc. the 8th AnnualACM-SIAM Symposium on Discrete Algorithms

(SODA97), 360369, 1997.P. Ferragina, R. Grossi, The String B-tree: A New

Data Structure for String Search in ExternalMemory and Its Applications, J. ACM, 46(2),236280, 1999.

E. Fredkin, Trie Memory, C. ACM, 3(9), 490499,1960.

P. Laird and R. Saul, Discrete sequence predictionand its applications, Machine Learning, 15(1):4368, 1994.

U. Manber, E. W. Myers, Suffix Arrays: A NewMethod for On-Line String Searches, SIAM J.Comput., 22(5), 935948, 1993.

G. Manzini and P. Ferragina, Engineering aLightweight Suffix Array Construction Algo-

rithm, Proc. the 10th European Symposium onAlgorithms (ESA02), 698-710, 2002.

A. Moffat, Implementing the PPM data compressionscheme, IEEE Trans Communications COM-38(11): 19171921, 1990.

R. Ramakrishnan, J. Gehrke, Database ManagementSystems, McGraw-Hill Professional, 2000.

D. Ron, Y. Singer, and N. Tishby, The power of am-nesia: learning probabilistic automata with vari-able memory length, Machine Learning, 25(2-3):117 149, 1996.

K. Sadakane A Fast Algorithms for Making Suf-fix Arrays and for Burrows-Wheeler Transforma-tion, Proc. the 8th Data Compression Confer-ence (DCC98), 129138, 1999.

R. Sinha, J. Zobel, Efficient Trie-Based Sort-ing of Large Sets of Strings, Proc. the26th Australasian Computer Science Conference(ACSC03), 2003.

M. Stonebraker, U. Cetintemel, One Size Fits All:An Idea Whose Time Has Come and Gone,Proc. the IEEE 21st International Conferenceon Data Engineering (ICDE05), keynote, 211,2005.

Documents

tcstr_06_21