Space-Efficient Data Structures for Top-k Completion 蔡晓华

Space-Efficient Data Structures for Top-k Completion

蔡晓华

Outline

•Motivation

•Completion Trie

•RMQ Trie

•Score-demonposed Trie

•Q & A

Motivation:focus on the case where the string set is so large that compression is needed to fit the data structure in memory

Contribution:present three different trie-based data structures to address this problem

Definition: Scored string setthe completion suggestions are drawn from a set of strings, each associated with a score. We call such a set a scored string set

Problem: top-k completionGiven a string p and an integer k, a top-k completion query in the scored string set S returns the k highest scored pairs in S

Completion Trie

Trie : Each edge represents a single character in the simple trie

Compacted Trie : Allows a sequence of characters to be associated with each edge(except root)

Completion Trie

(1) Assign to each leaf node the score of the string it represents(2) Assign to each intermediate node the maximum socre among its descendant leaf nodes

By construction , the score of each non-leaf node is simply the maximum score among its children

Score :

Completion Trie

Compacted trie with max scores in each node

Completion Trie

The way to find top-k completions :

a) Find the locus node with input string prefix p

b) Add the locus node into a priority queue

c) If the node is a leaf node , return; else insert all children of each expanded node to the priority queue

Completion Trie

Improvement

Instead of inserting all children of each expanded node to the priority queue , sort the children by order of decreasing score

Result

Only need to add the first child and the next sibling (if any)

Completion Trie

• Reduce the time complexity to find the top-k completions

• In practice , reduce the number of comparisons needed to find the locus node

Completion Trie

Input : prefix = c k=2

Find locus node : C

c ：3ac : 3

a:3 b:2

b:2

ac：3

String =caca

String =cbac

算法优点

• 传统的 Trie 树，每个叶子节点存储与 String 对应的 score ，解决top-K 问题时，需要找到所有满足这个 prefix 的叶子节点，然后动态排序，再返回前 K 个元素

• 问题：当 prefix 较短时，返回的结果很多，排序耗时也占用空间

• 一种方案是提前对数据进行处理，找到每个 prefix 的 K 个completion ，然后对应存储

• 问题：需要提前知道 K 的大小，而且 K 是固定的

Compressed Encoding

Motivation

• Improve the theoretical time complexity

• Improve the locality of memory access (random access to RAM and hard drive is much slower than that from CPU cache)

Compressed EncodingTwo strategies

BFS: when finding the locus node, store each group of child nodes consecutively

Access the next sibling is less likely to incur a cache miss

Prefix=c

Cache size=2

DFS: encoding in DFS order

As each internal node is assigned the maximum score of its children and the children are sorted by decreasing score, following the first child is guaranteed to reach a leaf node matching the score of an internal node

Typically incur only one cache miss per completion

Encoding for each node

• Character sequence associated with its incoming edge

• Score• Whether it is the last sibling• An offset pointer to its first child( If not , put 0)

(L+1)+4+1+4=l+10 bytes

Variable-byte encoding to scores and offsets

• Store only score difference between the current node and its previous sibling

• Store the delta offset between first child offset and its previous siblings

Implementation Details

How to get the string match the leaf node ?

DFS: reconstruct the string by starting from the root node and iterative finding the child whose subtrie node offset range includes the target leaf node

Reduce the cost by keeping additional bookkeeping in the search algorithm

• Store the nodes to be inserted into the queue in an array ,along with the index of its parent node in the array

• We can retrieve the path from each completion node to the locus node by following the parent indices

End of the First Section

Questions?

RMQ Trie

What is RMQ ?

• RMQ is short for Range Minimum Query data structure

• Maps a set of strings to consecutive integers in lexicographic order

RMQ Trie• If the string set S is

represented with a trie , the set of strings prefixed by p is a subtrie

• If the scores are arranged in DFS order within an array R, the scores of Sp are those in an interval R[a,b]

• PrefixRange(p) : an operation ,given p, return the pair (a,b) or null

RMQ Trie

Build an RMQ data structure on top of R using an inverted orderingi.e. the minimum is the highest score

strategy

• The index of the completion is i=RMQ(a,b)• The second completion is the one with highest score

among RMQ(a,i-1) and RMQ(i+1,b)• Recursive splitting• In general, the index of the next completion is the

highest scored RMQ among all the intervals• Maintaining the intervals in a priority queue orderd by

score

RMQ Trie

Advantage • Simplicity and modularity (re-use an existing

dictionary data structure without any significant modification)

Disadvantage• Hard to implement the operation PrefixRange

• The cost of PerfixRange is significantly worse

End of the Second Section

Q&A

Score-Decomposed Trie

Path decompositions :

• Let T be the trie built on the strings of the scored string set S. A path decomposition of T is a tree Tc whose nodes correspond to node-to-leaf paths Π in T and associating it with the root node of Tc; the children of root node are defined recursively as path decompositions of the subtries hanging off the path Π


• Find a root-to-leaf path

• Let the path be the root node of the new Trie Tc

• Recursively define the children of the root node

Note that while each string s in S corresponds to a root-to-leaf path in T, in Tc it corresponds to a root-to-node path.

Max-score path decomposition• It is a way to choose a path• Choose path as the one to the leaf with the

highest score .The subtries at the same level are arranged in decreasing order of score (the score of a subtrie is defined as the highest score in the subtrie)



• R represents the highest score in the subtrie rooted at Vi

• Add r to the label of the edge leading to the corresponding child, such that the label becomes the pair (b,r)

Score-decomposed Trie example1. 第一条路径 root +B B 为叶子节点，路径结束，可以看到 ab 有两个兄弟，

故 2ab2. 分解树中边是原来的节点如第二层的 c,3 和 b,23. 以左边 c,3 为例，走过路径后，递归寻找 subtrie 的路径，即 C 节点

+E+G ，路径是 aca, 因为 ac 和 a 各有一个兄弟，所以是 1ac1a4. G 只有一个兄弟，下一个路径就是 H 自己，拆分两个的 CC ，路径是 c,1 ，

节点是 C5. 剩下的也一样6. 注意的就是最右边那个 K 节点，里面的字符就直接用的 1, 因为是空

2ab

1ac1a 1

c,3 b,2

c

c,1

b,1

1ac

b,2

a

b,1

a


How to support top-k completions enumeration ?

1 ） Because of the max-score decomposition strategy , the highest score in each subtrie is exactly the score of the decomposition path for that subtrie.

2 ） The tree has the heap property : the score of each node is less or equal to the score of its parent

How to support top-k completions enumeration ?

3 ） This implies that for each (s,r) in S, if u is the node corresponding to s, then r is stored in the incoming edge of u, except when u is the root, whose score is stored separately.


• First, follow the algorithm of the Lookup operation until the prefix p is exhausted , leading to the locus node u, the highest node whose corresponding string contains p. (report it)

• Find the next completions : prefix p ends at some position in Lu. Thus all the other completions must be in the subtrees whose roots are the children of u branching after position I

• Extract the highest scored node from the priority queue , report the string corresponding to it ,and add all its children to the priority queue.

Prefix = cac k=2

Locus node : 1ac1a

caca

c

caca

caccc

Thank You !Questions ?

Documents

Space-Efficient Data Structures for Top-k Completion 蔡晓华