Upload
annice-harmon
View
266
Download
0
Embed Size (px)
Citation preview
Space-Efficient Data Structures for Top-k Completion
蔡晓华
Outline
•Motivation
•Completion Trie
•RMQ Trie
•Score-demonposed Trie
•Q & A
Motivation:focus on the case where the string set is so large that compression is needed to fit the data structure in memory
Contribution:present three different trie-based data structures to address this problem
Definition: Scored string setthe completion suggestions are drawn from a set of strings, each associated with a score. We call such a set a scored string set
Problem: top-k completionGiven a string p and an integer k, a top-k completion query in the scored string set S returns the k highest scored pairs in S
Completion Trie
Trie : Each edge represents a single character in the simple trie
Compacted Trie : Allows a sequence of characters to be associated with each edge(except root)
Completion Trie
(1) Assign to each leaf node the score of the string it represents(2) Assign to each intermediate node the maximum socre among its descendant leaf nodes
By construction , the score of each non-leaf node is simply the maximum score among its children
Score :
Completion Trie
Compacted trie with max scores in each node
Completion Trie
The way to find top-k completions :
a) Find the locus node with input string prefix p
b) Add the locus node into a priority queue
c) If the node is a leaf node , return; else insert all children of each expanded node to the priority queue
Completion Trie
Improvement
Instead of inserting all children of each expanded node to the priority queue , sort the children by order of decreasing score
Result
Only need to add the first child and the next sibling (if any)
Completion Trie
• Reduce the time complexity to find the top-k completions
• In practice , reduce the number of comparisons needed to find the locus node
Completion Trie
Input : prefix = c k=2
Find locus node : C
c :3ac : 3
a:3 b:2
b:2
ac:3
String =caca
String =cbac
算法优点
• 传统的 Trie 树,每个叶子节点存储与 String 对应的 score ,解决top-K 问题时,需要找到所有满足这个 prefix 的叶子节点,然后动态排序,再返回前 K 个元素
• 问题:当 prefix 较短时,返回的结果很多,排序耗时也占用空间
• 一种方案是提前对数据进行处理,找到每个 prefix 的 K 个completion ,然后对应存储
• 问题:需要提前知道 K 的大小,而且 K 是固定的
Compressed Encoding
Motivation
• Improve the theoretical time complexity
• Improve the locality of memory access (random access to RAM and hard drive is much slower than that from CPU cache)
Compressed EncodingTwo strategies
BFS: when finding the locus node, store each group of child nodes consecutively
Access the next sibling is less likely to incur a cache miss
Prefix=c
Cache size=2
DFS: encoding in DFS order
As each internal node is assigned the maximum score of its children and the children are sorted by decreasing score, following the first child is guaranteed to reach a leaf node matching the score of an internal node
Typically incur only one cache miss per completion
Encoding for each node
• Character sequence associated with its incoming edge
• Score• Whether it is the last sibling• An offset pointer to its first child( If not , put 0)
(L+1)+4+1+4=l+10 bytes
Variable-byte encoding to scores and offsets
• Store only score difference between the current node and its previous sibling
• Store the delta offset between first child offset and its previous siblings
Implementation Details
How to get the string match the leaf node ?
DFS: reconstruct the string by starting from the root node and iterative finding the child whose subtrie node offset range includes the target leaf node
Reduce the cost by keeping additional bookkeeping in the search algorithm
• Store the nodes to be inserted into the queue in an array ,along with the index of its parent node in the array
• We can retrieve the path from each completion node to the locus node by following the parent indices
End of the First Section
Questions?
RMQ Trie
What is RMQ ?
• RMQ is short for Range Minimum Query data structure
• Maps a set of strings to consecutive integers in lexicographic order
RMQ Trie• If the string set S is
represented with a trie , the set of strings prefixed by p is a subtrie
• If the scores are arranged in DFS order within an array R, the scores of Sp are those in an interval R[a,b]
• PrefixRange(p) : an operation ,given p, return the pair (a,b) or null
RMQ Trie
Build an RMQ data structure on top of R using an inverted orderingi.e. the minimum is the highest score
strategy
• The index of the completion is i=RMQ(a,b)• The second completion is the one with highest score
among RMQ(a,i-1) and RMQ(i+1,b)• Recursive splitting• In general, the index of the next completion is the
highest scored RMQ among all the intervals• Maintaining the intervals in a priority queue orderd by
score
RMQ Trie
Advantage • Simplicity and modularity (re-use an existing
dictionary data structure without any significant modification)
Disadvantage• Hard to implement the operation PrefixRange
• The cost of PerfixRange is significantly worse
End of the Second Section
Q&A
Score-Decomposed Trie
Path decompositions :
• Let T be the trie built on the strings of the scored string set S. A path decomposition of T is a tree Tc whose nodes correspond to node-to-leaf paths Π in T and associating it with the root node of Tc; the children of root node are defined recursively as path decompositions of the subtries hanging off the path Π
Score-Decomposed Trie
• Find a root-to-leaf path
• Let the path be the root node of the new Trie Tc
• Recursively define the children of the root node
Note that while each string s in S corresponds to a root-to-leaf path in T, in Tc it corresponds to a root-to-node path.
Max-score path decomposition• It is a way to choose a path• Choose path as the one to the leaf with the
highest score .The subtries at the same level are arranged in decreasing order of score (the score of a subtrie is defined as the highest score in the subtrie)
Score-Decomposed Trie
Score-Decomposed Trie
• R represents the highest score in the subtrie rooted at Vi
• Add r to the label of the edge leading to the corresponding child, such that the label becomes the pair (b,r)
Score-decomposed Trie example1. 第一条路径 root +B B 为叶子节点,路径结束,可以看到 ab 有两个兄弟,
故 2ab2. 分解树中边是原来的节点 如第二层的 c,3 和 b,23. 以左边 c,3 为例,走过路径后,递归寻找 subtrie 的路径,即 C 节点
+E+G ,路径是 aca, 因为 ac 和 a 各有一个兄弟,所以是 1ac1a4. G 只有一个兄弟,下一个路径就是 H 自己,拆分两个的 CC ,路径是 c,1 ,
节点是 C5. 剩下的也一样6. 注意的就是最右边那个 K 节点,里面的字符就直接用的 1, 因为是空
2ab
1ac1a 1
c,3 b,2
c
c,1
b,1
1ac
b,2
a
b,1
a
Score-Decomposed Trie
How to support top-k completions enumeration ?
1 ) Because of the max-score decomposition strategy , the highest score in each subtrie is exactly the score of the decomposition path for that subtrie.
2 ) The tree has the heap property : the score of each node is less or equal to the score of its parent
How to support top-k completions enumeration ?
3 ) This implies that for each (s,r) in S, if u is the node corresponding to s, then r is stored in the incoming edge of u, except when u is the root, whose score is stored separately.
Score-Decomposed Trie
• First, follow the algorithm of the Lookup operation until the prefix p is exhausted , leading to the locus node u, the highest node whose corresponding string contains p. (report it)
• Find the next completions : prefix p ends at some position in Lu. Thus all the other completions must be in the subtrees whose roots are the children of u branching after position I
• Extract the highest scored node from the priority queue , report the string corresponding to it ,and add all its children to the priority queue.
Prefix = cac k=2
Locus node : 1ac1a
caca
c
caca
caccc
Thank You !Questions ?