Data Structure - home.ustc.edu.cnhome.ustc.edu.cn/~huang83/ds/search2.pdf · 二叉查找树是二叉树的一个特例，因此BST的方法可以看作是特殊种类的二叉树方法)

2015/12/15

1

1

Data Structure

Ch.9 Search(2)

Dr. He Emil Huang

School of Computer Science and Technology

Soochow University

苏州大学计算机科学与技术学院网络工程系

E-mail: [email protected]

http://home.ustc.edu.cn/~huang83/ds.html

本章ppt与教材对应情况

本章涉及所有内容涵盖了Kruse教材以下章节 Chapter 7 Searching (P268~P316)

Chapter 9 Tables and Information Retrieval (P379~P428)

Chapter 10.2 (二叉查找树)

Chapter 10.3 (建立二叉查找树)

Chapter 10.4 (AVL树)

Chapter 11.3 (B-树，自学，重要)

2

在所有基于比较的查找算法中，折半查找效率高，但binary search并不适应insert and delete operations (插删操作)。本节讨论适用于动态查找，即查找过程中动态维护表结构。动态查找一般在树上进行，因此本小节我们称之为search on tree。

§ 9.3.1 二叉查找树 (binary search tree，BST)

定义：二叉查找树或是空树，或满足下述BST性质的二叉树：

①若它的左子树非空，则左子树上所有结点的keys均小于根的key

②若它的右子树非空，则右子树上所有结点的keys均大于根的key

③左右子树又都是二叉查找树

Note：性质1或2中，可以将“﹤”改为“≤”或“﹥”改为“≥”

BST property

binary search tree (二叉查找树 )的 inorder sequence(中序序列 )是一个incremental ordered sequence (递增有序序列) 3

§9.3 search on tree (dynamic search，动态查找) § 9.3.1 Binary Search Trees(二叉查找树、二叉排序树?)Definition: (a recursive definition 递归定义)

A binary search tree is a binary tree that is either empty or in which every node has a key (关键字) and satisfies the following conditions:

The key of the root is greater than the key in any node in the left subtree of the root.

The key of the root is less than the key in any node in the right subtree of the root.

The left and right subtrees of the root are again binary search trees.

No two entries in a binary search tree may have equal keys. (没有两个记录的关键字是一样的，Kruse教材中为了讨论方便进行的简化)

§ 9.3.1 Binary Search Trees

Ordered Lists and implementations:

We can regard binary search trees as a new ADT. 将二叉查找树看成一种新的数据类型，有自己的定义和方法

We may regard binary search trees as a specialization of binary trees.(二叉查找树是二叉树的一个特例，因此BST的方法可以看作是特殊种类的二叉树方法)

We may study binary search trees as a new implementation of the ADT ordered list. (可以把二叉查找树作为有序表的一种实现方法)

6

§ 9.3.1 二叉查找树 (binary search tree，BST)

example

storage structure：

typedef struct node {

KeyType key;

InfoType otherinfo;

struct node *lchild, *rchild;

} BSTNode, *BSTree;

53 7

2 94

2015/12/15

2


implementations: (P446 in Kruse book)

The Binary Search Tree Class (二叉查找树类)The binary search tree class will be derived from the binary tree class;hence all binary tree methods are inherited (继承).

template <class Record>

class Search_tree: public Binary_tree<Record> { //BST树类定义

public:Error_code insert (const Record &new_data);Error_code remove (const Record &old_data);Error_code tree_search (Record &target) const;

private: // Add auxiliary function prototypes here.

};

§9.3.1 Binary Search Trees

implementations:

The inherited methods include the constructors, the destructor, clear, empty, size, height, and the traversals preorder, inorder, and postorder.

A binary search tree (BST) also admits specialized methods called insert, remove(删除), and tree search(树查找).

The class Record has the behavior outlined in Chapter 7: Each Record is associated with a Key. The keys can be compared with the usual comparison operators(比较运算符). By casting records to their corresponding keys, the comparison operators apply to records as well as to keys.

9

§ 9.3.1 binary search tree

1. search basic thought

从根结点开始比较，若当前结点的key与待查的key相等，则查找成功，

返回该结点的指针；否则在左子树或右子树中继续查找。若树中没有待查结点，

则失败于某个空指针上。

algorithm

BSTNode *SearchBST( BSTree T, KeyType key ) {

//在T上查找key=K的节点，成功时返回该节点，否则返回NULL

if ( T==NULL || key==T->key ) return T; //若T为空，查找失败；否则成功if ( key < T->key )

return SearchBST ( T->lchild, key ); //查找左子树else

return SearchBST ( T->rchild, key ); //查找右子树}

53 7

2 94

§ 9.3.1 Binary Search TreesTree Search (Kruse P447)Public method for tree search:

Error_code Search_tree<Record> ::Tree_search(Record &target) const;

/*Post: If there is an entry in the tree whose key matches(符合) that in target, the parameter(参数) target is replaced by the corresponding record from the tree and a code of success is returned. Otherwise a code of not_present is returned.*/ 如果树中有一个元素的key与target匹配，则target参数被树中相应记录替换，返回success，否则返回not_present

This method will often be called with a parameter target that contains only a key value. The method will fill target with the complete data belonging to any corresponding Record in the tree.

§ 9.3.1 Binary Search TreesImplementations of tree Search

We program the searching process by calling an auxiliary recursive function.

We first compare it with the entry at the root of the tree. If their keys match, then we are finished.

Otherwise, we go to the left subtree or right subtree as appropriate and repeat the search in that subtree.

The process terminates(终止) when it either finds the target or hits an empty subtree.

The auxiliary search function returns a pointer to the node that contains the target back to the calling program.


implementationsBinary_node<Record> *Search_tree<Record> ::

search_for_node(Binary_node<Record>* sub_root, const Record &target) const;

/*Pre: sub_root is NULL or points to a subtree of a Search_tree

Post: If the key of target is not in the subtree, a result of NULL is returned.

Otherwise, a pointer to the subtree node containing the target is returned.*/

2015/12/15

3


Binary_node<Record> *Search_ tree<Record> :: search_for_node(

Binary_node<Record>* sub_root, const Record &target) const

{

if (sub_root == NULL || sub_root->data == target)

return sub_root;

else if (sub_root->data < target)

return search_for_ node(sub_ root->right, target);

else return search_ for_ node(sub_ root->left, target);

}

Recursive function 递归版本Non-recursive version: search的非递归实现


Binary_ node<Record> *Search_ tree<Record> :: search_for_node (Binary_ node<Record> *sub_ root, const Record &target) const

{

while (sub_ root != NULL && sub_root->data != target)

if (sub_ root->data < target)

sub_ root = sub_ root->right;

else

sub_ root = sub_ root->left;

return sub_ root;

}

§ 9.3.1 Binary Search TreesPublic method for tree search:


Error_code Search_tree<Record> :: tree_search(Record &target) const

{Error_code result = success;Binary_node<Record> *found = search_for_node(root, target);if (found == NULL)

result = not_present;else

target = found->data;return result;

}

§ 9.3.1 Binary Search TreesBinary Search Trees with the Same Keys (具有同一组关键字的二叉查找树可能形态)

17


time

若成功，则走了一条从根到待查节点的路径

若失败，则走了一条从根到叶子的路径? (不一定)

upper bound：O(h) h—BST树高

analysis：与树高相关

① the worst situation：单支树，ASL=(n+1)/2,与顺序查找相同

② the best situation：ASL≈lgn，形态与折半查找的判定树相似

③ average situation：假定n个keys所形成的n!种排列是等概率的，

则可证明由这n!个序列产生的n!棵BST(其中有的形态相同)的平均高度为O(lgn)，故查找时间仍为O(lgn)

53 7

2 94

1. search

18


Theorem：一棵有n个不同关键字的随机构建二叉查找树(搜索树/排序树)的期望高度为O(lgn)

证明：算法导论第12章二叉搜索树 12.4节 “随机构建二叉搜索树”章节

One useful conclusion you should remember：

在由n个随机key构建的二叉查找树中，查找命中/查找未命中/插入操作所需比较次数为O(lgn)，事实上经分析约为 1.39 lgn

1. search

2015/12/15

4


Analysis of Tree Search 树查找算法性能分析

Draw the comparison tree(比较树) for a binary search (on an ordered list). Binary search on the list does exactly the same comparisons as tree search will do if it is applied to the comparison tree. 回忆一下二分查找我们画出一棵比较树，如果将这棵比较树用于进行tree_search，那么比较次数相同

By Section 7.4, binary search performs O(log n) comparisons for a list of length n. This performance is excellent in comparison to other methods, since log n grows very slowly as n increases.

§ 9.3.1 Binary Search TreesAnalysis of Tree Search

The same keys may be built into binary search trees of many different shapes. 同样的一组key可以构建出多种不同形态的BST

If a binary search tree is nearly completely balanced(平衡的)(bushy 浓密的), then tree search on a tree with n vertices(顶点)will also do O(log n) comparisons of keys.

If the tree degenerates 退化 into a long chain, then tree search becomes the same as sequential search (顺序查找), doing O(n)comparisons on n vertices. This is the worst case for tree search.

The number of vertices between the root and the target, inclusive(包括它们), is the number of comparisons that must be done to find the target. The bushier the tree, the smaller the number of comparisons that will usually need to be done.


Analysis of Tree Search

It is often not possible to predict (in advance of building it) what shape of binary search tree will occur.

In practice, if the keys are built into a binary search tree in random order(以随机次序), then it is extremely unlikely that a binary search tree degenerates badly; tree_search usually performs almost as well as binary search.

22

2、insertion and generation of BSTBST的插入和生成

Insert 插入操作

algorithm idea: 保证新结点插入后满足BST性质，基本思想如下：

若T为空，则为待插入的Key申请一个新结点，并令其为根；

否则，从根开始向下查找插入位置，直到发现树中已有Key，或找到一个空指针为止；

将新结点作为叶子插入空指针的位置。

查找过程是一个关键字的比较过程，易于写出递归或非递

归算法，也可以调用查找操作。

algorithm implementation

implementations:

Insertion into a Binary Search Tree

24

2、insertion and generation of BSTvoid InsertBST( BSTree *T, KeyType key ) { //*T是根

BSTNode *f, *p=*T; //p指向根结点

while (p) { //当树非空时，查找插入位置

if ( p->key==key ) return; //树中已有key，不允许重复插入

f＝p；// f和p为父子关系

p＝( key < p->key ) ? p->lchild; p->rchild;

} //注意：树为空时，无须查找

p = (BSTNode *) malloc(sizeof(BSTNode));

p->key=key; p->lchild=p->rchild=NULL; //生成新结点

if ( *T ==NULL ) * T=p; //原树为空,新结点为根

else //原树非空, 新结点作为*f的左或右孩子插入

if ( key < f->key ) f->lchild=p;

else f->rchild=p;

} //时间为O(h)

非递归C版本

2015/12/15

5

Binary Search Trees

Insertion into a Binary Search TreeP453 in Kruse book

Error_code Search_tree<Record> ::insert(const Record &new_data);

/*Post: If a Record with a key matching that of new data already belongs to the Search_tree, a code of duplicate(重复的) error is returned. Otherwise, the Record new_datais inserted into the tree in such a way that the properties(性质) of a binary search tree are preserved, and a code of success is returned.*/

Binary Search Trees



Error_code Search_tree<Record> :: insert(const Record &new_data)

{

return search_and_insert(root, new_data);

}

Binary Search TreesInsertion into a Binary Search Tree

template <class Record>Error_code Search_tree<Record> :: search_and_insert(Binary_node<Record> * &sub_root, const Record &new_data){

if (sub_root == NULL) {sub_root = new Binary_node<Record>(new_data);return success;}else if (new_data < sub_root->data)

return search_and_insert(sub_root->left, new_data);else if (new_data > sub_root->data)

return search_and_insert(sub_root->right, new_data);else return duplicate_error;

}

Binary Search Trees


The method insert can usually insert a new node into a random binary search tree with n nodes in O(log n) steps. It is possible, but extremely unlikely (有可能，但可能性不大), that a random tree may degenerate (退化) so that insertions require as many as n steps.

If the keys are inserted in sorted order(有序次序) into an empty tree, however, this degenerate case will occur.

29

2、 insertion and generation of BSTGenerate 从一棵空树构建生成BST

algorithm：从空树开始，每输入一个数据就调用一次插入算法。

BSTree CreateBST( void ) {

BSTree T＝NULL; //初始时T为空

KeyType key;

scanf( “%d”, &key );

while( key ) { //假设key＝0表示输入结束

InsertBST ( &T, key ) ; //将key插入树T中

scanf( “%d”, &key )；// 读入下一个关键字

}

return T；// 返回根指针

}

example

30

2、insertion and generation of BST

10 10

18

10

183

10

183

8

10

183

8 12

10

183

8 122

10

183

8 122

7

输入实例 {10， 18， 3， 8， 12， 2， 7}

我们当然也可以设计方法允许重复插入相同元素

2015/12/15

6

31

2、 insertion and generation of BST

general situation

不同的输入实例 (数据集不同、或排列不同)，生成的树的形态一般不同。对n个结点的同一数据集，可生成n！棵BST。

exceptional situation

但有时不同的实例可能生成相同的BST，例如：(2，3，7，8，5，4)和(2，3，7，5，8，4)可构造同一棵BST。

origin of name of sorting tree (排序树名字由来)

因为BST的中序序列有序，所以对任意关键字序列，构造BST的过程，实际上是对其排序。

生成n个结点的BST的平均时间是O(nlgn), 但它约为堆排序的2~3倍，因此它并不适合排序。 32

3、Deletion of BST保证删一结点不能将以该结点为根的子树删去，且仍须满足BST性质。即：删一结点相当于删有序序列中的一个结点。

basic thought

查找待删结点*p，令parent指向其双亲 (初值NULL)；若找不到则返回，否则进入下一步。

在删除*p时处理其子树的连接问题，同时保持BST性质不变。

case1: *p是叶子，无须连接子树，只需将*parent中指向*p的指针置空

case2: *p只有1个孩子，只须连接唯一的1棵子树，故可令此孩子取代*p与其双亲连接 (4种状态)

parent

p

child child

parent

p

parent

p

child

parent

p

child

*p只有左孩子 *p只有右孩子

3、Deletion of BST

basic thought

case3: *p有2个孩子，有2种处理方式：

①找到*p的中序后继(或前驱)*s，用*p的右(或左)子树取代*p与其双亲*parent连接；而*p的左(或右)子树PL则作为*s的左(或右)子树与*s连接。缺点：树高可能增大。

PL sR

s

prparent

PL

sR

p

parent

s

pr

34

3、Deletion (Removal) of BST basic thought

②令q=p，找*q的中序后继*p，并令parent指向*p的双亲，将*p的右子树取代*p

与其双亲*parent连接。将*p的内容copy到*q中，相当于删去了*q，将删*q的

操作转换为删*p的操作。对称地，也可找*q的中序前驱

因为*p 多只有1棵非空的子树，属于case2。实际上case1也是case2的特例。

因此，case3采用该方式时，3种情况可以统一处理为case2。

q=p

parent

p

child

q=p

p

child

*q的中序后继就是其右孩子

35

3、Deletion (Removal) of BST algorithm

void DelBSTNode( BSTree *T, KeyType key) { //*T是根

BSTree *q, *child, *parent=NULL, *p=*T;

while ( p ) { //找待删结点

if ( p->key ==key ) break; //已找到

parent=p; //循环不变式是*parent为*p的双亲

p＝( key < p->key ) ? p->lchild; p->rchild;

}

if ( !p ) return; //找不到被删结点，返回

q=p; //q记住被删结点*p

if (q->lchild && q->rchild ) //case3，找*q的中序后继

for(parent=q,p=q->rchild; p->lchild; parent=p, p=p->lchild); 36

3、Deletion (Removal) of BST//现在3种情况已统一到情况2，被删结点*p 多只有1个非空的孩子

child=(p->lchild)？p->lchild; p->rchild; //case1时child空，否则非空

If ( !parent ) //*p的双亲为空，说明 *p是根，即删根结点

*T=child; //若是情况1，则树为空；否则*child取代*p成为根

else { //*p非根，它的孩子取代它与*p的双亲连接，即删 *p

if ( p==parent->lchild ) parent->lchild=child;

else parent->rchild=child;

if ( p != q ) //情况3，将*p的数据copy到*q中

q->key=p->key; //若有其他数据亦需copy

}

free( p ); //删除*p

} //时间O(h)

2015/12/15

7

5030 80

20 90

85

40

35

8832

(1)被删除的结点是叶子结点

例如: 被删关键字 = 2088

其双亲结点中相应指针域的值改为“空”

5030 80

20 90

85

40

35

8832

(2)被删除的结点只有左子树或者只有右子树

其双亲结点的相应指针域的值改为“指向被删除结点的左子树或

右子树”

被删关键字 = 4080

5030 80

20 90

85

40

35

8832

(3)被删除的结点既有左子树，也有右子树

40

40

以其前驱替代之，然后再删除

该前驱结点

被删结点前驱结点

被删关键字 = 50implementations:

Removal Method (BST中删除结点的实现)template <class Record>

Error_code Search_tree<Record> :: remove(const Record &target)

/* Post: If a Record with a key matching(符合) that of target belongs to the Search_tree a code of success is returned and the corresponding node is removed from the tree. Otherwise, a code of not_present is returned.

Uses: Function search_and_destroy */

{

return search_and_destroy(root, target);

}


Error_code Search_tree<Record> :: search_and_destroy(Binary_node<Record>* &sub_root, const Record &target)

/* Pre: sub_root is either NULL or points to a subtree of the Search_tree .

Post: If the key of target is not in the subtree, a code of not_present is returned. Otherwise, a code of success is returned and the subtree node containing target has been removed in such a way that the properties of a binary search tree have been preserved.

Uses: Functions search_and_destroy recursively and remove_root */

{

if (sub_root == NULL || sub_root->data == target)

return remove_root(sub_root);

else if (target < sub_root->data)

return search_and_destroy(sub_root->left, target);

else

return search_and_destroy(sub_root->right, target);

}

Binary Search Treesimplementations:

Auxiliary(辅助的) Function to Remove One Node


Error_code Search_tree<Record> :: remove_root(Binary_node<Record> * &sub_root)

/* Pre: sub_root is either NULL or points to a subtree of the Search_tree .

Post: If sub_root is NULL , a code of not_present is returned. Otherwise, theroot of the subtree is removed in such a way that the properties of a binarysearch tree are preserved. The parameter sub_root is reset as the root of themodified subtree, and success is returned. */

2015/12/15

8

Binary Search Trees

implementations:

Auxiliary Function to Remove One Node

{

if (sub_root == NULL) return not_present;

Binary_node<Record> *to_delete = sub_root;

// Remember node to delete at end.

if (sub_root->right == NULL) sub_root = sub_root->left;

else if (sub_root->left == NULL) sub_root = sub_root->right;

else { // Neither subtree is empty.

to_delete = sub_root->left;

// Move left to find predecessor(前驱).

Binary_node<Record> *parent = sub_root; // parent of to_delete

Binary Search Trees

while (to_delete->right != NULL) { // to delete is not the predecessor.

parent = to_delete;to_delete = to_delete->right;

}sub_root->data = to_delete->data; // Move from to_delete to root.if (parent == sub_root) sub_root->left = to_delete->left;else parent->right = to_delete->left;

}

delete to_delete; //Remove to_delete from tree.

return success;

}

Treesort 树排序

When a binary search tree is traversed in inorder, the keys will come out in sorted order.

This is the basis for a sorting method, called treesort: Take the entries to be sorted, use the method insert to build them into a binary search tree, and then use inorder traversal to put them out in order.

Theorem(定理) 10.1 Treesort makes exactly the same comparisons of keys as does quicksort(快速排序) when the pivot(基准) for each sublist(子表) is chosen to be the first key in the sublist.

Corollary(推论) 10.2 In the average case, on a randomly ordered list of length n, treesort performs:2n ln n + O(n)= 1.39n lg n+ O(n) comparisons of keys.

Treesort 树排序

Advantages and Drawback of Tree sort

First advantage(优点) of treesort over quicksort: The nodes need not all be available at the start of the process, but are built into the tree one by one as they become available.

Second advantage: The search tree remains available for later insertions and removals(删除).

Drawback(缺点): If the keys are already sorted, then treesort will be a disaster(灾难)--the search tree it builds will reduce to a chain.

Treesort should never be used if the keys are already sorted, or are nearly so.

Treesort 树排序

2015/12/15

9

STOC 2003理论计算机科学领域X炸天会议

The average binary search tree requires approximately 1.39 times as many comparisons as a completely balanced tree.

The average cost of not balancing a binary search tree is approximately 39% more comparisons

Tradeoff between balancing the tree and optimally search

52

3、平衡二叉树和平衡的二叉查找树 balanced binary tree (平衡二叉树)

任一结点的左右子树高度大致相同，即只要二叉树高度为O(lgn)，则可认为它是平衡的。

completely balanced binary tree (完全平衡二叉树)

任一结点的左右子树高度完全相同(满二叉树)

balanced binary search tree (平衡的二叉查找树)

满足BST性质的平衡二叉树

AVL树：任一结点的左右子树的高度之差的绝对值不超过1，h≈1.44lgn

红黑树：h≈2lgn

balance mechanism (平衡机制)

AVL树：4种旋转

红黑树：2种旋转

Breaking the lgn barrierBy use of key comparisons alone, it is impossible to complete a search of n items in fewer than lgn comparisons, on average。

通过关键字(key)比较的查找，平均情况下至少需要lgn次比较才可能完成n个item的查找,但是如果我们希望更快，打破lgn的约束

n条记录，每一条记录的key取值分别为0~n-1，存储在一个大小为n的array中，于是我们可以通过下标定位在O(1)时间定位

Ordinary table lookup or array access requires only constant time O(1).

普通的表格查找或数组访问时间只需O(1)

Both table lookup and searching share the same essential purpose, that of information retrieval. The key used for searching and the index used for table lookup have the same essential purpose: one piece of information that is used to locate further information.

Breaking the lgn barrierIn chapter 9 (Kruse book), we study ways to implement and access various kinds of tables in contiguous storage. 顺序存储中实现各类表格和访问表格的方法

Several steps may be needed to retrieve an entry from some kinds of tables, but the time required remains O(1). It is bounded by a constant that does not depend on the size of the table. Thus table lookup can be more efficient than any searching method. 无论表格的形态如何，我们检索一个元素的时间复杂度仍为O(1)，该结果表明检索时间复杂度不依赖于表格长度n，表格查找比任何查找方法更有效

The entries of the tables will be indexed by sequences of integers, just as array entries are indexed by such sequences. 我们所考虑的表格元素由整数索引，同数组类似。本身我们也是用数组实现抽象定义的table

2015/12/15

10

Breaking the lgn barrierWe shall implement abstractly defined tables with arrays. In order to distinguish between the abstract concept and its implementation, we introduce:(用数组来表示table)

约定：抽象定义的表格元素下标放在( )中，数组下标用[ ]表示

Example: Thus T(1, 2, 3) denotes the entry of the table T that is indexed by the sequence 1, 2, 3, and A[1][2][3] denotes the correspondingly indexed entry of the C++ array A.

(逻辑表示与存储表示的区分)

Rectangular Tables (矩形table)

Rectangular Tables

Row-Major and Column-Major Ordering Indexing Rectangular Tables 行主序以及列主序的矩形表格

Entry (i,j) in a rectangular table goes to position

ni+j in a sequential array.

A formula of this kind, which gives the sequential location of a table entry, is called an index function. (上面的公式给出了表格元素的顺序位置，我们称之为下标函数)

Variation: An Access Array (访问数组) access vector

This method is to keep an auxiliary array, a part of the multiplication table for n. The array will contain the values 0, n, 2n, 3n, ……, (m – 1)n

TABLES OF VARIOUS SHAPES(各种形态的矩阵)

Triangular Tables (三角矩阵)index function:

Entry (i，j) of the triangular table corresponds to entry,

1/2*i(i+1)+j

of the contiguous array.

access array,

0, 1, 1+2, (1+2)+3, ……

2015/12/15

11

Jagged Tables(锯齿矩阵)

where there is no predictable relation between the position of a row and its length.

表格查找与函数非常相似。根据自变量计算相应的值/根据下标确定相应的值

TABLES: A NEW ABSTRACT DATA TYPE

Functions


FunctionsIn mathematics a function is defined in terms of two sets and a correspondence from elements of the first set to elements of the second. If f is a function from a set A to a set B, then assigns to each element of A a unique element of B.

The set A is called the domain of f , and the set B is called the codomain of f .

The subset of B containing just those elements that occur as values of f is called the range of f .

Index set, value type(下标集合、值类型)

Table access begins with an index and uses the table to look up a corresponding value. Hence for a table we call the domain the index set, and we call the codomain the base type or value type.

Example : 1、double array[n]; 2、a triangular table;


Comparisons (List and table)retrievalThe time required for searching a list generally depends on the number n of entries in the list and is at least lgn, but the time for accessing a table does not usually depend on the number of entries in the table; that is, it is usually O(1). For this reason, in many applications, table access is significantly faster than list searching.


Comparisons(List and table)traversal

On the other hand, traversal is a natural operation for a list but not for a table. It is generally easy to move through a list performing some operation with every entry in the list. In general, it may not be nearly so easy to perform an operation on every entry in a table, particularly if some special order for the entries is specified in advance.(在table中遍历不是它的操作)

Comparisons( tables and arrays )In general, we shall use table as we have defined it in this section and restrict the term array to mean the programming feature available in C++ and most high-level languages and used for implementing both tables and contiguous lists.(array是高级语言中的概念，通常用于实现表格及表。)

Hashing哈希/散列

Hashing是算法在时间和空间上作出权衡的经典实例

Page 397 in Kruse book(very重要)

2015/12/15

12

67

§9.4 hashing technique

前述查找算法均是基于key的比较，由比较判定树可证明，在平均和坏情况下所需的比较次数的下界是：

lgn＋O(1)

要突破此界，就不能只依赖于比较来进行。散列(哈希)技术目标是无需通过比较即可找到key，期望时间为O(1)。

§ 9.4.1 the concept of hash table (哈希表/散列表的概念)

direct addressing (直接寻址)：当结点的key和存储位置间建立某种关系时，无须比较即可找到结点的存储位置。

e.g.，若500个结点的keys取值为1～500间互不相同的整数，则keys可作为下标，将其存储到T[1..500]之中，寻址时间为O(1)

68

§ 9.4.1 the concept of hash table 压缩映象：

例子：设影像出租店

总片数：10,000张

借还：500人次/每天

记录：(电话号码，姓名，住址，…), Key为7位电话号码

直接寻址：须保留1000万个记录

但实际只要大于500即可，多1万个记录

general case

全集U：可能发生的关键字集合，即关键字的取值集合

实际发生集K：实际需要存储的关键字集合

∵ |K| << |U|，当表T的规模取|U|浪费空间，有时甚至无法实现

∴ T的规模由|U|压缩至|K|，一般取O(K), 但压缩存储后失去了直接寻址的功能

69

§ 9.4.1 the concept of hash table 散列 (hash) 函数 h

基于上述原因，我们需要在keys和表地址之间建立一个对应关系h，将全集U映射到T[0..m-1]的下标集上：

h：U→{ 0, 1, … , m-1 }

∀ki∈U，h(ki) ∈ { 0,1,…,m-1 }

hash table: T

hash address (散列值/地址)：散列函数值h(ki)

hashing (散列)：将结点按其key的散列地址

存储到散列表中的过程

关键字集合

存储地址集合

hash

U. . . K k1 .. . . k2 .

. . k3.

h(k2)

h(k1)=h(k3)

0

m-1

T[0..m-1]

70

§ 9.4.1 the concept of hash table 散列 (hash) 函数h：将待处理的|U|个值减小到m个值

conflict (碰撞/冲突): 若ki ≠kj，有h(ki) ＝ h(kj)

同义词：发生冲突的两个关键字互为同义词(相对于h)

能完全避免冲突吗？

须满足：① |U|≤m ② 选择合适的hash函数

否则只能设计好的h使冲突尽可能少

通常情况下，h是压缩映像，无论怎样设计hash函数也不能完全避免冲突，因为K集合可由U中任意O(|K|)个关键字构成

解决冲突：须确定处理冲突的方法

装填因子：冲突的频繁程度除了与h的好坏相关，还与表的填满程度相关

α＝n/m， 0≤ α ≤ 1 (n:表中结点数，m:表长度)

71

§ 9.4.2 散列函数的构造方法

原则：简单、均匀

简单－－计算快速，均匀－－冲突小化(如何理解?)

冲突小化——使子集K均匀分布在地址集上，即冲突小化

假定：关键字key定义在自然数集合上，否则将其转换到自然数集合上

Choosing a Hash Function

A hash function should be easy and quick to compute. (简单、计算快速)

A hash function should achieve an even distribution of the keys that actually occur across the range of indices. (均匀)

The usual way to make a hash function is to take the key, chop it up, mix the pieces together in various ways, and thereby obtain an index that will be uniformly distributed over the range of indices. (对关键字进行分割，混合等多种运算)

Note that there is nothing random about a hash function. If the function is evaluated more than once on the same key, then it must give the same result every time, so the key can be retrieved without fail.(哈希函数对于某一个关键字，每次计算都应得到相同的结果。)

2015/12/15

13

Choosing a Hash Function

Truncation: Sometimes we ignore part of the key, and use the remaining part as the index. (截断)

Folding: We may partition the key into several parts and combine the parts in a convenient way. (折叠)

Modular arithmetic: We may convert the key to an integer, divide by the size of the index range, and take the remainder as the result. (模运算)

A better spread of keys is often obtained by taking the size of the table (the index range) to be a prime number. (质数模)

74

§ 9.4.2 the construction method of hash function

1、 middle square method (平方取中法)

先通过平方扩大相近数的差别，然后取中间的若干位作为散列地址。因为中间位和乘数的每一位相关，故产生的地址较为均匀。

int Hash( int k ) {

k *=k; k /=100; //去掉末尾两位数

return k％1000；//取中间3位，T的地址为0~999

}

75

§ 9.4.2 the construction method of hash function2、除余法

用表长除以关键字，取其余数作为散列地址

h(k)＝ k%p； p≤m，有时取p＝m

p 好取素数，或取不包含小于20的质因数的合数

优点：简单，无需定义为函数，直接写到程序里使用

3、 random number method (随机数法)

h(k)＝random( k );

伪随机函数须保证函数值在0～m－1之间

选取哈希函数，考虑的因素

计算哈希函数所需时间

关键字长度

哈希表长度(哈希地址范围)

关键字分布情况

记录的查找频率

Choosing a Hash Function C++ Example

As a simple example, let us write a hash function in C++ for transforming a key consisting of eight alphanumeric characters into an integer in the range 0 . . hash_size-1.

Therefore, we shall assume that we have a class Key with the methods and functions of the following definition:

class Key: public String{

public:

char key_letter(int position) const;

void make_blank( );

// Add constructors and other methods.

};

Choosing a Hash FunctionWe inherit the methods of the class String from Chapter 6. The

method key_letter(int position) must return the character in a particular position of the Key, or return a blank if the Key has length less than n. The final method make_blank sets up an empty Key

int hash(const Key &target)/* Post: target has been hashed, returning a value between 0 and hash_ size -1 .Uses: Methods for the class Key . */{int value = 0;for (int position = 0; position < 8; position++)

value = 4 * value + target.key_letter(position);return value%hash_size;}

78

§ 9.4.3 the methods of handling conflicts1、 Open Addressing, 开放地址法 (闭散列)

basic idea：当发生冲突时，使用某种探测技术在散列表中形成一个探测序列，沿此序列逐个单元查找，直到找到给定的key，或碰到1个开放地址(空单元)为止

插入时，探测到开地址可将新结点插入其中

查找时，探测到开地址表示查找失败

开地址表示：与应用相关，如key为字符串时，用空串表示开地址，即用不会出现在keys中的值表示进行初始化

general form

hi＝(h(k)＋di)％m 1 ≤i ≤ m－1 (hash-1)

探测序列：h(k), h1, h2 , …, hm-1

开地址法要求：α<1，实用中 0.5≤ α ≤ 0.9

增量序列

2015/12/15

14

increment functions (增量函数)

We will do just as well to find a more sophisticated way of determining the distance to move from the first hash position and apply this method, whatever the first hash location is. Hence we wish to design an increment function that can depend on the key or on the number of probes already made and that will avoid clustering.(比较好的方法是：确定从原哈希函数确定的位置的相对距离，因此我们设计一个增量函数，这个函数可能与关键字或已经进行的探测次数有关。)

80

§ 9.4.3 Collision Resolution (处理冲突) 的方法

Open Addressing (开放地址法) 分类：按照形成探测序列的方法不同，可将其分为3类：线性探测、二次探测、双重散列法

(1) Linear Probing (线性探测法)

Linear probing starts with the hash address and searchessequentially for the target key or an empty position. The arrayshould be considered circular, so that when the last location isreached, the search proceeds to the first location of the array.(向后顺序查找一个空位置，当后面没有空位时，从开始位置向后探测)

81

§ 9.4.3 Collision Resolution (处理冲突) 的方法

basic thought：将T[0..m-1]看作一循环向量，令hash-1式中的di＝i，即

hi＝(h(key)＋i)％m 0 ≤i ≤ m－1

若令d＝h(key)，则长的探测序列为：

d,d+1,…,m-1,0,1,…,d-1

探测过程终止于

① 探测到空单元：查找时失败，插入时写入

② 探测到含有key的单元：查找时成功，插入时失败

③ 探测到T[d-1]但未出现上述2种情况：查找、插入均失败

82

§ 9.4.3 the methods of handling conflicts 例1：已知一组keys为(26,36,41,38,44,15,68,12,06,51), 用除余法

和线性探测法构造散列表

解：∵α<1，n＝10，不妨取m＝13，此时α≈0.77

h( k )= k%13 keys：(26,36,41,38,44,15,68,12,06,51)

对应的散列地址：(0,10, 2, 12, 5, 2, 3, 12, 6, 12)

1211109876543210

3836444126

38365106446815411226

1191122131

T[0..12]

探测次数

83

§ 9.4.3 the methods of handling conflicts 非同义词冲突：上例中，h(15)=2，h(68)=3，15和68不是同义词，但是

15和41这两个同义词处理过程中，15先占用了T[3]，导致插入68时发生了非同义词的冲突。

聚集 (堆积)：一般地，用线性探测法解决冲突时，当表中i,i+1,…,i+k的位置上已有结点时，一个散列地址为i,i+1,…,i+k,i+k+1的结点都将争夺同一地址：i＋k＋1。这种不同地址的结点争夺同一后继地址的现象称为聚集。

聚集增加了探测序列的长度，使查找时间增加。应跳跃式地散列。

Clustering (聚集)

The major drawback of linear probing is that, as the table becomes about

half full, there is a tendency toward clustering; that is, records start to

appear in long strings of adjacent positions with gaps between the

strings. Thus the sequential searches needed to find an empty position

become longer and longer. 84

§ 9.4.3 the methods of handling conflicts

1211109876543210

3836444126

38365106446815411226

1191122131

T[0..12]

探测次数

2015/12/15

15

85


(2)二次探测法 Quadratic Probing：

探测序列为( 增量序列为di =i2 )：

hi＝(h(k)＋i*i)％m 0 ≤i ≤ m－1

缺点：不易探测到整个空间

Collision Resolution with Open Addressing

Quadratic Probing (二次探测)

Other methods

Key-dependent increments

Random probing

这时，增量函数为：i2 (i=1,2,3,……)

Example: we have the records as below{ 19, 01, 23, 14, 55, 68, 11, 82, 36 }

The hash function is H(key) = key MOD 11And the table has length 11

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

1901 23 1455 68

1901 23 14 68

Linear probing

Quadratic probing

11 82 36

55 11 82 36

1 1 2 1 3 6 2 5 1

88


(3)双重散列法：

是开地址法中好的方法之一，探测序列为

hi＝(h(k)＋i* h1(k) )％m 0 ≤i ≤ m－1

即： di =i* h1(k)

特点：使用了两个散列函数，一般须使h1(k)的值与m互素

avoid clustering

If we are to avoid the problem of clustering, then we must use some more sophisticated way to select the sequence of locations to check when a collision occurs.使用更好的方法解决聚集问题

There are many ways to do so. One, called rehashing, uses a second hash function to obtain the second position to consider. If this position is filled, then some other rehashing method is needed to get the third position, and so on. 再哈希法


C++ Algorithmsconst int hash_ size = 997; // a prime number of appropriate size

class Hash_table {

public:

Hash_table( );

void clear( );

Error_code insert(const Record &new entry);

Error_code retrieve(const Key &target, Record &found) const;

private:

Record table[hash_ size];

};


2015/12/15

16

Collision Resolution with Open Addressing Collision Resolution with Open AddressingHash Table InsertionError_code Hash_table :: insert(const Record &new_entry)

/* Post: If the Hash table is full, a code of overflow is returned. If the table already contains an item with the key of new entry a code of duplicate error is returned. Otherwise:The Record new entry is inserted into the Hash table and success is returned.

Uses: Methods for classes Key , and Record . The function hash . */

{ //采用二次探测解决冲突

Error_code result = success;

int probe_count, //Counter to be sure that table is not full.

increment, //Increment used for quadratic probing.此处increment指的是相邻两次探测的距离，与前面增量函数的含义不同。

probe; //Position currently probed in the hash table.

Key null; //Null key for comparison purposes.

null.make_blank( );

probe = hash(new_entry);

Hash Table Insertionprobe_count = 0;increment = 1;while (table[probe] != null // Is the location empty?

&& table[probe] != new_entry // Duplicate key?&& probe_ count < (hash_size + 1)/2) { // Has overflow occurred?

probe_ count++;probe = (probe + increment)%hash_size;increment += 2; // Prepare increment for next iteration.

}if (table[probe] == null) table[probe] = new_entry;// Insert new_entry.else if (table[probe] == new_entry) result = duplicate_error;else result = overflow; // The table is full.return result;}

Collision Resolution with Open Addressing Collision Resolution by Chaining

95

§ 9.4.3 the methods of handling conflicts2、拉链法(开散列)

基本思想：将所有keys为同义词的结点连接在同一个单链表中，若散列地址空间为0～m-1，则将表定义为m个头指针组成的指针数组T[0..m-1]，其初值为空。

例2：keys和hash函数同前例

01

234

56

789

101112

44

^41 15

26

68

06

36

1238

^

^

^^

^

^

^

^

^

^

^

^

51比较次数： 1 2 3 96


特点

无堆积现象、非同义词不会冲突、ASL较短

无法预先确定表长度时，可选此方法

删除操作易于实现，开地址方法只能做删除标记

允许α>1，当n较大时，节省空间

2015/12/15

17

97

§ 9.4.4 operations on the hash table

仅给出开放定址法处理冲突时的相关运算

storage structure#define NIL -1 //空结点标记

#define m 997 //表长，依赖应用和α

typedef struct {

KeyType key;

//….

}NodeType, HashTable[m];

1、search operation (查找运算)

和建表类似，使用的函数和处理冲突的方法应与建表一致

开地址可统一处理98

§ 9.4.4 operations on the hash tableint Hash( KeyType K, int i){

return ( h(K)+ Increment( i ) )%m; //Increment相当于di

}

Int HashSearch( HashTable T, KeyType K, int *pos){

int i=0; //探测次数计数器

do { *pos=Hash( K,i ); //求探测地址hi

if ( T[*pos].key==K ) return 1; //成功

if ( T[*pos].key==NIL ) return 0; //失败

} while (++i<m); // 多探测m次

return -1; //可能是表满，失败

}

2、 insert and build table

insert：先查找，找到开地址成功，否则(表满、K存在)失败

void HashInsert( HashTable T, NodeType new ){

int pos, sign;

sign=HashSearch( T, new.key, &pos ) ;

if ( !sign ) //标记为0，是开地址

T[pos]=new; //插入成功

else

if ( sign>0 ) printf(“duplicate key”) ; //重复，插入失败

else Error(“overflow”); //表满，插入失败

}

build table：将表中keys清空，然后依次调用插入算法插入结点

§ 9.4.4 operations on the hash table

100

§ 9.4.4 operations on the hash table3、delete

开地址法不能真正删除 (置空)，而应该置特定标记Deleted

查找时：探测到Deleted标记时，继续探测

插入时：探测到Deleted标记时，将其看作开地址，插入新结点

当有删除操作时，一般不用开地址法，而用拉链法

4、performance analysis

只分析查找时间，插删取决于查找，因为冲突，仍需比较。

例子(见例1和例2图)

success

线性探测：ASL＝(1*6+2*2+3*1+9*1)/10=2.2 //n=10

拉链法： ASL＝(1*7+2*2+3*1)/10=1.4 //n=10

101

§ 9.4.4 operations on the hash table failure

线性探测：直到探测到开地址为止

ASL＝(9+8+7+6+5+4+3+2+1+1+2+1+10)/13=4.54 //m=13

注意：当m>p(p＝|散列地址集|)时，分母应该为p

拉链法：不包括空指针判定

ASL＝(1+0+2+1+0+1+1+0+0+0+1+0+3)/13=0.77 //m=13

T[0..12]

1211109876543210

38365106446815411226

10223456789探测次数 111

102

§ 9.4.4 operations on the hash table failure

拉链法：不包括空指针判定

ASL＝(1+0+2+1+0+1+1+0+0+0+1+0+3)/13=0.77 //m=13

01

234

56

789

101112

44

^41 15

26

68

06

36

1238

^

^

^^

^

^

^

^

^

^

^

^

51

比较次数： 1 2 3

2015/12/15

18

103

§ 9.4.4 operations on the hash table general case

线性探测

成功：ASL＝(1＋1/(1－α))/2

失败：ASL＝(1＋1/(1－α)2)/2

显然，只与α相关，只要α合适，ASL＝O(1)。

其他方法优于线性探测：略

5、用途

加速查找

加密等信息安全领域

Collision Resolution by Chaining

Advantages and Disadvantage of chainingThe linked lists from the hash table are called chains.

If the records are large, a chained hash table can save space.

Collision resolution with chaining is simple; clustering is no problem.

The hash table itself can be smaller than the number of records; overflow is no problem.

Deletion is quick and easy in a chained hash table.

If the records are very small and the table nearly full, chaining may take more space.

Collision Resolution by ChainingCode for Chained Hash Tables

Definitionclass Hash_table {public:// Specify methods here.private:List<Record> table[hash size];};

The class List can be any one of the generic linked implementations of a list studied in Chapter 6.ConstructorThe implementation of the constructor simply calls the constructor for each list in the array.ClearTo clear the table, we must clear the linked list in each of the table positions, using the List method clear( ).

Collision Resolution by Chaining

Code for Chained Hash TablesRetrieval

sequential_search(table[hash(target)], target, position);

Insertion

table[hash(new entry)].insert(0, new entry);

Deletion

ANALYSIS OF HASHINGThe Birthday Surprise

The probability that m people all have different birthdays is

This expression becomes less than 0.5 whenever m 23.

结论:随机挑选23个人，就可能有2个人的生日是一样的。

ANALYSIS OF HASHING

For hashing, the birthday surprise says that for any problem of reasonable size, collisions will almost certainly occur.

A probe is one comparison of a key with the target. (定义一次探测为与目标记录的一次比较)

The load factor of the table is = n/t , where n positions are occupied out of a total of t positions in the table. (装载因子)

2015/12/15

19

ANALYSIS OF HASHING ANALYSIS OF HASHING

ANALYSIS OF HASHING CONCLUSIONS: COMPARISON OF METHODS

We have studied four principal methods of information retrieval,the first two for lists and the second two for tables. Often we can choose either lists or tables for our data structures.Sequential search is O(n)

Sequential search is the most flexible method. The data may be stored in any order, with either contiguous or linked representation.

Binary search is O(lg n)Binary search demands more, but is faster: The keys must be in order, and the data must be in random-access representation (contiguous storage).

CONCLUSIONS: COMPARISON OF METHODS

Table lookup is O(1)Ordinary lookup in contiguous tables is best, both in speed and convenience, unless a list is preferred, or the set of keys is sparse, or insertions or deletions are frequent.

Hash-table retrieval is Θ(1)Hashing requires the most structure, a peculiar ordering of the keys well suited to retrieval from the hash table, but generally useless for any other purpose. If the data are to be available for human inspection, then some kind of order is needed, and a hash table is inappropriate.

Documents

Data Structure - home.ustc.edu.cnhome.ustc.edu.cn/~huang83/ds/search2.pdf · 二叉查找树是二叉树的一个特例，因此BST的方法可以 看作是特殊种类的二叉树方法)

Data Structure - home.ustc.edu.cnhome.ustc.edu.cn/~huang83/ds/search2.pdf · 二叉查找树是二叉树的一个特例，因此BST的方法可以看作是特殊种类的二叉树方法)