chapter 11 binary search trees, etc.zshen/webfiles/notes/cs316/...binary search trees a binary...
TRANSCRIPT
Chapter 11
Binary Search Trees, etc.
The Dictionary data type is a collection of
data, each of which has a key component. If
we use an hashing table to represent a dictio-
nary, it only takes, on average, Θ(1) to carry
out such basic operations as search, insertion,
and deletion. However, the worst time is O(n).
It also can not provide good average time to
carry out other operations such as output in
ascending order of keys, find the kth element
in ascending order, etc..
All those basic operations can be done in Θ(n),
using a balanced search tree, a sorted output
can be done in Θ(n), and the other operations
can be done in Θ(logn).
We will begin with binary search tree, which is
not always ”balanced.”
1
Binary Search Trees
A binary search tree is a binary tree that may
be empty. A nonempty binary tree has the
following properties: 1) Every element has a
key component, whose values are distinct. 2)
The keys in the left subtree of the root are
smaller than the key in the root. 3) The keys
in the right subtree of the root are larger than
that in the root. 4) Both the left and right
subtrees are also binary search trees.
An indexed BST is derived from a BST by
adding a field, LeftSize, to each tree node,
whose value is the number of nodes in its left-
subtree plus 1.
2
The BSTree class
Since the number of nodes, as well as its shape,
will change frequently, it is appropriate to use
a linked structure to represent a binary search
tree. Moreover, it will be convienient to derive
it from the class of binary trees. For example,
we can call the InOutput, operation defined for
binary trees to output all the elements in an
ascending order.
class BSTree : public BinaryTree<E> {
public:
bool Search(const K& k, E& e) const;
BSTree<E,K>& Insert(const E& e);
BSTree<E,K>& Delete(const K& k, E& e);
void Ascend() {InOutput();}
};
3
The search operation
When we search for x, we begin at the root.
If it is NULL, then x can’t be found. Otherwise,
we compare x with the key of the root. If they
are equal, the search is a success. If it is larger
that the key of the root, it must be in the
right subtree, if it is inside the tree at all; so,
we recursively search for x in the right subtree.
Otherwise, we search the left one.
bool BSTree<E,K>::Search
(const K& k, E &e) const{
BinaryTreeNode<E> *p = root;
while (p) // examine p->data
if (k < p->data) p = p->LeftChild;
else if (k > p->data) p = p->RightChild;
else { e = p->data;
return true;}
return false;
}
4
Obviously, the search can be done in O(h),
where h is the height of the tree.
With an indexed BST, we can also search for
the kth element, also in O(h). For example,
to look for the third element in the following
indexed BST, we check the LeftSize filed of
the root, since it is 4, the element must be
in the left subtree. We then check this field
again for the root of its left subtree, it is a 2.
hence, the element we are looking for is the
smallest one of its right subtree. We find out
that the LeftSize of its root is 1, i.e., so we
can conclude that this node the root itself is
the element we are looking for.
5
The insertion operation
To insert an element with key k inside a BST,
we first verify that this key does not occur any-
where inside the tree. If the search succeeds,
the insertion will not be done; otherwise, the
element will be inserted where the search re-
ports a failure.
The following are two examples:
If we insert into an indexed BST, we also have
to adjust the LeftSize field of all the elements
from the insertion point back to the root. In
both cases, it takes Θ(logn).
6
The code
BSTree<E,K>& BSTree<E,K>::Insert(const E& e){
BinaryTreeNode<E> *p = root,*pp = 0;
while (p){
pp = p;
if (e < p->data) p = p->LeftChild;
else if (e > p->data) p = p->RightChild;
else throw BadInput(); // duplicate
}
BinaryTreeNode<E> *r=new BinaryTreeNode<E> (e);
if (root) {
if (e < pp->data) pp->LeftChild = r;
else pp->RightChild = r;}
else root = r;
return *this;
}
7
The deletion operation
If the node to be deleted is a leaf, we simply
delete it. It is also easy to do when it has
just one child. When it has two children, it
becomes a bit tricky, since we have only one
location for two nodes to be hooked on.
8
BSTree<E,K>& BSTree<E,K>::Delete
(const K& k, E& e){
BinaryTreeNode<E> *p = root, *pp = 0;
while (p && p->data != k){
pp = p;
if (k < p->data) p = p->LeftChild;
else p = p->RightChild; }
if (!p) throw BadInput();
e = p->data;
if(p->LeftChild && p->RightChild) {
BinaryTreeNode<E> *s = p->LeftChild,*ps = p;
while (s->RightChild){ ps = s;
s = s->RightChild;}
p->data = s->data; p = s; pp = ps; }
BinaryTreeNode<E> *c;
if (p->LeftChild) c = p->LeftChild;
else c = p->RightChild;
if (p == root) root = c;
else { if (p == pp->LeftChild)
pp->LeftChild = c;
else pp->RightChild = c;}
delete p;
return *this; }
9
BST could be high
The height of a binary search tree can be quite
high. For example, if we use the original inser-
tion function to add in element 1,2, . . . , n, in an
empty BST, then the resulted BST is actually
a linear list. As a result, a basic operation can
take as must as O(n).
But, if insertion and deletions are made at ran-
dom, then the height of such a BST will be
O(logn).
10
A little assignment
Homework 11.1. Generate a random permuta-
tion of the integers 1 through n, n ∈ [100,500,
1000,10000,20000,50000]. Insert them into an
initially empty binary search tree according to
the random order. Measure the height of the
resulting search tree. Repeat this experiment
for several random permutations and then cal-
culate the average of the measured heights.
Compare the just obtained average with the
theoretical figure of 2dlog2(n + 1)e.
11
AVL trees
We want to define the BST operations in such
a way that the trees stay balanced all the times,
while preserving the binary search tree proper-
ties. The essential algorithms for insertion and
deletion will be exactly the same as for BST,
except that as they might destroy the prop-
erty of being balanced, we have to restore the
balance of the tree, if needed. Such a tree is
called a balanced tree, it is guaranteed that all
the basic operations applied to a balanced tree
will be done in Θ(logn).
The class of AVL trees is one of the more pop-
ular balanced trees. An nonempty binary tree
with TL and TR as its left and right subtrees is
an AVL tree iff both TL and TR are AVL trees,
and |hL − hR| ≤ 1.
12
AVL search trees
An AVL search tree is a binary search tree,
which is also an AVL tree. It has the following
properties.
1. The height of an AVL tree with n nodes is
O(logn).
2. For every n ≥ 0, there exists an AVL tree
with n nodes.
3. We can search an n− element AVL search
tree in O(logn).
4. A new element can be added into an n−element AVL tree so that the resulted (n +
1)−element tree is also an AVL tree.
5. An element can be deleted from an n−element AVL search tree, and the resulted (n−1)−element tree is also an AVL tree, and the
deletion can be done in O(logn).
13
Fibonacci tree
We calculate the least number of nodes in a
balanced binary tree with depth h first: Let
S(h) be this number. Obviously, S(0) = 1 and
S(1) = 2.
In general, we can construct a balanced tree
with depth h by merging two subtrees of heights
h−1 and h−2, each of which has the least num-
ber of nodes. We call such trees Fibonacci
trees for the obvious reason.
By an inductive argument, we have that S(h) =
S(h−1)+S(h−2)+1 ≥ fib(h), which leads to
that S(h) > 32
h−1. Thus, the least number of
nodes in such a tree, n0, satisfy that n0 > 32
h−1.
14
Height of an AVL tree
Therefore, given any AVL tree with n nodes,
we have that
n ≥ n0 >3
2
h−1
Further calculation leads to the following re-
sult,
h ≤ 1.44 log(n + 2) − 0.328.
A simpler, but unproved, result is that
h ≤ log(n + 1) + 0.25.
15
Represent an AVL tree
An AVL tree is usually represented using the
linked representation for binary trees. To pro-
vide information to restore the balance, a bal-
ance factor is added to each node. For any
node, this factor is defined to be the differ-
ence between the heights of its left and right
subtrees.
It is easy to see that any search takes O(logn).
16
Restore the Balance
When a tree becomes unbalanced, as a result
of either insertion or deletion, we can restore
its balance by rotating the tree at and/or below
the node where the imbalance is detected.
Case 1: Assume that the left subtree of k2
has height larger than that of its right subtree,
caused by adding a node into subtree X or
deleting a node from subtree Z. We apply a
right rotation, which will restore the balance
of the tree rooted at k1.
Notice that the above rotation is symmetric.
17
An example
Below shows how to restore the balance of a
tree by applying a single rotation.
Notice that here we just added a node with
value 612 into the left subtree rooted at 7, la-
beled as k1 which itself is the left subtree of
the tree rooted at 8, labeled as k2.
18
Restore the Balance
Case 2: Suppose that the imbalance is caused
by the right subtree rooted at k1 being too
high, whose balance can’t be restored by a sin-
gle rotation. We must apply two successive
single rotations, as follows.
When it is caused by the left subtree being
too high, we have the following similar double
rotation.
19
An example
Below shows how to restore the balance of a
tree by applying a double rotation.
Notice that here we just added a node with
value 14, labeled as k2 into the left subtree
rooted at 15, labeled as k1, which itself is the
right subtree of the tree rooted at 7, labeled
as k3.
20
Histogramming
We start with a collection of n keys and must
output a list of distinct keys and their frequen-
cies. Assume that n = 10, and keys = [2, 4, 2
, 2, 3, 4, 2, 6, 4, 2].
When the range of those keys is quite small,
this problem can be solved, in Θ(n), by using
an array h[0..r], where r is the maximum key
value, and letting h[i] store the frequency of
i.
21
When the key value are not integers, the above
procedure can’t be used. Instead, we can sort
the n keys, in Θ(n logn), then scan the sorted
list from left to right to find the frequencies of
respective keys.
This solution can be further improved, when m,
the number of distinct keys, is quite small. We
can use a (balanced) BST tree, each of whose
nodes contains two fields, key and frequency.
We insert all the n keys into such a tree, when
a key is already there, we simply increment its
frequency by 1. This procedure takes an ex-
pected complexity of Θ(n logm). When a bal-
anced tree is used, this complexity is guaran-
teed, even in the worst case.
22
m−way search trees
An m−way search tree, when it is not empty,
satisfies the following conditions: 1) each in-ternal node has up to m children and con-
tains between 1 and m− 1 elements. 2) Every
node with p elements has exactly p + 1 chil-
dren. 3) For any node with p elements. Let
k1 <, . . . , < kp be the keys of those elements,
and let c0, . . . , cp be the p + 1 children. Then,
the keys in all the elements in the subtree c0 are
smaller than k1, the keys in all the elements inthe subtree cp are bigger than kp, and the keys
in all the elements in subtree ci are between ki
and ki+1, 1 ≤ i ≤ p − 1.
23
It is easy to do a search in such a tree. To
insert an element, we search for it first. The
place where the search terminates indicates an
insertion point. But, if there are no space left,
we have to add an node to the tree to contain
this element. It is a bit trickier to do a deletion.
An m−way search tree of height h may have
as few as h elements, and as many as mh − 1.
As a result, the height of such a tree ranges
between logm(n+1) and n. For example, a 200-
way tree of height 5 can hold up to 32∗1010−1
elements.
When the search tree is located on a disk, the
disk access time is in proportional to O(h),
where h is the height of the tree. Thus, we
have to cut the height as much as possible.
24
B−tree
A B−tree of order m is an m−way search tree.
When it is not empty, it satisfies the following:
1) The root has at least two children. 2) All
internal nodes other than the root have at least
dm2 e children. 3) All the external nodes are at
the same level.
25
2-3 trees
For a B−tree of order 3, as it is a 3-way search
tree, each of its internal node can have no
more than 3 children. On the other hand, it
is also required that such a node should have
at least 2 children. Thus, each internal node
will have either 2 or 3 children. Thus, it is also
called a 2-3 tree.
26
Height of a B−tree of order m
Let N be the total number of keys. Since sucha tree is an m−way search tree, we immediatelyhave that
h ≥ logm(N + 1).
On the other hand, let d be dm2 e. As the root
has at least two children, each of the otherinternal node has at least d children, we havethat the least number of nodes at each level is1,2,2d, . . . ,2dh−1. Hence, the total number ofnodes is at least 1 + 2dh−1
d−1 .
Further calculation shows that
h ≤ logd((d − 1)(n − 1)
2+ 1).
We have that n ≤ N−1d−1 + 1, thus,
h ≤ logd(N + 1
2).
27
Operations of B-tree
Search is done pretty straightforward. Basi-cally, to look for a node, we start with theroot, following the lead given by the keys, weeventually either find it, or report a failure.
When inserting an element, because of thenumber restriction, i.e., each node contains atmost m−1 keys in an m− ary B−tree, we mighthave to split a node along the way, which mayeven lead to the growth of a tree.
When deleting a node, there are two cases,w.r.t. the nature of the node to be deleted.1) It is deleted from a leaf. 2) It is to bedeleted from a non-leaf. In case 2) we canreplace this node with either the largest ele-ment from its left neighboring subtree, or thesmallest from its right subtree, then to deletethe substituting element. Hence, case 2 can
always be converted to case 1.
28
In case 1, if the element is to be deleted from
a node that has at least one more than the
minimum required number, i.e., dm2 e− 1 for an
m−ary tree, we simply delete it and write back
the modified node.
Otherwise, we will try to replace it with an ele-
ment from either its left or right sibling. When
both siblings don’t have that extra element, we
have to merge those two nodes with the ele-
ment between them in the parent node. This
might have a chain reaction, ending up reduc-
ing the height of the whole tree.
29
The ISAM method
AVL trees and other balanced trees will provide
good performance when the dictionary is small
enough to fit into the internal memory. When
we have to deal with larger quantity of data, we
can get improved performance by using search
trees of higher degree.
Let’s look at ISAM (indexed sequential access
method,) which is a popular access method for
external memory, first. In ISAM, the available
disk space will be cut into blocks, which is the
smallest unit of disk space that will be input or
output, at the cost of a single seek operation.
The dictionary elements are packed into blocks
in ascending order.
30
For sequential access, the blocks are input in
order, and the elements will be retrieved in as-
cending order. If each block contains m ele-
ments, then the number of disk access time
is 1m. On the other hand, to support random
access, an index is maintained, which keeps
the largest key in each block. Since, the index
holds as many keys as the number of blocks,
and since a block usually holds many elements,
this index can usually be kept in the internal
memory.
To search for a record, the index is searched
first for the block that contains this record.
Then, that block is retrieved, and searched in-
ternally for that record. Thus, a single data
access is enough.
31
This technique can be generalized, when data
have to be packed into several disks, whence
each disk will keep an index, and there will be
an overall index kept in the internal memory.
When search for something in this case, the
overall index will be searched first to determine
the single disk that contains the record. Then,
the block index for that disk will be retrieved
in a disk access, and finally, the relevant block
will be brought in to do an internal search. So,
in total, just two disk accesses are needed.
In the next slide, we will see a more specific
example, using B−tree to maintain those in-
dexes.
32
An Example
Assume that we have a 128MB hard disk di-
vided into 216 pages of 2048 bytes each. We
use 4 bytes to represent the addresses of those
pages.
We also assume that the whole disk is used to
store just one file of N records, each of which
needs 512 bytes. Thus, each page can store
4 such records. Further assume that the key
for each record is 4 bytes long. Thus, the key
and the disk address of each record will need 8
bytes, so that each page can hold 256 of them.
If we keep these index information separate
from the records, we need N256 pages to store
them. We also need N4 pages to keep the
records. Also, the total number of pages must
be less than 216. This leads to the fact that
N ≤ 258,111. Assume that the equality holds.
33
We will use a B-tree to hold the indices. As
each page can hold 256 indices, we decide to
use a tree of order 256. The formula tells usthat the height of such a tree will be less than
3, 2.43 to be exact. Therefore, to get access
to any record in the disk, it takes at most fouraccesses to find the index and one more to get
the record itself for a total of 5 disk accesses.
Can we store the whole B-tree in RAM with
4 MB, thus, only one disk access is needed?
Each node of the tree must store 255 indices of8 bytes each and 256 pointers of 4 bytes each,
thus a single node costs 3064 bytes, which
leads to the fact that at most 1368 nodes willfit the RAM. However, as the root may con-
tain just 1 index and each of the other nodemay contain just 128 indexes, thus the total in-
dices contained by those nodes might be just
174,977, which is less than the total numberof records. Thus, the whole tree cannot fit
into the RAM.34