chapter 11 binary search trees, etc.zshen/webfiles/notes/cs316/...binary search trees a binary...

Chapter 11

Binary Search Trees, etc.

The Dictionary data type is a collection of

data, each of which has a key component. If

we use an hashing table to represent a dictio-

nary, it only takes, on average, Θ(1) to carry

out such basic operations as search, insertion,

and deletion. However, the worst time is O(n).

It also can not provide good average time to

carry out other operations such as output in

ascending order of keys, find the kth element

in ascending order, etc..

All those basic operations can be done in Θ(n),

using a balanced search tree, a sorted output

can be done in Θ(n), and the other operations

can be done in Θ(logn).

We will begin with binary search tree, which is

not always ”balanced.”

1

Binary Search Trees

A binary search tree is a binary tree that may

be empty. A nonempty binary tree has the

following properties: 1) Every element has a

key component, whose values are distinct. 2)

The keys in the left subtree of the root are

smaller than the key in the root. 3) The keys

in the right subtree of the root are larger than

that in the root. 4) Both the left and right

subtrees are also binary search trees.

An indexed BST is derived from a BST by

adding a field, LeftSize, to each tree node,

whose value is the number of nodes in its left-

subtree plus 1.

2

The BSTree class

Since the number of nodes, as well as its shape,

will change frequently, it is appropriate to use

a linked structure to represent a binary search

tree. Moreover, it will be convienient to derive

it from the class of binary trees. For example,

we can call the InOutput, operation defined for

binary trees to output all the elements in an

ascending order.

class BSTree : public BinaryTree<E> {

public:

bool Search(const K& k, E& e) const;

BSTree<E,K>& Insert(const E& e);

BSTree<E,K>& Delete(const K& k, E& e);

void Ascend() {InOutput();}

};

3

The search operation

When we search for x, we begin at the root.

If it is NULL, then x can’t be found. Otherwise,

we compare x with the key of the root. If they

are equal, the search is a success. If it is larger

that the key of the root, it must be in the

right subtree, if it is inside the tree at all; so,

we recursively search for x in the right subtree.

Otherwise, we search the left one.

bool BSTree<E,K>::Search

(const K& k, E &e) const{

BinaryTreeNode<E> *p = root;

while (p) // examine p->data

if (k < p->data) p = p->LeftChild;

else if (k > p->data) p = p->RightChild;

else { e = p->data;

return true;}

return false;

}

4

Obviously, the search can be done in O(h),

where h is the height of the tree.

With an indexed BST, we can also search for

the kth element, also in O(h). For example,

to look for the third element in the following

indexed BST, we check the LeftSize filed of

the root, since it is 4, the element must be

in the left subtree. We then check this field

again for the root of its left subtree, it is a 2.

hence, the element we are looking for is the

smallest one of its right subtree. We find out

that the LeftSize of its root is 1, i.e., so we

can conclude that this node the root itself is

the element we are looking for.

5

The insertion operation

To insert an element with key k inside a BST,

we first verify that this key does not occur any-

where inside the tree. If the search succeeds,

the insertion will not be done; otherwise, the

element will be inserted where the search re-

ports a failure.

The following are two examples:

If we insert into an indexed BST, we also have

to adjust the LeftSize field of all the elements

from the insertion point back to the root. In

both cases, it takes Θ(logn).

6

The code

BSTree<E,K>& BSTree<E,K>::Insert(const E& e){

BinaryTreeNode<E> *p = root,*pp = 0;

while (p){

pp = p;

if (e < p->data) p = p->LeftChild;

else if (e > p->data) p = p->RightChild;

else throw BadInput(); // duplicate

}

BinaryTreeNode<E> *r=new BinaryTreeNode<E> (e);

if (root) {

if (e < pp->data) pp->LeftChild = r;

else pp->RightChild = r;}

else root = r;

return *this;

}

7

The deletion operation

If the node to be deleted is a leaf, we simply

delete it. It is also easy to do when it has

just one child. When it has two children, it

becomes a bit tricky, since we have only one

location for two nodes to be hooked on.

8

BSTree<E,K>& BSTree<E,K>::Delete

(const K& k, E& e){

BinaryTreeNode<E> *p = root, *pp = 0;

while (p && p->data != k){

pp = p;

if (k < p->data) p = p->LeftChild;

else p = p->RightChild; }

if (!p) throw BadInput();

e = p->data;

if(p->LeftChild && p->RightChild) {

BinaryTreeNode<E> *s = p->LeftChild,*ps = p;

while (s->RightChild){ ps = s;

s = s->RightChild;}

p->data = s->data; p = s; pp = ps; }

BinaryTreeNode<E> *c;

if (p->LeftChild) c = p->LeftChild;

else c = p->RightChild;

if (p == root) root = c;

else { if (p == pp->LeftChild)

pp->LeftChild = c;

else pp->RightChild = c;}

delete p;

return *this; }

9

BST could be high

The height of a binary search tree can be quite

high. For example, if we use the original inser-

tion function to add in element 1,2, . . . , n, in an

empty BST, then the resulted BST is actually

a linear list. As a result, a basic operation can

take as must as O(n).

But, if insertion and deletions are made at ran-

dom, then the height of such a BST will be

O(logn).

10

A little assignment

Homework 11.1. Generate a random permuta-

tion of the integers 1 through n, n ∈ [100,500,

1000,10000,20000,50000]. Insert them into an

initially empty binary search tree according to

the random order. Measure the height of the

resulting search tree. Repeat this experiment

for several random permutations and then cal-

culate the average of the measured heights.

Compare the just obtained average with the

theoretical figure of 2dlog2(n + 1)e.

11

AVL trees

We want to define the BST operations in such

a way that the trees stay balanced all the times,

while preserving the binary search tree proper-

ties. The essential algorithms for insertion and

deletion will be exactly the same as for BST,

except that as they might destroy the prop-

erty of being balanced, we have to restore the

balance of the tree, if needed. Such a tree is

called a balanced tree, it is guaranteed that all

the basic operations applied to a balanced tree

will be done in Θ(logn).

The class of AVL trees is one of the more pop-

ular balanced trees. An nonempty binary tree

with TL and TR as its left and right subtrees is

an AVL tree iff both TL and TR are AVL trees,

and |hL − hR| ≤ 1.

12

AVL search trees

An AVL search tree is a binary search tree,

which is also an AVL tree. It has the following

properties.

1. The height of an AVL tree with n nodes is

O(logn).

2. For every n ≥ 0, there exists an AVL tree

with n nodes.

3. We can search an n− element AVL search

tree in O(logn).

4. A new element can be added into an n−element AVL tree so that the resulted (n +

1)−element tree is also an AVL tree.

5. An element can be deleted from an n−element AVL search tree, and the resulted (n−1)−element tree is also an AVL tree, and the

deletion can be done in O(logn).

13

Fibonacci tree

We calculate the least number of nodes in a

balanced binary tree with depth h first: Let

S(h) be this number. Obviously, S(0) = 1 and

S(1) = 2.

In general, we can construct a balanced tree

with depth h by merging two subtrees of heights

h−1 and h−2, each of which has the least num-

ber of nodes. We call such trees Fibonacci

trees for the obvious reason.

By an inductive argument, we have that S(h) =

S(h−1)+S(h−2)+1 ≥ fib(h), which leads to

that S(h) > 32

h−1. Thus, the least number of

nodes in such a tree, n0, satisfy that n0 > 32

h−1.

14

Height of an AVL tree

Therefore, given any AVL tree with n nodes,

we have that

n ≥ n0 >3

2

h−1

Further calculation leads to the following re-

sult,

h ≤ 1.44 log(n + 2) − 0.328.

A simpler, but unproved, result is that

h ≤ log(n + 1) + 0.25.

15

Represent an AVL tree

An AVL tree is usually represented using the

linked representation for binary trees. To pro-

vide information to restore the balance, a bal-

ance factor is added to each node. For any

node, this factor is defined to be the differ-

ence between the heights of its left and right

subtrees.

It is easy to see that any search takes O(logn).

16

Restore the Balance

When a tree becomes unbalanced, as a result

of either insertion or deletion, we can restore

its balance by rotating the tree at and/or below

the node where the imbalance is detected.

Case 1: Assume that the left subtree of k2

has height larger than that of its right subtree,

caused by adding a node into subtree X or

deleting a node from subtree Z. We apply a

right rotation, which will restore the balance

of the tree rooted at k1.

Notice that the above rotation is symmetric.

17

An example

Below shows how to restore the balance of a

tree by applying a single rotation.

Notice that here we just added a node with

value 612 into the left subtree rooted at 7, la-

beled as k1 which itself is the left subtree of

the tree rooted at 8, labeled as k2.

18

Restore the Balance

Case 2: Suppose that the imbalance is caused

by the right subtree rooted at k1 being too

high, whose balance can’t be restored by a sin-

gle rotation. We must apply two successive

single rotations, as follows.

When it is caused by the left subtree being

too high, we have the following similar double

rotation.

19

An example

Below shows how to restore the balance of a

tree by applying a double rotation.

Notice that here we just added a node with

value 14, labeled as k2 into the left subtree

rooted at 15, labeled as k1, which itself is the

right subtree of the tree rooted at 7, labeled

as k3.

20

Histogramming

We start with a collection of n keys and must

output a list of distinct keys and their frequen-

cies. Assume that n = 10, and keys = [2, 4, 2

, 2, 3, 4, 2, 6, 4, 2].

When the range of those keys is quite small,

this problem can be solved, in Θ(n), by using

an array h[0..r], where r is the maximum key

value, and letting h[i] store the frequency of

i.

21

When the key value are not integers, the above

procedure can’t be used. Instead, we can sort

the n keys, in Θ(n logn), then scan the sorted

list from left to right to find the frequencies of

respective keys.

This solution can be further improved, when m,

the number of distinct keys, is quite small. We

can use a (balanced) BST tree, each of whose

nodes contains two fields, key and frequency.

We insert all the n keys into such a tree, when

a key is already there, we simply increment its

frequency by 1. This procedure takes an ex-

pected complexity of Θ(n logm). When a bal-

anced tree is used, this complexity is guaran-

teed, even in the worst case.

22

m−way search trees

An m−way search tree, when it is not empty,

satisfies the following conditions: 1) each in-ternal node has up to m children and con-

tains between 1 and m− 1 elements. 2) Every

node with p elements has exactly p + 1 chil-

dren. 3) For any node with p elements. Let

k1 <, . . . , < kp be the keys of those elements,

and let c0, . . . , cp be the p + 1 children. Then,

the keys in all the elements in the subtree c0 are

smaller than k1, the keys in all the elements inthe subtree cp are bigger than kp, and the keys

in all the elements in subtree ci are between ki

and ki+1, 1 ≤ i ≤ p − 1.

23

It is easy to do a search in such a tree. To

insert an element, we search for it first. The

place where the search terminates indicates an

insertion point. But, if there are no space left,

we have to add an node to the tree to contain

this element. It is a bit trickier to do a deletion.

An m−way search tree of height h may have

as few as h elements, and as many as mh − 1.

As a result, the height of such a tree ranges

between logm(n+1) and n. For example, a 200-

way tree of height 5 can hold up to 32∗1010−1

elements.

When the search tree is located on a disk, the

disk access time is in proportional to O(h),

where h is the height of the tree. Thus, we

have to cut the height as much as possible.

24

B−tree

A B−tree of order m is an m−way search tree.

When it is not empty, it satisfies the following:

1) The root has at least two children. 2) All

internal nodes other than the root have at least

dm2 e children. 3) All the external nodes are at

the same level.

25

2-3 trees

For a B−tree of order 3, as it is a 3-way search

tree, each of its internal node can have no

more than 3 children. On the other hand, it

is also required that such a node should have

at least 2 children. Thus, each internal node

will have either 2 or 3 children. Thus, it is also

called a 2-3 tree.

26

Height of a B−tree of order m

Let N be the total number of keys. Since sucha tree is an m−way search tree, we immediatelyhave that

h ≥ logm(N + 1).

On the other hand, let d be dm2 e. As the root

has at least two children, each of the otherinternal node has at least d children, we havethat the least number of nodes at each level is1,2,2d, . . . ,2dh−1. Hence, the total number ofnodes is at least 1 + 2dh−1

d−1 .

Further calculation shows that

h ≤ logd((d − 1)(n − 1)

2+ 1).

We have that n ≤ N−1d−1 + 1, thus,

h ≤ logd(N + 1

2).

27

Operations of B-tree

Search is done pretty straightforward. Basi-cally, to look for a node, we start with theroot, following the lead given by the keys, weeventually either find it, or report a failure.

When inserting an element, because of thenumber restriction, i.e., each node contains atmost m−1 keys in an m− ary B−tree, we mighthave to split a node along the way, which mayeven lead to the growth of a tree.

When deleting a node, there are two cases,w.r.t. the nature of the node to be deleted.1) It is deleted from a leaf. 2) It is to bedeleted from a non-leaf. In case 2) we canreplace this node with either the largest ele-ment from its left neighboring subtree, or thesmallest from its right subtree, then to deletethe substituting element. Hence, case 2 can

always be converted to case 1.

28

In case 1, if the element is to be deleted from

a node that has at least one more than the

minimum required number, i.e., dm2 e− 1 for an

m−ary tree, we simply delete it and write back

the modified node.

Otherwise, we will try to replace it with an ele-

ment from either its left or right sibling. When

both siblings don’t have that extra element, we

have to merge those two nodes with the ele-

ment between them in the parent node. This

might have a chain reaction, ending up reduc-

ing the height of the whole tree.

29

The ISAM method

AVL trees and other balanced trees will provide

good performance when the dictionary is small

enough to fit into the internal memory. When

we have to deal with larger quantity of data, we

can get improved performance by using search

trees of higher degree.

Let’s look at ISAM (indexed sequential access

method,) which is a popular access method for

external memory, first. In ISAM, the available

disk space will be cut into blocks, which is the

smallest unit of disk space that will be input or

output, at the cost of a single seek operation.

The dictionary elements are packed into blocks

in ascending order.

30

For sequential access, the blocks are input in

order, and the elements will be retrieved in as-

cending order. If each block contains m ele-

ments, then the number of disk access time

is 1m. On the other hand, to support random

access, an index is maintained, which keeps

the largest key in each block. Since, the index

holds as many keys as the number of blocks,

and since a block usually holds many elements,

this index can usually be kept in the internal

memory.

To search for a record, the index is searched

first for the block that contains this record.

Then, that block is retrieved, and searched in-

ternally for that record. Thus, a single data

access is enough.

31

This technique can be generalized, when data

have to be packed into several disks, whence

each disk will keep an index, and there will be

an overall index kept in the internal memory.

When search for something in this case, the

overall index will be searched first to determine

the single disk that contains the record. Then,

the block index for that disk will be retrieved

in a disk access, and finally, the relevant block

will be brought in to do an internal search. So,

in total, just two disk accesses are needed.

In the next slide, we will see a more specific

example, using B−tree to maintain those in-

dexes.

32

An Example

Assume that we have a 128MB hard disk di-

vided into 216 pages of 2048 bytes each. We

use 4 bytes to represent the addresses of those

pages.

We also assume that the whole disk is used to

store just one file of N records, each of which

needs 512 bytes. Thus, each page can store

4 such records. Further assume that the key

for each record is 4 bytes long. Thus, the key

and the disk address of each record will need 8

bytes, so that each page can hold 256 of them.

If we keep these index information separate

from the records, we need N256 pages to store

them. We also need N4 pages to keep the

records. Also, the total number of pages must

be less than 216. This leads to the fact that

N ≤ 258,111. Assume that the equality holds.

33

We will use a B-tree to hold the indices. As

each page can hold 256 indices, we decide to

use a tree of order 256. The formula tells usthat the height of such a tree will be less than

3, 2.43 to be exact. Therefore, to get access

to any record in the disk, it takes at most fouraccesses to find the index and one more to get

the record itself for a total of 5 disk accesses.

Can we store the whole B-tree in RAM with

4 MB, thus, only one disk access is needed?

Each node of the tree must store 255 indices of8 bytes each and 256 pointers of 4 bytes each,

thus a single node costs 3064 bytes, which

leads to the fact that at most 1368 nodes willfit the RAM. However, as the root may con-

tain just 1 index and each of the other nodemay contain just 128 indexes, thus the total in-

dices contained by those nodes might be just

174,977, which is less than the total numberof records. Thus, the whole tree cannot fit

into the RAM.34

chapter 11 binary search trees, etc.zshen/webfiles/notes/cs316/...binary search trees a binary...

Documents