indexes a heap file allows record retrieval: by specifying the rid, or by scanning all records...

41
Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying the values in one or more fields is needed (semantic search or value-based query), e.g., Find all students in CS dept; Find students with gpa > 3 – Indexes are files (separate from the data file they index) that enable answering these value-based queries efficiently. Indexes contain “search keys”, k, which are values from the attribute being indexed and “data entries”, k*, which lead us to the records containing the search key value (usually pointers).

Upload: ralph-blair

Post on 20-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Indexes• A Heap file allows record retrieval:

• by specifying the rid, or• by scanning all records sequentially

• Sometimes, retrieval of records by specifying the values in one or more fields is needed (semantic search or value-based query), e.g.,

• Find all students in CS dept; Find students with gpa > 3

– Indexes are files (separate from the data file they index) that enable answering these value-based queries efficiently.

– Indexes contain “search keys”, k, which are values from the attribute being indexed and “data entries”, k*, which lead us to the records containing the search key value (usually pointers).

Page 2: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Index Classification

• Primary vs. secondary: If the search key contains the clustered primary key, then it is called a primary index, else it is called a secondary index.

• Clustered vs. unclustered: If the closeness of the data records is the same as the closeness of the data entries, the index is called a clustered index.

– A file can be clustered on at most 1 attribute (search key)

– Cost of retrieving data records through an index varies greatly based on whether index is clustered or not!

Page 3: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Primary Index

Example: Assume the blocking factor (bfr) is 2 which means 2 records/page. STUDENT|S#|SNAME |LCODE |pg|17|BAID |NY2091|1 |25|CLAY |NJ5101|1 |32|THAISZ|NJ5102|2|38|GOOD |FL6321|2|57|BROWN |NY2092|3 |83|THOM |ND3450|3

RID Inserting and deleting are major problems.

- must move records to maintain ordering - anchors change (in non-dense case)

Non-dense Primary Index on S# |S#|pg |17| 1 |32| 2 |57| 3

Dense Primary Index on S# |S#|pg offset |17| 1 0 |25| 1 1 |32| 2 0 |38| 2 1 |57| 3 0 |83| 3 1

PRIMARY INDEX: I(k,p) k = ordered or clustered "key" field values from ordered or clustered field of file with

uniqueness property (individual value occurrences are "unique" i.e., each value can occur at most once.)

p = pointer to page containing record(s) with value, k Primary indexes can be either:

DENSE: (every record is indexed) or NON-DENSE: only key-values of records at the beginning of a page are indexed (anchor record of page). (and then the pointer is page-# only)

Page 4: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Clustering Index

ENROLL2 |S#|C#|GRADE pg |17|6 | 96 |1 |25|6 | 76 |1 |32|6 | 62 |2 |38|6 | 98 |2 |32|6 | 91 |3 |25|7 | 68 |3 |32|8 | 89 |4 |17|9 | 95 |4

|C#|pg| Dense Clustering_Index on C# |6 | 1| |7 | 3| |8 | 4| |9 | 4|

|C#|pg| Non-dense Clustering_Index on C# |6 | 1| (indexing new anchor records only) |8 | 4|

There's no more search overhead with this 2nd type of non-dense clustering index, but

- How can you know which page has C#=7?(search pages starting at pg=1)

- How can you know which page has C#=9?

(search pags starting at pg=4)

like a primary index except that the attribute nee not be a key - the file must be clustered on the attribute, k - the pointer for any k is the address of 1st page with that k-value

Page 5: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Secondary Index

S#|C#|GRADE ENROLL (unclustered C#) 32|8 | 89 |1 25|6 | 76 |1 32|6 | 62 |2 25|8 | 86 |2 38|6 | 98 |3 32|7 | 91 |3 17|5 | 96 |4 25|7 | 68 |4 17|8 | 95 |5

C#|pg Secondary_Index, Option1 on C# 5 | 46 | 16 | 26 | 37 | 37 | 48 | 18 | 28 | 4

Option2: Use repeating groups of pointers (requires variable length pointer(s) |C#|page |5 | 4 |6 | 1,2,3 |7 | 3,4 |8 | 1,2,4

Option3: Use 1 index entry for each value, 1 pointer to "list" or "linked list" of record pointers. (1 level of indirection)|S#|C#|GRADE pg ENROLL (unclustered C#)|32|8 | 89 |1 |25|6 | 76 |1 |32|6 | 62 |2 |25|8 | 86 |2 |38|6 | 98 |3 |32|7 | 91 |3 |17|5 | 96 |4 |25|7 | 68 |4 |17|8 | 95 |5

|C#| page Secondary_Index, opt3 on C# |5 | -->|4| |6 | -->|1|->|2|->|3| |7 | -->|3|->|4| |8 | -->|1|->|2|->|4|

These indexes are the same as the previous except, - the file is need not be clustered on k - p points to the page or record containing k - every record must be indexed (dense)

Option1: If there are multiple occurences of k, use multiple index entries for that k.

Page 6: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Multi-level Index (made up of an index on an index)

STUDENT|S#|SNAME |LCODE |pg |17|BAID |NY2091|1 |25|CLAY |NJ5101|1 |32|THAISZ|NJ5102|2 |38|GOOD |FL6321|2 |57|BROWN |NY2092|3 |83|THOM |ND3450|3 |91|PARK |MN7334|4 |94|SIVA |OR1123|4

|S#|pg|pg(of index file) S#-index (nondense, primary) |17| 1|1 |32| 2|1 |57| 3|2 |91| 4|2

2nd_LEVEL (a second level, nondense index) |S#|pg| |17| 1| |57| 2|

For any index, since it is a file clustered on the key, k, it can have a primary or clustering index on it. (constituting the second level of the multilevel index).

Page 7: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Index Classification (Contd.)

• If there is at least one index entry per existing attribute value, then it is called dense, else sparse

Ashby

Cass

Smith

Sparse Indexon

Name

Anchor records of each page

• Every sparse index must be clustered! Sparse indexes are smaller.

Ashby, 25, 3000

Smith, 44, 3000

22

25

30

40

44

44

50

Data File

Dense Indexon

Age

33

Bristow, 30, 2007

Basu, 33, 4003

Cass, 50, 5004

Tracy, 44, 5004

Daniels, 22, 6003

Jones, 40, 6003

Name, age, bonus

• Tree-structured indexing techniques support both range searches (AKA inequality searches) and equality searches.

• ISAM: (variation of multilevel clustering) static structure;

• B+ tree: dynamic, adjust gracefully under insert and delete.

Page 8: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

ISAM

Index file may still be quite large. But we can apply the idea repeatedly!

K*0

K1 K*

1K 2 K*

2K m

K*m

index entry

Non-leaf (inode

Leaf

Leaf pages contain data entries, <k,k*>. In inodes, k*=indirect ptr.

Pages

Overflow page

Primary pages

1 index entry per page of data file, of the form: <k,k*> sorted on the attrribute value, k.k* points to 1st page (possibly) containing k.Provides alternate entry points into the file – faster than binary search which has just one entry point.

Page 9: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Example ISAM Tree• Where each node can hold 2 (k,k*) entries• in any internal node or inode (non-leaf)

add ptr for key_values < the first k-value

10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97*

20,20* 33,33* 51,51* 63,63*

40,40*

Root

Page 10: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Insert k=23

OverflowPages

Leaf

IndexPages

Pages

Primary

23,23*48,48*

42,42*

10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97*

20,20* 33,33* 51,51* 63,63*

40,40*

Need overflow page

Insert k=48

41,41*

Need overflow page

Insert k=41

Insert k=42

Need overflow page

Page 11: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Deleting 42

Note that 51* appears in index levels, but not in leaf!

Deleting 51

Deleting 97

23,23*48,48*

42,42*

10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97*

20,20* 33,33* 51,51* 63,63*

40,40*

41,41*

Page 12: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

B+ Tree: The Most Widely Used Index

• keeps tree height-balanced. • Minimum 50% occupancy (except for root). Each

node contains m entries, where d m 2d.– d is called the degree or order of the index.

• Supports equality and range-searches efficiently.

Index Entries

Data Entries("Sequence set")

(“Direct search set or index set”)

Page 13: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Example B+ Tree (d=2)

• Search begins at root, key comparisons direct it to a leaf.

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

• Search for 5• Search for15• Search for all data entries 24

5*

15 is not in the file!

Leaves are doubly linked forfast sequential < search

Page 14: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Example B+ Tree (contd.)

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

• Search for all data entries < 23• (note, this is the reason for the double linkage).

Page 15: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Inserting a Data Entry into a B+ Tree• Find correct leaf L. • Put data entry in L.

– If L has enough space, done!– Else, must split L (into L and a new node L2)

• Redistribute entries, copy up (promote) middle key.• middle value which was promoted and is now the anchor key for L2).

• This can happen recursively (e.g., if there is no space for the promoted middle value in the inode to which it is promoted)– To split inode, redistribute entries evenly, but push up

(promote) middle key.• So promote means Copy up at leaf; Move up at inode.

• Splits “grow” tree• only a root split increases height.

– Only tree growth possible: wider or 1 level taller at top.

Page 16: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Inserting 8*

Observe how minimum occupancy is guaranteed in both leaf and index pg splits.

• Note difference between copy-up (leaf) and move-up (inode)

5 to be inserted in parent node.(Note that 5 iscontinues to appear in the new leaf node, L2, as anchor value.)

s copied up and

appears once in the index. Contrast

Entry to be inserted in parent node.(Note that 17 is moved up and only

this with a leaf split.)

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

No room for 8, so split.

5* 7* 8*2* 3*

17 24 3013 2* 3* 5* 7*24 305 13

17

No room for 5, so split and move 17 up.5

Page 17: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

B+ Tree Before Inserting 8*

Note height_increase, balance and occupancy maintenance.

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

2* 3*

Root

17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

After Inserting 8*

Page 18: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Deleting a Data Entry from a B+ Tree

• Start at root, find leaf L where entry belongs.

• Remove the entry.

– If L is at least half-full, done!

– If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and a sibling.

• Merge could propagate to root, and therefore decreasing height.

Page 19: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Example Tree After Inserting 8*

• Deleting 19* is easy.• Deleting 20* is done with re-distribution of 24* (and revision of

anchor value (from 24 to 27) in inode.

2* 3*

Root17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*

Root

2* 3*

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

Then Deleting 19*, 20*

Page 20: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

... And Then Deleting 24*

• Must merge.

2* 3*

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

• Observe `toss’ of index entry, 27, now that inode is below min occupancy so merge it with its sibling

2* 3* 7* 14* 16* 22* 27* 29* 33* 34* 38* 39*5* 8*

Root30135 17

• and index entry, 17 can be `pulled down’ (sibling merge, followed by pull-down)

2* 3*

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22*

27

27* 29*

Page 21: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Summary so far• Tree indexes are ideal for range-searches and equality searches.

• ISAM is a static structure.– Only leaf pages modified; overflow pages needed.– Overflow chains can degrade performance unless size of data set and data

distribution stay constant.

• B+ tree is a dynamic structure.– Inserts/deletes leave tree height-balanced.– High fanout (F) means depth rarely more than 3 or 4.– Almost always better than maintaining a sorted file.– Typically, 67% occupancy on average.– Usually preferable to ISAM– adjusts to growth gracefully.– Most widely used index in database management systems because of its

versatility. One of the most optimized components of a DBMS.– Caution! There is much variation in implementation

Page 22: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Multidimensional IndexMultidimensional data almost always requires multidimensional indexing for effective access. One dimensional indexes assume a single search column, attribute (search key) which can be a

composite column or key.

Data structures, that support queries into multidimensional data specifically, fall in two categories:

1. Hash-table-like (e.g., Grid files and partitioned hash fctns)

2. Tree-like, eg,multi-key indexes, kd-trees, quad-trees (for sets of points); R-trees (for sets of regions as well as sets of points) ), Predicate-trees (P-trees) for vertical compressed, representations of data

Page 23: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Hash-like Structures for Multidimensional e.g., Data Grid Files

Partition the POINTS space into a grid. In each dimension "grid lines" partition space into stripes. Points that fall right on a grid line belong to the stripe above it (i.e., grid-lines are the lower boundaries).

Example: 12 customer(age,sal) data points (i.e., records or tuples) (age,sal): (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)

If vertical grid lines at age=40, age=65 and horizontal at SAL=90K, SAL=224K 40 56

400K380K360K340K320K300K280K260K240K220K200K180K160K140K120K100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100 AGE

* * *** *

*

*

**

* *

Grid hash functionage sal

pointsrange range 0-39 0-89K (24,60) 40-55 0-89K (46,60) (50,80) 40-55 90-223K (50,100)

(50,120) 56-99 90-223K (70,100) (84,140) 0-39 224-400K (30,260) (26,400) 40-55 224-400K (44,360) (50,280) 56-99 224-400K (60,260)

Inserting into Grid files: If there is room, insert, else (two methods) 1. add overflow block and chain it to the primary block, or 2. reorganize the structure by adding or moving grid lines (similar to dynamic hashing)

A problem with Grid files is that the number of buckets grows exponentially with dimension and the grid may become sparse.

Page 24: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

is a sequence of hash functions, h=(h1,...hn) such that hi produces the ith segment of bits in the hash key, that is, h(a) is the concatenation of bit subsequences, h1(a)h2(a)..hn(a).

Example: The data file is CUSTOMER(AGE,SAL) consisting again of (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)

Use 2 hash functions and 3 bits, the 1st bit is for age with hash function, mod2(tens_digit of age) and the last 2 bits are for salary with hash function, mod4(hundreds_digit of sal)

The lookup table is: Partitioined hash functionkey points

0 0 0 (24,060) (46,060) (26,400) 0 0 1 (84,140) 0 1 0 (60,260)0 1 1 (44,360) 1 0 0 (50,080)1 0 1 (50,100) (50,120) (70,100) 1 1 0 (30,260) (50,280) 1 1 1

Hash-like Structures for Multidimensional e.g., Partitioned hash Files

Page 25: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Assume several attributes representing "dimensions" of the data points (data cube tuples) - uses a multi-level index, e.g., suppose there are 2 attributes: Provides a second level of Indexes on 2nd attribute to all tuples with same 1st attribute value

Tree-like Structures for Multidimensional e.g., Multi-key Index

/|--> / |-->

Index on .--> < |--> 1st attr / \ |..

/|/ \|--> / | / | /|--> / | / |--> --> < |----> < |--> \ | \ |.. \ |\ \|--> \ | \ \| \ /|--> \ \ / |--> \ `>< |--> \ \ |.. \ \|--> \ `-> . . . indexes on

2nd attr

Take the (age, salary) points again (24,60) (24,260) (24,400) (50,80) (50,100) (50,120) (50,280) (60,100) (60,260) (84,140)

. - - - - - - - - - - -> (24,060)

/ .- - - - - - -> (24,060)

/ / .- - - -> (24,400)

___/_________/______/ .--> |_60_|_260_|_400_____| / .- - - - - - -- - - - - -> (50,080)age / ____/________________24----' .-> |_80_|_100|_120_|_280_|- - - --> (50,280)50-------' \ `- - - - - - - --> (50,120)60. _______ `- - - - - - - - - - -> (50,100)84 `-->|100|260|- - - - - - - - - - - - - --> (60,260) \ `- - - - - - - - - - - - - - - - --> (60,100) \ _____ `- >|_140_|- - - - - - - - - - - - - - --> (84,140)

Page 26: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Interior nodes have (Attribute, Value, LowPointer, HighPontr) - Value is a value which splits data points - The example below will show (a, V, down, up) with pointers going down for LowPointer and up for HighPointer. (goes up on greater or equal actually). - Attributes used for different levels are different and ROTATE among the dimensions (round robin). - The leaves are blocks of records (assume data blocks hold 2 records, i.e., the blocking factor, bfr, is 2). - to search: decide along the tree until you reach a leaf (going up on greater or equal) - to insert: decide along the tree until you reach the proper leaf if there is room there, insert; else split the block and divide its contents according to the appropriate attribute (next one in the rotation). Example: (insert into kd-tree in this order using age first then salart, sal): age,sal (50,80) (84,140) (30,260) (44,360) (50,120) (70,100) (24,60) (26,400) (50,280) (46,60) (60,260) (50,100)insert the first 2 pairs (no tree yet, since just 1 leaf block): 50, 80 84, 140

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index

age sal 30, 260 (leaf is full so split it and divide the contents by sal=150)

30,260 sal /

{ ,150} < \ 50,80 84,140

Page 27: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

30,260 sal /

{ ,150} < \ 50,80 84,140

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

age sal 44,360 (leaf is not full so insert)

30,260 44,360

sal / { ,150} <

\ 50,80 84,140

age sal 50,120 (leaf is full so split, divide contents by age=55) 30,260

44,360 sal /

{ ,150} < \ 84,140 \age / { 55, } <

\ 50,80

50,120

Page 28: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued age sal

50,120 30,260 44,360

sal / { ,150} <

\ 84,140 \age / { 55, } <

\ 50,80

50,120

age sal 70,100 (leaf is not full so insert) 30,260

44,360 sal /

{ ,150} < 84,140\ 70,100 \age / { 55, } <

\ 50,80

50,120

age sal 24,060 (leaf is full so split, divide by sal=75) 30,260

44,360 sal /

{ ,150} < 84,140\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060

Page 29: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 30,260

44,360 sal /

{ ,150} < 84,140\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060

age sal 26,400 (leaf full split, div by age=28)

30,260 44,360 age / { 28, }< / \ / 26,400 /

sal / { ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060

age sal 50,280 (leaf full split, div by sal=300)

44,360 sal /

(300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 /

sal / { ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060

Page 30: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 44,360

sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 /

sal / { ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060

age sal 46,060 (leaf not full so insert)

44,360 sal /

(300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 /

sal / { ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060 46,060

Page 31: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 44,360

sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 /

sal / { ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060 46,060

age sal 60,260 (leaf full split by age=40)

44,360 sal /

(300, }< age / \ 30,260 { 28, }< \ age / / \ { 40, }< / 26,400 \ / 50,280

sal / 60,260{ ,150} < 84,140

\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060 46,060

Page 32: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidime.g., k dim (kd tree) Index continued 44,360

sal / (300, }< age / \

30,260 { 28, }< \ age / / \ { 40, }< / 26,400 \ /

50,280sal / 60,260

{ ,150} < 84,140\ 70,100 \age / { 55, } <

\ 50,080 \ 50,120

\ sal / { ,75)<

\ 24,060 46,060

age sal 50,100 (full split age=50 full again split sal=90)

44,360 sal /

(300, }< age / \ 30,260 { 28, }< \ age / / \ { 40, }< / 26,400 \ / 50,280

sal / 60,260{ ,150} < 84,140 50,120

\ 70,100 50,100

\age / sal / { 55, } < { , 90 }<

\ age / \ \ { 50, }< 50,080

\ sal / \ { ,75)<

\ 24,060 46,060

Page 33: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidimensional datasets e.g., Quad tree indexes - Interior nodes (Inodes) correspond to rectangulars in 2-D (more generally, they can be constructed to represent

hypercubes higher dimensional space)

- If the number of points in the rectangle fits in a block, it's a leaf, else the rectangle is treated as interior node with children corresponding to its 4 quadrants.

- to insert into the quad treee index: search to find the proper leaf; if there's room, insert; else split node into 4 quadrants, divide contents appropriately.

Example: Build the Quad-tree index as it would develop, assuming (age,sal) arrive in this order: age,sal (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)

Insert (24,60) (46,60)

400K380K360K340K320K300K280K260K240K220K200K180K160K140K120K100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100 AGE

* *

The only leaf node is:age sal 24,06046,060

Page 34: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidim datasets e.g., Quad tree indexes

400K380K360K340K320K300K280K260K240K220K200K180K160K140K120K100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100 AGE

* * *

24,06046,060

insertage sal 50, 080 (leaf full split (e.g., at age=50 and sal=200) divide contents by quadrant

.-NW /

/---NE age,sal /{50,200} <

\ \---SW 24,060 \ 46,060 \ `SE 50,080

Page 35: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidim datasets e.g., Quad tree indexes

400K380K360K340K320K300K280K260K240K220K200K180K160K140K120K100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100 AGE

* * *

insert 50, 100 (not full insert

.-NW /

/---NE age,sal /{50,200} <

\ \---SW 24,060 \ 46,060 \ `SE 50,080 50,100

.-NW /

/---NE age,sal /{50,200} <

\ \---SW 24,060 \ 46,060 \ `SE 50,080

insert 50, 120 full split SE at 75,100

.-NW /

/---NE age,sal /{50,200} <

\ \---SW 24,060 \ 46,060 \ `SE(75,100)<

.-NW 50,100 / 50,120 /---NE / < \ \---SW 50,080 \ \ `SE

**

ETC.

Page 36: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Tree-like Structures for Multidim datasets: Region tree (Rtree) indexes - inodes of an R-tree correspond to interior regions, (which can be overlapping) (usually regions are

rectangles, tho, not necessarily)

- R-tree regions have subregions that represent the contents of their children

- And the subregions need not cover the region they subdivide (but all data must be within a subregion) Example, Consider the spatial image:

Example: Consider the spatial image:

100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | | |---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100

Assume a leaf can hold 6 regions (bfr=6)

and that the 6 regions or objects above are together on 1 leaf block, whose region is shown as the outer red rectangle

Thus the R-tree has a root and 1 leaf:

( (0,0), (100,90) ) (corners of outer

red region)

road1 road2 house1 school house2 pipeline

(a full leaf with 6 objects)

Page 37: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Rtree indexes cont.

100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | |

|---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100

(0,0), (100,90)

road1 | road2 | house1

Now suppose a local cellular phone company adds a POP as shown.

POP

Split the full leaf putting 4 objects in 1 new leaf and 3 in the other

(minimize overlap and split ~evenly)

school | house2 | pipeline (0,0), (60,50) (20,20), (100,80)

POP

Page 38: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Rtree indexes cont.

100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | |

|---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100

(0,0), (100,90)

road1 | road2 | house1

POP

school | house2 | pipeline

(0,0), (60,50) (20,20), (100,80)

POP

Now suppose we insert house3

house3

Since house3 is not in either region (and both have room) we must decide to expand one of them.

If we pick the green, expanding it to (0,20), (100,80) we add 1600 units2

If we pick the purple, expanding it to ((0,0), (80,50) we add 1000 units2

so to minimize we pick the purple.

(80,50)

house2 house3

Note that house2 is in both regions.

Page 39: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Similar to B-tree, except - only the common parts of key values are embedded in inodes - a single bit is used to make the navigation direction decision at each level (0 for up

and 1 for down). (zero-based bit positions are used) Example: (in this example, the tree structure is being built left-to-right) Starting with an empty structure,

Binary Radix Tree Index (AKA a trie) an additional index structure (e.g., used in IBM AS/400 systems)

INSERT JAY | LA | 25 | STAR (assigned RRN=1 to it)

INSERT JON | LA | 45 | HOOD (assigned RRN=2 1st letters are teh same (J) so the common pat is embedded in the root 2nd letters: A and O, bit 3 (zero-based count) is 1st difference (and makes the decision)

0123 4567 bit positionsDBCDIC for A=1100 0001EBCDIC for O=1101 0110

1 | JAY | LA | 25 | STAR

CUSTOMER FILERRN nam loc age job

2 | JON | LA | 45 | HOODJAY 1

nam_trie_INDEX

namepart RRN

b3 J<

ON 2

Page 40: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Binary Radix Tree Index (AKA a trie) cont.

INSERT JAN | RO | 93 | DOC (assigned RRN=3

0123 4567 bit positionsDBCDIC for N=1101 0101EBCDIC for Y=1110 0000

1 | JAY | LA | 25 | STAR

CUSTOMER FILERRN nam loc age job

2 | JON | LA | 45 | HOOD

nam_trie_INDEX

JAY 1b3 J<

ON 2

N 3b2 A<

Y 1

b3

J<

ON 2

3 | JAN | RO | 93 | DOC

Page 41: Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying

Binary Radix Tree Index (AKA a trie) cont.

INSERT SUE | RO | 16 | PROG (assigned RRN=4

0123 4567 bit positionsDBCDIC for J= 1101 0001EBCDIC for Y= 1110 0010

1 | JAY | LA | 25 | STAR

CUSTOMER FILERRN nam loc age job

2 | JON | LA | 45 | HOOD

nam_trie_INDEX

N 3b2 A<

Y 1

b3

J<

ON 2

b2

<

SUE 2

3 | JAN | RO | 93 | DOC4 | SUE | RO | 16 | PROG