introduction to data management · 2020. 11. 25. · database management systems 3ed, r....

19
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Introduction to Data Management Lecture #21 (Storage & Indexing, cont.) *** The “Online” Edition *** Instructor: Mike Carey [email protected] SQL

Upload: others

Post on 19-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Introduction to Data Management

Lecture #21(Storage & Indexing, cont.)

*** The “Online” Edition ***

Instructor: Mike Carey [email protected]

SQL

Page 2: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

Announcements

v Currently on a between-HW break!§ Planning to change to a Sunday 4PM schedule to buy more time (watch the wiki!)

v Only a handful of talking-head lectures remain after today!§ Week 9 MWF and week 10 MWF

Page 3: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Ex: Emp (eid, ename, sal, deptid)

111 (P1,1)

222 (P1,2)

333 (P1,3)

444 (P1,4)

555 (P2,1)

666 (P2,2)

... ...

3K (P1,1)

12K (P1,4)

18K (P2,1)

18K (P10000,1)

23K (P2,3)

60K (P2,4)

... ...

eid sal

eid RID sal RID

......

Note:Can have multiple

unclusteredindexes!

v One simple approach would be to have another file, sorted on k, for each k that we want to index§ Hundreds of (key, RID) entries will fit on a single page§ Index is thus much smaller than the data file§ Less data (fewer reads!) to search to locate the RIDs of interest

(RID)(RID)

Page 4: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

Even Better: Tree Indexes!

v Leaf pages contain data entries, and are chainedv Non-leaf pages have index entries; role is to guide searchesv Query processing steps become:

1. Choose a good index to use (if one is available)2. Search the index to determine the interesting RID(s)3. Use the RID(s) to fetch the corresponding record(s)

Non-leafPagesof Index

Leaf Pagesof Index (Sorted by search key)

Recursiveapplicationof indexing

concept!(w/ pages)

Page 5: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

An Example (B+ Tree Preview)

2* 3*

Root

17

30

14* 16* 33* 34* 38* 39*

135

7*5* 8* 22* 24*

27

27* 29*

Entries < 17 Entries >= 17

Notice that data entriesat leaf level are sorted

Note: Just 3 page reads to get from root to (any) leaf here!

k* = (k, I(k)), e.g., (29, RID(s))

29...?

Page 6: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Index Classification (Revisited)

v Primary vs. secondary: If search key contains the primary key, it’s called the primary index.§ Unique index: Search key contains a candidate key.

v Clustered vs. unclustered: If order of index entries is the same as, or “close to”, the order of stored data records, then it’s called a clustered index.§ A table can be clustered on at most one search key (see

RID ordering in example a few slides ago).§ Cost of retrieving data records via an index varies

greatly based on whether index is clustered or not!

Page 7: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

Clustered vs. Unclustered Indexes

Index entries

Data entries

direct search for

(Index File)(Data file)

Data Records

data entries

Data entries

Data Records

CLUSTERED UNCLUSTERED

*** * **

Page 8: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Indexing in MySQL (w/InnoDB)

https://dev.mysql.com/doc/refman/8.0/en/create-index.html)

CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name[index_type] ON tbl_name (index_col_name,...) [index_option] [algorithm_option | lock_option] ...

index_col_name: col_name [(length)] [ASC | DESC]

index_option: index_type | KEY_BLOCK_SIZE [=] value| WITH PARSER parser_name COMMENT 'string’

index_type: USING {BTREE | HASH}

algorithm_option: ALGORITHM [=] {DEFAULT|INPLACE|COPY}lock_option: LOCK [=] {DEFAULT|NONE|SHARED|EXCLUSIVE}

Ex: CREATE INDEX salidx ON Emp (sal) USING BTREE;

ß

Page 9: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

Tree-Structured Indexes: Over(re)view

v As for any index, 3 alternatives for data entries k*:§ Data record with key value k§ <k, RID of data record with search key value k>§ <k, list of RIDs of data records with search key k>

v This data entry choice is orthogonal to the indexing technique used to locate data entries k*.

v Tree-structured indexing techniques support both range searches and equality searches.

v ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.

Page 10: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Range Searches (Review)

v “Find all students with gpa > 3.0”v If records are in a sorted file, do binary search to

find first such student, then scan to find others.§ Cost of file binary search can be quite high.

v Simple idea to do better: add an “index” file.

! Can now binary search the (smaller) index file!

Page 1 Page 2 Page NPage 3 Data File

k2 kNk1 Index File

Page 11: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11

ISAM*

v Index file may still be quite large. But, we can apply the same idea recursively to address that!

! Leaf pages contain the data entries.

P0 K 1 P 1 K 2 P 2 K m P m

index entry

Non-leafPages

PagesOverflow

pagePrimary pages

Leaf

*ISAM: Indexed Sequential Access Method

Page 12: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Comments on ISAM

v File creation: Leaf (data) pages first allocated sequentially, sorted by search key; then index pages allocated, and then overflow pages.

v Index entries: <search key value, page id>; they “direct” searches for data entries, which are in leaf pages.

v Search: Start at root; use key comparisons to go to leaf. I/O cost log F N; F = #entries/index pg, N = # leaf pgs

v Insert: Find leaf data entry belongs to, put entry there.v Delete: Find and remove entry from leaf; if overflow

page is empty now, de-allocate it. ! Static tree structure: inserts/deletes affect only leaf pages.

µ

Data Pages

Index Pages

Overflow pages

Page 13: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

Example ISAM Tree

v Suppose each node can hold 2 entries (really more like 200, since nodes are disk pages!)

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

40

Root

Page 14: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

After Inserting 23*, 48*, 41*, 42* ...

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

40

23* 48* 41*

42*

OverflowPages

Leaf

IndexPages

Pages

Primary

Page 15: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15

... Then Deleting 42*, 51*, 97*

! Note that 51* still appears in index levels, but not in leaf!

10* 15* 20* 27* 33* 37* 40* 46* 55* 63*

20 33 51 63

40

23* 48* 41*

Page 16: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

B+ Tree: Most Widely Used Index!v Insert/delete at log F N cost; keep tree height-

balanced. (F = fanout, N = # leaf pages)v Minimum 50% occupancy (except for root).

Each node contains d <= m <= 2d entries. The (mythical) d is called the order of the B+ tree.

v Supports equality and range-searches efficiently.

Index Entries

Data Entries("Sequence set")

(Direct search)

...

Page 17: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17

Example B+ Tree

v Search begins at root, and key comparisons direct the search to a leaf (as in ISAM).

v Ex: Search for 5*, 15*, all data entries >= 24*, ...

! Based on the search for 15*, we know it is not in the tree.

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

! Range searches find the starting point, then scan across the leaves.

Page 18: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Example B+ Tree (clarification)

v Search begins at root, and key comparisons direct the search to a leaf (as in ISAM).

v Ex: Search for 5*, 15*, all data entries >= 24*, ...Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

P1 P2 P3 P4 P5

P6All “pointers” here

are page IDs!

Page 19: Introduction to Data Management · 2020. 11. 25. · Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Even Better: Tree Indexes! vLeaf pages containdata entries, and

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19

To Be Continued...