introduction to data management · 2020. 11. 25. · database management systems 3ed, r....
TRANSCRIPT
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
Introduction to Data Management
Lecture #21(Storage & Indexing, cont.)
*** The “Online” Edition ***
Instructor: Mike Carey [email protected]
SQL
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2
Announcements
v Currently on a between-HW break!§ Planning to change to a Sunday 4PM schedule to buy more time (watch the wiki!)
v Only a handful of talking-head lectures remain after today!§ Week 9 MWF and week 10 MWF
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3
Ex: Emp (eid, ename, sal, deptid)
111 (P1,1)
222 (P1,2)
333 (P1,3)
444 (P1,4)
555 (P2,1)
666 (P2,2)
... ...
3K (P1,1)
12K (P1,4)
18K (P2,1)
18K (P10000,1)
23K (P2,3)
60K (P2,4)
... ...
eid sal
eid RID sal RID
......
Note:Can have multiple
unclusteredindexes!
v One simple approach would be to have another file, sorted on k, for each k that we want to index§ Hundreds of (key, RID) entries will fit on a single page§ Index is thus much smaller than the data file§ Less data (fewer reads!) to search to locate the RIDs of interest
(RID)(RID)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4
Even Better: Tree Indexes!
v Leaf pages contain data entries, and are chainedv Non-leaf pages have index entries; role is to guide searchesv Query processing steps become:
1. Choose a good index to use (if one is available)2. Search the index to determine the interesting RID(s)3. Use the RID(s) to fetch the corresponding record(s)
Non-leafPagesof Index
Leaf Pagesof Index (Sorted by search key)
Recursiveapplicationof indexing
concept!(w/ pages)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5
An Example (B+ Tree Preview)
2* 3*
Root
17
30
14* 16* 33* 34* 38* 39*
135
7*5* 8* 22* 24*
27
27* 29*
Entries < 17 Entries >= 17
Notice that data entriesat leaf level are sorted
Note: Just 3 page reads to get from root to (any) leaf here!
k* = (k, I(k)), e.g., (29, RID(s))
✓
✓
✓
29...?
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6
Index Classification (Revisited)
v Primary vs. secondary: If search key contains the primary key, it’s called the primary index.§ Unique index: Search key contains a candidate key.
v Clustered vs. unclustered: If order of index entries is the same as, or “close to”, the order of stored data records, then it’s called a clustered index.§ A table can be clustered on at most one search key (see
RID ordering in example a few slides ago).§ Cost of retrieving data records via an index varies
greatly based on whether index is clustered or not!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7
Clustered vs. Unclustered Indexes
Index entries
Data entries
direct search for
(Index File)(Data file)
Data Records
data entries
Data entries
Data Records
CLUSTERED UNCLUSTERED
*** * **
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8
Indexing in MySQL (w/InnoDB)
https://dev.mysql.com/doc/refman/8.0/en/create-index.html)
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name[index_type] ON tbl_name (index_col_name,...) [index_option] [algorithm_option | lock_option] ...
index_col_name: col_name [(length)] [ASC | DESC]
index_option: index_type | KEY_BLOCK_SIZE [=] value| WITH PARSER parser_name COMMENT 'string’
index_type: USING {BTREE | HASH}
algorithm_option: ALGORITHM [=] {DEFAULT|INPLACE|COPY}lock_option: LOCK [=] {DEFAULT|NONE|SHARED|EXCLUSIVE}
Ex: CREATE INDEX salidx ON Emp (sal) USING BTREE;
ß
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9
Tree-Structured Indexes: Over(re)view
v As for any index, 3 alternatives for data entries k*:§ Data record with key value k§ <k, RID of data record with search key value k>§ <k, list of RIDs of data records with search key k>
v This data entry choice is orthogonal to the indexing technique used to locate data entries k*.
v Tree-structured indexing techniques support both range searches and equality searches.
v ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10
Range Searches (Review)
v “Find all students with gpa > 3.0”v If records are in a sorted file, do binary search to
find first such student, then scan to find others.§ Cost of file binary search can be quite high.
v Simple idea to do better: add an “index” file.
! Can now binary search the (smaller) index file!
Page 1 Page 2 Page NPage 3 Data File
k2 kNk1 Index File
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11
ISAM*
v Index file may still be quite large. But, we can apply the same idea recursively to address that!
! Leaf pages contain the data entries.
P0 K 1 P 1 K 2 P 2 K m P m
index entry
Non-leafPages
PagesOverflow
pagePrimary pages
Leaf
*ISAM: Indexed Sequential Access Method
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12
Comments on ISAM
v File creation: Leaf (data) pages first allocated sequentially, sorted by search key; then index pages allocated, and then overflow pages.
v Index entries: <search key value, page id>; they “direct” searches for data entries, which are in leaf pages.
v Search: Start at root; use key comparisons to go to leaf. I/O cost log F N; F = #entries/index pg, N = # leaf pgs
v Insert: Find leaf data entry belongs to, put entry there.v Delete: Find and remove entry from leaf; if overflow
page is empty now, de-allocate it. ! Static tree structure: inserts/deletes affect only leaf pages.
µ
Data Pages
Index Pages
Overflow pages
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13
Example ISAM Tree
v Suppose each node can hold 2 entries (really more like 200, since nodes are disk pages!)
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40
Root
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14
After Inserting 23*, 48*, 41*, 42* ...
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
40
23* 48* 41*
42*
OverflowPages
Leaf
IndexPages
Pages
Primary
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15
... Then Deleting 42*, 51*, 97*
! Note that 51* still appears in index levels, but not in leaf!
10* 15* 20* 27* 33* 37* 40* 46* 55* 63*
20 33 51 63
40
23* 48* 41*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16
B+ Tree: Most Widely Used Index!v Insert/delete at log F N cost; keep tree height-
balanced. (F = fanout, N = # leaf pages)v Minimum 50% occupancy (except for root).
Each node contains d <= m <= 2d entries. The (mythical) d is called the order of the B+ tree.
v Supports equality and range-searches efficiently.
Index Entries
Data Entries("Sequence set")
(Direct search)
...
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17
Example B+ Tree
v Search begins at root, and key comparisons direct the search to a leaf (as in ISAM).
v Ex: Search for 5*, 15*, all data entries >= 24*, ...
! Based on the search for 15*, we know it is not in the tree.
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
! Range searches find the starting point, then scan across the leaves.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18
Example B+ Tree (clarification)
v Search begins at root, and key comparisons direct the search to a leaf (as in ISAM).
v Ex: Search for 5*, 15*, all data entries >= 24*, ...Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
P1 P2 P3 P4 P5
P6All “pointers” here
are page IDs!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19
To Be Continued...