cse 232a graduate database systems · splits “grow” the tree; root split increases height. tree...
TRANSCRIPT
Topic 2: Indexing and Sorting
Chapters 10, 11, and 13 of Cow Book
Arun Kumar
1
CSE 232A Graduate Database Systems
Slide ACKs: Jignesh Patel, Paris Koutris
2
Motivation for Indexing
❖ Consider the following SQL query:
SELECT * FROM Movies WHERE Year=2017
Movies (M) MovieID Name Year Director
Q: How to obtain the matching records from the file?
❖ Heap file? Need to do a linear scan! O(N) I/O and CPU ❖ “Sorted” file? Binary search! O(log2(N)) I/O and CPU
SELECT * FROM Movies WHERE Year>=2000 AND Year<2010
Indexing helps retrieve records faster for selective predicates!
3
Another View of Storage Manager
I/O Accesses
I/O Manager
Access Methods
Heap File
B+-tree Index
Sorted File
Hash Index
Buffer Manager
Conc
urre
ncy
Cont
rol M
anag
er
Reco
very
Man
ager
4
Indexing: Outline
❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index
5
Indexing
❖ Index: A data structure to speed up record retrieval ❖ Search Key: Attribute(s) on which file is indexed; also
called Index Key (used interchangeably) ❖ Any permutation of any subset of a relation’s attributes
can be index key for an index ❖ Index key need not be a primary/candidate key
❖ Two main types of indexes: ❖ B+ Tree index: good for both range and equality search ❖ Hash index: good for equality search
Overview of Indexes
❖ Need to consider efficiency of search, insert, and delete ❖ Primarily optimized to reduce (disk) I/O cost
❖ B+ Tree index: ❖ O(logF(N)) I/O and CPU cost for equality search (N:
number of “data entries”; F: “fanout” of non-leaf node) ❖ Range search, Insert, and Delete all start with an
equality search ❖ Hash index: ❖ O(1) I/O and CPU cost for equality search ❖ Insert and delete start with equality search ❖ Not “good” for range search! 6
What is stored in the Index?
❖ 2 things: Search/index key values and data entries ❖ Alternatives for data entries for a given key value k: ❖ AltRecord: Actual data records of file that match k ❖ AltRID: <k, RID of a record that matches k> ❖ AltRIDlist: <k, list of RIDs of records that match k>
❖ API for operations on records: ❖ Search (IndexKey); could be a predicate for B+Tree ❖ Insert (IndexKey, data entry) ❖ Delete (IndexKey); could be a predicate for B+Tree
7
Overview of B+ Tree Index
Index Entries (Non-leaf pages)
Data Entries (Leaf pages)Entries of the form: AltRID: (IndexKey value, RID)
Entries of the form: (IndexKey value, PageID)
❖ Non-leaf pages do not contain data values; they contain [d, 2d] index keys; d is order parameter
❖ Height-balanced tree; only root can have [1,d) keys ❖ Leaf pages in sorted order of IndexKey; connected as a
doubly linked listQ: What is the difference between “B+ Tree” and “B Tree”?
8
Overview of Hash Index
Hash functionSearchKey
Bucket pages
Overflow pages
h
10
N-1
Primary bucket pages
❖ Bucket pages have data entries (same 3 Alternatives) ❖ Hash function helps obtain O(1) search time
9
Trade-offs of Data Entry Alternatives
❖ Pros and cons of alternatives for data entries: ❖ AltRecord: Entire file is stored as an index! If
records are long, data entries of index are large and search time could be high
❖ AltRID and AltRIDlist: Data entries typically smaller than records; often faster for equality search
❖ AltRIDlist has more compact data entries than AltRID but entries are variable-length
Q: A file can have at most one AltRecord index. Why? 10
More Indexing-related Terminology
❖ Composite Index: IndexKey has > 1 attributes ❖ Primary Index: IndexKey contains the primary key ❖ Secondary Index: Any index that not a primary index ❖ Unique Index: IndexKey contains a candidate key ❖ All primary indexes are unique indexes!
MovieID Name Year Director IMDB_URLIMDB_URL is a candidate key
Index on MovieID? Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)? 11
More Indexing-related Terminology
❖ Clustered index: order in which records are laid out is same as (or “very close to”) order of IndexKey domain
❖ Matters for (range) search performance! ❖ AltRecord implies index is clustered. Why? ❖ In practice, clustered almost always implies AltRecord ❖ In practice, a file is clustered on at most 1 IndexKey
❖ Unclustered index: an index that is not clustered
MovieID Name Year Director IMDB_URL
Index on Year? Index on (Year, Name)? 12
13
Indexing: Outline
❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index
Root17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13Height = 1
B+ Tree Index: Search
14
❖ Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached
❖ Comparison within non-leaf nodes: binary/linear search
Examples: search 7*; 8*; 24*; range [19*,33*]
Order = 2
Leaf
Pag
e
R1 K 1 R 2 K2 K n P n+1
data entries
record 1 record 2
Next Page Pointer
Rn
record n
P0
Prev Page Pointer
Non
-leaf
P
age P1 K 1 P2 K 2 P3 K m P m+1
index entries
Pointer to apage with Values < K1
Pointer to a page with values s.t. K1≤ Values < K2
Pointer to apage with values ≥Km
Pointer to a page with values s.t., K2≤ Values < K3
Pm
B+ Tree Index: Page Format
15
Order = m/2
B+ Trees in Practice
16
❖ Typical order value: 100 (so, non-leaf node can have up to 200 index keys)
❖ Typical occupancy: 67%; so, typical “fanout” = 133 ❖ Computing the tree’s capacity using fanout: ❖ Height 1 stores 133 leaf pages ❖ Height 4 store 1334 = 312,900,700 leaf pages
❖ Typically, higher levels of B+Tree cached in buffer pool ❖ Level 0 (root) = 1 page = 8 KB ❖ Level 1 = 133 pages ~ 1 MB ❖ Level 2 = 17,689 pages ~ 138 MB and so on
B+ Tree Index: Insert
17
❖ Search for correct leaf L ❖ Insert data entry into L; if L has enough space, done!
Otherwise, must split L (into new L and a new leaf L’) ❖ Redistribute entries evenly, copy up middle key ❖ Insert index entry pointing to L’ into parent of L
❖ A split might have to propagate upwards recursively: ❖ To split non-leaf node, redistribute entries evenly, but
push up the middle key (not copy up, as in leaf splits!) ❖ Splits “grow” the tree; root split increases height. ❖ Tree growth: gets wider or one level taller at top.
Root17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
Entry to be inserted in parent nodeCopied up (and continues to appear in the leaf)
2* 3* 5* 7* 8*
5
B+ Tree Index: Insert
18
Example: Insert 8*
Split!
Split! Height++
Insert in parent node. Pushed up (and only appears once in the index)
5 24 30
17
13
Minimum occupancy is guaranteed in both leaf and non-leaf page splits
B+ Tree Index: Insert
19
Example: Insert 8*
2* 3*
New Root17
24 30
14*16* 19*20*22* 24* 27*29* 33*34*38*39*
135
7*5* 8*
B+ Tree Index: Insert
20
❖ Recursive splitting went up to root; height went up by 1 ❖ Splitting is somewhat expensive; is it avoidable? ❖ Can redistribute data entries with left or right sibling, if
there is space!
Example: Insert 8*
Root17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
8* 14* 16*
8
Insert: Leaf Node Redistribution
21
❖ Redistributing data entries with a sibling improves page occupancy at leaf level and avoids too many splits; but usually not used for non-leaf node splits
❖ Could increase I/O cost (checking siblings) ❖ Propagating internal splits is better amortization ❖ Pointer management headaches
Example: Insert 8*
B+ Tree Index: Delete
22
❖ Start at root, find leaf L where entry belongs ❖ Remove the entry; if L is at least half-full, done! Else, if
L has only d-1 entries: ❖ Try to re-distribute, borrowing from sibling L’ ❖ If re-distribution fails, merge L and L’ into single leaf
❖ If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
❖ A merge might have to propagate upwards recursively to root, which decreases height by 1
27*29*2* 3*
Root17
24 30
14*16* 19* 20*22* 24* 33*34*38*39*
135
7*5* 8*
27
24* 27* 29*
23
B+ Tree Index: Delete
Example: Delete 22*Example: Delete 20*
❖ Deleting 22* is easy ❖ Deleting 20* is followed by redistribution at leaf level.
Note how middle key is copied up.
Example: Delete 24*?
30
19* 27* 29* 33* 34* 38* 39*
Need to merge recursively upwards!
2* 3* 7* 14* 16* 19* 27* 29* 33* 34* 38* 39*5* 8*
New Root 30135 17
24
B+ Tree Index: Delete
Example: Delete 24*❖ Must merge leaf
nodes! ❖ In non-leaf node,
remove index entry with key value = 27
❖ Pull down of the index entry
Root
135 17 20
22
30
14* 16* 17* 18* 20* 33* 34* 38* 39*22* 27* 29*21*7*5* 8*3*2*
25
Delete: Non-leaf Node Redistribution
❖ Suppose this is the state of the tree when deleting 24* ❖ Instead of merge of root’s children, we can also
redistribute entry from left child of root to right child
14* 16* 33* 34*38* 39*22* 27* 29*17* 18* 20* 21*7*5* 8*2* 3*
Root
135
17
3020 22
26
Delete: After Redistribution
❖ Rotate IndexKeys through the parent node ❖ It suffices to re-distribute index entry with key 20; for
illustration, 17 also re-distributed
27
Delete: Redistribution Preferred
❖ Unlike Insert, where redistribution is discouraged for non-leaf nodes, Delete prefers redistribution over merge decisions at both leaf or non-leaf levels. Why?
❖ Files usually grow, not shrink; deletions are rare! ❖ High chance of redistribution success (high fanouts) ❖ Only need to propagate changes to parent node
28
Handling Duplicates/Repetitions
❖ Many data entries could have same IndexKey value ❖ Related to AltRIDlist vs AltRID for data entries ❖ Also, single data entry could still span multiple pages
❖ Solution 1: ❖ All data entries with a given IndexKey value reside
on a single page ❖ Use “overflow” pages, if needed (not inside leaf list)
❖ Solution 2: ❖ Allow repeated IndexKey values among data entries ❖ Modify Search appropriately ❖ Use RID to get a unique composite key!
29
Order Concept in Practice
❖ In practice, order (d) concept replaced by physical space criterion: at least half-full
❖ Non-leaf pages can typically hold many more entries than leaf pages, since leaf pages could have long data records (AltRecord) or RID lists (AltRIDlist)
❖ Often, different nodes could have different # entries: ❖ Variable sized IndexKey ❖ AltRecord and variable-sized data records ❖ AltRIDlist could leads to different numbers of data
entries sharing an IndexKey value
30
B+ Tree Index: Bulk Loading
❖ Given an existing file we want to index, multiple record-at-a-time Inserts are wasteful (too many IndexKeys!)
❖ Bulk loading avoids this overhead; reduces I/O cost ❖ 1) Sort data entries by IndexKey (AltRecord sorts file!) ❖ 2) Create empty root page; copy leftmost IndexKey of
leftmost leaf page to root (non-leaf) and assign child ❖ 3) Go left to right; Insert only leftmost IndexKey from
each leaf page into index as usual (NB: fewer keys!) ❖ 4) When non-leaf fills up, follow usual Split procedure
and recurse upwards, if needed
31
B+ Tree Index: Bulk Loading
145
14* 16* 17* 18* 20* 22* 27*21*7*5*3*2*
Sorted pages of data entries
Insert IndexKey 17 Split; push up 14
14
17 20
Insert IndexKey 22 No redistribution! Split again; push up 20 Height++
…
and so on
32
Indexing: Outline
❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index
33
Overview of Hash Indexing
❖ Reduces search cost to nearly O(1) ❖ Good for equality search (but not for range search) ❖ Many variants: ❖ Static hashing ❖ Extendible hashing ❖ Linear hashing, etc. (we will not discuss these)
34
Static Hashing
Hash function
SearchKey k
Bucket pages
Overflow pages
h
10
N-1
Primary bucket pages
❖ N is fixed; primary bucket pages never deallocated ❖ Bucket pages contain data entries (same 3 Alts) ❖ Search: ❖ Overflow pages help handle hash collisions ❖ Average search cost is O(1) + #overflow pages
E.g., h(k) = a*k + b
h(k) mod N
2
35
Static Hashing: Example
Say, AltRecord and only one record fits per page!
N = 3
MovieID Name Year Director20 Inception 2010 Christopher Nolan15 Avatar 2009 Jim Cameron52 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Hash function: a = 1, b = 0 h(k) = k MOD 3
15 MOD 3 = 0
(15, Avatar, …)
(52, Gravity, …)
(20, Inception, …) (74, Blue …)
0
1
274 MOD 3 = 2
36
Static Hashing: Insert and Delete
❖ Insert: ❖ Equality search; find space on primary bucket page ❖ If not enough space, add overflow page
❖ Delete: ❖ Equality search; delete record ❖ If overflow page becomes empty, remove it ❖ Primary bucket pages are never removed!
37
Static Hashing: Issues
❖ Since N is fixed, #overflow pages might grow and degrade search cost; deletes waste a lot of space
❖ Full reorg. is expensive and could block query proc. ❖ Skew in hashing: ❖ Could be due to “bad” hash function that does not
“spread” values—but this issue is well-studied/solved ❖ Could be due to skew in the data (duplicates); this
could cause more overflow pages—difficult to resolve
Extendible (dynamic) hashing helps resolve first two issues
38
Extendible Hashing
❖ Idea: Instead of hashing directly to data pages, maintain a dynamic directory of pointers to data pages
❖ Directory can grow and shrink; chosen to double/halve
❖ Search I/O cost: 1 (2 if direc. not cached)
Check last GD bits of h(k)
224* 12* 32* 16*
21* 13* 41* 17*
210*
27* 15*
Directory
Data Pages
Bucket A
Bucket B
Bucket C
Bucket D
Global Depth (GD) Local Depth (LD)
00011011
Example: Search 17* 10001 Search 42* …10
39
Extendible Hashing: Insert
❖ Search for k; if data page has space, add record; done ❖ If data page has no more space: ❖ If LD < GD, create Split Image of bucket; LD++ for
both bucket and split image; insert record properly ❖ If LD == GD, create Split Image; LD++ for both
buckets; insert record; but also, double the directory and GD++! Duplicate other direc. pointers properly
❖ Direc. typically grows in spurts
40
Extendible Hashing: Insert
224* 12* 32* 16*
21* 13* 41* 17*
210*
27* 15*
Directory
Data Pages
Bucket A
Bucket B
Bucket C
Bucket D
Global Depth (GD) Local Depth (LD)
Check last GD bits of h(k)
00011011
Example: Insert 24* …00
No space in bucket A Need to split and LD++ Since LD was = GD, GD++ and direc. doubles
332* 16* 24*
34* 12*
Bucket A
Bucket A2000
100
41
Extendible Hashing: Insert
3332* 16* 24*
21* 13* 41* 17*
210*
27* 15*
Bucket A
Bucket B
Bucket C
Bucket D
Global Depth (GD) Local Depth (LD)
000001010011
Example: Insert 24*
34* 12*
Bucket A2
100101110111
Example: Insert 18*…010
18*
Example: Insert 25*…001
Need to split bucket B! Since LD < GD, direc. does not double this time Only LD++ on old bucket and modify direc. ptrs
313*
Bucket B2
31* 41* 17*
42
Extendible Hashing: Delete
❖ Search for k; delete record on data page ❖ If data page becomes empty, we can merge with
another data page with same last GD-1 bits (its “historical split image”) and do LD--; update direc. ptrs
❖ If all splits images get merged back and all buckets have LD = GD-1, shrink direc. by half; GD--
NB: Never does a bucket’s LD become > GD!
43
Extendible Hashing: Delete
224* 12*
21* 13* 41* 17*
226*
27* 15*
Bucket A
Bucket B
Bucket C
Bucket D
Global Depth (GD) Local Depth (LD)
00011011
Example: Delete 26*…10
Bucket C is now empty Can merge with A; LD--
Q: Why did we pick A to merge with?
1
Example: Delete 41*…01
In practice, deletes and thus, bucket merges are rare So, directory shrinking is even more uncommon
44
Static Hashing vs Extendible Hashing
Q: Why not let N in static hashing grow and shrink too?
❖ Extendible hash direc. size is typically much smaller than data pages; with static hash, reorg. of all data pages is far more expensive
❖ Hashing skew is a common issue for both; in static hash, this could lead to large # overflow pages; in extendible hash, this could blow up direc. size (this is OK)
❖ If too many data entries share search key (duplicates), overflow pages needed for extendible hash too!
45
Indexing: Outline
❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index
46
Statistical Machine Learning 101
❖ An “ML model” is a program/function to approximate a hidden / unknown function based on a set of data points
❖ The set of functions representable by an ML model is called its hypothesis space
f : DX ! DY<latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit>
Input/features Output/targetExamples:
DY = R<latexit sha1_base64="o6TNPS9XaJ4hffLe0y8QUzyVNuA=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm00k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sbllbu+0ZJQIQpsk4pHoeFhSzkLaVExx2okFxYHHadsbXeZ++4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDppMUNGTrUSh/5kdAvVKhQf3ekOJByHHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CccqQjlgaA+E5QoPtYEE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hffLe0y8QUzyVNuA=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm00k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sbllbu+0ZJQIQpsk4pHoeFhSzkLaVExx2okFxYHHadsbXeZ++4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDppMUNGTrUSh/5kdAvVKhQf3ekOJByHHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CccqQjlgaA+E5QoPtYEE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hffLe0y8QUzyVNuA=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm00k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sbllbu+0ZJQIQpsk4pHoeFhSzkLaVExx2okFxYHHadsbXeZ++4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDppMUNGTrUSh/5kdAvVKhQf3ekOJByHHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CccqQjlgaA+E5QoPtYEE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hffLe0y8QUzyVNuA=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm00k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sbllbu+0ZJQIQpsk4pHoeFhSzkLaVExx2okFxYHHadsbXeZ++4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDppMUNGTrUSh/5kdAvVKhQf3ekOJByHHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CccqQjlgaA+E5QoPtYEE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit>
H = {f}<latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AAAB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2//wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PPRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq88nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AAAB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2//wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PPRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq88nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AAAB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2//wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PPRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq88nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AAAB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2//wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PPRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq88nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit>
DX = R<latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfd4MWdSWda3sbC4tLyyWlmrrm9sbm2bO7ttGSWC0BaJeCS6HpaUs5C2FFOcdmNBceBx2vHGV7nfeaBCsii8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zzZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOOowAEcwjHYcAYNuIEmtIDAIzzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ==</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfd4MWdSWda3sbC4tLyyWlmrrm9sbm2bO7ttGSWC0BaJeCS6HpaUs5C2FFOcdmNBceBx2vHGV7nfeaBCsii8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zzZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOOowAEcwjHYcAYNuIEmtIDAIzzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ==</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfd4MWdSWda3sbC4tLyyWlmrrm9sbm2bO7ttGSWC0BaJeCS6HpaUs5C2FFOcdmNBceBx2vHGV7nfeaBCsii8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zzZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOOowAEcwjHYcAYNuIEmtIDAIzzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ==</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfd4MWdSWda3sbC4tLyyWlmrrm9sbm2bO7ttGSWC0BaJeCS6HpaUs5C2FFOcdmNBceBx2vHGV7nfeaBCsii8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zzZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOOowAEcwjHYcAYNuIEmtIDAIzzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ==</latexit>
Linear Regression H = R2<latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYouumyin1AG8tkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtoob25t7+yae/ttGcaC0BYJeSi6LpaUs4C2FFOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg==</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYouumyin1AG8tkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtoob25t7+yae/ttGcaC0BYJeSi6LpaUs4C2FFOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg==</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYouumyin1AG8tkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtoob25t7+yae/ttGcaC0BYJeSi6LpaUs4C2FFOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg==</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">AAACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYouumyin1AG8tkOmmHTh7MTIQSAm78FTcuFHHrT7jzb5ykWWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtoob25t7+yae/ttGcaC0BYJeSi6LpaUs4C2FFOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg==</latexit>
Polynomial Regression H = Rd<latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv997jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOOOr3O/80CFZFF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDppMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA==</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv997jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOOOr3O/80CFZFF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDppMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA==</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv997jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOOOr3O/80CFZFF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDppMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA==</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">AAACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv997jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOOOr3O/80CFZFF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDppMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA==</latexit>
47
❖ Most ML models are parametric; an ML model’s learning algorithm is a program to help pick a specific function from its hypothesis space, i.e., set parameter values
❖ Accuracy of ML model quantified by how well the picked function “fits” the observed data
❖ Two kinds of accuracy: “train” vs “test” ❖ Train is on same data seen by learning algorithm ❖ Test is on held out data; evaluates “generalization”
❖ Different ML models represent different sets of functions and thus, typically yield different accuracy
Statistical Machine Learning 101
48
Artificial Neural Networks (ANNs)
❖ Extremely powerful, versatile, and popular ML model ❖ Can be viewed as a generalization of linear regression:
append non-linearity after linear regression: composite is called “neuron”; line up many neurons: called a “layer”; stack many such layers from input to output
❖ Universal function approximator!
❖ A special form of ANNs can simulate arbitrary Turing Machines!
49
Artificial Neural Networks (ANNs)
❖ Different forms of ANNs tailored for different “raw” data representations for input and/or output
❖ Multi-Layer Perceptrons (MLPs) deal with a “flat” numeric vector for input and/or output
❖ Recurrent Neural Networks (RNNs) tailored for sequentially structured inputs (text, time series, etc.), Convolutional Neural Networks (CNNs) for images/spatially structured data, etc.
❖ Bottomline: ANNs can help approximate any function!
f : DX ! DY<latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx++66j4g7Qhpg8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiKK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAggnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOOfeBXXGqzhhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taaV2WcRRAvvgABwBF5yBGrgBddAAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU55H</latexit>
(Switch to Tim Kraska’s slides)
50
Sorting: Outline
❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees
51
Motivation for Sorting
❖ User’s SQL query has ORDER BY clause! ❖ First step of bulk loading of a B+ tree index ❖ Used in implementations of many relational ops: project,
join, set ops, group by aggregate, etc. (next topic!)
Q: But sorting is well-known; why should a DBMS bother?
Often, the file (relation) to be sorted will not fit in RAM!
“External” Sorting
52
External Sorting: Overview
❖ Goal: Given relation R with N pages, SortKey A, M buffer pages (often, M << N), sort R on A to get sorted R’
❖ Idea: Sorting algorithm should be disk page I/O-aware! ❖ Desiderata: ❖ High efficiency, i.e., low I/O cost, even for very large N ❖ Use sequential I/Os rather than random I/Os AMAP ❖ Interleave I/O and comp. (DMA); reduce CPU cost too
NB: I/O-aware sorting is also a key part of the implementation of MapReduce/Hadoop!
!53
Warm-up: 2-way External Merge Sort
1. Sort phase: Read each page into buffer memory; do “internal” sort (use any popular fast sorting algorithm, e.g., quicksort); write it back to disk (a sorted “run”)
2. Merge phase: Read 2 runs, merge them on the fly, write out a new double-length run; recurse till whole file is a run!
Idea: Make Merge Sort I/O-aware!
NB: Sort phase is 1-pass; merge phase is often multi-pass!
Input file3,4 6,2 9,4 8,7 5,6 3,1 2
1-page runsPASS 0
3,4 5,62,6 4,9 7,8 1,3 2
2-page runs
PASS 1
2,34,6
4,7
8,91,35,6 2
4-page runs
PASS 2
2,3
4,46,7
8,9
1,23,56
8-page runs
PASS 3
9
1,22,3
3,4
4,56,6
7,8
!54
Warm-up: 2-way External Merge Sort
Each pass does 1 read and 1 write of whole file: 2N page I/Os per pass
Number of passes: Sort phase: 1 Merge phase: dlog2(N)e
I/O cost of 2-way EMS:2N(1 + dlog2(N)e)
Q: How to reduce this cost further?Whole file is sorted!
N=7 pages=2*7*4=56
55
Sorting: Outline
❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees
!56
Multi-way EMS: Motivation
Idea: Why not exploit more buffer pages (say, B >> 3)?
Q: How many buffer pages does 2-way EMS use?
Sort phase: 2 (1 for read, 1 for write) Merge phase: 3 (1 for each run input; 1 for merged output)
So, 2-way EMS uses only 3 buffer pages!
Sort phase: Read B pages at a time (not just 1 at a time)! Write out sorted runs of length B each (not just 1) But I/O cost of sort phase is still the same!
dN/Be
!57
Multi-way EMS: B-way Merge Phase
Idea: In 2-way EMS, we merge 2 sorted runs at a time; in multi-way EMS, we merge B-1 sorted runs at a time!
B-1 way merge; total # buffer pages used: B
INPUT 1
INPUT B-1
OUTPUT
DiskDisk
INPUT 2
. . .. . .. . .
# passes for Merge Phase reduces to: dlogB�1(dN/Be)e
!58
Multi-way EMS I/O Cost
Overall, # passes = I/O cost per pass =
Total I/O cost of EMS =
1 + dlogB�1(dN/Be)e
2N(1 + dlogB�1(dN/Be)e)2N
Example: File with 100M records of length 0.5KB each; page size is 8KB; number of buffer pages for EMS B=1000
Number of pages N = 100M * 0.5KB / 8KB = 6.25M
Total I/O cost of EMS =
Only need the ceil!= 2 x 6.25M x (1 + 2) = 37.5M
2⇥ 6.25M ⇥ (1 + dlog999(6250)e)
!59
Multi-way EMS I/O Cost
Total number of passes =
N Naive 2-way EMS B=1K B=10K B=100K
1M 21 3 2 2
10M 25 3 2 2
100M 28 3 3 2
1B 31 4 3 2
1 + dlogB�1(dN/Be)e
With 8KB page, 7.5TB!
With 8KB page, 782MB
Only 2 passes to sort up to 74.5TB! (2 is the lower bound for EMS!)
!60
Multi-way EMS: Improvements
❖ While already efficient, some key algorithmic+systems-oriented improvements have been made to multi-way EMS to reduce overall runtime (not just counting I/O cost)
❖ Three prominent improvements: 1. Replacement sort (aka heap sort) as internal sort 2. “Blocked” I/O 3. Double Buffering
!61
Improvement 1: Replacement Sort
❖ In standard EMS, quick sort used during Sort Phase ❖ Produces runs of length B pages each
❖ Replacement sort is an alternative for Sort Phase ❖ Produces runs of average length 2B pages each ❖ So, number of runs reduced on average to ❖ Maintains a sorted heap in B-2 pages; 1 page for
reading; 1 for sorted output ❖ Slightly higher CPU cost; but signif. lower I/O cost
dN/Be
dN/2Be
(We are skipping the details of this algorithm)
New total I/O cost = 2N(1 + dlogB�1(dN/2Be)e)
!62
Improvement 2: “Blocked” I/O
❖ Merge Phase did not recognize distinction between sequential I/O and random I/O!
❖ Time difference not reflected in counting I/O cost ❖ Idea: Read a “block” of b pages of each run at a time! ❖ So, only runs can be merged at a time ❖ b controls trade-off of # passes vs time-per-pass
New total I/O cost =
bB/bc � 1
“Fan-in” of Merge Phase =
or
!63
Improvement 3: Double Buffering
❖ Most machines have DMA; enables I/O-CPU parallelism ❖ Trivially feasible to exploit DMA in the Sort Phase ❖ But in the Merge Phase, CPU blocked by I/O for runs ❖ Idea: Allocate double the buffers for each run; while CPU
processes one set, read pages (I/O) into other set! ❖ So, only runs can be merged at a time
New total I/O cost =
New fan-in of Merge Phase = F =
or
bB/2bc � 1
bB/2bc � 1
64
Sorting: Outline
❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees
!65
Using B+ Tree for EMS
❖ Suppose we already have a B+ tree index with the SortKey being equal to (or a prefix of) the IndexKey
❖ Data entries of the B+ tree are already in sorted order!
Q: Is it a “good” idea to simply read the leaf level of the B+ tree to achieve the EMS then?
It depends!
On whether the index is clustered or not!
Good idea! Might be really bad!
Data Pages
Index
Data Entries
!66
Using Clustered B+ Tree for EMS
❖ Go down the tree to reach left-most leaf
❖ Scan leaf pages (data entries) left to right
❖ If AltRecord, done! O/W, retrieve data pages pointed to by successive data entries
❖ I/O cost if AltRecord: height + # leaf pages ❖ I/O cost otherwise: height + # leaf pages + # data pages ❖ Either way, I/O cost often << from-scratch EMS!
Data Pages
Index
Data Entries
!67
Using Unclustered B+ Tree for EMS
❖ Unclustered means not AltRecord! Why?
❖ Same procedure as for clustered B+ tree
❖ Same I/O “cost” as for clustered tree with AltRID/AltRIDlist but many back-to-back random I/Os; thrashing!
❖ Usually, much slower than from-scratch EMS!
Q: But when is this faster than from-scratch EMS?
!68
External Sorting as Competitive Sport!
The geekiest “sport” in the world: sortbenchmark.org