8. external sorting
DESCRIPTION
8. External Sorting. Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/1.jpg)
8. External Sorting
Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer.
What shall we do?
Need to use EXTERNAL STORAGE DEVICE !!!
External Sorting
- Disk Sort
- Tape Sort
What is a major difference between two external sorts?
![Page 2: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/2.jpg)
Sorting with Disk
k - way merging
“mergesort”
merge
internal sort
......
......
![Page 3: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/3.jpg)
Example
4500 records
250 records/block
available memory = 3 blocks
Def’n : A segment of a file is said to be a run if all the records in the segment are sorted.
1 2 3 4 5 6
I
1 3 5
D1 ……
2 4 6
D2 ……
![Page 4: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/4.jpg)
3
D1 D2
……
6 n
D3 D4
2
n
: the size of a run
![Page 5: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/5.jpg)
1 3 5 7
Run size 2 4 6 8
1 3 5 7 2 4 6 8
3
12 34 56 78
6
1256 3478
12
12345678
24
How many passes?
1 + log2r
(r # of initial runs)
![Page 6: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/6.jpg)
a
nn
ar
rn
an
2
2
log
,
)log(
O
size. run initial the
O
operations I/O of #
![Page 7: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/7.jpg)
k-way merging
… … …… …
……
logkr ……………………………………………….
……
# of passes
1+logkr
# of I/O operations?
O(nlogkr)
better than 2-way merging !!!
![Page 8: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/8.jpg)
How about # of comparisons?
Is k-way merging always better than 2-way merging?
![Page 9: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/9.jpg)
Replacement Selection
… … …… …
……
……………………………………………….
……
# of passes
1+logkr #(P)
#(P) k rr run size
![Page 10: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/10.jpg)
# of comparisons(k-way merge)
16 38 30 25 50 16 110 20
15 20 20 25 15 11 120 18
10 9 20 15 8 9 90 17
10 9 20 15 8 9 90 17
15 8 17
9 8
8
8
9
8 9
1
32
4 5 6 7
10 11 12 13 14 15
8
![Page 11: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/11.jpg)
How many comparisons in a pass?
nlog2k why?
Total # of comparisons?
(# of passes) (# of comparisons in a pass)
= (logkr)(nlog2k)
= (nlog2r) independent of k !!!
#(c) r
![Page 12: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/12.jpg)
How to increase run size(initial run size)
x1, x2, x3,…,xm, xm+1, xm+2, xm+3,…,x2m, x2m+1, x2m+2, x2m+3,…
m keys m keys m keys
r = # of runs = Any improvement?
Observation
See p.94 in textbook
!!!
…...
m
n
m
nr
![Page 13: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/13.jpg)
4,2,32,12,18,24,91,11
(record size >> the size of pointer)
why do we need this?
11
91
24
18
11
18
11
4
5
6
7
2
3
![Page 14: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/14.jpg)
A tree of losers
4 parent
2 loser
32
12 Updating pointers
18 ptr := winner.parent;
24 while ptr nil do
91 if (ptr.loser.key < winner.key) then
11 interchange(ptr.loser, winner);
end {if}
ptr := ptr.parent;
end {while}
11 91
winner
1824
![Page 15: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/15.jpg)
Explain p.97-101, textbook !!!
Exercise :
In a complete 2-tree(T) with n leaf nodes,
show that
total # of nodes in T = 2n -1
![Page 16: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/16.jpg)
Performance Analysis
(Average size of runs)
m0 # of records in (real) memory.
H. Seward (M.S. Thesis, MIT, 1954)
gave a good reason to believe that a run contains more than 1.5m0 records
(no proof)
E. Friend (JACM, 3, (1966))
experiment 2m0
E. Moore (1961)
Proved that 2m0 is the expected run length.
![Page 17: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/17.jpg)
Sketch of Moore’s Proof
Snowplow
falling snow
2m0 m0
uniform distribution 2m0
![Page 18: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/18.jpg)
Tape Sorting
• Balanced k-way merging
(similar to disk sorting)
• Polyphase merging
• Cascade merging
![Page 19: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/19.jpg)
Polyphase Merging (Motivation)– (R1, R2, …, R5000)– length (Ri) 20 bytes– Only 1000 records fitted in the internal memory at one time.
( 20k bytes)– 4 tapes available
Balanced 2-way mergeT1 T2 T3 T4
R1,1000 R1001,2000
R2001,3000 R3001,4000 R4001,5000
R1,2000 R2001,4000
R4001,5000
R1,4000 R4001,5000 R1,5000
Total # of operations = 15000
![Page 20: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/20.jpg)
Tape 1 Tape 2 Tape 3 Tape 4
R1,1000 R1001,2000 R2001,3000
R3001,4000 R4001,5000
(rewind)
R3001,4000 R4001,5000 R1,3000
R1,5000
• Total # of I/O operations
3000 + 5000 = 8000
Balanced Merge is not always best !!!
![Page 21: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/21.jpg)
What if only 3 tapes available?
Tape 1 Tape 2 Tape 3
R1,1000 R1001,2000
R2001,3000 R3001,4000
R4001,5000
R1,2000
R2001,4000
R4001,5000
R1,2000 R2001,4000
R4001,5000
R1,4000
R4001,5000
R4001,5000 R1,4000
R1,5000
Total # of I/O Operations
5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!
![Page 22: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/22.jpg)
Tape 1 Tape 2 Tape 3
R1,1000 R1001,2000
R2001,3000 R3001,4000
R4001,5000
R1,2000
R4001,5000 R2001,4000
(rewind)
R1,2000; 4001,5000
(rewind)
R1,5000
Total # of I/O Operations
4000 + 3000 + 5000 = 11,000 !!!
4000,2001R
![Page 23: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/23.jpg)
Polyphase merge
T1 T2 T3 T4 T5 T6
131 130 128 124 116 115 114 112 18 516
17 16 14 98 58
13 12 174 94 54
11 332 172 92 52
651 331 171 91 51
1291
How to assign initial runs?
![Page 24: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/24.jpg)
Cascade MergeT1 T2 T3 T4 T5 T6
155 150 141 129 115 140 135 126 114 515
Pass 1 126 121 112 414 515
114 19 312 414 515
15 29 312 414 515
( 15 29 312 414 515)
155 24 37 49 510
155 144 33 45 56
Pass 2 155 144 123 42 53
155 144 123 92 51
(155 144 123 92 51 )
154 143 122 91 551
153 142 121 501 551
Pass 3 152 141 411 501 551
151 291 411 501 551
( 151 291 411 501 551)
Pass 4 1901
![Page 25: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/25.jpg)
Polyphase Merge
T1 T2 T3 T4 T5 T6
phase 1 131 130 128 124 116 2 115 114 112 18 516
3 17 16 14 98 58
4 13 12 174 94 54 Gilstad(1960)
5 11 332 172 92 52
6 651 331 171 91 51
7 1291
{{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4},
{16,15,14,12,8},{31,30,28,24,16}}
Perfect Fibonacci Distribution !!!
What is the underlying rule?
![Page 26: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/26.jpg)
i ai bi ci di ei
0 1 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 1
3 4 4 4 3 2
4 8 8 7 6 4
5 16 15 14 12 8
6 31 30 28 24 16
![Page 27: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/27.jpg)
(a0 + b0) (a0 + c0) (a0 + d0) (a0 + e0) a0
(a1 + b1) (a1 + c1) (a1 + d1) (a1 + e1) a1
(a2 + b2) (a2 + c2) (a2 + d2) (a2 + e2) a2
n an bn cn dn en
n+1 an + bn an + cn an + dn an + en an
an bn cn dn en
![Page 28: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/28.jpg)
i ai bi ci di ei output
0 1 0 0 0 0 T6
1 1 1 1 1 1 T1
2 2 2 2 2 1 T2
3 4 4 4 3 2 T3
2 2 2 1 0 2
1 1 1 0 1 1
4 8 8 7 6 4 T4
5 16 15 14 12 8 T5
6 31 30 28 24 16 T6
7 61 59 55 47 31
T1 T2 T3 T4 T5
![Page 29: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/29.jpg)
n-1 an-1 bn-1 cn-1 dn-1 en-1
n an-1+bn-1 an-1+cn-1 an-1+dn-1 an-1+en-1 an-1
an bn cn dn en
en = an-1
dn = an-1 + en = an-1 + an-2
cn = an-1 + dn-1 = an-1 + (an-2 + en-2) = an-1 + an-2 + an-3
………….
en = an-1
dn = an-1 + an-2
cn = an-1 + an-2 + an-3
bn = an-1 + an-2 + an-3 + an-4
an = an-1 + an-2 + an-3 + an-4 + an-5
(a0 = 1, ai = 0, i = -1, -2, -3, -4)
![Page 30: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/30.jpg)
e = an-1
d = an-1 + an-2
c = an-1 + an-2 + an-3
b = an-1 + an-2 + an-3 + an-4
a = an-1 + an-2 + an-3 + an-4 + an-4
![Page 31: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/31.jpg)
i -4 -3 -2 -1 0 1 2 3 4 5 6 7
ai 0 0 0 0 1 1 2 4 8 16 31 61
1
bi 0
ci 0
di 0
ei 0
![Page 32: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/32.jpg)
1 2 4 8 16 31 61
1 2 4 8 15 30 59
1 2 4 7 14 28 55
1 2 3 6 12 24 47
1 1 2 4 8 16 31
![Page 33: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/33.jpg)
ai = < 0, 0, 0, 0, 1, 1, 2, 4, 8, 16, 31, 61, …… >, i = -4, -3, -2, -1, 0, 1, 2,...“The kth order Fibonacci number”
Fnk = Fn-1
k + Fn-2k + …… + Fn-k
k
0, 0 n k-2 Fn
k = 1, n = k-1
e.g)The second order Fibonacci number
0 1 1 2 3 5 ……
Fn2 = Fn-1
2 + Fn-22
0, if n = 0 Fn
2 = 1, if n = 1
Fibonacci number !!!
an = Fn+k-1k if k tapes(input) are used
why?
![Page 34: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/34.jpg)
What if not perfect Fib. Dist’n?
Use dummy runs !!!
5 input tapes and 53 initial runs.
Level T1 T2 T3 T4 T5
1 1 1 1 1 1 5
2 2 2 2 2 1 91 1 1 1 0
3 4 4 4 3 2 172 2 2 1 1
4 8 8 7 6 4 334 4 3 3 2
5 16 15 14 12 8 65>53(8 7 7 6 4)………………………………
T1 T2 T3 T4 T5
(34)(35) (36) (37)(38) (39) (40) (41)(42) (43) (44) (45)(46) (47) (48) (49) (50)(51) (52) (53)
![Page 35: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/35.jpg)
T1 T2 T3 T4 T5 T6
(2) (2) (2) (3) (3)
18 17 16 14 58
(2) (2) (2) (3) 55
53
not best
but simple and good !!!
For better one, see Knuth !!!
1111
1111
1111
161 151 141 121 141
![Page 36: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/36.jpg)
Example (3 tapes)
T1 T2 T3
(k)8 (k)5 (k)3 (2k)5
(3k)3 (2k)2 0, 1, 1, 2, 3, 5, 8
(5k)2 (3k)1 (5k)1 (8k)1
(13k)1
Runs on two input tapes (k)
# of runs run size(k) # of pairs # of I/O’s
8,5 1,1 5 10
5,3 2,1 3 9
3,2 3,2 2 10
2,1 5,3 1 8
1,1 8,5 1 13
1 13
How many passes over the data?
![Page 37: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/37.jpg)
Total number Fs for some s.
of initial runs
the sth Fibonacci number
Fs
Fs-1 Fs-2
T1 T2 T3
Fs-1 Fs-2
Fs-3 Fs-2
Fs-3 Fs-4
…………
See Fig. p.107, textbook !!!
Total # of I/O operations =
# of passes =
2
11
s
iisi kFF
s
s
iisi
s
s
iisi
F
FF
kF
kFF
2
11
2
11
![Page 38: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/38.jpg)
Lemma :
[proof] (By induction on S)
(s=2) LHS =
RHS =
(s=3) LHS =
RHS =
(s=k) Suppose that
(s=k+1)
Exercise !!!
See page 106-107 in textbook !!!
2,5
22
5
51
2
11
sF
sF
sFF ss
s
iisi
00
11
iisi FF
05
6
5
6
5
24
5
5223
FF
231
1
11
FFFF
iisi
25
16
5
6
5
26
5
5334
FF
kkFk
Fk
FF kk
k
iiki
'4,
5
2'2
5
5''1'
2'
11'
![Page 39: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/39.jpg)
From the previous lemma,
# of passes =
Fs = r
(1)
why?
. Golden Ratio !!!
From (1) ,
5
22
5
5
522
55
1
1
2
11
s
F
Fs
F
Fs
Fs
F
FF
s
s
s
ss
s
s
iisi
KK
kF 512
151
2
1
5
1
k k
kF
51
2
1
5
1
8
131
j
j
F
F
ss F
Fs log43.167.1
1)51log(
log5log
5jfor
![Page 40: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/40.jpg)
Theorem:
Fs-1 Fs-2
Polyphase merge
merge 3 tapes
Fs = r = # of initial runs
# of passes = 1.04 log2r
![Page 41: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/41.jpg)
APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING
Tapes Phases Passes Pass/phase Growth percent ratio
3 2.078 lnS + 0.672 1.504 lnS + 0.992 72 1.6180340
4 1.641 lnS + 0.364 1.015 lnS + 0.965 62 1.8392868
5 1.524 lnS + 0.078 0.863 lnS + 0.921 57 1.9275620
6 1.479 lnS + 0.185 0.795 lnS + 0.864 54 1.9659482
7 1.460 lnS + 0.424 0.762 lnS + 0.797 52 1.9835828
8 1.451 lnS + 0.642 0.744 lnS + 0.723 51 1.9919642
9 1.447 lnS + 0.838 0.734 lnS + 0.646 51 1.9960312
10 1.445 lnS + 1.017 0.728 lnS + 0.568 50 1.9980295
20 1.443 lnS + 2.170 0.721 lnS – 0.030 50 1.9999981
APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING
Tapes Phases Passes Growth ratio
3 2.078 lnS + 0.672 1.504 lnS + 0.992 1.6180840
4 1.235 lnS + 0.754 1.012 lnS + 0.820 2.2469796
5 0.946 lnS + 0.796 0.897 lnS + 0.800 2.8793852
6 0.796 lnS + 0.821 0.773 lnS + 0.808 3.5133371
7 0.703 lnS + 0.839 0.691 lnS + 0.822 4.1481149
8 0.639 lnS + 0.852 0.632 lnS + 0.834 4.7833861
9 0.592 lnS + 0.861 0.587 lnS + 0.845 5.4189757
10 0.555 lnS + 0.869 0.552 lnS + 0.854 6.0547828
20 0.397 lnS + 0.905 0.397 lnS + 0.901 12.4174426
![Page 42: 8. External Sorting](https://reader035.vdocument.in/reader035/viewer/2022062409/56814868550346895db57541/html5/thumbnails/42.jpg)
Cascade Merge
Level ai bi ci di ei
0 1 0 0 0 0
1 1 1 1 1 1
2 5 4 3 2 1
3 15 14 12 9 5
4 55 50 41 29 15
n an bn cn dn en
n+1 an+bn+cn an+1 bn+1 cn+1 dn+1
+dn+en -en -dn -cn -bn
an+1 an
Perfect dist’n
for detail see Knuth Vol III !!!