external sorting - aminer · 2.1 balanced n-way-merge merged runs: step 3: re-merging: 165 198 351...

24
External Sorting External Sorting Merge Sort, Replacement Selection

Upload: others

Post on 14-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

External SortingExternal Sorting

Merge Sort, ReplacementSelection

Page 2: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

2Overview

1. What is “External Sorting”?

2. How does “Merge Sort” work?Balanced n-way-mergingImprovements

3. What are the advantages of a “Selection Tree”?

4. What is “Replacement Selection”?

5. Applicability and efficiency

ImprovementsSnow-plow example

Structure:

Page 3: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

31. Principle

e.g. Quick Sort, Heap Sort, Selection Sort,...

Very efficient but all data needs to fit completely into main memory.

Conventional sort algorithms:

External sorting:performing sorting operations on amounts of data that are too large to fit into main memory.

External sorting can not be done in one step.

Page 4: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

41. Principle

Multiple steps:

1. Split the data into pieces that fit into main memory

2. Sort the pieces with conventional sorting algorithms

3. Merge those so called runs and build the completely sorted data-set

Internal Sorting and External Merging

Page 5: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

52. Merge Sort

Source data Initial run

Main memory

Source hard disk Working hard disk

Sorting…Source data

Source data

Source data

Initial run

Initial run

Initial run

Principle: Internal Sorting and External Merging

Page 6: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

62.1 Balanced n-way-merge

535 288 351 354 412 198 451 852 291 448 898 165 217 366 756 665

Unsorted data-set:

535 288 351 354 412 198 451 852 291 448 898 165 217 366 756 665288 351 354 535 198 412 451 852 165 291 448 898 217 366 665 756

Step 1:

Creation of initial runs:

RUN1 RUN2 RUN3 RUN4

In this example four elements each fit into main memory.

Page 7: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

72.1 Balanced n-way-merge

Initial runs:

288 351 354 535 198 412 451 852 165 291 448 898 217 366 665 756

Step 2:

RUN1 RUN2 RUN3 RUN4

Merging of initial runs:

198 288 351 354 412 451 535 852 165 217 291 366 448 665 756 898

RUN5 RUN6

Page 8: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

8

198 288 351 354 412 451 535 852

2.1 Balanced n-way-merge

Merged runs:

Step 3:

Re-merging:

165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898

FINAL RUN

165 217 291 366 448 665 756 898

RUN5 RUN6

Result:After two merge-procedures our formerly unsorted set is in perfect order and merge sort is complete.

Page 9: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

92.1 Balanced n-way-merge

This procedure is called

Explanation:

Balanced 2-way-merging

As well source- as workingspace is required

Out of 2 merged runsone new run is formed

Balanced: 2-way:

Page 10: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

102.1 Balanced n-way-merge

The merging-procedure can be certainly applied to morethan two runs at each time. Then, it is termed n-way-mergeor multiway merge. A balanced 3-way merge would beimplemented as follows:

Example:

RUN1

RUN4

RUN2

RUN3

RUN5

RUN6

RUN7

RUN8

RUN1~ (1-3)

RUN2~ (4-6)

RUN3~ (7-8)

RUN1~ (1-8)

Page 11: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

112.2 Sophisticated n-way-merge

Algorithms like Polyphase merge, cascade mergeOptimizations:

Reducing the number of intermediate steps by implementing n-way-merging with great values of n.

Saving time by doing a perfect spreading of the runs on the storage media.

Maximizing speed by increasing the number of drives for storage disposals for minimal access time.

Additional costs and expenditureDisadvantages:

Page 12: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

122.2 Sophisticated n-way-merge

Significant speed increase by storing all runs on differentdrives for minimal access time:

Example:

RUN1

RUN4

RUN2

RUN3

RUN5

1

2

3

4

5

RUN1~ (1-5)6

Page 13: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

133. Selection Tree

Selecting the smallest element is very time-consuming.Problem:

It requires (n / p) - 1 comparisons when using a non-advanced algorithm.

Building a selection tree saves lots of comparisons and speeds up the selection process:Then, just log2 p comparisons are necessary.

Solution:

217

198

165

288

217

first element is compared subsequently with all remaining p-1 elements

351 354 535

412 451 852

291 448 898

366 665 756

RUN1

RUN2

RUN3

RUN4

Page 14: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

143. Selection Tree

Building a selection tree:Start:

Always the smallest element is taken out of the top of the tree

288 351 354 535

198 412 451 852

165 291 448 898

217 366 665 756

165

198

165

New elements are pulled forward in the current branch

Repeats until all branches of the selection tree are empty

Page 15: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

153. Selection Tree

Pulling smallest elements forwardStep 1:

Always the smallest element is taken out of the top of the tree

288 351 354 535

198 412 451 852

291 448 898

217 366 665 756

217

198

198

New elements are pulled forward in the current branch

Repeats until all branches of the selection tree are empty

165

Page 16: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

163. Selection Tree

Pulling smallest elements forwardStep 3:

Always the smallest element is taken out of the top of the tree

288 351 354 535

412 451 852

291 448 898

366 665 756

291

288

288

New elements are pulled forward in the current branch

Repeats until all branches of the selection tree are empty

217198165

Page 17: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

173. Selection Tree

Pulling smallest elements forwardStep 5:

Always the smallest element is taken out of the top of the tree

351 354 535

412 451 852

448 898

366 665 756

366

351

351

New elements are pulled forward in the current branch

Repeats until all branches of the selection tree are empty

291288217198165

Page 18: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

18

Most efficient is to keep the number of initial runs very low→ The length of runs has to be as great as possible

4.1 Replacement selection

Records are replaced in memory to form even longer runs than memory is available. This technique is called replacement selection.

Maximum size of a run is limited by available size of main memory

Conventional run-creation:

Modification:

Page 19: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

194.1 Replacement selection

2112 42 2

Values in memory

2

Run

2112 42 73

21(5) 42 73

39(5) 42 73

(17)(5) 42 73

(17)(5) (18) 73

(17)(5) (18) (11)

12

21

39

42

73

(End)

Example of a replacement selection sequence:

Length of run: 6Available memory: 4

size of run > size of memory

Four elements each fit into main memory

Page 20: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

204.1 Replacement selection

What happened:

Result:

1. The smallest record in memory is stored to the run2. Right after that, a new record is loaded at its position in memory3. If this new record is smaller than our last element of the current run,

it is tagged, because we can’t use it now4. Records are replaced in memory to form even longer runs than

memory is available

• Long length of runs, especially when data is presorted• Statistically, length of runs levels off at 2 * size of memory• Practically, runs tend to contain even more records, because in

almost every commercial application data is presorted

Page 21: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

214.2 Replacement selection

Demonstration: There’s a well-known way to proof why initial runs of a length of 2 * q can be expected when q is the size of main memory.

A snowplow is clearing a road with snow randomly distributed all over.

Page 22: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

224.2 Replacement selection

Because snow is falling at constant speed, this stable situation will never change:

• Rectangle is cut in half by the line representing the actual snow level

• Level of existing snow represents records in main memory

• At the end of the road, there is no snow from the previous turn left

• All records from the last run are tagged with the marker, so a new run has to be created.

• The volume of snow removed in one circle (namely the length of a run) is twice the amount that is present on the track at any time.

Page 23: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

235. Applicability and efficiency

Most popular algorithms:

1. Internal sorting:creates short runs with a constant maximum length equal to the size of main memory.

2. Replacement selection:mostly used, creates runs of big size.

3. Delayed Reconstitution of the Runs

4. Replacement Selection with natural selection

As well as

Today, speed and efficiency of external sorting is less concerned with the algorithm than with the thereby used hardware.

Page 24: External Sorting - AMiner · 2.1 Balanced n-way-merge Merged runs: Step 3: Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 665 756 852 898 FINAL RUN 165 217 291 366 448

246. Conclusion

Can’t compete by far with speed of internal sort algorithmsSpeed:

Minimize accesses to slow external media

Provide suitable and affordable solution

Intention:

In practice, data records are often presorted in some way.

In this case, replacement selection can produce extremely long runs

Advantage:

Increase of speed because of more sophisticated algorithms

Increase of speed because of much faster external hardware

Development: