sanders: algorithm engineering january 20, 2011 algorithm...

276
Sanders: Algorithm Engineering January 20, 2011 1 Algorithm Engineering für grundlegende Datenstrukturen und Algorithmen Peter Sanders Was sind die schnellsten implementierten Algorithmen für das 1 × 1 der Algorithmik: Listen, Sortieren, Prioritätslisten, Sortierte Listen, Hashtabellen, Graphenalgorithemen?

Upload: others

Post on 26-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 1

Algorithm Engineering fürgrundlegende Datenstrukturen und

Algorithmen

Peter SandersWas sind dieschnellsten implementierten Algorithmen

für das 1×1 der Algorithmik:

Listen, Sortieren, Prioritätslisten, Sortierte Listen, Hashtabellen,

Graphenalgorithemen?

Page 2: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 2

Nützliche Vorkenntnisse

Informatik I/II

Algorithmentechnik (gleichzeitig hören vermutlich OK)

etwa Rechnerarchitektur (oder Ct lesen ;-)

passive Kenntnisse von C/C++

Vertiefungsgebiet: Algorithmik

Page 3: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 3

Material

Folien

wissenschaftliche Aufsätze.

Siehe Vorlesungshomepage

Basiskenntnisse: Algorithmenlehrbücher,

z.B. Mehlhorn/Sanders, Cormen et al.

Mehlhorn Näher: The LEDA Platform of Combinatorial

and Geometric Computing. Gut für die fortgeschrittenen

Papiere.

Page 4: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 4

Überblick Was ist Algorithm Engineering, Modelle, . . .

Erste Schritte: Arrays, verkettete Listen, Stacks, FIFOs,. . .

Sortieren rauf und runter

Prioritätslisten

Sortierte Listen

Hashtabellen

Minimale Spannbäume

Kürzeste Wege

Ausgewählte fortgeschrittene Algorithmen, z.B. maximale

Flüsse

Methodik: in Exkursen

Page 5: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 5

Algorithmics

= thesystematicdesign of efficient software and hardware

computer science

logics correct

efficient

soft− & hardware

the

ore

tic

al p

rac

tica

l

algorithmics

Page 6: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 6

(Caricatured) Traditional View: Algorithm Theory

models

design

analysis

perf. guarantees applications

implementation

deduction

Theory Practice

Page 7: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 7

Gaps Between Theory & Practice

Theory ←→ Practice

simple appl. model complex

simple machine model real

complex algorithms FOR simple

advanced data structures arrays,. . .

worst casemax complexity measure inputs

asympt. O(·) efficiency 42%constant factors

Page 8: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 8

Algorithmics as Algorithm Engineering

models

design

perf.−guarantees

deduction

analysis experiments

algorithmengineering

??

?

Page 9: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 9

Algorithmics as Algorithm Engineering

models

design

implementationperf.−guarantees

deduction

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering

Page 10: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 10

Algorithmics as Algorithm Engineering

realisticmodels

implementationperf.−guarantees

deduction

falsifiable

inductionhypotheses experiments

algorithmengineering

design

analysis

real.

real.

Page 11: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 11

Algorithmics as Algorithm Engineering

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

deduction

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

Page 12: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 12

Algorithmics as Algorithm Engineering

appl. engin.

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

applications

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

deduction

Page 13: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 13

Goals

bridge gapsbetween theory and practice

acceleratetransferof algorithmic results intoapplications

keep the advantages of theoretical treatment:

generalityof solutions and

reliabiltiy, predictabiltyfrom performance guarantees

appl. engin.

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

applications

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

deduction

Page 14: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 14

Bits of History1843– Algorithms in theory and practice

1950s,1960sStill infancy

1970s,1980sPaper and pencil algorithm theory.

Exceptions exist, e.g.,[D. Johnson]

1986 Term used by[T. Beth],

lecture“Algorithmentechnik”in Karlsruhe.

1988– Library of Efficient Data Types and Algorithms

(LEDA) [K. Mehlhorn]

1997– Workshop on Algorithm Engineering

ESA applied track[G. Italiano]

1997 Term used in US policy paper[Aho, Johnson, Karp, et. al]

1998 Alex workshop in Italy ALENEX

Page 15: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 15

Warum diese Vorlesung?

Jeder Informatiker kennt einige Lehrbuchalgorithmen

wir können gleich mit Algorithm Engineering loslegen

Viele Anwendungen profitieren

Es ist frappierend, dass es hier noch Neuland gibt

Basis für Studien- Diplomarbeiten

Page 16: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 16

Was diese Vorlesung nicht ist:

Keine wiedergekäute Algorithmentechnik o.Ä.

Grundvorlesungen “vereinfachen” die Wahrheit oft

z.T. fortgeschrittene Algorithmen

steilere Lernkurve

Implementierungsdetails

Betonung von Messergebnissen

Page 17: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 17

Was diese Vorlesung nicht ist:

Keine Theorievorlesung

keine (wenig?) Beweise

Reale Leistung vor Asymptotik

Page 18: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 18

Was diese Vorlesung nicht ist:

Keine Implementierungsvorlesung

Etwas Algorithmenanalyse,. . .

Wenig Software Engineering

Keine Implementierungsübungen (aber, stay tuned für ein

Praktikum)

Page 19: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 19

Exkurs: Maschinenmodelle

RAM/von Neumann Modell

ALUO(1) registers

1 word = O(log n) bits

large memoryfreely programmable

Analyse: zähle Maschinenbefehle —

load, store, Arithmetik, Branch,. . .

Einfach

Sehr erfolgreich

zunehmrendunrealistisch

weil reale Hardware

immer komplexer wird

Page 20: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 20

Das Sekundärspeichermodell

1 wordcapacity Mfreely programmable

ALUregisters

B words

large memory

M: Schneller Speicher der GrößeM

B: Blockgröße

Analyse: zähle (nur?) Blockzugriffe (I/Os)

Page 21: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 21

Interpretationen des Sekundärspeichermodelle

Externspeicher Caches

großer Speicher Platte(n) Hauptspeicher

M Hauptspeicher ein cache level

B Plattenblock (MBytes!) Cache Block (16–256) ByteGgf. auch zwei cache levels.

Page 22: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 22

Parallele Platten

MM

B

...B

D1 2

unabhängige Platten

[Vitter Shriver 94]

1 2 D

Mehrkopfmodell

[Aggarwal Vitter 88]

Page 23: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 23

Mehr Modellaspekte

Instruktionsparallelismus (Superscalar, VLIW,

EPIC,SIMD,. . . )

Pipelining

Was kostet branch misprediction?

Multilevel Caches (gegenwärtig 2–3 levels) “cache

oblivious algorithms”

Parallele Prozessoren, Multithreading

Kommunikationsnetzwerke

. . .

Page 24: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 24

1 Arrays, Verkettete Listen und

abgeleitete Datenstrukturen

Bounded Arrays

Eingebaute Datenstruktur.

Größe muss von Anfang an bekannt sein

Page 25: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 25

Unbounded Array

z.B.std::vector

pushBack:Element anhängen

popBack: Letztes Element löschen

Idee: verdopple wenn der Platz ausgeht

halbiere wenn Platz verschwendet wird

Wenn man dasrichtig macht, brauchen

n pushBack/popBack Operationen ZeitO(n)

Algorithmese: pushBack/popBack haben konstanteamortisierte

Komplexität Was kann man falsch machen?

Page 26: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 26

Doppelt verkettete Listen

Page 27: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 27

ClassItemof Element // one link in a doubly linked list

e : Element

next : Handle// -

-

-

prev : Handle

invariant next→prev=prev→next=this

Trick: Use a dummy header

-

⊥-

· · ·· · ·

-

Page 28: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 28

Proceduresplice(a,b,t : Handle)assertb is not beforea∧ t 6∈ 〈a, . . . ,b〉//Cut out〈a, . . . ,b〉 a′ a b b′

· · · · · ·-

-

-

-

a′ := a→prevb′ := b→nexta′→next :=b′ //b′→prev :=a′ // · · · · · ·

R

-

-

-

Y

// insert〈a, . . . ,b〉 aftertt ′ := t→next //

t a b t′

· · · · · ·R

-

-

Y

b→next := t’ //a→prev :=t // · · · · · ·

R

-

-

-

Y

t→next := a //t ′→prev := b // · · · · · ·

-

-

-

-

Page 29: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 29

Einfach verkettete Listen

...

Vergleich mit doppelt verketteten Listen

Weniger Speicherplatz

Platz ist oft auch Zeit

Eingeschränkter z.B. kein delete

Merkwürdige Benutzerschnittstelle, z.B. deleteAfter

Page 30: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 30

Speicherverwaltung für Listen

kann leicht 90 % der Zeit kosten!

Lieber Elemente zwischen (Free)lists herschieben als echtemallocs

Alloziere viele Items gleichzeitig

Am Ende alles freigeben?

Speichere „parasitär“. z.B. Graphen:Knotenarray. Jeder Knoten speichert ein ListItem Partition der Knoten kann als verkettete Listengespeichert werden MST, shortest Path

Challenge: garbage collection, viele Datentypen auch ein Software Engineering Problem hier nicht

Page 31: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 31

Beispiel: Stack

......12 1 2

SList B-Array U-Array

dynamisch + − +

Platzverschwendung pointer zu groß? zu groß?

freigeben?

Zeitverschwendung cache miss + umkopieren

worst case time (+) + −Wars das?

Hat jede Implementierung gravierende Schwächen?

Page 32: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 32

The Best From Both Worlds

...

B

hybrid

dynamisch +

Platzverschwendungn/B+B

Zeitverschwendung +

worst case time +

Page 33: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 33

Eine Variante

...

...

B

Directory

Elemente

− Reallozierung im top level nicht worst case konstante

Zeit

+ Indizierter Zugrif aufS[i] in konstanter Zeit

Page 34: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 34

Beyond Stacks...

stack

...FIFO queue

...

pushBack popBackpushFrontpopFront

deque

FIFO: BArray−→ cyclic array

h

t0n

b

Aufgabe: Ein Array, das “[i]” in konstanter Zeit und

einfügen/löschen von Elementen in ZeitO(√

n) unterstützt

Aufgabe: Ein externer Stack, dern push/pop Operationen mit

O(n/B) I/Os unterstützt

Page 35: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 35

Aufgabe: Tabelle für hybride Datenstrukturen vervollständigenOperation List SList UArray CArray explanation of ‘∗’[·] n n 1 1|·| 1∗ 1∗ 1 1 not with inter-listsplice

first 1 1 1 1last 1 1 1 1insert 1 1∗ n n insertAfter onlyremove 1 1∗ n n removeAfter onlypushBack 1 1 1∗ 1∗ amortizedpushFront 1 1 n 1∗ amortizedpopBack 1 n 1∗ 1∗ amortizedpopFront 1 1 n 1∗ amortizedconcat 1 1 n nsplice 1 1 n nfindNext,. . . n n n∗ n∗ cache efficient

Page 36: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 36

Was fehlt?

Fakten Fakten Fakten

Messungen für

Verschiedene Implementierungsvarianten

Verschiedene Architekturen

Verschiedene Eingabegrößen

Auswirkungen auf reale Anwendungen

Kurven dazu

Interpretation, ggf. Theoriebildung

Aufgabe: Array durchlaufen versus zufällig allozierte verketteteListe

Page 37: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 37

Sorting — Overview

You think you understandquicksort?

Avoiding branch mispredictions: Super Scalar Sample Sort

(Parallel disk)externalsorting

Multicore sorting

Page 38: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 38

Quicksort

Function quickSort(s : Sequenceof Element) : Sequenceof Element

if |s| ≤ 1 then return s // base case

pick p∈ s uniformly at random // pivot key

a := 〈e∈ s : e< p〉b := 〈e∈ s : e= p〉c := 〈e∈ s : e> p〉return concatenate(quickSort(a),b,quickSort(c)

Page 39: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 39

Engineering Quicksort

array

2-way-Comparisons

sentinelsfor inner loop

inplace swaps

Recursion onsmallersubproblems

→ O(logn) additional space

break recursionfor small (20–100) inputs, insertion sort

(not one big insertion sort)

Page 40: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 40

ProcedureqSort(a : Array of Element;ℓ, r : N) // Sorta[ℓ..r]while r− ℓ≥ n0 do // Use divide-and-conquer

j := pickPivotPos(a, l , r)swap(a[ℓ],a[ j]) // Helps to establish the invariantp := a[ℓ]i := ℓ; j := rrepeat // a: ℓ i→ j← r

while a[i]< p do i++ // Scan over elements (A)while a[ j]> p do j−− // on the correct side (B)if i ≤ j then swap(a[i],a[ j]); i++ ; j−−

until i > j // Done partitioningif i < l+r

2 then qSort(a, ℓ, j); ℓ := jelse qSort(a, i, r) ; r := i

insertionSort(a[l ..r]) // faster for smallr− ℓ

Page 41: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 41

Picking Pivots Painstakingly — Theory

probabilistically: Expected1.4nlogn element comparisons

median of three:Expected1.2nlogn element comparisons

perfect:−→ nlogn element comparisons

(approximate using large samples)

Practice

3GHz Pentium 4 Prescott, g++

Page 42: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 42

Picking Pivots Painstakingly — Instructions

6

7

8

9

10

11

12

10 12 14 16 18 20 22 24 26

inst

ruct

ions

con

stan

t

n

out sort Instructions / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11

random pivotmedian of 3

exact medianskewed pivot n/10skewed pivot n/11

Page 43: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 43

Picking Pivots Painstakingly — Time

6.8

7

7.2

7.4

7.6

7.8

8

8.2

8.4

10 12 14 16 18 20 22 24 26

nano

secs

con

stan

t

n

out sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11

random pivotmedian of 3

exact medianskewed pivot n/10skewed pivot n/11

Page 44: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 44

Picking Pivots Painstakingly — Branch Misses

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

10 12 14 16 18 20 22 24 26

bran

ch m

isse

s co

nsta

nt

n

out sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11

random pivotmedian of 3

exact medianskewed pivot n/10skewed pivot n/11

Page 45: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 45

Can We Do Better? Previous WorkInteger Keys

+ Can be 2−3 times faster than quicksort

− Naive ones are cache inefficient andslowerthan quicksort

− Simple ones aredistributiondependent.

Cache efficient sortingk-ary merge sort

[Nyberg et al. 94, Arge et al. 04, Ranade et al. 00, Brodal et al.

04]

+ Faktor logk less cache faults

− Only≈ 20 % speedup, and only for laaarge inputs

Page 46: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 46

Sample Sort

Function sampleSort(e= 〈e1, . . . ,en〉,k)if n/k is “small” then return smallSort(e)

let S= 〈S1, . . . ,Sak−1〉 denote a randomsampleof e

sortS

〈s0,s1,s2, . . . ,sk−1,sk〉:=〈−∞,Sa,S2a, . . . ,S(k−1)a,∞〉for i := 1 to n do

find j ∈ 1, . . . ,ksuch thatsj−1 < ei ≤ sj

placeei in bucketb j

return concatenate(sampleSort(b1), . . . ,sampleSort(bk))

1s 5s 6s3s 7s 2s 4s

binary search?

buckets

Page 47: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 47

Why Sample Sort? traditionally:parallelizableon coarse grained machines

+ Cache efficient≈merge sort

− Binary searchnot much faster than merging

− complicatedmemory management

Super Scalar SampleSort Binary search−→ implicit search tree

Eliminate all conditionalbranches

Exploit instruction parallelism

Cache efficiencycomes to bear

“steal” memory management fromradix sort

Page 48: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 48

Classifying Elements

t:= 〈sk/2,sk/4,s3k/4,sk/8,s3k/8,s5k/8,s7k/8, . . .〉for i := 1 to n do

j:= 1

repeat logk times //unroll

j:= 2 j +(ai > t j)

j:= j−k+1∣∣b j∣∣++

oracle[i]:= j //oracle

6s

splitter array index

3s

<

< <

<<<< 1s 5s 7s6 7

buckets

4

3s 2 2

1 4s

5

decisions

decisions

decisions

> >

>>>>

*2

*2 *2

>+1

+1 +1

Now thecompilershould:

usepredicated instructions

interleave for-loop iterations (unrolling∨ software pipelining)

Page 49: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 49

template <class T>

void findOraclesAndCount(const T* const a,

const int n, const int k, const T* const s,

Oracle* const oracle, int* const bucket)

for (int i = 0; i < n; i++)

int j = 1;

while (j < k)

j = j*2 + (a[i] > s[j]);

int b = j-k;

bucket[b]++;

oracle[i] = b;

Page 50: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 50

Predication

Hardware mechanism that allows instructions to be

conditionally executed

Booleanpredicate registers(1–64) hold condition codes

predicate registersp are additional inputs ofpredicated

instructionsI

At runtime,I is executed if and only ifp is true

+ Avoids branch misprediction penalty

+ More flexible instruction scheduling

− Switched off instructions still take time

− Longer opcodes

− Complicated hardware design

Page 51: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 51

Example (IA-64)

Translation of:if (r1 > 2) r3 := r3 + 4

With a conditional branch: Via predication:

cmp.gt p6,p7=r1,r2 cmp.gt p6,p7=r1,r2

(p7) br.cond .label (p6) add r3=4,r3

add r3=4,r3

.label:

(...)

Other Current Architectures:

Conditional movesonly

Page 52: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 52

Unrolling ( k= 16)

template <class T>

void findOraclesAndCountUnrolled([...])

for (int i = 0; i < n; i++)

int j = 1;

j = j*2 + (a[i] > s[j]);

j = j*2 + (a[i] > s[j]);

j = j*2 + (a[i] > s[j]);

j = j*2 + (a[i] > s[j]);

int b = j-k;

bucket[b]++;

oracle[i] = b;

Page 53: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 53

More Unrolling k= 16, n eventemplate <class T>

void findOraclesAndCountUnrolled2([...])

for (int i = n & 1; i < n; i+=2) \

int j0 = 1; int j1 = 1;

T ai0 = a[i]; T ai1 = a[i+1];

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

int b0 = j0-k; int b1 = j1-k;

bucket[b0]++; bucket[b1]++;

oracle[i] = b0; oracle[i+1] = b1;

Page 54: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 54

Distributing Elements

for i := 1 to n do a′B[oracle[i]]++ := ai

o

B

a’

a

i

moverefer

refer

refer

Why Oracles?

simplifiesmemory management

no overflow testsor re-copying

simplifies softwarepipelining

separatescomputationandmemory accessaspects

small(n bytes)

sequential, predictablememory access

can behiddenusing prefetching / write buffering

Page 55: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 55

Distributing Elementstemplate <class T> void distribute(

const T* const a0, T* const a1,

const int n, const int k,

const Oracle* const oracle, int* const bucket)

for (int i = 0, sum = 0; i <= k; i++)

int t = bucket[i]; bucket[i] = sum; sum += t;

for (int i = 0; i < n; i++)

a1[bucket[oracle[i]]++] = a0[i];

Page 56: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 56

Experiments: 1.4 GHz Itanium 2

restrict keyword from ANSI/ISO C99

to indicate nonaliasing

Intel’s C++ compiler v8.0 usespredicated instructions

automatically

Profiling gives9%speedup

k= 256 splitters

Usestl:sort from g++ (n≤ 1000)!

insertion sort forn≤ 100

Random 32 bit integers in[0,109]

Page 57: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 57

Comparison with Quicksort

0

1

2

3

4

5

6

7

8

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

GCC STL qsortsss-sort

Page 58: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 58

Breakdown of Execution Time

0

1

2

3

4

5

6

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

TotalfindBuckets + distribute + smallSort

findBuckets + distributefindBuckets

Page 59: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 59

A More Detailed View

dynamic dynamic

instr. cycles IPC smalln IPCn= 225

findBuckets,

1× outer loop 63 11 5.4 4.5

distribute,

one element 14 4 3.5 0.8

Page 60: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 60

Comparison with Quicksort Pentium 4

0

2

4

6

8

10

12

14

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

Intel STLGCC STL

sss-sort

Problems: fewregisters, onecondition codeonly, compilerneeds “help”

Page 61: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 61

Breakdown of Execution Time Pentium 4

0

1

2

3

4

5

6

7

8

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

TotalfindBuckets + distribute + smallSort

findBuckets + distributefindBuckets

Page 62: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 62

Analysis

mem. acc.branchesdata dep. I/Os registersinstructions

k-way distribution:

sss-sort nlogk O(1) O(n) 3.5n/B3×unroll O(logk)

quicksort logk lvls. 2nlogk nlogk O(nlogk)2nB logk 4 O(1)

k-way merging:

memory nlogk nlogk O(nlogk) 2n/B 7 O(logk)

register 2n nlogk O(nlogk) 2n/B k O(k)

funnelk′logk′ k 2nlogk′ k nlogk O(nlogk) 2n/B 2k′+2 O(k′)

Page 63: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 63

Conclusions

sss-sort up totwice as fast as quicksort on Itanium

comparisons6= conditional branches

algorithm analysis is not just instructions and caches

New result:GPU-Sample-Sortis best comparison based sorting

algorithm on graphic hardware

[Leischner/Osipov/Sanders 2009]

Page 64: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 64

Criticism I

Why only random keys?

Answer I

Sample sort hardly depends on input distribution

Page 65: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 65

Criticism I´

What if there are many equal keys?

They all end up in the same bucket

Answer I´

Its not a bug its a feature:

si = si+1 = · · ·= sj indicates a frequent key!

Setsi := maxx∈ Key: x< si,(optional: dropsi+2, . . .sj )

Now bucketi+1 need not be sorted!

todo: implement

Page 66: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 66

Criticism II

Quicksort is inplace

Answer II

Usehybrid List-Array Representationof sequences

needsO(√

kn)

extra space fork-way sample sort

Page 67: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 67

Criticism II´

But I WANT Arrays for input and output

Answer II´

Inplace Konversion

input: easy

output: tricky. Exercise: develop rough idea. Hint: permute

blocks. Each permutation consists of a product of cyclic

permutations. An inplace cyclic permuation is easy.

Page 68: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 68

Future Work

high level fine-tuning, e.g., clever choice ofk

modern architectures

almostin-placeimplementation

multilevel cache-aware or cache-oblivious generalization

(oracles help)

Page 69: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 69

Externes Sortieren

n: Eingabegröße

M: Größe des schnellen Speichers

B: Blockgröße

1 wordcapacity Mfreely programmable

ALUregisters

B words

large memory

Page 70: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 70

ProcedureexternalMerge(a,b,c :File of Element)

x := a.readElement // AssumeemptyFile.readElement= ∞y := b.readElement

for j := 1 to |a|+ |b| doif x≤ y then c.writeElement(x); x := a.readElement

else c.writeElement(y); y := b.readElement

xac

b y

Page 71: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 71

Externes (binäres) Mischen – I/O-Analyse

Dateia lesen:⌈|a|/B⌉ ≤ |a|/B+1.

Dateib lesen:⌈|b|/B⌉ ≤ |b|/B+1.

Dateic schreiben:⌈(|a|+ |b|)/B⌉ ≤ (|a|+ |b|)/B+1.

Insgesamt:

≤ 3+2|a|+ |b|

B≈ 2|a|+ |b|

BBedingung: Wir brauchen 3 Pufferblöcke, d.h.,M > 3B.

xac

b y

Page 72: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 72

Run Formation

Sortiere Eingabeportionen der GrößeM

M

i=0 i=M i=2M

i=2Mi=0 i=M

run internalsort

f

t

I/Os:≈ 2nB

Page 73: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 73

Sortieren durch Externes Binäres Mischen

formRuns formRuns formRuns formRuns

merge merge

make_things_

__aeghikmnst

as_simple_as

__aaeilmpsss

_possible_bu

__aaeilmpsss

t_no_simpler

__eilmnoprst

____aaaeeghiiklmmnpsssst ____bbeeiillmnoopprssstu

merge

________aaabbeeeeghiiiiklllmmmnnooppprsssssssttu

ProcedureexternalBinaryMergeSort // I/Os:≈run formation // 2n/B

while more than one run leftdo //⌈log n

M

⌉×

mergepairs of runs // 2n/B

output remaining run // ∑ : 2nB

(

1+⌈

lognM

⌉)

Page 74: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 74

Zahlenbeispiel: PC 2007

n= 238 Byte

M = 231 Byte

B= 220 Byte

I/O braucht 2−6 s

Zeit: 2nB

(

1+⌈

lognM

⌉)

= 2·218 · (1+7) ·2−6 s= 216 s≈ 18 h

Idee: 8 Durchläufe 2 Durchläufe

Page 75: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 75

Zahlenbeispiel: PC 2007→ 2009

n= 238→39 Byte

M = 231→32 Byte

B= 220→21 Byte

I/O braucht 2−6 s

Zeit: 2nB

(

1+⌈

lognM

⌉)

= 2·218 · (1+7) ·2−6 s= 216 s≈ 18 h

Page 76: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 76

Mehrwegemischen

ProceduremultiwayMerge(a1, . . . ,ak,c :File of Element)

for i := 1 to k do xi:= ai .readElement

for j := 1 to ∑ki=1 |ai | do

find i ∈ 1..k that minimizesxi// no I/Os!,O(logk) time

c.writeElement(xi)

xi := ai .readElement

...

...

...

...

...

internal buffers

...

Page 77: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 77

Mehrwegemischen – Analyse

...

...

...

...

...

internal buffers

...

I/Os: Dateiai lesen:≈ |ai |/B.

Dateic schreiben:≈ ∑ki=1 |ai |/B

Insgesamt:

≤≈ 2∑k

i=1 |ai |B

Bedingung: Wir brauchenk Pufferblöcke, d.h.,k< M/B.

Interne Arbeit: (benutze Prioritätsliste !)

O

(

logkk

∑i=1|ai |)

Page 78: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 78

Sortieren durch Mehrwege-Mischen Sortiere⌈n/M⌉ runsmit je M Elementen 2n/B I/Os

MischejeweilsM/B runs 2n/B I/Os

bis nur noch ein run übrig ist ×⌈

logM/BnM

Mischphasen

————————————————-

Insgesamt sort(n) :=2nB

(

1+⌈

logM/BnM

⌉)

I/Os

formRuns formRuns formRuns formRuns

make_things_

__aeghikmnst

as_simple_as

__aaeilmpsss

_possible_bu

__aaeilmpsss

t_no_simpler

__eilmnoprst

________aaabbeeeeghiiiiklllmmmnnooppprsssssssttu

multi merge

Page 79: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 79

Sortieren durch Mehrwege-Mischen

Interne Arbeit:

O

run formation︷ ︸︸ ︷

nlogM + nlogMB

︸ ︷︷ ︸

PQ access per phase

phases︷ ︸︸ ︷⌈

logM/BnM

= O(nlogn)

Mehr als eine Mischphase?:Nicht für Hierarchie Hauptspeicher, Festplatte.

Grund

>1000︷︸︸︷

MB

>

<1000︷ ︸︸ ︷

RAM Euro/bitPlatte Euro/bit

11-2010:8GB4MB = 2048> 50/4GB

80/2T B = 320

Page 80: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 80

Mehr zu externem Sortieren

Untere Schranke≈ 2(?)nB

(

1+⌈

logM/BnM

⌉)

I/Os

[Aggarwal Vitter 1988]

Obere Schranke≈ 2nDB

(

1+⌈

logM/BnM

⌉)

I/Os (erwartet)

für D parallele Platten

[Hutchinson Sanders Vitter 2005, Dementiev Sanders2003]

Offene Frage: deterministisch?

Page 81: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 81

Sorting with Parallel DisksI/O Step:= Access to a single physical block per disk

M

B

... D1 2

independent disks

[Vitter Shriver 94]

Theory: Balance Sort[Nodine Vitter 93].

Deterministic, complex

asymptotically optimal O(·)

Multiway merging

“Usually” factor10? less I/Os.

Not asymptotically optimal. 42%

Basic Approach: Improve Multiway Merging

Page 82: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 82

Striping

21 D

emulated disk

logical block

physical blocks

That takes care ofrun formation

and writing theoutput...

...

...

...

...

internal buffers

...

???

But what aboutmerging?

Page 83: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 83

Naive Striping

Run single disk merge-sort on striped logical disk:

2nDB

(

1+⌈

logM/DBnM

⌉)

I/Os

Theory:Θ(logM/B) worse whenD≈M/B

Practice: 2→ 3 passes in some cases

Page 84: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 84

Prediction

[Folklore, Knuth]

Smallest Element

of each block

triggersfetch.

Prefetch buffers

allow parallel access

of next blocks

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

internal buffers

...

controls

prediction sequence

pref

etch

buf

fers

Page 85: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 85

Warmup: Multihead Model

M

B1 2 D

Multihead Model

[Aggarwal Vitter 88]

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

internal buffers

...

controls

prediction sequence

pref

etch

buf

fers

D prefetch buffers yield an optimal algorithm

sort(n) :=2nDB

(

1+⌈

logM/BnM

⌉)

I/Os

Page 86: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 86

Bigger Prefetch Buffer

...

...

...

...

...

...prediction sequence

...

internal buffers

...

1

2

k

prefetch buffers

Dk gooddeterministicperformance

O(D) would yield an optimal algorithm.

Possible?

Page 87: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 87

Randomized Cycling

[Vitter Hutchinson 01]

Block i of stripe j goes to diskπ j(i) for a rand. permutationπ j

...prediction sequence

internal buffers

...

...

...

...

...

...

prefetch buffers

Good fornaive prefetchingandΩ(D logD) buffers

Page 88: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 88

Buffered Writing[S-Egner-Korst SODA00, Hutchinson-S-Vitter ESA 01]

Theorem:Buffered Writing

is optimal

. . .

Buthow good is optimal?W/D

...

...

...

1 2 3

queues

D

Sequence Σof blocks

randomizedmapping

write whenever

one of W

buffers is free

otherwise, output

one block from

each nonempty queue

Theorem: Rand. cycling achievesefficiency 1−O(D/W).

Analysis:negative associationof random variables,application ofqueueing theoryto a “throttled” Alg.

Page 89: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 89

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

r q p o l i f

n g e

m k j

h

abcd

abcdefghijklmnopqr

ΣR

Σ

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

order of reading

order of writing

Page 90: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 90

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σrqponm

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

r

m

n

order of reading

order of writing

Page 91: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 91

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ qpolk

j

k

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 92: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 92

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σpol

j

ki

h

j

h

p

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 93: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 93

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ

ol

ki j

h

p

efg

g

o

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 94: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 94

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ

l

ki j

h

p

ef

g

od

c e

l

d

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 95: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 95

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ

ki j

h

p

f

g

o

c e

l

d

b

a

c

i

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 96: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 96

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ

k j

h

p

f

g

o

e

l

d

b

a

c

i

b

f

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 97: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 97

Optimal Offline Prefetching

Theorem:For buffer sizeW:

∃ (offline) prefetchingschedule forΣ with T input steps

⇔∃ (online)write schedule forΣR with T output steps

abcdefghijklmnopqr

ΣR

Σ

k j

h

p

g

o

e

l

d

a

c

i

b

f

a

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

Page 98: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 98

SynthesisMultiway merging

+ prediction [60s Folklore]

+optimal (randomized) writing [S-Egner-Korst SODA 2000]

+randomized cycling [Vitter Hutchinson 2001]

+optimal prefetching [Hutchinson-S-Vitter ESA 2002]

(1+o(1))·sort(n) I/Os

“answers” question in[Knuth 98];

difficulty 48 on a 1..50 scale.

Page 99: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 99

We are not done yet!

Internal work

Overlapping I/O and computation

Reasonable hardware

Interfacing with the Operating System

Parameter Tuning

Software engineering

Pipelining

Page 100: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 100

Key Sorting

TheI/O bandwidthof our machine is about1/3 of its main

memorybandwidth

If key size≪ element size

sort key pointer pairs to save memory bandwidth during run

formation

3 1 4 1 5 1 1 3 4 5

3 1 4 1 5

1 1 3 4 5

a

a

b

b

c

c

d

d

e

e

permute

extract

sort

Page 101: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 101

Tournament Trees for Multiway Merging

Assumek= 2K runs

K level complete binary tree

Leaves:smallest current element of each run

Internal nodes:loserof a competition for being smallest

Above root: global winner

6 2 7 9 1 4 74

6 7 9 7

4 4

2

1

36 2 7 9 4 74

6 7 9 7

4 4

3

2deleteMin+insertNext

3

Page 102: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 102

Why Tournament Trees Exactlylogk elementcomparisons

Implicit layout in an array simple index arithmetics(shifts)

Predictableload instructions and index computations(Unrollable) inner loop:for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1)

currentPos = entry + i;

currentKey = currentPos->key;

if (currentKey < winnerKey)

currentIndex = currentPos->index;

currentPos->key = winnerKey;

currentPos->index = winnerIndex;

winnerKey = currentKey;

winnerIndex = currentIndex;

Page 103: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 103

Overlapping I/O and Computation

One thread for each disk (or asynchronous I/O)

Possibly additional threads

Blocks filled with elements are passedby references

between different buffers

Page 104: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 104

Overlapping During Run Formation

First postreadrequests for runs1 and2

Thread A: Loop wait-readi; sort i; post-write i;Thread B: Loop wait-write i; post-readi+2;

computebound case

I/Obound case

sort

read

time

control flow in thread A

sort

read

write

1

1

1

2

2

3

2

3

3

4

4

4

...

...

...

k−1

k−1

k−1

k

k

k

control flow in thread B

1 2

2

2

1

1

3

3

3

4

4

4

k−1

k−1

k−1

k

k

k

...

...

...write

Page 105: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 105

Overlapping During Merging

1B−12 3B−14 5B−16 · · ·Bad example: · · ·

1B−12 3B−14 5B−16 · · ·

Page 106: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 106

Overlapping During Merging

.

.

..

read buffers

.

.

..

k+3Doverlap buffers

merging

−ping

1

k

merge

2D w

rite buffers

D blocks

merge buffers

overlap−

prefetch buffers

disk scheduling

O(D

)I/O Threads:Writing haspriority over reading

Page 107: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 107

I/O bound case: The I/O thread never blocks

y= # of elements

merged during

one I/O step.

I/O bound

y> DB2

y≤ DB

blocking

r

DB−y DB 2DB−y 2DBw

kB+2DB

kB+DB+y

kB+DB

kB+3DB

kB+2DB+y

fetch

output#e

lem

ents

in o

verla

p+m

erge

buf

fers

#elements in output buffers

Page 108: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 108

Compute bound case:

The merging thread

never blocks

r

kB+2DB

kB+DB

kB+3DB

block I/O

kB+y

output

fetch

block merging

DB 2DBw

2DB−y

kB

kB+2y

#ele

men

ts in

ove

rlap+

mer

ge b

uffe

rs

#elements in output buffers

Page 109: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 109

Hardware (mid 2002)

Linux

(2× 2GHz Xeon× 2 Threads)

Several66 MHz PCI-buses

(SuperMicro P4DPE3)

Several fastIDE controllers

(4× Promise Ultra100 TX2)

Many fast IDEdisks

(8× IBM IC35L080AVVA07)

PCI−Busses

Controller

Channels

2x Xeon

2x64x66 Mb/s

4 Threads

E7500 128

MB/s

400x64 Mb/s

4x2x100

Intel

8x80GBMB/s

Chipset

8x45

RAMDDR1 GB

cost effective I/O-bandwidth (real 360 MB/s for≈ 3000 )e

Page 110: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 110

Hardware (end 2009) geschätzt

Linux

(2× 2.4 GHz Xeon E5530× 4 Cores× 2 Threads)

PCIe x8 SATA controller

16–241.5 TByte SATAdisks

(8× IBM IC35L080AVVA07)

24 GByte RAM

cost effective I/O-bandwidth (real 2 GB/s for≈ 6000 )e

Page 111: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 111

Software InterfaceGoals:efficient+ simple+ compatible

TX

XL

S

files, I/O requests, disk queues,

block prefetcher, buffered block writer

completion handlers

Asynchronous I/O primitives layer

Block management layertyped block, block manager, buffered streams,

Containers:

STL−user layervector, stack, set

priority_queue, mapsort, for_each, merge

Pipelined sorting,zero−I/O scanning

Streaming layer

Algorithms:

Operating System

Applications

Page 112: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 112

Default Measurement Parameters

t := number of available buffer blocks

Input Size: 16 GByte

Element Size:128 Byte

Keys: Random 32 bit integers

Run Size:256 MByte

Block sizeB: 2 MByte

Compiler: g++ 3.2 -O6

Write Buffers: max(t/4,2D)

Prefetch Buffers:2D+310

(t−w−2D)

PCI−Busses

Controller

Channels

2x Xeon

2x64x66 Mb/s

4 Threads

E7500 128

MB/s

400x64 Mb/s

4x2x100

Intel

8x80GBMB/s

Chipset

8x45

RAMDDR1 GB

Page 113: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 113

Element sizes (16 GByte, 8 disks)

0

50

100

150

200

250

300

350

400

16 32 64 128 256 512 1024

time

[s]

element size [byte]

run formationmergingI/O wait in merge phaseI/O wait in run formation phase

parallel disks bandwidth“for free” internal work, overlapping are relevant

Page 114: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 114

Earlier Academic ImplementationsSingle Disk,at most 2 GByte, old measurements useartificial M

0

100

200

300

400

500

600

700

800

16 32 64 128 256 512 1024

sort

tim

e [s

]

element size [byte]

LEDA-SMTPIE<stxxl> comparison based

Page 115: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 115

Earlier Acad. Implementations: Multiple Disks

2630

4050

75

100

200

300

400500616

16 32 64 128 256 512 1024

sort

tim

e [s

]

element size [byte]

LEDA-SM Soft-RAIDTPIE Soft-RAID<stxxl> Soft-RAID<stxxl>

Page 116: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 116

What are good block sizes (8 disks)?

12

14

16

18

20

22

24

26

128 256 512 1024 2048 4096 8192

sort

tim

e [n

s/by

te]

block size [KByte]

128 GBytes 1x merge128 GBytes 2x merge

16 GBytes

B is not a technology constant

Page 117: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 117

Optimal Versus Naive Prefetching

Total merge time

-3

-2

-1

0

1

0 20 40 60 80 100

diffe

renc

e to

nai

ve p

refe

tchi

ng [%

]

fraction of prefetch buffers [%]

176 read buffers112 read buffers

48 read buffers

Page 118: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 118

Impact of Prefetch and Overlap Buffers

115

120

125

130

135

140

145

150

16 32 48 64 80 96 112 128 144 160 172

mer

ging

tim

e [s

]

number of read buffers

no prefetch bufferheuristic schedule

Page 119: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 119

Tradeoff: Write Buffer Size Versus Read BufferSize

110

115

120

125

130

135

140

145

0 50 100 150 200

mer

ge ti

me

[s]

number of read buffers

Page 120: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 120

Scalability

0

5

10

15

20

16 32 64 128

sort

tim

e [n

s/by

te]

input size [GByte]

128-byte elements512-byte elements

4N/DB bulk I/Os

Page 121: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 121

Discussion

Theory and practice harmonize

No expensive server hardware necessary (SCSI,. . . )

No need to work with artificialM

No 2/4 GByte limits

Faster than academic implementations

(Must be) as fast as commercial implementations but with

performance guarantees

Blocks are much larger than often assumed. Not a

technology constant

Parallel disks

bandwidth“for free” don’t neglect internal costs

Page 122: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 122

More Parallel Disk Sorting?

Pipelining: Input does not come from disk but from a logicalinput stream. Output goes to a logical output stream only half the I/Os for sorting oftenno I/Osfor scanning todo: better overlapping

Parallelism:This is the only way to go forreally manydisks

Tuning and Special Cases:ssssort, permutations, balance workbetween merging and run formation?. . .

Longer Runs:not done with guaranteed overlapping, fastinternal sorting !

Distribution Sorting:Better for seeks etc.?

Inplace Sorting:Could also be faster

Determinism:A practical and theoretically efficient algorithm?

Page 123: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 123

ProcedureformLongRunsq,q′ : PriorityQueuefor i := 1 to M do q.insert(readElement)

invariant |q|+ |q′|= Mloop

while q 6= /0writeElement(e:= q.deleteMin)

if input exhaustedthen break outer loopif e′:= readElement < e then q′.insert(e′)elseq.insert(e′)

q:= q′; q′:= /0outputq in sorted order; outputq′ in sorted order

Knuth: average run length2Mtodo:cache-effizienteImplementierung

Page 124: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 124

2 Priority Queues (insert, deleteMin)Binary Heapsbest comparison based “flat memory” algorithm

+ On averageconstanttime for insertion

+ On averagelogn+O(1) key comparisons per delete-Min

using the “bottom-up” heuristics[Wegener 93].

− ≈ 2log(n/M) block accesses

per delete-MinCacheM

Page 125: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 125

Bottom Up Heuristics

97

2

97

2

movecompare swap

97

1

48

6

8

8

6

2

97

4

delete MinO(1)

sift down holelog(n)

averageO(1)

sift up

8

2

9

7

8

2

6

4 4

6

4

6

8

2

9

7

6

4

Factor two faster

than naive implementation

Page 126: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 126

Der Wettbewerber fit gemacht:

int i=1, m=2, t = a[1];

m += (m != n && a[m] > a[m + 1]);

if (t > a[m])

do a[i] = a[m];

i = m;

m = 2*i;

if (m > n) break;

m += (m != n && a[m] > a[m + 1]);

while (t > a[m]);

a[i] = t;

Keine signifikanten Leistungsunterschiede auf meiner Maschine

(heapsort von random integers)

Page 127: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 127

Vergleich

Speicherzugriffe:O(1) weniger als top downO(logn) worst

case. beieffizienter Implementierung

Elementvergleiche:≈ logn weniger für bottom up (average

case) aber die sindleicht vorhersagbar

Aufgabe: siftDown mitworst caselogn+O(log logn)

Elementvergleichen

Page 128: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 128

Heapkonstruktion

ProcedurebuildHeapBackwards

for i := ⌊n/2⌋ downto 1 do siftDown(i)

ProcedurebuildHeapRecursive(i : N)

if 4i ≤ n thenbuildHeapRecursive(2i)

buildHeapRecursive(2i+1)

siftDown(i)

Rekursive Funktion für große Eingaben 2× schneller!

(Rekursion abrollen für 2 unterste Ebenen)

Aufgabe: Erklärung

Page 129: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 129

Mittelgroße PQs –km≪M2/B Einfügungen

minsertionbuffer PQ

sortedsequences

PQ

...1 2 k

k−merge

mB

Insert: Anfangs ininsertion buffer.

Überlauf−→sort; flush; kleinster Schlüssel in merge-PQ

Delete-Min: deleteMin aus der PQ mit kleinerem min

Page 130: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 130

Analyse – I/Os

deleteMin: jedes Element wird≤ 1× gelesen, zusammen mitB

anderen –amortisiert1/B penalty fürinsert.

minsertionbuffer PQ

sortedsequences

PQ

...1 2 k

k−merge

mB

Page 131: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 131

Analyse – Vergleiche (Maß für interne Arbeit)

deleteMin: 1+O(max(logk, logm)) = O(logm)

genauere Argumentation: amortisiert1+ logk bei

geeigneter PQ

insert: ≈mlogm alle mOps.Amortisiert logm

Insgesamt nur logkmamortisiert !

minsertionbuffer PQ

sortedsequences

PQ

...1 2 k

k−merge

mB

Page 132: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 132

Large Queues[Sanders 00]

group− group− group−buffer bufferbuffer 1 2 3

cached

externalswapped

group 1

group 2

group 3

k k

k

m

m m

2T 3T

...1 2 k ...1 2 k ...1 2 k

k−merge k−merge k−merge

R−merge

m

m’

m

1T

deletion buffer

m

insertion bufferm

insert: group full−→merge group; shift into next group.merge invalid group buffers and move them into group 1.

Delete-Min: Refill. m′≪m. nothing else

Page 133: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 133

Example

Merge insertion buffer, deletion buffer, and leftmost group

buffer

b

g

v

jilq

k f

unmc

wtpoh

a

db

ex

3

sr

b

v

jilq

k f

unm

wtpoh

xab

ed

cg

sr

3insert( )

Page 134: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 134

Example

Merge group 1

b

g

v

jilq

k f

unmc

wtpoh

a

db

ex

3

sr

ex

b

v

ilq

unmc

wtpoh

a

db

fgjk

3

sr

Page 135: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 135

Example

Merge group 2

ex

b

v

ilq

unmc

wtpoh

a

db

fgjk

3

sr

ex

3a

fgjk

bdb

chil

mn o p

qrst u v

w

Page 136: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 136

Example

Merge group buffers

ex

3a

fgjk

bdb

chil

mn o p

qrst u v

w

ex

3a

fgjk

chil

mn o p

qrst u v

w

bbd

Page 137: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 137

Example

DeleteMin 3; DeleteMin a;

ex

3a

fgjk

chil

mn o p

qrst u v

w

bbd

ex

fgjk

chil

mn o p

qrst u v

w

bbd

Page 138: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 138

Example

DeleteMin b

ex

fgjk

il

mn o p

qrst u v

wd

chil

mn o p

qrst u v

w

bbd

ex

jk

bfg

ch

Page 139: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 139

Analysis

I insertions, buffer sizesm= Θ(M)

merging degreek= Θ(M/B)

block accesses:sort(I)+“small terms”

key comparisons:I logI + “small terms”

(on average)

... ... ...

R

R

W

WR

WR W

R W

RR

Other (similar, earlier)[Arge 95, Brodal-Katajainen 98, Brengel

et al. 99, Fadel et al. 97]data structures spend afactor≥ 3 more

I/Os toreplaceI by queue size.

Page 140: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 140

Implementation Details

Fast routines for 2–4 way merging keeping smallest

elements inregisters

Use sentinels to avoid special case treatments (empty

sequences, . . . )

Currently heap sort for sorting the insertion buffer

k 6= M/B: multiple levels, limited associativity, TLB

Page 141: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 141

Experiments

Keys: random 32 bit integers

Associated information:32 dummy bits

Deletion buffer size:32 Near optimal

Group buffer size:256 : performance on

Merging degreek: 128 all machines tried!

Compiler flags:Highly optimizing, nothing advanced

Operation Sequence:

(Insert-DeleteMin-Insert)N(DeleteMin-Insert-DeleteMin)N

Near optimal performance on all machines tried!

Page 142: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 142

MIPS R10000, 180 MHz

0

50

100

150

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heapsequence heap

Page 143: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 143

Ultra-SparcIIi, 300 MHz

0

20

40

60

80

100

120

140

160

256 1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heap

sequence heap

Page 144: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 144

Alpha-21164, 533 MHz

0

20

40

60

80

100

120

140

256 1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heap

sequence heap

Page 145: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 145

Pentium II, 300 MHz

0

50

100

150

200

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heap

sequence heap

Page 146: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 146

Core2 Duo Notebook, 1.??? GHz

0

5

10

15

20

25

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapsequence heap

Page 147: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 147

(insert(deleteMin insert)s)N

(deleteMin(insert deleteMin)s)N

32

64

128

256 1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

s=0, binary heaps=0, 4-ary heap

s=0s=1s=4

s=16

Page 148: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 148

Methodological Lessons

If you want to comparesmallconstant factors inexecution time:

Reproducabilitydemandspublication of source codes(4-ary heaps, old study in Pascal)

Highly tuned codes in particularfor the competitors(binary heaps have factor 2 between good and naiveimplementation).How do you compare two mediocre implementations?

Careful choice/description ofinputs

Use multiple different hardwareplatforms

Augment withtheory(e.g., comparisons, datadependencies, cache faults, locality effects . . . )

Page 149: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 149

Open Problems

Dependence onsizerather than number of insertions

Parallel disks

Space efficientimplementation

Multi-level cache aware or cache-oblivious variants

Page 150: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 150

3 van Emde-Boas Search Trees

K’ bit

vEB

top

hash table

K’ bit

vEB

boti

Store setM of K = 2k-bit integers.

later: associated information

K = 1 or |M|= 1: store directly

K′:= K/2

Mi :=

x mod 2K′ : x div 2K′ = i

root points to nonemptyMi-s

top t = i : Mi 6= /0

insert, delete, search inO(logK) time

Page 151: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 151

Locate

//minx∈M : y≤ x

Function locate(y : N) : ElementHandle

if y> maxM then return ∞if K = 1 then return locateLocally(y)

if M = x then return x

i:= y div 2K/2

if Mi = /0∨y> maxMi thenreturn minMtop.locate(i)

elsereturn i2K/2+Mi.locate(y mod 2K/2)

Page 152: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 152

Comparison with Comparison Based Search TreesIdeally: logn log logn

Problems:

Many special

casetests

High space

consumption

100

1000

256 1024 4096 16384 65536 218 220 222 223

Tim

e fo

r ra

ndom

loca

te [n

s]

n

orig-STreeLEDA-STree

STL map(2,16)-tree

Page 153: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 153

Efficient 32 bit Implementation

top 3 layers of bit arrays

K’ bit

vEB

top

hash table

K’ bit

vEB

boti

hash table

K’ bit

vEB

boti

65535

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

65535

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Page 154: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 154

Layers of Bit Arrays

t1[i] = 1 iff Mi 6= /0

t2[i] = t1[32i]∨ t1[32i+1]∨ ·· ·∨ t1[32i+31]

t3[i] = t2[32i]∨ t2[32i+1]∨ ·· ·∨ t2[32i+31]

Page 155: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 155

Efficient 32 bit Implementation

Break recursion after 3 layers

hash table

K’ bit

vEB

boti

65535

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

65535

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

… …

hash table

… …

hash table

… …

hash table

… …

… …

hash table

… …

hash table

… …

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 3

Bits 7-0

… …

hash table

Level 3

Bits 7-0

… …

hash table

… …

hash table

… …

… …

hash table

hash tablehash table

65535

Page 156: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 156

Efficient 32 bit Implementation

root hash table root array

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

… …

hash table

… …

hash table

… …

hash table

… …

… …

hash table

… …

hash table

… …

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 3

Bits 7-0

… …

hash table

Level 3

Bits 7-0

… …

hash table

… …

hash table

… …

… …

hash table

hash tablehash table

65535

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

… …

hash table

… …

hash table

… …

hash table

… …

… …

hash table

… …

hash table

… …

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 3

Bits 7-0

… …

hash table

Level 3

Bits 7-0

… …

hash table

… …

hash table

… …

… …

hash table

65535

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

Page 157: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 157

Efficient 32 bit Implementation

Sorted doubly linked lists forassociated informationandrange

queries

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

Level 1 – root

Bits 31-16

… …

hash table

… …

hash table

Level 2

Bits 15-8

… …

hash table

… …

hash table

… …

hash table

… …

… …

hash table

… …

hash table

… …

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 3

Bits 7-0

… …

hash table

Level 3

Bits 7-0

… …

hash table

… …

hash table

… …

… …

hash table

65535

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

630

0 31

0 31

32

32

63

63

4095

… … … …

… … … …

Level 1 – root

Bits 31-16

… …

hash table

… …

… …

hash table

… …

hash table

… …

… …

hash table

Level 2

Bits 15-8

Level 2

Bits 15-8

Level 3

Bits 7-0

… …

hash table

… …

… …

hash table

65535

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

65535

array 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0…

Page 158: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 158

Example

0 0 0 0 0 0 00 0 0 00 0...

v

hash

v

...

0 0 0 00 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0

v

...

v

...

1 111111111111111

t3

t2

t1

=1,11,111

=111111=1111

=1,11,111,1111M

M00

M1M04

t100

t200

t20

t10

0

32

0

32 ...111

...111

0 0 0 0 0 0 0 0

0

root

L2

L3

1..1111

0 hash1..111

32

32

0 1

63

4095 root−top

65535

Element List

...4

=1,11,111,1111,111111M

Page 159: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 159

Locate High Level// return handle of minx∈M : y≤ x

Function locate(y : N) : ElementHandle

if y> maxM then return ∞i := y[16..31] // Level 1

if r[i] = nil ∨y> maxMi then return minMt1.locate(i)

if Mi = x then return x

j := y[8..15] // Level 2

if r i [ j] = nil ∨y> maxMi j then return minMi,t1i .locate( j)

if Mi j = x then return x

return r i j [t1i j .locate(y[0..7])] // Level 3

Page 160: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 160

Locate in Bit Arrays

//find the smallestj ≥ i such thattk[ j] = 1

Method locate(i) for a bit arraytk consisting ofn bit words

//n= 32 for t1, t2, t1i , t1

i j ; n= 64 for t3; n= 8 for t2i , t2

i j

assertsome bit intk to the right ofi is nonzero

j := i div n // which word?

a := tk[n j..n j+n−1]

seta[(i modn)+1..n−1] to zero // n−1· · · i modn· · ·0if a= 0 then

j := tk+1.locate( j)

a := tk[n j..n j+n−1]

return n j+msbPos(a) // e.g. floating point conversion

Page 161: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 161

Random Locate

100

1000

256 1024 4096 16384 65536 218 220 222 223

Tim

e fo

r lo

cate

[ns]

n

orig-STreeLEDA-STree

STL map(2,16)-tree

STree

Page 162: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 162

Random Insert

0

500

1000

1500

2000

2500

3000

3500

4000

1024 4096 16384 65536 218 220 222 223

Tim

e fo

r in

sert

[ns]

n

orig-STreeLEDA-STree

STL map(2,16)-tree

STree

Page 163: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 163

Delete Random Elements

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

1024 4096 16384 65536 218 220 222 223

Tim

e fo

r de

lete

[ns]

n

orig-STreeLEDA-STree

STL map(2,16)-tree

STree

Page 164: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 164

Open Problems

Measurement for “worst case” inputs

Measure Performance forrealistic inputs

– IP lookupetc.

– Best firstheuristics like, e.g., bin packing

More space efficientimplementation

(A few) more bits

Page 165: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 165

4 Hashing

“to hash” ≈ “to bring into completedisorder”

paradoxically, this helps us to find things

moreeasily!

store setM ⊆ Element.

key(e) is unique fore∈M.

supportdictionaryoperations inO(1) time:

M.insert(e : Element): M := M∪e

M.remove(k : Key): M := M \e, e= k

M.find(k : Key): returne∈M with e= k; ⊥ if none present

(Convention:key is implicit), e.g.e= k iff key(e) = k)

Page 166: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 166

An (Over)optimistic approach

t

h

M

A (perfect)hash functionh

maps elements ofM to

unique entries oftablet[0..m−1], i.e.,

t[h(key(e))] = e

Page 167: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 167

Collisions

perfect hash functions are difficult to obtain

t

h

M

Example: Birthday Paradoxon

Page 168: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 168

Collision Resolution

for example byclosed hashing

entries: elements sequencesof elements

< >< >

< >

< >

<><><>

<><>

<><><>

t[h(k)]

t

h

M

k

Page 169: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 169

Hashing with Chaining

Implement sequences in closed hashing by singly linked lists

insert(e): Inserte at the beginning oft[h(e)]. constant time

remove(k): Scan throught[h(k)]. If an elemente with h(e) = k isencountered, remove it and return.

find(k) : Scan throught[h(k)].If an elemente with h(e) = k is encountered,return it. Otherwise, return⊥.

O(|M|) worst case time forremove andfind

< >< >

< >

< >

<><><>

<><>

<><><>

t[h(k)]

t

h

M

k

Page 170: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 170

A Review of Probability

1s 5s 6s3s 7s 2s 4s

binary search?

buckets

Example from hashing

sample spaceΩ random hash functions0..m−1Key

events: subsets ofΩ E42 = h∈Ω : h(4) = h(2)px =probabilityof x∈Ω uniform distr. ph = m−|Key|

P [E ] = ∑x∈E px P [E42] =1m

random variableX0 : Ω→ R X = |e∈M : h(e) = 0|.

Page 171: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 171

expectation E[X0] = ∑y∈Ω pyX(y) E[X] = |M|m (*)

Linearity of Expectation: E[X+Y] = E[X]+E[Y]

Proof of (*):

Consider the 0-1 RVXe = 1 if h(e) = 0 for e∈M andXe = 0

else.

E[X0] = E[ ∑e∈M

Xe] = ∑e∈M

E[Xe] = ∑e∈M

P [Xe = 1] = |M| · 1m

Page 172: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 172

Analysis for Random Hash Functions< >< >

< >

< >

<><><>

<><>

<><><>t

h

M

t[h(k)]Satz 1. The expectedexecution timeof

remove(k) andfind(k)

is O(1) if |M|= O(m).

Beweis.Constant time plus thetime for scanningt[h(k)].X := |t[h(k)]|= |e∈M : h(e) = h(k)|.Consider the 0-1 RVXe = 1 if h(e) = h(k) for e∈M andXe = 0else.

E[X] = E[ ∑e∈M

Xe] = ∑e∈M

E[Xe] = ∑e∈M

P [Xe = 1] =|M|m

= O(1)

This isindependent of the input setM.

Page 173: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 173

Universal Hashing

Idea: use only certain“easy” hash functions

Definition:

U ⊆ 0..m−1Key is universal

if for all x, y in Key with x 6= y and randomh∈U ,

P [h(x) = h(y)] =1m

.

Satz 2. Theorem 1 also applies to universal families of hash

functions.

Beweis.For Ω = U we still haveP [Xe = 1] = 1m.

The rest is as before.

Page 174: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 174

H Ω

Page 175: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 175

A Simple Universal FamilyAssumem is prime, Key⊆ 0, . . . ,m−1k

Satz 3. For a= (a1, . . . ,ak) ∈ 0, . . . ,m−1k define

ha(x) = a·x mod m,H · =

ha : a∈ 0..m−1k

.

H · is a universal family of hash functions

x1 x2 x3

a2 a3a1* * * mod m = h (x)+ + a

Page 176: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 176

Beweis.Considerx = (x1, . . . ,xk), y = (y1, . . . ,yk) with x j 6= y j

counta-s with ha(x) = ha(y).For each choice ofais, i 6= j, ∃ exactly onea j with

ha(x) = ha(y):

∑1≤i≤k

aixi ≡ ∑1≤i≤k

aiyi( mod m)

⇔ a j(x j −y j)≡ ∑i 6= j,1≤i≤k

ai(yi−xi)( mod m)

⇔ a j ≡ (x j −y j)−1 ∑

i 6= j,1≤i≤k

ai(yi−xi)( mod m)

mk−1 ways to choose theai with i 6= j.

mk is total number ofas, i.e.,

Page 177: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 177

P [ha(x) = ha(y)] =mk−1

mk =1m.

Page 178: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 178

Bit Based Universal FamiliesLet m= 2w, Key = 0,1k

Bit-Matrix Multiplication: H⊕ =

hM : M ∈ 0,1w×k

wherehM (x) = Mx (arithmetics mod2, i.e., xor, and)

Table Lookup:H⊕[] =

h⊕[](t1,...,tb)

: ti ∈ 0..m−10..w−1

whereh⊕[](t1,...,tb)

((x0,x1, . . . ,xb)) = x0⊕⊕b

i=1ti [xi]

x0x2 x1

a wa

x

k

Page 179: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 179

Hashing with Linear Probing

Open hashing: go back to original idea.

Elements are directly stored in the table.

Collisions are resolved by finding other entries.

linear probing: search for next free place by scanning the table.

Wrap around at the end.

simple

space efficient

cache efficient

t

h

M

Page 180: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 180

The Easy Part

ClassBoundedLinearProbing(m,m′ : N; h : Key→ 0..m−1)

t=[⊥, . . . ,⊥] : Array [0..m+m′−1] of Element

invariant ∀i : t[i] 6=⊥⇒ ∀ j ∈ h(t[i])..i−1 : t[i] 6=⊥

Procedure insert(e : Element)

for i := h(e) to ∞ while t[i] 6=⊥ do ;

asserti < m+m′−1

t[i] := e

Function find(k : Key) : Element

for i := h(e) to ∞ while t[i] 6=⊥ doif t[i] = k then return t[i]

return ⊥

t

h

M

m’

m

Page 181: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 181

Removeexample:t = [. . . , x

h(z),y,z, . . .], remove(x)

invariant ∀i : t[i] 6=⊥⇒ ∀ j ∈ h(t[i])..i−1 : t[i] 6=⊥Procedureremove(k : Key)

for i := h(k) to ∞ while k 6= t[i] do // searchk

if t[i] =⊥ then return // nothing to do

//we plan for ahole ati.

for j := i+1 to ∞ while t[ j] 6=⊥ do//Establish invariant fort[ j].

if h(t[ j])≤ i thent[i] := t[ j] // Overwrite removed element

i := j // move planned hole

t[i] :=⊥ // erase freed entry

Page 182: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 182

More Hashing Issues

High probability andworst caseguarantees

more requirements on the hash functions

Hashing as a means of load balancing in parallel systems,

e.g., storage servers

– Different disk sizes and speeds

– Adding disks / replacing failed disks without much

copying

Page 183: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 183

Space Efficient Hashing

with Worst Case Constant Access Time

Represent a set ofn elements (with associated information)

using space(1+ ε)n.

Support operationsinsert, delete, lookup, (doall) efficiently.

Assume a truly random hash functionh

Page 184: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 184

Related Work

Uniform hashing:Expectedtime≈ 1

ε

h3h1 h2

Dynamic Perfect Hashing,[Dietzfelbinger et al. 94]Worst caseconstant timefor lookupbut ε is not small.

Approaching the Information Theoretic Lower Bound:[Brodnik Munro 99,Raman Rao 02]Space(1+o(1))×lower boundwithout associated information[Pagh 01]static case.

Page 185: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 185

Cuckoo Hashing

[Pagh Rodler 01]Table of size2+ ε .

Two choices for each element.

Insert moves elements;

rebuild if necessary.

Very fast lookup and insert.

Expected constant insertion time.

h1

h2

Page 186: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 186

d-ary Cuckoo Hashing

d choices for each element.

Worst cased probes fordeleteandlookup.

Task: maintainperfect matching

in thebipartite graph

(L = Elements,R= Cells,E = Choices),

e.g.,insertby BFSof random walk.

h1

h2

h3

Page 187: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 187

Tradeoff: Space↔ Lookup/Deletion Time

Lookup and Delete:d = O(log 1

ε)

probes

Insert:

(1ε

)O(log(1/ε)), (experiments)−→ O(1/ε)?

Page 188: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 188

Experiments

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

ε *

#pro

bes

for

inse

rt

space utilization

d=2d=3d=4d=5

Page 189: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 189

Open Questions and Further Results Tight analysis ofinsertion

Two choices withd slots each[Dietzfelbinger et al.]

cache efficiency

Good Implementation?

Automatic rehash

Always correct

Page 190: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 190

5 Minimum Spanning Trees

undirected GraphG= (V,E).

nodesV, n= |V|, e.g.,V = 1, . . . ,nedgese∈ E, m= |E|, two-element subsets ofV.

edge weightc(e), c(e) ∈ R+.

G is connected, i.e.,∃ path between any two nodes.4

2

3

1

792

5

Find a tree(V,T) with minimumweight∑e∈T c(e) that connects

all nodes.

Page 191: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 191

MST: Overview

Basics: Edge property and cycle property

Jarník-Prim Algorithm

Kruskals Algorithm

Filter-Kruskal

Comparison

(Advanced algorithms using the cycle property)

External MST

Page 192: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 192

Applications

Clustering

Subroutine in combinatorial optimization, e.g., Held-Karp

lower bound for TSP.

Challenging real world instances???

Image segementation−→ [Diss. Jan Wassenberg]

Anyway: almost ideal“fruit fly” problem

Page 193: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 193

Selecting and Discarding MST Edges

The Cut Property

For anyS⊂V consider the cut edges

C= u,v ∈ E : u∈ S,v∈V \SThelightestedge inC can be used in an MST. 4

2

3

1

72

59

The Cycle Property

Theheaviestedge on a cycle is not needed for an MST4

2

3

1

792

5

Page 194: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 194

The Jarník-Prim Algorithm [Jarník 1930, Prim

1957]

Idea: grow a tree

T:= /0

S:= s for arbitrary start nodes

repeatn−1 times

find (u,v) fulfilling the cut property forS

S:= S∪vT:= T ∪(u,v)

4

2

3

1

792

5

Page 195: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 195

Implementation Using Priority Queues

Function jpMST(V, E, w) : Setof Edge

dist=[∞, . . . ,∞] : Array [1..n]// dist[v] is distance ofv from the tree

pred: Array of Edge// pred[v] is shortest edge betweenSandv

q : PriorityQueueof Node withdist[·] as priority

dist[s] := 0; q.insert(s) for anys∈V

for i := 1 to n−1 do dou := q.deleteMin() // new node forS

dist[u] := 0

foreach (u,v) ∈ E doif c((u,v))< dist[v] then

dist[v] := c((u,v)); pred[v] := (u,v)

if v∈ q then q.decreaseKey(v) elseq.insert(v)

return pred[v] : v∈V \s

4

2

3

1

792

5

Page 196: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 196

Graph Representation for Jarník-Prim

We need node→ incident edges

4

2

3

1

792

5

m 8=m+1

V

E

1 3 5 7 91 n 5=n+1

4 1 3 2 4 1 3c 9 5 7 7 2 2 95

2

1

+ fast (cache efficient)

+ more compact than linked lists

− difficult to change

− Edges are stored twice

Page 197: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 197

Analysis

O(m+n) time outside priority queue

n deleteMin (timeO(nlogn))

O(m) decreaseKey (timeO(1) amortized)

O(m+nlogn) usingFibonacci Heaps

practical implementation using simplerpairing heaps.

But analysis is still partlyopen!

Page 198: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 198

Kruskal’s Algorithm [1956]

T := /0 // subforest of the MST

foreach (u,v) ∈ E in ascending order of weightdoif u andv are indifferent subtrees ofT then

T := T ∪(u,v) // Join two subtrees

return T

Page 199: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 199

The Union-Find Data Structure

ClassUnionFind(n : N) // Maintain a partition of 1..n

parent=[n+1, . . . ,n+1] : Array [1..n] of 1..n+⌈logn⌉Function find(i : 1..n) : 1..n

if parent[i]> n then return i

elsei′ := find(parent[i])

parent[i] := i′

return i′

Procedure link(i, j : 1..n)

asserti and j are leaders of different subsets

if parent[i]< parent[ j] then parent[i] := j

else ifparent[i]> parent[ j] then parent[ j] := i

elseparent[ j] := i; parent[i]++ // nextgeneration

Procedureunion(i, j) if find(i) 6= find( j) then link(find(i), find( j))

3 421

556 1

5 6

rank

1

rank

2

4

1 2

3

parent:

Page 200: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 200

Kruskal Using Union Find

T : UnionFind(n)

sortE in ascending order of weight

kruskal(E)

Procedurekruskal(E)

foreach (u,v) ∈ E dou′:= T.find(u)

v′:= T.find(v)

if u′ 6= v′ thenoutput(u,v)

T.link(u′,v′)

Page 201: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 201

Graph Representation for Kruskal

Just an edge sequence (array) !

+ very fast (cache efficient)

+ Edges are stored only once

more compact than adjacency array

Page 202: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 202

Analysis

O(sort(m)+mα(m,n)) = O(mlogm) whereα is the inverse

Ackermann function

Page 203: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 203

Kruskal versus Jarník-Prim I

Kruskal wins for very sparse graphs

Prim seems to win for denser graphs

Switching point isunclear

– How is the inputrepresented?

– How manydecreaseKeys are performed by JP?

(average case:nlog mn [Noshita 85])

– Experimental studies are quiteold [Moret Shapiro 91],

useslow graphrepresentationfor both algs,

andartificial inputs

see attached slides.

Page 204: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 204

5.1 Filtering by Sampling Rather Than Sorting

R:= random sample ofr edges fromE

F :=MST(R) // Wlog assume thatF spansV

L:= /0 // “light edges” with respect toR

foreache∈ E do // Filter

C := the unique cycle ine∪F

if e is not heaviest inC thenL:= L∪e

return MST((L∪F))

Page 205: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 205

5.1.1 Analysis

[Chan 98, KKK 95]

Observation:e∈ L only if e∈MST(R∪e).(Otherwisee could replace some heavier edge inF).

Lemma 4. E[|L∪F|]≤ mnr

Page 206: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 206

MST Verification by Interval Maxima

Number the nodes by the order they were added to the MST

by Prim’s algorithm.

wi = weight of the edge that inserted nodei.

Largest weight on path(u,v) = maxw j‖u< j ≤ v.

14

3

5

8

0 1 84 3 5

14

8

3

5

0 1 4 3 5 8

Page 207: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 207

Interval Maxima

Preprocessing: build

nlogn size array

PreSuf.

Level 0

Level 3

Level 2

Level 1

88 56 30 56 52 74 76

88 88 30 65 65 52 52 77 77 74 74 76 80

98 98 98 15 65 75 77 77 77 41 62 76 80

98 98 75 75 75 56 52 77 77 77 77 77 80

98 15 65 75 34 77 41 62 80

757598

98 75 74

98 3498

2 6To find maxa[i.. j]:

Find the level of the LCA: ℓ= ⌊log2(i⊕ j)⌋.

Returnmax(PreSuf[ℓ][i],PreSuf[ℓ][ j]).

Example: 2⊕6= 010⊕110= 100⇒ ℓ= 2

Page 208: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 208

A Simple Filter Based Algorithm

Chooser =√

mn.

We get expected time

TPrim(√

mn)+O(nlogn+m)+TPrim(mn√mn) = O(nlogn+m)

The constant factor in front of them is very small.

Page 209: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 209

Results10 000 nodes, SUN-Fire-15000, 900 MHz UltraSPARC-III+

Worst Case Graph

0

100

200

300

400

500

600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

PrimFilter

Page 210: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 210

Results10 000 nodes, SUN-Fire-15000, 900 MHz UltraSPARC-III+

Random Graph

0

100

200

300

400

500

600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

PrimFilter

Page 211: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 211

Results10 000 nodes,

NEC SX-5Vector Machine

“worst case”

0

200

400

600

800

1000

1200

1400

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

Prim (Scalar)I-Max (Scalar)

Prim (Vectorized)I-Max (Vectorized)

Page 212: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 212

Edge Contraction

Let u,v denote an MST edge.

Eliminatev:

forall (w,v) ∈ E doE := E \ (w,v)∪(w,u) // but remember orignal terminals

4

1

4 3

1

792

59

2

7 (was 2,3)2

3

Page 213: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 213

Boruvka’s Node Reduction Algorithm

For each node find the lightest incident edge.

Include them into the MST (cut property)

contract these edges,

TimeO(m)

At least halves the number of remaining nodes

4

1

5

792

5 2

33

Page 214: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 214

5.2 Simpler and Faster Node Reduction

for i := n downto n′+1 dopick a random nodev

find thelightestedge(u,v) out of v and output it

contract(u,v)

E[degree(v)]≤ 2m/i

∑n′<i≤n

2mi

=2m

(

∑0<i≤n

1i− ∑

0<i≤n′

1i

)

≈2m(lnn− lnn′)=2mlnnn′

4

1

4 3

1

4 32

9 (was 1,4)795

9

7 (was 2,3)

output 2,3

2

3 7output 1,2

52 2

Page 215: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 215

5.3 Randomized Linear Time Algorithm

1. Factor 8 node reduction (3× Boruvka or sweep algorithm)

O(m+n).

2. R⇐m/2 random edges.O(m+n).

3. F ⇐MST(R) [Recursively].

4. Find light edgesL (edge reduction). O(m+n)

E[|L|]≤ mn/8m/2 = n/4.

5. T⇐MST(L∪F) [Recursively].

T(n,m)≤ T(n/8,m/2)+T(n/8,n/4)+c(n+m)

T(n,m)≤ 2c(n+m) fulfills this recurrence.

Page 216: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 216

5.4 External MSTs

Semiexternal Algorithms

Assumen≤M−2B:

runKruskal’s algorithmusingexternal sorting

Page 217: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 217

Streaming MSTs

If M is yet a bit larger we can even do it withm/B I/Os:

T:= /0 // currentapproximation of MST

while there are any unprocessed edgesdoload anyΘ(M) unprocessed edgesE′

T:= MST(T ∪E′) // for any internal MST alg.

Corollary: we can do it withlinearexpectedinternal work

Disadvantages to Kruskal:

Slower in practice

Smaller max.n

Page 218: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 218

General External MST

while n> M−2B doperform some node reduction

use semi-external Kruskal

n’<M

m/nn

Theory:O(sort(m)) expected I/Os by externalizing the linear

time algorithm.

(i.e., node reduction+ edge reduction)

Page 219: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 219

External Implementation I: Sweeping

π : random permutationV→V

sortedges(u,v) by min(π(u),π(v))for i := n downto n′+1 do

pick the nodev with π(v) = i

find thelightestedge(u,v) out of v and output it

contract(u,v)

u v...

relink

relink

outputsweep line

Problem: how to implement relinking?

Page 220: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 220

Relinking Using Priority Queues

Q: priority queue // Order:max node, thenmin edge weight

foreach (u,v ,c) ∈ E do Q.insert((π(u),π(v) ,c,u,v))current :=n+1

loop(u,v ,c,u0,v0) := Q.deleteMin()

if current6= maxu,vthenif current= M+1 then returnoutputu0,v0 ,ccurrent := maxu,vconnect := minu,v

elseQ.insert((minu,v ,connect ,c,u0,v0))

≈ sort(10mln nM ) I/Os with opt. priority queues [Sanders 00]

Problem:Computebound

u v...

relink

relink

outputsweep line

Page 221: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 221

Sweeping with linear internal work

Assumem= O(M2/B

)

k= Θ(M/B) external bucketswith n/k nodes each

M nodes for last“semiexternal”bucket

split currentbucket into

internalbuckets for eachnode

Sweeping:

Scan current internal bucket twice:

1. Find minimum

2. Relink

New external bucket: scan and put ininternalbuckets

Large degree nodes: move tosemiexternalbucket

...

externalexternal

internal

current semiexternal

Page 222: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 222

Experiments

Instances from “classical” MST study[Moret Shapiro 1994]

sparse random graphs

random geometric graphs

grids

O(sort(m)) I/Os

for planar graphs by

removing parallel edges!

Other instances are rather dense

or designed to fool specific algorithms.

PCI−Busses

Controller

Channels

2x64x66 Mb/s

4 Threads2x Xeon

128

E75001 GBDDRRAM

8x45

4x2x100MB/s

400x64 Mb/s

8x80GBMB/s

Chipset

Intel

Page 223: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 223

m≈ 2n

1

2

3

4

5

6

5 10 20 40 80 160 320 640 1280 2560

t / m

[µs]

m / 1 000 000

KruskalPrim

randomgeometric

grid

Page 224: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 224

m≈ 4n

1

2

3

4

5

6

5 10 20 40 80 160 320 640 1280 2560

t / m

[µs]

m / 1 000 000

KruskalPrim

randomgeometric

Page 225: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 225

m≈ 8n

1

2

3

4

5

6

5 10 20 40 80 160 320 640 1280 2560

t / m

[µs]

m / 1 000 000

KruskalPrim

randomgeometric

Page 226: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 226

MST Summary

Edge reduction helps for very dense, “hard” graphs

A fast and simplenode reductionalgorithm

4× less I/Os than previous algorithms

Refined semiexternal MST, use asbase case

Simple pseudo random permutations (no I/Os)

A fast implementation

Experiments with huge graphs (up ton= 4·109 nodes)

External MST is feasible

Page 227: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 227

Open Problems

New experiments for (improved) Kruskal versus

Jarník-Prim

Realistic (huge) inputs

Parallel external algorithms

Implementations for other graph problems

Page 228: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 228

Conclusions

Even fundamental, “simple” algorithmic problems

still raise interesting questions

Implementation and experiments are important and were

neglected by parts of the algorithms community

Theoryan (at least) equally important, essential component

of the algorithm design process

Page 229: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 229

More Algorithm Engineering on Graphs

Count triangles in very large graphs. Interesting as a

measure of clusteredness. (Cooperation with AG Wagner)

External BFS (Master thesis Deepak Ajwani)

Maximum flows: Is the theoretical best algorithm any

good? (Jein)

Approximate max. weighted matching (Studienarbeit Jens

Maue)

Page 230: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 230

Maximal Flows

Theory: O(mΛ log(n2/m) logU

)binary blocking

flow-algorithm mitΛ = minm1/2,n2/3 [Goldberg-Rao-97].

Problem: best case≈ worst case

[Hagerup Sanders Träff WAE 98]:

Implementable generalization

best case≪ worst case

best algorithms for some “difficult” instances

Page 231: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 231

Ergebnis

u v...

relink

relink

outputsweep line

Einfach extern implementierbar

n′ = M semiexterner Kruskal Algorithmus

InsgesamtO(sort(mln n

m))

erwartete I/Os

Für realistische Eingaben mindestens4× bisherals bisher

bekannte Algorithmen

Implementierung in<stxxl> mit bis zu96 GBytegro/3en

Graphen läuft „über Nacht“

Page 232: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 232

More On Experimental Methodology

Scientific Method:

Experiment need a possible outcome thatfalsifiesahypothesis

Reproducible

– keep data/code for at least 10 years

– clear and detaileddescription in papers / TRs

– share instances and code

appl. engin.

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

applications

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

deduction

Page 233: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 233

Quality Criteris

Beat the state of the art, globally – (not your own toy codes

or the toy codes used in your community!)

Clearlydemonstrate this !

– both codes use same data ideally from accepted

benchmarks (not just your favorite data!)

– comparable machines or fair (conservative) scaling

– Avoid uncomparabilities like: “Yeah we are worse but

but twice as fast”

– real world data wherever possible

– as much different inputs as possible

– its fine if you are better just on some (important) inputs

Page 234: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 234

Not Here but Important

describing the setup

finding sources of measurement errors

reducing measurement errors (averaging, median,unloaded

machine. . . )

measurements in thecreativephase of experimental

algorithmics.

Page 235: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 235

The Starting Point

(Several) Algorithm(s)

A few quantities to be measured: time, space, solution

quality, comparisons, cache faults,. . . There may also be

measurement errors.

An unlimited number of potential inputs. condense to a

few characteristic ones (size,|V|, |E|, . . . or problem

instances from applications)

Usually there is not a lack but anabundanceof data6= many

other sciences

Page 236: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 236

The Process

Waterfall model?

1. Design

2. Measurement

3. Interpretation

Perhaps the paper should at least look like that.

Page 237: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 237

The Process

Eventually stop asking questions (Advisors/Referees listen

!)

build measurement tools

automate (re)measurements

Choice of Experiments driven by risk and opportunity

Distinguish mode

explorative: many different parameter settings, interactive,

short turnaround times

consolidating:many large instances, standardized

measurement conditions, batch mode, many machines

Page 238: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 238

Of Risks and Opportunities

Example: Hypothesis= my algorithm is the best

big risk: untried main competitor

small risk: tuning of a subroutine that takes 20 % of the time.

big opportunity: use algorithm for a new application

new input instances

Page 239: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 239

Presenting Data from Experiments in

Algorithmics

Restrictions

black and white easy and cheap printing

2D (stay tuned)

no animation

no realism desired

Page 240: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 240

Basic Principles

Minimize nondata ink

(form follows function, not a beauty contest,. . . )

Letter size≈ surrounding text

Avoid clutter and overwhelming complexity

Avoid boredom (too little data perm2).

Make the conclusions evident

Page 241: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 241

Tables

+ easy

− easy overuse

+ accurate values (6= 3D)

+ more compact than bar chart

+ good for unrelated instances (e.g. solution quality)

− boring

− no visual processing

rule of thumb that “tables usually outperform a graph for smalldata sets of 20 numbers or less”[Tufte 83]

Curves in main paper, tables in appendix?

Page 242: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 242

2D Figures

default:x= input size, y= f (execution time)

Page 243: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 243

x AxisChoose unit to eliminate a parameter?

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

lengthk fractional tree broadcasting. latencyt0+k

Page 244: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 244

x Axislogarithmic scale?

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

yes if x range is wide

Page 245: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 245

x Axislogarithmic scale, powers of two (or

√2)

0

50

100

150

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heapsequence heap

with tic marks, (plus a few small ones)

Page 246: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 246

gnuplotset xlabel "N"

set ylabel "(time per operation)/log N [ns]"

set xtics (256, 1024, 4096, 16384, 65536, "2^18" 262144,

set size 0.66, 0.33

set logscale x 2

set data style linespoints

set key left

set terminal postscript portrait enhanced 10

set output "r10000timenew.eps"

plot [1024:10000000][0:220]\

"h2r10000new.log" using 1:3 title "bottom up binary heap"

"h4r10000new.log" using 1:3 title "bottom up aligned 4-ary

"knr10000new.log" using 1:3 title "sequence heap" with linespoints

Page 247: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 247

Data File256 703.125 87.8906

512 729.167 81.0185

1024 768.229 76.8229

2048 830.078 75.4616

4096 846.354 70.5295

8192 878.906 67.6082

16384 915.527 65.3948

32768 925.7 61.7133

65536 946.045 59.1278

131072 971.476 57.1457

262144 1009.62 56.0902

524288 1035.69 54.51

1048576 1055.08 52.7541

2097152 1113.73 53.0349

4194304 1150.29 52.2859

8388608 1172.62 50.9836

Page 248: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 248

x Axislinear scale for ratios or small ranges (#processor,. . . )

0

100

200

300

400

500

600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

PrimFilter

Page 249: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 249

x AxisAn exotic scale: arrival rate 1− ε of saturation point

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

Page 250: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 250

y AxisAvoid log scale ! scale such that theory gives≈ horizontal lines

0

50

100

150

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heapsequence heap

but give easy interpretation of the scaling function

Page 251: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 251

y Axisgive units

0

50

100

150

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heapsequence heap

Page 252: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 252

y Axisstart from 0if this does not waste too much space

0

50

100

150

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heapsequence heap

you may assume readers to be out of Kindergarten

Page 253: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 253

y Axisclip outclassed algorithms

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

Page 254: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 254

y Axisvertical size: weighted average of the slants of the line segments

in the figure should be about45[Cleveland 94]

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

Page 255: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 255

y Axisgraph a bit wider than high, e.g., golden ratio[Tufte 83]

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

Page 256: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 256

Multiple Curves

+ high information density

+ better than 3D (reading off values)

− Easily overdone

≤ 7 smoothcurves

Page 257: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 257

Reducing the Number of Curves

use ratios

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

Page 258: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 258

Reducing the Number of Curves

omit curves

outclassed algorithms (for case shown)

equivalent algorithms (for case shown)

Page 259: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 259

Reducing the Number of Curves

split into two graphs

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

Page 260: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 260

Reducing the Number of Curvessplit into two graphs

1

1.2

1.4

1.6

1.8

2

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

shortest queuehybridlazy sharingmatching

Page 261: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 261

Keeping Curves apart: log y scale

0

200

400

600

800

1000

1200

1400

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

Prim (Scalar)I-Max (Scalar)

Prim (Vectorized)I-Max (Vectorized)

Page 262: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 262

Keeping Curves apart: smoothing

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

Page 263: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 263

Keys

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

same order as curves

Page 264: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 264

Keysplace in white space

1

1.2

1.4

1.6

1.8

2

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

shortest queuehybridlazy sharingmatching

consistent in different figures

Page 265: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 265

Todsünden1. forget explaining theaxes

2. connecting unrelatedpoints

by lines

3. mindless use/overinterpretation

of double-log plot

4. crypticabbreviations

5. microscopiclettering

6. excessivecomplexity

7. pie charts

Page 266: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 266

Arranging Instances

bar charts

stack components of execution time

careful with shading

postprocessing

phase 2

phase 1

preprocessing

Page 267: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 267

Arranging Instancesscatter plots

1

10

100

1000

1 10 100 1000

n / a

ctiv

e se

t siz

e

n/m

Page 268: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 268

Measurements and Connections

straight line between points do not imply claim of linear

interpolation

different with higher order curves

no points imply an even stronger claim. Good for very

dense smooth measurements.

Page 269: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 269

Grids and Ticks

Avoid grids or make it light gray

usually round numbers for tic marks!

sometimes plot important values on the axis

Page 270: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 270

0

5

10

15

20

32 1024 32768n

Measurementlog n + log ln n + 1log n

errors maynot be of statistical nature!

Page 271: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 271

3D

− you cannot read off absolute values

− interesting parts may be hidden

− only one surface

+ good impression of shape

Page 272: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 272

Caption

what is displayed

how has the data been obtained

surrounding text has more.

Page 273: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 273

Check List

Should the experimental setup from the exploratory phase

be redesigned to increase conciseness or accuracy?

What parameters should be varied? What variables should

be measured? How are parameters chosen that cannot be

varied?

Can tables be converted into curves, bar charts, scatter plots

or any other useful graphics?

Should tables be added in an appendix or on a web page?

Should a 3D-plot be replaced by collections of 2D-curves?

Can we reduce the number of curves to be displayed?

How many figures are needed?

Page 274: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 274

Scale thex-axis to makey-values independent of some

parameters?

Should thex-axis have a logarithmic scale? If so, do the

x-values used for measuring have the same basis as the tick

marks?

Should thex-axis be transformed to magnify interesting

subranges?

Is the range ofx-values adequate?

Do we have measurements for the rightx-values, i.e.,

nowhere too dense or too sparse?

Should they-axis be transformed to make the interesting

part of the data more visible?

Should they-axis have a logarithmic scale?

Page 275: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 275

Is it be misleading to start they-range at the smallest

measured value?

Clip the range ofy-values to exclude useless parts of

curves?

Can we use banking to 45?

Are all curves sufficiently well separated?

Can noise be reduced using more accurate measurements?

Are error bars needed? If so, what should they indicate?

Remember that measurement errors are usuallynot random

variables.

Use points to indicate for whichx-values actual data is

available.

Connect points belonging to the same curve.

Page 276: Sanders: Algorithm Engineering January 20, 2011 Algorithm …algo2.iti.kit.edu/sanders/courses/algen10-11/vorlesung.pdf · 7.2 7.4 7.6 7.8 8 8.2 8.4 10 12 14 16 18 20 22 24 26 nanosecs

Sanders: Algorithm EngineeringJanuary 20, 2011 276

Only use splines for connecting points if interpolation is

sensible.

Do not connect points belonging to unrelated problem

instances.

Use different point and line styles for different curves.

Use the same styles for corresponding curves in different

graphs.

Place labels defining point and line styles in the right order

and without concealing the curves.

Captions should make figures self contained.

Give enough information to make experiments

reproducible.