felix putze and peter sanders - kitalgo2.iti.kit.edu/sanders/courses/algen17/skript.pdffelix putze...

Felix Putze and Peter Sanders

Algorithmics

design

implement

anal

yze

experim

ent

Course NotesAlgorithm Engineering

TU KarlsruheOctober 19, 2009

PrefaceThese course notes cover a lecture on algorithm engineering for the basic toolbox thatPeter Sanders is reading at Universitat Karlsruhe since 2004. The text is compiled fromslides, scientific papers and other manuscripts. Most of this material is in English so thatthis language was adopted as the main language. However, some parts are in German.The primal sources of our material are given at the beginning of each chapter. Please referto the original publications for further references.This document is still work in progress. Please report bugs of any type (content, language,layout, . . . ) to [email protected]. Thank you!

1

Contents

1 Was ist Algorithm Engineering? 61.1 Einfuhrung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Stand der Forschung und offene Probleme . . . . . . . . . . . . . . . . . 8

2 Data Structures 162.1 Arrays & Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 External Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Stacks, Queues & Variants . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Sorting 243.1 Quicksort Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Refined Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Lessons from experiments . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Super Scalar Sample Sort . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Multiway Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Sorting with parallel disks . . . . . . . . . . . . . . . . . . . . . . . . . 353.7 Internal work of Multiway Mergesort . . . . . . . . . . . . . . . . . . . 383.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Priority Queues 444.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Binary Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 External Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Adressable Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 External Memory Algorithms 575.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 The external memory model and things we already saw . . . . . . . . . . 595.3 The Stxxl library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4 Time-Forward Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5 Cache-oblivious Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 64

2

5.5.1 Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . . 675.5.2 Searching Using Van Emde Boas Layout . . . . . . . . . . . . . 695.5.3 Funnel sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5.4 Is the Model an Oversimplification? . . . . . . . . . . . . . . . . 74

5.6 External BFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.6.2 Algorithm of Munagala and Ranade . . . . . . . . . . . . . . . . 775.6.3 An Improved BFS Algorithm with sublinear I/O . . . . . . . . . 785.6.4 Improvements in the previous implementat-

ions of MR BFS and MM BFS R . . . . . . . . . . . . . . . . . 815.6.5 A Heuristic for maintaining the pool . . . . . . . . . . . . . . . . 82

5.7 Maximal Independent Set . . . . . . . . . . . . . . . . . . . . . . . . . . 845.8 Euler Tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.9 List Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 van Emde Boas Trees 906.1 From theory to practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 Shortest Path Search 987.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.2 “Classical” and other Results . . . . . . . . . . . . . . . . . . . . . . . . 997.3 Highway Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.3.2 Hierarchies and Contraction . . . . . . . . . . . . . . . . . . . . 1037.3.3 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.4 Transit Node Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.4.1 Computing Transit Nodes . . . . . . . . . . . . . . . . . . . . . 1177.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.4.3 Complete Description of the Shortest Path . . . . . . . . . . . . . 122

7.5 Dynamic Shortest Path Computation . . . . . . . . . . . . . . . . . . . . 1237.5.1 Covering Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.5.2 Static Highway-Node Routing . . . . . . . . . . . . . . . . . . . 1267.5.3 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.5.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.5.5 Analogies To and Differences From Related Techniques . . . . . 1287.5.6 Dynamic Multi-Level Highway Node Routing . . . . . . . . . . 1297.5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3

8 Minimum Spanning Trees 1378.1 Definition & Basic Remarks . . . . . . . . . . . . . . . . . . . . . . . . 137

8.1.1 Two important properties . . . . . . . . . . . . . . . . . . . . . . 1378.2 Classic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.2.1 Excursus: The Union-Find Data Structure . . . . . . . . . . . . . 1418.3 QuickKruskal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438.4 The I-Max-Filter algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 1448.5 External MST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.5.1 Semiexternal Algorithm . . . . . . . . . . . . . . . . . . . . . . 1508.5.2 External Sweeping Algorithm . . . . . . . . . . . . . . . . . . . 1518.5.3 Implementation & Experiments . . . . . . . . . . . . . . . . . . 153

8.6 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9 String Sorting 1589.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1589.2 Multikey Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.3 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

10 Suffix Array Construction 16210.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16210.2 The DC3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16210.3 External Suffix Array Construction . . . . . . . . . . . . . . . . . . . . . 165

11 Presenting Data from Experiments 17011.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17011.2 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17111.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17211.4 Two-dimensional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . 17211.5 Grids and Ticks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18111.6 Three-dimensional Figures . . . . . . . . . . . . . . . . . . . . . . . . . 18211.7 The Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18311.8 A Check List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

12 Appendix 18612.1 Used machine models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18612.2 Amortized Analysis for Unbounded Arrays . . . . . . . . . . . . . . . . 18712.3 Analysis of Randomized Quicksort . . . . . . . . . . . . . . . . . . . . . 18812.4 Insertion Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18912.5 Lemma on Interval Maxima . . . . . . . . . . . . . . . . . . . . . . . . 19012.6 Random Permutations without additional I/Os . . . . . . . . . . . . . . . 19112.7 Proof of Discarding Theorem for Suffix Array Construction . . . . . . . . 192

4

12.8 Pseudocode for the Discarding Algorithm . . . . . . . . . . . . . . . . . 192

5

Chapter 1

Was ist Algorithm Engineering?

1.1 EinfuhrungAlgorithmen (einschließlich Datenstrukturen) sind das Herz jeder Computeranwendungund damit von entscheidender Bedeutung fur große Bereiche von Technik, Wirtschaft,Wissenschaft und taglichem Leben. Die Algorithmik befasst sich mit der systematischenEntwicklung effizienter Algorithmen und hat damit entscheidenden Anteil an der effek-tiven Entwicklung verlaßlicher und ressourcenschonender Technik. Wir nennen hier nureinige besonders spektakulare Beispiele:

Das schnelle Durchsuchen der gewaltigen Datenmengen des Internet (z.B. mitGoogle) hat die Art verandert, wie wir mit Wissen und Information umgehen. Moglichwurde dies durch Algorithmen zur Volltextsuche, die in Sekundenbruchteilen alle Tr-effer aus Terabytes von Daten herausfischen konnen und durch Ranking-Algorithmen,die Graphen mit Milliarden von Knoten verarbeiten, um aus der Flut von Treffernrelevante Antworten zu filtern. Weniger sichtbar aber ahnlich wichtig sind Algorith-men fur die effiziente Verteilung von sehr haufig zugegriffenen Daten unter massivenLastschwankungen oder gar Uberlastungsangriffen (distributed denial of service attacks).Der Marktfuhrer auf diesem Gebiet, Akamai, wurde von Algorithmikern gegrundet. Einesder wichtigsten wissenschaftlichen Ereignisse der letzten Jahre war die Veroffentlichungdes menschlichen Genoms. Mitentscheidend fur die fruhe Veroffentlichung war die vonder Firma Celera verwendete und durch algorithmische Uberlegungen begrundete Aus-gestaltung des Sequenzierprozesses (whole genome shotgun sequencing). Die Algorith-mik hat sich hier nicht auf die Verarbeitung der von Naturwissenschaftlern produziertenDaten beschrankt, sondern gestaltenden Einfluss auf den gesamten Prozess ausgeubt.

Die Liste der Bereiche, in denen ausgefeilte Algorithmen eine Schlusselrolle spielen,ließe sich fast beliebig fortsetzen: Computergrafik, Bildverarbeitung, geografische Infor-mationssysteme, Kryptografie, Produktions-, Logistik- und Verkehrsplanung . . .

Wie funktioniert nun der Transfer algorithmischer Innovation in Anwendungsbere-

6

abstrakteModelle

Entwurf

Analyse

Leistungsgarantien

Implementierung

Anwendungen

realistischeModelle

Entwurf

Implementierung

bibliothekenAlgorithmen−

Leistungs−garantien

An

wen

du

ng

en

6

2

4

18

Deduktion

falsifizierbare

InduktionHypothesen 53Analyse Experimente

AlgorithmEngineering

Alg

orith

men

theo

rie

realeEingaben 7

Figure 1.1: Zwei Sichtweisen der Algorithmik: Links: traditionell. Rechts: AE = Algo-rithmik als von falsifizierbaren Hypothesen getriebener Kreislauf aus Entwurf, Analyse,Implementierung, und experimenteller Bewertung von Algorithmen.

iche? Traditionell hat sich die Algorithmik der Methodik der Algorithmentheorie bedi-ent, die aus der Mathematik stammt: Algorithmen werden fur einfache und abstrakteProblem- und Maschinenmodelle entworfen. Hauptergebnis sind dann beweisbare Leis-tungsgarantien fur alle moglichen Eingaben. Dieser Ansatz fuhrt in vielen Fallen zu ele-ganten, zeitlosen Losungen, die sich an viele Anwendungen anpassen lassen. Die hartenLeistungsgarantien ergeben zuverlassig hohe Effizienz auch fur zur Implementierungszeitunbekannte Typen von Eingaben. Aufgreifen und Implementieren eines Algorithmus istaus Sicht der Algorithmentheorie Teil der Anwendungsentwicklung. Nach allgemeinerBeobachtung ist diese Art des Ergebnistransfers aber ein sehr langsamer Prozess. Beiwachsenden Anforderungen an innovative Algorithmen ergeben sich daraus wachsendeLucken zwischen Theorie und Praxis: Reale Hardware entwickelt sich durch Parallelis-mus, Pipelining, Speicherhierarchien u.s.w. immer weiter weg von einfachen Maschi-nenmodellen. Anwendungen werden immer komplexer. Gleichzeitig entwickelt die Al-gorithmentheorie immer ausgeklugeltere Algorithmen, die zwar wichtige Ideen enthaltenaber manchmal kaum implementierbar sind. Außerdem haben reale Eingaben oft wenigmit den worst-case Szenarien der theoretischen Analyse zu tun. Im Extremfall werdenviel versprechende algorithmische Ansatze vernachlassigt, weil eine vollstandige Anal-yse mathematisch zu schwierig ware.

Seit Beginn der 1990er Jahre wird deshalb eine breitere Sichtweise der Algorithmikimmer wichtiger, die als algorithm engineering (AE) bezeichnet wird und bei der En-twurf, Analyse, Implementierung und experimentelle Bewertung von Algorithmen gleich-berechtigt nebeneinander stehen. Der gegenuber der Algorithmentheorie großere Meth-

7

odenapparat, die Einbeziehung realer Software und der engere Bezug zu Anwendun-gen verspricht realistischere Algorithmen, die Uberbruckung entstandener Lucken zwis-chen Theorie und Praxis, und einen schnelleren Transfer von algorithmischem Know-how in Anwendungen. Abbildung 1.1 zeigt die Sichtweise der Algorithmik als AE undeine Aufteilung in acht eng interagierende Aktivitaten. Ziele und Arbeitsprogramm desSchwerpunktprogramms ergeben sich daraus in naturlicher Weise: Einsatz der vollenSchlagkraft der AE Methodologie mit dem Ziel, Lucken zwischen Theorie und Praxiszu uberbrucken.

1. Studium realistischer Modelle fur Maschinen und algorithmische Probleme.

2. Entwurf von einfachen und auch in der Realitat effizienten Algorithmen.

3. Analyse praktikabler Algorithmen zwecks Etablierung von Leistungsgarantien, dieTheorie und Praxis einander naher bringen.

4. Sorgfaltige Implementierungen, die die Lucken zwischen dem besten theoretischenAlgorithmus und dem besten implementierten Algorithmus verkleinern.

5. Systematische, reproduzierbare Experimente, die der Widerlegung oder Starkungaussagekraftiger, falsifizierbarer Hypothesen dienen, die sich aus Entwurf, Anal-yse oder fruheren Experimenten ergeben. Oft wird es z.B. um den Vergleich vonAlgorithmen gehen, deren theoretische Analyse zu viele Fragen offen lasst.

6. Entwicklung und Ausbau von Algorithmenbibliotheken, die Anwendungsentwick-lungen beschleunigen und algorithmisches Know-how breit verfugbar machen.

7. Sammeln von großen und realistischen Probleminstanzen sowie Entwicklung vonBenchmarks.

8. Einsatz von algorithmischem Know-how in konkreten Anwendungen.

1.2 Stand der Forschung und offene ProblemeIm Folgenden beschreiben wir die Methodik des AE anhand von Beispielen.

Fallbeispiel: Routenplanung in StraßennetzenJeder kennt diese zunehmend wichtige Anwendung: Man gibt Start- und Zielpunkt inein Navigationssystem ein und wartet auf die Ausgabe der schnellsten Route. Hier hatdas AE in letzter Zeit Losungen entwickelt, die in Sekundenbruchteilen optimale Routenberechnen wo kommerzielle Losungen trotz erheblich langerer Rechenzeiten bisher keineQualitatsgarantien geben konnen und gelegentlich deutlich daneben liegen. Auf den er-sten Blick ist das Anwendungsmodell ein klassisches und wohl studiertes Problem aus

8

der Graphentheorie: kurzeste Wege in Graphen. Die altbekannte Lehrbuchlosung — Di-jkstra’s Algorithmus — hatte allerdings auf einem Hochleistungs-Server Antwortzeitenim Minutenbereich und ware auf leistungsschwacherer mobiler Hardware mit begren-ztem Hauptspeicher hoffnungslos langsam. Kommerzielle Routenplaner greifen deshalbzu Heuristiken, die annehmbare Antwortzeiten haben, aber nicht immer die beste Routefinden.

Auf den zweiten Blick bietet sich ein verfeinertes Problemmodell an, das die Vor-berechnung von Informationen zulasst, die dann fur viele Anfragen verwendet werden.Die Theorie winkt ab und beweist, dass nur eine unpraktikabel große vorberechneteDatenstruktur die Berechnung von schnellsten Routen in beliebigen Graphen beschleu-nigt. Reale Straßengraphen haben jedoch Eigenschaften, welche die Vorberechnungsideepraktikabel machen. Die Wirksamkeit dieser Ansatze hangt von Hypothesen uber dieEigenschaft der Straßengraphen ab, wie ”‘weit weg von Start und Ziel kann man dieSuche auf uberregionale Straßen beschranken”’ oder ”‘Straßen, die vom Ziel wegfuhren,darf man ignorieren”’. Solche intuitiven Formulierungen gilt es dann so zu formalisieren,dass sich daraus Algorithmen mit Leistungsgarantien entwickeln lassen. Letztlich lassendiese Hypothesen sich aber nur durch Experimente mit Implementierungen uberprufen,die realistische Straßengraphen verwenden. Letzteres ist in der Praxis schwierig, da vieleFirmen nur ungern Daten an Forscher herausgeben. Besonders wertvoll ist deshalb einfrei verfugbarer Graph der USA, der aus Daten im Web konstruiert wurde und jetzt fureine DIMACS Implementation Challenge zur Routenplanung Verwendung finden soll.Die Experimente decken Schwachstellen auf, die wiederum zum Entwurf verbesserterAlgorithmen fuhren. Zum Beispiel stellte sich heraus, dass schon wenige Langstreck-enfahrverbindungen den Vorberechnungsaufwand der ersten Version des Algorithmusenorm in die Hohe treiben.

Trotz der Erfolge gibt es viele offene Fragen. Kann man die Heuristiken auch theo-retisch analysieren um zu allgemeineren Leistungsgarantien zu kommen? Wie vertragtsich die Idee der Vorberechnung mit Anwendungsanforderungen wie Anderung desStraßennetzes, Baustellen, Staus, oder verschiedenen Zielfunktionen der Benutzer? Wielassen sich die komplexen Speicherhierarchien von Mobilgeraten berucksichtigen?

ModelleEin wichtiger Aspekt des AE sind Maschinenmodelle. Sie betreffen im Prinzip alleAnwendungen und sind die Schnittstelle zwischen der Algorithmik und der rasantentechnologische Entwicklung mit immer komplexerer Hardware. Wegen seiner großenEinfachheit ist das streng sequentielle, mit uniformem Speicher ausgestattete von Neu-mann Maschinenmodell immer noch Grundlage der meisten algorithmischen Arbeiten.Dies ist vor allem bei der Verarbeitung großer Datenmengen ein Problem, da die Zu-griffszeit auf den Speicher sich um viele Großenordnungen andert, je nachdem, ob

9

auf den schnellsten Cache eines Prozessors, auf den Hauptspeicher oder auf die Fest-platte zugegriffen wird. Speicherhierarchien werden in der Algorithmik bisher meistauf zwei Schichten beschrankt (I/O Modell). Dieses Modell ist sehr erfolgreich undeine Vielzahl von Ergebnissen dazu ist bekannt. Allerdings klaffen oft große Luckenzwischen den besten bekannten Algorithmen und den implementierten Verfahren. Bib-liotheken fur Sekundarspeicheralgorithmen wie STXXL versprechen diese Situation zuverbessern. In letzter Zeit gibt es aber verstarktes Interesse an weiteren immer nocheinfachen Modellen zur Verarbeitung großer Datenmengen, z.B. einfache Modelle furmehrschichtige Speicherhierarchien, Datenstrommodelle, bei denen die Daten uber einNetzwerk hereinkommen, oder Sublinearzeitalgorithmen, bei denen gar nicht alle Datenberuhrt werden mussen.

Nur punktuelle Ergebnisse gibt es bisher zu anderen komplexen Eigenschaftenmoderner Prozessoren, wie den Ersetzungsmechanismen von Hardwarecaches oderSprungvorhersage.

Wir erwarten, dass die Erforschung paralleler Algorithmen in nachster Zeit eine Re-naissance erfahren wird, denn durch die Verbreitung von Multithreading, Multi-Core-CPUs und Clustern halt die Parallelverarbeitung nun Einzug in den Mainstream derDatenverarbeitung. Die traditionellen ”‘flachen”’ Modelle fur Parallelverarbeitung sindhier allerdings nur von begrenztem Nutzen, da es parallel zur Speicherhierarchie eineHierarchie mehr oder weniger eng gekoppelter Verarbeitungseinheiten gibt.

EntwurfEine entscheidende Komponente des AE ist die Entwicklung implementierbarer Algo-rithmen, die effiziente Ausfuhrung in realistischen Situationen erwarten lassen. Le-ichte Implementierbarkeit bedeutet vor allem Einfachheit aber auch Moglichkeiten zurCodewiederverwendung. Effiziente Ausfuhrung bedeutet in der Algorithmentheorie guteasymptotische Ausfuhrungszeit und damit gute Skalierungseigenschaften fur sehr großeEingaben. Im AE sind aber auch konstante Faktoren und die Ausnutzung einfacher Prob-leminstanzen wichtig.

Ein Beispiel hierzu:Der Sekundarspeicheralgorithmus zur Berechnung minimaler Spannbaume war der

erste Algorithmus, der ein nichttriviales Graphenproblem mit Milliarden von Knoten aufeinem PC lost. Theoretisch ist er suboptimal, weil er einen FaktorO(log m

M) mehr Platten-

zugriffe benotigt als der theoretisch beste Algorithmus (dabei istm die Anzahl der Kantendes Eingabegraphen undM der Hauptspeicher der Maschine). Auf sinnvoll konfiguriertenMaschinen benotigt er aber jetzt und in absehbarer Zukunft hochstens ein Drittel der Plat-tenzugriffe der asymptotisch besten bekannten Algorithmen. Hat man eine Prioritatslistefur Sekundarspeicher zur Verfugung wie in STXXL, ist der Pseudocode des Algorithmusein Zwolfzeiler und die Analyse der erwarteten Ausfuhrungszeit ein Siebenzeiler.

10

AnalyseSelbst einfache, in der Praxis bewahrte Algorithmen sind oft schwer zu analysieren unddies ist ein Hauptgrund fur Lucken zur Algorithmentheorie. Die Analyse solcher Algo-rithmen ist damit ein wichtiger Aspekt des AE. Zum Beispiel sind randomisierte (zufalls-gesteuerte) Algorithmen oft viel einfacher und schneller als die besten bekannten deter-ministischen Algorithmen. Allerdings sind selbst einfache randomisierte Algorithmen oftschwer zu analysieren.

Viele komplexe Optimierungsprobleme werden mittels Metaheuristiken wie (ran-domisierter) lokaler Suche oder genetischer Programmierung gelost. So entworfene Al-gorithmen sind einfach und flexibel an das jeweils vorliegende Problem anpassbar. Nurganz wenige dieser Algorithmen sind aber bisher analysiert worden obwohl Leistungs-garantien von großem theoretischen und praktischen Interesse waren.

Ein beruhmtes Beispiel fur lokale Suche ist der Simplexalgorithmus zur linearen Op-timierung — vielleicht der praktisch wichtigste Algorithmus in der mathematischen Op-timierung. Einfache Varianten des Simplexalgorithmus benotigen fur spezielle, konstru-ierte Eingaben exponentielle Zeit. Es wird aber vermutet, dass es Varianten gibt, die inpolynomieller Zeit laufen. In der Praxis jedenfalls genugt eine lineare Zahl Iterationen.Bisher kennt man aber lediglich subexponentielle erwartete Laufzeitschranken fur inprak-tikable Varianten. Spielmann und Teng konnten jedoch zeigen, dass selbst kleine zufalligeVeranderungen der Koeffizienten eines beliebigen linearen Programms genugen, um dieerwartete Laufzeit des Simplexalgorithmus polynomiell zu machen. Dieses Konzept dergeglatteten Analyse (smoothed analysis) ist eine Verallgemeinerung der average caseanalysis und auch jenseits des Simplexalgorithmus ein interessantes Werkzeug des AE.Zum Beispiel konnten Beier und Vocking fur eine wichtige Familie NP-harter Problemezeigen, dass ihre geglattete Komplexitat polynomiell ist. Dieses Ergebnis erklart u.a.,warum das NP-harte Rucksackproblem sich in der Praxis effizient losen lasst und hat auchzur Verbesserung der besten Codes fur Rucksackprobleme gefuhrt. Es gibt auch engeBeziehungen zwischen geglatteter Komplexitat, Naherungsalgorithmen und sogenanntenpseudopolynomiellen Algorithmen, die ebenfalls ein interessanter Ansatz zur praktischenLosungen NP-harter Probleme sind.

ImplementierungDie Implementierung ist nur scheinbar der am klarsten vorgezeichnete und langweiligsteSchritt im Kreislauf des AE. Ein Grund dafur sind die großen semantischen Lucken zwis-chen abstrakt formulierten Algorithmen, imperativen Programmiersprachen und realerHardware.

Ein extremes Beispiel fur die semantische Lucke sind viele geometrische Algorith-men, die unter der Annahme exakter Arithmetik mit reellen Zahlen und ohne explizite

11

Berucksichtigung degenerierter Falle entworfen sind. Die Robustheit geometrischer Al-gorithmen kann deshalb als eigener Zweig des AE betrachtet werden.

Selbst Implementierungen relativ einfacher grundlegender Algorithmen konnen sehranspruchsvoll sein. Dort gilt es namlich oft mehrere Kandidaten auf Grund kleiner kon-stanter Faktoren in ihrer Ausfuhrungszeit zu vergleichen. Der einzige verlassliche Wegbesteht dann darin, alle Kontrahenten voll auszureizen, denn schon kleine Implemen-tierungsdetails konnen sich zu einem Faktor zwei in der Ausfuhrungszeit auswachsen.Selbst ein Vergleich des erzeugten Maschinencodes kann angezeigt sein, um Zweifelsfalleaufzuklaren.

Oft liefern erst Implementierungen von Algorithmen einen letzten Beleg fur deren Ko-rrektheit bzw. die Qualitat der Ergebnisse. In der Geometrie und bei Graphenproblemenwird naturlicherweise meist eine graphische Ausgabe der Ergebnisse erzeugt, wodurchNachteile des Algorithmus oder sogar Fehler sofort sichtbar werden.

Zum Beispiel wurde zur Einbettung eines planaren Graphen 20 Jahre lang auf eineArbeit von Hopcroft und Tarjan1 verwiesen. Dort findet sich aber nur eine vage Beschrei-bung wie sich ein Planaritatstestalgorithmus erweitern lasst. Einige Versuche einerdetaillierteren Beschreibung waren fehlerhaft. Dies wurde erst bemerkt, als die er-sten korrekten Implementierungen erstellt wurden. Lange Zeit gelang es niemandem,einen beruhmten Algorithmus2 zur Berechnung von 3-Zusammenhangskomponenten (einwichtiges Werkzeug beim Graphenzeichnen und in der Signalverarbeitung) zu implemen-tieren. Erst wahrend einer Implementierung im Jahr 2000 wurden die Fehler im Algorith-mus identifiziert und korrigiert.

Es gibt sehr viele interessante Algorithmen fur wichtige Probleme, die noch nie im-plementiert wurden. Zum Beispiel, die asymptotisch besten Algorithmen fur viele Fluss-und Matchingprobleme, die meisten Algorithmen fur mehrschichtige Speicherhierarchien(cache oblivious Algorithmen) oder geometrische Algorithmen, die Cuttings oder ε-Netzebenutzen.

ExperimenteAussagekraftige Experimente sind der Schlussel zum Schließen des Kreises im AE-Prozess. Zum Beispiel brachten Experimente3 zur Kreuzungsminimierung beimGraphenzeichnen eine neue Qualitat in diesen Bereich. Alle vorhergehenden Studienarbeiteten mit relativ dichten Graphen und wiesen jeweils nach, dass die Kreuzungszahlrecht nahe an die jeweiligen oberen theoretischen Schranken herankam. In den ange-

1J. Hopcroft and R. E. Tarjan: Efficient planarity testing. J. of the ACM, 21(4):549–568, 1974.2R. E. Tarjan and J. E. Hopcroft: Dividing a graph into triconnected components. SIAM J. Comput.,

2(3):135–158, 1973.3M. Junger and P. Mutzel: 2-layer straightline crossing minimization: Performance of exact and heuris-

tic algorithms. Journal of Graph Algorithms and Applications (JGAA), 1(1):1–25, 1997.

12

sprochenen Experimenten wird dagegen auch mit optimalen Algorithmen und den in derPraxis wichtigen dunnen Graphen gearbeitet. Es stellte sich heraus, dass die Ergebnissemancher Heuristiken um ein Vielfaches uber der optimalen Kreuzungszahl liegen. DiesesPapier gehort inzwischen zu den am meisten zitierten Arbeiten im Bereich des Graphen-zeichnens.

Experimente konnen auch entscheidenden Einfluss auf die Algorithmenanalysehaben: Die Rekonstruktion einer Kurve aus einer Menge von Messpunkten ist diegrundlegendste Variante einer wichtigen Familie von Bildverarbeitungsproblemen. Ineiner Arbeit von Althaus und Mehlhorn4 wird ein scheinbar recht aufwendiges Ver-fahren untersucht, das auf dem Handlungsreisendenproblem beruht. Bei Experimentenstellte sich heraus, dass ”‘vernunftige”’ Eingaben zu leicht losbaren Instanzen des Hand-lungsreisendenproblems fuhren. Diese Beobachtung wurde dann formalisiert und be-wiesen.

Gegenuber den Naturwissenschaften ist das AE in der privilegierten Situation, eineVielzahl von Experimenten schnell und vergleichsweise kostengunstig durchfuhren zukonnen. Die Ruckseite dieser Medaille ist aber eine hochgradig nichttriviale Planung,Auswertung, Archivierung, Aufbereitung und Interpretation dieser Ergebnisse. Aus-gangspunkt sollten dabei falsifizierbare Hypothesen uber das Verhalten der untersuchtenAlgorithmen sein, die aus Entwurf, Analyse, Implementierung oder fruheren Experi-menten stammen. Ergebnis ist eine Widerlegung, Bestatigung oder Verfeinerung dieserHypothesen. Diese fuhren als Erganzung beweisbarer Leistungsgarantien nicht nur zubesserem Verstandnis der Algorithmen, sondern liefern auch Ideen fur bessere Algorith-men, genauere Analyse oder effizientere Implementierung.

Erfolgreiches Experimentieren hat viel mit Software Engineering zu tun. Ein mod-ularer Aufbau der Implementierungen ermoglicht flexible Experimente. GeschickterEinsatz von Werkzeugen erleichtert die Auswertung. Sorgfaltige Dokumentation undVersionsverwaltung erleichtert Reproduzierbarkeit — eine zentrale Anforderung wis-senschaftlicher Experimente, die bei den schnellen Modellwechseln von Soft- und Hard-ware eine große Herausforderung darstellt.

ProbleminstanzenSammlungen von realistischen Probleminstanzen zwecks Benchmarking haben sich alsentscheidend fur die Weiterentwicklung von Algorithmen erwiesen. Zum Beispiel gibtes interessante Sammlungen fur einige NP-harte Probleme wie das Handlungsreisenden-problem, das Steinerbaumproblem, Satisfiability, Set Covering oder Graphpartition-ierung. Besonders bei den beiden ersten Problemen hat das zu erstaunlichenDurchbruchen gefuhrt. Mit Hilfe tiefer mathematischer Einsichten in die Struktur der

4E. Althaus and K. Mehlhorn: Traveling salesman-based curve reconstruction in polynomial time.SIAM Journal on Computing, 31(1):27–66, 2002.

13

Probleme kann man selbst große, realistische Instanzen des Handlungsreisendenproblemsund des Steinerbaumproblems exakt losen.

Merkwurdigerweise sind realistische Probleminstanzen fur polynomiell losbare Prob-leme viel schwerer zu bekommen. Zum Beispiel gibt es dutzende praktischer Anwendun-gen der Berechnung maximaler Flusse aber die Algorithmenentwicklung muss bislangmit synthetischen Instanzen vorlieb nehmen.

AnwendungenDie Algorithmik spielt eine Schlusselrolle bei der Entwicklung innovativer IT-Anwendungen und entsprechend sind anwendungsorientierte AE-Projekte aller Art einesehr wichtiger Teil des Schwerpunktprogramms. Hier nennen wir nur einige grand chal-lenge Anwendungen, bei denen Algorithmik eine wichtige Rolle spielen konnte unddie ein besonderes Potential haben, einen wichtigen Einfluss auf Wissenschaft, Technik,Wirtschaft oder tagliches Leben zu haben.5

Bioinformatik Neben dem bereits genannten Problem der Genomsequenzierunghalt die Mikrobiologie viele weitere algorithmische Herausforderungen bereit: dieBerechnung der Tertiarstrukturen von Proteinen; Algorithmen zur Berechnung vonStammbaumen von Arten; data mining in den Daten zur Genaktivierung, die in großemUmfang mit DNA chips gewonnen werden. . . Diese Probleme konnen nur in enger Ko-operation mit Molekularbiologen oder Chemikern gelost werden.

Information Retrieval Die zu Beginn erwahnten Indizierungs- und Rankingalgorith-men von Internetsuchmaschinen sind zwar sehr erfolgreich, lassen aber noch viel zuwunschen ubrig. Viele Heuristiken sind kaum publiziert, geschweige denn mit Leis-tungsgarantien ausgestattet. Nur in kleineren Systemen wird bisher ernsthaft versucht,Ahnlichkeitssuche zu unterstutzen und es zeichnet sich ein Rustungswettlauf zwischenRankingalgorithmen und Spammern ab, die diese zu tauschen versuchen.

Verkehrsplanung Der Einsatz von Algorithmen in der Verkehrsplanung hat geradeerst begonnen. Neben einzelnen Anwendungen im Flugverkehr, wo Probleminstanzenrelativ klein und das Einsparpotenzial groß ist, beschranken sich diese Anwendungenauf verhaltnismaßig einfache, isolierte Bereiche: Datenerfassung (Netze, Straßenkate-gorien, Fahrzeiten), Monitoring und partielle Lenkung (Toll Collect, Bahnleitstande),

5Dessen ungeachtet wird naturlich nicht erwartet, dass ein einzelnes Teilprojekt den Durchbruch beieiner grand challenge bringt, und viele Teilprojekte werden sich mit weniger spektakularen, aber ebensointeressanten Anwendungen beschaftigen.

14

Prognose (Simulation, Vorhersagemodelle) und einfache Nutzerunterstutzung (Routen-planung, Fahrplanabfrage). Das AE kann wesentlich zur Weiterentwicklung und Integra-tion dieser verschiedenen Aspekte hin zu leistungsfahigen Algorithmen fur eine besserePlanung und Lenkung unserer Verkehrssysteme (Lenkung durch Maut, Fahrplanopti-mierung, Linienplanung, Fahrzeug- und Personaleinsatzplanung) beitragen. BesondereHerausforderungen sind dabei die sehr komplexen Anwendungsmodelle und die darausentstehenden riesigen Problemgroßen.

Geografische Informationssysteme Moderne Erdbeobachtungssatelliten und andereDatenquellen erzeugen inzwischen taglich viele Terabyte an Informationen, die wichtigeAnwendungen in Landwirtschaft, Umweltschutz, Katastrophenschutz, Tourismus etc.versprechen. Die effektive Verarbeitung solch gewaltiger Datenmengen ist aber eineechte Herausforderung bei der Know-how aus geometrischen Algorithmen, Parallelverar-beitung und Speicherhierarchien sowie AE mit realen Eingabedaten eine wichtige Rollespielen wird.

Kommunikationsnetzwerke Im selben Maße wie Netzwerke immer vielseitiger undgroßer werden, wachst der Bedarf an effizienten Verfahren zu ihrer Organisation. Beson-deres Interesse gilt hier mobilen, ad-hoc und Sensornetzen, sowie Peer-to-peer Netzenund der Koordination konkurrierender Agenten mit spieltheoretischen Techniken. Alldiesen neuartigen Anwendungen ist gemeinsam, dass sie ohne eine zentrale Planung undOrganisation auskommen mussen.

Viele der hier untersuchten Fragestellungen kann man als noch-nicht-Anwendungenbezeichnen. Aus der Perspektive des AE ist daran besonders interessant, dass auch prak-tische Arbeiten hier keine verlasslichen Daten uber Große und Eigenarten der spaterenAnwendungssituation haben. Einerseits ergibt sich daraus ein noch großerer Bedarf anbeweisbaren Leistungsgarantien. Andererseits sind die Modelle vieler theoretischer Ar-beiten auf diesem Gebiet noch weiter von der Realitat entfernt als sonst.

Planungsprobleme Zeitliche Ablaufe in Produktion und Logistik werden stets enger,und der Bedarf an algorithmischer Unterstutzung und Optimierung wachst. Erste Ansatzehierzu aus der Algorithmik werden durch Onlinealgorithmen (dial a ride, Scheduling) undflows over time (Routing mit Zeitfenstern, dynamische Flusse) gegeben. Die Entwicklungsteht jedoch erst in den Anfangen. Fur aussagekraftige Qualitatsaussagen zu Onlinealgo-rithmen muss insbesondere die kompetitive Analyse uberdacht werden, die sich zu sehram groben worst-case Verhalten orientiert. Flows over Time verlangen nach besserenTechniken, um algorithmisch moglichst effizient mit der Dimension ”‘Zeit”’ umzugehen.

15

Chapter 2

Data Structures

Most material in this chapter was taken from a yet unpublished book manuscript by Pe-ter Sanders and Ulrich Mehlhorn. Some parts on external data structures were presentedin [7]. Notice that during the lecture, the latter topics were covered in the talk on ex-ternal algorithms, not in the introduction on data structures. If you are unfamiliar withexternal memory models, please read the introduction in 5.2 or the short overview in theappendix 12.1.

2.1 Arrays & ListsFor starters, we will study how algorithm engineering can be applied to the (apparently?)easy field of sequence data structures.

Bounded Arrays: Usually the most basic, built-in sequence data structure in pro-gramming languages. They have constant running time for [·]-, popBack- and pushBack-operations which remove or add an element behind the currently last entry. Their majordrawback is that their size has to be known in advance to reserve enough memory.

Unbounded Arrays: To bypass this often unconvenient restriction, unbounded arraysare introduced (std::vector from the C++ STL is an example). They are imple-mented on a bounded array. If this array runs out of space for new elements, a new arrayof double size is allocated and the old content copied. If the filling degree is reduced to aquarter by pop-operations, the array is replaced with new one, using only the half space.We can show amortized costs of O(1) for pushBack and popBack implemented that way.A proof is given in the appendix in 12.2. Note that is not possible to already shrink thearray when it is half full since repeated insertion and deletion at that point would lead tocosts of O(n) for a single operation.

Double Linked Lists1: Figure 2.1 shows the basic building block of a linked list. A

1Sometimes singly linked lists (maintaining only a successor pointer) are sufficient and more space

16

Class Item of Elemente : Elementnext : Handleprev : Handleinvariant next→prev=prev→next=this

Figure 2.1: Prototype of a segment in doubly linked list

Figure 2.2: Structure of a double linked list

list item (a link of a chain) stores one element and pointers to successor and predecessor.This sounds simple enough, but pointers are so powerful that we can make a big messif we are not careful. What makes a consistent list data structure? We make a simpleand innocent looking decision and the basic design of our list data structure will followfrom that: The successor of the predecessor of an item must be the original item, andthe same holds for the predecessor of a successor. If all items fulfill this invariant, theywill form a collection of cyclic chains. This may look strange, since we want to representsequences rather than loops. Sequences have a start and an end, wheras loops have neither.Most implementations of linked lists therefore go a different way, and treat the first andlast item of a list differently. Unfortunately, this makes the implementation of lists morecomplicated, more errorprone and somewhat slower. Therefore, we stick to the simplecyclic internal representation.

For conciseness, we implement all basic list operations in terms of the single operationsplice depicted in Figure 2.3. splice cuts out a sublist from one list and inserts it after sometarget item. The target can be either in the same list or in a different list but it must notbe inside the sublist. splice can easily be specialized to common methods like insert,delete, . . .

Since splice never changes the number of items in the system, we assume that thereis one special list freeList that keeps a supply of unused elements. When inserting newelements into a list, we take the necessary items from freeList and when deleting elements

efficient. As they have non-intuitive semantics on some operations and are less versatile, we focus ondoubly linked lists.

17

we return the corresponding items to freeList. The function checkFreeList allocates mem-ory for new items when necessary. A freeList is not only useful for the splice operationbut it also simplifies our memory management which can otherwise easily take 90% ofthe work since a malloc would be necessary for every element inserted2. It remains todecide how to simulate the start and end of a list. The class List in Figure 2.2 introducesa dummy item h that does not store any element but seperates the first element from thelast element in the cycle formed by the list. By definition of Item, h points to the first“proper” item as a successor and to the last item as a predecessor. In addition, a handlehead pointing to h can be used to encode a position before the first element or after thelast element. Note that there are n+1 possible positions for inserting an element into anlist with n elements so that an additional item is hard to circumvent if we want to codehandles as pointers to items. With these conventions in place, a large number of usefuloperations can be implemented as one line functions that all run in constant time. Thanksto the power of splice, we can even manipulate arbitrarily long sublists in constant time.The dummy header can also be useful for other operations. For example consider the fol-lowing code for finding the next occurrence of x starting at item from. If x is not present,head should be returned. We use the header as a sentinel. A sentinel is a dummy elementin a data structure that makes sure that some loop will terminate. By storing the key weare looking for in the header, we make sure that the search terminates even if x is origi-nally not present in the list. This trick saves us an additional test in each iteration whetherthe end of the list is reached. A drawback of dummy headers is that it requires additionalspace. This seems negligible for most applications but may be costly for many, nearlyempty lists. This is a typical scenario for hash tables using chaining on collisions.

2.2 External ListsThe direct implementation of a linked list in an external memory model will have costs of1 I/O when following a link, which leads to Θ(N) I/Os for traversing N elements. This iscaused by the high degree of freedom in the allocation of list elements within memory3.A first idea to improve this is to introduce locality by requiring to store B consecutiveelements together. Traversal is now only N/B = O(scan(N)) I/Os, but an insertion ordeletion can cost Θ(N/B) I/Os for moving all following elements. We relax the invariantto ≥ 2

3B elements in every pair of consecutive blocks. Traversal is still available for

≤ 3N/B = O(scan(N)) I/Os. For inserting in block i, we have to distinguish to cases: Ifblock i has space we pay 1 I/O and are done. If it is full but a neighbor has space, we push

2Another countermeasure to allocation overhead is scheduling many insertions at the same time, result-ing in only one malloc and possibly less cache faults as many items reside in the same memory block

3A faster traversal is possible if we use list ranking (see 5.9) as preprocessing, which can be donein O(sort(N)). Sorting with respect to each element’s rank (distance from last node) will then give ascannable representation of the list

18

//Remove 〈a, . . . , b〉 from its current list and insert it after t// . . . , a′, a, . . . , b, b′, . . . , t, t′, . . .) 7→ (. . . , a′, b′, . . . , t, a, . . . , b, t′, . . .)Procedure splice(a,b,t : Handle)

assert b is not before a ∧ t 6∈ 〈a, . . . , b〉a′ := a→prevb′ := b→nexta′→next := b′

b′→prev := a′

b→next := t′

a→prev := t

t→next := at′→prev := b

Figure 2.3: The splice method

Figure 2.4: The direct implementation of linked lists is not suited for external memory.

an element to it for O(1) I/Os. If both neighbors are full, we split block i into 2 blocks of≈ B/2 elements, for (amortized) costs of O(1) I/Os (≥ B/6 deletions needed to violatethe invariant). For deletion from block i: if blocks i and i + 1 or blocks i and i − 1 have≤ 2B/3 elements⇒ merge the two blocks, again for (amortized) O(1) I/Os.

Figure 2.5: First approach: block B consecutive list elements together

19

Figure 2.6: Second approach: block ≥ 23B consecutive list elements together

S-List B-Array U-Arraydynamic + − +space wasting pointer too large? too large?

set free?time wasting cache miss + resizingworst case time (+) + −

Table 2.1: Pros and cons for implementation variants of a stack

2.3 Stacks, Queues & VariantsWe now want to use these general sequence types to implement another important datastructure: A stack with operations push (insert at the end of the sequence) and pop (returnand remove the last element) which we both want to implement with constant costs. Letus examine the alternatives:

A bounded array is only feasible if you can give a tight limit for the number of insertedelements; otherwise, you have to allocate much memory in advance to avoid running outof space. A linked list comes with nontrivial memory management and a lot of cachefaults (when every successor is in a different memory block). An unbounded array hasno constant cost guaranty for a single operation and can consume up to twice the actuallyrequired space. So none of the basic data structures comes without major drawbacks. Foran optimal solution, we need to take a hybrid approach:

A hybrid stack is a linked list containing bounded arrays of size B. When the currentarray is full, another one is allocated and linked. We now have a dynamic data structure

...

BFigure 2.7: A hybrid stack

20

...

...

B

Directory

ElementeFigure 2.8: A variant of the hybrid stack

with (small) constant worst case access time4 at the back pointer. We give up a maximumof n/B + B wasted space (for pointers and one empty block). This is minimized forB = Θ(

√n).

A variant of this stack works as follows: Instead of having each block maintain apointer to its successor, we use a directory (implemented as an unbounded array) contain-ing these. Together with two additional references to the current dictionary entry and thecurrent position in the last block, we gain the functionality of a stack. Additionally, it isnow easy to implement [·] in constant time using integer division and modulo arithmetic.The drawback of this approach is non-constant worst case insertion time (although westill have constant amortized costs).

There are further specialized data structures that can be useful for certain algorithms:a FIFO queue allows insertion at one end and extraction at the other. FIFO queues areeasy to implement with singly linked lists with a pointer to the last element. For boundedqueues, we can also use cyclic arrays where entry zero is the successor of the last entry.Now it suffices to maintain two indices h and t delimiting the range of valid queue entries.These indices travel around the cycle as elements are queued and dequeued. The cyclicsemantics of the indices can be implemented using arithmetics modulo the array size.5

Our implementation always leaves one entry of the array empty because otherwise itwould be difficult to distinguish a full queue from an empty queue. Bounded queues canbe made unbounded using similar techniques as for unbounded arrays.

Finally, deques – allowing read/write access at both ends – cannot be implementedefficiently using singly linked lists. But the array based FIFO from Figure 2.3 is easyto generalize. Circular array can also support access using [·] (interpreting [i] as [i +hmod n].

With techniques from both the hybrid stack variant and the cyclic FIFO queue, wecan derive a data structure with constant costs for random accesses and costs O(

√n) for

4although inserting at the end of the current array is still costlier5On some machines one might get significant speedups by choosing the array size as a power of two

and replacing mod by bit operations.

21

h

t0n

b

Figure 2.9: A variant of the hybrid stack

insertion/deletion on arbitrary positions: Instead of bounded arrays, we have our directorypoint to cyclic arrays. Random access works as above. For insertion at a random location,shift the elements in the corresponding cyclic array that follow the new element’s position.If the array was full, we have no room for the last element so it is propagated to the nextcyclic array. Here, it replaces the last element (which can travel further) and the indicesare rotated by one, giving the new element index 0. In the worst case, we haveB elementsto move in the first array and constant time operations for the other n/B subarrays. Thisis again minized for B = Θ(

√n).

Another specialized variant we can develop is an I/O efficient stack6: We use 2 buffersof size B in main memory and a pointer to the end of the stack. When both buffers arefull, we write the one containing the older elements to disk and use the freed room fornew insertions. When both buffers run empty, we refill one with a block from disk. Thisleads to amortized I/O costs of O(1/B). Mind that only one buffer is not sufficient: Asequence of B insertions followed by alternating insertions and deletions will incur 1 I/Oper operation.

[image] ⇐=Table 2.2 summarizes some of the results found in this chapter by comparing running

times for common operations of the presented data structures. Predictably, arrays are bet-ter at indexed access whereas linked lists have their strenghts at sequence manipulationat arbitrary positions. However, both basic approaches can implement the special opera-tions needed for stacks and queues roughly equally well. Where both approaches work,arrays are more cache efficient whereas linked lists provide worst case performance guar-antees. This is particularly true for all kinds of operations that scan through the sequence;findNext is only one example.

612.1 gives an introduction on our external memory model

22

Operation List UArray hybr . Stack hybr . Array cycl . Array explanation of ‘∗’[·] n 1

√n 1 1

| · | 1∗ 1 1 1 1 not with inter-list splicefirst 1 1 1 1 1last 1 1 1 1 1insert 1 n n

√n n

remove 1 n n√n n

pushBack 1 1∗ 1 1 1∗ amortizedpushFront 1 n n

√n 1∗ amortized

popBack 1 1∗ 1 1 1∗ amortizedpopFront 1 n n

√n 1∗ amortized

concat 1 n n n nsplice 1 n n n nfindNext ,. . . n n∗ n∗ n∗ n∗ cache efficient

Table 2.2: Running times of operations on sequences with n elements. Entries have animplicit O(·) around them.

23

Chapter 3

Sorting

The findings on how branch mispredictions affect quicksort are taken from [1]. SuperScalar Sample Sort is described in [2], Multiway Merge Sort is covered in [3], the analysisof duality between prefetching and buffered writing is from [4].

3.1 Quicksort BasicsSorting is one of the most important algorithmic problems both practically and theoret-ically. Quicksort is perhaps the most frequently used sorting algorithm since it is veryfast in practice, needs almost no additional memory, and makes no assumptions on thedistribution of the input.

Function quickSort(s : Sequence of Element) : Sequence of Elementif |s| ≤ 1 then return s // base casepick p ∈ s uniformly at random // pivot keya := 〈e ∈ s : e < p〉 // (A)b := 〈e ∈ s : e = p〉 // (B)c := 〈e ∈ s : e > p〉 // (C)return concatenation of quickSort(a), b, and quickSort(c)

Figure 3.1: Quicksort (high-level implementation)

Analysis shows that Quicksort picking pivots randomly will perform an expectednumber of ≈ 1.4n log(n) comparisons1. A proof for this bound is given in the appendix

1With other other strategies for selecting a pivot, better constant factors can be achieved: e.g. ”medianof three” reduces the expected number of comparisons to ≈ 1.2n log(n)

24

quickSort qsort i-> partition <-j3 6 8 1 0 7 2 4 5 9 3 6 8 1 0 7 2 4 5 9 3 6 8 1 0 7 2 4 5 91 0 2|3|6 8 7 4 5 9 2 0 1|8 6 7 3 4 5 9 2 6 8 1 0 7 3 4 5 90|1|2 4 5|6|8 7 9 1 0|2|5 6 7 3 4|8 9 2 0 8 1 6 7 3 4 5 9

4|5 7|8|9 0 1| |4 3|7 6 5|8 9 2 0 1|8 6 7 3 4 5 90 1 2 3 4 5 6 7 8 9 | |3 4|5 6|7| j i

| | |5 6| |0 1 2 3 4 5 6 7 8 9

Figure 3.2: Execution of both high-level and refined version of quickSort . (Figure 3.1and Figure 3.3) on 〈2, 7, 1, 8, 2, 8, 1〉 using the first character of a subsequence as thepivot. The right block shows the first execution of the repeat loop for partitioning theinput in qSort .

in 12.3. The worst case occurs if all elements are different and we are always so unluckyto pick the largest or smallest element as a pivot and results in Θ(n2) comparisons. Asthe number of executed instructions and cache faults is proportional to the number ofcomparisons, this is (at least in theory) a good measure for the total runtime of Quicksort.

3.2 Refined QuicksortFigure 3.3 gives pseudocode for an array based quicksort that works in-place and usesseveral implementation tricks that make it faster and very space efficient.

To make a recursive algorithm compatible to the requirement of in-place sorting ofan array, quicksort is called with a reference to the array and the range of array indicesto be sorted. Very small subproblems with size up to n0 are sorted faster using a simplealgorithm like insertion sort2. The best choice for the constant n0 depends on many detailsof the machine and the compiler. Usually one should expect values around 10–40. Anefficient implementation of Insertion Sort is given in the appendix in 12.4.

The pivot element is chosen by a function pickPivotPos that we have not specifiedhere. The idea is to find a pivot that splits the input more accurately than just choosinga random element. A method frequently used in practice chooses the median (‘middle’)of three elements. An even better method would choose the exact median of a randomsample of elements.

The repeat-until loop partitions the subarray into two smaller subarrays. Elements

2Some books propose to leave small pieces unsorted and clean up at the end using a single insertion sortthat will be fast as the sequence is already almost sorted. Although this nice trick reduces the number ofinstructions executed by the processor, our solution is faster on modern machines because the subarray tobe sorted will already be in cache.

25

//Sort the subarray a[`..r]Procedure qSort(a : Array of Element; `, r : N)

while r − ` ≥ n0 do // Use divide-and-conquerj := pickPivotPos(a, l, r)swap(a[`], a[j]) // Helps to establish the invariantp := a[`]i := `; j := rrepeat // a: ` i→ j← r

invariant 1: ∀i′ ∈ `..i− 1: a[i′] ≤ p // a: ∀ ≤ pinvariant 2: ∀j′ ∈ j + 1..r: a[j′] ≥ p // a: ∀ ≥ pinvariant 3: ∃i′ ∈ i..r : a[i′] ≥ p // a: ∃ ≥ pinvariant 4: ∃j′ ∈ `..j : a[j′] ≤ p // a: ∃ ≤ pwhile a[i] < p do i++ // Scan over elements (A)while a[j] > p do j−− // on the correct side (B)if i ≤ j then swap(a[i], a[j]); i++ ; j−−

until i > j // Done partitioningif i < l+r

2then qSort(a,`,j); ` := j

else qSort(a,i,r) ; r := iinsertionSort(a[l..r]) // faster for small r − l

Figure 3.3: Refined quicksort

26

equal to the pivot can end up on either side or between the two subarrays. Since quicksortspends most of its time in this partitioning loop, its implementation details are important.Index variable i scans the input from left to right and j scans from right to left. The keyinvariant is that elements left of i are no larger than the pivot whereas elements right of jare no smaller than the pivot. Loops (A) and (B) scan over elements that already satisfythis invariant. When a[i] ≥ p and a[j] ≤ p, scanning can be continued after swappingthese two elements. Once indices i and j meet, the partitioning is completed. Now, a[`..j]represents the left partition and a[i..r] represents the right partition. This sounds simpleenough but for a correct and fast implementation, some subtleties come into play.

To ensure termination, we verify that no single piece represents all of a[`..r] even ifp is the smallest or largest array element. So, suppose p is the smallest element. Thenloop A first stops at i = `; loop B stops at the last occurrence of p. Then a[i] and a[j]are swapped (even if i = j) and i is incremented. Since i is never decremented, the rightpartition a[i..r] will not represent the entire subarray a[`..r]. The case that p is the largestelement can be handled using a symmetric argument.

The scanning loops A and B are very fast because they make only a single test. Onthe first glance, that looks dangerous. For example, index i could run beyond the rightboundary r if all elements in a[i..r] were smaller than the pivot. But this cannot hap-pen. Initially, the pivot is in a[i..r] and serves as a sentinel that can stop Scanning LoopA. Later, the elements swapped to the right are large enough to play the role of a sen-tinel. Invariant 3 expresses this requirement that ensures termination of Scanning LoopA. Symmetric arguments apply for Invariant 4 and Scanning Loop B.

Our array quicksort handles recursion in a seemingly strange way. It is something like“semi-recursive”. The smaller partition is sorted recursively, while the larger partition issorted iteratively by adjusting ` and r. This measure ensures that recursion can never godeeper than dlog n

n0e levels. Hence, the space needed for the recursion stack is O(log n).

Note that a completely recursive algorithm could reach a recursion depth of n− 1 so thespace needed for the recursion stack could be considerably larger than for the input arrayitself.

3.3 Lessons from experimentsWe now run Quicksort on real machines to check if it behaves differently than our analysison the RAM model predicted. We will see that modern hardware architecture can haveinfluence on the runtime and try to find algorithmic solutions to these problems.

In the analysis, we saw that the number of comparisons determines the runtime ofQuicksort. On a real machine a comparison and the corresponding if-clause are mapped toa branch instruction. In modern processors with long execution pipelines and superscalarexecution, dozens of subsequent instructions are executed in parallel to achieve a high

27

4.8

5

5.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

10 12 14 16 18 20 22 24 26

nano

secs

con

stan

t

n

in sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11

random pivotmedian of 3

exact medianskewed pivot n/10skewed pivot n/11

Figure 3.4: Runtime for Quicksort using different strategies for pivot selection

peak throughput. To keep the pipeline filled, the outcome of each branch is predicted bythe hardware (based on several possible heuristics). When a branch is mispredicted, muchof the work already done on the instructions following the predicted branch direction turnsout to be wasted. Therefore, ingenious and very successful schemes have been devised toaccurately predict the direction a branch takes . Unfortunately, we are facing a dilemmahere. Information theory tells us that the optimal number of n log n element comparisonsfor sorting can only be achieved if each element comparison yields one bit of information,i.e., there is a 50% chance for the branch to take either direction. In this situation, even themost clever branch prediction algorithm is helpless. A painfully large number of branchmispredictions seems to be unavoidable.

Figure 3.4 compares the runtime of Quicksort implementations using different strate-gies of selecting a pivot. Together with standard techniques (random, median of three,. . . ) α-skewed pivots are used, i.e., pivots which have a rank of αn. Theory suggestslarge constant factors in execution time for these strategies with α 1

2compared to a

perfect median. In practice, Figure 3.4 shows that these implementations are actuallyfaster than those that use an (approximated) median as pivot.

An explanation for this can be found in Figure 3.5: A pivot with rank close to n/2produces many more branch mispredictions than a pivot that separates the sequence intwo parts of very different sizes. The costs to flush the entire instruction pipeline outweigh

28

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

10 12 14 16 18 20 22 24 26

bran

ch m

isse

s co

nsta

nt

n

in sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11

random pivotmedian of 3

exact medianskewed pivot n/10skewed pivot n/11

Figure 3.5: Number of branch mispredictions for Quicksort using different strategies forpivot selection

29

the fewer partition steps of these variants.

3.4 Super Scalar Sample SortWe now study a sorting algorithm which is aware of hardware phenomenons like branchmispredictions or superscalar execution. This algorithm is called Super Scalar SampleSort (SSSS) which is an engineered version of Sample Sort which in turn is a generaliza-tion of Quicksort.

Function sampleSort(e = 〈e1, . . . , en〉 : Sequence of Element, k : Z) : Sequence of Elementif n/k is “small” then return smallSort(e) // base case, e.g. quicksortlet 〈S1, . . . , Sak−1〉 denote a random sample of esort S // or at least locate the elements whose rank is a multiple of a〈s0, s1, s2, . . . , sk−1, sk〉:= 〈−∞, Sa, S2a, . . . , S(k−1)a,∞〉 // determine splittersfor i := 1 to n do

find j ∈ 1, . . . , k such that sj−1 < ei ≤ sjplace ei in bucket bj

return concatenate(sampleSort(b1), . . . , sampleSort(bk))

Figure 3.6: Standard Sample Sort

Our starting point is ordinary sample sort. Fig. 3.6 gives high level pseudocode. Smallinputs are sorted using some other algorithm like quicksort. For larger inputs, we firsttake a sample of s = ak randomly chosen elements. The oversampling factor a allowsa flexible tradeoff between the overhead for handling the sample and the accuracy ofsplitting. Our splitters are those elements whose rank in the sample is a multiple of a. Noweach input element is located in the splitters and placed into the corresponding bucket.The buckets are sorted recursively and their concatenation is the sorted output. A firstadvantage of Sample Sort over Quicksort is the number of logk n recursion levels whichis by a factor log2 k smaller than the recursion depth of Quicksort log2 n. Every elementis moved once during each level, resulting in less cache faults for Sample Sort. However,this alone does not resolve the central issue of branch mispredictions and only comes tobear for very large inputs.

SSSS is an implementation strategy for the basic sample sort algorithm. All sequencesare represented as arrays. More precisely, we need two arrays of size n. One for theoriginal input and one for temporary storage. The flow of data between these two arraysalternates in different levels of recursion. If the number of recursion levels is odd, a finalcopy operation makes sure that the output is in the same place as the input. Using an array

30

o

B

a’

a

i

moverefer

refer

refer

Figure 3.7: Two-pass element distribution in Super Scalar Sample Sort

of size n to accommodate all buckets means that we need to know exactly how big eachbucket is. In radix sort implementations this is done by locating each element twice. Butthis would be prohibitive in a comparison based algorithm. Therefore we use an additionalauxiliary array, o, of n oracles – o(i) stores the bucket index for element ei. A first passcomputes the oracles and the bucket sizes. A second pass reads the elements again andplaces element ei into bucket bo(i). This two pass approach incurs costs in space and time.However these costs are rather small since bytes suffice for the oracles and the additionalmemory accesses are sequential and thus can almost completely be hidden via softwareor hardware prefetching3. In exchange we get simplified memory management, no needto test for bucket overflows. Perhaps more importantly, decoupling the expensive tasksof finding buckets and distributing elements to buckets facilitates software pipelining bythe compiler and prevents cache interferences of the two parts. This optimization is alsoknown as loop distribution.

Theoretically the most expensive and algorithmically the most interesting part is howto locate elements with respect to the splitters. Fig. 3.8 gives pseudocode and a picturefor this part. Assume k is a power of two. The splitters are placed into an array t suchthat they form a complete binary search tree with root t1 = sk/2. The left successorof tj is stored at t2j and the right successor is stored at t2j+1. This is the arrangementwell known from binary heaps but used for representing a search tree here. To locate anelement ai, it suffices to travel down this tree, multiplying the index j by two in eachlevel and adding one if the element is larger than the current splitter. This increment isthe only instruction that depends on the outcome of the comparison. Some architectures

3This is true as long as we can accommodate one buffer per bucket in the cache, limiting the parameterk. Other limiting factors are the size of the TLB (translation lookaside buffer, storing mappings of virtualto physical memory addresses) and k ≤ 256 if we want to store the bucket indices in one byte

31

3k/8s k/8s 5k/8s 7k/8s

k/4s 3k/4s

k/2s

< <

<

1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >

>>

>t:= 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8, . . .〉for i := 1 to n do // locate each element

j:= 1 // current tree node := rootrepeat log k times // will be unrolled

j:= 2j + (ai > tj) // left or right?j:= j − k + 1 // bucket index|bj|++ // count bucket sizeoi:= j // remember oracle

Figure 3.8: Finding buckets using implicit search trees. The picture is for k = 8. Weadopt the convention from C that “x > y” is one if x > y holds, and zero else.

cmp.gt p7=r1,r2 cmp.gt p6=r1,r2(p7) br.cond .label (p6) add r3=4,r3

add r3=4,r3.label:

Table 3.1: Translation of if(r1 > r2) r3 := r3 + 4 with branches (left) andpredicated instructions (right)

like IA-64 have predicated arithmetic instructions that are only executed if the previouslycomputed condition code in the instruction’s predicate register is set. Others at least havea conditional move so that we can compute j:= 2j and then, speculatively, j′:= j+1. Thenwe conditionally move j′ to j. The difference between such predicated instructions andordinary branches is that they do not affect the instruction flow and hence cannot sufferfrom branch mispredictions.

Experiments (conducted on an Intel Itanium processor with Intel’s compiler to havesupport for predicated instructions and software pipelining) show that our implementa-tion of SSSS outperforms two well known library implementations for sorting. In theExperiment 32 bit random integers in the range [0, 109] were sorted4.

For this first version of SSSS, several improvements are possible. For example, thecurrent implementation suffers from many identical keys. This could be fixed withoutmuch overhead: If si−1 < si = si+1 = · · · = sj (identical splitters are an indicator formany identical keys), j > i, change si to si − 1. Do not recurse on buckets bi+1,. . . ,bj –they all contain identical keys. Now SSSS can even profit from an input like this.¡

Another disadvantage compared to quicksort is that SSSS is not inplace. One couldmake it almost inplace however. This is most easy to explain for the case that both input

4note that the algorithm’s runtime is not influenced by the distribution of elements, so a random distri-bution of elements is no unfair advantage for SSSS

32

0

2

4

6

8

10

12

14

16

18

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

Intel STLGCC STL

sss-sort

Figure 3.9: Runtime for sorting using SSSS and other algorithms

0

1

2

3

4

5

6

7

4096 16384 65536 218 220 222 224

time

/ n lo

g n

[ns]

n

TotalFB+DIST+BA

FB+DISTFB

Figure 3.10: Breakdown of the execution time of SSSS (divided by n log n) into phases.“FB” denotes the finding of buckets for the elements, “DIST” the distribution of theelements to the buckets, “BA” the base sorting routines. The remaining time is spent infinding the splitters etc.

33

M

i=0 i=M i=2M

i=2Mi=0 i=M

run internalsort

f

t

Figure 3.11: Run formation

st stps

B

__aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprstf

make things as simple as possible but no simpler

t

nextk

________aaabbeeeeghiiiiklllmmmnnooppprss

ssM out

internal

ssrunBuffer

f

Figure 3.12: Example of 4-way merging with M = 12, B = 2

and output are a sequence of blocks (compare chapter 2). Sampling takes sublinear spaceand time. Distribution needs at most 2k additional blocks and can otherwise recycle freedblocks of the input sequence. Although software pipelining may be more difficult for thisdistribution loop, the block representation facilitates a single pass implementation with-out the time and space overhead for oracles so that good performance may be possible.Since it is possible to convert inplace between block list representation and an array rep-resentation in linear time, one could actually attempt an almost inplace implementationof SSSS.

3.5 Multiway Merge SortWe will now study another algorithm based on the concept of Merge Sort which is espe-cially well suited for external sorting. For external algorithms, an efficient sorting sub-routine is even more important than for main memory algorithms because one often triesto avoid random disk accesses by ordering the data, allowing a sequential scan.

Multiway Merge Sort first splits the data into dn/Me runs which fit into main memorywhere they are sorted. We merge these runs until only one is left. Instead of ordinary 2-way-merging, we merge k := M/B runs in a single pass resulting in a smaller numberof merge phases. We only have to keep one block (containing the currently smallestelements) per run in main memory. We maintain a priority queue containing the smallestelements of each run in the current merging step to efficiently keep track of the overallsmallest element.

34

21 D

emulated disk

logical block

physical blocks

Figure 3.13: Striping: one logical block consists of D physical blocks.

Every element is read/written twice for forming the runs (in blocks of size B) andtwice for every merging phase. Access granularity is blocks. This leads to the following(asymptotically optimal) total number of I/Os:

2n

B(1 + dlogk #runse) =

2n

B

(1 +

⌈logM/B

n

M

⌉):= sort(n) (3.1)

Let us consider the following realistic parameters: B = 2MB, M = 1GB. For inputsup to a size of n = 512GB, we get only one merging phase! In general, this is thecase if we can store dn/Me buffers (one for each run) of size B in internal memory (i.e.,n ≤M2/B). Therefore, only one additional level can increase the I/O volume by 50%.

3.6 Sorting with parallel disksWe now consider a system with D disks. There are different ways to model this situation(see Figure 3.15) but all have in common that in one I/O step we can fetch up to D blocksso we can hope to reduce the number of I/Os by this factor:

2n

BD

(1 +

⌈logM/B

n

M

⌉)(3.2)

An obvious idea to handle multiple disks is the concept of striping: An emulated diskcontains logical blocks of size DB consisting of one physical block per disk. The algo-rithms for run formation and writing the output can be used unchanged on this emulateddisk. For the merging step however, we have to be careful: With larger (logical) blocksthe number of I/Os becomes:

2n

BD

(1 +

⌈logM/BD

n

M

⌉)(3.3)

The algorithm will move more data in one I/O step (compared to the setup with onedisk) but requires a possibly deeper recursion level. In practice, this can make the differ-ence between one or two merge phases. We therefore have to work on the level of physical

35

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

internal buffers

...

controls

prediction sequence

pref

etch

buf

fers

Figure 3.14: The smallest element of each block triggers fetch.

blocks to achieve optimal constant factors. This comes with the necessity to distribute theruns in an intelligent way among the disks and to find a schedule for fetching blocks intothe merger.

For starters, it is necessary to find out which block on which disk will be required nextwhen one of the merging buffers runs out of elements. This can be computed offline whenall runs are formed: A block is required the moment its smallest element is required. Wecan therefore sort the set of all smallest elements to get a prediction sequence.

To be able to refill the merge buffers in time we maintain prefetch buffers which we fill(if necessary) while the merging of the current elements takes place. This allows parallelaccess of next due blocks and helps for an efficiency near 1 (i.e. fetching D blocks in oneI/O step). How many prefetch buffers should we use?

We first approach this question by using a simplified model ((a) in figure 3.15) wherewe have D read-/write-heads on one large disk. Here, D prefetch buffers suffice: In oneI/O-step we can refill all buffers, transferring D blocks of size B which leads to a total(optimal) number of I/Os as in equation 3.2.

If we replace the multihead model with D independent disks (each with its ownright/write-head) we get a more realistic model. But now D prefetch buffers seem toofew as it is possible that all next k blocks reside on the same disk which would need thatmany I/O steps for filling the buffers while the other disks lie idle, leading to a non-optimalefficiency.

A first solution is to increase the number of prefetch buffers to kD. But that wouldleave us with less space for merge buffers, write buffers and other data that we have to

36

M

B1 2 D

Multihead Model

[Aggarwal Vitter 88]

(a)

M

B

... D1 2

independent disks

[Vitter Shriver 94](b)

Figure 3.15: Different models for systems with several disks

...prediction sequence

internal buffers

...

...

...

...

...

...

prefetch buffers

Figure 3.16: Distribution of runs using randomized cycling.

keep in main memory.Instead, we use the randomized cycling pattern while forming runs: For every run j,

we map block i to πj(i mod D) for a random permutation πj . This makes the event ofgetting a “difficult“ distribution highly unlikely.

With a naive prefetching strategy and random cycling, we can achieve a good perfor-mance with only O(D logD) buffers. Is it possible to even reduce this to O(D)?

The prefetching strategy leaves more room for optimization. The naive approachfetches in one I/O-step the next blocks from the prediction sequence until all free buffersare filled or a disk would be accessed twice.

The problem is now to find an optimal offline prefetching schedule (offline, becausethe prediction sequence yields the order in which the blocks on each disk are needed).For the solution, we make a digression to online buffered writing and use the principle ofduality to transform our result here into a schedule for offline prefetching.

In online buffered writing, we have a sequence Σ of blocks to be written to one of D

37

W/D

...

...

...

1 2 3

queues

D

Sequence Σof blocks

randomizedmapping

write whenever

one of W

buffers is free

otherwise, output

one block from

each nonempty queue

Figure 3.17: The online buffered writing problem and its optimal solution.

disks. We also have W buffers, W/D for each disk. It can be shown that randomized,equally distributed writing to one of the free buffers and outputting one block of eachqueue if no capacity is left is an optimal strategy and achieves an expected efficiency of1− o(D/W ).

We can now reverse this process to obtain an optimal offline prefetching algorithmcalled lazy prefetching: Given the prediction sequence Σ, we calculate the optimal onlinewriting sequence T for ΣR and use TR as prefetching schedule. Note that we do not usethe distribution between the disks the writing algorithm produces and that the randomdistribution during the writing process corresponds to random cycling.

Figure 3.19 gives an example in which our optimal strategy yields a better result thana naive prefetching approach: The upper half shows the result of the example schedulefrom 3.18 created by inverting a writing schedule. The bottom half shows the result ofnaive prefetching, always fetching the next block from every disk in one step (as long asthere are free buffers).

3.7 Internal work of Multiway MergesortUntil now we have only regarded the number of I/Os. In fact, when running with severaldisks our sorting algorithm can very well be compute bound, i.e. prefetching D newblocks requires less time than merging them. We use the technique of overlapping tominimize wait time for whichever task is bounding our algorithm in a given environment.Take the following example on run formation (i denotes a run):

Thread A: Loop wait-read i; sort i; post-write i;Thread B: Loop wait-write i; post-read i+2;

38

r q p o l i f

n g e

m k j

h

abcd

abcdefghijklmnopqr

ΣR

Σ

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

order of reading

order of writing

(a)

abcdefghijklmnopqr

ΣR

Σrqponm

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

r

m

n

order of reading

order of writing

(b)

abcdefghijklmnopqr

ΣR

Σ qpolk

j

k

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(c)

abcdefghijklmnopqr

ΣR

Σpol

j

ki

h

j

h

p

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(d)

abcdefghijklmnopqr

ΣR

Σ

ol

ki j

h

p

efg

g

o

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8Puffer

input step

qr

m

n

order of reading

order of writing

(e)

abcdefghijklmnopqr

ΣR

Σ

l

ki j

h

p

ef

g

od

c e

l

d

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(f)

abcdefghijklmnopqr

ΣR

Σ

ki j

h

p

f

g

o

c e

l

d

b

a

c

i

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(g)

abcdefghijklmnopqr

ΣR

Σ

k j

h

p

f

g

o

e

l

d

b

a

c

i

b

f

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(h)

abcdefghijklmnopqr

ΣR

Σ

k j

h

p

g

o

e

l

d

a

c

i

b

f

a

8 7 6 5 4 3 2 1

1output step 2 3 4 5 6 7 8

Puffer

input step

qr

m

n

order of reading

order of writing

(i)

Figure 3.18: Example: Optimal randomized online writing

39

c d e f h igb j m n o p q rka l

rqpolif

nge

mkj

h

a b c d

1 2 3 4 5 6 7

7 6 5 4 3 2 1 Ausgabeschritt

8Eingabeschritt

optimal

Σ R

f

c

h

d

rqp

e

a

g

b

i

n

m

Eingabeschritt1

[Barve−Grove−Vitter 97]

ol

2

j k

43 5 6 7

Σ

8 9

Figure 3.19: Example: resulting offline reading schedule

During initialization, runs 1 and 2 are read, i is set to 1. Thread A sorts runs in memoryand writes them to disk. Thread B will wait until run i is finished (and thread A works oni+ 1) and reads the next run i+ 2 into the freed space. The thread doing the more intensework will never wait for the other one.

A similar result can be achieved during the merging step but this is considerably morecomplicated and beyond the scope of this course.

As internal work influences running time, we need a fast solution for the most computeintense step during merging: A Tournament Tree (or Loser Tree) is a specialized datastructure for finding the smallest element of all runs. For k = 2K , it is a complete binarytree with K levels, where each leaf contains the currently smallest element of one run.Each internal node contains the ’loser’ (i.e., the greater) of the ’competition’ between itstwo child nodes. Above the root node, we store the global winner along with a pointerto the corresponding run. After writing this element to the output buffer, we simply haveto move the next element of its run up until there is a new global winner. Compared togeneral purpose data structures like binary heaps, we get exactly log k comparisons (nohidden constant factors). Similar to the implicit search trees we used for Sample Sort,Tournament Trees can be implemented as arrays where finding the parent node simplymaps to an index shift to the right. The inner loop for moving from leaf to root canbe unrolled and contains predictable load instructions and index computations allowing

40

6 2 7 9 1 4 74

6 7 9 7

4 4

2

1

36 2 7 9 4 74

6 7 9 7

4 4

3

2deleteMin+insertNext

3

Figure 3.20: A tournament tree

for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1)currentPos = entry + i;currentKey = currentPos->key;if (currentKey < winnerKey)

currentIndex = currentPos->index;currentPos->key = winnerKey;currentPos->index = winnerIndex;winnerKey = currentKey;winnerIndex = currentIndex;

Figure 3.21: Inner loop of Tournament Tree computation

exploitation of instruction parallelism.

3.8 ExperimentsExperiments on Multiway Merge Sort were performed in 2001 on a 2 × 2GHz Xeon ×2 threads machine (Intel IV with Netburst) with several 66 MHz PCI-buses, 4 fast IDEcontrollers (Promise Ultra100 TX2) and 8 fast IDE disks (IBM IC35L080AVVA07). Thisinexpensive (mid 2002) setup gave a high I/O-bandwidth of 360 MB/s. The keys consistedof 16 GByte random 32 bit integers, run size was 256 MByte, block size B was 2MB (ifnot otherwise mentioned).

Figure 3.22 shows the running time for different element sizes (for a constant total datavolume of 16 GByte). The smaller the elements, the costlier becomes internal work, espe-cially during run formation (there are more elements to sort). With a high I/O throughputand intelligent prefetching algorithms, I/O wait time never makes up for more than half ofthe total running time. This proves the point that overlapping and tuning of internal workare important.

41

0

50

100

150

200

250

300

350

400

16 32 64 128 256 512 1024

time

[s]

element size [byte]

run formationmergingI/O wait in merge phaseI/O wait in run formation phase

Figure 3.22: Multiway Merge Sort with different element sizes

12

14

16

18

20

22

24

26

128 256 512 1024 2048 4096 8192

sort

tim

e [n

s/by

te]

block size [KByte]

128 GBytes 1x merge128 GBytes 2x merge

16 GBytes

Figure 3.23: Performance using different block sizes

42

What is a good block size B? An intuitive approach would link B to the size of aphysical disk block. However, figure 3.23 shows that B is no technology constant but atuning parameter: A larger B is better (as it reduces the amortized costs of O(1/B) I/Osper element), as long as the resulting smaller k still allows for a single merge phase (seecurve for 128GB).

43

Chapter 4

Priority Queues

The material on external priority queues was first published in [5].

4.1 IntroductionPriority queues are an important data structure for many applications, including: short-est path search (Dijkstra’s Algorithm), sorting, construction of mimimum spanning trees,branch and bound search, discrete event simulaton and many more. While the first ex-amples are widely known and also covered in other chapters, we give a short explanationof the latter two applications: The best first branch-and-bound approach to optimizationelements are partial solutions of an optimization problem and the keys are optimistic es-timates of the obtainable solution quality. The algorithm repeatedly removes the bestlooking partial solution, refines it, and inserts zero or more new partial solutions. In adiscrete event simulation one has to maintain a set of pending events. Each event happensat some scheduled point in time and creates zero or more new events scheduled to happenat some time in the future. Pending events are kept in a priority queue. The main loopof the simulation deletes the next event from the queue, executes it, and inserts newlygenerated events into the priority queue.Our (non-addressable) priority queue M needs to support the following operations:

Procedure build(e1, . . . , en) M := e1, . . . , enProcedure insert(e) M := M ∪ eFunction deleteMin e:= minM ; M := M \ e; return e

There are different approaches to implementing priority queues but most of them re-sort to an implicit or explicit tree representation which is heap-ordered1: If w is a succes-sor of v, the key stored in w is not greater than the key stored in v. This way, the overallsmallest key is stored in the root.

1In 4.4 we will see implementations using a whole forest of heap-ordered trees

44

4.2 Binary HeapsPriority queues are often implemented as binary heaps, stored in an array h where thesuccessors for an element at position i are stored at positions 2i and 2i + 1. This is animplicit representation of a near-perfect binary tree which only might lack some leafs inthe bottom level. We require that this array is heap-ordered, i.e.,

if 2 ≤ j ≤ n then h[bj/2c] ≤ h[j ].

Binary Heaps with arrays are bounded in space, but they can be made unbounded inthe same way as bounded arrays are made unbounded. Asuming non-hierarchical mem-ory, we can implement all desired operations in an efficient manner:

An insert puts a new element e tentatively at the end of the heap h, i.e., e is put ata leaf of the tree represented by h.[reformulated:more rendundancy, less ambiguity]⇐=Then e is moved to an appropriate position on the path from the leaf h[n] to the root.

Procedure insert(e : Element)assert n < wn++ ; h[n]:= esiftUp(n)

where siftUp(s) moves the contents of node s towards the root until the heap prop-erty[was heap condition] holds. ⇐=

Procedure siftUp(i : N)assert the heap property holds except maybe for j = iif i = 1 ∨ h[bi/2c] ≤ h[i] then returnassert the heap property holds except for j = iswap(h[i], h[bi/2c])assert the heap property holds except maybe for j = bi/2csiftUp(bi/2c)

Since siftUp will potentially move the element up to the root and perform a com-parison on every level, insert takes O(log n) time. On avergage, a constant number ofcomparisons will suffice.

deleteMin in its basic form replaces the root with the leftmost leaf which is thensifted down (analogously to siftUp), resulting in 2 log n key comparisons (on every level,we have to find the minimum of three elements). The bottom-up heuristic suggests animprovement for that operation: The hole left by the removed minimum is “sifted down“to a leaf (requiring only one comparison per level between the two successors of the hole),is only now replaced by the rightmost leaf which is then sifted up again (costing constanttime on average, like insertion).

45

97

2

97

2

movecompare swap

97

1

48

6

8

8

6

2

97

4

delete MinO(1)

sift down holelog(n)

averageO(1)

sift up

8

2

9

7

8

2

6

4 4

6

4

6

8

2

9

7

6

4

Figure 4.1: The bottom-up heuristic

int i=1, m=2, t = a[1];m += (m != n && a[m] > a[m + 1]);if (t > a[m])

do a[i] = a[m];i = m;m = 2*i;if (m > n) break;m += (m != n && a[m] > a[m + 1]);

while (t > a[m]);a[i] = t;

Figure 4.2: An efficient version of standard deleteMin

This approach should be a factor two faster then the naive implementation. However,if the latter is programmed properly (see figure 4.2), there are no measureable differencesin runtime: The given implementation has log n comparisons more than bottom-up butthese are stop criterions for the loop and thus easy to handle for branch prediction. Noticehow the increment of m avoids branches within the loop.

[siftDown mit logn + loglogn Vergleichen] ⇐=For the initial construction of a heap there are also two competing approaches:

buildHeapBackwards moves from the leaves to the root, ensuring the heap property onevery level. buildHeapRecursive first fixes this properties recursively on the two subtreesof the root and then sifts the remaining node down. Here, we have the reverse situationcompared to deleteMin: Both algorithms asymptotically cost O(n) time but on a realmachine, the recursive variant is faster by a factor two: It is more cache efficient. Note,that a subtree with B leaves and therefore logB levels can be stored in B logB blocks of

46

Procedure buildHeapBackwardsfor i := bn/2c downto 1 do siftDown(i)

Procedure buildHeapRecursive(i : N)if 4i ≤ n then

buildHeapRecursive(2i)buildHeapRecursive(2i+ 1)

siftDown(i)

Figure 4.3: Two implementations for buildHeap

insertion bufferm

tournament treedata structure

sortedsequencesm...1 2 k

k−merge

mbuffer

Figure 4.4: A simple external PQ for n < km

size B. If these blocks fit into the cache, we only require O(n/B) I/O operations.

4.3 External Priority QueuesWe now study a variant of external priority queues2 which are called sequence heaps.

Merging k sorted sequences into one sorted sequence (k-way merging) is an I/O effi-cient subroutine used for sorting – we saw this in chapter 3.5. The basic idea of sequenceheaps is to adapt k-way merging to the related but more dynamical problem of priorityqueues.

Let us start with the simple case, that at most km insertions take place where m isthe size of a buffer that fits into fast memory. Then the data structure could consist of ksorted sequences of length up to m. We can use k-way merging for deleting a batch of them smallest elements from k sorted sequences. The next m deletions can then be servedfrom a buffer in constant time.

A separate binary heap with capacity m allows an arbitrary mix of insertions and

2if “I/O“ is replaced by “cache fault“, we can use this approach also one level higher in the memoryhierarchy

47

m

m m

2T 3T

...1 2 k ...1 2 k ...1 2 k

k-merge k-merge k-merge

R-merge

insert heap

mk

buffer 3group

m’

groupbuffer 1 m

1T

m

groupbuffer 2

deletion buffer

mk2

Figure 4.5: Overview of the complete data structure for R = 3 merge groups

deletions by holding the recently inserted elements. Deletions have to check whether thesmallest element has to come from this insertion buffer. When this buffer is full, it issorted, and the resulting sequence becomes one of the sequences for the k-way merge.

How can we generalize this approach to handle more than km elements? We cannotincrease m beyond M , since the insertion heap would not fit into fast memory. We cannotarbitrarily increase k, since eventually k-way merging would start to incur cache faults.Sequence heaps make room by merging all the k sequences producing a larger sequenceof size up to km.

Now the question arises how to handle the larger sequences. Sequence heaps employR merge groupsG1, . . . , GR whereGi holds up to k sequences of size up tomki−1. Whengroup Gi overflows, all its sequences are merged, and the resulting sequence is put intogroup Gi+1.

Each group is equipped with a group buffer of size m to allow batched deletion fromthe sequences. The smallest elements of these buffers are deleted in batches of size m′ m. They are stored in the deletion buffer. Fig. 4.5 summarizes the data structure. We nowhave enough information to understand how deletion works:

DeleteMin: The smallest elements of the deletion buffer and insertion buffer are com-pared, and the smaller one is deleted and returned. If this empties the deletion buffer, itis refilled from the group buffers using an R-way merge. Before the refill, group buffers

48

with less than m′ elements are refilled from the sequences in their group (if the group isnonempty).

DeleteMin works correctly provided the data structure fulfills the heap property, i.e.,elements in the group buffers are not smaller than elements in the deletion buffer, and inturn, elements in a sorted sequence are not smaller than the elements in the respectivegroup buffer. Maintaining this invariant is the main difficulty for implementing insertion.

Insert: New elements are inserted into the insert heap. When its size reaches m, itselements are sorted (e.g. using merge sort or heap sort). The result is then merged withthe concatenation of the deletion buffer and the group buffer 1. The smallest resultingelements replace the deletion buffer and group buffer 1. The remaining elements form anew sequence of length at most m. The new sequence is finally inserted into a free slotof group G1. If there is no free slot initially, G1 is emptied by merging all its sequencesinto a single sequence of size at most km, which is then put into G2. The same strategy isused recursively to free higher level groups when necessary. When group GR overflows,R is incremented and a new group is created. When a sequence is moved from one groupto the other, the heap property may be violated. Therefore, when G1 through Gi havebeen emptied, the group buffers 1 through i+ 1 are merged, and put into G1.

For cached memory, where the speed of internal computation matters, it is also crucialhow to implement the operation of k-way merging. How is can be done in an efficientway is described in the chapter about Sorting (3.7).

AnalysisWe will now give a sketch for the I/O analysis of our priority queues. Let i denote thenumber of insertions and an upper bound to the number of deleteMin operations.

First note that Group Gi can overflow at most every m(ki − 1) insertions: The onlycomplication is the slot in group G1 used for invalid group buffers. Nevertheless, whengroups G1 through Gi contain k sequences each, this can only happen if

R∑j=1

m(k − 1)kj−1 = m(ki − 1)

insertions have taken place. Therefore, R =⌈logk

Im

⌉groups suffice.

Now consider the I/Os performed for an element moving on the following canoni-cal data path: It is first inserted into the insert buffer and then written to a sequence ingroup G1 in a batched manner, i.e., 1/B I/Os are charged to the insertion of this element.Then it is involved in emptying groups until it arrives in group GR. For each emptyingoperation, the element is involved into one batched read and one batched write, i.e., it is

49

b

g

v

jilq

k f

unmc

wtpoh

a

db

ex

3

sr

b

v

jilq

k f

unm

wtpoh

xab

ed

cg

sr

3insert( )

(a) Inserting element 3 leads to overflow of insert heap:it is merged with deletion buffer and group buffer 1 andthen inserted into group 1

b

g

v

jilq

k f

unmc

wtpoh

a

db

ex

3

sr

ex

b

v

ilq

unmc

wtpoh

a

db

fgjk

3

sr

(b) Overflow in group 1: all old elements are merged andinserted in next group

ex

b

v

ilq

unmc

wtpoh

a

db

fgjk

3

sr

ex

3a

fgjk

bdb

chil

mn o p

qrst u v

w

(c) Overflow in group 2: all old elements are merged and insertedin next group

ex

3a

fgjk

bdb

chil

mn o p

qrst u v

w

ex

3a

fgjk

chil

mn o p

qrst u v

w

bbd

(d) Group buffers are invalid now: merge and inserted them to group 1

Figure 4.6: Example of an insertion on the sequence heap

50

ex

3a

fgjk

chil

mn o p

qrst u v

w

bbd

ex

fgjk

chil

mn o p

qrst u v

w

bbd

(a) Deletion of two elements empties insert heap and deletion buffer

ex

fgjk

il

mn o p

qrst u v

wd

chil

mn o p

qrst u v

w

bbd

ex

jk

bfg

ch

(b) Every Group fills its buffer via k-way-merging, the deletion buffer is filledfrom group buffers via M-way-merging

Figure 4.7: Example of a deletion on the sequence

51

charged with 2(R − 1)/B I/Os for tree emptying operations. Eventually, the element isread into group buffer R yielding a charge of 1/B I/Os for. All in all, we get a charge of2R/B I/Os for each insertion.

What remains to be shown (and is ommited here) is that the remaining I/Os onlycontribute lower order terms or replace I/Os done on the canonical path. For example, wesave I/Os when an element is extracted before it reaches the last group. We use the costscharged for this to pay for swapping the group buffers in and out. Eventually, we haveO(sort(I)) I/Os.

In a similar fashion, we can show that I operations inflict I log I key comparisons onaverage. As for sorting, this is a good measure for the internal work, since in efficientimplementations of priority queues for the comparison model, this number is close to thenumber of unpredictable branch instructions (whereas loop control branches are usuallywell predictable by the hardware or the compiler), and the number of key comparisonsis also proportional to the number of memory accesses. These two types of operationsoften have the largest impact on the execution time, since they are the most severe limitto instruction parallelism in a super-scalar processor.

ExperimentsWe now present the results of some experiments conducted to compare our sequenceheap with other priority queue implementations. Random 32 bit integers were usedas keys for another 32 bits of associated information. The operation sequence usedwas (Insert − deleteMin − Insert)N(deleteMin − Insert − deleteMin)N . Thechoice of this sequence is nontrivial as it can have measurable influence (factor twoand more) on the performance. Figure 4.9 show this: Here we have the sequence(Insert (deleteMin Insert)s)N (deleteMin (Insert deleteMin)s)N for several val-ues of s. For larger s, the performance gets better when N is large enough. This canbe explained with a “locality effect“: New elements tend to be smaller than most oldelements (the smallest of the old elements have long been removed before). Therefore,many elements never make it into group G1 let alone the groups for larger sequences.Since most work is performed while emptying groups, this work is saved. So that theseinstances should come close to the worst case. To make clear that sequence heaps arenevertheless still much better than binary or 4-ary heaps, Figure 4.9 additionally containstheir timing for s = 0.

The parameters chosen for the experiments where m′ = 32, m = 256 and k = 128 onall machines tried. While there were better settings for individual machines, these globalvalues gave near optimal performance in all cases.

52

0

50

100

150

200

1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

bottom up binary heapbottom up aligned 4-ary heap

sequence heap

Figure 4.8: Runtime comparison for several PQ implementations (on a 180MHz MIPSR10000)

32

64

128

256 1024 4096 16384 65536 218 220 222 223

(T(d

elet

eMin

) +

T(in

sert

))/lo

g N

[ns

]

N

s=0, binary heaps=0, 4-ary heap

s=0s=1s=4

s=16

Figure 4.9: Runtime comparison for different operation sequences

53

a b a

b

a<b

Figure 4.10: Link: Merge to trees preserving the heap property

Figure 4.11: Cut: remove subtree and add it to the forest

4.4 Adressable Priority QueuesFor adressable Priority Queues, we want to add the following functionality to theinterface of our basic data structure:

Function remove(h : Handle) e:= h; M := M \ e; return eProcedure decreaseKey(h : Handle, k : Key) assert key(h) ≥ k; key(h):= kProcedure merge(M ′) M := M ∪M ′

This extended interface is required to efficiently implement Dijkstra’s Algorithm forshortest paths or the Jarnik-Prim Algorithm for calculating Minimum Spanning Trees(both make use of the decreaseKey operation).

It is not possible to extend our hitherto approach to become adressable as keys areconstantly swapped in our array for deleteMin and other operations. For this domain, weimplement priority queues as a set of heap-ordered trees and a pointer for finding the treecontaining the globally minimal element. The elementary form of these priority queues iscalled Pairing Heap.

With just two basic operations, we can implement adressable priority queues:Now we can already give a high-level implementation of all necessary operations:

Procedure insertItem(h : Handle)newTree(h)

Procedure newTree(h : Handle)forest := forest ∪ hif e < min then minPtr := h

Procedure decreaseKey(h : Handle, k : Key)key(h):= k

54

if h is not a root then cut(h)

Function deleteMin : Handlem:= minPtrforest := forest \ mforeach child h of m do newTree(h)Perform a pairwise link of the tree roots in forestreturn m

Procedure merge(o : AdressablePQ)if minPtr > o.minPtr then minPtr := o.minPtrforest := forest ∪ o.foresto.forest := ∅

An insert adds a new single node tree to the forest. So a sequence of n inserts intoan initially empty heap will simply create n single node trees. The cost of an insert isclearly O(1).

A deleteMin operation removes the node indicated by minPtr. This turns allchildren of the removed node into roots. We then scan the set of roots (old and new)to find the new minimum. To find the new minimum we need to inspect all roots (oldand new), a potentially very costly process. We make the process even more expensive(by a constant factor) by doing some useful work on the side, namely combining sometrees into larger trees. Pairing heaps do this by just doing one step of pairwise linkingof arbitrary trees. There are variants doing more complicated operations to prove bettertheoretical bounds.

We turn to the decreaseKey operation next. It is given a handle h and a new keyk and decreases the key value of h to k. In order to maintain the heap property, we cutthe subtree rooted at h and turn h into a root. Cutting out subtrees causes the more subtleproblem that it may leave trees that have an awkward shape. While Pairing heaps donothing to prevent thiss, some variants of addressable priority queues perform additionaloperations to keep the trees in shape.

The remaining operations are easy. We can remove an item from the queue by firstdecreasing its key so that it becomes the minimum item in the queue and then perform adeleteMin. To merge a queue o into another queue we compute the union of roots ando.roots . To update minPtr it suffices to compare the minima of the merged queues. Ifthe root sets are represented by linked lists, and no additional balancing is done, a mergeneeds only constant time.

Pairing heaps are the simplest form of forest-based adressable priority queues. A moreelaborated and (in theory, at least) faster variant are Fibonacci Heaps. They maintaina rank (initially zero, denoting the number of its children) for every element, which isincreased for root nodes when another tree is linked to them and a mark flag that is setwhen the node lost a child due to a decreaseKey. Root nodes of the same rank are

55

one child

right sibling

data

one child

parent

left siblingright sibling

data

left siblingor parent

rank mark

Pairing Heap

Fibonacci Heap

Figure 4.12: Structure of one item in a Pairing Heap or a Fibonacci Heap.

linked after a deleteMin to limit the number of trees. If a cut is executed on a nodewith an already marked parent, the parent is cut as well. These rules lead to an amortizedcomplexity ofO(log n) for deleteMin anO(1) for all other operations. However, both theconstant factors and the worst case performance for a single operation are high, makingFibonacci Heaps a mainly theoretical tool. In addition, more metainformation per nodeincreases the memory overhead of Fibonacci Heaps.

56

Chapter 5

External Memory Algorithms

The introduction of this chapter is based on [6]. The sections on time-forward process-ing, graph algorithms and cache oblivious algorithms use material from the book chap-ters [10], [8] and [9]. The cache oblivious model was first presented in [11]. The sectionon Funnelsort is based on [19]. The external BFS section is from [12] for the presenta-tion of the algorithm and from [13] for tuning and experiments. Additional material inmultiple sections is from [7].

5.1 IntroductionMassive data sets arise naturally in many domains. Spatial data bases of geographicinformation systems like GoogleEarth and NASA’s World Wind store terabytes ofgeographically-referenced information that includes the whole Earth. In computer graph-ics one has to visualize huge scenes using only a conventional workstation with limitedmemory. Billing systems of telecommunication companies evaluate terabytes of phonecall log files. One is interested in analyzing huge network instances like a web graphor a phone call graph. Search engines like Google and Yahoo provide fast text searchin their data bases indexing billions of web pages. A precise simulation of the Earth’sclimate needs to manipulate with petabytes of data. These examples are only a sample ofnumerous applications which have to process huge amount of data.

For economical reasons, it is not feasible to build all of the computer’s memory of thefastest type or to extend the fast memory to dimensions that could hold all relevant data.Instead, modern computer architectures contain a memory hierarchy of increasing size,decreasing speed and costs from top to bottom: On top, we have the registers integrated inthe CPU, a number of caches, main memory and finally disks, which are often referencedas external memory as opposed to internal memory.

The internal memories of computers can keep only a small fraction of these large datasets. During the processing the applications need to access the external memory (e. g.

57

Figure 5.1: schematic construction of a hard disk

hard disks) very frequently. One such access can be about 106 times slower than a mainmemory access. Therefore, the disk accesses (I/Os) become the main bottleneck.

The reason for this high latency is the mechanical nature of the disk access. Figure 5.1shows the schematic construction of a hard disk. The time needed for finding the dataposition on the disk is called seek time or (seek) latency and averages to about 3–10 ms formodern disks. The seek time depends on the surface data density and the rotational speedand can hardly be reduced because of the mechanical nature of hard disk technology,which still remains the best way to store massive amounts of data. Note that after findingthe required position on the surface, the data can be transferred at a higher speed whichis limited only by the surface data density and the bandwidth of the interface connectingCPU and hard disk. This speed is called sustained throughput and achieves up to 80MByte/s nowadays. In order to amortize the high seek latency one reads or writes thedata in chunks (blocks). The block size is balanced when the seek latency is a fractionof the sustained transfer time for the block. Good results show blocks containing a fulltrack. For older low density disks of the early 90’s the track capacities were about 16-64KB. Nowadays, disk tracks have a capacity of several megabytes.

Operating systems implement the so called virtual memory mechanism that providesan additional working space for an application, mapping an external memory file (pagefile) to virtual main memory addresses. This idea supports the Random Access Machinemodel in which a program has an infinitely large main memory with uniform randomaccess cost. Since the memory view is unified in operating systems supporting virtualmemory, the application does not know where its working space and program code arelocated: in the main memory or (partially) swapped out to the page file. For many appli-cations and algorithms with non-linear access pattern, these remedies are not useful and

58

even counterproductive: the swap file is accessed very frequently; the data code can beswapped out in favor of data blocks; the swap file is highly fragmented and thus manyrandom input/output operations (I/Os) are needed even for scanning.

5.2 The external memory model and things we alreadysaw

If we bypass the virtual memory mechanism, we cannot apply the RAM model for analy-sis anymore since we now have to explicitly handle different levels of memory hierarchy,while the RAM model uses one large, uniform memory.

Several simple models have been introduced for designing I/O-efficient algorithmsand data structures (also called external memory algorithms and data structures). Themost popular and realistic model is the Parallel Disk Model (PDM) of Vitter and Shriver.In this model, I/Os are handled explicitly by the application. An I/O operation transfers ablock of B consecutive bytes from/to a disk to amortize the latency. The application triesto transfer D blocks between the main memory of size M bytes and D independent disksin one I/O step to improve bandwidth. The input size is N bytes which is (much) largerthan M. The main complexity metrics of an I/O-efficient algorithm in this model are:

• I/O complexity: the number of I/O steps should be minimized (the main metric),

• CPU work complexity: the number of operations executed by the CPU should beminimized as well.

The PDM model has become the standard theoretical model for designing and analyzingI/O-efficient algorithms.

There are some “golden rules” that can guide the process of designing I/O efficientalgorithms: Unstructured memory access is often very expensive as it comes with 1 I/Oper operation whereas we want 1/B I/Os for an efficient algorithm. Instead, we want toscan the external memory, always loading the next due block of size B in one step andprocessing it internally. An optimal scan will only cost scan(N) := Θ( N

D·B ) I/Os. Ifthe data is not stored in a way that allows linear scanning, we can often use sorting toreorder and than scan it. As we saw in chapter 3, external sorting can be implementedwith sort(N) := Θ( N

D·B · logM/BNB

) I/Os.A simple example of this technique is the following task: We want to reorder the

elements in an array A into an array B using a given “rank” stored in array C. Thisshould be done in an I/O efficient way.

int[1..N] A,B,C;for i=1 to N do A[i]:=B[C[i]];

59

Disk 1 Disk D

CPU

Memory M

Disk i

BBB

c

Figure 5.2: Vitter’s I/O model with several independent disks

The literal implementation would have worst case costs of Ω(N) I/Os. For N = 106,this would take ≈ T = 10000 seconds ≈ 3 hours. Using the sort-and-scan technique, wecan lower this to sort(N) and the algorithm would finish in less than a second:

SCAN C: (C[1]=17,1), (C[2]=5,2), ...SORT(1st): (C[73]=1,73), (C[12]=2,12), ...par SCAN : (B[1],73), (B[2],12), ...SORT(2nd): (B[C[1]],1), (B[C[2]],2), ...

We already saw some I/O efficient algorithms using this model in previous chapters:Chapter 2 presented an external stack, a large section of chapter 3 dealt with external sort-ing and in chapter 4 we saw external priority queues. Chapter 8 will present an externalapproach to minimal spanning trees. In this chapter, we will see some more algorithms,study how these algorithms and data structures can be implemented in a convenient wayusing an algorithm library and learn about other models of external computation.

5.3 The Stxxl libraryThe Stxxl library is an algorithm library aimed to speed up the process of implement-ing I/O-efficient algorithms, abstracting away the details of how I/O is performed. Manyhigh-performance features are supported: disk parallelism, explicit overlapping of I/O and

60

TX

XL

S

files, I/O requests, disk queues,

block prefetcher, buffered block writer

completion handlers

Block management (BM) layertyped block, block manager, buffered streams,

Containers:

STL−user layervector, stack, set

priority_queue, mapsort, for_each, merge

Pipelined sorting,zero−I/O scanning

Streaming layer

Algorithms:

Operating System

Applications

Asynchronous I/O primitives (AIO) layer

Figure 5.3: three layer structure of the Stxxl library

computation, external memory algorithm pipelining to save I/Os, improved utilization ofCPU resources (many of these techniques are introduced in Chapter 3 on external sort-ing). The high-level algorithms and data structures of our library implement interfaces ofthe well known C++ Standard Template Library (STL). This allows to elegantly reuse theSTL code such that it works I/O-efficiently without any change, and to shorten the devel-opment times for the people who already know STL. Our STL-compatible I/O-efficientimplementations include the following data structures and algorithms: unbounded array(vector), stacks, queue, deque, priority queue, search tree, sorting, etc. They are all well-engineered and have robust interfaces allowing a high degree of flexibility. Stxxl is alayered library consisting of three layers (see Figure 5.3):

The lowest layer, the Asynchronous I/O primitives layer (AIO layer), abstracts awaythe details of how asynchronous I/O is performed on a particular operating system. Otherexisting external memory algorithm libraries only rely on synchronous I/O APIs or allowreading ahead sequences stored in a file using the POSIX asynchronous I/O API. Theselibraries also rely on uncontrolled operating system I/O caching and buffering in orderto overlap I/O and computation in some way. However, this approach has significantperformance penalties for accesses without locality. Unfortunately, the asynchronous I/OAPIs are very different for different operating systems (e.g. POSIX AIO and Win32Overlapped I/O). Therefore, we have introduced the AIO layer to make porting Stxxleasy. Porting the whole library to a different platform requires only reimplementing theAIO layer using native file access methods and/or native multithreading mechanisms.

The Block Management layer (BM layer) provides a programming interface emulat-ing the parallel disk model. The BM layer provides an abstraction for a fundamental

61

concept in the external memory algorithm design — a block of elements. The block man-ager implements block allocation/deallocation, allowing several block-to-disk assignmentstrategies: striping, randomized striping, randomized cycling, etc. The block managementlayer provides an implementation of parallel disk buffered writing, optimal prefetchingHSV01], and block caching. The implementations are fully asynchronous and designedto explicitly support overlapping between I/O and computation.

The top of Stxxl consists of two modules. The STL-user layer provides external mem-ory sorting, external memory stack, external memory priority queue, etc. which have(almost) the same interfaces (including syntax and semantics) as their STL counterparts.The Streaming layer provides efficient support for pipelining external memory algorithms.Many external memory algorithms, implemented using this layer, can save a factor of2–3 in I/Os. For example, the algorithms for external memory suffix array constructionimplemented with this module require only 1/3 of the number of I/Os which must be per-formed by implementations that use conventional data structures and algorithms (eitherfrom Stxxl STL-user layer, LEDASM, or TPIE). The win is due to an efficient interfacethat couples the input and the output of the algorithm–components (scans, sorts, etc.). Theoutput from an algorithm is directly fed into another algorithm as input, without needingto store it on the disk in-between. This generic pipelining interface is the first of this kindfor external memory algorithms.

5.4 Time-Forward ProcessingThis section is based on material from [10].

Time-Forward Processing is an elegant technique for solving problems that can beexpressed as a traversal of a directed acyclic graph (DAG) from its sources to its sinks.Problems of this type arise mostly in I/O-efficient graph algorithms, even though applica-tions of this technique for the construction of I/O-efficient data structures are also known.Formally, the problem that can be solved using time-forward processing is that of eval-uating a DAG G: Let φ be an assignment of labels φ(v) to the vertices of G. Then thegoal is to compute another labelling ψ of the vertices of G so that for every vertex v ∈ G,label ψ(v) can be computed from labels φ(v) and ψ(u1), . . . , ψ(uk), where u1, . . . , uk arethe in-neighbors of v.

As an illustration, consider the problem of expression-tree evaluation. For this prob-lem, the input is a binary tree T whose leaves store real numbers and whose internalvertices are labelled with one of the four elementary binary operations +,−, ∗, /. Thevalue of a vertex is defined recursively. For a leaf v, its value val(v) is the real numberstored at v. For an internal vertex v with label ∈ +,−, ∗, /, left child x, and rightchild y, val(v) = val(x) val(y). The goal is to compute the value of the root of T . Castin terms of the general DAG evaluation problem defined above, tree T is a DAG whose

62

−

∗/

+

∗

2 324

7 1

2 324

7 1

48

8 6

62

(a) (b)

Figure 5.4: (a) The expression tree for the expression ((4 / 2) + (2 ∗ 3)) ∗ (7− 1). (b) Thesame tree with its vertices labelled with their values.

edges are directed from children to parents, labelling φ is the initial assignment of realnumbers to the leaves of T and of operations to the internal vertices of T , and labelling ψis the assignment of the values val(v) to all vertices v ∈ T . For every vertex v ∈ T , itslabel ψ(v) = val(v) is computed from the labels ψ(x) = val(x) and ψ(y) = val(y) of itsin-neighbors (children) and its own label φ(v) ∈ +,−, ∗, /.

In order to be able to evaluate a DAG G I/O-efficiently, two assumptions have to besatisfied: (1) The vertices of G have to be stored in topologically sorted order. That is,for every edge (v, w) ∈ G, vertex v precedes vertex w. (2) Label ψ(v) has to be com-putable from labels φ(v) and ψ(u1), . . . , ψ(uk) inO(sort(k)) I/Os. The second conditionis trivially satisfied if every vertex of G has in-degree no more than M .

Given these two assumptions, time-forward processing visits the vertices ofG in topo-logically sorted order to compute labelling ψ. Visiting the vertices ofG in this order guar-antees that for every vertex v ∈ G, its in-neighbors are evaluated before v is evaluated.Thus, if these in-neighbors “send” their labels ψ(u1), . . . , ψ(uk) to v, v has these labelsand its own label φ(v) at its disposal to compute ψ(v). After computing ψ(v), v sendsits own label ψ(v) “forward in time” to its out-neighbors, which guarantees that theseout-neighbors have ψ(v) at their disposal when it is their turn to be evaluated.

The implementation of this technique due to Arge is simple and elegant. The “send-ing” of information is realized using a priority queueQ. When a vertex v wants to send itslabel ψ(v) to another vertex w, it inserts ψ(v) into priority queueQ and gives it priority w.When vertex w is evaluated, it removes all entries with priority w from Q. Since everyin-neighbor of w sends its label to w by queuing it with priority w, this provides w withthe required inputs. Moreover, every vertex removes its inputs from the priority queuebefore it is evaluated, and all vertices with smaller numbers are evaluated before w. Thus,

63

at the time when w is evaluated, the entries in Q with priority w are those with lowestpriority, so that they can be removed using a sequence of DELETEMIN operations.

Using the buffer tree of Arge to implement priority queue Q, INSERT andDELETEMIN operations on Q can be performed in O

((1/B) · logM/B(|E|/B)

)I/Os

amortized because priority queue Q never holds more than |E| entries. The total num-ber of priority queue operations performed by the algorithm is O(|E|), one INSERT andone DELETEMIN operation per edge. Hence, all updates of priority queue Q can beprocessed in O(() sort(|E|) I/Os. The computation of labels ψ(v) from labels φ(v) andψ(u1), . . . , ψ(uk), for all vertices v ∈ G, can also be carried out in O(sort(|E|)) I/Os,using the above assumption that this computation takes O(() sort(k)) I/Os for a singlevertex v. Hence, we obtain the following result.

Theorem 1 Given a DAG G = (V,E) whose vertices are stored in topologically sortedorder, graph G can be evaluated in O(sort(|V |+ |E|)) I/Os, provided that the compu-tation of the label of every vertex v ∈ G can be carried out in O

(sort(deg−(v))

)I/Os,

where deg−(v) is the in-degree of vertex v.

5.5 Cache-oblivious AlgorithmsHave a look at table 5.1, whiches gives size and other attributes of different levels inthe memory hierarchy for various systems. How can we write portable code that runsefficiently on different multilevel caching architectures? Not only is the external memorymodel restricted to two levels of memory, for most algorithms we have to explicitly givevalues for M and B which are different on every level and system. The cache obliviousmodel suggests a solution: We want to design algorithms that are not given M and B andthat nevertheless perform well on every memory hierarchy.

The ideal cache oblivious memory model is a two level memory model. We willassume that the faster level has sizeM and the slower level always transfersB consecutivewords of data to the faster level. These two levels could represent the memory and thedisk, memory and the cache, or any two consecutive levels of the memory hierarchy. Inthis chapter, M and B can be assumed to be the sizes of any two consecutive levels of thememory hierarchy subject to some assumptions about them (For instance the inclusionproperty which we will see soon). We will assume that the processor can access the fasterlevel of memory which has sizeM . If the processor references something from the secondlevel of memory, an I/O fault occurs and B words are fetched into the faster level of thememory. We will refer to a block as the minimum unit that can be present or absent froma level in the two level memory hierarchy. We will use B to denote the size of a block asin the external memory model. If the faster level of the memory is full (i.e. M is full), ablock gets evicted to make space.

64

Pentium 4 Pentium III MIPS 10000 AMD Athlon Itanium 2Clock rate 2400 MHz 800 MHz 175 MHz 1333 MHz 1137 MHzL1 data cache size 8 KB 16 KB 32 KB 128 KB 32 KBL1 line size 128 B 32 B 32 B 64 B 64 BL1 associativity 4-way 4-way 2-way 2-way 4-wayL2 cache size 512 KB 256 KB 1024 KB 256 KB 256 KBL2 line size 128 B 32 B 32 B 64 B 128 BL2 associativity 8-way 4-way 2-way 8-way 8-wayTLB entries 128 64 64 40 128TLB associativity full 4-way 64-way 4-way fullRAM size 512 MB 256 MB 128 MB 512 MB 3072 MB

Table 5.1: some exemplary cache and memory configurations

The ideal cache oblivious memory model enables us to reason about a two level mem-ory model like the external memory model but prove results about a multi-level memorymodel. Compared with the external memory model it seems surprising that without anymemory specific parametrization, or in other words, without specifying the parametersM,B, an algorithm can be efficient for the whole memory hierarchy, nevertheless it ispossible. The model is built upon some basic assumptions which we enumerate next.Optimal replacement: The replacement policy refers to the policy chosen to replace ablock when a cache miss occurs and the cache is full. In most hardware, this is imple-mented as FIFO, LRU or Random. The model assumes that the cache line chosen forreplacement is the one that is accessed furthest in the future. This is known as the optimaloff-line replacement strategy.Two levels of memory: There are certain assumptions in the model regarding the twolevels of memory chosen. They should follow the inclusion property which says that datacannot be present at level i unless it is present at level i + 1. Another assumption is thatthe size of level i of the memory hierarchy is strictly smaller than level i+ 1.Full associativity: When a block of data is fetched from the slower level of the memory,it can reside in any part of the faster level.Automatic replacement: When a block is to be brought in the faster level of the memory,it is automatically done by the OS/hardware and the algorithm designer does not haveto care about it while designing the algorithm. Note that we could access single blocksfor reading and writing in the external memory model, which is not allowed in the cacheoblivious model.

We will now examine each of the assumptions individually. First we consider theoptimal replacement policy. The most commonly used replacement policy is LRU (leastrecently used). We have the following lemma, whose proof is omitted here:

65

Lemma 2 An algorithm that causesQ∗(n,M,B) cache misses on a problem of size n us-ing a (M,B)-ideal cache incurs Q(n,M,B) ≤ 2Q∗(n, M

2, B) cache misses on a (M,B)

cache that uses LRU or FIFO replacement. This is only true for algorithms which followa regularity condition.

An algorithm whose cache complexity satisfies the condition Q(n,M,B) ≤O(Q(n, 2M,B)) is called regular (All algorithms presented in this chapter are regular).Intuitively, algorithms that slow down by a constant factor when memory (M ) is reducedto half, are called regular. It immediately follows from the above lemma that if an algo-rithm whose number of cache misses satisfies the regularity condition does Q(n,M,B)cache misses with optimal replacement then this algorithm would make Θ(Q(n,M,B))cache misses on a cache with LRU or FIFO replacement.

The automatic replacement and full associativity assumption can be implemented insoftware by using LRU implementation based on hashing. It was shown that a fullyassociative LRU replacement policy can be implemented in O(1) expected time usingO(M

B) records of size O(B) in ordinary memory. Note that the above description about

the cache oblivious model proves that any optimal cache oblivious algorithm can also beoptimally implemented in the external memory model.

We now turn our attention to multi-level ideal caches. We assume that all the levelsof this cache hierarchy follow the inclusion property and are managed by an optimalreplacement strategy. Thus on each level, an optimal cache oblivious algorithm will incuran asymptotically optimal number of cache misses. From Lemma 2, this becomes true forcache hierarchies maintained by LRU and FIFO replacement strategies.

Apart from not knowing the values of M,B explicitly, some cache oblivious algo-rithms (for example optimal sorting algorithms) require a tall cache assumption. The tallcache assumption states that M = Ω(B2) which is usually true in practice. Recently,compiler support for cache oblivious type algorithms have also been looked into.

In cache oblivious algorithm design some algorithm design techniques are used ubiq-uitously. One of them is a scan of an array which is laid out in contiguous memory.Irrespective of B, a scan takes at most 1 + dN

Be I/Os. The argument is trivial and very

similar to the external memory scan algorithm. The difference is that in the cache oblivi-ous setting the buffer of sizeB is not explicitly maintained in memory. In the assumptionsof the model, B is the size of the data that is always fetched from level 2 memory to level1 memory. The scan does not touch the level 2 memory until its ready to evict the lastloaded buffer of size B already in level 1. Hence, the total number of times the scanalgorithm will force the CPU to bring buffers from the level 2 memory to level 1 memoryis upper bounded by 1 + dN

Be.

Another common approach in the cache oblivious model are divide and conquer al-gorithms. Why does divide and conquer help in general for cache oblivious algorithms?Divide and conquer algorithms split the instance of the problem to be solved into severalsubproblems such that each of the subproblems can be solved independentally. Since the

66

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

80 81 82 83 84 85 86 87 88 89

90 91 92 93 94 95 96 97 98 99

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

80 81 82 83 84 85 86 87 88 89

90 91 92 93 94 95 96 97 98 99

B

BAr,s

Cs,r

Figure 5.5: cache aware matrix transposition using block access

algorithm recurses on the subproblems, at some point of time, the subproblems fit insideM and subsequent recursion, fits the subproblems into B.

5.5.1 Matrix TranspositionWe will see the recursive approach in our first example, dealing with matrix transposition.We will first give an algorithm using the external memory model that requires theknowledge of M and B. We than have the opportunity to compare both implementations(cache aware and oblivious) experimentically.

The naive matrix transposition algorithm:

for (i=0; i<N; i++)for (j=0; j<N; j++)

C[j][i] = A[i][j];

accessing the source matrix A in a row major fashion will have I/O costs of Θ(N) forwriting the target matrix C, leading to total of Θ(N2) I/Os. But if M and B are known,we can switch from row major access to block access: Partition A and C into blocks ofsize r × s with r, s = Θ(

√M). Apply the naive algorithm to N/r ×N/s matrices (their

elements are s × s sub-matrices). This requires O((N/s)3 · s2/B) = O(N3/(s · B)) =O(N3/(B ·

√M)) I/Os, which is optimal.

The oblivious approach works by partitioning the matrix in two blocks A1 and A2,transposing them recursively and writing the results in the appropriate blocks in the resultmatrix C.

Here is the C code for cache oblivious matrix transposition. The following code takesas input a submatrix given by (x, y) − (x + delx, y + dely) in the input matrix I and

67

A =(A1 A2

)C =

(C1C2

)CO_Transpose(A,C)

CO_Transpose(A1,C1);CO_Transpose(A2,C2);

Figure 5.6: pseudo code for cache oblivious matrix transposition

void transpose(int x, int delx, int y, int dely,ElementType I[N][P], ElementType O[P][N])

if((delx == 1) \&\& (dely == 1)) O[y][x] = I[x][y];return;

if(delx >= dely)

int xmid = delx / 2;transpose(x,xmid,y,dely,I,O);transpose(x+xmid,delx-xmid,y,dely,I,O);return;

// Similarly cut from ymid into two subproblems...

Figure 5.7: C code for cache oblivious matrix transposition

transposes it to the output matrix O. ElementType1 can be any element type , forinstance long.

The code works by divide and conquer, dividing the bigger side of the matrix in themiddle and recursing. It is short, easy to understand and contains no tuning parametersthat have to be tweaked for every new setup. The algorithm is always I/O efficient:

Let the input be a matrix of N × P size. There are three cases:

1In all our experiments, ElementType was set to long.

68

log2N Naive CA CO10 0.21 0.10 0.0811 0.86 0.49 0.4512 3.37 1.63 2.1613 13.56 6.38 6.69

log2N Naive CA CO10 0.14 0.12 0.0911 0.87 0.42 0.4712 3.36 1.46 2.0313 13.46 5.74 6.86

Table 5.2: Running time of naive, cache aware (CA) and cache oblivious (CO) matrixtransposition for B = 32 and B = 128

Case I: maxN,P ≤ αB In this case,

Q(N,P ) ≤ NP

B+O(1)

Case II: N ≤ αB < P In this case,

Q(N,P ) ≤

O(1 +N) if αB

2≤ P ≤ αB

2Q(N,P/2) +O(1) N ≤ αB < P

Case III: P ≤ αB < N Analogous to Case II.Case IV: minN,P ≥ αB

Q(N,P ) ≤

O(N + P + NP

B) if αB

2≤ N,P ≤ αB

2Q(N,P/2) +O(1) P ≥ N

2Q(N/2, P ) +O(1) N ≥ P

The above recurrence solves to Q(N,P ) = O(1 + NPB

).

There is a simpler way to visualize the above mess. Once the recursion makes thematrix small enough such that max(N,P ) ≤ αB ≤ β

√M (here β is a suitable constant),

or such that the submatrix (or the block) we need to transpose fits in memory, the numberof I/O faults is equal to the scan of the elements in the submatrix. A packing argument ofthese not so small submatrices (blocks) in the large input matrix shows that we do not dotoo many I/O faults compared to a linear scan of all the elements.

Table 5.2 shows the result of an experiment performed on a 300 MHz UltraSPARC-IIwith 2 MB L2 cache, 16 KB L1 cache, page size 8 KB and 64 TLB entries. The (tuned)cache aware implementation is slightly slower than the cache oblivious one, but bothoutperform the naive implementation.

5.5.2 Searching Using Van Emde Boas LayoutIn this section we report a method to speed up simple binary searches on a balancedbinary tree. This method could be used to optimize or speed up any kind of search on

69

Figure 5.8: memory layout of cache oblivious search trees

Figure 5.9: BFS structre of cache oblivious search trees

a tree as long as the tree is static and balanced. It is easy to code, uses the fact thatthe memory consists of a cache hierarchy, and could be exploited to speed up tree basedsearch structures on most current machines. Experimental results show that this methodcould speed up searches by a factor of 5 or more in certain cases!

It turns out that a balanced binary tree has a very simple layout that is cache-oblivious.By layout here, we mean the mapping of the nodes of a binary tree to the indices of anarray where the nodes are actually stored. The nodes should be stored in the bottom arrayin the order shown for searches to be fast and use the cache hierarchy.

Given a complete binary tree, we describe a mapping from the nodes of the tree topositions of an array in memory. Suppose the tree has N items and has height h =logN + 1. Split the tree in the middle, at height h/2. This breaks the tree into a toprecursive subtree of height bh/2c and several bottom subtrees B1, B2, ..., Bk of heightdh/2e. There are

√N bottom recursive subtrees, each of size

√N . The top subtree

occupies the top part in the array of allocated nodes, and then the Bi’s are laid out. Everysubtree is recursively laid out.

Another way to see the algorithm is to run a breadth first search on the top node of

70

the tree and run it till√N nodes are in the BFS, see Fig. 5.9. The figure shows the run

of the algorithm for the first BFS when the tree size is√N . Then the tree consists of the

part that is covered by the BFS and trees hanging out. BFS can now be recursively runon each tree, including the covered part. Note that in the second level of recursion, thetree size is

√N and the BFS will cover only N

14 nodes since the same algorithm is run on

each subtree of√N . The main idea behind the algorithm is to store recursive sub-trees in

contiguous blocks of memory.Lets now try to analyze the number of cache misses when a search is performed. We

can conceptually stop the recursion at the level of detail where the size of the subtrees hassize ≤ B. Since these subtrees are stored contiguously, they at most fit in two blocks. (Ablock can not span three blocks of memory when stored). The height of these subtrees islogB. A search path from root to leaf crosses O

(logNlogB

)= O(logB N) subtrees. So the

total number of cache misses is bounded by O(logB N).We did a very simple experiment to see how in real life, this kind of layout would help.

A vector was sorted and a binary search tree was built on it. A query vector was generatedwith random numbers and searched on this BST which was laid out in pre-order. Whywe chose pre-order compared to random layout was because most people code a BST ineither pre/post/in-order compared to randomly laying it (Which incidentally is very badfor cache health). Once this query was done, we laid the BST using Van Emde BoasLayout and gave it the same query vector. The experiment reported in Fig. 5.10 weredone on a Itanium dual processor system with 2GB RAM. (Only one processor was beingused)

5.5.3 Funnel sortingFunnel sorting is a cache oblivious sorting strategy. We will describe a simplifcationcalled lazy funnelsort, which was introduced by Brodal and Fagerberg [18]. Funnelsort,in turn, is a sort of lazy mergesort. This algorithm will be our first application of thetall-cache assumption (see 5.5. For simplicity, we assume that M = Ω(B2). The sameresults can be obtained whenM = Ω(B1+γ) by increasing the constant 3; refer to [18] fordetails. Interestingly, optimal cache-oblivious sorting is not achievable without the tall-cache assumption. The heart of the funnelsort algorithm is a static data structure whichwe call a funnel. For now, we treat a K-Funnel as a black box that merges K sorted listsof total size K3 using O

(K3

BlogM/B

K3

B+K

)memory transfers. The space occupied by

a K-Funnel is Θ(K2).Once we have such a fast merging procedure, we can sort using a K-way mergesort.

How should we choose K? The larger the K, the faster the algorithm, because we cannotpredict the optimal (M/B) multiplicity of the merge. This property suggests choosingK = N , in which case the entire sorting algorithm is in the merge. In fact, however, a

71

Figure 5.10: Comparison of Van Emde Boas searches with pre-order searches on a bal-anced binary tree. Similar to the last experiment, this experiment was performed on aItanium with 48 byte node size.

K-Funnel is fast only if it is fed at least K3 elements. Also, a K-Funnel occupies Θ(K2)space, and we want a linear-space algorithm. Thus, we choose K = N1/3.Now the sorting algorithm proceeds as follows:

1. Split the array into K = N1/3 contiguous segments each of size N/K = N2/3.

2. Recursively sort each segment.

3. Apply the K-Funnel to merge the sorted segments.

Memory transfers are made just in Steps 2 and 3, leading to the recurrence:T (N) = N1/3T (N2/3) +O

(NB

logM/B N/B +N1/3)

72

' 8 2/3k

R

k 2

3

2k

2k

3kk

L

1L

2L

B

4 '!1!)' 2

2/3k

Figure 5.11: Example of a funnel merger

The base case is T (O(B2)) = O(B) because the tall-cache assumption says that M ≥B2. Above the base case, N = Ω(B2), so B =

√N , and the N/B log . . . cost dominates

the N1/3 cost.The recursion tree has N/B2 leaves, each costing O

(B logM/B B +B1/3

)= O(B)

memory transfers, for a total leaf cost of O(N/B). The root divide-and-merge cost isO(NB

logM/B N/B), which dominates the recurrence. Thus, modulo the details of the

funnel, we have proved the following theorem:

Theorem 3 Assuming M = Ω(B2), funnelsort sorts N comparable elements inO(NB

logM/B N/B)

memory transfers.

It can also be shown that the number of comparisons is O(N logN); see [18] for details.Now, how do K-Funnels look like? Our goal is to develop a K-Funnel which merges

K sorted lists of total size K3 usingO(K3

BlogM/B

K3

B+K

)memory transfers and Ω(K2)

space.A K-Funnel is a complete binary tree with K leaves, stored according to the van

Emde Boas layout we saw in 5.5.2. Thus, each of the recursive subtrees of a K-Funnel isa√K-funnel. In addition to the nodes, edges in a K-Funnel store buffers; see figure 5.11.

The edges at the middle level of a K-Funnel, partitioning the funnel into two recursive√K-subfunnels, have size K3/2 each, for a total buffer size of K2 at that level. Buffers

within the subfunnels are recursively smaller. We store these buffers of size K3/2 in the

73

recursive layout alongside the recursive pK-subfunnels within the K-Funnel. The bufferscan be stored in an arbitrary order along with the recursive subtrees.

For consistency in describing the algorithms, we view a K-Funnel as having an ad-ditional buffer of size K3 along the edge connecting the root of the tree to its imaginaryparent. To maintain the lemma above that the storage isO(K2), this buffer is not actuallystored; rather, it can be viewed as the output mechanism. The algorithm to fill this bufferabove the root node, thereby merging the entire input, is a simple recursion. We mergethe elements in the buffers along the left and right children edges of the node, as long asthose two buffers remain nonempty. (Initially, all buffers are empty.) Whenever either ofthe buffers becomes empty, we recursively fill it. At the bottom of the tree, a leaf buffer(a buffer immediately below a leaf) corresponds to one of the input lists.

For the analysis on K-Funnels, we refer to [18].

5.5.4 Is the Model an Oversimplification?In theory, both the cache oblivious and the external memory models are nice to work with,because of their simplicity. A lot of the work done in the external memory model has beenturned into practical results as well. Before one makes his hand “dirty” with implementingan algorithm in the cache oblivious or the external memory model, one should be aware ofpractical things that might become detrimental to the speed of the code but are not caughtin the theoretical setup.

Here we list a few practical glitches that are shared by both the cache oblivious andthe external memory model. The ones that are not shared are marked2 accordingly. Areader that wants to use these models to design practical algorithms and especially onewho wants to write code, should keep these issues in mind. Code written and algorithmsdesigned keeping the following things in mind, could be a lot faster than just directlycoding an algorithm that is optimal in either the cache oblivious or the external memorymodel.TLBo: TLBs are caches on page tables, are usually small with 128-256 entries and arelike just any other cache. They can be implemented as fully associative. The model doesnot take into account the fact that TLBs are not tall.Concurrency: The model does not talk about I/O and CPU concurrency, which automat-ically looses it a 2x factor in terms of constants. The need for speed might drive futureuniprocessor systems to diversify and look for alternative solutions in terms of concur-rency on a single chip, for instance the hyper-threading3 introduced by Intel in its latestXeons is a glaring example. On these kind of systems and other multiprocessor systems,

2A superscript ’o’ means this issue only applies to the cache oblivious model.3One physical processor Intel Xeon MP forms two logical processors which share CPU computational

resources The software sees two CPUs and can distribute work load between them as a normal dual proces-sor system.

74

coherence misses might become an issue. This is hard to capture in the cache obliviousmodel and for most algorithms that have been devised in this model already, concurrencyis still an open problem. A parallel cache oblivious model would be really welcome forpractitioners who would like to apply cache oblivious algorithms to multiprocessor sys-tems.Associativityo: The assumption of the fully associative cache is not so nice. In realitycaches are either direct mapped or k-way associative (typically k = 2, 4, 8). If two ob-jects map to the same location in the cache and are referenced in temporal proximity, theaccesses will become costlier than they are assumed in the model (also known as cacheinterference problem). Also, k−way set associative caches are implemented by usingmore comparators.Instruction/Unified Caches: Rarely executed, special case code disrupts locality. Loopswith few iterations that call other routines make loop locality hard to exploit and plentyof loopless code hampers temporal locality. Issues related to instruction caches are notmodeled in the cache oblivious model. Unified caches (e.g. the latest Intel Itanium chipsL2 and L3 caches) are used in some machines where instruction and data caches aremerged(e.g. Intel PIII, Itaniums). These are another challenge to handle in the model.Replacement Policyo: Current operating systems do not page more than 4GB of memorybecause of address space limitations. That means one would have to use legacy code onthese systems for paging. This problem makes portability of cache oblivious code for bigproblems a myth! In the experiments reported in this chapter, we could not do externalmemory experimentation because the OS did not allow us to allocate array sizes of morethan a GB or so. One can overcome this problem by writing one’s own paging systemover the OS to do experimentation of cache oblivious algorithms on huge data sizes. Butthen its not so clear if writing a paging system is easier or handling disks explicitly in anapplication. This problem does not exist on 64-bit operating systems and should go awaywith time.Multiple Diskso: For “most” applications where data is huge and external memory al-gorithms are required, using Multiple disks is an option to increase I/O efficiency. As ofnow, the cache oblivious model does not take into account the existence of multiple disksin a system.Write-through cacheso: L1 caches in many new CPUs is write through, i.e. it transmitsa written value to L2 cache immediately. Write through caches are simpler to manageand can always discard cache data without any bookkeeping (Read misses can not resultin writes). With write through caches (e.g. DECStation 3100, Intel Itanium), one can nolonger argue that there are no misses once the problem size fits into cache! Victim Cachesimplemented in HP and Alpha machines are caches that are implemented as small buffersto reduce the effect of conflicts in set-associative caches. These also should be kept inmind when designing code for these machines.Complicated Algorithmso and Asymptotics: For non-trivial problems the algorithms

75

can become quite complicated and impractical, a glaring instance of which is sorting.The speed by which different levels of memory differ in data transfer are constants! Forinstance the speed difference between L1 and L2 caches on a typical Intel pentium canbe around 10. Using an O() notation for an algorithm that is trying to beat a constantof 10, and sometimes not even talking about those constants while designing algorithmscan show up in practice). For instance there are “constants” involved in simulating a fullyassociative cache on a k-way associative cache. Not using I/O concurrently with CPU canmake an algorithm loose another constant. Can one really afford to hide these constantsin the design of a cache oblivious algorithm in real code?

Despite these limitations the model does perform very well for some applications, butmight be outperformed by more coding effort combined with cache aware algorithms.Here’s an intercept from an experimental paper by Chatterjee and Sen:

Our major conclusion are as follows: Limited associativity in the mappingfrom main memory addresses to cache sets can significantly degrade run-ning time; the limited number of TLB entries can easily lead to thrashing;the fanciest optimal algorithms are not competitive on real machines even atfairly large problem sizes unless cache miss penalties are quite high; low levelperformance tuning “hacks”, such as register tiling and array alignment, cansignificantly distort the effect of improved algorithms, ...

5.6 External BFSThe material of this section was taken from [12].

5.6.1 IntroductionLarge graphs arise naturally in many applications and very often we need to traverse thesegraphs for solving optimization problems. Breadth first search (BFS) is a fundamentalgraph traversal strategy. It decomposes the input graph G = (V,E) of n nodes and medges into at most n levels where level i comprises all nodes that can be reached from adesignated source s via a path of i edges, but cannot be reached using less than i edges.Typical real-world applications of BFS on large graphs (and some of its generalizationslike shortest paths or A∗) include crawling and analyzing the WWW, route planning usingsmall navigation devices with flash memory cards and state space exploration.

BFS is well-understood in the RAM model. There exists a simple linear time algo-rithm (hereafter refered as IM BFS) for the BFS traversal in a graph. IM BFS keeps aset of appropriate candidate nodes for the next vertex to be visited in a FIFO queue Q.Furthermore, in order to find out the unvisited neighbours of a node from its adjacencylist, it marks the nodes as either visited or unvisited. Unfortunately, even when half of the

76

N(b)

N(c)

L(t−2) L(t−1) L(t)

b

c e

d f

s a

acd

abde

acd

b

e

a

d

e

d

e

− dupl.N( L(t−1) ) − L(t−1) − L(t−2)

Figure 5.12: A phase in the BFS algorithm of Munagala and Ranade. Level L(t) iscomposed out of the disjoint neighbor vertices of level L(t− 1) excluding those verticesalready existing in either L(t− 2) or L(t− 1).

graph fits in the main memory, the running time of this algorithm deviates significantlyfrom the predicted RAM performance (hours as compared to minutes) and for massivegraphs, such approaches are simply non-viable. As discussed before, the main cause forsuch a poor performance of this algorithm on massive graphs is the number of I/Os it in-curs. Remembering visited nodes needs Θ(m) I/Os in the worst case and the unstructuredindexed access to adjacency lists may result in Θ(n) I/Os.

5.6.2 Algorithm of Munagala and RanadeWe turn to the basic BFS algorithm of Munagala and Ranade [14], MR BFS for short.

Let L(t) denote the set of nodes in BFS level t, and let |L(t)| be the number of nodesin L(t). MR BFS builds L(t) as follows: let A(t) := N(L(t − 1)) be the multi-set ofneighbor vertices of nodes in L(t−1); N(L(t−1)) is created by |L(t−1)| accesses to theadjacency lists, one for each node in L(t− 1). Since the graph is stored in adjacency-listrepresentation, this takesO (|L(t− 1)|+ |N(L(t− 1))|/B) I/Os. Then the algorithm re-moves duplicates from the multi-set A. This can be done by sorting A(t) according to thenode indices, followed by a scan and compaction phase; hence, the duplicate eliminationtakes O(sort(|A(t)|) I/Os. The resulting set A′(t) is still sorted.

Now the algorithm computes L(t) := A′(t)\L(t−1)∪L(t−2). Fig. 5.12 providesan example. Filtering out the nodes already contained in the sorted listsL(t−1) orL(t−2)is possible by parallel scanning. Therefore, this step can be done using

O(

sort(|N(L(t− 1))|

)+ scan

(|L(t− 1)|+ |L(t− 2)|

))I/Os.

Since∑

t |N(L(t))| = O(|E|) and∑

t |L(t)| = O(|V |), the whole execution ofMR BFS requires O(|V |+ sort(|E|)) I/Os.

The correctness of this BFS algorithm crucially depends on the fact that the inputgraph is undirected. Assume that the levels L(0), . . . , L(t − 1) have already been com-puted correctly. We consider a neighbor v of a node u ∈ L(t − 1): the distance from s

77

to v is at least t − 2 because otherwise the distance of u would be less than t − 1. Thusv ∈ L(t − 2) ∪ L(t − 1) ∪ L(t) and hence it is correct to assign precisely the nodes inA′(t) \ L(t− 1) ∪ L(t− 2) to L(t).

Theorem 4 BFS on arbitrary undirected graphs can be solved usingO(|V |+ sort(|V |+ |E|)) I/Os.

5.6.3 An Improved BFS Algorithm with sublinear I/OThe MM BFS algorithm of Mehlhorn and Meyer [15] refines the approach of Munagalaand Ranade [14]. It trades-off unstructured I/Os with increasing the number of iterationsin which an edge may be involved. MM BFS operates in two phases: in a first phaseit preprocesses the graph and in a second phase it performs BFS using the informationgathered in the first phase. We first sketch a variant with a randomized preprocessing.Then we outline a deterministic version.

The Randomized Partitioning Phase

The preprocessing step partitions the graph into disjoint connected subgraphs Si, 0 ≤ i ≤K, with small expected diameter. It also partitions the adjacency lists accordingly, i.e.,it constructs an external file F = F0F1 . . .Fi . . .FK−1 where Fi contains the adjacencylists of all nodes in Si. The partition is built by choosing master nodes independently anduniformly at random with probability µ = min1,

√(|V |+ |E|)/(B · |V |) and running

a local BFS from all master nodes “in parallel” (for technical reasons, the source node sis made the master node of S0): in each round, each master node si tries to capture allunvisited neighbors of its current sub-graph Si; this is done by first sorting the nodes of theactive fringes of all Si (the nodes that have been captured in the previous round) and thenscanning the dynamically shrinking adjacency-lists representation of the yet unexploredgraph. If several master nodes want to include a certain node v into their partitions thenan arbitrary master node among them succeeds. The selection can be done by sorting andscanning the created set of neighbor nodes.

The expected number of master nodes is K := O(1+µ ·n) and the expected shortest-path distance (number of edges) between any two nodes of a subgraph is at most 2/µ.Hence, the expected total amount of data being scanned from the adjacency-lists repre-sentation during the “parallel partition growing” is bounded by

X := O(∑v∈V

1/µ · (1 + degree(v))) = O((|V |+ |E|)/µ).

The total number of fringe nodes and neighbor nodes sorted and scanned during the par-titioning is at most Y := O(|V |+ |E|). Therefore, the partitioning requires

O(scan(X) + sort(Y )) = O(scan(|V |+ |E|)/µ+ sort(|V |+ |E|))

78

expected I/Os.After the partitioning phase each node knows the (index of the) subgraph to which it

belongs. With a constant number of sort and scan operations MM BFS can reorganizethe adjacency lists into the formatF0F1 . . .Fi . . .F|S|−1, whereFi contains the adjacencylists of the nodes in partition Si; an entry (v, w,S(w), fS(w)) from the adjacency list ofv ∈ Fi stands for the edge (v, w) and provides the additional information that w belongsto subgraph S(w) whose subfile FS(w) starts at position fS(w) within F . The edge entriesof each Fi are lexicographically sorted. In total, F occupies O((|V |+ |E|)/B) blocks ofexternal storage.

The BFS Phase

In the second phase the algorithm performs BFS as described by Munagala and Ranade(Section 5.6.2) with one crucial difference: MM BFS maintains an external fileH (= hotadjacency lists); it comprises unused parts of subfiles Fi that contain a node in the currentlevel L(t− 1). MM BFS initializesH with F0. Thus, initially,H contains the adjacencylist of the root node s of level L(0). The nodes of each created BFS level will also carryidentifiers for the subfiles Fi of their respective subgraphs Si.

When creating level L(t) based on L(t − 1) and L(t − 2), MM BFS does not ac-cess single adjacency lists like MR BFS does. Instead, it performs a parallel scan ofthe sorted lists L(t − 1) and H and extracts N(L(t − 1)); In order to maintain the in-variant that H contains the adjacency lists of all vertices on the current level, the sub-files Fi of nodes whose adjacency lists are not yet included in H will be merged withH. This can be done by first sorting the respective subfiles and then merging the sortedset with H using one scan. Each subfile Fi is added to H at most once. After an ad-jacency list was copied to H, it will be used only for O(1/µ) expected steps; after-wards it can be discarded from H. Thus, the expected total data volume for scanningH is O(1/µ · (|V | + |E|)), and the expected total number of I/Os to handle H and Fiis O (µ · |V |+ sort(|V |+ |E|) + 1/µ · scan(|V |+ |E|)). The final result follows withµ = min1,

√scan(|V |+ |E|)/|V |.

Theorem 5 ([15]) External memory BFS on undirected graphs can be solved usingO(√|V | · scan(|V |+ |E|) + sort(|V |+ |E|)

)expected I/Os.

The Deterministic Variant

In order to obtain the result of Theorem 5 in the worst case, too, it is sufficient to modifythe preprocessing phase as follows: instead of growing subgraphs around randomly se-lected master nodes, the deterministic variant extracts the subfiles Fi from an Euler Touraround a spanning tree for the connected component Cs that contains the source node s.

79

2/µs

1 8 1 5

1 8 5

0 4 2 4

0 4 2

7 4 0 6

7 6

01 3 1 6

3

0

4 6

1

8 3

2 7

5

Figure 5.13: Using an Euler tour around a spanning tree of the input graph in order toobtain a partition for the deterministic BFS algorithm.

Observe that Cs can be obtained with the deterministic connected-components algorithmof [14] usingO((1 + log log(B · |V |/|E|)) · sort(|V |+ |E|)) =O(√|V | · scan(|V |+ |E|) + sort(|V |+ |E|)) I/Os. The same number of I/Os suffices to

compute a (minimum) spanning tree Ts for Cs [20].After Ts has been built, the preprocessing constructs an Euler tour around Ts using a

constant number of sort- and scan-steps [16]. Then the tour is broken at the root node s;the elements of the resulting list can be stored in consecutive order using the deterministiclist ranking algorithm of [16]. This takesO(sort(|V |)) I/Os. Subsequently, the Euler tourcan be cut into pieces of size 2/µ in a single scan. These Euler tour pieces account forsubgraphs Si with the property that the distance between any two nodes of Si in G is atmost 2/µ − 1. See Fig. 5.13 for an example. Observe that a node v of degree d may bepart of Θ(d) different subgraphs Si. However, with a constant number of sorting steps itis possible to remove multiple node appearances and make sure that each node of Cs ispart of exactly one subgraph Si. Eventually, the reduced subgraphs Si are used to createthe reordered adjacency-list files Fi; this is done as in the randomized preprocessing andtakes another O(sort(|V | + |E|)) I/Os. Note that the reduced subgraphs Si may not beconnected any more; however, this does not matter as our approach only requires that anytwo nodes in a subgraph are relatively close in the original input graph.

The BFS-phase of the algorithm remains unchanged; the modified preprocessing,however, guarantees that each adjacency-list will be part of the external setH for at most2/µ BFS levels: if a subfile Fi is merged with H for BFS level L(t), then the BFS levelof any node v in Si is at most L(t) + 2/µ− 1. Therefore, the adjacency list of v in Fi willbe kept inH for at most 2/µ BFS levels.

Theorem 6 ([15]) External memory BFS on undirected graphs can be solved usingO(√|V | · scan(|V |+ |E|) + sort(|V |+ |E|)

)I/Os in the worst case.

80

5.6.4 Improvements in the previous implementat-ions of MR BFS and MM BFS R

The computation of each level of MR BFS involves sorting and scanning of neighboursof the nodes in the previous level. Even if there are very few elements to be sorted, there isa certain overhead associated with initializing the external sorters. In particular, while theStxxl stream sorter (with the flag DStxxl SMALL INPUT PSORT OPT) does not incuran I/O for sorting less than B elements, it still requires to allocate some memory and doessome computation for initialization. This overhead accumulates over all levels and forlarge diameter graphs, it dominates the running time. This problem is also inherited bythe BFS phase of MM BFS4. Since in the pipelined implementation of [17], we do notknow in advance the exact number of elements to be sorted, we can’t switch between theexternal and the internal sorter so easily. In order to get around this problem, we firstbuffer the first B elements and initialize the external sorter only when the buffer is full.Otherwise, we sort it internally.

In addition to this, we make the graph representation for MR BFS more compact.Except the source and the destination node pair, no other information is stored with theedges.

Designing MM BFS D

Graph class n m Long clusters Random clustersGrid(214 × 214) 228 229 51 28

Table 5.3: Time taken (in hours) by the BFS phase of MM BFS D with long and randomclustering

As for list ranking, we found Sibeyn’s algorithm (we talk about in 5.9) promising asit has low constant factors in its I/O complexity. Sibeyn’s implementation relies on theoperating system for I/Os and does not guarantee that the top blocks of all the stacks re-main in the internal memory, which is a necessary assumption for the asymptotic analysisof the algorithm. Besides, its reliance on internal arrays and swap space puts a restrictionon the size of the lists it can rank. The deeper integration of the algorithm in the Stxxlframework, using the Stxxl stacks and vectors in particular, made it possible to obtain ascalable solution, which could handle graph instances of the size we require while keepingthe theoretical worst case bounds.

4We use MM BFS R to refer to the randomized variant and MM BFS D for the deterministic variant ofMM BFS

81

Figure 5.14: Schema depicting the implementation of our heuristic

To summarize, our Stxxl based implementation of MM BFS D uses our adaptationof Sibeyn’s algorithm for list ranking the Euler tour around the minimum spanning treecomputed by EM MST. The Euler tour is then chopped into sets of

√B consecutive nodes

which after duplicate removal gives the requisite graph partitioning. The BFS phase re-mains similar to MM BFS R.Quality of the spanning tree The quality of the spanning tree computed can have asignificant impact on the clustering and the disk layout of the adjacency list after thedeterministic preprocessing, and consequently on the BFS phase. For instance, in thecase of grid graph, a spanning tree containing a list with elements in a snake-like rowmajor order produces long and narrow clusters, while a “random” spanning tree is likelyto result in clusters with low diameters. Such a “random” spanning tree can be attainedby assigning random weights to the edges of the graph and then computing a minimumspanning tree or by randomly permuting the indices of the nodes. The nodes in the longand narrow clusters tend to stay longer in the pool and therefore, their adjacency listsare scanned more often. This causes the pool to grow external and results in larger I/Ovolume. On the other hand, low diameter clusters are evicted from the pool sooner and arescanned less often reducing the I/O volume of the BFS phase. Consequently as Table 5.3shows, the BFS phase of MM BFS D takes only 28 hours with clusters produced by“random” spanning tree, while it takes 51 hours with long and narrow clusters.

5.6.5 A Heuristic for maintaining the poolAs noted above, the asymptotic improvement and the performance gain in MM BFS overMR BFS is obtained by decomposing the graph into low diameter clusters and maintain-ing an efficiently accessible pool of adjacency lists which will be required in the next fewlevels. Whenever the first node of a cluster is visited during the BFS, the remaining nodes

82

of this cluster will be reached soon after and hence, this cluster is loaded into the pool. Forcomputing the neighbours of the nodes in the current level, we just need to scan the pooland not the entire graph. Efficient management of this pool is thus, crucial for the perfor-mance of MM BFS. In this section, we propose a heuristic for efficient management ofthe pool, while keeping the worst case I/O bounds of MM BFS.

For many large diameter graphs, the pool fits into the internal memory most of thetime. However, even if the number of edges in the pool is not so large, scanning all theedges in the pool for each level can be computationally quite expensive. Hence, we keepa portion of the pool that fits in the internal memory as a multi-map hash table. Givena node as a key, it returns all the nodes adjacent to the current node. Thus, to get theneighbours of a set of nodes we just query the hash function for those nodes and deletethem from the hash table. For loading the cluster, we just insert all the adjacency lists ofthe cluster in the hash table, unless the hash table has already O(M) elements.

Recall that after the deterministic preprocessing, the elements are stored on the disk inthe order in which they appear on the Euler tour around a spanning tree of the input graph.The Euler tour is then chopped into clusters with

√B elements (before the duplicate

removal) ensuring that the maximum distance between any two nodes in the cluster is atmost

√B − 1. However, the fact that the contiguous elements on the disk are also closer

in terms of BFS levels is not restricted to intra-cluster adjacency lists. The adjacencylists that come alongside the requisite cluster will also be required soon and by cachingthese other adjacency lists, we can save the I/Os in the future. This caching is particularlybeneficial when the pool fits in the internal memory. Note that we still load the

√B

node clusters in the pool, but keep the remaining elements of the block in the pool-cache.For the line graphs, this means that we load the

√B nodes in the internal pool, while

keeping the remaining O(B) adjacency lists which we get in the same block, in the pool-cache, thereby reducing the I/O complexity for the BFS traversal on line graphs to thecomputation of a spanning tree.

We represent the adjacency lists of nodes in the graph as a Stxxl vector. Stxxl al-ready provides a fully associative vector-cache with every vector. Before doing an I/O forloading a block of elements from the vector, it first checks if the block is already therein the vector-cache. If so, it avoids the I/O loading the elements from the cache instead.Increasing the vector-cache size of the adjacency list vector with a layout computed bythe deterministic preprocessing and choosing the replacement policy to be LRU providesus with an implementation of the pool-cache. Figure 5.14 depicts the implementation ofour heuristic.

83

5.7 Maximal Independent SetIn this section we describe a simple technique proposed in [21] that can be used to makeinternal memory graph algorithms of a sufficiently simple structure I/O-efficient. For thistechnique to be applicable, the algorithm has to compute a labelling of the vertices of thegraph, and it has to do so in a particular way. We call a vertex labelling algorithm Asingle-pass if it computes the desired labelling λ of the vertices of the graph by visitingevery vertex exactly once and assigns label λ(v) to v during this visit. We call A localif label λ(v) can be computed in O(sort(k)) I/Os from labels λ(u1), . . . , λ(uk), whereu1, . . . , uk are the neighbors of v whose labels are computed before λ(v). Finally, al-gorithm A is presortable if there is an algorithm that takes O(sort(|V |+ |E|)) I/Os tocompute an order of the vertices of the graph so thatA produces a correct result if it visitsthe vertices of the graph in this order. The technique we describe here is applicable ifalgorithm A is presortable, local, and single-pass.

So letA be a presortable local single-pass vertex-labelling algorithm computing somelabelling λ of the vertices of a graph G = (V,E). In order to make algorithm A I/O-efficient, the two main problems are to determine an order in which algorithm A shouldvisit the vertices of G and devise a mechanism that provides every vertex v with thelabels of its previously visited neighbors u1, . . . , uk. Since algorithm A is presortable,there exists an algorithm A′ that takes O(sort(|V |+ |E|)) I/Os to compute an order ofthe vertices of G so that algorithm A produces the correct result if it visits the verticesof G in this order. Assume w.l.o.g. that this ordering of the vertices of G is expressed as anumbering. We use algorithm A′ to number the vertices of G and then derive a DAG G′

from G by directing every edge of G from the vertex with smaller number to the vertexwith larger number. DAGG′ has the property that for every vertex v, the in-neighbors of vin G′ are exactly those neighbors of v that are labelled before v. Hence, labelling λ can becomputed using time-forward processing. In particular, by the locality ofA, the label λ(v)of every vertex can be computed in O(sort(k)) I/Os from the labels λ(u1), . . . , λ(uk) ofits in-neighbors u1, . . . , uk in G′, which is a simplified version of the condition for theapplicability of time-forward processing. This leads to the following result.

Theorem 7 [21] Every graph problemP that can be solved by a presortable local single-pass vertex labelling algorithm can be solved in O(sort(|V |+ |E|)) I/Os.

An important observation to be made is that in this application of time-forward pro-cessing, the restriction that the vertices of the DAG to be evaluated have to be given intopologically sorted order does not pose a problem because the directions of the edges arechosen only after fixing an order of the vertices that is to be the topological order.

In order to compute a maximal independent set S of a graph G = (V,E) in internalmemory, the following simple algorithm can be used: Process the vertices in an arbitraryorder. When a vertex v ∈ V is visited, add it to S if none of its neighbors is in S.

84

Translated into a labelling problem, the goal is to compute the characteristic functionχS : V → 0, 1 of S, where χS(v) = 1 if v ∈ S, and χS(v) = 0 if v 6∈ S. Also notethat if S is initially empty, then any neighbor w of v that is visited after v cannot be in Sat the time when v is visited, so that it is sufficient for v to inspect all its neighbors thatare visited before v to decide whether or not v should be added to S. The result of thesemodifications is a vertex-labelling algorithm that is presortable (since the order in whichthe vertices are visited is unimportant), local (since only previously visited neighborsof v are inspected to decide whether v should be added to S, and a single scan of labelsχS(u1), . . . , χS(uk) suffices to do so), and single-pass. This leads to the following result.

Theorem 8 Given an undirected graphG = (V,E), a maximal independent set of G canbe found in O(sort(|V |+ |E|)) I/Os and linear space.

5.8 Euler ToursAn Euler tour of a tree T = (V,E) is a traversal of T that traverses every edge exactlytwice, once in each direction. Such a traversal is useful, as it produces a linear list ofvertices or edges that captures the structure of the tree. Hence, it allows standard parallelor external memory algorithms to be applied to this list, in order to solve problems ontree T that can be expressed as some function to be evaluated over the Euler tour.

Formally, the tour is represented as a linked list L whose elements are the edges in theset (v, w), (w, v) : v, w ∈ E and so that for any two consecutive edges e1 and e2, thetarget of e1 is the source of e2. In order to define an Euler tour, choose a circular order ofthe edges incident to each vertex of T . Let v, w1, . . . , v, wk be the edges incident tovertex v. Then let succ((wi, v)) = (v, wi+1), for 1 ≤ i < k, and succ((wk, v)) = (v, w1).The result is a circular linked list of the edges in T . Now an Euler tour of T starting atsome vertex r and returning to that vertex can be obtained by choosing an edge (v, r) withsucc((v, r)) = (r, w), setting succ((v, r)) = null, and choosing (r, w) as the first edge ofthe traversal.

List L can be computed from the edge set of T in O(sort(N)) I/Os: First scan set Eto replace every edge v, w with two directed edges (v, w) and (w, v). Then sort theresulting set of directed edges by their target vertices. This stores the incoming edges ofevery vertex consecutively. Hence, a scan of the sorted edge list now suffices to computethe successor of every edge in L.

Theorem 9 An Euler tour L of a tree with N vertices can be computed inO(sort(N)) I/Os.

Given an unrooted (and undirected) tree T , choosing one vertex of T as the root de-fines a direction on the edges of T by requiring that every edge be directed from the

85

parent to the child. The process of rooting tree T is that of computing these directionsexplicitly for all edges of T . To do this, we construct an Euler tour starting at an edge(r, v) and compute the rank of every edge in the list. For every pair of opposite edges(u, v) and (v, u), we call the edge with the lower rank a forward edge, and the other aback edge. Now it suffices to observe that for any vertex x 6= r in T , edge (parent(x), x)is traversed before edge (x, parent(x)) by any Euler tour starting at r. Hence, for ev-ery pair of adjacent vertices x and parent(x), edge (parent(x), x) is a forward edge, andedge (x, parent(x)) is a back edge. That is, the set of forward edges is the desired set ofedges directed from parents to children. Constructing and ranking an Euler tour startingat the root r takesO(sort(N)) I/Os. Given the ranks of all edges, the set of forward edgescan be extracted by sorting all edges in L so that for any two adjacent vertices v and w,edges (v, w) and (w, v) are stored consecutively and then scanning this sorted edge list todiscard the edge with higher rank from each of these edge pairs. Hence, a tree T can berooted in O(sort(N)) I/Os.

Instead of discarding back edges, it may be useful to keep them, but tag every edge ofthe Euler tour L as either a forward or back edge. Using this information, well-known la-bellings of the vertices of T can be computed by ranking list L after assigning appropriateweights to the edges of L. For example, consider the weighted ranks of the edges in L af-ter assigning weight one to every forward edge and weight zero to every back edge. Thenthe preorder number of every vertex v 6= r in T is one more than the weighted rank ofthe forward edge with target v; the preorder number of the root r is always one. The sizeof the subtree rooted at v is one more than the difference between the weighted ranks ofthe back edge with source v and the forward edge with target v. To compute a postordernumbering, we assign weight zero to every forward edge and weight one to every backedge. Then the postorder number of every vertex v 6= r is the weighted rank of the backedge with source v. The postorder number of the root r is always N .

After labelling every edge in L as a forward or back edge, the appropriate weights forcomputing the above labellings can be assigned in a single scan of list L. The weightedranks can then be computed inO(sort(N)) I/Os, by 10. Extracting preorder and postordernumbers from these ranks takes a single scan of list L again. To extract the sizes of thesubtrees rooted at the vertices of T , we sort the edges in L so that opposite edges withthe same endpoints are stored consecutively. Then a single scan of this sorted edge listsuffices to compute the size of the subtree rooted at every vertex v. Hence, all these labelscan be computed in O(sort(N)) I/Os for a tree with N vertices.

5.9 List RankingList ranking and the Euler tour technique (5.8) are two techniques that have been appliedsuccessfully in the design of PRAM algorithms for labelling problems on lists and rooted

86

(a)

(b)

Figure 5.15: Example input and output for the list ranking task

trees and problems that can be reduced efficiently to one of these problems. Given thesimilarity of the issues to be addressed in parallel and external memory algorithms, it isnot surprising that the same two techniques can be applied in I/O-efficient algorithms aswell.

Let L be a linked list, i.e., a collection of vertices x1, . . . , xN such that each vertex xi,except the tail of the list, stores a pointer succ(xi) to its successor in L, no two verticeshave the same successor, and every vertex can reach the tail of L by following successorpointers. Given a pointer to the head of the list (i.e., the vertex that no other vertex in thelist points to), the list ranking problem is that of computing for every vertex xi of list L,its distance from the head of L, i.e., the number of edges on the path from the head of Lto xi.

In internal memory this problem can easily be solved in linear time using the followingalgorithm: Starting at the head of the list, follow successor pointers and number thevertices of the list from 0 to N − 1 in the order they are visited. Often we use the term“list ranking” to denote the following generalization of the list ranking problem, whichis solvable in linear time using a straightforward generalization of the above algorithm:Given a function λ : x1, . . . , xN → X assigning labels to the vertices of list L anda multiplication ⊗ : X × X → X defined over X , compute a label φ(xi) for eachvertex xi of L such that φ

(xσ(1)

)= λ

(xσ(1)

)and φ

(xσ(i)

)= φ

(xσ(i−1)

)⊗ λ

(xσ(i)

), for

1 < i ≤ N , where σ : [1, N ]→ [1, N ] is a permutation so that xσ(1) is the head of L andsucc

(xσ(i)

)= xσ(i+1), for 1 ≤ i < N .

Unfortunately the simple internal memory algorithm is not I/O-efficient: Since wehave no control over the physical order of the vertices of L on disk, an adversary caneasily arrange the vertices of L in a manner that forces the internal memory algorithm

87

to perform one I/O per visited vertex, so that the algorithm performs Ω(N) I/Os in total.On the other hand, the lower bound for list ranking shown in [16] is only Ω(perm(N)).Next we sketch a list ranking algorithm proposed in [16] that takes O(sort(N)) I/Os andthereby closes the gap between the lower and the upper bound.

We make the simplifying assumption that multiplication over X is associative. If thisis not the case, we determine the distance of every vertex from the head of L, sort thevertices of L by increasing distances, and then compute the prefix product using the inter-nal memory algorithm. After arranging the vertices by increasing distances from the headof L, the internal memory algorithm takesO(scan(N)) I/Os. Hence, the whole procedurestill takes O(sort(N)) I/Os, and the associativity assumption is not a restriction.

Given that multiplication over X is associative, the algorithm of [16] uses graphcontraction to rank list L as follows: First an independent set I of L is found so that|I| = Ω(N). Then the elements in I are removed from L. That is, for every elementx ∈ I with predecessor y and successor z in L, the successor pointer of y is updatedto succ(y) = z. The label of x is multiplied with the label of z, and the result is assignedto z as its new label in the compressed list. It is not hard to see that the weighted ranks ofthe elements in L − I remain the same after adjusting the labels in this manner. Hence,their ranks can be computed by applying the list ranking algorithm recursively to the com-pressed list. Once the ranks of all elements in L− I are known, the ranks of the elementsin I are computed by multiplying their labels with the ranks of their predecessors in L.

If the algorithm excluding the recursive invocation on the compressed list takesO(sort(N)) I/Os, the total I/O-complexity of the algorithm is given by the recurrenceI(N) = I(cN) + O(sort(N)), for some constant 0 < c < 1. The solution of this re-currence is O(sort(N)). Hence, we have to argue that every step, except the recursiveinvocation, can be carried out in O(sort(N)) I/Os.

Given independent set I , it suffices to sort the vertices in I by their successors and thevertices in L−I by their own IDs, and then scan the resulting two sorted lists to update theweights of the successors of all elements in I . The successor pointers of the predecessorsof all elements in I can be updated in the same manner. In particular, it suffices to sort thevertices in L− I by their successors and the vertices in I by their own IDs, and then scanthe two sorted lists to copy the successor pointer from each vertex in I to its predecessor.Thus, the construction of the compressed list takes O(sort(N)) I/Os, once set I is given.

Theorem 10 A list of length N can be ranked in O(sort(N)) I/Os.

List ranking alone is of very limited use. However, combined with the Euler tourtechnique, it becomes a very powerful tool for solving problems on trees that can beexpressed as functions over a traversal of the tree or problems on general graphs thatcan be expressed in terms of a traversal of a spanning tree of the graph. An importantapplication is the rooting of an undirected tree T , which is the process of directing alledges of T from parents to children after choosing one vertex of T as the root. Given a

88

rooted tree T (i.e., one where all edges are directed from parents to children), the Eulertour technique and list ranking can be used to compute a preorder or postorder numberingof the vertices of T , or the sizes of the subtrees rooted at the vertices of T . Such labellingsare used in many classical graph algorithms, so that the ability to compute them is a firststep towards solving more complicated graph problems.

89

Chapter 6

van Emde Boas Trees

The original description of this search tree was published in [22], the implementationstudy can be found in [23].

6.1 From theory to practiceSorted lists with an auxiliary data structure that support the following operations on asorted sequence s:

build: Build the data structure from a set of elements

insert: Insert an element

remove: Delete an element specified by a key or by a reference to that element.

locate: Given a key k, find the smallest element e in s such that e ≥ k. If such an elementdoes not exist, return an indication of this fact, i.e., a handle to a dummy elementwith key∞.

range query: Return alle elements in s with key in a specified range [k, k′].

Sorted sequences are one of the most versatile data structures. In current algorithm li-braries, they are implemented using comparison based data structures such as ab-trees,red-black trees, splay trees, or skip lists. These implementations support insertion, dele-tion, and search in time O(log n) and range queries in time O(k + log n) where n is thenumber of elements and k is the size of the output. For w bit integer keys, a theoreticallyattractive alternative are van Emde Boas stratified trees (vEB-trees) that replace the log nby a logw: A vEB tree T for storing subsets M of w = 2k+1 bit integers stores the setdirectly if |M | = 1. Otherwise it contains a root (hash) table r such that r[i] points to avEB tree Ti for 2k bit integers. Ti represents the set Mi = x mod 22k

: x ∈ M ∧ x

90

2k = i.1 Furthermore, T stores minM , maxM , and a top data structure t consisting ofa 2k bit vEB tree storing the setMt =

x 2k : x ∈M

. This data structure takes space

O(|M | logw) and can be modified to consume only linear space. It can also be combinedwith a doubly linked sorted list to support fast successor and predecessor queries.

However, for a long time there was no known implementation of vEB-trees that couldcompete with comparison based data structures used in algorithm libraries. The followingdescribes a specialized and highly tuned version of vEB-trees for storing 32-bit integersthat can often outperform the classic data structures in terms of runtime.

Figure 6.1 outlines the transformation from a general vEB-tree to our specialized ver-sion. The starting point were vEB search trees as described above but we arrive at anonrecursive data structure: We get a three level search tree. The root is represented byan array of size 216 and the lower levels use hash tables of size up to 256. Due to thissmall size, hash functions can be implemented by table lookup. Locating entries in thesetables is achieved using hierarchies of bit patterns.

The main operation we are interested in is locate(y). locate returns min(x ∈M : y ≤ x). Note that for plain lookup, a hash table would be faster than every datastructure discussed here.

[todo: example figure with detailed explanation] ⇐=

6.2 ImplementationRoot Table

The root-table r contains a plain array with one entry for each possible value of the 16most significant bits of the keys. r[i] = null if there is no x ∈ M with x[16..31] = i. If|Mi| = 1, it contains a pointer to the element list item corresponding to the unique elementof Mi. Otherwise, r[i] points to an L2-table containing Mi = x ∈M : x[16..31] = i.The two latter cases can be distinguished using a flag stored in the least significant bit ofthe pointer.2 Note that the root-table only uses 256kB memory and therefore easily fitsinto the cache.

L2-table

An L2-table ri stores the elements in Mi. If |Mi| ≥ 2 it uses a hash table storing an entrywith key j if ∃x ∈Mi : x[8..15] = j.

Let Mij = x ∈M : x[8..15] = j, x[16..31] = i. If |Mij| = 1 the hash table entrypoints to the element list and if |Mij| ≥ 2 it points to an L3-table representing Mij using

1We use the C-like shift operator ‘’, i.e., x i =⌊x/2i

⌋.

2This is portable without further measures because all modern systems use addresses that are multiplesof four (except for strings).

91

(a) The abstract definition (b) Efficient inner-level lookupstructures

(c) Removed recursion (d) Replace large hash table with anarray

(e) Allow range queries

Figure 6.1: Evolution of the vEB data structure

92

a similar trick as in the root-table.

L3-table

An L3-table rij stores the elements in Mij . If |Mij| ≥ 2, it uses a hash table storing anentry with key k if ∃x ∈Mij : x[0..7] = k. This entry points to an item in the element liststoring the element with x[0..7] = k, x[8..15] = j, x[16..31] = i.

Auxiliary data structures

To locate an element x in the data structure we first lookup i = x[16 . . . 31] in theroot-table. If r[i] 6= null and y ≤ maxMi, we can proceed to the next level of thetree3. Otherwise, we have to find the subtree Mr with r = min k : k ≥ i ∧Mk 6= null (If no such j exists, we return ∞). To do this efficiently, we need some additional datastructures. For every level L of the tree, we have some top data structures: t1 and t2 forevery level, t3 only for root level. We explain that concept for the root level. To find r,we first use t1, which is a bit table containing a flag for every possible subtree of the roottable, indicating if Mi 6= null . Via i div n we find the machine word a (of length n) inwhich i is located and check if it contains r by setting bits ≤ i to zero and checking forthe most significant bit4. Only if a = 0 we have to inspect another word. To do that, wejump to t2 in which every entry is an logical or over 32 bits in t1. Analogously to t1 wetry to find the first nonnull word right of a. Again, we check only the word containing iand switch to t3 (every entry is logical or over 32 bits in t2) if unsuccessful. t3 is 64 bitsmall and can be searched efficiently.

If we need to access L2 (when the located subtree contains more than one element),we calculate j = x[8..15]. j is our key for an hash table to locate the entry correspondingto x in L2. Repeat the procedure described above and possibly inspect L3 in the samemanner. Figure 6.2 gives pseudo code for locate.

Our hash tables use open addressing with linear probing. The table size is always apower of two between 4 and 2565. The size is doubled when a table of size k containsmore than 3k/4 entries and k < 256. The table shrinks when it contains less than k/4entries. Since all keys are between 0 and 255, we can afford to implement the hashfunction as a full lookup table h that is shared between all tables. This lookup table is

3We have to store the maximum of every subtree to do this efficiently4Finding the position of the most significant bit can be implemented in constant time by converting the

number to floating point and then inspecting the exponent. In our implementation, two 16-bit table lookupsturn out to be somewhat faster.

5Note that this is much smaller than for the original definition as we removed the large hash table fromthe top layer

93

// return handle of minx ∈M : y ≤ xFunction locate(y : N) : ElementHandle

if y > maxM then return∞ // no larger elementi := y[16..31] // index into root table rif r[i] = null ∨y > maxMi then return minMt1.locate(i)

if Mi = x then return x // single element casej := y[8..15] // key for L2 hash table at Mi

if ri[j] = null ∨y > maxMij then return minMi,t1i .locate(j)

if Mij = x then return x // single element casereturn rij[t1ij.locate(y[0..7])] // L3 Hash table access

//find the smallest j ≥ i such that tk[j] = 1Method locate(i) for a bit array tk consisting of n bit words

//n = 32 for t1, t2, t1i , t1ij; n = 64 for t3; n = 8 for t2i , t

2ij

//Assertion: some bit in tk to the right of i is nonzeroj := i div n // which n bit word in b contains bit i?a := tk[nj..nj + n− 1] // get this wordset a[(i mod n) + 1..n− 1] to zero // erase the bits to the left of bit iif a = 0 then // nothing here→ look in higher level bit array

j := tk+1.locate(j) // tk+1 stores the or of n-bit groups of tk

a := tk[nj..nj + n− 1] // get the corresponding word in tk

return nj + msbPos(a)

Figure 6.2: Pseudo code for locating the smallest x ∈M with y ≤ x.

initialized to a random permutation h : 0..255→ 0..255. Hash function values for a tableof size 256/2i are obtained by shifting h[x] i bits to the right. Note that for tables of size256 we obtain a perfect hash function, i.e., there are no collisions between different tableentries.

The worst case for all input sizes is if there are pairs of elements that only differ in the8 least significant bits and differ from all other elements in the 16 most significant bits. Inthis case, hash tables and top data structures at levels two and three are allocated for eachsuch pair of elements. This example shows that the faster locate comes at the price ofpotentially larger memory overhead.

94

100

1000

256 1024 4096 16384 65536 218 220 222 223

Tim

e fo

r lo

cate

[ns]

n

orig-STreeLEDA-STree

STL map(2,16)-tree

STree

Figure 6.3: Locating randomly distributed keys

6.3 ExperimentsWe now compare several implementations of search tree like data structures. As compar-ison based data structures we use the STL std::map which is based on red-black treesand ab tree from LEDA which is based on (a, b)-trees with a = 2, b = 16 which faredbest in a previous comparison of search tree data structures in LEDA.

The implementations run under Linux on a 2GHz Intel Xeon processor with 512KByte of L2-cache using an Intel E7500 Chip set. The machine has 1GByte of RAMand no swap space to exclude swapping effects. We use the g++ 2.95.4 compiler withoptimization level -O6. We report the average execution time per operation in nanosec-onds on an otherwise unloaded machine. The average is taken over at least 100 000executions of the operation. Elements are 32 bit unsigned integers plus a 32 bit integer asassociated information.

Figure 6.3 shows the time for the locate operation for random 32 bit integers andindependently drawn random 32 bit queries for locate. Already the comparison based

95

100

1000

64 256 1024 4096 16384 216 218 220 222 223

Tim

e fo

r lo

cate

[ns]

n

STL map (hard)(2,16)-tree (hard)

STree (hard)STree (random)

Figure 6.4: Locating on a hard instance

96

data structures show some interesting effects. For small n, when the data structures fitin cache, red-black trees outperform (2, 16)-trees indicating that red-black trees executeless instructions. For larger n, this picture changes dramatically, presumably because(2, 16)-trees are more cache efficient.

Our vEB tree (called STree here) is fastest over the entire range of inputs. For small n,it is much faster than comparison based structures up to a factor of 4.1. For random inputsof this size, locate mostly accesses the root-top data structure which fits in cache andhence is very fast. It even gets faster with increasing n because then locate rarely hasto go to the second or even third level t2 and t3 of the root-top data structure. For mediumsize inputs there is a range of steep increase of execution time because the L2 and L3 datastructures get used more heavily and the memory consumption quickly exceeds the cachesize. But the speedup over (2, 16)-trees is always at least 1.5. For large n the advantageover comparison based data structures is growing again reaching a factor of 2.9 for thelargest inputs.

Figures 6.3 shows the result for an attempt to obtain close to worst case in-puts for our vEB tree. For a given set size |M | = n, we store Mhard =28i∆, 28i∆ + 255 : i = 0..n/2− 1 where ∆ = b225/nc. Mhard maximizes space con-sumption of our implementation. Furthermore, locate queries of the form 28j∆ + 128 forrandom j ∈ 0..n/2− 1 force the vEB tree to go through the root table, the L2-table, bothlevels of the L3-top data structure, and the L3-table. As to be expected, the comparisonbased implementations are not affected by this change of input. For n ≤ 218, the vEB treeis now slower than its comparison based competitors. However, for large n we still havea similar speedup as for random inputs.

97

Chapter 7

Shortest Path Search

The overview of classical algorithms was taken from [30]. The section on highwayhierarchies is mainly based on material from [31] and [32]. The material on transitnode routing was taken from [33] and the section on dynamic highway routing isfrom [34]. More and newer material can be found on Dominik Schultes’ website:http://algo2.iti.uni-karlsruhe.de/schultes/hwy/.

Some material from this chapter (especially on Highway Hierarchies) was not coveredduring the lecture in 2007. For self-containment, we include these paragraphs for furtherstudies. In the following chapter, these supplemental sections are marked with an asterisk:*.

7.1 IntroductionComputing shortest paths in graphs (networks) with nonnegative edge weights is a classi-cal problem of computer science. From a worst case perspective, the problem has largelybeen solved by Dijkstra in 1959 who gave an algorithm that finds all shortest paths froma starting node s using at most m + n priority queue operations for a graph G = (V,E)with n nodes and m edges.

However, motivated by important applications (e.g., in transportation networks), therehas recently been considerable interest in the problem of accelerating shortest pathqueries, i.e., the problem to find a shortest path between a source node s and a targetnode t. In this case, Dijkstra’s algorithm can stop as soon as the shortest path to t isfound.

A classical technique that gives a constant factor speedup is bidirectional search whichsimultaneously searches forward from s and backwards from t until the search frontiersmeet. All further speedup techniques either need additional information (e.g., geometryinformation for goal directed search) or precomputation. There is a trade-off between the

98

time needed for precomputation, the space needed for storing the precomputed informa-tion, and the resulting query time.

In particular, from now on we focus on shortest paths in large road networks wherewe use ‘shortest’ as a synomym for ‘fastest’. The graphs used for North America orWestern Europe already have around 20 000 000 nodes so that significantly superlinearpreprocessing time or even slightly superlinear space is prohibitive. To our best knowl-edge, all commercial applications currently only compute paths heuristically that are notalways shortest possible. The basic idea of these heuristics is the observation that shortestpaths “usually” use small roads only locally, i.e., at the beginning and at the end of apath. Hence the heuristic algorithm only performs some kind of local search from s and tand then switches to search in a highway network that is much smaller than the completegraph. Typically, an edge is put into the highway network if the information supplied onits road type indicates that it represents an important road.

7.2 “Classical” and other ResultsThe following section gives a short review of older speedup techniques.

Dijkstra’s AlgorithmThe classical algorithm for route planning—maintains an array of tentative distancesD[u] ≥ d(s, u) for each node. The algorithm visits (or settles) the nodes of the roadnetwork in the order of their distance to the source node and maintains the invariant thatD[u] = d(s, u) for visited nodes. We call the rank of node u in this order its Dijkstrarank rs(u) = r. When a node u is visited, its outgoing edges (u, v) are relaxed, i.e., D[v]is set to min(D[v], d(s, u) + w(u, v)). Dijkstra’s algorithm terminates when the targetnode is visited. The size of the search space is O(n) and n/2 (nodes) on the average.We will assess the quality of route planning algorithms by looking at their speedup com-pared to Dijkstra’s algorithm, i.e., how many times faster they can compute shortest-pathdistances.

Priority Queues.Dijkstra’s algorithm can be implemented using O(n) priority queue operations. In thecomparison based model this leads to O(n log n) execution time. In other models ofcomputation and on the average, better bounds exist. However, in practice the impact ofpriority queues on performance for large road networks is rather limited since cache faultsfor accessing the graph are usually the main bottleneck. In addition, our experiments indi-cate that the impact of priority queue implementations diminishes with advanced speedup

99

techniques since these techniques at the same time introduce additional overheads anddramatically reduce the queue sizes.

Bidirectional SearchBidirectional Search executes Dijkstra’s algorithm simultaneously forward from thesource and backwards from the target. Once some node has been visited from both direc-tions, the shortest path can be derived from the information already gathered. In a roadnetwork, where search spaces will take a roughly circular shape, we can expect a speeduparound two —one disk with radius d(s, t) has twice the area of two disks with half the ra-dius. Bidirectional search is important since it can be combined with most other speeduptechniques and, more importantly, because it is a necessary ingredient of the most efficientadvanced techniques.

Geometric Goal Directed Search (A∗)The intuition behind goal directed search is that shortest paths ‘should’ lead in the generaldirection of the target. A∗ search achieves this by modifying the weight of edge (u, v) tow(u, v) − π(u) + π(v) where π(v) is a lower bound on d(v, t). Note that this manipu-lation shortens edges that lead towards the target. Since the added and subtracted vertexpotentials π(v) cancel along any path, this modification of edge weights preserves short-est paths. Moreover, as long as all edge weights remain nonnegative, Dijkstra’s algorithmcan still be used. The classical way to use A∗ for route planning in road maps estimatesd(v, t) based on the Euclidean distance between v and t and the average speed of thefastest road anywhere in the network. Since this is a very conservative estimation, thespeedup for finding quickest routes is rather small.

HeuristicsIn the last decades, commercial navigation systems were developed which had to handleever more detailed descriptions of road networks on rather low-powered processors. Ven-dors resolved to heuristics still used today that do not give any performance guarantees:useA∗ search with estimates on d(u, t) rather than lower bounds; do not look at ‘unimpor-tant’ streets, unless you are close to the source or target. The latter heuristic needs carefulhand tuning of road classifications to produce reasonable results but yields considerablespeedups.

100

Small SeparatorsRoad networks are almost planar, i.e., most edges intersect only at nodes. Hence, tech-niques developed for planar graphs will often also work for road networks. UsingO(n log2 n) space and preprocessing time, query time O(

√n log n) can be achieved for

directed planar graphs without negative cycles. Queries accurate within a factor (1 + ε)can be answered in near constant time usingO((n log n)/ε) space and preprocessing time.Most of these theoretical approaches look difficult to use in practice since they are com-plicated and need superlinear space.

The first published practical approach to fast route planning uses a set of nodes V1

whose removal partitions the graph G = G0 into small components. Now consider theoverlay graph G1 = (V1, E1) where edges in E1 are shortcuts corresponding to shortestpaths in G that do not contain nodes from V1 in their interior. Routing can now be re-stricted to G1 and the components containing s and t respectively. This process can beiterated yielding a multi-level method. A limitation of this approach is that the graphs athigher levels become much more dense than the input graphs thus limiting the benefitsgained from the hierarchy. Also, computing small separators and shortcuts can becomequite costly for large graphs.

Reach-Based RoutingLet R(v) := maxs,t∈V Rst(v) denote the reach of node v where Rst(v) :=min(d(s, v), d(v, t)). Gutman [35] observed that a shortest-path search can be stoppedat nodes with a reach too small to get to source or target from there. Variants of reach-based routing work with the reach of edges or characterize reach in terms of geometricdistance rather than shortest-path distance. The first implementation had disappointingspeedups and preprocessing times that would be prohibitive for large networks.

Edge LabelsThe idea behind edge labels is to precompute information for an edge e that specifiesa set of nodes M(e) with the property that M(e) is a superset of all nodes that lie ona shortest path starting with e. In an s–t query, an edge e need not be relaxed if t 6∈M(e). In [26], M(e) is specified by an angular range. More effective is informationthat can distinguish between long range and short range edges. In [27] many geometriccontainers are evaluated. Very good performance is observed for axis parallel rectangles.A disadvantage of geometric containers is that they require a complete all-pairs shortest-path computation. Faster precomputation is possible by partitioning the graph into kregions that have similar size and only a small number of boundary nodes. Now M(e) isrepresented as a k-vector of edge flags [29, 28] where flag i indicates whether there is a

101

shortest path containing e that leads to a node in region i. Edge flags can be computedusing a single-source shortest-path computation from all boundary nodes of the regions.

Landmark A∗

Using the triangle inequality, quite strong bounds on shortest-path distances can be ob-tained by precomputing distances to a set of around 20 landmark nodes that are welldistributed over the far ends of the network [24]. Using reasonable space and much lesspreprocessing time than for edge labels, these lower bounds yield considerable speedupfor route planning.

Precomputed Cluster Distances (PCD)In [25], we give a different way to use precomputed distances for goal-directed search. Wepartition the network into clusters and then precompute the shortest connection betweenany pair of clusters U and V , i.e., minu∈U,v∈V d(u, v). PCDs cannot be used together withA∗ search since reduced edge weights can become negative. However, PCDs yield upperand lower bounds for distances that can be used to prune search. This gives speedupcomparable to landmark-A∗ using less space.

7.3 Highway Hierarchy

7.3.1 IntroductionOur first approach is based on the idea to compute exact shortest paths by defining thenotion of local search and highway network appropriately. This is very simple. We definelocal search to be a search that visits theH closest nodes from s (or t) whereH is a tuningparameter. This definition already fixes the highway network. An edge (u, v) ∈ E shouldbe a highway edge if there are nodes s and t such that (u, v) is on the shortest path from sto t, v is not within the H closest nodes from s, and u is not within the H closest nodesfrom t.

So far, the highway network still contains all the nodes of the original network. How-ever, we can prune it significantly: Isolated nodes are not needed. Trees attached to abiconnected component can only be traversed at the beginning and end of a path. Simi-larly, paths consisting of nodes with degree two can be replaced by a single edge1. Theresult is a contracted highway network that only contains nodes of degree at least three.We can iterate the above approach, define local search on the highway network, find a

1note that this list of possible contractions was only used in an early version of the algorithm but stillgives a good idea where contraction might be useful.

102

“superhighway network”, contract it, . . . We arrive at a multi-level highway network — ahighway hierarchy.

The next section formalizes some of these ideas.

7.3.2 Hierarchies and ContractionGraphs and Paths. We expect a directed graph G = (V,E) with n nodes and medges (u, v) with nonnegative weights w(u, v) as input. We assume w.l.o.g. that thereare no self-loops, parallel edges, and zero weight edges in the input—they could be dealtwith easily in a preprocessing step. The length w(P ) of a path P is the sum of the weightsof the edges that belong to P . P ∗ = 〈s, . . . , t〉 is a shortest path if there is no path P ′

from s to t such that w(P ′) < w(P ∗). The distance d(s, t) between s and t is the lengthof a shortest path from s to t. If P = 〈s, . . . , s′, u1, u2, . . . , uk, t

′, . . . , t〉 is a path from sto t, then P |s′→t′ = 〈s′, u1, u2, . . . , uk, t

′〉 denotes the subpath of P from s′ to t′.

Dijkstra’s Algorithm. Dijkstra’s algorithm can be used to solve the single source short-est path (SSSP) problem, i.e., to compute the shortest paths from a single source node s toall other nodes in a given graph. It is covered by virtually any textbook on algorithms, sothat we confine ourselves to introducing our terminology: Starting with the source nodes as root, Dijkstra’s algorithm grows a shortest path tree that contains shortest paths froms to all other nodes. During this process, each node of the graph is either unreached,reached, or settled. A node that already belongs to the tree is settled. If a node u is set-tled, a shortest path P ∗ from s to u has been found and the distance d(s, u) = w(P ∗) isknown. A node that is adjacent to a settled node is reached. Note that a settled node isalso reached. If a node u is reached, a path P from s to u, which might not be the shortestone, has been found and a tentative distance δ(u) = w(P ) is known. Nodes that are notreached are unreached.

A bidirectional version of Dijkstra’s algorithm can be used to find a shortest pathfrom a given node s to a given node t. Two Dijkstra searches are executed in parallel: onesearches from the source node s in the original graph G = (V,E), also called forwardgraph and denoted as

−→G = (V,

−→E ); another searches from the target node t backwards,

i.e., it searches in the reverse graph←−G = (V,

←−E ),←−E := (v, u) | (u, v) ∈ E. The

reverse graph←−G is also called backward graph. When both search scopes meet, a shortest

path from s to t has been found.A highway hierarchy of a graph G consists of several levels G0, G1, G2, . . . , GL,

where the number of levels L+ 1 is given. We provide an inductive definition:

• Base case (G′0, G0): level 0 (G0 = (V0, E0)) corresponds to the original graph G;furthermore, we define G′0 := G0.

103

• First step (G′` → G`+1, 0 ≤ ` < L): for given neighbourhood radii, we will definethe highway network G`+1 of a graph G′`.

• Second step (G` → G′`, 1 ≤ ` ≤ L): for a given set B` ⊆ V` of bypassable nodes,we will define the core G′` of level ` (This is the contraction step).

First step (highway network). For each node u, we choose a nonnegative neighbour-hood radius r→` (u) for the forward graph and a radius r←` (u) ≥ 0 for the backward graph.To avoid some case distinctions, for any direction ∈ →,←, we set the neighbour-hood radius r

` (u) to infinity for u 6∈ V ′` and for ` = L.The level-` neighbourhood of a node u ∈ V ′` is N→` (u) :=

v ∈ V ′` | d`(u, v) ≤ r→` (u) with respect to the forward graph and, analogously,N←` (u) := v ∈ V ′` | d←` (u, v) ≤ r←` (u) with respect to the backward graph, whered`(u, v) denotes the distance from u to v in the forward graphG` and d←` (u, v) := d`(v, u)

in the backward graph←−G`.

The highway network G`+1 = (V`+1, E`+1) of a graph G′` is the subgraph of G′` in-duced by the edge set E`+1: an edge (u, v) ∈ E ′` belongs to E`+1 iff there are nodess, t ∈ V ′` such that the edge (u, v) appears in some shortest path 〈s, . . . , u, v, . . . , t〉 froms to t in G′` with the property that v 6∈ N→` (s) and u 6∈ N←` (t).

The definition of the highway network suggests that we need an all pairs shortestpath search to find all its edges, which would be very time-consuming. Fortunately, itis possible to design an efficient algorithm that performs only ‘local search’ from eachnode. The main idea is that it is not necessary to look at node pairs s, t that are veryfar apart: Suppose that (u, v) ∈ E1 is witnessed by source and target nodes s and t. Ifd(s, u) r→` (s) and d(v, t) r←` (t), then we may expect that there are other witnessess′ and t′ that are much closer to the edge (u, v).

For each node s0 ∈ V , we compute and store the values r→` (s0) and r←` (s0). This canbe easily done by a Dijkstra search from each node s0 that is aborted as soon as H nodeshave been settled. Then, we start with an empty set of highway edges E1. For each nodes0, two phases are performed: the forward construction of a partial shortest path tree Band the backward evaluation of B. The construction is done by a single source shortestpath (SSSP) search from s0; during the evaluation phase, paths from the leaves of B tothe root s0 are traversed and for each edge on these paths, it is decided whether to add itto E1 or not. The crucial part is the specification of an abort criterion for the SSSP searchin order to restrict it to a ‘local search’.

Phase 1: Construction of a Partial Shortest Path Tree* A Dijkstra search from s0 isexecuted. During the search, a reached node is either in the state active or passive. Thesource node s0 is active; each node that is reached for the first time (insert) and eachreached node that is updated (decreaseKey) adopts the activation state from its (tentative)parent in the shortest path tree B. When a node p is settled using the path 〈s0, s1, . . . , p〉,

104

Figure 7.1: Instead of a complete all-to-all shortest path search, we can identify all high-way edges by a local search for each node, visiting only its close neighbors.

N(p)

N(s1)

s0 s1 p

Figure 7.2: The abort criterion for finding highway edges ensures local search.

105

then p’s state is set to passive if |N(s1) ∩ N(p)| ≤ 1. When no active unsettled node isleft, the search is aborted and the growth of B stops.

Phase 2: Selection of the Highway Edges* During Phase 2, all edges (u, v) are addedto E1 that lie on paths 〈s0, . . . , u, v, . . . , t0〉 in B with the property that v 6∈ N(s0) andu 6∈ N(t0), where t0 is a leaf of B. This can be done in time O(|B|).

Speeding Up Construction* An active node v is declared to be a maverick if d(s0, v) >f · dH(s0), where f is a parameter. Normally, the search cannot be aborted before thesearch radius reaches d(s0, v) because we have to proof that we found the shortest path.Now, when all active nodes are mavericks, the search from passive nodes is no longercontinued. This way, the construction process is accelerated and E1 becomes a supersetof the highway network. Hence, queries will be slower, but still compute exact shortestpaths. The maverick factor f enables us to adjust the trade-off between construction andquery time. Long-distance ferries are a typical example of mavericks.

Second step (core)* For a given set B` ⊆ V` of bypassable nodes, we define the setS` of shortcut edges that bypass the nodes in B`: for each path P = 〈u, b1, b2, . . . , bk, v〉with u, v ∈ V` \ B` and bi ∈ B`, 1 ≤ i ≤ k, the set S` contains an edge (u, v) withw(u, v) = w(P ). The core G′` = (V ′` , E

′`) of level ` is defined in the following way:

V ′` :=V` \B` and E ′` := (E` ∩ (V ′` × V ′` )) ∪ S`.

Contraction of a Graph In order to obtain the core of a highway network, we contractit, which yields several advantages. The search space during the queries gets smallersince bypassed nodes are not touched and the construction process gets faster since thenext iteration only deals with the nodes that have not been bypassed. Furthermore, a moreeffective contraction allows us to use smaller neighbourhood sizes without compromisingthe shrinking of the highway networks. This improves both construction and query times.However, bypassing nodes involves the creation of shortcuts, i.e., edges that representthe bypasses. Due to these shortcuts, the average degree of the graph is increased andthe memory consumption grows. In particular, more edges have to be relaxed during thequeries. Therefore, we have to carefully select nodes so that the benefits of bypassingthem outweigh the drawbacks.

An intuive justification for contraction is the following consideration, which was infact the basis of contraction in an earlier version of Highway Hierarchies: Imagine a longpath where the inner nodes have no other edges. It is possible to contract this path toa single edge between the starting and the end node and still receive all shortest paths.Another example of contractable structures are attached trees where every shortest pathto a node outside has to go through the root.

We give an iterative algorithm that combines the selection of the bypassable nodesB` with the creation of the corresponding shortcuts. We manage a stack that contains

106

= non−bypassed nodescontracted network

+ shortcuts

bypassednodes

Figure 7.3: Contraction a graph seperates bypassable components from the core.

all nodes that have to be considered, initially all nodes from V`. As long as the stackis not empty, we deal with the topmost node u. We check the bypassability criterion#shortcuts ≤ c · (degin(u) + degout(u)), which compares the number of shortcuts thatwould be created when u was bypassed with the sum of the in- and outdegree of u. Themagnitude of the contraction is determined by the parameter c. If the criterion is fulfilled,the node is bypassed, i.e., it is added to B` and the appropriate shortcuts are created. Notethat the creation of the shortcuts alters the degree of the corresponding endpoints so thatbypassing one node can influence the bypassability criterion of another node. Therefore,all adjacent nodes that have been removed from the stack earlier, have not been bypassed,yet, and are bypassable now are pushed on the stack once again. It happens that shortcutsthat were created once are discarded later when one of its endpoints is bypassed.

7.3.3 QueryOur highway query algorithm is a modification of the bidirectional version of Dijkstra’salgorithm. For now, we assume that the search is not aborted when both search scopesmeet. We only describe the modifications of the forward search since forward and back-ward search are symmetric. In addition to the distance from the source, the key of eachnode includes the search level and the gap to the next applicable neighbourhood border.The search starts at the source node s in level 0. First, a local search in the neighbourhoodof s is performed, i.e., the gap to the next border is set to the neighbourhood radius of sin level 0. When a node v is settled, it adopts the gap of its parent u minus the length ofthe edge (u, v). As long as we stay inside the current neighbourhood, everything works asusual. However, if an edge (u, v) crosses the neighbourhood border (i.e., the length of the

107

level 0

level 1

N→1 (u)

N→0 (s)

su

entrance point to level 1



Figure 7.4: A schematic diagram of a highway query. Only the forward search is depicted.

edge is greater than the gap), we switch to a higher search level `. The node u becomes anentrance point to the higher level. If the level of the edge (u, v) is less than the new searchlevel `, the edge is not relaxed—this is one of the two restrictions that cause the speedupin comparison to Dijkstra’s algorithm (Restriction 1). Otherwise, the edge is relaxed. Ifthe relaxation is successful, v adopts the new search level ` and the gap to the border ofthe neighbourhood of u in level ` since u is the corresponding entrance point to level `.Figure 7.4 illustrates this process.

To increase the speedup and make use of the contracted graph, we introduce anotherrestriction (Restriction 2): when a node u ∈ V ′` is settled, all edges (u, v) that lead to abypassed node v ∈ B` in search level ` are not relaxed.

A detailed example* Figure 7.5 gives a detailed example of the forward search of ahighway query. The search starts at node s. The gap of s is initialised to the distance froms to the border of the neighbourhood of s in level 0. Within the neighbourhood of s, thesearch process corresponds to a standard Dijkstra search. The edge that leads to u leavesthe neighbourhood. It is not relaxed due to Restriction 1 since the edge belongs only tolevel 0. In contrast, the edge that leaves s1 is relaxed since its level allows to switch tolevel 1 in the search process. s1 and its direct successor are bypassed nodes in level 1.Their neighbourhoods are unbounded, i.e., their neighbourhood radii are infinity so thatthe gap is set to infinity as well. At s′1, we leave the component of bypassed nodes andenter the core of level 1. Now, the search is continued in the core of level 1 within theneighbourhood of s′1. The gap is set appropriately. Note that the edge to v is not relaxeddue to Restriction 2 since v is a bypassed node. Instead, the direct shortcut to s2 = s′2 isused. Here, we switch to level 2. In this case, we do not enter the next level through acomponent of bypassed nodes, but we get directly into the core. The search is continuedin the core of level 2 within the neighbourhood of s′2. And so on.

Despite of Restriction 1, we always find the optimal path since the construction of the

108

N→0 (s)

N→1 (s′1)

s s1 s′1

gap(s)

∞

Restriction 1 N→2 (s′2)

s2 =s′2

u

v

shortcut

Restriction 2

Figure 7.5: A detailed example of a highway query. Only the forward search is depicted.Nodes in level 0, 1, and 2 are vertically striped, solid, and horizontally striped, respec-tively. In level 1, dark shades represent core nodes, light shades bypassed nodes. Edgesin level 0, 1, and 2 are dashed, solid, and dotted, respectively.

highway hierarchy guarantees that the levels of the edges that belong to the optimal pathare sufficiently high so that these edges are not skipped. Restriction 2 does not invalidatethe correctness of the algorithm since we have introduced shortcuts that bypass the nodesthat do not belong to the core. Hence, we can use these shortcuts instead of the originalpaths.

The Algorithm.* We use two priority queues−→Q and

←−Q , one for the forward search

and one for the backward search. The key of a node u is a triple (δ(u), `(u), gap(u)),the (tentative) distance δ(u) from s (or t) to u, the search level `(u), and the gap gap(u)to the next applicable neighbourhood border. A key (δ, `, gap) is less than another key(δ′, `′, gap′) iff δ < δ′ or δ = δ′ ∧ ` > `′ or δ = δ′ ∧ ` = `′ ∧ gap < gap′.

Figure 7.6 contains the pseudo-code of the highway query algorithm.

Speeding Up the Search in the Topmost Level. Let us assume that we have a distancetable that contains for any node pair s, t ∈ V ′L the optimal distance dL(s, t). Such a tablecan be precomputed during the preprocessing phase by |V ′L| SSSP searches in V ′L. Usingthe distance table, we do not have to search in level L. Instead, when we arrive at anode u ∈ V ′L that ‘leads’ to level L, we add u to a set

−→I or

←−I depending on the search

direction; we do not relax the edge that leads to level L. After the sets−→I and

←−I have

been determined, we consider all pairs (u, v), u ∈−→I , v ∈

←−I , and compute the minimum

path length D := d0(s, u) + dL(u, v) + d0(v, t). Then, the length of the shortest s-t-pathis the minimum of D and the length of the tentative shortest path found so far (in case thatthe search scopes have already met in a level < L).

109

input: source node s and target node t

1−→Q .insert(s, (0, 0, r→0 (s)));

←−Q .insert(t, (0, 0, r←0 (t)));

2 while (−→Q ∪←−Q 6= ∅) do

3 ∈ →,←; // select direction

4 u :=Q.deleteMin();

5 if gap(u) 6=∞ then gap′ := gap(u) else gap′ := r`(u)(u);

6 foreach e = (u, v) ∈E do

7 for (` := `(u), gap := gap′; w(e) > gap; `++ )gap := r

`+1(u); // go “upwards”8 if `(e) < ` then continue; // Restriction 19 if u ∈ V ′` ∧ v ∈ B` then continue; // Restriction 210 k := (δ(u) + w(e), `, gap− w(e));

11 if v ∈Q then

Q.decreaseKey(v, k); else

Q.insert(v, k);

12 13

Figure 7.6: The highway query algorithm. Differences to the bidirectional version ofDijkstra’s algorithm are marked: additional and modified lines have a framed line number;in modified lines, the modifications are underlined.

110

7.3.4 ExperimentsEnvironment and Instances. The experiments were done on one core of an AMDOpteron Processor 270 clocked at 2.0 GHz with 4 GB main memory and 2 × 1 MBL2 cache, running SuSE Linux 10.0 (kernel 2.6.13). The program was compiled by theGNU C++ compiler 4.0.2 using optimisation level 3. We use 32 bits to store edge weightsand path lengths.

We deal with the road networks of Western Europe2 and of the USA (without Hawaii)and Canada. Both networks have been made available for scientific use by the companyPTV AG. The original graphs contain for each edge a length and a road category, e.g.,motorway, national road, regional road, urban street. We assign average speeds to theroad categories, compute for each edge the average travel time, and use it as weight.

We report only the times needed to compute the shortest path distance between twonodes without outputting the actual route. In order to obtain the corresponding subpathsin the original graph, we are able to extract the used shortcuts without using any extradata. However, if a fast output routine is required, we might want to spend some addi-tional space to accelerate the unpacking process. For details, we refer to the full paper.Table 7.3.4 summarises the properties of the used road networks and key results of theexperiments.

Parameters. Unless otherwise stated, the following default settings apply. We use thecontraction rate c = 1.5 and the neighbourhood sizes H as stated in Tab. 7.3.4—the sameneighbourhood size is used for all levels of a hierarchy. First, we contract the originalgraph. Then, we perform four iterations of our construction procedure, which determinesa highway network and its core. Finally, we compute the distance table between all level-4core nodes.

In one test series (Fig. 7.7), we used all the default settings except for the neighbour-hood size H , which we varied from 40 to 90. On the one hand, if H is too small, theshrinking of the highway networks is less effective so that the level-4 core is still quitebig. Hence, we need much time and space to precompute and store the distance table. Onthe other hand, if H gets bigger, the time needed to preprocess the lower levels increasesbecause the area covered by the local searches depends on the neighbourhood size. Fur-thermore, during a query, it takes longer to leave the lower levels in order to get to thetopmost level where the distance table can be used. Thus, the query time increases aswell. We observe that we get good space-time tradeoffs for neighbourhood sizes around60. In particular, we find that a good choice of the parameter H does not vary very muchfrom graph to graph.

In another test series (Tab. 7.3.4a), we did not use a distance table, but repeated theconstruction process until the topmost level was empty or the hierarchy consisted of 15

214 countries: at, be, ch, de, dk, es, fr, it, lu, nl, no, pt, se, uk

111

Europe USA/CAN USA(Tiger)

INPUT

#nodes 18 029 721 18 741 705 24 278 285#directed edges 42 199 587 47 244 849 58 213 192#road categories 13 13 4

PARAM.average speeds [km/h] 10–130 16–112 40–100H 50 60 60

PREPROC.CPU time [min] 15 20 18∅overhead/node [byte] 68 69 50

QUERY

CPU time [ms] 0.76 0.90 0.88#settled nodes 884 951 1 076#relaxed edges 3 182 3 630 4 638speedup (CPU time) 8 320 7 232 7 642speedup (#settled nodes) 10 196 9 840 11 080worst case (#settled nodes) 8 543 3 561 5 141

Table 7.1: Overview of the used road networks and key results. ‘∅overhead/node’ ac-counts for the additional memory that is needed by our highway hierarchy approach (di-vided by the number of nodes). The amount of memory needed to store the original graphis not included. Query times are average values based on 10 000 random s-t-queries.‘Speedup’ refers to a comparison with Dijkstra’s algorithm (unidirectional). Worst caseis an upper bound for any possible query in the respective graph.

Figure 7.7: Preprocessing and query performance depending on the neighbourhood sizeH .

levels. We varied the contraction rate c from 0.5 to 2. In case of c = 0.5 (and H = 50),the shrinking of the highway networks does not work properly so that the topmost levelis still very big. This yields huge query times. In earlier implementations we used alarger neighbourhood size to cope with this problem. Choosing larger contraction rates

112

contr.PREPROCESSING QUERY

rate ctime over- ∅deg.

time #settled #relaxed[min] head [ms] nodes edges

0.5 89 27 3.2 176.05 242 156 505 0861 16 27 3.7 1.97 2 321 8 931

1.5 13 27 3.8 1.58 1 704 7 9352 13 28 3.9 1.70 1 681 8 607

(a)

PREPROC. QUERY

# time over- time #settledlevels [min] head [ms] nodes

5 16 68 0.77 8847 13 28 1.19 1 2909 13 27 1.51 1 574

11 13 27 1.62 1 694(b)

Table 7.2: Preprocessing and query performance for the European road network depend-ing on the contraction rate c (a) and the number of levels (b). ‘overhead’ denotes theaverage memory overhead per node in bytes.

reduces the preprocessing and query times since the cores and search spaces get smaller.However, the memory usage and the average degree are increased since more shortcutsare introduced. Adding too many shortcuts (c = 2) further reduces the search space, butthe number of relaxed edges increases so that the query times get worse.

In a third test series (Tab. 7.3.4b), we used the default settings except for the numberof levels, which we varied from 5 to 11. In each test case, a distance table was used inthe topmost level. The construction of the higher levels of the hierarchy is very fast andhas no significant effect on the preprocessing times. In contrast, using only five levelsyields a rather large distance table, which somewhat slows down the preprocessing andincreases the memory usage. However, in terms of query times, ‘5 levels’ is the optimalchoice since using the distance table is faster than continuing the search in higher levels.

Fast vs. Precise Construction. During various experiments, we came to the conclusionthat it is a good idea not to take a fixed maverick factor f for all levels of the constructionprocess, but to start with a low value (i.e. fast construction) and increase it level by level(i.e. more precise construction). For the following experiments, we used the sequence0, 2, 4, 6, . . ..

Best Neighbourhood Sizes. For two levels ` and ` + 1 of a highway hierarchy, theshrinking factor is the ratio between |E ′`| and |E ′`+1|. In our experiments, we observedthat the highway hierarchies of the USA and Europe were almost self-similar in the sensethat the shrinking factor remained nearly unchanged from level to level when we usedthe same neighbourhood size H for all levels. We kept this approach and applied thesame H iteratively until the construction led to an empty highway network. Figure 7.8demonstrates the shrinking process for Europe. For most levels, we observe an almostconstant shrinking factor (which appears as a straight line due to the logarithmic scaleof the y-axis). The greater the neighbourhood size, the greater the shrinking factor. The

113

Figure 7.8: Shrinking of the highway networks of Europe. For different neighbourhoodsizes H and for each level `, we plot |E ′`|, i.e., the number of edges that belong to the coreof level `.

first iteration (Level 0→1) and the last few iterations are exceptions: at the first iteration,the construction works very well due to the characteristics of the real world road network(there are many trees and lines that can be contracted); at the last iterations, the highwaynetwork collapses, i.e., it shrinks very fast, because nodes that are close to the border ofthe network usually do not belong to the next level of the highway hierarchy, and whenthe network gets small, almost all nodes are close to the border.

Multilevel Queries. Table 7.3.4 contains average values for queries, where the sourceand target nodes are chosen randomly. For the two large graphs we get a speedup of morethan 2 000 compared to Dijkstra’s algorithm both with respect to (query) time3 and withrespect to the number of settled nodes.

For our largest road network (USA), the number of nodes that are settled during thesearch is less than the number of nodes that belong to the shortest paths that are found.Thus, we get an efficiency that is greater than 100%. The reason is that edges at highlevels will often represent long paths containing many nodes.4

For use in applications it is unrealistic to assume a uniform distribution of queries inlarge graphs such as Europe or the USA. On the other hand, it would be hardly morerealistic to arbitrarily cut the graph into smaller pieces. Therefore, we decided to measure

3It is likely that Dijkstra would profit more from a faster priority queue than our algorithm. Therefore,the time-speedup could decrease by a small constant factor.

4The reported query times do not include the time for expanding these paths. We have made measure-ments with a naive recursive expansion routine which never take more than 50% of the query time. Alsonote that this process could be radically sped up by precomputing unpacked representations of edges.

114

Figure 7.9: Multilevel Queries. For each road network and each Dijkstra rank on thex-axis, 1 000 queries from random source nodes were performed. The results are repre-sented as box-and-whisker plot: each box spreads from the lower to the upper quartile andcontains the median, the whiskers extend to the minimum and maximum value omittingoutliers, which are plotted individually. It is important to note that a logarithmic scale isused for the x-axis.

local queries within the big graphs: For each power of two r = 2k, we choose randomsample points s and then use Dijkstra’s algorithm to find the node t with Dijkstra rankrs(t) = r. We then use our algorithm to make an s-t query. By plotting the resultingstatistics for each value r = 2k, we can see how the performance scales with a naturalmeasure of difficulty of the query. Figure 7.9 shows the query times. Note that the medianquery times are scaling quite smoothly and the growth is much slower than the exponentialincrease we would expect in a plot with logarithmic x axis, linear y axis, and any growthrate of the form rρ for Dijkstra rank r and some constant power ρ. The curve is also notthe straight line one would expect from a query time logarithmic in r. Note that for thelargest rank, query times are actually decreasing. A possible explanation is that thesequeries will have at least one of source or destination node at the border area of the mapwhere the road network is often not very dense (e.g. northern Norway). This plot wasdone without using distance tables which would also cut the costs at some point whereevery query will move to the highest level and then resort to the table.

Worst Case Upper Bounds. By executing a query from each node of a given graph toan added isolated dummy node and a query from the dummy node to each actual nodein the backward graph, we obtain a distribution of the search spaces of the forward andbackward search, respectively. We can combine both distributions to get an upper boundfor the distribution of the search spaces of bidirectional queries: when F→(x) (F←(x))

115

1014

1012

1010

108

106

104

100

0 500 1000 1500 2000 2500 3000

Fre

quen

cy

Search Space

Europe

Figure 7.10: Histogram of upper bounds for the search spaces of s-t-queries.

denotes the number of source (target) nodes whose search space consists of x nodes in aforward (backward) search, we define F↔(z) :=

∑x+y=z F→(x) · F←(y), i.e., F↔(z) is

the number of s-t-pairs such that the upper bound of the search space size of a query froms to t is z. In particular, we obtain the upper bound max z | F↔(z) > 0 for the worstcase without performing all n2 possible queries. Figure 7.10 visualises the distributionF↔(z) as a histogram.

7.4 Transit Node RoutingWhen you drive to somewhere ‘far away’, you will leave your current location via oneof only a few ‘important’ traffic junctions. For graphs representing road networks, thismeans: First, there is a relatively small set of transit nodes, about 10 000 for the US roadnetwork, with the property that for every pair of nodes that are ‘not too close’ to eachother, the shortest path between them passes through at least one of these transit nodes.Second, for every node, the set of transit nodes encountered first when going far—we callthese access nodes—is small (about 10). We will now try to exploit this property.

To simplify notation we will present the approach for undirected graphs. However, themethod is easily generalised to directed graphs and our highway hierarchy implementa-tion already handles directed graphs. Consider any set T ⊆ V of transit nodes, an accessmapping A : V → 2T , and a locality filter L : V × V → true, false. We require that

116

¬L(s, t) implies that the shortest path distance is

d(s, t) = min d(s, u) + d(u, v) + d(v, t) : u ∈ A(s), v ∈ A(t) . (7.1)

Equation 7.1 implies that the shortest path between nodes that are not near to each othergoes through transit nodes at both ends. In principle, we can pick any set of transit nodes,any access mapping, and any locality filter fulfilling Equation (7.1) to obtain a transit nodequery algorithm:Assume we have precomputed all distances between nodes in T . If ¬L(s, t) then computed(s, t) using Equation (7.1). Else, use any other routing algorithm.

Of course, we want a good choice of (T , A, L). T should be small but allow manyglobal queries, L should efficiently identify as many of these global query pairs as possi-ble, and we should be able to store and evaluate A efficiently.

We can apply a second layer of generalised transit node routing to the remaining localqueries (that may dominate some real world applications). We have a node set T2 ⊃ T ,an access mapping A2 : V → 2T2 , and a locality filter L2 such that ¬L2(s, t) implies thatthe shortest path distance is defined by Equation 7.1 or by

d(s, t) = min d(s, u) + d(u, v) + d(v, t) : u ∈ A2(s), v ∈ A2(t) . (7.2)

In order to be able to evaluate Equation 7.2 efficiently we need to precompute thelocal connections from d(u, v) : u, v ∈ T2 ∧ L(u, v) which cannot be obtained usingEquation 7.1.

In an analogous way we can add further layers.

7.4.1 Computing Transit NodesComputing Access Nodes: Backward Approach.

Start a Dijkstra search from each transit node v ∈ T . Run it until all paths leading tonodes in the priority queue pass over another node w ∈ T . Record v as an access nodefor any node u on a shortest path from v that does not lead over another node in T . Recordan edge (v, w) with weight d(v, w) for a transit graph G[T ] = (T , ET ). When this localsearch has been performed from all transit nodes, we have found all access nodes and thedistance table can be computed using an all-pairs shortest path computation in G[T ].

Layer 2 Information

is computed similarly to the top level information except that a search on the transit graphG[T2] can be stopped when all paths in the priority queue pass over a top level transitnode w ∈ T . Level 2 distances from each node v ∈ T2 can be stored space efficiently ina static hash table. We only need to store distances that actually improve on the distancesobtained going via the top level T .

117

Computing Access Nodes: Forward Approach.

Start a Dijkstra search from each node u. Stop when all paths in the shortest path tree are‘covered’ by transit nodes. Take these transit nodes as access points of u. Applied naively,this approach is rather inefficient. However, we can use two tricks to make it efficient.First, during the search we do not relax the edges leaving transit nodes. This leads to thecomputation of a superset of the access points. Fortunately, this set can be easily reducedif the distances between all transit nodes are already known: if an access point v′ can bereached from u via another access point v on a shortest path, we can discard v′. Second,we can only determine the access point sets A(v) for all nodes v ∈ T2 and the sets A2(u)for all nodes u ∈ V . Then, for any node u, A(u) can be computed as

⋃v∈A2(u) A(v).

Again, we can use the reduction technique to remove unnecessary elements from the setunion.

Locality Filters.

There seem to be two basic approaches to transit node routing. One that starts with alocality filter L and then has to find a good set of transit nodes T for which L works. Theother approach starts with T and then has to find a locality filter that can be efficientlyevaluated and detects as accurately as possible whether local search is needed. One ap-proach that we found very effective is to use the information gained when computing thedistance table for layer i + 1 to define a locality filter for layer i. For example, we cancompute the radius ri(u) of a circle around every node u ∈ Ti+1 that contains for eachentry d(u, v) in the layer-(i+ 1) table the meeting point of a bidirectional search betweenu and v. We can use this information in several ways. We can (pre)compute conserva-tive circle radii for arbitrary nodes v as ri(v) := max ||v − u||2 + ri(u) : u ∈ Ai+1(v).Note that even if we are not able to store the information gathered during a precomputa-tion at layer i + 1, it might still make sense to run it in order to gather the more compactlocality information.

Combining with Highway Hierachies

Nodes on high levels of a highway hierarchy have the property that they are used on short-est paths far away from starting and target nodes. ‘Far away’ is defined with respect tothe Dijkstra rank. Hence, it is natural to use (the core of) some level K of the highwayhierarchy for the transit node set T . Note that we have quite good (though indirect) con-trol over the resulting size of T by choosing the appropriate neighbourhood sizes and theappropriate value for K =: K1. In our current implementation this is level 4, 5, or 6. Inaddition, the highway hierarchy helps us to efficiently compute the required information.Note that there is a difference between the level of the highway hierarchy and the layer oftransit node search.

118

ri(v)

ri(u)v u

Figure 7.11: Example for the extension of the geometric locality filter. The grey nodesconstitute the set Ai+1(v).

We can also combine the techniques of distance tables (many-to-many queries) with tran-sit nodes. Roughly, this algorithm first performs independent backward searches from alltransit nodes and stores the gathered distance information in buckets associated with eachnode. Then, a forward search from each transit node scans all buckets it encounters anduses the resulting path length information to update a table of tentative distances. Thisapproach can be generalised for computing distances at layer i > 1. We use the for-ward approach to compute the access point sets (In our case, we do not perform Dijkstrasearches, but highway searches).

7.4.2 ExperimentsEnvironment, Instances, and Parameters

The experiments were done on one core of an AMD Opteron Processor 270 clocked at2.0 GHz with 8 GB main memory and 2 × 1 MB L2 cache, running SuSE Linux 10.0(kernel 2.6.13). The program was compiled by the GNU C++ compiler 4.0.2 using op-timisation level 3. We deal with the same networks we already used for experiments onHighway Hierarchies. We assign average speeds to the road categories, compute for eachedge the average travel time, and use it as weight. In addition to this travel time metric,we perform experiments on variants of the European graph with a distance metric and theunit metric.

We use two variants of the transit node approach: Variant economical aims at a goodcompromise between space consumption, preprocessing time and query time. Economi-cal uses two layers and reconstructs the access node set and the locality filter needed forthe layer-1 query using information only stored with nodes in T2, i.e., for a layer-1 query

119

with source node s, we build the union⋃u∈A2(s) A(u) of all layer-1 access nodes of all

layer-2 access nodes of s to determine on-the-fly a layer-1 access node set for s. Simi-larly, a layer-1 locality filter for s is built using the locality filters of the layer-2 accessnodes. Variant generous accepts larger distance tables by choosing K = 4 (however us-ing somewhat larger neighbourhoods for constructing the hierarchy). Generous stores allinformation required for a query with every node. To obtain a high quality layer-2 filterL2, the generous variant performs a complete layer-3 preprocessing based on the core oflevel 1 and also stores a distance table for layer 3.

Since it has turned out that a better performance is obtained when the preprocessingstarts with a contraction phase, we practically skip the first construction step (by choosingneighbourhood sets that contain only the node itself) so that the first highway networkvirtually corresponds to the original graph. Then, the first real step is the contraction oflevel 1 to get its core. Note that compared to numbers presented on Highway Hierar-chies, we use a slightly improved contraction heuristic, which sorts the nodes accordingto degree and then tries to bypass the node with the smallest degree first.

Main Results

Table 7.3 gives the preprocessing times for both road networks and both the travel timeand the distance metric; in case of the travel time metric, we distinguish between theeconomical and the generous variant. In addition, some key facts on the results of thepreprocessing, e.g., the sizes of the transit node sets, are presented. It is interesting toobserve that for the travel time metric in layer 2 the actual distance table size is onlyabout 0.1% of the size a naive |T2| × |T2| table would have. As expected, the distancemetric yields more access points than the travel time metric (a factor 2–3) since not onlyjunctions on very fast roads (which are rare) qualify as access point. The fact that wehave to increase the neighbourhood size from level to level in order to achieve an effectiveshrinking of the highway networks leads to comparatively high preprocessing times forthe distance metric.

Table 7.4 summarises the average case performance of transit node routing. For thetravel time metric, the generous variant achieves average query times more than two or-ders of magnitude lower than highway hierarchies alone. At the cost of a factor 2.4 inquery time, the economical variant saves around a factor of two in space and a factor of3.5 in preprocessing time.

Finding a good locality filter is one of the biggest challenges of a highway hierarchybased implementation of transit node routing. The values in Tab. 7.4 indicate that our filteris suboptimal: for instance, only 0.0064% of the queries performed by the economicalvariant in the US network with the travel time metric would require a local search toanswer them correctly. However, the locality filter L2 forces us to perform local searchesin 0.278% of all cases. The high-quality layer-2 filter employed by the generous variant

120

layer 1 layer 2 layer 3metric variant |T | |table| |A| |T2| |table2| |A2| |T3| |table3| space time

[× 106] [× 106] [× 106] [B/node] [h]

USAtimeeco 12 111 147 6.1 184 379 30 4.9 – – 111 0:59gen 10 674 114 5.7 485 410 204 4.2 3 855 407 173 244 3:25

dist eco 15 399 237 17.0 102 352 41 10.9 – – 171 8:58

EURtimeeco 8 964 80 10.1 118 356 20 5.5 – – 110 0:46gen 11 293 128 9.9 323 356 130 4.1 2 954 721 119 251 2:44

dist eco 11 610 135 20.3 69 775 31 13.1 – – 193 7:05

Table 7.3: Statistics on preprocessing for the highway hierarchy approach. For each layer,we give the size (in terms of number of transit nodes), the number of entries in the distancetable, and the average number of access points to the layer. ‘Space’ is the total overheadof our approach.

layer 1 [%] layer 2 [%] layer 3 [%]metric variant correct stopped correct stopped correct stopped query time

USA timeeco 99.86 98.87 99.9936 99.7220 – – 11.5µsgen 99.89 99.20 99.9986 99.9862 99.99986 99.99984 4.9µs

dist eco 98.43 91.90 99.9511 97.7648 – – 87.5µs

EUR timeeco 99.46 97.13 99.9908 99.4157 – – 13.4µsgen 99.74 98.65 99.9985 99.9810 99.99981 99.99972 5.6µs

dist eco 95.32 81.68 99.8239 95.7236 – – 107.4µs

Table 7.4: Performance of transit node routing with respect to 10 000 000 randomly cho-sen (s, t)-pairs. Each query is performed in a top-down fashion. For each layer i, wereport the percentage of the queries that is answered correctly in some layer ≤ i and thepercentage of the queries that is stopped after layer i (i.e., ¬Li(s, t)).

is considerably more effective, still the percentage of false positives is about 90%.For the distance metric, the situation is worse. Only 92% and 82% of the queries

are stopped after the top layer has been searched (for the US and the European net-work, respectively). This is due to the fact that we had to choose the cores of levels 6and 4 as layers 1 and 2 since the shrinking of the highway networks is less effective sothat lower levels would be too big. It is important to note that we concentrated on thetravel time metric—since we consider the travel time metric more important for practicalapplications—, and we spent comparatively little time to tune our approach for the dis-tance metric. For example, a variant using a third layer (namely levels 6, 4, and 2 as layers1, 2, and 3), which is not yet supported by our implementation, seems to be promising.Nevertheless, the current version shows feasibility and still achieves an improvement ofa factor of 71 and 56 (for the US and the European network, respectively) over high-way hierarchies alone. We use again a box-and-whisker plot to account for variance in

121

Dijkstra Rank

Que

ry T

ime

[µs]

25 26 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

510

2040

100

300

1000

510

2040

100

300

1000

economicalgenerous

Figure 7.12: Query times for the USA with the travel time metric as a function of Dijkstrarank.

query times. For the generous approach, we can easily recognise the three layers of transitnode routing with small transition zones in between: For ranks 218–224 we usually have¬L(s, t) and thus only require cheap distance table accesses in layer 1. For ranks 212–216, we need additional look-ups in the table of layer 2 so that the queries get somewhatmore expensive. In this range, outliers can be considerably more costly, indicating thatoccasional local searches are needed. For small ranks we usually need local searches andadditional look-ups in the table of layer 3. Still, the combination of a local search in avery small area and table look-ups in all three layers usually results in query times of onlyabout 20µs.

In the economical approach, we observe a high variance in query times for ranks 215–216. In this range, all types of queries occur and the difference between the layer-1 queriesand the local queries is rather big since the economical variant does not make use of a thirdlayer. For smaller ranks, we see a picture very similar to basic highway hierarchies withquery time growing logarithmically with Dijkstra rank.

7.4.3 Complete Description of the Shortest PathFor a given node pair (s, t), in order to get a complete description of the shortest s-t-path,we first perform a transit node query and determine the layer i that is used to obtain theshortest path distance. Then, we have to determine the path from s to the forward accesspoint u to layer i, the path from the backward access point v to t, and the path from u to

122

v. In case of a local query, we can fall back on a normal highway search.Currently, we provide an efficient implementation only for the case that the path goes

through the top layer. In all other cases, we just perform a normal highway search. Theeffect on the average times is very small since more than 99% of the queries are correctlyanswered using only the top search (in case of the travel time metric; cp. Tab. 7.4).

When a node s and one of its access points u are given, we can determine the next nodeon the shortest path from s to u by considering all adjacent nodes s′ of s and checkingwhether d(s, s′) + d(s′, u) = d(s, u). In most cases, the distance d(s′, u) is directlyavailable since u is also an access point of s′. In a few cases—when u is not an accesspoint of s′—, we have to consider all access points u′ of s′ and check whether d(s, s′) +d(s′, u′)+d(u′, u) = d(s, u). Note that d(u′, u) can be looked up in the top distance table.Using this subroutine, we can determine the path from s to the forward access point u andfrom the backward access point v to t.

A similar procedure can be used to find the path from u to v. However, in this case,we consider only adjacent nodes u′ of u that belong to the top layer as well becauseonly for these nodes we can look up d(u′, v). Since there are shortest paths between toplayer nodes that leave the top layer—we call such paths hidden paths—, we execute anadditional preprocessing step that determines all hidden paths and stores them in a specialdata structure (after the used shortcuts have been expanded). Whenever we cannot findthe next node on the path to v considering only adjacent nodes in the top layer, we lookfor the right hidden path that leads to the next node in the top layer. In Tab. 7.5 we givethe additional preprocessing time and the additional disk space for the hidden paths andthe unpacking data structures. Furthermore, we report the additional time that is neededto determine a complete description of the shortest path and to traverse5 it summing upthe weights of all edges as a sanity check—assuming that the distance query has alreadybeen performed. That means that the total average time to determine a shortest path is thetime given in Tab. 7.5 plus the query time given in Tab. 7.4.

7.5 Dynamic Shortest Path ComputationThe successful methods we saw until now are static, i.e., they assume that the network—including its edge weights—does not change. This makes it possible to preprocess someinformation once and for all that can be used to accelerate all subsequent point-to-pointqueries. However, real road networks change all the time. In this section, we addresstwo such dynamic scenarios: individual edge weight updates, e.g., due to traffic jams,and switching between different cost functions that take vehicle type, road restrictions, ordriver preferences into account.

5Note that we do not traverse the path in the original graph, but we directly scan the assembled descrip-tion of the path.

123

preproc. space query # hops[min] [MB] [µs] (avg.)

USA 4:04 193 258 4 537EUR 7:43 188 155 1 373

Table 7.5: Additional preprocessing time, additional disk space and query time that isneeded to determine a complete description of the shortest path and to traverse it summingup the weights of all edges—assuming that the query to determine its lengths has alreadybeen performed. Moreover, the average number of hops—i.e., the average path length interms of number of nodes—is given. These figures refer to experiments on the graphswith the travel time metric using the generous variant.

7.5.1 Covering NodesWe now introduce the concept of “covering nodes”, which will be useful later.

Problem Definition.

During a Dijkstra search from s, we say that a settled node u is covered by a node setV ′ if there is at least one node v ∈ V ′ on the path from the root s to u. A queued nodeis covered if its tentative parent is covered. The current partial shortest-path tree T iscovered if all currently queued nodes are covered. All nodes v ∈ V ′ ∩ T that have noparent in T that is covered are covering nodes, forming the set CG(V ′, s).

The crucial subroutine of all algorithms in the subsequent sections takes a graph G,a node set V ′, and a root s and determines all covering nodes CG(V ′, s). We distinguishbetween four different ways of doing this.

Conservative Approach.

The conservative variant (Fig. 7.13 (a)) works in the obvious way: a search from s isstopped as soon as the current partial shortest-path tree T is covered. Then, it is straight-forward to read off all covering nodes. However, if the partial shortest-path tree containsone path that is not covered for a long time, the tree can get very big even though all otherbranches might have been covered very early. In our application, this is a critical issuedue to long-distance ferry connections.

Aggressive Approach.

As an overreaction to the above observation, we might want to define an aggressive vari-ant that does not continue the search from any covering node, i.e., some branches might

124

s v w x yu s u v w x y

(a) conservative (b) aggressive

s u v w x ystall

wakes u v w x y

(c) stall-in-advance (d) stall-on-demand

Figure 7.13: Simple example for the computation of covering nodes. We assume thatall edges have weight 1 except for the edges (s, v) and (s, x), which have weight 10. Ineach case, the search process is started from s. The set V ′ consists of all nodes that arerepresented by a square. Thick edges belong to the search tree T . Nodes that belong tothe computed superset CG(V ′, s) of the covering nodes are highlighted in grey. Note thatthe actual covering node set contains only one node, namely u.

be terminated early, while only the non-covered paths are followed further on. Unfortu-nately, this provokes two problems. First, we can no longer guarantee that T containsonly shortest paths. As a consequence, we get a superset CG(V ′, s) of the covering nodes,which still can be used to obtain correct results. However, the performance will be im-paired. In Section 7.5.2, we will explain how to reduce a given superset rather efficientlyin order to obtain the exact covering node set. Second, the tree T can get even bigger sincethe search might continue around the covering nodes where we pruned the search.6 In ourexample (Fig. 7.13 (b)), the search is pruned at u so that v is reached using a much longerpath that leads around u. As a consequence, w is superfluously marked as a coveringnode.

Stall-in-Advance Technique.

If we decide not to prune the search immediately, but to go on ‘for a while’ in order tostall other branches, we obtain a compromise between the conservative and the aggressivevariant, which we call stall-in-advance. One heuristic we use prunes the search at node zwhen the path explored from s to z contains p nodes of V ′ for some tuning parameter p.Note that for p := 1, the stall-in-advance variant corresponds to the aggressive variant. Inour example (Fig. 7.13 (c)), we use p := 2. Therefore, the search is pruned not until w issettled. This stalls the edge (s, v) and, in contrast to (b), the node v is covered. Still, thesearch is pruned too early so that the edge (s, x) is used to settle x.

6Note that the query algorithm of the separator-based approach virtually uses the aggressive variant tocompute covering nodes. This is reasonable since the search can never ‘escape’ the component where itstarted.

125

Stall-on-Demand Technique.

In the stall-in-advance variant, relaxing an edge leaving a covered node is based on the‘hope’ that this might stall another branch. However, our heuristic is not perfect, i.e.,some edges are relaxed in vain, while other edges which would have been able to stallother branches, are not relaxed. Since we are not able to make the perfect decision inadvance, we introduce a fourth variant, namely stall-on-demand. It is an extension of theaggressive variant, i.e., at first, edges leaving a covered node are not relaxed. However, ifsuch a node u is reached later via another path, it is woken up and a breadth-first search(BFS) is performed from that node: an adjacent node v that has already been reachedby the main search is inserted into the BFS queue if we can prove that the best path Pfound so far is suboptimal. This is certainly the case if the path from s via u to v isshorter than P . All nodes encountered during the BFS are marked as stalled. The mainsearch is pruned at stalled nodes. Furthermore, stalled nodes are never marked as coveringnodes. The stalling process cannot invalidate the correctness since only nodes are stalledthat otherwise would contribute to suboptimal paths. In our example (Fig. 7.13 (d)), thesearch is pruned at u. When v is settled, we assume that the edge (v, w) is relaxed first.Then, the edge (v, u) wakes the node u up. A stalling process (a BFS search) is startedfrom u. The nodes v and w are marked as stalled. When w is settled, its outgoing edgesare not relaxed. Similarly, the edge (x,w) wakes the stalled node w and another stallingprocess is performed.

7.5.2 Static Highway-Node RoutingMulti-Level Overlay Graph.

For given highway-node sets V =: V0 ⊇ V1 ⊇ ... ⊇ VL, wegive a definition of the multi-level overlay graph G = (G0, G1, ..., GL):G0 := G and for each ` > 0, we have G` := (V`, E`) with E` :=(s, t) ∈ V` × V` | ∃ shortest path P = 〈s, u1, u2, . . . , uk, t〉 in G`−1 s.t. ∀i : ui 6∈ V`.

Node Selection

We can choose any highway node sets to get a correct procedure. However, the efficiencyof both the preprocessing and the query very much depends on the highway node sets.Roughly speaking, a node that lies on a lot of shortest paths should belong to the nodeset of a high level. In a first implementation, we use the set of level-` core nodes of thehighway hierarchy of G as highway node set V`. In other words, we let the constructionprocedure of the highway hierarchies decide the importance of the nodes.

126

7.5.3 ConstructionThe multi-level overlay graph is constructed in a bottom-up fashion. In order to con-struct level ` > 0, we perform for each node s ∈ V` a Dijkstra search in G`−1 that isstopped as soon as the partial shortest-path tree is covered by V` \ s. For each pathP = 〈s, u1, u2, . . . , uk, t〉 in T with the property that ∀i : ui 6∈ V`, we add an edge (s, t)with weight w(P ) to E`.

Theorem 11 The construction algorithm yields the multi-level overlay graph.

Faster Construction Heuristics. Using the above construction procedure, we en-counter the same performance problems and provide similar solutions than for the High-way Hierarchies: if the partial shortest-path tree contains a path that is not covered by ahighway node for a long time, the tree can get very big even though all other branchesmight have been covered very early. In particular, we observed this behaviour in the Euro-pean road network for long-distance ferry connections and for some long dead-end streetsin the Alps. It is possible to prune the search at any settled node that is covered by V`\s.However, applying this aggressive pruning technique has two disadvantages. First, we canno longer guarantee that T contains only shortest paths. As a consequence, we obtain asuperset of E`, which does not invalidate the correctness of the query, but which slows itdown. Second, the tree T can get even bigger since the search might continue on slowerroads around the nodes where we pruned the search.

It turns out that a good compromise is to prune only some edges at some nodes. Weuse two heuristic pruning rules. First, if for the current covered node u and some constant∆, we have d(s, u) + ∆ < min δ(v) | v reached, not settled, not covered by V` \ s,then u’s edges are not relaxed. Second, if on the path from s to the current node u, thereare at least p nodes in some level ` (for some constant p), then all edges (u, v) in levels< ` are pruned.

After efficiently computing a superset of an overlay edge set E`, we can apply a fastreduction step to get rid of the superfluous edges: for each node u ∈ V`, we perform asearch in G` (instead of G`−1) till all adjacent nodes have been settled. For any node vthat has been settled via a path that consists of more than one edge, we can remove theedge (u, v) since a (better) alternative that does not require this edge has been found.

7.5.4 QueryThe query algorithm is a bidirectional procedure: the backward search works completelyanalogously to the forward search so that it is sufficient to describe only the forwardsearch. The search is performed in a bottom-up fashion. We perform a Dijkstra searchfrom s in G0 and stop the search as soon as the search tree is covered by V1. From allcovering nodes, the search is continued in G1 until it is covered by V2, and so on. In

127

the topmost level, we can abort when forward and backward search meet. Figure 7.14contains the pseudo-code of the query algorithm for the forward direction.

input: source node s;VL+1 := ∅; // to avoid case distinctionsS0 := s; δ0(s) := 0;for ` := 0 to L do

V ′` := V` ∪ s′; // s′ is a new artificial nodeE ′` := E` ∪ (s′, u) | u ∈ S`; w(s′, u) := δ`(u);perform Dijkstra search from s′ in G′` := (V ′` , E

′`),

stop when search tree is covered by V`+1;S`+1 := ∅;foreach covering node u do

add u to S`+1;δ`+1(u) := d(s′, u);

Figure 7.14: The query algorithm for the forward direction.

Theorem 12 The query algorithm always finds a shortest path.

7.5.5 Analogies To and Differences From Related TechniquesTransit Node Routing. Let us consider a Dijkstra search from some node in a roadnetwork. We observe that some branches are very important—they extend through thewhole road network—, while other branches are stalled by the more important branchesat some point. For instance, there might be all types of roads (motorways, national roads,rural roads) that leave a certain region around the source node, but usually the branchesthat leave the region via rural roads end at some point since all further nodes are reachedon a faster path using motorways or national roads. The transit node routing exploits thisobservation: not all nodes that separate different regions are selected as transit nodes, butonly the nodes on the important branches.

Multi-level highway node routing uses the same argument to select the highway nodes.However, the distances from each node to the neighbouring highway nodes are not pre-calculated but computed during the query (using an algorithm very similar to the prepro-cessing algorithm for transit node routing). Moreover, the distances between all highwaynodes are not represented by tables, but by overlay graphs. The algorithms to constructthe overlay graphs and to compute the distance tables for transit node routing (except forthe topmost table) are very similar. The fact that multi-level highway node routing relieson less precomputed data allows the implementation of an efficient update operation.

128

Multi-Level Overlay Graphs. In contrast to transit node and multi-level highway noderouting, in the original multi-level approach all nodes that separate different regions areselected, which leads to a comparatively high average node degree. This has a negativeimpact on the performance. Let us consider the original approach with the new selectionstrategy, i.e., only ‘important’ nodes are selected. Then, the graph is typically not decom-posed into many small components so that the following performance problem arises inthe query algorithm. From the highway/separator nodes, only edges of the overlay graphare relaxed. As a consequence, the unimportant branches are not stalled by the importantbranches. Thus, since the separator nodes on the unimportant branches have not beenselected, the search might extend through large parts of the road network.

To sum up, there are two major steps to get from the original to the new multi-levelapproach: first, select only ‘important’ nodes and second, at highway/separator nodes, donot switch immediately to the next level, but keep relaxing low-level edges ‘for a while’until you can be sure that slow branches have been stalled.

Highway Hierarchies. We use the preprocessing of the highway hierarchies in order toselect the highway nodes for our new approach. However, this is not the sole connectionbetween both methods. In fact, we can interpret multi-level highway node routing as amodification of the highway hierarchy approach. (In particular, our actual implementationis a modification of the highway hierarchy program code.) An overlay graph can berepresented by shortcut edges that belong to the appropriate level of the hierarchy. Thereare two main differences.

First, the neighbourhood of a node is defined in different ways. In case of the highwayhierarchies, for a given parameter H , the H closest nodes belong to the neighbourhood.In case of multi-level highway node routing, all nodes belong to the neighbourhood thatare settled by a search that is stopped when the search tree is covered by the highway nodeset.

Second, in case of the highway hierarchies, we decide locally when to switch to thenext level, namely when the neighbourhood is left at some node. In case of multi-levelhighway node routing, we decide globally when to switch to the next level, namely whenthe complete search tree (not only the current branch) is covered by the highway nodeset7. By this modification, important branches can stall slow branches.

7.5.6 Dynamic Multi-Level Highway Node RoutingVarious Scenarios

We could consider several types of changes in road networks, e.g.,

7This is a simplified description. As mentioned in Section 7.5.4, we enhance the query algorithm bysome rules in order to deal with special cases like long-distance ferry connections more efficiently.

129

a) The structure of the road network changes: new roads are built, old roads are de-molished. That means, edges can be added and removed.

b) A different cost function is used, which means that potentially all edge weightschange. For example, a cost function can take into account different weightings of traveltime, distance, and fuel consumption. With respect to travel time, we can think of dif-ferent profiles of average speeds for each road category. In addition, for certain vehicletypes there might be restrictions on some roads (e.g., bridges and tunnels). For many‘reasonable’ cost functions, properties of the road network (like the inherent hierarchy)are possibly weakened, but not completely destroyed or even inverted. For instance, botha truck and a sports car—despite going different speeds—drive faster on a motorway thanon an urban street.

c) An unexpected incident occurs: the travel time of a certain road or several roads insome area changes, e.g., due to a traffic jam. That means, a single or a few edge weightschange. While a traffic jam causes a slow-down, the cancellation of a traffic jam causes aspeed-up so that we have to deal with both increasing and decreasing edge weights.

d) The edge weights depend on the time of day according to some function known inadvance. For example, such a function takes into account the rush hours.

The following paragraphs deal with type b) and c), respectively. We do not (explicitly)handle type a) since the addition of a new edge is a comparatively rare event in practicalapplications and the removal can be emulated by an edge weight change to infinity. Typed) is not (yet) covered by our work.

In case of type c), we can think of a server and mobile scenario: in the former, aserver has to react to incoming events by updating its data structures so that any point-to-point query can be answered correctly; in the latter, a mobile device has to react toincoming events by (re)computing a single point-to-point query taking into account thenew situation. In the server scenario, it pays to invest some time to perform the updateoperation since a lot of queries depend on it. In the mobile scenario however, we do notwant to waste time for updating parts of the graph that are irrelevant to the current query.In this paper, we concentrate on the server scenario.

Complete Recomputation

The more time-consuming part of the preprocessing is the determination of the highwaynode sets. As stated above, we assume that the application of a different profile of averagespeeds will not completely invalidate the hierarchical properties of the road network, i.e.,a node that has been very important usually will not get completely unimportant and viceversa when a different vehicle type is used. Thus, we can still expect a good query per-formance when keeping the highway node sets and recomputing only the overlay graphs.In order to do so, we do not need any additional data structures. We can directly use thestatic approach omitting the first preprocessing step (the determination of the highway

130

input: set of edges Em with modified weight;define set of modified nodes: V m

0 := u | (u, v) ∈ Em;foreach level ` ≥ 1 do

V m` := ∅;

foreach node v ∈⋃u∈Vm

`−1Aù do

repeat construction step from v;if something changes, put v to V m

` ;

Figure 7.15: The update algorithm that deals with a set of edge weight changes.

node sets).

Updating a Few Edge Weights

Similar to the previous paragraph, when a single or a few edge weights change, we keepthe highway node sets and update only the overlay graphs. In contrast to the previous sce-nario, we do not have to repeat the complete construction from scratch, but it is sufficientto perform the construction step only from nodes that might be affected by the change.Certainly, a node v whose partial shortest-path tree of the initial construction did not con-tain any node u of a modified edge (u, x) is not affected: if we repeated the constructionstep from v, we would get exactly the same partial shortest-path tree and, consequently,the same result.

During the first construction (and all subsequent update operations), we manage setsAù of nodes whose level-` preprocessing might be affected when an outgoing edge of uchanges: when a level-` construction step from some node v is performed, for each nodeu in the partial shortest-path tree, we add v to Aù. Note that these sets can be storedexplicitly (as we do it in our current implementation) or we could store a superset, e.g.,by some kind of geometric container (a disk, for instance). Figure 7.15 contains thepseudo-code of the update algorithm.

Theorem 13 After the update operation, we have the same situation as if we had repeatedthe complete construction procedure from scratch.

7.5.7 ExperimentsEnvironment and Instances.

The experiments were done on one core of a single AMD Opteron Processor 270 clockedat 2.0 GHz with 8 GB main memory and 2 × 1 MB L2 cache, running SuSE Linux

131

10.0 (kernel 2.6.13). The program was compiled by the GNU C++ compiler 4.0.2 usingoptimisation level 3.

We deal with the road network of Western Europe which was already used in the lastsections. It consists of 18 029 721 nodes and 42 199 587 directed edges. The originalgraph contains for each edge a length and a road category. There are four major roadcategories (motorway, national road, regional road, urban street), which are divided intothree subcategories each. In addition, there is one category for forest and gravel roads. Weassign average speeds (130, 120, . . ., 10 km/h)8 to the road categories, compute for eachedge the average travel time, and use it as weight. We call this our default speed profile.Experiments which we did on a US and Canadian road network of roughly the same size(provided by PTV as well) show exactly the same relative behaviour as in section 7.3.4,namely that it is slightly more difficult to handle North America than Europe (e.g., 20%slower query times). We give detailed results only for Europe.

For now, we report the times needed to compute the shortest-path distance betweentwo nodes without outputting the actual route. Note that we could also output full pathdescriptions. The query times are averages based on 10 000 randomly chosen (s, t)-pairs.In addition to providing average values, we use the methodology from 7.9 in order to plotquery times against the ‘distance’ of the target from the source, where in this context, theDijkstra rank is used as a measure of distance: for a fixed source s, the Dijkstra rank ofa node t is the rank w.r.t. the order which Dijkstra’s algorithm settles the nodes in. Suchplots are based on 1 000 random source nodes. After performing a lot of preliminaryexperiments, we decided to apply the stall-in-advance technique to the construction andupdate process (with p := 1 for the construction of level 1 and p := 5 for all other levels)and the stall-on-demand technique to the query.

Highway Hierarchy Construction.

In order to determine the highway node sets, we construct seven levels of the highwayhierarchy using our default speed profile and neighbourhood size H = 70. This can bedone in 16 minutes. For all further experiments, these highway-node sets are used.

Static Scenario.

The first data column of Tab. 7.6 contains the construction time of the multi-level overlaygraph and the average query performance for the default speed profile. Figure 7.16 showsthe query performance against the Dijkstra rank. The disk space overhead of thestatic variant is 8 bytes per node to store the additional edges of the multi-level overlaygraph and the level data associated with the nodes. Note that this overhead can be furtherreduced to as little as 2.0 bytes per node yielding query times of 1.55 ms (Tab. 7.9). The

8we call this our speed profile

132

speed profile default (reduced) fast car slow car slow truck distanceconstr. [min] 1:40 (3:04) 1:41 1:39 1:36 3:56query [ms] 1.17 (1.12) 1.20 1.28 1.50 35.62#settled nodes 1 414 (1 382) 1 444 1 507 1 667 7 057

Table 7.6: Construction time of the overlay graphs and query performance for differentspeed profiles using the same highway-node sets. For the default speed profile, we alsogive results for the case that the edge reduction step (Section 7.5.2) is applied.

any road type motorway national regional urban|change set| + − ∞ × + − ∞ × + ∞ + ∞ + ∞

1 2.7 2.5 2.8 2.6 40.0 40.0 40.1 37.3 19.9 20.3 8.4 8.6 2.1 2.11000 2.4 2.3 2.4 2.4 8.4 8.1 8.3 8.1 7.1 7.1 5.3 5.3 2.0 2.0

Table 7.7: Update times per changed edge [ms] for different road types and differentupdate types: add a traffic jam (+), cancel a traffic jam (−), block a road (∞), andmultiply the weight by 10 (×). Due to space constraints, some columns are omitted.

affected #settled nodes query time [ms]|change set| queries absolute relative init search total

1 0.6 % 2 347 (1.7) 0.3 2.0 2.310 6.3 % 8 294 (5.9) 1.9 7.2 9.1

100 41.3 % 43 042 (30.4) 10.6 36.9 47.51 000 82.6 % 200 465 (141.8) 62.0 181.9 243.9

10 000 97.5 % 645 579 (456.6) 309.9 627.1 937.0

Table 7.8: Query performance depending on the number of edge weight changes (selectonly motorways, multiply weight by 10). For ≤ 100 changes, 100 different edge sets areconsidered; for ≥ 1 000 changes, we deal only with one set. For each set, 1 000 queriesare performed. We give the average percentage of queries whose shortest-path lengthis affected by the changes, the average number of settled nodes (also relative to zerochanges), and the average query time, broken down into the init phase where the reliablelevels are determined and the search phase.

133

preprocessing static queries updates dynamic queriestime space time #settled compl. single #settled nodes

method [min] [B/node] [ms] nodes [min] [ms] 10 chgs. 1000 chgs.HH pure 17 28 1.16 1 662 17 – – –StHNR 19 8 1.12 1 382 3 – – –StHNR mem 24 2 1.55 2 453 8 – – –DynHNR 18 32 1.17 1 414 2 37 8 294 200 465DynALT-16 (85) 128 (53.6) 74 441 (6) (2 036) 75 501 255 754

Table 7.9: Comparison between pure highway hierarchies, three variants of highway-node routing (HNR), and dynamic ALT-16 [36]. ‘Space’ denotes the average disk spaceoverhead. We give execution times for both a complete recomputation using a similarcost function and an update of a single motorway edge multiplying its weight by 10. Fur-thermore, we give search space sizes after 10 and 1 000 edge weight changes (motorway,×10) for the mobile scenario. Time measurements in parentheses have been obtained ona similar, but not identical machine.

Que

ry T

ime

[ms]

211 212 213 214 215 216 217 218 219 220 221 222 223 224

01

2

01

2

Figure 7.16: Query performance against Dijkstra rank for the default speed profile, withedge reduction. Each box represents the three quartiles box-and-whisker plot

134

total disk space9 of 32 bytes per node also includes the original edges and a mapping fromoriginal to internal node IDs (that is needed since the nodes are reordered by level).

Changing the Cost Function.

In addition to our default speed profile, Tab. 7.6 also gives the construction and querytimes for a few other selected speed profiles (which have been provided by the companyPTV AG) using the same highway-node sets. Note that for most road categories, ourprofile is slightly faster than PTV’s fast car profile. The last speed profile (‘distance’) vir-tually corresponds to a distance metric since for each road type the same constant speedis assumed. The performance in case of the three PTV travel time profiles is quite closeto the performance for the default profile. Hence, we can switch between these profileswithout recomputing the highway-node sets. The constant speed profile is a rather diffi-cult case. Still, it would not completely fail, although the performance gets considerablyworse. We assume that any other ‘reasonable’ cost function would rank somewhere be-tween our default and the constant profile.

Updating a Few Edge Weights (Server Scenario).

In the dynamic scenario, we need additional space to manage the affected node sets A`u.Furthermore, the edge reduction step is not yet supported in the dynamic case so that thetotal disk space usage increases to 56 bytes per node. In contrast to the static variant, themain memory usage is considerably higher than the disk space usage (around a factor oftwo) mainly because the dynamic data structures maintain vacancies that might be filledduring future update operations.

We can expect different performances when updating very important roads (like mo-torways) or very unimportant ones (like urban streets, which are usually only relevantto very few connections). Therefore, for each of the four major road categories, we pick1 000 edges at random. In addition, we randomly pick 1 000 edges irrespective of the roadtype. For each of these edge sets, we consider four types of updates: first, we add a trafficjam to each edge (by increasing the weight by 30 minutes); second, we cancel all trafficjams (by setting the original weights); third, we block all edges (by increasing the weightsby 100 hours, which virtually corresponds to ‘infinity’ in our scenario); fourth, we mul-tiply the weights by 10 in order to allow comparisons to [36]. For each of these cases,Tab. 7.7 gives the average update time per changed edge. We distinguish between twochange set sizes: dealing with only one change at a time and processing 1 000 changessimultaneously.

9The main memory usage is somewhat higher. However, we cannot give exact numbers for the staticvariant since our implementation does not allow to switch off the dynamic data structures.

135

As expected, the performance depends mainly on the selected edge and hardly on thetype of update. The average execution times for a single update operation range between40 ms (for motorways) and 2 ms (for urban streets). Usually, an update of a motorwayedge requires updates of most levels of the overlay graph, while the effects of an urban-street update are limited to the lowest levels. We get a better performance when severalchanges are processed at once: for example, 1 000 random motorway segments can beupdated in about 8 seconds. Note that such an update operation will be even more efficientwhen the involved edges belong to the same local area (instead of being randomly spread),which might be a common case in real-world applications.

Updating a Few Edge Weights (Mobile Scenario).

Table 7.8 shows for the most difficult case (updating motorways) that using our modifiedquery algorithm we can omit the comparatively expensive update operation and still getacceptable execution times, at least if only a moderate amount of edge weight changesoccur. Additional experiments have confirmed that, similar to the results in Tab. 7.7, theperformance does not depend on the update type (add 30 minutes, multiply by 10, . . .), buton the edge type (motorway, urban street, . . .) and, of course, on the number of updates.

Comparisons.

Highway-node routing has similar preprocessing and query times as pure highway hier-archies, but (in the static case) a significantly smaller memory overhead. Table 7.9 givesdetailed numbers, and it also contains a comparison to the dynamic ALT approach [36]with 16 landmarks. We can conclude that as a stand-alone method, highway-node routingis (clearly) superior to dynamic ALT w.r.t. all studied aspects.10

10Note that our comparison concentrates on only one variant of dynamic ALT: different landmark setscan yield different tradeoffs. Also, better results can be expected when a lot of very small changes areinvolved. Moreover, dynamic ALT can turn out to be very useful in combination with other dynamicspeedup techniques yet to come.

136

Chapter 8

Minimum Spanning Trees

The section on the I-Max-Filteralgorithm is based on [37]. The external memory algo-rithm is described in [38] and the addition on connected components was taken from [6].

8.1 Definition & Basic RemarksConsider a connected1 undirected graph G = (V,E) with positive edge weights c : E →R+. A minimum spanning tree (MST) of G is defined by a set T ⊆ E of edges such thatthe graph (V, T ) is connected and c(T ) :=

∑e∈T c(e) is minimized. It is not difficult to

see that T forms a tree2 and hence contains n− 1 edges.Because MSTs are such a simple concept, they also show up in many seemingly unre-

lated problems such as clustering, finding paths that minimize the maximum edge weightused, or finding approximations for harder problems like TSP.

8.1.1 Two important propertiesThe following two properties are the base for nearly every MST algorithm. On an abstractlevel they even suffice to formulate the algorithms by Kruskal and Prim presented later.

Cut Property: Consider a proper subset S of V and an edge e ∈ (s, t) : (s, t) ∈E, s ∈ S, t ∈ V \ S with minimal weight. Then there is an MST T of G that contains e.

Proof: Consider any MST T ′ of G. Since T ′ is a tree, T ′ contains a unique edgee′ ∈ T ′ connecting a node from S with a node from V \S. Furthermore, T ′ \ e′ definesa spanning trees for S and V \ S and hence T = (T ′ \ e′) ∪ e defines a spanning

1IfG is not connected, we may ask for a minimum spanning forest — a set of edges that defines an MSTfor each connected component of G.

2In this chapter we often identify a set of edges T with a subgraph of (V, T ).

137

4

2

3

1

72

59

Figure 8.1: The cut property

4

2

3

1

792

5

Figure 8.2: The cycle property

tree. By our assumption, c(e) ≤ c(e′) and therefore c(T ) ≤ c(T ′). Since T ′ is an MST,we have c(T ) = c(T ′) and hence T is also an MST.

Cycle Property: Consider any cycle C ⊆ E and an edge e ∈ C with maximal weight.Then any MST of G′ = (V,E \ e) is also an MST of G.

Proof: Consider any MST T of G. Since trees contain no cycles, there must be someedge e′ ∈ C \ T . If e = e′ then T is also an MST of G′ and we are done. Otherwise,T ′ = e′∪T \ e forms another tree and since c(e′) ≤ c(e), T ′ must also form an MSTof G.

8.2 Classic AlgorithmsThe well known Jarnik-Prim algorithm starts from an (arbitrary) source node s and growsa minimum spanning tree by adding one node after the other, using the cut property. Theset S is the set of nodes already added to the tree. This choice guarantees that the smallestedge leaving S is not in the tree yet.

This high-level description is of course not suited for implementation. The main chal-lenge is to find (u, v) from the cut property efficiently. To this end, the algorithm 8.4maintains the shortest connection between any node v ∈ V \ S to S in an (adresseable)priority queue q. The smallest element in q gives the desired edge. To add a new node toS, we have to check its incident edges whether they give improved connections to nodesin V \ S. Note that by setting the distance of nodes in S to zero, edges connecting s witha node v ∈ S will be ignored as required by the cut property. This small trick saves acomparison in the innermost loop.

It may be interesting to study the form of graph representation we need for the Jarnik-Prim algorithm. The graph is accessed when we add a new node to the tree and scan

138

T:= ∅S:= s for arbitrary start node srepeat n− 1 times

find (u, v) fulfilling the cut property for SS:= S ∪ vT := T ∪ (u, v)

Figure 8.3: Abstract description of the Jarnik-Prim algorithm

Function jpMST(V, E, w) : Set of Edgedist=[∞, . . . ,∞] : Array [1..n] // dist [v] is distance of v from the treepred : Array of Edge // pred[v] is shortest edge between S and vq : PriorityQueue of Node with dist[·] as prioritydist[s] := 0; q.insert(s) for any s ∈ Vfor i := 1 to n− 1 do do

u := q.deleteMin() // new node for Sdist[u] := 0foreach (u, v) ∈ E do

if c((u, v)) < dist [v] thendist[v] := c((u, v)); pred[v] := (u, v)if v ∈ q then q.decreaseKey(v) else q.insert(v)

return pred [v] : v ∈ V \ s

Figure 8.4: The Jarnik-Prim algorithm using priority queues

139

4

2

3

1

792

5

m 8=m+1

V

E

1 3 5 7 91 n 5=n+1

4 1 3 2 4 1 3c 9 5 7 7 2 2 95

2

1

Figure 8.5: Adjacency Array

T := ∅ // subforest of the MSTforeach (u, v) ∈ E in ascending order of weight do

if u and v are in different subtrees of T thenT := T ∪ (u, v) // Join two subtrees

return T

Figure 8.6: An abstract description of Kruskal’s algorithm

its edges for new or cheaper connections to nodes outside the tree. An adjacency array(a static variant of the well-known adjacency list) supports this mapping from nodes toincident edges: We maintain the edges in a sorted array, first listing all neighbors of node 1(and the costs to reach them), then all neighbors of node 2, etc. A second array maintainsa pointer for every node leading to the first incident edge.

This representation is very cache efficient for our application (in contrast to e.g. alinked list). On the downside, we have to store every edge twice and receive a very staticdata structure.

For analysing the algorithm’s runtime, we have to study the number of priority queueoperations (all other instructions run in O(n+m)). We obviously have n deleteMinoperations, costing O(log n) each. As every node is regarded exactly once, every edge isregarded exactly once resulting in O(m) decreaseKey operations. The latter can beimplemented in amortized time O(1) using Fibonacci Heaps. In total, we have costs ofO(m+ n log n). This result is partly theoretical as practical implementations will oftenresort to simpler pairing heaps for which the analysis is still open.

Another classic algorithm is due to Kruskal:Again, correctness follows from from the cut property (set S as one of the subtrees

connected by (u, v)).For an efficient implementation of this algorithm we need a fast way to determine

whether two nodes are in the same subtree. We use the Union-Find data structure for thistask: It maintains disjoint sets (in our case containing the subtrees of T ) whose union isV . It allows near-constant operations to identify the subtree a node is in (via the find-operaton) and to merge two subtrees using link. A more general overview over Union-

140

T : UnionFind(n)sort E in ascending order of weightkruskal(E)

Procedure kruskal(E)foreach (u, v) ∈ E do

u′:= T.find(u)v′:= T.find(v)if u′ 6= v′ then

output (u, v)T.link(u′, v′)

Figure 8.7: Kruskal’s algorithm using union-find

Find is given in 8.2.1.Using Union-Find, we have a running time of O(sort(m) +mα(m,n)) =

O(m logm) where α is the inverse Ackermann function.The necessary graph representation is very simple: An array of edges is enough and

can be sorted and scanned very cache efficiently. Every edge is represented only once.Which of these two algorithms is better? As often, there is no easy answer to this

question. Kruskal wins for very sparse graphs while Prim’s algorithm is more suitedfor dense graphs. The switching point is unclear and is heavily dependant on the inputrepresentation, the structure of the graphs, etc. Systematic experimental evaluation isrequired.

8.2.1 Excursus: The Union-Find Data StructureA partition of a set M into subsets M1, . . . , Mk has the property that the subsets aredisjoint and cover M , i.e., Mi ∪ Mj = ∅ for i 6= j and M = M1 ∪ · · · ∪ Mk. Forexample, in Kruskal’s algorithm the forest T partitions V into subtrees — including trivialsubsets of size one for isolated nodes. Kruskal’s algorithms performs two operations onthe partition: Testing whether two elements are in the same subset (subtree) and joiningtwo subsets into one (inserting an edge into T ).

The union-find data structure maintains a partition of the set 1..n and supports thesetwo operations. Initially, each element is in its own subset. Each subset is assigned aleader element (or representative). The function find(i) finds the leader of the subsetcontaining i; link(i, j) applied to leaders of different partitions joins these two subsets.Figure 8.8 gives an efficient implementation of this idea. The most important part ofthe data structure is the array parent . Leaders are their own parents. Following parent

141

Class UnionFind(n : N) // Maintain a partition of 1..nparent=〈1, 2, . . . , n〉 : Array [1..n] of 1..ngen=〈0, . . . , 0〉 : Array [1..n] of 0.. log n // generation of leaders

Function find(i : 1..n) : 1..n // picture ‘before’if parent[i] = i then return ielse i′ := find(parent[i])

parent[i] := i′ // path compressionreturn i′ // picture ‘after’

Procedure link(i, j : 1..n) // picture ‘before’assert i and j are leaders of different subsetsif gen[i] < gen[j] then parent[i] := j // balanceelse

parent[j] := iif gen[i] = gen[j] then gen[i]++

Procedure union(i, j : 1..n)if find(i) 6= find(j) then link(find(i), find(j))

Figure 8.8: An efficient Union-Find data structure maintaining a partition of the set1, . . . , n.

references leads to the leaders. The parent references of a subset form a rooted tree,i.e., a tree with all edges directed towards the root.3 Additionally, each root has a self-loop. Hence, find is easy to implement by following the parent references until aself-loop is encountered.

Linking two leaders i and j is also easy to implement by promoting one of the leadersto overall leader and making it the parent of the other. What we have said so far yields acorrect but inefficient union-find data structure. The parent references could form longchains that are traversed again and again during find operations.

Therefore, Figure 8.8 makes two optimizations. The link operation uses the arraygen to limit the depth of the parent trees. Promotion in leadership is based on theseniority principle. The older generation is always promoted. It can be shown that thismeasure alone limits the time for find to O(log n). The second optimization is pathcompression. A long chain of parent references is never traversed twice. Rather, findredirects all nodes it traverses directly to the leader. It is possible to prove that these twooptimizations together make the union-find data structure “breath-takingly” efficient —

3Note that this tree may have very different structure compared to the corresponding subtree in Kruskal’salgorithm.

142

Procedure quickKruskal(E : Sequence of Edge)if m ≤ βn then kruskal(E) // for some constant βelse

pick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ E〉 // partitioning a laE>:= 〈e ∈ E : e > E〉 // quicksortquickKruskal(E≤)E ′>:= filter(E>)quickKruskal(E ′>)

Function filter(E)make sure that leader[i] gives the leader of node i // O(n)!return 〈(u, v) ∈ E : leader [u] 6= leader [v]〉

Figure 8.9: The QuickKruskal algorithm

the amortized cost of any operation is almost constant.

8.3 QuickKruskalAs Kruskal’s algorithm becomes less attractive for dense graphs, we propose a variantthat uses a quicksort-like recursion to deal with those instances:

When the average degree is bounded by some constant β (i.e. the graph is sparse) weknow that Kruskal’s algorithm performs well. Else, we determine a MST recursively onthe smallest edges of the graph, resulting in a set of connected components. Now the sec-ond recursion only has to regard those heavy edges connecting two different components,the others are filtered out. The filtering subroutine makes again use of the Union-Finddata structure: We have leader[v] := find(v) to determine the connected componentnode v is in. Note that n find operations have a running time of O(n)4.

We can now attempt an average-case analysis of QuickKruskal: We assume that theweights are unique and randomly chosen, the pivot has median weight. Let T (m) denotethe expected execution time for m edges. For m ≤ βn, our base case, we have T (m) =O(m logm) = O(n log n). In the general case, we have costs of O(m+ n) = Ω (m) forpartitioning and filtering. E≤ has a size of m/2 for an optimal pivot. The key observationhere is that the number of edges surviving the filtering is only linear in n.

4this can be shown using amortized analysis: Every element accessed once during a find operation willbe a direct successor of its root node, resulting in constant costs for subsequent requests. O(n) is less thanthe general bound on n union-find operations

143

R:= random sample of r edges from EF :=MST(R) // Wlog assume that F spans VL:= ∅ // “light edges” with respect to Rforeach e ∈ E do // Filter

C := the unique cycle in e ∪ Fif e is not heaviest in C then

L:= L ∪ ereturn MST((L ∪ F ))

Figure 8.10: A simplified filtering algorithm using random samples

This leads to T (m) = Ω (m)+T (m/2)+T (2n). Since for β ≥ 2, the second recursionwill already fall back to the base case, we have the linear recurrence T (m) = Ω (m) +n log n+ T (m/2) which solves (using stanard techniques) to O

(m+ n log n log m

n

).

A hard instance for QuickKruskal would consist of several very dense componentsconsisting of light edges which are connected by heavy edges which would not be sortedout during filtering as the first recursion will concentrate on local MSTs within thecomponents. More concrete: Consider the fully connected graph Kx2 (for x ∈ N) whereevery node is replaced with another Kx. This graph has O(2x4) edges. Let the “outer“edges have weight 2 whereas the “inner“ edges have weight 1. The first recursion ofQuickKruskal will only regard edges within the Kx components but completely ignorethe heavy edges so none of them is filtered out.

8.4 The I-Max-Filter algorithmA similar approach also resorts to filtering, but the subgraph does not consist of the lightestedges but of a random sample:

While Kruskal’s and Prim’s algorithms make use of the Cut Property, this code’s cor-rectness is guaranteed by the Cycle Property. Its performance depends on the size of Land F . It can be shown that if r edges are chosen, we expect only mn

redges to survive the

filtering5.The tricky part in implementing this algorithm is how to determine the heaviest edge

in C. We exploit that by renumbering nodes according to the order in which they areadded to the MST by the Jarnik-Prim algorithm, heaviest edge queries can be reduced

5this is because for every edge not to be filtered has to be in MST (e ∪R), which has ≤ n elements.The probability of survival therefore is ≤ n

r , as r edges are regarded

144

56 98 41 745688 7677347515 8062526530

77 80

98 98 15 75 77 80

76745275659830

65 75 77 62 767498

7452

417798

88 65 75 77

Level 0

Level 1

Level 2

98 75 34 52 77 8098 7777779898 75 75 56 77 Level 3

56

Figure 8.11: Example of a layers array for interval maxima. The suffix sections aremarked by an extra surrounding box.

to simple interval maximum queries. A proof for this claim can be found in 12.5. Wetherefore reduced the problem to efficiently compute interval maxima:

Given an array a[0] . . . a[n− 1], we explain how max a[i..j] can be computed in con-stant time using preprocessing time and spaceO(n log n). The emphasis is on very simpleand fast queries since we are looking at applications where many more than n log n queriesare made. This algorithm might be of independent interest for other applications. Slightmodifications of this basic algorithm are necessary in order to use it in the I-Max-Filteralgorithm. They will be described later. In the following, we assume that n is a power oftwo. Adaption to the general case is simple by either rounding up to the next power of twoand filling the array with −∞ or by introducing a few case distinctions while initializingthe data structure.

Consider a complete binary tree built on top of a so that the entries of a are the leaves(see level 0 in Figure 8.11). The idea is to store an array of prefix or suffix maxima withevery internal node of the tree. Left successors store suffix maxima. Right successorsstore prefix maxima. The size of an array is proportional to the size of the subtree rootedat the corresponding node. To compute the interval maximum max a[i..j], let v denotethe least common ancestor of a[i] and a[j]. Let u denote the left successor of v and letw denote the right successor of v. Let u[i] denote the suffix maximum correspondingto leaf i in the suffix maxima array stored in u. Correspondingly, let w[j] denote theprefix maximum corresponding to leaf j in the prefix maxima array stored in w. Thenmax a[i..j] = max(u[i], w[j]).

We observed that this approach can be implemented in a very simple way using alog(n)×n array preSuf. As can be seen in Figure 8.11, all suffix and prefix arrays in onelayer can be assembled in one array as follows

preSuf[`][i] =

max(a[2`b..i]) if b is oddmax(a[i..(2` + 1)b− 1]) otherwise

where b =⌊i/2`

⌋.

145

//Compute MST of G = (0, . . . , n− 1 , E)Function I-Max-Filter-MST(E) : set of Edge

E ′ := random sample from E of size√mn

E ′′ := JP-MST(E ′)Let jpNum[0..n− 1] denote the order in which JP-MST added the nodesInitialize the table preSuf[0.. log n][0..n− 1]//Filtering loopforall edges e = (u, v) ∈ E do

` := msbPos(jpNum[u]⊕jpNum[v])if we < preSuf[`][jpNum[u]] and we < preSuf[`][jpNum[v]] then add e to E ′′

return JP-MST(E ′′)

Figure 8.12: The I-Max-Filter algorithm.

Furthermore, the interval boundaries can be used to index the arrays. We simply havemax a[i..j] = max(preSuf[`][i], preSuf[`][j]) where ` = msbPos(i ⊕ j); ⊕ is the bit-wise exclusive-or operation and msbPos(x) = blog2 xc is equal to the position of themost significant nonzero bit of x (starting at 0). Some architectures have this operationin hardware6; if not, msbPos(x) can be stored in a table (of size n) and found by tablelookup. Layer 0 is identical to a. A further optimization stores a pointer to the arraypreSuf[`] in the layer table. As the computation is symmetric, we can conduct a tablelookup with indices i, j without knowing whether i < j or j < i.

To use this data structure for the I-Max-Filter algorithm we need a small modificationsince we are interested in maxima of the form max a[min(i, j) + 1..max(i, j)] withoutknowing which of two endpoints is the smaller. Here we simply note that the approachstill works if we redefine the suffix maxima to exclude the first entry, i.e., preSuf[`][i] =max(a[i+ 1..(2` + 1)

⌊i/2`

⌋− 1]) if

⌊i/2`

⌋is even.

We can now return to the original problem of finding an MST. Figure 8.12 gives adetailed implementation of the I-Max-Filteralgorithm:

The I-Max-Filter algorithm computes MSTs in expected timemTfilter + O(n log n+

√nm) where Tfilter is the time required to query the filter about

one edge.The algorithms we saw until now all had specific requirements for the graph repre-

sentation. The I-Max-Filter algorithm can be implemented to work well with any rep-resentation that allows sampling edges in time linear in the sample size and that allowsfast iteration over all edges. In particular, it is sufficient to store each edge once. Ourimplementation for I-Max-Filter uses an array in which each edge appears once as (u, v)with u < v and the edges are sorted by source node (u).7

6One trick is to use the exponent in a floating point representation of x.7These requirements could be dropped at very small cost. In particular, I-Max-Filter can work efficiently

146

Experiments

I-Max-Filter should work well for dense graphs where m n log n. We try to prove thisclaim in experiments.

Both algorithms, JP and I-Max-Filter were implemented in C++ and compiled usingGNU g++ version 3.0.4 with optimization level -O6. We use a SUN-Fire-15000server with 900 MHz UltraSPARC-III+ processors.

We performed measurements with four different families of graphs, each with ad-justable edge density ρ = 2m/n(n − 1). A test instance is defined by three parameters:the graph type, the number of nodes and the density of edges (the number of edges iscomputed from these parameters). Each reported result is the average of ten executionsof the relevant algorithm; each on a different randomly generated graph with the givenparameters. Furthermore, the I-Max-Filter algorithm is randomized because the samplegraph is selected at random. Despite the randomization, the variance of the executiontimes within one test was consistently very small (less than 1 percent), hence we only plotthe averages.Worst-Case: ρ · n(n − 1)/2 edges are selected at random and the edges are assignedweights that cause JP to perform as many Decrease Key operations as possible.Linear-Random: ρ·n(n−1)/2 edges are selected at random. Each edge (u, v) is assignedthe weight w(u, v) = |u− v| where u and v are the integer IDs of the nodes.Uniform-Random: ρ · n(n− 1)/2 edges are selected at random and each is assigned anedge weight which is selected uniformly at random.Random-Geometric: Nodes are random 2D points in a 1× y rectangle for some stretchfactor y > 0. Edges are between nodes with Euclidean distance at most α and the weightof an edge is equal to the distance between its endpoints. The parameter α indirectlycontrols density whereas the stretch factor y allows us to interpolate between behaviorsimilar to class Uniform-Random and behavior similar to class Linear-Random.

Fig. 8.13 shows execution times per edge on the SUN for two graph families Worst-Caseand Uniform-Random for n = 10000 nodes and varying density. We can see that I-Max-Filter is up to 2.46 times faster than JP. The speedup is smaller for Uniform-Randomgraphs. The reason is that for “average” inputs JP needs to perform only a sublinear num-ber of decrease-key operations so that the part of code dominating the execution time ofJP is scanning adjacency lists and comparing the weight of each edge with the distance ofthe target node from the current MST. There is no hope to be significantly faster than that.Hence, when we say that I-Max-Filter outperforms JP this is with respect to space con-sumption, simplicity of input conventions and worst-case performance guarantees ratherthan average case execution time.

On very sparse graphs, I-Max-Filter is up to two times slower than JP, because

with a completely unsorted edge array or with an adjacency array representation that stores each edge onlyin one direction. The latter only needs space for m+ n node indices and m edge weights.

147

0

100

200

300

400

500

600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

PrimFilter

0

100

200

300

400

500

600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e pe

r ed

ge [n

s]

Edge density

PrimFilter

Figure 8.13: Worst-Case and Uniform-Random graphs, 10000 nodes on a SUN machine.

148

4

1

4 3

1

4

1

792

59

2

7 (was 2,3)

output 2,3

2

3 7output 1,2

5

9 2 (was 4,3)

Figure 8.14: Example for node contraction

√mn = Θ(m) and as a result both the sample graph and the graph that remains after

the filtering stage are not much smaller than the original graph. The runtime is thereforecomparable to two runs of JP on the input.

8.5 External MSTAfter studying and extending some classic algorithms for graphs in main memory, wenow consider another approach. We start with a simple randomized algorithm using agraph contraction operation, develop an even simpler variant and step by step generate anexternal algorithm for huge graphs.

Contracting is defined as follows: If e = (u, v) ∈ E is known to be an MST edge, wecan remove u from the problem by outputting e and identifying u and v, e.g., by removingnode u and renaming an edge of the form (u,w) to a new edge (v, w). By rememberingwhere (v, w) came from, we can reconstruct the MST of the original graph from the MSTof the smaller graph.

With this operation, we can use Boruvka’s Algorithm which consists of repeated ex-ecution of Boruvka Phases: In each phase, find the lightest incident edge for each node.The set C of these edges can be output as part of the MST (because of the Cut property).Now contract these edges, i.e., find a representative node for each connected componentof (V,C) and rename an edge u, v to componentId(u), componentId(v). This rou-tine at least halves the number of nodes (as every edge is picked at most twice) and runsin O(m)8. In total, if we contract our graph until only one node is left, we have a runtimeof O(m log n).

On our way to an external MST algorithm we will use a simpler variant of Boruvka’sAlgorithm which does not use phases — Sibeyn’s algorithm:

In the iteration when i nodes are left (note that i = n in the first iteration), the expecteddegree of a random node is at most 2m/i. Hence, the expected number of edges, Xi,inspected in iteration i is at most 2m/i. By the linearity of expectation, the total expected

8We can use Union-Find again: m operations to construct all components by merging and another moperations to find the edges between different components

149

for i := n downto n′ + 1 dopick a random node vfind the lightest edge (u, v) out of v and output itcontract (u, v)

Figure 8.15: High level version of Sibeyn’s MST algorithm.

Factor 8 node reduction (3× Boruvka or sweep algorithm) // O(m+ n)R⇐ m/2 random edgesF ⇐MST (R) [Recursively]Find light edges L (edge reduction) // O(m+ n), E[|L|] ≤ mn/8

m/2= n/4

T ⇐MST (L ∪ F ) [Recursively]

Figure 8.16: Outline of a randomized linear time MST algorithm.

number of edges processed is

E[∑

n′<i≤n

Xi] =∑

n′<i≤n

E[Xi] ≤∑

n′<i≤n

2m

i= 2m

∑n′<i≤n

1

i= 2m

( ∑1≤i≤n

1

i−∑

1≤i≤n′

1

i

)

= 2m(Hn −Hn′) ≤ 2m(lnn− lnn′) = 2m lnn

n′

where Hn = lnn+ 0.577 · · ·+O(1/n) is the n-th harmonic number.The techniques of sampling and contraction can lead to a (impractical) randomized

linear time algorithm, developed by Karger, Klein and Tarjan. It is presented in 8.16 butnot studied in detail. Its analysis depends again on the observation that clever samplingwill lead to an expected number of unfiltered (light) edges linear in n. The complicatedstep is the fourth, done in linear time using table lookups. The expected runtime for thisalgorithm is given by T (n,m) ≤ T (n/8,m/2) + T (n/8, n/4) + c(n + m), which isfulfilled by T (n,m) = 2c(n+m).

8.5.1 Semiexternal AlgorithmA first step for an algorithm that can cope with huge graphs stored on disk is a semiexter-nal algorithm: We use Kruskal’s Algorithm but incorporate an external sorting algorithm.We then just have to scan the edges and maintain the union-find array in main memory.

150

π : random permutation V → Vsort edges (u, v) by min(π(u), π(v))for i := n downto n′ + 1 do

pick the node v with π(v) = ifind the lightest edge (u, v) out of v and output itcontract (u, v)

Figure 8.17: High level implementation for graph contraction with sweeping

u v...

relink

relink

outputsweep line

Figure 8.18: Sweeping scans through the randomly ordered nodes, removes one, outputsits lightest edge and relinks the others

This only requires one 32 bit word per node to store up to 0..232 − 32 = 4 294 967 264nodes. There exist asymptotically better algorithms but these come with discouragingconstant factors and significantly larger data structures.

8.5.2 External Sweeping AlgorithmWe can use the semiexternal algorithm while n < M − 2B. For larger inputs, we cannotstore the additional data structures in main memory. To deal with those graphs, we useagain the technique of node reduction via contraction until we can resort to our semiex-ternal algorithm.

This algorithm is a more concrete implementation of Sibeyn’s algorithm from Fig-ure 8.5. We replace random selection of nodes by sweeping the nodes in an order fixed inadvance. We assume that nodes are numbered 0..n− 1. We first rename the node indicesusing a random permutation π : 0..n− 1→ 0..n− 1 and then remove renamed nodes inthe order n−1, n−2,. . . , n′. This way, we replace random access by sorting and scanningthe nodes once. The appendix (in section 12.6) describes a procedure to create a randompermutation on the fly without additional I/Os.

There is a very simple external realization of the sweeping algorithm based on priorityqueues of edges. Edges are stored in the form ((u, v), c, eold) where (u, v) is the edge inthe current graph, c is the edge weight, and eold identifies the edge in the original graph.

151

Q: priority queue // Order: max node, then min edge weightforeach (u, v , c) ∈ E do Q.insert((π(u), π(v) , c, u, v))current := n+ 1loop

(u, v , c, u0, v0) := Q.deleteMin()if current6= max u, vthen

if current= M + 1 then returnoutput u0, v0 , ccurrent := max u, vconnect := min u, v

else Q.insert((min u, v , connect , c, u0, v0))

Figure 8.19: Sweeping algorithm implementation using Priority Queues

The queue normalizes edges (u, v) in such a way that u ≥ v. We define a priority order((u, v), c, eold) < ((u′, v′), c′, e′old) iff u > u′ or u = u′ and c < c′. With these conventionsin place, the algorithm can be described using the simple pseudocode in Figure 8.19. Ifeold is just an edge identifier, e.g. a position in the input, an additional sorting step at theend can extract the actual MST edges. If eold stores both incident vertices, the MST edgeand its weight can be output directly.

With optimal external priority queues, this implementation performs≈ sort(10m log n

M) I/Os.

The priority queue implementation unnecessarily sorts the edges adjacent to a nodewhere we really only care about the smallest edge coming first. We now describe animplementation of the sweeping algorithm that has internal work linear in the total I/Ovolume. We first make a few simplifying assumptions to get closer to our implementation.

The representation of edges and the renaming of nodes works as in the priority queueimplementation. As before, in iteration i, node i is removed by outputting the lightestedge incident to it and relinking all the other edges. We split the node range n′..n − 1into k = O(M/B) equal sized external buckets, i.e., subranges of size (n − n′)/k andwe define a special external bucket for the range 0..n′ − 1. An edge (u, v) with u > vis always stored in the bucket for u. We assume that the current bucket (that contains i)completely fits into main memory. The other buckets are stored externally with only awrite buffer block to accommodate recently relinked edges.

When i reaches a new external bucket, it is distributed to internal buckets — one foreach node in the external bucket. The internal bucket for i is scanned twice. Once forfinding the lightest edge and once for relinking. Relinked edges destined for the currentexternal bucket are immediately put into the appropriate internal bucket. The remainingedges are put into the write buffer of their external bucket. Write buffers are flushed to

152

...

externalexternal

internal

current semiexternal

Figure 8.20: Sweeping with buckets

disk when they become full.When only n′ nodes are left, the bucket for range 0..n′ − 1 is used as input for the

semi-external Kruskal algorithm.Nodes with very high degree (> M ) can be moved to the bucket for the semiexternal

case directly. These nodes can be assigned the numbers n′ + 1, n′ + 2,. . . without dangerof confusing them with nodes with the same index in other buckets. To accomodate theseadditional nodes in the semiexternal case, n′ has to be reduced by at mostO(M/B) sincefor m = O(M2/B) there can be at most O(M/B) nodes with degree Ω (M).

8.5.3 Implementation & ExperimentsOur external implementation makes extensive use of the Stxxl9 and uses many techniquesand data structures we already saw in earlier chapters. The semiexternal Kruskal andthe priority queue based sweeping algorithm become almost trivial using external sortingand external priority queues. The bucket based implementation uses external stacks torepresent external buckets. The stacks have a single private output buffer and they sharea common pool of additional output buffers that facilitates overlapping of output andinternal computation. When a stack is switched to reading, it is assigned additional privatebuffers to facilitate prefetching.

The internal aspects of the bucket implementation are also crucial. In particular, weneed a representation of internal buckets that is space efficient, cache efficient, and cangrow adaptively. Therefore, internal buckets are represented as linked lists of small blocksthat can hold several edges each. Edges in internal buckets do not store their source nodebecause this information is redundant.

For experiments we use three families of graphs: Instance families for random graphswith random edge weights and random geometric graphs where random points in the unit

9see chapter 5

153

PCI−Busses

Controller

Channels

2x64x66 Mb/s

4 Threads2x Xeon

128

E75001 GBDDRRAM

8x45

4x2x100MB/s

400x64 Mb/s

8x80GBMB/s

Chipset

Intel

Figure 8.21: Setup for experiments on external MST

square are connected to their d closest neighbors. In order to obtain a simple family ofplanar graphs, we use grid graphs with random edge weights where the nodes are arrangedin a grid and are connected to their (up to) four direct neighbors10.

The experiments have been performed on a low cost PC-server (around 3000 Euro inJuly 2002) with two 2 GHz Intel Xeon processors, 1 GByte RAM and 4×80 GByte disks(IBM 120GXP) that are connected to the machine in a bottleneck-free way. This machineruns Linux 2.4.20 using the XFS file system. Swapping was disabled. All programs werecompiled with g++ version 3.2 and optimization level -O6. The total computer timespend for the experiments was about 25 days producing a total I/O volume of severaldozen Terabytes.

Figure 8.22 summarizes the results for the bucket implementation. The curves onlyshow the internal results for random graphs — at least Kruskal’s algorithm shows verysimilar behavior for the other graph classes.

Our implementation can handle up to 20 million edges. Kruskal’s algorithm is bestfor very sparse graphs (m ≤ 4n) whereas the Jarnık-Prim algorithm (with a fast imple-mentation of pairing heaps) is fastest for denser graphs but requires more memory. Forn ≤ 160 000 000, we can run the semiexternal algorithm and get execution times withina factor of two of the internal algorithm.11 The curves are almost flat and very similarfor all three graph families. This is not astonishing since Kruskal’s algorithm is not verydependent on the structure of the graph. Beyond 160 000 000 nodes, the full external al-

10Note that for planar graphs we can give a bound of O(sort(n)) if we deal with parallel edges: Whenscanning the internal bucket for node i, the edges (i, v) are put into a hash table using v as a key. Thecorresponding table entry only keeps the lightest edge connecting i and v seen so far.

11Both the internal and the semiexternal algorithm have a number of possibilities for further tuning (e.g.,using integer sorting or a better external sorter for small elements). But none of these measures is likely toyield more than a factor of 2.

154

Figure 8.22: Execution time per edge for m ≈ 2 · n (top),m ≈ 4 · n (center), m ≈ 8 · n(bottom). “Kruskal“ and “Prim“ denote the results of these internal algorithms on the“random“ input.

155

gorithm is needed. This immediately costs us another factor of two in execution time: Wehave additional costs for random renaming, node reduction, and a blowup of the size ofan edge from 12 bytes to 20 bytes (for renamed nodes). For random graphs, the executiontime keeps growing with n/M .

The behavior for grid graphs is much better than predicted. It is interesting that sim-ilar effects can be observed for geometric graphs. This is an indication that it is worthremoving parallel edges for many nonplanar graphs.12 Interestingly, the time per edge de-creases with m for grid graphs and geometric graphs. The reason is that the time for thesemiexternal base case does not increase proportionally to the number of input edges. Forexample, 5.6 · 108 edges of a grid graph with 640 · 106 nodes survive the node reduction,and 6.3 · 108 edges of a grid graph with twice the number of edges.

Another observation is that for m = 2560 · 106 and random or geometric graphs weget the worst time per edge for m ≈ 4n. For m ≈ 8n, we do not need to run the nodereduction very long. For m ≈ 2n we process less edges than predicted even for randomgraphs simply because one MST edge is removed for each node.

8.6 Connected ComponentsWe modify and extend the bucket version of the bucket algorithm to get the connectedcomponents of an exernal graph. As in the spanning forest algorithm, the input is anunweighted graph represented as a list of edges. The output of the algorithm is a listof entries (v, c), v ∈ V , where c is the connected component id of node v, at the sametime c is the id of a node belonging to the connected component. This special nodec is sometimes called the representative node of a component. The algorithm makestwo passes over adjacency lists of nodes (left-to-right pass v = n − 1..0 and right-to-left v = 0..n − 1, v ∈ V ), relinking the edges such that they connect node v with therepresentative node of its connected component.

If there are k = O(M/B)) external memory buckets then bucket i ∈ 0..k − 1contains the adjacent edges (u, v), u > v of nodes ui−1 < u < ui, where ui is the upper(lower) bound of node ids in bucket i(i+1). Additionally, there are k question buckets andk answer buckets with the same bounds. A question is a tuple (v, r(v)) that represents theassignment of node v to a preliminary representative node r(v). An answer is a tuple(v, r(v)) that represents the assignment of node v to an ultimate representative node.Function b : V → 0..k − 1 maps a node id to the coresponding bucket id accordingto the bucket bounds. The bucket implementation is complemented with the followingsteps. During the processing of node v, the algorithm assigns r(v) tentatively the id ofits neighbor with the smallest id. If no neighbor exists then r(v) := v. After processing

12Very few parallel edges are generated for random graphs. Therefore, switching off duplicate removalgives about 13 % speedup for random graphs compared to the numbers given.

156

the bucket i we post the preliminary assignments (v, r(v)) of nodes v, ui−1 < v . ui toquestion bucket b(r(v)) if r(v) does not belong to bucket i. Otherwise we can update r(v)with r(r(v)). If the new r(v) belongs to bucket i than it is the ultimate representative nodeof v and (v, r(v)) can be written to the answer bucket b(v), otherwise we post question(v, r(v)) to the appropriate question bucket. Note that the first answer bucket is handleddifferently as it is implemented as the union-find data structure in the base case. For v inthe union-data structure r(v) is the id of the leader node of the union where v belongs to.The connected component algorithm needs an additional right-to-left scan to determinethe ultimate representatives which have not been determined in the previous left-to-rightscan. The buckets are read in the order 0..k − 1. For each (v, r(v)) in question bucketi we update r(v) with the ultimate representative r(r(v)) looking up values in answerbucket i. The final value (v, r(v)) is appended to answer bucket b(v). After answeringall questions in bucket i, the content of answer bucket i is added to the output of theconnected component algorithm. If one only needs to compute the component ids andno spanning tree edges then the implementation does not keep the original edge id in theedge data structure. It is sufficient to invert randomization for the node ids in the output,which can be done with the chosen randomization scheme without additional I/Os. Dueto this measure the total I/O volume and the memory requirements of the internal bucketsare reduced such that the block size of the external memory buckets can be made larger.All this leads to an overall performance improvement.

157

Chapter 9

String Sorting

This chapter is based on [39].

9.1 IntroductionThe task is to sort a set R = s1, s2, . . . , sn of n (non-empty) strings into the lexico-graphic order. N is the total length of strings,D the total length of distinguishing prefixes.The distinguishing prefix of a string s in R is the shortest prefix of s that is not a prefixof another string (or s if s is a prefix of another string). It is the shortest prefix of s thatdetermines the rank of s in R. A sorting algorithm needs to access every character in thedistinguishing prefixes, but no character outside the distinguishing prefixes.

We can evaluate algorithms using different alphabet models: In an ordered alphabet,only comparisons of characters are allowed. In an ordered alphabet of constant size, amultiset of characters can be sorted in linear time using counting sort. An integer alphabetis 1, . . . , σ for integer σ ≥ 2. Here, sorting a multiset of k characters can be done inO(k + σ) time with the same algorithm.We have the following simple lower bounds for sorting using these models:

If we use a standard sorting algorithm for strings, the worst case requires Θ(n log n)string comparisons. Let si = αβi, where |α| = |βi| = log n. This meansD = Θ(n log n).

alignmentall

allocatealphabetalternate

alternative

Figure 9.1: Example on distinguishing prefixes.

158

alphabet lower boundordered Ω(D + n log n)constant Ω(D)integer Ω(D)

Table 9.1: Simple lower bounds for string sorting using different alphabet models

al p habetal i gnmental l ocateal g orithmal t ernativeal i asal t ernateal l

=⇒

al i gnmental g orithmal i asal l ocateal lal p habetal t ernativeal t ernate

Figure 9.2: One partitioning step in multikey quicksort, with pivot ’l’ in ’allocate’

Our lower bound is Ω(D + n log n) = Ω(n log n), but standard sorting has costs ofΘ(n log n) · Θ(log n) = Θ(n log2 n). In the next sections, we try to approach the lowerbound for string sorting.

9.2 Multikey QuicksortMultikey Quicksort [40] performs in every recursion level a ternary partitioning of thedata elements. In contrast to the standard algorithm, the pivot is not a whole key (whichwould be a complete word), but only the first character following the common prefixshared by all elements.

We will now analyse the algorithm given in pseudo code in figure 9.3. The runningtime is dominated by the comparisons done in the partitioning step. We will use amortizedanalysis to count these comparisons. If s[`+ 1] 6= p[`+ 1], we charge the comparison ons. Assuming a perfect choice for the pivot element, we see that the total charge on s forthis type of comparison is ≤ log n, as the parition containing s is at least halved.If we have s[`+ 1] = p[`+ 1], we charge the comparison on s[`+ 1]. After that, s[`+ 1]becomes part of the common prefix in its partition and will never again be chosen as pivotcharacter. Therefore, the charge on s[` + 1] is ≤ 1 and the total charge on all charactersis ≤ D. Combining this with the above result, we get a total runtime of O(D + n log n).The only flaw in the above analysis is the assumption of a perfect pivot. Like in theanalysis of standard quicksort, we can show that the expected number of 6= comparisonsis 2n lnn when using a random pivot character.

159

Function Multikey-quicksort(R : Sequenceof String, ` : Integer) : Sequence of String//` is the length of the common prefix in Rif |R| ≤ 1 then return Rchoose pivot p ∈ RR< := s ∈ R | s[`+ 1] < p[`+ 1]R= := s ∈ R | s[`+ 1] = p[`+ 1]R> := s ∈ R | s[`+ 1] > p[`+ 1]Multikey-quicksort(R<, `)Multikey-quicksort(R=, `+1)Multikey-quicksort(R>, `)return concatenation of R<, R=, and R>

Figure 9.3: Pseudocode for Multikey Quicksort

9.3 Radix SortAnother classic string sorting algorithm is radix sort. There exist two main variants: LSD-first radix sort starts from the end of the strings (Least Significant Digit first) and movesbackward by one position in each step, until the first character is reached. In every phase,it partitions all strings according to the character at the current position (one group forevery possible character). When this is done, the strings are recollected, starting with thegroup corresponding to the “smallest” character. For correct sorting, this has to be donein a stable way within a group. The LSD variant accesses all characters (as we have toreach the first character of each word for correct sorting), which implies costs of Ω(N)time. This is poor when D N .MSD-first radix sort on the other hand starts from the beginning of the strings (MostSignificant Digit first). It distributes the strings (using counting sort) to groups accordingto the character at the current position and sorts these groups recursively (increasing theposition of the relevant character by 1). Then, all groups are concatenated, in the order ofthe corresponding characters1. This variant accesses only distinguishing prefixes.

What is the running time of MSD-first radix sort? Partitioning a group of k strings inσ buckets takes O(k + σ) time. As the total size of the partitioned groups is D, we haveO(D) total time on constant alphabets.

The total number of times any string is assigned to a group is D (the total size ofall groups created while sorting). For every non-trivial partitioning step (where not all

1When implementing this algorithm, many ideas used for Super Scalar Sample Sort (e.g. the two-pass-approach to determine the optimal bucket size) will also help for MSD-first radix sort. In fact, MSD-firstradix sort inspired the development of SSSS

160

al p habetal i gnmental l ocateal g orithmal t ernativeal i asal t ernateal l

=⇒

a 0...g 1...i 2...l 2...p 1...t 2...z 0

=⇒

al g orithmal i gnmental i asal l ocateal lal p habetal t ernativeal t ernate

Figure 9.4: Example of one partitioning phase in MSD-first radix sort using counting sortfor allocation

alphabet lower bound upper bound algorithm

ordered Ω(D + n log n) O(()D + n log n)multikeyquicksort

constant Ω(D) O(()D) radix sort

integer Ω(D) O(()D + n log σ)radix sort +

multikeyquicksort

Figure 9.5: Overview on upper and lower bounds using different alphabet models

characters are equal), additional costs of O(σ) for creating groups occur. Obviously, thenumber of non-trivial partitionings is≤ n. We therefore have costs ofO(D + nσ), whichbecomes O(D) for constant alphabets. When dealing with integer alphabets, anotherimprovement helps lowering the running time: When k < σ, where k is the number ofstrings to be partitioned in a certain step, switch to multikey quicksort. This results in arunning time of O(D + n log σ).

Table 9.5 gives an overview over the results of this chapter. Some gaps could beclosed, others require more elaborated techniques beyond this text.

161

Chapter 10

Suffix Array Construction

The description of the DC3 algorithm was taken from [41]. Material on external suffixarray construction is from [42].

10.1 IntroductionThe suffix tree of a string is a compact trie of all the suffixes of the string. It is a powerfuldata structure with numerous applications in computational biology and elsewhere. Oneof the important properties of the suffix tree is that it can be constructed in linear time inthe length of the string. The classical linear time algorithms require a constant alphabetsize, but Farach’s algorithm works also for integer alphabets, i.e., when characters arepolynomially bounded integers.

The suffix array is a lexicographically sorted array of the suffixes of a string. Forseveral applications, a suffix array is a simpler and more compact alternative to suffixtrees. The suffix array can be constructed in linear time by a lexicographic traversal ofthe suffix tree, but such a construction loses some of the advantage that the suffix arrayhas over the suffix tree. We introduce the DC3 algorithm, a linear-time direct suffix arrayconstruction algorithm for integer alphabets. The DC3 algorithm is simpler than anysuffix tree construction algorithm. In particular, it is much simpler than linear time suffixtree construction for integer alphabets.

10.2 The DC3 AlgorithmThe DC3 algorithm has a following structure:

1. Recursively construct the suffix array of the suffixes starting at positions i mod 3 6=0. This is done by reduction to the suffix array construction of a string of two thirdsthe length, which is solved recursively.

162

2 567

11 8 6 2 1 9 7 4 5 3

i7

10 7 4 11 8 6 2 9 5 31 position in s

position

suffix array SA

merge

repr. for 0−1 comparerepr. for 2−1 comparei6

mi7 pi0 si5 si6 pp1 ss6 ss7

sorted suffixes mod 1

m4 p1 s2 s3 i0 i5

10

suffixes mod 0suffixes mod 2

SA

sorted suffixes mod 0, 2,

14 3

m

20

i s s si s i ip p

11 2 3 4 56 7 8

533

recursive call3 5 45123

4

input s

triplestriple names512

s

7

p i 0 0 s s s s i p p ii

SA20

20

SA20

p

9

561234

0 11

i s s si s i

Figure 10.1: The DC3 algorithm applied to s = mississippi. First get all suffixes withindex mod 3 = 0, 2. Group their characters to triples and map these meta-charactersto an integer alphabet. Use the resulting string as input for a recursive call. The resultcontains at position i the index of the suffix with rank i. Sort the suffixes with indexmod 3 = 1 (using the rank of the following suffix). Merge both results.

2. Construct the suffix array of the remaining suffixes using the result of the first step.

3. Merge the two suffix arrays into one.

If we would halve the string length for recursion, step three would be very difficult andcostly. Surprisingly, the use of two thirds instead of half of the suffixes in the recursionmakes the last step almost trivial: a simple comparison-based merging is sufficient. Forexample, to compare suffixes starting at i and j with i mod 3 = 0 and j mod 3 = 1,we first compare the initial characters, and if they are the same, we compare the suffixesstarting at i+ 1 and j + 1 whose relative order is already known from the first step.

Algorithm DC3

Input The input is a string T = T [0, n) = t0t1 · · · tn−1 over the alphabet [1, n], that is,a sequence of n integers from the range [1, n]. For convenience, we assume thattn = tn+1 = tn+2 = 0.

The restriction to the alphabet [1, n] is not a serious one. For a string T over anyalphabet, we can first sort the characters of T , remove duplicates, assign a rank toeach character, and construct a new string T ′ over the alphabet [1, n] by renaming

163

the characters of T with their ranks. Since the renaming is order preserving, theorder of the suffixes does not change.

Output For i ∈ [0, n], let Si denote the suffix T [i, n) = titi+1 · · · tn−1. We also extendthe notation to sets: for C ⊆ [0, n], SC = Si | i ∈ C. The goal is to sort the setS[0,n] of suffixes of T lexicographically. The output is the suffix array SA[0, n] ofT , a permutation of [0, n] defined by

SA[i] = |j ∈ [0, n] | Sj < Si| .

Step 0: Construct a sample For k = 0, 1, 2, define

Bk = i ∈ [0, n] | i mod 3 = k.

Let C = B1 ∪B2 be the set of sample positions and SC the set of sample suffixes.

Step 1: Sort sample suffixes For k = 1, 2, construct the strings

Rk = [tktk+1tk+2][tk+3tk+4tk+5] . . . [tmaxBktmaxBk+1tmaxBk+2]

whose characters are triples [titi+1ti+2]. Note that the last character of Rk is alwaysunique because tmaxBk+2 = 0. Let R = R1 R2 be the concatenation of R1 andR2. Then the (nonempty) suffixes of R correspond to the set SC of sample suf-fixes: [titi+1ti+2][ti+4ti+5ti+6] . . . corresponds to Si. The correspondence is orderpreserving, i.e., by sorting the suffixes of R we get the order of the sample suffixesSC .

To sort the suffixes of R, first radix sort the characters of R. If all characters aredifferent, the order of characters gives directly the order of suffixes. Otherwise,we use the technique of renaming the characters with their ranks, and then sort thesuffixes of the resulting string using Algorithm DC3.

Once the sample suffixes are sorted, assign a rank to each suffix. For i ∈ C,let rank(Si) denote the rank of Si in the sample set SC . Additionally, definerank(Sn+1) = rank(Sn+2) = 0. For i ∈ B0, rank(Si) is undefined.

Step 2: Sort nonsample suffixes Represent each nonsample suffix Si ∈ SB0 with thepair (ti, rank(Si+1)). Note that rank(Si+1) is always defined for i ∈ B0. Clearlywe have, for all i, j ∈ B0,

Si ≤ Sj ⇐⇒ (ti, rank(Si+1)) ≤ (tj, rank(Sj+1)).

The pairs (ti, rank(Si+1)) are then radix sorted.

164

Step 3: Merge The two sorted sets of suffixes are merged using a standard comparison-based merging. To compare suffix Si ∈ SC with Sj ∈ SB0 , we distinguish twocases:

i ∈ B1 : Si ≤ Sj ⇐⇒ (ti, rank(Si+1)) ≤ (tj, rank(Sj+1))

i ∈ B2 : Si ≤ Sj ⇐⇒ (ti, ti+1, rank(Si+2)) ≤ (tj, tj+1, rank(Sj+2))

Note that the ranks are defined in all cases.

Theorem 14 The time complexity of Algorithm DC3 is O(n).

Proof: Excluding the recursive call everything can clearly be done in linear time. Therecursion is on a string of length d2n/3e. Thus the time is given by the recurrence T (n) =T (2n/3) +O(n), whose solution is O(n).

10.3 External Suffix Array ConstructionIn this section we are trying to engineer algorithms for suffix array construction that workon huge inputs using the external memory model.

The Doubling Algorithm

Figure 10.2 gives pseudocode for the doubling algorithm. The basic idea is to replacecharacters T [i] of the input by lexicographic names that respect the lexicographic order ofthe length 2k substring T [i, i+ 2k) in the k-th iteration. In contrast to previous variants ofthis algorithm, our formulation never actually builds the resulting string of names. Rather,it manipulates a sequence P of pairs (c, i) where each name c is tagged with its positioni in the input. To obtain names for the next iteration k + 1, the names for T [i, i + 2k)and T [i+ 2k, i+ 2k+1) together with the position i are stored in a sequence S and sorted.The new names can now be obtained by scanning this sequence and comparing adjacenttuples. Sequence S can be build using consecutive elements of P if we sort P usingthe pair (imod 2k, i div 2k). Previous formulations of the algorithm use i as a sortingcriterion and therefore have to access elements that are 2k characters apart. Our approachsaves I/Os and simplifies the pipelining optimization described in Section 10.3.

The algorithm performs a constant number of sorting and scanning operations forsequences of size n in each iteration. The number of iterations is determined by thelogarithm of the longest common prefix.

Theorem 15 The doubling algorithm computes a suffix array usingO(sort(n) dlog maxlcpe) I/Os where maxlcp := max0≤i<n lcp(i, i+ 1).

165

Function doubling(T )S:= 〈((T [i], T [i+ 1]), i) : i ∈ [0, n)〉 1

for k := 1 to dlog ne dosort S 2

P := name(S) 3

invariant ∀(c, i) ∈ P :c is a lexicographic name for T [i, i+ 2k)

if the names in P are unique thenreturn 〈i : (c, i) ∈ P 〉 4

sort P by (imod 2k, i div 2k)) 5

S:= 〈((c, c′), i) : j ∈ [0, n), 6

(c, i) = P [j], (c′, i+ 2k) = P [j + 1]〉Function name(S : Sequence of Pair)

q:= r:= 0; (`, `′):= ($, $)result := 〈〉foreach ((c, c′), i) ∈ S do

q++if (c, c′) 6= (`, `′) then r:= q; (`, `′):= (c, c′)append (r, i) to result

return result

Figure 10.2: The doubling algorithm.

Pipelining

The I/O volume of the doubling algorithm from Figure 10.2 can be reduced significantlyby observing that rather than writing the sequence S to external memory, we can directlyfeed it to the sorter in Line (1). Similarly, the sorted tuples need not be written but canbe directly fed into the naming procedure in Line (2) which can in turn forward it to thesorter in Line (4). The result of this sorting operation need not be written but can directlyyield tuples of S that can be fed into the next iteration of the doubling algorithm.

To motivate the idea of pipelining let us first analyze the constant factor in a naiveimplementation of the doubling algorithm from Figure 10.2. For simplicity assume fornow that inputs are not too large so that sorting m words can be done in 4m/DB I/Osusing two passes over the data. For example, one run formation phase could build sortedruns of size M and one multiway merging phase could merge the runs into a single sortedsequence.

Line (1) sorts n triples and hence needs 12n/DB I/Os. Naming in Line (2) scansthe triples and writes name-index pairs using 3n/DB + 2n/DB = 5n/DB I/Os. The

166

naming procedure can also determine whether all names are unique now, hence the test inLine (3) needs no I/Os. Sorting the pairs in P in Line (4) costs 8n/DB I/Os. Scanningthe pairs and producing triples in Line (5) costs another 5n/DB I/Os. Overall, we get(12 + 5 + 8 + 5)n/DB = 30n/DB I/Os for each iteration.

This can be radically reduced by interpreting the sequences S and P not as files but aspipelines similar to the pipes available in UNIX. In the beginning we explicitly scan theinput T and produce triples for S. We do not count these I/Os since they are not neededfor the subsequent iterations. The triples are not output directly but immediately fed intothe run formation phase of the sorting operation in Line (1). The runs are output to disk(3n/DB I/Os). The multiway merging phase reads the runs (3n/DB I/Os) and directlyfeeds the sorted triples into the naming procedure called in Line (2) which generates pairsthat are immediately fed into the run formation process of the next sorting operation inLine (3) (2n/DB I/Os). The multiway merging phase (2n/DB I/Os) for Line (3) doesnot write the sorted pairs but in Line (4) it generates triples for S that are fed into thepipeline for the next iteration. We have eliminated all the I/Os for scanning and half ofthe I/Os for sorting resulting in only 10n/DB I/Os per iteration — only one third of theI/Os needed for the naive implementation.

Note that pipelining would have been more complicated in the more traditionalformulation where Line (3) sorts P directly by the index i. In that case, a pipeliningformulation would require a FIFO of size 2k to produce shifted sequences. When2k > M this FIFO would have to be maintained externally causing 2n/DB additionalI/Os per iteration, i.e., our modification simplifies the algorithm and saves up to 20 % I/Os.

Let us discuss a more systematic model: The computations in many external memoryalgorithms can be viewed as a data flow through a directed acyclic graph G = (V =F ∪ S ∪ R,E). The file nodes F represent data that has to be stored physically ondisk. When a file node f ∈ F is accessed we need a buffer of size b(f) = Ω (BD).The streaming nodes s ∈ S read zero, one or several sequences and output zero, oneor several new sequences using internal buffers of size b(s).1 The Sorting nodes r ∈ Rread a sequence and output it in sorted order. Sorting nodes have a buffer requirement ofb(r) = Θ(M) and outdegree one2. Edges are labeled with the number of machine wordsw(e) flowing between two nodes.

Theorem 16 The doubling algorithm from Figure 10.2 can be implemented to run usingsort(5n) dlog(1 + maxlcp)e+O(sort(n)) I/Os.

Proof: The following flow graph shows that each iteration can be implemented using

1Streaming nodes may cause additional I/Os for internal processing, e.g., for large FIFO queues orpriority queues. These I/Os are not counted in our analysis.

2We could allow additional outgoing edges at an I/O cost n/DB. However, this would mean to performthe last phase of the sorting algorithm several times.

167

2,10 3,11 4 5 6 71

P

983N 2N 2ninput

2n2n

output

Figure 10.3: Data flow graph for the doubling + discarding . The numbers refer to linenumbers in Figure 12.4. The edge weights are sums over the whole execution with N =n log dps.

sort(2n) + sort(3n) ≤ sort(5n) I/Os. The numbers refer to the line numbers in Fig-ure 10.2

1 2 543n 2n streaming node

sorting node

After dlog(1 + maxlcp)e iterations, the algorithm finishes. TheO(sort(n)) term accountsfor the I/Os needed in Line 0 and for computing the final result. Note that there is a smalltechnicality here: Although naming can find out “for free” whether all names are unique,the result is known only when naming finishes. However, at this time, the first phase ofthe sorting step in Line 4 has also finished and has already incurred some I/Os. Moreover,the convenient arrangement of the pairs in P is destroyed now. However we can thenabort the sorting process, undo the wrong sorting, and compute the correct output.

Discarding

Let cki be the lexicographic name of T [i, i + 2k), i.e., the value paired with i at iterationk in Figure 10.2. Since cki is the number of strictly smaller substrings of length 2k, it is anon-decreasing function of k. More precisely, ck+1

i − cki is the number of positions j suchthat ckj = cki but ck

j+2k < cki+2k . This provides an alternative way of computing the names

given in Figure 12.3.Another consequence of the above observation is that if cki is unique, i.e., ckj 6= cki

for all j 6= i, then chi = cki for all h > k. The idea of the discarding algorithm is totake advantage of this, i.e., discard pair (c, i) from further iterations once c is unique. Akey to this is the new naming procedure in Figure 12.3, because it works correctly evenif we exclude from S all tuples ((c, c′), i), where c is unique. Note, however, that wecannot exclude ((c, c′), i) if c′ is unique but c is not. Therefore, we will partially discard(c, i) when c is unique. We will fully discard (c, i) = (cki , i) when also either ck

i−2k orcki−2k+1 is unique, because then in any iteration h > k, the first component of the tuple

((chi−2h , c

hi ), i− 2h) must be unique. The final algorithm is given in Figure 12.4.

168

Theorem 17 Doubling with discarding can be implemented to run usingsort(5n log dps) +O(sort(n)) I/Os.

The proof for theorem 17 and pseudocode can be found in 12.7 and 12.8 in the ap-pendix.

169

Chapter 11

Presenting Data from Experiments

11.1 IntroductionA paper in experimental algorithmics will often start by describing the problem and theexperimental setup. Then a substantial part will be devoted to presenting the results to-gether with their interpretation. Consequently, compiling the measured data into graphsis a central part of writing such a paper. This problem is often rather difficult becauseseveral competing factors are involved. First, the measurements can depend on many pa-rameters: problem size and other quantities describing the problem instance; variableslike number of processors, available memory describing the machine configuration used;and the algorithm variant together with tuning parameters such as the cooling rate in asimulated annealing algorithm.

Furthermore, many quantities can be measured such as solution quality, executiontime, memory consumption and other more abstract complexity measures such as thenumber of comparisons performed by a sorting algorithm. Mathematically speaking, wesample function values of a mapping f : A → B where the domain A can be high-dimensional. We hope to uncover properties of f from the measurements, e.g., an estimateof the time complexity of an algorithm as a function of the input size. Measurement errorsmay additionally complicate this task.

As a consequence of the the multitude of parameters, a meaningful experimental setupwill often produce large amounts of data and still cover only a tiny fraction of the pos-sible measurements. This data has to be presented in a way that clearly demonstratesthe observed properties. The most important presentation usually takes place in confer-ence proceedings or scientific journals where limited space and format restriction furthercomplicate the task.

This paper collects rules that have proven to be useful in designing good graphs. Al-though the examples are drawn from the work of the author, this paper owes a lot todiscussions with colleagues and detailed feedback from several referees. Sections 11.3–

170

11.7 explains the rules. The stress is on Section 11.4 where two-dimensional figures arediscussed in detail.

Instead of an abstract conclusion, Section 11.8 collects all the rules in a check list thatcan possibly be used when looking for teaching and as a source of ideas for improvinggraphs.

11.2 The ProcessIn a simplified model of experimental algorithmics a paper might be written using a “wa-terfall model”. The experimental design is followed by a description of the measurementwhich is in turn followed by an interpretation. In reality, there are numerous feedbacksinvolved and some might even remain visible in a presentation. After an algorithm hasbeen implemented, one typically builds a simple yet flexible tool that allows many kindsof measurements. After some explorative measurements the researcher gets a basic ideaof interesting parameter settings. Hypotheses are formed which are tested using more ex-tensive measurements using particular parameter ranges. This phase is the scientificallymost productive phase and often leads to new insights which lead to algorithmic changeswhich influence the entire setup.

It should be noted that most algorithmic problems are so complex that one cannotexpect to arrive at an ultimate set of measurements that answers all conceivable ques-tions. Rather, one is constantly facing a list of interesting open questions that require newmeasurements. The process of selecting the measurements that are actually performedis driven by risk and opportunity: The researcher will usually have a set of hypothesesthat have some support from measurements but more measurements might be importantto confirm them. For example, the hypothesis might be “my algorithm is better than allthe others” then a big risk might be that a promising other algorithm or important classesof problem instances have not been tried yet. A small risk might be that a tuning param-eter has so far been set in an ad hoc fashion where it is clear that it can only improve aprecomputation phase that takes 20 % of the execution time.

An opportunity might be a new idea of the authors’ that an algorithm might be usefulfor a new application where it was not originally designed for. In that case, one mightconsider to include problem instances from the new application into the measurements.

At some point, a group of researchers decides to cast the current state of results into apaper. The explorative phase is then stopped for a while. To make the presentation conciseand convincing, alternative ways to display the data are designed that are compact enoughto meet space restrictions and make the conclusions evident. This might also requireadditional measurements giving additional support to the hypotheses studied.

171

11.3 TablesTables are easier to produce than graphs and perhaps this advantage causes that they areoften overused. Tables are more difficult to interpret and too large for large data sets. Amore detailed explanation why tables are often a bad idea has been given by McGeochand Moret [51]. Nevertheless, tables have their place. Tufte [59] gives the rule of thumbthat “tables usually outperform a graph for small data sets of 20 numbers or less”. Tablesgive very accurate values which make it easier to check whether some experiments can bereproduced. Furthermore, one sometimes wants to present some quantities, e.g., solutionquality, as a function of problem instances which cannot be meaningfully arranged onthe axis of a graph. In that case, a graph or bar chart may look nicer but does not addutility compared to a more accurate and compact table. Often a paper will contain smalltables with particularly important results and graphs giving results in an abstract yet lessaccurate way. Furthermore, there may be an appendix or a link to a web page containinglarger tables for more detailed documentation of the results.

11.4 Two-dimensional FiguresAs our standard example we will use the case that execution time should be displayed as afunction of input size. The same rules will usually apply for many other types of variables.Sometimes we mention special examples which should be displayed differently.

The x-AxisThe first question one can ask oneself is what unit one chooses for the x-axis. For ex-ample, assume we want to display the time it takes to broadcast a message of length kin some network where transmitting k′ bytes of data from one processor to another takestime t0 + k′. Then it makes sense to plot the execution time as a function of k/t0 becausefor many implementations, the shape of the curve will then become independent of t0.More generally, by choosing an appropriate unit, we can sometimes get rid of one degreeof freedom. Figure 11.1 gives an example.

The variable defining the x-axis can often vary over many orders of magnitude. There-fore one should always consider whether a logarithmic scale is appropriate for the x-axis.This is an accepted way to give a general idea of a function over a wide range of values.One will then choose measurement values such that they are about evenly spaced on thex-axis, e.g., powers of two or powers of

√2. Figures 11.3, 11.5, and 11.6 all use powers

of two. In this case, one should also choose tic marks which are powers of two and notpowers of ten. Figures 11.1 and 11.4 use the “default” base ten because there is no choiceof input sizes involved here.

172

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

10 100 1000 10000 100000 1e+06

impr

ovem

ent m

in(T

* 1,T

* ∞)/

T* *

k/t0

P=64P=1024P=16384

Figure 11.1: Improvement of the fractional tree broadcasting algorithm [57] over thebest of pipelined binary tree and sequential pipeline algorithm as a function of messagetransmission time k over startup overhead t0. P is the number of processors. (See alsoSection 11.4 and 11.4)

173

Sometimes it is appropriate to give more measurements for small x-values becausethey are easily obtained and particularly important. Conversely, it is not a good idea tomeasures using constant offsets (x ∈ x0 + i∆ : 0 ≤ i < imax) as if one had a linearscale and then to display the values on a logarithmic scale. This looks awkward becausepoints are crowded for large values. Often there will be too few values for small x andone nevertheless wastes a lot of measurement time for large inputs.

A plain linear scale is adequate if the interesting range of x-values is relatively small,for example if the x-axis is the number of processors used and one measures on a smallmachine with only 8 processors. A linear scale is also good if one wants to point outperiodic behavior, for example if one wants to demonstrate that slow-down due to cacheconflicts get very large whenever the input size is a multiple of the cache size. However,one should resist the temptation to use a linear scale when x-values over many orders ofmagnitude are important but the own results look particularly good for large inputs.

Sometimes, transformations of the x-axis other than linear or logarithmic make sense.For example, in queuing systems one is often interested in the delay of requests as thesystem load approaches the maximum performance of the system. Figure 11.2 gives anexample. Assume we have a disk server with 64 disks. Data is placed randomly on thesedisks using a hash function. Assume that retrieving a block from a disk takes one time unitand that there is a periodic stream of requests — one every (1 + ε)/64 time units. Usingqueuing theory one can show that the delay of a request is approximately proportional to1/ε if only one copy of every block is available. Therefore, it makes sense to use 1/ε as thex-value. First, this transformation makes it easy to check whether the system measuredalso shows this behavior linear in 1/ε. Second, one gets high resolution for arrival ratesnear the saturation point of the system. Such high arrival rates are often more interestingthan low arrival rates because they correspond to very efficient uses of the system.

The y-AxisGiven that the x-axis often has a logarithmic scale, we often seem to be forced to use alogarithmic scale also for the y-axis. For example, if the execution time is approximatelysome power of the problem size, such a double-logarithmic plot will yield a straight line.

However, plots of the execution time can be quite boring. Often, we already know thegeneral shape of the curve. For example, a theoretical analysis may tell us that the execu-tion time is between T (n) = Ω (n) and T (n) = O(nPolylog(n)). A double-logarithmicplot will show something very close to a diagonal and discerns very little about the Poly-log term we are really interested in. In such a situation, we transform the y-axis so that apriori information is factored out. In our example above we could better display T (n)/nand then use a linear scale for the y-axis. A disadvantage of such transformations is thatthey may be difficult to explain. However, often this problem can be solved by findinga good term describing the quantity displayed. For example, “time per element” when

174

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

nonredundantmirrorring shortest queuering with matchingshortest queue

1

1.2

1.4

1.6

1.8

2

2 4 6 8 10 12 14 16 18 20

aver

age

dela

y

1/ε

shortest queuehybridlazy sharingmatching

Figure 11.2: Comparison of eight algorithms for scheduling accesses to parallel disksusing the model described in the text (note that “shortest queue” appears in both figures).Only the two algorithms “nonredundant” and “mirror” exhibit a linear behavior of theaccess delay predicted by queuing theory. The four best algorithms are based on randomduplicate allocation — every block is available on two randomly chosen disks and ascheduling algorithm [55] decides which copy to retrieve. (See also Section 11.4)

175

0

50

100

150

200

1024 4096 16384 65536 218 220 222 223

(tim

e pe

r op

erat

ion)

/log

N [

ns]

N

bottom up binary heapbottom up aligned 4-ary heap

sequence heap

Figure 11.3: Comparison of three different priority queue algorithms [58] on a MIPSR10000 processor. N is the size of the queue. All algorithms use Θ(logN) key com-parisons per operation. The y-axis shows the total execution time for some particularoperation sequence divided by the number of deletion/insertion pairs and logN . Hencethe plotted value is proportional to the execution time per key comparison. This scalingwas chosen to expose cache effects which are now the main source of variation in they-value. (See also Sections 11.4 and 11.4.)

one divides by the input size, “competitive ratio” when one divides by a lower bound, or“efficiency” when one displays the ratio between an upper performance bound and themeasured performance. Figure 11.3 gives an example for using such a ratio.

Another consideration is the range of y-values displayed. Assume ymin > 0 is theminimal value observed and ymax is the maximal value observed. Then one will usuallychoose [ymin, ymax] or (better) a somewhat larger interval as the displayed range. In thiscase, one should be careful however with overinterpreting the resulting picture. A changeof the y-value by 1 % will look equal to a change of y-value of 400 %. If one wants tosupport claims such as “for large x the improvements due to the new algorithm becomevery large” using a graph, choosing the range [0, ymax] can be a more sound choice. (Atleast if ymax/ymin is not too close to one. Some of the space “wasted” this way can oftenbe used for placing curve labels.) In Figure 11.2, using ymin = 1 is appropriate since norequest can get an access delay below one in the model used.

The choice of the the maximum y value displayed can also be nontrivial. In particular,it may be appropriate to clip extreme values if they correspond to measurement points

176

which are clearly useless in practice. For example, in Figure 11.2 it is not very interestingto see the entire curve for the algorithm “nonredundant” since it is clearly outclassed forlarge 1/ε anyway and since we have a good theoretical understanding of this particularcurve.

A further degree of freedom is the vertical size of the graph. This parameter can beused to achieve the above goals and the rule of “banking to 45”: The weighted averageof the slants of the line segments in the figure should be about 45.1 Refer to [47] fora detailed discussion. The weight of a segment is the x-interval bridged. There is goodempirical and mathematical evidence that graphs using this rule make changes in slopemost easily visible.

If banking to 45 does not yield a clear insight regarding the graph size, a good rule ofthumb is to make the graph a bit wider than high [59]. A traditional choice is to use thegolden ratio, i.e., a graph that is 1.62 times wider than high.

Arranging Multiple CurvesAn important feature of two-dimensional graphs is that we can place several curves ina single graph as in Figures 11.1, 11.2, and 11.3. In this way we can obtain a highinformation density without the disadvantages of three-dimensional plots. However, onecan easily overdo it resulting in a chaos of undecipherable points and lines. How manycurves fit into one pictures depends on the information density. When curves are verysmooth, and have few points where they cross each other, as in Figure 11.2, up to sevencurves may fit in one figure. If curves are very complicated, even three curves may be toomuch. Often one will start with a straight-forward graph that turns out to be too ugly forpublication. Then one can use a number of techniques to improve it:

• Remove unnecessary curves. For example, Figure 11.2 from [55] compares onlyeight algorithms out of eleven studied in this paper. The remaining three are clearlyoutclassed or equivalent to other algorithms for the measurement considered.

• If several curves are too close together in an important range of x-values, considerusing another y range or scale. If the small differences persist and are important,consider to use a separate graph with a magnification. For example, in Figure 11.2the four fastest algorithms were put into a separate plot to show the differencesbetween them.

• Check whether several curves can be combined into one curve. For example, as-sume we want to compare a new improved algorithm with several inferior old algo-rithms for input sizes on the x-axis. Then it might be sufficient to plot the speedup

1This is one of the few things described here which are are not easy to do with gnuplot. But even keepingthe principle of banking to 45 in mind is helpful.

177

of the new algorithm over the best of the old algorithms; perhaps labeling the sec-tions of the speedup curve so that the best of the old algorithms can be identifiedfor all x-values. Figure 11.1 gives an example where the speeup of one algorithmover two other algorithms is shown.

• Decrease noise in the data as described in Section 11.4.

• Once noise is small, replace error bars with specifications of the accuracy in thecaption as in Figure 11.6.

• Connect points belonging to the same curves using straight lines.

• Choose different point styles and line styles for different curves.

• Arrange labels explaining point and line styles in the “same order”2 as they appearin the graph. Sometimes one can also place the labels directly at the curves. Buteven then the labels should not obscure the curves. Unfortunately, gnuplot does nothave this feature so that we could not use it in this paper.

• Choose the x-range and the density of x-values appropriately.

Sometimes we need so many curves that they cannot fit into one figure. For example,when the cross-product of several parameter ranges defines the set of curves needed. Thenwe may finally decide to use several figures. In this case, the same y-ranges should usuallybe chosen so that the results remain comparable. Also one should choose the same pointstyles and line styles for related curves in different figures, e.g., for curves belonging tothe same algorithm as for the “shortest queue” algorithm in Figure 11.2. Note that toolssuch as gnuplot cannot do that automatically.

The explanations of point and line styles should avoid cryptic abbreviations wheneverpossible and at the same time avoid overlapping the curves. Both requirements can bereconciled by placing the explanations appropriately. For example, in computer science,curves often go from the lower left corner to the upper right corner. In that case, the bestplace for the definition of line and point styles is the upper left corner.

Arranging InstancesIf measurements like execution time for a small set of problem instances are to be dis-played, a bar chart is an appropriate tool. If other parameters such as the algorithm used,or the time consumed by different parts of the algorithm should be differentiated, the barscan be augmented to encode this. For example, several bars can be stacked in depth usingthree-dimensional effects or different pieces of a bar can get different shadings.3

2For example, one could use the order of the y-values at the larges x-value as in Figure 11.3.3Sophisticated fill styles give us additional opportunities for diversification but Tufte notes that they are

often too distracting [59].

178

If there are so many instances that bar charts consume too much space, a scatterplot can be useful. The x-axis stands for a parameter like problem size and we plotone point for every problem instance. Figure 11.4 gives a simple example. Point stylesand colors can be used to differentiate different types of instances or variations of otherparameters such as the algorithm used. Sometimes these points are falsely connected bylines. This should be avoided. It not only looks confusing but also wrongly suggests arelation between the data points that does not exist.

1

10

100

1000

1 10 100 1000

n / a

ctiv

e se

t siz

e

n/m

Figure 11.4: Each point gives the ratio between total problem size and “core” problemsize in a fast algorithm for solving set covering problems from air line crew scheduling[43]. The larger this ratio, the larger the possible speedup for a new algorithm. The x-axis is the ratio between the number of variables and number of constraints. This scalewas chosen to show that there is a correlation between these two ratios that is helpfulin understanding when the new algorithm is particularly useful. The deviating points atn/m = 10 are artificial problems rather different from typical crew scheduling problems.(See also Section 11.4.)

How to Connect MeasurementsTools such as gnuplot allow us to associate a measured value with a symbol like a cross ora star that clearly specifies that point and encodes some additional information about themeasurement. For example, one will usually choose one point symbol for each displayedcurve. Additionally, points belonging to the same curve can be connected by a straightline. Such lines should usually not be viewed as a claim that they present a good interpo-lation of the curve but just as a visual aid to find points that belong together. In this case, it

179

is important that the points are large enough to stand out against the connecting lines. Analternative is to plot measurements points plus curves stemming from an analytic modelas in Figure 11.5.

The situation is different if only lines and no points are plotted as in Figure 11.1. Inthis case, it is often impossible to tell which points have been measured. Hence sucha lines-only plot implies the very strong claim that the points where we measured areirrelevant and the plotted curve is an accurate representation of the true behavior for theentire x-range. This only makes sense if very dense measurements have been performedand they indeed form a smooth line. Sometimes one sees smooth lines that are weightedaverages over a neighborhood in the x-coordinates. Then one often uses very small pointsfor the actual measurements that form a cloud around this curve.

A related approach is connecting measured points with interpolated curves such assplines which are more smooth than lines. Such curves should only be used if we actuallyconjecture that the interpolation used is close to the truth.

Measurement ErrorsTools allow us to generalize measured points to ranges which are usually a point plus anerror bar specifying positive and negative deviations from the y-value.4 The main questionfrom the point of view of designing graphs is what kind of deviations should be displayedor how one can avoid the necessity for error bars entirely.

Let us start with the well behaved case that we are simulating a randomized algo-rithm or work with randomly generated problem instances. In this case, the results fromrepeated runs are independent identically distributed random variables. In this case, pow-erful methods from statistics can be invoked. For example, the point itself may be theaverage of the measured values and the error bar could be the standard deviation or thestandard error [53]. Figure 11.5 gives an example. Note that the latter less well knownquantity is a better estimate for the difference between the average and the actual mean.By monitoring the standard error during the simulation, we can even repeat the measure-ment sufficiently often so that this error measure is below some prespecified value. In thiscase, no error bars are needed and it suffices to state the bound on the error in the captionof the graph. Figure 11.6 gives an example.

The situation is more complicated for measurements of actual running times of deter-ministic algorithms, since this involves errors which are not of a statistical nature. Rather,the errors can stem from hidden variables such as operating system interrupts, which wecannot fully control. In this case, points and error bars based on order statistics mightbe more robust. For example, the y value could be the median of the measured valuesand the error bar could define the minimum and the maximum value measured or values

4Uncertainties in both x and y-values can also be specified but this case seems to be rare in Algorithmics.

180

0

5

10

15

20

32 1024 32768n

Measurementlog n + log ln n + 1log n

Figure 11.5: Number of iterations that the dynamic load balancing algorithm randompolling spends in its warmup phase until all processors are busy. Hypothesized upperbound, lower bound and measured averages with standard deviation [54, 56]. (See alsoSections 11.4 and 11.4.)

exceeded in less than 5 % of the measurements. The caption should explain how manymeasurements have been performed.

11.5 Grids and TicksTools for drawing graphs give us a lot of control over how axes are decorated with num-bers, tick marks and grid lines. The general rule that is often achieved automatically is touse a few round numbers on each axis and perhaps additional tick marks without numbers.The density of these numbers should not be too high. Not only should they appear wellseparated but they also should be far from dominating the visual appearance of the graph.When a very large range of values is displayed, we sometimes have to force the systemto use exponential notation on a part of the axis before numbers get too long. Figure 11.6gives an example for the particularly important case of base two scales. Sometimes wemay decide that reading off values is so important in a particular graph that grid linesshould be added, i.e., horizontal and vertical lines that span the entire range of the graph.Care must be taken that such grid lines to not dilute the visual impression of the data

181

0

0.5

1

1.5

2

2.5

3

16 64 256 1024 212 214 216 218 220 222 224

max

Loa

d -

m/n

m

n=65536n=256n=16n=4

Figure 11.6: m Balls are placed into n bins using balanced random allocation [44, 45].The difference between maximal and average load is plotted for different values of m andn. The experiments have been repeated at least sufficiently often to reduce the standarderror (σ/

√repetitions [53]) below one percent. In order to minimize artifacts of the

random number generator, we have used a generator with good reputation and very longperiod (219937 − 1) [49]. In addition, some experiments were repeated with the Unixgenerator srand48 leading to almost identical results. (See also Section 11.4.)

points. Hence, grid lines should be avoided or at least made thin or, even better, lightgray. Sometimes grid lines can be avoided by plotting the values corresponding to someparticularly important data points also on the axes.

A principle behind many of the above considerations is called Data-Ink Maximizationby Tufte [59]. In particular, one should reduce non-data ink and redundant data ink fromthe graph. The ratio of data ink to total ink used should be close to one. This principlealso explains more obvious sins like pseudo-3D bar charts, complex fill styles, etc.

11.6 Three-dimensional FiguresOn the first glance, three-dimensional figures are attractive because they look sophisti-cated and promise to present large amounts of data in a compact way. However there aremany drawbacks.

182

• It is almost impossible to read absolute values from the two-dimensional projectionof a function.

• In complicated functions interesting parts may be hidden from view.

• If several functions are to be compared, one is tempted to use a correspondingnumber of three-dimensional figures. But in this case, it is more difficult to interpretdifferences than in two-dimensional figures with cross-sections of all the functions.

It seems that three-dimensional figures only make sense if we want to present the generalshape of a single function. Perhaps three-dimensional figures become more interestingusing advanced interactive media where the user is free to choose viewpoints, read offprecise values, view subsets of curves, etc.

11.7 The CaptionGraphs are usually put into “floating figures” which are placed by the text formatter so thatpage breaks are taken into account. These figures have a caption text at their bottom whichmakes the figure sufficiently self contained. The captions explains what is displayed andhow the measurements have been obtained. This includes the instances measured, thealgorithms and their parameters used, and, if relevant the system configuration (hardware,compiler,. . . ). One should keep in mind that experiments in a scientific paper should bereproducible, i.e., the information available should suffice to repeat a similar experimentwith similar results. Since the caption should not become too long it usually containsexplicit or implicit references to surrounding text, literature or web resources.

183

11.8 A Check ListIn the following we summarize the rules discussed above. This list has the additionalbeneficial effect to serve as a check list one can refer to for preparing graphs and forteaching. The Section numbers containing a more detailed discussion are appended inbrackets. The order of the rules has been chosen so that in most cases they can be appliedin the order given.

• Should the experimental setup from the exploratory phase be redesigned to increaseconciseness or accuracy? (11.2)

• What parameters should be varied? What variables should be measured? How areparameters chosen that cannot be varied? (11.2)

• Can tables be converted into curves, bar charts, scatter plots or any other usefulgraphics? (11.3, 11.4)

• Should tables be added in an appendix or on a web page? (11.3)

• Should a 3D-plot be replaced by collections of 2D-curves? (11.6)

• Can we reduce the number of curves to be displayed? (11.4)

• How many figures are needed? (11.4)

• Scale the x-axis to make y-values independent of some parameters? (11.4)

• Should the x-axis have a logarithmic scale? If so, do the x-values used for measur-ing have the same basis as the tick marks? (11.4)

• Should the x-axis be transformed to magnify interesting subranges? (11.4)

• Is the range of x-values adequate? (11.4)

• Do we have measurements for the right x-values, i.e., nowhere too dense or toosparse? (11.4)

• Should the y-axis be transformed to make the interesting part of the data morevisible? (11.4)

• Should the y-axis have a logarithmic scale? (11.4)

• Is it be misleading to start the y-range at the smallest measured value? (11.4)

• Clip the range of y-values to exclude useless parts of curves? (11.4)

• Can we use banking to 45? (11.4)

• Are all curves sufficiently well separated? (11.4)

• Can noise be reduced using more accurate measurements? (11.4)

• Are error bars needed? If so, what should they indicate? Remember that measure-ment errors are usually not random variables. (11.4, 11.4)

184

• Use points to indicate for which x-values actual data is available. (11.4)

• Connect points belonging to the same curve. (11.4,11.4)

• Only use splines for connecting points if interpolation is sensible. (11.4,11.4)

• Do not connect points belonging to unrelated problem instances. (11.4)

• Use different point and line styles for different curves. (11.4)

• Use the same styles for corresponding curves in different graphs. (11.4)

• Place labels defining point and line styles in the right order and without concealingthe curves. (11.4)

• Captions should make figures self contained. (11.7)

• Give enough information to make experiments reproducible. (11.7)

185

Chapter 12

Appendix

12.1 Used machine modelsIn 1945 John von Neumann introduced a basic architecture of a computer. The designwas very simple in order to make it possible to build it with the limited hardware tech-nology of the time. Hardware design has grown out of this in most aspects. However,the resulting programming model was so simple and powerful, that it is still the basis formost programming. Usually it turns out that programs written with the model in mindalso work well on the vastly more complex hardware of todays machines.

The variant of von Neumann’s model we consider is the RAM (random access ma-chine) model. The most important features of this model are that it is sequential, i.e., thereis a single processing unit, and that it has a uniform memory, i.e., all memory accessescost the same amount of time. The memory consists of cells S[0], S[1], S[2], . . . The “. . . ”means that there are potentially infinitely many cells although at any point of time only afinite number of them will be in use. We assume that “reasonable” functions of the inputsize n can be stored in a single cell. We should keep in mind however, that that our modelallows us a limited form of parallelism. We can perform simple operations on log n bitsin constant time.

The external memory model is like the RAM model except that the fast memory islimited in size to M words. Additionally, there is an external memory with unlimitedsize. There are special I/O operations that transfer B consecutive words between slowand fast memory. For example, the external memory could be a hard disk, M would thenbe the main memory size andB would be a block size that is a good compromise betweenlow latency and high bandwidth. On current technology M = 1GByte and B = 1MBytecould be realistic values. One I/O step would then be around 10ms which is 107 clockcycles of a 1GHz machine. With another setting of the parameters M and B, we couldmodel the smaller access time differences between a hardware cache and main memory.

186

12.2 Amortized Analysis for Unbounded ArraysOur implementation of unbounded arrays follows the algorithm design principle “makethe common case fast”. Array access with [·] is as fast as for bounded arrays. Intuitively,pushBack and popBack should “usually” be fast — we just have to update n. However,a single insertion into a large array might incur a cost of n. We now show that such asituation cannot happen for our implementation. Although some isolated procedure callsmight be expensive, they are always rare, regardless of what sequence of operations weexecute.

Lemma 18 Consider an unbounded array u that is initially empty. Any sequence σ =〈σ1, . . . , σm〉 of pushBack or popBack operations on u is executed in time O(m).

Corollary 19 Unbounded arrays implement the operation [·] in worst case constant timeand the operations pushBack and popBack in amortized constant time.

To prove Lemma 18, we use the accounting method. Most of us have already used thisapproach because it is the basic idea behind an insurance. For example, when you rent acar, in most cases you also have to buy an insurance that covers the ruinous costs you couldincur by causing an accident. Similarly, we force all calls to pushBack and popBack tobuy an insurance against a possible call of reallocate. The cost of the insurance is puton an account. If a reallocate should actually become necessary, the responsible call topushBack or popBack does not need to pay but it is allowed to use previous deposits onthe insurance account. What remains to be shown is that the account will always be largeenough to cover all possible costs.Proof: Let m′ denote the total number of elements copied in calls of reallocate. Thetotal cost incurred by calls in the operation sequence σ is O(m+m′). Hence, it sufficesto show that m′ = O(m). Our unit of cost is now the cost of one element copy.

We require an insurance of 3 units from each call of pushBack and claim that thissuffices to pay for all calls of reallocate by both pushBack and popBack .

We prove by induction over the calls of reallocate that immediately after the call thereare at least n units left on the insurance account.

First call of reallocate: The first call grows w from 1 to 2 after at least onecall ofpushBack . We have n = 1 and 3− 1 = 2 > 1 units left on the insurance account.

For the induction step we prove that 2n units are on the account immediately beforethe current call to reallocate. Only n elements are copied leaving n units on the account— enough to maintain our invariant. The two cases in which reallocate may be called areanalyzed separately.

187

pushBack grows the array: The number of elements n has doubled since the lastreallocate when at least n/2 units were left on the account by the induction hypothe-sis (this holds regardless of the type of operation that caused the reallocate). The n/2new elements paid 3n/2 units giving a total of 2n units for insurance.

popBack shrinks the array: The number of elements has halved since the lastreallocate when at least 2n units were left on the account by the induction hypothesis.Since then, n/2 elements have been removed and n/2 elements have to be copied. Afterpaying for the current reallocate, 2n− n/2 = 3/2n > 2(n/2) are left on the account.

12.3 Analysis of Randomized QuicksortTo analyze the running time of quicksort for an input sequence s = 〈e1, . . . , en〉 wefocus on the number of element comparisons performed. Other operations contributeonly constant factors and small additive terms in the execution time.

Let C(n) denote the worst case number of comparisons needed for any input sequenceof size n and any choice of random pivots. The worst case performance is easily deter-mined. Lines (A), (B), and (C) in Figure 3.1. can be implemented in such a way thatall elements except for the pivot are compared with the pivot once (we allow three-waycomparisons here, with possible outcomes ‘smaller’, ‘equal’, and ‘larger’). This makesn − 1 comparisons. Assume there are k elements smaller than the pivot and k′ elementslarger than the pivot. We get C(0) = C(1) = 0 and

C(n) = n− 1 + max C(k) + C(k′) : 0 ≤ k ≤ n− 1, 0 ≤ k′ < n− k .

By induction it is easy to verify that

C(n) =n(n− 1)

2= Θ

(n2).

The worst case occurs if all elements are different and we are always so unlucky to pickthe largest or smallest element as a pivot.

The expected performance is much better.

Theorem 20 The expected number of comparisons performed by quicksort is

C(n) ≤ 2n lnn ≤ 1.4n log n .

We concentrate on the case that all elements are different. Other cases are easierbecause a pivot that occurs several times results in a larger middle sequence b that neednot be processed any further.

188

Let s′ = 〈e′1, . . . , e′n〉 denote the elements of the input sequence in sorted order. Ele-ments e′i and e′j are compared at most once and only if one of them is picked as a pivot.Hence, we can count comparisons by looking at the indicator random variables Xij , i < jwhere Xij = 1 if e′i and e′j are compared and Xij = 0 otherwise. We get

C(n) = E[n∑i=1

n∑j=i+1

Xij] =n∑i=1

n∑j=i+1

E[Xij] =n∑i=1

n∑j=i+1

prob(Xij = 1) .

The middle transformation follows from the linearity of expectation. The last equa-tion uses the definition of the expectation of an indicator random variable E[Xij] =prob(Xij = 1). Before we can further simplify the expression for C(n), we need todetermine this probability.

Lemma 21 For any i < j, prob(Xij = 1) =2

j − i+ 1.

Proof: Consider the j − i + 1 element set M = e′i, . . . , e′j. As long as no pivot fromM is selected, e′i and e′j are not compared but all elements from M are passed to the samerecursive calls. Eventually, a pivot p fromM is selected. Each element inM has the samechance 1/|M | to be selected. If p = e′i or p = e′j we have Xij = 1. The probability forthis event is 2/|M | = 2/(j− i+ 1). Otherwise, e′i and e′j are passed to different recursivecalls so that they will never be compared.

Now we can finish the proof of Theorem 20 using relatively simple calculations.

C(n) =n∑i=1

n∑j=i+1

prob(Xij = 1) =n∑i=1

n∑j=i+1

2

j − i+ 1=

n∑i=1

n−i+1∑k=2

2

k

≤n∑i=1

n∑k=2

2

k= 2n

n∑k=2

1

k= 2n(Hn − 1) ≤ 2n(lnn+ 1− 1) = 2n lnn .

For the last steps, recall the properties of the harmonic number Hn :=∑n

k=1 1/k ≤lnn+ 1.

12.4 Insertion SortInsertion Sort maintains the invariant that the output sequence is always sorted by choos-ing an arbitrary element of the input sequence but taking care to insert this element at theright place in the output sequence. Figure 12.1 gives an in-place array implementation ofinsertion sort. This implementation is straightforward except for a small trick that allows

189

the inner loop to use only a single comparison. When the element e to be inserted issmaller than all previously inserted elements, it can be inserted at the beginning withoutfurther tests. Otherwise, it suffices to scan the sorted part of a from right to left while eis smaller than the current element. This process has to stop because a[1] ≤ e. InsertionSort has a worst case running time of Θ(n2) but is nevertheless a fast algorithm for smalln.

Procedure insertionSort(a : Array [1..n] of Element)for i := 2 to n do

invariant a[1] ≤ · · · ≤ a[i− 1] // a: 1..i− 1: sorted i..n: unsorted//Move a[i] to the right placee := a[i] // a: sorted e i+ 1..nif e < a[1] then // new minimum

for j := i downto 2 do a[j] := a[j − 1] // a: sorted > e i+ 1..na[1] := e // a: e sorted > e i+ 1..n

else // Use a[1] as a sentinelfor j := i downto −∞ while a[j − 1] > e do a[j] := a[j − 1]a[j] := e // a: ≤ e e > e i+ 1..n

Figure 12.1: Insertion sort

12.5 Lemma on Interval MaximaLemma 22 Consider an MST T = (0, . . . , n− 1 , ET ) where the JP algorithm (JP)adds the nodes to the tree in the order 0, . . . , n − 1. Let ei, 0 < i < n denote theedge used to add node i to the tree by the JP algorithm. Let wi, denote the weight of ei.Then, for all nodes u < v, the heaviest edge on the path from u to v in T has weightmaxu<j≤v wj .

Proof: By induction over v. The claim is trivially true for v = 1. For the induction stepwe assume that the claim is true for all pairs of nodes (u, v′) with u < v′ < v and showthat it is also true for the pair (u, v). First note that ev is on the path from u to v becausein the JP algorithm u is inserted before v and v is an isolated node until ev is added to thetree. Let v′ < v denote the node at the other end of edge ev. Edge ev is heavier than allthe edges ev′+1,. . . ev−1 because otherwise the JP algorithm would have added v, using ev,earlier. There are two cases to consider (see Figure 12.2).

190

1 4 3 850 1 4 3 850

Case 2: v’ > u

83

4

5

1

u

v’

v

384

5

1

Case 1: v’ < u

v’

uv

Figure 12.2: Illustration of the two cases of Lemma 22. The JP algorithm adds the nodesfrom left to right.

Case v′ ≤ u: By the induction hypothesis, the heaviest edge on the path from v′ to uis maxv′<j≤uwj . Since all these edges are lighter than ev, the maximum over wu,. . . ,wvfinds the correct answer wv.Case v′ > u: By the induction hypothesis, the heaviest edge on the path between u andv′ has weight maxu<j≤v′ wj . Hence, the heaviest edge we are looking for has weightmax wv,maxu<j≤v′ wj. Maximizing over the larger set maxu<j≤v wj will return theright answer since ev is heavier than the edges ev′+1,. . . ev−1.

Lemma 22 also holds when we have the MSF of an unconnected graph rather than theMST of a connected graph. When JP spans a connected component, it selects an arbitrarynode i and adds it to the MSF with wi = ∞. Then the interval maximum for two nodesthat are in two different components is∞, as it should be.

12.6 Random Permutations without additional I/OsFor renaming nodes, we need a (pseudo)random permutation π : 0..n − 1 → 0..n − 1.Assume for now that n is a square so that we can represent a node i as a pair (a, b) with i =a+ b

√n. Our permutations are constructed from Feistel permutations, i.e., permutations

of the form πf ((a, b)) = (b, a+f(b) mod√n) for some random mapping f : 0..

√n−1→

0..√n − 1. Since

√n is small, we can afford to implement f using a lookup table filled

with random elements. For example, for n = 232 the lookup table for f would requireonly 128 KByte. It is known that a permutation π(x) = πf (πg(πh(πl(x)))) build bychaining four Feistel permutations is “pseudorandom” in a sense useful for cryptography.The same holds if the innermost and outermost permutation is replaced by an even simplerpermutation. In our implementation we use just two stages of Feistel-Permutations. It isan interesting question what provable performance guarantees for the sweep algorithm orother algorithmic problems can be given for such permutations.

A permutation π′ on 0.. d√ne2− 1 can be transformed to a permutation π on 0..n− 1

by iteratively applying π′ until a value below n is obtained. Since π′ is a permutation, this

191

process must eventually terminate. If π′ is random, the expected number of iterations isclose to 1 and it is unlikely that more than three iterations are necessary for any input.

12.7 Proof of Discarding Theorem for Suffix Array Con-struction

Proof: We prove the theorem by showing that the total amount of data in the differentsteps of the algorithm over the whole execution is as in the data flow graph in Figure 10.3.The nontrivial points are that at most N = n log dps tuples are processed in each sortingstep over the whole execution and that at most n tuples are written to P . The formerfollows from the fact that a suffix i is involved in the sorting steps as long as it has a non-unique rank, which happens in exactly dlog(1+dps(i))e iterations. To show the latter, wenote that a tuple (c, i) is written to P in iteration k only if the previous tuple (c′, i − 2k)was not unique. That previous tuple will become unique in the next iteration, because itis represented by ((c′, c), i − 2k) in S. Since each tuple turns unique only once, the totalnumber of tuples written to P is at most n.

12.8 Pseudocode for the Discarding Algorithm

Function name2 (S : Sequence of Pair)q:= q′:= 0; (`, `′):= ($, $)result := 〈〉foreach ((c, c′), i) ∈ S do

if c 6= ` then q:= q′:= 0; (`, `′):= (c, c′)else if c′ 6= `′ then q′:= q; `′:= c′

append (c+ q′, i) to resultq++

return result

Figure 12.3: The alternative naming procedure.

192

Function doubling + discarding(T )S:= 〈((T [i], T [i+ 1]), i) : i ∈ [0, n)〉 1

sort S 2

U := name(S) //undiscarded 3

P := 〈〉 //partially discardedF := 〈〉 // fully discardedfor k := 1 to dlog ne do

mark unique names in U 4

sort U by (imod 2k, i div 2k) 5

merge P into U ; P := 〈〉 6

S:= 〈〉; count := 0foreach (c, i) ∈ U do 7

if c is unique thenif count < 2 then

append (c, i) to Felse append (c, i) to Pcount := 0

elselet (c′, i′) be the next pair in Uappend ((c, c′), i) to Scount++

if S = ∅ thensort F by first component 8

return 〈i : (c, i) ∈ F 〉 9

sort S 10

U := name2 (S) 11

Figure 12.4: The doubling with discarding algorithm.

193

Bibliography

[1] K. Kaligosi, P. Sanders. How Branch Mispredictions Affect Quicksort In 14thEuropean Symposium on Algorithms (ESA), number 4168 in LNCS, pages 780–791,2006.

[2] P. Sanders and S. Winkel. Super Scalar Sample Sort In 12th European Symposiumon Algorithms (ESA), number 2625 in LNCS, pages 784–796, 2004.

[3] R. Dementiev and P. Sanders Asynchronous Parallel Disk Sorting In 15th ACMSymposium on Parallelism in Algorithms and Architectures, pages 138–148, SanDiego, 2003.

[4] D. A. Hutchinson, P. Sanders, and J. S. Vitter. Duality Between Prefetching andQueued Writing with Parallel Disks In 9th European Symposium on Algorithms(ESA), number 2161 in LNCS, pages 62–73, 2001.

[5] Peter Sanders. Fast Priority Queues for Cached Memory In ALENEX ’99, Workshopon Algorithm Engineering and Experimentation, number 1619 in LNCS, pages 312–327, 1999.

[6] Roman Dementiev. Algorithm Engineering for Large Data Sets. Dissertation atUniversitat des Saarlandes, 2006.

[7] P. Sanders, R. Dementiev. I/O-Efficient Algorithms and Data Structures. Slides forAlgorithm Engineering course at Universitat Karlsruhe, 2007.

[8] Laura Toma and Norbert Zeh I/O-Efficient Algorithms for Sparse Graphs In Algo-rithms for Memory Hierarchies, number 2625 in LNCS, pages 85–109, 2003.

[9] Piyush Kumar Cache Oblivious Algorithms In Algorithms for Memory Hierarchies,number 2625 in LNCS, pages 193–212, 2003.

[10] Anil Maheshwari and Norbert Zeh A Survey of Techniques for Designing I/O-Efficiet Algorithms In Algorithms for Memory Hierarchies, number 2625 in LNCS,pages 36–61, 2003.

194

[11] M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran. Cache-Oblivious Algo-rithms In 40th Symposium on Foundations of Computer Science, pages 285–298,1999.

[12] Irit Katriel and Ulrich Meyer Elementary Graph Algorithms in External Memory InAlgorithms for Memory Hierarchies, number 2625 in LNCS, pages 62–84, 2003.

[13] Deepak Ajwani, Ulrich Meyer, Vitaly Osipov. Improved external memory BFS im-plementations In Workshop on Algorithm engineering and experiments (ALENEX07), pages 3–12. New Orleans, USA, 2007.

[14] K. Munagala and A. Ranade. I/O-Complexity of Graph Algorithms In Proc. 10thAnn. Symposium on Discrete Algorithms, pages 687–694. ACM-SIAM, 1999.

[15] K. Mehlhorn and U. Meyer. External-memory breadth-first search with sublinearI/O. In Proc. 10th Ann. European Symposium on Algorithms (ESA), volume 2461 ofLNCS, pages 723–735. Springer, 2002.

[16] Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S.Vitter. External-memory graph algorithms. In Proceedings of the Sixth AnnualACM-SIAM Symposium on Discrete Algorithms, pages 139–149, 1995.

[17] D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memoryBFS algorithms. SODA, pages 601–610, 2006.

[18] Gerth Stolting Brodal and Rolf Fagerberg. Cache oblivious distribution sweeping InProceedings of the 29th International Colloquium on Au- tomata, Languages, andProgramming, pages 426–438, Malaga, Spain, July 2002.

[19] Erik D. Demaine. Cache-Oblivious Algorithms and Data Structures. In LectureNotes from the EEF Summer School on Massive Data Sets, 2002.

[20] L. A. Arge, G. S. Brodal, and L. Toma. On external-memory MST, SSSP and multi-way planar graph separation. In Proc. 8th Scand. Workshop on Algorithmic Theory,volume 1851 of LNCS, pages 433–447. Springer, 2000.

[21] N. Zeh. I/O-Efficient Algorithms for Shortest Path Related Problems. PhD thesis,School of Computer Science, Carleton University, 2002.

[22] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. InInformation Processing Letters, 6(3):80-82, 1977.

[23] R. Dementiev, L. Kettner, J. Mehnert, and P. Sanders. Engineering a Sorted List DataStructure for 32 Bit Keys. In Workshop on Algorithm Engineering & Experiments,pages 142–151, New Orleans, 2004.

195

[24] A. V. Goldberg and C. Harrelson. Computing the Shortest Path: A∗ meets GraphTheory. In 16th ACM-SIAM Symposium on Discrete Algorithms, pages 156–165,2005.

[25] J. Maue and P. Sanders and D. Matijevic. Goal Directed Shortest Path Queries UsingPrecomputed Cluster Distances. In 5th Workshop on Experimental Algorithms(WEA), LNCS vol. 4007, pages 316–328, 2006.

[26] F. Schulz and D. Wagner and K. Weihe. Dijkstra’s Algorithm On-Line: An Em-pirical Case Study from Public Railroad Transport. In 3rd Workshop on AlgorithmEngineering, LNCS vol. 1668, pages 110–123, 1999.

[27] D. Wagner and T. Willhalm. Geometric Speed-Up Techniques for Finding ShortestPaths in Large Sparse Graphs. In 11th European Symposium on Algorithms, LNCSvol. 2832, pages 776–787, 2003.

[28] R. H. Mohring and H. Schilling and B. Schutz and D. Wagner and T. Willhalm. Par-titioning Graphs to Speed Up Dijkstra’s Algorithm. In 4th International Workshopon Efficient and Experimental Algorithms, pages 189–202, 2005.

[29] E. Kohler and R. H. Mohring and H. Schilling. Acceleration of Shortest Path andConstrained Shortest Path Computation. In 4th International Workshop on Efficientand Experimental Algorithms, 2005.

[30] P. Sanders, D. Schultes. Engineering Fast Route Planning Algorithms. In 6th Work-shop on Experimental Algorithms (WEA), LNCS vol. 4525, pages 23–36, 2007.

[31] P. Sanders, D. Schultes. Highway Hierarchies Hasten Exact Shortest Path Queries.In 13th European Symposium on Algorithms (ESA), LNCS vol. 3669, pages 568–597, 2005.

[32] P. Sanders, D. Schultes. Engineering Highway Hierarchies. In 14th European Sym-posium on Algorithms (ESA), LNCS vol. 4168, pages 804–816, 2006.

[33] P. Sanders, D. Schultes. Robust, Almost Constant Time Shortest-Path Queries inRoad Networks. In 9th DIMACS Challenge on Shortest Paths, 2007.

[34] P. Sanders, D. Schultes. Dynamic Highway-Node Routing. In 6th Workshop onExperimental Algorithms (WEA), LNCS vol. 4525, pages 66–79, 2007.

[35] R. Gutman. Reach-based Routing: A New Approach to Shortest Path AlgorithmsOptimized for Road Networks. In 6th Workshop on Algorithm Engineering andExperiments, pages 100–111, 2004.

196

[36] D. Delling and D. Wagner. Landmark-Based Routing in Dynamic Graphs. In 6thWorkshop on Experimental Algorithms, 2007.

[37] I. Katriel, P. Sander and J. L. Traff. A Practical Minimum Spanning Tree AlgorithmUsing the Cycle Property. In 11th European Symposium on Algorithms (ESA), LNCSvol. 2832, pages 679–690, 2003.

[38] Roman Dementiev, Peter Sanders, Dominik Schultes, Jop Sibeyn. A Practical Mini-mum Spanning Tree Algorithm Using the Cycle Property. In 3rd IFIP InternationalConference on Theoretical Computer Science (TSC2004), pages 195–208, 2004.

[39] Juha Karkkainen, Peter Sanders. Sorting Strings and Suffixes. Slides, 2003.

[40] Bentley and Sedgewick. Fast Algorithms for Sorting and Searching Strings. InSODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.

[41] J. Karkkainen and P. Sanders. Simple Linear Work Suffix Array Construction. In30th International Colloquium on Automata, Languages and Programming, LNCSvol. 2719, pages 943–955, 2003.

[42] R. Dementiev, J. Mehnert, J. Karkkainen, P. Sanders. Better External Memory Suf-fix Array Construction. In Workshop on Algorithm Engineering & Experiments(ALENEX05), pages 86–97, 2005.

[43] P. Alefragis, P. Sanders, T. Takkula, and D. Wedelin. Parallel integer optimizationfor crew scheduling. Annals of Operations Research, 99(1):141–166, 2000.

[44] Y. Azar, A. Z. Broder, A. R. Karlin, and Eli Upfal. Balanced allocations. SIAMJournal on Computing, 29(1):180–200, February 2000.

[45] P. Berenbrink, A. Czumaj, A. Steger, and B. Vocking. Balanced allocations: Theheavily loaded case. In 32th Annual ACM Symposium on Theory of Computing,2000.

[46] J. M. Chambers, W. S. Cleveland, b. Kleiner, and P. A. Tukey. Graphical Methodsfor Data Analysis. Duxbury Press, Boston, 1983.

[47] W. S. Cleveland. Elements of Graphing Data. Wadsworth, Monterey, Ca, 2ndedition, 1994.

[48] D. S. Johnson. A theoretician’s guide to the experimental analysis of algorithms.In M. Goldwasser, D. S. Johnson, and C. C. McGeoch, editors, Proceedings of the5th and 6th DIMACS Implementation Challenges. American Mathematical Society,2002.

197

[49] M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidis-tributed uniform pseudo-random number generator. ACMTMCS: ACM Transac-tions on Modeling and Computer Simulation, 8:3–30, 1998. http://www.math.keio.ac.jp/˜matumoto/emt.html.

[50] C. C. McGeoch, D. Precup, and P. R. Cohen. How to find big-oh in your data set(and how not to). In Advances in Intelligent Data Analysis, number 1280 in LNCS,pages 41–52, 1997.

[51] C.C. McGeoch and B. M. E. Moret. How to present a paper on experimental workwith algorithms. SIGACT News, 30(4):85–90, 1999.

[52] B. M. E. Moret. Towards a discipline of experimental algorithmics. In 5th DIMACSChallenge, DIMACS Monograph Series, 2000. to appear.

[53] W. H. Press, S.A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipesin C. Cambridge University Press, 2nd edition, 1992.

[54] P. Sanders. Lastverteilungsalgorithmen fur parallele Tiefensuche. Number 463 inFortschrittsberichte, Reihe 10. VDI Verlag, 1997.

[55] P. Sanders. Asynchronous scheduling of redundant disk arrays. In 12th ACM Sym-posium on Parallel Algorithms and Architectures, pages 89–98, 2000.

[56] P. Sanders and R. Fleischer. Asymptotic complexity from experiments? a case studyfor randomized algorithms. In Workshop on Algorithm Engineering, number 1982in LNCS, pages 135–146, 2000.

[57] P. Sanders and J. Sibeyn. A bandwidth latency tradeoff for broadcast and reduction.In 6th Euro-Par, number 1900 in LNCS, pages 918–926, 2000.

[58] Peter Sanders. Fast priority queues for cached memory. ACM Journal of Experi-mental Algorithmics, 5, 2000.

[59] Edward R. Tufte. The Visual Display of Quantitative Information. Graphics Press,Cheshire, Connecticut, U.S.A., 1983.

198

felix putze and peter sanders - kitalgo2.iti.kit.edu/sanders/courses/algen17/skript.pdffelix putze...

Documents