advanced algorithms and data structureselomaa/teach/aads-2015-1.pdf · • we might consider...
TRANSCRIPT
8/27/2015
1
Advanced Algorithmsand Data Structures
Prof. Tapio [email protected]
Course Prerequisites
• A seven credit unit course• Replaced OHJ-2156 Analysis of Algorithms• We take things a bit further than OHJ-2156
• We will assume familiarity with– Necessary mathematics– Elementary data structures– Programming
27-Aug-15MAT-72006 AADS, Fall 2015 2
8/27/2015
2
Course Basics
• There will be 4 hours of lectures per week• Weekly exercises start in two weeks time
• We will not have a programming exercise thisyear (unless you demand to have one)
• We might consider organizing a seminar withvoluntary presentations (yielding extra points)at the end of the course
27-Aug-15MAT-72006 AADS, Fall 2015 3
Organization & Timetable
• Lectures: Prof. Tapio Elomaa– Tue & Thu 2–4 PM in TB223 (& TB219 in PII)– Aug. 25 – Dec. 3, 2015
• Period break Oct. 12–18, 2015
• Exercises: M.Sc. Juho Lauri & M.Sc. TeemuHeinimäki,
Mon 10–12 TC315, Start: Sept. 9• Exam: Fri Dec. 18, 2015
27-Aug-15MAT-72006 AADS, Fall 2015 4
8/27/2015
3
Guest Lecture
• Dr. Timo Aho from Solita• In October
– Big data– Definitions of data analytics– Methods– Case studies
27-Aug-15MAT-72006 AADS, Fall 2015 5
Course Grading
27-Aug-15MAT-72006 AADS, Fall 2015 6
• Exam: Maximum of 30 points• Weekly exercises yield extra points
• 40% of questions answered: 1 point• 80% answered: 6 points• In between: linear scale (so that
decimals are possible)• Final grading depends on what we
agree as course components
8/27/2015
4
Material
• The textbook of the course is– Cormen, Leiserson, Rivest, Stein: Introduction
to Algorithms, 3rd ed., MIT Press, 2009• There is no prepared material, the slides
appear in the web as the lectures proceed– http://www.cs.tut.fi/~elomaa/teach/72006.html
• The exam is based on the lectures (i.e., noton the slides only)
27-Aug-15MAT-72006 AADS, Fall 2015 7
Content (Plan)
I FoundationsII (Sorting and) Order StatisticsIII Data StructuresIV Advanced Design and Analysis
TechniquesV Advanced Data StructuresVI Graph AlgorithmsVII Selected Topics
27-Aug-15MAT-72006 AADS, Fall 2015 8
8/27/2015
5
I Foundations
The Role of Algorithms in ComputingGetting Started
Growth of FunctionsRecurrences
Probabilistic Analysis andRandomized Algorithms
II (Sorting and) OrderStatistics
HeapsortQuicksort
Sorting in Linear TimeMedians and Order Statistics
8/27/2015
6
III Data Structures
Elementary Data StructuresHash Tables
Binary Search TreesRed-Black Trees
IV Advanced Design andAnalysis Techniques
Dynamic ProgrammingGreedy AlgorithmsAmortized Analysis
8/27/2015
7
V Advanced DataStructures
B-TreesBinomial Heaps
Fibonacci Heaps
VI Graph Algorithms
MAT-62756 GRAPH THEORY
Elementary Graph AlgorithmsMinimum Spanning Trees
Single-Source Shortest PathsAll-Pairs Shortest Paths
Maximum Flow
8/27/2015
8
VII Selected Topics
Matrix OperationsLinear Programming
Number-Theoretic AlgorithmsApproximation Algorithms
Part of these could also bestudent presentation topics
Online ChiMerge
• In machine learning and data mining algorithms oneoften needs to discretize numerical attributes
• There are initial intervals that cannot be divided(examples that have the same value for the attributein question)
• CHIMERGE is intended as a preprocessing techniquefor learning algorithms
• It combines adjacent initial intervals bottom-up tocome up with the final discretization of the valuerange of the numerical attribute in question
27-Aug-15MAT-72006 AADS, Fall 2015 16
8/27/2015
9
27-Aug-15MAT-72006 AADS, Fall 2015 17
Combine? Combine? Combine?
Combine? Combine?
Combine?
Combine? Combine?
Combine?
• The worst-case time complexity of the batchalgorithm is lg )
• The main requirement for the online versionof CHIMERGE (OCM) is a processing time thatis, per each received training example,logarithmic in the number of intervals
• Thus, we altogether match the lg ) timerequirement of the batch version ofCHIMERGE
27-Aug-15MAT-72006 AADS, Fall 2015 18
8/27/2015
10
• If each new example drawn fromDATASOURCE would update the currentdiscretization immediately, this would yield aslowing down of OCM by a factor ascompared to the batch version
• Therefore, we update the active discretizationonly periodically, freezing it for theintermediate period
• Examples received during this time can behandled at a logarithmic time
• The price to be paid is maintaining someother data structures in addition to the BST
27-Aug-15MAT-72006 AADS, Fall 2015 19
• OCM uses a balanced binary search tree intowhich each example drawn from DATASOURCE is(eventually) submitted
• The algorithm continuously takes in newexamples from DATASOURCE, but instead of juststoring them to , it updates the required datastructures simultaneously
• We do not halt the ow of the data streambecause of extra processing of the datastructures, but amortize the required work tonormal example handling
• BST plus a queue, two doubly linked lists, anda priority queue (heap) are maintained
27-Aug-15MAT-72006 AADS, Fall 2015 20
8/27/2015
11
27-Aug-15MAT-72006 AADS, Fall 2015 21
• The discretization process works in three phases• In Phase 1 (of length time steps), the
examples from DATASOURCE are cached fortemporary storage to the queue
• Meanwhile the initial intervals are collected oneat a time from to a list that holds theintervals in their numerical order
• For each pair of two adjacent intervals, the -value is computed and inserted into the priorityqueue
• The queue is needed to freeze for theduration of the Phase 1
27-Aug-15MAT-72006 AADS, Fall 2015 22
8/27/2015
12
Procedure OCM )Input: Integer giving the number of exampleson which initial discretization is based on and real
which is the threshold for interval mergingData structures: A BST , a queue , two doublylinked lists and , and a priority queue1. Initialize , , , and as empty;2. Read training examples into tree ;3. phase 1; TR E E MIN IM U M );4. Make an empty priority queue of size 1;5. while DA T A SO U R C E () nil do
27-Aug-15MAT-72006 AADS, Fall 2015 23
6. if phase = 1 then EN Q U E U E ( , )7. else TR E E IN S E R T fi;8. if phase = 1 then - - - - - - - - - - - - - - - - - - - - - -9. Insert a copy of as the last item in ;10. if | > 1 then11. Compute the -value of the two last intervals in
;12. Using the obtained value as a key, add a
pointer to the two intervals in into ;13. fi14. TR E E SU C C E S S O R );15. if = nil then phase 2 fi
27-Aug-15MAT-72006 AADS, Fall 2015 24
8/27/2015
13
• After seeing examples, all the initial intervalsin have been inserted to and all the -values have been computed and added to
• At this point holds (unprocessed) examples• At the end of Phase 1 all required preparations
for interval merging have been done• Phase 2 implements the merging as in batch
CHIMERGE, but without stopping the processingof DATASOURCE
• As a result of each merging, the intervals inas well as the related -values in are updated
27-Aug-15MAT-72006 AADS, Fall 2015 25
• In Phase 2 the example received fromDATASOURCE is submitted directly to along withanother, previously received example that iscached in
• The lowest -value, , corresponding to someadjacent pair of intervals and , is drawn from
• If it exceeds the merging threshold, Phase 2is complete
• Otherwise, the intervals and are merged toobtain a new interval , and the -values onboth sides of are updated into
27-Aug-15MAT-72006 AADS, Fall 2015 26
8/27/2015
14
16. else if phase = 2 then - - - - - - - - - - - -- - - - - - -17. DE Q U E U E ); TR E E IN S E R T );18. , EX T R A C T MIN );19. if nil and then20. Remove from P the D-values of I and J;21. LIS T -DE L E T E ); LIS T DE L E T E );22. ME R G E );23. Compute the -values of with its neighbors in
;24. Update the -values into priority queue ;25. else26. phase 3; fi
27-Aug-15MAT-72006 AADS, Fall 2015 27
27. else - - - - - - - - - - - - - Phase = 3 - - - - - - - - - - - -28. DE Q U E U E ); TR E E -IN S E R T );29. LIS T -DE L E T E , HE A D );30. LIS T -IN S E R T );31. if EM P T Y ) then32. phase 1; TR E E -MIN IM U M );33. Make an empty priority queue, size 1;34. fi35. fi36. od
27-Aug-15MAT-72006 AADS, Fall 2015 28
8/27/2015
15
Time Complexity of OCM• The insertion of the received example to
clearly takes constant time, as well as theinsertion of an interval from to
• Also computing the -value can be considered aconstant-time operation
• Inserting the obtained value to requires time(lg )
• Finding a successor in also takes time (lg )• Thus, the total time consumption of one iteration
of Phase 1 is (lg )27-Aug-15MAT-72006 AADS, Fall 2015 29
• Insertion of the new example and one from the queueto the BST both take time (lg )
• Extracting the lowest -value from the priority queue isalso an lg time operation
• As contains pointers to the items holding intervalsand in , removing them can be implemented inconstant time
• Merging of and to obtain can be considered aconstant-time operation as well as computing the new -values for and its neighbors in
• Updating each of the new -values into takes lgtime
• Thus, the total time consumption of one iteration ofPhase 2 is also lg
27-Aug-15MAT-72006 AADS, Fall 2015 30
8/27/2015
16
The sorting problem
• Input: A sequence of numbers, … ,
• Output: A permutation (reordering), … ,
of the input sequence such that
• The numbers that we wish to sort are alsoknown as keys
27-Aug-15MAT-72006 AADS, Fall 2015 31
INSERTION-SORT )1. for 2 to2.3. // Insert ] into the sorted sequence [1. . – 1]4.5. while > 0 and ] >6. + 1] ]7. 18. [ + 1]
27-Aug-15MAT-72006 AADS, Fall 2015 32
8/27/2015
17
5 2 4 6 1 3
27-Aug-15MAT-72006 AADS, Fall 2015 33
2 5 4 6 1 3
2 4 5 6 1 3 2 4 5 6 1 3
1 2 4 5 6 3 1 2 3 4 5 6
Correctness of the Algorithm
• The following loop invariant helps usunderstand why the algorithm is correct:
At the start of each iteration of the forloop of lines 1–8, the subarray [1. . – 1]consists of the elements originally in
[1. . – 1] but in sorted order
27-Aug-15MAT-72006 AADS, Fall 2015 34
8/27/2015
18
Initialization
• The loop invariant holds before the first loopiteration, when = 2:– The subarray, therefore, consists of just the
single element [1]– It is in fact the original element in [1]– This subarray is trivially sorted– Therefore, the loop invariant holds prior to
the first iteration of the loop
27-Aug-15MAT-72006 AADS, Fall 2015 35
Maintenance• Each iteration maintains the loop invariant:
– The body of the for loop works by moving– 1], – 2], – 3], … by one position to
the right until the proper position for isfound (lines 4–7)
– At this point the value of ] is inserted (line8)
– The subarray [1. . ] then consists of theelements originally in [1. . ], but in sortedorder
27-Aug-15MAT-72006 AADS, Fall 2015 36
8/27/2015
19
Termination
• The condition causing the for loop toterminate is that
• Because each loop iteration increases by 1,we must have + 1 at that time
• Substituting + 1 for j in the wording of loopinvariant, we have that the subarray [1. . ]consists of the elements originally in [1. . ],but in sorted order
• [1. . ] is the entire array27-Aug-15MAT-72006 AADS, Fall 2015 37
Analysis of insertion sort
• The time taken by the INSERTION-SORTdepends on the input:– sorting a thousand numbers takes longer than
sorting three numbers• Moreover, the procedure can take different
amounts of time to sort two input sequencesof the same size– depending on how nearly sorted they already
are27-Aug-15MAT-72006 AADS, Fall 2015 38
8/27/2015
20
Input size
• The time taken by an algorithm grows withthe size of the input
• Traditional to describe the running time of aprogram as a function of the size of its input
• For many problems, such as sorting mostnatural measure for input size is the numberof items in the input—i.e., the array size
27-Aug-15MAT-72006 AADS, Fall 2015 39
• For, e.g., multiplying two integers, the bestmeasure is the total number of bits needed torepresent the input in binary notation
• Sometimes, more appropriate to describe thesize with two numbers rather than one
• E.g., if the input to an algorithm is a graph,the input size can be described by thenumbers of vertices and edges in it
27-Aug-15MAT-72006 AADS, Fall 2015 40
8/27/2015
21
Running time
• Running time of an algorithm on an input:– The number of primitive operations (“steps”)
executed• Step as machine-independent as possible• For the moment:
– Constant amount of time to execute each lineof pseudocode
– We assume that each execution of the th linetakes time , where is a constant
27-Aug-15MAT-72006 AADS, Fall 2015 41
27-Aug-15MAT-72006 AADS, Fall 2015 42
INSERTION-SORT ) cost times
1 for 2 to 1
2 ] 2 – 13 // Insert into the sorted sequence [1. . ] 0 – 1
4 – 4 – 1
5 while > 0 and ] > 5
6 + 1] ] 6 1)
7 – 7 1)
8 + 1] 8 – 1
8/27/2015
22
• denotes the number of times the while looptest in line 5 is executed for that value of
• When a for or while loop exits in the usualway, the test is executed one time more thanthe loop body
• Comments are not executable statements, sothey take no time
• Running time of the algorithm is the sum ofthose for each statement executed
27-Aug-15MAT-72006 AADS, Fall 2015 43
• To compute ), the running time ofINSERTION-SORT on an input of values,– we sum the products of the cost and times
columns, obtaining
1 + 1 +
1) 1 + 1)
27-Aug-15MAT-72006 AADS, Fall 2015 44
8/27/2015
23
Best case
• The best case occurs if the array is alreadysorted
• For each = 2, 3, … , , we then nd that] < in line 5 when has its initial value of1
• Thus = 1 for = 2, 3, … , , and the best-caserunning time is
1 + 1 + 1 + 1)
27-Aug-15MAT-72006 AADS, Fall 2015 45
Worst case
• We can express this as for constantsand that depend on the statement costs
• It is a linear function of• The worst case results when the array is in
reverse sorted order — in decreasing order• We must compare each element ] with each
element in the entire sorted subarray [1. . – 1],and so for = 2, 3, … ,
27-Aug-15MAT-72006 AADS, Fall 2015 46
8/27/2015
24
• Note that
=+ 1)
21
and
1 =1)
2by the summation of an arithmetic series
=+ 1)
227-Aug-15MAT-72006 AADS, Fall 2015 47
• The worst-case running time of INSERTION-SORT is
1 + 1+ 1)
2 1 +1)
21)
2 1)
= 2 + 2 + 2 +
( + + )27-Aug-15MAT-72006 AADS, Fall 2015 48
8/27/2015
25
• We can express this worst-case running timeas 2 for constants , , and thatdepend on the statement costs
• It is a quadratic function of• The rate of growth, or order of growth, of
the running time really interests us• We consider only the leading term of a
formula ( 2); the lower-order terms arerelatively insigni cant for large values of
27-Aug-15MAT-72006 AADS, Fall 2015 49
• We also ignore the leading term’s coef cient,constant factors are less signi cant than therate of growth in determining computationalef ciency for large inputs
• For insertion sort, we are left with the factorof 2 from the leading term
• We write that insertion sort has a worst-caserunning time of 2)(“theta of -squared”)
27-Aug-15MAT-72006 AADS, Fall 2015 50
8/27/2015
26
2.3 Designing algorithms
• Insertion sort is an incremental approach: havingsorted [1. . – 1], we insert ] into its properplace, yielding sorted subarray [1. . ]
• Let us examine an alternative design approach,known as “divide-and-conquer”
• We design a sorting algorithm whose worst-caserunning time is much lower
• The running times of divide-and-conqueralgorithms are often easily determined
27-Aug-15MAT-72006 AADS, Fall 2015 51
The divide-and-conquer approach
• Many useful algorithms are recursive:– to solve a problem, they call themselves to
deal with closely related subproblems• These algorithms typically follow a divide-
and-conquer approach:– Break the problem into subproblems that
resemble the original problem but are smaller,– Solve the subproblems recursively,– Combine these solutions to create a solution
to the original problem27-Aug-15MAT-72006 AADS, Fall 2015 52
8/27/2015
27
• The paradigm involves three steps at eachlevel of the recursion:1. Divide the problem into a number of
subproblems that are smaller instances ofthe same problem
2. Conquer the subproblems by solving themrecursively
• If the sizes are small enough, just solve thesubproblems in a straightforward manner
3. Combine the solutions to the subproblemsinto the solution for the original problem
27-Aug-15MAT-72006 AADS, Fall 2015 53
The merge sort algorithm
• Divide: Divide the -element sequence intotwo subsequences of /2 elements each
• Conquer: Sort the two subsequencesrecursively using merge sort
• Combine: Merge the two sortedsubsequences to produce the sorted answer– Recursion “bottoms out” when the sequence
to be sorted has length 1: a sequence oflength 1 is already in sorted order
27-Aug-15MAT-72006 AADS, Fall 2015 54
8/27/2015
28
• The key operation is the merging of two sortedsequences in the “combine” step
• We call auxiliary procedure MERGE ),where is an array and , , and are indicessuch that
• The procedure assumes that the subarrays. . and + 1. . ] are in sorted order and
• merges them to form a single sorted subarraythat replaces the current subarray . . ]
27-Aug-15MAT-72006 AADS, Fall 2015 55
MERGE )1. 1 – + 12. 2 –3. Let [1. . 1 + 1] and
[1. . 2 + 1] be newarrays
4. for 1 to 1
5. – 1]6. for 1 to 2
7.
8. 1 + 1]9. 2 + 1]10. 111. 112. for to13. if14. ]15. + 116. else ]17. + 1
27-Aug-15MAT-72006 AADS, Fall 2015 56
8/27/2015
29
• Line 1 computes the length 1 of the subarray. . ]; similarly for 2 and + 1. . ] on line 2
• Line 3 creates arrays (left) and (right), oflengths 1 + 1 and 2 + 1, respectively– the extra position will hold the sentinel
• The for loop of lines 4–5 copies . . into[1. . ];
• Lines 6–7 copy + 1. . into [1. . 2]• Lines 8–9 put the sentinels at the ends of and
27-Aug-15MAT-72006 AADS, Fall 2015 57
• Lines 10–17 perform the + 1 basic stepsby maintaining the following loop invariant:– At the start of each iteration of the for loop of
lines 12–17, . . – 1] contains the –smallest elements of [1. . 1 + 1] and
[1. . 2 + 1], in sorted order– Moreover, ] and ] are the smallest
elements of their arrays that have not beencopied back into
27-Aug-15MAT-72006 AADS, Fall 2015 58
8/27/2015
30
1 2 3 4 5
1 2 3 6
1 2 3 4 5
2 4 5 7
27-Aug-15MAT-72006 AADS, Fall 2015 59
8 9 10 11 12 13 14 15 16 17
… 2 4 5 7 1 2 3 6 …
MERGE( , 9, 12, 16)
1 2 3 4 5
2 4 5 7
1 2 3 4 5
1 2 3 6
8 9 10 11 12 13 14 15 16 17
… 1 4 5 7 1 2 3 6 …
1 2 3 4 5
1 2 3 6
1 2 3 4 5
2 4 5 7
27-Aug-15MAT-72006 AADS, Fall 2015 60
8 9 10 11 12 13 14 15 16 17
… 1 2 2 7 1 2 3 6 …
8 9 10 11 12 13 14 15 16 17
… 1 2 5 7 1 2 3 6 …
1 2 3 4 5
2 4 5 7
1 2 3 4 5
1 2 3 6
8/27/2015
31
1 2 3 4 5
1 2 3 6
27-Aug-15MAT-72006 AADS, Fall 2015 61
8 9 10 11 12 13 14 15 16 17
… 1 2 2 3 4 2 3 6 …
1 2 3 4 5
2 4 5 7
8 9 10 11 12 13 14 15 16 17
… 1 2 2 3 1 2 3 6 …
1 2 3 4 5
2 4 5 7
i
1 2 3 4 5
1 2 3 6
1 2 3 4 5
1 2 3 6
27-Aug-15MAT-72006 AADS, Fall 2015 62
8 9 10 11 12 13 14 15 16 17
… 1 2 2 3 4 5 6 6 …
1 2 3 4 5
2 4 5 7
8 9 10 11 12 13 14 15 16 17
… 1 2 2 3 4 5 3 6 …
1 2 3 4 5
2 4 5 7
1 2 3 4 5
1 2 3 6
8/27/2015
32
27-Aug-15MAT-72006 AADS, Fall 2015 63
8 9 10 11 12 13 14 15 16 17
… 1 2 2 3 4 5 6 7 …
1 2 3 4 5
2 4 5 7
1 2 3 4 5
1 2 3 6
• The needed – + 1 iterations of the lastfor loop have been executed:• [9. . 16] is sorted, and• the two sentinels in and are the only
two elements in these arrays that havenot been copied into
• MERGE procedure runs in ) time, where– + 1
– each of lines 1–3 and 8–11 takes constanttime
– the for loops of lines 4–7 take 1 2 =) time
– there are iterations of the for loop of lines12–17, each of which takes constant time
27-Aug-15MAT-72006 AADS, Fall 2015 64
8/27/2015
33
Merge sort
• The procedure MERGE-SORT sorts theelements in . .
• If , the subarray has at most oneelement and is therefore already sorted
• Otherwise, the divide step computes an indexthat partitions . . into two subarrays:
– . . ], containing /2 elements– + 1. . ], containing /2 elements
27-Aug-15MAT-72006 AADS, Fall 2015 65
MERGE-SORT )1. if2. )/23. MERGE-SORT )4. MERGE-SORT + 1, )5. MERGE )
27-Aug-15MAT-72006 AADS, Fall 2015 66
8/27/2015
34
Analysis of merge sort
• Our analysis assumes that the originalproblem size is a power of 2
• Each divide step then yields twosubsequences of size exactly /2
• We set up the recurrence for ), the worst-case running time of merge sort onnumbers
• Merge sort on just one element takesconstant time
27-Aug-15MAT-72006 AADS, Fall 2015 67
• When we have > 1 elements, we breakdown the running time as follows:– Divide: The step just computes the middle of
the subarray, which takes constant time:) = (1)
– Conquer: We recursively solve twosubproblems, each of size /2, whichcontributes /2) to the running time
– Combine: the MERGE procedure on an -element array takes time ), and so
) = )
27-Aug-15MAT-72006 AADS, Fall 2015 68
8/27/2015
35
• When we add the ) and ), we areadding functions that are ) and (1)
• This sum is a linear function of• Adding it to the /2) term from the
“conquer” step gives the recurrence for ):
=(1) if = 1
2) + ) if > 1
27-Aug-15MAT-72006 AADS, Fall 2015 69
• To intuitively see that the solution to therecurrence is ) = lg ), where lgstands for log2 , let us rewrite it as
= if = 12) + if > 1
where constant represents time required tosolve problems of size 1 and that per arrayelement of the divide and combine steps
27-Aug-15MAT-72006 AADS, Fall 2015 70