data structures and algorithms

Data Structures and Algorithms

Data StructuresandAlgorithmsCourse slides: Radix Search, Radix sort, Bucket sort, Huffman compressionLecture 10: SearchingRadix SearchingFor many applications, keys can be thought of as numbersSearching methods that take advantage of digital properties of these keys are called radix searchesRadix searches treat keys as numbers in base M (the radix) and work with individual digits Lecture 10: SearchingRadix SearchingProvide reasonable worst-case performance without complication of balanced trees.Provide way to handle variable length keys. Biased data can lead to degenerate data structures with bad performance.Lecture 10: SearchingThe Simplest Radix SearchDigital Search Trees like BSTs but branch according to the keys bits.Key comparison replaced by function that accesses the keys next bit.Lecture 10: SearchingDigital Search ExampleARHCSEA 00001S 10011E 00101R 10010C 00011H 010006Digital Search TreesConsider BST search for key KFor each node T in the tree we have 4 possible resultsT is empty (or a sentinel node) indicating item not foundK matches T.key and item is foundK < T.key and we go to left childK > T.key and we go to right childConsider now the same basic technique, but proceeding left or right based on the current bit within the key7Digital Search TreesCall this tree a Digital Search Tree (DST)DST search for key KFor each node T in the tree we have 4 possible resultsT is empty (or a sentinel node) indicating item not foundK matches T.key and item is foundCurrent bit of K is a 0 and we go to left childCurrent bit of K is a 1 and we go to right childLook at example on board8Digital Search TreesRun-times?Given N random keys, the height of a DST should average O(log2N)Think of it this way if the keys are random, at each branch it should be equally likely that a key will have a 0 bit or a 1 bitThus the tree should be well balancedIn the worst case, we are bound by the number of bits in the key (say it is b)So in a sense we can say that this tree has a constant run-time, if the number of bits in the key is a constantThis is an improvement over the BST9Digital Search TreesBut DSTs have drawbacksBitwise operations are not always easySome languages do not provide for them at all, and for others it is costlyHandling duplicates is problematicWhere would we put a duplicate object?Follow bits to new position?Will work but Find will always find first oneActually this problem exists with BST as wellCould have nodes store a collection of objects rather than a single object10Digital Search TreesSimilar problem with keys of different lengthsWhat if a key is a prefix of another key that is already present?Data is not sortedIf we want sorted data, we would need to extract all of the data from the tree and sort itMay do b comparisons (of entire key) to find a keyIf a key is long and comparisons are costly, this can be inefficientLecture 10: SearchingDigital SearchRequires O(log N) comparisons on averageRequires b comparisons in the worst case for a tree built with N random b-bit keysLecture 10: SearchingDigital SearchProblem: At each node we make a full key comparison this may be expensive, e.g. very long keys Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons

Lecture 10: SearchingRadix TriesUsed for Retrieval [sic]Internal nodes used for branching, external nodes used for final key comparison, and to store dataLecture 10: SearchingRadix Trie ExampleSA 00001S 10011E 00101R 10010C 00011H 01000RECAHLecture 10: SearchingRadix TriesLeft subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst caseLecture 10: SearchingRadix TriesProblem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above)This is addressed by Patricia trees, which allow lookahead to the next relevant bitPractical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia)In the slides that follow the entire alphabet would be included in the indexes17Radix Search TriesBenefit of simple Radix Search TriesFewer comparisons of entire key than DSTsDrawbacksThe tree will have more overall nodes than a DSTEach external node with a key needs a unique bit-path to itInternal and External nodes are of different typesInsert is somewhat more complicatedSome insert situations require new internal as well as external nodes to be createdWe need to create new internal nodes to ensure that each object has a unique path to itSee example18Radix Search TriesRun-time is similar to DSTSince tree is binary, average tree height for N keys is O(log2N)However, paths for nodes with many bits in common will tend to be longerWorst case path length is again bHowever, now at worst b bit comparisons are requiredWe only need one comparison of the entire keySo, again, the benefit to RST is that the entire key must be compared only one time19Improving TriesHow can we improve tries?Can we reduce the heights somehow?Average height now is O(log2N)Can we simplify the data structures needed (so different node types are not required)?Can we simplify the Insert?We will examine a couple of variations that improve over the basic TrieBucket-Sort and Radix-Sort20Bucket-SortLet be S be a sequence of n (key, element) entries with keys in the range [0, N - 1]Bucket-sort uses the keys as indices into an auxiliary array B of sequences (buckets)Phase 1: Empty sequence S by moving each entry (k, o) into its bucket B[k]Phase 2: For i = 0, , N - 1, move the entries of bucket B[i] to the end of sequence SAnalysis:Phase 1 takes O(n) timePhase 2 takes O(n + N) timeBucket-sort takes O(n + N) time Algorithm bucketSort(S, N)Input sequence S of (key, element)items with keys in the range[0, N - 1]Output sequence S sorted byincreasing keysB array of N empty sequenceswhile S.isEmpty()f S.first()(k, o) S.remove(f)B[k].insertLast((k, o))for i 0 to N - 1while B[i].isEmpty() f B[i].first()(k, o) B[i].remove(f)S.insertLast((k, o))

Bucket SortEach element of the array is put in one of the N buckets

Bucket SortNow, pull the elements from the buckets into the array

At last, the sorted array (sorted in a stable way):

Bucket-Sort and Radix-Sort23ExampleSorting a sequence of 4-bit integers1001001011010001111000101110100111010001100111010001001011101001000100101101111000010010100111011110

Bucket-Sort and Radix-Sort24ExampleKey range [0, 9]7, d1, c3, a7, g3, b7, e1, c3, a3, b7, d7, g7, ePhase 1Phase 20123456789B1, c7, d7, g3, b3, a7, e

Bucket-Sort and Radix-Sort25Properties and ExtensionsKey-type PropertyThe keys are used as indices into an array and cannot be arbitrary objectsNo external comparatorStable Sort PropertyThe relative order of any two items with the same key is preserved after the execution of the algorithm

ExtensionsInteger keys in the range [a, b]Put entry (k, o) into bucketB[k - a] String keys from a set D of possible strings, where D has constant size (e.g., names of the 50 U.S. states)Sort D and compute the rank r(k) of each string k of D in the sorted sequence Put entry (k, o) into bucket B[r(k)]

Bucket-Sort and Radix-Sort26Lexicographic OrderA d-tuple is a sequence of d keys (k1, k2, , kd), where key ki is said to be the i-th dimension of the tupleExample:The Cartesian coordinates of a point in space are a 3-tupleThe lexicographic order of two d-tuples is recursively defined as follows(x1, x2, , xd) < (y1, y2, , yd)x1 < y1 x1 = y1 (x2, , xd) < (y2, , yd)I.e., the tuples are compared by the first dimension, then by the second dimension, etc.

Bucket-Sort and Radix-Sort27Lexicographic-SortLet Ci be the comparator that compares two tuples by their i-th dimensionLet stableSort(S, C) be a stable sorting algorithm that uses comparator CLexicographic-sort sorts a sequence of d-tuples in lexicographic order by executing d times algorithm stableSort, one per dimensionLexicographic-sort runs in O(dT(n)) time, where T(n) is the running time of stableSort Algorithm lexicographicSort(S)Input sequence S of d-tuplesOutput sequence S sorted inlexicographic order

for i d downto 1stableSort(S, Ci)Example:(7,4,6) (5,1,5) (2,4,6) (2, 1, 4) (3, 2, 4)(2, 1, 4) (3, 2, 4) (5,1,5) (7,4,6) (2,4,6)(2, 1, 4) (5,1,5) (3, 2, 4) (7,4,6) (2,4,6)(2, 1, 4) (2,4,6) (3, 2, 4) (5,1,5) (7,4,6)Bucket-Sort and Radix-Sort28Radix-SortRadix-sort is a specialization of lexicographic-sort that uses bucket-sort as the stable sorting algorithm in each dimensionRadix-sort is applicable to tuples where the keys in each dimension i are integers in the range [0, N - 1]Radix-sort runs in time O(d( n + N))Algorithm radixSort(S, N)Input sequence S of d-tuples suchthat (0, , 0) (x1, , xd) and(x1, , xd) (N - 1, , N - 1)for each tuple (x1, , xd) in S Output sequence S sorted inlexicographic orderfor i d downto 1bucketSort(S, N)Bucket-Sort and Radix-Sort29Radix-Sort for Binary NumbersConsider a sequence of n b-bit integers x = xb - 1 x1x0We represent each element as a b-tuple of integers in the range [0, 1] and apply radix-sort with N = 2This application of the radix-sort algorithm runs in O(bn) time For example, we can sort a sequence of 32-bit integers in linear timeAlgorithm binaryRadixSort(S)Input sequence S of b-bitintegers Output sequence S sortedreplace each element xof S with the item (0, x)for i 0 to b - 1replace the key k of each item (k, x) of Swith bit xi of xbucketSort(S, 2)

Does it Work for Real Numbers?What if keys are not integers?Assumption: input is n reals from [0, 1)Basic idea: Create N linked lists (buckets) to divide interval [0,1) into subintervals of size 1/NAdd each input element to appropriate bucket and sort buckets with insertion sortUniform input distribution O(1) bucket sizeTherefore the expected total time is O(n)Distribution of keys in buckets similar with . ? Radix SortWhat sort will we use to sort on digits?Bucket sort is a good choice: Sort n numbers on digits that range from 1..NTime: O(n + N)Each pass over n numbers with d digits takes time O(n+k), so total time O(dn+dk)When d is constant and k=O(n), takes O(n) timeRadix Sort ExampleProblem: sort 1 million 64-bit numbersTreat as four-digit radix 216 numbersCan sort in just four passes with radix sort!Running time: 4( 1 million + 216 ) 4 million operations Compare with typical O(n lg n) comparison sort Requires approx lg n = 20 operations per number being sortedTotal running time 20 million operationsRadix SortIn general, radix sort based on bucket sort isAsymptotically fast (i.e., O(n))Simple to codeA good choiceCan radix sort be used on floating-point numbers?Summary: Radix SortRadix sort:Assumption: input has d digits ranging from 0 to kBasic idea: Sort elements by digit starting with least significantUse a stable sort (like bucket sort) for each stageEach pass over n numbers with 1 digit takes time O(n+k), so total time O(dn+dk)When d is constant and k=O(n), takes O(n) timeFast, Stable, SimpleDoesnt sort in place35Multiway TriesRST that we have seen considers the key 1 bit at a timeThis causes a maximum height in the tree of up to b, and gives an average height of O(log2N) for N keysIf we considered m bits at a time, then we could reduce the worst and average heightsMaximum height is now b/m since m bits are consumed at each levelLet M = 2mAverage height for N keys is now O(logMN), since we branch in M directions at each node 36Multiway TriesLet's look at an exampleConsider 220 (1 meg) keys of length 32 bitsSimple RST will haveWorst Case height = 32Ave Case height = O(log2[220]) 20Multiway Trie using 8 bits would haveWorst Case height = 32/8 = 4Ave Case height = O(log256[220]) 2.5This is a considerable improvementLet's look at an example using character dataWe will consider a single character (8 bits) at each levelGo over on board37Multiway TriesSo what is the catch (or cost)?MemoryMultiway Tries use considerably more memory than simple triesEach node in the multiway trie contains M pointers/referencesIn example with ASCII characters, M = 256Many of these are unused, especially During common paths (prefixes), where there is no branching (or "one-way" branching)Ex: through and throughoutAt the lower levels of the tree, where previous branching has likely separated keys already38Patricia TreesIdea:Save memory and height by eliminating all nodes in which no branching occursSee example on boardNote now that since some nodes are missing, level i does not necessarily correspond to bit (or character) i So to do a search we need to store in each node which bit (character) the node corresponds toHowever, the savings from the removed nodes is still considerable39Patricia TreesAlso, keep in mind that a key can match at every character that is checked, but still not be actually in the treeExample for tree on board:If we search for TWEEDLE, we will only compare the T**E**EHowever, the next node after the E is at index 8. This is past the end of TWEEDLE so it is not foundRun-time?Similar to those of RST and Multiway Trie, depending on how many bits are used per node40Patricia TreesSo Patricia treesReduce tree height by removing "one-way" branching nodesText also shows how "upwards" links enable us to use only one node typeTEXT VERSION makes the nodes homogeneous by storing keys within the nodes and using "upwards" links from the leaves to access the nodesSo every node contains a valid key. However, the keys are not checked on the way "down" the tree only after an upwards link is followedThus Patricia saves memory but makes the insert rather tricky, since new nodes may have to be inserted between other nodesSee textPATRICIA TREEA particular type of trieExample, trie and PATRICIA TREE with content 010, 011, and 101.

PATRICIA TREETherefore, PATRICIA TREE will have the following attributes in its internal nodes:Index bit (check bit)Child pointers (each node must contain exactly 2 children)On the other hand, leave nodes must be storing actual content for final comparisonSISTRINGSistring is the short form of Semi-Infinite StringString, no matter what they actually are, is a form of binary bit pattern. (e.g. 11001)One of the sistring in the above example is 11001000There are totally 5 sistrings in this exampleSISTRINGSistrings are theoretically of infinite length11001000010010000001000001000010000Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.SISTRINGBit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!e.g. CUHKCorresponding sistrings would beCUHK000UHK000HK000K000We require each should be at least 4 characters long.(Why we pad 0/NULL at the end of sistring?)SISTRING (USAGE)SISTRINGs are efficient in storing substring information.A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n3)e.g. CUHK is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, , CU, UK, , CUH, UHK, CUHK.Storage requirement is O(n2)max(length) -> O(n3)SISTRING (USAGE)We may instead storing the sistrings of CUHK, which requires O(n2) storage.CUHK 1)Find the two trees, T1 and T2, with the smallest weightsCreate a new tree, T, whose weight is the sum of T1 and T2Remove T1 and T2 from the F, and add them as left and right children of TAdd T to F 62Huffman CompressionSee example on boardHuffman Issues:Is the code correct?Does it satisfy the prefix property?Does it give good compression?How to decode?How to encode?How to determine weights/frequencies?63Huffman CompressionIs the code correct?Based on the way the tree is formed, it is clear that the codewords are validPrefix Property is assured, since each codeword ends at a leafall original nodes corresponding to the characters end up as leavesDoes it give good compression?For a block code of N different characters, log2N bits are needed per characterThus a file containing M ASCII characters, 8M bits are needed64Huffman CompressionGiven Huffman codes {C0,C1,CN-1} for the N characters in the alphabet, each of length |Ci|Given frequencies {F0,F1,FN-1} in the fileWhere sum of all frequencies = MThe total bits required for the file is:Sum from 0 to N-1 of (|Ci| * Fi)Overall total bits depends on differences in frequenciesThe more extreme the differences, the better the compressionIf frequencies are all the same, no compressionSee example from board65Huffman CompressionHow to decode?This is fairly straightforward, given that we have the Huffman tree availablestart at root of tree and first bit of filewhile not at end of fileif current bit is a 0, go left in treeelse go right in tree // bit is a 1if we are at a leaf output character go to root read next bit of fileEach character is a path from the root to a leafIf we are not at the root when end of file is reached, there was an error in the file66Huffman CompressionHow to encode?This is trickier, since we are starting with characters and outputing codewordsUsing the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit patternInstead, let's process the tree once (using a traversal) to build an encoding TABLE.Demonstrate inorder traversal on board67Huffman CompressionHow to determine weights/frequencies?2-pass algorithmProcess the original file once to count the frequencies, then build the tree/code and process the file again, this time compressingEnsures that each Huffman tree will be optimal for each fileHowever, to decode, the tree/freq information must be stored in the fileLikely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the fileAdds extra space to file, reducing overall compression quality68Huffman CompressionOverhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file sizeThus larger files have a higher potential for compression with Huffman than do smaller onesHowever, just because a file is large does NOT mean it will compress wellThe most important factor in the compression remains the relative frequencies of the charactersUsing a static Huffman treeProcess a lot of "sample" files, and build a single tree that will be used for all filesSaves overhead of tree information, but generally is NOT a very good approach69Huffman CompressionThere are many different file types that have very different frequency characteristicsEx: .cpp file vs. .txt containing an English essay.cpp file will have many ;, {, }, (, ).txt file will have many a,e,i,o,u,., etc.A tree that works well for one file may work poorly for another (perhaps even expanding it)Adaptive single-pass algorithmBuilds tree as it is encoding the file, thereby not requiring tree information to be separately storedProcesses file only one timeWe will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm we will discuss next are also adaptive70Huffman ShortcomingsWhat is Huffman missing?Although OPTIMAL for single character (word) compression, Huffman does not take into ac- count patterns / repeated sequences in a fileEx: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with HuffmanYet it seems like this file should be compressableWe can use run-length encoding in this case (see text)However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)Sheet1CUHK01000011 01010101 00000000 00000000UHK001010101 01001000 00000000 00000000HK0001001000 01001011 00000000 00000000K00001001011 00000000 00000000 00000000

Sheet2

Sheet3

Sheet1CUHK01000011 01010101 00000000 00000000Hello This document is simple01001000 UHK001010101 01001000 00000000 00000000This document is simple01010100 HK0001001000 01001011 00000000 00000000document is simple01100100 K00001001011 00000000 00000000 00000000is simple01101001 simple01110011 Hello This document is simple01001000 This document is simple01010100 document is simple01100100 is simple01101001 simple01110011

Sheet2

Sheet3

data structures and algorithms

Documents

keys bits

key kfor

store keys

key comparison

n random keys

b comparisons of entire

long keys solution

key7digital search treescall