online list accessing algorithms and their applications ...rani/el-yaniv-papers/bachrache97.pdflet...

Online List Accessing Algorithms and Their Evidence

Applications: Recent Empirical

Ran Bachrach and* Ran El-Yaniv

Abstract

This paper reports the results of an empirical test of the performances of a large set of online list accessing algorithms. The dgorithms’ access cost performances were tested with respect to request sequences generated from the Calgary Corpus. In addition to testing access costs within the traditional dynamic list accessing model we tested all algorithms’ relative performances as data compressors via the compression scheme of Bentley et al. Some of the results are quite surprising and stand in contrast to someco mpetitive analysis theoretical results. For example, the randomized algorithms that were tested, all attaining competitive ratio less than 2,

performed consistently inferior to quite a few deterministic

algorithms that obtained the best performance. In many instances the best performance was obtained by deterministic algorithms that are either not competitive or not optimal.

1 Introduction

The list accessing problem has been studied for more than 30 years [15]. Indeed, plenty of interesting theoretical results have been proposed. Nevertheless, this topic of research suffers from a lack of experimental feed- back that could lead to better and more realistic theory. In this paper we report the results of a quite extensive empirical study testing the access cost performance and compression performance of more than 40 list accessing algorithms. These include many well known as well as new algorithms.

In the rest of this introduction we describe the list accessing problem, the competitive ratio measure that has been used for many theoretical analyses and then outline the contents of this extended abstract.

The list accessing problem. The list accessing problem revolves around the question of how to maintain and organize a dictionary as a linear list. In particular, in this problem an online algorithm maintains a set of items as an unsorted list. A sequence of requests for items is given and the algorithm must serve these requests in the order of their arrival. The requests are either for insertion of an item, to access an item, or a

‘Institute of Computer Science, The Hebrew University of

Jerusalem. Email: jranb.raniy}Bmath.huji.ac.il

deletion of an item. Requests for an access or deletion of an item x positioned ith from the front cost the algorithm i, the number of comparisons it must make in order to locate z while starting a (sequential) search from the front. A request for insertion of an item to a list currently holding e items costs the algorithm !.+ 1. In order to expedite future requests, a list accessing algorithm may reorganize its list periodically while attempting to place items that are expected to be accessed frequently closer to the front. Reorganizations are made via a sequence of transpositions of consecutive items. After an item z is accessed (inserted), z may be moved free of charge to any position closer to the front. Such transpositions that bring z forward after an access (insertion) are called free. Any other transpositions cost 1 each and are called paid,

The importance of the list accessing problem arises from the fact that online list accessing algorithms are often used by practitioners. Although organizations of dictionaries as linear lists are often inefficient, there are various situations in which a linear list is the imple-

mentation of choice. For instance, when the dictionary is small (say for organizing the list of identifiers maintained by a compiler, or for organizing collisions in a hash table), or when there is no space to imple- ment efficient but space consuming data structures, etc. [7, 131. Other uses of list accessing algorithms are made

by algorithms for computing point maxima and convex hulls [6]. Finally, any list accessing algorithm can be used as the “heart” of a data compression algorithm [8, 21. Due to its great relevancy the list accessing problem has been extensively studied since 1965 (see e.g. [15, 18, 7, 19, 16, 31).

Competitive list accessing algorithms. Let ALG be any list accessing algorithm. For any sequence of requests u we denote by ALG(CT) the total cost incurred by ALG to service 6. One, recently popular performance measure of online algorithms is the competitive ratio [19], defined as follows. An algorithm ALG attains a competitive ratio c (equivalently, ALG is c-competitive) if there exists a constant CY such that for all request sequences u, AK(u) - c - OPT(~) 5 (Y where OPT is an optimal ofline list accessing algorithm. If ALG is

53

54

randomized, it is c competitive against an oblivious adversary if for every request sequence u, E[ALG(u)] -

c . OPT(u) _< (Y where o is a constant independent of u, and EC.1 is the mathematical expectation taken with respect to the random choices made by ALG [9].

The best known competitive algorithms. In their seminal paper Sleator and Tarjan [19] showed that the well-known algorithm MOVE-TO-FRONT (MTF), a deterministic algorithm that moves each requested item to the front, is 2-competitive. Raghavan and Karp (reported in [14]) proved a lower bound of 2 - 2/(t + 1) on the competitive ratio of any deterministic algorithm maintaining a list of size 1. Irani [14] gave a matching upper bound for MTF showing that MTF is a strictly optimal online algorithm judged from the perspective of competitive analysis. Albers [l] discovered another deterministic algorithm called TIMESTAMP and proved that it is a-competitive. In [ll] an infinite family of optimal (Scompetitive) deterministic algorithms is presented. This family of algorithms include TIMESTAMP and MTF as

special cases.l As for randomized algorithms, so far the best known upper bound is obtained by an algorithm called COMB due to Albers, von Stengel and Werchner [3]. The bound obtained is 1.6. The best known randomized lower bound, of 1.5, is due to Teia [20].

Contents and paper organization. This abstract outlines the results of a quite extensive study of many list accessing algorithms. In Section 2 we describe all the algorithms tested. These include representatives of 18 different families and more than 40 algorithms in all. Whenever known we specify bounds on the competitive ratios of the algorithms described. We have performed two experiments. The first attempts to rank the various algorithms in terms of their access cost performance, and the second, in terms of their performance as data compressors. Section 3 gives the details of these experiments. In Section 3.4 we briefly describe the results of other empirical studies relevant to accessing cost and compression. Sections 4 and 5 outline the main conclusions and some observations obtained by the first and second experiments, respectively. Finally, in Section 6 we draw our conclusions. To support our conclusions we adjoined to this extended abstract six appendices that include details of some key concepts, tables, graphs and some raw data obtained in our experiments.

‘Any deterministic list accessing algorithm is termed here optimal if it attains a competitive ratio of 2. An algorithm which attains the lower bound of 2 - 2/(!+ 1) is termed strictly optimal.

2 The algorithms tested

In this section we describe the algorithms we tested. In the description of most algorithms we specify the algorithm’s action only with respect to an ACCESS

request. Unless otherwise specified it is implicitly assumed that an INSERT request for an item z places x at the back of the list and then x is treated as if it was accessed. With respect to each algorithm we also specify bounds on its competitive ratio whenever they are known. Throughout this paper e denotes the total number of elements inserted into the list.

2.1 Deterministic algorithms. Algorithm MOVE-TO-FRONT (MTF) is one of the

most well known and used algorithms. This al B

orithm attains an optimal competitive ratio of 2 - 2 [19, 141.

(I+ 1)

Algorithm MTF: Upon an access for an item z move t to the front.

Algorithm TRANSPOSE (TRANS) is perhaps the most “conservative” algorithm presented here and is an extreme opposite to MTF. The competitive ratio of TRANS is bounded below by 2-!/3 [ 191.

Algorithm TFf.ANS: Upon an access to an item z transpose z with the immediately preceding item.

Algorithm FREQUENCY-COUNT (FC) attempts to adapt its list to the empirical distribution of requests observed so far. A lower bound of (C + 1)/2 on its competitive ratio is known 1191.

Algorithm FC: Maintain a frequency counter for each item. Upon insertingan item, initialize its counter to 0. After accessing an item increment its counter by one and then reorganize the list so that items on the list are ordered in non-increasing order of their frequencies.

In our implementation we further require that if two items have the same frequency count then the item requested more recently is positioned in front of the item requested least recently.

The following algorithm called MTFP is a more “relaxed” version of MTF. MTFZ can be shown to be be 2-competitive.

Algorithm MTF2: Upon the ith request for an item I: move 3: to the front if and only if i is even.

Algorithm MOVE-AHEAD(k) (MHD(~)) proposed by Rivest [18] is a simple compromise between the relative extremes of TRANS and MTF.

Algorithm MHD(k): Upon a request for an item 2, move D forward k positions.

Thus algorithm MHD(~) is TRANS. We have tested the algorithms MHD (k) with k = 2,8,32,12$.

Algorithm MOVE-FRACTION(k) (MF( k)) proposed by Sleator and Tarjan is a slightly more sophisticated compromise between TRANS and MTF. algorithm MF(k) is 2k-competitive [19].

For each k,

Algorithm MF(k): Upon a request for an item z currently at the ith position, move c [i//c] - 1 positions towards the front.

55

Thus, algorithm MF(~) is MTF. We have tested the algorithms MF(~) with k = 2,8,32,128.

Algorithm TIMESTAMP (TS) due to Albers is an optimal algorithm attaining a competitive ratio of 2. PI

Algorithm TS: Upon a request for an item x, insert x in front of the first (from the front of the list) item y that precedes x on the list and was requested at most once since the last request for x. Do nothing if there is no such item y, or if x has been requested for the first time.

The following family of algorithms called the PRI FAMILY is a straightforward generalization of algorithm TS. Let m be a non-negative integer. Consider the following deterministic list accessing algorithm called PASS-RECENT-ITEM(~) (PRI(~)). It is known that for each m > 2 PRI(~) is 3-competitive (MRI(~) is known be 2-coGpetit&, ‘see below). ’ ’ ’

to

Algorithm PRI(m): Upon a request for item 2, move x forward just in front of the first item z on the list that was requested at most m times since the last request for x. Do nothing if there is no such item t. If this is the first request for x, leave x in place (move x to the front).

Note that modulo the handling of first requests, algorithm PRI( 1) is identical to algorithm TS. Notice that the limit element of this family, as m grows, is MTF. We have tested the algorithm PRI(m) with m = 0, 1, . . . ,8 and with the insertion position being the last and the front. In the sequel each member of the PRI family, PRI(m), that inserts a new item at the first (resp. last) position is denoted pm’(m) (resp. PRI(?n)).

The following family of algorithms is quite similar to the PRI FAMILY but with a significant change. Let m be a non-negative integer. Consider the following deterministic list accessing algorithm called MOVE-TO- RECENT-ITEM(~) (or MRr(m) for short . For each m 2 1 algorithm MRI(~) is P-competitive an II is thus optimal. The competitive ratio of algorithm MRI(O) is bounded below by O(d) [II].

Algorithm MRI(m): Upon a request for item 2, move x forward just after the last item t on the list that is in front of x and that was requested at least m + 1 times since the last request for x. If there is no such item z move x to the front. If this is the first request for x, leave x in place (move x to the front).

Although not immediately apparent algorithm MRI(~) can be shown to be equivalent to algorithm TS (modulo first requests for items). We have tested the algorithms Mm(m) with m = 0, 1, . . . ,8 and with the insertion position being the last and the front. In the sequel each member of the MRI family, MRI(~), that inserts a new item at the first (resp. last) position is denoted MRI’(~) (resp. MRI(m)).

2.2 Randomized algorithms. Algorithm SPLIT (SPL) due to Irani [14] was the

first randomized algorithm that was shown to attain

a competitive ratio lower than the deterministic lower bound of 2 - 2/(.t + 1

1 , It is known that SPL is 31/16-

competitive (= 1.9375 . Algorithm SPL: For each item x on the list maintain a pointer p(t) pointing to some item on the list. Initialy set p(x) = x. Upon a request for an item x, with probability l/2 move x to the front and with probability l/2 insert x just in front of p(x). Then set p(x) to point the first item on the list.

Algorithm 31T due to Reingold and Westbrook [16] attains a competitive ratio of 1.75 - 1.75/e.

Algorithm BIT: For each item on the list maintain a mod-2 counter b(z) initially set to either 0 or 1 randomly, independently and uniformly. Upon a request for an item x first complement b(t). Then if b(x) = 0 move x to the front.

The following algorithm called RMTF is very similar to BIT but somewhat surprisingly its worst case performance is inferior. It can be shown that its competitive ratio is not smaller than 2 [22].

Algorithm RMTF: Upon a request for x, with probability 112 move x to the front, and with probe- bility x leave x in place.

The family of

t

algorithms COUNTER(~, S) CTR(S, S)) due to Reingold, Westbrook and Sleator 171 is a sophisticated generalization of algorithm BIT.

Let s be a positive integer and S, a nonempty subset of {0,1)...) s- 1).

Algorithm CTR(s, S): For each item c on the list maintain a mod s counter c(x), initially set randomly, independently and uniformly to a number in {O,l,...,s - l}. Upon a request for an item x, decrement c(x) by 1 (mod s) and then if c(x) E S move 3: to the front.

Thus, CTR(~,{~}) is BIT. Reingold et al. prove that CTR(7, {0,2,4)) is 85/49-competitive (% 1.735). From this family we have tested algorithm CTR(~, (0,2,4}).

The family of algorithms RANDOM-RESET(S, D) (RST(S,D)) due to Reingold et al. [17] is a variation on the COUNTER algorithms. Let s be a positive inte er and D, a pr;bability distribution on set S= ! O,l,..., probability of i.

s - 1) such that for i E S,D(i) is

Algorithm RST(s, D): For each item x on the list maintain a counter c(x), initially set randomly a number in i E (0, 1, . . . , s - 1) with probability D(i). Upon a request for an item x, decrement c(x) by 1. If c(x) = 0 then move x to the front and then randomly reset c(x) using D.

the the

The best RST(S, D) algorithm, in terms of the competitive ratio, is obtained with s = 3 and D such that D(2) = (a - 1)/2 and D(1) = (3 - & /2. The competitive ratio attained in this case is $ 3 M 1.732. In this family we have only tested this algorithm.

Let p E [0, l] The following is family of algorithm called TIMESTAMP (TS(~)) due to Albers [l] that is a kind of randomized combination of algorithm TS and MTF. For each p, TS(~) is proven to be max{2 - p, 1 + p(2 - p)}-competitive.

56

Algorithm TS(p): Upon a request for an item 2, with probability p execute (i) move I to the front; and with probability 1 -p execute (ii) let y be the first item on the list such that either (a) y was not requested since the last request for z; or (b) y was requested exactly once since the last request for 2 and that request for y was served by the algorithm using step (ii). Insert G just in front of y. If there is no such y or if this is the first request for z leave 2 in place.

The optimal choice is easily shown to be p = (3 - fi)/2 in which case the competitive ratio attained is the golden ratio 4 = (1 + fi)/2 w 1.62. From this family we have tested this algorithm and denote it by RTS.

In [3], Albers, von Stengel and Wechner present the following algorithm called COMB which is a probability mixture of BIT and TS. Algorithm COMB is shown to be 8/f>-competitive. To date this bound is the best known for a list accessing algorithm.

Algorithm COMB: Before serving any request choose algorithm BIT with probability 4/5, and algorithm TS with probability l/5. Serve the entire request sequence with the chosen algorithm.

Although for any data set the empirical performance of COMB can be determined from the performances of BIT

and TS, we have tested COMB separately, as a control for the statistical significance of our tests of randomized algorithms.

2.3 Benchmark algorithms. In order to obtain some reference result we tested the performance of the following two “bad” algorithms.

The first of the benchmark algorithms acts in a way that at the outset seems to be the worst possible.

Algorithm MTL: Upon a request for an item G move t to the back of the list.

Although the model requires that we charge MTL for the paid exchanges it performs we ignored these costs and only counted access costs.

The second benchmark algorithm fixes a random permutation of the items on the list.

Algorithm RAND: Upon a request for insertion of an item z, insert 3: at a random position chosen independently and uniformly among the fZ + 1 possible position (where before the insertion there are < items on the list.

3 The experiments performed

Each of the above algorithms was tested with respect to a relatively large data set called the Calgary Corpus. We performed two different experiments. The goal of the first experiment was to rank the access cost performance of the various algorithms within the traditional dynamic model. In the second experiment we applied each algorithm as a data compressor and tested the overall compression ratios obtained. We start with a description of this data set and how request sequences

were generated from it, then we give the details of the experiments performed.

3.1 The data set. The Calgary Corpus is a collection of (mainly) text files that serves as a popular benchmark for testing the performance of (text) compression algorithms [5]. The corpus contains nine different types of files and consists of I7 files overall. In particular, this corpus contains books, papers, numeric data, a picture, programs and object files.

We used each file in the corpus to generate two different request sequences. The first sequence waa generated by parsing the file into “words” and the second, by reading the file as a sequence of bytes. With respect to the word parsing, a vrord is defined as the longest string of non space characters. For some of the non text files in the corpus (e.g. pit, geo) this parsing does not yield a meaningful request sequence and therefore we ignored the results corresponding to such sequences. Table 3 in Appendix B specifies for each file and for each of these parsings the length of the request sequence generated, the number of distinct requests and the ratio between these two quantities. The information in this table (in particular these ratios) can be used to identify less (and more) “significant” sequences. For example, the (word) sequences generated from ob j I and ob j2 are of minor interest since on average almost every word in the request sequence is new (the ratios of total number of words to distinct words are 1.63 and 1.46, respectively). According to this meaSure we would expect the four word-level sequences corresponding to the files progp, progl, book2 and book1 to be of greater significance than the rest of the word-level sequences.

In general, as can be seen in the table, the two kinds of request sequences (word- and byte-based) are considerably different, with the word-based sequences resulting in very long lists and relatively short sequences (compared to the list length). On the other hand, the byte-based sequences correspond to very short lists and very long sequences.

3.2 Experiment 1: access cost performance. In this experiment each algorithm started with

an empty list and served request sequences that were generated from each of the Corpus’ files. Our primary concern here was to rank the algorithms according to their total access costs and study the relationships between their competitive (theoretical) performances. To obtain statistically significant results each of the randomized algorithms (except for COMB) was executed 15 times per each request sequence. The results for COMB were computed as a weighted average from the

57

costs obtained for TIMESTAMP and the (average) costs obtained for BIT (recall that COMB is a probability mixture of these two algorithms).

3.3 Experiment 2: compression performance. In this experiment we tested the performance of all algorithms as data compressors using the compression scheme of Bentley el al. [S]. Specifically, in this scheme, whenever a dictionary word is to be encoded, a binary encoding of its current position on the list is transmitted (using some variable length prefix code). After that the list accessing algorithm rearranges the list as if the word were accessed. If the word is not on the list, the word itself must be transmitted. In this case we encoded the word as a stream of bytes using a secondary byte-level list accessing compressor that starts with all 256 possible bytes on the list. Both word and byte compressors were implemented using the same algorithm. To restore the data the receiver performs the inverse operations. That is, the receiver can recover each word or byte whose location was transmitted. None of the algorithms was actually implemented to compress and restore data. Note that in order to do that each of the word and and byte lists must contain a special “escape” symbol that must be transmitted each time there is a transition between the word and byte lists and vice versa. In our experiment these extra charges were ignored so the results obtained in this experiment are useful only for initial comparison between the various algorithms. (For each of the algorithms the absolute compression performance is in general somewhat worse than the one reported here.) Each algorithm was tested with respect to various variable length binary prefix encodings. In particular, we have tested each algorithm with 6 different encodings including d-encoding, Elias, w, start-step-stop and one new encoding scheme (to the best of our knowledge). Except for the new encoding scheme, all the other schemes are defined in [4]. A brief description of the encoding schemes used appears in Appendix A.

3.4 Previous empirical studies. A few empirical studies of the performance of online list accessing algorithms have been conducted. Bentley and McGeoch [7] tested the performance of MTF, FC, TRANS with respect to request sequences generated from several text files (4 files containing Pascal programs and 6 other English text files). The sequences generated from the text files by parsing the files to “words” with a word defined as an “alphanumeric string delimited by spaces or punctu- ation marks”. It was found that FC is always superior to TRANS and that MTF is often superior to FC. Tenenbaum [21] tested the performance of various algorithms from

the MOVE-AHEAD(~) family (MHD(~)) with respect to request sequences distributed by Zipf’s law. A few other simulation results testing various properties of particular algorithms with respect to sequences generated via Zipf’s distribution are summarized in a survey by Hester and Hirschberg [13].

There are also a few empirical tests of the performance of some list accessing algorithms applied to text compression. Bentley, Sleator, Tarjan and Wei tested the performance of MTF-compression algorithms with various list (“cache”) sizes with respect to several text files (containing several C and Pascal programs, book sections and transcripts of terminal sessions). They also compared the algorithms’ performance with that of Huffman coding compression and found that for a sufficiently large cache size (e.g. 256) MTF compression is “competitive” with Huffman coding. Albers and Mitzenmacher [2] compared the compression performance of algorithm TIMESTAMP with that of MTF with respect to the Calgary Corpus. They considered both word and byte (character) parsings. Using Elias prefix encoding (see Appendix A) they obtained the following results. TIMESTAMP-COmpRSSiOn is significantly better than MTF-compression with respect to character (byte) encoding, but both are not “competitive” with standard Unix compression utilities. With respect to “word” encoding TIMESTAMP is often (only marginally) better than MTF compression. The word-based compression performance of both these algorithms is found to be close to that of standard Unix compression utilities. However, note that the results of this experiment do not count the encoding of new words that are not already on the list. A few other studies compare the performance of particular, more sophisticated list accessing compressors that alter the basic scheme. Burrows and Wheeler [lo] tested the performance of an MTF compressor that operates on data that is first transformed via a “block- sorting” transformation. Grinberg et al. [12] tested the performance of an MTF compressor that uses “secondary lists” . We note (based on their results and ours) that these more sophisticated schemes achieve in general better compression results than the basic scheme.

From the above, the only comparable studies are those of Bentley and McGeoch and Albers and Mitzen- macher. First we note that our study supports the qualitative results of both these studies. However, they are not comparable quantitatively (the Bentley-McGeoch study is incomparable to ours as it used a different corpus; the quantities reported in the Albers-Mitzenmacher study are incomparable to ours as they have not measured the transmission costs of new words).

Compared to the known empirical studies, the results reported here are significantly more comprehen-

58

sive and provide insights into many algorithms, among which some that have never been tested. In addition, our results test the performance of list accessing algorithms with respect to both dictionary maintanence and compression applications.

4 Experiment 1: access cost performance

Although some of the results are hard to interpret there are certain consistent results that are quite interesting and/or surprising. First, consider Table 1 which specifies the four best algorithms with respect to each of the individual files. The table also specifies the ratio of the worst cost obtained (always by MTL) to the best. It is interesting to note that the initial hypothesis that the more relevant (word-level) request sequences are those that exhibit large ratios of total number of requests to distinct items is supported by the ratio of best performance to worst performance in Table 1. For example, the four most “significant” sequences according to the first measure, corresponding to the files progp, progl, book2 and book1 are exactly those that generated the largest ratios of best to worst performance.

It is most striking that in all cases the best performance is obtained by deterministic algorithms. Among the best performing algorithms with respect to word- based sequences are MF(~), FC and members of the PRI and MRI families. Among the best algorithms with respect to byte-based parsing are FC, members of the MRI and PRI families and MHD (8). Notably, FC(2) is only 4- competitive and FC is not competitive at all. The members of the MRI(~), m _> 1 and PRI(~) (TIMESTAMP) are 2-competitive. However, algorithm MRI(O) is also not competitive.

Not only is the best performance often obtained by (deterministic) algorithms that are not optimal, the best performing deterministic algorithms consistently and significantly outperform all randomized algorithms. These results stand in contrast to what could be expected given the worst case, competitive analysis results. We note that the average access costs of the randomized algorithms are statistically significant. Note that the costs obtained for randomized algorithms are averages of 15 independent runs for each request sequence (which are found to be statistically significant).

The results obtained in our experiment support the qualitative results obtained in the Bentley-McGeoch experiment. In all instances MTF outperformed TRANS. Also, in the word-based sequences, MTF often outperforms FC. With regards to the more significant word- based sequences (progp, progl, book2 and bookl), the ratio of FC to MTF cost is approximately 1.13, 1.19, 0.98 and 0.91, respectively. Thus, MTF outperformed FC in the first two sequences and under-performed FC in the

latter two. The performance relation between MTF and FC is

somewhat different with respect to the byte-based sequences. Here in most instances FC significantly outperforms MTF (MTF is better than FC w.r.t objl, obj2, paper6 and paper6). One possible hypothesis could be that in the instances where FC outperforms MTF the request sequence resembles more closely one that was generated via a probability distribution. Nevertheless, this hypothesis is not supported by the fact that in all such instances MTF outperforms TRANS.~ For example, this is also the case w.r.t the two longest byte-based request sequences corresponding to book1 and book2 (lengths of 768,771 and 610,856 requests, respectively).

Although the randomized algorithms performed poorly relative to the best deterministic algorithms it is still interesting to rank their relative performance. With respect to the byte-level sequences the best randomized algorithm was always COMB. The worst performance was most often obtained by RMTF and RST. (Note that we do not consider here algorithm RAND, which was always the worst randomized algorithm.) In all sequences the second best algorithm was either TS(P) or BIT. Note that COMB and TS(p) are the best algorithms in terms of their competitive ratio. In general, the ratio of worst (randomized) performance to the best one was in the range of 1.04-1.08 with an anomaly observed with respect to the files paper4, paper5 and paper6 which resulted in the ratios 1.14, 1.17 and 1.33. (Interestingly, these papers where written by the same author.) Thus, in general the randomized algorithms achieved similar performance. With respect to the four more “significant” word-level sequences (progp, progl, book2 and bookl) the best randomized algorithms were BIT (progp, progl) and SPL (book1 and book2). Whenever BIT was the best the ratio of worst to best (among the randomized algorithms) was significantly higher (1.15,l. 17) than the corresponding ratio for the cases where SPL was the best (1.07, 1.04).

Inspection of the relative performance of the MOVE- TO-FRONT based algorithms (MTF,MTF~,BIT and RMTF) yields the following conclusions. The worst performance was almost consistently obtained by RMTF. In the byte-level sequences the best algorithm was very often MTF:!, then MTF and a few times BIT. Among the more significant word-level sequences MTF was mostly the favorite.

With respect to the MHD(~) algorithms no consistent monotonicity was observed with respect to performance and the parameter Ic. It is interesting to note

‘With respect to all non trivial distributions it is known that TRANS has better asymptotic performance than MTF [18].

thms )

‘3)

cost ratio (&h/l& MTL/lst)

1.010, 2.097

Top four algorithms (byte partition)

-” .rl-.r/n\

Cost ratio (&h/l&, MTL/lst)

8) 1.009, 3.597

..1*1.

2) 1.023,3.438 MRI MAT

1) I 1.011, 2.318 II

MRI (3)

File name Top four algoril (word partition,

(8) I 1.023,2.663 II

1) I 1.006, 2.324 II MRI MRT

Table 1: The four best algorithms for the individual runs with respect to word and byte parsing of each of corpus files

the

that in various instances of the word-level sequences MHD(~) performed significantly worse than TRANS (=

MHD( 1)). This fact is quite surprising given that for such sequences (e.g. book2) more “agressive” algorithms are expected to perform better.

The MRI(VZ) and PRI(~) family exhibited consistent and almost perfect monotonocity in terms of their performance as a function of the parameter m. For example, perfect monotonicity was obtained with respect to the byte-level sequence generated from book2. In many cases the algorithms MRI(O) and PRI(O) acheived the best performance. With respect to the word-level sequences the performance was sometimes increasing with m and sometimes decreasing with m. Curious results where obtained for the sequence generated by book2 where the Pm(m) algorithms achieved better performance with lower values of m and the MRI(m) algorithms with larger

values. Based on these empirical results and taking into

account the time and space complexities of the better performing algorithms, one of the best algorithms, in terms of access cost performance and complexity is MF(2). In fact, after observing this we performed some initial experiments with various other algorithms in the family MF(~) and found that the algorithm ~~(3/2) ( i.e. the algorithm that advances a requested item 2/3th of its way to the front) was a champion in most of the byte-level sequences. Hence, if we were forced to choose one single list accessing algorithm for the purpose of dictionary maintenance (based on these empirical results) it would have been MF(~) with some k E [3/2,2]. N evertheless, it is important to note that although the members of the MRI (and PRI) families are time and space consuming, the combination of their

60

their empirical and theoretical performance is among the best.

5 Experiment 2: data compression

Table 2 summarizes the overall average compression ratios (bit/char) obtained by the algorithms with respect to the various binary encoding schemes used. The averages are calculated with respect to the entire corpus. In this table each row corresponds to one algorithm and each column corresponds to one particular binary encoding. Surprisingly, the best average compression ratios are obtained by FC and and MRI(O). Similar performances were obtained by various other members of the MRI and PRI families and MHD(~) . These results were obtained using the RB-encoding scheme. We note that with the exception of algorithms MF(4), MF(~) and PRI'(O), the RB-encoding yielded the best compression ratios for all algorithms. The worst performance (RB- encoding, and not including algorithms MTL and RAND)

was obtained by algorithms MF@), MF(~) and TRANS. Note that algorithms FC and MM(O) where the best regardless of the binary encoding used.

A striking fact is that the ranking of the randomized algorithms (RB-encoding) almost matched their competitive ratios, where the best to worst performance was obtained by COMB, TS(~), BIT, CTR, RST, SPL and RMTF. Note that algorithm RST is theoretically superior to algorithm CTR. Note also that a lower bound on the competitive ratio of BIT is not known. Nevertheless, it is important to mention that the best performance of COMB was consistently due to the deterministic algorithm TIMESTAMP (PRI( 1)), which consistently outperformed BIT.

Finally we note that the results of this experiment support the qualitative results of the Albers- Mitzenmacher compression experiment [2]. Namely, the compression ratio obtained by algorithm TIMESTAMP was

consistently better than those obtained by MTF compression (regardless of the binary encoding used).

6 Concluding remarks

Some of the results reported in this paper stand in contrast to various theoretical studies of the list accessing problem. Nevertheless, it is important to keep in mind that these results are based on performance measured with respect to a particular data set that can be crit- icized. One major criticism is that the sequences generated belong to two rather extreme families of request sequences. In the word-based sequences the typical ratio of total number of requests to distinct requests is on the low side. On the other hand, the byte-based sequences have a small number of distinct requests and due to the nature of the particular files in the corpus it

may be the case that they do not exhibit enough locality of reference (e.g. many of the files are English text files). Note however that this particular criticism does not apply to the compression experiment.

Thus, it would be of great interest to investigate the statistical properties of these sequences in order to val- idate the results presented here. In particular, it would be of major importance to devise a meaningful, quanti- tative measure of locality of reference that could be used to classify request sequences and further investigate the correlation between various algorithms and their performance with respect to sequences. To the best of our knowledge, no such measure has been studied. Also, it would be of great importance to put together an appro- priate corpus that could be used to test the performance of data structures and algorithms for dictionary maintenance.

Although the results concerning compression performance do not give true compression ratios they do give lower bounds. These results clearly indicate that the list accessing compression scheme by itself will not give compression ratios that are competitive with popular compression algorithms such as those based on Lempel-Ziv schemes. Nevertheless, our very prelimi- nary experimentation with more sophisticated list accessing based compression algorithms suggests that significantly better results will be obtained using more sophisticated variations on the basic scheme. For example it would be very interesting to experiment with secondary lists [12], the Burrows-Wheeler block-sorting transformations [lo], two different list managers (one for new words, and one for words already on the list), and dynamic transition between different basic list accessing algorithms in order to adapt to changing levels of locality of reference. Also, it would be important to explore various other variable length prefix encodings (such as members of the START-STEP-STOP scheme) and attempt to match them to the list manager behavior.

Acknowledgments

We thank Susanne Albers, Allan Borodin, Brenda Brown, David Johnson and Jeffery Westbrook for useful comments.

References

[l] S. Albers. Improved randomized on-line algorithms for the list update problem. In Proceedings of the 6th An- nual ACM-SIAM Symposium on Discrete Algorithms, pages 412-419, 1995.

[2] S. Albers and M. Mitzenmacher. Average case analyses of list update algorithms, with applications to data compression. Technical Report TR-95-039, Interna- tional Computer Science Institute, 1995.

61

Table 2: Overall average results for the compression experiment. Each entry gives the average compression ratio, bits/byte, for the entire corpus. The best results appear in boldface and the three best results are marked with stars

[3] S. Albers, B. von Stengel, and R. Werchner. A combined BIT and TIMESTAMP algorithm for the list update problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995.

[4] T. Bell, J.G Cleary, and I.H. Witten. Text Compres- sion. Prentice Hall, 1990.

[5] T. Bell, J.G Clear-y, and I.H. Witten. Text Compres- sion, Appendix B. Prentice Hall, 1990. The Calgary Corpus is available at ftp site: f tp. cpsc . ucalgary . ca.

[6] J.L. Bentley, K.L. Clarkson, and D.B. Levine. Fast linear expected-time algorithms for computing maxima and convex hulls. In Proceedings of the 1st ACM- SIAM Symposium on Discrete Algorithms, pages 179- 187, 1993.

[7] J.L. Bentley and C. McGeoch. Amortized analysis of self-organizing sequential search heuristics. Communi- cations of the ACM, 28(4):404411, 1985.

[8] J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V. K. Wei. A locally adaptive data compression scheme.

Communications of the ACM, 29(4):320-330, 1986. [9] A. Borodin, N. Linial, and M. Saks. An optimal online

algorithm for metrical task systems. Journal of the ACM, 39:745-763, 1992.

[lo] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Techuical Report 124, Digital System Research center, 1994.

[ll] R. El-Yaniv. There are infinitely many competitive- optimal online list accessing algorithms, June 1996. submitted to SODA 97.

[12] D. Grinberg, S. Rajagopalan, R. Venkatesan, and V.K. Wei. Splay trees for data compression. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995.

[13] J.H. Hester and D.S. Hirschberg. Self-organizing linear search. ACM Computing Surveys, 17(3):295-312, 1985.

[14] S. Irani. Two results on the list update problem. Information Processing Letters, 38(6):202-208, June

1991.

62

[151

WI

P71

PI

WI

PO1

WI

PI

A

J. McCabe. On serial files with relocatable records. Operations Research, 13:609-618, July 1965. N. Reingold and J. Westbrook. Randomized algorithms for the list update problem. Technical Report YALEU/DcS/TR-804, Yale University, June 1990. N. Reingold, J. Westbrook, and D. Sleator. Random- ized competitive algorithms for the list update problem. Algorithmica, 11:15-32, 1994. R. Rivest. On self-organizing sequential search heuristics. Communications of the ACM, 19, 2:63-67, Febru- ary 1976. D.D. Sleator and R.E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202-208, 1985. B. Teia. A lower bound for randomized list update algorithms. Information Processing Letters, 47:5-g, 1993. A. Tenenbaum . Simulations of dynamic sequential search algorithms. Communications of the ACM, 28(2):790-791, 1978. J. Westbrook, 1996. personal communication.

Variable length prefix free binary encodings

We have used the following six variable length prefix binary encodings. The precise definition of the each of the first five codes can be found in [4], Appendix A. Here we specify the lengths obtained by some of the encodings and provide a brief description if the length is not a closed form formula. Also we give the definition of the new encoding, called RB-encoding.

l &encoding: encodes i with 1 + [log i] + 2llog(l + LlogiJ)] bits

l Elias encoding: encodes i with 2Llog i] + 1 bits

l w-encoding: i is encoded by writing the binary representation of i, b(i), then appending (to the left) the binary representation of lb(i)1 - 1. This process is repeated recursively and halts on the left with coding of 2 bits. A single zero is appended on the right to mark the end of the code

l J-encoding: similar to w-encoding, halts with 3 bits

l START-STEP-STOP: The START-STEP-STOP family produces a great variety of codes. Each code in this family is specified by three parameters (see the exact definition in [4], Appendix A). From this family we have tested the (1,2,5)-encoding.

l The RB-encoding. This code assumes that at each time an upper bound B on the maximum possible integer to be encoded is known. The idea is straightforward. We write the length of the binary encoding concatenated with the binary encoding

Table 3: Total number of requests, number of distinct requests and the ratio between them as generated from each file via byte and word parsing

(excluding the most significant bit). In order to exploit all possible codes we use the following variation. Define c = [log( [log B] + 1)l. Set f = 2c - [log B] - 1; that is f is the number of unused codes. Let I < i < B be an integer to be encoded. If i 2 f we encode 2c - i using c bits. Otherwise, i > f, we encode the number [log(i - f)l with c bits appended by i-f in binary, excluding its most significant bit using [log(i - f)j - 1 bits.

B The Calgary Corpus

For each of the 17 files in the Calgary corpus, Table 3 specifies its length, the number of distinct elements and the ratio between these two, according to byte parsing and word parsing. The Calgary Corpus can be downlowded from ftp site: f tp. cpsc . ucalgary . ca.

online list accessing algorithms and their applications ...rani/el-yaniv-papers/bachrache97.pdflet...

Documents