[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

1
Optimum String Match Choices in LZSS Graham Little and James Diamond Jodrey School of Computer Science Acadia University Wolfville, Nova Scotia, Canada B4P 2R6 f081219l, jdiamondg@acadiau.ca Summary Compression techniques in the LZ77 family operate by repeatedly searching for strings in a dictionary and then outputting a series of tokens which unambiguously define the chosen sequence of strings. The dictionary is composed of the most-recently matched N symbols, for some implementation-dependent N . The strings to be matched are the prefixes of the remaining input symbols. When a particular prefix has been matched, those symbols are moved from the beginning of the remaining symbols to the end of the dictionary; in general this will cause some symbols to be deleted from the beginning of the dictionary, in order to limit its size to N . Compression algorithms in the LZ77 family perform a greedy choice when looking for the next string of input symbols to match. That is, the longest string of symbols which is found in the current dictionary is chosen as the next match. Many variations of LZ77 have been proposed; some of these attempt to improve compression by sometimes choosing a non-maximal string, if it appears that such a choice might improve the overall compression ratio. In this paper we present an algorithm which computes a set of matches designed to minimize the number of bits output, not necessarily the number of strings matched. In some variants of LZ77, the token stream is itself compressed using a statistical technique, which means the length of a token is not known a priori. However, other LZ77 variants code the tokens using a scheme for which the length of a given token can be computed in advance. In such a case it is computationally feasible to compute the globally optimum set of matches (we refer to this as the optimum parsing of the input). The basic idea is as follows. At each step of the compression process, the number of bits required by an optimum parsing of the input ending at the current position is known. If the longest match available at this point has length m, then candidate optimum parsings for each of the next m positions can be computed by adding the number of bits required for the current position to the token lengths for each of the m possible prefixes of the longest match. These m values are compared pairwise to the current values for the next m locations, and for each improved bit count, the new value and a pointer to the current location are stored. When the end of the input is reached the pointers are traced backwards from the final input symbol to compute the optimum parsing. The Calgary Corpus was used as the test data. An implementation of LZSS which has a maximum match length of 16, a dictionary of 4K symbols and token sizes known a priori was used as the base algorithm. Our algorithm reduced the average compression ratio from 45.28% to 42.64%, a (relative) improvement of better than 5.8%. Author’s current address is 10 Wilson Blvd, Halifax, NS, B3M 3E4 This work was partially supported by the Natural Sciences and Engineering Research Council. 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.67 538

Upload: james

Post on 07-Apr-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Optimum String Match Choices in LZSS

Optimum String Match Choices in LZSSGraham Little� and James Diamond�

Jodrey School of Computer ScienceAcadia University

Wolfville, Nova Scotia, Canada B4P 2R6f081219l, [email protected]

SummaryCompression techniques in the LZ77 family operate by repeatedly searching for strings ina dictionary and then outputting a series of tokens which unambiguously define the chosensequence of strings. The dictionary is composed of the most-recently matched N symbols,for some implementation-dependent N . The strings to be matched are the prefixes of theremaining input symbols. When a particular prefix has been matched, those symbols aremoved from the beginning of the remaining symbols to the end of the dictionary; in generalthis will cause some symbols to be deleted from the beginning of the dictionary, in order tolimit its size to N .

Compression algorithms in the LZ77 family perform a greedy choice when lookingfor the next string of input symbols to match. That is, the longest string of symbols which isfound in the current dictionary is chosen as the next match. Many variations of LZ77 havebeen proposed; some of these attempt to improve compression by sometimes choosing anon-maximal string, if it appears that such a choice might improve the overall compressionratio. In this paper we present an algorithm which computes a set of matches designed tominimize the number of bits output, not necessarily the number of strings matched.

In some variants of LZ77, the token stream is itself compressed using a statisticaltechnique, which means the length of a token is not known a priori. However, other LZ77variants code the tokens using a scheme for which the length of a given token can becomputed in advance. In such a case it is computationally feasible to compute the globallyoptimum set of matches (we refer to this as the optimum parsing of the input).

The basic idea is as follows. At each step of the compression process, the number ofbits required by an optimum parsing of the input ending at the current position is known. Ifthe longest match available at this point has length m, then candidate optimum parsings foreach of the next m positions can be computed by adding the number of bits required for thecurrent position to the token lengths for each of the m possible prefixes of the longest match.These m values are compared pairwise to the current values for the next m locations, andfor each improved bit count, the new value and a pointer to the current location are stored.When the end of the input is reached the pointers are traced backwards from the final inputsymbol to compute the optimum parsing.

The Calgary Corpus was used as the test data. An implementation of LZSS which hasa maximum match length of 16, a dictionary of 4K symbols and token sizes known a prioriwas used as the base algorithm. Our algorithm reduced the average compression ratio from45.28% to 42.64%, a (relative) improvement of better than 5.8%.

� Author’s current address is 10 Wilson Blvd, Halifax, NS, B3M 3E4� This work was partially supported by the Natural Sciences and Engineering Research

Council.

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.67

538