[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

Post on 07-Apr-2017

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Optimum String Match Choices in LZSSGraham Little� and James Diamond�

Jodrey School of Computer ScienceAcadia University

Wolfville, Nova Scotia, Canada B4P 2R6f081219l, jdiamondg@acadiau.ca

SummaryCompression techniques in the LZ77 family operate by repeatedly searching for strings ina dictionary and then outputting a series of tokens which unambiguously define the chosensequence of strings. The dictionary is composed of the most-recently matched N symbols,for some implementation-dependent N . The strings to be matched are the prefixes of theremaining input symbols. When a particular prefix has been matched, those symbols aremoved from the beginning of the remaining symbols to the end of the dictionary; in generalthis will cause some symbols to be deleted from the beginning of the dictionary, in order tolimit its size to N .

Compression algorithms in the LZ77 family perform a greedy choice when lookingfor the next string of input symbols to match. That is, the longest string of symbols which isfound in the current dictionary is chosen as the next match. Many variations of LZ77 havebeen proposed; some of these attempt to improve compression by sometimes choosing anon-maximal string, if it appears that such a choice might improve the overall compressionratio. In this paper we present an algorithm which computes a set of matches designed tominimize the number of bits output, not necessarily the number of strings matched.

In some variants of LZ77, the token stream is itself compressed using a statisticaltechnique, which means the length of a token is not known a priori. However, other LZ77variants code the tokens using a scheme for which the length of a given token can becomputed in advance. In such a case it is computationally feasible to compute the globallyoptimum set of matches (we refer to this as the optimum parsing of the input).

The basic idea is as follows. At each step of the compression process, the number ofbits required by an optimum parsing of the input ending at the current position is known. Ifthe longest match available at this point has length m, then candidate optimum parsings foreach of the next m positions can be computed by adding the number of bits required for thecurrent position to the token lengths for each of the m possible prefixes of the longest match.These m values are compared pairwise to the current values for the next m locations, andfor each improved bit count, the new value and a pointer to the current location are stored.When the end of the input is reached the pointers are traced backwards from the final inputsymbol to compute the optimum parsing.

The Calgary Corpus was used as the test data. An implementation of LZSS which hasa maximum match length of 16, a dictionary of 4K symbols and token sizes known a prioriwas used as the base algorithm. Our algorithm reduced the average compression ratio from45.28% to 42.64%, a (relative) improvement of better than 5.8%.

� Author’s current address is 10 Wilson Blvd, Halifax, NS, B3M 3E4� This work was partially supported by the Natural Sciences and Engineering Research

Council.

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.67

538

top related