[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

High-Order Text Compression on Hierarchical Edge-Guided*

Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuenteand Javier D. Fernandez

Depto. de Informatica, Universidad de Valladolid, Spain.{migumar2,jadiego,pfuente,jfergar}@infor.uva.es

The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G approach [1] to support high-order text statistics. These consider the same graph-basedmodel to represent an extended input alphabet obtained by using a variant of theRe-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the featuresof the bit-oriented canonical Huffman code chosen as output alphabet.

The base-case E-G1 (established for k = 1) regards a text as a sequence of words.Those ones related with more than a different words are considered stopwords. Anew heuristic is proposed to save re-ordering operations in these stopwords. That is,when a vertex outdegree is greater than a no re-ordering operations are performedand new transitions are sequentially appended to last positions of the vocabulary.Besides, E-G1 performs a dynamic model encoding: NEWVERTEX and NEWEDGEencoding is adapted to their current statistics in the context vertex. E-G1 improvesup to 2 times the original E-G compression/decompression processes with similareffectiveness between 20.5% - 22% medium-large size text collections.

The general E-G k considers E-G1 improvements on an extended input alphabetwhich also comprises significative word sequences. A variant of Re-Pair is designedfor achieving a hierarchical representation of these sequences. It is based on a redefinition of the active pair concept. Let e and t be the number of different andtotal transitions between words, and w = eIt the average number of times that atransition is found in the text. vVe only consider for replacement those pairs whichappears at least w times. This process delivers a set of high-order contexts and thebinary hierarchy which generates them. These are integrated into the original graphby linking the k-order context with its right component in the level k - 1. That is,if C ----+ AB, an special edge from C to B is used for representing this hierarchicalrelationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5)yields a competitive space/efficiency trade-off with compression ratios ranging between 18% - 20% for medium-large size collections and decompression times betterthan those achieved by other high-order character-oriented compressors (eg. PPM).

References

[1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compression. In SPIRE 2007, LNCS, pages 14~25, 2007.

[2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 88(11):1722~1732, 2000.

This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship grantedby the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors).

High-Order Text Compression on Hierarchical Edge-Guided*

Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuenteand Javier D. Fernandez

Depto. de Informatica, Universidad de Valladolid, Spain.{migumar2,jadiego,pfuente,jfergar}@infor.uva.es

The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G approach [1] to support high-order text statistics. These consider the same graph-basedmodel to represent an extended input alphabet obtained by using a variant of theRe-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the featuresof the bit-oriented canonical Huffman code chosen as output alphabet.

The base-case E-G1 (established for k = 1) regards a text as a sequence of words.Those ones related with more than a different words are considered stopwords. Anew heuristic is proposed to save re-ordering operations in these stopwords. That is,when a vertex outdegree is greater than a no re-ordering operations are performedand new transitions are sequentially appended to last positions of the vocabulary.Besides, E-G1 performs a dynamic model encoding: NEWVERTEX and NEWEDGEencoding is adapted to their current statistics in the context vertex. E-G1 improvesup to 2 times the original E-G compression/decompression processes with similareffectiveness between 20.5% - 22% medium-large size text collections.

The general E-G k considers E-G1 improvements on an extended input alphabetwhich also comprises significative word sequences. A variant of Re-Pair is designedfor achieving a hierarchical representation of these sequences. It is based on a redefinition of the active pair concept. Let e and t be the number of different andtotal transitions between words, and w = eIt the average number of times that atransition is found in the text. vVe only consider for replacement those pairs whichappears at least w times. This process delivers a set of high-order contexts and thebinary hierarchy which generates them. These are integrated into the original graphby linking the k-order context with its right component in the level k - 1. That is,if C ----+ AB, an special edge from C to B is used for representing this hierarchicalrelationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5)yields a competitive space/efficiency trade-off with compression ratios ranging between 18% - 20% for medium-large size collections and decompression times betterthan those achieved by other high-order character-oriented compressors (eg. PPM).

References

[1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compression. In SPIRE 2007, LNCS, pages 14~25, 2007.

[2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 88(11):1722~1732, 2000.

This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship grantedby the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors).

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.72

543

[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

Documents