[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

1
High-Order Text Compression on Hierarchical Edge-Guided* Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuente and Javier D. Fernandez Depto. de Informatica, Universidad de Valladolid, Spain. {migumar2,jadiego,pfuente,jfergar}@infor.uva.es The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G ap- proach [1] to support high-order text statistics. These consider the same graph-based model to represent an extended input alphabet obtained by using a variant of the Re-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the features of the bit-oriented canonical Huffman code chosen as output alphabet. The base-case E-G 1 (established for k = 1) regards a text as a sequence of words. Those ones related with more than a different words are considered stopwords. A new heuristic is proposed to save re-ordering operations in these stopwords. That is, when a vertex outdegree is greater than a no re-ordering operations are performed and new transitions are sequentially appended to last positions of the vocabulary. Besides, E-G 1 performs a dynamic model encoding: NEWVERTEX and NEWEDGE encoding is adapted to their current statistics in the context vertex. E-G 1 improves up to 2 times the original E-G compression/decompression processes with similar effectiveness between 20.5% - 22% medium-large size text collections. The general E-G k considers E-G 1 improvements on an extended input alphabet which also comprises significative word sequences. A variant of Re-Pair is designed for achieving a hierarchical representation of these sequences. It is based on a re- definition of the active pair concept. Let e and t be the number of different and total transitions between words, and w = e It the average number of times that a transition is found in the text. vVe only consider for replacement those pairs which appears at least w times. This process delivers a set of high-order contexts and the binary hierarchy which generates them. These are integrated into the original graph by linking the k-order context with its right component in the level k - 1. That is, if C ----+ AB, an special edge from C to B is used for representing this hierarchical relationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5) yields a competitive space/efficiency trade-off with compression ratios ranging be- tween 18% - 20% for medium-large size collections and decompression times better than those achieved by other high-order character-oriented compressors (eg. PPM). References [1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compres- sion. In SPIRE 2007, LNCS, pages 2007. [2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 2000. This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship granted by the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors). High-Order Text Compression on Hierarchical Edge-Guided* Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuente and Javier D. Fernandez Depto. de Informatica, Universidad de Valladolid, Spain. {migumar2,jadiego,pfuente,jfergar}@infor.uva.es The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G ap- proach [1] to support high-order text statistics. These consider the same graph-based model to represent an extended input alphabet obtained by using a variant of the Re-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the features of the bit-oriented canonical Huffman code chosen as output alphabet. The base-case E-G 1 (established for k = 1) regards a text as a sequence of words. Those ones related with more than a different words are considered stopwords. A new heuristic is proposed to save re-ordering operations in these stopwords. That is, when a vertex outdegree is greater than a no re-ordering operations are performed and new transitions are sequentially appended to last positions of the vocabulary. Besides, E-G 1 performs a dynamic model encoding: NEWVERTEX and NEWEDGE encoding is adapted to their current statistics in the context vertex. E-G 1 improves up to 2 times the original E-G compression/decompression processes with similar effectiveness between 20.5% - 22% medium-large size text collections. The general E-G k considers E-G 1 improvements on an extended input alphabet which also comprises significative word sequences. A variant of Re-Pair is designed for achieving a hierarchical representation of these sequences. It is based on a re- definition of the active pair concept. Let e and t be the number of different and total transitions between words, and w = e It the average number of times that a transition is found in the text. vVe only consider for replacement those pairs which appears at least w times. This process delivers a set of high-order contexts and the binary hierarchy which generates them. These are integrated into the original graph by linking the k-order context with its right component in the level k - 1. That is, if C ----+ AB, an special edge from C to B is used for representing this hierarchical relationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5) yields a competitive space/efficiency trade-off with compression ratios ranging be- tween 18% - 20% for medium-large size collections and decompression times better than those achieved by other high-order character-oriented compressors (eg. PPM). References [1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compres- sion. In SPIRE 2007, LNCS, pages 2007. [2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 2000. This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship granted by the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors). 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.72 543

Upload: javier-d

Post on 07-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - High-Order Text Compression on Hierarchical Edge-Guided

High-Order Text Compression on Hierarchical Edge-Guided*

Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuenteand Javier D. Fernandez

Depto. de Informatica, Universidad de Valladolid, Spain.{migumar2,jadiego,pfuente,jfergar}@infor.uva.es

The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G ap­proach [1] to support high-order text statistics. These consider the same graph-basedmodel to represent an extended input alphabet obtained by using a variant of theRe-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the featuresof the bit-oriented canonical Huffman code chosen as output alphabet.

The base-case E-G1 (established for k = 1) regards a text as a sequence of words.Those ones related with more than a different words are considered stopwords. Anew heuristic is proposed to save re-ordering operations in these stopwords. That is,when a vertex outdegree is greater than a no re-ordering operations are performedand new transitions are sequentially appended to last positions of the vocabulary.Besides, E-G1 performs a dynamic model encoding: NEWVERTEX and NEWEDGEencoding is adapted to their current statistics in the context vertex. E-G1 improvesup to 2 times the original E-G compression/decompression processes with similareffectiveness between 20.5% - 22% medium-large size text collections.

The general E-G k considers E-G1 improvements on an extended input alphabetwhich also comprises significative word sequences. A variant of Re-Pair is designedfor achieving a hierarchical representation of these sequences. It is based on a re­definition of the active pair concept. Let e and t be the number of different andtotal transitions between words, and w = eIt the average number of times that atransition is found in the text. vVe only consider for replacement those pairs whichappears at least w times. This process delivers a set of high-order contexts and thebinary hierarchy which generates them. These are integrated into the original graphby linking the k-order context with its right component in the level k - 1. That is,if C ----+ AB, an special edge from C to B is used for representing this hierarchicalrelationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5)yields a competitive space/efficiency trade-off with compression ratios ranging be­tween 18% - 20% for medium-large size collections and decompression times betterthan those achieved by other high-order character-oriented compressors (eg. PPM).

References

[1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compres­sion. In SPIRE 2007, LNCS, pages 14~25, 2007.

[2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 88(11):1722~1732, 2000.

This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship grantedby the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors).

High-Order Text Compression on Hierarchical Edge-Guided*

Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuenteand Javier D. Fernandez

Depto. de Informatica, Universidad de Valladolid, Spain.{migumar2,jadiego,pfuente,jfergar}@infor.uva.es

The hierarchical Edge-Guided techniques (called E-Gk) enhance the original E-G ap­proach [1] to support high-order text statistics. These consider the same graph-basedmodel to represent an extended input alphabet obtained by using a variant of theRe-Pair [2] algorithm. E-G k adapts the previous coding scheme to grasp the featuresof the bit-oriented canonical Huffman code chosen as output alphabet.

The base-case E-G1 (established for k = 1) regards a text as a sequence of words.Those ones related with more than a different words are considered stopwords. Anew heuristic is proposed to save re-ordering operations in these stopwords. That is,when a vertex outdegree is greater than a no re-ordering operations are performedand new transitions are sequentially appended to last positions of the vocabulary.Besides, E-G1 performs a dynamic model encoding: NEWVERTEX and NEWEDGEencoding is adapted to their current statistics in the context vertex. E-G1 improvesup to 2 times the original E-G compression/decompression processes with similareffectiveness between 20.5% - 22% medium-large size text collections.

The general E-G k considers E-G1 improvements on an extended input alphabetwhich also comprises significative word sequences. A variant of Re-Pair is designedfor achieving a hierarchical representation of these sequences. It is based on a re­definition of the active pair concept. Let e and t be the number of different andtotal transitions between words, and w = eIt the average number of times that atransition is found in the text. vVe only consider for replacement those pairs whichappears at least w times. This process delivers a set of high-order contexts and thebinary hierarchy which generates them. These are integrated into the original graphby linking the k-order context with its right component in the level k - 1. That is,if C ----+ AB, an special edge from C to B is used for representing this hierarchicalrelationship that is encoded with a NEWCONTEXT escape symbol. E-G k (2 :::; k :::; 5)yields a competitive space/efficiency trade-off with compression ratios ranging be­tween 18% - 20% for medium-large size collections and decompression times betterthan those achieved by other high-order character-oriented compressors (eg. PPM).

References

[1] J. Adiego, M.A. Martinez-Prieto, and P. de la Fuente. Edge-Guided Natural Language Text Compres­sion. In SPIRE 2007, LNCS, pages 14~25, 2007.

[2] N.J. Larsson and A. Moffat. Offline Dictionary-Based Compression. P.IEEE, 88(11):1722~1732, 2000.

This work is partially funded by MICINN (grant TIN2009-14009-C02-02), and by a fellowship grantedby the Regional Government of Castilla y Leon and the European Social Fund (first and fourth authors).

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.72

543