[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...
TRANSCRIPT
LOCAL MODELING FOR WEBGRAPH COMPRESSION
Vo Ngoc Anh and Alistair Moffat, The University of Melbourne, Australia
ABSTRACT: We describe a simple hierarchical scheme for webgraph compression,which supports efficient in-memory and from-disk decoding of page neighborhoods,for neighborhoods defined for both incoming and outgoing links. The scheme ishighly competitive in terms of both compression effectiveness and decoding speed.
The webgraph of n web pages, where the pages are ordered by their URL strings andthen identified as integers from 1 to n, is considered as n adjacent lists. A practical de-factostandard for compression of such graph is the BV method [Boldi and Vigna, 2004], whichexploits the similarity, locality, and consecutiveness properties of the adjacency lists.
Here we present a simple technique that is also based on these properties. We firstpartition the lists into groups of h consecutive lists. A model for a group is built as unionof the group lists, and then further reduced by replacing by a new symbol consecutivesequences (a, a + 1, . . . , b) that appear in all h lists. Integers in each of the h lists in thegroup are replaced by ordinal integers pertaining to the model’s alphabet. As well, listscan be represented relative to any of the preceding lists in the same group, by means of anXOR operation, if such a representation is smaller than the list itself. This process, with onehierarchical level of modeling, is denoted as s = 1. An s = 2 scheme results if the sameprocess is applied to the set of lists representing the �n/h� models. Finally, all of the lists,including one model list per group, are coded using a clustered code such as interpolativeor Exp-Golomb [Witten et al., 1999], or ζ-3 [Boldi and Vigna, 2004].
Data set Nodes Links Average bits per link(millions) (millions) BV (∞) BV (7, 3) s = 1 s = 2
eu-2005 0.86 19 4.38 5.17 3.81 3.55sk-2005 50.64 1,949 2.86 3.88 2.61 2.26webbase-2001 118.14 1,020 3.08 3.74 3.23 2.95
We measured performance using datasets from http://law.dsi.unimi.it. Resultsfor h = 32 are presented, against BV (with R = ∞) as a baseline for compression ef-fectiveness, and against BV (with w = 7, R = 3) as a baseline for decoding speed. Thenew method provides excellent compression effectiveness relative to both baselines, andsupports both memory- and disk-based random decoding. Extended experiments not doc-umented here show that the method is also comparable to BV (7, 3) in terms of decodingspeed, and suitable when the transposed webgraph (or both forms together) are to be stored.
REFERENCES
P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th Int.Conf. World Wide Web, 2004, 595–602.
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Docu-ments and Images. Morgan Kaufmann, San Francisco, second edition, 1999.
2010 Data Compression Conference
1068-0314/10 $26.00 © 2010 IEEE
DOI 10.1109/DCC.2010.59
519