[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

1
L OCAL M ODELING FOR WEB G RAPH C OMPRESSION Vo Ngoc Anh and Alistair Moffat, The University of Melbourne, Australia ABSTRACT: We describe a simple hierarchical scheme for webgraph compression, which supports efficient in-memory and from-disk decoding of page neighborhoods, for neighborhoods defined for both incoming and outgoing links. The scheme is highly competitive in terms of both compression effectiveness and decoding speed. The webgraph of n web pages, where the pages are ordered by their URL strings and then identified as integers from 1 to n, is considered as n adjacent lists. A practical de-facto standard for compression of such graph is the BV method [Boldi and Vigna, 2004], which exploits the similarity, locality, and consecutiveness properties of the adjacency lists. Here we present a simple technique that is also based on these properties. We first partition the lists into groups of h consecutive lists. A model for a group is built as union of the group lists, and then further reduced by replacing by a new symbol consecutive sequences (a, a +1,...,b) that appear in all h lists. Integers in each of the h lists in the group are replaced by ordinal integers pertaining to the model’s alphabet. As well, lists can be represented relative to any of the preceding lists in the same group, by means of an XOR operation, if such a representation is smaller than the list itself. This process, with one hierarchical level of modeling, is denoted as s =1. An s =2 scheme results if the same process is applied to the set of lists representing the n/h models. Finally, all of the lists, including one model list per group, are coded using a clustered code such as interpolative or Exp-Golomb [Witten et al., 1999], or ζ -3 [Boldi and Vigna, 2004]. Data set Nodes Links Average bits per link (millions) (millions) BV () BV (7, 3) s =1 s =2 eu-2005 0.86 19 4.38 5.17 3.81 3.55 sk-2005 50.64 1,949 2.86 3.88 2.61 2.26 webbase-2001 118.14 1,020 3.08 3.74 3.23 2.95 We measured performance using datasets from http://law.dsi.unimi.it. Results for h = 32 are presented, against BV (with R = ) as a baseline for compression ef- fectiveness, and against BV (with w =7,R =3) as a baseline for decoding speed. The new method provides excellent compression effectiveness relative to both baselines, and supports both memory- and disk-based random decoding. Extended experiments not doc- umented here show that the method is also comparable to BV (7, 3) in terms of decoding speed, and suitable when the transposed webgraph (or both forms together) are to be stored. R EFERENCES P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th Int. Conf. World Wide Web, 2004, 595–602. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Docu- ments and Images. Morgan Kaufmann, San Francisco, second edition, 1999. 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.59 519

Upload: alistair

Post on 28-Feb-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

LOCAL MODELING FOR WEBGRAPH COMPRESSION

Vo Ngoc Anh and Alistair Moffat, The University of Melbourne, Australia

ABSTRACT: We describe a simple hierarchical scheme for webgraph compression,which supports efficient in-memory and from-disk decoding of page neighborhoods,for neighborhoods defined for both incoming and outgoing links. The scheme ishighly competitive in terms of both compression effectiveness and decoding speed.

The webgraph of n web pages, where the pages are ordered by their URL strings andthen identified as integers from 1 to n, is considered as n adjacent lists. A practical de-factostandard for compression of such graph is the BV method [Boldi and Vigna, 2004], whichexploits the similarity, locality, and consecutiveness properties of the adjacency lists.

Here we present a simple technique that is also based on these properties. We firstpartition the lists into groups of h consecutive lists. A model for a group is built as unionof the group lists, and then further reduced by replacing by a new symbol consecutivesequences (a, a + 1, . . . , b) that appear in all h lists. Integers in each of the h lists in thegroup are replaced by ordinal integers pertaining to the model’s alphabet. As well, listscan be represented relative to any of the preceding lists in the same group, by means of anXOR operation, if such a representation is smaller than the list itself. This process, with onehierarchical level of modeling, is denoted as s = 1. An s = 2 scheme results if the sameprocess is applied to the set of lists representing the �n/h� models. Finally, all of the lists,including one model list per group, are coded using a clustered code such as interpolativeor Exp-Golomb [Witten et al., 1999], or ζ-3 [Boldi and Vigna, 2004].

Data set Nodes Links Average bits per link(millions) (millions) BV (∞) BV (7, 3) s = 1 s = 2

eu-2005 0.86 19 4.38 5.17 3.81 3.55sk-2005 50.64 1,949 2.86 3.88 2.61 2.26webbase-2001 118.14 1,020 3.08 3.74 3.23 2.95

We measured performance using datasets from http://law.dsi.unimi.it. Resultsfor h = 32 are presented, against BV (with R = ∞) as a baseline for compression ef-fectiveness, and against BV (with w = 7, R = 3) as a baseline for decoding speed. Thenew method provides excellent compression effectiveness relative to both baselines, andsupports both memory- and disk-based random decoding. Extended experiments not doc-umented here show that the method is also comparable to BV (7, 3) in terms of decodingspeed, and suitable when the transposed webgraph (or both forms together) are to be stored.

REFERENCES

P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th Int.Conf. World Wide Web, 2004, 595–602.

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Docu-ments and Images. Morgan Kaufmann, San Francisco, second edition, 1999.

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.59

519