[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

L OCAL M ODELING FOR WEB G RAPH C OMPRESSION Vo Ngoc Anh and Alistair Moffat, The University of Melbourne, Australia ABSTRACT: We describe a simple hierarchical scheme for webgraph compression, which supports efficient in-memory and from-disk decoding of page neighborhoods, for neighborhoods defined for both incoming and outgoing links. The scheme is highly competitive in terms of both compression effectiveness and decoding speed. The webgraph of n web pages, where the pages are ordered by their URL strings and then identified as integers from 1 to n, is considered as n adjacent lists. A practical de-facto standard for compression of such graph is the BV method [Boldi and Vigna, 2004], which exploits the similarity, locality, and consecutiveness properties of the adjacency lists. Here we present a simple technique that is also based on these properties. We first partition the lists into groups of h consecutive lists. A model for a group is built as union of the group lists, and then further reduced by replacing by a new symbol consecutive sequences (a, a +1,...,b) that appear in all h lists. Integers in each of the h lists in the group are replaced by ordinal integers pertaining to the model’s alphabet. As well, lists can be represented relative to any of the preceding lists in the same group, by means of an XOR operation, if such a representation is smaller than the list itself. This process, with one hierarchical level of modeling, is denoted as s =1. An s =2 scheme results if the same process is applied to the set of lists representing the n/h models. Finally, all of the lists, including one model list per group, are coded using a clustered code such as interpolative or Exp-Golomb [Witten et al., 1999], or ζ -3 [Boldi and Vigna, 2004]. Data set Nodes Links Average bits per link (millions) (millions) BV (∞) BV (7, 3) s =1 s =2 eu-2005 0.86 19 4.38 5.17 3.81 3.55 sk-2005 50.64 1,949 2.86 3.88 2.61 2.26 webbase-2001 118.14 1,020 3.08 3.74 3.23 2.95 We measured performance using datasets from http://law.dsi.unimi.it. Results for h = 32 are presented, against BV (with R = ∞) as a baseline for compression effectiveness, and against BV (with w =7,R =3) as a baseline for decoding speed. The new method provides excellent compression effectiveness relative to both baselines, and supports both memory- and disk-based random decoding. Extended experiments not doc- umented here show that the method is also comparable to BV (7, 3) in terms of decoding speed, and suitable when the transposed webgraph (or both forms together) are to be stored. R EFERENCES P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th Int. Conf. World Wide Web, 2004, 595–602. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Docu- ments and Images. Morgan Kaufmann, San Francisco, second edition, 1999. 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.59 519

Upload: alistair

Post on 28-Feb-2017

214 views

Category:

Documents

2 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Local Modeling for WebGraph Compression

LOCAL MODELING FOR WEBGRAPH COMPRESSION

Vo Ngoc Anh and Alistair Moffat, The University of Melbourne, Australia

ABSTRACT: We describe a simple hierarchical scheme for webgraph compression,which supports efficient in-memory and from-disk decoding of page neighborhoods,for neighborhoods defined for both incoming and outgoing links. The scheme ishighly competitive in terms of both compression effectiveness and decoding speed.

The webgraph of n web pages, where the pages are ordered by their URL strings andthen identified as integers from 1 to n, is considered as n adjacent lists. A practical de-factostandard for compression of such graph is the BV method [Boldi and Vigna, 2004], whichexploits the similarity, locality, and consecutiveness properties of the adjacency lists.

Here we present a simple technique that is also based on these properties. We firstpartition the lists into groups of h consecutive lists. A model for a group is built as unionof the group lists, and then further reduced by replacing by a new symbol consecutivesequences (a, a + 1, . . . , b) that appear in all h lists. Integers in each of the h lists in thegroup are replaced by ordinal integers pertaining to the model’s alphabet. As well, listscan be represented relative to any of the preceding lists in the same group, by means of anXOR operation, if such a representation is smaller than the list itself. This process, with onehierarchical level of modeling, is denoted as s = 1. An s = 2 scheme results if the sameprocess is applied to the set of lists representing the �n/h� models. Finally, all of the lists,including one model list per group, are coded using a clustered code such as interpolativeor Exp-Golomb [Witten et al., 1999], or ζ-3 [Boldi and Vigna, 2004].

Data set Nodes Links Average bits per link(millions) (millions) BV (∞) BV (7, 3) s = 1 s = 2

eu-2005 0.86 19 4.38 5.17 3.81 3.55sk-2005 50.64 1,949 2.86 3.88 2.61 2.26webbase-2001 118.14 1,020 3.08 3.74 3.23 2.95

We measured performance using datasets from http://law.dsi.unimi.it. Resultsfor h = 32 are presented, against BV (with R = ∞) as a baseline for compression ef-fectiveness, and against BV (with w = 7, R = 3) as a baseline for decoding speed. Thenew method provides excellent compression effectiveness relative to both baselines, andsupports both memory- and disk-based random decoding. Extended experiments not doc-umented here show that the method is also comparable to BV (7, 3) in terms of decodingspeed, and suitable when the transposed webgraph (or both forms together) are to be stored.

REFERENCES

P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In Proc. 13th Int.Conf. World Wide Web, 2004, 595–602.

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Docu-ments and Images. Morgan Kaufmann, San Francisco, second edition, 1999.

2010 Data Compression Conference

DOI 10.1109/DCC.2010.59

519

SM...Monday - Sunday Salt Lake County Ski Schedule 2020-2021 SM Serving Alta, Brighton Snowbird, Solitude 953 - Midvale Ft Union Station To Snowbird/Alta Down to …

Rudderpost April 2015 - draft€¦ · April 2015 Newsletter Page 3 Last of the Snowbird Series Last of the Snowbird Series March 21, 2015March 21, 2015 Tom Madden, Race ChairTom Madden,

Hakvoort Shipyard Snowbird

INTRODUCTION TO ENDOSCOPY Snowbird Lectures Utah, June 2006 Jean

Snowbird User Research

Python Webgraph Generator Reference Manual

Snowbird DS13

SNOWBIRD - Paradise Yacht Charters · As you sail through the Adriatic’s deep blue waters onboard M/Y SNOWBIRD, explore this stunning coastline and its myriad islands – dotted

ALTA/SNOWBIRD FREERIDE AND SNOWBIRD FREERIDE 2013-2014 PARENT ATHLETE MEETING