angela mccarthy cp5080, sp1 2010. received: 14 august 2008 revised: 13 november 2008 written by...

20
XML Compression Techniques: Survey and Comparison Angela McCarthy CP5080, SP1 2010

Upload: mitchell-underwood

Post on 18-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

 Author looking at XML compression techniques and launch a study ◦ Surveys each of the different compression techniques and compares advantages and disadvantages of each  Data transmitted online is rather large ◦ XML usage is growing, thus a demand for efficient XML compression tools exists

TRANSCRIPT

Page 1: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

XML Compression Techniques:

Survey and ComparisonAngela McCarthy

CP5080, SP12010

Page 2: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Received: 14 August 2008 Revised: 13 November 2008 Written by Sherif Sakr of University of New South

Wales, Australia

eXtensible Markup Language (XML), standard for data representation over World Wide Web

Large document sizes, compression introduced to deal with issues

Paper provides survey over compression techniques

Overview

Page 3: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Author looking at XML compression techniques and launch a study◦ Surveys each of the different compression

techniques and compares advantages and disadvantages of each

Data transmitted online is rather large◦ XML usage is growing, thus a demand for efficient

XML compression tools exists

Introduction

Page 4: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Contributions made:◦ Comprehensive survey of XML compression

techniques◦ A rich XML corpus collected and constructed

Contains wide variety of XML data sources, natures and document sizes

◦ Detailed results examining performance and characteristics

◦ Work repeatable Webpage of study provides access to test files,

examined XML compressors and detailed results of study

Introduction

Page 5: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Each section goes through each of the classifications of compressors

General Text Compressors ◦ Treats XML as plain text, uses traditional text

compression techniques XML Conscious Compressors

◦ Takes advantage of awareness of XML files◦ Uses document structure to achieve better

compression rates

Classifications

Page 6: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Non-Queriable (Archival) XML Compressors◦ No queries can be processed over compressed

format◦ Focus to achieve highest compression ratio

Queriable XML Compressors◦ Queries can be processed over compressed

format◦ Compression ratio actually worse then archival

XML compressors◦ Focus to avoid full document decompression

during query execution

Classifications

Page 7: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Compressor Characteristics

Page 8: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

XML Data Sets

Page 9: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Large variety of data sets (see previous)◦ From 0.5MB to 1.3GB◦ Four Categories

Structural Documents Textual Documents Regular Documents Irregular Documents

Testing Environments◦ To ensure consistency, two

different were environments used, high VS low

XML Testing Corpus

Page 10: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Performance Metrics measured and compared◦ Compression Ratio

Ratio between sizes of compressed and uncompressed Compression Ratio = (Compressed Size)/(Uncompressed

Size)◦ Compression Time

Elapsed time during compression process◦ Decompression Time

Elapsed time during decompression process The lower the metric value, the better the

compressor

Performance Metrics

Page 11: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

11 XML Compressors Evaluated◦ Three general purpose text compressors

Gzip, bzip2, PPM◦ Eight XML conscious compressors

XMillGzip, XMillBzip, XMillPPM, XMLPPM, SCMPPM, XWRT, AXECHOP

◦ Compressors evaluated under default settings◦ Additional experiments run with tuned

parameters for highest level of compression paramters

◦ In total, 16 variant compressors

Framework

Page 12: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Ideally want to provide a global ranking on XML compression tools

Results show there is no clear winner◦ Dependant upon the weight of each metric

Three ranking functions◦ – WF1 = (1/3 ∗ CR)+(1/3 ∗ CT)+(1/3 ∗ DCT)◦ – WF2 = (1/2 ∗ CR)+(1/4 ∗ CT)+(1/4 ∗ DCT)◦ – WF3 = (3/5 ∗ CR)+(1/5 ∗ CT)+(1/5 ∗ DCT)

CR represents the compression ratio metric, CT represents the compression time metric and DCT represents the decompression time metric

Results

Page 13: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Compression Ratio

Page 14: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Compression Time

Page 15: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Decompression Time

Page 16: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Paper surveyed state-of-the-art XML compression techniques

Reported the behaviour of various different XML compressors using large corpus of XML documents

Paper could be valuable for ◦ Developers of new XML compression tools◦ Users for making an effective decision on most suitable

compressor for requirements Fig 7. Shows none of XML conscious compressors

has achieved outstanding compression ratio

Conclusions

Page 17: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Average Compression Ratios

Page 18: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Planning to continue maintaining and updating webpage of study with further evaluations

Enable visitors to perform online experiments using set of available compressors and own XML documents

Future Work

Page 19: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Large number of references◦ Due to different compression techniques used

Large amount of data Thorough in research methods

◦ Large amount of data tested◦ Tested on different systems◦ Tested using different techniques

Abbreviations/Acronyms given◦ Designed for specific audience

Paper seems to be a reference tool◦ User to read to help decide on which compression tool

to use

Metadata

Page 20: Angela McCarthy CP5080, SP1 2010.  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia

Thanks for listening!

Questions?