treechop: a tree- based query-able compressor for xml gregory leighton, tomasz müldner, james...

TREECHOP: A Tree-based Query-able Compressor for XML

Gregory Leighton, Tomasz Müldner, James Diamond

Acadia University

June 6, 2005

Outline

XML TREECHOP

Compression Strategy Decompression Strategy Querying Strategy

Experimental Results Conclusions

Extensible Markup Language (XML) What is it?

A standard for semi-structured data representation introduced in 1998

Data is surrounded by markup tokens (elements and attributes) used to indicate semantic meaning

Characteristics? Verbose (often 5 – 10 times larger than alternative

formats like CSV) Lots of repetition… plenty of opportunities for data

compression

Example XML Document< ?x m l v e rs io n = ”1 .0 ” en co d in g = ”U T F -8 ”?> < !-- s ta rt o f P O --> n o = ”1 4 5 6 ”> < > 0 6 /0 5 /0 5 < / > < > 7 6 5 3 4 5 < / > < > < > < > P -4 5 3 4 < / > < > 2 < / > < / > < > < > P -9 1 8 2 < / > < > 1 < / > < / > < / > < / > < !-- en d o f P O -->

< P urcha seO rder D a te D a teC usto m erID C ustom erIDO rd er

ItemP ro ductN o P rodu ctN oQ u an tity Q uan tity

ItemItem

P ro ductN o P rodu ctN oQ u an tity Q uan tity

ItemO rd er

P u rch aseO rd er

root element attribute

data value

comment

TREECHOP: Compression Strategy

Parsing splits document into three segments: Prologue: stores text occurring before document’s

root element Document Tree: contains all document contents

between and including root element start and end tags

Epilogue: stores text occurring after document’s root element




ItemItem


ItemO rd er

P u rch aseO rd er

Prologue

Epilogue

DocumentTree

Document Tree

Root node corresponds to document’s root element

Character data segments are represented using leaf nodes

XML markup represented using non-leaf nodes; 5 types of non-leaf nodes: Element, attribute, CDATA, comment, processing

instruction

Document Tree Generation

Get next token from XML parser

Construct tree nodefrom token

Write tree node to compression stream

1 2

3

Document Tree Nodes

Each node in the tree has an associated label value, L Element node name of the element Attribute node ‘@’ + name of the attribute Comment, CDATA, processing instruction nodes

all text between delimiting section markers

The path for a node vn consists of /L1/L2…/Ln where a route connecting the root node v1 with vn consists of nodes v1, v2, …, vn and Li is the label for node vi

Codeword Generation

A binary codeword is assigned to each non-leaf node, based on node path Multiple nodes with identical path are assigned same

codeword

Codeword is used during decompression and querying operations to identify the value and type of each node

Codeword Generation

The codeword C(v) assigned to a non-leaf node v with parent node p is formed by the concatenation of three codes C(p): the codeword assigned to p G(v): Golomb code assigned to v based on its

ordering relative to p. T(v): a sequence of 3 bits used to indicate node type




ItemItem


ItemO rd er

P u rch aseO rd er

Example Document Tree

Node Path C(v) /PurchaseOrder

/PurchaseOrder/@no

/PurchaseOrder/Date

/PurchaseOrder/CustomerID

/PurchaseOrder/Order

/PurchaseOrder/Order/Item

/PurchaseOrder/Order/Item/ProductNo

/PurchaseOrder/Order/Item/Quantity

00000

0000000001

00000010000

00000011000

00000100000

0000010000000000

000001000000000000000

0000010000000000010000

Codeword Assignment

C(p) – portion inherited from parent nodeG(v) – portion assigned based on Golomb codeT(v) – portion used to indicate node type

TREECHOP: Writing the Tree

Encoded tree is written to compression stream in depth-first order; gzip is applied to further compress the encoded tree

Non-leaf nodes: written as 3-tuple (L, C, D) L is a byte indicating bit length of code word C is a sequence of L / 8 bytes containing code word D is the node’s label (e.g. element/attribute name) -

reserved byte values are used to signal beginning/end of sequence of raw character data

TREECHOP: Writing the Tree

On second and subsequent occurrences of a particular codeword, only the 2-tuple (L, C) is written (decoder is able to infer associated D)

Leaf nodes are transmitted in same manner as D value for non-leaf nodes

Each node encoding is transmitted immediately after node construction – avoids necessity of building entire tree in memory

TREECHOP: Decompression Strategy Decoder operates by reading node data from

compression stream. For each non-leaf node:1. Determine D value

2. Determine node type

3. Surround D with XML syntax appropriate to the node type and immediately emit to the decompression stream

TREECHOP: Querying Strategy

An individual query handler is registered with the decoder for each query

Single scan of compression stream is carried out, using a stack to keep track of current path

When query predicate path is matched, the current codeword is recorded and remainder of compression stream is scanned for future occurrences

Each time a query match is encountered, the associated D value is extracted from the compression stream and passed to the query handler for processing

Experimental Results: Compression Rates

0

1

2

3

4

Co

mp

ress

ion

R

ate

(bp

c)

A B C D

Document

TREECHOP

gzip

XGRIND

File Size(KB) Elements Attributes Data

(A) Baseball 788 27080 0 230970

(B) Macbeth 175 3975 0 97625

(C) 150emp 26 901 150 8277

(D) 100000emp 16831 600001 100000 5534311

Experimental Results: Compression/Decompression Speed

05000

1000015000

2000025000

2 200 400 600 800 1000

Document Size (KB)

Tra

nsm

issi

on

Tim

e (m

sec)

Raw XML TREECHOP GZIP

Distance between sender/receiver: 20 km / 12 miles

Experimental Results: Querying

0

5000

10000

15000

20000

25000

30000

2 200 400 600 800 1000

XML Document Size (KB)

Qu

ery

Exe

cuti

on

Tim

e (m

sec)

GZIP/XSLT

TREECHOP

Raw XML/XSLT

Distance between sender/receiver: 20 km / 12 miles

Conclusions

TREECHOP compresses at rates comparable to gzip, while also providing query-friendly annotations to the compression stream

Using TREECHOP querying in place of alternative methods like XSLT yields a significant performance advantage on medium- to large-sized XML documents; advantage increases with document size

treechop: a tree- based query-able compressor for xml gregory leighton, tomasz müldner, james...

Documents

node slide

element node

document tree root node

node type slide

example document tree

document tree nodes

node v n

element attribute node