treechop: a tree- based query-able compressor for xml gregory leighton, tomasz müldner, james...
TRANSCRIPT
TREECHOP: A Tree-based Query-able Compressor for XML
Gregory Leighton, Tomasz Müldner, James Diamond
Acadia University
June 6, 2005
Outline
XML TREECHOP
Compression Strategy Decompression Strategy Querying Strategy
Experimental Results Conclusions
Extensible Markup Language (XML) What is it?
A standard for semi-structured data representation introduced in 1998
Data is surrounded by markup tokens (elements and attributes) used to indicate semantic meaning
Characteristics? Verbose (often 5 – 10 times larger than alternative
formats like CSV) Lots of repetition… plenty of opportunities for data
compression
Example XML Document< ?x m l v e rs io n = ”1 .0 ” en co d in g = ”U T F -8 ”?> < !-- s ta rt o f P O --> n o = ”1 4 5 6 ”> < > 0 6 /0 5 /0 5 < / > < > 7 6 5 3 4 5 < / > < > < > < > P -4 5 3 4 < / > < > 2 < / > < / > < > < > P -9 1 8 2 < / > < > 1 < / > < / > < / > < / > < !-- en d o f P O -->
< P urcha seO rder D a te D a teC usto m erID C ustom erIDO rd er
ItemP ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemItem
P ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemO rd er
P u rch aseO rd er
root element attribute
data value
comment
TREECHOP: Compression Strategy
Parsing splits document into three segments: Prologue: stores text occurring before document’s
root element Document Tree: contains all document contents
between and including root element start and end tags
Epilogue: stores text occurring after document’s root element
Example XML Document< ?x m l v e rs io n = ”1 .0 ” en co d in g = ”U T F -8 ”?> < !-- s ta rt o f P O --> n o = ”1 4 5 6 ”> < > 0 6 /0 5 /0 5 < / > < > 7 6 5 3 4 5 < / > < > < > < > P -4 5 3 4 < / > < > 2 < / > < / > < > < > P -9 1 8 2 < / > < > 1 < / > < / > < / > < / > < !-- en d o f P O -->
< P urcha seO rder D a te D a teC usto m erID C ustom erIDO rd er
ItemP ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemItem
P ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemO rd er
P u rch aseO rd er
Prologue
Epilogue
DocumentTree
Document Tree
Root node corresponds to document’s root element
Character data segments are represented using leaf nodes
XML markup represented using non-leaf nodes; 5 types of non-leaf nodes: Element, attribute, CDATA, comment, processing
instruction
Document Tree Generation
Get next token from XML parser
Construct tree nodefrom token
Write tree node to compression stream
1 2
3
Document Tree Nodes
Each node in the tree has an associated label value, L Element node name of the element Attribute node ‘@’ + name of the attribute Comment, CDATA, processing instruction nodes
all text between delimiting section markers
The path for a node vn consists of /L1/L2…/Ln where a route connecting the root node v1 with vn consists of nodes v1, v2, …, vn and Li is the label for node vi
Codeword Generation
A binary codeword is assigned to each non-leaf node, based on node path Multiple nodes with identical path are assigned same
codeword
Codeword is used during decompression and querying operations to identify the value and type of each node
Codeword Generation
The codeword C(v) assigned to a non-leaf node v with parent node p is formed by the concatenation of three codes C(p): the codeword assigned to p G(v): Golomb code assigned to v based on its
ordering relative to p. T(v): a sequence of 3 bits used to indicate node type
Example XML Document< ?x m l v e rs io n = ”1 .0 ” en co d in g = ”U T F -8 ”?> < !-- s ta rt o f P O --> n o = ”1 4 5 6 ”> < > 0 6 /0 5 /0 5 < / > < > 7 6 5 3 4 5 < / > < > < > < > P -4 5 3 4 < / > < > 2 < / > < / > < > < > P -9 1 8 2 < / > < > 1 < / > < / > < / > < / > < !-- en d o f P O -->
< P urcha seO rder D a te D a teC usto m erID C ustom erIDO rd er
ItemP ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemItem
P ro ductN o P rodu ctN oQ u an tity Q uan tity
ItemO rd er
P u rch aseO rd er
Example Document Tree
Node Path C(v) /PurchaseOrder
/PurchaseOrder/@no
/PurchaseOrder/Date
/PurchaseOrder/CustomerID
/PurchaseOrder/Order
/PurchaseOrder/Order/Item
/PurchaseOrder/Order/Item/ProductNo
/PurchaseOrder/Order/Item/Quantity
00000
0000000001
00000010000
00000011000
00000100000
0000010000000000
000001000000000000000
0000010000000000010000
Codeword Assignment
C(p) – portion inherited from parent nodeG(v) – portion assigned based on Golomb codeT(v) – portion used to indicate node type
TREECHOP: Writing the Tree
Encoded tree is written to compression stream in depth-first order; gzip is applied to further compress the encoded tree
Non-leaf nodes: written as 3-tuple (L, C, D) L is a byte indicating bit length of code word C is a sequence of L / 8 bytes containing code word D is the node’s label (e.g. element/attribute name) -
reserved byte values are used to signal beginning/end of sequence of raw character data
TREECHOP: Writing the Tree
On second and subsequent occurrences of a particular codeword, only the 2-tuple (L, C) is written (decoder is able to infer associated D)
Leaf nodes are transmitted in same manner as D value for non-leaf nodes
Each node encoding is transmitted immediately after node construction – avoids necessity of building entire tree in memory
TREECHOP: Decompression Strategy Decoder operates by reading node data from
compression stream. For each non-leaf node:1. Determine D value
2. Determine node type
3. Surround D with XML syntax appropriate to the node type and immediately emit to the decompression stream
TREECHOP: Querying Strategy
An individual query handler is registered with the decoder for each query
Single scan of compression stream is carried out, using a stack to keep track of current path
When query predicate path is matched, the current codeword is recorded and remainder of compression stream is scanned for future occurrences
Each time a query match is encountered, the associated D value is extracted from the compression stream and passed to the query handler for processing
Experimental Results: Compression Rates
0
1
2
3
4
Co
mp
ress
ion
R
ate
(bp
c)
A B C D
Document
TREECHOP
gzip
XGRIND
File Size(KB) Elements Attributes Data
(A) Baseball 788 27080 0 230970
(B) Macbeth 175 3975 0 97625
(C) 150emp 26 901 150 8277
(D) 100000emp 16831 600001 100000 5534311
Experimental Results: Compression/Decompression Speed
05000
1000015000
2000025000
2 200 400 600 800 1000
Document Size (KB)
Tra
nsm
issi
on
Tim
e (m
sec)
Raw XML TREECHOP GZIP
Distance between sender/receiver: 20 km / 12 miles
Experimental Results: Querying
0
5000
10000
15000
20000
25000
30000
2 200 400 600 800 1000
XML Document Size (KB)
Qu
ery
Exe
cuti
on
Tim
e (m
sec)
GZIP/XSLT
TREECHOP
Raw XML/XSLT
Distance between sender/receiver: 20 km / 12 miles
Conclusions
TREECHOP compresses at rates comparable to gzip, while also providing query-friendly annotations to the compression stream
Using TREECHOP querying in place of alternative methods like XSLT yields a significant performance advantage on medium- to large-sized XML documents; advantage increases with document size