xml::parent - yet another way to store xml files

20
XML::XParent Another way to store XML elements... Marco Masetti(grubert) - [email protected] [email protected]

Upload: marco-masetti

Post on 04-Jul-2015

1.144 views

Category:

Documents


0 download

DESCRIPTION

XParent is a simple SQL schema to store XML elements. XML::XParent is a perl module that provides API to store XML files and retrieve XML elements from a XParent data store.

TRANSCRIPT

Page 1: Xml::parent - Yet another way to store XML files

XML::XParentAnother way to store XML elements...

Marco Masetti(grubert) - [email protected]@gmail.com

Page 2: Xml::parent - Yet another way to store XML files

Ways of storing XML files

• Plain files, simple scripts to perform XPath queries– trivial, very limited scalability, search and element handling

• DBMS as BLOBs (text)– Limited search features, performance and scalability. No

inherent element handling.• DBMS with XML support

– Document oriented. Not supported by all. Different features provided.

• Native XML databases (Tamino, Basex, eXist,...)– Ok…but then I need something else to talk of…

• Custom DBMS schemas– Data oriented, element handling trivial, scale very well

Page 3: Xml::parent - Yet another way to store XML files

Custom DBMS schemas

• Structure mapping: – the design of the database schema is based on the

understanding of XML Schema or DTDs

• Model mapping: – A fixed database schema for all XML documents

without assistance of DTD or XML schemes

Page 4: Xml::parent - Yet another way to store XML files

Structure-mapping schema: XML::RDB!

• Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type.

• Pros:● Does what he means● Quite fast● Works with XML Schemas too● Could eventually treat value types properly

• Cons:● Inherent hierarchical structure lost● Not good if XML files belongs to different schemas● Does only what he means...● Not very well maintained...● SQL schemas can easily become unreadable...

Page 5: Xml::parent - Yet another way to store XML files

Model-mapping schema: XParent !

• XParent is a very simple DBMS schema that can be used to store XML elements

• Does not require the XML schema (Schema-oblivious)• Highly normalized• Cons:

Values are stored as text

Page 6: Xml::parent - Yet another way to store XML files

<?xml version="1.0" encoding="ISO88591"?>  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema"         xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance">    <DescriptionUnit xsi:type="DescriptorCollectionType">      <Descriptor size="5" xsi:type="DominantColorType">        <ColorSpace type="HSV" colorReferenceFlag="false"/>        <SpatialCoherency>0</SpatialCoherency>        <Values>        <Percentage>2</Percentage>        <Index>10 6 0</Index>        </Values>        <Values>          <Percentage>15</Percentage>          <Index>6 16 9</Index>        </Values>        <Values>          <Percentage>3</Percentage>          <Index>7 18 4</Index>      </Values>    </Descriptor>  </DescriptionUnit></Mpeg7>

XParent: how it works...Table LabelPath id | len |                               path                               ++

Table Element did | pathid | ordinal ++

Table Data did | pathid | ordinal |                    value                     +++

Table DataPath pid | cid +

Table LabelPath id | len |                               path                               ++  1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace

Table Element did | pathid | ordinal ++   1 |      1 |       1

Table LabelPath id | len |                               path                               ++  1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace  2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag  3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type

Table Element did | pathid | ordinal ++   1 |      1 |       1   2 |      2 |       1   3 |      3 |       2

Table Data did | pathid | ordinal |                    value                     +++   2 |      2 |       1 | false   3 |      3 |       2 | HSV

Table DataPath pid | cid +   1 |   2   1 |   3

Page 7: Xml::parent - Yet another way to store XML files

The XML::XParent module• Perl module to handle XML documents on a XParent

schema• Can load any XML file into the same SQL schema• Plugins can be registered for custom logic on elements• Provides utilities to:

● Create the XParent schema for SQLite and Postgresql● Parse and load an XML file ( xparent-parse.pl )● Query the XParent schema ( xparent-search.pl )

• Classes:● XML::XParent::Parser: XML parser based on XML::Twig● XML::XParent::Parser::Plugin: base interface class to

be implemented by any plugin● XML::XParent::Schema: base class (interface) to the

XParent schema● XML::XParent::Elem: class that describes an XML

element

Page 8: Xml::parent - Yet another way to store XML files

XML::XParent::Schema drivers

• The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores

• 2 generic drivers implemented so far: XML::XParent::Schema::DBIx: driver implementation based on

DBIx::Class● All advantages of an ORM (but who cares ?)● Quite slow!

XML::XParent::Schema::DBI: driver implementation based on DBI● Direct integration with the data store● Much faster...

Page 9: Xml::parent - Yet another way to store XML files

The quest for speed...

● Tests performed on my laptop:● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05

● Reference XML file:● Size: 45 MB● XML elements: ~600.000

● Reference DBMS: PostgreSQL 8.4.13

● Parsing of the reference file with the DBIx driver:● perl xparentparse.pl i <ref.xml> driver DBIx● Execution time: > 3000 mins !!!

● Parsing of the reference file with the DBI driver:● perl xparentparse.pl i <ref.xml> driver DBI● Execution time: ~ 400 mins.

Page 10: Xml::parent - Yet another way to store XML files

...But then...

● I realized loading times were divergent!

● I realized there was a stupid error in the implementation of the algorith...

Ref. Im

plem.

Algo patched....

1

2

33

4

Exec Time(log t)

3000

400

28

177

Page 11: Xml::parent - Yet another way to store XML files

...But then...

● I realized that records in Data and DataPath tables are not referenced by anybody...● They do not need to be inserted one each...● => Bulk Loading!!!● ...given N elements, how many records we have in the DataPath table ?

Page 12: Xml::parent - Yet another way to store XML files

Bulk Loading

• Saves a lot of time storing data: DBI: Bulk loading of 1000000 records All in once:    50.462398 wallclock secondsChunks of 1000: 31.157044 wallclock secondsChunks of 2000: 27.747248 wallclock secondsChunks of 5000: 28.209256 wallclock secondsChunks of 10000:26.334099 wallclock seconds

• Distinct inserts of 1000000 records:Elapsed time: 250.563282 wallclock seconds

Ref. Im

plem.

Algo patched....

1

2

33

4

Exec Time(log t)

3000

400

28

177

Bulk Loading..

..

98

16

Page 13: Xml::parent - Yet another way to store XML files

...But then...

• For each element we have to check if path already exists...

• Much better cache it in an hash than go back and forth into the DB...

Ref. Im

plem.

Algo patched....

1

2

33

4

Exec Time(log t)

3000

400

28

177

Bulk Loading..

..

Cached P

aths..

..

98

1612

41

Page 14: Xml::parent - Yet another way to store XML files

...But then...• Added some indexes:• CREATE INDEX LabelPath_Path ON LabelPath (Path);• CREATE INDEX Element_PathID ON Element (PathID);• CREATE INDEX DataPath_Cid ON DataPath (Cid);• CREATE INDEX DataPath_Pid ON DataPath (Pid);• CREATE INDEX Data_Did ON Data (Did);

Ref. Im

plem.

Algo patched....

1

2

33

4

Exec Time(log t)

3000

400

28

177

Bulk Loading..

..

Cached P

aths..

..

98

16 12

41

+ Indexe

s....

8

29

Page 15: Xml::parent - Yet another way to store XML files

...But then...• Realized I could “compact” records...

Saves another 20%-30%...Needs some logic at query time (experimental)...

<?xml version="1.0" encoding="ISO88591"?>  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema"         xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance">    <DescriptionUnit xsi:type="DescriptorCollectionType">      <Descriptor size="5" xsi:type="DominantColorType">        <ColorSpace type="HSV" colorReferenceFlag="false"/>        <SpatialCoherency>0</SpatialCoherency>        <Values>          <Percentage>2</Percentage>          <Index>10 6 0</Index>        </Values>        <Values>          <Percentage>15</Percentage>          <Index>6 16 9</Index>        </Values>        <Values>          <Percentage>3</Percentage>          <Index>7 18 4</Index>        </Values>    </Descriptor>  </DescriptionUnit></Mpeg7>

Page 16: Xml::parent - Yet another way to store XML files

To cut a very long story short...

Reference Algopatched

Bulkloading

Cached Paths

indexes Compact

DBIx > 3000 177 98 41 29 22

DBI ~400 28 16 12 8 6

● ..and we have still to do:● Code profiling...● Specific DBMS techniques...● Use MapReduce to split jobs among several

workers...

Time (mins) to load ~600.000 XML elems

Page 17: Xml::parent - Yet another way to store XML files

About retrieval...

• At first I tried implementing an Xpath-to-sql translator

• Found it very very hard...• ...and almost useless• ...use the power of SQL to express what you

want!• XML::XParent provides an API (get_elem) to

query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.

Page 18: Xml::parent - Yet another way to store XML files

• To load an XML file:perl xparentparse.pl

i <input file>driver <the Schema driver to use>[config_file <the config file>][verbose][clean][compact]

XML::XParent utilities: how to use them• Configure parameters into xparent.yml file:

schema_params:     'dbi:Pg:dbname=xparent'#     'dbi:SQLite:xparent.db'     grubert     grubert            AutoCommit: 1#plugins:#    'SLMS::Redis::ParserPlugin': #        'tag': 'MovingRegion' • To query the Xparent data store:

perl xparentsearch.plpath <path regex>driver <the Schema driver to use>[config_file <the config file>]

• To clean the data store:perl xparentclean.pl 

driver <the Schema driver to use>[config_file <the config file>]

Page 19: Xml::parent - Yet another way to store XML files

Contribute!

https://github.com/grubert65/XParent-Perl.git