querying tree-structured data using dimension graphs dimitri theodoratos (new jersey institute of...

Querying Tree-Structured Data Using Dimension Graphs

Dimitri Theodoratos (New Jersey Institute of Technology, USA)

Theodore Dalamagas (National Techn. University of Athens, Greece)

-2-

Tree-structured Data Management

Tree structures: a means to organize the information on the Web.

Examples: taxonomies, thematic categories, concept hierarchies, product catalogs, etc.

Organizing data in tree structures (tree-structured data) has been vastly established due to the popularity of the XML language.

XML language (W3C): the standard data exchange format on the Web

Data is stored natively in tree structures, or Data is publicly available in tree structures to enable its

automatic processing by programs, scripts, and agents

-3-

Tree-structured Data Management

Querying tree-structured data is based on path expression queries.

Popular query languages for tree-structured data: XPath and XQuery (W3C), e.g:

FOR $i IN /brand/type[price<900] RETURN {$i/id, $i/condition, $i/price} (find products cheaper than 900, and display their id, condition, and price)

Querying tree-structured data hits to two major obstacles: the semistructured nature of data, lack of semantics.

This is actually the penalty one has to pay for the flexibility offered by XML technologies.

...<brand> Sony <type> laptop <id> 1 </id> <condition> used </condition> <price> 800 </price> </type></brand>...

-4-

Semistructured Nature of Tree-structured Data

Due to the first obstacle (i.e. semistructured nature): Querying tree-structured data requires to resolve structural

differences and inconsistencies. The reason? different possible ways of organizing the

same information in tree-structures. Examples:

Structural differences: certain ‘nodes’ (i.e. categories, elements, etc...) exist in a tree-structured data source but not in another.

Structural inconsistencies: variations in ‘node’ sequences (even within a single tree-structured data source).

-5-

Notebooks

Custom Ultralight Multimedia

Desktops

10''

Servers

8''

PDAs

r

Mac HPSony IBMSony

HP IBM

Notebooks

Servers

Desktops PDAs

r

Mac HPSony

HP IBM

Dell Sony

Used New Used

Used

New UsedNew

ProductCatalog A

Multimedia

HP IBM

ProductCatalog B

Structural difference Product catalog A has a finer

categorization on notebooks, e.g.: Custom/Ultralight and 10’’/8’’ (for the ultralight) compared to Catalog B.

-6-

Notebooks

Custom Ultralight Multimedia

Desktops

10''

Servers

8''

PDAs

r

Mac HPSony IBMSony

HP IBM

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony HP IBMMac Sony

Dell Sony

Used New Used

Used

UsedNew Used

New UsedNew

ProductCatalog A

Multimedia

HP IBM

ProductCatalog B

Structural inconsistency Product catalog A classifies

notebooks by brand and next by condition, while catalog B the other way around (Sony/Used vs Used/Sony).

-7-


...

<brand>

Sony

<type>

laptop

<condition> used </condition>

<price> 800 </price>

</type>

</brand>

...

...

<type>

laptop

<condition>

used

<brand> Sony </brand>


</condition>

</type>

...

brand

type

condition

type

condition

brand

Structural inconsistency (...cont.) An XML doc includes the element sequence brand, type, condition, while

another one (for same data) includes type, condition, brand. Such inconsistencies are observed even within tree-structured data of a

single data source.

-8-


How structural differences and inconsistencies affects querying of tree-structured data?

The user should explicitly specify them as part of the query.

Extremely cumbersome. E.g.: explicitly specify disjunctions of possible alternative

node sequences: /brand/type[price<900] OR /type/condition[price<900] OR /condition/type[price<900] ....<brand>

Sony

<type>

laptop

<condition> used </condition>


......

<type>

laptop

<condition>

used

<brand> Sony </brand>


......

......

......

......

-9-


However, sometimes specifying alternate node sequences is not due to the need to resolve structural differences and inconsistencies.

Users should be able to pose queries even if they do not know (or do not care about) the exact structure of tree-structured data sources.

e.g. find products cheaper than 900, and display their id, condition, and price

...but I do not know (or I do not care!) whether condition is before brand and type!

Currently, query formulation on tree-structured data is strictly dependent on the structure of data.

Only ancestor/descendant relationship may produce relaxed path expressions (brand//type).

-10-

Lack of Semantics in Tree-structured Data

Reminder: Querying tree-structured data hits to two major obstacles:

the semistructured nature of data (just explained) + lack of semantics.

Tree-structured data provides mainly syntactic and not semantic information.

However, there are inherent semantics in tree-structured data. Sets of nodes in a catalog are usually related under a semantic

interpretation, e.g. Mac, HP, Sony refer to a brand name. Such information can be exploited to become part of query

formulation and support query optimization. Currently, query formulation on tree-structured data ignores

this issue.

-11-

Our Approach

We introduce the notion of dimension graphs to capture semantic information in tree-structured data.

We design a query language for tree-structured data. Queries are not cast on the structure of tree-structured data. Queries can handle structural differences and

inconsistencies effectively. We discuss query evaluation issues. We show how dimension graphs can be used to query

multiple tree-structured data sources.

-12-

Data Model

We use value trees to represent tree-structured data. Values (i.e. nodes) in value trees are grouped to form

dimensions. A dimension...

...is a set of semantically related nodes (i.e. values) in the value tree.

The semantic interpretation is given by the user. Two nodes in the same path cannot belong to the same

dimension.

-13-

Data Model

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand

conditionpc_category

E.g. dimensions pc_type = {Notebooks, Desktops, PDAs}, pc_category = {Servers, Multimedia}, brand = {Mac, Sony, HP, IBM, Dell}, etc.

pc_type

-14-

Data Model

We use dimension graphs to capture relationships between dimensions.

The nodes of a dimension graph represent dimensions. There is an edge from dimension D1 to D2 if a value of D1 is

the parent of some value in D2.

-15-

Data Model

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


condition

R

pc_type

pc_category

brand

Value Tree T

Dimension Graph of T

pc_type

-16-

Data Model

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


condition

R

pc_type

pc_category

brand

Value Tree T

Dimension Graph of T

-17-

Data Model

A dimension graph... can be automatically extracted from a value tree, given the

dimensions, provides an abstraction of the structural information of

value trees, provides semantic query guidance to pose queries on tree-

structured data, in the presence of structural differences and inconsistencies,

supports query evaluation and optimization. ...will be explained soon.

-18-

Querying Tree-structured Data

Queries are defined on dimension graphs and not directly on value trees.

The user annotates some dimensions. Also, she has the choice of not specifying or partially

specifying parent-child and ancestor-descendant relationships between the annotated dimensions in a query.

Our system identifies possible ‘valid’ orderings of dimensions exploiting the dimension graph.

These orderings are used as patterns for constructing a set of path expressions to be sent directly to the value trees.

-19-

Querying

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T

Query on Dimension Graph of T

condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

annotated dimension

= ? the dimension can have any value

= { ... } the dimension should havespecific values

-20-

Querying

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand.

-21-

Querying

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand.

-22-

Querying

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

Notice how query handles the structural inconsistencies!

-23-

Querying

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

condition

R

pc_category

Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

‘Find all Sony, IBM used products. However, the nodes referring to brand name should be after the node ‘used’.’, i.e. Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand,However: values of condition should be parents of values of brand.

........................................

-24-

Querying

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand,However: values of condition should be parents of values of brand.

-25-

Query Evaluation

Query evaluation exploits dimension graphs to detect answer paths.

An answer path is a path in a dimension graph that starts from R, includes all annotated dimensions, and ends on an annotated dimension.


condition = {used}

R

pc_type = ?

mobile_typepc_category

brand ={Sony, IBM} Examples of answer paths:

/R/pc_type/condition/brand, /R/pc_type/pc_category/brand/condition, ....

-26-

Query Evaluation

pc_category

Notebooks

New Used Servers

Desktops PDAs

r

Mac HPSony

pc_type pc_type

brand

HP IBMMac Sony

Dell Sony

Used New Used

condition

condition brand

R

Used

Multimedia

HP IBM

brand


Value Tree T


condition = {used}

R

pc_type = ?

pc_category

brand ={Sony, IBM}

Answer paths are used to generate path expressions to be exploited by e.g. an XQuery engine to retrieve the answersfrom a value tree. E.g. /R/pc_type/condition/brand gives/r/(Notebooks|Desktops)/Used/(Sony|IBM)

-27-

Query Evaluation

The answer paths help to detect ordering of values that can possibly exist in a value tree.

Only these value orderings will be used to compute the answer of a query on the value tree.

This is performed before query evaluation reaches the value tree.

Detecting answers paths in a dimension graph is not a costly task since dimension graphs are much smaller than value trees.

-28-

Query Evaluation

Query evaluation exploits dimension graphs to detect unsatisfiable queries (i.e. queries with empty answers in the value tree).

Examples of unsatisfiable queries:R

pc_type= ?

brand= ?

mobile_type

condition

pc_categorycondition

R

pc_type = ?

mobile_type= ?

pc_category

Brand = ?

R

pc_type

brand

mobile_type = ?

condition =?

pc_category

= ?

No answer paths! Two children havethe same parent!

No path from conditionto mobile_type!

-29-

Query Evaluation

Dimension graphs can be used to query multiple value trees.

Consider value trees T1, T2, ..., Tn over a dimension set D. Let G1, G2, ..., Gn be their dimension graphs. Construct a global dimension graph G by merging G1,

G2, ..., Gn. Queries are formed on G. The annotations are transferred to G1, G2, ..., Gn. Query evaluation is performed as described before.

-30-

Conclusions

Querying tree-structured data using dimension graphs: Dimension graphs: capture semantic information in tree-

structured data. Used for query formulation and evaluation. Queries are not cast on the structure of tree-structured data

but on dimension graphs. Queries can handle structural differences and

inconsistencies in value trees. Query evaluation exploits dimension graphs to generate

appropriate path expressions to be be evaluated on the value trees.

Dimension graphs can be also used to query multiple value trees.

querying tree-structured data using dimension graphs dimitri theodoratos (new jersey institute of...

Documents

web data

standard data exchange

xml language w3c

greece slide

xquery w3c

thematic categories

concept hierarchies

product catalogs