querying tree-structured data using dimension graphs dimitri theodoratos (new jersey institute of...
TRANSCRIPT
Querying Tree-Structured Data Using Dimension Graphs
Dimitri Theodoratos (New Jersey Institute of Technology, USA)
Theodore Dalamagas (National Techn. University of Athens, Greece)
-2-
Tree-structured Data Management
Tree structures: a means to organize the information on the Web.
Examples: taxonomies, thematic categories, concept hierarchies, product catalogs, etc.
Organizing data in tree structures (tree-structured data) has been vastly established due to the popularity of the XML language.
XML language (W3C): the standard data exchange format on the Web
Data is stored natively in tree structures, or Data is publicly available in tree structures to enable its
automatic processing by programs, scripts, and agents
-3-
Tree-structured Data Management
Querying tree-structured data is based on path expression queries.
Popular query languages for tree-structured data: XPath and XQuery (W3C), e.g:
FOR $i IN /brand/type[price<900] RETURN {$i/id, $i/condition, $i/price} (find products cheaper than 900, and display their id, condition, and price)
Querying tree-structured data hits to two major obstacles: the semistructured nature of data, lack of semantics.
This is actually the penalty one has to pay for the flexibility offered by XML technologies.
...<brand> Sony <type> laptop <id> 1 </id> <condition> used </condition> <price> 800 </price> </type></brand>...
-4-
Semistructured Nature of Tree-structured Data
Due to the first obstacle (i.e. semistructured nature): Querying tree-structured data requires to resolve structural
differences and inconsistencies. The reason? different possible ways of organizing the
same information in tree-structures. Examples:
Structural differences: certain ‘nodes’ (i.e. categories, elements, etc...) exist in a tree-structured data source but not in another.
Structural inconsistencies: variations in ‘node’ sequences (even within a single tree-structured data source).
-5-
Notebooks
Custom Ultralight Multimedia
Desktops
10''
Servers
8''
PDAs
r
Mac HPSony IBMSony
HP IBM
Notebooks
Servers
Desktops PDAs
r
Mac HPSony
HP IBM
Dell Sony
Used New Used
Used
New UsedNew
ProductCatalog A
Multimedia
HP IBM
ProductCatalog B
Structural difference Product catalog A has a finer
categorization on notebooks, e.g.: Custom/Ultralight and 10’’/8’’ (for the ultralight) compared to Catalog B.
-6-
Notebooks
Custom Ultralight Multimedia
Desktops
10''
Servers
8''
PDAs
r
Mac HPSony IBMSony
HP IBM
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony HP IBMMac Sony
Dell Sony
Used New Used
Used
UsedNew Used
New UsedNew
ProductCatalog A
Multimedia
HP IBM
ProductCatalog B
Structural inconsistency Product catalog A classifies
notebooks by brand and next by condition, while catalog B the other way around (Sony/Used vs Used/Sony).
-7-
Semistructured Nature of Tree-structured Data
...
<brand>
Sony
<type>
laptop
<condition> used </condition>
<price> 800 </price>
</type>
</brand>
...
...
<type>
laptop
<condition>
used
<brand> Sony </brand>
<price> 800 </price>
</condition>
</type>
...
brand
type
condition
type
condition
brand
Structural inconsistency (...cont.) An XML doc includes the element sequence brand, type, condition, while
another one (for same data) includes type, condition, brand. Such inconsistencies are observed even within tree-structured data of a
single data source.
-8-
Semistructured Nature of Tree-structured Data
How structural differences and inconsistencies affects querying of tree-structured data?
The user should explicitly specify them as part of the query.
Extremely cumbersome. E.g.: explicitly specify disjunctions of possible alternative
node sequences: /brand/type[price<900] OR /type/condition[price<900] OR /condition/type[price<900] ....<brand>
Sony
<type>
laptop
<condition> used </condition>
<price> 800 </price>
......
<type>
laptop
<condition>
used
<brand> Sony </brand>
<price> 800 </price>
......
......
......
......
-9-
Semistructured Nature of Tree-structured Data
However, sometimes specifying alternate node sequences is not due to the need to resolve structural differences and inconsistencies.
Users should be able to pose queries even if they do not know (or do not care about) the exact structure of tree-structured data sources.
e.g. find products cheaper than 900, and display their id, condition, and price
...but I do not know (or I do not care!) whether condition is before brand and type!
Currently, query formulation on tree-structured data is strictly dependent on the structure of data.
Only ancestor/descendant relationship may produce relaxed path expressions (brand//type).
-10-
Lack of Semantics in Tree-structured Data
Reminder: Querying tree-structured data hits to two major obstacles:
the semistructured nature of data (just explained) + lack of semantics.
Tree-structured data provides mainly syntactic and not semantic information.
However, there are inherent semantics in tree-structured data. Sets of nodes in a catalog are usually related under a semantic
interpretation, e.g. Mac, HP, Sony refer to a brand name. Such information can be exploited to become part of query
formulation and support query optimization. Currently, query formulation on tree-structured data ignores
this issue.
-11-
Our Approach
We introduce the notion of dimension graphs to capture semantic information in tree-structured data.
We design a query language for tree-structured data. Queries are not cast on the structure of tree-structured data. Queries can handle structural differences and
inconsistencies effectively. We discuss query evaluation issues. We show how dimension graphs can be used to query
multiple tree-structured data sources.
-12-
Data Model
We use value trees to represent tree-structured data. Values (i.e. nodes) in value trees are grouped to form
dimensions. A dimension...
...is a set of semantically related nodes (i.e. values) in the value tree.
The semantic interpretation is given by the user. Two nodes in the same path cannot belong to the same
dimension.
-13-
Data Model
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
E.g. dimensions pc_type = {Notebooks, Desktops, PDAs}, pc_category = {Servers, Multimedia}, brand = {Mac, Sony, HP, IBM, Dell}, etc.
pc_type
-14-
Data Model
We use dimension graphs to capture relationships between dimensions.
The nodes of a dimension graph represent dimensions. There is an edge from dimension D1 to D2 if a value of D1 is
the parent of some value in D2.
-15-
Data Model
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
condition
R
pc_type
pc_category
brand
Value Tree T
Dimension Graph of T
pc_type
-16-
Data Model
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
condition
R
pc_type
pc_category
brand
Value Tree T
Dimension Graph of T
-17-
Data Model
A dimension graph... can be automatically extracted from a value tree, given the
dimensions, provides an abstraction of the structural information of
value trees, provides semantic query guidance to pose queries on tree-
structured data, in the presence of structural differences and inconsistencies,
supports query evaluation and optimization. ...will be explained soon.
-18-
Querying Tree-structured Data
Queries are defined on dimension graphs and not directly on value trees.
The user annotates some dimensions. Also, she has the choice of not specifying or partially
specifying parent-child and ancestor-descendant relationships between the annotated dimensions in a query.
Our system identifies possible ‘valid’ orderings of dimensions exploiting the dimension graph.
These orderings are used as patterns for constructing a set of path expressions to be sent directly to the value trees.
-19-
Querying
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
annotated dimension
= ? the dimension can have any value
= { ... } the dimension should havespecific values
-20-
Querying
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand.
-21-
Querying
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand.
-22-
Querying
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
Notice how query handles the structural inconsistencies!
-23-
Querying
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
condition
R
pc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
‘Find all Sony, IBM used products. However, the nodes referring to brand name should be after the node ‘used’.’, i.e. Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand,However: values of condition should be parents of values of brand.
........................................
-24-
Querying
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type,-the value ‘used’ of dimension condition,-either value ‘Sony’ or ‘IBM’ of dimension brand,However: values of condition should be parents of values of brand.
-25-
Query Evaluation
Query evaluation exploits dimension graphs to detect answer paths.
An answer path is a path in a dimension graph that starts from R, includes all annotated dimensions, and ends on an annotated dimension.
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
mobile_typepc_category
brand ={Sony, IBM} Examples of answer paths:
/R/pc_type/condition/brand, /R/pc_type/pc_category/brand/condition, ....
-26-
Query Evaluation
pc_category
Notebooks
New Used Servers
Desktops PDAs
r
Mac HPSony
pc_type pc_type
brand
HP IBMMac Sony
Dell Sony
Used New Used
condition
condition brand
R
Used
Multimedia
HP IBM
brand
conditionpc_category
Value Tree T
Query on Dimension Graph of T
condition = {used}
R
pc_type = ?
pc_category
brand ={Sony, IBM}
Answer paths are used to generate path expressions to be exploited by e.g. an XQuery engine to retrieve the answersfrom a value tree. E.g. /R/pc_type/condition/brand gives/r/(Notebooks|Desktops)/Used/(Sony|IBM)
-27-
Query Evaluation
The answer paths help to detect ordering of values that can possibly exist in a value tree.
Only these value orderings will be used to compute the answer of a query on the value tree.
This is performed before query evaluation reaches the value tree.
Detecting answers paths in a dimension graph is not a costly task since dimension graphs are much smaller than value trees.
-28-
Query Evaluation
Query evaluation exploits dimension graphs to detect unsatisfiable queries (i.e. queries with empty answers in the value tree).
Examples of unsatisfiable queries:R
pc_type= ?
brand= ?
mobile_type
condition
pc_categorycondition
R
pc_type = ?
mobile_type= ?
pc_category
Brand = ?
R
pc_type
brand
mobile_type = ?
condition =?
pc_category
= ?
No answer paths! Two children havethe same parent!
No path from conditionto mobile_type!
-29-
Query Evaluation
Dimension graphs can be used to query multiple value trees.
Consider value trees T1, T2, ..., Tn over a dimension set D. Let G1, G2, ..., Gn be their dimension graphs. Construct a global dimension graph G by merging G1,
G2, ..., Gn. Queries are formed on G. The annotations are transferred to G1, G2, ..., Gn. Query evaluation is performed as described before.
-30-
Conclusions
Querying tree-structured data using dimension graphs: Dimension graphs: capture semantic information in tree-
structured data. Used for query formulation and evaluation. Queries are not cast on the structure of tree-structured data
but on dimension graphs. Queries can handle structural differences and
inconsistencies in value trees. Query evaluation exploits dimension graphs to generate
appropriate path expressions to be be evaluated on the value trees.
Dimension graphs can be also used to query multiple value trees.