typing semistructured data by, keshava reddy kottapally goutham chinnapolamada source: serge...

Typing Semistructured Data

By,

Keshava Reddy Kottapally

Goutham Chinnapolamada

Source:

Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN 1-55860-622-X, 1999

Typing Semistructured Data

• Introduction: Schema for Semistructured data• Motivation for typing Semistructured data• Schema formalisms:

– First-order logic

– Datalog

– Graph simulations

• Extracting schemas from data• Inferring schemas from queries• Path constraints

What is semistructured data..?

• Semistructured data has some structure, but is difficult to describe with a predefined, rigid schema– Irregularity

– Continual evolution

– Structure that is implicit or unknown to the user

What is typing..?

• Typing is about finding the structure of semistructured data

• The idea of structuring semistructured data is still an area of much research activity

• Typing involves finding methods to provide schemas for semistructured data

• Typing for SSD differ from those for relational or object-oriented data and hence needs separate methods

Uses of typing SSD

• To optimize query evaluationExample:

Original query:

select X.title

from biblio._X

where X.*.zip = “12345”

Optimized form:

select X.title

from biblio.book X

where X.address.zip = “12345”

C1 C2 C3 C4

C5

C5

C5

C5 C5

C5

C5

C5 C5

C5

C5

biblio book title string

author first name

last name

string

string

string

string

string

string

street

city

zip

title

journal

year

paper

address

Uses of typing continued...

• To facilitate the task of integrating several data sources

• To improve storage– Better clustering may reduce number of page fetches,

thus improving query performance

• To construct indexes• To describe the database content to users and

facilitate query formulation• To proscribe certain updates

Two ways of typing..

• Schema extraction– Given one particular data instance, finding the most

specific schema for it

– With semistructured data we may specify the type after the database is populated

– A data instance may have more than one type

• Schema inference– Finding the most specific schema by analyzing the

query

– This process is similar to type inference in programming languages

The problem

• Given a database and a type, – does the database conform to this type…?

• Classification of objects– Which objects belong to each class..?

• Typing involves description of the structure of each class and its relationships with other classes

Difference between typing SSD and Object Databases

• Classes are defined less precisely. As a consequence, objects may belong to several classes

• Some objects may not belong to any class or may have properties that do not pertain to any class

• The typing may be approximate. For example, we may accept in a class an object that does not quite conform to the specification of that class.

Schema formalisms

First-order logic

Datalog

Simulation

First-order logic

• Example: Consider three kinds of objects in the database

– Root object(s) have• Outgoing edges labeled company to company objects and person to

person objects

– Person objects have• Outgoing edges labeled name and position to string objects

• Outgoing edges labeled worksfor to company objects

• Incoming edges labeled manager and employee from company objects

– Company objects have• Outgoing edges labeled name and address to string objects

• Outgoing edges labeled manager and employee to person objects

• Incoming edges labeled worksfor from person objects

• If : – if an object has a-edges to strings and b-edges from c’ objects, then

it is a c-object. Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• Only-if:– Any c-object has some a-edges to strings and some b-edges from

c’ objects: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• If and only if: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• Consequence: – c(X) ^ ref(Z,b,X) c’ (Z)

– c(X) ^ ref(X,a,Y) string(Y)

– c(X) ^ ref(X,L,Y) ^ L a ^ L b false

Problem definition with first-order logic

• The previous questions on typing can be restated in terms of first-order logic– Does D satisfy T, noted D |= T, that is, is there a model

of T that coincides with D over the extensional predicates..?

– If D |= T, what is the classification that is induced..?

• First-order logic leads to very general typings, probably too general for what is needed in semistructured data

• It could also lead to undecidability or intractability

Datalog: A rule-based language

• Datalog allows us to state that if a conjunction of facts holds, then some new fact can be derived

• Datalog rules allow us to define classes by specifying what incoming and outgoing edges are required

• Example:– r(X) :- ref(X, person, Y), p(Y), ref(X, company, Z), c(Z)

– p(X) :- c(Y), ref(Y, manager, X), c(Z), ref(Z, employee, X), ref(X, worksfor, U), c(U), ref(X, name, N), string(N), ref(X, position, P), string(P)

– c(X) :- p(Z), ref(Z, worksfor, X), p(Z), ref(Z, worksfor, X), ref(X, manager, M), p(M), ref(X, employee, E), p(E), ref(X, name, N), string(N), ref(X, address, A), string(A)

Fixpoint semantics

• Least fixpoint semantics– We start from an empty set of facts and derive

nothing. Hence, the empty set of facts is the least fixpoint for this program

• Greatest fixpoint semantics– Typing the largest set of objects

• The goal is to find the greatest fixpoint for a given data graph. The desired model is the greatest fixpoint containing D.

Consider the following data graph D:&o1 {company: &o2{name: &o5 “o2”,

address: &o6 “Versailles”,

manager: &o3,

employee: &o3, employee: &o4 },

person: &o3 { name: &o7 “Francois”,

position: &o8 “CEO”,

worksfor: &o2 },

person: &o4 { name: &o9 “Lucien”,

position: &o10 “programmer”,

worksfor: &o2 }

}

• ref(&o1, company, &o2), ref(&o2, name, &o5), etc.

• string(&o5, string(&o6), etc.

Deriving the greatest fixpoint

• The desired model M can be derived by starting from a model containing D and all possible typing facts. LetJo = D U { r(&o1), r(&o2), r(&o3), r(&o4), p(&o1),

p(&o2), p(&o3), p(&o4), c(&o1), c(&o2), c(&o3), c(&o4), }

• Deriving from J0 until a fixpoint is reached will get to the desired modelM = J2 = J1 = D U {r(&o1), c(&o2), p(&o3), p(&o4)}

Simulation

• The aim is to produce a schema graph for a data graph whose semantics lead to a listing of all permitted labels.

• A schema graph is similar to a data graph with the following changes– Labels can be alterations (like address | name | url ) or

underscore

– Atomic values are type names, like string, int, float, etc.

– Oids of complex objects are called as classes, like Person, Company, etc.

&r1

&p1 &c1 &p2 &c2 &p3

&s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9

&a1

&a2&a3

&a4

&a5

&a6 &a7

person

companypersoncompany

person

managermgr emp

name name name name name

position addr phone addr position

&s10

url

worksfor worksfor worksfor

emp

description

procurementsalesrep

contact

task

description

performance

19971998

“Smith” “Mgr” “Widget” “Trent” “Joe”

Schema graph

Root

Person Company

StringAny

companyperson

employee

manager

worksforname|address|urlname|phone|positiondescription

manager

-

• Simulation is defined as follows:Given graphs G1 = (V1, E1), G2 = (V2, E2), a relation R on V1,V2

is a simulation if it satisfies l L x1,y1 V1 x2 V2(x1[l]y1 ^ x1Rx2 y2V2(y1Ry2 ^ x2[l]y2))

• The rule says that every edge in G1 must have a “corresponding” edge in G2 under the simulation

x1

y1 y2

x2R

R

G1 G2

[l] [l]

• To define a simulation between a semistructured data instance and a schema graph, we add the following additional requirements:

– The roots must be in the simulation: r R r’

– Whenever x R y, if y is an atomic type (like string, int), then x must be an atomic node too and have a value of that type. We say the simulation is typed

Data node Schema node&r1 Root

&c1, &c2 Company

&p1, &p2, &p3 Person

&s0,&s1,&s2,&s3… string

&a1,&a2,&a3,&a4…. Any

• The relation R defined by the example data graph and the given schema graph is a simulation

Back to the typing problem….

• When does a data graph D conform to a schema graph S..?– When there exists a rooted, typed simulation between

the data and the schema

• Which objects belong to each class..?– The principle is that oid ‘o’ should belong to class ‘c’ if

o R c. In this way, a rooted simulation R will always classify all objects.

– However, the classification need not be unique!, which leads to finding maximal simulation

string string string string string string

book

title author author

book

title author publisher

book

title author year

&o

&b1&b2

D =

S =

Maximal simulation

• G1 <=R G2 : R is a simulation from G1 to G2

• Fact:– if G1 <=R1 G2 and G1 <=R2 G2 then G1 <=R1UR2 G2

– For any data graph D conforming to some schema graph S, there is always a maximal simulation from D to S.

• Back to the problem: Which objects belong to each class…?– An object ‘o’ belongs to some class ‘c’ if oRc, where R

is the maximal solution between the OEM data and schema graph

typing semistructured data by, keshava reddy kottapally goutham chinnapolamada source: serge...

Documents