typing semistructured data by, keshava reddy kottapally goutham chinnapolamada source: serge...

27
Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN 1-55860-622-X, 1999

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Typing Semistructured Data

By,

Keshava Reddy Kottapally

Goutham Chinnapolamada

Source:

Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN 1-55860-622-X, 1999

Page 2: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Typing Semistructured Data

• Introduction: Schema for Semistructured data• Motivation for typing Semistructured data• Schema formalisms:

– First-order logic

– Datalog

– Graph simulations

• Extracting schemas from data• Inferring schemas from queries• Path constraints

Page 3: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

What is semistructured data..?

• Semistructured data has some structure, but is difficult to describe with a predefined, rigid schema– Irregularity

– Continual evolution

– Structure that is implicit or unknown to the user

Page 4: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

What is typing..?

• Typing is about finding the structure of semistructured data

• The idea of structuring semistructured data is still an area of much research activity

• Typing involves finding methods to provide schemas for semistructured data

• Typing for SSD differ from those for relational or object-oriented data and hence needs separate methods

Page 5: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Uses of typing SSD

• To optimize query evaluationExample:

Original query:

select X.title

from biblio._X

where X.*.zip = “12345”

Optimized form:

select X.title

from biblio.book X

where X.address.zip = “12345”

Page 6: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

C1 C2 C3 C4

C5

C5

C5

C5 C5

C5

C5

C5 C5

C5

C5

biblio book title string

author first name

last name

string

string

string

string

string

string

street

city

zip

title

journal

year

paper

address

Page 7: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Uses of typing continued...

• To facilitate the task of integrating several data sources

• To improve storage– Better clustering may reduce number of page fetches,

thus improving query performance

• To construct indexes• To describe the database content to users and

facilitate query formulation• To proscribe certain updates

Page 8: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Two ways of typing..

• Schema extraction– Given one particular data instance, finding the most

specific schema for it

– With semistructured data we may specify the type after the database is populated

– A data instance may have more than one type

• Schema inference– Finding the most specific schema by analyzing the

query

– This process is similar to type inference in programming languages

Page 9: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

The problem

• Given a database and a type, – does the database conform to this type…?

• Classification of objects– Which objects belong to each class..?

• Typing involves description of the structure of each class and its relationships with other classes

Page 10: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Difference between typing SSD and Object Databases

• Classes are defined less precisely. As a consequence, objects may belong to several classes

• Some objects may not belong to any class or may have properties that do not pertain to any class

• The typing may be approximate. For example, we may accept in a class an object that does not quite conform to the specification of that class.

Page 11: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Schema formalisms

First-order logic

Datalog

Simulation

Page 12: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

First-order logic

• Example: Consider three kinds of objects in the database

– Root object(s) have• Outgoing edges labeled company to company objects and person to

person objects

– Person objects have• Outgoing edges labeled name and position to string objects

• Outgoing edges labeled worksfor to company objects

• Incoming edges labeled manager and employee from company objects

– Company objects have• Outgoing edges labeled name and address to string objects

• Outgoing edges labeled manager and employee to person objects

• Incoming edges labeled worksfor from person objects

Page 13: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

• If : – if an object has a-edges to strings and b-edges from c’ objects, then

it is a c-object. Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• Only-if:– Any c-object has some a-edges to strings and some b-edges from

c’ objects: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• If and only if: Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X)

• Consequence: – c(X) ^ ref(Z,b,X) c’ (Z)

– c(X) ^ ref(X,a,Y) string(Y)

– c(X) ^ ref(X,L,Y) ^ L a ^ L b false

Page 14: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Problem definition with first-order logic

• The previous questions on typing can be restated in terms of first-order logic– Does D satisfy T, noted D |= T, that is, is there a model

of T that coincides with D over the extensional predicates..?

– If D |= T, what is the classification that is induced..?

• First-order logic leads to very general typings, probably too general for what is needed in semistructured data

• It could also lead to undecidability or intractability

Page 15: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Datalog: A rule-based language

• Datalog allows us to state that if a conjunction of facts holds, then some new fact can be derived

• Datalog rules allow us to define classes by specifying what incoming and outgoing edges are required

• Example:– r(X) :- ref(X, person, Y), p(Y), ref(X, company, Z), c(Z)

– p(X) :- c(Y), ref(Y, manager, X), c(Z), ref(Z, employee, X), ref(X, worksfor, U), c(U), ref(X, name, N), string(N), ref(X, position, P), string(P)

– c(X) :- p(Z), ref(Z, worksfor, X), p(Z), ref(Z, worksfor, X), ref(X, manager, M), p(M), ref(X, employee, E), p(E), ref(X, name, N), string(N), ref(X, address, A), string(A)

Page 16: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Fixpoint semantics

• Least fixpoint semantics– We start from an empty set of facts and derive

nothing. Hence, the empty set of facts is the least fixpoint for this program

• Greatest fixpoint semantics– Typing the largest set of objects

• The goal is to find the greatest fixpoint for a given data graph. The desired model is the greatest fixpoint containing D.

Page 17: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Consider the following data graph D:&o1 {company: &o2{name: &o5 “o2”,

address: &o6 “Versailles”,

manager: &o3,

employee: &o3, employee: &o4 },

person: &o3 { name: &o7 “Francois”,

position: &o8 “CEO”,

worksfor: &o2 },

person: &o4 { name: &o9 “Lucien”,

position: &o10 “programmer”,

worksfor: &o2 }

}

• ref(&o1, company, &o2), ref(&o2, name, &o5), etc.

• string(&o5, string(&o6), etc.

Page 18: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Deriving the greatest fixpoint

• The desired model M can be derived by starting from a model containing D and all possible typing facts. LetJo = D U { r(&o1), r(&o2), r(&o3), r(&o4), p(&o1),

p(&o2), p(&o3), p(&o4), c(&o1), c(&o2), c(&o3), c(&o4), }

• Deriving from J0 until a fixpoint is reached will get to the desired modelM = J2 = J1 = D U {r(&o1), c(&o2), p(&o3), p(&o4)}

Page 19: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Simulation

• The aim is to produce a schema graph for a data graph whose semantics lead to a listing of all permitted labels.

• A schema graph is similar to a data graph with the following changes– Labels can be alterations (like address | name | url ) or

underscore

– Atomic values are type names, like string, int, float, etc.

– Oids of complex objects are called as classes, like Person, Company, etc.

Page 20: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

&r1

&p1 &c1 &p2 &c2 &p3

&s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9

&a1

&a2&a3

&a4

&a5

&a6 &a7

person

companypersoncompany

person

managermgr emp

name name name name name

position addr phone addr position

&s10

url

worksfor worksfor worksfor

emp

description

procurementsalesrep

contact

task

description

performance

19971998

“Smith” “Mgr” “Widget” “Trent” “Joe”

Page 21: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Schema graph

Root

Person Company

StringAny

companyperson

employee

manager

worksforname|address|urlname|phone|positiondescription

manager

-

Page 22: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

• Simulation is defined as follows:Given graphs G1 = (V1, E1), G2 = (V2, E2), a relation R on V1,V2

is a simulation if it satisfies l L x1,y1 V1 x2 V2(x1[l]y1 ^ x1Rx2 y2V2(y1Ry2 ^ x2[l]y2))

• The rule says that every edge in G1 must have a “corresponding” edge in G2 under the simulation

x1

y1 y2

x2R

R

G1 G2

[l] [l]

Page 23: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

• To define a simulation between a semistructured data instance and a schema graph, we add the following additional requirements:

– The roots must be in the simulation: r R r’

– Whenever x R y, if y is an atomic type (like string, int), then x must be an atomic node too and have a value of that type. We say the simulation is typed

Page 24: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Data node Schema node&r1 Root

&c1, &c2 Company

&p1, &p2, &p3 Person

&s0,&s1,&s2,&s3… string

&a1,&a2,&a3,&a4…. Any

• The relation R defined by the example data graph and the given schema graph is a simulation

Page 25: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Back to the typing problem….

• When does a data graph D conform to a schema graph S..?– When there exists a rooted, typed simulation between

the data and the schema

• Which objects belong to each class..?– The principle is that oid ‘o’ should belong to class ‘c’ if

o R c. In this way, a rooted simulation R will always classify all objects.

– However, the classification need not be unique!, which leads to finding maximal simulation

Page 26: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

string string string string string string

book

title author author

book

title author publisher

book

title author year

&o

&b1&b2

D =

S =

Page 27: Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From

Maximal simulation

• G1 <=R G2 : R is a simulation from G1 to G2

• Fact:– if G1 <=R1 G2 and G1 <=R2 G2 then G1 <=R1UR2 G2

– For any data graph D conforming to some schema graph S, there is always a maximal simulation from D to S.

• Back to the problem: Which objects belong to each class…?– An object ‘o’ belongs to some class ‘c’ if oRc, where R

is the maximal solution between the OEM data and schema graph