master informatique 10/9/2007 1 typing semistructured data serge abiteboul 2008 typing...

111
Master Informatique 10/9/2007 1 Typing semistructured data Serge Abiteboul 2008 Typing semistructured data

Upload: david-fields

Post on 25-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Master Informatique 10/9/2007 1

Typing semistructured data

Serge Abiteboul

2008

Typing semistructured data

Master Informatique 10/9/2007 2Master Informatique Typing semistructured data

Organization

• Motivations• Automata

– Automata on words– Ranked tree automata– Unranked tree automata– Automata and monadic second-order logic– Automata – to compute

• XML typing: DTD, XML schema• Graphs and bisimulation

Master Informatique 10/9/2007 3

Motivation

Typing semistructured data

Master Informatique 10/9/2007 4Master Informatique Typing semistructured data

XML typing

• Not compulsory• Simplify writing software for XML

– Improve interoperability between programs

• Improve storage and performance• Ease querying: data guide • Simplify data protection

– Reject illegal update – like relational dependencies

Master Informatique 10/9/2007 5

Improve storage

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

o i d n a m e a d d r e s s c . e . o .… … … …… … … …

Company

o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …

Employee

Store rest in overflow graph

Lower-bound schema

Typing semistructured data

Master Informatique 10/9/2007 6

Improve performance

Bib

paper book

yearjournal

title

int string string

addressauthor

title

zip city street

lastname

firstname

string string string string string

string

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

Typing semistructured data

Master Informatique 10/9/2007 7Master Informatique Typing semistructured data

Type checking

• Who checks– XML editor: check that the data conforms to its type– XML exchange, e.g., with Web service

• Server when delivering the data• Client/application: when receiving it

• Dynamic verification: after the data is produced• Static verification: verification of the program that

generates the data

Master Informatique 10/9/2007 8Master Informatique Typing semistructured data

Static verification

• Input: input type T and code of function f– f is Xquery, Xpath, XSLT, etc.

• Verification of T’– Is it true that d╞T, f(d)╞T’ ?

• Type inference– Find the smallest T’ such that d╞T, f(d)╞T’

• Rapidly undecidable because of “joins”

Master Informatique 10/9/2007 9Master Informatique Typing semistructured data

Examplefor $p in doc("parts.xml“)//part[color=“red"]return <part>

<name>$p/name</name><desc>$p/desc</desc>

</part>

Result type (part (name (string) desc (any) )*

If the type of parts.xml//part/desc is string(part (name (string) desc (string) )*

Master Informatique 10/9/2007 10Master Informatique Typing semistructured data

Difficultyfor $X in Input, $Y in Input do { print ( <b/> }

Input: <a/> <a/> Result: <b/> <b/> <b/> <b/> Problem: { bi i=n2 for n ≥ 0 } cannot be described in XML

schemaThere is no « best » result

– b*– + b2 b*

– + b2 + b4b*

– + b2 + b4 + b9b*

– …

Master Informatique 10/9/2007 11Master Informatique Typing semistructured data

Why tree automata?

• XML = unranked trees• No theory for XML • Rich theory for strings: Automata• Extend to

rich theory for ranked trees: Tree automata – Nice algorithms– Nice theorems– Can this carry to unranked trees and XML?

• Yes!

Master Informatique 10/9/2007 12Master Informatique Typing semistructured data

From strings to treesa

b

b

a

a

b

b a

b

b

a b

a

b

b

a

b

b

a b

a b

a b

Word Binary tree… Unranked tree automataFinite State Ranked tree automata no bound on number of childrenAutomata

a

b b b

Master Informatique 10/9/2007 13Master Informatique Typing semistructured data

Only unranked tree automata?

• Missing practical gadgets• Complexity of verification

– Goal: typing at reasonable cost

• Unranked tree automata + …

Master Informatique 10/9/2007 14

Automata

Automata on words

Typing semistructured data

Master Informatique 10/9/2007 15

Finite state automata on words

),,,,( 0 FqQ

Alphabet

State

Initial state Accepting states

Transitions

Qq 0 QF

)(: QPQ

Typing semistructured data

Master Informatique 10/9/2007 16Master Informatique Typing semistructured data

q0

Nondeterministic automaton: Example

33

32

21

01

100

,

,

,

,

,,

qq

qq

qq

qqb

qqqa

2

3210 ,,,

,

qF

qqqqQ

ba

a b a a b - a b a -q0

q1

q0 q0

q1

q0

q1

q0 q0

q1

q0 q0 q2

q1

q0

KO OK

Master Informatique 10/9/2007 17Master Informatique Typing semistructured data

• Deterministic– No transition– No alternative transitions such as

• Determinization – It is possible to obtain an equivalent deterministic automaton– State of new automaton = set of states of the original one– Possible exponential blow-up

• Minimization• Limitations – cannot do

– Context-free languages• Essential tool – e.g., lexical analysis

Reminder

Ν, nba nn

100 ,, qqqa 0, qq

Master Informatique 10/9/2007 18Master Informatique Typing semistructured data

Reminder (2)• L(A) = set of words accepted by automata A• Regular languages• Can be described by regular expressions, e.g. a(b+c)*d• Closed under complement

• Closed under union, intersection

– Product automata with states (s,s’) where s is from A and s’ is from A’

)(* AL

)()(

)()(

BLAL

BLAL

Master Informatique 10/9/2007 19Master Informatique Typing semistructured data

Automata on words versus trees

a b b a

a

b

b a

b

b

a b

a

Left to right

Right to left

No difference

Bottom up

Top down

Differences

Master Informatique 10/9/2007 20

Automata

Automata on ranked trees

Typing semistructured data

Master Informatique 10/9/2007 21

Binary tree automata

• Parallel evaluation

• For leaves:

• For other nodes:

),,,( FQ

)(: QP

)(: QPQQ

a

b

b a

b

a b

a

Bottom up

q q’

bq”

q1q”

q2

qqq’

Typing semistructured data

Master Informatique 10/9/2007 22Master Informatique Typing semistructured data

Bottom-up tree automata

• Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’

• Accepts is the root is in some state in F

• Not deterministic if alternatives or -transitions:

',',, rrqqa

}',{',, rrqqa ', rr

Master Informatique 10/9/2007 23Master Informatique Typing semistructured data

Example: deterministic bottom-up

1102012112

0002

0102012002

1112

,,,,,,,,

,,

,,,,,,,,

,,

qqqqqqq

qqq

qqqqqqq

qqq

1

10 ,

,,1,0

qF

qqQ

11

01

1

0

q

q

Master Informatique 10/9/2007 24Master Informatique Typing semistructured data

1102

1012

1112

0002

0102

0012

0002

1112

,,

,,

,,

,,

,,

,,

,,

,,

qqq

qqq

qqq

qqq

qqq

qqq

qqq

qqq

Boolean circuit evaluation

v

v

v

1v v1

10

v

0

11

11

01

1

0

q

q

0q 1q 0q

1q1q

1q1q1q

1q

1q

1q

1q

1q

OK

Master Informatique 10/9/2007 25

Regular tree language = set of trees accepted by a bottom-up tree automata

Typing semistructured data

Master Informatique 10/9/2007 26Master Informatique Typing semistructured data

Regular tree languages

The following are equivalent– L is a regular tree language– L is accepted by a nondeterministic bottom-up

automata– L is accepted by a deterministic bottom-up

automata– L is accepted by a nondeterministic top-down

automata

Deterministic top-down is weaker

Master Informatique 10/9/2007 27Master Informatique Typing semistructured data

Top-down tree automata

• Top-down: if a node labeled a is in state q”, then its left child moves to state q (right to q’)

• Accepts is all leaves are is in states in F• Not deterministic if

',", qqqa

',,',", rrqqqa

Master Informatique 10/9/2007 28Master Informatique Typing semistructured data

Why deterministic top-down is weaker?

• Consider the language– L = { f(a,b), f(b,a) }

• It can be accepted by a bottom-up TA– Exercise: write a BUTA A such that L = L(A)

• Suppose that B is a deterministic top-down TA with L = L(B)– Exercise: Show that B also accepts {f(a,a)} – A contradiction

Fact: No deterministic top-down tree automata accepts L

Master Informatique 10/9/2007 29Master Informatique Typing semistructured data

Ranked trees automata: Properties

• Like for words only higher complexity• Determinization • Minimization• Closed under

– Complement– Intersection– Union

Master Informatique 10/9/2007 30Master Informatique Typing semistructured data

But…

• XML documents are unranked• The kind of things we want to do:

book (intro,section*,conclusion)

Master Informatique 10/9/2007 31

Automata

Automata on unranked tree

Typing semistructured data

Master Informatique 10/9/2007 32Master Informatique Typing semistructured data

Unranked tree automata

...,,,,,,

...,,,,,

...,,,,,

...,,,,,,

222

222

222

222

fffffffff

ttftfttt

ftffftff

ttttttttt

Issue: represent an infinite set of transitionsSolution: a regular language

Master Informatique 10/9/2007 33Master Informatique Typing semistructured data

• Rule:• Meaning: if the states of the children of some

node labeled a form a word in L(Q), this node moves to some state in {r1,…,rm}

Unranked tree automata (2)

mrrQLa ,...,)(, 1

fOrwherefOr

fttftOrwheretOr

ftfftAndwherefAnd

tAndwheretAnd

00,

*)(*)(11,

*)(*)(00,

11,

2

2

2

2

Master Informatique 10/9/2007 34Master Informatique Typing semistructured data

Building on ranked trees

a

b

b

b

b

a b

a b

a

b

b

b

b

a b

a b

Ranked tree: FirstChild-NextSibling

F: encoding into a ranked tree• F is a bijection

F-1: decoding

Master Informatique 10/9/2007 35Master Informatique Typing semistructured data

Building on bottom-up ranked trees (2)

• For each Unranked TA A, there is a Ranked TA accepting F(L(A))

• For each Ranked TA A, there is an unranked TA accepting F-1(L(A))

• Both are easy to construct

Consequence: Unranked TA are closed under union, intersection, complement

Master Informatique 10/9/2007 36Master Informatique Typing semistructured data

• Determinization always possible for bottom-up

• Can we use the FirstChild-NextSibling encoding – No: it does not preserve determinism

Determinization

.such that

),( rule unique a exists there,, *

Lw

LQw

Master Informatique 10/9/2007 37Master Informatique Typing semistructured data

Top-down?

• This is more delicate• Transition (a,q)=A(a,q)

– The state of the automata A(a,q) when reading the labels of the children of a node labeled a determines the states of the children of that node

– Accepts if all the leaves are in accepting state

Master Informatique 10/9/2007 38Master Informatique Typing semistructured data

1q

Boolean circuit evaluation

v

v

v1 v

11

0q

1q

010v

1

111

10

0v

v

v

1q

1q

1q

1q

1q 1q0q

0q

0q0q 1q

0q

0q 0q

1q 0q

0q

1q

It is acceptedIt rejects by if

some state of a leaf is neither 0 with q0

nor 1 with q1

Master Informatique 10/9/2007 39

Automata

Automata and monadic second-order logic

Typing semistructured data

Master Informatique 10/9/2007 40Master Informatique Typing semistructured data

Monadic second-order logic

• Representation of a tree as a logical structure

E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)

a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)

a

b

b

b

b

a b

a b

1

6

3 42

7 8 9

5

Master Informatique 10/9/2007 41Master Informatique Typing semistructured data

XxXX

x

xayxSyxEyx

)(

...)(),(),(::

Monadic second-order logic

E(1,2), E(1,3)… E(3,9)S(2,3), S(3,4), S(4,5)…S(8,9)a(1), a(4), a(8)b(2), b(3), b(5), b(6), b(7), b(9)

MSO syntax

Set variable

Quantification over a set variable

Master Informatique 10/9/2007 42Master Informatique Typing semistructured data

Example of MSO

• Each a node has a b-descendant• This corresponds to the formula

For each node x labeled a: each set X that ()contains x and that () is closed under descendant, X contains some y labeled b

))()((

))()(),((

)(

)(

ybyXy

zXyXzyEzy

xX

whereXxax

Master Informatique 10/9/2007 43Master Informatique Typing semistructured data

Bridge

Theorem: for a set L of trees, the following are equivalent

1.L = L(A) for some bottom-up tree automata Ai.e. L is definable with bottom-tree automata

2.L = {T | T satisfies } for some MSO formula i.e. L is definable in MSO

Master Informatique 10/9/2007 44

XML typing

DTDs

Typing semistructured data

Master Informatique 10/9/2007 45Master Informatique Typing semistructured data

DTD

• Describe the children of a node of a label a by a regular expression

• Bizarre syntax<!ELEMENT populationdata (continent*) >

<!ELEMENT continent (name, country*) >

<!ELEMENT country (name, province*)>

<!ELEMENT province (name, city*) >

<!ELEMENT city (name, pop) >

<!ELEMENT name (#PCDATA) >

<!ELEMENT pop (#PCDATA) >

Master Informatique 10/9/2007 46Master Informatique Typing semistructured data

DTD and deterministism

• Regular expressions in DTD should be deterministic– Complicated definition

• Intuition: the corresponding automata should be deterministic– (a+b)*a is not– When reading <a>, one cannot tell whether it is an a

from (a+b) or if it is the a of the end– (b*a)(b*a)* is an equivalent expression that is

deterministic

Master Informatique 10/9/2007 47Master Informatique Typing semistructured data

Very efficient validation

• It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata Aa

• Possible to type check the document while scanning it, e.g. with SAX parser

Master Informatique 10/9/2007 48Master Informatique Typing semistructured data

Very efficient validation (2)<!ELEMENT a ( b c ) >

<!ELEMENT b ( d+ ) >

a

b c

d d

s t ub c

Aa

s’ t’d

dAb

<a><b><d/><d/></b><c/></a>

s’

st

t’

Acceptu

Master Informatique 10/9/2007 49Master Informatique Typing semistructured data

Warning

• The previous example can be checked with a simple automata on words

• But not the following one <!ELEMENT part ( part* ) >

• The stack is needed for accepting<a>…<a></a>…</a>

n <a> n </a>

Master Informatique 10/9/2007 50Master Informatique Typing semistructured data

Some bad news for DTD

• Not closed under union DTD1 …

<!ELEMENT used( ad*) >

<!ELEMENT ad ( year, brand )>

DTD2 …

<!ELEMENT new( ad*) >

<!ELEMENT ad ( brand )>

• L(DTD1) L(DTD2) cannot be described by a DTD but can be described easily by a tree automata– Problem with the type of ad that depends of its parent

• Also not closed under complement• Limited expressive power

Master Informatique 10/9/2007 51Master Informatique Typing semistructured data

Car example continued

• The best DTD we can choose does not distinguish between ads for used and new cars– <!ELEMENT ad (year?, brand) >

Car

Used New

Brand Year Brand

“Renault” “2008” “BMW”

Master Informatique 10/9/2007 52Master Informatique Typing semistructured data

Decoupled types in XML schema

• Each type corresponds to a label, not converselycar: [car] ( used + new )*

used:[used] (ad1*)

new: [new] (ad2*)

ad1: [ad] (year, brand)

ad2: [ad] (brand)

• The tags are in green; type names in blue• Nice closure properties• Many other « gadgets » in XML schemas

Master Informatique 10/9/2007 53

XML typing

XML Schemas

Typing semistructured data

Master Informatique 10/9/2007 54Master Informatique Typing semistructured data

XML Schema

• Often criticized & unnecessarily complicated• Boosted by Web services• Richer than DTD – decoupled types• Deterministic top-down tree automata (close to)• XML schemas are extensible• Many other useful functionalities

– Namespaces – Atomic types– Integrity constraints, etc.

Master Informatique 10/9/2007 55Master Informatique Typing semistructured data

An XML schema is an XML document

• Since it is an XML syntax, it can use XML tools– Editor– Type checker– Etc.

• The type of all XML schemas can be described with an XML schema

Master Informatique 10/9/2007 56

<?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetnamespace="http://www.net-language.com"> <xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="author" type="xs:string"/> <xs:element name="character" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="friend-of" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="since" type="xs:date"/> <xs:element name="qualification" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="isbn" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>

Typing semistructured data

Master Informatique 10/9/2007 57Master Informatique Typing semistructured data

Simple elements and atomic types

Definition: <xs:element name="xxx" type="yyy"/>with common types: xs:string; xs:decimal; xs:integer; xs:boolean; xs:date; xs:time

Examples<xs:element name="lastname" type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="dateborn" type="xs:date"/>

Instances of such elements<lastname>Refsnes</lastname> <age>34</age> <dateborn>1968-03-27</dateborn>

Master Informatique 10/9/2007 58Master Informatique Typing semistructured data

Attributs

Definition: <xs:attribute name="xxx" type="yyy"/>

Example<xs:attribute name="lang" type="xs:string"/>

Instance of such attribute<lastname lang="EN">Smith</lastname>

Master Informatique 10/9/2007 59Master Informatique Typing semistructured data

Complex elements

• Empty element<product pid="1345"/>

• Contains only other elements<employee> <firstname>John</firstname> <lastname>Smith</lastname> </employee>

• Contains only text<food type="dessert">Ice cream</food>

• Contains both elements and text<description> It happened on <date lang="norwegian"> 03.03.99</date> .... </description>

Master Informatique 10/9/2007 60Master Informatique Typing semistructured data

Restriction of simple elements<xs:element name="age">

<xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/>

<xs:maxInclusive value="100"/> </xs:restriction>

</xs:simpleType></xs:element>

Other restrictions: enumerated types, patterns, etc.

Master Informatique 10/9/2007 61Master Informatique Typing semistructured data

Restriction on complex elements

<xs:element name="person"> <xs:complexType>

<xs:sequence> <xs:element name="firstname" type="xs:string"/>

<xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType>

</xs:element>

Master Informatique 10/9/2007 62

Possible to name a type<xs:element name="employee">

<xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType>

</xs:element>

Only the "employee" element can use the specified complex type(<sequence> indicates an order on child elements)

Alternative

<xs:element name="employee" type="personinfo" />

<xs:complexType name="personinfo"> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence>

</xs:complexType>

Typing semistructured data

Master Informatique 10/9/2007 63Master Informatique Typing semistructured data

Other gadgets

• Import of types associated to a namespace– <import nameSpace = "http:// ..."

schemaLocation = "http:// ..." />

• Possible to include an existing schema– <include schemaLocation="http:// ..."/>

• Possible to extend/redefine an existing schema– <redefine schemaLocation="http:// ..."/>

.... Extensions ...

</redefine>

Master Informatique 10/9/2007 64Master Informatique Typing semistructured data

Example: a DTD<!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>

<!ATTLIST EMAIL

LANGUAGE (Western|Greek|Latin|Universal) "Western"

ENCRYPTED CDATA #IMPLIED

PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

<!ELEMENT TO (#PCDATA)>

<!ELEMENT FROM (#PCDATA)>

<!ELEMENT CC (#PCDATA)>

<!ELEMENT BCC (#PCDATA)>

<!ATTLIST BCC

HIDDEN CDATA #FIXED "TRUE">

<!ELEMENT SUBJECT (#PCDATA)>

<!ELEMENT BODY (#PCDATA)>

<!ENTITY SIGNATURE "Bill">

Master Informatique 10/9/2007 65Master Informatique Typing semistructured data

The same in XML schema(more verbose)

<?xml version="1.0" ?>

<Schema name="email" xmlns="urn:schemas-microsoft-com:xml-data"

xmlns:dt="urn:schemas-microsoft-com:datatypes">

<AttributeType name="language"

dt:type="enumeration" dt:values="Western Greek Latin Universal" />

<AttributeType name="encrypted" />

<AttributeType name="priority" dt:type="enumeration" dt:values="NORMAL LOW HIGH" />

<AttributeType name="hidden" default="true" />

<ElementType name="to" content="textOnly" />

<ElementType name="from" content="textOnly" />

<ElementType name="cc" content="textOnly" />

<ElementType name="bcc" content="mixed">

<attribute type="hidden" required="yes" />

</ElementType>

<ElementType name="subject" content="textOnly" />

<ElementType name="body" content="textOnly" />

<ElementType name="email" content="eltOnly">

<attribute type="language" default="Western" />

<attribute type="encrypted" />

<attribute type="priority" default="NORMAL" />

<element type="to" minOccurs="1" maxOccurs="*" />

<element type="from" minOccurs="1" maxOccurs="1" />

<element type="cc" minOccurs="0" maxOccurs="*" />

<element type="bcc" minOccurs="0" maxOccurs="*" />

<element type="subject" minOccurs="0" maxOccurs="1" />

<element type="body" minOccurs="0" maxOccurs="1" />

</ElementType>

</Schema>

Master Informatique 10/9/2007 66Master Informatique Typing semistructured data

Where to place XML schemas

• Some bizarre restriction– Inside an element, no two types with the same tag

• Closer to DTDs than to tree automata• Efficient type validation

Tree automata

Deterministic .top-down tree automata

DTDXML schema

Master Informatique 10/9/2007 67Master Informatique Typing semistructured data

Exercise: coupled vs decoupled

• Write a realistic DTD1 for new cars– With make, model, engine…

• Write a realistic DTD2 for used cars– Also year, miles, zipcode

• Write an XML schema for L(DTD1) L(DTD2) – Using decoupled schema

Master Informatique 10/9/2007 68

Automata

Automata to compute

Typing semistructured data

Master Informatique 10/9/2007 69Master Informatique Typing semistructured data

Another use of automata: XPATH $x in //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)

Master Informatique 10/9/2007 70Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

Master Informatique 10/9/2007 71Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)

Master Informatique 10/9/2007 72Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)(02)

$x

Master Informatique 10/9/2007 73Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)

$x

Master Informatique 10/9/2007 74Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

Master Informatique 10/9/2007 75Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(01)

Master Informatique 10/9/2007 76Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

Master Informatique 10/9/2007 77Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)$x

Master Informatique 10/9/2007 78Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)$x

(01)

Master Informatique 10/9/2007 79Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

(01)(02)

$x

Master Informatique 10/9/2007 80Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

(01)$x

Master Informatique 10/9/2007 81Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

$x

Master Informatique 10/9/2007 82Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

$x

$x

Master Informatique 10/9/2007 83Master Informatique Typing semistructured data

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)

$x

$x

$x

Master Informatique 10/9/2007 84Master Informatique Typing semistructured data

Determinization: exponential blow up

//a/*/*/b

Typing semistructured data

Master Informatique 10/9/2007 85Master Informatique Typing semistructured data

Proposal : k-pebble transducers

stack

[milo,suciu,vianu]

Master Informatique 10/9/2007 86Master Informatique Typing semistructured data

k-pebble transducers: result

root

a c

b a a b

a b

Capture a core aspect of Xquery but not the data management part

Master Informatique 10/9/2007 87

Graphs and bisimulation

Typing semistructured data

Master Informatique 10/9/2007 88Master Informatique Typing semistructured data

Graph

• Graph semistructured data• Graph simulation • Graph bisimulation• Data guides

Master Informatique 10/9/2007 89Master Informatique Typing semistructured data

Semistructured data

• With ID-IDREF, XML is a graph model as well

• OEM = Object Exchange ModelLabeled (rooted) graph (E,r)– Set N of nodes– A finite ternary relation E NNLabel

E(s,t,l) = there is an edge from s to t labeled l– Possibly a root r

Master Informatique 10/9/2007 90Master Informatique Typing semistructured data

&r

&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c

company

employee

employee

employeeemployee employee employee

employeeemployee

worksfor

worksfor

worksforworksforworksfor

worksforworksfor

worksfor

manages

manages

manages

manages

managedby

managedby

managedby

manages

managedby

managedby

Master Informatique 10/9/2007 91Master Informatique Typing semistructured data

Equality revisited

• {1,2,2,1,5} = {1,2,5}– Ignores the order

• For trees, if we ignore the order of siblings and use a “set” semantics

=a

b c

d d

b

d d

a

b c

d

Master Informatique 10/9/2007 92Master Informatique Typing semistructured data

Simulation

A simulation of (E,r) with (E’,r’) is a relation between the nodes of E and E’ such that

1.(r,r’)2. if (s,s’) and E(s,t,l) for some l then there

exists t’ with (t,t’) and E’(s’,t’,l’)

(we simulate a move in E by a move in E’)

Master Informatique 10/9/2007 93Master Informatique Typing semistructured data

Bisimulation

Given , E, E’, is a bisimulation if is a simulation of E with E’ and

-1 is a simulation of E’ with E

Master Informatique 10/9/2007 94Master Informatique Typing semistructured data

Examples

a a

a d

a a

a d

a

a d

G G’ G”

They all have the same paths from the root

bisimulation Not bisimulation

Master Informatique 10/9/2007 95Master Informatique Typing semistructured data

root

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

Master Informatique 10/9/2007 96Master Informatique Typing semistructured data

t1

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

Master Informatique 10/9/2007 97Master Informatique Typing semistructured data

t1

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

Master Informatique 10/9/2007 98Master Informatique Typing semistructured data

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

Master Informatique 10/9/2007 99Master Informatique Typing semistructured data

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

R

Master Informatique 10/9/2007 100Master Informatique Typing semistructured data

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

Master Informatique 10/9/2007 101Master Informatique Typing semistructured data

Graph bisimulationroot

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance" "adminstr.""PR" "undergrad""grad" "postgrad" "web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING_

employee

projects

R

R

Master Informatique 10/9/2007 102Master Informatique Typing semistructured data

Computing bisimulation in ptime

• Start with = N N’ (for N, N’ the set of nodes)

• While there exists (x,x’) in that violate the definition of simulation, remove (x,x’) from

• This computes the maximal bisimulation in ptime(Note: this maximal bisimulation exists because is

a bisimulation, and if 1, 2 are bisimulation, 1 2 is also one)

Master Informatique 10/9/2007 103Master Informatique Typing semistructured data

What does this have to do with typing?

• Take a very complex graph E• How do you describe it?• By a “smaller” graph T that is a bisimulation of

E• There may be several bisimulation with more

and more details

Master Informatique 10/9/2007 104Master Informatique Typing semistructured data

Rough bisimulation

Root&r

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company employee

manages

managedby

worksfor

worksfor

employee

Master Informatique 10/9/2007 105Master Informatique Typing semistructured data

More precise one

Root&r

Employees&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company

employee

managesmanagedby

manages

managedby

worksfor

worksfor

worksfor

Master Informatique 10/9/2007 106Master Informatique Typing semistructured data

Other “typing”: data guide

• See the graph as an automata with root as the start symbol and only accepting states

• This graph accepts all the paths from the root• Obtain an equivalent, minimal, deterministic

automata – This is the data guide for the graph– It can be used for describing the data– It can be used to support Graphical Query

Interfaces

Master Informatique 10/9/2007 107Master Informatique Typing semistructured data

Data guide

• Gives all the paths from the root• Automata minimization

Master Informatique 10/9/2007 108Master Informatique Typing semistructured data

{root}

{c1}

programmer

{c2}

statistician

{p1,p2,p3,p4,p5,p6,p7,p8,p9}

project

{e1,e2,e3,e4}

employee

{p1,p3} {p2,p4} {p1,p3,p5,p7} {p4,p6} {p4}

workson leads workson leadsconsults

{e1,e2} {e2,e3}{p1,p3,p5,p7,p9}

{p2,p4,p6,p8}

workson

{p4,p9}

leadsconsultsemployee employee

root

e2 e3 e4e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture""finance""adminstr.""PR""undergrad""grad""postgrad""web"

leads

workson leadsworkson

leadsworkson leads

workson consults

employee

consultsworkson

workson

c1 c2programmer statistician

project

workson

employee employee

Master Informatique 10/9/2007 109Master Informatique Typing semistructured data

What you should remember

• Tree automata = theoretical foundation for XML• Bottom-up tree automata are nice• Top-down and determinism together limitations • XML documents do not have to be typed• Typing may be very useful for XML

– In particular for software managing XML data• DTD: simple but limited• XML Schema: more expressive but still limited• Graph data: bisimulation is the answer

Master Informatique 10/9/2007 110

Merci

Typing semistructured data

Master Informatique 10/9/2007 111Master Informatique Typing semistructured data

Bibliography

• TATA: the book, Tree Automata Techniques and Applications, tata.gforge.inria.fr/– The book on the topic and it is free

• XML schema, see http://w3.org http://www.w3schools.com/schema/