![Page 1: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/1.jpg)
Semistructured Data:The Next Step
Dan Suciu
![Page 2: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/2.jpg)
2
ThesisBack then: semistructured data addressed imprecisions in structure: e.g. data integration, “new data”
Novel paradigm based on semi-structure
Today: many more variety of imprecisions found in data and schema: e.g. data integration, “new data”
Novel paradigm based on probabilities
![Page 3: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/3.jpg)
3
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
![Page 4: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/4.jpg)
4
Origins 1/2
Motivation: imprecisions in the structure
data integration [Tsimmis’94-98] handling “new data” [Lorel’97], [UnQL’95]
Solution: relax the structure
![Page 5: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/5.jpg)
5
Origins 2/2
Enabling technologies: OO databases non first normal form data querying SGML files text databases
Theoretical roots: trees, regular expressions
![Page 6: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/6.jpg)
6
TodaySemistructured data in industry: XML support in RDBMS XML engines XML for data exchange stream XML processing
Semistructured data in research: often largest session in SIGMOD/VLDB/PODS dedicated symposiums, conferences, workshops 2nd Dagstuhl workshop !
![Page 7: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/7.jpg)
7
Today:Imprecisions Everywhere
misspellings
different terminologies
object matching
schema matching
measurement errors
constraint violations
None is addressed by semistructured data
![Page 8: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/8.jpg)
8
Handling Imprecisions Today
several tools:
edit distance, q-gram distance, data cleaning, ontologies (WordNet), record linkeage, schema mathings, repairs, ....
always outside the database system
Should be done inside DBMS
![Page 9: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/9.jpg)
9
Summary of Past and Present
imprecisions in data structure:
lead to semistructured data model
many more imprecisions remain today
require new data model
![Page 10: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/10.jpg)
10
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic
Conclusions
![Page 11: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/11.jpg)
11
Probabilistic Database
DefinitionA prob DB = a database where each tuple has a probability
will make this more precise later
first, let’s see examples
![Page 12: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/12.jpg)
12
Uncertain Predicates 1/2
SELECT DISTINCT A.name FROM Actor A, Casts C, Film FWHERE C.filmID = F.filmID and C.actorID = A.actorID and A.name ≈ ‘Kevin’ and F.year ≈ 1995 and F.rating ≈ ‘high’
misspellings, data translation, ontologies
Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v), x ≈ ‘Kevin’, u ≈ 1995, v ≈ ‘high’
![Page 13: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/13.jpg)
13
Uncertain Predicates 2/2Assign a probabilistic score to each tuple
↓probabilistic database
filmID name rating prob. score
f1 ... 9.2 0.95
f2 ... 7.7 0.65
m3 ... 5.3 0.1
m4 ... 8.1 0.7
Film
Film.rating ≈ high
actorID name prob. score
a1 Kevin B. 1.0
a2 R. Keven 0.9a3 Calvin C. 0.8
a4 Collins 0.2
Actor
Actor.name ≈ ‘Kevin’
Now evaluate Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v)on probabilistic database
![Page 14: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/14.jpg)
14
Inexact Schema Matches 1/2
S2(name, address)
S1(customerName, officeAddress, homeAddress)
?prob=0.8
prob=0.2
![Page 15: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/15.jpg)
15
Inexact Schema Matches 2/2
Query translation:select *from S2where S2.name ≈ ‘Kevin’ and S2.address = ‘Seattle’
customerName officeAddress homeAddress prob. score
Kevin Spokane Seattle 0.2 * 1.00
Keven Portland Bellvue 0.0 * 0.95
Calvin Seattle Redmond 0.8 * 0.37
Collins Ballard Seattle 0.2 * 0.03
S1
Probabilities allow us to consider both mappings
“Run” this on S1
![Page 16: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/16.jpg)
16
Handling Inconsistencies 1/2
Name City Email
Larry Big Seattle lb@com. . . .
.. . . . .
Name City Phone
Larry Big Bellvue . . .
. . . . .
Data Source V1 Data Source V2
Global constraint: each person, only one city
Q: Larry Big’s email ?A: no answers, due to ‘repairs’
[Chomicki]
![Page 17: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/17.jpg)
17
Handling Inconsistencies 2/2
Name City Email prob
Larry Big Seattle lb@com p1. . . .
.. . . . .
Name City Phone prob
Larry Big Bellvue . . . p2
. . . . .
Data Source V1 Data Source V2
Global constraint: p1 + p2 ≤ 1
Q: Larry Big’s email ?A: lb@com, with probability p1 = 0.5
no other info ? p1=p2=0.5
Probabilities allow us to use all the data
![Page 18: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/18.jpg)
18
Formal Definition
A probabilistic database = a probability distribution on all instances
Definition
Pr: Inst → [0,1], ∑I∈Inst Pr[I] = 1
Very powerful: any correlations between tuples
Notation: probabilistic database Ip
“Possible Worlds”
![Page 19: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/19.jpg)
19
Query Semantics
Given Ip, query Q, tuple t:
Pr[t ∈Q] = ∑I∈Inst: t ∈Q(I) Pr[I]
Definition
Note: EVERY query has well defined semantics !
Notation: Q(Ip) = prob distribution on tuples
Return to user: ranked tuples, top k
![Page 20: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/20.jpg)
20
Example
Film
What are the “possible worlds” ?How do we “evaluate” a query ?
Actor
filmID title year prob
f1 t1 1995 p1f2 t2 1982 p2m3 t3 1982 p3m4 t4 1971 p4
filmID actorName prob
f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . . . .
![Page 21: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/21.jpg)
21
Example: Possible WorldsfilmID title year prob
f1 T1 1995 p1f2 T2 1982 p2m3 T3 1982 p3m4 T4 1971 p4
filmID title year
m3 T3 1982
(1-p1)*(1-p2)*p3*(1-p4)
filmID title year
f1 T1 1995
f2 T2 1982
m4 T4 1971
p1*p2*(1-p3)*p4
filmID title year
f1 T1 1995
f2 T2 1982
m3 T3 1982
m4 T4 1971
p1*p2*p3*p4
. . .
. . .
Ip
![Page 22: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/22.jpg)
22
Example: Query Evaluationyear filmID prob
1995 f1 p11982 f2 p21982 m3 p3
filmID actorName prob
f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . q5m3 . . . q6
Prob(1995) = p1 * (1 - (1 - q1)*(1 - q2)*(1 - q3))
Prob(1982) = 1 - (1 - )* (1 - )
p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))
Filmp Castp
Q(x) :- Film(x,y), Cast(y,z)
![Page 23: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/23.jpg)
23
Summary of the new Paradigm
System converts all imprecisions into prob db Ip
User asks standard query Q
System computes Q(Ip), ranks tuples, returns top k
Excellent exploratory queries, not business process
Variations: Expose probabilities: Trio [Stanford] Hide probabilities: MystiQ [UW]
![Page 24: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/24.jpg)
24
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
![Page 25: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/25.jpg)
25
Technology: Top-k Ranking[Chaudhuri&Gravano’99]
Data is exact, but user is willingto accept inexact matches
SELECT * FROM HousesWHERE no_bdr = 4 and no_garages = 3 and price= 300k
Rich recent literature on evaluation techniques
![Page 26: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/26.jpg)
26
Technology:Optimal Aggregation
Theorem [Fagin,Lotem,Naor’01,03]There are optimal algorithms for computing top k
![Page 27: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/27.jpg)
27
Technology:Ranking Query Results
Keyword searches DBExplorer [Chaudhuri’02]DISCOVER [Hristidis’02]BANKS [Bhalotia’02]
Using ontologies XXL [Theobald’02]TOSS [Hung’04]
Approximate SQL [Agrawal’03][Dalvi’04]
Data/queries are inexact
![Page 28: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/28.jpg)
28
Technology:Object Matching
AKA: Object fusion, record linkage,merge/purge, de-duplication, ....
Long history, rich literature
Problem: given n objects, find duplicates assuming typos, misspellings, ...
Theorem: AI-complete
[Felligi&Sunter’69, Winkler’99]
![Page 29: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/29.jpg)
29
Technology:Information Extraction
KnoItAll [Etzioni et al.’04]
Extractors defined per class. E.g. scientist
Collect large body of information by searching the Web
Albert Einstein 0.9262188000Niels Bohr 0.9207664000Enrico Fermi 0.9199148000Marie Curie 0.9156590000Max Planck 0.9126473000Ernest Rutherford 0.9108679000Michael Faraday 0.9105023000Louis Pasteur 0.8986677000Stephen Hawking 0.8986677000Henry Cavendish 0.8981776000
Lots of usefulprobabilisticdatabases !
![Page 30: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/30.jpg)
30
Theory:Random Graphs (1/2)
Erdos and Reny
Independent tuple distribution: Pr[t] = p(n);
Given property Q, let µ[Q] = limn→∞ Pr[Q]
Theorem Every monotone Q has “threshod function” t(n): if p(n) << t(n) then µ[Q] = 0 if p(n) >> t(n) then µ[Q] = 1
p(n) = C/n2
![Page 31: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/31.jpg)
31
Theory:Random Graphs (2/2)
Erdos and Reny
Programfind “threshold” functions p(n) for various properties Q
Important example: subgraph property: Q ⊆ GThreshold is: t(n) = 1/nh (note: >> 1/n2) where h = minH H/e(H) (inv. density)
![Page 32: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/32.jpg)
32
Theory: 0/1 Laws
Shelah and Spencer
Pr[t] = 1/nß
0/1 law holds if ß is irrational
Grove, Halpern, Koller
Pr[t] = 1/2µ[Q|V] does not always exist, and it’s undecidable
Fagin’76
Pr[t] = 1/2for an Q in FO, µ[Q] = 0 or µ[Q] = 1
p(n) = C/n2
![Page 33: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/33.jpg)
33
Theory: Prob DB’s (1/3)[Fuhr&Roellke]
Independent events: e1, e2, ... with probabilities: p1, p2, ...
A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)
Formalism for specifying a prob db
Intensional database:
Ip =
Extensional database: all events are independent
![Page 34: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/34.jpg)
34
Theory: Prob DB’s (2/3)[Fuhr&Roellke]
given E⊆{e1, e2, ...} define I = inst(Ip,E)
Pr[I] = ∑ {Pr[E] | I = inst(Ip,E)}
Possible worlds semantics
Theorem [Dalvi&S] Every probability distribution Pr[I] comes from an intentional db
Theorem Computing Pr[t ∈ Ip] is #P complete
![Page 35: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/35.jpg)
35
Theory: Prob DB’s (3/3)
A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)
Theorem [Fuhr&Roellke] Closed under FO evaluation
A Ea e1∨(e3∧e4)b e4 ∨ (e2 ∨ (¬e4∧e1) )
∃y.actor(x,y)
Intesional databases: powerful but expensive
![Page 36: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/36.jpg)
36
Summary of SupportingWork
Supporting technology
a lot exists
need to understand it in order to make correct assumptions
Theoretical roots
Old and solid
Do not apply out of the box
![Page 37: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/37.jpg)
37
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
![Page 38: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/38.jpg)
38
An Agenda for a Theory of Probabilistic Queries
Specification languages
Query evaluation: data complexity
Query evaluation using views
Randomized PTIME query approximation
Probabilistic XML
![Page 39: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/39.jpg)
39
Specification Languages
Extensional databases
Intensional databases
Probabilistic Relational Model
Implicit probabilities
[Fefferman’99]
[Fuhr&Roellke’97]
[Erdos&Reny,Shelah&Spencer]
Agenda Item:Expressibility/complexity/conciseness tradeoffs
![Page 40: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/40.jpg)
40
Query Evaluation (1/3)Given: Q, Ip
Compute: Q(Ip)
Data complexity of Q is in PTIME
Prob(1982) = 1 - (1 - )* (1 - )
p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))
Q(x) :- Film(x,y), Cast(y,z)
year filmID prob
1982 f1 p21982 f2 p3
FilmfilmID actorID prob
f1 a1 q4f1 a2 q5f2 a2 q6
Cast
Variation: return top k
![Page 41: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/41.jpg)
41
Query Evaluation (2/3)
Prob(1982, Kevin) = ???
Q(x,u) :- Film(x,y), Cast(y,z), Actor(z,u)
Data complexity of Q is #P-complete
actorID name prob
a1 Kevin r1a2 Kevin r2
Actor
year filmID prob
1982 f1 p21982 f2 p3
FilmfilmID actorID prob
f1 a1 q4f1 a2 q5f2 a2 q6
Cast
![Page 42: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/42.jpg)
42
Query Evaluation (3/3)Theorem [Dalvi&S.’04,Graedel’98] Data complexity is:
But: restrictions on Q, Ip
Agenda item:Data complexity = f(k, specification lang, query lang)
Q Q(Ip)
..., R(x,...),S(x,y,...),T(y,...),... #P-complete
otherwise PTIME
![Page 43: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/43.jpg)
43
Query Answering Using Views
Given: Q, V, Jp
Possible world: Ip s.t. V(Ip) = Jp
Compute: Q(Ip) = Answer(Q, V, Jp), top k
Theorem[Dalvi&S.’05] Answer(Q, V, Jp) computable in exptime
Agenda item:Data complexity = f(k, specification lang, query lang)
But: restrictions on Q, Jp
Corollary µ[Q | V] computable in exptime p(n) = C/n2
![Page 44: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/44.jpg)
44
Query Approximation
Agenda item:Algorithm complexity = f(# calls to query engine)
Given: Q, Ip
Compute: approximate Q(Ip) in PTIME, top k
Theorem [Karp’89] Given a DNF expression E,Pr[E] can be approximated efficiently usingMonte Carlo simulation
![Page 45: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/45.jpg)
45
Probabilistic XML
PXML: introduced by [Hung,Getoor,Subr.’03]
Still open:
Possible world semantics
Query evaluation
Agenda item:Theory of probabilistic XML data
![Page 46: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/46.jpg)
46
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies
Theoretical roots
Towards a theory of probabilistic queries
Conclusions
![Page 47: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/47.jpg)
47
Conclusions
Imprecisions in data are ubiquitous
Data management is ripe for a paradigm shift
Pre-existing technology, theory
Lots of interest from the industry
Kinds of theory: exploratory v.s. explanatory
![Page 48: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’](https://reader035.vdocument.in/reader035/viewer/2022081616/601ef75d96f3c96c7f6f8b29/html5/thumbnails/48.jpg)
Questions ?