a database approach to monitoring the quality of information in rdf stores

Post on 29-Aug-2014

483 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

A DATABASE APPROACH TO MONITORING THE QUALITY OF INFORMATION IN RDF STORES

Alexandre Rademaker and Edward Hermann

Wednesday, November 30, 11

NOTES

This is not a research report, this is a research propose!

Let us start by looking results from database researchers.

Wednesday, November 30, 11

WHAT IS (ENSURE) DATA QUALITY?

Semantic properties of databases can be represented by integrity constraints!

Integrity enforcement means maintain correctness of database. Truth Maintenance!

Hendrik, 2011

Wednesday, November 30, 11

HENDRIK DECKER

http://web.iti.upv.es/~hendrik/Universidad Politécnica de Valencia

Wednesday, November 30, 11

EXAMPLE

A marriage is between one man and one women only. How can we model such constraint in a relational DB?

We are talking about more than: check constraint, foreign key and primary key.

Wednesday, November 30, 11

DB THEORY USES DATALOG

Datalog is more expressive than SQL (transitive closure)

SQL is FOL (dedidable for finite model)

SELECT X WHERE Y (give me the binds that satisfy the clauses)

Wednesday, November 30, 11

TWO WAYS TO ENFORCE INTEGRITY

In each update, check if any integrity constraint is violated. (not always rigorously check due its performance penalty)

Repair extant violations of constraints. (accumulation of inconsistency is inevitable)

Hendrik, 2011

Wednesday, November 30, 11

INCONSISTENCY-TOLERANT METHODS

Rigorous way is to eliminate all inconsistency. Repair the whole database.

Relaxation... partial (flexible) repairs!

Hendrik, 2011

Absolute consistency is out of question due its intractability!

Wednesday, November 30, 11

FLEXIBILITY OF PARTIAL INCONSISTENCY

Integrity enforcement is more flexible. Don’t have to be done all at once. (constraint violations can be tolerated to be solved in appropriate moment)

Some inconsistency may be unknown at update time. Total approach would fail in such situation.

But...

Hendrik, 2011

Flexibility served in two ways:

Wednesday, November 30, 11

PARTIAL REPAIRS

Absolute consistency is out of question due its intractability.

But, naive inconsistency-tolerant repairs can be data-destructive.

For a rational flexible repair strategy, one needs criteria (expressed in terms of metrics)

Only admit repairs that are integrity-preserving! That is, total amount of integrity violation not increase after the repair.

Hendrik, 2011

Wednesday, November 30, 11

FORMAL DEFINITIONS

Hendrik, 2011

D = databaseIC = integrity theoryI = constraint U = update

D(F) = true if F eval to true in D

D(I) = true if I is satisfied in D

D(IC) = true if all I in IC is satisfied in D

For an update U (inserts, deletes) of database D, we

denoted DUthe updated database.

Wednesday, November 30, 11

FORMAL DEFINITIONS

Hendrik, 2011

Let � be an ordering antisymmetric, reflexive and transitive.

For two elements in a lattice A and B, A�B is their least upper bound.

Wednesday, November 30, 11

FORMAL DEFINITIONS

Hendrik, 2011

We say that (µ,�) is an inconsistency metric if

µ maps tuples (D, IC) to some lattice that is partially ordered by �.

Simple example of a metric � is given by �(D, IC) = D(IC)

with the natural order true � false of the range of �.

That is, integrity sat, D(IC) = true, mean lower inconsistency than integrity violation, D(IC) = false.

Non trivial examples given by comparing or counting violated constraints.

Wednesday, November 30, 11

INCONSISTENCY METRICS

Inconsistency metrics are used to decide if an update preserves integrity, that is, doesn’t create a integrity violation that doesn’t exist before the update.

Intuitively, an update preserves integrity if it doesn’t increase the measured inconsistency

Hendrik, 2011

For a metric (µ,�), an update U in a database Dwith integrity theory IC is integrity-preserving with

regard to (µ,�) if µ(DU , IC) � µ(D, IC).

Wednesday, November 30, 11

AND MORE...

Inconsistency-tolerant integrity checking

Repairs

Computing and checking partial repairs

Computing integrity-preserving repairs

Hendrik, 2011

Wednesday, November 30, 11

WHY WE ARE TALKING ABOUT IT?

Wednesday, November 30, 11

WHY WE ARE TALKING ABOUT IT?

Lattes@FGV Project (a unified KB of FGV research publications, researchers, skills etc), http://dck092.fgv.br/

Semantic Web brings, RDF, description logics, linked data etc.

Our research topics include Logics and knowledge representation.

RDF are the key concept of Semantic Web

Relational has fixed model (TBOX of an ontology)

Wednesday, November 30, 11

TOPOS: THEORETICAL PART

A topos (plural topoi or toposes) is a category with a quite expressive internal logic

The category of graphs and graph-homomorphisms can be viewed as a topos.

This topos already has a Heyting algebra that is used as the truth-basis of its internal logic.

A Heyting algebra is a lattice with additional properties. This topos-theoretic view of RDF stores can be investigated in order to provide a natural way to provide foundations to partial repairs in RDF stores.

Besides that, if we view traditional DBs as finite first-order logical structures, the category of (finite) first-order structures and homomorphism between then has its own internal logic. This internal logic can be investigated also regarding partial repairs.

scratching the surface!

Wednesday, November 30, 11

LATTES@FGV

Wednesday, November 30, 11

LATTES@FGV

Wednesday, November 30, 11

LATTES@FGV

Wednesday, November 30, 11

LATTES@FGV: THE RDF KB

http://dck092.fgv.br:10035/repositories/fgv (800k triples)

Wednesday, November 30, 11

LATTES@FGV

480 CV Lattes and collected data from other sources (Qualis, Digital Library etc) in one triple store

lots of errors (inconsistencies) for different reasons: poor user interface for input data, misinterpretation etc.

How to identify the errors? (non ad-hoc matter)

How to fix what can be fixed automatically?

Wednesday, November 30, 11

INTEGRITY CONSTRAINTS IN RDF

We can consider the extension of what was discussed so far to non-SQL

KR/DB can be viewed as a graph

The query language of RDF based stores, SPARQL, can be used to provide semantics to the store.

Wednesday, November 30, 11

EXAMPLES

An article referenced by a CV must have the author of this CV as one of its authors!

Wednesday, November 30, 11

EXAMPLES

If two resources were identified by reference to the same article, every author of the first one should also be related to the second one!

Wednesday, November 30, 11

IN THE LAST EXAMPLE

ask {  ?p1 owl:sameAs ?p2 ;      dc:creator ?c .  OPTIONAL {    ?p2 ?rel ?c .  }  FILTER( !bound(?rel) )}

Of course, two publications cannot be considered the same comparing only their titles!

We need entity alignment, similarity checker...

Suppose we have identified all resources that represent the same real “entity” using owl:sameAs, than ...

Wednesday, November 30, 11

A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY

(defun assert-same-list (list) (let ((new nil)) (mapcar (lambda (pair) (let ((a (first pair)) (b (second pair))) (if (not (blank-node-p a)) (push (reverse pair) new) (push pair new)))) list) (dolist (pair new) (add-triple (first pair) !owl:sameAs (second pair)))))

(select0/callback (?x ?y) #'insert-same-as (q- ?x !rdf:type !foaf:Agent) (q- ?y !rdf:type !foaf:Agent) (q- ?x !foaf:name ?n) (q- ?y !foaf:name ?n) (lispp (upi< ?x ?y)))

Naive approach: Shaking hands!

Wednesday, November 30, 11

A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY

(defun components (vertices n generator) (do ((res nil) (vtx vertices (set-difference vtx (car res) :test #'upi=))) ((null vtx) res) (push (ego-group (car vtx) n generator) res)))

(defsna-generator same-journal (node) (select0 (?j) (q- (?? node) !bibo:issn ?i) (q- ?j !bibo:issn ?i) (lispp (utils::check-issn (part->value ?i))) (lispp (upi< node ?j)) (q- ?j !dc:title ?t2) (q- (?? node) !dc:title ?t1) (lispp (> (utils::jaro-winkler-distance (part->value ?t1) (part->value ?t2)) 0.7))))

(let ((nodes (mapcar #'subject (get-triples-list :p !bibo:issn :limit nil)))) (dolist (g (components nodes 2 'same-journal))) (merge-nodes g))

An ad-hoc solution: breath-first-search of connected components!

Wednesday, November 30, 11

top related