probabilistic models for relational data

32
Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005

Upload: parry

Post on 12-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Models for Relational Data. Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005. History/Introduction. “flat” data relational data plate models and probabilistic relational models (PRMs) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic  Models  for       Relational Data

Probabilistic Models for Relational Data

Seminar Data Mining (SS 2005)Prof. Dr. Thomas Hofmann

Dipl. Inform. Steffen Hartmann

Xin Dong 05,07,2005

Page 2: Probabilistic  Models  for       Relational Data

History/Introduction “flat” data relational data

plate models and probabilistic relational models (PRMs)graphically quite differentsimilar to express probabilistic relationships

probabilistic entity-relationship (PER) modelan extension of the ER modelenhances the expressivenessmake relationships first class objectseasy to model relational data.

directed acyclic probabilistic entity-relationship (DAPER) modelmore similar, more expressivethe use of restricted relationships,self relationships, probabilistic relationships

Page 3: Probabilistic  Models  for       Relational Data

The Basic Ideas ---ER ModelEntity relationship (ER) model

a commonly used abstract representation of database structure

the first step in the process of building a relational database

Features of anticipated data and how they interrelate are encoded used to create a relational schema for the database, which in turn is

used to build the database itself

is a representation of a database structure, not of a particular database that contains data

Page 4: Probabilistic  Models  for       Relational Data

The Basic Ideas ---ER Model

Definitions entity --- a thing or object that is or may be

stored in a database relationship --- a specific interaction

among entities attribute --- a variable describing some

property of an entity or relationship.

Page 5: Probabilistic  Models  for       Relational Data

The Basic Ideas --- ER ModelExample 1

A university database maintains records on students and their IQs, courses and their difficulty, and the courses taken by students and the grades they receive.

distinguish between: ER diagram and ER model

ER diagram --- only graphER model --- ER diagram + mechanism

skeleton and instance for an ER modelskeleton --- collection of corresponding entity and

relationship setsinstance --- skeleton + assignment of a value to every attribute

an instance of an ER model is an actual database

Page 6: Probabilistic  Models  for       Relational Data

Course Diff

Takes

Student

Grade

IQ

attribute

class

entity

class

relationship

class

Student

John

mary

Course

cs107

stat10

Takes

Student Course

John cs107

mary cs107

mary stat10cs107.Diff

T(mary,stat10).G

stat10.Diff

T(john,cs107).G T(mary,cs107).G

mary.IQjohn.IQ

(a). ER model

(b). An example skeleton for the entity

and relationship classes

(c). The attributes defined by the application of the ER model to the skeleton.

entity

set

relationship

set

Page 7: Probabilistic  Models  for       Relational Data

Student

John

mary

Course

cs107

stat10

Takes

Student Course

John cs107

mary cs107

mary stat10

Student

John

mary

Course

cs107

stat10

Takes

Student Course

John cs107

mary cs107

mary stat10

Student . IQ

120

125

Course . Diff

A

B

Takes . Grade

3.0

2.0

1.0

skeleton for a set of entity and relationship classes

instance for an ER model

Page 8: Probabilistic  Models  for       Relational Data

The Basic Ideas --- DAPER Model

directed acyclic probabilistic entity relationship (DAPER) model

ER model with directed (solid) arcs and local distribution classes

arc class --- represent probabilistic dependencies among corresponding attributes

local distribution classes --- define local distributions for attributes

DAPER diagram --- graph

DAPER model --- diagram + the local distribution classes + the mechanism, by which a DAPER model defines a directed acyclic graphical (DAG) model given a skeleton.

Page 9: Probabilistic  Models  for       Relational Data

The Basic Ideas --- DAPER ModelExample 2 In the university database (Example 1), a student’s grade in a course

depends both on the student’s IQ and on the difficulty of the course.

arc class

Constraint

local distribution class

a specification from which local distributions for attributes corresponding to the attribute class can be constructed, when a DAPER model is expanded to a

DAG model

local distribution class for Takes.Grade p (Takes.Grade | Student.IQ, Course.Diff)

is a specification from which the local distributions for Takes(s, c).Grade, for all students s and courses c, can be constructed.

Page 10: Probabilistic  Models  for       Relational Data

Course Diff

Takes

Student

Grade

IQ

Course[Diff] =

Course[Grade]

student[IQ] =

student[Grade]

(a). DAPER model

Student

John

mary

Course

cs107

stat10

Takes

Student Course

John cs107

mary cs107

mary stat10

(b). An example skeleton for the entity

and relationship classes

cs107.Diff

T(mary,stat10).G

stat10.Diff

T(john,cs107).G T(mary,cs107).G

mary.IQjohn.IQ

(c). Directed acyclic graphical (DAG) model defined by application of DAPER model to ER skeleton

Page 11: Probabilistic  Models  for       Relational Data

The Basic Ideas --- plate Model

developed as a language for compactly representing graphical models in which there are repeated measurements

no formal definition of a plate model, we provide one here. This definition enhances the expressivity of such models while retaining their essence

plate and DAPER models are equivalent

Page 12: Probabilistic  Models  for       Relational Data

Course

Takes

Student

Diff

Grade

IQ

Course [Diff] =

Course [Grade]

Student [IQ] =

Student [Grade]

Plate model depicting the structure of a university database.

entity class -> a large rectangle, called a plate

The plate is labeled with the entity-class name

Plates are allowed to intersect or overlap

A relationship class is drawn at the named intersection of the

plates

Attribute classes of an entity class are drawn as ovals inside the rectangle

corresponding to the entity,

but outside any intersection.

Attribute classes associated with a relationship class are drawn in the

intersection

corresponding to the relationship class.

Arc classes and

constraints are drawn just as they are in

DAPER models.In additon, local distribution

classes are specified just as they are in DAPER models.

(not shown in the graph)

The invertible mapping from a DAPER to plate model

Page 13: Probabilistic  Models  for       Relational Data

The Basic Ideas --- PRMsProbabilistic Relational Models (PRMs)

developed explicitly for the purpose of representing relational data

extends the relational model — another commonly used representation for the structure of a database

directed PRMs equivalent to DAPER models and plate models

Page 14: Probabilistic  Models  for       Relational Data

Course

Diff

Takes

Course

Student

Grade

Student

IQ

Course [Diff] = Course [Grade]

Student [IQ] = Student [Grade]

PRM model depicting the structure of a university database.

The invertible mapping from a DAPER model to a directed PRM

the ER-model component of the DAPER model is mapped to a

relational model in a standard way

both entity and relationship classes are represented as tables

attribute classes for entity and relationship classes are represented

as attributes or columns in the corresponding tables of the relational

model

the probabilistic components of the DAPER model are mapped to those

of the directed PRM

arc classes and constraints just as they are in the DAPER model.

Page 15: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Fundamentals

ground graph --- structure of the

DAG model created by the expansion of a DAPER model given a skeleton

drawing of arcs --- important part of this expansion

mechanism --- important conditional independence relations could be expressed

Page 16: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Example 3 A database contains diseases and symptoms for a given patient. Every disease is a potential cause of every symptom.

Example 4 Extending Example 3, suppose a physician has identified the possible causes of each symptom.

Page 17: Probabilistic  Models  for       Relational Data

Disease Present

Symptom Present

d3.Presentd2.Present

s1.Present s2.Present s3.present

d1.Present

CausesCauses (d, s)

Causes

Disease Symptom

d1 s1

d1 s2

d1 s3

d2 s2

d3 s3

(a) A DAPER model for a complete bipartite graph between symptoms and diseases.

(b) A ground graph (a DAG model structure) generated by the application of this DAPER model to any given a skeleton is a full bipartite graph.

(c) A DAPER model for a incomplete bipartite graph between symptoms and diseases.

(d) A possible skeleton

(e) A DAG model resulting from the expansion of the DAPER model to the skeleton.

Page 18: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Example 5 Extending Example 3 in a different way, suppose the physician has identified both primary (major) and secondary (minor) causes of disease.

Example 6 Extending Example 3 in a different way, suppose that both diseases and symptoms have category labels — labels drawn from the same set of categories. The possible causes of a symptom are diseases that have at least one category in common with that symptom.

Page 19: Probabilistic  Models  for       Relational Data

Disease Present

Symptom Present

CausesCauses (d, s)

2°Causes1°Causes1°Causes (d, s) v

2°Causes(d, s)

Disease Present

Symptom Present

R1

R2

Category

),(1 cdRc),(2 csR

(b) A DAPER model

with a disjunctive constraint.

(c) A constraint containing the existence quantifier.

(a) A DAPER model (in Example 4)

Page 20: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Restricted RelationshipsA relationship class R in an ER (or PER) model is restricted when some skeletons for the entity and relationship classes of the ER model are prohibited.

graphical notation has been developedfor common restrictionsextremely useful tool for modeling with

PER models.

Page 21: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Example 7 A binary outcome O is measured on patients in multiple hospitals. Each patient is treated in exactly one hospital. It is believed that outcomes in any given hospital h are i.i.d. given binomial parameter h.θ; and that these binomial parameters are themselves i.i.d. across hospitals given hyper parameters α.

Page 22: Probabilistic  Models  for       Relational Data

Hospital

Patient

InIn (h, p)

θ

o

α

h1. hm.

pmnm. pm1. p1n1. p11.

. . .

. . . . . .

α

θ θ

oooo

o

h[ ]=h[ ]θ o

(a) A DAPER model

(b) The ground graph for a skeleton containing m hospitals

and ni patients in hospital i applied to the DAPER model.(c) A DAPER model

equivalent to the one in (a).

Page 23: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Self Relationships

Self relationships are relationships that relate like entities (and perhaps other entities as well). A self-relationship class is one that contains self relationships.

Page 24: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Example 9 In the university-database example (Example 2), a student’s grade in a course depends on whether an advisor of the student is a friend of a teacher of the course.

Page 25: Probabilistic  Models  for       Relational Data

Course Diff

Takes

Student

Grade

IQ

Professor

Teaches

F Friend

Advises

Full

(a) ER model(b) DAPER model

c[D]=c[G]

s[IQ]=s[G]

Teaches(p, c)

Advises(pf, s)

F(p, pf)

(c) DAPER model, the Professor entity class has been copied.

Professor

(Advisor)

Professor

(Teacher)

θ

an ordinary attribute θ corresponding to

this uncertain distribution.

there are two instances of the Professor entity class named“Professor (Teacher)” and“Professor (Advisor).”Note that copying allows us to annotate the role that each copy of the entity class plays in the self-relationship class. Models drawn with this copy convention are sometimes more transparent.

Page 26: Probabilistic  Models  for       Relational Data

F has one attribute class F.Friend,where the attribute F(p, pf).Friend is true if professor pf is a friend of professor p. Note that F has the Full constraint so that we can model whether any one professor is a friend of another. Also note that F(p1, p2).Friend may be true while F(p2, p1).Friend may be false.

Page 27: Probabilistic  Models  for       Relational Data

The constraint on the arc class from F.Friend to Takes.Grade is Teaches(p, c)

∧ Advises(pf, s).Thus, in any ground graph generated from this model, there is an arc from attribute F(p, pf ).Friend to attribute Takes(s, c).Grade whenever a teacher of the course is p and an advisor of the student is pf —precisely the additional dependence described in the example.

Page 28: Probabilistic  Models  for       Relational Data

Probabilistic Entity-Relationship Models

Probabilistic Relationships

Example 12 (Relationship existence) A database contains academic papers and citations for a subset of those papers. Using the citations we have, we model how the topics of two papers influence whether one paper cites the other.

Example 13 Modifying Example 12, we now know that the database was constructed such that contains at most ten citations from the bibliography of any paper.

Page 29: Probabilistic  Models  for       Relational Data

Paper

(Citing)Topic

Cites

Paper

(Cited)

Exists

Topic

(a) An ER model(b) A DAPER model for the situation

where citations are uncertain.

p[T]=pcg [E]

p[T]=pcd [E]

Cites(pcg,pcd)

Full

(c) A DAPER model for the situation where citations are limited to ten per paper.

<=10pcg [E]=p[<=10]

we are uncertain about the citations of papers whose citations have not been recorded. To model this uncertainty, we use a DAPER Model in which Cites is a Full relationship class with attribute class Cites.Exists, where Cites(pcg, pcd).Exists is true when paper pcg cites paper pcd. In addition, to model how the topics of two papers influence this existence, we add the attribute class Paper.Topic and the arc classes.

Page 30: Probabilistic  Models  for       Relational Data

With respect to Figure b, we have added a binary, attribute class Paper. <= 10. The double oval associated with this Attribute class indicates that this attribute expands to deterministic attributes in a ground graph. In particular, a ground graph attribute p. <= 10 will have parents Cites(pcg, pcd).Exists, for all pcd, and will be true exactly when ten or fewer of these parents are true. To encode the restriction, we set p. <= 10 to true for every p when performing inference in the ground graph.

Page 31: Probabilistic  Models  for       Relational Data

Summary

ER model by example definitions for the DAPER model,

plate model and PRM examine DAPER models in detail

restricted relationshipsself relationshipsprobabilistic relationships

Page 32: Probabilistic  Models  for       Relational Data