statistical learning from relational data

70
Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people

Upload: natara

Post on 31-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Statistical Learning from Relational Data. Daphne Koller Stanford University Joint work with many many people. Relational Data is Everywhere. The web Webpages (& the entities they represent), hyperlinks Social networks People, institutions, friendship links Biological data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Learning  from Relational Data

Statistical Learning from Relational Data

Daphne KollerStanford University

Joint work with many many people

Page 2: Statistical Learning  from Relational Data

Relational Data is Everywhere

The web Webpages (& the entities they represent),

hyperlinks Social networks

People, institutions, friendship links Biological data

Genes, proteins, interactions, regulation Bibliometrics

Papers, authors, journals, citations Corporate databases

Customers, products, transactions

Page 3: Statistical Learning  from Relational Data

Relational Data is Different

Data instances not independent Topics of linked webpages are correlated

Data instances are not identically distributed: Heterogeneous instances (papers, authors)

No IID assumption

This is a good thing

Page 4: Statistical Learning  from Relational Data

New Learning Tasks Collective classification of related instances

Labeling an entire website of related webpages

Relational clustering Finding coherent clusters in the genome

Link prediction & classification Predicting when two people are likely to be friends

Pattern detection in network of related objects Finding groups (research groups, terrorist groups)

Page 5: Statistical Learning  from Relational Data

Probabilistic Models Uncertainty model:

space of “possible worlds”; probability distribution over this space.

Worlds: often defined via a set of state variables medical diagnosis: diseases, symptoms, findings, …

each world: an assignment of values to variables

Number of worlds is exponential in # of vars 2n if we have n binary variables

Page 6: Statistical Learning  from Relational Data

Outline

Relational Bayesian networks* Relational Markov networks Collective Classification Relational clustering

* with Avi Pfeffer, Nir Friedman, Lise Getoor

Page 7: Statistical Learning  from Relational Data

Bayesian Networks

nodes = variablesedges = direct influence

Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

CPD P(G|D,I)

Job

Grade

SAT

IntelligenceDifficulty

Page 8: Statistical Learning  from Relational Data

Bayesian Networks: Problem

Bayesian nets use propositional representation Real world has objects, related to each other

Intelligence Difficulty

Grade

Intell_Jane Diffic_CS101

Grade_Jane_CS101

Intell_George Diffic_Geo101

Grade_George_Geo101

Intell_George Diffic_CS101

Grade_George_CS101A C

These “instances” are not independent

Page 9: Statistical Learning  from Relational Data

Relational Schema Specifies types of objects in domain, attributes of

each type of object & types of relations between objects

Teach

Student

Intelligence

Registration

Grade

Satisfaction

Course

Difficulty

Professor

Teaching-Ability

In

Take

ClassesClasses

RelationsRelationsAttributesAttributes

Page 10: Statistical Learning  from Relational Data

St. Nordaf University

Tea

ches

Tea

ches

In-course

In-course

Registered

In-course

Prof. SmithProf. Jones

George

Jane

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty Registered

RegisteredGrade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

World

Page 11: Statistical Learning  from Relational Data

Relational Bayesian Networks

Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies

Links define potential interactions

StudentIntelligence

RegGrade

Satisfaction

CourseDifficulty

ProfessorTeaching-Ability

[K. & Pfeffer; Poole; Ngo & Haddawy]

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

Page 12: Statistical Learning  from Relational Data

Prof. SmithProf. Jones

Welcome to

CS101

Welcome to

Geo101

RBN Semantics

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

George

Jane

Ground model: variables: attributes of all objects dependencies: determined by relational links & template model

Welcome to

CS101

Page 13: Statistical Learning  from Relational Data

Welcome to

CS101

low / high

The Web of Influence

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

low high

0% 50% 100%

easy / hard

Page 14: Statistical Learning  from Relational Data

Outline

Relational Bayesian networks* Relational Markov networks†

Collective Classification Relational clustering

* with Avi Pfeffer, Nir Friedman, Lise Getoor

† with Ben Taskar, Pieter Abbeel

Page 15: Statistical Learning  from Relational Data

Why Undirected Models? Symmetric, non-causal interactions

E.g., web: categories of linked pages are correlated

Cannot introduce direct edges because of cycles

Patterns involving multiple entities E.g., web: “triangle” patterns Directed edges not appropriate

“Solution”: Impose arbitrary direction Not clear how to parameterize CPD for variables

involved in multiple interactions Very difficult within a class-based

parameterization[Taskar, Abbeel, K. 2001]

Page 16: Statistical Learning  from Relational Data

Markov Networks

Laura

Noah

Mary

James

N)(L,N)(M,M)(J,L)(K,L)(J,K)(J,

ZN)M,L,K,P(J,

1

Kyle

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

Page 17: Statistical Learning  from Relational Data

Relational Markov Networks

Universals: Probabilistic patterns hold for all groups of objects

Locality: Represent local probabilistic dependencies Sets of links give us possible interactions

Study Group

Student2

Reg2

GradeIntelligence

Course

Reg1Grade

Student1

Difficulty

Intelligence

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

Page 18: Statistical Learning  from Relational Data

Welcome to

CS101

RMN Semantics

Welcome to

Geo101

Difficulty

Difficulty

Grade

Grade

Intelligence

Intelligence

George

Jane

Jill

Intelligence

Geo Study Group

CS Study Group

Grade

Grade

Page 19: Statistical Learning  from Relational Data

Outline

Relational Bayesian Networks Relational Markov Networks Collective Classification*

Discriminative training Web page classification Link prediction

Relational clustering

* with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel

Page 20: Statistical Learning  from Relational Data

Model Structure

ProbabilisticRelational

ModelCourse

Student

Reg

Training Data

New Data

Learning

Inference

Conclusions

Collective Classification

Train on one year of student intelligence, course difficulty, and grades Given only grades in following year, predict all students’ intelligence

Example:

Features: .x

Labels: .y*

Features: ’.x Labels: ’.y

Page 21: Statistical Learning  from Relational Data

Learning RMN Parameters

Student2

Reg2

GradeIntelligence

Course

Reg1Grade

Student1

Difficulty

IntelligenceTemplate potential

Study Group

AAABACBABBBCCACBCC

Parameterize potentials as log-linear model

)exp(1

).( )(xfwxw

wT

ZP

)exp().,.( 21 CCCCAAAA fwfwGRGR

Page 22: Statistical Learning  from Relational Data

Max Likelihood Estimation

maximizew

Estimation Classification

argmaxy

.x

.y* ).|.(log xy*w P ).,.(log xy*w P

We don’t care about the joint distribution P(.x, .y)

)'.|'.(log xyw P

Page 23: Statistical Learning  from Relational Data

Web KB

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Project-of

Member

[Craven et al.]

Page 24: Statistical Learning  from Relational Data

Web Classification Experiments

WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links

Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Page 25: Statistical Learning  from Relational Data

Professordepartment

extractinformationcomputersciencemachinelearning

Standard Classification

Categories:facultycourseprojectstudentother

Page

...

Category

Word1 WordN

Page 26: Statistical Learning  from Relational Data

Standard Classification

...LinkWordN

workingwithTom Mitchell …

Page

...

Category

Word1 WordN

00.020.040.060.080.1

0.120.140.160.18

Logistic

test

set

err

or

4-fold CV:Trained on 3 universities

Tested on 4th

Discriminatively trained naïve Markov

= Logistic Regression

Page 27: Statistical Learning  from Relational Data

Power of ContextProfessor

?Student? Post-doc?

Page 28: Statistical Learning  from Relational Data

Collective Classification

...

PageCategory

Word1 WordN

From-

Link ...

PageCategory

Word1 WordN

To-

CCCFCPCSFCFFFPFSPCPFPPPSSCSFSPSS

Compatibility (From,To)FT

Page 29: Statistical Learning  from Relational Data

Collective Classification

...

PageCategory

Word1 WordN

From-

Link ...

PageCategory

Word1 WordN

To-

Logistic Links

Classify all pages collectively,

maximizing the joint label probability

00.020.040.060.080.1

0.120.140.160.18

test

set

err

or

[Taskar, Abbeel, K., 2002]

Page 30: Statistical Learning  from Relational Data

More Complex Structure

Page 31: Statistical Learning  from Relational Data

More Complex Structure

C

Wn

W1Faculty

S

Students

S

Courses

Page 32: Statistical Learning  from Relational Data

Collective Classification: Results

00.020.040.060.080.1

0.120.140.160.18

Logistic Links Section Link+Section[Taskar, Abbeel, K., 2002]

test

set

err

or

35.4% error reduction over logistic

Page 33: Statistical Learning  from Relational Data

Max Conditional Likelihood

maximizew

Estimation Classification

argmaxy

)(log..).|.(log xyx,fwxy ww ZP T

xyfwx

xyw

w .,.exp)(

1).|.( T

ZP

)'.|'.(log xyw P xyfw '.,'. T).|.(log xy*w P.x

.y*

We don’t care about the conditional distribution P(.y |

.x)

Page 34: Statistical Learning  from Relational Data

*yy

yyx,fw

*yx,fw

].[..

..

T

T

margin # labelingmistakes in y

Max Margin Estimation

[Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003])

Quadratic program

Exponentially many constraints

maximize ||w||=1

Estimation Classification

argmaxy xyfw '.,'. T.x

.y*

What we really want: correct class labels

Page 35: Statistical Learning  from Relational Data

Max Margin Markov Networks

We use structure of Markov network to provide equivalent formulation of QP Exponential only in tree width of network Complexity = max-likelihood classification

Can solve approximately in networks where induced width is too large Analogous to loopy belief propagation

Can use kernel-based features! SVMs meet graphical models

[Taskar, Guestrin, K., 2003]

Page 36: Statistical Learning  from Relational Data

WebKB Revisited

00.020.040.060.080.1

0.120.140.160.180.2

Test

Err

or

Logistic likelihood max margin

16.1% relative reduction in error relative to cond. likelihood RMNs

Page 37: Statistical Learning  from Relational Data

Predicting Relationships

Even more interesting: relationships between objects

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Member

Member

Page 38: Statistical Learning  from Relational Data

Predicting Relations

0

5

10

15

20

25

30

Flat Collective

Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world

Relation

...

Page

Word1 WordN

From-

...

Page

Word1 WordN

To-

Exists/Type...LinkWord1 LinkWordN

Category Category

72.9% error reduction over flat

[Taskar, Wong, Abbeel, K., 2003]

Page 39: Statistical Learning  from Relational Data

Outline

Relational Bayesian Networks Relational Markov Networks Collective Classification Relational clustering

Movie data* Biological data†

* with Ben Taskar, Eran Segal

† with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein

Page 40: Statistical Learning  from Relational Data

Model Structure

ProbabilisticRelational

ModelCourse

Student

Reg

Unlabeled Relational Data

Learning

Relational Clustering

Given only students’ grades, cluster similar students

Example:

Clustering of instances

Page 41: Statistical Learning  from Relational Data

Learning w. Missing Data: EM

EM Algorithm applies essentially unchanged E-step computes expected sufficient statistics,

aggregated over all objects in class M-step uses ML (or MAP) parameter estimation

Key difference: In general, the hidden variables are not

independent Computation of expected sufficient statistics

requires inference over entire network

Page 42: Statistical Learning  from Relational Data

P(Registration.Grade | Course.Difficulty, Student.Intelligence)

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

Learning w. Missing Data: EM

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

low / higheasy / hard

A B C

CoursesStudents

[Dempster et al. 77]

Page 43: Statistical Learning  from Relational Data

Movie Data

Internet Movie Databasehttp://www.imdb.com

Page 44: Statistical Learning  from Relational Data

Actor

Director

Movie

Genres Rating

Year#Votes

MPAA Rating

Discovering Hidden Types

Type Type

Type

[Taskar, Segal, K., 2001]

Learn model using EM

Page 45: Statistical Learning  from Relational Data

Directors

Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher

Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola

Actors

Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman

Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger

MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson

Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October

Discovering Hidden Types

[Taskar, Segal, K., 2001]

Page 46: Statistical Learning  from Relational Data

Biology 101: Gene Expression

Gene 2

CodingControl

Gene 1

CodingControl

DNA

RNA

Protein

Swi5 Transcription factor

Sw

i5

Cells express different subsets of their genesin different tissues and under different conditions

Page 47: Statistical Learning  from Relational Data

Gene Expression Microarrays

Measure mRNA level for all genes in one condition Hundreds of experiments Highly noisy

Expression of gene i in experiment jExperiment

s

Gen

es

Induced

Repressed

Page 48: Statistical Learning  from Relational Data

Standard Analysis Cluster genes by similarity of expression profiles Manually examine clusters to understand what’s

common to genes in cluster

Clustering

Page 49: Statistical Learning  from Relational Data

General Approach Expression level is a function of gene

properties and experiment properties Learn model that best explains the data• Observed properties: gene sequence, array condition, …• Hidden properties: gene clusterGene Experiment

Expression

Properties of

Gene iProperties of Experiment j

Expression levelof Gene i

in Experiment j

Attributes Attributes

Level

• Assignment to hidden variables (e.g., module assignment)• Expression level as function of properties

Page 50: Statistical Learning  from Relational Data

Level

Gene ExperimentCluster

Expression

ID

Clustering as a PRM

P(Ei.L | g.C)g.C

1

2

3

0

0

0

g.C

g.E1 g.E2 g.Ek

CPD 2

CPD k

Naïve Bayes

CPD 1

Page 51: Statistical Learning  from Relational Data

Modular Regulation Learn functional modules:

Clusters of genes that are similarly controlled Learn control program for modules

Expression as function of control genes

HAP4

CMK1 truefalse

truefalse

Page 52: Statistical Learning  from Relational Data

[Segal, Regev, Pe’er, Koller, Friedman, 2003]

Level

GeneControlk

ExperimentCluster

Expression

Control2Control1

Module Network PRM

HAP4

CMK1 truefalse

truefalse

00

0

Cluster 1BMH1

Yer184c

true

false

truefalse

GIC2 USV1FAR1 true

false

true

truefalse

false

true

true

false

USV1

truefalse

APG1

Cluster 2

Activity levelof control

genein experiment

Page 53: Statistical Learning  from Relational Data

Experimental Results

Yeast Stress Data (Gasch et al.) 2355 genes that showed activity 173 experiments (microarrays):

Diverse environmental stress conditions (e.g. heat shock)

Learned module network with 50 modules: Cluster assignments are hidden variables Structure of dependency trees unknown

Learned model using structural EM algorithm

Segal et al., Nature Genetics, 2003

Page 54: Statistical Learning  from Relational Data

Biological Evaluation

Find sets of co-regulated genes (regulatory module)

Find the regulators of each module

[Segal et al., Nature Genetics, 2003]

46/50

30/50

Page 55: Statistical Learning  from Relational Data

Experimental Results Hypothesis: Regulator ‘X’ regulates process ‘Y’ Experiment: Knock out ‘X’ and rerun the experiment

HAP4

CMK1 truefalse

truefalse X?

[Segal et al., Nature Genetics, 2003]

Page 56: Statistical Learning  from Relational Data

wt Ypl230w

0 3 5 7 9 24 0 2 5 7 9 24

(hrs.)

>16x

341 differentially expressed genes

0 7 15 30 60 0 7 15 30 60

wt (min.)

Ppt1

>4x

602

0 5 15 30 60 0 5 15 30 60

wt (min.)

Kin82

>4x

281

Differentially Expressed Genes

[Segal et al., Nature Genetics, 2003]

Page 57: Statistical Learning  from Relational Data

Were the differentially expressed genes predicted as targets?

Rank modules by enrichment for diff. expressed genes

# Module Significance

14 Ribosomal and phosphate metabolism 8/32, 9e 3

11 Amino acid and purine metabolism 11/53, 1e 2

15 mRNA, rRNA and tRNA processing 9/43, 2e 2

39 Protein folding 6/23, 2e 2

30 Cell cycle 7/30, 2e 2

Ppt1

# Module Significance

39Protein folding 7/23, 1e-4

29Cell differentiation 6/41, 2e-2

5 Glycolysis and folding 5/37, 4e-2

34Mitochondrial and protein fate 5/37, 4e-2

Ypl230w

# Module Significance

3 Energy and osmotic stress I 8/31, 1e 4

2 Energy, osmolarity & cAMP signaling 9/64, 6e 3

15 mRNA, rRNA and tRNA processing 6/43, 2e 2

Kin82

Biological Experiments Validation

All regulators regulate predicted modules

[Segal et al., Nature Genetics, 2003]

Page 58: Statistical Learning  from Relational Data

Biology 102: Pathways

Pathways are sets of genes that act together to achieve a common function

Page 59: Statistical Learning  from Relational Data

Finding Pathways: Attempt I

Use protein-protein interaction data

Page 60: Statistical Learning  from Relational Data

Finding Pathways: Attempt I

Use protein-protein interaction data

Page 61: Statistical Learning  from Relational Data

Finding Pathways: Attempt I

Use protein-protein interaction data

Problems: Data is very noisy Structure is lost:

Large connected component in interaction graph (3527/3589 genes)

Page 62: Statistical Learning  from Relational Data

Finding Pathways: Attempt II

Use expression microarray clusters

Pathway I

Pathway II

Problems: Expression is only

‘weak’ indicator of interaction

Interacting pathways are not separable

Page 63: Statistical Learning  from Relational Data

Finding Pathways: Our Approach

Use both types of data to find pathways Find “active” interactions using gene expression Find pathway-related co-expression using

interactions

Pathway I

Pathway II

Pathway III

Pathway IV

[Segal, Wang, K., 2003]

Page 64: Statistical Learning  from Relational Data

Probabilistic Model

...

Pathway

Exp1 ExpN

Gene

Interacts

[Segal, Wang, K., 2003]

1

...

Pathway

Exp1 ExpN

Gene2

Expression level in N arrays

protein productinteraction

Compatibilitypotential

(g.C,g.C)g1.C g2.C

123123123

111222333

1

1

2

3

0

0

Cluster all genes collectively,

maximizing the joint model likelihood

Page 65: Statistical Learning  from Relational Data

Capturing Protein Complexes

Independent data set of interacting proteins

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70 80 90 100Complex Coverage (%)

Nu

m C

om

ple

xes

Our method

Standard expression clustering

124 complexes covered at 50% for our method

46 complexes covered at 50% for clustering

[Segal, Wang, K., 2003]

Page 66: Statistical Learning  from Relational Data

YHR081WRRP40RRP42MTR3RRP45RRP4RRP43DIS3TRM7SKI6RRP46CSL4

RNAse Complex Pathway

YHR081W

SKI6

RRP42

RRP45

RRP46

RRP43TRM7RRP40

MTR3RRP4

DIS3

CSL4

Includes all 10 known pathway genes

Only 5 genes found by clustering

[Segal, Wang, K., 2003]

Page 67: Statistical Learning  from Relational Data

Interaction Clustering RNAse complex found by interaction

clustering as part of cluster with 138 genes

[Segal, Wang, K., 2003]

Page 68: Statistical Learning  from Relational Data

Truth in Advertising Huge graphical models:

3000-50,000 hidden variables Hundreds of thousands of observed nodes Very densely connected

Learning: Multiple iterations of model updates Each requires running inference on the model

Inference: Exact inference is intractable Use belief propagation Single inference iteration: 1-6 hours Algorithmic ideas key to scaling

Page 69: Statistical Learning  from Relational Data

Relational Data: A New Challenge

Data consists of different types of instances

Instances are related in complex networks

Instances are not independent

New tasks for machine learning Collective classification Relational clustering Link prediction Group detection

Opportunity

Page 70: Statistical Learning  from Relational Data

http://robotics.stanford.edu/~koller/