probabilistic models of structured data - cs.stanford.edu · stanford university. bayesian networks...

58
Probabilistic Models of Structured Data Daphne Koller Stanford University

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Probabilistic Models of Structured Data

Daphne KollerStanford University

Page 2: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Bayesian Networks

nodes = variablesedges = direct influence

Graph structure encodes independence assumptions:Job conditionally independent of Intelligence given Grade

0% 50% 100%

hard,highhard,low

easy,higheasy,low

A B CCPD P(G|D,I)

Job

Grade

SAT

IntelligenceDifficulty

Page 3: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Markov Networks

Laura

Noah

Mary

James

N)(L,N)(M,M)(J,L)(K,L)(J,K)(J,Z

N)M,L,K,P(J,

φφφφφφ

1=

Kyle

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

Page 4: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

The World is Richly Structured

The webWebpages (& the entities they represent), hyperlinks

Biological dataGenes, proteins, interactions, regulation

Physical environmentsPeople, rooms, objects

Natural language

Page 5: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Problem

Bayesian/Markov nets use attribute representationReal world has objects, related to each other

Intelligence Difficulty

Grade

Intell_Jane Diffic_CS101

Grade_Jane_CS101

Intell_George Diffic_Geo101

Grade_George_Geo101

Intell_George Diffic_CS101

Grade_George_CS101A C

These “instances” are not independent

Page 6: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Welcome to

CS101

low / high

Holistic Reasoning

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

low high

0% 50% 100%

easy / hard

Page 7: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Main Application Domains

Symbolic understanding of the physical world

Understanding and reconstructing cellular processes from genomic data

Page 8: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Long-Term Goal: Scene Understanding

“man wearing a backpack,

smoking a cigarette, walking a dog”

Man

Dog

Backpack

Cigarette

“A cow walking through the grass

on a pasture by the sea”

Page 9: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Object Detection

Page 10: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Basic Object Detection

car person motorcycle

Page 11: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Scene Segmentation

‘grass’, ‘road’, ‘tree’, ‘sky’, ‘water’, ‘building’, ‘foreground’

Page 12: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Segmentation CRF

Singleton energy:

Mean R,G,BMean H,U,VTexture Responses…

Pairwise energy:

Delta R,G,BOffset Vector…

∏ ∏∝i ji jijiiiN ISSISSSP

, ,,...,1 ):,():()( φφ

grass, road, tree, water, building, sky, foreground

Page 13: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Hierarchical Scene Model

pixels

regionsPi

objectsRk

On

Sk , Gk

Variablesαi: pixel appearancePi: pixel-to-region correspondenceAk: region appearanceSk: region semantic classGk: region geometryRk: region-to-object correspondenceOn: object classvhz: location of horizon

[Gould, Gao, Koller submitted]

Energy FunctionE(P, R, S, G, A, O, vhz, K | I, θ)

Page 14: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Results: 21-class MSRCValidate against state-of-the-art approachesRegion/pixel class onlyGround truth labels are approximateNo geometry information

21 CLASS Mean

Shotton et al. 72.2

Gould et al. 76.5

Pixelwise 75.3

Region-based 75.4

hand labeled image pixelwise region-based

[Gould, Fulton, Koller, ICCV 2009]

Page 15: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

High Quality Dataset

MSRC dataset is limitedpoorly labeled boundariesmany missing pixels (void)no geometry information

Collected images from MSRC, Hoiem et al., Pascal VOC

715 outdoor scenes with high-quality labels

region boundariesregion class and geometryhorizon

[Gould, Fulton, Koller, ICCV 2009]

Page 16: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Amazon Mechanical Turk (AMT)

$0.10 per task (regions, classes, surface types)5-10 minutes per task24-48 hour turn-around time (for 715 images)Less than 10% of tasks needed rework

Total cost for labels: under $250 (includes $40 textbook on Adobe Flash)Saving Steve from having to label images: priceless.

[Gould, Fulton, Koller, ICCV 2009]

Page 17: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

AMT: Label Quality

You don’t always getwhat you want

Comparison with MSRC labels

Typical quality (hand labeled)

[Gould, Fulton, Koller, ICCV 2009]

Page 18: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Example Results

[Gould, Fulton, Koller, ICCV 2009]

Page 19: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

More Example Results

[Gould, Fulton, Koller, ICCV 2009]

Page 20: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Application: 3d Reconstruction

ground plane

camera

h

image plane

image

horizon

Estimate camera tilt from location of horizonPredict region 3D position using ray projected through camera plane

[Gould, Fulton, Koller, ICCV 2009]

Page 21: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Example 3D Reconstructions

Page 22: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Object Detection Examples

Page 23: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Sliding-Window Failures

Sliding-window detector top results

Our region-based object detector results

Page 24: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Object Detection

car† pedestrian† cow*

* run on subset of 21-class MSRC datas† run on Street Scene dataset

[Gould, Gao, Koller submitted]

improved precision by only considering regions in

context

With correct regions we’d get near perfect detection, but region model still has some way

to go

Page 25: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Shape-Based Segmentation

Set of shape landmarksShape defined by connecting piecewise-linear contourSemantic outlining = assignment L of landmarks to pixels

[Heitz, Elidan, Packer, Koller, NIPS-08b]

Page 26: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

{ }

{ }∏

∏Σ

=Σ=

iiii

jijijiji

IlFw

IllFwLPZ

wIOLP

);(exp

);,(exp),;(1),,,,1|(

,,,SHAPE μ

μ

The LOOPS Model

shapedistribution

state of-the-art landmark detectors

pairwise image features

L – assignment of modellandmarks to pixels

Outlining = MAP Inference over L[Heitz, Elidan, Packer, Koller, NIPS-08b]

Page 27: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Learning the Shape ModelProblem:

With few instances, learnedmodels aren’t robust

MEAN

PrincipalComponents

Training Set:

std +1std -1

std +1

std -1

MEAN

[UAI 2008]

Page 28: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Fdata: Encourage parameters to explain data

Undirected Probabilistic Model

θroot

Fdata Fdata

Divergence

β: high

θRhino

β: low

Divergence: Encourage parameters to be similar to parents

Divergence

[UAI 2008]

θElephant

Page 29: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Degrees of Transfer

Not all parameters deserve equal sharing

[UAI 2008]

Page 30: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Do Degrees of Transfer Help?

6 10 20 30-15

-10

-5

0

5

10

15

Total Number of Training Instances

Del

ta lo

g-l

oss

/ in

stan

ce

Mammal Pairs

Bison-RhinoElephant-BisonElephant-RhinoGiraffe-BisonGiraffe-ElephantGiraffe-RhinoLlama-BisonLlama-ElephantLlama-GiraffeLlama-Rhino

Regularized Max Likelihood

Bison-Rhino

Hyperprior

[UAI 2008]

Page 31: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Degrees of Transfer

1/λStronger transfer Weaker transfer

θroot

0 5 10 15 20 25 30 35 40 45 500

2

4

6

8

10

12

14

16

18

20

Distribution of DOT coefficients using Hyperprior

[UAI 2008]

Page 32: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Outlining Results: Mammals

[Heitz, Elidan, Packer, Koller, NIPS-08b]

Page 33: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

LOOPSOutlining: Comparison

OBJ CUT [Kumar et al., CVPR 05]

kAS [Ferrari et al., CVPR 07]

[Heitz, Elidan, Packer, Koller, NIPS-08b]

Page 34: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

LOOPS Approach

Descriptive Classification

LOOPS TRAINING DATA

… LOOPS

“UP”

Standard Approach

TRAINING DATA

(X1,-)

…Standard

Feature-based

Classifier

H(x) = +0.6“UP”

(X2,+) (XN,-)

X1 X2 XN (1,-) (2,+)

CLASSIFICATION TRAINING DATA2 << N

Page 35: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Mammals

Page 36: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Gene Regulatory Networks

http://en.wikipedia.org/wiki/Gene_regulatory_network

Controlled by diverse mechanismsModified by endogenous and

exogenous perturbations

Page 37: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Goals

Infer regulatory network and mechanisms that control gene expression

Identify effect of perturbations on network

Understand effect of gene regulation on phenotype

Page 38: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Regulatory Network ImRNA level of regulator can indicate its activity levelTarget expression is predicted by expression of its regulatorsUse expression of regulatory genes as regulators

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006

Transcription factors, signal transduction proteins, mRNA processing factors, …

Page 39: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Co-regulated genes have similar regulation program Exploit modularity and predict expression of entire moduleAllows uncovering complex regulatory programs

Regulatory Network II

module

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

Targets

Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006

“Regulatory Program” ?

Page 40: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Regulation as Linear Regressionminimizew (Σwixi - ETargets)2

But we often have hundreds or thousands of regulators… and linear regression gives them all nonzero weight!

xN…x1 x2

w1w2 wN

ETargets

parametersw1

w2 wN

ETargets= w1 x1+…+wN xN+ε

Problem: This objective learns too many regulators

=

MFA1

Module

GPA1-3 x+

0.5 x

Page 41: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Lasso* (L1) Regression

minimizew (w1x1 + … wNxN - ETargets)2+ Σ C |wi|

Induces sparsity in the solution w (many wi‘s set to zero)Provably selects “right” features when many features are irrelevant

Convex optimization problemUnique global optimumEfficient optimization

But, arbitrary choice among correlated regulators

xN…x1 x2

w1w2 wN

ETargets

parametersw1

w2

x1 x2

* Tibshirani, 1996

L2 L1

Page 42: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Elastic Net* Regression

minimizew (w1x1 + … wNxN - ETargets)2+ Σ C |wi| + Σ D wi2

Induces sparsityBut avoids arbitrary choices among relevant features

Convex optimization problemUnique global optimumEfficient optimization algorithms

Lee et al., PLOS Genetics 2009 * Zhou & Hastie, 200

xN…x1 x2

w1w2 wN

ETargets

w1w2

x1 x2 L2 L1

Page 43: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Cluster genes into modulesLearn a regulatory program for each module

Learning Regulatory Network

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

Lee et al., PLoS Genet 200

=

MFA1

Module

GPA1-3 x+

0.5 x

This is a Bayesian network• But multiple genes share same program• Dependency model is linear regression

Page 44: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Genotype → phenotype

…ACTCGGTTGGCCTAAATTCGGCCCGG…

…ACCCGGTAGGCCTTAATTCGGCCCGG…

:

…ACTCGGTAGGCCTATATTCGGCCGGG…

Different sequences

Different phenotypes

??

Perturbations toregulatory network

Page 45: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

eQTL Data [Brem et al. (2002) Science]

BY

×

RM

:

112 progeny

Mar

kers

Expression data

112 individuals

6000

gen

es

0101100100…0111011110100…0010010110000…010

:

0000010100…101

0010000000…100

Genotype data

3000

mar

kers

112 individuals

×

Page 46: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

LirNet Regulatory networkE-regulators: Activity (expression) of regulatory genesG-regulators: Genotype of genes

Measured as values of chromosomal markers

Lee et al., PNAS 2006; Lee et al., PLoS Genetics 2009

PHO5PHM6

SPL2

PHO3PHO84

VTC3GIT1

PHO2

TEC1

GPA1

ECM18

UTH1MEC3

MFA1

SAS5SEC59

SGS1

PHO4

ASG7

RIM15

HAP1

PHO2

GPA1MFA1

SAS5PHO4

RIM15

M1

M120

M22

M1011

M321M321

marker

Page 47: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

MotivationNot all SNPs are equally likely to be causal.

Gene

SNPs

TACGTAGGAACCTGTACCA … GGAAAATATCAAATCCAACGACGTTAGCCAATGCGATCGAATGGGAACGTA

ChrXIV: 449,639-502,316

SNP 1:Conserved residue in a gene involved in RNA degradation

SNP 2:In nonconserved intergenic region

“Regulatory features” F1. Gene region?2. Protein coding region?3. Nonsynonymous?4. Create a stop codon?5. Strong conservation?

:

Idea: Prioritize SNPs that have “good” regulatory features

But how do we weight different features?

Lee et al., PLOS Genetics 2009

Page 48: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Bayesian L1-Regularization

wmk

ym

Module m

xmk

Regulator k

~ P(ym|x;w) = N (Σk wmkxmk,ε2)

~ P(w)= Laplacian(0,C)

higher prior variance⇒ weight can more easily deviate from 0⇒ regulator more likely to be selected

wiwi

Priorvariance

∑−km

kmwC,

, ||∑=

M

mmmmyP

1),|(log wx

Lee et al., PLOS Genetics 2009

Page 49: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Metaprior Model (Hierarchical Bayes)

ym

xmk wmk ~ Laplacian (0,Cmk)

= g(ßTfmk)

:

~ P(ym|x;w) = N (Σk wmkxmk,ε2)

Cmk

fmk

Regulatoryprior ß

Module m

Regulator k

Module 1

Module M

xN…x1 x2x1 x2

Emodule 1

Potential regulators

xN…x1 x2x2

Emodule M

xN

Regulatory featuresInside a gene?Protein coding region? Strong conservation?TF binds to module genes

:

“Regulatory potential” = ß1 x Inside a gene? + ß2 x Protein coding region? + ß3 x Conserved? …

YESNO

:

YESNO

:

NONO

:

w11w12 w1N

wN1wN2 wMN

Lee et al., PLOS Genetics 2009

Page 50: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Metaprior Method

Regulatory potentials…x1 x2 xN

Ei

…x1 x2

Ei

xN

Module i+2

…x1 x2

Ei

xN

Module i+1

…x1 x2

Ei

xN

Module i

Regulatory programs

1. Learn regulatory programs

3. Compute regulatory potential of SNPs

0 0.1 0.2 0.3

1

2

3

4

5

6

Regulatory weights ß

Non-synonymousConservationAA small ↔ large

Cell cycle0 0.1 0.2 0.3

BY(lab) MVLT ELVQ VSDASKQLWD

RM(wild) MVLT ELVQ VSDASKQLW

× 1× 1× 1

× 0

L

D

× 1× 0× 0

× 1

: :

0.3 0.9

= =

Regulatory features F

A

G

2. Learn ß

Lee et al., PLOS Genetics 2009

Maximize P(E,ß,W|X)

Maximize P(E,ß,W|X)

Empirical hierarchical BayesUse point estimate of model parametersLearn priors from data to maximize joint posterior

Page 51: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Transfer LearningWhat do regulatory potentials do?

They do not change selection of “strong” regulators –those where prediction of targets is clearThey only help disambiguate between weak ones

Strong regulators help teach us what to look for in other regulators

Transfer of knowledge between different prediction tasks

Page 52: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Learned regulatory weights

0 0.1 0.2 0.3 0.4 0.5

Non-synonymous codingStop codon

Synonymous coding3' UTR

500 bp upstream5' UTR

500 bp downstreamConservation score

Cis-regulationChange of average mass (Da)Change of isoelectric point (pI)

Change of pK1Change of pK2

Change of hydro-phobicityChange of pKa

Change of polarityChange of pH

Change of van der WaalsTranscription regulator activity

Telomere organization andProtein folding

Glucose metabolic processRNA modification

Hexose metabolic processCell cycle checkpoint

Proteolysissame GO processsame GO functionChIP-chip binding

0 0.1 0.2 0.3 0.4 0.5

Non-synonymous codingSynonymous coding

IntronLocus region

Splice siteUTR region

Conservation scoreCis-regulation

Change of average mass (Da)Change of isoelectric point

Change of pK1Change of pKa

Change of polarityChange of pH

Change of van der Waals volumecell communicationsignal transduction

transcription

Human regulatory weights

Location

AA property change

Genefunction

Pairwise feature

Regulatory features

Lee et al., PLOS Genetics 2009

Yeast regulatory weights

Page 53: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

How many predicted interactions have support in other data?

Deletion/ over-expression microarrays [Hughes et al. 2000; Chua et al. 2006]ChIP-chip binding experiments [Harbison et al. 2004]Transcription factor binding sites [Maclsaac et al. 2006]mRNA binding pull-down experiments [Gerber et al. 2004]Literature-curated signaling interactions

Biological evaluation I

0

10

20

30

40

50

60

70

80

90

GeronemoFlat LirnetLirnet

%interactions %interactions%modules %modules

Reg

Module

Via cascade

Lirnet without regulatory features

% S

uppo

rted

inte

ract

ions

TF

Lee et al., PLOS Genetics 2009

Page 54: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Biological Evaluation II

0

10

20

30

40

50

60

70

80

90

100

0 15 30 45 60 75 90

Significance (%)

% s

uppo

rted

regu

lato

ry p

redi

ctio

ns

Lee et al., PLOS Genetics 2009

LirnetZhu et al (Nature Genet 2008)Random

significance of support

Page 55: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Predicting Causal Regulators

Region Zhu et al [Nat Genet 08]

1 None

2TBS1, TOS1, ARA1, CSH1, SUP45, CNS1, AMN1

3 None

4 LEU2, ILV6, NFS1, CIT

5 MATALPHA16 URA37 GPA18 HAP19 YRF1-4, YRF1-5, YLR464W

10 None

11 SAL1, TOP2

12 PHM713 None

Lirnet (top 3 are considered)SEC18 RDH54 SPT7

AMN1 CNS1 TOS1

TRS20 ABD1 PRP5

2, MATALPHA1 LEU2 PGS1 ILV6MATALPHA1 MATALPHA2 RBK1

URA3 NPP2 PAC2

STP2 GPA1 NEM1

HAP1 NEJ1 GSY2

SIR3 HMG2 ECM7

ARG81 TAF13 CAC2

MKT1 TOP2 MSK1

PHM7 ATG19 BRX1

ADE2 ORT1 CAT5

Finding causal regulators for 13 “chromosomal hotspots”

8 validated regulators in 7

regions

14 validated regulators in 11

regions

Lee et al., PLOS Genetics 2009

Page 56: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

Current & Future Directions

Understand mechanism by which individual genotype leads to changes in phenotype

Genotype & copy number changes (e.g., in cancer)First step to personalized medicine

Analysis and reconstruction of cellular pathways Understand immune response and how it is affected by aging…

Page 57: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

The Computer Science Inside

Computational Issues: Huge graphical models require development of new algorithms

Convex optimization methods for learning network structureLearning using MAP inferenceUsing combinatorial optimization within standard inference

Statistical issues: Sparse data in high dimension“Holistic models” to exploit correlations between different labelsTransfer learning between related problemsNew algorithms for feature selection

Page 58: Probabilistic Models of Structured Data - cs.stanford.edu · Stanford University. Bayesian Networks nodes = variables edges = direct influence ... Intell_George Diffic_CS101 A Grade_George_CSC

http://ai.stanford.edu/~koller/

http://dags.stanford.edu/