lectures 2 – oct 3, 2011 cse 527 computational biology, fall 2011 instructor: su-in lee ta:...

23
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Introduction to Probabilistic Models for Computational Biology 1

Upload: ambrose-hancock

Post on 03-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Lectures 2 – Oct 3, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Introduction to Probabilistic Models for Computational Biology

1

Page 2: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Review: Gene Regulation

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

DNA

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

GeneAGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genes regulate each others’ expression and activity.

AUGCGCGUC

MRV

Genetic regulatory network

gene

RNA degradatio

nMID

AUGAUUAUAUGAUUGAU

MID

“Gene Expression”

a switch! (“transcription factor binding site”)

Gene regulation

transcription

translation

Page 3: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Review: Variations in the DNA

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genetic regulatory network

“Single nucleotide polymorphism (SNP)”

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

gene

CX

TX X X

A GXT

XC X

L

CX X

TXU

X X

Sequence variations perturb the regulatory network.

Page 4: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

4

Outline Probabilistic models in biology

Model selection problems

Mathematical foundations

Bayesian networks Probabilistic Graphical Models: Principles and

Techniques, Koller & Friedman, The MIT Press

Learning from data Maximum likelihood estimation Expectation and maximization

Page 5: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

5

Example 1 How a change in a nucleotide in DNA, blood

pressure and heart disease are related?

There can be several “models”…

Bloodpressure

Heartdisease

OR

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

Page 6: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

6

Example 2 How genes A, B and C regulate each other’s

expression levels (mRNA levels) ?

There can be several models…

A

B C

A

B C

A

B C

OR ?

Page 7: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

7

Gene A

Gene B

Gene C

Exp 1 Exp 2 Exp N…

A

B C

A

B C

A

B C

OR ?

Statistical dependencies between expression levels of genes A, B, C?

Probability that model x is true given the data Model selection: argmaxx P(model x is true |

Data)

N instances

Model I Model II Model III

Probabilistic graphical models A graphical representation of statistical

dependencies.

Page 8: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

8

Outline Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data Maximum likelihood estimation Expectation and maximization

Page 9: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

9

Probability Theory Review Assume random variables Val(A)={a1,a2,a3},

Val(B)={b1,b2}

Conditional probability Definition

Chain rule

Bayes’ rule

Probabilistic independence

Page 10: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

10

Probabilistic Representation Joint distribution P over {x1,…, xn}

xi is binary 2n-1 entries

If x’s are independent P(x) = p(x1) … p(xn)

Page 11: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

11

Conditional Parameterization The Diabetes example

Genetic risk (G), Diabetes (D) Val (G) = {g1,g0}, Val (D) =

{d1,d0}

P(G,D) = P(G) P(D|G) P(G): Prior distribution P(D|G): Conditional

probabilistic distribution (CPD)

Genetic risk

Diabetes

Page 12: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

12

Naïve Bayes Model - Example Elaborating the diabetes example,

Genetic Risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0} 8 entries

If S and G are independent given I, P(G,D,H) = P(G)P(D|G)P(H|G) 5 entries; more compact than joint

Genetic risk

Diabetes Hypertension

Page 13: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

13

Naïve Bayes Model A class C where Val (C) = {c1,…,ck}.

Finding variables x1,…,xn

Naïve Bayes assumption The findings are conditionally independent

given the individual’s class. The model factorizes as:

The Diabetes example class: Genetic risk, findings: Diabetes,

Hypertension

Page 14: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

14

Naïve Bayes Model - Example Medical diagnosis system

Class C: disease Findings X: symptoms

Computing the confidence:

Drawbacks Strong assumptions

Page 15: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

15

Bayesian Network Directed acyclic graph (DAG)

Node: a random variable Edge: direct influence of one node on another

The Diabetes example revisited Genetic risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0}Genetic risk

Diabetes Hypertension

Page 16: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

Bayesian Network Semantics A Bayesian network structure G is a directed acyclic graph

whose nodes represent random variables X1,…,Xn. PaXi: parents of Xi in G NonDescendantsXi: variables in G that are not descendants of Xi.

G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G):

For each variable Xi: x1

x2

x3

x4

x5

x6

x3

x7

x11

x10

x8

x9

16

Page 17: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

17

The Genetics Example Variables

B: blood type (a phenotype) G: genotype of the gene that encodes a

person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>

Page 18: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

18

Bayesian Network Joint Distribution

Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as:

A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.

Page 19: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

19

The Student Example More complex scenario

Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G)

Val(D) = {easy, hard}, Val(L) = {strong, weak},

Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3}

Joint distribution requires 47 entries

Page 20: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

20

The Student Bayesian network Joint distribution

P(I,D,G,S,L) =

from Koller & Friedman

Page 21: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

21

Parameter Estimation Assumptions

Fixed network structure Fully observed instances of the network variables: D={d[1],

…,d[M]} Maximum likelihood estimation (MLE)!

“Parameters” of the Bayesian network

For example, {i0,d1,g1,l0,s0

}

from Koller & Friedman

Page 22: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

22

Outline Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data Maximum likelihood estimation Expectation and maximization

Page 23: Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

23

Acknowledgement

Profs Daphne Koller & Nir Friedman,“Probabilistic Graphical Models”