medgen 505 gene regulation bioinformatics wyeth w. wasserman

72
MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman www.cisreg.ca

Upload: river-hollinsworth

Post on 16-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

MedGen 505

Gene Regulation Bioinformatics

Wyeth W. Wasserman

www.cisreg.ca

Page 2: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Overview

• TFBS Prediction with Motif Models

• Improving Specificity of Predictions

• Analysis of Sets of Co-Expressed and Co-Regulated Genes

Page 3: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Transcription Factor Binding Sites(over-simplified for pedagogical purposes)

TATAURE

URF Pol-II

Page 4: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Teaching a computer to find TFBS…

Page 5: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Laboratory Discovery of TFBS

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

ACTIVITY

Page 6: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Representing Binding Sites for a TF

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a a set of sites

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Page 7: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

TGCTG = 0.9

PFMs to PWMs

Add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(n)p(b)

Page 8: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Performance of Profiles

• 95% of predicted sites bound in vitro (Tronche 1997)

• MyoD binding sites predicted about once every 600 bp (Fickett 1995)

• The Futility Conjuncture– Nearly 100% of predicted transcription factor

binding sites have no function in vivo

Page 9: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

JASPAR

AN OPEN-ACCESS DATABASE

OF TF BINDING PROFILES

Page 10: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

PROBLEM: Too many spurious predictions

Actin, alpha cardiac

Page 11: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Terms

• Specificity – The portion of predictions that are correct

• Sensitivity – The portion of “positives” that are detected

• The detection of TFBS is limited by terrible specificity. Why?

I.9

Page 12: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Method#1Phylogenetic Footprinting

70,000,000 years of evolution reveals

most regulatory regions

Page 13: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Phylogenetic Footprinting

FoxC2100%

80%

60%

40%

20%

0%

Page 14: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Phylogenetic Footprinting to Identify Functional Segments

% Id

en

tity

Actin gene compared between human and mouse with DPB.200 bp Window Start Position (human sequence)

Page 15: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human

Mouse

Actin, alpha cardiac

Page 16: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Performance: Human vs. Mouse

• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

• 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

Page 17: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

ConSite (www.cisreg.ca)

NEW: Ortholog Sequence Retrieval Service

Page 18: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Emerging Issues

• Multiple sequence comparisons– Incorporate phylogenetic trees– Visualization

• Analysis of closely related species– Phylogenetic shadowing

• Genome rearrangements– Inversion compatible alignment algorithm

• Higher order models of TFBS

Page 19: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

OnLine Resources for Phylogenetic Footprinting

• Linked to TFBS– ConSite– rVISTA

• Alignments– Blastz– Lagan– Avid– ORCA

I.18

• Visualization– Sockeye– Vista Browser– PipMaker

Page 20: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Method#2Discrimination of Regulatory Modules

TFs do NOT act in isolation

Page 21: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Layers of Complexity in Metazoan Transcription

Page 22: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Diverse and non-uniform use of terms: Partial glossary for tutorial

• Promoter – Sufficient to support the initiation of transcription; orientation dependent; includes TSS

• Regulatory Regions– Proximal – adjacent to promoter– Distal – some distance away from promoter (vague)– May be positive (enhancing) or negative (repressing)

• TSS – transcription start site• TFBS – single transcription factor binding site• Modules – Sets of TFBS that function together

EXONTFBS TATA

TSSTFBSTFBS

Promoter Region

TFBSTFBS

Distal Regulatory Region Proximal Regulatory Region

EXONTFBS TFBS

Distal R.R.

Page 23: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Detecting Clusters of TF Binding Sites

• Trained Methods– Sufficient examples of real clusters to establish

weights on the relative importance of each TF

• Statistical Over-Representation of Combinations– Binding profiles available for a set of biologically

motivated TFs

Page 24: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

Page 25: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Models for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

Page 26: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Logistic Regression Analysis

“logit”

Optimize vector to maximize the distance between output values for positive and negative training data.

Output value is:

elogit

p(x)= 1 + elogit

Page 27: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Performance of the Liver Model

• Performance– Sensitivity: 60% of known CRMs detected– Specificity: 1 prediction/35,000bp

• Limitations– Applies to genes expressed late in hepatocyte

differentiation– Requires 10-15 genes in positive training set– This model doesn’t account for multiple sites for the

same TF• New methods from several groups address this limit

Page 28: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

UGT1A1

WildtypeOther

Live

r M

odul

e M

odel

Sco

re

“Window” Position in Sequence

Page 29: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Making better predictions

• Profiles make far too many false predictions to have predictive value in isolation

• Phylogenetic footprinting eliminates ~90% of false predictions

• Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context

Page 30: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Method#3 Higher Order Models

Position-position dependence

Page 31: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

What is a higher-order background model?

Zero-order:p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29

Ni

inucleotidePseqP...1

)()(

First-order:AA T

C

GA

m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases

Probabilistic Methods for Pattern Discovery(7)

Page 32: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Linking co-expressed genes to candidate transcription factors

Page 33: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Deciphering Regulation of Co-Expressed Genes

Page 34: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

oPOSSUM Procedure

Set of co-expressed

genes

Automated sequence retrieval

from EnsEMBL

Phylogenetic Footprinting

Detection of transcription factor

binding sites

Statistical significance of binding sites

Putative mediating

transcription factors

ORCA

Page 35: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Statistical Methods for Identifying Over-represented TFBS

• Z scores– Based on the number of occurrences of the TFBS relative

to background

– Normalized for sequence length

– Simple binomial distribution model

• Fisher exact probability scores– Based on the number of genes containing the TFBS

relative to background

– Hypergeometric probability distribution

Page 36: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

The oPOSSUM Database

• Orthologous genes:  8468

• Promoter pairs:  6911

• Promoters with TFBS:  6758

• Total # of TFBS predictions:  1638293

• Overall failure rate:  20.2%

Page 37: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16 analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher

SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08

MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03

c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01

Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01

TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02

deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01

S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01

Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02

Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01

HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

Page 38: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Application to Microarray Data Sets

1. NF-кB inhibition microarray study

Page 39: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)

TF Class Rank Z-score Fisher No. Genes

p65 REL 1 36.57 5.66e-12 62

NF-kappaB REL 2 32.58 5.82e-11 61

c-REL REL 3 26.02 8.59e-08 63

Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6

SPI-B ETS 5 16.59 1.23e-03 135

Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23

Sox-5 HMG 7 15.38 2.56e-02 126

p50 REL 8 14.72 2.23e-03 19

Nkx HOMEO 9 13.66 2.29e-03 111

Bsap PAIRED 10 13.2 9.92e-02 1

FREAC-4 FORKHEAD 11 12.05 1.66e-03 92

n-MYC bHLH-ZIP 25 6.695 1.84e-03 102

ARNT bHLH 26 6.695 1.84e-03 102

HNF-3beta FORKHEAD 29 5.948 3.32e-03 47

SOX17 HMG 31 5.406 8.60e-03 79

Page 40: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

C-Myc SAGE Data

• c-Myc transcription factor dimerizes with the Max protein

• Key regulator of cell proliferation, differentiation and apoptosis

• Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

• They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR

Page 41: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)

TF Class Rank Z-score Fisher No. Genes

Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7

Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2

Max bHLH-ZIP 3 18.32 2.16e-02 12

SAP-1 ETS 4 13.23 1.61e-04 13

USF bHLH-ZIP 5 11.90 1.84e-01 16

SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12

n-MYC bHLH-ZIP 7 11.11 1.55e-01 20

ARNT bHLH 8 11.11 1.55e-01 20

Elk-1 ETS 9 10.92 3.88e-03 19

Ahr-ARNT bHLH 10 10.17 1.11e-01 25

Page 42: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

C-Fos Microarray Experiment

• In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line

• We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs

Page 43: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)

TF Class Rank Z-score Fisher No. Genes

c-FOS bZIP 1 17.53 2.60e-05 45

RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1

PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1

CREB bZIP 4 3.626 1.25e-01 10

E2F Unknown 5 2.965 7.67e-02 15

NF-kappaB REL 6 2.915 1.04e-01 17

SRF MADS 7 2.707 2.24e-01 2

MEF2 MADS 8 2.634 1.32e-01 13

c-REL REL 9 2.467 5.79e-02 22

Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1

Ahr-ARNT bHLH 15 1.716 2.57e-03 63

deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75

Elk-1 ETS 21 0.7875 8.12e-03 37

MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73

n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51

ARNT bHLH 31 -0.8738 8.20e-03 51

Page 44: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

oPOSSUM Server

Page 45: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES

Page 46: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

SELECT YOUR TFBS PROFILES

Page 47: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

SELECT:

1. CONSERVATION2. PSSM MATCH THRESHOLD3. PROMOTER REGION4. STATISTICAL MEASURE

Page 48: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

de novo Discovery of TF Binding Sites

Page 49: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Pattern Discovery

Page 50: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

de novo Pattern Discovery

• Exhaustive – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in

comparison of “+” and “-” (or complete) promoter collections

• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)– Generalization: Identify strong patterns in “+” promoter

collection vs. background model of expected sequence characteristics

Page 51: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Exhaustive methods

Word based methods: How likely are X words in a set of sequences, given sequence characteristics?

CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

Page 52: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Over-representation

How many words of type ’AGGAGTGA’ are found in our sequences?

k

jjapiinbeginswP

1

)(

k

jjw apknXE

1

)()1(

w

www XVar

XEXZ

How likely is this result?

Exhaustive methods

Page 53: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Exhaustive methodsFind all words of length 7 in the yeast genome

Make a lookup table:

AAACCTTT 456TTTTTTTT 57788GATAGGCA 589

Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Page 54: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Two data structures used:

1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4

2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j.

One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcctgatctctagacctcatgacctct

Probabilistic Methods for Pattern Discovery

Page 55: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Iteration step

Remove one sequence z from the set. Update the current pattern according to

tgacttcctgatctctagacctcatgacctct

BN

bcq jji

ji

1

,,

Pseudocount for symbol j

Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery

A

’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model

B

z

Page 56: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10

12

14

16

18

0 100 200 300 400 500 600

SEQUENCE LENGTH

PA

TTE

RN

SIM

ILA

RIT

Yvs.

TR

UE

ME

F2 P

RO

FILE

True Mef2 Binding Sites

Page 57: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Four Approaches to Improve Sensitivity

• Better background models-Higher-order properties of DNA

• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of

sequence

• Regulatory Modules – Architectural rules

• Limit the types of binding profiles allowed– TFBS patterns are NOT random

Page 58: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000)

Palindromicity, dyads(van Helden et al 2000)

Variable gaps(Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity

Page 59: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Pattern discovery methods using biochemical constraints

Page 60: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Some profile constraints have been explored…

• Segmentation of informative columns

• Palindromic patterns

Page 61: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Our Hypothesis

• Point 1: Structurally-related DNA binding domains interact with similar target sequences

• Exceptions exist (e.g. Zn-fingers)

• Point 2: There are a finite number of binding domains used in human TFs

• Approximately 20-25

• Idea: We could use the shared binding properties for each family to focus pattern detection methods

• Constrain the range of patterns sought

Page 62: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Comparison of profiles requires alignment and a scoring function

• Scoring function based on sum of squared differences

• Align frequency matrices with modified Needleman-Wunsch algorithm

• Calculate empirical p-values based on simulated set of matrices

Score

Fre

que

ncy

Page 63: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Intra-family comparisons more similar than inter-family

TF Database(JASPAR)

COMPARE

Match to bHLH

Jackknife Test 87% correct

Independent Test Set 93% correct

Page 64: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Page 65: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

FBPs enhance sensitivity of pattern detection

Page 66: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman
Page 67: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

REVIEWING THE TOP POINTS

Page 68: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Orientation

Regulatory regions problem space

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Clusters of binding sites

Clusters of binding sites

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

TATAURE

URF Pol-II

Page 69: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Analysis of regulatory regions with TFBS

Detecting binding sites in a single sequence

Scanning a sequence against a PWM

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative scoreA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.51281.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.23481.2348 1.23481.2348 2.12222.1222 2.12222.1222 0.4368 1.23481.2348 1.51281.5128 1.74571.7457 1.74571.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.74571.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5-1.5 0.4368 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5-1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5-1.5 ]T [ 0.4368 -0.22840.4368 -0.2284 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5-1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_scoreMin_score - Abs_score

Rel_score

Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75%

Ouch.

Page 70: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Low specificity of profiles:•too many hits•great majority not biologically significant

A dramatic improvement in the percentage of biologically significant detections

Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions

Analysis of regulatory regions with TFBS

Phylogenetic Footprints

Page 71: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Pattern Discovery

10

12

14

16

18

0 100 200 300 400 500 600

Page 72: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

CMMT

Concluding Thoughts

• Bioinformatics is often constrained by our understanding of biochemistry rather than computational or statistical limitations

• Evolution has a powerful influence on the performance of many bioinformatics methods

• Computational predictions have value, but only if you understand the limitations of the methods