mrna protein dna activation repression translation localization stability pol ii 3’utr...

Post on 19-Dec-2015

230 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

mRNA

protein

DNAActivationRepression

TranslationLocalizationStability

Pol II

3’UTR

Transcriptional and post-transcriptional regulation of gene

expression

Using gene expression to identify regulatory

elements5’ upstream

Feb 2007: ~110,000 arrays in NCBI GEO

Gene expression

Allen Institute Mouse Brain Gene expression Atlas (in

situ hybridization, ~23,000 genes)

Feb 2008: ~202,000 arrays

Roth et al., 1998, Tavazoie et al., 1999:co-expressed genes often share the same

regulatory elements

Expression5’ upstream

How do you identify regulatory elements from gene

expression ?

Motif finding programs:

- AlignACE (Hughes et al, 2000)

- MEME

- REDUCE (Bussemaker et al, 2001)

and many others …

Problems with current motif finding approaches

Different approaches for different types of expression data

Single microarray (e.g. log-ratios) REDUCE

C0

C1

C2

C3

C4

Co-expression clusters ALIGNACE

Problems with current motif finding approaches

Many unrealistic assumptions, e.g., zeroth order

background sequence model in AlignACE 1kb upstream region of Plasmodium falciparum PF11_0108

TTTAAAAAAAAAAAAAAAAAGAGAAAAACCATATTTATATGGATATAATATTTTTAAAGTATAGAAAAAATAATATATATTTATATACATTTATATTAATGAAAAAGCAAACAGCTAAATTACAAAAAAAAAAAAAAATTAGATTATCTCAATTAAAAGAACAATATATAAATAATTAATCCATGCTATTTTTTGATATATATAAGAATTTAATGCCTTATATTATAAATAGAGAAATAAATAAATAAATAAATATATAAACATATATATTATATATATATATATATATATAGTTATACATTATGATTTTGAAAAAATAGATATATACTATTAATTGTATATGTTTATACATAAAGCATATTTTTATTAATTGTAATATATAGATTTTTTATTATAATAATATTATATATATATATATATATATATATATTTTTTTTTTTTTGTTAAATAGCGAAATAAAAATACCTGACCTTTGTAATCTTTATTTGATTACTTCCTTCTTCATTCCTTCTTTGTTTGTTTGTTTGTTTCCCTTTTTTTTTTTTTTTTTTTTTTTTTTAGTTAATTCTTTTATATGTATAATAATATTATAAGACAATTGGACAATGATTACAAAAAGGTAAAAGTAATAATTTTCTAAAGTATAATATAATATTATAATAATATAATATAAATTTTTAATAAAATTTAATAAAAAAGTTTATAAATACTTATCGACCATAAGTCGTTTAAGAAAAAAAAAAAAAAGAAAAAAAAAAAAAAATATTACAAAAATATTATAGTGTATTATATTTTATCATATCATCTTTTTTTTATTTTTTTTTATATTTTTGTTACGGCACATCAAGCAACTATAAATATTTAAGATCAACCACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACATTATTTATGGTATTTTAAA

Problems with current motif finding approaches

Elevated false positive rate

k-means clustered gene expression

randomly clustered gene expression

many motifs

AlignACE

many motifs

Re-thinking motif discovery from expression data

• One approach for all types of expression data

• Make as few a priori assumptions as possible

• Very low false positive rate

• Scale to complex metazoan and plant genomes

Noam Slonim

0

1

2

3

4

Cluster Index

Microarray Conditions

All Genes on array

Clusters of co-expressed genes

5’ upstream regions

Cluster index

1

1

1

1

1

2

2

2

2

0

0

0

0

0

2

correlation is quantified using

the mutual information

0.27 0.07 0.33

0.07 0.27 0.00Motif

Expression (Cluster Indices)

Absent

Present

0 1 2

Our approach: look for motifs whose profile of

presence/absence is informative about the

expression profile

2

1

3

1 )()(

),(log),()expression; motif(

i j jPiP

jiPjiPI

These genes belong to cluster

0

These genes belong to cluster

1

These genes belong to cluster

2

...

0.45

0.12

0.01

-0.08

-0.87

-1.56

-2.32

-2.89

-5.65

1.54

1.98

3.50

4.39

6.45

-8.90

5’ upstream region

Log-ratio

Continuous expression variables (e.g. microarray log-ratios)

3’UTRs

1

1

1

1

1

2

2

2

2

0

0

0

0

0

2

5’ upstream regions

Expression cluster index

1

1

1

1

1

2

2

2

2

0

0

0

0

0

2

Expression cluster index

DNA RNA

finding all informative DNA and RNA motifs

How do we do it ?

Possible motif representations

Search space

Accuracy

very good

good

acceptable

very large

large

smallWords (k-mers)

GCGATGAG

Weight matrices

Degenerate code[AC]CGATGAG[TC]

Motif Search Algorithmk-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GATGAGC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265......ACGCGCG 0.0018CGACGCG 0.0012TACGCTA 0.0011ACCCCCT 0.0010CCACGGC 0.0009TTCAAAA 0.0005AGACGCG 0.0004CGAGAGC 0.0003CTTATTA 0.0002

Not informative

Highly informative

...

MI=0.081

MI=0.045

MI=0.040

Optimizing k-mers into more informative degenerate motifs

ATCCGTACA

ATCC[C/G]TACAwhich character

increases the mutual information by the largest amount ?

5’ upstream regions

Cluster Indices

1

1

1

1

1

2

2

2

2

0

0

0

0

0

2

A/G

T/GC/G A/C/G

A/T/G

C/G/T

Optimizing k-mers into more informative degenerate motifs

ATCC[C/G]TACA

5’ upstream regions

Cluster Indices

1

1

1

1

1

2

2

2

2

0

0

0

0

0

2

A/C

T/CC/G A/C/G

A/T/C

C/G/T

.

.

.

change

Motif Conservation with S. bayanus

Similarity to ChIP-chip RAP1 motif

Mutual information

RAP1 binding site (ChIP-chip)

k-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GCTCATC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265...

Highly informative

k-mers

Only optimize k-mer if

I(k-mer;expression | motif)

is large enough

(for all motifs optimized so far)

MI=0.081

MI=0.045

Motifs optimized so far

optimize ?

Conditional mutual information I(X;Y|Z)

Each motif is subjected to a stringent statistical significance

test

Real mutual information value

Maximum of 10,000 expression-shuffled mutual information values

The regulation of gene expression is highly

combinatorial

DNA

Pol II

Expression pattern 1

Expression pattern 2

Expression pattern 3

Can we group our predicted motifs into modules of combinatorially acting regulatory elements ?

The regulation of gene expression is highly

combinatorial

Predicting combinatorial regulation using mutual

information5’ upstream region

0.43 0.07

0.07 0.43Motif 1

Motif 2

Absent

Present

Absent Present

2

1

2

1 )()(

),(log),()2 motif1; motif(

i j jPiP

jiPjiPI

Is the presence of motif 1 informative about the presence of motif 2 ?

Discovering modules of combinatorially acting

motifs

modules

YeastP. falciparum Huma

n

Results

(malaria parasite)

Yeast stress gene expression program (Gasch et al, 2000)

• 173 microarray conditions

• ~ 5,500 genes

• 80 co-expression clusters

• Runtime ~ 1h (standard PC)

Predicted Motifs

Expression Clusters

17 motifs in 5’ upstream regions 6 motifs in 3’UTRs

0 “motifs” when shuffling the gene labels of the clustering partition

1129 motifs when applying AlignACE (with default parameters) to each cluster independently880 “motifs” when applying AlignACE to the same shuffled clusters as above

Predicted Motifs

13 modules of co-occurring motifs

All 23 motifs are highly conserved with S. bayanus

PAC

RRPE

PUF4

PUF3

MSN2/4

RAP1

RPN4

REB1

MBP1

HAP4

XBP1

BAS1

CBF1

SWI4

14 previously known motifs

Expression Clusters

PAC is under-represented in cluster

13 (p<1e-5)

PAC is highly over-represented in cluster 66

(p<1e-20)

over-

rep

resen

tion

un

der-

rep

resen

tion

PAC

RRPE

Puf4

Expression ClustersMotifs

5’

5’

3’UTR

Predicted cooperation between DNA and RNA

motifs

Predicted Motifs

Expression Clusters

PAC

RRPE

PUF4

PUF3

MSN2/4

RAP1

RPN4

REB1

MBP1

HAP4

XBP1

BAS1

CBF1

SWI4

Mitochondrial ribosome, p<1e-33

Mitochondrial ribosome, p<1e-29

Puf3

Cytosolic ribosome, p<1e-18

Functional enrichments

Proteasome complex (p<1e-44)

DNA replication (p<1e-7)

Oxydative phosphorylation (p<1e-17)

Rpn4

Mbp1

Novel motif

Beer and Tavazoie, 2004; Elemento and Tavazoie, 2005

We also use mutual information to discover …

Non-random spatial

distribution

Orientation preferences

Co-localization

0

0

0

0

0

1

1

1

1

5’ upstream region

Cluster Indices

2

2

2

2

0

0

0

0

0

1

1

1

1

2

2

2

2

Cluster IndicesCl

ose

Far

Very

far

Distance to TSS is informative about

expression

Distance between two motifs is informative

about expression

Predicted Motifs

Clusters

YY

YYY

YY

Y

Y

YY

Y

> 50% of our predicted motifs have a non-random spatial distribution

Clusters where the motif is over-represented

Clusters where the motif is NOT over-represented

ATG-600bp

PAC has a non-random spatial distribution

RAP1 motif has a different kind of non-random spatial distribution

Unique cluster where the motif is over-represented

Clusters where the motif is NOT over-represented

RAP1 motif also has a strong orientation preference

Clusters where the TWO motifs are both over-represented

Clusters where the motifs are NOT over-represented

-600bp ATG

PAC and RRPE tend be co-localize on the DNA

-600bp ATGPAC and the Msn2/4 binding site tend to avoid being in the same promoters

Single array analysis

Down-regulated Up-regulatedCy3/Cy5 expression log-ratios

PAC

Rpn4

Yap1

Puf3

H2O2 treatment in ΔMsn2/ΔMsn4 background

Bozdech, Llinás, et al., PLoS Biol, 2003

P. falciparum intra-erythrocytic developmental cycle

~ 2

,70

0 p

eri

od

ically

exp

ress

ed

g

en

es

0h Time 48h

Associate a “phase” to each gene, which reflects the timing of maximal expression

-0.25

-0.12

0.01

0.08

0.34

0.67

2.32

2.89

3.01

-0.38

-1.68

-2.34

-2.56

-3.14

3.14

5’ upstream region

Phase

Discovering motifs that are informative about the expression phase

21 motifs in 5’ upstream regions 0 motifs in 3’UTRs

0 “motifs” when shuffling the gene labels of the phase profile

-π Phase +π

71% highly conserved with P. yoelli

DNA replication, p<1e-4plastid, p<0.01

ribosome, p<0.001

Independent biochemical validation

- Purified 3/26 predicted TF in P. falciparum

- Identified DNA-binding specificities using protein binding microarrays

Bulyk lab, Harvard

Llinás lab, Princeton University, submitted

Motifs Match PredictionsP

rote

in B

ind

ing

M

icro

arra

y F

IRE

P

red

icti

on

GST AP2 GST AP2 AP2 GST AP2

-π Phase +π

Independent biochemical validation for 3/21 motifs

More TFs being purified ...

bound by MAL6P1.44

bound by PF11_0404

bound by PF14_0633

Human gene expression atlas(Su et al, 2004, PNAS)

• 79 human tissues

• >17,000 genes

• 120 co-expression clusters

• Runtime ~ 24h (standard PC)

73 DNA motifs 42 RNA motifs

ELK4Sp1AhRbZIP911NF-YE2F1TCF11-MafGPax2E2Fv-MybTEADDof2CHOP-C/EBPalphaHAND1-TCF3GBPSkn-1HFH-3Sox17

miR-499/miR-505/miR-200a/miR-141miR-525/miR-518f*/miR-526c/miR-526a/miR-520a*miR-380-3p/miR-215/miR-485-3p/hsa-let-7g/miR-610/hsa-let-7i/hsa-let-7b/hsa-let-7amiR-30d/miR-30c/miR-30a-5p/miR-30e-5p/miR-30BmiR-200b/miR-429/miR-200cmiR-663

71 modules

NF-Y

novel

M phase (p<1e-43)

TCF11-MafG

novel

Olfactory receptor activity (p<1e-43)

miR-525miR-518f*miR-526c

Sp1

(NF-Y binding site)

let-7b over-expression in human fibroblasts

let-7 microRNAs are up-regulated when fibroblasts enter quiescence

Let-7 target genes ?

A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted

~14,0

00

gen

es

C1

C2

C3

C4

C1

C2

C4

C5

A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted

0h 12h 24h 36h 48h

C5

let-7b over-expression in human fibroblasts

                 UACCUC |||||| uugguguguuggaugAUGGAGu-5’  let-7b

seed

-1000bp

TSSArabidopsis thaliana

Experimental testing:

Ken Birnbaum (NYU)

Phil Benfey (Duke)

~22,300 genes on Affy chip

Biological insights• Importance of RNA motifs in

shaping transcriptomes ~30% of yeast, worm, human, arabidopsis motifs are RNA motifs

• In worm/human/mouse, many RNA motifs match miRNA targets

• “Cooperation” between DNA and RNA motifs

                UGUGAU |||||| cgaguaguuucgaccgACACUAu

Yeast Puf4 motif

Novel worm 3’UTR motif

Biological insights

• Avoidance of joint-presence for certain motifs

• Under-representation of certain motifs

works with any type of gene expression data

FIRE(Finding Informative Regulatory Elements)

Single microarray (e.g. log-ratios)

Elemento, Slonim and Tavazoie, 2007, Molecular Cell

C0

C1

C2

C3

C4

Clustered microarrays

Gene expression

phase

It looks for both DNA and RNA motifs

FIRE(Finding Informative Regulatory Elements)

5’5’

3’UTR3’UTR

5’5’5’5’

3’UTR3’UTR

5’5’5’5’

Elemento, Slonim and Tavazoie, 2007, Molecular Cell

It is fast and scales well to large metazoan and plant genomes

FIRE(Finding Informative Regulatory Elements)

Elemento, Slonim and Tavazoie, 2007, Molecular Cell

It yields few or no false positives

FIRE(Finding Informative Regulatory Elements)

Real clustered gene expression

randomly clustered gene expression

115 motifs

0 motifs

(Human tissue microarray dataset)

It automatically evaluates:

FIRE(Finding Informative Regulatory Elements)

Functional coherence

Defense response (p<1e-32)

Inter-species conservation

Spatial and orientation biases

Compare to known motifs (JASPAR)

Cooperativity and co-localization

FIRE

fire --expfile=human_clusters.txt --exptype=discrete --species=human

Expression file Expression type Species

FIRE(Finding Informative Regulatory Elements)

Usage:

http://tavazoielab.princeton.edu/FIRE/

NM_000030 0NM_000040 0NM_000042 0NM_000045 0NM_000046 1NM_000053 1NM_000065 1NM_000066 1 ...

- discrete- continuous

- human- mouse- arabidopsis- drosophila- worm- plasmodium- budding yeast- fission yeast- sea squirt

~250 downloads since Nov 2007

Bambi Tsui

http://tavazoielab.princeton.edu/FIRE/

~1500 queries since Nov 2007

Acknowledgements

Saeed Tavazoie Noam Slonim

Sasan Amini

Chang Chan

Gordon Freckleton

Hany Girgis

Yir-Chung Liu

Ilias Tagkopoulos

Tiffany Vora

Scott Breunig

Anand Dharan

Hani Goodarzi

Danny Lieber

Yael Marshall

Kellen Olszewski

Bambi Tsui

Eric Wieschaus

Manuel Llinás

Hilary Coller

Aster Lagesse-Miller

Xuemin Lu

Erandi De Silva

Collaborators at

Princeton

Additional slides

…Chan*, Elemento*, Tavazoie, PLoS Computational Biology, 2005

k-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GATGAGC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298CATCGCA 0.0293AGATGAG 0.0288TTTTTCA 0.0280ATCTCAT 0.0265...ACGCGCG 0.0168CGACGCG 0.0167TACGCTA 0.0167ACCCCCT 0.0167CCACGGC 0.0164TTCAAAA 0.0163AGACGCG 0.0163CGAGAGC 0.0163GATAGAG 0.0155GTAGCTC 0.0143CTTATTA 0.0142...

TestPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASS

PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASS

Most informative

Less informative

10 consecutive

“don’t pass”

Optimize “seeds” into

more degenerate

motifs

Optimizing seeds into more informative degenerate

motifsA

TCG

S=C/GM=A/CW=A/TR=A/GK=T/GY=T/C

V=A/C/GH=A/C/TD=A/G/TB=C/G/T

N=A/C/G/T

TCC[C/G]TAC matches TCCCTAC

and

TCCGTAC

Predicted Motifs

Clusters

PAC

RRPE

PUF4

PUF3

MSN2/4

RAP1

RPN4

REB1

MBP1

HAP4

XBP1

BAS1

CBF1

SWI4

5’

3’UTR

3’UTR

Another example of predicted cooperation between DNA and RNA

motifs

RAP1

Novel

Novel

A gene expression map of Arabidopsis thaliana

development (Schmid et al, 2005)

• 79 different tissue samples

• 22,300 genes

• 140 clusters

Schmid et al, 2005, Nature Genetics

-Log(p) over-rep

Log(p) under-rep

114 motifs in 5’ upstream regions 66 motifs in 3’UTRs

0 “motifs” when shuffling the gene labels of the clustering partition

-Log(p) over-rep

Log(p) under-rep

telo-box

ABRE-like

W-box

I-box

DRE-core

Few of these motifs are known

-Log(p) over-rep

Log(p) under-rep

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Many have a non-random spatial distribution

Clusters where the motif is over-represented

Clusters where the motif is NOT over-represented

-1000bp

TSSMotif has a non-random spatial distribution

Functional enrichments

Defense response (p<1e-32)

Ribosome (p<1e-84)

Localized to chloroplast (p<1e-35)

3’UTR

5’

3’UTR

-Log(p) over-rep

Log(p) under-rep

These two motifs are predicted to co-localize extensively

Clusters where the motifs are over-represented

Clusters where the motifs are NOT over-represented

-1000bp

TSS

These two motifs co-localize on the DNA

Examples of tissue-specific motifs

... Collaborations with Phil Benfey (Duke), Ken Birnbaum

(NYU)

Other data-types

strong binding

no bindingp-values

20 Bicoid-bound vs 100 non-bound enhancers

20 Dorsal-bound vs 100 non-bound enhancers

• ChiP-chip, e.g. HNF6 (in human islet cells)

• Enhancers (Drosophila)

No

No

No

No

No

No

No

No

No

5’ upstream region

Tissue-specific expression

Yes

Yes

Yes

Yes

Yes

Yes

All other genes

Tissue-specific genes

Binary expression variables (e.g. tissue specific expression)

top related