identification and analysis of functional transcription ... · introduction prediction validation...

25
Introduction Prediction Validation Analysis Summary Identification and analysis of functional transcription factor binding sites Troy W. Whitfield and Zhiping Weng July 20, 2010 Whitfield and Weng TFBS Function

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Identification and analysis of functionaltranscription factor binding sites

Troy W. Whitfield and Zhiping Weng

July 20, 2010

Whitfield and Weng TFBS Function

Page 2: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Outline

1 Outline

2 Introduction

3 TFBS Prediction

4 Validation of TFBS

5 Further analysis

6 Summary

7 Acknowledgements

Whitfield and Weng TFBS Function

Page 3: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Introduction

Identify and functionally annotate transcription factor binding sites(TFBS) at base pair resolution.

Predict TF binding sites.

Mutate informative bases within the TFBS.

Measure the effect on promoter activity.

Whitfield and Weng TFBS Function

Page 4: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

ChIP sequencing peaks

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

1 10 100 1000 10000

P(le

ngth

)

DNA fragment length (bp)

K562 MACS

Figure 1: Distribution of ChIP-seq DNA fragment lengths for the GATA1transcription factor. ENCODE consorium data were reported by the Yale/UCDavis/Harvard team in K562 cells. Peak calling was done using MACS [Zhanget al., 2008]. ChIP-seq peaks are much larger than TFBS footprint(s).

Whitfield and Weng TFBS Function

Page 5: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Assessing PWM predictiveness: GABP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

True

pos

itive

rate

False positive rate

zlab_bidirect_GABPGABP.M00341

MGGAAGTG_GABP.M1028

Figure 2: ROC curves for the GABP transcription factor using existingmotifs. ChIP-seq data were reported by the lab of Richard Myers(HudsonAlpha Institute).

Whitfield and Weng TFBS Function

Page 6: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Identifying TF binding sites

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-10 -5 0 5 10 15 20

P(sc

ore)

score (a.u.)

ChIP hitsControl

Figure 3: Score distributions for ChIP-seq peaks (called using MACS[Zhang et al., 2008]) and control fragments. ChIP-seq data for theGABP transcription factor were reported by the lab of Richard Myers(HudsonAlpha Institute).

Whitfield and Weng TFBS Function

Page 7: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Identifying TF binding sites

The PWMs that were most able to account for data fromhigh-throughput ChIP data were used to identify TF bindingsites.

The statistical significance of computed scores, S, for putativeTF binding sites was calculated as p(S) =

∫∞S Pc(S

′)dS′,where Pc(S) is the probability distribution for a set of controlsequences. Predicted TF binding sites with small p(S) wereexperimentally tested.

PWM discovery and refinement will enhance our ability toaccurately identify TF binding sites.

Whitfield and Weng TFBS Function

Page 8: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Experimental validation

Mutations were made to as many as 5 bases in each TFBS onthe basis that they caused the greatest reduction in thecomputed score.

Lucifese reporter assays were carried out using transienttransfection of promoter constructs.

Measurements were made using a total of 9 replicates perconstruct and were analyzed using a mixed-effects model.

From our most recent sets of ∼ 500 TFBS predictions, ∼ 350,or 70%, were experimentally validated in each of 4 human celllines: K562, HepG2, HT-1080 and HCT-116.

Whitfield and Weng TFBS Function

Page 9: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Experimental validation

Table 1: Summary of functional tests of 650 predicted TF binding sites inK562 cells. Approximately 1/3 of functionally validated TF binding sites wereshown to repress transcription.

Transcription factor No. TFBS val./pred. TFBS act./rep.

ATF3 3/5 1/2ATF6 5/8 4/1CTCF 115/171 73/42GABP 20/28 18/2GATA1 4/4 4/0GATA2 59/82 37/22JunD 2/3 2/0MAX 2/3 1/1STAT1 34/48 27/7STAT2 18/23 14/4USF1 2/2 2/0YY1 89/102 49/40Various other 68/154 47/21

Total 431/650 283/148

Whitfield and Weng TFBS Function

Page 10: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

MAX

Validated 2 MAX binding sites, comprising 1 repressor and 1activator.

WT MT

510

1520

25C16orf35

Lum

inos

ity

WT MT

1012

1416

1820

PPP2R4

Lum

inos

ity

Figure 4: Box plots for validated (p < 0.05) MAX binding sites. Geneannotations appear above the boxes.

Whitfield and Weng TFBS Function

Page 11: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

GABP

Validated 18 GABP BS, comprising 2 repressors and 16activators.

WT MT

0.4

0.6

0.8

1.0

1.2

LILRB1

Luminosity

WT MT

1020

3040

50

ZNF687

Luminosity

WT MT

510

15

PISD

Luminosity

WT MT

0.1

0.3

0.5

AVPR2

Luminosity

WT MT

1020

3040

50

PSMB4

Luminosity

WT MT

510

1520

BUD13

Luminosity

WT MT

0.3

0.4

0.5

0.6

CHPF

Luminosity

WT MT

1030

5070

ERGIC3

Luminosity

WT MT

68

1012

14

C21orf59

Luminosity

WT MT

50100

150

200

GART

Luminosity

WT MT

010

2030

40

FLJ46020Luminosity

WT MT

020

60100

LOC168850

Luminosity

Figure 5: Box plots for validated (p < 0.05) GABP binding sites. Geneannotations appear above the boxes.

Whitfield and Weng TFBS Function

Page 12: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

GABP

WT MT

510

1520

25

CLDN12

Luminosity

WT MT

110

130

150

MEN1

Luminosity

WT MT

1030

5070

ZNF259

Luminosity

WT MT

1012

1416

HYPK

Luminosity

WT MT

1020

3040

LENG1

Luminosity

WT MT

050

100

200

C20orf44

Luminosity

WT MT

1015

20

NFS1

Luminosity

WT MT

2040

6080

120

SYNJ1

Luminosity

Figure 6: Additional box plots for validated (p < 0.05) GABP bindingsites. Gene annotations appear above the boxes.

Whitfield and Weng TFBS Function

Page 13: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Further analysis

Can conservation distinguish functionally validated fromnon-validated TF binding sites?

Among functionally validated TF binding sites, canconservation be used to distinguish sites activatingtranscription from those that repress transcription?

What other genomic (e.g. distance from TSS, nearbyenrichment of binding sites for other TFs) or epigenomicfeatures (e.g. histone modifications) correlate with thefunction of TF binding sites?

How specific are functional TF binding sites to the cell lines inwhich they are experimentally validated?

Whitfield and Weng TFBS Function

Page 14: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Conservation in the binding sites of various TFs

Within the predicted binding sites of a given TF, it is difficultto distinguish experimentally validated from non-validatedTFBS predictions.

VertebratesPrimate

Mammal

log10(p-value phyloP)

log

10(p

-val

ue

Phas

tCon

s)

0-0.2-0.4-0.6-0.8-1-1.2

0

-0.2

-0.4

-0.6

-0.8

-1

-1.2

Figure 7: Assessing and comparing the ability of PhastCons and PhyloP scores todistinguish experimentally validated from non-validated TFBS predictions in thefollowing TFs: CTCF, E2F4, GABP, GATA2, STAT1, STAT1 and YY1.

Whitfield and Weng TFBS Function

Page 15: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Conservation in the binding sites of various TFs

Among the predicted binding sites of a given TF, conservationappears to be higher among experimentally validated thannon-validated sites, although the difference is not significantdue to the small number of sites.

N V N V N V N V N V N V N V

−10

12

34

phyl

oP s

core

(uni

tless

)

CTCF E2F4 GABP GATA2 STAT1 STAT2 YY1

Figure 8: Box plots of PhyloP scores for validated (p < 0.05) andnon-validated binding sites in several transcription factors.

Whitfield and Weng TFBS Function

Page 16: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Genomic conservation in TF binding sites

Over a larger number of diverse TF binding sites, however, theenhanced genomic conservation in experimentally validatedversus non-validated binding sites is statistically highlysignificant.

For example, aggregating the seven transcription factorsdisplayed on the previous slide, a Kolmogorov-Smirnov testbetween the conservation scores of validated andnon-validated TFBS predictions gives p < 0.01, even forconservation among primates.

The power of genomic conservation to distinguish validatedfrom non-validated TFBS predictions is even greater when all80 transcription factors for which TFBS predictions weremade are considered.

Repressing and activating TF binding sites are generallyequally conserved.

Whitfield and Weng TFBS Function

Page 17: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

TF binding sites in relation to the transcription start site

M/bp

P M

12008004000

1

0.8

0.6

0.4

0.2

0

Non-validatedFunctionally validated

|N |/bp

P|N

|

1400120010008006004002000

0.005

0.0045

0.004

0.0035

0.003

0.0025

0.002

0.0015

0.001

0.0005

0

Figure 9: Distinguishing between validated and non-validated TF binding sites fromtransient transfection assays. Here, P|N| = P−N + PN is the probability of finding avalidated TFBS within |N | base pairs of the transcription start site. Plotted in the

inset is the cumulative probability, PM =∑M

N=0 P|N|. The two distributions can be

distinguished with p < 1.5× 10−3 using a Kolmogorov-Smirnov test: validated TFbinding sites tend to be closer to the TSS.

Whitfield and Weng TFBS Function

Page 18: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

TF binding sites in relation to the transcription start site

M/bp

P M

12008004000

1

0.8

0.6

0.4

0.2

0

RepressorsActivators

|N |/bp

P|N

|

1400120010008006004002000

0.006

0.005

0.004

0.003

0.002

0.001

0

Figure 10: Distinguishing between activating and repressing TF binding sites, allexperimentally validated by transient transfection assays. Here, P|N| = P−N + PN isthe probability of finding a validated TFBS within |N | base pairs of the transcription

start site. Plotted in the inset is the cumulative probability, PM =∑M

N=0 P|N|. The

two distributions can be distinguished with p < 8× 10−3 using a Kolmogorov-Smirnovtest: activating TF binding sites tend to be closer than repressing TF binding sites tothe TSS.

Whitfield and Weng TFBS Function

Page 19: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Over-representation of additional motifs on promoters withfunctionally validated TF binding sites: CTCF example

Table 2: Promoters with functional CTCF binding sites are enriched in adifferent set of motifs than promoters with non-functional binding sites.A set of ∼ 13000 human promoters was used as background.

Transcription factor p-value

ELF5 0.003Myf 0.016Gfi 0.04

Whitfield and Weng TFBS Function

Page 20: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Over-representation of additional motifs on promoters withfunctionally validated TF binding sites: YY1 example

Table 3: Over-represented motifs on promoters with functionally validatedYY1 binding sites. A set of ∼ 13000 human promoters was used asbackground. Over-represented motifs present on promoters with functionalYY1 binding sites were not over-represented on promoters with a predictedbut non-functional CTCF BS.

Transcription factor p-value

SRY < 0.001NFYA < 0.001GABPA 0.002Nkx2-5 0.006CREB1 0.009SOX5 0.01AR 0.011EWSR1-FLI1 0.014ELK4 0.015SP1 0.017FOXI1 0.02FOXD3 0.022SOX9 0.022Klf4 0.025STAT1 0.027

Whitfield and Weng TFBS Function

Page 21: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Histone modifications and TFBS functional validation

H4K20me1H3K9me1

H3K9acH3K4me3H3K4me2

H3K4me1H3K36me3H3K27me3

H3K27ac

L/bp

p-v

alue

(L)

10410310210

1

0.1

Figure 11: Distinguishing between promoters with functionally validated andnon-validated TF binding sites. The p-values are computed by applying aKolmogorov-Smirnov (KS) test to histone modification signals, averaged over basepairs out to a distance L away from the TSS. Before applying the KS test, signals aregrouped according to whether or not they have a validated (in K562 cells) TFBS. Forpromoters with validated TF binding sites, there are are significantly (p < 0.05) higherH3K4me1 and H3K9me1 signals for 300 bp < L < 2000 bp.

Whitfield and Weng TFBS Function

Page 22: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Histone modifications and activation versus repression

H3K20me1H3K9me1

H3K9acH3K4me3H3K4me2

H3K4me1H3K36me3H3K27me3

H3K27ac

L/bp

p-v

alue

(L)

10410310210

1

0.1

Figure 12: Distinguishing between promoters with functionally validated TF bindingsites which activate or repress transcription. The p-values are computed by applying aKolmogorov-Smirnov (KS) test to histone modification signals, averaged over basepairs out to a distance L away from the TSS. Before applying the KS test, signals aregrouped according to whether the TF binding activates or represses transcription (inK562 cells). For promoters with activating TF binding sites, there are significantly(p < 0.05) higher H3K4me1 and signals for 300 bp < L < 1000 bp.

Whitfield and Weng TFBS Function

Page 23: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Cell line specificity

Figure 13: Venn diagram for functionally validated (p < 0.05) TFBSbinding sites in four different cell lines.

Whitfield and Weng TFBS Function

Page 24: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Summary

We have carried out approximately 36000 (500× 18× 4) functionalassays on our predicted TF binding sites. In each of 4 cell lines (K562,HepG2, HT-1080 and HCT-116), approximately 70% of predicted TFbinding sites were functionally validated.

Approximately 1/3 of validated TF binding sites repress transcription ofthe genes that they regulate.

Functional validation of predicted TF binding sites is cell line specific.

Validated TF binding sites are significantly more conserved thannon-validated preditions.

Validated TF binding sites tend to be closer to the TSS than TF bindingsites that were not validated.

Functionally validated TF binding sites that activate transcription tend tobe closer to the TSS than those that repress transcription.

Functionally validated TF binding sites can be distinguished fromnon-validated sites by the statistical over-representation of additional anddifferent TF motifs.

Histone modifications can distinguish validated from non-validated TFBSpredictions and activation from repression.

Whitfield and Weng TFBS Function

Page 25: Identification and analysis of functional transcription ... · Introduction Prediction Validation Analysis Summary Experimental validation Table 1: Summary of functional tests of

Introduction Prediction Validation Analysis Summary

Acknowledgements

Weng Lab NHGRIJie Wang

Myers LabE. Christopher Partridge

SwitchGear GenomicsPatrick CollinsNathan Trinklein

Whitfield and Weng TFBS Function