genesdev.cshlp.orggenesdev.cshlp.org/.../supplementary_material.docx · web view2015/01/29 ·...

Mancino, Termanini et al. - Supplementary material

Supplemental Figures

Suppl. Figure 1. Binding of IRF8 to multimerized IRF sites. A) Macrophages were transduced with retroviral vectors to express IRF8. The DNA immunoprecipitated with an IRF8 ChIP was detected with a quantitative PCR using primers for genomic regions of inducible IRF8 recruitment and containing multimerized IRF sites. The upper left region contains a PU.1/IRF site and was used as a control. B) In vitro pull-down assay using increasing amounts of different biotinylated IRF oligonucleotides. The pulled-down proteins were analyzed by western blot using the indicated antibodies. Images were acquired and quantified using Li-Cor.

1


Suppl. Fig. 2. A feed-forward loop controls Irf8 transcription. A) IRF8 protein expression after LPS stimulation in wild type and Bxh2 macrophages. B) Binding of IRF8 to the Irf8 gene promoter and effects of the Bxh2 mutation.

2


Suppl. Fig. 3. Effects of the Bxh2 mutation on Irf8, Pu.1 and H3K27Ac after LPS stimulation . Scatter plots indicating IRF8 and PU.1 levels in Bxh2 macrophages relative to wild type cells after LPS treatment.

Suppl. Fig. 4. Reduced expression of Egr1 and Egr3 in Bxh2 macrophages. RNA-seq snapshots for the two genes are shown.

3


Suppl. Fig. 5. Effects of IFN on Irf8 expression in wild type and Bxh2 macrophages. A) Western blot showing IRF8 expression and STAT1 phosphorylation in response to IFN stimulation. B) STAT1 binding to the Irf8 gene locus and Irf8 induction in wild type and Bxh2 macrophages.

4


Suppl. Fig. 6. Impact of the Bxh2 mutation on STAT1 genomic binding after extended IFN stimulation. The snapshot shows one genomic region where multiple STAT1 peaks were affected by the loss of IRF8 activity.

5


Suppl. Fig. 7. Impact of the Bxh2 mutation on STAT1 recruitment in response to IFN stimulation. Macrophages were stimulated as indicated and STAT1 ChIP-seq carried out at 1, 2 and 4h after treatment. A) Two PWMs were retrieved from STAT1 peaks, the first one corresponding to a canonical GAS site and the second one to a dimeric IRF site. B) A representative genomic region showing the limited impact of IRF8 loss on STAT1 recruitment.

Supplemental Table legends

Suppl. Table 1. IRF8 genomic occupancy in basal and LPS-stimulated macrophages.

Suppl. Table 2. GREAT analysis on different classes of IRF genomic binding sites.

Suppl. Table 3. IRF1 ChIP-seq peaks in untreated and LPS-stimulated macrophages.

Suppl. Table 4. RNA-seq analysis in LPS-stimulated wild type and Bxh2 macrophages.

Suppl. Table 5. IRF8 binding and expression of purinergic receptor genes in wild type and Bxh2 macrophages.

Suppl. Table 6. RNA-seq analysis in IFN-stimulated wild type and Bxh2 macrophages.

Suppl. Table 7. Summary of the ChIP-seq and RNA-seq data sets in this study.

Supplemental computational methods

ChIP-seq data analysis. Short reads obtained from Illumina HiSeq 2000 were quality filtered according to the Illumina pipeline. Analysis of the datasets was automated using the Fish the ChIPs pipeline (Barozzi et al. 2011), which includes the alignment to the mm9 reference mouse genome using Bowtie v0.12.7 (Langmead et al. 2009). All the reads with a unique match to the genome and with two or fewer mismatches (-m 1 –v 2) were retained. Peak calling was performed using MACS v1.4 (Zhang et al. 2008) with default parameters (gsize=2.72e9, tsize=36). MACS compares the distribution of reads in each ChIP to a control sample, looking at local biases in the nearby region. MACS is also capable of dealing with possible PCR biases introduced during the preparation of the samples, removing duplicate tags in excess of what is acceptable by the sequencing depth. Each ChIP was compared to input DNA derived from mouse BMDM (GEO accession: GSM499415). We used a threshold of p=1e-10 for peak calling of the ChIP versus the input DNA. Samples were collected in two separate batches, one for the LPS and one IFN. For this reason, two different untreated samples were generated and used separately for the analyses. Gene Interval Notator (GIN), a tool included in the CARPET suite (Cesaroni et al. 2008), was then used to annotate all regions over mm9 RefSeq genes extracted from the UCSC genome browser (Karolchik et al. 2014). GIN was run with priority set to “gene” and “-20000” as promoter definition. In order to visualize the raw profiles on the UCSC Genome Browser, wiggle files were generated with MACS and converted to bigwig file format (Kent et al. 2010). Tracks were linearly re-scaled to the same sequencing depth.

ChIP-seq enriched regions (peaks) classification. Applying the same threshold of 1e-10, MACS was run to compare the untreated (UT) to the treated (TR) samples in both the LPS and IFN batches. The analysis was performed both ways (UT vs. TR and TR vs. UT). Regions enriched in TR versus UT were defined as induced, while regions enriched in UT versus TR were defined as repressed. These sets of differentially enriched regions were subsequently filtered keeping only the peaks found to overlap a region enriched against the input. The fraction of peaks that remained unchanged between these two conditions was considered as invariant or constitutive. Finally, induced regions that don’t overlap with UT regions enriched versus input were defined as new. The same approach was followed in order to define subsets of constitutive, inducible and repressed peaks between wild type and

6


Bxh2 paired samples. Regions enriched in Bxh2 versus wt were defined as induced in Bxh2, while regions enriched in wt versus Bxh2 were defined as repressed in Bxh2. These sets were filtered against the input. The fraction of unchanged peaks between these two conditions was considered as constitutively present (invariant) between wt and Bxh2 macrophages.

De novo motifs discovery. We ran MEME v4.6.1 (Bailey et al. 2009) considering a window of +/- 100 nucleotides around the peak summit of the top 1000 peaks (as determined by MACS p-value). The following parameters were used: -dna -mod zoops -evt 0.01 -nmotifs 10 -minw 6 -maxw 16 -revcomp.

Supervised learning using Support Vector Machines. Support vector machines (SVMs) (Cortes and Vapnik 1995), are supervised learning models used to learn patterns useful for classification and regression analysis (Drucker et al. 1997). Given a set of training examples (each belonging to one and only one category) an SVM learning algorithm builds a model that can be used to categorize new examples. In this specific case, the libSVM implementation (Chang and Lin 2011) coupled with a feature selection procedure (Guyon et al. 2003) was used to identify the most relevant sequence features able to discriminate IRF8-bound inducible from IRF8-bound constitutive regions. Using 50% of the total instances, ten forward features selection were run randomizing training and validation sets (50% each). The features selected in at least one out of ten randomizations were then pooled and used to train the machine on the entire 50% and test on the remaining 50%. Train and test sets were also randomized ten times, for a total of 100 partially independent feature selection runs. For the cutoffs applied during feature selection and for any other detail refer to the Supplementary Methods of Barozzi et al., 2014 (Barozzi et al. 2014).Performances were assessed as overall accuracy, defining the fraction of instances correctly predicted, calculated as (TP+TN) / (TP+FP+TN+FN); inducible regions as the positive set, constitutive as the negative, TP = true positive, FP = false positive, TN = true negative, FN = false negative.

Measuring features in the DNA sequence of IRF8-bound regions. Features were assessed in a 200 bps window centered on the summit of the ChIP-seq peaks. Position weight matrices (PWMs) for SVM3 were previously collected (Barozzi et al. 2014). Analysis was limited to PWMs of TFs showing mRNA expression (FPKM>=1) across RNA-seq samples either in basal or LPS-stimulated conditions from WT or BXH2 mice. FIMO(version included in Meme 4.6.1) (Grant et al. 2011) was used to scan the regions of interest. Results were then summarized at the level of subfamily of transcription factors using the annotation available in TFClass (Wingender et al. 2013), as described previously (Barozzi et al. 2014). Pattern matching of DNA strings was performed with Patmatch 1.2 (zero mismatches, -n, -c)(Yan et al. 2005).

Scatterplot of ChIP-Seq regions. The number of reads for each region was normalized based on the sequencing depth of the smallest sample. Counts were normalized in kbp and log2 transformed. Each dot in Fig. 2 was colored accordingly to the enrichment between the two samples (MACS 1e-10) of the corresponding region (Red: enriched in wt; Blue: enriched in Bxh2; Grey: no enrichment).

Heatmaps of ChIP-Seq regions. Considering each antibody independently, the number of reads for each region was normalized on the sequencing depth of the smallest sample. Besides, since these regions were not homogeneous in length, counts were normalized on the size of the region in kbp. To avoid any bias due to the outliers, a saturation procedure was performed: considering each antibody independently, counts exceeding a given value were set to this value (Fig.1B: 90th percentile; Fig.4A: 95th percentile). Values were then set to the range 0-1, still considering each antibody independently. Finally, regions were sorted according to their chromosome and start.

Functional enrichment analysis of ChIP-Seq enriched regions using GREAT. For each list of ChIP-Seq peaks of interest, GREAT 2.0.2 (McLean et al. 2010) was used with default parameters and selecting the whole mm9 genome as background.

7


RNA-seq analysis. After quality filtering according to the Illumina pipeline, paired-end reads were aligned to the mm9 mouse reference genome and to the Mus Musculus transcriptome (Ensembl build 63) (Flicek et al. 2012) using TopHat 1.3.1 (Trapnell et al. 2009). We allowed up to two mismatches and specified a mean distance between pairs (-r) of 120 bp. Transcript abundances and differentially expressed genes were quantified using Cufflinks 1.2.1 (Trapnell et al. 2010). During transcript quantification we used options –N (which specifies for upper-quartile normalization) and -u (which allows a better weighting of the multi-mapping reads). For subsequent analyses we considered the information at the level of genes. Differentially expressed genes were defined by minimum FPKM (fragments per kilobase of exon per million fragments mapped) in at least one experimental condition, p-value and fold-change (FC) (Fig. 3A: FPKM=2, p=0.01, FC=2; Fig 5A: FPKM=0.5, p=0.05, FC=2). Tracks for the UCSC genome browser (Fujita et al. 2011) were generated with samtools 0.1.18 (Li et al. 2009) and bedtools 2.16.1 (Quinlan and Hall 2010) using the uniquely aligned reads. Tracks were linearly re-scaled to the same sequencing depth.

Heatmaps of differentially expressed genes. For each gene, FPKM values among each sample were log-2 transformed and reported as the fold-change (FC) respect to untreated wt sample. To avoid any bias due to the outliers, a saturation procedure was performed: values lower than -5 were set to -5 and values higher than 5 were set to 5. In Fig. 3A, before saturation, genes was subject to unsupervised hierarchical clustering (method=average; distance=Pearson correlation). In order to identify relevant clusters, we cut the dendogram with cutreeDynamic R package v1.60-1 (Langfelder et al. 2008) (method="hybrid", cutHeight=NULL, minClusterSize=40, deepSplit=0, minGap=0.15). Gene Ontology (GO) enrichment analysis of differentially expressed genes. Ingenuity Pathway Analysis software (IPA) (Qiagen, Redwood City, California; www.ingenuity.com) was used with default parameters.

Gene Set Enrichment Analysis (GSEA). Gene set enrichment analysis (Subramanian et al. 2005) is used to test if a defined set of genes shows concordant and significant differences between two biological states. Transcripts were ranked based on their difference in expression between wt and Bxh2 macrophages. Using previously defined IFN-regulated and IRF3-dependent sets of genes (Chen et al. 2012) an enrichment score reflecting the degree of over-representation at the extremes (top or bottom) of the ranked list was computed. A p-value for the score was estimated using an empirical distribution of scores built upon GSEA run over 1,000 random datasets (obtained permuting the labels of the genes in the original dataset). Finally, the significance level was adjusted for multiple hypotheses testing. GSEA Java implementation was used to perform these analyses.

Statistics and plots. R software 2.15.1 was used to compute statistics and generate plots.

Accession numbers. Raw datasets are available in the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo) under the accession GSE56123, which comprise ChIP-seq data (GSE56121) and expression data (GSE56122).

Supplemental References

Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic acids research 37: W202-208.

Barozzi I, Simonatto M, Bonifacio S, Yang L, Rohs R, Ghisletti S, Natoli G. 2014. Coregulation of transcription factor binding and nucleosome occupancy through DNA features of mammalian enhancers. Molecular cell 54: 844-857.

Barozzi I, Termanini A, Minucci S, Natoli G. 2011. Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data. Biol Direct 6: 51.

Cesaroni M, Cittaro D, Brozzi A, Pelicci PG, Luzi L. 2008. CARPET: a web-based package for the analysis of ChIP-chip and expression tiling data. Bioinformatics 24: 2918-2920.

8


Chang C-C, Lin C-J. 2011. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2: 1-27.

Chen X, Barozzi I, Termanini A, Prosperini E, Recchiuti A, Dalli J, Mietton F, Matteoli G, Hiebert S, Natoli G. 2012. Requirement for the histone deacetylase Hdac3 for the inflammatory gene expression program in macrophages. Proceedings of the National Academy of Sciences of the United States of America 109: E2865-2874.

Cortes C, Vapnik V. 1995. Support-vector networks. Mach Learn 20: 273-297.Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V. 1997. Support vector regression

machines. Advances in neural information processing systems: 155-161.Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G,

Fairley S, Fitzgerald S et al. 2012. Ensembl 2012. Nucleic Acids Res 40: D84-90.Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP,

Clawson H, Coelho A et al. 2011. The UCSC Genome Browser database: update 2011. Nucleic acids research 39: D876-882.

Grant CE, Bailey TL, Noble WS. 2011. FIMO: scanning for occurrences of a given motif. Bioinformatics 27: 1017-1018.

Guyon I, Andr, #233, Elisseeff. 2003. An introduction to variable and feature selection. J Mach Learn Res 3: 1157-1182.

Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M et al. 2014. The UCSC Genome Browser database: 2014 update. Nucleic acids research 42: D764-770.

Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. 2010. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26: 2204-2207.

Langfelder P, Zhang B, Horvath S. 2008. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24: 719-720.

Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-2079.

McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. 2010. GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology 28: 495-501.

Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841-842.

Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105-1111.

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 28: 511-515.

Wingender E, Schoeps T, Donitz J. 2013. TFClass: an expandable hierarchical classification of human transcription factors. Nucleic acids research 41: D165-170.

Yan T, Yoo D, Berardini TZ, Mueller LA, Weems DC, Weng S, Cherry JM, Rhee SY. 2005. PatMatch: a program for finding patterns in peptide and nucleotide sequences. Nucleic acids research 33: W262-266.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9: R137.

9

genesdev.cshlp.orggenesdev.cshlp.org/.../supplementary_material.docx · web view2015/01/29 ·...

Documents