Download - EDACC Quality Characterization for Various Epigenetic Assays Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

EDACCQuality Characterization for Various

Epigenetic Assays

Cristian CoarfaBioinformatics Research Laboratory

Molecular and Human Genetics

Data Types Submitted To EDACC

• ChIP-Seq

• Methyl-C

• RRBS

• MRE-Seq

• MeDIP-Seq

• Chromatin Accessibility

• small RNA-Seq

• mRNA-Seq

How to measure the quality of mapped reads? Note: not quality of sequencing

statistics on this are provided by the sequencer

Most labs do some sort of visual inspection Metrics for characterizing level 2 data quality Apply it to various data types submitted to EDACC

Quality Characterization

ChIP-Seq, MeDIP-Seq, Chromatin Accessibility Methods implemented

– PTIH (percent tags in hotspots)– iROC (integral of ROC)– Percent tags in peaks (FindPeaks)– Poisson enrichment metric

Implemented in EDACC pipeline– Metrics computed on all submitted data

Enrichment Based Protocols

PTIH (percent tags in hotspots)

• Detect enriched regions using “hotspot” algorithm• PTIH = percentage of all tags that fall in hotspots

Hotspot algorithm

Scan statistic gauging enrichment with a z-score based on the binomial distribution.

250 bp50kb

n tags

N tags

Binomial distribution gives probability of seeing n tags in the small window given N tags total in the large window. This adjusts for local background fluctuations (due to CNV, for instance).

PTIH values

0.48

0.48

0.72

0.19

Ratio of Tags in Peaks

• Determine uniquely mapping reads• Use FindPeaks to call peaks• Count reads mapping into peaks

– percentage of total mapped reads

Poisson Based Enrichment Method

• Determine uniquely mapping reads• Remove duplicate reads• Bin the reads into 1kb windows• Infer parameters of a simple poisson distribution• Filter enriched windows

– p-value < 0.01

• Count reads mapping into enriched windows

Next Step – Metrics Evaluation

• Metrics probe different features of data• Use visual inspection to ascertain which (one or more)

of the proposed methods captures useful aspects of data quality.

ChIP-Seq/Chromatin Accessibility/FindPeaks QC Metrics

• Collaborative efforts between centers

• ~330 lanes of verified ChIP-Seq, MeDIP-Seq, and Chromatin accesibility data

• Accesible in Epigenome Atlas

EDACC will run continuously on all submitted data Option to automatically flag data that fall below

specified thresholds For most data types we need further experience on what thresholds

make sense

Include QC metrics in metadata Provide downstream users with this information

Note that we are breaking new ground uniform quality scoring is not being performed by other major

consortia (ENCODE, modENCODE)

Going forward

Pearson correlation for ChIP-Seq Histone Modification

• Using raw density maps at 10kb resolution• Process

– Select uniquely mapping reads– Extend 200bp in mapping strand direction– Remove monoclonal reads– Build density map– Pearson correlation with other submitted marks

• Ideally: a mark correlates best with other experiments for the same assay

• How well does Pearson correlation work ? – Help us identify 5 bad lanes, REMCs retracted the data

PCA Analysis

• 10kb windows on chr20

• PCA using Pearson correlation metric

Pearson correlation metric

H3K27me3

H3K36me3

H3K9me3

Input

H3K4me3H3K9ac

H2AK5acH2BK120acH2BK12acH2BK15acH2BK20acH3K14acH3K18acH3K23acH3K27acH3K4acH3K56acH4K5acH4K8acH4K91ac

H3K79me1

H3K20me1

PCA 53.8%

MRE-Seq

• Reads are mapped onto reference genome• Uniquely mapping reads are kept• Build the fragment map of expecting mapping locations

– based on the enzyme cocktail used• Count reads mapping within the expected digest

fragments• 76-99% of reads map within expected fragment

mRNA-Seq

• Reads are mapped onto reference genome• Uniquely mapping reads are kept• Count reads mapping within UCSC genes exons• 70-90% of reads map within gene exons

– UCSC known genes– Entrez genes

Small RNA-Seq

• Trim adaptors• Reads are mapped onto reference genome• Reads mapping up to 100 locations are kept• Count reads overlapping with known small RNAs

– miRNAs, piRNAs, sno/scaRNAs, piRNAs, repeat RNAs

• At least 30% of reads overlap with known small RNAs

Bisulfite Sequencing

• Map using Pash• Methyl-C

– Genome wide– QC

• C->T Conversion rates; typically 99%

• RRBS– Enzyme cocktail– QC

• Map within expected cut sites• Ratio varies 40%-90%

QC for MeDIP-Seq Data Using Galaxy

Exercise

• Download the input MeDIP-Seq file from the workshop wiki

• Determine the ratio of reads in peaks using FindPeaks in Galaxy

Download - EDACC Quality Characterization for Various Epigenetic Assays Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Top Related