examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Examining Gene Expression and Methylation with Next-Gen Sequencing

Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu

University of Virginia


Gene expression pre-2008 PCR Microarrays


Advantages of RNA-Seq

n No reference necessary n Low background (no cross-hybridization) n Unlimited dynamic range (FC 9000 Science

320:1344) n Direct counting (microarrays: indirect – hybridization) n Can characterize full transcriptome

u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression


Isoform level data


Differential splicing & TSS use


Is it accurate?

n  Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.


RNA-Seq Challenges n  Library construction

u Size selection (messenger, small) u Strand specificity?

n  Bioinformatic challenges u Spliced alignment u Transcript deconvolution

n  Statistical Challenges u Highly variable abundance u Sample size: never, ever, plan n=1 u Normalization (RPKM)

► More reads from longer transcripts, higher sequencing depth

► Want to compare features of different lengths

► Want to compare conditions with different total sequence depth


RNA-Seq Overview

Condi&on 1 (normal colon)

Condi&on 2 (colon tumor)

Samples of interest

AAAAA mRNA

AAAAA mRNA

TTTTT

Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-


Common question #1: Depth

n  Question: how much sequence do I need? n  Answer: it’s complicated. n  Oversimplified answer: 20-50 million PE reads / sample

(mouse/human). n  Depends on:

u Size & complexity of transcriptome u Application: differential gene expression, transcript

discovery u Tissue type, RNA quality, library preparation u Sequencing type: length, paired-end vs single-end, etc.

n  Find a publication in your field with similar goals. n  Good news: ¼ HiSeq lane usually sufficient.


Common question #2: Sample Size

n Question: How many samples should I sequence?

n Oversimplified Answer: At least 3 biological replicates per condition.

n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance

n Find a publication with similar goals


Common question #3: Workflow

n  How do I analyze the data? n  No standards!

u  Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u  Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2,

GSNAP, MANY others. u  Reference builds & annotations: UCSC, Entrez, Ensembl u  Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u  Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u  Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna

n  Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline!

n  Benchmarks u  Microarray: Spike-ins (Irizarry) u  RNA-Seq: ???, simulation, ???


Common question #3: Workflow

Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993

Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782


Phases of NGS Analysis n  Primary

u  Conversion of raw machine signal into sequence and quali8es n  secondary

u  Alignment of reads to reference genome or transcriptome u  or de novo assembly of reads into con8gs

n  Ter8ary u  SNP discovery/genotyping u  Peak discovery/quan8fica8on (ChIP, MeDIP) u  Transcript assembly/quan8fica8on (RNA-‐seq)

n  Quaternary u  Differen8al expression u  Enrichment, pathways, correla8on, clustering, visualiza8on, etc. u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html

u  hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529


Primary Analysis: Get FASTQ file

@HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-


“Phred-‐scaled” base quali&es # $p is probability base is erroneous $Q = -‐10 * log($p) / log(10); # Phred Q $q = chr(($Q<=40? $Q : 40) + 33); # FASTQ quality character $Q = ord($q) -‐ 33; # 33 offset

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

| | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40)

I - Illumina 1.3 Phred+64, 41 values (0, 40)

X - Solexa Solexa+64, 68 values (-5, 62)


Secondary analysis

n Alignment back to the reference u Computa8onally demanding – can’t use BLAST u Many algorithms (Maq, BWA, bow8e, bow8e2, Mosaik, NovoAlign, SOAP2, SSAHA, …)

u  hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware u Sensi8vity to sequencing errors, polymorphisms, indels, rearrangements

u Tradeoffs in 8me vs. memory vs. performance


RNA-Seq Workflow 1: Differential Gene Expression


RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage


Download data & software

n  Public data from GEO. E.g. GSE32038 u  http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u  Trapnell et al. Differential gene and transcript expression analysis of RNA-seq

experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n  Sequence, annotation, indexes (Ensembl)

u  iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u  Genes: /Annotation/Genes/genes.gtf u  Indexes: /Sequence/BowtieIndex/genome.*

n  Software: u  Samtools: http://samtools.sourceforge.net/ u  FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u  Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u  Tophat: http://tophat.cbcb.umd.edu/ u  HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u  R: http://www.r-project.org/ u  DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u  Cufflinks: http://cufflinks.cbcb.umd.edu/ u  cummeRbund: http://compbio.mit.edu/cummeRbund/


Do some quality assessment Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html


Mapping across splice junctions: tophat 1.  Map reads to genome 2.  Collect unmappable reads 3.  Break reads into segments. Small

segments often independently align. If align 100bp-kbs apart, infer splice.

tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq

Gene Annotation Output Directory Bowtie Index Read 1 Read 2


Workflow 1: Differential Gene Expression

Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression




Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq

Run htseq-count on each of the alignments: htseq-count <sam_file> <gtf_file>

First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam




Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

> library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...


RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage


Changes in fragment count for a gene does not necessarily equal a change in expression.

Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.


Workflow 2a: Assemble transcripts for each sample: cufflinks n Cufflinks

u Identifies mutually incompatible fragments

u Identify minimal set of transcripts to explain all the fragments.

cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam

cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam

Output Directory Path to alignment


Merge assemblies: cuffmerge

n  Merge assemblies to create single merged transcriptome annotation. u  Option 1: Pool alignments and assemble all at once.

►  Computationally demanding ►  Assembler will be faced complex mixture of isoforms à more error

u  Option 2: Assemble alignments individually, merge resulting assemblies ►  Cuffmerge: meta-assembler using parsimony. ►  Genes with low expression à insufficient coverage for reconstruction. ►  Merging often recovers complete gene. ►  Newly discovered isoforms integrated w/ known ones (RABT).


Merge assemblies: cuffmerge

n Create “manifest” of location of all assemblies

n Run Cuffmerge on assemblies using RABT

cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt

Reference Gene Annotation

./C1_R1_cufflinksout/transcripts.gtf






Assemblies.txt: location of assemblies

Reference Genome Sequence Manifest from above


Differential expression: cuffdiff

n Identify differentially expressed genes & transcripts

cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam

Reference Sequence

Output directory

Merged assembly

Location of alignments

•  1 gene •  2 TSS •  2 CDS •  3 Isoforms


Downstream analysis & visualization


Visualization with cummeRbund

n Install cummeRbund: u Install from BioConductor:

► source("http://bioconductor.org/biocLite.R")

► biocLite("cummeRbund")

u Download and install latest version from http://compbio.mit.edu/cummeRbund/

n Load the package u library(cummeRbund)

n Read in the data u  cuff <- readCufflinks(“/path/to/cuffdiff/output”)


Visualization with cummeRbund csDensity(genes(cuff))

csBoxplot(genes(cuff))

csScatter(genes(cuff), "C1", "C2", smooth=T)

csVolcano(genes(cuff), "C1", "C2")


Visualization with cummeRbund

mygene2 <- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))


DEXSeq

n  Differential Gene Expression (E.g. DESeq)

n  Differential Isoform Expression (E.g. Cufflinks)

n  Differential Exon Usage n  What’s different about DEXSeq?

u  Doesn’t do full transcript assembly (Cufflinks)

u  Doesn’t count fragments mapping to genes (DESeq)

u  Avoids assembly and looks for differences in reads mapping to individual exons.

u  Uses counts (negative binomial)


Using DEXSeq: Installation

n Installation & load: u  source("http://bioconductor.org/biocLite.R")

u  biocLite(“DEXSeq”)

u  library(DEXSeq)

n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.


Using DEXSeq: Data preparation

n First, prepare “flattened” GFF:

n Create sorted SAM files

n Count reads overlapping counting bins

dexseq_prepare_annotation.py input.gtf exons.gff

Reference Annotation Script comes with DEXSeq

samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam

samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam

dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt

dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt

Script comes with DEXSeq Flattened

Annotation Alignment Output file

Output file


Using DEXSeq: Data import

n  The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html

> design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1

C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep=""))

> countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design,

flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)


Using DEXSeq: Data Analysis # Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion

exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons)

result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001)

# M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)


Using DEXSeq: visualization plotDEXSeq(exons, "FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)


Using DEXSeq: HTML Report library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description")

DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)


Downstream analysis

n Now you have a list of: u Genes u  Isoforms (genes) u Exons (genes)

n How to place in functional context? n Pathway / functional analysis!

u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more…

n Resources: u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html u  hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529


Workflow Management: Galaxy

n http:usegalaxy.org


Workflow Management: Taverna

n  Taverna: http://www.taverna.org.uk/ n  TavernaPBS: http://sourceforge.net/projects/tavernapbs/


Further Reading

n  RNA-Seq: u  Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and

quantification using RNA-seq. Nature methods, 8(6), 469-77. u  Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical

reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian

transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u  Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics,

12(2), 87-98. u  Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research,

991-998. u  Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics,

10(1), 57-63.

n  Bowtie/Tophat: u  Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA

sequences to the human genome. Genome biology, 10(3), R25. u  Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford,

England), 25(9), 1105-11.

n  Cufflinks: u  Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using

RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u  Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript

expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u  Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript

assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5.

n  DEXSeq: u  Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u  Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data.

Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.


Online Community Forum and Discussion n Seqanswers

u  http://SEQanswers.com u  Format: Forum u  Li et al. SEQanswers : An open access community for

collaboratively decoding genomes. Bioinformatics (2012).

n BioStar: u  http://biostar.stackexchange.com u  Format: Q&A u  Parnell et al. BioStar: an online question & answer resource for

the bioinformatics community. PLoS Comp Bio (2011).

n  Other Bioinformatics Resources: stephenturner.us/p/edu


DNA Methylation: Importance

n Occurs most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex

diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease

genes & drug targets.


DNA Methylation: Challenges

n Dynamic and tissue-specific n DNA à Collection of cells which vary in 5meC

patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods:

u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs.

u Affinity enrichment, count-based: Assay methylation level across many genomic loci.


DNA Methylation: Mapping

BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay

RNA-Seq High-throughput cDNA sequencing

DNA Methylation

Gene Expression


Methylation: REs and PCR

n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize

same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites

n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate

because of incomplete digestion (for reasons other than methylation).


Bisulfite sequencing

n  Sodium bisulfite converts unmethylated (but not methylated) C’s into U’s. n  This introduces a methylation-specific “SNP”. n  RRBS – library enriched for CpG-dense regions by digesting with MspI.


MeDIP-Seq

n MeDIP-Seq = Methylated DNA immunoprecipitation

n Uses antibody against 5-methylcytosine to retrieve methylated fragments from sonicated DNA.

n Enrichment method = count number of reads


MethylCap-Seq

n Uses methyl-binding domain (MBD) protein to obtain DNA with similar methylation levels.

n Also a counting method.


Methylation: Accuracy

n  Bock et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14.

n  MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay


Methylation methods: coverage

n Coverage varies among different methods


Methylation: Features & Biases


Methylation: Bioinformatics Resources Resource Purpose URL Refs Batman MeDIP DNA methyla8on analysis tool hKp://td-‐blade.gurdon.cam.ac.uk/sokware/batman BDPC DNA methyla8on analysis plalorm hKp://biochem.jacobs-‐university.de/BDPC BSMAP Whole-‐genome bisulphite sequence mapping hKp://code.google.com/p/bsmap CpG Analyzer Windows-‐based program for bisulphite DNA -‐ CpGcluster CpG island iden8fica8on hKp://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island iden8fica8on hKp://linux1.sokberry.com CpG Island Explorer Online program for CpG Island iden8fica8on hKp://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island iden8fica8on hKp://cpgislands.usc.edu CpG PaKernFinder Windows-‐based program for bisulphite DNA -‐ CpG Promoter Large-‐scale promoter mapping using CpG islands hKp://www.cshl.edu/OTT/html/cpg_promoter.html CpG ra8o and GC content PloKer Online program for ploMng the observed:expected ra8o of CpG hKp://mwsross.bms.ed.ac.uk/public/cgi-‐bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer hKp://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-‐based analysis of plant genomic DNA hKp://www.gmi.oeaw.ac.at/en/cymate-‐index/

EMBOSS CpGPlot/ CpGReport Online program for ploMng CpG-‐rich regions hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Ini8a8ve homepage hKp://nihroadmap.nih.gov/epigenomics Epinexus DNA methyla8on analysis tools hKp://epinexus.net/home.html MEDME Sokware package (using R) for modelling MeDIP experimental data hKp://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-‐modified DNA hKp://medgen.ugent.be/methBLAST MethDB Database for DNA methyla8on data hKp://www.methdb.de MethPrimer Primer design for bisulphite PCR hKp://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methyla8on analysis hKp://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool hKp://www.methdb.de MethyCancer Database Database of cancer DNA methyla8on data hKp://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR hKp://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methyla8on data from Illumina hKp://www.bioconductor.org/packages/bioc/html/

Methylyzer Bisulphite DNA sequence visualiza8on tool hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methyla8on viewer integrated w/ Ensembl genome browser hKp://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methyla8on literature hKp://www.pubmeth.org QUMA Quan8fica8on tool for methyla8on analysis hKp://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methyla8on data hKp://cancergenome.nih.gov/dataportal


Methylation: Further Reading Bock, C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative

comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation

profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns

characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA

methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison

of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105.

Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N.

Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics,

11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide

and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598


Thank you

Web: bioinformatics.virginia.edu

E-mail: [email protected]

Blog: www.GettingGeneticsDone.com

Twitter: twitter.com/genetics_blog

examining gene expression and methylation with next gen sequencing

Education