examining gene expression and methylation with next gen sequencing
DESCRIPTION
Slides on RNA-seq and methylation studies using next-gen sequencing given at the University of Miami Hussman Institute for Human Genomics "Genetic Analysis of Complex Human Diseases" course in 2012 (http://hihg.med.miami.edu/educational-programs/analysis-of-complex-human-diseases/genetic-analysis-of-complex-human-diseases/)TRANSCRIPT
GENETIC ANALYSIS of Complex Human Diseases
Examining Gene Expression and Methylation with Next-Gen Sequencing
Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu
University of Virginia
GENETIC ANALYSIS of Complex Human Diseases
Gene expression pre-2008 PCR Microarrays
GENETIC ANALYSIS of Complex Human Diseases
GENETIC ANALYSIS of Complex Human Diseases
Advantages of RNA-Seq
n No reference necessary n Low background (no cross-hybridization) n Unlimited dynamic range (FC 9000 Science
320:1344) n Direct counting (microarrays: indirect – hybridization) n Can characterize full transcriptome
u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression
GENETIC ANALYSIS of Complex Human Diseases
Isoform level data
GENETIC ANALYSIS of Complex Human Diseases
Isoform level data
GENETIC ANALYSIS of Complex Human Diseases
Differential splicing & TSS use
GENETIC ANALYSIS of Complex Human Diseases
Is it accurate?
n Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.
GENETIC ANALYSIS of Complex Human Diseases
RNA-Seq Challenges n Library construction
u Size selection (messenger, small) u Strand specificity?
n Bioinformatic challenges u Spliced alignment u Transcript deconvolution
n Statistical Challenges u Highly variable abundance u Sample size: never, ever, plan n=1 u Normalization (RPKM)
► More reads from longer transcripts, higher sequencing depth
► Want to compare features of different lengths
► Want to compare conditions with different total sequence depth
GENETIC ANALYSIS of Complex Human Diseases
RNA-Seq Overview
Condi&on 1 (normal colon)
Condi&on 2 (colon tumor)
Samples of interest
AAAAA mRNA
AAAAA mRNA
TTTTT
Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-
GENETIC ANALYSIS of Complex Human Diseases
Common question #1: Depth
n Question: how much sequence do I need? n Answer: it’s complicated. n Oversimplified answer: 20-50 million PE reads / sample
(mouse/human). n Depends on:
u Size & complexity of transcriptome u Application: differential gene expression, transcript
discovery u Tissue type, RNA quality, library preparation u Sequencing type: length, paired-end vs single-end, etc.
n Find a publication in your field with similar goals. n Good news: ¼ HiSeq lane usually sufficient.
GENETIC ANALYSIS of Complex Human Diseases
Common question #2: Sample Size
n Question: How many samples should I sequence?
n Oversimplified Answer: At least 3 biological replicates per condition.
n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance
n Find a publication with similar goals
GENETIC ANALYSIS of Complex Human Diseases
Common question #3: Workflow
n How do I analyze the data? n No standards!
u Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2,
GSNAP, MANY others. u Reference builds & annotations: UCSC, Entrez, Ensembl u Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna
n Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline!
n Benchmarks u Microarray: Spike-ins (Irizarry) u RNA-Seq: ???, simulation, ???
GENETIC ANALYSIS of Complex Human Diseases
Common question #3: Workflow
Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993
Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782
GENETIC ANALYSIS of Complex Human Diseases
Phases of NGS Analysis n Primary
u Conversion of raw machine signal into sequence and quali8es n secondary
u Alignment of reads to reference genome or transcriptome u or de novo assembly of reads into con8gs
n Ter8ary u SNP discovery/genotyping u Peak discovery/quan8fica8on (ChIP, MeDIP) u Transcript assembly/quan8fica8on (RNA-‐seq)
n Quaternary u Differen8al expression u Enrichment, pathways, correla8on, clustering, visualiza8on, etc. u hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html
u hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529
GENETIC ANALYSIS of Complex Human Diseases
Primary Analysis: Get FASTQ file
@HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-
GENETIC ANALYSIS of Complex Human Diseases
“Phred-‐scaled” base quali&es # $p is probability base is erroneous $Q = -‐10 * log($p) / log(10); # Phred Q $q = chr(($Q<=40? $Q : 40) + 33); # FASTQ quality character $Q = ord($q) -‐ 33; # 33 offset
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, 41 values (0, 40)
I - Illumina 1.3 Phred+64, 41 values (0, 40)
X - Solexa Solexa+64, 68 values (-5, 62)
GENETIC ANALYSIS of Complex Human Diseases
Secondary analysis
n Alignment back to the reference u Computa8onally demanding – can’t use BLAST u Many algorithms (Maq, BWA, bow8e, bow8e2, Mosaik, NovoAlign, SOAP2, SSAHA, …)
u hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware u Sensi8vity to sequencing errors, polymorphisms, indels, rearrangements
u Tradeoffs in 8me vs. memory vs. performance
GENETIC ANALYSIS of Complex Human Diseases
RNA-Seq Workflow 1: Differential Gene Expression
GENETIC ANALYSIS of Complex Human Diseases
RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage
GENETIC ANALYSIS of Complex Human Diseases
Download data & software
n Public data from GEO. E.g. GSE32038 u http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u Trapnell et al. Differential gene and transcript expression analysis of RNA-seq
experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n Sequence, annotation, indexes (Ensembl)
u iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u Genes: /Annotation/Genes/genes.gtf u Indexes: /Sequence/BowtieIndex/genome.*
n Software: u Samtools: http://samtools.sourceforge.net/ u FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u Tophat: http://tophat.cbcb.umd.edu/ u HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u R: http://www.r-project.org/ u DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u Cufflinks: http://cufflinks.cbcb.umd.edu/ u cummeRbund: http://compbio.mit.edu/cummeRbund/
GENETIC ANALYSIS of Complex Human Diseases
Do some quality assessment Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html
GENETIC ANALYSIS of Complex Human Diseases
Mapping across splice junctions: tophat 1. Map reads to genome 2. Collect unmappable reads 3. Break reads into segments. Small
segments often independently align. If align 100bp-kbs apart, infer splice.
tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq
Gene Annotation Output Directory Bowtie Index Read 1 Read 2
GENETIC ANALYSIS of Complex Human Diseases
Workflow 1: Differential Gene Expression
Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression
GENETIC ANALYSIS of Complex Human Diseases
Workflow 1: Differential Gene Expression
Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression
Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq
Run htseq-count on each of the alignments: htseq-count <sam_file> <gtf_file>
First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam
GENETIC ANALYSIS of Complex Human Diseases
Workflow 1: Differential Gene Expression
Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression
Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
> library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...
GENETIC ANALYSIS of Complex Human Diseases
RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage
GENETIC ANALYSIS of Complex Human Diseases
Changes in fragment count for a gene does not necessarily equal a change in expression.
Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.
GENETIC ANALYSIS of Complex Human Diseases
Workflow 2a: Assemble transcripts for each sample: cufflinks n Cufflinks
u Identifies mutually incompatible fragments
u Identify minimal set of transcripts to explain all the fragments.
cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam
cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam
Output Directory Path to alignment
GENETIC ANALYSIS of Complex Human Diseases
Merge assemblies: cuffmerge
n Merge assemblies to create single merged transcriptome annotation. u Option 1: Pool alignments and assemble all at once.
► Computationally demanding ► Assembler will be faced complex mixture of isoforms à more error
u Option 2: Assemble alignments individually, merge resulting assemblies ► Cuffmerge: meta-assembler using parsimony. ► Genes with low expression à insufficient coverage for reconstruction. ► Merging often recovers complete gene. ► Newly discovered isoforms integrated w/ known ones (RABT).
GENETIC ANALYSIS of Complex Human Diseases
Merge assemblies: cuffmerge
n Create “manifest” of location of all assemblies
n Run Cuffmerge on assemblies using RABT
cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt
Reference Gene Annotation
./C1_R1_cufflinksout/transcripts.gtf
./C1_R2_cufflinksout/transcripts.gtf
./C1_R3_cufflinksout/transcripts.gtf
./C2_R1_cufflinksout/transcripts.gtf
./C2_R2_cufflinksout/transcripts.gtf
./C2_R3_cufflinksout/transcripts.gtf
Assemblies.txt: location of assemblies
Reference Genome Sequence Manifest from above
GENETIC ANALYSIS of Complex Human Diseases
Differential expression: cuffdiff
n Identify differentially expressed genes & transcripts
cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam
Reference Sequence
Output directory
Merged assembly
Location of alignments
• 1 gene • 2 TSS • 2 CDS • 3 Isoforms
GENETIC ANALYSIS of Complex Human Diseases
Downstream analysis & visualization
GENETIC ANALYSIS of Complex Human Diseases
Visualization with cummeRbund
n Install cummeRbund: u Install from BioConductor:
► source("http://bioconductor.org/biocLite.R")
► biocLite("cummeRbund")
u Download and install latest version from http://compbio.mit.edu/cummeRbund/
n Load the package u library(cummeRbund)
n Read in the data u cuff <- readCufflinks(“/path/to/cuffdiff/output”)
GENETIC ANALYSIS of Complex Human Diseases
Visualization with cummeRbund csDensity(genes(cuff))
csBoxplot(genes(cuff))
csScatter(genes(cuff), "C1", "C2", smooth=T)
csVolcano(genes(cuff), "C1", "C2")
GENETIC ANALYSIS of Complex Human Diseases
Visualization with cummeRbund
mygene2 <- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))
GENETIC ANALYSIS of Complex Human Diseases
DEXSeq
n Differential Gene Expression (E.g. DESeq)
n Differential Isoform Expression (E.g. Cufflinks)
n Differential Exon Usage n What’s different about DEXSeq?
u Doesn’t do full transcript assembly (Cufflinks)
u Doesn’t count fragments mapping to genes (DESeq)
u Avoids assembly and looks for differences in reads mapping to individual exons.
u Uses counts (negative binomial)
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: Installation
n Installation & load: u source("http://bioconductor.org/biocLite.R")
u biocLite(“DEXSeq”)
u library(DEXSeq)
n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: Data preparation
n First, prepare “flattened” GFF:
n Create sorted SAM files
n Count reads overlapping counting bins
dexseq_prepare_annotation.py input.gtf exons.gff
Reference Annotation Script comes with DEXSeq
samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam
samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam
dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt
dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt
Script comes with DEXSeq Flattened
Annotation Alignment Output file
Output file
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: Data import
n The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html
> design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1
C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep=""))
> countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design,
flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: Data Analysis # Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion
exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons)
result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001)
# M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: visualization plotDEXSeq(exons, "FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)
GENETIC ANALYSIS of Complex Human Diseases
Using DEXSeq: HTML Report library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description")
DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)
GENETIC ANALYSIS of Complex Human Diseases
Downstream analysis
n Now you have a list of: u Genes u Isoforms (genes) u Exons (genes)
n How to place in functional context? n Pathway / functional analysis!
u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more…
n Resources: u hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html u hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529
GENETIC ANALYSIS of Complex Human Diseases
Workflow Management: Galaxy
n http:usegalaxy.org
GENETIC ANALYSIS of Complex Human Diseases
Workflow Management: Taverna
n Taverna: http://www.taverna.org.uk/ n TavernaPBS: http://sourceforge.net/projects/tavernapbs/
GENETIC ANALYSIS of Complex Human Diseases
Further Reading
n RNA-Seq: u Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and
quantification using RNA-seq. Nature methods, 8(6), 469-77. u Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical
reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian
transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics,
12(2), 87-98. u Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research,
991-998. u Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics,
10(1), 57-63.
n Bowtie/Tophat: u Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome biology, 10(3), R25. u Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford,
England), 25(9), 1105-11.
n Cufflinks: u Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using
RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript
expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript
assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5.
n DEXSeq: u Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data.
Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.
GENETIC ANALYSIS of Complex Human Diseases
Online Community Forum and Discussion n Seqanswers
u http://SEQanswers.com u Format: Forum u Li et al. SEQanswers : An open access community for
collaboratively decoding genomes. Bioinformatics (2012).
n BioStar: u http://biostar.stackexchange.com u Format: Q&A u Parnell et al. BioStar: an online question & answer resource for
the bioinformatics community. PLoS Comp Bio (2011).
n Other Bioinformatics Resources: stephenturner.us/p/edu
GENETIC ANALYSIS of Complex Human Diseases
DNA Methylation: Importance
n Occurs most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex
diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease
genes & drug targets.
GENETIC ANALYSIS of Complex Human Diseases
DNA Methylation: Challenges
n Dynamic and tissue-specific n DNA à Collection of cells which vary in 5meC
patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods:
u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs.
u Affinity enrichment, count-based: Assay methylation level across many genomic loci.
GENETIC ANALYSIS of Complex Human Diseases
DNA Methylation: Mapping
BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay
RNA-Seq High-throughput cDNA sequencing
DNA Methylation
Gene Expression
GENETIC ANALYSIS of Complex Human Diseases
Methylation: REs and PCR
n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize
same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites
n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate
because of incomplete digestion (for reasons other than methylation).
GENETIC ANALYSIS of Complex Human Diseases
Bisulfite sequencing
n Sodium bisulfite converts unmethylated (but not methylated) C’s into U’s. n This introduces a methylation-specific “SNP”. n RRBS – library enriched for CpG-dense regions by digesting with MspI.
GENETIC ANALYSIS of Complex Human Diseases
MeDIP-Seq
n MeDIP-Seq = Methylated DNA immunoprecipitation
n Uses antibody against 5-methylcytosine to retrieve methylated fragments from sonicated DNA.
n Enrichment method = count number of reads
GENETIC ANALYSIS of Complex Human Diseases
MethylCap-Seq
n Uses methyl-binding domain (MBD) protein to obtain DNA with similar methylation levels.
n Also a counting method.
GENETIC ANALYSIS of Complex Human Diseases
Methylation: Accuracy
n Bock et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14.
n MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay
GENETIC ANALYSIS of Complex Human Diseases
Methylation methods: coverage
n Coverage varies among different methods
GENETIC ANALYSIS of Complex Human Diseases
Methylation: Features & Biases
GENETIC ANALYSIS of Complex Human Diseases
Methylation: Bioinformatics Resources Resource Purpose URL Refs Batman MeDIP DNA methyla8on analysis tool hKp://td-‐blade.gurdon.cam.ac.uk/sokware/batman BDPC DNA methyla8on analysis plalorm hKp://biochem.jacobs-‐university.de/BDPC BSMAP Whole-‐genome bisulphite sequence mapping hKp://code.google.com/p/bsmap CpG Analyzer Windows-‐based program for bisulphite DNA -‐ CpGcluster CpG island iden8fica8on hKp://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island iden8fica8on hKp://linux1.sokberry.com CpG Island Explorer Online program for CpG Island iden8fica8on hKp://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island iden8fica8on hKp://cpgislands.usc.edu CpG PaKernFinder Windows-‐based program for bisulphite DNA -‐ CpG Promoter Large-‐scale promoter mapping using CpG islands hKp://www.cshl.edu/OTT/html/cpg_promoter.html CpG ra8o and GC content PloKer Online program for ploMng the observed:expected ra8o of CpG hKp://mwsross.bms.ed.ac.uk/public/cgi-‐bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer hKp://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-‐based analysis of plant genomic DNA hKp://www.gmi.oeaw.ac.at/en/cymate-‐index/
EMBOSS CpGPlot/ CpGReport Online program for ploMng CpG-‐rich regions hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Ini8a8ve homepage hKp://nihroadmap.nih.gov/epigenomics Epinexus DNA methyla8on analysis tools hKp://epinexus.net/home.html MEDME Sokware package (using R) for modelling MeDIP experimental data hKp://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-‐modified DNA hKp://medgen.ugent.be/methBLAST MethDB Database for DNA methyla8on data hKp://www.methdb.de MethPrimer Primer design for bisulphite PCR hKp://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methyla8on analysis hKp://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool hKp://www.methdb.de MethyCancer Database Database of cancer DNA methyla8on data hKp://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR hKp://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methyla8on data from Illumina hKp://www.bioconductor.org/packages/bioc/html/
Methylyzer Bisulphite DNA sequence visualiza8on tool hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methyla8on viewer integrated w/ Ensembl genome browser hKp://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methyla8on literature hKp://www.pubmeth.org QUMA Quan8fica8on tool for methyla8on analysis hKp://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methyla8on data hKp://cancergenome.nih.gov/dataportal
GENETIC ANALYSIS of Complex Human Diseases
Methylation: Further Reading Bock, C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative
comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation
profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns
characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA
methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison
of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105.
Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N.
Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics,
11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide
and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598
GENETIC ANALYSIS of Complex Human Diseases
Thank you
Web: bioinformatics.virginia.edu
E-mail: [email protected]
Blog: www.GettingGeneticsDone.com
Twitter: twitter.com/genetics_blog