examining gene expression and methylation with next gen sequencing

61
GENETIC ANALYSIS of Complex Human Diseases Examining Gene Expression and Methylation with Next-Gen Sequencing Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu University of Virginia

Upload: stephen-turner

Post on 08-May-2015

2.512 views

Category:

Education


3 download

DESCRIPTION

Slides on RNA-seq and methylation studies using next-gen sequencing given at the University of Miami Hussman Institute for Human Genomics "Genetic Analysis of Complex Human Diseases" course in 2012 (http://hihg.med.miami.edu/educational-programs/analysis-of-complex-human-diseases/genetic-analysis-of-complex-human-diseases/)

TRANSCRIPT

Page 1: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Examining Gene Expression and Methylation with Next-Gen Sequencing

Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu

University of Virginia

Page 2: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Gene expression pre-2008 PCR Microarrays

Page 3: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Page 4: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Advantages of RNA-Seq

n No reference necessary n Low background (no cross-hybridization) n Unlimited dynamic range (FC 9000 Science

320:1344) n Direct counting (microarrays: indirect – hybridization) n Can characterize full transcriptome

u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression

Page 5: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Isoform level data

Page 6: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Isoform level data

Page 7: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Differential splicing & TSS use

Page 8: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Is it accurate?

n  Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.

Page 9: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

RNA-Seq Challenges n  Library construction

u Size selection (messenger, small) u Strand specificity?

n  Bioinformatic challenges u Spliced alignment u Transcript deconvolution

n  Statistical Challenges u Highly variable abundance u Sample size: never, ever, plan n=1 u Normalization (RPKM)

► More reads from longer transcripts, higher sequencing depth

► Want to compare features of different lengths

► Want to compare conditions with different total sequence depth

Page 10: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

RNA-Seq Overview

Condi&on  1  (normal  colon)  

Condi&on  2  (colon  tumor)  

Samples  of  interest  

AAAAA mRNA

AAAAA mRNA

TTTTT

Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-

Page 11: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Common question #1: Depth

n  Question: how much sequence do I need? n  Answer: it’s complicated. n  Oversimplified answer: 20-50 million PE reads / sample

(mouse/human). n  Depends on:

u Size & complexity of transcriptome u Application: differential gene expression, transcript

discovery u Tissue type, RNA quality, library preparation u Sequencing type: length, paired-end vs single-end, etc.

n  Find a publication in your field with similar goals. n  Good news: ¼ HiSeq lane usually sufficient.

Page 12: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Common question #2: Sample Size

n Question: How many samples should I sequence?

n Oversimplified Answer: At least 3 biological replicates per condition.

n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance

n Find a publication with similar goals

Page 13: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Common question #3: Workflow

n  How do I analyze the data? n  No standards!

u  Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u  Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2,

GSNAP, MANY others. u  Reference builds & annotations: UCSC, Entrez, Ensembl u  Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u  Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u  Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna

n  Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline!

n  Benchmarks u  Microarray: Spike-ins (Irizarry) u  RNA-Seq: ???, simulation, ???

Page 14: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Common question #3: Workflow

Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993

Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782

Page 15: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Phases  of  NGS  Analysis  n  Primary  

u  Conversion  of  raw  machine  signal  into  sequence  and  quali8es  n  secondary  

u  Alignment  of  reads  to  reference  genome  or  transcriptome  u  or  de  novo  assembly  of  reads  into  con8gs  

n  Ter8ary  u  SNP  discovery/genotyping  u  Peak  discovery/quan8fica8on  (ChIP,  MeDIP)  u  Transcript  assembly/quan8fica8on  (RNA-­‐seq)  

n  Quaternary  u  Differen8al  expression  u  Enrichment,  pathways,  correla8on,  clustering,  visualiza8on,  etc.    u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html  

u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  

Page 16: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Primary  Analysis:  Get  FASTQ  file  

@HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-

Page 17: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

“Phred-­‐scaled”  base  quali&es  #  $p  is  probability  base  is  erroneous  $Q  =  -­‐10  *  log($p)  /  log(10);  #  Phred  Q  $q  =  chr(($Q<=40?  $Q  :  40)  +  33);  #  FASTQ  quality  character  $Q  =  ord($q)  -­‐  33;  #  33  offset  

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

| | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40)

I - Illumina 1.3 Phred+64, 41 values (0, 40)

X - Solexa Solexa+64, 68 values (-5, 62)

Page 18: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Secondary  analysis  

n Alignment  back  to  the  reference  u Computa8onally  demanding  –  can’t  use  BLAST  u Many  algorithms  (Maq,  BWA,  bow8e,  bow8e2,  Mosaik,  NovoAlign,  SOAP2,  SSAHA,  …)  

u  hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware    u Sensi8vity  to  sequencing  errors,  polymorphisms,  indels,  rearrangements  

u Tradeoffs  in  8me  vs.  memory  vs.  performance  

 

Page 19: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

RNA-Seq Workflow 1: Differential Gene Expression

Page 20: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage

Page 21: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Download data & software

n  Public data from GEO. E.g. GSE32038 u  http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u  Trapnell et al. Differential gene and transcript expression analysis of RNA-seq

experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n  Sequence, annotation, indexes (Ensembl)

u  iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u  Genes: /Annotation/Genes/genes.gtf u  Indexes: /Sequence/BowtieIndex/genome.*

n  Software: u  Samtools: http://samtools.sourceforge.net/ u  FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u  Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u  Tophat: http://tophat.cbcb.umd.edu/ u  HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u  R: http://www.r-project.org/ u  DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u  Cufflinks: http://cufflinks.cbcb.umd.edu/ u  cummeRbund: http://compbio.mit.edu/cummeRbund/

Page 22: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Do some quality assessment Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html

Page 23: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Mapping across splice junctions: tophat 1.  Map reads to genome 2.  Collect unmappable reads 3.  Break reads into segments. Small

segments often independently align. If align 100bp-kbs apart, infer splice.

tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq

Gene Annotation Output Directory Bowtie Index Read 1 Read 2

Page 24: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow 1: Differential Gene Expression

Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression

Page 25: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow 1: Differential Gene Expression

Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression

Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq

Run htseq-count on each of the alignments: htseq-count <sam_file> <gtf_file>

First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam

Page 26: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow 1: Differential Gene Expression

Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression

Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

> library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...

Page 27: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage

Page 28: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Changes in fragment count for a gene does not necessarily equal a change in expression.

Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.

Page 29: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow 2a: Assemble transcripts for each sample: cufflinks n Cufflinks

u Identifies mutually incompatible fragments

u Identify minimal set of transcripts to explain all the fragments.

cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam

cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam

Output Directory Path to alignment

Page 30: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Merge assemblies: cuffmerge

n  Merge assemblies to create single merged transcriptome annotation. u  Option 1: Pool alignments and assemble all at once.

►  Computationally demanding ►  Assembler will be faced complex mixture of isoforms à more error

u  Option 2: Assemble alignments individually, merge resulting assemblies ►  Cuffmerge: meta-assembler using parsimony. ►  Genes with low expression à insufficient coverage for reconstruction. ►  Merging often recovers complete gene. ►  Newly discovered isoforms integrated w/ known ones (RABT).

Page 31: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Merge assemblies: cuffmerge

n Create “manifest” of location of all assemblies

n Run Cuffmerge on assemblies using RABT

cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt

Reference Gene Annotation

./C1_R1_cufflinksout/transcripts.gtf

./C1_R2_cufflinksout/transcripts.gtf

./C1_R3_cufflinksout/transcripts.gtf

./C2_R1_cufflinksout/transcripts.gtf

./C2_R2_cufflinksout/transcripts.gtf

./C2_R3_cufflinksout/transcripts.gtf

Assemblies.txt: location of assemblies

Reference Genome Sequence Manifest from above

Page 32: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Differential expression: cuffdiff

n Identify differentially expressed genes & transcripts

cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam

Reference Sequence

Output directory

Merged assembly

Location of alignments

•  1 gene •  2 TSS •  2 CDS •  3 Isoforms

Page 33: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Downstream analysis & visualization

Page 34: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Visualization with cummeRbund

n Install cummeRbund: u Install from BioConductor:

► source("http://bioconductor.org/biocLite.R")

► biocLite("cummeRbund")

u Download and install latest version from http://compbio.mit.edu/cummeRbund/

n Load the package u library(cummeRbund)

n Read in the data u  cuff <- readCufflinks(“/path/to/cuffdiff/output”)

Page 35: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Visualization with cummeRbund csDensity(genes(cuff))

csBoxplot(genes(cuff))

csScatter(genes(cuff), "C1", "C2", smooth=T)

csVolcano(genes(cuff), "C1", "C2")

Page 36: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Visualization with cummeRbund

mygene2 <- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))

Page 37: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

DEXSeq

n  Differential Gene Expression (E.g. DESeq)

n  Differential Isoform Expression (E.g. Cufflinks)

n  Differential Exon Usage n  What’s different about DEXSeq?

u  Doesn’t do full transcript assembly (Cufflinks)

u  Doesn’t count fragments mapping to genes (DESeq)

u  Avoids assembly and looks for differences in reads mapping to individual exons.

u  Uses counts (negative binomial)

Page 38: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: Installation

n Installation & load: u  source("http://bioconductor.org/biocLite.R")

u  biocLite(“DEXSeq”)

u  library(DEXSeq)

n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.

Page 39: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: Data preparation

n First, prepare “flattened” GFF:

n Create sorted SAM files

n Count reads overlapping counting bins

dexseq_prepare_annotation.py input.gtf exons.gff

Reference Annotation Script comes with DEXSeq

samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam

samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam

dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt

dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt

Script comes with DEXSeq Flattened

Annotation Alignment Output file

Output file

Page 40: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: Data import

n  The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html

> design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1

C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep=""))

> countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design,

flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)

Page 41: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: Data Analysis # Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion

exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons)

result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001)

# M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)

Page 42: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: visualization plotDEXSeq(exons, "FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)

Page 43: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Using DEXSeq: HTML Report library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description")

DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)

Page 44: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Downstream analysis

n Now you have a list of: u Genes u  Isoforms (genes) u Exons (genes)

n How to place in functional context? n Pathway / functional analysis!

u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more…

n Resources: u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html  u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  

Page 45: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow Management: Galaxy

n http:usegalaxy.org

Page 46: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Workflow Management: Taverna

n  Taverna: http://www.taverna.org.uk/ n  TavernaPBS: http://sourceforge.net/projects/tavernapbs/

Page 47: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Further Reading

n  RNA-Seq: u  Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and

quantification using RNA-seq. Nature methods, 8(6), 469-77. u  Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical

reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian

transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u  Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics,

12(2), 87-98. u  Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research,

991-998. u  Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics,

10(1), 57-63.

n  Bowtie/Tophat: u  Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA

sequences to the human genome. Genome biology, 10(3), R25. u  Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford,

England), 25(9), 1105-11.

n  Cufflinks: u  Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using

RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u  Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript

expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u  Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript

assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5.

n  DEXSeq: u  Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u  Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data.

Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.

Page 48: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Online Community Forum and Discussion n Seqanswers

u  http://SEQanswers.com u  Format: Forum u  Li et al. SEQanswers : An open access community for

collaboratively decoding genomes. Bioinformatics (2012).

n BioStar: u  http://biostar.stackexchange.com u  Format: Q&A u  Parnell et al. BioStar: an online question & answer resource for

the bioinformatics community. PLoS Comp Bio (2011).

n  Other Bioinformatics Resources: stephenturner.us/p/edu

Page 49: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

DNA Methylation: Importance

n Occurs most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex

diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease

genes & drug targets.

Page 50: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

DNA Methylation: Challenges

n Dynamic and tissue-specific n DNA à Collection of cells which vary in 5meC

patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods:

u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs.

u Affinity enrichment, count-based: Assay methylation level across many genomic loci.

Page 51: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

DNA Methylation: Mapping

BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay

RNA-Seq High-throughput cDNA sequencing

DNA Methylation

Gene Expression

Page 52: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation: REs and PCR

n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize

same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites

n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate

because of incomplete digestion (for reasons other than methylation).

Page 53: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Bisulfite sequencing

n  Sodium bisulfite converts unmethylated (but not methylated) C’s into U’s. n  This introduces a methylation-specific “SNP”. n  RRBS – library enriched for CpG-dense regions by digesting with MspI.

Page 54: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

MeDIP-Seq

n MeDIP-Seq = Methylated DNA immunoprecipitation

n Uses antibody against 5-methylcytosine to retrieve methylated fragments from sonicated DNA.

n Enrichment method = count number of reads

Page 55: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

MethylCap-Seq

n Uses methyl-binding domain (MBD) protein to obtain DNA with similar methylation levels.

n Also a counting method.

Page 56: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation: Accuracy

n  Bock et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14.

n  MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay

Page 57: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation methods: coverage

n Coverage varies among different methods

Page 58: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation: Features & Biases

Page 59: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation: Bioinformatics Resources Resource   Purpose   URL  Refs  Batman   MeDIP  DNA  methyla8on  analysis  tool   hKp://td-­‐blade.gurdon.cam.ac.uk/sokware/batman  BDPC   DNA  methyla8on  analysis  plalorm   hKp://biochem.jacobs-­‐university.de/BDPC  BSMAP   Whole-­‐genome  bisulphite  sequence  mapping   hKp://code.google.com/p/bsmap  CpG  Analyzer   Windows-­‐based  program  for  bisulphite  DNA   -­‐  CpGcluster   CpG  island  iden8fica8on   hKp://bioinfo2.ugr.es/CpGcluster  CpGFinder   Online  program  for  CpG  island  iden8fica8on   hKp://linux1.sokberry.com  CpG  Island  Explorer   Online  program  for  CpG  Island  iden8fica8on   hKp://bioinfo.hku.hk/cpgieintro.html  CpG  Island  Searcher   Online  program  for  CpG  Island  iden8fica8on   hKp://cpgislands.usc.edu  CpG  PaKernFinder   Windows-­‐based  program  for  bisulphite  DNA   -­‐  CpG  Promoter   Large-­‐scale  promoter  mapping  using  CpG  islands   hKp://www.cshl.edu/OTT/html/cpg_promoter.html  CpG  ra8o  and  GC  content  PloKer   Online  program  for  ploMng  the  observed:expected  ra8o  of  CpG   hKp://mwsross.bms.ed.ac.uk/public/cgi-­‐bin/cpg.pl  CpGviewer   Bisulphite  DNA  sequencing  viewer   hKp://dna.leeds.ac.uk/cpgviewer  CyMATE   Bisulphite-­‐based  analysis  of  plant  genomic  DNA   hKp://www.gmi.oeaw.ac.at/en/cymate-­‐index/  

EMBOSS  CpGPlot/  CpGReport   Online  program  for  ploMng  CpG-­‐rich  regions   hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html  Epigenomics  Roadmap   NIH  Epigenomics  Roadmap  Ini8a8ve  homepage   hKp://nihroadmap.nih.gov/epigenomics  Epinexus   DNA  methyla8on  analysis  tools   hKp://epinexus.net/home.html  MEDME   Sokware  package  (using  R)  for  modelling  MeDIP  experimental  data   hKp://espresso.med.yale.edu/medme  methBLAST   Similarity  search  program  for  bisulphite-­‐modified  DNA   hKp://medgen.ugent.be/methBLAST  MethDB   Database  for  DNA  methyla8on  data   hKp://www.methdb.de  MethPrimer   Primer  design  for  bisulphite  PCR   hKp://www.urogene.org/methprimer  methPrimerDB   PCR  primers  for  DNA  methyla8on  analysis   hKp://medgen.ugent.be/methprimerdb  MethTools   Bisulphite  sequence  data  analysis  tool   hKp://www.methdb.de  MethyCancer  Database   Database  of  cancer  DNA  methyla8on  data   hKp://methycancer.psych.ac.cn  Methyl  Primer  Express   Primer  design  for  bisulphite  PCR   hKp://www.appliedbiosystems.com/  Methylumi   Bioconductor  pkg  for  DNA  methyla8on  data  from  Illumina   hKp://www.bioconductor.org/packages/bioc/html/  

Methylyzer   Bisulphite  DNA  sequence  visualiza8on  tool   hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html  mPod   DNA  methyla8on  viewer  integrated  w/  Ensembl  genome  browser   hKp://www.compbio.group.cam.ac.uk/Projects/  PubMeth   Database  of  DNA  methyla8on  literature   hKp://www.pubmeth.org  QUMA   Quan8fica8on  tool  for  methyla8on  analysis   hKp://quma.cdb.riken.jp  TCGA  Data  Portal   Database  of  TCGA  DNA  methyla8on  data   hKp://cancergenome.nih.gov/dataportal  

Page 60: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Methylation: Further Reading Bock, C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative

comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation

profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns

characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA

methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison

of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105.

Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N.

Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics,

11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide

and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598

Page 61: Examining gene expression and methylation with next gen sequencing

GENETIC ANALYSIS of Complex Human Diseases

Thank you

Web: bioinformatics.virginia.edu

E-mail: [email protected]

Blog: www.GettingGeneticsDone.com

Twitter: twitter.com/genetics_blog