high-throughput sequence analysis with r and · pdf filehigh-throughput sequence analysis with...

High-throughput sequence analysis with R and
Bioconductor
Marc Carlson, Valerie Obenchain, Herve Pages, Paul Shannon,Daniel Tenenbaum, Martin Morgan
June 2012
Contents
1 Introduction 31.1 This workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 High-throughput sequence analysis . . . . . . . . . . . . . . . . . 41.4 Statistical programming . . . . . . . . . . . . . . . . . . . . . . . 41.5 Bioconductor for high-throughput sequence analysis . . . . . . . 71.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 R 82.1 R data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Useful functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Efficient scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6 Warnings, errors, and debugging . . . . . . . . . . . . . . . . . . 25
3 Ranges and Strings 273.1 Genomic ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Working with strings . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Reads and Alignments 364.1 The pasilla data set . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Reads and the ShortRead package . . . . . . . . . . . . . . . . . 364.3 Alignments and the Rsamtools package . . . . . . . . . . . . . . 41
5 RNA-seq 475.1 Varieties of RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Differential expression with the edgeR package . . . . . . . . . . 475.3 Additional steps in RNA-seq work flows . . . . . . . . . . . . . . 525.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
mcarlson,vobencha,hpages,pshannon,dtenenba,[email protected]
1
http://bioconductor.org/packages/release/bioc/html/ShortRead.htmlhttp://bioconductor.org/packages/release/bioc/html/Rsamtools.htmlhttp://bioconductor.org/packages/release/bioc/html/edgeR.htmlmailto:[email protected]

6 ChIP-seq 566.1 Varieties of ChIP-seq . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 Initial Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Peak calling with R / Bioconductor (advanced) . . . . . . 596.3 Comparison of multiple experiments: DiffBind . . . . . . . . . . 636.4 Working with called peaks . . . . . . . . . . . . . . . . . . . . . . 63
7 Annotation 687.1 Gene-centric annotations with AnnotationDbi . . . . . . . . . . . 687.2 Genome-centric annotations with GenomicFeatures . . . . . . . . 717.3 Using biomaRt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Annotation of Variants 748.1 Variant call format (VCF) files . . . . . . . . . . . . . . . . . . . 748.2 Coding consequences . . . . . . . . . . . . . . . . . . . . . . . . . 76
A Appendix: data retrieval 81A.1 RNA-seq data retrieval . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 ChIP-seq data retrieval and MACS analysis . . . . . . . . . . . . 81
2
http://bioconductor.org/packages/release/bioc/html/DiffBind.htmlhttp://bioconductor.org/packages/release/bioc/html/AnnotationDbi.htmlhttp://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html

Table 1: Tentative schedule.
R / Bioconductor for Sequence AnalysisR data types & functions; help; objects; essential packages, ef-ficient programming (Section 2). Working with ranges, strings,reads and alignments. Quality assessment (Sections 3, 4).
RNA-SeqDifferential representation, gene set enrichment, annotation,exon use (Sections 5, 7).
ChIP-SeqPeak calling (3rd party); collated experiments; motifs; annota-tion (Section 6, 7).
Variant AnnotationCommon work flows; variants in and around genes, amino acidand coding consequences (Sections 8).
1 Introduction
1.1 This workshop
This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data. The workshop is structured as a series of shortremarks followed by group exercises. The exercises explore the diversity of tasksfor which R / Bioconductor are appropriate, but are far from comprehensive.
The goals of the workshop are to: (1) develop familiarity with R / Biocon-ductor software for high-throughput analysis; (2) expose key statistical issues inthe analysis of sequence data; and (3) provide inspiration and a framework forfurther independent exploration. An approximate schedule is shown in Table 1.
1.2 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data. Bioconductor started more than 10 yearsago. It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments, and integrative and re-producible approaches to bioinformatic tasks. There are now more than 500Bioconductor packages for expression and other microarrays, sequence analy-sis, flow cytometry, imaging, and other domains. The Bioconductor web siteprovides installation, package repository, help, and other documentation.
The Bioconductor web site is at bioconductor.org. Features include:
Introductory work flows.
A manifest of Bioconductor packages arranged in BiocViews.
Annotation (data bases of relevant genomic information, e.g., Entrez geneids in model organisms, KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages.
Mailing lists, including searchable archives, as the primary source of help.
Course and conference information, including extensive reference material.
General information about the project.
3
http://bioconductor.orgbioconductor.orghttp://bioconductor.org/help/workflows/http://bioconductor.org/packages/release/bioc/http://bioconductor.org/packages/release/BiocViews.htmlhttp://bioconductor.org/packages/release/data/annotation/http://bioconductor.org/packages/release/data/experiment/http://bioconductor.org/help/mailing-list/http://bioconductor.org/help/course-materials/http://bioconductor.org/about/

Package developer resources, including guidelines for creating and submit-ting new packages.
Exercise 1Scavenger hunt. Spend five minutes tracking down the following information.
a. From the Bioconductor web site, instructions for installing or updatingBioconductor packages.
b. A list of all packages in the current release of Bioconductor.
c. The URL of the Bioconductor mailing list subscription page.
Solution: Possible solutions from the Bioconductor web site are, e.g., http://bioconductor.org/install/ (installation instructions), http://bioconductor.org/packages/release/bioc/ (current software packages), and http://bioconductor.org/help/mailing-list/ (mailing lists).
1.3 High-throughput sequence analysis
Recent technological developments introduce high-throughput sequencing ap-proaches. A variety of experimental protocols and analysis work flows addressgene expression, regulation, and encoding of genetic variants. Experimental pro-tocols produce a large number (millions per sample) of short (e.g., 35-100, singleor paired-end) nucleotide sequences. These are aligned to a reference or othergenome. Analysis work flows use the alignments to infer levels of gene expression(RNA-seq), binding of regulatory elements to genomic locations (ChIP-seq), orprevalence of structural variants (e.g., SNPs, short indels, large-scale genomicrearrangements). Sample sizes range from minimal replication (e.g,. 2 samplesper treatment group) to thousands of individuals.
1.4 Statistical programming
Many academic and commercial software products are available; why wouldone use R and Bioconductor? One answer is to ask about the demands high-throughput genomic data places on effective computational biology software.
Effective computational biology software High-throughput questions makeuse of large data sets. This applies both to the primary data (microarray ex-pression values, sequenced reads, etc.) and also to the annotations on thosedata (coordinates of genes and features such as exons or regulatory regions;participation in biological pathways, etc.). Large data sets place demands onour tools that preclude some standard approaches, such as spread sheets. Like-wise, intricate relationships between data and annotation, and the diversity ofresearch questions, require flexibility typical of a programming language ratherthan a narrowly-enabled graphical user interface.
Analysis of high-throughput data is necessarily statistical. The volume ofdata requires that it be appropriately summarized before any sort of compre-hension is possible. The data are produced by advanced technologies, and theseintroduce artifacts (e.g., probe-specific bias in microarrays; sequence or base
4
http://bioconductor.org/developers/http://bioconductor.org/install/http://bioconductor.org/install/http://bioconductor.org/packages/release/bioc/http://bioconductor.org/packages/release/bioc/http://bioconductor.org/help/mailing-list/http://bioconductor.org/help/mailing-list/

calling bias in RNA-seq experiments) that need to be accommodated to avoidincorrect or inefficient inference. Data sets typically derive from designed ex-periments, requiring a statistical approach both to account for the design andto correctly address the large number of observed values (e.g., gene expressionor sequence tag counts) and small number of samples accessible in typical ex-periments.
Research needs to be reproducible. Reproducibility is both an ideal of t

high-throughput sequence analysis with r and · pdf filehigh-throughput sequence analysis with...

Documents