exercise1-chipseq
TRANSCRIPT
Workshop Practice 1: Reading and Manipulating
Short Reads
Supat Thongjuea, Jan Christian Bryne, Vedran Franke,Christopher Previti and Boris Lenhard∗
30 June - 2 July, 2010
Contents
1 Goals 1
2 Introduction 1
3 Aligned read input 23.1 Navigating Solexa output . . . . . . . . . . . . . . . . . . . . . . 23.2 readAligned and the AlignedRead class . . . . . . . . . . . . . . . 3
4 Data Processing 74.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Normalize library size from ChIP and Control . . . . . . . . . . . 84.3 Fragment size estimation . . . . . . . . . . . . . . . . . . . . . . . 94.4 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.5 Display Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.6 Island . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Assign Enriched Region, Calculate p-value, and DownstreamAnalysis 15
6 Session information 15
1 Goals
To provide participants with the basic concepts and skills necessary for browsingand basic analysis of high-throughput sequencing data
∗Bergen Center for Computational Science, Bergen, Norway
1
2 Introduction
In the following exercise, we’ll start by reading in reads files and manipulatingthe resulting short reads object in R. The samples we’ll use are from an experi-ment looking for transcription factor binding sites in the mouse genome. In thisexample, we have both the treatment and control samples and are restrictingourselves to chromosome X. We’ll use the ShortRead package to input the datafiles, and then demonstrate the sequence manipulations available. Activitiesduring the workshop are posed as exercises. So, as a first exercise:
Exercise 1Obtain and load the EuTRACC2010 package. Use
> install.packages("path/to/EuTRACC2010_1.0.tar.gz", repos=NULL)
Let us know if you have problems installing the package. Optionally, typethe following command in your command prompt:
> R CMD INSTALL EuTRACC2010 1.0.tar.gz
Exercise 2The following command loads the EuTRACC2010 package and shows that ithas been installed.
> library(EuTRACC2010)
> ?EuTRACC2010
(?EuTRACC2010 means showing the help page of EuTRACC2010 package)
3 Aligned read input
This section illustrates how to input aligned reads. It focuses on aligned readsproduced by the Solexa Genome Analyzer ELAND software; reading data pro-duced by software such as MAQ or Bowtie, or in BAM format is described inthe ShortRead ‘Overview’ vignette and on the readAligned help page.
3.1 Navigating Solexa output
Vendor and third-party software is likely to process raw images, base calling(Rolexa is a Bioconductor package providing alternative base calling). Biocon-ductor packages might enter a typical work flow after the alignment process.Our workflow starts with reads aligned using ELAND, an alignment programfrom the Illumina Genome Analyzer II (GAII).
We’ll start by creating a variable, dataDir, containing the directory holdingthe sample GAII data, and check to make sure that we have defined the rightpath.
> dataDir <- system.file("extdata", package="EuTRACC2010")
> file.exists(dataDir)
2
[1] TRUE
> stopifnot(file.exists(dataDir))
To list files inside the directory, the function list.files will list relevant fileswhich have to be found in the directory.
Exercise 3Use list.files to display all files in the dataDir directory.
> list.files(dataDir)
[1] "jaspar2010.txt" "jaspar2010_PCC_SWU.scores"
[3] "mel.in.1.map.gz" "mel.in.2.map.gz"
[5] "mel.ni.1.map.gz" "mel.ni.2.map.gz"
[7] "sample.chip.txt.gz" "sample.control.txt.gz"
[9] "total.chip.peak.list" "transcript.len.txt"
There are a list of files shown in the directory. We are focusing on thesefollowing two files.
1. sample.chip.txt.gz, the ChIP contains the sample aligned reads of chro-mosome X plus a subset of reads which not align to the reference genome.
2. sample.control.txt.gz, the IgG Control contains the sample aligned readsof chromosome X plus a subset of reads which not align to the referencegenome.
These files are produced using ELAND software run in eland-extended
mode. This mode produces files, one for each lane that summarize diversefeatures of all reads, and is a very convenient starting point for analysis; otherinput formats (e.g., MAQ, Bowtie, BWA, or BAM files) are also supported.
Exercise 4To read these two files into R object. We’ll assign variable names ’chip’ and’control’ to the file names.
> chip <- "sample.chip.txt.gz"
> control<-"sample.control.txt.gz"
3.2 readAligned and the AlignedRead class
The readAligned function can be used to input aligned reads. The first argumentis a directory path where the alignment files are to be found. The secondargument is the regular expression to select files to be read. An optional thirdargument allows the user to specify the file type.
Exercise 5> chip.aln <- readAligned(dataDir, chip, "SolexaExport")
> control.aln <- readAligned(dataDir, control, "SolexaExport")
3
It will take a while to read the aligned read files. See the help page forreadAligned for additional details about supported file types.
What does readAligned input? It inputs the short read sequences and basecall qualities, and the chromosome, position, and strand information associatedwith short read alignments. This information is expected to be provided byevery short read alignment software.
Exercise 6Display the object we have read in.
> chip.aln
class: AlignedRead
length: 573289 reads; width: 36 cycles
chromosome: chrX chrX ... 0:4:4 0:0:1
position: 58358278 66300396 ... NA NA
strand: - + ... NA NA
alignQuality: NumericQuality
alignData varLabels: run lane ... filtering contig
> head(sread(chip.aln))
A DNAStringSet instance of length 6
width seq
[1] 36 CAGACACAAAATGACATGCATGGTATATACTCATTA
[2] 36 AATCATAATTGCTGAGTTCATATGAACAGAATACAC
[3] 36 ACTACCCTCTGTGTTTTTAGCTCATTTTAAAGAATA
[4] 36 AATAAATAAAAACTCATTGAAAAACTGCTAGGAAAT
[5] 36 CATCTAACGAGATGATCATCTTTGAGTTTGTTTATA
[6] 36 TGGTGTCCTGGAACTTACTGTGTAGATCAGACTGGA
There are 573289 reads in the object, each read consisting of 36 nucleotides.
Exercise 7We can check the library size of both ChIP and Control
> length(chip.aln)
[1] 573289
> length(control.aln)
[1] 952900
The library size of ChIP and Control are 573289 reads, and 952900 readsrespectively.
How many reads are aligned to the reference genome?.To check that, there are two functions readAligned and position.
4
Exercise 8> table(strand(chip.aln),useNA="ifany")
+ - * <NA>
276817 276472 0 20000
> sum(is.na(position(chip.aln)))
[1] 20000
What are all the NA values returned by strand and position? These corre-spond to reads that did not align to the reference genome used by ELAND. Thestrand function returns a factor with three levels. The first two describe readsaligned to the plus and minus strands, the third (*) is available for successfulalignments where strand information is irrelevant.
Aligned reads contain several different kinds of information about ‘quality’.Individual bases are assessed for quality during base calling. These ‘raw’ basequalities are ‘calibrated’ during ELAND alignment; details of calibration areto be found in Illumina documentation. The alignments themselves also havequalities associated with them, with the details of alignment quality differingbetween algorithms.
Exercise 9Retrieve calibrated base quality from chip.aln.
> head(quality(chip.aln))
class: SFastqQuality
quality:
A BStringSet instance of length 6
width seq
[1] 36 bbbbbbbbbbbbbbbbbbbbbbbabbbbbbbbbbbb
[2] 36 bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
[3] 36 bbbbbbabbbbbbbbbbaabbbbbbbbbbbbbabbb
[4] 36 bb_aba^bb]]]]bbbbbbbbabbb^bbbbbabbb_
[5] 36 bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaabb
[6] 36 `b]bbbbbabaaaaY^^^^^Z]Z]X][]]]``Y_``
These qualities are string-encoded −10 log10 probabilities. The encoding in thiscase follows a convention established by Illumina. The details of the encodingcan be obtained by querying quality(aln) for its alphabet; the letter A corre-sponds to a −10log10 score of 1.
Numeric values are readily retrieved as a matrix, with rows correspondingto reads and columns to cycles. Computations can then be performed on them,e.g., to determine average calibrated quality scores as a function of cycle.
> alf <- alphabet(quality(chip.aln))
> alf
5
[1] " " "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*"
[12] "+" "," "-" "." "/" "0" "1" "2" "3" "4" "5"
[23] "6" "7" "8" "9" ":" ";" "<" "=" ">" "?" "@"
[34] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
[45] "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V"
[56] "W" "X" "Y" "Z" "[" "\\" "]" "^" "_" "`" "a"
[67] "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
[78] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
[89] "x" "y" "z" "{" "|" "}"
> m <- as(quality(chip.aln), "matrix")
> colMeans(m)
[1] 32.83358 32.75103 32.73937 32.72723 32.72865 32.77423
[7] 32.71444 32.69251 32.67952 32.64014 32.52894 32.47822
[13] 32.43063 32.42744 32.41413 32.36669 32.34187 32.31671
[19] 32.28197 32.23646 32.16516 32.11523 32.09137 32.01706
[25] 31.95273 31.80540 31.75207 31.72279 31.67447 31.61476
[31] 31.47764 31.43199 31.37439 31.34117 31.27768 31.06493
Alignment qualities are accessible with alignQuality. This returns an objectthat can contain quality scores in different formats; to extract the actual qualityscores, use quality. Reads failing to align or to align in multiple locations havean alignment quality of 0.
Exercise 10Retrieve the alignment quality scores, determine how many align poorly, andvisualize the distribution (Figure 1) of scores.
> alignQuality(chip.aln)
class: NumericQuality
quality: 54 119 ... 0 0 (573289 total)
> q <- quality(alignQuality(chip.aln))
> sum(q==0)
[1] 20000
> print(densityplot(q[q>1], plot.points=FALSE,
+ xlab="Alignment quality"))
Alignment algorithms produce information in addition to basic data aboutchromosome, position, and strand alignment. The exact content varies betweenalgorithms, and is available with alignData. alignData returns an Aligned-DataFrame object that contains these data and a metadata description of them.For instance, ELAND includes information about whether the read passed abase-calling filter (based on strength and consistency of early bases in the read),in addition to the lane, tile, x and y coordinate of each read.
6
Alignment quality
Den
sity
0.00
0.05
0.10
0.15
0 50 100
Figure 1: Alignment quality
Exercise 11Use the alignData function to extract the additional information in the ELANDalignment file. The underlying data in this object can be accessed as though itwere a data frame, for instance to tally the number of reads passing Illuminabase calling filter.
> alignData(chip.aln)
An object of class "AlignedDataFrame"
readName: 1, 2, ..., 573289 (573289 total)
varLabels and varMetadata description:
run: Analysis pipeline run
lane: Flow cell lane
...: ...
contig: Contig
(7 total)
> table(alignData(chip.aln)$filtering)
Y N
546752 26537
7
4 Data Processing
This section we will demonstrate how to process ChIP-seq data from both ChIPand Control
4.1 Filtering
In this section we will demonstrate how to perform filtering operations throughthe following restrictions:
• Select just the aligned reads passing Illumina filtering, and aligning to thereference genome
• Select aligned reads with an aligment quality score ≥ 10
• No duplicates of {chromosome, strand, position} combinations (PCR biascorrection)
Exercise 12> filt1<-alignDataFilter(expression(filtering=="Y"))
> filt2<-chromosomeFilter("chr[0-9XYM]")
> filt3<-alignQualityFilter(10)
> filt4<-occurrenceFilter(withSread=FALSE)
> filt<-compose(filt1,filt2,filt3,filt4)
> chip.aln.filtered<-chip.aln[filt(chip.aln)]
> control.aln.filtered<-control.aln[filt(control.aln)]
> chip.aln.filtered
class: AlignedRead
length: 452971 reads; width: 36 cycles
chromosome: chrX chrX ... chrX chrX
position: 58358278 66300396 ... 130300614 151247704
strand: - + ... - +
alignQuality: NumericQuality
alignData varLabels: run lane ... filtering contig
> control.aln.filtered
class: AlignedRead
length: 499486 reads; width: 36 cycles
chromosome: chrX chrX ... chrX chrX
position: 160657944 9179764 ... 35935208 162968642
strand: + + ... - +
alignQuality: NumericQuality
alignData varLabels: run lane ... filtering contig
8
4.2 Normalize library size from ChIP and Control
Since the library size from ChIP and Control are different, A simple way to makethem comparable is to make them have the same number of reads. We’ll get thenumber of reads from the smallest data set and then use that number to samplereads from the largest data set. Keep in mind that this method will throw awaythe reads from the largest data set. The development of a good normalizationmethod for ChIP-seq is a research topic by itself. There are many ways to do thenormalization which have been described in recent papers. The method shownin this section is the basic way to do for these 2 sets of data.
Exercise 13> chip.lib.size<- length(chip.aln.filtered)
> control.lib.size<- length(control.aln.filtered)
> smallest.lib.size<- min(chip.lib.size,control.lib.size)
> chip.norm.aln<-sample(chip.aln.filtered,smallest.lib.size,replace=FALSE)
> control.norm.aln<-sample(control.aln.filtered,smallest.lib.size,replace=FALSE)
> chip.norm.aln
class: AlignedRead
length: 452971 reads; width: 36 cycles
chromosome: chrX chrX ... chrX chrX
position: 72862481 104703430 ... 45293208 22187014
strand: + - ... + -
alignQuality: NumericQuality
alignData varLabels: run lane ... filtering contig
> control.norm.aln
class: AlignedRead
length: 452971 reads; width: 36 cycles
chromosome: chrX chrX ... chrX chrX
position: 135727227 9539127 ... 147026092 157102271
strand: + + ... - -
alignQuality: NumericQuality
alignData varLabels: run lane ... filtering contig
4.3 Fragment size estimation
To calculate the coverage for both ChIP and Control, We’ll estimate the frag-ment size for each read. The function estimate.mean.fraglen from the chipseqpackage implements three methods for estimating mean fragment length. Seemore details in :
> ?estimate.mean.fraglen
Exercise 14> chip.fragment.size=round(estimate.mean.fraglen(chip.norm.aln,
+ method = "SISSR"))
9
> control.fragment.size=round(estimate.mean.fraglen(control.norm.aln,
+ method = "SISSR"))
> chip.fragment.size
chrX
209
> control.fragment.size
chrX
206
4.4 Coverage
To calculate the depth to which individual nucleotides in the reference sequenceare covered by reads, the function coverage will be used. To make the coverage,we need to know the chromosome size. The BSgenome.Mmusculus.UCSC.mm9will give us the size of chromosomes.
Exercise 15> library(BSgenome.Mmusculus.UCSC.mm9)
> seqlengths(Mmusculus)
chr1 chr2 chr3 chr4
197195432 181748087 159599783 155630120
chr5 chr6 chr7 chr8
152537259 149517037 152524553 131738871
chr9 chr10 chr11 chr12
124076172 129993255 121843856 121257530
chr13 chr14 chr15 chr16
120284312 125194864 103494974 98319150
chr17 chr18 chr19 chrX
95272651 90772031 61342430 166650296
chrY chrM chr1_random chr3_random
15902555 16299 1231697 41899
chr4_random chr5_random chr7_random chr8_random
160594 357350 362490 849593
chr9_random chr13_random chr16_random chr17_random
449403 400311 3994 628739
chrX_random chrY_random chrUn_random
1785075 58682461 5900358
We’ll calculate the coverage by using parameters ’extend’ to extend thelength of reads into the estimated fragment length. Note that we’ll do minus36bp from the estimated fragment length (the explanation show in the Figure 2).
10
Figure 2: Read length extension
Exercise 16> chip.extend.length=chip.fragment.size-width(chip.norm.aln[1])
> control.extend.length=control.fragment.size-width(control.norm.aln[1])
> chip.cov <- coverage(chip.norm.aln,width=seqlengths(Mmusculus),
+ extend=as.integer(chip.extend.length))
> control.cov <- coverage(control.norm.aln,width=seqlengths(Mmusculus),
+ extend=as.integer(control.extend.length))
> chip.cov
SimpleRleList of length 1
$chrX
'integer' Rle of length 166650296 with 893704 runs
Lengths: 3000054 209 3326 ... 209 27688
Values : 0 1 0 ... 1 0
> control.cov
SimpleRleList of length 1
$chrX
'integer' Rle of length 166650296 with 898211 runs
Lengths: 3001162 206 812 ... 206 21898
Values : 0 1 0 ... 1 0
The coverage function returns a list-like structure, where each element of the listrepresents a chromosome. The data are represented as a ‘run length encoded’(Rle) instance. Rle instances are readily interogated for a variety of usefulinsights.
> plotChIP.Coverage<-function(x,xlab="Position",ylab="Coverage",
+ main="ChIP Coverage")
+ {
+ plot(c(start(x),length(x)),c(runValue(x),1),type="l",
11
Figure 3: The coverage of ChIP
+ col="blue",xlab=xlab,ylab=ylab,main=main)
+ }
> plotControl.Coverage<-function(x,xlab="Position",ylab="Coverage",
+ main="Control Coverage")
+ {
+ plot(c(start(x),length(x)),c(runValue(x),1),type="l",
+ col="red",xlab=xlab,ylab=ylab,main=main)
+ }
> plotChIP.Coverage(chip.cov[[1]])
> plotControl.Coverage(control.cov[[1]])
4.5 Display Coverage
We’ll use the UCSC genome browser to display the coverage from both the ChIPand Control. The UCSC genome browser requires specific formats as the in-put http://genome.ucsc.edu/FAQ/FAQformat.html. This exercise will use thertracklayer which can export Rle and other objects to the appropriate ’bed-Graph’ format.
Exercise 17> chip.cov.Track<-as(chip.cov,"RangedData")
> chip.cov.no.zero<-subset(chip.cov.Track,score >0)
> control.cov.Track<-as(control.cov,"RangedData")
> control.cov.no.zero<-subset(control.cov.Track,score >0)
12
Figure 4: The coverage of Control
> export(chip.cov.no.zero,"chip.cov.txt","bedGraph")
> export(control.cov.no.zero,"control.cov.txt","bedGraph")
We might remove the coverage with contains zero.
Then we’ll upload tracks to the UCSC genome browser.To add the custom tracks, we’ll click on the ”manage custom tracks” button
and then click to upload these two tracks. We might change the track namefrom ”R Track” to ChIP or Control.
We’ll also directly upload the track from R commands:
> library(rtracklayer)
> chip.tmp<-tempfile()
> export(chip.cov,chip.tmp,"bedGraph")
> restored.chip.track <- import(chip.tmp,"bedGraph",genome = "mm9")
> session <- browserSession("UCSC")
> track(session, "target") <- restored.chip.track
> browserView(session,range(restored.chip.track))
UCSCView of chr12:57795963-57815592
trackNames(13): 'R Track' ... 'RepeatMasker'
Instead of using the rtracklayer export function, we can make a script tocovert coverage to bedGraph format:
> out.file <-"chip.bedGraph.txt"
> nz <- runValue(chip.cov[[1]]) > 0
> write.table(cbind('chrX', start(chip.cov[[1]])[nz],
13
Figure 5: Manage your custome tracks
+ format((end(chip.cov[[1]])+1), scientific=F)[nz],
+ runValue(chip.cov[[1]])[nz]), file=out.file,
+ append=TRUE, sep="\t", quote=FALSE,row.names=FALSE,
+ col.names=FALSE)
We might want to edit the file ”chip.bedGraph.txt” to put the header likethis following:
track name=”chip” type=bedGraph”Then, we’ll upload the track to the UCSC genome browser.
4.6 Island
The regions of interest are contiguous segments of non-zero coverage, also knownas islands. We’ll select the islands where there is at least 1 read. The functionslice can be used to identify those regions of interest.
Exercise 18> chip.islands<-slice(chip.cov,lower=1)
> chip.islands
SimpleRleViewsList of length 1
$chrX
Views on a 166650296-length Rle subject
views:
start end width
[1] 3000055 3000263 209 [1 1 1 1 1 1 1 1 1 ...]
[2] 3003590 3003798 209 [1 1 1 1 1 1 1 1 1 ...]
[3] 3007874 3008082 209 [1 1 1 1 1 1 1 1 1 ...]
14
[4] 3009187 3009395 209 [1 1 1 1 1 1 1 1 1 ...]
[5] 3010563 3010771 209 [1 1 1 1 1 1 1 1 1 ...]
[6] 3012647 3012855 209 [1 1 1 1 1 1 1 1 1 ...]
[7] 3014879 3015087 209 [1 1 1 1 1 1 1 1 1 ...]
[8] 3017730 3017938 209 [1 1 1 1 1 1 1 1 1 ...]
[9] 3017982 3018530 549 [1 1 1 1 1 1 1 1 1 ...]
... ... ... ... ...
[182483] 166584318 166584632 315 [1 1 1 1 1 1 1 1 1 ...]
[182484] 166584735 166585074 340 [1 1 1 1 1 1 1 1 1 ...]
[182485] 166585107 166585315 209 [1 1 1 1 1 1 1 1 1 ...]
[182486] 166585387 166585723 337 [1 1 1 1 1 1 1 1 1 ...]
[182487] 166606183 166606391 209 [1 1 1 1 1 1 1 1 1 ...]
[182488] 166610145 166610747 603 [1 1 1 1 1 1 1 1 1 ...]
[182489] 166610881 166611089 209 [1 1 1 1 1 1 1 1 1 ...]
[182490] 166611709 166612421 713 [1 1 1 1 1 1 1 1 1 ...]
[182491] 166622400 166622608 209 [1 1 1 1 1 1 1 1 1 ...]
For each island, we can compute the number of reads in the island, and themaximum coverage depth within that island.
> viewSums(head(chip.islands))
SimpleIntegerList of length 1
[["chrX"]] 209 209 209 209 209 209 ... 209 836 209 1254 209
> viewMaxs(head(chip.islands))
SimpleIntegerList of length 1
[["chrX"]] 1 1 1 1 1 1 1 1 2 2 1 ... 1 1 2 2 1 2 1 2 1 3 1
5 Assign Enriched Region, Calculate p-value, andDownstream Analysis
Now, we already have the coverage from both ChIP and Control. For furtheranalysis, please look at the Workshop Practice 2.
Before moving to the workshop practice 2, we have to save the coverage fileof both ChIP and control into the R object files.
Exercise 19> save(chip.cov,file="chip.cov.RData")
> save(control.cov,file="control.cov.RData")
6 Session information
• R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0
15
• Locale: C
• Base packages: base, datasets, grDevices, graphics, grid, methods, stats,utils
• Other packages: BSgenome 1.16.1,BSgenome.Mmusculus.UCSC.mm9 1.3.16, Biostrings 2.16.5,EuTRACC2010 1.0, GenomicFeatures 1.0.0, GenomicRanges 1.0.3,IRanges 1.6.6, MotIV 1.1.3, RCurl 1.4-2, Rsamtools 1.0.1,ShortRead 1.6.2, bitops 1.0-4.1, chipseq 0.4.0, lattice 0.18-8,rGADEM 1.0.0, rtracklayer 1.8.1, seqLogo 1.14.0
• Loaded via a namespace (and not attached): Biobase 2.8.0, DBI 0.2-5,RSQLite 0.9-0, XML 3.1-0, biomaRt 2.4.0, hwriter 1.2, tools 2.11.1
16