research open access investigating bisulfite short-read ...research open access investigating...

9
RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter 1 , Ming-an Sun 2 , Hehuang Xie 2,3* , Liqing Zhang 1 From Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014) Miami Beach, FL, USA. 2-4 June 2014 Abstract Background: DNA methylation is an important epigenetic mark relevant to normal development and disease genesis. A common approach to characterizing genome-wide DNA methylation is using Next Generation Sequencing technology to sequence bisulfite treated DNA. The short sequence reads are mapped to the reference genome to determine the methylation statuses of Cs. However, despite intense effort, a much smaller proportion of the reads derived from bisulfite treated DNA (usually about 40-80%) can be mapped than regular short reads mapping (> 90%), and it is unclear what factors lead to this low mapping efficiency. Results: To address this issue, we used the hairpin bisulfite sequencing technology to determine sequences of both DNA double strands simultaneously. This enabled the recovery of the original non-bisulfite-converted sequences. We used Bismark for bisulfite read mapping and Bowtie2 for recovered read mapping. We found that recovering the reads improved unique mapping efficiency by 9-10% compared to the bisulfite reads. Such improvement in mapping efficiency is related to sequence entropy. Conclusions: The hairpin recovery technique improves mapping efficiency, and sequence entropy relates to mapping efficiency. Background DNA methylation is an epigenetic phenomenon that adds a methyl group to the nucleic acid cytosine in DNA. Methylation impacts evolution and inheritance [1], embryonic development [2,3], and cancer [4,5]. Discover- ing where DNA methylation occurs can be accomplished with the sequencing of bisulfite treated DNA. Bisulfite transforms unmethylated cytosine into uracil. The methy- lated cytosine sterically hinders the sulfite ion preventing transformation of methylated cytosine into uracil. The DNA is amplified with polymerase chain reaction (PCR) where the uracil is converted to thymine. Differences between a reference genome and the bisulfite treated DNA can be used to search for methylated cytosines. To identify such differences, the most important step is to map the bisulfite treated short reads to the reference genome. Mapping software takes a DNA sequence and finds a matching position for it in a reference genome. Many mapping programs suitable for bisulfite treated short reads exist. These include Bismark [6], BSMap [7], BiSS from NGM [8], BatMeth [9], and BS-Seeker2 [10]. These programs work by first finding a potential location for the read using hashing or a string index. The next step is to compute a score for the read. This can be done by counting mismatches and indels. Some programs use the Smith-Waterman algorithm [11] or the Needleman-Wunsch algorithm [12]. The best scoring read location is reported as the location for that read. Despite continued and intense effort in creating new or improving existing bisulfite read mappers, the proportion of mapped reads remains low, commonly around 40% [6,13], and sometimes as high as 80% espe- cially with read trimming [9,13], which is much lower than regular short reads mapping (e.g., data for the 1000 Genomes project has mapping efficiency > 90% [14]). * Correspondence: [email protected] 2 Virginia Bioinformatics Institute, Virginia Tech, 24061 Blacksburg, VA, USA Full list of author information is available at the end of the article Porter et al. BMC Genomics 2015, 16(Suppl 11):S2 http://www.biomedcentral.com/1471-2164/16/S11/S2 © 2015 Porter et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http:// creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/ zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Upload: others

Post on 09-Mar-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

RESEARCH Open Access

Investigating bisulfite short-read mapping failurewith hairpin bisulfite sequencing dataJacob Porter1, Ming-an Sun2, Hehuang Xie2,3*, Liqing Zhang1

From Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences(ICCABS 2014)Miami Beach, FL, USA. 2-4 June 2014

Abstract

Background: DNA methylation is an important epigenetic mark relevant to normal development and diseasegenesis. A common approach to characterizing genome-wide DNA methylation is using Next GenerationSequencing technology to sequence bisulfite treated DNA. The short sequence reads are mapped to the referencegenome to determine the methylation statuses of Cs. However, despite intense effort, a much smaller proportionof the reads derived from bisulfite treated DNA (usually about 40-80%) can be mapped than regular short readsmapping (> 90%), and it is unclear what factors lead to this low mapping efficiency.

Results: To address this issue, we used the hairpin bisulfite sequencing technology to determine sequences ofboth DNA double strands simultaneously. This enabled the recovery of the original non-bisulfite-convertedsequences. We used Bismark for bisulfite read mapping and Bowtie2 for recovered read mapping. We found thatrecovering the reads improved unique mapping efficiency by 9-10% compared to the bisulfite reads. Suchimprovement in mapping efficiency is related to sequence entropy.

Conclusions: The hairpin recovery technique improves mapping efficiency, and sequence entropy relates tomapping efficiency.

BackgroundDNA methylation is an epigenetic phenomenon that addsa methyl group to the nucleic acid cytosine in DNA.Methylation impacts evolution and inheritance [1],embryonic development [2,3], and cancer [4,5]. Discover-ing where DNA methylation occurs can be accomplishedwith the sequencing of bisulfite treated DNA. Bisulfitetransforms unmethylated cytosine into uracil. The methy-lated cytosine sterically hinders the sulfite ion preventingtransformation of methylated cytosine into uracil. TheDNA is amplified with polymerase chain reaction (PCR)where the uracil is converted to thymine. Differencesbetween a reference genome and the bisulfite treatedDNA can be used to search for methylated cytosines.To identify such differences, the most important step is

to map the bisulfite treated short reads to the reference

genome. Mapping software takes a DNA sequence andfinds a matching position for it in a reference genome.Many mapping programs suitable for bisulfite treatedshort reads exist. These include Bismark [6], BSMap [7],BiSS from NGM [8], BatMeth [9], and BS-Seeker2 [10].These programs work by first finding a potential locationfor the read using hashing or a string index. The nextstep is to compute a score for the read.This can be done by counting mismatches and indels.

Some programs use the Smith-Waterman algorithm [11]or the Needleman-Wunsch algorithm [12]. The bestscoring read location is reported as the location for thatread. Despite continued and intense effort in creatingnew or improving existing bisulfite read mappers, theproportion of mapped reads remains low, commonlyaround 40% [6,13], and sometimes as high as 80% espe-cially with read trimming [9,13], which is much lowerthan regular short reads mapping (e.g., data for the 1000Genomes project has mapping efficiency > 90% [14]).

* Correspondence: [email protected] Bioinformatics Institute, Virginia Tech, 24061 Blacksburg, VA, USAFull list of author information is available at the end of the article

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

© 2015 Porter et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Page 2: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

It is unclear what leads to the reduced efficiency in bisul-fite reads mapping.In this work, we use a sequencing strategy called gen-

ome-wide hairpin sequencing of bisulfite-treated shortreads to investigate factors that may adversely influence

mapping efficiency of bisulfite reads [15]. Unlike previousbisulfite sequencing methods that destroy knowledge ofthe original (non-bisulfite treated) sequence, the hairpintechnology (see Figure 1) allows for the recovery of theoriginal sequences by putting a connector between the

Figure 1 Example of bisulfite-treated hairpin PCR sequencing with recovery of the original sequence.

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 2 of 9

Page 3: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

Watson and Crick strands and then using PCR andpaired end technology to sequence short reads [16]. Theresulting sequences give paired strands that can bemapped to recover the original untreated read and todetect sequencing error.This study focuses on the read mapper Bismark, which

is popular, easy to use, and reasonably accurate [13]. Itconverts all Cs to Ts in the read and the reference gen-ome, and then calls Bowtie2 as a sub-process to do themapping. Bowtie2 uses an FM-Index, a compressed stringindex, of the reference genome to find read positions onthe reference genome. Bowtie2’s scoring function weightsmismatches, indel starts, and indel extensions. It canreport the score for the best mapping as well as the nextbest mapping [6,17]. Because the hairpin method enablesrecovery of the original sequences from bisulfite convertedreads, the effects of bisulfite treatment on a mapping pro-gram can be studied by mapping bisulfite converted readswith Bismark and recovered (original) reads with Bowtie2.This isolates bisulfite treatment as the only variable affect-ing mapping quality. We also used the bisulfite mapperBSMap, one of the first bisulfite mapper programs, forcomparison. It creates a hash table of the reference gen-ome of all possible C to T conversions in a sliding windowof 16 base pairs in the reference genome [7]. The followinghave been published by the same authors in an extendedabstract: some results on original sequence recovery andentropy, most of the methods section, and Figure 1 [18].

MethodsGenome-wide mouse hairpin bisulfite-sequencing datacreationWe first generated genome-wide hairpin bisulfite-sequen-cing data for mouse ES cells (E14TG2a) using the technol-ogy developed by our lab (GEO accession numberGSE48229) [15]. To induce differentiation, the mouse EScells were cultured in ES cell culturing media for six dayswithout LIF (Leukemia inhibitory factor). Genomic DNAisolated from the undifferentiated (E14-d0) and differentiat-ing (E14-d6) states of mES cells were used for library con-struction. Briefly, after DNA sonication, end repair, and dAtailing, the DNA fragments were further ligated to Biotin-modified hairpin adapter and Illumina TruSeq adapters.Adapter-ligated DNA was digested with MseI and MluCI(NEB) to enrich regions with high CpG density, and thenpulled down using Dynabeads® MyOneTM StreptavidinC1 beads (Invitrogen). After bisulfite conversion and PCR,size selection of 400-600 bp fragments was conducted toyield longer sequences that are more amenable for unam-biguous mapping to the reference sequence.

Unconverted sequence recoveryFigure 1 shows how the bisulfite treated hairpin PCR sequen-cing technology works and how the original non-bisulfite

sequence can be recovered. In step 0, the hairpin connectoris attached to opposing Watson and Crick strands. Theopposing strands are denatured in step 1 and then treatedwith bisulfite in step 2. Bisulfite treatment convertsun-methylated cytosine to uracil. Step 3 involves PCR ampli-fication, which copies the forward and reverse sequences andconverts uracil into thymine. In step 4, paired-end sequen-cing technology is used to sequence from the 5’ ends. Thisgives a sequence pair. Notice that one strand is enriched forTs since the PCR and bisulfite treatment converts unmethy-lated Cs to Ts. The other strand is enriched for As since theCrick strand’s un-methylated Cs become Ts, which are com-plemented with As. These As are then paired with Gs in theopposing strand. Finally, step 5 recovers the originalsequence. When the T-enriched strand has a T and theA-enriched strand has a C, this maps to a C, and when theT-enriched strand has a G and the A-enriched strand has anA, this maps to a G. All other mismatches are due to errorsin PCR or sequencing since the two strands should match upperfectly as they come from opposing sequences in the DNA.

Software installation and useAll software development and analysis was performedon the bioinformatics clusters hosted by the VirginiaBioinformatics Institute and the Virginia Tech CSdepartment. Two of the machines consist of two quadcore processors (Intel(R) Xeon(R) CPU E5-2407 0 @2.20 GHz) each with 128 gigabytes of memory. Twoother machines consisted of three eight core processors(Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20 GHz) each,where one machine has 128 GB of memory and theother has 192 GB of memory. Bowtie2 (version 2.1.0)was compiled with the Gnu Compiler Collection. Bis-mark (version 0.7.12) was run with Perl (version 5.10.1)with the non-directional flag and default settings. Blastn(version 2.2.28+) was run with default settings on aC-to-T converted reference genome to map unmappedBismark reads, where all of the reads had Cs convertedto Ts. Bismark reports unique mappings, which aremappings that have a unique best score, ambiguousmappings, which are reads with locations that have mul-tiple best scores, and unmapped reads, which had scoresthat were too low or reads where a location couldn’t beassigned.

File processing and statistical calculationCustom Python 2.7 [19] scripts using BioPython [20]were created to process files and calculate statistics suchas sequence entropy and sequence length. File manipu-lation included randomly sampling from FASTA files.Sequence entropy was calculated for each sequence in

each of the mapping categories (unique, ambiguous, andunmapped). Entropy for a sequence, s, is calculated bytaking the negative sum of the frequency of each base,

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 3 of 9

Page 4: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

fb, times the logarithm of the frequency of each base.More formally,

Ent (s) = − ∑

b∈{A,C,T,G}fb log

(fb

)

Entropy gives a measure of sequence randomness [21].This measure of sequence complexity was chosen sinceit is simple and fast to compute. A sequence of all Tswill have an entropy of 0, but a sequence with 25 per-cent of each base will have an entropy of 2. Wehypothesized that entropy would affect mapping quality.Highly random-seeming sequences should uniquely mapwhile non-random sequences, for example, a sequencewith mostly Ts, will map ambiguously or be discardedsince the string of Ts could map to many portions ofthe genome that have Cs at indeterminate locations inthe string of Ts. Bisulfite treatment tends to reducesentropy since most C’s will be converted to T’s sincecytosine methylation is usually rare.For comparing the unique mapping efficiency for con-

verted and unconverted reads, 50 bootstrap replicateswith replacement were created from all of the mousedata (both differentiating and undifferentiated). Eachreplicate had 6.46 million reads. Each replicate had theT-enriched pair and the A-enriched pair so that theuntreated original sequence could be recovered. Readsthat aligned exactly (that is, there was no sequencingerror) were extracted and mapped with Bismark ondefault settings for both the T-enriched and the A-enriched reads. About 68% (4.4 million) of the reads ineach replicate aligned exactly. The original read wasrecovered and mapped with Bowtie2 using the same set-tings as Bismark. Bismark reports a uniquely mappedpercent, and the recovered sequence mapping categoriesfrom Bowtie2 was used to calculate a uniquely mappedpercent in the same way as Bismark. Read mappingswere done for all 50 bootstrap replicates so that a p-valuecould be computed.For a single bootstrap replicate for the day 0 data,

reads were extracted that had 0, 1, 2, and 3 mismatchessomewhere in the read. Mapping using Bismark wasdone for all of them. Bismark mappings for reads withmismatches only in the first 25 bases were comparedwith reads with mismatches only in the latter right-mostpart. This comparison was done for 1, 2, and 3mismatches.With Illumina pair-end sequencing technology, a total

of 8 “lanes” of data was produced. Lane 8 data was usedthroughout except in the two cases mentioned above thatused bootstrap replicates drawn from all of the data.Both T-enriched and A-enriched strands were used foranalysis, and the day 0 and day 6 data was used. The day0 lane 8 data has 32,434,798 reads and the day 6 lane 8data has 58,034,817 reads.

To understand what the best possible bisulfite map-ping efficiency was, a simulation was performed wheresimulated bisulfite reads were sampled from the mousereference genome at random and read sequencing errorwas introduced. This simulation represented the bestpossible mapping scenario since the reads did notinclude natural variation. The simulation sampledsequences of length 100, 75, and 50 bp in the same pro-portions that were in the lane 8 data and comparedmapping efficiency to the real hairpin data withsequences of only 100, 75, and 50 bases. Simulationswith sequencing error of 1% and 10% were created.(Sequencing error set to less than 1% had very little dif-ference in the mapping efficiencies of 1% error.) Theprogram Sherman from Babraham Bioinformatics wasused to do the simulation with methylation rates set toCpG = 80% and CPH = 2%. These rates were chosensince they were consistent with the literature [15]. BothBSMap and Bismark were used to do the mapping.To investigate the benefit of using different algo-

rithms, Blast was run on default settings on 64 percentof the bisulfite-treated reads unmapped by Bismark forthe T-enriched day 0 lane 8 data. The reads and thereference genome had all C’s converted to T’s. Only64 percent of the reads were used since Blast took along time to complete and produced a lot of data.

Results and discussionBisulfite treated reads have low mapping efficiencyBismark uniquely mapped 47.5-52 percent of the bisulfiteconverted reads in the mouse data, ambiguously mapped27-34 percent, leaving 16-20 percent unmapped. There-fore, around half of the reads don’t give a strong signalfor their position, rendering them less biologically useful.

Sequence entropy relates to mapping efficiencyWe examined the average sequence entropy by mappingcategory for both day 0 and day 6 data when mappedwith Bismark or BSMap (Figure 2). Both mappers show atrend of increasing average entropy when going fromunmapped to ambiguous to unique mappings.Figure 3 shows a histogram of sequence entropy by

mapping category for the hairpin data on day 0 mappedwith Bismark. There is a spiking phenomenon around 1.5that may correspond with reduced entropy due to bisulfitetreatment. During bisulfite conversion, most of the Cswere converted to Ts and thus the frequency of the Cs isgreatly reduced. In this distribution, the maximum differ-ences between the mapping categories were determined asthe following: unmap-ambig: 0.05, unmap- unique: 0.12,unique-ambig: 0.17. The Kolmogorov-Smirnoff two-sample tests were performed and the differences betweenthese distributions are with p-values less than 0.01.Because uniquely mapped reads tend to have higher

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 4 of 9

Page 5: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

Figure 2 Bismark and BSMap average sequence entropy by mapping category for both day 0 and day 6 data.

Figure 3 Bismark entropy distribution on day 0 data by mapping category: ambiguous, unique, unmapped. The x-axis is entropy times10, and each integer point × is a bucket of sequences where × represents sequences with entropy values between x/10 and (x-1)/10. The y-value is the proportion of the total number of sequences in the mapping category.

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 5 of 9

Page 6: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

entropy, entropy positively correlates with improved map-ping efficiency. This suggests that reduced entropy frombisulfite treatment contributes to worsened mappingeffectiveness.

Recovering the original sequence improves mappingefficiencyWe hypothesized that the hairpin sequencing technologycan improve the mapping efficiency by recovering the ori-ginal sequences, which have higher entropy. For the undif-ferentiated data, recovering the untreated sequenceimproved unique mapping efficiency by an average of9.15% (p-value = 0, variance≈ 10−8) compared to the aver-age unique mapping efficiency of the T-enriched and theA-enriched reads. For the differentiating data, the uniquemapping efficiency was improved by 10.59% (p-value = 0,variance ≈ 10−8). This shows the benefit of recovering theinformation lost from bisulfite treatment and is consistentwith the notion that more entropic reads map better.

Different hairpin sequences may map differentlyWe next checked whether mapping efficiency can beimproved without sequence recovery but with mapping

information from additional sequence read, since onesequence from the hairpin sequences may uniquely mapwhile the other doesn’t. Bismark was run on the lane 8data for both day 0 and day 6. Both the T-enriched and theA-enriched sequences were compared to see if onesequence uniquely mapped while the other didn’t. If onesequence uniquely maps, then the other sequence can beconsidered to uniquely map as well. This improves theoverall mapping efficiency. For the day 0 data, about 19percent of the T-enriched sequences that uniquely mappeddid not uniquely map in the A-enriched sequences, andvice-versa. For the day 6 data, about 22 percent of the T-enriched sequences that uniquely mapped did not uniquelymap in the A-enriched sequences, and vice-versa. This sug-gests that the hairpin sequencing strategy can improvemapping efficiencies by uniquely mapping one sequence tofind the position of the other sequence.

Sequencing error adversely affects mapping efficiencyThe number of mismatches in reads could be deter-mined by mapping two pair-end sequences together. Weobserved a downward trend in mapping efficiency withincreasing mismatch count (Figure 4). Figure 4 shows

Figure 4 Number of mismatches on the left or the right side of the read versus alignment efficiency with day 0 data with Bismarkunique alignment efficiency. The left side is the first 25 (5’ end) bases of the read. The right side is the other bases.

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 6 of 9

Page 7: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

the results of mapping reads with mismatches in thefirst 25 bases compared with mapping reads with mis-matches only in the other bases. The unique mappingefficiency is consistently lower when mismatches are inthe first 25 bases, and it is worsened with increasedmismatch count. Thus, mismatches correlate with lowmapping efficiency, especially in the 5’ end. The hairpinsequencing technology can be used to identify mismatchareas of a read for removal so that unique mapping effi-ciency can be increased.

Bismark and BSMap mappings are highly overlappingThe T-enriched sequences for the day 0 data weremapped with Bismark and BSMap. Combining theuniquely mapped sequences of both mappers (an ORoperation) resulted in only about a 2 percent increaseover BSMap alone. Other work has shown greater differ-ences between these programs for other data [13].Therefore using multiple mappers may improve map-ping efficiencies. Using multiple mappers increases theconfidence that a read is mapped correctly when themappers map to the same location.

Longer sequences tend to map uniquelyReads in the mouse hairpin data ranged from 40 to 101bases. Day 0 lane 8 data had 63.4 percent reads withlengths either 100 or 101, and day 6 lane 8 data had69.34 percent reads with 100-101 bases. We suspectedthat longer nucleotide sequences would have higher

unique mapping efficiency since the probability that alonger sequence is repeated in the genome is low. Forexample, the sequence consisting of only “A” is oftenfound, but the sequence “ATCGGTGCCAT” will befound less often.The programs BSMap and Bismark were used to do

mapping, and then the percentage of reads of a givenlength that belong to each mapping category was calcu-lated as can be seen in Figure 5 for Bismark. For Bismark,there is a generally upward trend with longer sequencesmore often mapping uniquely and longer sequences lessoften unmapped. There are noticeable spikes at lengths60 and 90. However, BSMap’s mapping efficiencyremains about flat (no spiking) until longer reads aremapped where there is a slight increase in mapping effi-ciency. These results confirm the hypothesis that longerreads tend to map uniquely.

Sequencing error simulation suggests an upper bound onmapping performanceFigure 6 shows that with one percent sequencing errorand with reads drawn randomly from the mouse refer-ence genome, Bismark uniquely maps 86.1 percent ofthe reads and ambiguously maps 12.7 percent of thereads. The best possible mapper probably won’t be ableto do much better than this. This compares to 50.4 per-cent unique mapping efficiency by Bismark on bisulfitetreated data according to Figure 6. This means thatread mappers could improve by over 30 percent.

Figure 5 Bismark mapping categories by sequence length on day 0 data. This shows mapping efficiency by length for the day 0 lane 8data for each mapping category: unique, ambiguous, and unmapped.

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 7 of 9

Page 8: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

A high sequencing error of 10% causes an almost com-plete failure of Bismark, and BSMap worsens by 22%.

Blasting unmapped reads produces hitsLast, we investigated the benefit of using differentalgorithms for mapping by using Blast to map readsunmapped by Bismark. Blast significantly mappedaround 30 percent of these unmapped reads. Blast isnoticeably slower and its output is not in a convenientformat for variant calling; nonetheless, this resultshows that more computationally intense methodsimprove mapping efficiency.

ConclusionsIn this study, we observed low sequence entropy co-occurs with low unique mapping efficiency, and highentropy co-occurs with high unique mapping efficiency.This suggests that lost entropy from bisulfite treatmentcauses worse mapping efficiency. In addition, mismatchesin the seed region for a mapper correlate with worsenedperformance. The hairpin sequencing strategy is useful inimproving mapping performance of bisulfite treatedreads. With sequence information from two DNAstrands, the original sequence can be recovered and PCRor sequencing error mismatches can be identified. Hair-pin data can be used to ameliorate lost entropy by reco-vering the original sequence and improving uniquemapping efficiencies.Additional coping strategies involve using multiple read

mappers or more computationally intense software suchas Blast to improve mapping results since they can beused to map reads that are unmapped by less computa-tionally intense software such as Bismark and Bowtie2.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsJP conceived and carried out most of the data analysis and wrote thescripts, and he wrote much of the manuscript. MS prepared the DNAbisulfite hairpin data and contributed to the writing pertaining to datapreparation. HX supervised the data preparation and contributed to themanuscript. LZ contributed to the manuscript and the conception of thedata analysis.

AcknowledgementsThe work was supported by the VBI new faculty startup fund for HehuangXie and by NSF Award No. OCI-1124123 to Liqing Zhang.

DeclarationsPublication of this article was funded by the VBI new faculty startup fund forHX and Virginia Tech’s Open Access Subvention Fund.This article has been published as part of BMC Genomics Volume 16Supplement 11, 2015: Selected articles from the Fourth IEEE InternationalConference on Computational Advances in Bio and medical Sciences(ICCABS 2014): Genomics. The full contents of the supplement are availableonline at http://www.biomedcentral.com/bmcgenomics/supplements/16/S11.

Authors’ details1Department of Computer Science, Virginia Tech, Blacksburg, 24061, VA, USA.2Virginia Bioinformatics Institute, Virginia Tech, 24061 Blacksburg, VA, USA.3Department of Biological Sciences, Virginia Tech, 24061 Blacksburg, VA, USA.

Published: 10 November 2015

References1. Allis CD, Jenuwein T, Reinberg D, Caparros ML: Epigenetics 2007.2. Mann MR, Bartolomei MS: Epigenetic reprogramming in the mammalian

embryo: struggle of the clones. Genome Biol 2002, 3(2):1003-1.3. Woroniecki R, Gaikwad AB, Susztak K: Fetal environment, epigenetics, and

pediatric renal disease. Pediatric Nephrology 2011, 26(5):705-711.4. Baylln SB, Herman JG, Graff JR, Vertino PM, Issa JP: Alterations in dna

methylation: a fundamental aspect of neoplasia. Advances in cancerresearch 1997, 72:141-196.

5. Chen RZ, Pettersson U, Beard C, Jackson-Grusby L, Jaenisch R: Dnahypomethylation leads to elevated mutation rates. Nature 1998,395(6697):89-93.

Figure 6 Mapping results for both simulated and real data on sequences of length 100, 75, and 50 distributed in the same way as thehairpin data set. Simulation on bisulfite treated at 1% and 10% error rate. The mapping was done with BSMap and Bismark.

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 8 of 9

Page 9: RESEARCH Open Access Investigating bisulfite short-read ...RESEARCH Open Access Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data Jacob Porter1,

6. Krueger F, Andrews SR: Bismark: a flexible aligner and methylation callerfor bisulfite-seq applications. Bioinformatics 2011, 27(11):1571-1572.

7. Xi Y, Li W: Bsmap: whole genome bisulfite sequence mapping program.BMC bioinformatics 2009, 10(1):232.

8. Dinh HQ, Dubin M, Sedlazeck FJ, Lettner N, Scheid OM, von Haeseler A:Advanced methylome analysis after bisulfite deep sequencing: anexample in arabidopsis. PloS one 2012, 7(7):41528.

9. Lim JQ, Tennakoon C, Li G, Wong E, Ruan Y, Wei CL, Sung WK: Batmeth:improved mapper for bisulfite sequencing reads on dna methylation.Genome Biol 2012, 13(10):82.

10. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M:Bs-seeker2: a versatile aligning pipeline for bisulfite sequencing data.BMC genomics 2013, 14(1):774.

11. Smith TF, Waterman MS: Identification of common molecularsubsequences. Journal of molecular biology 1981, 147(1):195-197.

12. Needleman SB, Wunsch CD: A general method applicable to the searchfor similarities in the amino acid sequence of two proteins. Journal ofmolecular biology 1970, 48(3):443-453.

13. Tran H, Porter J, Sun Ma, Xie H, Zhang L: Objective and comprehensiveevaluation of bisulfite short read mapping tools. Advances inbioinformatics 2014, 2014.

14. Siva N: 1000 genomes project. Nature biotechnology 2008, 26(3):256-256.15. Zhao L, Sun Ma, Li Z, Bai X, Yu M, Wang M, Liang L, Shao X, Arnovitz S,

Wang Q, et al: The dynamics of dna methylation fidelity during mouseembryonic stem cell self-renewal and differentiation. Genome research2014, 24(8):1296-1307.

16. Laird CD, Pleasant ND, Clark AD, Sneeden JL, Hassan KA, Manley NC,Vary JC, Morgan T, Hansen RS, Stoger R: Hairpin-bisulfite pcr: assessingepigenetic methylation patterns on complementary strands of individualdna molecules. Proceedings of the National Academy of Sciences 2004,101(1):204-209.

17. Langmead B, Salzberg SL: Fast gapped-read alignment with bowtie 2.Nature methods 2012, 9(4):357-359.

18. Porter J, Sun Ma, Xie H, Zhang L: Improving bisulfite short-read mappingefficiency with hairpin-bisulfite data. Computational Advances in Bio andMedical Sciences (ICCABS), 2014 IEEE 4th International Conference IEEE; 2014,1-2.

19. Martelli A: Python in a Nutshell O’Reilly Media, Inc; 2006.20. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,

Hamelryck T, Kauff F, Wilczynski B, et al: Biopython: freely available pythontools for computational molecular biology and bioinformatics.Bioinformatics 2009, 25(11):1422-1423.

21. Shannon CE: The mathematical theory of communication. 1963. MDcomputing: computers in medical practice 1996, 14(4):306-317.

doi:10.1186/1471-2164-16-S11-S2Cite this article as: Porter et al.: Investigating bisulfite short-readmapping failure with hairpin bisulfite sequencing data. BMC Genomics2015 16(Suppl 11):S2.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Porter et al. BMC Genomics 2015, 16(Suppl 11):S2http://www.biomedcentral.com/1471-2164/16/S11/S2

Page 9 of 9