1 introduction to the bisulfite sequencing plugin

Bisulfite Sequencing Plugin

USER MANUAL

User manual for

Bisulfite Sequencing 2.1Windows, Mac OS X and Linux

November 7, 2017

This software is for research purposes only.

QIAGEN AarhusSilkeborgvej 2PrismetDK-8000 Aarhus CDenmark

Contents

1 Introduction to Bisulfite Sequencing plugin 4

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 DNA-Methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Detecting DNA methylation by bisulfite sequencing . . . . . . . . . . . . . 5

1.2 Map Bisulfite Reads to Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Selecting reads and directionality . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Selecting the reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 Including or excluding regions (masking) . . . . . . . . . . . . . . . . . . . 9

1.2.4 Mapping parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.5 Mapping paired reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.6 Non-specific matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.7 Gap placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.8 Bisulfite read mapping result handling . . . . . . . . . . . . . . . . . . . . 15

1.3 Call Methylation Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.1 Statistical tests and thresholds settings . . . . . . . . . . . . . . . . . . . 19

1.3.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4 Create RRBS-fragment Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Install and uninstall plugins 25

2.1 Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Uninstall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Bibliography 28

Index 28

3

Chapter 1

Introduction to Bisulfite Sequencingplugin

The Bisulfite Sequencing plugin consists of three new tools that you can find in the EpigenomicsAnalysis folder in the Toolbox (see figure 1.1).

Figure 1.1: The Toolbox with the Bisulfite sequencing tools

Before going into details about each individual tool and how to combine them, we will review somebackground on DNA-Methylation patterns and the Bisulfite Sequencing approach to detectingthese methylation patterns.

1.1 Background

1.1.1 DNA-Methylation

DNA-Methylation is one of the most significant epigenetic mechanisms for cell-programming.DNA-methylation alters the gene expression pattern such that cells can recall their cell type,essentially removing the necessity for continuous external signalling or stimulation. Even more,DNA-methylation is retained throughout the cell-cycle and thus inherited through cell division. DNA-Methylation involves the addition of a methyl group to the 5-position of the cytosine pyrimidinering or the number 6 nitrogen of the adenine purine ring. DNA-methylation at the 5-position ofcytosines typically occurs in a CpG dinucleotide context. CpG dinucleotides are often grouped inclusters called CpG islands, which are present in the 5’ regulatory regions of many genes.

A large body of evidence has demonstrated that aberrant DNA-methylation is associated withunscheduled gene silencing, resulting in a broad range of human malignancies. AberrantDNA-methylation manifests itself in two distinct forms: hypomethylation (a loss/reduction ofmethylation) and hypermethylation (an increase in methylation) compared to normal tissue.

4

CHAPTER 1. INTRODUCTION TO BISULFITE SEQUENCING PLUGIN 5

Hypermethylation has been found to play significant roles in carcinogenesis by repressingtranscription of tumor suppressor genes, while hypomethylation is implicated in the developmentand the progression of cancer.

1.1.2 Detecting DNA methylation by bisulfite sequencing

DNA methylation detection methods comprise various techniques. Early methods were based onrestriction enzymes, cleaving either methylated or unmethylated CpG dinucleotides. These werelaborious but suitable to interrogate the DNA-methylation status of individual DNA sites. It waslater discovered that bisulfite treatment of DNA turns unmethylated cytosines into uracils whileleaving methylated cytosines unaffected (figure 1.2). This discovery provided a more effectivescreening method that, in conjunction with whole genome shotgun sequencing of bisulfiteconverted DNA, opened up a broad field of genome-wide DNA methylation studies. Ever since,bisulfite shotgun sequencing (BS-seq) is considered a gold-standard technology for genome-widemethylation profiling at base-pair resolution.

Figure 1.2: Outline of bisulfite conversion of sample sequence of genomic DNA. Nucleotidesin blue are unmethylated cytosines converted to uracils by bisulfite, while red nucleotides are5-methylcytosines resistant to conversion. Source: https://en.wikipedia.org/wiki/Bisulfite_sequencing

Figure 1.3 depicts the conceptual steps of the Bisulfite-sequencing:

1. Genomic DNA Genomic DNA is extracted from cells, sheared to fragments, end-repaired,size-selected (around 400 base pairs depending on targeted read length) and ligatedwith Illumina methylated sequencing adapters. End-repair involves either methylated orunmethylated cytosines, possibly skewing true methylation levels. Therefore, 3’- and5’-ends of sequenced fragments should be soft-clipped prior to assessing methylationlevels.

2. Denaturation Fragments must be denatured (and kept denatured during bisulfite conver-sion), because bisulfite can only convert single-stranded DNA.

3. Bisulfite conversion Bisulfite converts unmethylated cytosines into uracils, but leavesmethylated cytosines unchanged. Because bisulfite conversion has degrading effects on

https://en.wikipedia.org/wiki/Bisulfite_sequencing

https://en.wikipedia.org/wiki/Bisulfite_sequencing


Figure 1.3: Individual steps of BS-seq workflow include denaturation of fragmented sampleDNA, bisulfite conversion, subsequent amplification, sequencing and mapping of resulting DNA-fragments. (See text for explanations). Methylated cytosines are drawn in red, unmethylatedcytosines and respective uracils/thymidines in blue. DNA-nucleotides that are in-silico converted(during read-mapping) are given in green.

the sample DNA, the conversion duration is kept as short as possible, sometimes resultingin non-complete conversions (i.e. not all unmethylated cytosines are converted).

4. PCR amplification PCR-amplification reconstructs the complementary strands of the con-verted single-stranded fragments and turns uracils into thymines.

5. Strand discordance Not an actual step of the workflow, but to illustrate that bisulfite con-verted single-stranded fragments are not reverse-complementary anymore after conversion.

6. Paired-end sequencing Directional paired-end sequencing yields read pairs from bothstrands of the original sample-DNA. The first read of a pair is known to be sequencedeither from the original-top (OT) or the original-bottom (OB) strand. The second read of


a pair is sequenced from a complementary strand, either ctOT or ctOB. It is a commonmisunderstanding that the first read of a pair yields methylation information for the top-strand and the second read for the bottom-strand (or vice versa). Rather, both reads ofa read pair report methylation for the same strand of sample DNA, either the top or thebottom strand. Individual read pairs can of course arise from both the top and the bottomstrand, eventually yielding information for both strands of the sample DNA.

7. In silico read-conversion The only bias-free mapping approach for BS-seq reads involvesin-silico conversion of all reads. All cytosines within all first reads of a pair are convertedto thymines and all guanines in all second reads of a pair are converted to adenines(complementary to C-T conversion).

8. Mapping to CT- or GA-conversion of reference genome The reference genome is alsoconverted into two different in silico versions. In the first conversion all cytosines arereplaced by thymines and, in the second conversion, all guanines are converted toadenines. The in silico converted read pairs are then independently mapped to the twoconversions of the reference genome and the better of the two mappings is reported asfinal mapping result (see green checkboxes).

Note: with non-directional BS-seq, no assumptions regarding the strand origins of either of thereads of a pair can be made (see step 6). Therefore, two different conversions of the read pairneed to be created: the first read of a converted pair consists of the CT-conversion of read 1and the GA-conversion of read 2, and the second converted pair consists of the GA-conversion ofread 1 and the CT-conversion of read 2. Both converted reads pairs are subsequently mapped tothe two conversion of the reference genome. The best out of the four resulting mappings is thenreported as the final mapping result.

1.2 Map Bisulfite Reads to ReferenceSetting up the Map Bisulfite Reads to Reference tool is very similar to the usual read mapping,with a few differences. In particular,

• Some options such as ’Global alignment’ are either not available or preset.

• Only a track version of mappings is produced in output.

• The bisulfite mappings in CLC Workbenches have a special ’invisible’ property set for them,to inform the downstream tool Call Methylation levels (See section 1.3) about the correcttype of input.

Please note that, because two versions of the reference sequence (C->T and G->A - con-verted) have to be indexed and used simultaneously for each read, the RAM requirementsfor the bisulfite mapper are double than those needed for a regular mapping against a ref-erence sequence of the same size. To learn more about the computational requirementsof the read mapper, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Computational_requirements.html

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Computational_requirements.html




1.2.1 Selecting reads and directionality

To start the read mapping:

Toolbox | Epigenomics Analysis ( ) | Bisulfite Sequencing ( )| Map BisulfiteReads to Reference ( )

In the first dialog, select the sequences or sequence lists containing the sequencing data(figure 1.4). Please note that reads longer than 100,000 bases are not supported.

Figure 1.4: Specifying the reads as input. You can also choose to work in batch.

When the sequences are selected, click Next, and you will see the dialog shown in figure 1.5.

Figure 1.5: Specifying the directionality of the protocol.

Bisulfite sequencing protocols may be directional or non-directional, and here you can specifywhat type of protocol you have used.

A directional protocol yields reads from both strands of the original sample-DNA. The first readof a pair (or every read for single-end sequencing) is known to be sequenced either from theoriginal-top (OT) or the original-bottom (OB) strand. The second read of a pair is sequencedfrom a complementary strand, either ctOT or ctOB. At the time of writing, examples of directionalprotocols include:

• Illumina TruSeq DNA Methylation Kit (formerly EpiGnome)

• Kits from the NuGen Ovation family of products

• Swift Accel-NGS Methyl-seq DNA Library Kit

• Libraries prepared by the ’Lister’ method

In a non-directional protocol, the first read of a pair may come from any of the four strands: OT,OB, ctOT or ctOB. Examples include:


• Zymo Pico Methyl-Seq Library Kit

• Bioo Scientific (Perkin Elmer) NEXTflex Bisulfite-Seq Kit

• Libraries prepared by the ’Cokus’ method

Note: Previous versions of the Bisulfite Sequencing plugin did not support the use of non-directional bisulfite sequencing protocols.

If you are uncertain about the directionality of your protocol, contact the protocol vendor1.

1.2.2 Selecting the reference

When the sequences and directionality are selected, click Next, and you will see the dialogshown in figure 1.6.

Figure 1.6: Specifying the reference sequences and masking.

At the top you select one or more reference sequences by clicking the Browse and select element( ) button. You can select either single sequences, a list of sequences or a sequence track asreference. Note the following constraints:

• single reference sequences longer than 2gb (2 · 109 bases) are not supported.

• a maximum of 120 input items (sequence lists or sequence elements) can be used asinput to a single read mapping run.

1.2.3 Including or excluding regions (masking)

The next part of the dialog shown in figure 1.6 lets you mask the reference sequences. Maskingrefers to a mechanism where parts of the reference sequence are not considered in the mapping.This can be useful for example when mapping data is captured from specific regions (e.g. foramplicon resequencing). The read mapping will still base its output on the full reference - it isonly the core read mapping that ignores regions.

1It is sometimes possible to infer the directionality by looking at the reads. In the absence of methylation, adirectional protocol will have few or no Cs in the first read of each pair. We do not recommend using this approach.


Masking is performed by discarding the masked out nucleotides. As a result the reference is splitinto separate sequences, which are positioned according to the original unmasked referencesequence.

Note that you should be careful that your data is indeed only sequenced from the target regions.If not, some of the reads that would have matched a masked-out region perfectly may beplaced wrongly at another position with a less-perfect match and lead to wrong results forsubsequent variant calling. For resequencing purposes, we recommend testing whether maskingis appropriate by running the same data set through two rounds of read mapping and variantcalling: one with masking and one without. At the end, comparing the results will reveal if anyoff-target sequences cause problems in the variant calling.

Masking out repeats or using other masks with many regions is not recommended. Repeats arehandled well and does not cause any slowdown. On the contrary, masking repeats is likely tocause a dramatic slowdown in speed, increase memory requirements and lead to incorrect readplacement.

To mask a reference sequence, first click the Include or Exclude options, and second click theBrowse ( ) button to select a track to use for masking.

1.2.4 Mapping parameters

Clicking Next leads to the parameters for the read mapping (see figure 1.7).

Figure 1.7: Setting parameters for the bisulfite read mapping.

The first parameter allows the mismatch cost to be adjusted:

• Match score The positive score for a match between the read and the reference sequence.It is set by default to 1 but can be adjusted up to 10.

• Mismatch cost The cost of a mismatch between the read and the reference sequence.Ambiguous nucleotides such as "N", "R" or "Y" in read or reference sequences are treated


as a mismatches and any column with one of these symbols will therefore be penalizedwith the mismatch cost.

After setting the mismatch cost you need to choose between linear gap cost and affine gap cost,and depending on the model you chose, you need to set two different sets of parameters thatcontrol how gaps in the read mapping are penalized.

• Linear gap cost The cost of a gap is computed directly from the length of the gap andthe insertion or deletion cost. This model often favors small, fragmented gaps over longcontiguous gaps. If you choose linear gap cost, you must set the insertion cost and thedeletion cost:

Insertion cost The cost of an insertion in the read (a gap in the reference sequence). Thecost of an insertion of length ` will be

` · Insertion cost, (1.1)

Deletion cost The cost of a deletion in the read (gap in the read sequence). The cost of adeletion of length ` will be

` · Deletion cost. (1.2)

• Affine gap cost An extra cost associated with opening a gap is introduced such that longcontiguous gaps are favored over short gaps. If you chose affine gap cost, you must alsoset the cost of opening an insertion or a deletion:

Insertion open cost The cost of opening an insertion in the read (a gap in the referencesequence).

Insertion extend cost The cost of extending an insertion in the read (a gap in the referencesequence) by one column.

Deletion open cost The cost of a opening a deletion in the read (gap in the read sequence).

Deletion extend cost The cost of extending a deletion in the read (gap in the readsequence) by one column.

Using, affine gap cost, an insertion of length ` is penalized by

Insertion open cost + ` · Insertion extend cost, (1.3)

and a deletion of length ` is penalized by

Deletion open cost + ` · Deletion extend cost. (1.4)

In this way long consecutive gaps get a lower cost per column than small fragmented gapsand they are therefore favored.

The score of a match between the read and the reference is usually set to 1. Adjusting the costparameters above can improve the mapping quality, e.g. when the read error rate is high or thereference is expected to differ significantly from the sequenced organism. For example, if thereads are expected to contain many insertions and/or deletions, it can be a good idea to lowerthe insertion and deletion costs to allow more of such errors. However, one should also consider


Figure 1.8: An alignment of a read where a region of 35bp at the start of the read is unalignedwhile the remaining 57 nucleotides matches the reference.

the possible drawbacks when adjusting these settings. For example, reducing the insertion anddeletion cost increases the risk of mapping reads to the wrong positions in the reference.

Figure 1.8 shows an example using linear gap cost where the read mapper is unable to mapa region in a read due to insertions in the read and mismatches between the read and thereference. The aligned region of the read has a total of 57 matching nucleotides which resultin an alignment score of 57 which is optimal when using the default cost for mismatches andinsertions/deletions (2 and 3 respectively). If the mapper had aligned the remaining 35bp of theread as shown in Figure 1.9 using the default scoring scheme, the score would become:

(26 + 1 + 3 + 57) ∗ 1− 5 ∗ 2− 8 ∗ 3 = 53 (1.5)

In this case, the alignment shown in Figure 1.8 is optimal since it has the highest score. However,if either the cost of deletions or mismatches were reduced by one, the score of the alignmentshown in figure 1.9 would become 61 and 58, respectively, and thus make it optimal.

Figure 1.9: An alignment of a read containing a region with several mismatches and deletions.By reducing the default cost of either mismatches or deletions the read mapper can make analignment that spans the full length of the read.

Once the optimal alignment of the read is found, based on the cost parameters described above,a filtering process determines whether this match is good enough for the read to be included inthe output. The filtering threshold is determined by two factors:

• Length fraction The minimum percentage of the total alignment length that must matchthe reference sequence at the selected similarity fraction. A fraction of 0.5 means that atleast half of the alignment must match the reference sequence before the read is includedin the mapping (if the similarity fraction is set to 1). Note, that the minimal seed (word) sizefor read mapping is 15 bp, so reads shorter than this will not be mapped.

• Similarity fraction The minimum percentage identity between the aligned region of the readand the reference sequence. For example, if the identity should be at least 80% for theread to be included in the mapping, set this value to 0.8. Note that the similarity fractionrelates to the length fraction, i.e. when the length fraction is set to 50% then at least 50%of the alignment must have at least 80% identity (see figure 1.10).

1.2.5 Mapping paired reads

• Auto-detect paired distances At the bottom of the dialog shown in figure ?? you can specifyhow Paired reads should be handled. If the sequence list used as input contains paired


Figure 1.10: A read containing 59 nucleotides where the total alignment length is 60. The part ofthe alignment that gave rise to the optimal score has length 58 which excludes 2 bases at the leftend of the read. The length fraction of the matching region in this example is therefore 58/60 =0.97. Given a minimum length fraction of 0.5, the similarity fraction of the alignment is computedas the maximum similarity fraction of any part of the alignment which constitute at least 50% ofthe total alignment. In this example the marked region in the alignment with length 30 (50% of thealignment length) has a similarity fraction of 0.83 which will satisfy the default minimum similarityfraction requirement of 0.8.

reads, this option will automatically be enabled - if it contains single reads, this option willnot be applicable.

The CLC Workbench offers as the default choice to automatically calculate the distance betweenthe pairs. If this is selected, the distance is estimated in the following way:

1. A sample of 100,000 reads is extracted randomly from the full data set and mappedagainst the reference using a very wide distance interval.

2. The distribution of distances between the paired reads is analyzed, and an appropriatedistance interval is selected:

• If less than 10,000 reads map, a simple calculation is used where the minimumdistance is one standard deviation below the average distance, and the maximumdistance is one standard deviation above the average distance.

• If more than 10,000 reads map, a more sophisticated method is used which investi-gates the shape of the distribution and finds the boundaries of the peak.

3. The full sample is mapped using this distance interval.

4. The history ( ) of the result records the distance interval used.

The above procedure will be run for each sequence list used as input, assuming that theydo not necessarily share the same library preparation and could have different distributions ofpaired distances. Figure 1.11 shows an example of the distribution of intervals with and withoutautomatic pair distance interval estimation.

Sometimes the automatic estimation of the distance between the pairs may return a warning"multiple intervals detected". This may happen if the reads derive from multiple libraries orfrom certain types of amplicon sequencing protocols. In this case, the estimates may still becorrect, but, if in doubt, the user may want to disable the option to automatically estimate paireddistances and instead manually specify minimum and maximum distances between pairs on theinput sequence list.


Figure 1.11: To the left: mapping with a narrower distance interval estimated by the workbench. Tothe right: mapping with a large paired distance interval (note the large right tail of the distribution).

If the automatic detection of paired distances is not checked, the mapper will use the informationabout minimum and maximum distance recorded on the input sequence lists.

The ’automatic detection of paired distance’ option when mapping should be used with caution.It is possible that the estimated distance setting is too narrow and consequently many read pairswill be flagged broken. For more information, see the Read mapping section of the workbenchyou are working with (CLC Genomics or Biomedical Genomics).

When a paired distance interval is set, the following approach is used for determining theplacement of read pairs:

• First, all the optimal placements for the two individual reads are found.

• Then, the allowed placements according to the paired distance interval are found.

• If both reads can be placed independently but no pairs satisfies the paired criteria, thereads are treated as independent and marked as a broken pair.

• If only one pair of placements satisfy the criteria, the reads are placed accordingly andmarked as uniquely placed even if either read may have multiple optimal placements.

• If several placements satisfy the paired criteria, the pair is treated as a non-specific match(see section 1.2.6 for more information.)

• If one read is uniquely mapped but the other read has several placements that are validgiven the distance interval, the mapper chooses the location that is closest to the firstread.

1.2.6 Non-specific matches

At the bottom of the dialog, you can specify how Non-specific matches should be treated. Theconcept of Non-specific matches refers to a situation where a read aligns at more than oneposition with an equally good score. In this case you have two options:

• Random. This will place the read in one of the positions randomly.

• Ignore. This will not include the read in the final mapping.


Note that a read is only considered non-specific when the read matches equally well at severalalignment positions. If there are e.g. two possible alignment positions and one of them is aperfect match and the other involves a mismatch, the read is placed at the position with theperfect match and it is not marked as a non-specific match.

For paired data, reads are only considered non-specific matches if the entire pair could bemapped elsewhere with equal scores for both reads, or if the pair is broken in which case a readcan be categorized as non-specific in the same way as single reads (see section 1.2.5).

When looking at the mapping, the default color for non-specific matches is yellow.

1.2.7 Gap placement

In the case of insertions or deletions in homopolymeric or repetitive regions, the preciseplacement of the insertion or deletion cannot be determined from the data. An example is shownin figure 1.12.

Figure 1.12: Three A’s in the reference (top) have been replaced by two A’s in the reads (shown inred). The gap is placed towards the 5’ end, but could have been placed towards the 3’ end with anequally good mapping score for the read.

In this example, three A’s in the reference (top) have been replaced by two A’s in the reads(shown in red). The gap is placed towards the 5’ end (left side), but could have been placedtowards the 3’ end with an equally good mapping score for the read as shown in figure 1.13.

Figure 1.13: Three A’s in the reference (top) have been replaced by two A’s in the reads (shown inred). The gap is placed towards the 3’ end, but could have been placed towards the 5’ end with anequally good mapping score for the read.

Since either way of placing the gap is arbitrary, the goal of the mapper is to place the gapsconsistently at the same side for all reads.

1.2.8 Bisulfite read mapping result handling

Click Next lets you choose how the output of the mapping should be reported (see figure 1.14).


Figure 1.14: Mapping output options.

There are two independent output options available that can be (de-)activated in both cases:

• Create report. This will generate a summary report as described in http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Summary_mapping_report.html.

• Collect un-mapped reads. This will collect all the reads that could not be mapped to thereference into a sequence list (there will be one list of unmapped reads per sample, andfor paired reads, there will be one list for intact pairs and one for single reads where themate could be mapped).

However, the main output is a reads track:

Reads track A reads track is very "lean" (i.e. with respect to memory requirements) since itonly contains the reads themselves. Additional information about the reference, consensussequence or annotations can be added and viewed alongside in the context of a Track Listlater (by adding, for example, a reference and/or annotation track, respectively). This kindof output is useful when working with tracks in general and especially for resequencingpurposes this is recommended.

Note that the tool will output an empty read mapping and report when nothing mapped, andempty unmapped reads if everything mapped.

Figure 1.15 illustrates the view of a typical directional shotgun BS-seq mapping. As with anyread mapping view, the colour of the reads follows the usual CLC convention, that is green/redfor forward/reverse reads, and dark/pale blue for paired reads. Independent of this orientationproperty, each read or read pair has an ’invisible’ property indicating if it came from the originaltop (OT), or original bottom (OB) strand. However, if the BS-sequencing protocol is truly 100%-directional, the orientation in the mapping and the OT/OB origin of reads/read pairs will beconcordant. In this figure, blocks of reads from the original top strand are marked with squigglybrackets on the left, while the rest are from the original bottom strand. In a mapping, they can bedistinguished by a pattern of highlighted mismatches to reference. OT reads will have mismatchespredominantly on ’T’, due to C->T conversion; OB reads will have a pattern of mismatches on ’A’symbols, corresponding to G->A conversion as can be seen on a reverse-complementary strand.

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Summary_mapping_report.html




Figure 1.15: A typical directional shotgun BS-seq mapping, together with the base-level methylationcalling feature track on top.

When methyl-C occurs in a sample, there will be a match in the reads, instead of an expectedmismatch.

In this figure, there were two positions where such events occurred, both on the original bottomstrand, and both supported by a single read only. ’G’ symbols in those reads are shown in redboxes. The reverse direction of an arrowhead on a base-level methylation track also reflects theOB-position of a methylation event.

Note also that it appears that there may be a G/T heterozygous SNP (C/A in the OB strand) inthe second position. While such occurrences may lead to underestimation of true methylationlevels of cytosine alleles in heterozygous SNPs, our current tool does not attempt to compensatefor such eventualities.

To understand and interpret BS-sequencing and mapping better, it may be helpful to examinethe position (marked with a red asterisk) in between the two detected methylation events.There appears to be an additional A/C heterozygous SNP with C’s in reads from OT strand fullyconverted to T’s, i.e., showing no evidence of methylation at that heterozygous position.

1.3 Call Methylation LevelsThe tool takes as input one or more read mappings created by Map Bisulfite Reads to Referencetool (See section 1.2). If more than one mapping is used as input, various statistical options todetect differential methylation become available.

The tool will accept a regular mapping as input, but will warn about possibly inconsistentinterpretation of results. Mapping done in a ’normal’, not bisulfite mode, is likely to result insub-optimal placement of reads due to a large number of C/T mismatches between bisulfite-converted reads and a reference. Also, this tool will consequently interpret majority of cytosinesin a reference as methylated, creating possibly very large and misleading output files. Theinvisible ’bisulfite’ property of a mapping may be erased if the original mapping is manipulated


in the workbench with other tools - such as the Merge Read Mappings tool - in which case thewarning should be ignored.

After selecting the relevant mapping(s), the wizard offers to set the parameters for base-levelmethylation calling, as shown in figure 1.16:

Figure 1.16: Methylation call settings.

• The first three check boxes (Ignore non-specific matches, Ignore duplicate matches,Ignore broken pairs) enable control over whether or not certain reads will be included inbase level methylation calling, and subsequent statistical analysis. The recommendedoption is to have them turned on.

– Ignore non-specific matches Reads or matches mapped ambiguously will be ignored.

– Ignore duplicate matches Multiple reads with identical mapping coordinates will becounted once only.

– Ignore broken pairs Pairs reads mapped as broken pairs will be ignored.

• Read 1 soft clip, Read 2 soft clip: sets a number of bases on a 5’-end of a read thatwill be ignored in methylation calling. It is common for bisulfite data to have a technicalbias in amplification and sequencing, making a small number of bases at the beginning ofa read (usually not more than 5) unreliable for calling. Setting a parameter to 0 (default),and inspecting a graph in the report may help determine the specific number for a certaindataset, if a bias is suspected.

• Methylation context group popup menu controls in which context the calls will be made.

– Standard contexts include:CpG Detects 5-methylated cytosines in CpG contextsCHG Detects 5-methylated cytosines in CHG contexts (H = A/C/T)


CHH Detects 5-methylated cytosines in CHH contexts

– NOMe-seq contexts [Kelly et al., 2012] include:GCH Detects enzymatic methylation in GCH contextsHCG Detects endogenous methylation in HCG contextsGCG Detects ambiguous methylation in GCG contexts

– ExhaustiveDetects 5-methylated cytosines independently of their nucleotide-context

• Confirm methylation contexts in reads checkbox controls if a selected context(s) is presentin a read itself, and not just a reference sequence, before a call is made. This is usefulif a sample has a variant that can affect a context of a call, so that reads representing avariant allele that breaks a context will be excluded, if the box is checked.

• Minimum strand-specific coverage sets a lower limit of coverage for the top, or a bottomstrand, to filter out positions with low coverage.

• Restrict calling to target regions enables selection of a feature track to limit callingto defined regions. In addition to genes, CDSs and other annotation tracks that canbe generated or imported into the workbench, the tool Create RRBS-fragment Track (seesection 1.4) can be used to generate fragments of pre-selected size predicted for restrictiondigest of a reference genome with commonly used frequent cutters that target commonmethylation contexts, such as MspI.

1.3.1 Statistical tests and thresholds settings

The next set of parameters depends on experimental setup, and a number of samples in theinput, as shown in 1.17.

Statistical test

• Statistic mode pop-up menu offers the following choices.

• Fisher exact: Compares methylation-levels of a case/control sample-pair; multiplecase/control samples are merged before pair-wise comparison. Context-specific cover-age, and methylated coverage are summed separately for case and control mappings ina window of pre-set size. If more than one case, or control samples are provided, thevalues within each set are simply added. The contingency table is used to evaluate the hy-pergeometric cumulative distribution probability for methylated coverage in case sample(s)to be equal or greater than in controls, given the context-specific coverage in a window.Therefore, this test reports statistically significant HYPER-methylation in cases, comparedto controls, in a given window. To identify regions that are hypo-methylated compared tocontrols with this test, simply reverse case and control when specifying the inputs.

• Chi-squared: Analyses the inter-individual methylation-level variability across a cohort ofsamples; no controls are supported. A contingency table is constructed for a window,where each row corresponds to an input sample, with coverage counted for methylated andunmethylated cytosines within strand/context. Expected values for those, given samplecoverage in a window, are calculated from aggregate for all samples, and deviation is


Figure 1.17: The statistical tests and thresholds settings

evaluated with a Chi-squared test. This statistic tells if an input group of samples hasmethylation heterogeneity between them, in a given window.

• ANOVA: Assesses differential methylation by comparing a case-sample group versus acontrol-sample group; requires at least two case-samples and two control-samples. It testsif variability in methylation levels within each group is less than between groups, in awindow of interest.

• No test: No test will be performed and only methylation levels will be produced for eachinput sample; remaining options on that screen will be grayed out.

• Maximum p-value sets the limit of probability calculated in a statistical test of choice, atwhich a window will be accepted as significant, and included in the output.

• Control samples menu is used to select bisulfite mappings that are required to serve ascontrols in either Fisher exact, or Anova statistics.

Window thresholds

• Window length When no window track was chosen in the previous step for focusing theanalysis, examine differential methylation in windows of this fixed size. Defines the sizeof the window in the genome track within which methylation levels in case and controlsamples are compared, and statistical significance of difference, if any, is calculated andreported. Windows are evaluated sequentially along the reference.


• Minimum number of samples A window will be skipped, if less than this number of samplesin a group have coverage at or above the Minimum strand-specific coverage in a minimumnumber of sites, as defined below.

Sample thresholds

• Minimum high-confidence site-coverage A site with at least this coverage is considered ahigh confidence site.

• Minimum high-confidence site-count Exclude sample from a current window, if it has fewerthan this number high-confidence methylation-sites.

• Maximum mean site coverage Exclude sample from current window, if it has a higher meansite coverage. The default "0.0" setting does not filter any.

1.3.2 Outputs

The tool produces a number of feature tracks and reports. Select the outputs you are interestedin during the last wizard step of the tool (1.18).

Figure 1.18: Result handling options

The Create track of methylated cytosines option is chosen by default. It will provide a baselevel methylation track for each read mapping supplied, i.e., case or control (see figure 1.19 fora table view of the track).

In the table, each row corresponds to a cytosine that conforms to a context (such as ’CpG’ inthis example) and which has non-zero methylated coverage.


Figure 1.19: Output table

The columns of the methylation levels track table view indicate:

• Chromosome chromosome in which the methylated cytosine is found

• Region position of the mapping where the methylated cytosine is found. Rows with ’Region’values that start with ’complement’ represent methylated Cs in reads that come from theoriginal bottom strand of reference.

• Name of the context in which methylation is detected (see tooltip of the wizard for thenames and definition of the various contexts available.)

• Total coverage total reads coverage of the position. May be calculated after filtering fornon-specific, broken, and duplicate reads if these options are enabled.

• Strand coverage of the total coverage, how many reads are in the same direction than thestrand in which the methylated C is Fdetected (original top, or original bottom)

• Context coverage of the strand coverage, how many reads conform to the selectedmethylation context

• Methylated coverage how many reads support evidence of methylation in this position,i.e., retained Cs instead of conversion to Ts

• Methylation level "Methylated coverage" divided by "Context coverage"

For each mapping, you can also generate an optional summary report by selecting the Createmethylation reports option. This report includes statistics of direction of mapping of reads/readpairs, chosen contexts, and useful graphs. The graphs can help detect any bias in calledmethylation levels that commonly occurs at the start of BS-seq reads due to end-repair of DNAfragments in library preparation. This facilitates setting the correct trimming parameters for Read1 soft clip, Read 2 soft clip.

Note that positions where no methylation was detected are filtered from the final output andare not reported in the ’Methylation levels’ feature track. However they are included in theintermediate calculations for differential methylation detection.

When the statistical test is performed, a feature track is produced. If more than one methylationcontext is chosen, a separate feature track is produced for each individual context, i.e., for CpG,CHH, etc. The table view of such track for Fisher exact test is shown in figure 1.20.

The columns of the differential methylation feature track table indicate:

• Name column not used


Figure 1.20: Example of table with statistical test output

• Cytosines total number of cytosines in the region

• Case samples number of samples in the case group

• Case coverage sum of "Total coverage" values of the region in the case group

• Case coverage mean sum of "Context coverage" in the region divided by the number ofcovered Cs in context in the region in the case group

• Case methylated sum of "Methylated coverage" in the region in the case group

• Case methylation level "Case methylated" divided by "Case coverage mean"

• Control samples number of samples in the control group

• Control coverage sum of "Total coverage" values of the region in the control group

• Control coverage mean sum of "Context coverage" in the region divided by the number ofcovered Cs in context in the region in the control group

• Control methylated sum of "Methylated coverage" in the region in the control group

• Control methylation level "Case methylated" divided by "Case coverage mean" for thecontrol group

• p-value probability of no difference in methylation levels between case and control in theregion, given the data and the statistical test applied

For the highlighted window region 833001..834000, the relevant values used in the hypergeo-metric test are 6 (the number of methylated cytosines in the case sample) out of 7 (total numberof cytosines), while the control sample had 11 covered context-conforming cytosines in theregion, of which only 2 were methylated. If there are no case/control difference in methylation,the probability (p-value) of such hypermethylation in the case sample is calculated as 9.05×10−3,below the threshold.

1.4 Create RRBS-fragment TrackCreate RRBS-fragment Track can be used to generate fragments of pre-selected size basedon the restriction digest of a reference genome with commonly used frequent cutters targetingmethylation contexts such as MspI. If such track is provided to the Call Methylation Levels tool,its features are used to calculate statistical tests instead of consecutive windows of pre-set size.The input object is a reference genome of interest, in a track format.

Figure 1.21 shows the options for generating a track with predicted, and selected featurescorresponding to restriction digest of the selected genome.


Figure 1.21: Create RRBS-fragment track settings

Restriction enzymes The enzymes used in reduced-represention bisulfite sequencing can beselected here, with the most common for the CpG islands, MspI, pre-selected by default.

Parameters Minimum fragment length and Maximum fragment length define the acceptablesize range, out of the all possible predicted fragments after full digest of reference DNA, will beincluded in the output of the tool.

The output of the tool is a feature track.

Chapter 2

Install and uninstall plugins

Bisulfite Sequencing plugin is installed as a plugin.

Note: In order to install plugins and modules, the Workbench must be run in administrator mode.On Linux and Mac, it means you must be logged in as an administrator. On Windows, you can dothis by right-clicking the program shortcut and choosing "Run as Administrator".

Plugins are installed and uninstalled using the plugin manager.

Help in the Menu Bar | Plugins... ( ) or Plugins ( ) in the Toolbar

The plugin manager has two tabs at the top:

• Manage Plugins. This is an overview of plugins that are installed.

• Download Plugins. This is an overview of available plugins on QIAGEN Aarhus server.

2.1 InstallTo install a plugin, click the Download Plugins tab. This will display an overview of the pluginsthat are available for download and installation (see figure 2.1).

Figure 2.1: The plugins that are available for download.

Select Bisulfite Sequencing plugin to display additional information about the plugin on the right

25

CHAPTER 2. INSTALL AND UNINSTALL PLUGINS 26

side of the dialog. Click Download and Install to add the plugin functionalities to your workbench.

Accepting the license agreement

Part of the installation involves checking and accepting the end user license agreement (EULA)as seen in figure 2.2.

Figure 2.2: Read the license agreement carefully.

Please read the EULA text carefully before clicking in the box next to the text I accept theseterms to accept. If requested, fill in your personal information before clicking Finish.

If Bisulfite Sequencing plugin is not shown on the server but you have the installer file on yourcomputer (for example if you have downloaded it from our website), you can install the plugin byclicking the Install from File button at the bottom of the dialog and specifying the plugin *.cpafile saved on your computer.

When you close the dialog, you will be asked whether you wish to restart the workbench. Theplugin will not be ready for use until you have restarted.

2.2 UninstallPlugins are uninstalled using the plugin manager:

Help in the Menu Bar | Plugins... ( ) or Plugins ( ) in the Toolbar

This will open the dialog shown in figure 2.3.

The installed plugins are shown in the Manage plugins tab of the plugin manager. To uninstall,select Bisulfite Sequencing plugin and click Uninstall.

If you do not wish to completely uninstall the plugin, but you do not want it to be used next timeyou start the Workbench, click the Disable button.

CHAPTER 2. INSTALL AND UNINSTALL PLUGINS 27

Figure 2.3: The plugin manager with plugins installed.

When you close the dialog, you will be asked whether you wish to restart the workbench. Theplugin will not be uninstalled until the workbench is restarted.

Bibliography

[Kelly et al., 2012] Kelly, T. K., Liu, Y., Lay, F. D., Liang, G., Berman, B. P., and Jones,P. A. (2012). Genome-wide mapping of nucleosome positioning and DNA methylation withinindividual DNA molecules. Genome Res., 22(12):2497--2506.

28

Index

Bibliography, 28

Mapto coding regions, 9

Map reads to referenceselect reference sequences, 9

Map reads to referencemasking, 9

Mask, reference sequence, 9

Non-specific matches, 14

References, 28Repeat masking, 9

29

1 introduction to the bisulfite sequencing plugin

Documents