ddradseqtools: software package for in silico · pdf fileddradseqtools: software package for...
TRANSCRIPT
ddRADseqTools: Software package for
in silico simulation and testing of
double digest RADseq experiments
ALUMNO: Fernando Mora Márquez
MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL
ESCUELA NACIONAL DE SALUD - INSTITUTO DE SALUD CARLOS III
2014-2015
CENTRO DONDE SE DESALLORARON LAS PRÁCTICAS: Grupo de investigación Genética,
Fisiología e Historia Forestal - E.T.S. Ingenieros de Montes - UPM
DIRECTOR DE LA TESIS: Profesor Dr. Unai López de Heredia Larrea
FECHA: Febrero 2016
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page i
Agradezco al Profesor Dr. Unai López de Heredia Larrea sus enseñanzas
y su dedicación en la dirección de esta tesis.
También doy las gracias al Dr. Brent Emerson del Instituto de
Productos Naturales y Agrobiología (IPNA-CSIC) por sus explicaciones
sobre ddRADseq, y a Víctor García Olivares, miembro del equipo del Dr.
Emerson, por las pruebas realizadas con ddRADseqTools.
A Henar, Teresa y María por todo vuestro cariño.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page ii
Table of contents
Abstract ......................................................................................................................................... 1
Introduction .................................................................................................................................. 2
Double Digest RAD Sequencing ................................................................................................. 2
Potential sources of error in ddRADseq experiments ............................................................... 5
PCR duplicates ....................................................................................................................... 5
Allele dropout ........................................................................................................................ 6
Technical replicates ............................................................................................................... 6
Available ddRADseq simulation tools ....................................................................................... 6
Aims ............................................................................................................................................... 8
Materials and Methods ................................................................................................................. 9
Design of ddRADseqTools ......................................................................................................... 9
Description of data files and programs ................................................................................... 13
Ends file ............................................................................................................................... 13
Individuals file ..................................................................................................................... 13
Restriction sites file ............................................................................................................. 14
Program rsitesearch.py ....................................................................................................... 15
Program fragsgeneration.py ............................................................................................... 16
Program simddradseq.py .................................................................................................... 17
Program pcrdupremoval.py ................................................................................................ 21
Program indsdemultiplexing.py .......................................................................................... 22
Program readstrim.py ......................................................................................................... 23
Program seqlocation.py ...................................................................................................... 24
Methodology for ddRADSeqTools validation .......................................................................... 24
Benchmark reference genomes .......................................................................................... 25
Validation experiments ....................................................................................................... 25
Results and Discussion ................................................................................................................ 33
Analysis of fragments generation ........................................................................................... 33
ddRADseq simulations ............................................................................................................ 34
Analysis of PCR duplicates ....................................................................................................... 35
Analysis of the effect of the GC content ................................................................................. 37
Analysis of the mutation patterns ........................................................................................... 39
Pipeline for the alignment of simulated reads ........................................................................ 39
Performance of ddRADseqTools ............................................................................................. 44
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page iii
Comparative between ddRADseqTools and other ddRADseq simulation tools ..................... 44
Limitations and Future Prospects ........................................................................................... 46
Conclusions ............................................................................................................................. 46
References ................................................................................................................................... 48
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 1
Abstract
Double digested RADseq (ddRADseq) is a next generation sequencing strategy that generates
reads from thousands of loci targeted by restriction enzyme cut sites, across multiple
individuals. To be statistically sound, and economically affordable, a ddRADseq experiment has
a preliminary design stage that needs to consider issues related to the selection of the enzyme
pair combination, particularities of the genome of the subject species, modifications of the
library construction, the coverage needed to avoid missing data, and the potential sources of
error that have impact in the coverage.
In this Master Thesis, we present ddRADseqTools, a software package that performs in silico
ddRADseq simulations in order to help design of a ddRADseq experiment by testing hypothesis
related to inherent sources of bias. It covers the in silico fragments generation, both at random
or from a reference genome; the construction of modified ddRADseq libraries using adapters
with either one or two indexes and Degenerate Base Regions (DBRs) for quantification of PCR
duplicates; and initial steps of the bioinformatics pre-processing of reads (quantification and
removal of PCR duplicates, demultiplexing of individuals and trimming of adapters from raw
reads). ddRADseqTools generates single-ended (SE) or paired-ended (PE) reads that may show
three types of mutations: SNPs, indels and mutations at the enzyme's recognition motif (i.e.
allele dropout). The resulting output files can be submitted to pipelines of alignment and
variant/genotype calling in order to allow a fine tuning of parameters, before in vitro data are
obtained from the laboratory of reference.
We validated ddRADseqTools with specific tests that accounted for double digested fragment
selection, generation of SE and PE reads with varying degree of polymorphism, and
implementation, quantification and removal of PCR duplicates. To validate the processes, we
used three benchmark genomes from species with contrasting characteristics (Saccharomyces
cerevisiae, Homo sapiens and Pinus taeda).
ddRADseqTools is cost-efficient in terms of time of execution, and can be run in computers
with standard CPU and RAM configuration.
Aims: 1) To develop a software package to perform in silico ddRADseq simulations in order to
help in the design of a ddRADseq experiment by testing hypothesis related to inherent
sources of bias.
2) To validate the software package to verify its proper design and functionality for
diverse genomes and under several scenarios.
3) To evaluate the cost-efficiency of the software package in terms of CPU and RAM
usage.
4) To compare the software package with other ddRADseq simulation tools.
Keywords: allele dropout, coverage, ddRADseq, genotyping, in silico simulation, PCR
duplicates.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 2
Introduction
Double Digest RAD Sequencing
A restriction endonuclease, or restriction enzyme, is an enzyme that recognizes specific base
sequences in DNA, breaks down phosphodiester bonds, and cleaves the double helix at specific
sites. These sequences or motifs are known as restriction sites and contain four to eight
nucleotides. A characteristic of the restriction sites is that they have two-fold rotational
symmetry, i.e. they are palindromic sequences. Figure 1 shows the restriction sites of four
enzymes of common use.
Figure 1. Restriction sites of EcoRI, PstI, SbfI and MseI. Circles represent the symmetry axes and arrows
indicate the cleavage sites.
For example, EcoRI recognizes the sequence GAATTC and cuts the enzyme between the G and
the A. Figure 2 shows how the EcoRI cleaves DNA, and which are the resulting fragments.
Figure 2. Action of EcoRI on the Chromosome IX (368744-368769) of S. cerevisiae.
The capacity of cleaving DNA by the restriction endonucleases is the basis of Restriction-site
Associated DNA (RAD) marker, which are genetic markers identified by the sequence that is
recognized by a restriction endonuclease. Initially, RAD markers were used in microarrays
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 3
(Miller et al. 2007), and then in high-throughput sequencing technologies (Illumina) (Baird et
al. 2008).
RADseq (Baird et al. 2008; Peterson et al. 2012) is a NGS methodology that uses only one
restriction endonuclease. The resulting fragments are then sheared randomly. Fragments are
obtained in both sides of the cut site, and therefore reads will cover both sides. RADseq uses
an enzyme whose motif is rare, thus the number of fragments will not be very high, and will
therefore be tractable. RADseq technology has allowed to genotype high numbers of
individuals (from a few ones to thousands of them), hence obtaining polymorphisms (mainly
SNPs but also indels) across the full genome (Baird et al. 2008; Davey & Blaxter 2010; Etter et
al. 2011; Davey et al. 2011; Davey et al. 2013; Mastretta-Yanes et al.2014); therefore, it has an
extraordinary potential to perform genetic mapping and population genetics studies in non-
model species when a reference genome is not available.
Several modifications of the RADseq technology exist. The most popular is the Double Digest
Restriction Associated DNA (ddRAD) Sequencing or ddRADseq that uses two restriction
endonucleases (Peterson et al. 2012). Usually, one of the enzymes has a rare motif while the
other enzyme has a common motif, in order to obtain a manageable number of fragments. The
enzyme combination of choice will be different depending on the size and structure of the
genome of the subject organism. The fragments produced by the ddRADseq platform are
flanked by the cut site of both enzymes. Frequently, only the fragments of a specific size are
selected to be sequenced. For instance, those fragments between 200 and 300 nucleotides are
purified in agarose gels after digestion to run ligation of adapters, and PCRs prior to Illumina
sequencing.
A comparison of RADseq and ddRADseq technologies (Peterson et al. 2012) is shown in Figure
3.
Figure 3. RADseq vs ddRADseq methodologies (Peterson et al. 2012). A) RAD sequencing; the library is
built from the fragments (in red) created by digest of the genome with a restriction endonuclease coupled
with a random shearing; reads cover both sides of the cut site. B) ddRADseq; the fragments are generated
by digestion of the genome with two restriction endonucleases; selected fragments are those that are
flanked by the cut site of both enzymes, and that have a suitable size to build the library (in red). In this
figure, the fragments a and b are not selected because a is too short and b is too long. Two individual are
represented in both examples.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 4
Using ddRADseq, there is no need to perform whole genome sequencing, and thousands of
markers of many individuals are obtained by studying only a portion of the genome. Therefore,
is a cost-affordable sequencing technique to study non-model species. ddRADseq technology
allows the optimization of the number of genetic markers that can be obtained, and has
expanded the potential field of application of RADseq (Peterson et al. 2012). Thus, ddRADseq
can be used to perform pedigree or quantitative trait locus (QTL) mapping, population
structure assessment or phylogenetic studies (Figure 4).
Figure 4. Potential fields of application of RADseq vs ddRADseq methodologies (Peterson et al. 2012).
The fragments sequenced by ddRADseq consist of a genome insert between both restriction
sites, and two ends that include an adapter and a primer. A particular short sequence (index) is
attached to one or both ends to identify individuals, i.e. an index is a barcode that
distinguishes the fragment that belongs to a particular individual from the fragments of other
individuals (see Figure 5). If a second index is attached to another end, the potential number of
individuals to sample increases considerably.
Figure 5. ddRADseq fragment. The fragment has a genome insert (in blue) between the restriction site of
the first enzyme (rse 1 in yellow) and the other restriction site (rse 2 in yellow). The ends have an adapter
and a primer (in grey). An index in the end where Adapter 1 is, and optionally another one in the second
end, are used to identify individuals.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 5
Potential sources of error in ddRADseq experiments
Many sources of error have been identified in RADseq experiments. Mastretta-Yanes et al.
(2015) classified these sources of error in technical, human, of wet laboratory, inherent to the
high-throughput sequencing technique, and of bioinformatics nature. In the present software
we have considered two of the main potential sources of error present in ddRADSeq
experiments that have strong incidence in coverage reduction, and have parameterized them
at the time of simulating ddRADseq read files: PCR duplicates and allele dropout. Other
sources of error, such as the presence of paralogous sequences, are implicitly occurring when
the genomes of the subject species show high repetitive content.
In addition, a way to improve the accuracy of a ddRADseq experiment, is to include technical
replicates. Therefore, we have also taken this point into consideration at the time of simulating
ddRADseq read files.
PCR duplicates
PCR duplicates are identical copies of the same template fragment arisen during the stage of
PCR in NGS sequencing experiments. Amplification increases the number of available
molecules for sequencing but changes the representation of the template molecules in the
amplified product and introduces random errors (Casbon et al. 2011). In the case of ddRADseq,
this type of error is originated at the PCR amplification of the fragments + the adapters prior to
Illumina sequencing (Schweyen et al. 2014; Tin et al. 2015). The presence of PCR duplicates
implies a loss of coverage and can lead to genotype/variant calling errors. When possible, the
PCR duplicates must be eliminated in the analysis of an experiment.
In order to detect PCR duplicates, a recently developed technique (Schweyen et al. 2014) used
a degenerate base region (DBR) in one of the two ends (see Figure 6). The DBR is a short
sequence of a few nucleotides ligated during library construction that operates as a molecular
counter to estimate the number of template molecules in the PCR associated with each
variant. After ligation, each fragment in the library incorporates a particular sequence chosen
from all the possible DBR sequences. The counter can be used to determine whether a
putative variant is associated with a single template molecule or, alternatively, multiple
template molecules and hence the probability that it derives from a polymerase error or true
variant (Casbon et al. 2011). From a bioinformatics point of view, the duplicate reads can be
identified and removed, because they will bear the same sequence.
Figure 6. DBR addition in an end is used to identify PCR duplicates.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 6
Allele dropout
There is allele dropout in a locus when one or more alleles are no present in reads of a
sequencing. This problem can occur (Mastretta-Yanes et al. 2015) by: 1) a mutation in a
restriction enzyme recognition site; 2) wet laboratory errors, e.g. an exposure to UV light; 3)
bioinformatics issues, e.g. reads removal whose coverage is low in samples with wide range of
coverage per locus. Allele dropout decreases the accuracy of genotyping because the affected
allele is not detected.
Technical replicates
Replicates are used to detect and identify sources of variation in measurements, and limit the
effect of spurious variation on hypothesis testing and parameter estimation (Blainey et al.
2014). Replicates can be: 1) Technical replicates are replicates where the biological material is
the same in each replicate; they can be utilized to calculate the variability of measurements
and find out technical errors. These are recommended in ddRADseq experiments; 2) Biological
replicates, which are replicates whose biological material comes from different samples; they
can be used to estimate the variability between individuals of a population. In ddRADseq
experiments, biological replicates correspond to individuals.
Available ddRADseq simulation tools
Few software tools related to ddRADSeq in silico simulation are currently available. Some of
them are listed below.
simRAD (Lepais 2014; https://cran.r-project.org/web/packages/SimRAD/):
simRAD is a R package the provides functions to simulate restriction endonuclease digestion
and fragment selection to a ddRADseq experiment. A reference genome or randomly
generated DNA sequences can be the input to the digestion process. This utility does not
consider reads generation.
BU-RAD-seq (DaCosta & Sorenson 2014; https://github.com/BU-RAD-seq):
BU-RAD-seq has two utilities. One utility is Digital_RADs, a Python 3 program that performs the
digestion of a genome with one or two enzymes. It requires the motifs and the length of the
down/upstream sequence (one enzyme) or the lower or upper size of the fragment (two
enzymes). This utility does not consider Individual identification, PCR duplicates or mutations.
ddRAD-seq-Pipeline, the utility, is a Python 3 programs set that processes double digest RAD
sequences in order to genotype the samples. Individual identification, PCR duplicates and
mutations are not considered.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 7
simRRLs (http://dereneaton.com/software/simRRLs/):
simRRLs is a Python 2 program that can be used to simulate randomly RADseq-like sequence
data on a fixed species tree topology under a coalescent model. It is not possible to use a
reference genome. It supports various types of RADseq: SE RAD, SE ddRAD, PE ddRAD, PE
ddRAD w/ merged reads, etc. An index is used to identify individuals and the DBR sequence is
not considered. It accepts various arguments: length of simulated sequences, number of loci to
sample, individuals from each taxon, restriction sites, mutation rate, indel rate, existence of
allele dropout, etc. It has been used to test PyRAD, a ddSRADseq pipeline (Eaton 2014).
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 8
Aims
The importance of a robust design of ddRADseq experiments to save time and money, and the
lack of tools for in silico testing of hypothesis related to inherent sources of bias (PCR
duplicates and allele dropout), led to the definition of the following specific aims for this
Master Thesis:
Aim 1.
To develop a software package to perform in silico ddRADseq simulations in order to help in
the design of a ddRADseq experiment by testing hypothesis related to inherent sources of
bias.
Aim 2.
To validate the software package to verify its proper design and functionality for diverse
genomes and under several scenarios. The validations must ensure that the output data are
reliable, according to the corresponding design of the programs included in the software
package.
Aim 3
To evaluate the cost-efficiency of the software package in terms of CPU and RAM usage.
Aim 4
To compare the software package with other ddRADseq simulation tools.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 9
Materials and Methods
Design of ddRADseqTools
ddRADseqTools, is a set of programs, data files and configuration files useful to design and in
silico testing of ddRADseq experiments (Aim 1). The programs that form part of
ddRADseqTools aim to meet the following scopes:
Scope A.
Simulation of fragments produced by a double digest with a given pair of restriction
endonucleases. The fragments can be obtained from a reference genome if available; or they
can be randomly generated. Each fragment corresponds to a locus.
Scope B.
Generation of mutations within fragments of each individual including SNP and indels, and
the possibility of allele dropout. The number and type of mutations across the simulated
reads are determined according to user-defined probabilities. The maximum number of
mutated position in one fragment is defined by the user. The location of the mutations in the
fragment is assigned at random, but is conserved across individuals. At present, only the Jukes-
Cantor model of sequence evolution is implemented.
Scope C.
Simulation of single-end (SE) or pair-end reads (PE). Reads are simulated according to the
following points:
The number of reads by locus is calculated by dividing the total number of reads to
generate by the number of loci to sample.
Individuals have two fragment sequences by locus. i.e. two alleles. They are assigned
randomly and can be mutated or not-mutated depending on a probability. The
chance of allele dropout of a locus is individually identified randomly for each allele.
For such specific loci and individuals, no reads will be generated.
The GC ratio for each fragment is considered as a factor that controls the probability
of producing PCR duplicates. Digested fragments with higher GC ratio will have higher
probability of producing PCR duplicates than those with lower GC ratio.
Scope D.
Flexibility to configure raw read ends: user defined adapters, Illumina or ad hoc PCR primers,
indexes at both ends of the read, and including DBRs according to the needs of the
experiment and the sequencing platform. As several modifications of the ddRADseq library
construction methodology exist (Peterson et al. 2012; Mastretta-Yanes et al. 2015; Schweyen
et al. 2014; Tin et al. 2015), this version of ddRADseqTools implements four out of these
techniques:
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 10
Only one index is used to identify the individuals: this is the original ddRADseq
methology (Peterson et al. 2012). The sequence of the end corresponding to the
Adapter 1 includes a single index1 sequence (Peterson et al. 2012). The index2
sequence and the DBR sequence are not considered.
One index is used to identify individuals, and DBRs are used to quantify PCR
duplicates: The sequence of the end corresponding to the Adapter 1 includes an
index1 sequence and a DBR sequence (Schweyen et al. 2014; Tin et al. 2015).
Two indexes are used to identify the individuals: the sequence of the end
corresponding to the Adapter 1 includes an index1 sequence, and the sequence of the
end corresponding to the Adapter 2 includes an index2 sequence (Peterson et al.
2012; Mastretta-Yanes et al. 2015). The DBR sequence is not considered.
Two indexes are used to identify individuals and DBRs are used to quantify PCR
duplicates: The sequence of the end corresponding to the Adapter 1 includes an
index1 sequence; the sequence of the end corresponding to the Adapter 2 includes an
index2 sequence; and a DBR sequence is included at the end of either the Adapter 1 or
the Adapter2 (Schweyen et al. 2014; Tin et al. 2015).
The indexes and DBR can have any size and be located in any position of the adapters.
Scope E.
Quantification and removal of PCR duplicates. PCR duplicates can have a strong effect in the
decrease of coverage and may inflate the percentage of missing data. When using the DBR
strategy, PCR duplicates can be quantified.
Scope F.
Demultiplexing of reads by individual. Reads need to be separated by individual, in order to
build the individuals' genotypes, and to check for the presence of paralogous sequences (see
Mastretta-Yanes et al. 2015).
Scope G.
Trimming of reads. The adaptors, primers, indexes and DBRs must be removed from raw reads
in order to use trimmed reads for alignment and variant calling.
ddRADseqTools was programmed in Python 3 (version 3.4 or higher is required), and runs in
any computer with an OS that allows for Python 3: Linux/Unix, Microsoft Windows, Mac OS X,
among others. The only dependencies required to run this software package are the NumPy
and matplotlib libraries. The software package, version 0.36, is attached to this document
(ddRADseqTools-0.36.zip). Within the ddRADseqTools.zip file, there is a manual that describes
how to install ddRADseqTools, and the way to operate with each program. Appendix A (see file
supplementary.pdf) contains the complete list of the files of ddRADseqTools.
A flow chart of the programs contained in ddRADseqTools is shown in Figure 7. The work-flow
has the usual three steps in an NGS experiment:
1) Library construction/in silico fragments generation: A file of fragments is generated
from a reference genome by rsitesearch.py; or fragment sequences are simulated
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 11
randomly with fragsgeneration.py, respectively. If the genome-guided version of the
software is used (rsitesearch.py), a particular pair of restriction endonucleases has to
be specified and their action within the genome is simulated (Scope A).
2) High Throughput Sequencing / Generation of reads: Raw reads are generated by
simddradseq.py. This program allows handling a wide number of parameters, such as
single-ended (SE) or paired-ended (PE) read types, configuration of the ends of the raw
reads, number of reads, size of the fragments, mutation probability, or PCR duplicates
probability (Scopes B, C and D). The software was designed having Illumina's SE and PE
read files in mind, but it can be used to simulate read files from other sequencing
platforms (Roche 454, PacBio, Helicos, etc.).
3) Bioinformatics pre-processing of reads: This step can be split in three sub-steps:
3.1) Quantification and removal of PCR duplicates: The PCR duplicates of the
raw reads are quantified and some statistics are computed with
pcrdupremoval.py. Also, this program generates read files without the
duplicated reads (Scope E).
3.2) Demultiplexing of individuals: Joint raw reads are demultiplexed by
indsdemultiplexing.py to obtain separate individual read files (Scope F).
3.3) Trimming raw reads: The program readstrim.py removes the adapters and
other sequences from raw reads in order to perform correctly the alignment of
reads and the variant calling step (Scope G).
Table 1. Parallelism between in vitro and in silico experiments and ddRADseqTools programs.
In vitro experiments In silico experiments ddRADseqTools program
Library construction In silico fragments generation rsitesearch.py (w/genome) fragsgeneration.py (random)
High-Throughput Sequencing Generation of reads simddradseq.py
Bioinformatics pre-processing of reads
Quantification and removal of PCR duplicates pcrdupremoval.py
Demultiplexing of individuals indsdemultiplexing.py
Trimming of raw reads readstrim.py
The output files of this work-flow are ready to be submitted to alignment utilities, such as BWA
(Li & Durbin 2009), or to ddRADseq analysis pipelines, such as Rainbow (Chong et al. 2012),
STACKS (Catchen et al. 2013), Pyrad (Eaton 2014) or AftrRAD (Sovic et al. 2015). Doing so
allows in silico tuning of the parameters used by the pipeline before the in vitro data are
obtained.
Three data files are required: 1) an end file that is used to design the sequence ends, with the
corresponding adapters, primers, indexes and DBRs; 2) a file of individuals that contains the
sequences that identify the individuals; and 3) a file of restriction sites that holds the
restriction sites recognition motifs and indicates their cut sites.
The default parameters to run each program are stored in specific configuration files. These
options can be modified simply editing the configuration file, or in the command line.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 12
Figure 7. Flow-chart of ddRADseqTools.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 13
Description of data files and programs
A detailed description of the data files and programs included in ddRADseqTools is shown
below.
Ends file
The file ends.txt contains the end sequences of the raw reads (5'->3'), as defined by the user.
The ending sequences are integrated by adapters, primers, indexes and DBRs. A read has two
ends: one for Adapter 1, and a second one for Adapter 2.
The record format of the ends file has two fields (Figure 8): 1) end identification; and 2) end
sequence. Both fields must be separated by a semicolon.
Figure 8. Example of the content of an ends file. 1 represents a nucleotide of index1; 2 represents a nucleotide
of index2; 3 represents a nucleotide of the DBR.
Figure 9 shows an example of how a simulated NGS fragment can be assembled using the data
of ends file.
Figure 9. Example of fragment assembly. The sequence is formatted by end 1 (it corresponds to the ID end21
displayed in Figure 8) that contains a first index and a DBR; end 2 (it corresponds to the ID end 22 displayed in
Figure 8) that bears a second index; the cuts performed by EcoRI and MseI; and the genome insert, which is
represented by question marks.
Individuals file
The file individuals.txt, is the individuals file. It contains the sequences that identify each
individual in the experiment. Either one or two indexes can be used to identify the individual.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 14
This file can be easily written with a text editor or spreadsheet. The record format of
individuals file has five fields (Figure 10): 1) individual ID; 2) replicated individual ID (if technical
replicates are included in the experiment) or NONE (if no technical replicate is considered); 3)
population ID (this is not operative in the present version of the program); 4) sequence of
index1 corresponding to Adapter 1; and 5) sequence of index2 corresponding to Adapter 2
(optional). The fields must be separated by a semicolon.
Figure 10. Example of an individuals file. Only some individuals are displayed. The individual identified by
ind0206 is a technical replicate of ind0201.
Restriction sites file
The file restrictionsites.txt contains the restriction sites recognition motifs and identifies their
cut sites.
The record format of restriction sites file has two fields (Figure 11): 1) the ID of the restriction
endonuclease; and 2) the sequence of restriction site. Both fields must be separated by a
semicolon.
The file restrictionsites.txt included in the present version of the software contains more than
60 widely used restriction endonucleases. However, the researcher can include new enzymes
data at the end of the file, by simply editing it with a text editor.
Figure 11. Example of a restriction sites file. A cut site is represented by an asterisk in the sequence.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 15
Program rsitesearch.py
This program locates the restriction sites' motifs and performs an in silico double digestion of a
genome. After directly simulating the digest of the sequence of a reference genome as is in the
genome file (Watson strand), the complementary reverse sequence (Crick strand) is obtained,
and a second digest simulation of the latter sequence is performed.
The output is a FASTA file with the resulting fragments (Figure 13). The header of each FASTA
record will show the following information: fragment number, length of the fragment, GC rate,
strand, start position in the locus, end position in the locus, and description of chromosome or
scaffold.
It also provides some statistics regarding the number of fragments classified according to
fragment size intervals, and a graphics showing the distribution of fragments by size interval
(see chapter Results and Discussion).
Statistics of the fragments generated are obtained. The data are summarized by intervals. For
each interval the number of fragments, the percentage of fragments relative to the total, and
the number of fragments that contain undetermined nucleotides are calculated (see Figure 12
for an example of such statistics).
The statistics output is also generated in CSV format that can be easily downstreamed to other
general purpose programs, such as R, Libre Office Calc or Microsoft Excel.
Figure 12. Example of statistics generated by the program rsitesearch.py. The table shows the output results
for the first sixteen size intervals of the fragments produced by a double digest of P. taeda with SbfI and MseI.
The data shown for each interval are the number of fragments, the percentage of fragments relative to the
total, and the number of fragments that contain undetermined nucleotides.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 16
The input and output files of rsitesearch.py are shown in Figure 7, as well as the position of this
program within the processes flow of ddRADseqTools.
The options of the program are detailed in Table 2.
Table 2. rsitesearch.py options.
Option Default value Comment
genfile ./genome.fna Path of the reference genome file in FASTA format or .gz format (compressed).
fragsfile ./fragments.fasta Path of the output fragments file.
rsfile ./restrictionsites.txt Path of the input restriction sites file.
enzyme1 EcoRI
Name of the first restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. EcoRI and GAATTC are equivalent.
enzyme2 MseI
Name of the second restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. MseI and TTAA are equivalent.
minfragsize (*) 201 Lower fragment loci size.
maxfragsize (*) 300 Upper fragment loci size.
fragstfile ./genome-statistics.txt Output statistics file.
fragstinterval 25 Interval length of fragment size for the output statistics.
(*) During library construction in ddRADseq experiments it is very common to filter only the fragments
ranging a particular size (usually 100-400 bp). This size interval can be set here.
Program fragsgeneration.py
This program generates random fragments simulating a double digestion of a genome, and
writes them to a FASTA file. It is useful when rsitesearch.py cannot be used because there is no
reference genome.
The output of the program is equal to that provided by rsitesearch.py
The input and output files of fragsgeneration.py are shown in Figure 7 as well as the position
of this program within the processes flow of ddRADseqTools.
The options of the program are detailed in Table 3.
Table 3. fragsgeneration.py options.
Option Default value Comment
fragsfile ./fragments.fasta Path of the output fragments file.
rsfile ./restrictionsites.txt Path of the input restriction sites file.
enzyme1 EcoRI
Name of the first restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. EcoRI and GAATTC are equivalent.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 17
enzyme2 MseI
Name of the second restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. MseI and TTAA are equivalent.
fragsnum 10000 Number of fragments to generate.
minfragsize (*) 201 Lower fragment loci size.
maxfragsize (*) 300 Upper fragment loci size.
fragstfile ./genome-statistics.txt Output statistics file.
fragstinterval 25 Interval length of fragment size for the output statistics.
(*) During library construction in ddRADseq experiments it is very common to filter only the fragments
ranging a particular size (usually 100-400 bp). This size interval can be set here.
Program simddradseq.py
This program builds SE or PE simulated read files from a virtual library of a ddRADseq
experiment in FASTQ/FASTA format.
The input fragments can be obtained in two ways:
1) Using a reference genome via rsitesearch.py.
2) Randomly via fragsgeneration.py.
Read mutations are generated probabilistically according to the next steps:
1) Each fragment proceeding from a double digest simulation is considered to be a locus.
Fragments are picked up randomly.
2) The presence of mutations in a locus is assessed according to a probability. If
mutations exist, a mutated sequence of the fragment is generated. The number of
mutations within a fragment is randomly chosen from the interval between 1 and a
maximum number of mutations defined by the user. The type of polymorphism (SNP
or indel) is also determined from a user defined probability of a mutation being an
indel. The indels have a user defined upper boundary size.
3) Coming up next an individual database is created. One fragment sequence, mutated or
not, is assigned randomly to each individual chromosome, i.e. the individuals are
supposed to be diploid. Also, a mark indicates if the individual shows allele dropout in
the locus corresponding to the fragment according to a user defined probability. The
technical replicates have the same fragment sequences that the sequences of the
individuals that they are replicating.
This program generates SE or PE reads of user defined length in FASTQ or FASTA format. Figure
14 shows an example of PE reads in FASTQ format. The header of each FASTA/FASTQ record
will the following information: read number, fragment number, read number in the fragment,
trace of a mutation in the read sequence, individual ID, index1 sequence, and index2 sequence.
The theoretical number of reads by locus is calculated by dividing the number of reads to
generate by the number of loci to sample. The number of reads of each locus is calculated
randomly in a rank in which the theoretical number is contained.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 18
The reads of a locus are generated in a loop, in which is necessary to randomly assess the
individual, and its corresponding fragment sequences of each read. Individuals that have the
allele dropout mark in this fragment are not considered.
The generation of reads in a locus also contemplates if this locus presents PCR duplicates,
according to a user defined probability, which is weighted by the GC factor of the fragment.
When the probability of having PCR duplicates of the fragment is below the weighted
theoretical probability, the number of duplicates per read is sampled from a distribution which
probability decays monotonically from one replicate up to an upper boundary for the number
of PCR duplicates. That is, it is more likely to generate less than many PCR duplicates.
The coverage is controlled by setting the number of loci, the number of individuals (controlled
by the individuals file), and the number of reads of the library. Coverage may be unequal
among loci and individuals, ranging between two user defined values. If uniform coverage is
desired, both options should be set to 1.
Four techniques of building libraries are implemented:
1) IND1: An index sequence is inserted in the end where Adapter 1 is (Peterson et al.
2012).
2) IND1_DBR: An index sequence a DBR are inserted in the end where Adapter 1
(Schweyen et al. 2014; Tin et al. 2015)
3) IND1_IND2: In addition to the index sequence in Adapter 1, another index sequence is
inserted in the end where Adapter 2 is (Peterson et al. 2012; Mastretta-Yanes et al.
2015).
4) IND1_IND2_DBR: This technique uses DBRs in addition to the index sequences. The
DBR sequence is generated randomly (Schweyen et al. 2014; Tin et al. 2015).
Figure 7 shows the input and output files of simddradseq.py, as well as the position of this
program within the processes flow of ddRADseqTools.
The options of the program are detailed in Table 4.
Table 4. simddradseq.py options.
Option Default value Comment
fragsfile ./fragments.fasta Path of the input fragments file.
technique IND1_IND2_DBR
Three methodologies are available: IND1 (an index sequence in adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in adapter 1 + an index sequence in adapter 2) and IND1_IND2_DBR (an index sequence in adapter 1 + an index sequence in adapter 2 + a DBR).
format FASTQ Format of the output file: FASTQ or FASTA.
readsfile ./reads Path of the output read file (without extension).
readtype PE Read type: SE or PE.
rsfile ./restrictionsites.txt Path of the input restriction sites file.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 19
enzyme1 EcoRI
Name of the first restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. EcoRI and GAATTC are equivalent.
enzyme2 MseI
Name of the second restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. MseI and TTAA are equivalent.
endsfile ./ends.txt Path of the input end sequences file.
index1len 6 Index sequence length in Adapter 1.
index2len 6 Index sequence length in Adapter 2 (it must be 0 when the technique is BC).
dbrlen 4 DBR length (it must be 0 when the technique is BC or BC_IND).
wend end01 Code used in endsfile corresponding to the end where adapter 1 is.
cend end02 Code used in endsfile corresponding to the end where adapter 2 is.
individualsfile ./individuals.txt Path of the input individuals file.
locinum 100 Number of loci to sample.
readsnum 10000 Number of reads to generate.
minreadvar 0.8 Lower parameter value of the interval to control variation of the number of reads per locus (0.5 <= minreadvar<= 1.0).
maxreadvar 1.2 Upper parameter value of the interval to control variation of the number of reads per locus (1.0 <= maxreadvar <= 1.5).
insertlen 180 Insert length, i. e. genome sequence length inserted in the reads.
mutprob 0.2 Mutation probability (0.0 <= mutprob < 1.0.)
locusmaxmut 1 Maximum mutation number by locus (1 <= locusmaxmut <= 5)
indelprob 0.4 Indel probability (0.0 <= indelprob < 1.0). This is the probability of a mutation being an indel (otherwise, it will be a substitution).
maxindelsize 3 Maximum size of the generated indels (1 <= maxindelsize < 20).
dropout 0 Probability of mutation at the enzyme recognition sites (0.0 <= dropout < 1.0).
pcrdupprob 0 Probability of loci bearing PCR duplicates (0.0 <= pcrdupprob < 1.0).
gcfactor 0 Weight factor of GC ratio in a locus with PCR duplicates (0.0 <= gcfactor < 1.0)
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 20
Figure 13. Example of fragments file. A portion of fragments originated simulating a double digest of S. cerevisiase by EcoRI and MseI.
Figure 14. Example of raw reads files. A portion of FASTQ PE read files where two reads corresponding to fragment 1066 shown in Figure 13 are displayed. The reads corresponding to
reads file with end 1 (it corresponds to ID end21 displayed in Figure 8) are on top and the reads with end 2 (it corresponds to ID end 22 displayed in Figure 8) are in the bottom. One record
has a mutation, while the other one is a non-mutated sequence.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 21
Program pcrdupremoval.py
This program quantifies and removes the PCR duplicates detected in ddRADseq experiments
that use DBRs embedded in the adapters, in addition to the index sequences.
The input read file(s) have been generated by simddradseq.py.
The determination of PCR duplicates is performed following the next steps:
1) Reads are sorted by sequence in the SE read file, or by both sequences in the PE read
files.
2) As reads are raw, (i.e. they can include adapters, primers, indexes and DBRs in addition
to the genome insert), reads with equal sequence(s) imply PCR duplicates, and only
one of them is saved in the output file.
As reads have been generated in silico, mismatches are not considered.
The output file(s) have the same format as the input file(s).
This program calculates statistics regarding the number of removed and total reads per locus
and individual (see Figure 15). The output also indicates if a locus has PCR duplicates or not.
Figure 15. Example of the statistics output generated by the program pcrdupremoval.py. Only ten loci and five
individuals are displayed. In each cell, the first value is the removed number of reads for a given locus /
individual, and the second value is the corresponding total number of reads.
It is possible to have a locus in an individual showing not reads. If this occurs extensively, it
suggests that the coverage must be optimized. The output file is also generated in CSV format
for further processing with spreadsheets such as Libre Office Calc or Microsoft Excel, or
statistics programs, such as R. Figure 7 shows the input and output files of pcrdupremoval.py,
as well as the position of this program within the processes flow of ddRADseqTools. The
options of the program are detailed in Table 5.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 22
Table 5. pcrdupremoval.py options.
Option Default value Comment
format FASTQ Format of the output file: FASTQ or FASTA.
readtype PE Read type: SE or PE.
readsfile1 ./reads_1.fastq Path of the file for SE reads or the reads file where Adapter 1 is for PE reads.
readsfile2 ./reads_2.fastq Path of reads file where Adapter 2 is for PE reads or NONE for SE reads.
clearfile ./reads_cleared Path of the output file with removed PCR duplicates (without extension).
dupstfile ./pcrduplicates_stats.txt Path of the PCR duplicates statistics file.
Program indsdemultiplexing.py
This program demultiplexes one file (SE) or two files (PE) with reads of n individuals in n files
(SE) or 2n files (PE), containing the reads of each individual.
The input reads have been generated by simddradseq.py or they have been the result of the
removal of PCR duplicates performed with pcrdupremoval.py.
At this point, the reads are raw (i.e. they include the indexes, adapters, DBRs, etc.). Therefore,
they have one index or two indexes to identify each individual. The number ID selected to
identify the index, and the index position are given by the end identifiers of the ends file. As
reads have been generated in silico, mismatches are not considered.
The input and output files of indsdemultiplexing.py are shown in Figure 7 as well as the
position of this program within the processes flow of ddRADseqTools.
The options of the program are detailed in Table 6.
Table 6. indsdemutiplexing.py options.
Option Default value Comment
technique IND1_IND2_DBR
Three methodologies are available: IND1 (an index sequence in Adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in Adapter 1 + an index sequence in Adapter 2) and IND1_IND2_DBR (an index sequence in Adapter 1 + an index sequence in Adapter 2 + a DBR).
format FASTQ Format of the output file: FASTQ or FASTA.
readtype PE Read type: SE or PE.
endsfile ./ends.txt Path of the input end sequences file.
index1len 6 Index sequence length in Adapter 1.
index2len 6 Index sequence length in Adapter 2 (it must be 0 when technique is BC).
dbrlen 4 DBR length (it must be 0 when technique is IND1 or IND1_IND2).
wend end01 Code used in endsfile corresponding to the end where Adapter 1 is.
cend end02 Code used in endsfile corresponding to the end where Adapter 2 is.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 23
individualsfile ./individuals.txt Path of the input individuals file.
readsfile1 ./reads_1.fastq Path of the file for SE reads or the reads file where Adapter 1 is for PE reads.
readsfile2 ./reads_2.fastq Path of reads file where Adapter 2 is for PE reads or NONE for SE reads.
Program readstrim.py
This program trims the ends of 1 file (SE) / 2 files (PE) of raw reads, i.e. cuts the adapters,
primers, indexes and DBR. The ends identifiers determine the two ends of the raw reads, and,
therefore, the sequences that must be trimmed (see Figure 16).
Figure 16. Example of trimmed reads file. A portion of FASTQ PE read files where two reads are shown without
the two ends.
Figure 7 shows the input and output files of readstrim.py, as well as the position of this
program within the processes flow of ddRADseqTools.
The options of the program are detailed in Table 7.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 24
Table 7. readstrim.py options.
Option Default value Comment
technique IND1_IND2_DBR
Three methodologies are available: IND1 (an index sequence in Adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in Adapter 1 + an index sequence in Adapter 2) and IND1_IND2_DBR (an index sequence in Adapter 1 + an index sequence in Adapter 2 + a DBR).
format FASTQ Format of the output file: FASTQ or FASTA.
readtype PE Read type: SE or PE.
endsfile ./ends.txt Path of the input end sequences file.
index1len 6 Index sequence length in Adapter 1.
index2len 6 Index sequence length in Adapter 2 (it must be 0 when technique is BC).
dbrlen 4 DBR length (it must be 0 when technique is IND1 or IND1_IND2).
wend end01 Code used in endsfile corresponding to the end where Adapter 1 is.
cend end02 Code used in endsfile corresponding to the end where Adapter 2 is.
readsfile1 ./reads_1.fastq Path of the file for SE reads or the reads file where Adapter 1 is for PE reads.
readsfile2 ./reads_2.fastq Path of reads file where Adapter 2 is for PE reads or NONE for SE reads.
trimfile ./reads_cleared Path of the output file with trimmed reads (without extension).
Program seqlocation.py
This program locates a sequence into the genome, and shows the start and end positions, as
well as the reverse complementary sequence. No mismatches are allowed in this version of the
program.
The options of the program are detailed in Table 8.
Table 8. seqlocation.py options.
Option Default value Comment
genfile ./genome.fna File of the reference genome in FASTA format. The file can be compressed.
seq TGGAGGTGGGG The sequence to be located into the genome.
Methodology for ddRADSeqTools validation
Validation of the methods implemented in ddRADseqTools are necessary to verify its proper
design and operability (Aim 2). Other tests were also implemented to study the performance
of ddRADseqTools (Aim 3), and to compare it with the performance of other ddRADseq
simulation tools (Aim 4) . Below, we describe the benchmark data, and the tested processes
under different scenarios to validate the software.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 25
Benchmark reference genomes
Three reference genomes have been used to validate ddRADseqTools:
Saccharomyces cerevisiae genome: It was downloaded from NCBI:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
This genome is small, about 12 Mbp, and has 14 chromosomes (Engel et al. 2014).
Homo sapiens genome: It was downloaded from NCBI:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.29_GRCh38.p3/GCF_000001405.29_GRCh38.p3_genomic.fna.gz
The human genome has 23 chromosomes and approximately 3 Gbp, 1N data (Venter
et al. 2001).
Pinus taeda genome: It was downloaded from Dendrome, a forest tree genome
database:
http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/pinerefseq/Pita/v1.01/ptaeda.v1.01.scaffolds.fasta.gz
This genome is among the largest genomes in living organisms, more than 20 Gbp
distributed in 12 chromosomes, 1N data, and high complexity. (Neale et al. 2014; Zimin
et al. 2014).
These genomes cover very different organisms: Fungi, Animalia and Plantae; and their sizes
are small, large and very large, allowing the study of ddRADseqTools performance.
Validation experiments
The following validation experiments have been conducted in order to test the operability of
ddRADseqTools programs, and the reliability of the results:
A) Analysis of fragments generation: The program rsitesearch.py analyses how several
restriction endonucleases pairs perform a double digest of the reference genomes
above to obtain double digested fragments files (loci).
The Bash script simulation-genome.sh included in the software package has all the
instructions that performed this test. Table 9 shows the values of the main options set
in the runs of rsitesearch.py.
Table 9. Values of the main options set in the runs of rsitesearch.py in simulation-genome.sh.
options rsitesearch.py values
enzyme1 EcoRI, SbfI & PstI
enzyme2 MseI
fragstinterval 25
genfile 3 files indicated in section Benchmark reference genomes
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 26
minfragsize 101 (S. Cerevisiase) 201 (H. sapiens and P. taeda)
maxfragsize 300
B) ddRADseq simulation: The program simddradseq.py simulated reads of a ddRADseq
sequencing varying two options: 1) the number of reads to generate; and 2) the
probability of loci bearing PCR duplicates. The program pcrdupremoval.py quantified
and removed the PCR duplicates.
The Bash script simulation-ddradseq.sh included in the software package has all the
instructions that performed this test. Table 10 shows the values of the main options
set in the runs of each ddRADseq program.
Table 10. Values of the main options set in the runs of each ddRADseq program in simulation-ddradseq.sh.
Options simddradseq.py values pcrdupremoval.pv values
dropout 0.0 n.a.
enzyme1 enzyme selected in Test A for each reference genome
n.a.
enzyme2 MseI n.a.
format FASTQ n.a.
fragsfile
the three files generated in Test A, corresponding to the enzyme combination selected for each reference genome
n.a.
gcfactor 0.2 n.a.
indelprob 0.1 n.a.
insertlen 100 n.a.
individualsfile a file with 48 individuals n.a.
locinum number of loci assessed in test A for each reference genome
n.a.
locusmaxmut 1 n.a.
maxindelsize 10 n.a.
maxreadvar 1.2 n.a.
minreadvar 0.8 n.a.
mutprob 0.2 n.a.
pcrdupprob 0.2, 0.4 & 0.6 n.a.
readsfile1 n.a. the file 1 generated by simddradseq.py
readsfile2 n.a. the file 2 generated by simddradseq.py
readsnum number of reads assessed in Test A for each reference genome and each coverage
n.a.
readtype PE n.a.
technique IND1_IND2_DBR n.a.
n.a.: option not available in the program
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 27
C) Analysis of PCR duplicates: This test analysed the effect of the probability of PCR
duplicates on the number of reads to generate in the S. cerevisiae genome. The
simddradseq.py generated reads for a wide value list of the probability of loci bearing
PCR duplicates; and the program pcrdupremoval.py quantified and removed the PCR
duplicates.
The Bash script simulation-pcrdupprob.sh included in the software package has all the
instructions that performed this test. Table 11 shows the values of the main options
set in the runs of each ddRADseq program.
Table 11. Values of the main options set in the runs of each ddRADseq program in simulation-pcrdupprob.sh.
options simddradseq.py values pcrdupremoval.pv values
dropout 0.0 n.a.
enzyme1 enzyme selected in test A for S. cerevisiae genome
n.a.
enzyme2 MseI n.a.
format FASTQ n.a.
fragsfile file generated in test A corresponding to S. cerevisiae genome
n.a.
gcfactor 0.2 n.a.
indelprob 0.1 n.a.
insertlen 100 n.a.
individualsfile file with 48 individuals n.a.
locinum number of loci assessed in test A for S. cerevisiae genome
n.a.
locusmaxmut 1 n.a.
maxindelsize 10 n.a.
maxreadvar 1.2 n.a.
minreadvar 0.8 n.a.
mutprob 0.2 n.a.
pcrdupprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9
n.a.
readsfile1 n.a. the file 1 generated by simddradseq.py
readsfile2 n.a. the file 2 generated by simddradseq.py
readsnum number of reads assessed in test A for S. cerevisiae genome and each coverage
n.a.
readtype PE n.a.
technique IND1_IND2_DBR n.a.
n.a.: option not available in the program
D) Analysis of GC factor: This test analysed the effect of the GC content of fragments on
the number of reads and the probability of loci bearing PCR duplicates in the S.
cerevisiae genome. The simddradseq.py generated reads for a range of values for both
the probability of loci bearing PCR duplicates, and GC factor. The program
pcrdupremoval.py quantified and removed the PCR duplicates.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 28
The Bash script simulation-gcfactor.sh included in the software package has all the
instructions that performed this test. Table 12 shows the values of the main options
set in the runs of each ddRADseq program.
Table 12. Values of the main options set in the runs of each ddRADseq program in simulation-gcfactor.sh.
options simddradseq.py values pcrdupremoval.pv values
dropout 0.0 n.a.
enzyme1 enzyme selected in test A for S. cerevisiae genome
n.a.
enzyme2 MseI n.a.
format FASTQ n.a.
fragsfile file generated in test A corresponding to S. cerevisiae genome
n.a.
gcfactor 0.0, 0.1, 0.2, 0.3, 0.4 & 0.5 n.a.
indelprob 0.1 n.a.
insertlen 100 n.a.
individualsfile file with 48 individuals n.a.
locinum number of loci assessed in test A for S. cerevisiae genome
n.a.
locusmaxmut 1 n.a.
maxindelsize 10 n.a.
maxreadvar 1.2 n.a.
minreadvar 0.8 n.a.
mutprob 0.2 n.a.
pcrdupprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9
n.a.
readsfile1 n.a. the file 1 generated by simddradseq.py
readsfile2 n.a. the file 2 generated by simddradseq.py
readsnum number of reads assessed in test A for S. cerevisiae genome and x4 and x8 coverage
n.a.
readtype PE n.a.
technique IND1_IND2_DBR n.a.
n.a.: option not available in the program
E) Checking of the pattern of mutations: In this test the programs simddradseq.py,
pcrdupremoval.py, and indsdemultiplexing.py were run. Statistics of mutated and not-
mutated fragments of each individual were calculated for a range of values of the
probability of mutation based in the header information of reads, and stored in a CSV
file.
The Bash script simulation-mutations.sh included in the software package has all the
instructions that performed this test. Table 13 shows the values of the main options
set in the runs of each ddRADseq program.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 29
Table 13. Values of the main options set in the runs of each ddRADseq program in
simulation-mutations.sh.
options simddradseq.py
values pcrdupremoval.py
values insddemultiplexing
values
dropout 0.0 n.a. n.a.
enzyme1 enzyme selected in test A for S. cerevisiae genome
n.a. n.a.
enzyme2 MseI n.a. n.a.
format FASTQ n.a. FASTQ
fragsfile file generated by rsitesearch.py
n.a. n.a.
fragsinterval n.a. n.a. n.a.
gcfactor 0.2 n.a. n.a.
indelprob 0.1 n.a. n.a.
insertlen 100 n.a. n.a.
individualsfile file with 48 individuals
n.a. file with 48 individuals
locinum
number of loci assessed in test A for S. cerevisiae genome
n.a. n.a.
locusmaxmut 1 n.a. n.a.
maxfragsize n.a. n.a. n.a.
maxindelsize 10 n.a. n.a.
maxreadvar 1.2 n.a. n.a.
minfragsize n.a. n.a. n.a.
minreadvar 0.8 n.a. n.a.
mutprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9
n.a. n.a.
pcrdupprob 0.2 n.a. n.a.
readsfile1 n.a. the file 1 generated by simddradseq.py
the file 1 generated by pcrdupremoval.py
readsfile2 n.a. the file 2 generated by simddradseq.py
the file 2 generated by pcrdupremoval.py
readsnum
number of reads assessed in test A for S. cerevisiae genome and x2 coverage
n.a. n.a.
readtype PE n.a. PE
technique IND1_IND2_DBR n.a. IND1_IND2_DBR
n.a.: option not available in the program
F) Pipeline of alignment: Full test of the ddRADseq programs. It generated fragments
from S. cerevisiae genome, simulated a double digest, generated their reads, removed
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 30
PCR duplicates, demultiplexed the individuals, and trimmed the adapters and other
specific sequences. It also aligned the reads, and produced SAM, BAM, BED and VCF
format files to study and visualize alignments.
The Bash script simulation-pipeline.sh included in the software package has all the
instructions that performed this test. It is prepared to perform the test with other
genomes. Table 14 shows the values of the main options set in the runs of each
ddRADseq program.
Table 14. Values of the main options set in the runs of each ddRADseq program in simulation-pipeline.sh.
options rsitesearch.py
values simddradseq.py
values pcrdupremoval.py
values insddemultiplexing
values readstrim.py
values
dropout n.a. 0.0 n.a. n.a. n.a.
enzyme1 enzyme selected in test A for S. cerevisiae genome
enzyme selected in test A for S. cerevisiae genome
n.a. n.a. n.a.
enzyme2 MseI MseI n.a. n.a. n.a.
format n.a. FASTQ n.a. FASTQ FASTQ
fragsfile n.a. file generated by rsitesearch.py
n.a. n.a. n.a.
fragsinterval 25 n.a. n.a. n.a. n.a.
gcfactor n.a. 0.0, 0.1, 0.2, 0.3, 0.4 & 0.5
n.a. n.a. n.a.
indelprob n.a. 0.1 n.a. n.a. n.a.
insertlen n.a. 100 n.a. n.a. n.a.
individualsfile n.a. file with 48 individuals
n.a. file with 48 individuals
file with 48 individuals
locinum n.a.
number of loci assessed in test A for S. cerevisiae genome
n.a. n.a. n.a.
locusmaxmut n.a. 1 n.a. n.a. n.a.
maxfragsize 300 n.a. n.a. n.a. n.a.
maxindelsize n.a. 10 n.a. n.a. n.a.
maxreadvar n.a. 1.2 n.a. n.a. n.a.
minfragsize 101 n.a. n.a. n.a. n.a.
minreadvar n.a. 0.8 n.a. n.a. n.a.
mutprob n.a. 0.2 n.a. n.a. n.a.
pcrdupprob n.a. 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9
n.a. n.a. n.a.
readsfile1 n.a. n.a. the file 1 generated by simddradseq.py
the file 1 generated by pcrdupremoval.py
each individual file 1 generated by indsdemultiplexing.py
readsfile2 n.a. n.a. the file 2 generated by simddradseq.py
the file 2 generated by pcrdupremoval.py
each individual file 2 generated by indsdemultiplexing.py
readsnum n.a.
number of reads assessed in test A for S. cerevisiae genome and x2 coverage
n.a. n.a. n.a.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 31
readtype n.a. PE n.a. PE PE
technique n.a. IND1_IND2_DBR n.a. IND1_IND2_DBR IND1_IND2_DBR
n.a.: option not available in the program
The reference genome was indexed, and the trimmed files obtained by readstrim.py
were aligned to the genome with the Burrows-Wheeler Aligner (BWA) (Li & Durbin
2009; http://sourceforge.net/projects/bio-bwa/). The SAM files were converted in
BAM and VCF format using SAMtools (https://github.com/samtools/samtools; Li et al.
2009). The BED files was generated from BAM files using BEDtools (Quinlan & Hall
2010; https://github.com/arq5x/bedtools2).
The BED and VCF files were displayed in Integrative Genomics Viewer (IGV) (Robinson
et al. 2011; Thorvaldsdóttir et al. 2013; http://www.broadinstitute.org/igv/).
The script simulation-pipeline.sh is prepared to perform the test with other genomes.
G) Analysis of the software performance: In this test the programs rsitesearch.py,
simddradseq.py, pcrdupremoval.py, indsdemultiplexing.py, and readstrim.py were run
repeatedly in order to measure the elapsed real time used by the program, the total
number of CPU-seconds used by the system on behalf of the process, the total number
of CPU-seconds that the process used directly, and the maximum resident set size of
the process during its lifetime.
The Bash script simulation- performance.sh included in the software package has all
the instructions that performed this test. Table 15 shows the values of the main
options set in the runs of each ddRADseq program.
Table 15. Values of the main options set in the runs of each ddRADseq program in simulation-performance.sh.
options rsitesearch.py
values simddradseq.py
values pcrdupremoval.py
values Insddemultiplexing
values readstrim.py
values
dropout n.a. 0.0 n.a. n.a. n.a.
enzyme1 EcoRI, SbfI & PstI enzyme selected in Test A for each reference genome
n.a. n.a. n.a.
enzyme2 MseI MseI n.a. n.a. n.a.
format n.a. FASTQ n.a. FASTQ FASTQ
fragsfile n.a. file generated by rsitesearch.py
n.a. n.a. n.a.
fragsinterval 25 n.a. n.a. n.a. n.a.
gcfactor n.a. 0.2 n.a. n.a. n.a.
indelprob n.a. 0.1 n.a. n.a. n.a.
insertlen n.a. 100 n.a. n.a. n.a.
individualsfile n.a. file with 48 individuals
n.a. file with 48 individuals
file with 48 individuals
locinum n.a.
number of loci assessed in test A for each reference genome
n.a. n.a. n.a.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 32
locusmaxmut n.a. 1 n.a. n.a. n.a.
maxfragsize 300 n.a. n.a. n.a. n.a.
maxindelsize n.a. 10 n.a. n.a. n.a.
maxreadvar n.a. 1.2 n.a. n.a. n.a.
minfragsize 101 (S. cerevisiae) 201 (H. sapiens and P. taeda)
n.a. n.a. n.a. n.a.
minreadvar n.a. 0.8 n.a. n.a. n.a.
mutprob n.a. 0.2 n.a. n.a. n.a.
pcrdupprob n.a. 0.2 & 0.6 n.a. n.a. n.a.
readsfile1 n.a. n.a. the file 1 generated by simddradseq.py
the file 1 generated by pcrdupremoval.py
each individual file 1 generated by indsdemultiplexing.py
readsfile2 n.a. n.a. the file 2 generated by simddradseq.py
the file 2 generated by pcrdupremoval.py
each individual file 2 generated by indsdemultiplexing.py
readsnum n.a.
number of reads assessed in test A for each reference genome and x2 & x16 coverage
n.a. n.a. n.a.
readtype n.a. PE n.a. PE PE
technique n.a. IND1_IND2_DBR n.a. IND1_IND2_DBR IND1_IND2_DBR
n.a.: option not available in the program
The analysis was run in a computer with Bio-Linux 8 installed. The main features of the
computer were Intel Core i5-4200U 1.6 GHz with Turbo Boost up to 2.g GHz; RAM 8
GiB; 5400 rpm disk.
H) Comparative between ddRADseqTools and other ddRADseq simulation tools: The
program rsitesearch.py was compared with the R package SimRAD and the python
program Digital_RADs.py of software package BU-RAD-seq. The test was designed to
generate fragments of a double digest of the benchmark reference genomes with the
following restriction endonucleases pairs: EcoRI/MseI, PstI/MseI and SbfI/MseI
selecting fragments between 101-300 nt in EcoRI/MseI pair; and otherwise 201-300 nt.
We also analysed the performance of the three tools as in test G.
The comparative between the program simddradseq.py and simRRLs was not
considered because this tool does not admit reference genomes.
The attached file other_scripts.zip contains the necessary scripts to run SimRAD and
Digital_RADS.py tests.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 33
Results and Discussion
The results of the tests performed to validate the programs are detailed in the next sub-
chapters. The complete results of a run of the scripts that correspond to each test are available
in the attached file simulations.zip.
Analysis of fragments generation
The script simulation-genome.sh provides statistics regarding the abundance of fragments in
intervals of 25 nucleotides, obtained from reference genomes when EcoRI, PstI and SbfI act as
the first enzyme, and MseI acts as the second enzyme. Figures B-1, B-2 and B-3 from Appendix
B (see file supplementary.pdf) represent the distribution of fragments for these enzymes and
genomes. They have been drawn by rsitesearch.py.
The summary statistics in Table 16 shows the total number of fragments, and the number of
fragments whose size is between 201 and 300 nt, which is the usual range in ddRADseq
experiments. For S. cerevisiae, however, due to the small size of its genome, the range
specified is between 101 and 300 nt.
Table 16. Fragments generated by restriction endonucleases for three reference genomes (S.
cerevisiae, H. sapiens, and P.taeda). The optimal enzyme combination inferred from the
number of fragments generated for the selected size interval is indicated in bold.
S. cerevisiae
Enzymes Total fragments Fragments w/ size 101-300 nt
EcoRI - MseI 8,176 3,103
PstI - MseI 4,623 1,853
SbfI - MseI 188 70
H. sapiens
Enzymes Total fragments Fragments w/ size 201-300 nt
EcoRI - MseI 1,629,978 203,735
PstI - MseI 2,236,406 331,344
SbfI - MseI 156,140 21,016
P. taeda
Enzymes Total fragments Fragments w/ size 201-300 nt
EcoRI - MseI 11,459,733 1,353,309
PstI - MseI 4,784,215 621,933
SbfI - MseI 215,211 26,532
The effect of the double digestion with EcoRI, PstI and SbfI was different for each genome. The
success and cost-efficiency of a ddRADseq experiment largely depended on the selection of
the correct enzyme pair combination. Since the number of reads is equal to the number of
fragments multiplied by the coverage and the number of individuals, the enzyme pair chosen
in a ddRADseq experiment must provide a tractable number of fragments; that is, there must
be a balance between the number of fragments, the total number of reads and the number of
individuals to get an optimal coverage and a low percentage of missing data.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 34
The restriction endonucleases marked in bold in Table 9 were selected to perform the
subsequent validation tests. The number of reads used in the next validation tests is shown in
Table 10. As 48 individuals were used in all tests, a rounded number of reads was used to
reach approximately 2x, 4x, 8x and 16x coverage and the number of fragments marked in bold
in Table 17.
Table 17. Number of reads for 48 individuals and the coverage of 2x, 4x, 8x and 16x used in the validation tests.
organism enzymes number of reads
2x 4x 8x 16x
S. cerevisiase EcoRI - MseI 300,000 600,000 1,200,000 2,400,000
H. sapiens SbfI - MseI 2,000,000 4,000,000 8,100,000 16,100,000
P. taeda SbfI - MseI 2,500,000 5,100,000 10,200,000 20,400,000
ddRADseq simulations
The script simulation-ddRADseq.sh checks the relationship of the number of reads and the
probability of loci bearing PCR duplicates, with the percentage of removed reads and the
subsequent deviation in the depth coverage for three reference genomes. The summary of a
run of this script is found in Appendix C (see file supplementary.pdf): Table C-1, S. cerevisiase;
Table C-2, H. sapiens; and Table C-3, P. taeda.
Table 18 shows the percentage of removed reads and the coverage deviation. The latter
statistic is computed by comparing the theoretical probability of loci bearing PCR duplicates
(value passed to the program pcrdupremoval.py), and the actual proportion of loci bearing PCR
duplicates. The term "depth coverage" is used instead "number of reads" to standardize the
comparisons between organisms.
Table 18. Percentage of removed reads and coverage deviation collected from Appendix C data (see file supplementary.pdf).
organism pcr dup
prob
2x 4x 8x 16x %r.reads cov.dev.
%r.reads cov.dev. %r.reads cov.dev. %r.reads cov.dev. %r.reads cov.dev. mean s.d. mean s.d.
S. cerevisiae
0.2 15.37 -0.3166 16.77 -0.6955 17.48 -1.4543 18.39 -3.0511 17.00 1.2744 -1.38 1.2107
0.4 31.16 -0.6516 30.6 -1.2565 32.74 -2.7541 32.29 -5.3893 31.70 0.9885 -2.51 2.1115
0.6 46.83 -0.9764 46.45 -1.9375 48.26 -4.0376 48.07 -8.0143 47.40 0.8974 -3.74 3.1222
H. sapiens
0.2 15.59 -0.317 16.41 -0.6564 17.22 -1.3987 18.54 -3.0078 16.94 1.2572 -1.34 1.1970
0.4 31.26 -0.6325 31.48 -1.2625 32.18 -2.6238 33.01 -5.356 31.98 0.7894 -2.47 2.0966
0.6 46.21 -0.9303 46.4 -1.8666 47.48 -3.867 48.03 -7.7916 47.03 0.8702 -3.61 3.0426
P. taeda
0.2 15.9 -0.3217 16.24 -0.9321 17.4 -1.4188 18 -2.9382 16.89 0.9823 -1.40 1.1177
0.4 31.44 -0.6327 31.69 -1.2969 32.23 -2.6339 33.15 -5.4098 32.13 0.7572 -2.49 2.1149
0.6 46.46 -0.9321 46.76 -1.9117 46.95 -3.8366 47.63 -7.7748 46.95 0.4962 -3.61 3.0250
pcrdupprob: theoretical probability of loci bearing PCR duplicates; %r.reads: percentage of removed reads;
cov.dev.: coverage deviation
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 35
The behaviour of the data was very similar for all three genomes. We observed that:
For a theoretical probability of loci bearing PCR duplicates, the percentage of removed
reads was very similar for all depth coverages.
For a theoretical probability of loci bearing PCR duplicates, the coverage deviation
grew linearly with the depth coverage, irrespective of the genome size.
The percentage of removed reads and the coverage deviation grew linearly with the
theoretical probability of loci bearing PCR duplicates.
A graphical interpretation of Table 18 can be found in Appendix C (see supplementary.pdf file):
Figure C-1, S. cerevisiase; Figure C-2, H. sapiens; and Figure C-3, P. taeda. Table C-1, Table C-2
and Table C-3 shows that the loci bearing PCR duplicates generated by the program
simddradseq.py have only small deviations. Table 18 shows mean and standard deviation of
each probability of loci bearing PCR duplicates per reference genome.
Analysis of PCR duplicates
The in-depth analysis of the effect of the probability of loci bearing PCR duplicates on the
number of reads was carried out by the script simulation-pcrdupprob.sh. In one run of this
script, the program simddradseq.py generated reads for a wide range of probabilities of PCR
duplicates. The program pcrdupremoval.py quantified and removed the PCR duplicates. The
results of this run are summarized in Table D-1 (Appendix D, file supplementary.pdf). Figure 17
shows graphically the percentage of removed reads, and the coverage deviation for each PCR
duplicates probability and depth coverage in S. cerevisiae.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 36
Figure 17. Percentage of removed reads and coverage deviation according to values of probability of loci
bearing PCR duplicates between 0.0, and 0.9 in S. cerevisiae and 2x, 4x, 8x and 16x coverage.
The results showed that:
The number of removed reads, i.e. the number of duplicate reads, was proportional to
the probability of loci bearing PCR duplicates, and the values were independent of the
depth coverage.
The coverage deviation was proportional to both the probability of loci bearing PCR
duplicates and the depth coverage.
Some (although nearly non-significant) duplicate reads were produced when the
probability of loci bearing PCR duplicates was 0.0. This is due to an artifact derived
from the random generation of the DBR sequences. These duplicate reads occurred
also when the probability of loci bearing PCR duplicates was > 0.0, and there is no way
to distinguish between real duplicates or artifacts. In any case, the number of
duplicate reads generated randomly was negligible when the PCR duplicates
probability is > 0.0.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 37
The decreasing of coverage became more important as PCR duplicates grew. The
researcher can set the value of the option pcrdupprob that best suits to the
experiment that he/she is designing.
Analysis of the effect of the GC content
The script simulation-gcfactor.sh analyses the effect of the GC content on the number of reads
to generate, and on the probability of loci bearing PCR duplicates in the S. cerevisiae genome.
In this script the program simddradseq.py generated reads for a wide value lists of both the
probability of loci bearing PCR duplicates and GC factor. The program pcrdupremoval.py
quantified and removed the PCR duplicates.
Table E-1 (Appendix E, file supplementary.pdf) summarizes the data of a run of simulation-
gcfactor.sh. Table 19 shows the results of the percentage of removed reads and of the
coverage deviation for a range of values (0.0-0.9) of theoretical probability of loci bearing PCR
duplicates, and for a depth coverage of 4x and 8x.
Table 19. Summary of percentage of removed reads and coverage deviation corresponding
to various values of factor GC grouped by the probability of loci bearing PCR duplicates
collected from Appendix E data (see file supplementary.pdf).
theoretical pcrdupprob
4x 8x
percentage removed reads
coverage deviation percentage
removed reads coverage deviation
mean s.d. mean s.d. mean s.d. mean s.d.
0.0 0.81 0.0071 -0.03 0.0070 1.60 0.0096 -0.14 0.0166
0.1 9.41 1.0235 -0.39 0.0435 10.31 1.1093 -0.87 0.0909
0.2 16.97 0.5428 -0.71 0.0239 17.29 0.4387 -1.44 0.0454
0.3 23.78 0.7591 -1.00 0.0351 24.71 0.5526 -2.05 0.0462
0.4 31.70 0.2604 -1.32 0.0131 31.80 0.1894 -2.65 0.0162
0.5 38.90 0.7139 -1.62 0.0266 39.63 1.0265 -3.31 0.0880
0.6 46.68 0.1937 -1.94 0.0084 47.62 0.7847 -3.97 0.0610
0.7 54.31 0.5059 -2.26 0.0226 54.32 0.7413 -4.52 0.0628
0.8 62.14 0.4659 -2.59 0.0198 62.30 0.7972 -5.19 0.0700
0.9 69.16 0.4306 -2.88 0.0196 69.19 0.3822 -5.77 0.0340
Figure 18 represents the percentage of removed reads against the probability of loci bearing
PCR duplicates for several GC factor values.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 38
Figure 18. Percentage of removed reads vs. the probability of loci bearing PCR duplicates for several
values of GC factor in S. cerevisiae, and for 4x and 8x coverage.
Figure 19 shows the coverage deviation against the probability of loci bearing PCR duplicates
for several GC factor values.
Figure 19. Coverage deviation against the probability of loci bearing PCR duplicates for several values of
GC factor in S. cerevisiae, and for 4x and 8x coverage.
In light of these results, we can conclude that the GC factor had not major influence in the
generation of PCR duplicates, and in the coverage deviation. In addition, this confirmed the
results from the test of the section Analysis of PCR duplicates:
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 39
The number of duplicate reads was proportional to the probability of loci bearing PCR
duplicates and they were independent of the coverage.
The coverage deviation was proportional to both the probability of loci bearing PCR
duplicates and the coverage.
Analysis of the mutation patterns
The script simulation-mutations.sh calculated statistics for the mutated and not-mutated reads
by individual generated after PCR duplicates removal and demultiplexing. Detailed results per
individual for a run of this script is found in Table F-1 (Appendix F, file supplementary.pdf) and
a summary is shown in Table 20. We can observe that the percentage of mutated reads is
approximately the expected value to the corresponding percentage of mutation.
Table 20. Summary of percentage of mutated reads corresponding to the mutated reads
collected from Appendix F data (see file supplementary.pdf).
mutprob not-mutated
reads mutated
reads total reads
percentage of mutated
reads
0.0 251,348 0 251,348 0.00
0.1 227,302 24,912 252,214 9.88
0.2 203,360 49,959 253,319 19.72
0.3 174,697 75,128 249,825 30.07
0.4 152,020 101,928 253,948 40.14
0.5 125,015 125,350 250,365 50.07
0.6 99,570 150,164 249,734 60.13
0.7 74,655 174,346 249,001 70.02
0.8 50,267 201,499 251,766 80.03
0.9 25,432 228,608 254,040 89.99
Pipeline for the alignment of simulated reads
The script simulation-pipeline.sh performed a complete test of the ddRADseq programs for S.
cerevisiae. The resulting reads were aligned, and SAM, BAM, BED and VCF format files were
generated. The alignments were visualized with IGV.
Figure 20 displays the results of loading the reference genome of S. cerevisiae, and the BED
files of the reads resulting from the simulations. We can observe that the collapsed reads of
each individual covered all chromosomes uniformly.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 40
Figure 20. Reads generated by simulation-pipeline.sh for S. cerevisiae visualised along all its genome. Each
row corresponds to the collapsed reads of a single individual.
Figure 21 shows the expanded reads at single chromosome level. The number of reads varied
from one individual to other, and in certain cases, an individual did not show reads in a locus.
Figure 21. Reads generated by an simulation-pipeline.sh for chromosome NC_001139.9 of S. cerevisiae.
Each row corresponding to squished reads of one individual.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 41
IGV allows to trace from the visualization a fragment to a detail of the corresponding reads in
the browser, or vice versa. Fragments have information about the chromosome or scaffold and
strand where they belong, and also about the start and end position. Reads have information
of the fragment from where they derived. Files in VCF format allow to quantify the extent of
mutations (SNPs or indels) identified by chromosome or scaffold, and by their coordinate in
the genome. IGV displays the alignment of reads to the genome. Figure 22 and Figure 23 show
two examples of fragment – reads – VCF – alignment traceability. These are evidences of the
correct functioning of the ddRADseqTools programs.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 42
Figure 22. Example of fragment – reads – VCF – alignment traceability visualized with IGV in the case of an indel. The fragment shown corresponds to positions 759937-760168 of the strand +
of the chromosome XIII (id NC_001145.3). In one of the reads of the individual ind0101 occurs an indel in positions 759942-759945: a GCCC sequence is changed to GCC. This indel was
recorded in the VCF file and can be visualized with IGV.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 43
Figure 23. Example of fragment – reads – VCF – alignment traceability visualized with IGV in the case of an SNP. The fragment shown corresponds to positions 139675-139506 of the strand - of
the chromosome I (id NC_001133.9). In one of the reads of the individual ind0101 occurs a SNP in position 139669: a A (T in strand +) is changed to C (G in strand +). This SNP was recorded in
the VCF file and can be visualized with IGV (genome nucleotides of strand – are shown in this alignment).
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 44
Performance of ddRADseqTools
The results for the analysis of the ddRADsetTools are shown in Table G-1 (Appendix G, file
supplementary.pdf).
rsitesearch.py is the program that needed the highest amount of memory: 61 MiB were
approximately required for S. cerevisiae; above 4 GiB for H. sapiens; bellow 220 MiB for P.
taeda. Although the P. taeda genome is larger than the H. sapiens genome, memory
requirements were lower because its genome file contains scaffolds, and no chromosomes, as
in the H. sapiens genome file. The elapsed time depended both on the genome size and on the
number of fragments obtained (see also Table 15). The maximum time elapsed recorded was
2.812.50 s to P. taeda, and EcoRI / MseI as restriction enconucleases pair.
The program simddradseq.py had very low memory requirements: below 23 MiB for the three
reference genomes analysed. The elapsed time was proportional to the number of reads. The
maximum elapsed time recorded was 2,337.57 s (below 39 min) for P. taeda; and 20,400,000
reads.
The elapsed time of the program pcrdupremoval.py depended on the records in the input and
the output files: for the same input records (readsnum column) the time was lower when the
produced output records were less (i.e. the pcrdupprob column value was greater). The
maximum elapsed time recorded was 10,606.18 s (approximately 2 hr and 57 min) for P. taeda
with 20,400,000 reads, and 0.2 of probability of loci bearing PCR duplicates.
The program insdemultiplexing.py had always a memory requirement approximately of 10
MiB. Its elapsed time depended on the records in the input file (for the same readsnum value,
a greater prcrdupprob value implied less reads numbers). The maximum elapsed time
recorded was 1,270.93 s (above 21 min) for P. taeda with 20,400,000 reads, and 0.2 of
probability of loci bearing PCR duplicates.
The maximum resident set size to the program readstrim.py is 9 Mb, and maximum elapsed
real time was 311.41 (above 5 min) for P. taeda with 20,400,000 reads, and 0.2 of probability
of loci bearing PCR duplicates.
Comparative between ddRADseqTools and other ddRADseq simulation tools
Table 21 shows the comparative between rsitesearch.py and R package SimRAD and
Digital_RADs.py of BU-RAD-seq. Both SimRAD and Digital_RADs.py needed a previous process
to decompress the genome file. Also Digital_RADs.py needed another previous process to put
in upper case the nucleotide symbols.
SimRAD did not work with H. sapiens and P. taeda genomes. Digital-RADS obtained fragments
of the benchmark reference genomes, but it was not efficient for very large genomes.
rsiterearch.py of ddRADseqTools was the only tool that could obtain fragments of all
benchmark reference genomes with a good performance.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 45
Table 21. Comparative between rsitesearch.py and R package SimRAD and Digital_RADs.py of BU-RAD-seq.
S. cerevisiae
enzymes
ddRADseqTools - rsitesearch.py SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)
total fragments
fragments w/ size
101-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
total fragments
fragments w/ size
101-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
fragments w/ size
1-1,000 nt
fragments w/ size
101-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
EcoRI - MseI 8,176 3,103 4.99 0.09 2.38 8,176 3,048 21.14 0.21 17.94 8,139 3,191 1.30 0.03 0.41
PstI - MseI 4,623 1,853 2.34 0.01 2.19 4,628 1,866 18.32 0.21 18.07 4,590 1,934 0.38 0.02 0.36
SbfI - MseI 188 70 2.01 0.05 1.84 188 70 17.89 0.20 17.66 186 73 0.35 0.01 0.34
H. sapiens
enzymes
ddRADseqTools SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)
total fragments
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
total fragments
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
fragments w/ size
1-1,000 nt
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
EcoRI - MseI 1,629,978 203,735 421.39 11.90 399.33 (***) (***) (****) (****) (****) 1,604,730 208,238 233.08 8.89 96.03
PstI - MseI 2,236,406 331,344 469.76 10.03 457.63 (***) (***) (****) (****) (****) 2,195,695 343,793 180.33 5.66 87.17
SbfI - MseI 156,140 21,016 324.16 8.18 314.54 (***) (***) (****) (****) (****) 141,656 21,660 175.42 5.37 84.21
P. taeda
enzymes
ddRADseqTools SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)
total fragments
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
total fragments
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
fragments w/ size
1-1,000 nt
fragments w/ size
201-300 nt
elapsed real
time (s)
CPU time (s)
in kernel mode
CPU time (s)
in user mode
EcoRI - MseI 11,459,733 1,353,309 2,812.50 19.27 2,773.89 (****) (****) (*****) (*****) (*****) 11,181,647 1,377,129 26,937.80 872.16 4,062.31
PstI - MseI 4,784,215 621,933 2,429.52 15.42 2,402.45 (****) (****) (*****) (*****) (*****) 4,590,018 643,991 34,287.74 902.78 4,141.07
SbfI - MseI 215.211 26,532 2,005.67 11.48 1,985.56 (****) (****) (*****) (*****) (*****) 204,438 27,408 68,824.32 955,48 4,336.22
(*) It was necessary to decompress the genome file in a preliminar stage. Elapsed real time: S. cerevisiae, 0.14 s; H. sapiens, 59.96 s; P. taeda, 443,30 s. (**) It was necessary to convert genome file content to upper case previously. Elapsed real time: S. Cerevisiae, 0.14 s; H. sapiens, 100.81 s; P. taeda, 829.36 s. (***) Further, it was necessary to delete temporal files. For P. taeda, 14,412,988 temporal files were generated and their deletion took several hours. (****) Error in ref.DNAseq (result would exceed 2^31-1 bytes). (*****) Computer crashed.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 46
Limitations and Future Prospects
The current version of the ddRADseqTools software package has the following limitations:
Only the Jukes-Cantor model of sequence evolution is implemented, and the
phylogenetic relationships between individuals or groups of individuals cannot be
simulated.
The individuals are supposed diploid.
Mismatches are not admitted in the demultiplexing process. It is not important when
the reads are generated in silico, like in the present project, but the current version of
pcrdupremoval.py and indsdemultipling.py cannot be used with experimental
ddRADseq data.
Paralagous sequences are not parameterized. If organisms with genomes with high
content of repetitive regions (e.g. P. taeda) are used as a reference, some paralogous
fragments will be generated, but this is a feature not controlled by the user. However,
paralogous sequences can be identified following Mastretta-Yanes et al. (2015). When
reads are generated at random with fragsgeneration.py, paralogous sequences will not
be generated.
Statistics for the amount of missing data are not automatically generated. However,
they can be easily calculated from the existing output with a general purpose software
(R, Excel, etc.)
Conclusions
In the project corresponding to this Master Thesis:
1. We have developed ddRADseqTools, a software package that provides tools to design in
silico ddRADseq experiments with the following characteristics:
Study of the restriction endonucleases pair more suitable to the reference genome.
Fragments generation (library construction).
Generation of reads simulating a high-throughput sequencing with the possibility of
including variation (both SNP and indels), allele dropout and technical replicates.
Use of one or two indexes to identify individuals
Use of DBR to identify PCR duplicates.
Quantification and removal of PCR duplicates.
Demultiplexing of reads by individual.
Trimming of reads.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 47
2. We have validated the software by performing tests in order to ensure that the output
data are reliable, according to the corresponding design of the programs included in the
software package. The tests have performed ddRADseq simulations, and analysis of
fragments generation, PCR duplicates, effect of the GC content and mutation patterns. We
have also written a pipeline for the alignment of simulated reads so they can be displayed
in a genome browser like IGV; and we have assessed the performance of ddRADseqTools.
3. The software package is efficient in terms of CPU and RAM usage. We have run the test in
a computer whose main features were: Intel Core i5-4200U 1.6 GHz with Turbo Boost up
to 2.g GHz; RAM 8 GiB; 5400 rpm disk. Therefore, ddRADseqTools can run in computers
with standard CPU and RAM configuration
4. Unlike other ddRADseq simulation tools, ddRADseqTools can process genomes of any size,
with both chromosomes and scaffolds sequences, and with a good performance.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 48
References
Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP Discovery and Genetic Mapping Using
Sequenced RAD Markers. PLos ONE, 3, e3376.
Blainey P, Krzywinski M, Altman N (2014) Points of Significance: Replication. Nature Methods,
11, 879-880.
Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP (2011) A method for counting PCR template
molecules with application to next-generation sequencing. Nucleic Acids Research, 39, e81.
Catchen J, Hohenlone PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set
for population genomics. Molecular Ecology, 22, 3124-40.
Chong Z, Ruan J, Wu CI (2012) Rainbow: an integrated tool for efficient clustering and
assembling RAD-seq reads. Bioinformatics, 28, 2732-2737.
DaCosta JM, Sorenson MD (2014) Amplification Biases and Consistent Recovery of Loci in a
Double-Digest RAD-seq Protocol. PLoS ONE, 9, e106713.
Davey JW, Blaxter ML (2010) RADSeq: next-generation population genetics. Briefing in
Functional Genomics, 9, 416-423.
Davey JW, Cezard T, Fuentes-Utrilla P (2013) Special features of RAD Sequencing data:
implications for genotyping. Molecular Ecology, 22, 3151-3164.
Davey JW, Hohenlohe PA, Etter PD et al. (2011) Genome-wide genetic marker discovery and
genotyping using next-generation sequencing. Nature Reviews Genetics, 12, 499-510.
Eaton DAR (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses.
Bioinformatics, 30, 1844-1849.
Etter PD, Bassham S, Hohenlohe PA, Johnson EA, Cresko WA (2011) SNP Discovery and
Genotyping for Evolutionary Genetics Using RAD Sequencing. Methods in Molecular Biology,
772, 157-178.
Engel SR, Dietrich FS, Fisk DG, et at. (2014) The Reference Genome Sequence of
Saccharomyces cerevisiae: Then and Now. G3: Genes, Genomes, Genetics, 4, 389-398.
Lepais O, Weir JT (2014) SimRAD: an R package for simulation-based prediction of the number
of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular
Ecology Resources, 14, 1314-1321.
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler
Transform. Bioinformatics, 25, 1754-1760.
Li H, Handsaker B, Wysoker A, et al. (2009) The Sequence Alignment/Map format and
SAMtools. Bioinformatics, 25, 2078-2079.
Mastretta-Yanes A, Arrigo N, Alvarez N, et al. (2015) Restriction site-associated DNA
sequencing, genotyping error estimation and de novo assembly optimization for population
genetic inference. Molecular Ecology Resources, 15, 28-41.
ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments
Page 49
Mastretta-Yanes A, Zamudio S, Jorgensen TH et al (2014). Gene Duplication, Population
Genomics, and Species-Level Differentiation within a Tropical Mountain Shrub. Geneome
Biology and Evolution, 6, 2611-2624.
Miller MR, Dunham, JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective
polymorphism identification and genotyping using restriction site associated DNA (RAD)
markers. Genome Research, 17, 240-248.
Neale DB, Wegrzyn JL, Stevens KA et al (2014) Decoding the massive genome of loblolly pine
using haploid DNA and novel assembly strategies. Genome Biology, 15, R59.
Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double Digest RADseq: An
Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model
Species. PLoS ONE, 7, e37135.
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics, 26, 841-842.
Robinson JT, Thorvaldsdóttir H, Winckler W, et al. (2011) Integrative genomics viewer. Nature
Biotechnology, 29, 24-26.
Schweyen H, Rozenberg A, Leese F (2014) Detection and Removal of PCR Duplicates in
Population Genomic ddRAD Studies by Addition of a Degenerate Base Region (DBR) in
Sequencing Adapters. The Biological Bulletin, 227, 146-160.
Sovic MG, Fries AC, Gibbs HL (2015) AftrRAD: a pipeline for accurate and efficient de novo
assembly of RADseq data. Molecular Ecology Resources, 15, 1163-1171.
Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-
performance genomics data visualization and exploration. Briefings in Bioinformatics, 14, 178-
192.
Tin MMY, Rheindt FE, Cros E, Mikheyev AS (2015) Degenerate adaptor sequences for detecting
PCR duplicates in reduced representation sequencing data improve genotype calling accuracy.
Molecular Ecology Resources, 15, 329-336.
Venter JC, Adams MD, Myers EW et al (2001) The Sequence of the Human Genome. Science,
291, 1304-1351.
Zimin A, Stevens KA, Crepeau MW, et al. (2014) Sequencing and Assembly of the 22-Gb
Loblolly Pine Genome. Genetics, 196, 875-890.