ddradseqtools: software package for in silico · pdf fileddradseqtools: software package for...

ddRADseqTools: Software package for

in silico simulation and testing of

double digest RADseq experiments

ALUMNO: Fernando Mora Márquez

MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL

ESCUELA NACIONAL DE SALUD - INSTITUTO DE SALUD CARLOS III

2014-2015

CENTRO DONDE SE DESALLORARON LAS PRÁCTICAS: Grupo de investigación Genética,

Fisiología e Historia Forestal - E.T.S. Ingenieros de Montes - UPM

DIRECTOR DE LA TESIS: Profesor Dr. Unai López de Heredia Larrea

FECHA: Febrero 2016

ddRADseqTools: Software package for in silico simulation and testing of double digest RADseq experiments

Page i

Agradezco al Profesor Dr. Unai López de Heredia Larrea sus enseñanzas

y su dedicación en la dirección de esta tesis.

También doy las gracias al Dr. Brent Emerson del Instituto de

Productos Naturales y Agrobiología (IPNA-CSIC) por sus explicaciones

sobre ddRADseq, y a Víctor García Olivares, miembro del equipo del Dr.

Emerson, por las pruebas realizadas con ddRADseqTools.

A Henar, Teresa y María por todo vuestro cariño.


Page ii

Table of contents

Abstract ......................................................................................................................................... 1

Introduction .................................................................................................................................. 2

Double Digest RAD Sequencing ................................................................................................. 2

Potential sources of error in ddRADseq experiments ............................................................... 5

PCR duplicates ....................................................................................................................... 5

Allele dropout ........................................................................................................................ 6

Technical replicates ............................................................................................................... 6

Available ddRADseq simulation tools ....................................................................................... 6

Aims ............................................................................................................................................... 8

Materials and Methods ................................................................................................................. 9

Design of ddRADseqTools ......................................................................................................... 9

Description of data files and programs ................................................................................... 13

Ends file ............................................................................................................................... 13

Individuals file ..................................................................................................................... 13

Restriction sites file ............................................................................................................. 14

Program rsitesearch.py ....................................................................................................... 15

Program fragsgeneration.py ............................................................................................... 16

Program simddradseq.py .................................................................................................... 17

Program pcrdupremoval.py ................................................................................................ 21

Program indsdemultiplexing.py .......................................................................................... 22

Program readstrim.py ......................................................................................................... 23

Program seqlocation.py ...................................................................................................... 24

Methodology for ddRADSeqTools validation .......................................................................... 24

Benchmark reference genomes .......................................................................................... 25

Validation experiments ....................................................................................................... 25

Results and Discussion ................................................................................................................ 33

Analysis of fragments generation ........................................................................................... 33

ddRADseq simulations ............................................................................................................ 34

Analysis of PCR duplicates ....................................................................................................... 35

Analysis of the effect of the GC content ................................................................................. 37

Analysis of the mutation patterns ........................................................................................... 39

Pipeline for the alignment of simulated reads ........................................................................ 39

Performance of ddRADseqTools ............................................................................................. 44


Page iii

Comparative between ddRADseqTools and other ddRADseq simulation tools ..................... 44

Limitations and Future Prospects ........................................................................................... 46

Conclusions ............................................................................................................................. 46

References ................................................................................................................................... 48


Page 1

Abstract

Double digested RADseq (ddRADseq) is a next generation sequencing strategy that generates

reads from thousands of loci targeted by restriction enzyme cut sites, across multiple

individuals. To be statistically sound, and economically affordable, a ddRADseq experiment has

a preliminary design stage that needs to consider issues related to the selection of the enzyme

pair combination, particularities of the genome of the subject species, modifications of the

library construction, the coverage needed to avoid missing data, and the potential sources of

error that have impact in the coverage.

In this Master Thesis, we present ddRADseqTools, a software package that performs in silico

ddRADseq simulations in order to help design of a ddRADseq experiment by testing hypothesis

related to inherent sources of bias. It covers the in silico fragments generation, both at random

or from a reference genome; the construction of modified ddRADseq libraries using adapters

with either one or two indexes and Degenerate Base Regions (DBRs) for quantification of PCR

duplicates; and initial steps of the bioinformatics pre-processing of reads (quantification and

removal of PCR duplicates, demultiplexing of individuals and trimming of adapters from raw

reads). ddRADseqTools generates single-ended (SE) or paired-ended (PE) reads that may show

three types of mutations: SNPs, indels and mutations at the enzyme's recognition motif (i.e.

allele dropout). The resulting output files can be submitted to pipelines of alignment and

variant/genotype calling in order to allow a fine tuning of parameters, before in vitro data are

obtained from the laboratory of reference.

We validated ddRADseqTools with specific tests that accounted for double digested fragment

selection, generation of SE and PE reads with varying degree of polymorphism, and

implementation, quantification and removal of PCR duplicates. To validate the processes, we

used three benchmark genomes from species with contrasting characteristics (Saccharomyces

cerevisiae, Homo sapiens and Pinus taeda).

ddRADseqTools is cost-efficient in terms of time of execution, and can be run in computers

with standard CPU and RAM configuration.

Aims: 1) To develop a software package to perform in silico ddRADseq simulations in order to

help in the design of a ddRADseq experiment by testing hypothesis related to inherent

sources of bias.

2) To validate the software package to verify its proper design and functionality for

diverse genomes and under several scenarios.

3) To evaluate the cost-efficiency of the software package in terms of CPU and RAM

usage.

4) To compare the software package with other ddRADseq simulation tools.

Keywords: allele dropout, coverage, ddRADseq, genotyping, in silico simulation, PCR

duplicates.


Page 2

Introduction

Double Digest RAD Sequencing

A restriction endonuclease, or restriction enzyme, is an enzyme that recognizes specific base

sequences in DNA, breaks down phosphodiester bonds, and cleaves the double helix at specific

sites. These sequences or motifs are known as restriction sites and contain four to eight

nucleotides. A characteristic of the restriction sites is that they have two-fold rotational

symmetry, i.e. they are palindromic sequences. Figure 1 shows the restriction sites of four

enzymes of common use.

Figure 1. Restriction sites of EcoRI, PstI, SbfI and MseI. Circles represent the symmetry axes and arrows

indicate the cleavage sites.

For example, EcoRI recognizes the sequence GAATTC and cuts the enzyme between the G and

the A. Figure 2 shows how the EcoRI cleaves DNA, and which are the resulting fragments.

Figure 2. Action of EcoRI on the Chromosome IX (368744-368769) of S. cerevisiae.

The capacity of cleaving DNA by the restriction endonucleases is the basis of Restriction-site

Associated DNA (RAD) marker, which are genetic markers identified by the sequence that is

recognized by a restriction endonuclease. Initially, RAD markers were used in microarrays


Page 3

(Miller et al. 2007), and then in high-throughput sequencing technologies (Illumina) (Baird et

al. 2008).

RADseq (Baird et al. 2008; Peterson et al. 2012) is a NGS methodology that uses only one

restriction endonuclease. The resulting fragments are then sheared randomly. Fragments are

obtained in both sides of the cut site, and therefore reads will cover both sides. RADseq uses

an enzyme whose motif is rare, thus the number of fragments will not be very high, and will

therefore be tractable. RADseq technology has allowed to genotype high numbers of

individuals (from a few ones to thousands of them), hence obtaining polymorphisms (mainly

SNPs but also indels) across the full genome (Baird et al. 2008; Davey & Blaxter 2010; Etter et

al. 2011; Davey et al. 2011; Davey et al. 2013; Mastretta-Yanes et al.2014); therefore, it has an

extraordinary potential to perform genetic mapping and population genetics studies in non-

model species when a reference genome is not available.

Several modifications of the RADseq technology exist. The most popular is the Double Digest

Restriction Associated DNA (ddRAD) Sequencing or ddRADseq that uses two restriction

endonucleases (Peterson et al. 2012). Usually, one of the enzymes has a rare motif while the

other enzyme has a common motif, in order to obtain a manageable number of fragments. The

enzyme combination of choice will be different depending on the size and structure of the

genome of the subject organism. The fragments produced by the ddRADseq platform are

flanked by the cut site of both enzymes. Frequently, only the fragments of a specific size are

selected to be sequenced. For instance, those fragments between 200 and 300 nucleotides are

purified in agarose gels after digestion to run ligation of adapters, and PCRs prior to Illumina

sequencing.

A comparison of RADseq and ddRADseq technologies (Peterson et al. 2012) is shown in Figure

3.

Figure 3. RADseq vs ddRADseq methodologies (Peterson et al. 2012). A) RAD sequencing; the library is

built from the fragments (in red) created by digest of the genome with a restriction endonuclease coupled

with a random shearing; reads cover both sides of the cut site. B) ddRADseq; the fragments are generated

by digestion of the genome with two restriction endonucleases; selected fragments are those that are

flanked by the cut site of both enzymes, and that have a suitable size to build the library (in red). In this

figure, the fragments a and b are not selected because a is too short and b is too long. Two individual are

represented in both examples.


Page 4

Using ddRADseq, there is no need to perform whole genome sequencing, and thousands of

markers of many individuals are obtained by studying only a portion of the genome. Therefore,

is a cost-affordable sequencing technique to study non-model species. ddRADseq technology

allows the optimization of the number of genetic markers that can be obtained, and has

expanded the potential field of application of RADseq (Peterson et al. 2012). Thus, ddRADseq

can be used to perform pedigree or quantitative trait locus (QTL) mapping, population

structure assessment or phylogenetic studies (Figure 4).

Figure 4. Potential fields of application of RADseq vs ddRADseq methodologies (Peterson et al. 2012).

The fragments sequenced by ddRADseq consist of a genome insert between both restriction

sites, and two ends that include an adapter and a primer. A particular short sequence (index) is

attached to one or both ends to identify individuals, i.e. an index is a barcode that

distinguishes the fragment that belongs to a particular individual from the fragments of other

individuals (see Figure 5). If a second index is attached to another end, the potential number of

individuals to sample increases considerably.

Figure 5. ddRADseq fragment. The fragment has a genome insert (in blue) between the restriction site of

the first enzyme (rse 1 in yellow) and the other restriction site (rse 2 in yellow). The ends have an adapter

and a primer (in grey). An index in the end where Adapter 1 is, and optionally another one in the second

end, are used to identify individuals.


Page 5

Potential sources of error in ddRADseq experiments

Many sources of error have been identified in RADseq experiments. Mastretta-Yanes et al.

(2015) classified these sources of error in technical, human, of wet laboratory, inherent to the

high-throughput sequencing technique, and of bioinformatics nature. In the present software

we have considered two of the main potential sources of error present in ddRADSeq

experiments that have strong incidence in coverage reduction, and have parameterized them

at the time of simulating ddRADseq read files: PCR duplicates and allele dropout. Other

sources of error, such as the presence of paralogous sequences, are implicitly occurring when

the genomes of the subject species show high repetitive content.

In addition, a way to improve the accuracy of a ddRADseq experiment, is to include technical

replicates. Therefore, we have also taken this point into consideration at the time of simulating

ddRADseq read files.

PCR duplicates

PCR duplicates are identical copies of the same template fragment arisen during the stage of

PCR in NGS sequencing experiments. Amplification increases the number of available

molecules for sequencing but changes the representation of the template molecules in the

amplified product and introduces random errors (Casbon et al. 2011). In the case of ddRADseq,

this type of error is originated at the PCR amplification of the fragments + the adapters prior to

Illumina sequencing (Schweyen et al. 2014; Tin et al. 2015). The presence of PCR duplicates

implies a loss of coverage and can lead to genotype/variant calling errors. When possible, the

PCR duplicates must be eliminated in the analysis of an experiment.

In order to detect PCR duplicates, a recently developed technique (Schweyen et al. 2014) used

a degenerate base region (DBR) in one of the two ends (see Figure 6). The DBR is a short

sequence of a few nucleotides ligated during library construction that operates as a molecular

counter to estimate the number of template molecules in the PCR associated with each

variant. After ligation, each fragment in the library incorporates a particular sequence chosen

from all the possible DBR sequences. The counter can be used to determine whether a

putative variant is associated with a single template molecule or, alternatively, multiple

template molecules and hence the probability that it derives from a polymerase error or true

variant (Casbon et al. 2011). From a bioinformatics point of view, the duplicate reads can be

identified and removed, because they will bear the same sequence.

Figure 6. DBR addition in an end is used to identify PCR duplicates.


Page 6

Allele dropout

There is allele dropout in a locus when one or more alleles are no present in reads of a

sequencing. This problem can occur (Mastretta-Yanes et al. 2015) by: 1) a mutation in a

restriction enzyme recognition site; 2) wet laboratory errors, e.g. an exposure to UV light; 3)

bioinformatics issues, e.g. reads removal whose coverage is low in samples with wide range of

coverage per locus. Allele dropout decreases the accuracy of genotyping because the affected

allele is not detected.

Technical replicates

Replicates are used to detect and identify sources of variation in measurements, and limit the

effect of spurious variation on hypothesis testing and parameter estimation (Blainey et al.

2014). Replicates can be: 1) Technical replicates are replicates where the biological material is

the same in each replicate; they can be utilized to calculate the variability of measurements

and find out technical errors. These are recommended in ddRADseq experiments; 2) Biological

replicates, which are replicates whose biological material comes from different samples; they

can be used to estimate the variability between individuals of a population. In ddRADseq

experiments, biological replicates correspond to individuals.

Available ddRADseq simulation tools

Few software tools related to ddRADSeq in silico simulation are currently available. Some of

them are listed below.

simRAD (Lepais 2014; https://cran.r-project.org/web/packages/SimRAD/):

simRAD is a R package the provides functions to simulate restriction endonuclease digestion

and fragment selection to a ddRADseq experiment. A reference genome or randomly

generated DNA sequences can be the input to the digestion process. This utility does not

consider reads generation.

BU-RAD-seq (DaCosta & Sorenson 2014; https://github.com/BU-RAD-seq):

BU-RAD-seq has two utilities. One utility is Digital_RADs, a Python 3 program that performs the

digestion of a genome with one or two enzymes. It requires the motifs and the length of the

down/upstream sequence (one enzyme) or the lower or upper size of the fragment (two

enzymes). This utility does not consider Individual identification, PCR duplicates or mutations.

ddRAD-seq-Pipeline, the utility, is a Python 3 programs set that processes double digest RAD

sequences in order to genotype the samples. Individual identification, PCR duplicates and

mutations are not considered.


Page 7

simRRLs (http://dereneaton.com/software/simRRLs/):

simRRLs is a Python 2 program that can be used to simulate randomly RADseq-like sequence

data on a fixed species tree topology under a coalescent model. It is not possible to use a

reference genome. It supports various types of RADseq: SE RAD, SE ddRAD, PE ddRAD, PE

ddRAD w/ merged reads, etc. An index is used to identify individuals and the DBR sequence is

not considered. It accepts various arguments: length of simulated sequences, number of loci to

sample, individuals from each taxon, restriction sites, mutation rate, indel rate, existence of

allele dropout, etc. It has been used to test PyRAD, a ddSRADseq pipeline (Eaton 2014).


Page 8

Aims

The importance of a robust design of ddRADseq experiments to save time and money, and the

lack of tools for in silico testing of hypothesis related to inherent sources of bias (PCR

duplicates and allele dropout), led to the definition of the following specific aims for this

Master Thesis:

Aim 1.

To develop a software package to perform in silico ddRADseq simulations in order to help in

the design of a ddRADseq experiment by testing hypothesis related to inherent sources of

bias.

Aim 2.

To validate the software package to verify its proper design and functionality for diverse

genomes and under several scenarios. The validations must ensure that the output data are

reliable, according to the corresponding design of the programs included in the software

package.

Aim 3

To evaluate the cost-efficiency of the software package in terms of CPU and RAM usage.

Aim 4

To compare the software package with other ddRADseq simulation tools.


Page 9

Materials and Methods

Design of ddRADseqTools

ddRADseqTools, is a set of programs, data files and configuration files useful to design and in

silico testing of ddRADseq experiments (Aim 1). The programs that form part of

ddRADseqTools aim to meet the following scopes:

Scope A.

Simulation of fragments produced by a double digest with a given pair of restriction

endonucleases. The fragments can be obtained from a reference genome if available; or they

can be randomly generated. Each fragment corresponds to a locus.

Scope B.

Generation of mutations within fragments of each individual including SNP and indels, and

the possibility of allele dropout. The number and type of mutations across the simulated

reads are determined according to user-defined probabilities. The maximum number of

mutated position in one fragment is defined by the user. The location of the mutations in the

fragment is assigned at random, but is conserved across individuals. At present, only the Jukes-

Cantor model of sequence evolution is implemented.

Scope C.

Simulation of single-end (SE) or pair-end reads (PE). Reads are simulated according to the

following points:

The number of reads by locus is calculated by dividing the total number of reads to

generate by the number of loci to sample.

Individuals have two fragment sequences by locus. i.e. two alleles. They are assigned

randomly and can be mutated or not-mutated depending on a probability. The

chance of allele dropout of a locus is individually identified randomly for each allele.

For such specific loci and individuals, no reads will be generated.

The GC ratio for each fragment is considered as a factor that controls the probability

of producing PCR duplicates. Digested fragments with higher GC ratio will have higher

probability of producing PCR duplicates than those with lower GC ratio.

Scope D.

Flexibility to configure raw read ends: user defined adapters, Illumina or ad hoc PCR primers,

indexes at both ends of the read, and including DBRs according to the needs of the

experiment and the sequencing platform. As several modifications of the ddRADseq library

construction methodology exist (Peterson et al. 2012; Mastretta-Yanes et al. 2015; Schweyen

et al. 2014; Tin et al. 2015), this version of ddRADseqTools implements four out of these

techniques:


Page 10

Only one index is used to identify the individuals: this is the original ddRADseq

methology (Peterson et al. 2012). The sequence of the end corresponding to the

Adapter 1 includes a single index1 sequence (Peterson et al. 2012). The index2

sequence and the DBR sequence are not considered.

One index is used to identify individuals, and DBRs are used to quantify PCR

duplicates: The sequence of the end corresponding to the Adapter 1 includes an

index1 sequence and a DBR sequence (Schweyen et al. 2014; Tin et al. 2015).

Two indexes are used to identify the individuals: the sequence of the end

corresponding to the Adapter 1 includes an index1 sequence, and the sequence of the

end corresponding to the Adapter 2 includes an index2 sequence (Peterson et al.

2012; Mastretta-Yanes et al. 2015). The DBR sequence is not considered.

Two indexes are used to identify individuals and DBRs are used to quantify PCR

duplicates: The sequence of the end corresponding to the Adapter 1 includes an

index1 sequence; the sequence of the end corresponding to the Adapter 2 includes an

index2 sequence; and a DBR sequence is included at the end of either the Adapter 1 or

the Adapter2 (Schweyen et al. 2014; Tin et al. 2015).

The indexes and DBR can have any size and be located in any position of the adapters.

Scope E.

Quantification and removal of PCR duplicates. PCR duplicates can have a strong effect in the

decrease of coverage and may inflate the percentage of missing data. When using the DBR

strategy, PCR duplicates can be quantified.

Scope F.

Demultiplexing of reads by individual. Reads need to be separated by individual, in order to

build the individuals' genotypes, and to check for the presence of paralogous sequences (see

Mastretta-Yanes et al. 2015).

Scope G.

Trimming of reads. The adaptors, primers, indexes and DBRs must be removed from raw reads

in order to use trimmed reads for alignment and variant calling.

ddRADseqTools was programmed in Python 3 (version 3.4 or higher is required), and runs in

any computer with an OS that allows for Python 3: Linux/Unix, Microsoft Windows, Mac OS X,

among others. The only dependencies required to run this software package are the NumPy

and matplotlib libraries. The software package, version 0.36, is attached to this document

(ddRADseqTools-0.36.zip). Within the ddRADseqTools.zip file, there is a manual that describes

how to install ddRADseqTools, and the way to operate with each program. Appendix A (see file

supplementary.pdf) contains the complete list of the files of ddRADseqTools.

A flow chart of the programs contained in ddRADseqTools is shown in Figure 7. The work-flow

has the usual three steps in an NGS experiment:

1) Library construction/in silico fragments generation: A file of fragments is generated

from a reference genome by rsitesearch.py; or fragment sequences are simulated


Page 11

randomly with fragsgeneration.py, respectively. If the genome-guided version of the

software is used (rsitesearch.py), a particular pair of restriction endonucleases has to

be specified and their action within the genome is simulated (Scope A).

2) High Throughput Sequencing / Generation of reads: Raw reads are generated by

simddradseq.py. This program allows handling a wide number of parameters, such as

single-ended (SE) or paired-ended (PE) read types, configuration of the ends of the raw

reads, number of reads, size of the fragments, mutation probability, or PCR duplicates

probability (Scopes B, C and D). The software was designed having Illumina's SE and PE

read files in mind, but it can be used to simulate read files from other sequencing

platforms (Roche 454, PacBio, Helicos, etc.).

3) Bioinformatics pre-processing of reads: This step can be split in three sub-steps:

3.1) Quantification and removal of PCR duplicates: The PCR duplicates of the

raw reads are quantified and some statistics are computed with

pcrdupremoval.py. Also, this program generates read files without the

duplicated reads (Scope E).

3.2) Demultiplexing of individuals: Joint raw reads are demultiplexed by

indsdemultiplexing.py to obtain separate individual read files (Scope F).

3.3) Trimming raw reads: The program readstrim.py removes the adapters and

other sequences from raw reads in order to perform correctly the alignment of

reads and the variant calling step (Scope G).

Table 1. Parallelism between in vitro and in silico experiments and ddRADseqTools programs.

In vitro experiments In silico experiments ddRADseqTools program

Library construction In silico fragments generation rsitesearch.py (w/genome) fragsgeneration.py (random)

High-Throughput Sequencing Generation of reads simddradseq.py

Bioinformatics pre-processing of reads

Quantification and removal of PCR duplicates pcrdupremoval.py

Demultiplexing of individuals indsdemultiplexing.py

Trimming of raw reads readstrim.py

The output files of this work-flow are ready to be submitted to alignment utilities, such as BWA

(Li & Durbin 2009), or to ddRADseq analysis pipelines, such as Rainbow (Chong et al. 2012),

STACKS (Catchen et al. 2013), Pyrad (Eaton 2014) or AftrRAD (Sovic et al. 2015). Doing so

allows in silico tuning of the parameters used by the pipeline before the in vitro data are

obtained.

Three data files are required: 1) an end file that is used to design the sequence ends, with the

corresponding adapters, primers, indexes and DBRs; 2) a file of individuals that contains the

sequences that identify the individuals; and 3) a file of restriction sites that holds the

restriction sites recognition motifs and indicates their cut sites.

The default parameters to run each program are stored in specific configuration files. These

options can be modified simply editing the configuration file, or in the command line.


Page 12

Figure 7. Flow-chart of ddRADseqTools.


Page 13

Description of data files and programs

A detailed description of the data files and programs included in ddRADseqTools is shown

below.

Ends file

The file ends.txt contains the end sequences of the raw reads (5'->3'), as defined by the user.

The ending sequences are integrated by adapters, primers, indexes and DBRs. A read has two

ends: one for Adapter 1, and a second one for Adapter 2.

The record format of the ends file has two fields (Figure 8): 1) end identification; and 2) end

sequence. Both fields must be separated by a semicolon.

Figure 8. Example of the content of an ends file. 1 represents a nucleotide of index1; 2 represents a nucleotide

of index2; 3 represents a nucleotide of the DBR.

Figure 9 shows an example of how a simulated NGS fragment can be assembled using the data

of ends file.

Figure 9. Example of fragment assembly. The sequence is formatted by end 1 (it corresponds to the ID end21

displayed in Figure 8) that contains a first index and a DBR; end 2 (it corresponds to the ID end 22 displayed in

Figure 8) that bears a second index; the cuts performed by EcoRI and MseI; and the genome insert, which is

represented by question marks.

Individuals file

The file individuals.txt, is the individuals file. It contains the sequences that identify each

individual in the experiment. Either one or two indexes can be used to identify the individual.


Page 14

This file can be easily written with a text editor or spreadsheet. The record format of

individuals file has five fields (Figure 10): 1) individual ID; 2) replicated individual ID (if technical

replicates are included in the experiment) or NONE (if no technical replicate is considered); 3)

population ID (this is not operative in the present version of the program); 4) sequence of

index1 corresponding to Adapter 1; and 5) sequence of index2 corresponding to Adapter 2

(optional). The fields must be separated by a semicolon.

Figure 10. Example of an individuals file. Only some individuals are displayed. The individual identified by

ind0206 is a technical replicate of ind0201.

Restriction sites file

The file restrictionsites.txt contains the restriction sites recognition motifs and identifies their

cut sites.

The record format of restriction sites file has two fields (Figure 11): 1) the ID of the restriction

endonuclease; and 2) the sequence of restriction site. Both fields must be separated by a

semicolon.

The file restrictionsites.txt included in the present version of the software contains more than

60 widely used restriction endonucleases. However, the researcher can include new enzymes

data at the end of the file, by simply editing it with a text editor.

Figure 11. Example of a restriction sites file. A cut site is represented by an asterisk in the sequence.


Page 15

Program rsitesearch.py

This program locates the restriction sites' motifs and performs an in silico double digestion of a

genome. After directly simulating the digest of the sequence of a reference genome as is in the

genome file (Watson strand), the complementary reverse sequence (Crick strand) is obtained,

and a second digest simulation of the latter sequence is performed.

The output is a FASTA file with the resulting fragments (Figure 13). The header of each FASTA

record will show the following information: fragment number, length of the fragment, GC rate,

strand, start position in the locus, end position in the locus, and description of chromosome or

scaffold.

It also provides some statistics regarding the number of fragments classified according to

fragment size intervals, and a graphics showing the distribution of fragments by size interval

(see chapter Results and Discussion).

Statistics of the fragments generated are obtained. The data are summarized by intervals. For

each interval the number of fragments, the percentage of fragments relative to the total, and

the number of fragments that contain undetermined nucleotides are calculated (see Figure 12

for an example of such statistics).

The statistics output is also generated in CSV format that can be easily downstreamed to other

general purpose programs, such as R, Libre Office Calc or Microsoft Excel.

Figure 12. Example of statistics generated by the program rsitesearch.py. The table shows the output results

for the first sixteen size intervals of the fragments produced by a double digest of P. taeda with SbfI and MseI.

The data shown for each interval are the number of fragments, the percentage of fragments relative to the

total, and the number of fragments that contain undetermined nucleotides.


Page 16

The input and output files of rsitesearch.py are shown in Figure 7, as well as the position of this

program within the processes flow of ddRADseqTools.

The options of the program are detailed in Table 2.

Table 2. rsitesearch.py options.

Option Default value Comment

genfile ./genome.fna Path of the reference genome file in FASTA format or .gz format (compressed).

fragsfile ./fragments.fasta Path of the output fragments file.

rsfile ./restrictionsites.txt Path of the input restriction sites file.

enzyme1 EcoRI

Name of the first restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. EcoRI and GAATTC are equivalent.

enzyme2 MseI

Name of the second restriction enzyme used in rsfile. Instead of the name, the restriction site sequence is allowed. e. g. MseI and TTAA are equivalent.

minfragsize (*) 201 Lower fragment loci size.

maxfragsize (*) 300 Upper fragment loci size.

fragstfile ./genome-statistics.txt Output statistics file.

fragstinterval 25 Interval length of fragment size for the output statistics.

(*) During library construction in ddRADseq experiments it is very common to filter only the fragments

ranging a particular size (usually 100-400 bp). This size interval can be set here.

Program fragsgeneration.py

This program generates random fragments simulating a double digestion of a genome, and

writes them to a FASTA file. It is useful when rsitesearch.py cannot be used because there is no

reference genome.

The output of the program is equal to that provided by rsitesearch.py

The input and output files of fragsgeneration.py are shown in Figure 7 as well as the position

of this program within the processes flow of ddRADseqTools.


Table 3. fragsgeneration.py options.


fragsfile ./fragments.fasta Path of the output fragments file.


enzyme1 EcoRI



Page 17

enzyme2 MseI


fragsnum 10000 Number of fragments to generate.

minfragsize (*) 201 Lower fragment loci size.

maxfragsize (*) 300 Upper fragment loci size.

fragstfile ./genome-statistics.txt Output statistics file.

fragstinterval 25 Interval length of fragment size for the output statistics.

(*) During library construction in ddRADseq experiments it is very common to filter only the fragments

ranging a particular size (usually 100-400 bp). This size interval can be set here.

Program simddradseq.py

This program builds SE or PE simulated read files from a virtual library of a ddRADseq

experiment in FASTQ/FASTA format.

The input fragments can be obtained in two ways:

1) Using a reference genome via rsitesearch.py.

2) Randomly via fragsgeneration.py.

Read mutations are generated probabilistically according to the next steps:

1) Each fragment proceeding from a double digest simulation is considered to be a locus.

Fragments are picked up randomly.

2) The presence of mutations in a locus is assessed according to a probability. If

mutations exist, a mutated sequence of the fragment is generated. The number of

mutations within a fragment is randomly chosen from the interval between 1 and a

maximum number of mutations defined by the user. The type of polymorphism (SNP

or indel) is also determined from a user defined probability of a mutation being an

indel. The indels have a user defined upper boundary size.

3) Coming up next an individual database is created. One fragment sequence, mutated or

not, is assigned randomly to each individual chromosome, i.e. the individuals are

supposed to be diploid. Also, a mark indicates if the individual shows allele dropout in

the locus corresponding to the fragment according to a user defined probability. The

technical replicates have the same fragment sequences that the sequences of the

individuals that they are replicating.

This program generates SE or PE reads of user defined length in FASTQ or FASTA format. Figure

14 shows an example of PE reads in FASTQ format. The header of each FASTA/FASTQ record

will the following information: read number, fragment number, read number in the fragment,

trace of a mutation in the read sequence, individual ID, index1 sequence, and index2 sequence.

The theoretical number of reads by locus is calculated by dividing the number of reads to

generate by the number of loci to sample. The number of reads of each locus is calculated

randomly in a rank in which the theoretical number is contained.


Page 18

The reads of a locus are generated in a loop, in which is necessary to randomly assess the

individual, and its corresponding fragment sequences of each read. Individuals that have the

allele dropout mark in this fragment are not considered.

The generation of reads in a locus also contemplates if this locus presents PCR duplicates,

according to a user defined probability, which is weighted by the GC factor of the fragment.

When the probability of having PCR duplicates of the fragment is below the weighted

theoretical probability, the number of duplicates per read is sampled from a distribution which

probability decays monotonically from one replicate up to an upper boundary for the number

of PCR duplicates. That is, it is more likely to generate less than many PCR duplicates.

The coverage is controlled by setting the number of loci, the number of individuals (controlled

by the individuals file), and the number of reads of the library. Coverage may be unequal

among loci and individuals, ranging between two user defined values. If uniform coverage is

desired, both options should be set to 1.

Four techniques of building libraries are implemented:

1) IND1: An index sequence is inserted in the end where Adapter 1 is (Peterson et al.

2012).

2) IND1_DBR: An index sequence a DBR are inserted in the end where Adapter 1

(Schweyen et al. 2014; Tin et al. 2015)

3) IND1_IND2: In addition to the index sequence in Adapter 1, another index sequence is

inserted in the end where Adapter 2 is (Peterson et al. 2012; Mastretta-Yanes et al.

2015).

4) IND1_IND2_DBR: This technique uses DBRs in addition to the index sequences. The

DBR sequence is generated randomly (Schweyen et al. 2014; Tin et al. 2015).

Figure 7 shows the input and output files of simddradseq.py, as well as the position of this



Table 4. simddradseq.py options.


fragsfile ./fragments.fasta Path of the input fragments file.

technique IND1_IND2_DBR

Three methodologies are available: IND1 (an index sequence in adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in adapter 1 + an index sequence in adapter 2) and IND1_IND2_DBR (an index sequence in adapter 1 + an index sequence in adapter 2 + a DBR).

format FASTQ Format of the output file: FASTQ or FASTA.

readsfile ./reads Path of the output read file (without extension).

readtype PE Read type: SE or PE.



Page 19

enzyme1 EcoRI


enzyme2 MseI


endsfile ./ends.txt Path of the input end sequences file.

index1len 6 Index sequence length in Adapter 1.

index2len 6 Index sequence length in Adapter 2 (it must be 0 when the technique is BC).

dbrlen 4 DBR length (it must be 0 when the technique is BC or BC_IND).

wend end01 Code used in endsfile corresponding to the end where adapter 1 is.

cend end02 Code used in endsfile corresponding to the end where adapter 2 is.

individualsfile ./individuals.txt Path of the input individuals file.

locinum 100 Number of loci to sample.

readsnum 10000 Number of reads to generate.

minreadvar 0.8 Lower parameter value of the interval to control variation of the number of reads per locus (0.5 <= minreadvar<= 1.0).

maxreadvar 1.2 Upper parameter value of the interval to control variation of the number of reads per locus (1.0 <= maxreadvar <= 1.5).

insertlen 180 Insert length, i. e. genome sequence length inserted in the reads.

mutprob 0.2 Mutation probability (0.0 <= mutprob < 1.0.)

locusmaxmut 1 Maximum mutation number by locus (1 <= locusmaxmut <= 5)

indelprob 0.4 Indel probability (0.0 <= indelprob < 1.0). This is the probability of a mutation being an indel (otherwise, it will be a substitution).

maxindelsize 3 Maximum size of the generated indels (1 <= maxindelsize < 20).

dropout 0 Probability of mutation at the enzyme recognition sites (0.0 <= dropout < 1.0).

pcrdupprob 0 Probability of loci bearing PCR duplicates (0.0 <= pcrdupprob < 1.0).

gcfactor 0 Weight factor of GC ratio in a locus with PCR duplicates (0.0 <= gcfactor < 1.0)


Page 20

Figure 13. Example of fragments file. A portion of fragments originated simulating a double digest of S. cerevisiase by EcoRI and MseI.

Figure 14. Example of raw reads files. A portion of FASTQ PE read files where two reads corresponding to fragment 1066 shown in Figure 13 are displayed. The reads corresponding to

reads file with end 1 (it corresponds to ID end21 displayed in Figure 8) are on top and the reads with end 2 (it corresponds to ID end 22 displayed in Figure 8) are in the bottom. One record

has a mutation, while the other one is a non-mutated sequence.


Page 21

Program pcrdupremoval.py

This program quantifies and removes the PCR duplicates detected in ddRADseq experiments

that use DBRs embedded in the adapters, in addition to the index sequences.

The input read file(s) have been generated by simddradseq.py.

The determination of PCR duplicates is performed following the next steps:

1) Reads are sorted by sequence in the SE read file, or by both sequences in the PE read

files.

2) As reads are raw, (i.e. they can include adapters, primers, indexes and DBRs in addition

to the genome insert), reads with equal sequence(s) imply PCR duplicates, and only

one of them is saved in the output file.

As reads have been generated in silico, mismatches are not considered.

The output file(s) have the same format as the input file(s).

This program calculates statistics regarding the number of removed and total reads per locus

and individual (see Figure 15). The output also indicates if a locus has PCR duplicates or not.

Figure 15. Example of the statistics output generated by the program pcrdupremoval.py. Only ten loci and five

individuals are displayed. In each cell, the first value is the removed number of reads for a given locus /

individual, and the second value is the corresponding total number of reads.

It is possible to have a locus in an individual showing not reads. If this occurs extensively, it

suggests that the coverage must be optimized. The output file is also generated in CSV format

for further processing with spreadsheets such as Libre Office Calc or Microsoft Excel, or

statistics programs, such as R. Figure 7 shows the input and output files of pcrdupremoval.py,

as well as the position of this program within the processes flow of ddRADseqTools. The

options of the program are detailed in Table 5.


Page 22

Table 5. pcrdupremoval.py options.




readsfile1 ./reads_1.fastq Path of the file for SE reads or the reads file where Adapter 1 is for PE reads.

readsfile2 ./reads_2.fastq Path of reads file where Adapter 2 is for PE reads or NONE for SE reads.

clearfile ./reads_cleared Path of the output file with removed PCR duplicates (without extension).

dupstfile ./pcrduplicates_stats.txt Path of the PCR duplicates statistics file.

Program indsdemultiplexing.py

This program demultiplexes one file (SE) or two files (PE) with reads of n individuals in n files

(SE) or 2n files (PE), containing the reads of each individual.

The input reads have been generated by simddradseq.py or they have been the result of the

removal of PCR duplicates performed with pcrdupremoval.py.

At this point, the reads are raw (i.e. they include the indexes, adapters, DBRs, etc.). Therefore,

they have one index or two indexes to identify each individual. The number ID selected to

identify the index, and the index position are given by the end identifiers of the ends file. As

reads have been generated in silico, mismatches are not considered.

The input and output files of indsdemultiplexing.py are shown in Figure 7 as well as the

position of this program within the processes flow of ddRADseqTools.


Table 6. indsdemutiplexing.py options.



Three methodologies are available: IND1 (an index sequence in Adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in Adapter 1 + an index sequence in Adapter 2) and IND1_IND2_DBR (an index sequence in Adapter 1 + an index sequence in Adapter 2 + a DBR).





index2len 6 Index sequence length in Adapter 2 (it must be 0 when technique is BC).

dbrlen 4 DBR length (it must be 0 when technique is IND1 or IND1_IND2).

wend end01 Code used in endsfile corresponding to the end where Adapter 1 is.

cend end02 Code used in endsfile corresponding to the end where Adapter 2 is.


Page 23

individualsfile ./individuals.txt Path of the input individuals file.



Program readstrim.py

This program trims the ends of 1 file (SE) / 2 files (PE) of raw reads, i.e. cuts the adapters,

primers, indexes and DBR. The ends identifiers determine the two ends of the raw reads, and,

therefore, the sequences that must be trimmed (see Figure 16).

Figure 16. Example of trimmed reads file. A portion of FASTQ PE read files where two reads are shown without

the two ends.

Figure 7 shows the input and output files of readstrim.py, as well as the position of this




Page 24

Table 7. readstrim.py options.



Three methodologies are available: IND1 (an index sequence in Adapter 1), IND1_DBR (an index sequence + a DBR in adapter 1), IND1_IND2 (an index sequence in Adapter 1 + an index sequence in Adapter 2) and IND1_IND2_DBR (an index sequence in Adapter 1 + an index sequence in Adapter 2 + a DBR).





index2len 6 Index sequence length in Adapter 2 (it must be 0 when technique is BC).

dbrlen 4 DBR length (it must be 0 when technique is IND1 or IND1_IND2).

wend end01 Code used in endsfile corresponding to the end where Adapter 1 is.

cend end02 Code used in endsfile corresponding to the end where Adapter 2 is.



trimfile ./reads_cleared Path of the output file with trimmed reads (without extension).

Program seqlocation.py

This program locates a sequence into the genome, and shows the start and end positions, as

well as the reverse complementary sequence. No mismatches are allowed in this version of the

program.


Table 8. seqlocation.py options.


genfile ./genome.fna File of the reference genome in FASTA format. The file can be compressed.

seq TGGAGGTGGGG The sequence to be located into the genome.

Methodology for ddRADSeqTools validation

Validation of the methods implemented in ddRADseqTools are necessary to verify its proper

design and operability (Aim 2). Other tests were also implemented to study the performance

of ddRADseqTools (Aim 3), and to compare it with the performance of other ddRADseq

simulation tools (Aim 4) . Below, we describe the benchmark data, and the tested processes

under different scenarios to validate the software.


Page 25

Benchmark reference genomes

Three reference genomes have been used to validate ddRADseqTools:

Saccharomyces cerevisiae genome: It was downloaded from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz

This genome is small, about 12 Mbp, and has 14 chromosomes (Engel et al. 2014).

Homo sapiens genome: It was downloaded from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.29_GRCh38.p3/GCF_000001405.29_GRCh38.p3_genomic.fna.gz

The human genome has 23 chromosomes and approximately 3 Gbp, 1N data (Venter

et al. 2001).

Pinus taeda genome: It was downloaded from Dendrome, a forest tree genome

database:

http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/pinerefseq/Pita/v1.01/ptaeda.v1.01.scaffolds.fasta.gz

This genome is among the largest genomes in living organisms, more than 20 Gbp

distributed in 12 chromosomes, 1N data, and high complexity. (Neale et al. 2014; Zimin

et al. 2014).

These genomes cover very different organisms: Fungi, Animalia and Plantae; and their sizes

are small, large and very large, allowing the study of ddRADseqTools performance.

Validation experiments

The following validation experiments have been conducted in order to test the operability of

ddRADseqTools programs, and the reliability of the results:

A) Analysis of fragments generation: The program rsitesearch.py analyses how several

restriction endonucleases pairs perform a double digest of the reference genomes

above to obtain double digested fragments files (loci).

The Bash script simulation-genome.sh included in the software package has all the

instructions that performed this test. Table 9 shows the values of the main options set

in the runs of rsitesearch.py.

Table 9. Values of the main options set in the runs of rsitesearch.py in simulation-genome.sh.

options rsitesearch.py values

enzyme1 EcoRI, SbfI & PstI

enzyme2 MseI

fragstinterval 25

genfile 3 files indicated in section Benchmark reference genomes

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.29_GRCh38.p3/GCF_000001405.29_GRCh38.p3_genomic.fna.gz

http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/pinerefseq/Pita/v1.01/ptaeda.v1.01.scaffolds.fasta.gz


Page 26

minfragsize 101 (S. Cerevisiase) 201 (H. sapiens and P. taeda)

maxfragsize 300

B) ddRADseq simulation: The program simddradseq.py simulated reads of a ddRADseq

sequencing varying two options: 1) the number of reads to generate; and 2) the

probability of loci bearing PCR duplicates. The program pcrdupremoval.py quantified

and removed the PCR duplicates.

The Bash script simulation-ddradseq.sh included in the software package has all the

instructions that performed this test. Table 10 shows the values of the main options

set in the runs of each ddRADseq program.

Table 10. Values of the main options set in the runs of each ddRADseq program in simulation-ddradseq.sh.

Options simddradseq.py values pcrdupremoval.pv values

dropout 0.0 n.a.

enzyme1 enzyme selected in Test A for each reference genome

n.a.

enzyme2 MseI n.a.

format FASTQ n.a.

fragsfile

the three files generated in Test A, corresponding to the enzyme combination selected for each reference genome

n.a.

gcfactor 0.2 n.a.

indelprob 0.1 n.a.

insertlen 100 n.a.

individualsfile a file with 48 individuals n.a.

locinum number of loci assessed in test A for each reference genome

n.a.

locusmaxmut 1 n.a.

maxindelsize 10 n.a.

maxreadvar 1.2 n.a.

minreadvar 0.8 n.a.

mutprob 0.2 n.a.

pcrdupprob 0.2, 0.4 & 0.6 n.a.

readsfile1 n.a. the file 1 generated by simddradseq.py


readsnum number of reads assessed in Test A for each reference genome and each coverage

n.a.

readtype PE n.a.

technique IND1_IND2_DBR n.a.

n.a.: option not available in the program


Page 27

C) Analysis of PCR duplicates: This test analysed the effect of the probability of PCR

duplicates on the number of reads to generate in the S. cerevisiae genome. The

simddradseq.py generated reads for a wide value list of the probability of loci bearing

PCR duplicates; and the program pcrdupremoval.py quantified and removed the PCR

duplicates.

The Bash script simulation-pcrdupprob.sh included in the software package has all the



Table 11. Values of the main options set in the runs of each ddRADseq program in simulation-pcrdupprob.sh.

options simddradseq.py values pcrdupremoval.pv values

dropout 0.0 n.a.

enzyme1 enzyme selected in test A for S. cerevisiae genome

n.a.

enzyme2 MseI n.a.

format FASTQ n.a.

fragsfile file generated in test A corresponding to S. cerevisiae genome

n.a.

gcfactor 0.2 n.a.

indelprob 0.1 n.a.

insertlen 100 n.a.

individualsfile file with 48 individuals n.a.

locinum number of loci assessed in test A for S. cerevisiae genome

n.a.

locusmaxmut 1 n.a.


maxreadvar 1.2 n.a.

minreadvar 0.8 n.a.

mutprob 0.2 n.a.

pcrdupprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9

n.a.



readsnum number of reads assessed in test A for S. cerevisiae genome and each coverage

n.a.

readtype PE n.a.



D) Analysis of GC factor: This test analysed the effect of the GC content of fragments on

the number of reads and the probability of loci bearing PCR duplicates in the S.

cerevisiae genome. The simddradseq.py generated reads for a range of values for both

the probability of loci bearing PCR duplicates, and GC factor. The program

pcrdupremoval.py quantified and removed the PCR duplicates.


Page 28

The Bash script simulation-gcfactor.sh included in the software package has all the



Table 12. Values of the main options set in the runs of each ddRADseq program in simulation-gcfactor.sh.

options simddradseq.py values pcrdupremoval.pv values

dropout 0.0 n.a.


n.a.

enzyme2 MseI n.a.

format FASTQ n.a.

fragsfile file generated in test A corresponding to S. cerevisiae genome

n.a.

gcfactor 0.0, 0.1, 0.2, 0.3, 0.4 & 0.5 n.a.

indelprob 0.1 n.a.

insertlen 100 n.a.

individualsfile file with 48 individuals n.a.

locinum number of loci assessed in test A for S. cerevisiae genome

n.a.

locusmaxmut 1 n.a.


maxreadvar 1.2 n.a.

minreadvar 0.8 n.a.

mutprob 0.2 n.a.

pcrdupprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9

n.a.



readsnum number of reads assessed in test A for S. cerevisiae genome and x4 and x8 coverage

n.a.

readtype PE n.a.



E) Checking of the pattern of mutations: In this test the programs simddradseq.py,

pcrdupremoval.py, and indsdemultiplexing.py were run. Statistics of mutated and not-

mutated fragments of each individual were calculated for a range of values of the

probability of mutation based in the header information of reads, and stored in a CSV

file.

The Bash script simulation-mutations.sh included in the software package has all the




Page 29

Table 13. Values of the main options set in the runs of each ddRADseq program in

simulation-mutations.sh.

options simddradseq.py

values pcrdupremoval.py

values insddemultiplexing

values

dropout 0.0 n.a. n.a.


n.a. n.a.

enzyme2 MseI n.a. n.a.

format FASTQ n.a. FASTQ

fragsfile file generated by rsitesearch.py

n.a. n.a.

fragsinterval n.a. n.a. n.a.

gcfactor 0.2 n.a. n.a.

indelprob 0.1 n.a. n.a.

insertlen 100 n.a. n.a.

individualsfile file with 48 individuals

n.a. file with 48 individuals

locinum

number of loci assessed in test A for S. cerevisiae genome

n.a. n.a.

locusmaxmut 1 n.a. n.a.

maxfragsize n.a. n.a. n.a.

maxindelsize 10 n.a. n.a.

maxreadvar 1.2 n.a. n.a.

minfragsize n.a. n.a. n.a.

minreadvar 0.8 n.a. n.a.

mutprob 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9

n.a. n.a.

pcrdupprob 0.2 n.a. n.a.


the file 1 generated by pcrdupremoval.py



readsnum

number of reads assessed in test A for S. cerevisiae genome and x2 coverage

n.a. n.a.

readtype PE n.a. PE

technique IND1_IND2_DBR n.a. IND1_IND2_DBR


F) Pipeline of alignment: Full test of the ddRADseq programs. It generated fragments

from S. cerevisiae genome, simulated a double digest, generated their reads, removed


Page 30

PCR duplicates, demultiplexed the individuals, and trimmed the adapters and other

specific sequences. It also aligned the reads, and produced SAM, BAM, BED and VCF

format files to study and visualize alignments.

The Bash script simulation-pipeline.sh included in the software package has all the

instructions that performed this test. It is prepared to perform the test with other

genomes. Table 14 shows the values of the main options set in the runs of each

ddRADseq program.

Table 14. Values of the main options set in the runs of each ddRADseq program in simulation-pipeline.sh.

options rsitesearch.py

values simddradseq.py


values insddemultiplexing

values readstrim.py

values

dropout n.a. 0.0 n.a. n.a. n.a.


enzyme selected in test A for S. cerevisiae genome

n.a. n.a. n.a.

enzyme2 MseI MseI n.a. n.a. n.a.

format n.a. FASTQ n.a. FASTQ FASTQ

fragsfile n.a. file generated by rsitesearch.py

n.a. n.a. n.a.

fragsinterval 25 n.a. n.a. n.a. n.a.

gcfactor n.a. 0.0, 0.1, 0.2, 0.3, 0.4 & 0.5

n.a. n.a. n.a.

indelprob n.a. 0.1 n.a. n.a. n.a.

insertlen n.a. 100 n.a. n.a. n.a.

individualsfile n.a. file with 48 individuals


file with 48 individuals

locinum n.a.

number of loci assessed in test A for S. cerevisiae genome

n.a. n.a. n.a.

locusmaxmut n.a. 1 n.a. n.a. n.a.

maxfragsize 300 n.a. n.a. n.a. n.a.

maxindelsize n.a. 10 n.a. n.a. n.a.

maxreadvar n.a. 1.2 n.a. n.a. n.a.

minfragsize 101 n.a. n.a. n.a. n.a.

minreadvar n.a. 0.8 n.a. n.a. n.a.

mutprob n.a. 0.2 n.a. n.a. n.a.

pcrdupprob n.a. 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 & 0.9

n.a. n.a. n.a.

readsfile1 n.a. n.a. the file 1 generated by simddradseq.py


each individual file 1 generated by indsdemultiplexing.py




readsnum n.a.

number of reads assessed in test A for S. cerevisiae genome and x2 coverage

n.a. n.a. n.a.


Page 31

readtype n.a. PE n.a. PE PE

technique n.a. IND1_IND2_DBR n.a. IND1_IND2_DBR IND1_IND2_DBR


The reference genome was indexed, and the trimmed files obtained by readstrim.py

were aligned to the genome with the Burrows-Wheeler Aligner (BWA) (Li & Durbin

2009; http://sourceforge.net/projects/bio-bwa/). The SAM files were converted in

BAM and VCF format using SAMtools (https://github.com/samtools/samtools; Li et al.

2009). The BED files was generated from BAM files using BEDtools (Quinlan & Hall

2010; https://github.com/arq5x/bedtools2).

The BED and VCF files were displayed in Integrative Genomics Viewer (IGV) (Robinson

et al. 2011; Thorvaldsdóttir et al. 2013; http://www.broadinstitute.org/igv/).

The script simulation-pipeline.sh is prepared to perform the test with other genomes.

G) Analysis of the software performance: In this test the programs rsitesearch.py,

simddradseq.py, pcrdupremoval.py, indsdemultiplexing.py, and readstrim.py were run

repeatedly in order to measure the elapsed real time used by the program, the total

number of CPU-seconds used by the system on behalf of the process, the total number

of CPU-seconds that the process used directly, and the maximum resident set size of

the process during its lifetime.

The Bash script simulation- performance.sh included in the software package has all

the instructions that performed this test. Table 15 shows the values of the main

options set in the runs of each ddRADseq program.

Table 15. Values of the main options set in the runs of each ddRADseq program in simulation-performance.sh.

options rsitesearch.py

values simddradseq.py


values Insddemultiplexing

values readstrim.py

values

dropout n.a. 0.0 n.a. n.a. n.a.

enzyme1 EcoRI, SbfI & PstI enzyme selected in Test A for each reference genome

n.a. n.a. n.a.

enzyme2 MseI MseI n.a. n.a. n.a.

format n.a. FASTQ n.a. FASTQ FASTQ

fragsfile n.a. file generated by rsitesearch.py

n.a. n.a. n.a.

fragsinterval 25 n.a. n.a. n.a. n.a.

gcfactor n.a. 0.2 n.a. n.a. n.a.

indelprob n.a. 0.1 n.a. n.a. n.a.

insertlen n.a. 100 n.a. n.a. n.a.

individualsfile n.a. file with 48 individuals


file with 48 individuals

locinum n.a.

number of loci assessed in test A for each reference genome

n.a. n.a. n.a.

http://sourceforge.net/projects/bio-bwa/

https://github.com/samtools/samtools

https://github.com/arq5x/bedtools2

http://www.broadinstitute.org/igv/


Page 32

locusmaxmut n.a. 1 n.a. n.a. n.a.

maxfragsize 300 n.a. n.a. n.a. n.a.

maxindelsize n.a. 10 n.a. n.a. n.a.

maxreadvar n.a. 1.2 n.a. n.a. n.a.

minfragsize 101 (S. cerevisiae) 201 (H. sapiens and P. taeda)

n.a. n.a. n.a. n.a.

minreadvar n.a. 0.8 n.a. n.a. n.a.

mutprob n.a. 0.2 n.a. n.a. n.a.

pcrdupprob n.a. 0.2 & 0.6 n.a. n.a. n.a.







readsnum n.a.

number of reads assessed in test A for each reference genome and x2 & x16 coverage

n.a. n.a. n.a.

readtype n.a. PE n.a. PE PE

technique n.a. IND1_IND2_DBR n.a. IND1_IND2_DBR IND1_IND2_DBR


The analysis was run in a computer with Bio-Linux 8 installed. The main features of the

computer were Intel Core i5-4200U 1.6 GHz with Turbo Boost up to 2.g GHz; RAM 8

GiB; 5400 rpm disk.

H) Comparative between ddRADseqTools and other ddRADseq simulation tools: The

program rsitesearch.py was compared with the R package SimRAD and the python

program Digital_RADs.py of software package BU-RAD-seq. The test was designed to

generate fragments of a double digest of the benchmark reference genomes with the

following restriction endonucleases pairs: EcoRI/MseI, PstI/MseI and SbfI/MseI

selecting fragments between 101-300 nt in EcoRI/MseI pair; and otherwise 201-300 nt.

We also analysed the performance of the three tools as in test G.

The comparative between the program simddradseq.py and simRRLs was not

considered because this tool does not admit reference genomes.

The attached file other_scripts.zip contains the necessary scripts to run SimRAD and

Digital_RADS.py tests.


Page 33

Results and Discussion

The results of the tests performed to validate the programs are detailed in the next sub-

chapters. The complete results of a run of the scripts that correspond to each test are available

in the attached file simulations.zip.

Analysis of fragments generation

The script simulation-genome.sh provides statistics regarding the abundance of fragments in

intervals of 25 nucleotides, obtained from reference genomes when EcoRI, PstI and SbfI act as

the first enzyme, and MseI acts as the second enzyme. Figures B-1, B-2 and B-3 from Appendix

B (see file supplementary.pdf) represent the distribution of fragments for these enzymes and

genomes. They have been drawn by rsitesearch.py.

The summary statistics in Table 16 shows the total number of fragments, and the number of

fragments whose size is between 201 and 300 nt, which is the usual range in ddRADseq

experiments. For S. cerevisiae, however, due to the small size of its genome, the range

specified is between 101 and 300 nt.

Table 16. Fragments generated by restriction endonucleases for three reference genomes (S.

cerevisiae, H. sapiens, and P.taeda). The optimal enzyme combination inferred from the

number of fragments generated for the selected size interval is indicated in bold.

S. cerevisiae

Enzymes Total fragments Fragments w/ size 101-300 nt

EcoRI - MseI 8,176 3,103

PstI - MseI 4,623 1,853

SbfI - MseI 188 70

H. sapiens


EcoRI - MseI 1,629,978 203,735

PstI - MseI 2,236,406 331,344

SbfI - MseI 156,140 21,016

P. taeda


EcoRI - MseI 11,459,733 1,353,309

PstI - MseI 4,784,215 621,933

SbfI - MseI 215,211 26,532

The effect of the double digestion with EcoRI, PstI and SbfI was different for each genome. The

success and cost-efficiency of a ddRADseq experiment largely depended on the selection of

the correct enzyme pair combination. Since the number of reads is equal to the number of

fragments multiplied by the coverage and the number of individuals, the enzyme pair chosen

in a ddRADseq experiment must provide a tractable number of fragments; that is, there must

be a balance between the number of fragments, the total number of reads and the number of

individuals to get an optimal coverage and a low percentage of missing data.


Page 34

The restriction endonucleases marked in bold in Table 9 were selected to perform the

subsequent validation tests. The number of reads used in the next validation tests is shown in

Table 10. As 48 individuals were used in all tests, a rounded number of reads was used to

reach approximately 2x, 4x, 8x and 16x coverage and the number of fragments marked in bold

in Table 17.

Table 17. Number of reads for 48 individuals and the coverage of 2x, 4x, 8x and 16x used in the validation tests.

organism enzymes number of reads

2x 4x 8x 16x

S. cerevisiase EcoRI - MseI 300,000 600,000 1,200,000 2,400,000

H. sapiens SbfI - MseI 2,000,000 4,000,000 8,100,000 16,100,000

P. taeda SbfI - MseI 2,500,000 5,100,000 10,200,000 20,400,000

ddRADseq simulations

The script simulation-ddRADseq.sh checks the relationship of the number of reads and the

probability of loci bearing PCR duplicates, with the percentage of removed reads and the

subsequent deviation in the depth coverage for three reference genomes. The summary of a

run of this script is found in Appendix C (see file supplementary.pdf): Table C-1, S. cerevisiase;

Table C-2, H. sapiens; and Table C-3, P. taeda.

Table 18 shows the percentage of removed reads and the coverage deviation. The latter

statistic is computed by comparing the theoretical probability of loci bearing PCR duplicates

(value passed to the program pcrdupremoval.py), and the actual proportion of loci bearing PCR

duplicates. The term "depth coverage" is used instead "number of reads" to standardize the

comparisons between organisms.

Table 18. Percentage of removed reads and coverage deviation collected from Appendix C data (see file supplementary.pdf).

organism pcr dup

prob

2x 4x 8x 16x %r.reads cov.dev.

%r.reads cov.dev. %r.reads cov.dev. %r.reads cov.dev. %r.reads cov.dev. mean s.d. mean s.d.

S. cerevisiae

0.2 15.37 -0.3166 16.77 -0.6955 17.48 -1.4543 18.39 -3.0511 17.00 1.2744 -1.38 1.2107

0.4 31.16 -0.6516 30.6 -1.2565 32.74 -2.7541 32.29 -5.3893 31.70 0.9885 -2.51 2.1115

0.6 46.83 -0.9764 46.45 -1.9375 48.26 -4.0376 48.07 -8.0143 47.40 0.8974 -3.74 3.1222

H. sapiens

0.2 15.59 -0.317 16.41 -0.6564 17.22 -1.3987 18.54 -3.0078 16.94 1.2572 -1.34 1.1970

0.4 31.26 -0.6325 31.48 -1.2625 32.18 -2.6238 33.01 -5.356 31.98 0.7894 -2.47 2.0966

0.6 46.21 -0.9303 46.4 -1.8666 47.48 -3.867 48.03 -7.7916 47.03 0.8702 -3.61 3.0426

P. taeda

0.2 15.9 -0.3217 16.24 -0.9321 17.4 -1.4188 18 -2.9382 16.89 0.9823 -1.40 1.1177

0.4 31.44 -0.6327 31.69 -1.2969 32.23 -2.6339 33.15 -5.4098 32.13 0.7572 -2.49 2.1149

0.6 46.46 -0.9321 46.76 -1.9117 46.95 -3.8366 47.63 -7.7748 46.95 0.4962 -3.61 3.0250

pcrdupprob: theoretical probability of loci bearing PCR duplicates; %r.reads: percentage of removed reads;

cov.dev.: coverage deviation


Page 35

The behaviour of the data was very similar for all three genomes. We observed that:

For a theoretical probability of loci bearing PCR duplicates, the percentage of removed

reads was very similar for all depth coverages.

For a theoretical probability of loci bearing PCR duplicates, the coverage deviation

grew linearly with the depth coverage, irrespective of the genome size.

The percentage of removed reads and the coverage deviation grew linearly with the

theoretical probability of loci bearing PCR duplicates.

A graphical interpretation of Table 18 can be found in Appendix C (see supplementary.pdf file):

Figure C-1, S. cerevisiase; Figure C-2, H. sapiens; and Figure C-3, P. taeda. Table C-1, Table C-2

and Table C-3 shows that the loci bearing PCR duplicates generated by the program

simddradseq.py have only small deviations. Table 18 shows mean and standard deviation of

each probability of loci bearing PCR duplicates per reference genome.

Analysis of PCR duplicates

The in-depth analysis of the effect of the probability of loci bearing PCR duplicates on the

number of reads was carried out by the script simulation-pcrdupprob.sh. In one run of this

script, the program simddradseq.py generated reads for a wide range of probabilities of PCR

duplicates. The program pcrdupremoval.py quantified and removed the PCR duplicates. The

results of this run are summarized in Table D-1 (Appendix D, file supplementary.pdf). Figure 17

shows graphically the percentage of removed reads, and the coverage deviation for each PCR

duplicates probability and depth coverage in S. cerevisiae.


Page 36

Figure 17. Percentage of removed reads and coverage deviation according to values of probability of loci

bearing PCR duplicates between 0.0, and 0.9 in S. cerevisiae and 2x, 4x, 8x and 16x coverage.

The results showed that:

The number of removed reads, i.e. the number of duplicate reads, was proportional to

the probability of loci bearing PCR duplicates, and the values were independent of the

depth coverage.

The coverage deviation was proportional to both the probability of loci bearing PCR

duplicates and the depth coverage.

Some (although nearly non-significant) duplicate reads were produced when the

probability of loci bearing PCR duplicates was 0.0. This is due to an artifact derived

from the random generation of the DBR sequences. These duplicate reads occurred

also when the probability of loci bearing PCR duplicates was > 0.0, and there is no way

to distinguish between real duplicates or artifacts. In any case, the number of

duplicate reads generated randomly was negligible when the PCR duplicates

probability is > 0.0.


Page 37

The decreasing of coverage became more important as PCR duplicates grew. The

researcher can set the value of the option pcrdupprob that best suits to the

experiment that he/she is designing.

Analysis of the effect of the GC content

The script simulation-gcfactor.sh analyses the effect of the GC content on the number of reads

to generate, and on the probability of loci bearing PCR duplicates in the S. cerevisiae genome.

In this script the program simddradseq.py generated reads for a wide value lists of both the

probability of loci bearing PCR duplicates and GC factor. The program pcrdupremoval.py

quantified and removed the PCR duplicates.

Table E-1 (Appendix E, file supplementary.pdf) summarizes the data of a run of simulation-

gcfactor.sh. Table 19 shows the results of the percentage of removed reads and of the

coverage deviation for a range of values (0.0-0.9) of theoretical probability of loci bearing PCR

duplicates, and for a depth coverage of 4x and 8x.

Table 19. Summary of percentage of removed reads and coverage deviation corresponding

to various values of factor GC grouped by the probability of loci bearing PCR duplicates

collected from Appendix E data (see file supplementary.pdf).

theoretical pcrdupprob

4x 8x

percentage removed reads

coverage deviation percentage

removed reads coverage deviation

mean s.d. mean s.d. mean s.d. mean s.d.

0.0 0.81 0.0071 -0.03 0.0070 1.60 0.0096 -0.14 0.0166

0.1 9.41 1.0235 -0.39 0.0435 10.31 1.1093 -0.87 0.0909

0.2 16.97 0.5428 -0.71 0.0239 17.29 0.4387 -1.44 0.0454

0.3 23.78 0.7591 -1.00 0.0351 24.71 0.5526 -2.05 0.0462

0.4 31.70 0.2604 -1.32 0.0131 31.80 0.1894 -2.65 0.0162

0.5 38.90 0.7139 -1.62 0.0266 39.63 1.0265 -3.31 0.0880

0.6 46.68 0.1937 -1.94 0.0084 47.62 0.7847 -3.97 0.0610

0.7 54.31 0.5059 -2.26 0.0226 54.32 0.7413 -4.52 0.0628

0.8 62.14 0.4659 -2.59 0.0198 62.30 0.7972 -5.19 0.0700

0.9 69.16 0.4306 -2.88 0.0196 69.19 0.3822 -5.77 0.0340

Figure 18 represents the percentage of removed reads against the probability of loci bearing

PCR duplicates for several GC factor values.


Page 38

Figure 18. Percentage of removed reads vs. the probability of loci bearing PCR duplicates for several

values of GC factor in S. cerevisiae, and for 4x and 8x coverage.

Figure 19 shows the coverage deviation against the probability of loci bearing PCR duplicates

for several GC factor values.

Figure 19. Coverage deviation against the probability of loci bearing PCR duplicates for several values of

GC factor in S. cerevisiae, and for 4x and 8x coverage.

In light of these results, we can conclude that the GC factor had not major influence in the

generation of PCR duplicates, and in the coverage deviation. In addition, this confirmed the

results from the test of the section Analysis of PCR duplicates:


Page 39

The number of duplicate reads was proportional to the probability of loci bearing PCR

duplicates and they were independent of the coverage.

The coverage deviation was proportional to both the probability of loci bearing PCR

duplicates and the coverage.

Analysis of the mutation patterns

The script simulation-mutations.sh calculated statistics for the mutated and not-mutated reads

by individual generated after PCR duplicates removal and demultiplexing. Detailed results per

individual for a run of this script is found in Table F-1 (Appendix F, file supplementary.pdf) and

a summary is shown in Table 20. We can observe that the percentage of mutated reads is

approximately the expected value to the corresponding percentage of mutation.

Table 20. Summary of percentage of mutated reads corresponding to the mutated reads

collected from Appendix F data (see file supplementary.pdf).

mutprob not-mutated

reads mutated

reads total reads

percentage of mutated

reads

0.0 251,348 0 251,348 0.00

0.1 227,302 24,912 252,214 9.88

0.2 203,360 49,959 253,319 19.72

0.3 174,697 75,128 249,825 30.07

0.4 152,020 101,928 253,948 40.14

0.5 125,015 125,350 250,365 50.07

0.6 99,570 150,164 249,734 60.13

0.7 74,655 174,346 249,001 70.02

0.8 50,267 201,499 251,766 80.03

0.9 25,432 228,608 254,040 89.99

Pipeline for the alignment of simulated reads

The script simulation-pipeline.sh performed a complete test of the ddRADseq programs for S.

cerevisiae. The resulting reads were aligned, and SAM, BAM, BED and VCF format files were

generated. The alignments were visualized with IGV.

Figure 20 displays the results of loading the reference genome of S. cerevisiae, and the BED

files of the reads resulting from the simulations. We can observe that the collapsed reads of

each individual covered all chromosomes uniformly.


Page 40

Figure 20. Reads generated by simulation-pipeline.sh for S. cerevisiae visualised along all its genome. Each

row corresponds to the collapsed reads of a single individual.

Figure 21 shows the expanded reads at single chromosome level. The number of reads varied

from one individual to other, and in certain cases, an individual did not show reads in a locus.

Figure 21. Reads generated by an simulation-pipeline.sh for chromosome NC_001139.9 of S. cerevisiae.

Each row corresponding to squished reads of one individual.


Page 41

IGV allows to trace from the visualization a fragment to a detail of the corresponding reads in

the browser, or vice versa. Fragments have information about the chromosome or scaffold and

strand where they belong, and also about the start and end position. Reads have information

of the fragment from where they derived. Files in VCF format allow to quantify the extent of

mutations (SNPs or indels) identified by chromosome or scaffold, and by their coordinate in

the genome. IGV displays the alignment of reads to the genome. Figure 22 and Figure 23 show

two examples of fragment – reads – VCF – alignment traceability. These are evidences of the

correct functioning of the ddRADseqTools programs.


Page 42

Figure 22. Example of fragment – reads – VCF – alignment traceability visualized with IGV in the case of an indel. The fragment shown corresponds to positions 759937-760168 of the strand +

of the chromosome XIII (id NC_001145.3). In one of the reads of the individual ind0101 occurs an indel in positions 759942-759945: a GCCC sequence is changed to GCC. This indel was

recorded in the VCF file and can be visualized with IGV.


Page 43

Figure 23. Example of fragment – reads – VCF – alignment traceability visualized with IGV in the case of an SNP. The fragment shown corresponds to positions 139675-139506 of the strand - of

the chromosome I (id NC_001133.9). In one of the reads of the individual ind0101 occurs a SNP in position 139669: a A (T in strand +) is changed to C (G in strand +). This SNP was recorded in

the VCF file and can be visualized with IGV (genome nucleotides of strand – are shown in this alignment).


Page 44

Performance of ddRADseqTools

The results for the analysis of the ddRADsetTools are shown in Table G-1 (Appendix G, file

supplementary.pdf).

rsitesearch.py is the program that needed the highest amount of memory: 61 MiB were

approximately required for S. cerevisiae; above 4 GiB for H. sapiens; bellow 220 MiB for P.

taeda. Although the P. taeda genome is larger than the H. sapiens genome, memory

requirements were lower because its genome file contains scaffolds, and no chromosomes, as

in the H. sapiens genome file. The elapsed time depended both on the genome size and on the

number of fragments obtained (see also Table 15). The maximum time elapsed recorded was

2.812.50 s to P. taeda, and EcoRI / MseI as restriction enconucleases pair.

The program simddradseq.py had very low memory requirements: below 23 MiB for the three

reference genomes analysed. The elapsed time was proportional to the number of reads. The

maximum elapsed time recorded was 2,337.57 s (below 39 min) for P. taeda; and 20,400,000

reads.

The elapsed time of the program pcrdupremoval.py depended on the records in the input and

the output files: for the same input records (readsnum column) the time was lower when the

produced output records were less (i.e. the pcrdupprob column value was greater). The

maximum elapsed time recorded was 10,606.18 s (approximately 2 hr and 57 min) for P. taeda

with 20,400,000 reads, and 0.2 of probability of loci bearing PCR duplicates.

The program insdemultiplexing.py had always a memory requirement approximately of 10

MiB. Its elapsed time depended on the records in the input file (for the same readsnum value,

a greater prcrdupprob value implied less reads numbers). The maximum elapsed time

recorded was 1,270.93 s (above 21 min) for P. taeda with 20,400,000 reads, and 0.2 of

probability of loci bearing PCR duplicates.

The maximum resident set size to the program readstrim.py is 9 Mb, and maximum elapsed

real time was 311.41 (above 5 min) for P. taeda with 20,400,000 reads, and 0.2 of probability

of loci bearing PCR duplicates.

Comparative between ddRADseqTools and other ddRADseq simulation tools

Table 21 shows the comparative between rsitesearch.py and R package SimRAD and

Digital_RADs.py of BU-RAD-seq. Both SimRAD and Digital_RADs.py needed a previous process

to decompress the genome file. Also Digital_RADs.py needed another previous process to put

in upper case the nucleotide symbols.

SimRAD did not work with H. sapiens and P. taeda genomes. Digital-RADS obtained fragments

of the benchmark reference genomes, but it was not efficient for very large genomes.

rsiterearch.py of ddRADseqTools was the only tool that could obtain fragments of all

benchmark reference genomes with a good performance.


Page 45

Table 21. Comparative between rsitesearch.py and R package SimRAD and Digital_RADs.py of BU-RAD-seq.

S. cerevisiae

enzymes

ddRADseqTools - rsitesearch.py SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)

total fragments

fragments w/ size

101-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

total fragments

fragments w/ size

101-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

fragments w/ size

1-1,000 nt

fragments w/ size

101-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

EcoRI - MseI 8,176 3,103 4.99 0.09 2.38 8,176 3,048 21.14 0.21 17.94 8,139 3,191 1.30 0.03 0.41

PstI - MseI 4,623 1,853 2.34 0.01 2.19 4,628 1,866 18.32 0.21 18.07 4,590 1,934 0.38 0.02 0.36

SbfI - MseI 188 70 2.01 0.05 1.84 188 70 17.89 0.20 17.66 186 73 0.35 0.01 0.34

H. sapiens

enzymes

ddRADseqTools SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)

total fragments

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

total fragments

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

fragments w/ size

1-1,000 nt

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

EcoRI - MseI 1,629,978 203,735 421.39 11.90 399.33 (***) (***) (****) (****) (****) 1,604,730 208,238 233.08 8.89 96.03

PstI - MseI 2,236,406 331,344 469.76 10.03 457.63 (***) (***) (****) (****) (****) 2,195,695 343,793 180.33 5.66 87.17

SbfI - MseI 156,140 21,016 324.16 8.18 314.54 (***) (***) (****) (****) (****) 141,656 21,660 175.42 5.37 84.21

P. taeda

enzymes

ddRADseqTools SimRAD (*) BU-RAD-seq - Digital_RADs.py (*) (**) (***)

total fragments

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

total fragments

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

fragments w/ size

1-1,000 nt

fragments w/ size

201-300 nt

elapsed real

time (s)

CPU time (s)

in kernel mode

CPU time (s)

in user mode

EcoRI - MseI 11,459,733 1,353,309 2,812.50 19.27 2,773.89 (****) (****) (*****) (*****) (*****) 11,181,647 1,377,129 26,937.80 872.16 4,062.31

PstI - MseI 4,784,215 621,933 2,429.52 15.42 2,402.45 (****) (****) (*****) (*****) (*****) 4,590,018 643,991 34,287.74 902.78 4,141.07

SbfI - MseI 215.211 26,532 2,005.67 11.48 1,985.56 (****) (****) (*****) (*****) (*****) 204,438 27,408 68,824.32 955,48 4,336.22

(*) It was necessary to decompress the genome file in a preliminar stage. Elapsed real time: S. cerevisiae, 0.14 s; H. sapiens, 59.96 s; P. taeda, 443,30 s. (**) It was necessary to convert genome file content to upper case previously. Elapsed real time: S. Cerevisiae, 0.14 s; H. sapiens, 100.81 s; P. taeda, 829.36 s. (***) Further, it was necessary to delete temporal files. For P. taeda, 14,412,988 temporal files were generated and their deletion took several hours. (****) Error in ref.DNAseq (result would exceed 2^31-1 bytes). (*****) Computer crashed.


Page 46

Limitations and Future Prospects

The current version of the ddRADseqTools software package has the following limitations:

Only the Jukes-Cantor model of sequence evolution is implemented, and the

phylogenetic relationships between individuals or groups of individuals cannot be

simulated.

The individuals are supposed diploid.

Mismatches are not admitted in the demultiplexing process. It is not important when

the reads are generated in silico, like in the present project, but the current version of

pcrdupremoval.py and indsdemultipling.py cannot be used with experimental

ddRADseq data.

Paralagous sequences are not parameterized. If organisms with genomes with high

content of repetitive regions (e.g. P. taeda) are used as a reference, some paralogous

fragments will be generated, but this is a feature not controlled by the user. However,

paralogous sequences can be identified following Mastretta-Yanes et al. (2015). When

reads are generated at random with fragsgeneration.py, paralogous sequences will not

be generated.

Statistics for the amount of missing data are not automatically generated. However,

they can be easily calculated from the existing output with a general purpose software

(R, Excel, etc.)

Conclusions

In the project corresponding to this Master Thesis:

1. We have developed ddRADseqTools, a software package that provides tools to design in

silico ddRADseq experiments with the following characteristics:

Study of the restriction endonucleases pair more suitable to the reference genome.

Fragments generation (library construction).

Generation of reads simulating a high-throughput sequencing with the possibility of

including variation (both SNP and indels), allele dropout and technical replicates.

Use of one or two indexes to identify individuals

Use of DBR to identify PCR duplicates.

Quantification and removal of PCR duplicates.

Demultiplexing of reads by individual.

Trimming of reads.


Page 47

2. We have validated the software by performing tests in order to ensure that the output

data are reliable, according to the corresponding design of the programs included in the

software package. The tests have performed ddRADseq simulations, and analysis of

fragments generation, PCR duplicates, effect of the GC content and mutation patterns. We

have also written a pipeline for the alignment of simulated reads so they can be displayed

in a genome browser like IGV; and we have assessed the performance of ddRADseqTools.

3. The software package is efficient in terms of CPU and RAM usage. We have run the test in

a computer whose main features were: Intel Core i5-4200U 1.6 GHz with Turbo Boost up

to 2.g GHz; RAM 8 GiB; 5400 rpm disk. Therefore, ddRADseqTools can run in computers

with standard CPU and RAM configuration

4. Unlike other ddRADseq simulation tools, ddRADseqTools can process genomes of any size,

with both chromosomes and scaffolds sequences, and with a good performance.


Page 48

References

Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP Discovery and Genetic Mapping Using

Sequenced RAD Markers. PLos ONE, 3, e3376.

Blainey P, Krzywinski M, Altman N (2014) Points of Significance: Replication. Nature Methods,

11, 879-880.

Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP (2011) A method for counting PCR template

molecules with application to next-generation sequencing. Nucleic Acids Research, 39, e81.

Catchen J, Hohenlone PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set

for population genomics. Molecular Ecology, 22, 3124-40.

Chong Z, Ruan J, Wu CI (2012) Rainbow: an integrated tool for efficient clustering and

assembling RAD-seq reads. Bioinformatics, 28, 2732-2737.

DaCosta JM, Sorenson MD (2014) Amplification Biases and Consistent Recovery of Loci in a

Double-Digest RAD-seq Protocol. PLoS ONE, 9, e106713.

Davey JW, Blaxter ML (2010) RADSeq: next-generation population genetics. Briefing in

Functional Genomics, 9, 416-423.

Davey JW, Cezard T, Fuentes-Utrilla P (2013) Special features of RAD Sequencing data:

implications for genotyping. Molecular Ecology, 22, 3151-3164.

Davey JW, Hohenlohe PA, Etter PD et al. (2011) Genome-wide genetic marker discovery and

genotyping using next-generation sequencing. Nature Reviews Genetics, 12, 499-510.

Eaton DAR (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses.

Bioinformatics, 30, 1844-1849.

Etter PD, Bassham S, Hohenlohe PA, Johnson EA, Cresko WA (2011) SNP Discovery and

Genotyping for Evolutionary Genetics Using RAD Sequencing. Methods in Molecular Biology,

772, 157-178.

Engel SR, Dietrich FS, Fisk DG, et at. (2014) The Reference Genome Sequence of

Saccharomyces cerevisiae: Then and Now. G3: Genes, Genomes, Genetics, 4, 389-398.

Lepais O, Weir JT (2014) SimRAD: an R package for simulation-based prediction of the number

of loci expected in RADseq and similar genotyping by sequencing approaches. Molecular

Ecology Resources, 14, 1314-1321.

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler

Transform. Bioinformatics, 25, 1754-1760.

Li H, Handsaker B, Wysoker A, et al. (2009) The Sequence Alignment/Map format and

SAMtools. Bioinformatics, 25, 2078-2079.

Mastretta-Yanes A, Arrigo N, Alvarez N, et al. (2015) Restriction site-associated DNA

sequencing, genotyping error estimation and de novo assembly optimization for population

genetic inference. Molecular Ecology Resources, 15, 28-41.


Page 49

Mastretta-Yanes A, Zamudio S, Jorgensen TH et al (2014). Gene Duplication, Population

Genomics, and Species-Level Differentiation within a Tropical Mountain Shrub. Geneome

Biology and Evolution, 6, 2611-2624.

Miller MR, Dunham, JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective

polymorphism identification and genotyping using restriction site associated DNA (RAD)

markers. Genome Research, 17, 240-248.

Neale DB, Wegrzyn JL, Stevens KA et al (2014) Decoding the massive genome of loblolly pine

using haploid DNA and novel assembly strategies. Genome Biology, 15, R59.

Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double Digest RADseq: An

Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model

Species. PLoS ONE, 7, e37135.

Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic

features. Bioinformatics, 26, 841-842.

Robinson JT, Thorvaldsdóttir H, Winckler W, et al. (2011) Integrative genomics viewer. Nature

Biotechnology, 29, 24-26.

Schweyen H, Rozenberg A, Leese F (2014) Detection and Removal of PCR Duplicates in

Population Genomic ddRAD Studies by Addition of a Degenerate Base Region (DBR) in

Sequencing Adapters. The Biological Bulletin, 227, 146-160.

Sovic MG, Fries AC, Gibbs HL (2015) AftrRAD: a pipeline for accurate and efficient de novo

assembly of RADseq data. Molecular Ecology Resources, 15, 1163-1171.

Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-

performance genomics data visualization and exploration. Briefings in Bioinformatics, 14, 178-

192.

Tin MMY, Rheindt FE, Cros E, Mikheyev AS (2015) Degenerate adaptor sequences for detecting

PCR duplicates in reduced representation sequencing data improve genotype calling accuracy.

Molecular Ecology Resources, 15, 329-336.

Venter JC, Adams MD, Myers EW et al (2001) The Sequence of the Human Genome. Science,

291, 1304-1351.

Zimin A, Stevens KA, Crepeau MW, et al. (2014) Sequencing and Assembly of the 22-Gb

Loblolly Pine Genome. Genetics, 196, 875-890.

http://www.ncbi.nlm.nih.gov/pubmed/?term=Sovic%20MG%5BAuthor%5D&cauthor=true&cauthor_uid=25641221

http://www.ncbi.nlm.nih.gov/pubmed/?term=Fries%20AC%5BAuthor%5D&cauthor=true&cauthor_uid=25641221

http://www.ncbi.nlm.nih.gov/pubmed/?term=Gibbs%20HL%5BAuthor%5D&cauthor=true&cauthor_uid=25641221

http://www.ncbi.nlm.nih.gov/pubmed/25641221

ddradseqtools: software package for in silico · pdf fileddradseqtools: software package for...

Documents