algorithms for aligning genetic sequences to reference genomes · algorithms for aligning genetic...

Algorithms for Aligning GeneticSequences to Reference Genomes

Jens-Ola Ekstrom

Jens-Ola EkstromVT 2017Bachelor thesis, 15 creditsSupervisor: Lars-Erik JanlertExaminer: Pedher JohanssonBachelor of Science Programme in Computing Science, 180 credits

Abstract

The technologies for sequencing genetic materials have improved vastlyduring the last fifteen years. Sequences can now be determined to af-fordable costs and therefore are more genetic sequences available thanever before. The bottleneck is no longer to obtain genetic sequences butrather to analyze all the sequence data.

A primary step in sequence analysis is to determine a sequence frag-ment’s position in the genome, a process called aligning. From Com-puting Science point of view, this is essentially text matching. This ishowever a much more complex task than searching for strings in or-dinary text documents. First, there is large amount of data. An ordi-nary sequencing experiment could generate more than 100 million se-quences. Each sequence should be matched to a reference genome se-quence, which is some billions characters in size. Second, the obtainedsequences may have differences compared to the reference genome se-quence. The algorithms are thus not only searching for exact matches,but for the best approximate matches.

In this work I review the algorithms behind modern sequence alignmentsoftwares. I also propose to evaluate to the fast Fourier transform for thetask.

1(22)

Contents

1 Introduction 3

2 Background 5

2.1 Basics in molecular biology 5

2.2 Next Generation Sequencing 5

3 Algorithms for sequence alignments 7

3.1 Exact matching algorithms 7

3.1.1 Hash tables 7

3.1.2 Tree structures 8

3.1.3 The Burrow-Wheeler transform 9

3.1.4 The fast Fourier transform 10

3.2 Approximate matching algorithms 11

3.2.1 The Smith-Waterman and Needleman-Wunsch algorithms 12

4 Discussion 15

References 17

Appendix: Calculations 21

3(22)

1 Introduction

The term Next Generation Sequencing (NGS) has been coined to describe the last fifteenyears of development in new technologies to determine the sequences of genetic materi-als. NGS is essentially a massive parallelization of sequencing reactions. Sequencing ofgenetic material is now performed to affordable costs, which has led to an explosion in theamount of genetic data. The bottleneck is no longer to obtain genetic data but to analyze allsequences.

Each DNA molecule reveals only a short fragment of its genetic sequence by the sequencingtechnologies available today. A sequence between 30 and up to some hundred nucleotidesmay be determined for each DNA molecule. These short sequences are simply called reads.By determining many reads it is possible, and today also routine, to cover the entire geneticmaterial of an organism. The genome size differ between organisms; human has about3.3× 109 nucleotides in the genome. Plants may have up to 400-fold larger genomeswhereas bacteria have about 1/1000 the size of the human genome.

The main challenge is perhaps that sequencing with NGS produces considerable amountof data. A regular small-scale experiment may generate more than 100 million pieces ofsequence fragments. That is 25-50 GB compressed text data from a single experiment (in-cluding identity and quality scores). It is truly a challenge to the Computer Scientists andProgrammers to construct softwares that are practically useful for the common Biologistor Clinician, which are the end users of these softwares. The computational resources inBiology research environment is commonly standard desktop computers. In this case maythe amount of accessible memory be of concern. Clinical laboratories have increased eco-nomical possibilities to set up dedicated servers with increased memory capacity, but in thiscase may the time to analyze each sample be of essence. Over all, NGS provides excitingchallenges for both Computer Scientists, Biologists and Clinicians, in their respective areaof profession.

The primary step in analyzing the NGS data is to identify the position of each sequencefragment in the organism’s reference genome sequence. This is a process called aligning;the result is an alignment. From Computing Science point of view, aligning DNA is es-sentially text searching. DNA sequences can be described as words of a finite alphabetwith four characters. Hundred million or so short pieces of text (each 30-200 characters)should be placed in the right position relative to its corresponding reference text sequence(some billion characters). The challenge increases as biological reality also includes somedegree of variation and deviation from previously known sequences. The task is thus notonly to find all positions for exact matches, but to find all positions for the best approximatematches.

In this work I review the algorithms behind modern NGS aligning softwares. I also discussthat an alternative algorithm, the Fast Fourier Transform (FFT), would do well to solve thetask, although this algorithm is not currently used in NGS softwares.

5(22)

2 Background

2.1 Basics in molecular biology

A cell keeps all rules for its survival and reproduction in the deoxyribonucleic acid (DNA)archive. DNA is composed of a backbone of sugar molecules, called ribose, chained to-gether in such way they that make a strand. Each ribose is linked to one of four differentnitrogenous bases, called adenosine (A), cytosine (C), guanine (G) and thymine (T). A unitof ribose and any of the nitrogenous bases is called nucleotide. The order of the nucleotidesin the DNA strand makes the genetic code. A DNA strand binds to another DNA strand,which together make the well-known double-helix [11]. Putting all pieces of DNA togetherin a cell, it counts to about 3.3×109 nucleotides.

DNA is used in mainly two ways. First, it can be copied to an identical1 copy upon celldivision. Second, some selected portions of DNA can be transcribed to a ribonucleic acid(RNA) copy. RNA is chemically almost an identical molecule to DNA but it is much lessstable and breaks down within minutes or hours. Those regions in DNA that are transcribedinto RNA are called genes. RNA has a number of different roles in the cell, one being itcontains instructions for how amino acids should be assembled into proteins.

2.2 Next Generation Sequencing

The main difference between next generation sequencing (NGS) compared to previous se-quencing technologies is massive parallelization of the chemical reactions. Earlier wereDNA sequences determined in solutions, with all molecules floating around in the test tube.Only one particular kind of DNA molecule could be studied in the entire test tube. In NGSare the DNA molecules attached to a surface, which makes it possible to observe singleDNA or RNA molecules during the chemical reactions. Analyzing single DNA moleculesis possible due to development of new chemistry and improved detection methods. Manydifferent DNA molecules can therefore be studied in parallel in a single sample. [59]

Determining the sequence of single RNA molecules serves mainly two different purposes.One is counting the number of RNA copies of each gene. This is based on the assumptionthat the RNA copy number is proportional to the gene’s biological activity in the cell. Forexample, many genes are active only in the fetal development but not expressed at all in therest of the organism’s life. In this case, sequencing of RNA serves to identify which gene,i.e. which position in the genome, that was used to produce this piece of RNA. This is veryuseful data in basic research in Biology and Medicine, but also to diagnose tumors.

1Changes may occur, sometimes by mistake but sometimes also deliberately, for example as a part of theimmune defense’s adaption to fight various infectious agents. Changes in genetic sequences are commonlycalled mutations. These may cause tumors or other diseases but they are also crucial in evolution of the species.

6(22)

A second purpose is mapping of mutations, for example to diagnose inherited diseases.The nucleotide sequence may have been changed by mutations such that a gene has lostits biological function. The instructions for protein assembly are simply erroneous andthe produced proteins may therefore malfunction. As NGS nowadays is a relatively cheaptechnology, it may be economically advantageous to analyze this by sequencing all RNAmolecules in the cell rather than analyzing single genes by other means.

7(22)

3 Algorithms for sequence alignments

Most existing sequence mapping softwares use a seed-and-extend technique. For each se-quence fragment obtained by NGS, an exact match method determines candidate positionsin the reference sequence. These candidate positions are called seeds. As there might bemismatches between the read sequence and the reference sequence, exact match searchescould be performed on only a section of the read sequence. For example, Bowtie2 [28] takesthe first half, the second half, and the middle half (excluding the outer 1/4 on each side) inthree separate searches for every read.

In a second step, the seed positions are extended to examine how well the entire query se-quence match the reference sequence. Approximate matching methods are used for this.Many NGS aligning softwares use yet additional processing to further adjust the read se-quence fragment to the reference genome, taking biological models into consideration.

In this work are only the first two steps reviewed. Additional steps, involving biologi-cal models, concerns Biology more than Computing Science, why it is overlooked in thisComputing Science exam work. Most softwares use any of the algorithms described below,whereas adaption to biological models have not yet reached conformed standards. The lattermay contribute to the large number of softwares available for aligning genetic data. Exam-ples of NGS sequence aligners, published in the last ten years, may be found in Table 1.

3.1 Exact matching algorithms

Exact matching algorithms are used in the first step in the analysis as they are faster thanapproximate matching methods. However, considering the size of the reference genomesequences, linear search methods such as the popular Boyer-Moore algorithm [2] are notpractically useful but methods having a more advantageous time complexity are used.

3.1.1 Hash tables

One way to determine sequence positions is using hash tables. For each position in thereference genome, k characters are used as a key in the hash table. The data that correspondsto each unique key could be a pointer to a list that stores the genome positions. Each keyneeds to have the capacity to link to several genome positions as the key’s sequence maybe present in many different positions in the genome sequence. The key’s sequences shouldtogether cover the entire genome sequence, for example:

GCAAAAAGGCCCCTGGGGGGGGGTTAATGAGTACTGGAAAAAGAAGCGCGAGATACCACT...GCAAAAAGGC index 0 GTTAATGAGT index 22 GCGCGAGATA index 44CAAAAAGGCC index 1 TTAATGAGTA index 23 CGCGAGATAC index 45AAAAAGGCCC index 2 TAATGAGTAC index 24 GCGAGATACC index 46

8(22)

Table 1 Alignment softwares and their method to determine seed positions.

Hash table Tree structures Burrows-Wheeler

ERNE2 [45] Masai [52] Hpg-Aligner [41]rHAT [37] STAR [12] CUSHAW3 [36]

MOSAIK [29] YAHA [14] mrsFAST-Ultra [18]Hobbes2 [25] Segemehl [20] CRAC [44]

NextGenMap [51] PatMan [46] GEM [40]SubRead [33] BatMeth [34]Srmapper [17] BLASR [5]SeqAlto [42] Batmis [55]

OSA [22] BarraCUDA [26]SplazerS [13] Bowtie2 [28]Stampy [38] BWA [30]

GASSST [48] SOAP2 [32]GNUMAP [8]BFAST [21]

PerM [7]RazerS [56]

SHRiMP [49]CloudBurst [50]

PASS [4]Slider [39]MAQ [31]

SeqMap [23]ZOOM [35]RMAP [53]

Hash tables are the fastest algorithm among the methods described here, but also the mostmemory space-demanding [60]. Representing the entire human genome in this way wouldtake about 26 GB memory for the pointers (calculations may be found in the Appendix).Storing the human genome positions takes additionally 13 GB, list structures excluded.Thus, at least 40 GB would be required to construct a hash table for the complete humangenome. Memory demands could be relaxed by dividing the genome into sections andanalyzing one section at the time (a bucketization approach). However, this comes with thecost that the entire set of sequence fragments needs to be analyzed for each section of thegenome. Another strategy might be to only keep every second key. In this case the cost isthat each sequence fragment needs to be analyzed twice, excluding the first character thesecond time.

3.1.2 Tree structures

A 16-mer DNA fragment can encode 4.3× 109 unique combinations. A number of thesecombinations will likely remain unused. For comparison, the human genome is about3.3× 109 nucleotides and the pigeon hole principle gives that at least a billion combinationswill remain unused upon human sequence analyses. In addition, most eukaryotic genomes

9(22)

contain repeated sequences1, which further reduces the number of unique combinationsthat actually are used. Only representing those keys that are represented in the genomesequence in a tree structure saves memory space in comparison to using hash tables, assignificantly fewer pointers to genome position lists need to be allocated. For example, thePatMan program [46] generates a directed acyclic graph (DAG) of the reference sequenceand transforms it into a non-deterministic automaton that is used in the sequence positionmatch analyses.

3.1.3 The Burrow-Wheeler transform

The Burrows-Wheeler (BW) transform [3] was first reported as an algorithm to improvecompression of text data. Strings transformed in this way tend to group similar characters,which is advantageous in entropic compression algorithms [3]. This transform also turnedout to be a useful algorithm for exact text matching when expanded with so-called FMindexes [16]. The main advantage is a small memory footprint relative to other matchingalgorithms.

To produce a Burrows-Wheeler transformed string, a unique character is added to the endof the original string. Next, all rotations2 of the string are lexicographically sorted. The lastcharacter of all sorted rotations make the BW transformed string. Here is an example on theword ’BANANA’ ($ is the added string end character):

All rotations: Sorted: Last character:BANANA$ ANANA$B B$BANANA ANA$BAN NA$BANAN A$BANAN N BW-transformed string:NA$BANA BANANA$ $ BNN$AAAANA$BAN NANA$BA ANANA$BA NA$BANA AANANA$B $BANANA A

Inverse-transforming a BW-transformed string take some more steps. The BW transformedstring is copied and all characters in the copy are sorted lexicographically. The characters inthe transformed string are arranged in one column and the sorted copy in next column. Thestring end character is found in the sorted column and the character in the correspondingposition in the unsorted column is read. This is the last character in the inverse-transformedstring. The corresponding character in the sorted column is to be found, and the second lastcharacter can be read in the unsorted column. The entire original string can be reconstructedby iterating these steps. See the example:

1Over two-thirds of the human genome consists of repeated genetic elements [27].2Take a copy of the string and move its first character to the last position. Iterate on the last produced string

until every character in the original string has been seen in the first position.

10(22)

Inverse-transformedUnsorted: Sorted: Step: word:

B A 1. Sorted $ -> third unsorted A A$N A 2. Third sorted A -> second unsorted N NA$N A 3. Second sorted N -> second unsorted A ANA$$ B 4. Second sorted A -> first unsorted N NANA$A N 5. First sorted N -> first unsorted A ANANA$A N 6. First sorted A -> unsorted B BANANA$A $ 7. Sorted B -> unsorted $; finished. BANANA$

All rotations are represented in the transformed string, which also means that every sub-string there exists in the original string can be found in the beginning of any rotation. Allrotations are also lexicographically sorted by the BW transformation. Of crucial impor-tance, this enables binary search algorithms for substrings; searches can be made in loga-rithmic time.

It could however be an elaborative task to identify a substring’s position in the originalstring. The position becomes known only when the string end character is seen. The FMindex expansion of the BW transform addresses this problem [16]. Each character of acertain type, say every T, is tagged with its position in the original string. To save memoryspace could all but, for example, every fourth T in the transformed string drop their positionassignment. To determine its position, the substring needs to be back-transformed only untila T character that has the position linked to it is found.

Look-ups in the sorted transformed string should cause no serious problem as the charactersare lexicographically sorted. An algorithm could simply sum the total count of each charac-ter that lexicographically comes before the character of interest. Look-ups in the unsortedtransformed string take more effort and is again addressed by the FM-index approach. Theposition of, for example, every 16th character of each type is kept in an array, which allowsfast look-up for the approximate position. Exact position is determined in a subsequentlinear search.

Memory space requirement for using the BW transform as sketched above, applied on thehuman genome, would be about 2.5 GB in total (see the Appendix for calculations). Mem-ory requirements could be both increased and decreased by changing sizes of the FM in-dexes. This would affect how frequent the position information is available to the algorithm,which in turn impact on the look-up times in an inverse linear relation. The BW transformcombined with FM indexes has m× log(n) time complexity, where m is the number ofsequence fragments to align and n is the size of the reference sequence. It has howeversignificantly less memory space requirements than the other algorithms discussed in thiswork.

3.1.4 The fast Fourier transform

During this work it has strike me that the Fourier transform [57] is not used in modern NGSalignment softwares, but I think it would be an interesting option to explore. The Fouriertransform converts a function from its time domain to its frequency domain (Figure 1), andit is used in many image and sound compression softwares. A special case is the fast Fouriertransform (FFT) [58], which has the attractive n× log(n) time complexity and that can beapplied on data sizes of 2n values.

11(22)

The idea is to represent for example every A as the value -1, C as 0, G as 1 and T as 2,as sample points in a wave form, which makes a function for the sequence. The Fouriertransform would reveal this function’s frequency components. If all frequencies for thecandidate sequence are present in the reference sequence, and the phase angles also haveidentical differences at some point, the position in the reference sequence can be deter-mined. The Fourier transform has been used in a number of classical sequence alignmentsoftwares [1, 6, 15, 24, 47], providing proof of concept.

Figure 1: A. The Fourier transform divides a complex wave (blue) into its frequency com-ponents (green, yellow and red). B. The transform returns amplitude and phaseangle for each frequency component.

3.2 Approximate matching algorithms

Biological reality is that there is numerous differences in an individual’s genome comparedto the reference genome sequences. Alignment softwares also need to search for best ap-proximate matches. It seems there is a consensus across different NGS alignment softwaresto use either the Smith-Waterman [54] or the Needleman-Wunsch [43] algorithm, whichhave obvious similarities.

A way to manage approximate matching in sequence analyses is so-called edit-distancearrays [43, 54]. This is a linear search procedure to find the best match between two strings.Each character in the substring is compared to the corresponding position in the other string.In case a the characters differ, the substring receives a penalty score. In next round, thesubstring is moved one character position and a new penalty score is determined. The bestmatch is naturally the position that has the lowest penalty score. Allowing gaps in thesequences could produce lower penalty scores, for example:

12(22)

Without gaps:GCGTATGCGGCTAAGCG Six penalties

||||||||||GCTATGCGGCTATAGCG

With gaps:GCGTATGCGGCTA-AGCG Two penalties|| |||||||||| ||||GC-TATGCGGCTATAGCG

In Biology, mutations are not completely random but nucleotides mutate more often to aspecific other nucleotide. Transition type mutations (A ↔ G or C ↔ T) can be observedin about 1/3000 nucleotides, transversions (A ↔ C, A ↔ T, C ↔ G or G ↔ T) in about1/6000 nucleotides and insertion or deletion is observed in roughly 1/12000 nucleotides. Itcould also be discussed whether a gap that extends over more than one nucleotide should beconsidered as a single mutation event or one mutation event for each nucleotide it covers.For efficient computation could a penalty matrix be created that corresponds to the observedmutation frequencies (og = opening gap, eg = extending gap):

A C G T og egA 0 2 1 2 4 1C 2 0 2 1 4 1G 1 2 0 2 4 1T 2 1 2 0 4 1

og 4 4 4 4 0 1eg 1 1 1 1 1 0

Allowing unlimited number of substitutions and gaps increase the calculations dramatically;more than 5n different combinations could be tested. This is not practically useful in effi-cient computing. Looking closer to a such algorithm, it becomes apparent that many testedcandidate strings have in part identical sequences, and these subsequences are thereforetested numerous times. The standard technique to avoid repeated calculations of identicalvalues is dynamic programming, i.e. to store the result of each calculation for later use.

3.2.1 The Smith-Waterman and Needleman-Wunsch algorithms

The Needleman-Wunsch [43] and Smith-Waterman [54] algorithms have obvious similari-ties. The main difference between the two algorithms is the penalty matrix. The Needleman-Wunsch algorithm is described here. An array is constructed where the reference string isaligned with the rows and the candidate string is aligned with the columns. The first lineis assigned zeros and the first column is assigned increasing integers. The value for eachremaining element is calculated by dynamic programming such that the least value of threeseparate calculations from previous iterations is selected. These calculations are 1) thepenalty score on comparing the characters in the two strings, plus the value in the diago-nally up-left element. 2) The penalty to introduce or extend a gap, plus the value in theelement above. 3) The penalty to introduce or extend a gap, plus the value in the element tothe left. This could look like:

13(22)

- T A T T G G C T A T A C G G T T- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0G 1 2 1 2 2 0 0 2 2 1 2 1 2 0 0 2 2C 2 2 4 2 3 4 2 0 3 4 2 4 1 4 2 1 3G 3 4 4 6 4 3 4 4 2 4 6 4 5 1 4 4 3T 4 3 6 4 6 6 3 5 4 4 4 8 5 5 3 4 4A 5 6 3 7 6 8 7 5 7 4 6 4 8 7 7 5 6T 6 5 7 3 7 8 10 8 5 8 4 8 5 10 9 7 5G 7 8 7 7 5 7 8 12 9 7 8 6 9 5 9 11 9C 8 8 10 8 8 7 9 8 12 11 8 10 6 9 7 10 12

To select the best match, the element with the least penalty score in the last row is identifiedand the path of least values through the array is followed. This path is indicated by dotsin the example below. Exact matches would give a path diagonally up-left and gaps areindicated by a vertical or horizontal path.

- T A T T G G C T A T A C G G T T- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0G 1 2 1 2 2 0 . 2 2 1 2 1 2 0 0 2 2C 2 2 4 2 3 4 2 . 3 4 2 4 1 4 2 1 3G 3 4 4 6 4 3 4 4 . 4 6 4 5 1 4 4 3T 4 3 6 4 6 6 3 5 . 4 4 8 5 5 3 4 4A 5 6 3 7 6 8 7 5 7 . 6 4 8 7 7 5 6T 6 5 7 3 7 8 10 8 5 8 . 8 5 10 9 7 5G 7 8 7 7 5 7 8 12 9 7 8 . 9 5 9 11 9C 8 8 10 8 8 7 9 8 12 11 8 10 . 9 7 10 12

The best match in this example would be:

TATTGGC-TATACGGTT|| ||| |GCGTATGC

A path could branch into different matches with equal penalty scores. Edit-distance algo-rithms may therefore take some delicate decision-making to select which branch to reportback.

The Smith-Waterman algorithm is almost identical to the Needleman-Wunsch algorithm,the main difference being the penalty score matrix. In Smith-Waterman, matching charac-ters get a positive score whereas mismatches receives a negative score. All scores in thearray are however always kept positive. In this case, the best path is the one with the highestscores. The starting point of this path could be found not only in the last row but anywherein the matrix.

This seemingly subtle difference between the Smith-Waterman and Needleman-Wunschalgorithms has big impact on how to interpret the results. Needleman-Wunsch returns thebest position for the entire query sequence, whereas Smith-Waterman gives the best positionfor a part of the query sequence but on the same time there might be another part that has

14(22)

poor match. Smith-Waterman is often referred to as local alignment. The Smith-Watermanand Needleman-Wunsch algorithms have m×n complexity in both time and space.

To improve execution time, dynamic programming is an obvious target for parallel pro-cessing. However, both the Smith-Waterman and Needleman-Wunsch algorithms have de-pendencies to previous iterations such that a diagonal wavefront of data across the array isprocessed. This is not optimal for parallelization. Dependencies to data both above and tothe left in the array are due to the possibility to handle gaps in both sequences. Gaps arehowever a relatively uncommon feature. A number of implementations process an entirerow or an entire column in each round of parallel processing. In case gap scores need to betaken into consideration, the entire row or column is processed repeatedly until all scoresare set.

15(22)

4 Discussion

Next Generation Sequencing has enabled a previously hidden dimension of informationin basic Biology and Biomedical research. Due to funding was large-scale sequencing ofgenetic materials only possible for large consortia specialized on a few model organisms.Since the first completes genomes were published there have been strong expectancies andpressure for developing affordable sequencing technologies. Big leaps are taken in thelast fifteen years in the chemical processing to determine sequences. These technologiesare nowadays affordable for essentially any laboratory. The bottleneck is now instead toanalyze the enormous amount of data this generates.

Most softwares use the seed-and-extend technique, i.e. find a piece of the candidate se-quence that match to the reference genome by exact match algorithms, and then make anattempt to extend this match to cover the entire candidate sequence. Traditional linear textsearches are simply not suitable due to the large reference genome sequences, which mayhave billions (for some plants even more than a trillion) nucleotides. Instead are the al-gorithms based on data structures that allow fast look-ups (hash tables, tree structures) orthat the reference sequence is organized in a way that allows binary searches (the Burrow-Wheeler transform). Further optimizations and improved parallelization can surely be doneto most softwares. I believe however that radically better performances would take thatsomeone manages to think out of the box and comes up with a fresh idea.

During this work it strike me that modern NGS alignment softwares do not utilize theFourier transform to identify the sequence fragment positions. This transform has beenused in signal analyses for decades and it is proven to work also in the sequence alignmentcontext. Whether the reason for not using the Fourier transform is that it has been out-competed by the other algorithms or simply has been overlooked remains an open questionto me. Sometimes researchers tend to just do what everybody else does. Perhaps it couldbe worth the effort to evaluate the FFT algorithm also in the Next Generation Sequencingcontext.

Affordable access to genetic sequences has certainly opened new exciting possibilities toBiologists, Biomedicine researches and Clinicians in their respective disciplines. Besidesimproved research opportunities and analyses in Biology and Medicine, this technologyalso asks interesting questions to the Computing Scientists.

16(22)

17(22)

References

[1] D.C. Benson. Fourier methods for biosequence analysis. Nucleic Acids Research18:6305–6310, 1990.

[2] R.S. Boyer, J.S. Moore. A fast string searching algorithm. Communications of theACM 20:762-772, 1977.

[3] M. Burrows, D.J. Wheeler. A block-sorting lossless data compression algorithm. Dig-ital Systems Research Center SRC-RR-124, 1994.

[4] D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S. Manavski, N. Vitulo,G. Valle G. PASS: a program to align short sequences. Bioinformatics 25:967-698,2009.

[5] M.J.Chaisson, G. Tesler Mapping single molecule sequencing reads using basic localalignment with successive refinement (BLASR): application and theory BMC Bioin-formatics 13:238, 2012.

[6] E.A. Cheever, G.C. Overton, D.B. Searls. Fast Fourier transform-based correlation ofDNA sequences using complex plane encoding. Computational Applications in Bio-science 7:143–154, 1991.

[7] Y. Chen, T, Souaiaia, T. Chen. PerM: efficient mapping of short sequencing reads withperiodic full sensitive spaced seeds. Bioinformatics 25:2514-2521, 2009.

[8] N.L. Clement, Q. Snell, M.J. Clement, P.C. Hollenhorst, J. Purwar, B.J. Graves, B.R.Cairns, W.E. Johnson. The GNUMAP algorithm: unbiased probabilistic mapping ofoligonucleotides from next-generation sequencing. Bioinformatics 26:38-45.

[9] R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algo-rithm. Proceedings of the second annual ACM-SIAM symposium on Discrete algo-rithms 222-233, 1991.

[10] A. Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acidsequences: Recommendations 1984. Nucleic Acids Research 13:3021-3030, 1985.

[11] F. Crick. Central dogma of molecular biology. Nature 227:561-563, 1970.

[12] A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M.Chaisson, T.R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics29:15–21, 2013.

[13] A.K. Emde, M.H. Schulz, D. Weese, R. Sun, M. Vingron, V.M. Kalscheuer, S.A.Haas, K. Reinert. Detecting genomic indel variants with exact breakpoints in single-and paired-end sequencing data using SplazerS. Bioinformatics 28:619-27, 2012.

18(22)

[14] G.G. Faust, I.M. Hall. YAHA: fast and flexible long-read alignment with optimalbreakpoint detection. Bioinformatics 28: 2417–2424, 2012.

[15] J Felsenstein, S Sawyer, R Kochin. An efficient method for matching nucleic acidsequences. Nucleic Acids Research 10:133-139, 1982.

[16] P. Ferragina, G. Manzini. Opportunistic data structures with applications. 41st AnnualSymposium on foundations of computer science. 2000.

[17] P.M. Gontarz, J. Berger, C.F. Wong. SRmapper: a fast and sensitive genome-hashingalignment tool. Bioinformatics 29:316-21, 2013.

[18] F. Hach, I. Sarrafi, F. Hormozdiari, C. Alkan, E.E. Eichler, S.C. Sahinalp mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applicationsNucleic Acids Research 42:W494–W500, 2014.

[19] M.T. Heideman, D.H. Johnson, C.S. Burrus. Gauss and the History of the Fast FourierTransform. IEEE ASSP Magazine 1:14-21, 1984.

[20] S. Hoffmann, C. Otto, S. Kurtz, C.M. Sharma, P. Khaitovich, J. Vogel, P.F. Stadler,J. Hackermuller. Fast mapping of short sequences with mismatches, insertions anddeletions using index structures. PLoS Computational Biology 5:e1000502, 2009.

[21] N. Homer, B. Merriman, S.F. Nelson. BFAST: an alignment tool for large scalegenome resequencing. PLoS One 4:e7767, 2009.

[22] J Hu, H. Ge, M. Newman, K. Liu. OSA: a fast and accurate alignment tool for RNA-Seq. Bioinformatics 28:1933-1934, 2012.

[23] H. Jiang and W.H. Wong. SeqMap: mapping massive amount of oligonucleotides tothe genome. Bioinformatics 24:2395–2396, 2008.

[24] K. Katoh, K. Misawa, K. Kuma, T. Miyataa MAFFT: a novel method for rapid mul-tiple sequence alignment based on fast Fourier transform. Nucleic Acids Research30:3059–3066, 2002.

[25] J. Kim, C. Li, X. Xie. Improving read mapping using additional prefix grams. BMCBioinformatics, 15:42, 2014.

[26] P. Klus, S. Lam, D. Lyberg, M.S. Cheung, G. Pullan, I. McFarlane, G.S. Yeo, B.Y.Lam. BarraCUDA - a fast short read sequence aligner using graphics processing units.BMC Research Notes 5:27, 2012.

[27] A.P.J. de Koning W. Gu, T.A. Castoe, M.A. Batzer D.D. Pollock. Repetitive ElementsMay Comprise Over Two-Thirds of the Human Genome. PLoS Genetics 7:e1002384,2011.

[28] B. Langmead, C. Trapnell, M. Pop, S.L. Salzberg. Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biology 10:R25,2009.

[29] W.-P. Lee, M.P. Stromberg, A. Ward, C. Stewart, E.P. Garrison, G.T. Marth MOSAIK:A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Map-ping PLoS One, 9:e90581, 2014.

19(22)

[30] H. Li, R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler Trans-form. Bioinformatics 25:1754–1760, 2009.

[31] H. Li, J. Ruan, R. Durbin. Mapping short DNA sequencing reads and calling variantsusing mapping quality scores. Genome Research 18:1851-1858, 2008.

[32] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, J. Wang. SOAP2: animproved ultrafast tool for short read alignment. Bioinformatics 25:1966-1967, 2009.

[33] Y. Liao, G.K. Smyth, W. Shi The Subread aligner: fast, accurate and scalable readmapping by seed-and-vote Nucleic Acids Research 41:e108.

[34] J.-Q. Lim, C. Tennakoon, G. Li, E. Wong, Y. Ruan, C.-L. Wei, W.-K. Sung. BatMeth:improved mapper for bisulfite sequencing reads on DNA methylation. Genome Biol-ogy 13:R82, 2012.

[35] H. Lin, Z. Zhang, M.Q.Zhang,,B. Ma, M. Li. ZOOM! Zillions of oligos mapped.Bioinformatics 24:2431-7, 2008.

[36] Y. Liu, B. Popp, B. Schmidt. CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding. PLoS One 9:e86869, 2014.

[37] B. Liu, D. Guan, M. Teng, Y. Wang Y. rHAT: fast alignment of noisy long reads withregional hashing. Bioinformatics 32:1625-1631, 2016.

[38] G Lunter, M. Goodson. Stampy: A statistical algorithm for sensitive and fast mappingof Illumina sequence reads. Genome Research. 21:936–939, 2011.

[39] N. Malhis, Y.S. Butterfield, M. Ester, S.J. Jones. Slider - Maximum use of probabilityinformation for alignment of short sequence reads and SNP detection. Bioinformatics25:6-13, 2009

[40] S. Marco-Sola, M. Sammeth, R. Guigo, P. Ribeca. The GEM mapper: fast, accurateand versatile alignment by filtration. Nature Methods 9:1185–1188, 2012.

[41] I. Medina, J. Tarraga, H. Martınez, S. Barrachina, M. I. Castillo, J. Paschall,J. Salavert-Torres, I. Blanquer-Espert,V. Hernandez-Garcıa, E.S. Quintana-Ortı, J.Dopazo. Highly sensitive and ultrafast read mapping for RNA-seq analysis. DNA Re-search 23:93–100, 2016.

[42] J.C. Mu, H. Jiang, A. Kiani, M. Mohiyuddin, N.B. Asadi, W.H. Wong. Fast and accu-rate read alignment for resequencing. Bioinformatics 28:2366–2373, 2012.

[43] S.B. Needleman, C.D. Wunsch. A general method applicable to the search for sim-ilarities in the amino acid sequence of two proteins. Journal of Molecular Biology.48:443–53, 1970.

[44] N. Philippe, M. Salson, T. Commes, E. Rivals. CRAC: an integrated approach to theanalysis of RNA-seq reads Genome Biology 14:R30, 2013

[45] N. Prezza, F. Vezzi, M. Kaller, A. Policriti. Fast, accurate, and lightweight analysis ofBS-treated reads with ERNE 2 BMC Bioinformatics 17:69, 2016.

20(22)

[46] K. Prufer, U. Stenzel, M. Dannemann, R.E. Green, M. Lachmann, J. Kelso. PatMaN:rapid alignment of short sequences to large databases. Bioinformatics 24:1530-1531,2008.

[47] S. Rajasekaran, X. Jin, J.L. Spouge. The efficient computation of position-specificmatch scores with the Fast Fourier Transform. Journal of Computational Biology 9:23-33, 2004.

[48] G. Rizk, D. Lavenier. GASSST: global alignment short sequence search tool. Bioin-formatics 26:2534-2540, 2010.

[49] S.M. Rumble, P. Lacroute, A.V. Dalca, M. Fiume, A. Sidow, M. Brudno. SHRiMP:Accurate Mapping of Short Color-space Reads. PLOS Computational Biology5:e1000386, 2009.

[50] M.C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinfor-matics 25:1363-1369, 2009.

[51] F.J. Sedlazeck, P. Rescheneder, A. von Haeseler NextGenMap: fast and accurate readmapping in highly polymorphic genomes. Bioinformatics 29:2790-2791, 2013.

[52] E. Siragusa, D. Weese, K. Reinert. Fast and accurate read mapping with approximateseeds and multiple backtracking. Nucleic Acids Research 41:e78, 2013.

[53] A.D. Smith, Z. Xuan, M.Q. Zhang. Using quality scores and longer reads improvesaccuracy of Solexa read mapping. BMC Bioinformatics 9:128, 2008.

[54] T.F. Smith, M.S. Waterman. Identification of common molecular subsequences. Jour-nal of Molecular Biology. 147:195–197, 1981.

[55] C. Tennakoon, R.W. Purbojati, W.K. Sung. BatMis: a fast algorithm for k-mismatchmapping. Bioinformatics 28:2122-2128, 2012.

[56] D. Weese, A.K. Emde, T. Rausch, A. Doring, K. Reinert. RazerS-fast read mappingwith sensitivity control. Genome Research 19:1646-1654, 2009.

[57] https://en.wikipedia.org/wiki/Fourier transform. Retrieved June 3, 2017.

[58] https://en.wikipedia.org/wiki/Fast Fourier transform. Retrieved June 3, 2007.

[59] https://en.wikipedia.org/wiki/DNA sequencing. Retrieved June 3, 2017.

[60] https://en.wikipedia.org/wiki/Hash table. Retrieved June 3, 2017.

21(22)

Appendix: Calculations

This appendix describes calculations for some of the data stated in the text.

In section 3.3.1 Hash tables

These calculations are based on array sizes rather than hash tables. The size of the hashtable is calculated from a worst-case scenario that saturates all possible combinations fromk nucleotides, which is unlikely in a real-life experiment. On the other hand, hash tablesneed a considerable overhead to function well.

The text states it takes 26 GB of memory space to keep the pointers and 13 GB to storeall positions in the human genome for the hash table. This is calculated by that there is3.3× 109 nucleotides in the human genome, i.e. 3.3× 109 key-value pairs if the offset foreach k-mer is one nucleotide. Assuming 64 bit pointers (that corresponds to 8 bytes perpointer) and 32-bits integers (4 bytes per integers) to keep genome positions, that is:3.3×109×8 bytes = 26.4×109 bytes ≈ 26 GB.3.3×109×4 bytes = 13.2×109 bytes ≈ 13 GB.

In section 3.1.3 The Burrow-Wheeler transform

As genetic sequences are represented by an alphabet of four characters, each character couldbe represented by two bits. This makes four nucleotides per byte. The size of the humangenome, 3.3× 109 nucleotides, divided by four makes approximately 825 MB of memoryspace to keep the human genome sequence.

For the FM indexes, calculating the memory space needed for tagging every fourth T char-acter, I first assume the four different nucleotides are present to on average equal numbersin the reference sequence. There is thus approximately 8.25× 108 T characters in the ref-erence sequence in total; one fourth of these is about 2.1× 108. Assuming each positionis stored as a 32-bit integer (4 bytes), it neutralizes the last division and we are back on8.25×108 bytes, or 825 MB, of storage space.

For the second part of the FM indexes, storing the position for every 16 character of eachkind, it is simply the whole genome size divided by 16. Taking this, times the storage spacefor each position (here 32 bits or four bytes) gives:3.3×109×4 bytes /16 = 825 MB.

In total, memory space is needed for the BW transformed sequence and the FM indexes.This makes 825 MB ×3≈ 2.5 GB.

22(22)

algorithms for aligning genetic sequences to reference genomes · algorithms for aligning genetic...

Documents