accelerating the understanding of life’s code … · onur mutlu(co-advisor) ergin atalar mehmet...

167
This thesis received the 2018 IEEE Turkey Doctoral Dissertation Award **This copy was updated on October 2019** ACCELERATING THE UNDERSTANDING OF LIFE’S CODE THROUGH BETTER ALGORITHMS AND HARDWARE DESIGN a dissertation submitted to the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy in computer engineering By Mohammed H. K. Alser June 2018 arXiv:1910.03936v1 [q-bio.GN] 8 Oct 2019

Upload: others

Post on 07-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • This thesis received the 2018 IEEE Turkey Doctoral Dissertation Award

    **This copy was updated on October 2019**

    ACCELERATING THE UNDERSTANDINGOF LIFE’S CODE THROUGH BETTER

    ALGORITHMS AND HARDWARE DESIGN

    a dissertation submitted to

    the graduate school of engineering and science

    of bilkent university

    in partial fulfillment of the requirements for

    the degree of

    doctor of philosophy

    in

    computer engineering

    By

    Mohammed H. K. Alser

    June 2018

    arX

    iv:1

    910.

    0393

    6v1

    [q-

    bio.

    GN

    ] 8

    Oct

    201

    9

  • ACCELERATING THE UNDERSTANDING OF LIFE’S CODE

    THROUGH BETTER ALGORITHMS AND HARDWARE DESIGN

    By Mohammed H. K. Alser

    June 2018

    We certify that we have read this dissertation and that in our opinion it is fully

    adequate, in scope and in quality, as a dissertation for the degree of Doctor of

    Philosophy.

    Can Alkan (Advisor)

    Onur Mutlu (Co-Advisor)

    Ergin Atalar

    Mehmet Koyutürk

    Nurcan Tunçbağ

    Oğuz Ergin

    Approved for the Graduate School of Engineering and Science:

    Ezhan KaraşanDirector of the Graduate School

    ii

  • ABSTRACT

    ACCELERATING THE UNDERSTANDING OF LIFE’SCODE THROUGH BETTER ALGORITHMS AND

    HARDWARE DESIGN

    Mohammed H. K. Alser

    Ph.D. in Computer Engineering

    Advisor: Can Alkan

    Co-Advisor: Onur Mutlu

    June 2018

    Our understanding of human genomes today is affected by the ability of modern

    computing technology to quickly and accurately determine an individual’s entire

    genome. Over the past decade, high throughput sequencing (HTS) technologies

    have opened the door to remarkable biomedical discoveries through its ability to

    generate hundreds of millions to billions of DNA segments per run along with a

    substantial reduction in time and cost. However, this flood of sequencing data

    continues to overwhelm the processing capacity of existing algorithms and hard-

    ware. To analyze a patient’s genome, each of these segments -called reads- must

    be mapped to a reference genome based on the similarity between a read and

    “candidate” locations in that reference genome. The similarity measurement,

    called alignment, formulated as an approximate string matching problem, is the

    computational bottleneck because: (1) it is implemented using quadratic-time

    dynamic programming algorithms, and (2) the majority of candidate locations

    in the reference genome do not align with a given read due to high dissimilar-

    ity. Calculating the alignment of such incorrect candidate locations consumes an

    overwhelming majority of a modern read mapper’s execution time. Therefore, it

    is crucial to develop a fast and effective filter that can detect incorrect candidate

    locations and eliminate them before invoking computationally costly alignment

    algorithms.

    In this thesis, we introduce four new algorithms that function as a pre-

    alignment step and aim to filter out most incorrect candidate locations. We

    call our algorithms GateKeeper, Shouji, MAGNET, and SneakySnake. The first

    key idea of our proposed pre-alignment filters is to provide high filtering accuracy

    by correctly detecting all similar segments shared between two sequences.

    iii

  • iv

    The second key idea is to exploit the massively parallel architecture of modern

    FPGAs for accelerating our four proposed filtering algorithms. We also develop

    an efficient CPU implementation of the SneakySnake algorithm for commodity

    desktops and servers, which are largely available to bioinformaticians without the

    hassle of handling hardware complexity. We evaluate the benefits and downsides

    of our pre-alignment filtering approach in detail using 12 real datasets across dif-

    ferent read length and edit distance thresholds. In our evaluation, we demonstrate

    that our hardware pre-alignment filters show two to three orders of magnitude

    speedup over their equivalent CPU implementations. We also demonstrate that

    integrating our hardware pre-alignment filters with the state-of-the-art read align-

    ers reduces the aligner’s execution time by up to 21.5x. Finally, we show that

    efficient CPU implementation of pre-alignment filtering still provides significant

    benefits. We show that SneakySnake on average reduces the execution time of

    the best performing CPU-based read aligners Edlib and Parasail, by up to 43x

    and 57.9x, respectively. The key conclusion of this thesis is that developing a fast

    and efficient filtering heuristic, and developing a better understanding of its ac-

    curacy together leads to significant reduction in read alignment’s execution time,

    without sacrificing any of the aligner’ capabilities. We hope and believe that our

    new architectures and algorithms catalyze their adoption in existing and future

    genome analysis pipelines.

    Keywords: Read Mapping, Approximate String Matching, Read Alignment, Lev-

    enshtein Distance, String Algorithms, Edit Distance, fast pre-alignment filter,

    field-programmable gate arrays (FPGA).

  • ÖZET

    YAŞAMIN KODUNU ANLAMAYI DAHA İYİALGORİTMALAR VE DONANIM TASARIMLARIYLA

    HIZLANDIRMAK

    Mohammed H. K. Alser

    Bilgisayar Mühendisliği, Doktora

    Tez Danışmanı: Can Alkan

    İkinci Tez Danışmanı: Onur Mutlu

    Haziran 2018

    Günümüzde insan genomları konusundaki anlayışımız, modern bilişim teknolo-

    jisinin bir bireyin tüm genomunu hızlı ve doğru bir şekilde belirleyebilme

    yeteneğinden etkilenmektedir. Geçtiğimiz on yıl boyunca, yüksek verimli dizileme

    (HTS) teknolojileri, zaman ve maliyette önemli bir azalma ile birlikte, tek

    bir çalışmada yüz milyonlardan milyarlarcaya kadar DNA parçası üretme ka-

    biliyeti sayesinde dikkat çekici biyomedikal keşiflere kapı açmıştır. Ancak,

    bu dizileme verisi bolluğu mevcut algoritmaların ve donanımların işlem kapa-

    sitelerinin sınırlarını zorlamaya devam etmektedir. Bir hastanın genomunu analiz

    etmek için, ”okuma” adı verilen bu parçaların her biri referans genomundaki aday

    bölgelerle olan benzerliklerine bakılarak, referans genomu üzerine yerleştirilir.

    Yaklaşık karakter dizgisi eşleştirme problemi şeklinde formüle edilen ve hizalama

    olarak adlandırılan benzerlik hesaplaması, işlemsel bir darboğazdır çünkü: (1)

    ikinci dereceden devingen programlama algoritmaları kullanılarak hesaplanır ve

    (2) referans genomundaki aday bölgelerin büyük bir bölümü ile verilen okuma

    parçası birbirlerinden yüksek düzeyde farklılık gösterdiklerinden dolayı hiza-

    lanamaz. Bu şekilde yanlış belirlenen aday bölgelerin hizalanabilirliğin hesa-

    planması, günümüzdeki okuma haritalandırıcı algoritmaların çalışma sürelerinin

    büyük bölümünü oluşturmaktadır. Bu nedenle, hesaplama olarak maliyetli bu

    hizalama algoritmalarını çalıştırmadan önce, doğru olmayan aday bölgeleri tespit

    edebilen ve bu bölgeleri aday bölge olmaktan çıkaran, hızlı ve etkili bir filtre

    geliştirmek çok önemlidir.

    Bu tezde, ön hizalama aşaması olarak işlev gören ve yanlış aday konumlarının

    çoğunu filtrelemeyi hedefleyen dört yeni algoritma sunuyoruz. Algoritmalarımızı

    GateKeeper, Shouji, MAGNET ve SneakySnake olarak adlandırıyoruz.

    v

  • vi

    Önerilen ön hizalama filtrelerinin ilk temel fikri, iki dizi arasında paylaşılan

    tüm benzer segmentleri doğru bir şekilde tespit ederek yüksek filtreleme

    doğruluğu sağlamaktır. İkinci temel fikir, önerilen dört filtreleme algoritmamızın

    hızlandırılması için modern FPGA’ların çok büyük ölçekte paralel mimarisini

    kullanmaktır. SneakySnake’i esas olarak biyoinformatisyenlerin mevcut olan, do-

    nanım karmaşıklığı ile uğraşmak zorunda olmadıkları emtia masaüstü ve sunucu-

    larında kullanabilmeleri için geliştirdik. Ön okuma filtreleme yaklaşımımızın

    avantaj ve dezavantajlarını 12 gerçek veri setini, farklı okuma uzunlukları ve

    mesafe eşikleri kullanarak ayrıntılı olarak değerlendirdik. Değerlendirmemizde,

    donanım ön hizalama filtrelerimizin eşdeğer CPU uygulamalarına göre iki ila

    üç derece hızlı olduklarını gösteriyoruz. Donanım ön hizalama filtrelerim-

    izi son teknoloji okuma hizalayıcılarıyla entegre etmenin hizalayıcının çalışma

    süresini düzenleme mesafesi eşiğine bağlı olarak 21.5x. Son olarak, ön hiza-

    lama filtrelerinin etkin CPU uygulamasının hala önemli faydalar sağladığını

    gösteriyoruz. SneakySnake’in en iyi performansa sahip CPU tabanlı okuma

    ayarlayıcıları Edlib ve Parasail’in yürütme sürelerini sırasıyla 43x ve 57,9x’e

    kadar azalttığını gösteriyoruz. Bu tezin ana sonucu, hızlı ve verimli bir fil-

    treleme mekanizması geliştirilmesi ve bu mekanizmanın doğruluğunun daha iyi

    anlaşılması, hizalayıcıların yeteneklerinden hiçbir şey ödün vermeden, okuma

    hizalamasının çalışma süresinde önemli bir azalmaya yol açmaktadır. Yeni mimar-

    ilerimizin ve algoritmalarımızın, mevcut ve gelecekteki genom analiz planlarında

    benimsenmelerini katalize ettiğimizi umuyor ve buna inanıyoruz.

    Anahtar sözcükler : Haritalamayı Oku, Yaklaşık Dize Eşleştirme, Hizalamayı

    Oku, Levenshtein Mesafesi, String Algoritmaları, Mesafeyi Düzenle, hızlı ön hiza-

    lama filtresi, Alan programlanabilir kapı dizileri (FPGA).

  • Acknowledgement

    The last nearly four years at Bilkent University have been most exciting and

    fruitful time of my life, not only in the academic arena, but also on a personal level.

    For that, I would like to seize this opportunity to acknowledge all people who

    have supported and helped me to become who I am today. First and foremost,

    the greatest of all my appreciation goes to the almighty Allah for granting me

    countless blessings and helping me stay strong and focused to accomplish this

    work.

    I would like to extend my sincere thanks to my advisor Can Alkan, who intro-

    duced me to the realm of bioinformatics. I feel very privileged to be part of his

    lab and I am proud that I am going to be his first PhD graduate. Can generously

    provided the resources and the opportunities that enabled me to publish and col-

    laborate with researchers from other institutions. I appreciate all his support and

    guidance at key moments in my studies while allowing me to work independently

    with my own ideas. I truly enjoyed the memorable time we spent together in

    Ankara, Izmir, Istanbul, Vienna, and Los Angeles. Thanks to him, I get addicted

    to the dark chocolate with almonds and sea salt.

    I am also very thankful to my co-advisor Onur Mutlu for pushing my academic

    writing to a new level. It is an honor that he also gave me the opportunity to join

    his lab at ETH Zürich as a postdoctoral researcher. His dedication to research is

    contagious and inspirational.

    I am grateful to the members of my thesis committee: Ergin Atalar, Mehmet

    Koyutürk, Nurcan Tunçbağ, Oğuz Ergin, and Ozcan Oztürk for their valuable

    comments and discussions.

    I would also like to thank Eleazar Eskin and Jason Cong for giving me the

    opportunity to join UCLA during the summer of 2017 as a staff research associate.

    I would like to extend my thanks to Serghei Mangul, David Koslicki, Farhad

    Hormozdiari, and Nathan LaPierre who provided the guidance to make my visit

    fruitful and enjoyable. I am also thankful to Akash Kumar for giving me the

    chance to join the Chair for Processor Design during the winter of 2017-2018 as

    a visiting researcher at TU Dresden.

    vii

  • viii

    I would like to acknowledge Nordin Zakaria and Izzatdin B A Aziz for providing

    me the opportunity to continue carrying out PETRONAS projects while i am in

    Turkey. I am grateful to Nor Hisham Hamid for believing in me, hiring me as

    a consultant for one of his government-sponsored projects, and inviting me to

    Malaysia. It was my great fortune to work with them.

    I want to thank all my co-authors and colleagues: Eleazar Eskin, Erman Ayday,

    Hongyi Xin, Hasan Hassan, Serghei Mangul, Oğuz Ergin, Akash Kumar, Nour

    Almadhoun, Jeremie Kim, and Azita Nouri.

    I would like to acknowledge all past and current members of our research

    group for being both great friends and colleagues to me. Special thanks to Fatma

    Kahveci for being a great labmate who taught me her native language and tol-

    erated me for many years. Thanks to Gülfem Demir for all her valuable support

    and encouragement. Thanks to Shatlyk Ashyralyyev for his great help when I

    prepared for my qualifying examination and for many valuable discussions on

    research and life. Thanks to Handan Kulan for her friendly nature and priceless

    support. Thanks to Can Firtina and Damla Şenol Cali (CMU) for all the help in

    translating the abstract of this thesis to Turkish. I am grateful to other members

    for their companionship: Marzieh Eslami Rasekh, Azita Nouri, Elif Dal, Zülal

    Bingöl, Ezgi Ebren, Arda Söylev, Halil İbrahim Özercan, Fatih Karaoğlanoğlu,

    Tuğba Doğan, and Balanur İçen.

    I would like to express my deepest gratitude to Ahmed Nemer Almadhoun,

    Khaldoun Elbatsh (and his wonderful family), Hamzeh Ahangari, Soe Thane,

    Mohamed Meselhy Eltoukhy (and his kind family), Abdel-Haleem Abdel-Aty,

    Heba Kadry, and Ibrahima Faye for all the good times and endless generous

    support in the ups and downs of life. I feel very grateful to all my inspiring

    friends who made Bilkent a happy home for me, especially Tamer Alloh, Basil

    Heriz, Kadir Akbudak, Amirah Ahmed, Zahra Ghanem, Muaz Draz, Nabeel Abu

    Baker, Maha Sharei, Salman Dar, Elif Doğan Dar, Ahmed Ouf, Shady Zahed,

    Mohammed Tareq, Abdelrahman Teskyeh, Obada and many others.

    My graduate study is an enjoyable journey with my lovely wife,

    Nour Almadhoun, and my sons, Hasan and Adam. Their love and

    never ending support always give me the internal strength to move on

    whenever I feel like giving up or whenever things become too hard.

  • ix

    Pursuing PhD together with Nour in the same department is our most precious

    memory in lifetime. We could not have been completed any of our PhD work

    without continuously supporting and encouraging each other.

    My family has been a pillar of support, encouragement and comfort all through-

    out my journey. Thanks to my parents, Hasan Alser and Itaf Abdelhadi, who

    raised me with a love of science and supported me in all my pursuits. Thanks to

    my lovely sister Deema and my brothers Ayman (and his lovely family), Hani, So-

    haib, Osaid, Moath, and Ayham for always believing in me and being by my side.

    Thanks to my parents-in-law, Mohammed Almadhoun and Nariman Almadhoun,

    my brotherss-in-law, Ahmed Nemer, Ezz Eldeen, Moatasem, Mahmoud, sisters-

    in-law, Narmeen, Rania, and Eman, and their families for all their understanding,

    great support, and love.

    Finally, I gratefully acknowledge the funding sources that made my PhD work

    possible. I was honored to be a TÜBİTAK Fellow for 4 years offered by the Scien-

    tific and Technological Research Council of Turkey under 2215 program. In 2017,

    I was also honored to receive the HiPEAC Collaboration Grant. I am grateful

    to Bilkent University for providing generous financial support and funding my

    academic visits. This thesis was supported by NIH Grant (No. HG006004) to

    Onur Mutlu and Can Alkan and a Marie Curie Career Integration Grant (No.

    PCIG-2011-303772) to Can Alkan under the Seventh Framework Programme.

    Mohammed H. K. Alser

    June 2018, Ankara, Turkey

  • Contents

    1 Introduction 1

    1.1 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2 Background 14

    2.1 Overview of Read Mapper . . . . . . . . . . . . . . . . . . . . . . 14

    2.2 Acceleration of Read Mappers . . . . . . . . . . . . . . . . . . . . 17

    2.2.1 Seed Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2.2 Accelerating Read Alignment . . . . . . . . . . . . . . . . 19

    2.2.3 False Mapping Filtering . . . . . . . . . . . . . . . . . . . 22

    2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3 Understanding and Improving Pre-alignment Filtering Accuracy 25

    3.1 Pre-alignment Filter Performance Metrics . . . . . . . . . . . . . 25

    3.1.1 Filtering Speed . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.1.2 Filtering Accuracy . . . . . . . . . . . . . . . . . . . . . . 26

    3.1.3 End-to-End Alignment Speed . . . . . . . . . . . . . . . . 27

    3.2 On the False Filtering of SHD algorithm . . . . . . . . . . . . . . 28

    3.2.1 Random Zeros . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2.2 Conservative Counting . . . . . . . . . . . . . . . . . . . . 30

    3.2.3 Leading and Trailing Zeros . . . . . . . . . . . . . . . . . . 31

    3.2.4 Lack of Backtracking . . . . . . . . . . . . . . . . . . . . . 32

    x

  • CONTENTS xi

    3.3 Discussion on Improving the Filtering Accuracy . . . . . . . . . . 34

    3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4 The First Hardware Accelerator for Pre-Alignment Filtering 36

    4.1 FPGA as Acceleration Platform . . . . . . . . . . . . . . . . . . . 36

    4.2 Overview of Our Accelerator Architecture . . . . . . . . . . . . . 37

    4.2.1 Read Controller . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.2.2 Result Controller . . . . . . . . . . . . . . . . . . . . . . . 38

    4.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.4 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 40

    4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5 GateKeeper: Fast Hardware Pre-Alignment Filter 41

    5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.2.1 Method 1: Fast Approximate String Matching . . . . . . . 42

    5.2.2 Method 2: Insertion and Deletion (Indel) Detection . . . . 43

    5.2.3 Method 3: Minimizing Hardware Resource Overheads . . . 47

    5.3 Analysis of GateKeeper Algorithm . . . . . . . . . . . . . . . . . 48

    5.4 Discussion and Novelty . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    6 Shouji: Fast and Accurate Hardware Pre-Alignment Filter 52

    6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    6.2.1 Method 1: Building the Neighborhood Map . . . . . . . . 54

    6.2.2 Method 2: Identifying the Diagonally-Consecutive Matches 54

    6.2.3 Method 3: Filtering Out Dissimilar Sequences . . . . . . . 58

    6.3 Analysis of Shouji algorithm . . . . . . . . . . . . . . . . . . . . . 59

    6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    7 MAGNET: Accurate Hardware Pre-Alignment Filter 64

    7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

  • CONTENTS xii

    7.2.1 Method 1: Building the Neighborhood Map . . . . . . . . 66

    7.2.2 Method 2: Identifying the E+1 Identical Subsequences . . 66

    7.2.3 Method 3: Filtering Out Dissimilar Sequences . . . . . . . 68

    7.3 Analysis of MAGNET algorithm . . . . . . . . . . . . . . . . . . . 69

    7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    8 SneakySnake: Fast, Accurate, and Cost-Effective Filter 74

    8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    8.2 The Sneaky Snake Problem . . . . . . . . . . . . . . . . . . . . . 75

    8.3 Weighted Maze Methods . . . . . . . . . . . . . . . . . . . . . . . 76

    8.3.1 Method 1: Building the Weighted Neighborhood Maze . . 77

    8.3.2 Method 2: Finding the Optimal Travel Path . . . . . . . . 77

    8.3.3 Method 3: Examining the Snake Survival . . . . . . . . . . 78

    8.4 Unweighted Maze Methods . . . . . . . . . . . . . . . . . . . . . . 79

    8.4.1 Method 1: Building the Unweighted Neighborhood Maze . 79

    8.4.2 Method 2: Finding the Optimal Travel Path . . . . . . . . 80

    8.4.3 Method 3: Examining the Snake Survival . . . . . . . . . . 81

    8.5 Analysis of SneakySnake Algorithm . . . . . . . . . . . . . . . . . 81

    8.6 SneakySnake on an FPGA . . . . . . . . . . . . . . . . . . . . . . 83

    8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

    8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    9 Evaluation 90

    9.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    9.2 Resource Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    9.3 Filtering Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    9.3.1 Partitioning the Search Space of Approximate and Exact

    Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . 94

    9.3.2 False Accept Rate . . . . . . . . . . . . . . . . . . . . . . . 98

    9.3.3 False Reject Rate . . . . . . . . . . . . . . . . . . . . . . . 101

    9.4 Effects of Hardware Pre-Alignment Filtering on Read Alignment . 105

    9.5 Effects of CPU Pre-Alignment Filtering on Read Alignment . . . 108

    9.6 Memory Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 114

  • CONTENTS xiii

    10 Conclusions and Future Directions 118

    10.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . 119

    11 Other Works of This Author 122

    A Data 141

  • List of Figures

    1.1 Rate of verified and unverified read-reference pairs that are gen-

    erated by mrFAST mapper and are fed into its read alignment

    algorithm. We set the threshold to 2, 3, and 5 edits. . . . . . . . . 3

    1.2 The flow chart of seed-and-extend based mappers that includes

    five main steps. (1) Typical mapper starts with indexing the ref-

    erence genome using a scheme based on hash table or BWT-FM.

    (2) It extracts number of seeds from each read that are produced

    using sequencing machine. (3) It obtains a list of possible locations

    within the reference genome that could result in a match with the

    extracted seeds. (4) It applies fast heuristics to examine the align-

    ment of the entire read with its corresponding reference segment

    of the same length. (5) It uses expensive and accurate alignment

    algorithm to determine the similarity between the read and the

    reference sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3 End-to-end execution time (in seconds) for read alignment, with

    (blue plot) and without (green plot) pre-alignment filter. Our goal

    is to significantly reduce the alignment time spent on verifying in-

    correct mappings (highlighted in yellow). We sweep the percentage

    of rejected mappings and the filtering speed compared to alignment

    algorithm in the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . 8

    xiv

  • LIST OF FIGURES xv

    2.1 Timeline of read mappers. CPU-based mappers are plotted in

    black, GPU accelerated mappers in red, FPGA accelerated map-

    pers in blue and SSE-based mappers in green. Grey dotted lines

    connect related mappers (extensions or new versions). The names

    in the timeline are exactly as they appear in publications, except:

    SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson [104],

    BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68]. . . 18

    3.1 An example of a mapping with all its generated masks, where the

    edit distance threshold (E ) is set to 3. The green highlighted sub-

    sequences are part of the correct alignment. The red highlighted

    bit in the final bit-vector is a wrong alignment provided by SHD.

    The correct alignment (highlighted in yellow) shows that there are

    three substitutions and a single deletion, while SHD detects only

    two substitutions and a single deletion. . . . . . . . . . . . . . . . 29

    3.2 Examples of an incorrect mapping that passes the SHD filter due

    to the random zeros. While the edit distance threshold is 3, a map-

    ping of 4 edits (as examined at the end of the figure by Needleman-

    Wunsch algorithm) passes as a falsely accepted mapping. . . . . . 30

    3.3 An example of an incorrect mapping that passes the SHD filter

    due to conservative counting of the short streak of ‘1’s in the final

    bit-vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.4 Examples of an invalid mapping that passes the SHD filter due to

    the leading and trailing zeros. We use an edit distance threshold of

    3 and an SRS threshold of 2. While the regions that are highlighted

    in green are part of the correct alignment, the wrong alignment

    provided by SHD is highlighted in red. The yellow highlighted bits

    indicate a source of falsely accepted mapping. . . . . . . . . . . . 32

    3.5 Examples of incorrect mappings that pass the SHD filter due to

    (a) overlapping identical subsequences, and (b) lack of backtracking. 33

    4.1 Overview of our accelerator architecture. . . . . . . . . . . . . . . 38

    4.2 Xilinx Virtex-7 FPGA VC709 Connectivity Kit and chip layout of

    our implemented accelerator. . . . . . . . . . . . . . . . . . . . . . 39

  • LIST OF FIGURES xvi

    5.1 A flowchart representation of the GateKeeper algorithm. . . . . . 44

    5.2 An example showing how various types of edits affect the align-

    ment of two reads. In (a) the upper read exactly matches the

    lower read and thus each base exactly matches the corresponding

    base in the target read. (b) shows base-substitutions that only

    affect the alignment at their positions. (c) and (d) demonstrate

    insertions and deletions, respectively. Each edit has an influence

    on the alignment of all the subsequent bases. . . . . . . . . . . . . 45

    5.3 Workflow of the proposed architecture for the parallel amendment

    operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.4 An example of applying our solution for reducing the number of

    bits of each Hamming mask by half. We use a modified Hamming

    mask to store the result of applying the bitwise OR operation to

    each two bits of the Hamming mask. The modified mask maintains

    the same meaning of the original Hamming mask. . . . . . . . . . 48

    6.1 Random edit distribution in a read sequence. The edits (e1, e2,

    . . . , eE) act as dividers resulting in several identical subsequences

    (m1, m2, . . . , mE+1) between the read and the reference. . . . . . 53

    6.2 Neighborhood map (N ) and the Shouji bit-vector, for text T =

    GGTGCAGAGCTC, and pattern P = GGTGAGAGTTGT for

    E=3. The three common subsequences (i.e., GGTG, AGAG, and

    T) are highlighted in yellow. We use a search window of size 4

    columns (two examples of which are high-lighted in red) with a

    step size of a single column. Shouji searches diagonally within each

    search window for the 4-bit vector that has the largest number

    of zeros. Once found, Shouji examines if the found 4-bit vector

    maximizes the number of zeros at the corresponding location of

    the 4-bit vector in the Shouji bit-vector. If so, then Shouji stores

    this 4-bit vector in the Shouji bit-vector at its corresponding location. 55

  • LIST OF FIGURES xvii

    6.3 The effect of the window size on the rate of the falsely accepted

    sequences (i.e., dissimilar sequences that are considered as simi-

    lar ones by Shouji filter). We observe that a window width of 4

    columns provides the highest accuracy. We also observe that as

    window size increases beyond 4 columns, more similar sequences

    are rejected by Shouji, which should be avoided. . . . . . . . . . . 58

    6.4 An example of applying Shouji filtering algorithm to a sequence

    pair, where the edit distance threshold is set to 4. We present the

    content of the neighborhood map along with the Shouji bit-vector.

    We apply the Shouji algorithm starting from the leftmost column

    towards the rightmost column. . . . . . . . . . . . . . . . . . . . . 59

    6.5 A block diagram of the sliding window scheme implemented in

    FPGA for a single filtering unit. . . . . . . . . . . . . . . . . . . . 62

    7.1 The false accept rate of MAGNET using two different algorithms

    for identifying the identical subsequences. We observe that finding

    the E+1 non-overlapping identical subsequences leads to a signif-

    icant reduction in the incorrectly accepted sequences compared to

    finding the top E+1 longest identical subsequences. . . . . . . . . 68

    7.2 An example of applying MAGNET filtering algorithm with an

    edit distance threshold of 4. MAGNET finds all the longest non-

    overlapping subsequences of consecutive zeros in the descending

    order of their length (as numbered in yellow). . . . . . . . . . . . 69

    8.1 Weighted Neighborhood maze (WN ), for reference sequence A =

    GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT

    for E=3. The snake traverses the path that is highlighted in yellow

    which includes three obstacles and three pipes, that represents

    three identical subsequences: GGTG, AGAG, and T . . . . . . . . 78

    8.2 Unweighted neighborhood maze (UN ), for reference sequence A =

    GGTGGAGAGATC, and read sequence B = GGTGAGAGTTGT

    for E=3. The snake traverses the path that is highlighted in yellow

    which includes three obstacles and three pipes, that represents

    three identical subsequences: GGTG, AGAG, and T. . . . . . . . 80

  • LIST OF FIGURES xviii

    8.3 Proposed 4-bit LZC comparator. . . . . . . . . . . . . . . . . . . 85

    8.4 Block diagram of the 11 LZCs and the hierarchical LZC compara-

    tor tree for computing the largest number of leading zeros in 11

    rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    9.1 The effects of column-wise partitioning the search space of

    SneakySnake algorithm on the average false accept rate ((a), (c),

    and (e)) and the average execution time ((b), (d), and (f)) of ex-

    amining set 1 to set 4 in (a) and (b), set 5 to set 8 in (c) and (d),

    and set 9 to set 12 in (e) and (f). Besides the default size (equals

    the read length) of the SneakySnake’s unweighted neighborhood

    maze, we choose partition sizes (the number of grid’s columns that

    are included in each partition) of 5, 10, 25, and 50 columns. . . . 96

    9.2 The effects of the number of replications of the hardware imple-

    mentation of SneakySnake algorithm on its filtering accuracy (false

    accept rate). We use a wide range of edit distance thresholds (0

    bp-10 bp for a read length of 100 bp) and four datasets: (a) set 1,

    (b) set 2, (c) set 3, and (d) set 4. . . . . . . . . . . . . . . . . . . 97

    9.3 The effects of computing the prefix edit distance on the overall

    execution time of the edit distance calculations compared to the

    original Edlib (exact edit distance) and our partitioned implemen-

    tation of SneakySnake algorithm. We present the average time

    spent in examining set 1 to set 4 in (a), set 5 to set 8 in (b), and

    set 9 to set 12 in (c). We choose initial sub-matrix sizes of 5, 10,

    25, and 50 columns. We mark the intersection of SneakySnake-5

    and Edlib-50 plots with a dashed vertical line. . . . . . . . . . . . 99

    9.4 The false accept rate produced by our pre-alignment filters, Gate-

    Keeper, Shouji, MAGNET, and SneakySnake, compared to the

    best performing filter, SHD [65]. We use a wide range of edit dis-

    tance thresholds (0-10 edits for a read length of 100 bp) and four

    datasets: (a) set 1, (b) set 2, (c) set 3, and (d) set 4. . . . . . . . 101

  • LIST OF FIGURES xix

    9.5 The false accept rate produced by our pre-alignment filters, Gate-

    Keeper, Shouji, MAGNET, and SneakySnake, compared to the

    best performing filter, SHD [65]. We use a wide range of edit dis-

    tance thresholds (0-15 edits for a read length of 150 bp) and four

    datasets: (a) set 5, (b) set 6, (c) set 7, and (d) set 8. . . . . . . . 102

    9.6 The false accept rate produced by our pre-alignment filters, Gate-

    Keeper, Shouji, MAGNET, and SneakySnake, compared to the

    best performing filter, SHD [65]. We use a wide range of edit dis-

    tance thresholds (0-25 edits for a read length of 250 bp) and four

    datasets: (a) set 9, (b) set 10, (c) set 11, and (d) set 12. . . . . . 103

    9.7 An example of a falsely-rejected mapping using MAGNET algo-

    rithm for an edit distance threshold of 6. The random zeros (high-

    lighted in red) confuse MAGNET filter causing it to select shorter

    segments of random zeros instead of a longer identical subsequence

    (highlighted in blue). . . . . . . . . . . . . . . . . . . . . . . . . . 104

    9.8 End-to-end execution time (in seconds) for Edlib [78] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-

    ferent edit distance thresholds. We highlight in a dashed vertical

    line the edit distance threshold where Edlib starts to outperform

    our SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 110

    9.9 End-to-end execution time (in seconds) for Edlib [78] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-

    ferent edit distance thresholds. We highlight in a dashed vertical

    line the edit distance threshold where Edlib starts to outperform

    our SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 111

    9.10 End-to-end execution time (in seconds) for Edlib [78] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) across

    different edit distance thresholds. We highlight in a dashed vertical

    line the edit distance threshold where Edlib starts to outperform

    our SneakySnake-5 algorithm. . . . . . . . . . . . . . . . . . . . . 112

  • LIST OF FIGURES xx

    9.11 End-to-end execution time (in seconds) for Parasail [54] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 1, (b) set 2, (c) set 3, and (d) set 4) across dif-

    ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 113

    9.12 End-to-end execution time (in seconds) for Parasail [54] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 5, (b) set 6, (c) set 7, and (d) set 8) across dif-

    ferent edit distance thresholds. . . . . . . . . . . . . . . . . . . . . 114

    9.13 End-to-end execution time (in seconds) for Parasail [54] (full read

    aligner), with and without pre-alignment filters. We use four

    datasets ((a) set 9, (b) set 10, (c) set 11, and (d) set 12) across

    different edit distance thresholds. . . . . . . . . . . . . . . . . . . 115

    9.14 Memory utilization of Edlib (path) read aligner while evaluating

    set 12 for an edit distance threshold of 25. . . . . . . . . . . . . . 116

    9.15 Memory utilization of exact edit distance algorithm (Edlib ED)

    combined with Edlib (path) read aligner while evaluating set 12

    for an edit distance threshold of 25. . . . . . . . . . . . . . . . . . 116

    9.16 Memory utilization of SneakySnake-5 combined with Edlib (path)

    read aligner while evaluating set 12 for an edit distance threshold

    of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

  • List of Tables

    5.1 Overall benefits of GateKeeper over SHD in terms of number of

    operations performed. . . . . . . . . . . . . . . . . . . . . . . . . . 51

    9.1 Benchmark illumina-like read sets of whole human genome, ob-

    tained from EMBL-ENA. . . . . . . . . . . . . . . . . . . . . . . . 91

    9.2 Benchmark illumina-like datasets (read-reference pairs). We map

    each read set, described in Table 9.1, to the human reference

    genome in order to generate four datasets using different mapper’s

    edit distance thresholds (using -e parameter). . . . . . . . . . . . 91

    9.3 FPGA resource usage for a single filtering unit of GateKeeper,

    Shouji, MAGNET, and SneakySnake for a sequence length of 100

    and under different edit distance thresholds (E ). . . . . . . . . . . 93

    9.4 The execution time (in seconds) of GateKeeper, MAGNET, Shouji,

    and SneakySnake under different edit distance thresholds. We use

    set 1 to set 4 with a read length of 100. We provide the per-

    formance results for the CPU implementations and the hardware

    accelerators with the maximum number of filtering units. . . . . . 106

    9.5 End-to-end execution time (in seconds) for several state-of-the-art

    sequence alignment algorithms, with and without pre-alignment

    filters (SneakySnake, Shouji, MAGNET, GateKeeper, and SHD)

    and across different edit distance thresholds. We use four datasets

    (set 1, set 2, set 3, and set 4) across different edit distance thresh-

    olds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    xxi

  • LIST OF TABLES xxii

    A.1 Details of our first four datasets (set 1, set 2, set 3, and set 4). We

    use Edlib [78] (edit distance mode) to benchmark the accepted and

    the rejected pairs for edit distance thresholds of 0 up to 10 edits. . 142

    A.2 Details of our second four datasets (set 5, set 6, set 7, and set 8).

    We use Edlib [78] (edit distance mode) to benchmark the accepted

    and the rejected pairs for edit distance thresholds of 0 up to 15 edits.143

    A.3 Details of our last four datasets (set 9, set 10, set 11, and set 12).

    We use Edlib [78] (edit distance mode) to benchmark the accepted

    and the rejected pairs for edit distance thresholds of 0 up to 25 edits.143

    A.4 Details of evaluating the feasibility of reducing the search space

    for SneakySnake and Edlib, evaluated using set 1, set 2, set 3, and

    set 4 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    A.5 Details of evaluating the feasibility of reducing the search space

    for SneakySnake and Edlib, evaluated using set 5, set 6, set 7, and

    set 8 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    A.6 Details of evaluating the feasibility of reducing the search space for

    SneakySnake and Edlib, evaluated using set 9, set 10, set 11, and

    set 12 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

  • Chapter 1

    Introduction

    Genome is the code of life that includes set of instructions for making everything

    from humans to elephants, bananas, and yeast. Analyzing the life’s code helps,

    for example, to determine differences in genomes from human to human that

    are passed from one generation to the next and may cause diseases or different

    traits [1, 2, 3, 4, 5, 6]. One benefit of knowing the genetic variations is better

    understanding and diagnosis diseases such as cancer and autism [7, 8, 9, 10]

    and the development of efficient drugs [11, 12, 13, 14, 15]. The first step in

    genome analysis is to reveal the entire content of the subject genome – a process

    known as DNA sequencing [16]. Until today, it remains challenging to sequence

    the entire DNA molecule as a whole. As a workaround, high throughput DNA

    sequencing (HTS) technologies are used to sequence random fragments of copies

    of the original molecule. These fragments are called short reads and are 75-300

    basepairs (bp) long. The resulting reads lack information about their order and

    origin (i.e., which part of the subject genome they are originated from). Hence the

    main challenge in genome analysis is to construct the donor’s complete genome

    with respect to a reference genome.

    1

  • During a process, called read mapping, each read is mapped onto one or more

    possible locations in the reference genome based on the similarity between the

    read and the reference sequence segment at that location (like solving a jigsaw

    puzzle). The similarity measurement is referred to as optimal read alignment (i.e.,

    verification) and could be calculated using the Smith-Waterman local alignment

    algorithm [17]. It calculates the alignment that is an ordered list of characters

    representing possible edit operations and matches required to change one of the

    two given sequences into the other. Commonly allowed edit operations include

    deletion, insertion and substitution of characters in one or both sequences. As

    any two sequences can have several different arrangements of the edit operations

    and matches (and hence different alignments), the alignment algorithm usually

    involves a backtracking step. This step finds the alignment that has the highest

    alignment score (called optimal alignment). The alignment score is the sum

    of the scores of all edits and matches along the alignment implied by a user-

    defined scoring function. However, this approach is infeasible as it requires O(mn)

    running time, where m is the read length (few hundreds of bp) and n is the

    reference length (∼ 3.2 billion bp for human genome), for each read in the dataset (hundreds of millions to billions).

    To accelerate the read mapping process and reduce the search space, state-

    of-the-art mappers employ a strategy called seed-and-extend. In this strategy, a

    mapper applies heuristics to first find candidate map locations (seed locations)

    of subsequences of the reads using hash tables (BitMapper [18], mrFAST with

    FastHASH [19], mrsFAST [20]) or BWT-FM indices (BWA-MEM [21], Bowtie 2

    [22], SOAP3-dp [23]). It then aligns the read in full only to those seed locations.

    Although the strategies for finding seed locations vary among different read map-

    ping algorithms, seed location identification is typically followed by alignment

    step. The general goal of this step is to compare the read to the reference seg-

    ment at the seed location to check if the read aligns to that location in the genome

    with fewer differences (called edits) than a threshold [24].

    2

  • Figure 1.1: Rate of verified and unverified read-reference pairs that are generatedby mrFAST mapper and are fed into its read alignment algorithm. We set thethreshold to 2, 3, and 5 edits.

    1.1 Research Problem

    The alignment step is the performance bottleneck of today’s read map-

    per taking over 70% to 90% of the total running time [25, 18, 19]. We

    pinpoint three specific problems that cause, affect, or exacerbate the long align-

    ment’s execution time.

    (1) We find that across our data set (see Chapter 9), an overwhelming ma-

    jority (more than 90% as we present in Figure 1.1) of the seed locations, that

    are generated by a state-of-the-art read mapper, mrFAST with FastHASH [19],

    exhibit more edits than the allowed threshold. These particular seed locations

    impose a large computational burden as they waste 90% of the alignment’s exe-

    cution time in verifying these incorrect mappings. This observation is also in line

    with similar results for other read mappers [19, 25, 18, 26].

    (2) Typical read alignment algorithm also needs to tolerate sequencing errors

    [27, 28] as well as genetic variations [29]. The read alignment approach is non-

    additive measure [30]. This means that if we divide the sequence pair into two

    consecutive subsequence pairs, the edit distance of the entire sequence pair is

    not necessarily equivalent to the sum of the edit distances of the shorter pairs.

    Instead, we need to examine all possible prefixes of the two input sequences and

    keep track of the pairs of prefixes that provide an optimal solution. Therefore,

    3

  • the read alignment step is implemented using dynamic programming algorithms

    to avoid re-examining the same prefixes many times. This includes Levenshtein

    distance [31], Smith-Waterman [17], Needleman-Wunsch [32] and their improved

    implementations. These algorithms are inefficient as they run in a quadratic-time

    complexity in the read length, m, (i.e., O(m2)).

    (3) This computational burden is further aggravated by the unprecedented

    flood of sequencing data which continues to overwhelm the processing capacity

    of existing algorithms and compute infrastructures [33]. While today’s HTS ma-

    chines (e.g., Illumina HiSeq4000) can generate more than 300 million bases per

    minute, state-of-the-art read mapper can only map 1% of these bases per minute

    [34]. The situation gets even worse when one tries to to understand a complex

    disease (e.g., autism and cancer) [8, 9, 35, 36, 37, 38, 39] or profile a metage-

    nomics sample [40, 41, 42, 43], which requires sequencing hundreds of thousands

    of genomes. The long execution time of modern-day read alignment can severely

    hinder such studies. There is also an urgent need for rapidly incorporating clinical

    sequencing into clinical practice for diagnosis of genetic disorders in critically ill

    infants [44, 45, 46, 47]. While early diagnosis in such infants shortens the clinical

    course and enables optimal outcomes [48, 49, 50], it is still challenging to de-

    liver efficient clinical sequencing for tens to hundreds of thousands of hospitalized

    infants each year [51].

    Tackling these challenges and bridging the widening gap between the execution

    time of read alignment and the increasing amount of sequencing data necessitate

    the development of fundamentally new, fast, and efficient read alignment algo-

    rithms. In the next sections, we provide the motivation behind our proposed

    work to considerably boost the performance of read alignment. We also provide

    further background information and literature study in Chapter 2.

    4

  • 1.2 Motivation

    We present in Figure 1.2 the flow chart of a typical seed-and-extend based mapper

    during the mapping stage. The mapper follows five main steps to map a read

    set to the reference genome sequence. (1) In step 1, typically a mapper first

    constructs fast indexing data structure (e.g., large hash table) for short segments

    (called seeds or k-mers or q-maps) of the reference sequence. (2) In step 2, the

    mapper extracts short subsequences from a read and uses them to query the hash

    table. (3) In step 3, The hash table returns all the occurrence hits of each seed

    in the reference genome. Modern mappers employ seed filtering mechanism to

    reduce the false seed locations that leads to incorrect mappings. (4) In step 4, for

    each possible location in the list, the mapper retrieves the corresponding reference

    segment from the reference genome based on the seed location. The mapper

    can then examine the alignment of the entire read with the reference segment

    using fast filtering heuristics that reduce the need for the dynamic programming

    algorithms. It rejects the mappings if the read and the reference are obviously

    dissimilar.

    Otherwise, the mapper proceeds to the next step. (5) In step 5, the map-

    per calculates the optimal alignment between the read sequence and the ref-

    erence sequence using a computationally costly sequence alignment (i.e., veri-

    fication) algorithm to determine the similarity between the read sequence and

    the reference sequence. Many attempts were made to tackle the computation-

    ally very expensive alignment problem. Most existing works tend to follow one

    of three key directions: (1) accelerating the dynamic programming algorithms

    [52, 53, 54, 55, 56, 57, 58, 23, 59, 60], (2) developing seed filters that aim to

    reduce the false seed locations [18, 19, 26, 61, 62, 63, 64], and (3) developing

    pre-alignment filtering heuristics [65].

    The first direction takes advantage of parallelism capabilities of high perfor-

    mance computing platforms such as central processing units (CPUs) [54, 56],

    graphics processing units (GPUs) [57, 23, 59], and field-programmable gate ar-

    rays (FPGAs) [52, 53, 66, 67, 55, 58, 60]. Among these computing platforms,

    5

  • FPGA accelerators seem to yield the highest performance gain [53, 68]. However,

    many of these efforts either simplify the scoring function, or only take into account

    accelerating the computation of the dynamic programming matrix without pro-

    viding the optimal alignment (i.e., backtracking) as in [66, 67]. Different scoring

    functions are typically needed to better quantify the similarity between the read

    and the reference sequence segment [69]. The backtracking step required for opti-

    mal alignment computation involves unpredictable and irregular memory access

    patterns, which poses difficult challenge for efficient hardware implementation.

    The second direction to accelerate read alignment is to use filtering heuristics

    to reduce the size of the seed location list. This is the basic principle of nearly

    all mappers that employ seed-and-extend approach. Seed filter applies heuristics

    to reduce the output location list. The location list stores all the occurrence

    locations of each seed in the reference genome. The returned location list can be

    tremendously large as a mapper searches for an exact matches of short segment

    (typically 10 bp -13 bp for hash-based mappers) between two very long homolo-

    gous genomes [70]. Filters in this category suffer from low filtering accuracy as

    they can only look for exact matches with the help of hash table. Thus, they

    query a few number of seeds per read (e.g., in Bowtie 2 [22], it is 3 fixed length

    seeds at fixed locations) to maintain edit tolerance. mrFAST [19] uses another

    approach to increase the seed filtering accuracy by querying the seed and its

    shifted copies. This idea is based on the observation that indels cause the trailing

    characters to be shifted to one direction. If one of the shifted copies of the seed,

    generated from the read sequence, or the seed itself matches the corresponding

    seed from the reference, then this seed has zero edits. Otherwise, this approach

    calculates the number of edits in this seed as a single edit (that can be a single

    indel or a single substitution). Therefore, this approach fails to detect the correct

    number of edits for these case, for example, more than one substitutions in the

    same seed, substitutions and indel in the same seed, or more than one indels in

    the last seed). Seed filtering successes to eliminate some of the incorrect locations

    but it is still unable to eliminate sufficiently enough large portion of the false seed

    locations, as we present in Figure 1.1.

    6

  • TA

    ACGTACGTACGTACGT

    TATATATACGTACTAGTACGT

    ACGACTTTAGTACGTACGT

    TATATATACGTACTAGTACGT

    ACGTACGCCCCTACGTA

    ACGACTTTAGTACGTACGT

    TATATATACGTACTAAAGTACGT

    CCCCCCTATATATACGTACTAGTACGT

    TATATATACGTACTAGTACGT

    TATATATACGTACTAGTACGT

    ACGTTTTTAAAACGTA

    ACGACGGGGAGTACGTACGT

    TATATATACGTACTAAAGTACGT

    seed

    ... ...Reference Genome

    Billions of Short Read

    High throughput DNA sequencing (HTS) technologies

    ... ...Reference Genome

    L1

    Reference genome indexing Hash table or BWT-FM

    1

    L2

    L3 L4

    L5

    L6

    L7

    L10

    L11

    L12

    L13

    L22

    L9

    L8

    L15

    L20

    L14

    L19

    L18

    L16

    L17

    L26

    TATATATACGTACTAGTACGT

    CCTATATATACGTACTAGTACGT

    CCC

    A C T T A G C A C T

    0 1 2

    A 1 0 1 2

    C 2 1 0 1 2

    T 2 1 0 1 2

    A 2 1 2 1 2

    G 2 2 2 1 2

    A 3 2 2 2 2

    A 3 3 3 2 3

    C 4 3 3 2 3

    T 4 4 3 2

    T 5 4 3

    C T A T A A T A C G

    C

    A

    T

    A

    T

    A

    T

    A

    C

    G

    L2 L11L8 L18

    TATATATACGTACTA GTCCG

    T

    Seed selectionExtracting seeds from read set

    2Seed querying & filtering Each seed has its own location list

    3

    Read 1:

    Read 2:

    Read 3:

    ...

    ...

    ...

    CCTATAT TACGTACTAGTACGTTRead 1:

    CTATATA ACGTACTAGTACGTG

    G

    CCRef. 1 :

    L30

    L37

    L60

    Pre-alignment FilteringExamining each mapping individually

    4Read Alignment Slow and Accurate

    5

    Seed location list

    seeds

    Indexing data structure

    L7 L14

    L1 L10L9 L19 L30

    Exact match seeds

    from step 3

    Estimating the total number of edits quickly

    L50

    AGL3 L4 L12 L16 L37

    Finding all locations at reference

    genome where seeds exact match,

    then pruning the seed location list

    L5 L13L15 L17 L200

    .sam file contains

    necessary alignment

    information

    Figure 1.2: The flow chart of seed-and-extend based mappers that includes fivemain steps. (1) Typical mapper starts with indexing the reference genome usinga scheme based on hash table or BWT-FM. (2) It extracts number of seeds fromeach read that are produced using sequencing machine. (3) It obtains a list ofpossible locations within the reference genome that could result in a match withthe extracted seeds. (4) It applies fast heuristics to examine the alignment ofthe entire read with its corresponding reference segment of the same length. (5)It uses expensive and accurate alignment algorithm to determine the similaritybetween the read and the reference sequences.

    7

  • 0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    100% 60

    %

    20

    %

    2x 4x 8x 16x 32x 64x 128x

    Pro

    cess

    ing

    tim

    e (s

    ec)

    for

    1 m

    illi

    on

    ma

    pp

    ings

    Pre-alignment filtering accuracy and filtering speed compared to alignment step

    End-to-end alignment time without pre-alignment filter (sec)

    End-to-end alignment time with pre-alignment filter (sec)

    The desired processing time (Our goal)

    We aim to avoid these two regions

    Figure 1.3: End-to-end execution time (in seconds) for read alignment, with (blueplot) and without (green plot) pre-alignment filter. Our goal is to significantlyreduce the alignment time spent on verifying incorrect mappings (highlighted inyellow). We sweep the percentage of rejected mappings and the filtering speedcompared to alignment algorithm in the x-axis.

    The third direction to accelerate read alignment is to minimize the number of

    incorrect mappings on which alignment is performed by incorporating filtering

    heuristics. This is the last line of defense before invoking computationally expen-

    sive read alignment. Such filters come into play before read alignment (i.e., hence

    called pre-alignment filter), discarding incorrect mappings that alignment would

    deem a poor match. Though the seed filtering and the pre-alignment filtering

    have the same goal, they are fundamentally different problems. In pre-alignment

    filtering approach, a filter needs to examine the entire mapping. They calculate

    a best guess estimate for the alignment score between a read sequence and a ref-

    erence segment. If the lower bound exceeds a certain number of edits, indicating

    that the read and the reference segment do not align, the mapping is eliminated

    such that no alignment is performed. Unfortunately, the best performing exist-

    ing pre-alignment filter, such as shifted Hamming distance (SHD), is slow and its

    mechanism introduces inaccuracy in its filtering unnecessarily as we show in our

    study in Chapter 3 and in our experimental evaluation, Chapter 9.

    8

  • Pre-alignment filter enables the acceleration of read alignment and

    meanwhile offers the ability to make the best use of existing read align-

    ment algorithms.

    These benefits come without sacrificing any capabilities of these algorithms,

    as pre-alignment filter does not modify or replace the alignment step. This mo-

    tivates us to focus our improvement and acceleration efforts on pre-alignment

    filtering. We analyze in Figure 1.3 the effect of adding pre-alignment filtering

    step before calculating the optimal alignment and after generating the seed lo-

    cations. We make two key observations. (1) The reduction in the end-to-end

    processing time of the alignment step largely depends on the accuracy and the

    speed of the pre-alignment filter. (2) Pre-alignment filtering can provide unsatis-

    factory performance (as highlighted in red) if it can not reject more than about

    30% of the potential mappings while it’s only 2x-4x faster than read alignment

    step.

    We conclude that it is important to understand well what makes pre-alignment

    filter inefficient, such that we can devise new filtering technique that is much faster

    than read alignment and yet maintains high filtering accuracy.

    1.3 Thesis Statement

    Our goal in this thesis is to significantly reduce the time spent on calculating the

    optimal alignment in genome analysis from hours to mere seconds, given limited

    computational resources (i.e., personal computer or small hardware). This would

    make it feasible to analyze DNA routinely in the clinic for personalized health

    applications. Towards this end, we analyze the mappings that are provided to

    read alignment algorithm, and explore the causes of filtering inaccuracy. Our

    thesis statement is:

    9

  • read alignment can be substantially accelerated using computation-

    ally inexpensive and accurate pre-alignment filtering algorithms de-

    signed for specialized hardware.

    Accurate filter designed on a specialized hardware platform can drastically

    expedite read alignment by reducing the number of locations that must be ver-

    ified via dynamic programming. To this end, we (1) develop four hardware-

    acceleration-friendly filtering algorithms and highly-parallel hardware accelera-

    tor designs which greatly reduce the need for alignment verification in DNA read

    mapping, (2) introduce fast and accurate pre-alignment filter for general purpose

    processors, and (3) develop a better understanding of filtering inaccuracy and

    explore speed/accuracy trade-offs.

    1.4 Contributions

    The overarching contribution of this thesis is the new algorithms and architectures

    that reduce read alignment’s execution time in read mapping. More specifically,

    this thesis makes the following main contributions:

    1. We provide a detailed investigation and analysis of four potential causes of

    filtering inaccuracy in the state-of-the-art alignment filter, SHD [65]. We

    also provide our recommendations on eliminating these causes and improv-

    ing the overall filtering accuracy.

    2. GateKeeper. We introduce the first hardware pre-alignment filtering,

    GateKeeper, which substantially reduces the need for alignment verification

    in DNA read mapping. GateKeeper is highly parallel and heavily relies on

    bitwise operations such as look-up table, shift, XOR, and AND. GateKeeper

    can examine up to 16 mappings in parallel, on a single FPGA chip with a

    logic utilization of less than 1% for a single filtering unit. It provides two

    orders of magnitude speedup over the state-of-the-art pre-alignment filter,

    SHD. It also provides up to 13.9x speedup to the state-of-the-art aligners.

    10

  • GateKeeper is published in Bioinformatics [71] and also available in arXiv

    [72].

    3. Shouji. We introduce Shouji, a highly accurate and parallel pre-alignment

    filter which uses a sliding window approach to quickly identify dissimilar

    sequences without the need for computationally expensive alignment al-

    gorithms. Shouji can examine up to 16 mappings in parallel, on a single

    FPGA chip with a logic utilization of up to 2% for a single filtering unit. It

    provides, on average, 1.2x to 1.4x more speedup than what GateKeeper pro-

    vides to the state-of-the-art read aligners due to its high accuracy. Shouji

    is 2.9x to 155x more accurate than GateKeeper. Shouji is published in

    Bioinformatics [73] and also available in arXiv [74].

    4. MAGNET. We introduce MAGNET, a highly accurate pre-alignment fil-

    ter which employs greedy divide-and-conquer approach for identifying all

    non-overlapping long matches between two sequences. MAGNET can ex-

    amine 2 or 8 mappings in parallel depending on the edit distance threshold,

    on a single FPGA chip with a logic utilization of up to 37.8% for a single

    filtering unit. MAGNET is, on average, two to four orders of magnitude

    more accurate than both Shouji and GateKeeper. This comes at the ex-

    pense of its filtering speed as it becomes up to 8x slower than Shouji and

    GateKeeper. It still provides up to 16.6x speedup to the state-of-the-art

    read aligners. MAGNET is published in IPSI [75], available in arXiv [76],

    and presented in AACBB2018 [77].

    11

  • 5. SneakySnake. We introduce SneakySnake, the fastest and the most ac-

    curate pre-alignment filter. SneakySnake reduces the problem of finding

    the optimal alignment to finding a snake’s optimal path (with the least

    number of obstacles) in linear time complexity in read length. We pro-

    vide a cost-effective CPU implementation for our SneakySnake algorithm

    that accelerates the state-of-the-art read aligners, Edlib [78] and Parasail

    [54], by up to 43x and 57.9x, respectively, without the need for hardware

    accelerators. We also provide a scalable hardware architecture and hard-

    ware design optimization for the SneakySnake algorithm in order to further

    boost its speed. The hardware implementation of SneakySnake accelerates

    the existing state-of-the-art aligners by up to 21.5x when it is combined

    with the aligner. SneakySnake is up to one order, four orders, and five

    orders of magnitude more accurate compared to MAGNET, Shouji, and

    GateKeeper, while preserving all correct mappings. SneakySnake also re-

    duces the memory footprint of Edlib aligner by 50%. This work is yet to

    be published.

    6. We provide a comprehensive analysis of the asymptotic run time and space

    complexity of our four pre-alignment filtering algorithms. We perform a

    detailed experimental evaluation of our proposed algorithms using 12 real

    datasets across three different read lengths (100 bp, 150 bp, and 250 bp) an

    edit distance threshold of 0% to 10% of the read length. We explore different

    implementations for the edit distance problem in order to compare the

    performance of SneakySnake that calculate approximate edit distance with

    that of the efficient implementation of exact edit distance. This also helps

    us to develop a deep understanding of the trade-off between the accuracy

    and speed of pre-alignment filtering.

    Overall, we show in this thesis that developing a hardware-based alignment fil-

    tering algorithm and architecture together is both feasible and effective by build-

    ing our hardware accelerator on a modern FPGA system. We also demonstrate

    that our pre-alignment filters are more effective in boosting the overall perfor-

    mance of the alignment step than only accelerating the dynamic programming

    algorithms by one to two orders of magnitude.

    12

  • This thesis provides a foundation in developing fast and accurate pre-alignment

    filters for accelerating existing and future read mappers.

    1.5 Outline

    This thesis is organized into 11 chapters. Chapter 2 describes the necessary

    background on read mappers and related prior works on accelerating their com-

    putations. Chapter 3 explores the potential causes of filtering inaccuracy and

    provides recommendations on tackling them. Chapter 4 presents the architecture

    and implementation details of our hardware accelerator that we use for boosting

    the speed of our proposed pre-alignment filters. Chapter 5 presents GateKeeper

    algorithm and architecture. Chapter 6 presents Shouji algorithm and its hard-

    ware architecture. Chapter 7 presents MAGNET algorithm and its hardware

    architecture. Chapter 8 presents SneakySnake algorithm. Chapter 9 presents the

    detailed experimental evaluation for all our proposed pre-alignment filters along

    with comprehensive comparison with the state-of-the-art existing works. Chapter

    10 presents conclusions and future research directions that are enabled by this

    thesis. Finally, Appendix A extends Chapter 9 with more detailed information

    and additional experimental data/results.

    13

  • Chapter 2

    Background

    In this chapter, we provide the necessary background on two key read map-

    ping methods. We highlight the strengths and weaknesses of each method. We

    then provide an extensive literature review on the prior, existing, and recent ap-

    proaches for accelerating the operations of read mappers. We devote the provided

    background materials only to the reduction of read mapper’s execution time.

    2.1 Overview of Read Mapper

    With the presence of a reference genome, read mappers maintain a large index

    database (∼ 3 GB to 20 GB for human genome) for the reference genome. Thisfacilitates querying the whole reference sequence quickly and efficiently. Read

    mappers can use one of the following indexing techniques: suffix trees [79], suffix

    arrays [80], Burrows-Wheeler transformation [81] followed by Ferragina-Manzini

    index [82] (BWT-FM), and hash tables [19, 83, 62]. The choice of the index

    affects the query size, querying speed, and memory footprint of the read mapper,

    and even access patterns.

    14

  • Unlike hash tables, suffix-array or suffix tree can answer queries of variable

    length sequences. Based on the indexing technique used, short read mappers

    typically fall into one of two main categories [33]: (1) Burrows-Wheeler Transfor-

    mation [81] and Ferragina-Manzini index [82] (BWT-FM)-based methods and (2)

    Seed-and-extend based methods. Both types have different strengths and weak-

    nesses. The first approach (implemented by BWA [84], BWT-SW [85], Bowtie

    [86], SOAP2 [87], and SOAP3 [88]) uses aggressive algorithms to optimize the

    candidate location pools to find closest matches, and therefore may not find

    many potentially-correct mappings [89]. Their performance degrades as either

    the sequencing error rate increases or the genetic differences between the subject

    and the reference genome are more likely to occur [90, 84]. To allow mismatches,

    BWT-FM mapper exhaustively traverses the data structure and match the seed

    to each possible path. Thus, Bowtie [86], for example, performs a depth-first

    search (DFS) algorithm on the prefix trie and stops when the first hit (within a

    threshold of less than 4) is found. Next, we explain SOAP2, SOAP3, and Bowtie

    as examples of this category.

    SOAP2 [87] improves the execution time and the memory utilization of SOAP

    [91] by replacing its hash index technique with the BWT index. SOAP2 divides

    the read into non-overlapping (i.e., consecutive) seeds based on the number of

    allowed edits (default five). To tolerate two edits, SOAP2 splits a read into three

    consecutive seeds to search for at least one exact match seed that allows for up

    to two mismatches. SOAP3 [88] is the first read mapper that leverage graphics

    processing unit (GPU) to facilitate parallel calculations, as the authors claim in

    [88]. It speeds up the mapping process of SOAP2 [87] using a reference sequence

    that is indexed by the combination of the BWT index and the hash table. The

    purpose of this combination is to address the issue of random memory access

    while searching the BWT index, which is challenging for a GPU implementation.

    Both SOAP2 [87] and SOAP3 [88] can support alignment with an edit distance

    threshold of up to four bp.

    15

  • Bowtie [86] follows the same concept of SOAP2. However, it also provides

    a backtracking step that favors high-quality alignments. It also uses a ’double

    BWT indexing’ approach to avoid excessive backtracking by indexing the refer-

    ence genome and its reversed version. Bowtie fails to align reads to a reference

    for an edit distance threshold of more than three bp.

    The second category uses a hash table to index short seeds presented in ei-

    ther the read set (as in SHRiMP [83], Maq [92], RMAP [93], and ZOOM [94])

    or the reference (as in most of the other modern mappers in this category). The

    idea of the hash table indexing can be tracked back to BLAST [95]. Examples of

    this category include BFAST [96], BitMapper [18], mrFAST with FastHASH [19],

    mrsFAST [20], SHRiMP [83], SHRiMP2 [97], RazerS [64], Maq [92], Hobbes [62],

    drFAST [98], MOSAIK [99], SOAP [91], Saruman [100] (GPU), ZOOM [94], and

    RMAP [93]. Hash-based mappers build a very comprehensive but overly large

    candidate location pool and rely on seed filters and local alignment techniques

    to remove incorrect mappings from consideration in the verification step. Map-

    pers in this category are able to find all correct mappings of a read, but waste

    computational resources for identifying and rejecting incorrect mappings. As a

    result, they are slower than BWT-FM-based mappers. Next, we explain mrFAST

    mapper as an example of this category.

    mrFAST (> version 2.5) [19] first builds a hash table to index fixed-length

    seeds (typically 10-13 bp) from the reference genome . It then applies a seed

    location filtering mechanism, called Adjacency Filter, on the hash table to reduce

    the false seed locations. It divides each query read into smaller fixed-length seeds

    to query the hash table for their associated seed locations. Given an edit distance

    threshold, Adjacency Filter requires N-E seeds to exactly match adjacent loca-

    tions, where N is the number of the seeds and E is the edit distance threshold.

    Finally, mrFAST tries to extend the read at each of the seed locations by align-

    ing the read to the reference fragment at the seed location via Levenshtein edit

    distance [31] with Ukkonen’s banded algorithm [101]. One drawback of this seed

    filtering is that the presence of one or more substitutions in any seed is counted

    by the Adjacency Filter as a single mismatch. The effectiveness of the Adjacency

    Filter for substitutions and indels diminishes when E becomes larger than 3 edits.

    16

  • A recent work in [102] shows that by removing redundancies in the refer-

    ence genome and also across the reads, seed-and-extend mappers can be faster

    than BWT-FM-based mappers. This space-efficient approach uses a similar idea

    presented in [94]. A hybrid method that incorporates the advantages of each

    approach can be also utilized, such as BWA-MEM [21].

    2.2 Acceleration of Read Mappers

    A majority of read mappers are developed for machines equipped with the general-

    purpose central processing units (CPUs). As long as the gap between the CPU

    computing speed and the very large amount of sequencing data widens, CPU-

    based mappers become less favorable due to their limitations in accessing data

    [52, 53, 54, 55, 56, 57, 58, 23, 59, 60]. To tackle this challenge, many attempts

    were made to accelerate the operations of read mapping. We survey in Figure

    2.1 the existing read mappers implemented in various acceleration platforms.

    FPGA-based read mappers often demonstrate one to two orders of magnitude

    speedups against their GPU-based counterparts [53, 68]. Most of the existing

    works used hardware platforms to only accelerate the dynamic programming al-

    gorithms (e.g., Smith-Waterman algorithm [17]), as these algorithms contributed

    significantly to the overall running time of read mappers. Most existing works can

    be divided into three main approaches: (1) Developing seed filtering mechanism

    to reduce the seed location list, (2) Accelerating the computationally expensive

    alignment algorithms using algorithmic development or hardware accelerators,

    and (3) Developing pre-alignment filtering heuristics to reduce the number of in-

    correct mappings before being examined by read alignment. We describe next

    each of these three acceleration efforts in detail.

    2.2.1 Seed Filtering

    The first approach to accelerate today’s read mapper is to filter the seed location

    list before performing read alignment.

    17

  • Years

    CPUGPUFPGASSE-SIMD

    BLASTSSAHA

    BlatMUMmer3

    ExonerateGMAP GSNAP

    SOAP SOAP3 SOAP3-dpSOAP2

    RMAPZOOM ZOOM Lite

    SeqMapMAQ

    BLAST+BLASTZ

    MUMmer

    SOCS

    Slider SliderIIPASS PASS-bis

    Bowtie

    SOAP3-FPGA

    FHASTBowtie 2ProbeMatch WHAM

    CloudBurst

    BWA BWA-SW BWA-MEM BWA-MEM-FPGA

    2016

    SHRiMP SHRiMP2

    SARUMANCUSHAW

    CUSHAW2CUSHAW2-GPU

    RazerS RazerS 3BFAST BFAST-Olson BFAST-Yus

    BWA-Waidyasooriya BWA-W

    mrFAST mrsFAST mrFAST_FastHASH

    1990

    2007

    20

    09 2011

    2013 20

    1520

    142012

    2010

    2008

    ✹■■■■■■■■■■ ■

    ■■■■■■■

    ■■

    ■■

    ●● ●

    ✹ ✹✹

    ●✹

    Figure 2.1: Timeline of read mappers. CPU-based mappers are plotted in black,GPU accelerated mappers in red, FPGA accelerated mappers in blue and SSE-based mappers in green. Grey dotted lines connect related mappers (extensionsor new versions). The names in the timeline are exactly as they appear in pub-lications, except: SOAP3-FPGA [103], BWA-MEM-FPGA [55], BFAST-Olson[104], BFAST-Yus [105], BWA-Waidyasooriya [58], and BWA-W [68].

    18

  • This is the basic principle of nearly all seed-and-extend mappers. Seed filtering

    is based on the observation that if two sequences are potentially similar, then they

    share a certain number of seeds. Seeds (sometimes called q-grams or k-mers) are

    short subsequences that are used as indices into the reference genome to reduce

    the search space and speed up the mapping process. Modern mappers extract

    short subsequences from each read and use them as a key to query the previously

    built large reference index database. The database returns the location lists for

    each seed. The location list stores all the occurrence locations of each seed in the

    reference genome. The mapper then examines the optimal alignment between the

    read and the reference segment at each of these seed locations. The performance

    and accuracy of seed-and-extend mappers depend on how the seeds are selected

    in the first stage. Mappers should select a large number of non-overlapping seeds

    while keeping each seed as infrequent as possible for full sensitivity [70, 106, 19].

    There is also a significant advantage to selecting seeds with unequal lengths, as

    possible seeds of equal lengths can have drastically different levels of frequencies.

    Finding the optimal set of seeds from read sequences is challenging and complex,

    primarily because the associated search space is large and it grows exponentially

    as the number of seeds increases. There are other variants of seed filtering based

    on the pigeonhole principle [18, 61], non-overlapping seeds [19], gapped seeds

    [107, 63], variable-length seeds [70], random permutation of subsequences [108],

    or full permutation of all possible subsequences [25, 109, 110].

    2.2.2 Accelerating Read Alignment

    The second approach to boost the performance of read mappers is to accelerate

    read alignment step. One of the most fundamental computational steps in most

    bioinformatics analyses is the detection of the differences/similarities between two

    genomic sequences. Edit distance and pairwise alignment are two approaches to

    achieve this step, formulated as approximate string matching [24]. Edit distance

    approach is a measure of how much the sequences differ. It calculates the mini-

    mum number of edits needed to convert one sequence into the other.

    19

  • The higher the distance, the more different the sequences from one another.

    Commonly allowed edit operations include deletion, insertion, and substitution

    of characters in one or both sequences. Pairwise alignment is a way to identify

    regions of high similarity between sequences. Each application employs a dif-

    ferent edit model (called scoring function), which is then used to generate an

    alignment score. The latter is a measure of how much the sequences are alike.

    Any two sequences have a single edit distance value but they can have several

    different alignments (i.e., ordered lists of possible edit operations and matches)

    with different alignment scores. Thus, alignment algorithms usually involve a

    backtracking step for providing the optimal alignment (i.e., the best arrangement

    of the possible edit operations and matches) that has the highest alignment score.

    Depending on the demand, pairwise alignment can be performed as global align-

    ment, where two sequences of the same length are aligned end-to-end, or local

    alignment, where subsequences of the two given sequences are aligned. It can

    also be performed as semi-global alignment (called glocal), where the entirety of

    one sequence is aligned towards one of the ends of the other sequence.

    The edit distance and pairwise alignment approaches are non-additive mea-

    sures [30]. This means that if we divide the sequence pair into two consecutive

    subsequence pairs, the edit distance of the entire sequence pair is not necessarily

    equivalent to the sum of the edit distances of the shorter pairs. Instead, we need

    to examine all possible prefixes of the two input sequences and keep track of

    the pairs of prefixes that provide an optimal solution. Enumerating all possible

    prefixes is necessary for tolerating edits that result from both sequencing errors

    [28] and genetic variations [29]. Therefore, they are typically implemented as

    dynamic programming algorithms to avoid re-computing the edit distance of the

    prefixes many times. These implementations, such as Levenshtein distance [31],

    Smith-Waterman [17], and Needleman-Wunsch [32], are still inefficient as they

    have quadratic time and space complexity (i.e., O(m2) for a sequence length of

    m). Many attempts were made to boost the performance of existing sequence

    aligners. Despite more than three decades of attempts, the fastest known edit

    distance algorithm [111] has a running time of O(m2/log2m) for sequences of

    length m, which is still nearly quadratic [112].

    20

  • Therefore, more recent works tend to follow one of two key new directions

    to boost the performance of sequence alignment and edit distance implementa-

    tions: (1) Accelerating the dynamic programming algorithms using hardware

    accelerators. (2) Developing filtering heuristics that reduce the need for the

    dynamic programming algorithms, given an edit distance threshold. Hard-

    ware accelerators are becoming increasingly popular for speeding up

    the computationally-expensive alignment and edit distance algorithms

    [113, 114, 115, 116]. Hardware accelerators include multi-core and SIMD (sin-

    gle instruction multiple data) capable central processing units (CPUs), graphics

    processing units (GPUs), and field-programmable gate arrays (FPGAs). The

    classical dynamic programming algorithms are typically accelerated by comput-

    ing only the necessary regions (i.e., diagonal vectors) of the dynamic program-

    ming matrix rather than the entire matrix, as proposed in Ukkonen’s banded

    algorithm [101]. The number of the diagonal bands required for computing the

    dynamic programming matrix is 2E+1, where E is a user-defined edit distance

    threshold. The banded algorithm is still beneficial even with its recent sequential

    implementations as in Edlib [78]. The Edlib algorithm is implemented in C for

    standard CPUs and it calculates the banded Levenshtein distance. Parasail [54]

    exploits both Ukkonen’s banded algorithm and SIMD-capable CPUs to compute

    a banded alignment for a sequence pair with user-defined scoring matrix and

    affine gap penalty. SIMD instructions offer significant parallelism to the matrix

    computation by executing the same vector operation on multiple operands at

    once.

    Multi-core architecture of CPUs and GPUs provides the ability to compute

    alignments of many sequence pairs independently and concurrently [56, 57].

    GSWABE [57] exploits GPUs (Tesla K40) for a highly-parallel computation of

    global alignment with affine gap penalty. CUDASW++ 3.0 [59] exploits the

    SIMD capability of both CPUs and GPUs (GTX690) to accelerate the computa-

    tion of the Smith-Waterman algorithm with affine gap penalty. CUDASW++ 3.0

    provides only the optimal score, not the optimal alignment (i.e., no backtracking

    step).

    21

  • Other designs, for instance FPGASW [53], exploit the very large number of

    hardware execution units in FPGAs (Xilinx VC707) to form a linear systolic array

    [117]. Each execution unit in the systolic array is responsible for computing the

    value of a single entry of the dynamic programming matrix. The systolic array

    computes a single vector of the matrix at a time. The data dependencies between

    the entries restrict the systolic array to computing the vectors sequentially (e.g.,

    top-to-bottom, left-to-right, or in an anti-diagonal manner). FPGA acceleration

    platform can also provide more speedup to big-data computing frameworks -such

    as Apache Spark- for accelerating BWA-MEM [21]. By this integration, Chen

    et al. [118] achieve 2.6x speedup over the same cloud-based implementation but

    without FPGA acceleration [119]. FPGA accelerators seem to yield the highest

    performance gain compared to the other hardware accelerators [52, 53, 68, 58].

    However, many of these efforts either simplify the scoring function, or only take

    into account accelerating the computation of the dynamic programming matrix

    without providing the optimal alignment as in [66, 67, 59]. Different scoring

    functions are typically needed to better quantify the similarity between two se-

    quences [120, 121]. The backtracking step required for the optimal alignment

    computation involves unpredictable and irregular memory access patterns, which

    poses a difficult challenge for efficient hardware implementation. Comprehen-

    sive surveys on hardware acceleration for computational genomics appeared in

    [113, 114, 115, 116, 33]

    2.2.3 False Mapping Filtering

    The third approach to accelerate read mapping is to incorporate a pre-alignment

    filtering technique within the read mapper, before read alignment step. This filter

    is responsible for quickly excluding incorrect mappings in an early stage (i.e., as

    a pre-alignment step) to reduce the number of false mappings (i.e., mappings

    that have more edits than the user-defined threshold) that must be verified via

    dynamic programming. Existing filtering techniques include the so-called shifted

    Hamming distance (SHD) [65], which we explain next.

    22

  • 2.2.3.1 Shifted Hamming Distance (SHD)

    SHD enables pre-alignment filtering with the existence of indels and substitutions.

    Instead of building a single bit-vector using a pairwise comparison as Hamming

    distance does, SHD builds 2E+1 bit-vectors, where E is the user-defined edit

    distance threshold. This is similar to the Ukkonen’s banded algorithm [101]. Each

    bit-vector is built by gradually shifting the read sequence and then performing

    a pairwise comparison. The shifting process is inevitable in order to skip the

    deleted (or inserted) character and examine the subsequent matches. SHD merges

    all masks using bitwise AND operation. Due to the use of AND operation, a zero

    (i.e., pairwise match) at any position in the 2E+1 masks leads to a ‘0’ in the

    resulting output of the AND operation at the same position. The last step is to

    count the positions that have a value other than ‘0’. SHD decides if the mapping is

    correct based on