spring: a next-generation compressor for fastq data · Łukaszroguski, idoiaochoa, mikel hernaez,...

78
SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University ISMB/ECCB 2019

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING: a next-generation compressor for FASTQ data

Shubham ChandakStanford University

ISMB/ECCB 2019

Page 2: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Joint work with

• Kedar Tatwawadi, Stanford University• Idoia Ochoa, UIUC• Mikel Hernaez, UIUC• Tsachy Weissman, Stanford University

Page 3: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Outline

• Introduction and motivation • FASTQ format and compression results• Algorithms - SPRING and others• SPRING as a practical tool• Next steps

Page 4: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Genome sequencing

• Genome: long string of bases {A, C, G, T}• Sequenced as noisy paired substrings (reads):

~ 300 – 500 bases ~ 100 –150 bases

Genome ~ 3 billion basesAACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Coverage/Depth:

~30x-60x

Page 5: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Typical workflows

Page 6: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Typical workflows

Sequencing Raw readsAlignment

to reference

Aligned reads

Variant calling w.r.t. reference

VCF (tabular

data)

Page 7: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Typical workflows

Sequencing Raw readsAlignment

to reference

Aligned reads

Variant calling w.r.t. reference

VCF (tabular

data)

Sequencing Raw reads Assembly Assembled genome

Page 8: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Why store raw reads?

Page 9: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Why store raw reads?

• Pipelines improve with time - need raw data for reanalysis

Page 10: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Why store raw reads?

• Pipelines improve with time - need raw data for reanalysis• For temporary storage - alignment and assembly time-consuming

Page 11: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Why store raw reads?

• Pipelines improve with time - need raw data for reanalysis• For temporary storage - alignment and assembly time-consuming• Can’t perform alignment when reference genome not available – e.g.,

de novo assembly or metagenomics

Page 12: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Why store raw reads?

• Pipelines improve with time - need raw data for reanalysis• For temporary storage - alignment and assembly time-consuming• Can’t perform alignment when reference genome not available – e.g.,

de novo assembly or metagenomics• Can get better compression than aligned data compression if

significant variation from reference (more on this later)!

Page 13: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

FASTQ format

Page 14: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

FASTQ format

We’ll mostly focus on reads in this talk.

Page 15: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression

Page 16: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression

• For a typical 25x human dataset:• Uncompressed: 79 GB (1 byte/base)

Page 17: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression

• For a typical 25x human dataset:• Uncompressed: 79 GB (1 byte/base)• Gzip: ~20 GB (2 bits/base) – still far from optimal

Page 18: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression

• For a typical 25x human dataset:• Uncompressed: 79 GB (1 byte/base)• Gzip: ~20 GB (2 bits/base) – still far from optimal

• Order of read pairs in FASTQ irrelevant – can this help?

Page 19: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression results

Compressor 25x human

Uncompressed 79 GB

Gzip ~20 GB

Page 20: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression results

Compressor 25x human

Uncompressed 79 GB

Gzip ~20 GB

FaStore(allow reordering) 6 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Page 21: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression results

Compressor 25x human

Uncompressed 79 GB

Gzip ~20 GB

FaStore(allow reordering) 6 GB

SPRING(no reordering) 3 GB

SPRING(allow reordering) 2 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Page 22: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Read compression results

Compressor 25x human 100x human

Uncompressed 79 GB 319 GB

Gzip ~20 GB ~80 GB

FaStore(allow reordering) 6 GB 13.7 GB

SPRING(no reordering) 3 GB 10 GB

SPRING(allow reordering) 2 GB 5.7 GB

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2748–2756

Page 23: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• Storing reads equivalent to

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Page 24: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• Storing reads equivalent to• Store genome

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Page 25: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• Storing reads equivalent to• Store genome• Store read positions in genome (+ gap between paired reads)

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Page 26: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• Storing reads equivalent to• Store genome• Store read positions in genome (+ gap between paired reads)• Store noise in reads

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Page 27: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• Storing reads equivalent to• Store genome• Store read positions in genome (+ gap between paired reads)• Store noise in reads

• Entropy calculations show this outperforms previous compressors

AACGATGTCGTATATCGTAGTAGCTCTATGTTCTCATTAGCTCGCTAGTAGCTATGCTCTAATGCTAT

Page 28: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• But... How to get the genome from the reads?

Page 29: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• But... How to get the genome from the reads?• Genome assembly too expensive - big challenges:• resolve repeats• get very long pieces of genome from shorter assemblies

Page 30: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Key idea

• But... How to get the genome from the reads?• Genome assembly too expensive - big challenges:• resolve repeats• get very long pieces of genome from shorter assemblies

• Solution: Don’t need perfect assembly for compression!

Page 31: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING workflow

Raw reads

Page 32: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING workflow

Approximate assembly

Raw reads

Contigs

Page 33: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING workflow

Approximate assembly

Raw reads

Encode

• Assembled sequence• Read position in

assembled sequence• Gap b/w paired reads• Noisy bases + positions• Etc.

Contigs

Page 34: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING workflow

Approximate assembly

Raw reads

Encode

• Assembled sequence• Read position in

assembled sequence• Gap b/w paired reads• Noisy bases + positions• Etc.

BSC

Compressed file

https://github.com/IlyaGrebnov/libbsc

Contigs

Page 35: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING workflow

Approximate assembly

Raw reads

Encode

• Assembled sequence• Read position in

assembled sequence• Gap b/w paired reads• Noisy bases + positions• Etc.

BSC

Compressed fileIn “allow reordering” mode: reorder by position in approximate assembly

https://github.com/IlyaGrebnov/libbsc

Contigs

Page 36: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

Page 37: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables

Page 38: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance

Page 39: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)

Page 40: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)• ACGATCGTACGTATACGGGTACG (candidate next read)

Page 41: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)• ACGATCGTACGTATACGGGTACG (candidate next read)• Index match found but Hamming distance too large → shift search substring

by one

Page 42: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)

Page 43: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)

• No index match found → shift search substring by one

Page 44: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)• GATCGTACGTATGATGGTCATTA (candidate next read)

Page 45: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)• GATCGTACGTATGATGGTCATTA (candidate next read)• Next read found!

• Repeat process with the new read

Page 46: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Approx. assembly/reordering step (simplified)

• Index reads by specific substrings using hash tables• For the current read, try to find an overlapping read within small

Hamming distance• Example (reads indexed by prefix for simplicity):• ACGATCGTACGTACGATCGTCAG (current read)• GATCGTACGTATGATGGTCATTA (candidate next read)• Next read found!

• Repeat process with the new read. • If no match found at any shift, pick arbitrary remaining read & start

new contig

Page 47: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Quality and read identifier compression

Page 48: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Quality and read identifier compression

• Quality – use general purpose compressor BSC (optionally apply quantization)• Read identifier – split into tokens and use arithmetic coding [1]

1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3 (2013): e59190.

Page 49: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Quality and read identifier compression

• Quality – use general purpose compressor BSC (optionally apply quantization)• Read identifier – split into tokens and use arithmetic coding [1]

Dataset Reads Quality Read identifier

Hiseq 2000 28x, 100 bp x 2 4.3 23.8 0.9

Novaseq 25x, 150 bp x 2 3.0 3.6 0.3

All human datasets. Sizes in GB.

1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3 (2013): e59190.

Page 50: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Quality and read identifier compression

• Quality – use general purpose compressor BSC (optionally apply quantization)• Read identifier – split into tokens and use arithmetic coding [1]

Dataset Reads Quality Read identifier

Hiseq 2000 28x, 100 bp x 2 4.3 23.8 0.9

Novaseq 25x, 150 bp x 2 3.0 3.6 0.3

Novaseq 25x, 150 bp x 2(allow reordering)

2.0 3.6 1.4

All human datasets. Sizes in GB.

1. Bonfield, James K., and Matthew V. Mahoney. "Compression of FASTQ and SAM format sequencing data." PloS one 8.3 (2013): e59190.

Page 51: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING vs. reference-based compression195 GB

25x human FASTQ

NovaSeq

Page 52: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING vs. reference-based compression195 GB

25x human FASTQ

NovaSeq

SPRING2 hours

7 GB losslessSPRINGarchive

Page 53: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING vs. reference-based compression195 GB

25x human FASTQ

NovaSeq

SPRING2 hours

7 GB losslessSPRINGarchive

BWA-MEMalignment

(hg19)8 hours

SAM fileRemove

irrelevant fields(sorting)

Page 54: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING vs. reference-based compression195 GB

25x human FASTQ

NovaSeq

SPRING2 hours

7 GB losslessSPRINGarchive

BWA-MEMalignment

(hg19)8 hours

SAM fileRemove

irrelevant fieldsCRAM v325 min

(sorting)

Unsorted: 7.6 GBSorted: 7.8 GBSorted (+ embedded reference): 8.5 GB

*partly due to quality compression

improvements in SPRING

Page 55: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING vs. reference-based compression195 GB

25x human FASTQ

NovaSeq

SPRING2 hours

7 GB losslessSPRINGarchive

BWA-MEMalignment

(hg19)8 hours

SAM fileRemove

irrelevant fieldsCRAM v325 min

(sorting)

Advantage can be even greater in case of large variations between

reference genome & FASTQ genome.

Unsorted: 7.6 GBSorted: 7.8 GBSorted (+ embedded reference): 8.5 GB

*partly due to quality compression

improvements in SPRING

Page 56: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Other approaches for FASTQ compression

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005.Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

Page 57: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Other approaches for FASTQ compression

• gzip/bzip2

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005.Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

Page 58: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Other approaches for FASTQ compression

• gzip/bzip2• Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005.Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

Page 59: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Other approaches for FASTQ compression

• gzip/bzip2• Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip• Assembly based: Leon, Quip, Assembletrie

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005.Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

Page 60: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Other approaches for FASTQ compression

• gzip/bzip2• Context-based arithmetic coding: DSRC 2, Fqzcomp, Quip• Assembly based: Leon, Quip, Assembletrie• Reordering based:• Reordering based on substrings/minimizers: Orcom, Mince, FaStore, SCALCE• BWT-based reordering: BEETL

Numanagić, Ibrahim, et al. "Comparison of high-throughput sequencing data compression tools." Nature Methods 13.12 (2016): 1005.Hernaez, Mikel, et al. "Genomic Data Compression." Annual Review of Biomedical Data Science 2 (2019).

Page 61: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Recent FASTQ compressors

Page 62: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Recent FASTQ compressors

• minicom [1]: Use minimizers to construct large contigs (assemblies)• Slight improvement (5-10%) over SPRING on RNA-seq reads

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074.

Page 63: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Recent FASTQ compressors

• minicom [1]: Use minimizers to construct large contigs (assemblies)• Slight improvement (5-10%) over SPRING on RNA-seq reads

• FQSqueezer [2]: Adapt general-purpose compressors such as prediction by partial matchting (PPM) and dynamic Markov coding (DMC) to read compression• 10-30% improvement over SPRING for bacterial datasets

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074.

2. Deorowicz, Sebastian. "FQSqueezer: k-mer-based compression of sequencing data." bioRxiv (2019): 559807.

Page 64: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Recent FASTQ compressors

• minicom [1]: Use minimizers to construct large contigs (assemblies)• Slight improvement (5-10%) over SPRING on RNA-seq reads

• FQSqueezer [2]: Adapt general-purpose compressors such as prediction by partial matchting (PPM) and dynamic Markov coding (DMC) to read compression• 10-30% improvement over SPRING for bacterial datasets

• Both require significantly more time and memory than SPRING• Not tested on moderate to high coverage human datasets

1. Yuansheng Liu, Zuguo Yu, Marcel E Dinger, Jinyan Li, Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2066–2074.

2. Deorowicz, Sebastian. "FQSqueezer: k-mer-based compression of sequencing data." bioRxiv (2019): 559807.

Page 65: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING as a practical tool

Page 66: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING as a practical tool

195 GB25x human

FASTQ

2 hours32 GB RAM8 threads

7 GBSPRINGarchive

26 minutes6 GB RAM8 threads

OriginalFASTQ

Page 67: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING as a practical tool

• ~1.6x better compression than FaStore with similar time/memory

195 GB25x human

FASTQ

2 hours32 GB RAM8 threads

7 GBSPRINGarchive

26 minutes6 GB RAM8 threads

OriginalFASTQ

Page 68: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING as a practical tool

• ~1.6x better compression than FaStore with similar time/memory• Easy to use with support for:• Lossless and lossy modes• Variable length reads, long reads, etc.• Random access• Scalable to large datasets

195 GB25x human

FASTQ

2 hours32 GB RAM8 threads

7 GBSPRINGarchive

26 minutes6 GB RAM8 threads

OriginalFASTQ

Page 69: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

SPRING as a practical tool

• ~1.6x better compression than FaStore with similar time/memory• Easy to use with support for:• Lossless and lossy modes• Variable length reads, long reads, etc.• Random access• Scalable to large datasets

• Github: https://github.com/shubhamchandak94/SPRING/

195 GB25x human

FASTQ

2 hours32 GB RAM8 threads

7 GBSPRINGarchive

26 minutes6 GB RAM8 threads

OriginalFASTQ

Page 70: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec

Page 71: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):

Page 72: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):• Long reads, lots of insertions and deletion errors

• Hash based approximate assembly doesn’t extend immediately

Page 73: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):• Long reads, lots of insertions and deletion errors

• Hash based approximate assembly doesn’t extend immediately• New types of raw data – e.g., raw current signal for nanopore sequencing

• Need huge amounts of space and typically retained for further analysis

Page 74: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):• Long reads, lots of insertions and deletion errors

• Hash based approximate assembly doesn’t extend immediately• New types of raw data – e.g., raw current signal for nanopore sequencing

• Need huge amounts of space and typically retained for further analysis

• Time and memory efficient tool with compression close to SPRING:

Page 75: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):• Long reads, lots of insertions and deletion errors

• Hash based approximate assembly doesn’t extend immediately• New types of raw data – e.g., raw current signal for nanopore sequencing

• Need huge amounts of space and typically retained for further analysis

• Time and memory efficient tool with compression close to SPRING:• Disk based strategies (like Orcom/FaStore)

Page 76: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Next steps

• Currently integrating SPRING with genie, an upcoming open source MPEG-G codec• Third generation sequencing technologies (e.g., nanopore):• Long reads, lots of insertions and deletion errors

• Hash based approximate assembly doesn’t extend immediately• New types of raw data – e.g., raw current signal for nanopore sequencing

• Need huge amounts of space and typically retained for further analysis

• Time and memory efficient tool with compression close to SPRING:• Disk based strategies (like Orcom/FaStore)• When reference is available, can do fast and approximate alignment

Page 77: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

Thank you!

Page 78: SPRING: a next-generation compressor for FASTQ data · ŁukaszRoguski, IdoiaOchoa, Mikel Hernaez, Sebastian Deorowicz; FaStore: a space-saving solution for raw sequencing data,Bioinformatics,

References

• Shubham Chandak, Kedar Tatwawadi, Tsachy Weissman; Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis, Bioinformatics, Volume 34, Issue 4, 15 February 2018, Pages 558–567

• Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, Tsachy Weissman; SPRING: a next-generation compressor for FASTQ data, Bioinformatics, bty1015

• SPRING download: https://github.com/shubhamchandak94/Spring

• genie (open source MPEG-G codec – under development): https://github.com/mitogen/genie