ebi is an outstation of the european molecular biology laboratory. cram: reference-based compression...

24
EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin

Upload: emily-pitts

Post on 16-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

EBI is an Outstation of the European Molecular Biology Laboratory.

CRAM: reference-based compression formatdeveloped by Vadim Zalunin

Data horror

EMBL-EBI10 petabytesSRA~1 petabytes

Over 2 million DVDs or 2.5km

Complete Genomics0.5 TB for a single file

The need for compression

Red alert

Compression, what is it?

BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb

LOSSLESS LOSSY

Compression, when we know what to expect.

BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb

LOSSLESS LOSSY

But the actual message is only 40 characters (bytes) long!

Compression at it’s best

IMAGE, 145 kb

"Five little ducks went swimming one day"

TEXT, 40 b IMAGE, 145 kb

~3500 times more efficient

compress uncompress

What are we talking about

sample

sequencing machines

bug

bunch of huge files

The bug’s DNA is hidden somewhere

Looking closer at the data

bunch of huge files

read 1read 2read 3…..read bizzilion

It boils down to a long list of reads:

Each read represents a short nucleotide sequence from the genome.

Additional information may be attached to it, for example error estimates.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

An excerpt from of a FASTQ file.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name

An excerpt from of a FASTQ file.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name read bases

An excerpt from of a FASTQ file.

Bases: ACGTN

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name read bases

read quality scores

An excerpt from of a FASTQ file.

Bases: ACGTN

Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

What is quality score?

Then quality score is phred quality score encoded as ASCII symbols 33-126.

Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 T G A G C T C T T A G T A G C      read 2       G C T C T A A G T A G C C G C  read 3   C T C T A A G T A G C C G C G            read 4             G T A G C C G C G G A C T G T      read 5               C G G T C T G T C C G

Read start position Read end position

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 . . . . . . . . T . . . . . .      read 2       . . . . . . . . . . . . . . .  read 3   . . . . . . . . . . . . . . .            read 4             . . . . . . . . . . A . . . .      read 5               . . . . . . . . . . .

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 . . . . . . . . T . . . . . .      read 2       . . . . . . . . . . . . . . .  read 3   . . . . . . . . . . . . . . .            read 4             . . . . . . . . . . A . . . .      read 5               . . . . . . . . . . .

Mismatching bases

Lossy quality scores

Approach 1Quality scores are usually values from 0 to 39.

Let’s shrink them, so that they are from 0 to 7 now.

Approach 2Let’s treat quality scores using alignment information.

For example: preserve only quality scores for mismatching bases.

horizontal

vert

ical

Comparison study:1K Genomes exomes

compress uncompressBAM BAMCRAM

compress uncompress

Comparison study:1K Genomes exomes

BAM BAMCRAM

Some analysis pipeline

Some analysis pipeline

compress uncompress

Comparison study:1K Genomes exomes

BAM BAMCRAM

Some analysis pipeline

Some analysis pipeline

Original SNPs Restored SNPs

Comparison study:1K Genomes exomes

CRAM NGS data compression

Do nothingDo nothing

CRAM lossyUntreated

CRAM very lossy

LosslessLossless LossyLossy

Bits/base

CRAM lossless

(bad) (good)

Progressive application of compression

Sample value

Sam

ple accessibility

200-fold

Lossless

2-fold

20-fold

Hard

High

Easy

Low

References

More information:

http://www.ebi.ac.uk/ena/about/cram_toolkit

Mailing list:

http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev

Publications:

Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40

Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1