the extensible sequence (xsq) file format to support the ... · 2/15/2011  · • existing file...

17
1 The eXtensible SeQuence (XSQ) file format to support the new 5500 Series SOLiD™ Sequencers Daryl J. Thomas, Ph.D. Associate Director, Bioinformatics Standards Feb 15, 2011

Upload: others

Post on 11-Oct-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

1

The eXtensible SeQuence (XSQ) file format to support the new 5500 Series SOLiD™ Sequencers

Daryl J. Thomas, Ph.D. Associate Director, Bioinformatics Standards Feb 15, 2011

Page 2: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

2

Why? Another new data format?? Main Drivers

•  Existing file formats do not support multiple calls per read −  Existing format cannot support ECC

•  Existing file formats are not space efficient −  FASTQ and CSFASTQ are not binary => XSQ reduces file size

Additional Benefits

•  Improved access for mapping and pairing

•  Simplified index (barcode) reassignment

•  File support for other sequence types

Converters will be available to enable use of alternate formats

•  Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)

•  Integration with legacy data (CSFASTA+QUAL => XSQ)

Page 3: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

3

Exact Call Chemistry (ECC) •  ECC makes the best use of redundancy to protect discrete information

•  Precedent is ubiquitous in communication and data storage systems

•  ECC produces a orthogonal data with different colorspace encoding

−  If the amount of errors stays below a specific threshold, errors can be corrected

−  Even when there are no errors, confidence in the answer is significantly increased

Page 4: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

4

AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C

G C C G G C T G G G A T C A

GA A C G G C T A G G A A C

C A A G T C T A C G C A A A

A C A A T T A T A C T C A A

C A G A C T G A G A T T A A

(1,1)

(1,3,0,3)C A G A C T G A G A T T A A

Improved Accuracy with ECC

AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C

G C C G G C T G G G A T C A

GA A C G G C T A G G A A C

C A A G T C T A C G C A A A

A C A A T T A T A C T C A A

C A G A C T G A G A T T A A

(1,1)

(1,3,0,3)C A G A C T G A G A T T A A

(1,1) represents traditional SOLiD™ two base color encoding (1,3,0,3) represents the additional ECC primer for four base color encoding,

which does not fit in a normal CSFASTA file but is required for base space decoding on the instrument

Page 5: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

5

The data footprint of the 5500 is smaller Example for 1 billion reads of 50 x 50 bp mate pair run

ONE .xsq file per lane

File Size (GB)

Transfer Time (30 MB/s; minutes)

File Extension

SOLiD 4 212.21 121 *.csfasta & *.qual 5500 Series 49.55 28 *.xsq

XSQ file size is ~75% smaller than CSFASTA+QUAL

Page 6: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

6

What data is in XSQ file?

One Lane generates one .xsq

Sequence

Quality Value

Metadata

Filtering

Indexing

Indexes (barcodes) may be re-assigned to correct user input errors

Filtered data may be included or excluded from mapping and analysis

Instrument / sample / run metadata guides data organization and tracking

Calls are generated in colorspace (non-ECC), basespace (ECC), or both

Page 7: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

7

How is it organized? Hierarchically! •  We chose the HDF5 file format due to its maturity and flexibility in data

representation.

•  HDF5 is a data model, library, and file format for storing and managing data.

•  It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

•  HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.

•  The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

Page 8: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

8

8

XSQ File Heirarchy (without indexing)

File Root

0002

F3 R3

Base or Color ECC run only

Fragments

0001 0660 …

TagDetails

RunMetadata

Library

Page 9: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

9

9

XSQ File Heirarchy (with indexing)

File Root

Library 2

0002

F3 R3

Base or Color ECC run only

Fragments

Unclassified Library 96

0001 0660 …

Library 1 Indexing …

0002

TagDetails

RunMetadata

Library 2

Page 10: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

10

Mapping and Tertiary analysis

off instrument

Primary Analysis on Instrument

Old and new analysis workflow comparison

BioScope™ 1.3

LifeScope™ 2.0

SOLiD 4

5500 Series

BAM

BAM

*.xsq 1 file per lane

*.csfasta & *.qual

1 pair per library

Page 11: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

11

Data format conversion options

LifeScope™ 2.0

FASTQ

3rd party tools

SOLiD 4 5500 Series

BioScope™ 1.3

Legacy Data

User Pipelines

ECC Basecalls

*.xsq 1 file per lane

*.csfasta & *.qual

1 pair per library

XSQ Splitter

Page 12: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

12

Life Tech provides data format converters Questions Answers

Who accepts .xsq format? LifeScope™ 2.0 supports the new format. We are working with 3rd party developers to adapt their workflows

What if I have pipelines that take .csfasta/.qual?

Life Technologies will provide tools on the SOLiDTM Developers Website to convert .xsq files into .csfasta/.qual

What if I have pipelines that take .fastq?

When the ECC module is used, base space data will be available in the .xsq file and can be exported into .fastq

Can I use data from SOLiDTM 4 and 5500 Series SOLiDTM System for data analysis?

Yes, LifeScope™ 2.0 supports .xsq and allows combined analysis with .csfasta/.qual data via conversion to *.xsq

Is there a converter to go from .csfasta/.qual to .xsq?

Standalone converters will be provided

How does XSQ handle multiplexing?

An XSQ splitter provides option to separate libraries into separate *.xsq files when necessary

Are APIs available for accessing XSQ data?

Example code for accessing XSQ data and metadata will be provided

.xsq .fastq .csfasta/.qual

Page 13: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

13

XSQ file supports different data formats

Without ECC With ECC Color sequence and QV Color sequence and QV

and Base sequence

Base sequence is available only when the ECC module is used

.xsq file

Page 14: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

14

Each lane may generate different data

ECC Module? Default

.xsq output Optional Lane #

1

2

4

Color + QV

Color + QV

N/A

N/A

Base + QV & 6 primers Color Seq

Yes

No

No

Base + QV

Library Type

Page 15: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

15

XSQ Release Package Resources available now •  XSQ webinar slides http://solidsoftwaretools.com/gf/project/xsq/

•  XSQ format specification http://solidsoftwaretools.com/gf/project/xsq/

•  Example XSQ files http://solidsoftwaretools.com/gf/project/xsq/

•  HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/

•  HDF5 APIs: http://www.hdfgroup.org/HDF5/release/obtain5.html

•  ECC: Exact Call Chemistry whitepaper: http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/

solid-next-generation-sequencing/publications-literature.html

Additional resources coming out in early March 2011 •  XSQ file format converters

•  XSQ documentation −  File format description, with example HDF5 API code for XSQ access −  File format converters and usage

Page 16: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

16

Why? An opportunity Main Drivers

•  Open format for sequence data exchange and storage

•  Extensibility means that multiple data types, including ECC, are supported in a common format.

•  Improved encoding reduces file size

•  Metadata in file retains experiment context and traceability

•  Hierarchal format supports partitioning and parallelization, leading to simplified index reassignment and elimination of read alignment pairing

To ease the transition, converters will be available

•  Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)

•  Integration with legacy data (CSFASTA+QUAL => XSQ)

Page 17: The eXtensible SeQuence (XSQ) file format to support the ... · 2/15/2011  · • Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces

17

© 2011 Life Technologies Corporation. All rights reserved.

The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners