the extensible sequence (xsq) file format to support the ... · 2/15/2011  · • existing file...

Post on 11-Oct-2020

21 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

The eXtensible SeQuence (XSQ) file format to support the new 5500 Series SOLiD™ Sequencers

Daryl J. Thomas, Ph.D. Associate Director, Bioinformatics Standards Feb 15, 2011

2

Why? Another new data format?? Main Drivers

•  Existing file formats do not support multiple calls per read −  Existing format cannot support ECC

•  Existing file formats are not space efficient −  FASTQ and CSFASTQ are not binary => XSQ reduces file size

Additional Benefits

•  Improved access for mapping and pairing

•  Simplified index (barcode) reassignment

•  File support for other sequence types

Converters will be available to enable use of alternate formats

•  Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)

•  Integration with legacy data (CSFASTA+QUAL => XSQ)

3

Exact Call Chemistry (ECC) •  ECC makes the best use of redundancy to protect discrete information

•  Precedent is ubiquitous in communication and data storage systems

•  ECC produces a orthogonal data with different colorspace encoding

−  If the amount of errors stays below a specific threshold, errors can be corrected

−  Even when there are no errors, confidence in the answer is significantly increased

4

AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C

G C C G G C T G G G A T C A

GA A C G G C T A G G A A C

C A A G T C T A C G C A A A

A C A A T T A T A C T C A A

C A G A C T G A G A T T A A

(1,1)

(1,3,0,3)C A G A C T G A G A T T A A

Improved Accuracy with ECC

AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C

G C C G G C T G G G A T C A

GA A C G G C T A G G A A C

C A A G T C T A C G C A A A

A C A A T T A T A C T C A A

C A G A C T G A G A T T A A

(1,1)

(1,3,0,3)C A G A C T G A G A T T A A

(1,1) represents traditional SOLiD™ two base color encoding (1,3,0,3) represents the additional ECC primer for four base color encoding,

which does not fit in a normal CSFASTA file but is required for base space decoding on the instrument

5

The data footprint of the 5500 is smaller Example for 1 billion reads of 50 x 50 bp mate pair run

ONE .xsq file per lane

File Size (GB)

Transfer Time (30 MB/s; minutes)

File Extension

SOLiD 4 212.21 121 *.csfasta & *.qual 5500 Series 49.55 28 *.xsq

XSQ file size is ~75% smaller than CSFASTA+QUAL

6

What data is in XSQ file?

One Lane generates one .xsq

Sequence

Quality Value

Metadata

Filtering

Indexing

Indexes (barcodes) may be re-assigned to correct user input errors

Filtered data may be included or excluded from mapping and analysis

Instrument / sample / run metadata guides data organization and tracking

Calls are generated in colorspace (non-ECC), basespace (ECC), or both

7

How is it organized? Hierarchically! •  We chose the HDF5 file format due to its maturity and flexibility in data

representation.

•  HDF5 is a data model, library, and file format for storing and managing data.

•  It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

•  HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.

•  The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

8

8

XSQ File Heirarchy (without indexing)

File Root

0002

F3 R3

Base or Color ECC run only

Fragments

0001 0660 …

TagDetails

RunMetadata

Library

9

9

XSQ File Heirarchy (with indexing)

File Root

Library 2

0002

F3 R3

Base or Color ECC run only

Fragments

Unclassified Library 96

0001 0660 …

Library 1 Indexing …

0002

TagDetails

RunMetadata

Library 2

10

Mapping and Tertiary analysis

off instrument

Primary Analysis on Instrument

Old and new analysis workflow comparison

BioScope™ 1.3

LifeScope™ 2.0

SOLiD 4

5500 Series

BAM

BAM

*.xsq 1 file per lane

*.csfasta & *.qual

1 pair per library

11

Data format conversion options

LifeScope™ 2.0

FASTQ

3rd party tools

SOLiD 4 5500 Series

BioScope™ 1.3

Legacy Data

User Pipelines

ECC Basecalls

*.xsq 1 file per lane

*.csfasta & *.qual

1 pair per library

XSQ Splitter

12

Life Tech provides data format converters Questions Answers

Who accepts .xsq format? LifeScope™ 2.0 supports the new format. We are working with 3rd party developers to adapt their workflows

What if I have pipelines that take .csfasta/.qual?

Life Technologies will provide tools on the SOLiDTM Developers Website to convert .xsq files into .csfasta/.qual

What if I have pipelines that take .fastq?

When the ECC module is used, base space data will be available in the .xsq file and can be exported into .fastq

Can I use data from SOLiDTM 4 and 5500 Series SOLiDTM System for data analysis?

Yes, LifeScope™ 2.0 supports .xsq and allows combined analysis with .csfasta/.qual data via conversion to *.xsq

Is there a converter to go from .csfasta/.qual to .xsq?

Standalone converters will be provided

How does XSQ handle multiplexing?

An XSQ splitter provides option to separate libraries into separate *.xsq files when necessary

Are APIs available for accessing XSQ data?

Example code for accessing XSQ data and metadata will be provided

.xsq .fastq .csfasta/.qual

13

XSQ file supports different data formats

Without ECC With ECC Color sequence and QV Color sequence and QV

and Base sequence

Base sequence is available only when the ECC module is used

.xsq file

14

Each lane may generate different data

ECC Module? Default

.xsq output Optional Lane #

1

2

4

Color + QV

Color + QV

N/A

N/A

Base + QV & 6 primers Color Seq

Yes

No

No

Base + QV

Library Type

15

XSQ Release Package Resources available now •  XSQ webinar slides http://solidsoftwaretools.com/gf/project/xsq/

•  XSQ format specification http://solidsoftwaretools.com/gf/project/xsq/

•  Example XSQ files http://solidsoftwaretools.com/gf/project/xsq/

•  HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/

•  HDF5 APIs: http://www.hdfgroup.org/HDF5/release/obtain5.html

•  ECC: Exact Call Chemistry whitepaper: http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/

solid-next-generation-sequencing/publications-literature.html

Additional resources coming out in early March 2011 •  XSQ file format converters

•  XSQ documentation −  File format description, with example HDF5 API code for XSQ access −  File format converters and usage

16

Why? An opportunity Main Drivers

•  Open format for sequence data exchange and storage

•  Extensibility means that multiple data types, including ECC, are supported in a common format.

•  Improved encoding reduces file size

•  Metadata in file retains experiment context and traceability

•  Hierarchal format supports partitioning and parallelization, leading to simplified index reassignment and elimination of read alignment pairing

To ease the transition, converters will be available

•  Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)

•  Integration with legacy data (CSFASTA+QUAL => XSQ)

17

© 2011 Life Technologies Corporation. All rights reserved.

The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners

top related