the extensible sequence (xsq) file format to support the ... · 2/15/2011 · • existing file...
TRANSCRIPT
1
The eXtensible SeQuence (XSQ) file format to support the new 5500 Series SOLiD™ Sequencers
Daryl J. Thomas, Ph.D. Associate Director, Bioinformatics Standards Feb 15, 2011
2
Why? Another new data format?? Main Drivers
• Existing file formats do not support multiple calls per read − Existing format cannot support ECC
• Existing file formats are not space efficient − FASTQ and CSFASTQ are not binary => XSQ reduces file size
Additional Benefits
• Improved access for mapping and pairing
• Simplified index (barcode) reassignment
• File support for other sequence types
Converters will be available to enable use of alternate formats
• Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)
• Integration with legacy data (CSFASTA+QUAL => XSQ)
3
Exact Call Chemistry (ECC) • ECC makes the best use of redundancy to protect discrete information
• Precedent is ubiquitous in communication and data storage systems
• ECC produces a orthogonal data with different colorspace encoding
− If the amount of errors stays below a specific threshold, errors can be corrected
− Even when there are no errors, confidence in the answer is significantly increased
4
AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C
G C C G G C T G G G A T C A
GA A C G G C T A G G A A C
C A A G T C T A C G C A A A
A C A A T T A T A C T C A A
C A G A C T G A G A T T A A
(1,1)
(1,3,0,3)C A G A C T G A G A T T A A
Improved Accuracy with ECC
AAGACGAT C G T G T G C T T C C G A A G A C T A T C C T G C T A A G T G T T T T C A C
G C C G G C T G G G A T C A
GA A C G G C T A G G A A C
C A A G T C T A C G C A A A
A C A A T T A T A C T C A A
C A G A C T G A G A T T A A
(1,1)
(1,3,0,3)C A G A C T G A G A T T A A
(1,1) represents traditional SOLiD™ two base color encoding (1,3,0,3) represents the additional ECC primer for four base color encoding,
which does not fit in a normal CSFASTA file but is required for base space decoding on the instrument
5
The data footprint of the 5500 is smaller Example for 1 billion reads of 50 x 50 bp mate pair run
ONE .xsq file per lane
File Size (GB)
Transfer Time (30 MB/s; minutes)
File Extension
SOLiD 4 212.21 121 *.csfasta & *.qual 5500 Series 49.55 28 *.xsq
XSQ file size is ~75% smaller than CSFASTA+QUAL
6
What data is in XSQ file?
One Lane generates one .xsq
Sequence
Quality Value
Metadata
Filtering
Indexing
Indexes (barcodes) may be re-assigned to correct user input errors
Filtered data may be included or excluded from mapping and analysis
Instrument / sample / run metadata guides data organization and tracking
Calls are generated in colorspace (non-ECC), basespace (ECC), or both
7
How is it organized? Hierarchically! • We chose the HDF5 file format due to its maturity and flexibility in data
representation.
• HDF5 is a data model, library, and file format for storing and managing data.
• It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
• HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.
• The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
8
8
XSQ File Heirarchy (without indexing)
File Root
0002
F3 R3
Base or Color ECC run only
Fragments
0001 0660 …
TagDetails
RunMetadata
Library
9
9
XSQ File Heirarchy (with indexing)
File Root
Library 2
0002
F3 R3
Base or Color ECC run only
Fragments
Unclassified Library 96
0001 0660 …
Library 1 Indexing …
0002
TagDetails
RunMetadata
Library 2
10
Mapping and Tertiary analysis
off instrument
Primary Analysis on Instrument
Old and new analysis workflow comparison
BioScope™ 1.3
LifeScope™ 2.0
SOLiD 4
5500 Series
BAM
BAM
*.xsq 1 file per lane
*.csfasta & *.qual
1 pair per library
11
Data format conversion options
LifeScope™ 2.0
FASTQ
3rd party tools
SOLiD 4 5500 Series
BioScope™ 1.3
Legacy Data
User Pipelines
ECC Basecalls
*.xsq 1 file per lane
*.csfasta & *.qual
1 pair per library
XSQ Splitter
12
Life Tech provides data format converters Questions Answers
Who accepts .xsq format? LifeScope™ 2.0 supports the new format. We are working with 3rd party developers to adapt their workflows
What if I have pipelines that take .csfasta/.qual?
Life Technologies will provide tools on the SOLiDTM Developers Website to convert .xsq files into .csfasta/.qual
What if I have pipelines that take .fastq?
When the ECC module is used, base space data will be available in the .xsq file and can be exported into .fastq
Can I use data from SOLiDTM 4 and 5500 Series SOLiDTM System for data analysis?
Yes, LifeScope™ 2.0 supports .xsq and allows combined analysis with .csfasta/.qual data via conversion to *.xsq
Is there a converter to go from .csfasta/.qual to .xsq?
Standalone converters will be provided
How does XSQ handle multiplexing?
An XSQ splitter provides option to separate libraries into separate *.xsq files when necessary
Are APIs available for accessing XSQ data?
Example code for accessing XSQ data and metadata will be provided
.xsq .fastq .csfasta/.qual
13
XSQ file supports different data formats
Without ECC With ECC Color sequence and QV Color sequence and QV
and Base sequence
Base sequence is available only when the ECC module is used
.xsq file
14
Each lane may generate different data
ECC Module? Default
.xsq output Optional Lane #
1
2
4
Color + QV
Color + QV
N/A
N/A
Base + QV & 6 primers Color Seq
Yes
No
No
Base + QV
Library Type
15
XSQ Release Package Resources available now • XSQ webinar slides http://solidsoftwaretools.com/gf/project/xsq/
• XSQ format specification http://solidsoftwaretools.com/gf/project/xsq/
• Example XSQ files http://solidsoftwaretools.com/gf/project/xsq/
• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/
• HDF5 APIs: http://www.hdfgroup.org/HDF5/release/obtain5.html
• ECC: Exact Call Chemistry whitepaper: http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/
solid-next-generation-sequencing/publications-literature.html
Additional resources coming out in early March 2011 • XSQ file format converters
• XSQ documentation − File format description, with example HDF5 API code for XSQ access − File format converters and usage
16
Why? An opportunity Main Drivers
• Open format for sequence data exchange and storage
• Extensibility means that multiple data types, including ECC, are supported in a common format.
• Improved encoding reduces file size
• Metadata in file retains experiment context and traceability
• Hierarchal format supports partitioning and parallelization, leading to simplified index reassignment and elimination of read alignment pairing
To ease the transition, converters will be available
• Compatibility with current pipelines (XSQ => CSFASTA+QUAL, FASTQ)
• Integration with legacy data (CSFASTA+QUAL => XSQ)
17
© 2011 Life Technologies Corporation. All rights reserved.
The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners