imgs 2012 bioinformatics workshop: file formats for next gen sequence analysis

IMGS 2012Bioinformatics Workshop:

File Formats for Next Gen Sequence Analysis

19901992

19941997

19992001

20032005

20072009

0.00

10,000.00

20,000.00

30,000.00

40,000.00

50,000.00

60,000.00

70,000.00

$0.00

$20.00

$40.00

$60.00

$80.00

$100.00

$120.00

$140.00

Giga

base

s Cost per Kb

Lucinda Fulton, The Genome Center at Washington University

Cost Throughput

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Sequence “Space”• Roche 454 – Flow space

– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc

• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known

bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

http://www.youtube.com/watch?v=bFNjxKHP8Jc

http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

FlexibleGood: with rapidly changing data/tech

Poor: validationHuman Readable

Convenient for de-buggingComputer doesn’t care!

SequencesFASTAFASTQSAM/BAM

AlignmentsSAM/BAMMAF

AnnotationsBEDGTFGFF3GVFVCF

http://genome.ucsc.edu/FAQ/FAQformat.html

http://www.sequenceontology.org/

FASTQ

FASTA

FASTQ: Data Format• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Sequence data format

http://maq.sourceforge.net/fastq.shtml

FASTQ Example

FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

For analysis, it may be necessary to convert to the Sanger form of FASTQ.

FASTQ: Details• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

http://maq.sourceforge.net/fastq.shtml

Phred Quality Score Probability of incorrect base call Base call accuracy

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

50 1 in 100000 99.999 %

Q = Phred Quality ScoresP = Base-calling error probabilities

Quality scores

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Format/Platform QualityScoreType ASCII encodingSanger Phred: 0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred: 0-62 64-126Illumina 1.5 Phred: 0-62 64-126Illumina 1.8 Phred: 0-62 33-126 *** Sanger format!

Quality score encoding differ among the platforms

Most analysis tools require Sanger fastq quality score encoding

http://main.g2.bx.psu.edu/

SAM (Sequence Alignment/Map)

• SAM is the output of aligners that map reads to a reference genome– Tab delimited w/ header section and alignment

section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields

– BAM is the binary format of SAM

http://samtools.sourceforge.net/

Alignment data format

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

CIGAR string -> 8M2I4M1D3M

Annotation Formats• Mostly tab delimited files that describe the location of

genome features (i.e., genes, etc.)• Also used for displaying annotations on standard genome

browsers • Important for associating alignments with specific genome

features• descriptions• Knowing format details can be important to translating

results!– BED is zero based– GTF/GFF are one based

GTF

http://useast.ensembl.org/info/website/upload/gff.html

Annotation data format

chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171

chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +

BED formatAnnotation data format

BED: zero based, start inclusive, stop exclusive

GTF/GFF: one based, inclusive

Length = stop-start

Length = stop-start+1

GRCh37

NCBI36

imgs 2012 bioinformatics workshop: file formats for next gen sequence analysis

Documents

sequence identifiers

sequence flexiblegood

sequence read4 lines

gen sequence analysis

base incorporationshttp

phred quality scoresp

raw reads

nuc acids res