imgs 2012 bioinformatics workshop: file formats for next gen sequence analysis
DESCRIPTION
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis. Cost. Throughput. Gigabases. Cost per Kb. Lucinda Fulton, The Genome Center at Washington University. Sequencing Technologies. http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/1.jpg)
IMGS 2012Bioinformatics Workshop:
File Formats for Next Gen Sequence Analysis
![Page 2: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/2.jpg)
19901992
19941997
19992001
20032005
20072009
0.00
10,000.00
20,000.00
30,000.00
40,000.00
50,000.00
60,000.00
70,000.00
$0.00
$20.00
$40.00
$60.00
$80.00
$100.00
$120.00
$140.00
Giga
base
s Cost per Kb
Lucinda Fulton, The Genome Center at Washington University
Cost Throughput
![Page 3: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/3.jpg)
Sequencing Technologies
http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png
![Page 4: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/4.jpg)
Sequence “Space”• Roche 454 – Flow space
– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain
– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc
• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known
bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related
• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related
http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
![Page 5: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/5.jpg)
![Page 6: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/6.jpg)
![Page 7: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/7.jpg)
FlexibleGood: with rapidly changing data/tech
Poor: validationHuman Readable
Convenient for de-buggingComputer doesn’t care!
![Page 8: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/8.jpg)
SequencesFASTAFASTQSAM/BAM
AlignmentsSAM/BAMMAF
AnnotationsBEDGTFGFF3GVFVCF
http://genome.ucsc.edu/FAQ/FAQformat.html
http://www.sequenceontology.org/
![Page 9: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/9.jpg)
FASTQ
FASTA
![Page 10: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/10.jpg)
FASTQ: Data Format• FASTQ
– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional description
• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and
description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
Sequence data format
![Page 11: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/11.jpg)
FASTQ Example
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
For analysis, it may be necessary to convert to the Sanger form of FASTQ.
![Page 12: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/12.jpg)
FASTQ: Details• FASTQ
– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional description
• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and
description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
![Page 13: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/13.jpg)
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
50 1 in 100000 99.999 %
Q = Phred Quality ScoresP = Base-calling error probabilities
Quality scores
![Page 14: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/14.jpg)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Format/Platform QualityScoreType ASCII encodingSanger Phred: 0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred: 0-62 64-126Illumina 1.5 Phred: 0-62 64-126Illumina 1.8 Phred: 0-62 33-126 *** Sanger format!
Quality score encoding differ among the platforms
Most analysis tools require Sanger fastq quality score encoding
![Page 15: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/15.jpg)
http://main.g2.bx.psu.edu/
![Page 16: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/16.jpg)
![Page 17: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/17.jpg)
SAM (Sequence Alignment/Map)
• SAM is the output of aligners that map reads to a reference genome– Tab delimited w/ header section and alignment
section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields
– BAM is the binary format of SAM
http://samtools.sourceforge.net/
Alignment data format
![Page 18: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/18.jpg)
http://samtools.sourceforge.net/SAM1.pdf
Mandatory Alignment Fields
![Page 19: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/19.jpg)
http://samtools.sourceforge.net/SAM1.pdf
Alignment Examples
Alignments in SAM format
CIGAR string -> 8M2I4M1D3M
![Page 20: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/20.jpg)
Annotation Formats• Mostly tab delimited files that describe the location of
genome features (i.e., genes, etc.)• Also used for displaying annotations on standard genome
browsers • Important for associating alignments with specific genome
features• descriptions• Knowing format details can be important to translating
results!– BED is zero based– GTF/GFF are one based
![Page 21: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/21.jpg)
GTF
http://useast.ensembl.org/info/website/upload/gff.html
Annotation data format
![Page 22: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/22.jpg)
chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171
chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +
BED formatAnnotation data format
![Page 23: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/23.jpg)
BED: zero based, start inclusive, stop exclusive
GTF/GFF: one based, inclusive
Length = stop-start
Length = stop-start+1
![Page 24: IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis](https://reader036.vdocument.in/reader036/viewer/2022062310/56816571550346895dd806a4/html5/thumbnails/24.jpg)
GRCh37
NCBI36