lecture-7 - pennsylvania state university · the$structure$of$the$fastq$file$...
TRANSCRIPT
2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$
$
$Week$4,$Lecture$7$
István'Albert'
'
Biochemistry$and$Molecular$Biology$$
and$Bioinforma;cs$Consul;ng$Center$
$
Penn$State$
Sequencing$Technologies$%$perspec;ve$
1st''genera0on:$Frederic$Sanger$develops$DNA$sequencing$technology.$Latest$versions$3$million$bases/day,$1500bp$long$reads$$
2nd'genera0on:$(next%gen)$sequencing$started$2005$with$the$release$of$the$454$sequencing$plaXorm.$600$billion$bases/week,$150bp$long$reads$
$
3rd'genera0on:$single$molecule$(no$DNA$amplifica;on$required),$these$are$not$replacing$but$augumen;ng$2nd$genera;on$systems,$longer$reads,$shorter$turnarounds$
$
$
$
NGS$sequencing$read$formats$
Reads:$$short$sequences$produced$by$the$instrument$
$
Illumina$!$FastQ$format$(.fastq$or$.fq)$
Solid$!$colorspace$fasta$(.xsq$or$.csfasta$+$.qual)$
454$$$!$standard$flowgram$format$(.sff)$
Encoding$
One$character$!$one$byte$space$ABCa$=$$4$bytes$long$
65$66$67$97$=$11$bytes$long$$
Good:'The$space$savings$are$of$about$3$;mes$Bad:'not$readable,$hinders$understanding,$different$decoding$op;ons$
Quality$Scores$
• A$quality$score$is$a$number$that$usually$has$limits,$a$low$(say$0)$to$a$high$(say$40)$
• A$quality$score$represents$an$error$probability.$
• It$characterizes$a$single$step$of$the$process$and$the$NOT$the$en;re$experimental$procedure$
• Quality$scores$are$used$to$represent$base$calling$accuracy,$alignment$accuracy$$and$other$probabili;es$
Remapping$an$encoding$• Only$some$types$of$characters$can$be$printed.$
• So$the$encoding$must$start$at$a$character$that$can$be$printed,$but$we$also$want$that$value$to$be$the$low$end$of$the$scale$=$0$
$
• Say$character$“A”$has$a$code$of$65.$If$we$choose$“A”$as$the$minimum$of$our$scale$then$$$$$$$$$$$$$$$
PHRED$Quality$Scores$
For$a$quality$score$Q$the$error$probability$is$
$
P'='10'–Q/10'$
Examples:$
$
Q'='10$!$P$=$10$–1$=$1/10$=$0.1$=$10%$
Q'='40$!$P$=$10$–4$=$1/10000$=$$0.0001$=$0.01%$
$
There$are$mul;ple$encodings$
• Illumina$used$to$switch$around$the$encoding$every$once$in$a$while.$$
• Finally$they$seiled$on$the$Sanger$for$encoding/Phred$quality$representa;on.$
• There$are$datasets/tools$out$there$with$different$encodings!$
Sanger$Encoding$(+33)$
• Quality$Value$range$between$0$and$93$$
• Start$the$scale$at$character$33$
• End$the$scale$at$character$33$+$93$=$126$
(currently$most$instruments$only$produce$quali;es$in$the$range$is$0$to$40)$
Illumina$1.3$encoding$(+64)$$(obsolete$but$s;ll$olen$observed$in$the$wild)$
• Quality$range$between$0$to$62$
• Start$scale$at$character$64$$
• End$scale$at$character$64$+$62$=$126$
FASTQ$format$The$first$column$indicates$the$record$type$
De%facto$standard$for$processing$sequencing$reads.$$Download$the$lecture%6.zip$data$
The$structure$of$the$FASTQ$file$
Four'lines'per'FASTQ'record''
1. @$indicates$the$sequence$id$(above$is$longer$than$the$sequence$itself!)$2. the$sequence$content$of$the$read$3. $+$op;onally$repeat$the$sequence$id$(olen$lel$empty)$4. quality$string$
Paper:'The$Sanger$FASTQ$file$format$for$sequences$with$quality$scores,$$and$the$Solexa/Illumina$FASTQ$variants$%$$Nucl.&Acids&Res.&(2010)&38&(6):&176771771.&$
Other$formats$
• Some$instruments$generate$files$in$different$formats.$Occasionally$two$files:$
1. $A$sequence$file$in$FASTA$format$
2. A$FASTA$like$quality$file$that$lists$numerical$
quali;es$
Convert$them$to$FASTQ$