Cleaning Illumina reads
Torsten Seemann
VLSCI :: LSCC - Lab Meeting - Fri 23 Nov 2012
Motivation
Illumina reads
● Usually 100 or 150 bp○ 250bp rolling out now, 400bp next year
● Indel errors rare● Homopolymer errors rare● Substitution errors < 1%
○ Error rate higher at 3' end● Adaptor issues
○ rare in HiSeq (TruSeq prep)○ more common in MiSeq (Nextera prep)
● Very high quality overall
Illumina libraries● "single end"
○ just a shotgun read sequenced from one end
● "paired end"○~500bp fragments sequenced at both ends○very reliable
● "mate pair"○circularized 2-10 kbp fragments sequencing○ then "paired end" protocol○ reliability varies
● garbage reads○ instrument weirdness
● duplicate reads○ low complexity library, PCR duplicate
● adaptor read-through○ fragment too short
● indel errors○ skipping bases, inserting extra bases
● uncalled base○ couldn't reliably estimate, replace with "N"
● substitution errors○ reading wrong base
Sequences have errors
More common
Less common
Why clean reads?
● Erroneous data may cause software to:○ run more slowly○ use more RAM○ produce poor / biased / incorrect results
● Cleaning can:○ improve overall average quality of the reads
■ hopefully giving a reliable result○ reduce the volume of reads
■ some algorithms are O(N.logN) or O(N2)■ enable processing when otherwise couldn't
● (some software does handle them appropriately)
The Q in FASTQ
DNA sequence quality
● DNA sequences often have a quality value associated with each nucleotide
● A measure of reliability for each base○ as it is derived from physical process
■ chromatogram (Sanger sequencing)■ pH reading (Ion Torrent sequencing)
● Formalised by the Phred software for the Human Genome Project
Phred qualities
● Q = -10 log10 P <=> P = 10- Q / 10
○ Q = Phred quality score○ P = probability of base call being incorrect
Quality value Chance it is wrong Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
Illumina quality plot
Y-axis is "Phred" quality values (higher is better)
FASTQ format
Anatomy of a FASTQ entry
@read00179AGTCTGATATGCTGTACCTATTATAATTCTAGGCGCTCAT
GCCCGCGGATATCGTAGCTATATGCTTCA
+
8;ACCCD?DD???@B9<9<CAC@=AAAA8A;B<A@882,+
495;;3990,02..-&-&-*,,,,(0**#
Start symbol Sequence ID Sequence
Separator line
Encoded quality values, one symbol per nucleotide
FASTQ quality encoding
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ| | | | |
Q0 Q10 Q20 Q30 Q40
bad maybe ok good excellent
Uses letters/symbols to represent numbers:
Mnemonic:"swear" words!
Mnemonic:Q20 = '2' and '0'
Mnemonic:'HI' = high Q
Spoiled for choiceSolexa !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | | | | | 33 59 64 73 104
Illumina 1.3 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII | | | | | 33 59 64 73 104
Illumina 1.5 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ | | | | | 33 59 64 73 104
Illumina 1.8 (Sanger) !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS............................... | | | | | 33 59 64 73 104
What can we clean?
Ambiguous bases
● If there is ambiguity in the base call, an "N" is used
@ILLUMINA:6:1:964:115#GATCAG/1 GGACCTGAGAGTGTGCATGAAGAGGGCAGCGCGCACNGCATCA + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Possible software responses:○ Crash!○ Ignore it○ Silently convert to fixed or random base (Velvet > 'A')○ Handle it appropriately (Bowtie2)
● Small proportion overall, safer to discard
Homopolymers
● A read consisting of all the same base
@ILLUMINA:6:1:964:115#GATCAG/1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Often occur from clusters at edge of flowcell lane● Early Illumina software called '?' as 'A' rather than N● Unlikely to be present in real DNA● Best to discard
● Less common with newer Illumina OLB software
Quality trimming
● Remove low quality sequence○ Q=13 corresponds to 5% error (p=0.05)○ Q=0..13 encoded by: !"#@%&'()*+,-/
@ILLUMINA:6:1:9646:1115#GATCAG/1 GGACCTGAGAGTGTGCATGAAGAGGGCAGCCCCGCACTGCATG + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Can trim per○ each base○ window moving average eg. 3 base mean○ minimum % good per window eg. need 4 of 5
Illumina Adaptors
● Used in the sequencing chemistry● Can appear at ends of read sequences● Worse for mate-pair than for paired-end reads
● PCR PrimerCAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
● Genomic DNA Sequencing Primer CACTCTTTCCCTACACGACGCTCTTCCGATCT
● All other TruSeq & Nextera AdaptorsNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Adaptor clipping
● Method○ Align 3' and 5' read end against all adaptor sequences○ If there is an anchored "match", trim the read
● Minimum length of match?○ want to remove adaptor, but not real sequence [10 bp]
● Allow substitutions in match?○ as reads have errors, need some tolerance [1 sub]
● Allow gaps/indels in match?○ indels are unlikely in Illumina reads [no]
● Slow to perform compared to other preprocessing steps
Decloning
● Illumina "mate pair" sequencing○ Requires a lot of starting DNA○ Challenging protocol to implement reliably○ Not enough final DNA leads to PCR duplicates○ Coverage is highly non-uniform and sporadic○ Causes bias in analyses
● Decloning○ Replace clones with a single representative○ Choose representative with highest quality○ Helps salvage usable information content
Read length
● Enforce a minimum read length L
● k-mer based tools○ break reads into k-mers, so L < k is pointless○ Velvet, Trinity, Gossamer, ...
● Uniqueness○ desire reasonable uniqueness of sequence,
otherwise multiple mapping will take forever!○ L=24 is bare minimum (I reckon)○ BWA, Bowtie, Shrimp, ...
Other strategies
●Digital normalisation○ remove low frequency k-mers○ remove reads w/ too many low freq k-mers
●Error correction○ replace low frequency k-mers with their
"closest" high-frequency k-mer○other methods I don't understand yet
●Just use local alignment to "solve" the problem
Examples
Walk-through1.Original read + quality =43bp
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATAhhhhghfeefaa^a^[[[^X[[XX^^^^`SSTQPZZBBBBBBB
2.Homopolymer? NoGTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
3.Ambiguious N bases? Yes, 1GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
4.Quality < 20 ? Yes, at 3' endGTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
5.Adaptor sequences > 8bp ? Yes, 9 bp at 5' endGTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
6.Combine all masks Logical intersectionGTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
7.Extract longest sub-sequence =19bp TGACCATGATTCAAGGAAC
The coral genome● Raw data (A.millepora Illumina)
○ 9 libraries - 3 x PE, 6 x MP - 200bp to 10kbp○ 92.0 Gbp, 943M reads, average length 98bp
● Method○ Decloned all MP libs, disallow Ns, reject homopolymers,
trim Q < 20 + clip adaptors, minimum length 55bp● Cleaned data
○ 42.5 Gbp, 478M reads, average length 88bp
● Effect○ Good - de novo Velvet assembly improved overall
■ validated by RNA-Seq, ESTs○ Bad - lower coverage
■ coral is very heterozygous > 10%
Per library yields (Gbp)
●Library Raw Cleaned %Kept●pe_193 9.55 6.71 70●pe_463 19.19 13.89 72●pe_580 4.87 3.18 65●mp_2200 18.48 8.48 46●mp_2820 13.54 1.54 11 *●mp_4628 12.95 0.85 6 *●mp_5000 6.92 2.98 43●mp_8000 4.33 1.64 38●mp_10000 2.15 0.25 11 *●single n/a 3.00 n/a●TOTAL 92.00 42.50
The GKP FFPE BWA PE mystery
●FFPE sample○ formalin-fixed, paraffin embedded○ long term tissue archival storage method
●Sequencing○HiSeq-2000○22 million x 100bp paired-end reads○quality looks good ... but not mapping!
FastQC for Read 1
FastQC is giving us some hints...
nesoni clip:
Nesoni
● Implemented primarily by Paul Harrison @ VBC○swiss army knife○snp calling, phylogenetics, DGE, ....○extendible pipeline system (Python)
●We used the "nesoni clip:" module○adaptor clipping = on○min quality = Q10○allow Ns = no○min length = 2
Nesoni command linenesoni clip: --quality 10 --length 24 --out-separate yes SM_AdMP1_ID_07B26948_L004_CLIPPED pairs: SM_AdMP1_ID_07B26948_L004_R1.fastq.gz SM_AdMP1_ID_07B26948_L004_R2.fastq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_R1.fq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_R2.fq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_single.fq.gz
Orphans
Nesoni clip: R1 @ 5' end(> SM_AdMP1_ID_07B26948_L004_CLIPPED 161,553 read-1 adaptors clipped at start 10 bases: 11757 avg errors: 0.95 3948xPE_Read_2_Sequencing_Primer 1260xMultiplexing_Adapters_1 ... 11 bases: 3212 avg errors: 0.74 758xMultiplexing_PCR_Primer_2.0 269xPE_Read_2_Sequencing_Primer ... 12 bases: 1739 avg errors: 0.62 545xTruSeq_Adapter_Index_1 453xMultiplexing_PCR_Primer_2.0 ... 13 bases: 3054 avg errors: 0.38 2377xTruSeq_Universal_Adapter 429xMultiplexing_PCR_Primer_2.0 ... 14 bases: 760 avg errors: 0.34 357xMultiplexing_PCR_Primer_2.0 163xMultiplexing_Adapters_1 ... 15 bases: 2062 avg errors: 0.13 1558xMultiplexing_PCR_Primer_2.0 261xMultiplexing_Adapters_1 ... 16 bases: 967 avg errors: 0.15 575xMultiplexing_PCR_Primer_2.0 267xMultiplexing_Adapters_1 ... 17 bases: 1271 avg errors: 0.15 603xMultiplexing_PCR_Primer_2.0 419xMultiplexing_Adapters_1 ... 18 bases: 2026 avg errors: 0.23 1095xMultiplexing_PCR_Primer_2.0 483xMultiplexing_Adapters_1 ... 19 bases: 2494 avg errors: 0.07 1956xMultiplexing_PCR_Primer_2.0 422xMultiplexing_Adapters_1 ... 20 bases: 2791 avg errors: 0.07 1481xMultiplexing_Adapters_1 1286xMultiplexing_PCR_Primer_2.0 ... 21 bases: 3946 avg errors: 0.05 3814xMultiplexing_PCR_Primer_2.0 123x3p_RNA_Adapter ... 22 bases: 2238 avg errors: 0.16 1742xMultiplexing_PCR_Primer_2.0 318x3p_RNA_Adapter ... 23 bases: 4207 avg errors: 0.02 4162xMultiplexing_PCR_Primer_2.0 43xTruSeq_Adapter_Index_1 ... 24 bases: 2127 avg errors: 0.03 2086xMultiplexing_PCR_Primer_2.0 33xTruSeq_Adapter_Index_1 ... 25 bases: 1630 avg errors: 0.04 1612xMultiplexing_PCR_Primer_2.0 18xTruSeq_Adapter_Index_4 26 bases: 4031 avg errors: 0.02 4000xMultiplexing_PCR_Primer_2.0 28xTruSeq_Adapter_Index_4 ... 27 bases: 1736 avg errors: 0.04 1626xMultiplexing_PCR_Primer_2.0 110xTruSeq_Adapter_Index_4 28 bases: 3140 avg errors: 0.04 3059xMultiplexing_PCR_Primer_2.0 80xTruSeq_Adapter_Index_4 ... 29 bases: 3277 avg errors: 0.02 3228xMultiplexing_PCR_Primer_2.0 49xTruSeq_Adapter_Index_4 30 bases: 4578 avg errors: 0.03 4542xMultiplexing_PCR_Primer_2.0 36xTruSeq_Adapter_Index_4 31 bases: 4459 avg errors: 0.04 4309xMultiplexing_PCR_Primer_2.0 148xTruSeq_Adapter_Index_4 ... 32 bases: 2753 avg errors: 0.07 2577xMultiplexing_PCR_Primer_2.0 174xTruSeq_Adapter_Index_4 ... 33 bases: 9794 avg errors: 0.05 9255xMultiplexing_PCR_Primer_2.0 536xTruSeq_Adapter_Index_4 ... 34 bases: 63909 avg errors: 0.03 63779xMultiplexing_PCR_Primer_2.0 127xTruSeq_Adapter_Index_4 ...<snip>
Nesoni clip: R1 @ 3' end(> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,934,544 read-1 adaptors clipped at end 10 bases: 243222 avg errors: 0.04 224121xTruSeq_Universal_Adapter 11677xPCR_Primer_Index_1 ... 11 bases: 209946 avg errors: 0.02 198663xTruSeq_Universal_Adapter 7666xPCR_Primer_Index_1 ... 12 bases: 287146 avg errors: 0.02 277756xTruSeq_Universal_Adapter 6368xPCR_Primer_Index_1 ... 13 bases: 233067 avg errors: 0.02 222693xTruSeq_Universal_Adapter 7252xPCR_Primer_Index_1 ... 14 bases: 305000 avg errors: 0.02 295013xMultiplexing_PCR_Primer_2.0 7554xPCR_Primer_Index_3 ... 15 bases: 338908 avg errors: 0.02 328225xMultiplexing_PCR_Primer_2.0 8308xPCR_Primer_Index_4 ... 16 bases: 183745 avg errors: 0.02 174275xMultiplexing_PCR_Primer_2.0 7233xPCR_Primer_Index_4 ... 17 bases: 169356 avg errors: 0.02 160221xMultiplexing_PCR_Primer_2.0 6860xPCR_Primer_Index_4 ... 18 bases: 209386 avg errors: 0.02 197727xMultiplexing_PCR_Primer_2.0 7498xPCR_Primer_Index_4 ... 19 bases: 171360 avg errors: 0.02 157072xMultiplexing_PCR_Primer_2.0 8753xPCR_Primer_Index_4 ... 20 bases: 190699 avg errors: 0.02 169311xMultiplexing_PCR_Primer_2.0 18186xPCR_Primer_Index_4 ... 21 bases: 195454 avg errors: 0.02 169850xMultiplexing_PCR_Primer_2.0 22795xPCR_Primer_Index_4 ... 22 bases: 252124 avg errors: 0.21 194698xMultiplexing_PCR_Primer_2.0 47983x3p_RNA_Adapter ... 23 bases: 194228 avg errors: 0.02 183170xMultiplexing_PCR_Primer_2.0 5698xv1.5_Small_RNA_3p_Adapter ... 24 bases: 189333 avg errors: 0.02 178797xMultiplexing_PCR_Primer_2.0 7050xPCR_Primer_Index_4 ... 25 bases: 199061 avg errors: 0.02 190980xMultiplexing_PCR_Primer_2.0 7932xPCR_Primer_Index_4 ... 26 bases: 241032 avg errors: 0.02 235358xMultiplexing_PCR_Primer_2.0 5504xPCR_Primer_Index_4 ... 27 bases: 212056 avg errors: 0.02 206622xMultiplexing_PCR_Primer_2.0 5226xPCR_Primer_Index_4 ... 28 bases: 210022 avg errors: 0.03 203762xMultiplexing_PCR_Primer_2.0 6081xPCR_Primer_Index_4 ... 29 bases: 188424 avg errors: 0.02 183273xMultiplexing_PCR_Primer_2.0 4954xPCR_Primer_Index_4 ... 30 bases: 344312 avg errors: 0.02 339452xMultiplexing_PCR_Primer_2.0 4700xPCR_Primer_Index_4 ... 31 bases: 193023 avg errors: 0.02 187439xMultiplexing_PCR_Primer_2.0 5426xPCR_Primer_Index_4 ... 32 bases: 187459 avg errors: 0.03 180062xMultiplexing_PCR_Primer_2.0 7275xPCR_Primer_Index_4 ... 33 bases: 184063 avg errors: 0.02 180029xMultiplexing_PCR_Primer_2.0 3864xPCR_Primer_Index_4 ... 34 bases: 379388 avg errors: 0.02 209558xTruSeq_Adapter_Index_3 165637xMultiplexing_PCR_Primer_2.0 ... 35 bases: 217332 avg errors: 0.02 212927xTruSeq_Adapter_Index_4 3930xPCR_Primer_Index_4 ... 36 bases: 171253 avg errors: 0.02 166773xTruSeq_Adapter_Index_4 4119xPCR_Primer_Index_4 ... 37 bases: 170924 avg errors: 0.02 165925xTruSeq_Adapter_Index_4 4781xPCR_Primer_Index_4 ... 38 bases: 192662 avg errors: 0.03 187054xTruSeq_Adapter_Index_4 5435xPCR_Primer_Index_4 ... 39 bases: 219074 avg errors: 0.03 215639xTruSeq_Adapter_Index_4 3308xPCR_Primer_Index_4 ... 40 bases: 324519 avg errors: 0.03 321459xTruSeq_Adapter_Index_4 2845xPCR_Primer_Index_4 ... 41 bases: 425086 avg errors: 0.03 422160xTruSeq_Adapter_Index_4 2732xPCR_Primer_Index_4 ... <snip>
Nesoni clip: R2 @ 5' end(> SM_AdMP1_ID_07B26948_L004_CLIPPED 142,354 read-2 adaptors clipped at start 10 bases: 12898 avg errors: 0.97 3080xPE_Read_2_Sequencing_Primer 1837x3p_RNA_Adapter ... 11 bases: 4329 avg errors: 0.94 1123xTruSeq_Adapter_Index_1 940xTruSeq_Universal_Adapter ... 12 bases: 1813 avg errors: 0.42 1045xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 158xTruSeq_Universal_Adapter 13 bases: 3153 avg errors: 0.11 2431xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 569xTruSeq_Universal_Adapte 14 bases: 739 avg errors: 0.25 403xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 213xMultiplexing_PCR_Primer 15 bases: 3652 avg errors: 0.09 2645xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 861xMultiplexing_PCR_Primer 16 bases: 465 avg errors: 0.20 295xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 82xMultiplexing_PCR_Primer_ 17 bases: 436 avg errors: 0.21 235xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 118xMultiplexing_PCR_Primer 18 bases: 2689 avg errors: 0.15 2280xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 237xMultiplexing_PCR_ 19 bases: 584 avg errors: 0.13 402xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 124xTruSeq_Universal_Adapter 20 bases: 982 avg errors: 0.06 716xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 245xTruSeq_Universal_Adapter 21 bases: 1033 avg errors: 0.06 633xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 375xTruSeq_Universal_Adapter. 22 bases: 1530 avg errors: 0.04 1215xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 312xTruSeq_Universal_Adaptr 23 bases: 1112 avg errors: 0.05 752xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 348xTruSeq_Universal_Adapter 24 bases: 501 avg errors: 0.04 302xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 198xTruSeq_Universal_Adapter 25 bases: 3437 avg errors: 0.02 2140xTruSeq_Universal_Adapter 1297xOligonucleotide_sequences_for_Genomic_DNA_Adapte 26 bases: 1473 avg errors: 0.04 856xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 617xTruSeq_Universal_Adapter 27 bases: 12247 avg errors: 0.01 11399xTruSeq_Universal_Adapter 48xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 28 bases: 1592 avg errors: 0.04 1282xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 310xTruSeq_Universal_Adapter 29 bases: 3129 avg errors: 0.03 1833xOligonucleotide_sequences_for_Genomic_DNA_Adapters_21296xTruSeq_Universal_Adap 30 bases: 2842 avg errors: 0.04 2584xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 258xTruSeq_Universal_Adapter 31 bases: 1533 avg errors: 0.04 1082xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 448xTruSeq_Universal_Adapt 32 bases: 2822 avg errors: 0.04 2398xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 419xTruSeq_Universal_Adapt 33 bases: 12425 avg errors: 0.03 10171xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2254xTruSeq_Universal_Adapter 34 bases: 277 avg errors: 0.03 277xTruSeq_Universal_Adapter 35 bases: 191 avg errors: 0.04 191xTruSeq_Universal_Adapter 36 bases: 347 avg errors: 0.04 347xTruSeq_Universal_Adapter 37 bases: 1696 avg errors: 0.03 1696xTruSeq_Universal_Adapter 38 bases: 5037 avg errors: 0.02 5037xTruSeq_Universal_Adapter 39 bases: 441 avg errors: 0.05 441xTruSeq_Universal_Adapter 40 bases: 4079 avg errors: 0.02 4079xTruSeq_Universal_Adapter
Nesoni clip: R2 @ 3' end(> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,331,287 read-2 adaptors clipped at end 10 bases: 204601 avg errors: 0.03 200835xTruSeq_Universal_Adapter 556xOligonucleotide_sequences_for_G 11 bases: 200731 avg errors: 0.02 199519xTruSeq_Universal_Adapter 414xOligonucleotide_sequences_for_G 12 bases: 212465 avg errors: 0.01 212195xTruSeq_Universal_Adapter 45xTruSeq_Adapter_Index_1 ... 13 bases: 227861 avg errors: 0.01 227757xTruSeq_Universal_Adapter 31xPE_Adapters_1 ... 14 bases: 313533 avg errors: 0.01 312402xTruSeq_Universal_Adapter 830xPCR_Primers_2 ... 15 bases: 334268 avg errors: 0.01 332651xTruSeq_Universal_Adapter 1178xPCR_Primers_2 ... 16 bases: 188961 avg errors: 0.02 187641xTruSeq_Universal_Adapter 1134xPCR_Primers_2 ... 17 bases: 236185 avg errors: 0.01 235350xTruSeq_Universal_Adapter 692xPCR_Primers_2 ... 18 bases: 173398 avg errors: 0.02 173005xTruSeq_Universal_Adapter 342xPCR_Primers_2 ... 19 bases: 227216 avg errors: 0.02 226929xTruSeq_Universal_Adapter 283xPCR_Primers_2 ... 20 bases: 195904 avg errors: 0.01 195904xTruSeq_Universal_Adapter 21 bases: 171553 avg errors: 0.01 171553xTruSeq_Universal_Adapter 22 bases: 176894 avg errors: 0.01 176894xTruSeq_Universal_Adapter 23 bases: 192393 avg errors: 0.02 192393xTruSeq_Universal_Adapter 24 bases: 291821 avg errors: 0.02 291821xTruSeq_Universal_Adapter 25 bases: 201236 avg errors: 0.02 201236xTruSeq_Universal_Adapter 26 bases: 251535 avg errors: 0.02 251535xTruSeq_Universal_Adapter 27 bases: 215207 avg errors: 0.02 215207xTruSeq_Universal_Adapter 28 bases: 216714 avg errors: 0.02 216714xTruSeq_Universal_Adapter 29 bases: 203192 avg errors: 0.03 203192xTruSeq_Universal_Adapter 30 bases: 820871 avg errors: 0.02 820871xTruSeq_Universal_Adapter 31 bases: 158860 avg errors: 0.02 158860xTruSeq_Universal_Adapter 32 bases: 191083 avg errors: 0.02 191083xTruSeq_Universal_Adapter 33 bases: 159133 avg errors: 0.02 159133xTruSeq_Universal_Adapter 34 bases: 146912 avg errors: 0.02 146912xTruSeq_Universal_Adapter<snip>
Nesoni clip: summarynesoni 0.92
FASTQ offset seems to be 33
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,611,994 read-pairs
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 121,236 read-1 too short after quality clip(> SM_AdMP1_ID_07B26948_L004_CLIPPED 483,729 read-1 too short after adaptor clip(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,007,029 read-1 kept(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-1 average input length(> SM_AdMP1_ID_07B26948_L004_CLIPPED 75.426 read-1 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 673,175 read-2 too short after quality clip(> SM_AdMP1_ID_07B26948_L004_CLIPPED 419,343 read-2 too short after adaptor clip(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,519,476 read-2 kept(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-2 average input length(> SM_AdMP1_ID_07B26948_L004_CLIPPED 77.201 read-2 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,147,301 pairs kept after clipping(> SM_AdMP1_ID_07B26948_L004_CLIPPED 1,231,903 reads kept after clipping
started 21 November 2012 11:02 AMfinished 21 November 2012 11:43 AMrun time 0:40:06
Summary
GARBAGE IN,GARBAGE OUT