gbs project analysis report - n-genetics.com · click to top 3 mapping statistics the effective...

22
GBS Project Analysis Report 30-Jul-2017 Library Preparation and Sequencing DNA Qualification Library Construction Library Quality Control High-throughput DNA Sequencing Bioinformatics Analysis Procedures Results of Analyses Raw Data Quality Control of Sequencing Data Sequencing Quality Distribution Distribution of Sequencing Errors Sequencing Data Filtration Statistics of Sequencing Data Sequencing Evaluation Summary Sequencing Evaluation Q&A Mapping Statistics Statistics of Reference Genome Mapping Statistics with Reference Genome Mapping Summary Mapping Q&A SNP Detection & Annotation Statistics of SNP Detection & Annotation SNP Quality Distribution SNP Mutation Frequency SNP Detection & Annotation Q&A References Appendix List of Result Folders List of Softwares Methods Technical Instructions

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

GBS Project Analysis Report30-Jul-2017

Library Preparation and SequencingDNA QualificationLibrary ConstructionLibrary Quality ControlHigh-throughput DNA Sequencing

Bioinformatics Analysis ProceduresResults of Analyses

Raw DataQuality Control of Sequencing Data

Sequencing Quality DistributionDistribution of Sequencing ErrorsSequencing Data FiltrationStatistics of Sequencing DataSequencing Evaluation SummarySequencing Evaluation Q&A

Mapping StatisticsStatistics of Reference GenomeMapping Statistics with Reference GenomeMapping SummaryMapping Q&A

SNP Detection & AnnotationStatistics of SNP Detection & AnnotationSNP Quality DistributionSNP Mutation FrequencySNP Detection & Annotation Q&A

ReferencesAppendix

List of Result FoldersList of SoftwaresMethodsTechnical Instructions

Page 2: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

Ⅰ Library Preparation and Sequencing

Throughout the whole process of sequencing from the DNA sample to the final data, each steps, including sample test, library preparation,and sequencing procedures, influences the quality of data production, while the data quality further impacts on the analysis results directly. Toguarantee the accuracy and reliability of the sequencing data, Our pipeline utilizes stringent quality control (QC) procedures and strictlyadheres to the high standard at each step from source. The workflow is as follows:

Figure 1.1 Workflow of library preparation

1 DNA Qualification

Our pipeline utilizes three major QC methods for DNA sample qualification:

(1) Agarose gel electrophoresis analysis for DNA purity and integrity;

(2) NanoDrop® 2000 spectrophotometer measurement for DNA purity by assessing the OD260/OD280 ratio;

(3) Qubit® 2.0 flurometer quantitation for accurate measurement of DNA concentration;

Sample DNA, with OD260/OD280 ratio of 1.8 to 2.0 and total amount of more than 1.5 ug, was qualified for library construction.

2 Library Construction

The genomic DNA of each sample was digested respectively with proper restriction enzymes based on the in silico evaluation results, andthe obtained fragments were ligated with two barcoded adapters that either with a compatible sticky end with the primary digestion enzyme andthe Illumina P5 or P7 universal sequence. Followed by several rounds PCR amplification, all the samples were pooled and size-selected for therequired fragments to complete the library construction. The experimental procedures are as follows: (1) In/Ex silico digestion evaluation: For the genome-sequenced species and those with closely related sequnced species, the propergenome assembly was subjected to in silico digestion analysis to aid the optimization of enzyme sets and the fragment size, with concerningthe data production, genome coverage and eveness, skimming of repeated regions, as well as the enzymatic features. The combination ofexperimental digestion assay with the computational approach ensures high reliable and reproduciable data production with a broad spectrumof species; (2) Restriction enzyme digestion: 0.3~0.6 μg genomic DNA was digested completely with the optimized restriction enzyme set, in order toobtain a suitable marker density; (3) Ligating P1 and P2 adapter: each end of digested fragment was respectively ligated with P1 and P2 barcoded-adapter (withcomplementarily sticky ends to the digested DNA); (4) PCR enrichment and fragment selection: tags containing both P1 and P2 adapters were amplified through PCR. After DNA fragments ofdifferent samples were pooled, the desired fragments of DNA were recovered from gel electrophoresis; (5) High-throughput DNA sequencing: After cluster preparation, high-throughput DNA sequencing was performed on Illumina HiSeq platform.The experimental procedures of DNA library preparation are shown in Figure 1.2.

Page 3: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Figure 1.2 Experimental procedures of library preparation

3 Library Quality Control

To check the prepared DNA libraries, Qubit® 2.0 fluorometer was firstly used to determine the concentration of the library. After dilution to 1ng/ul, the Agilent® 2100 bioanalyzer was used to assess the insert size. And finally the quantitative real-time PCR (qPCR) was performed todetect the effective concentration of each library. If the library with appropriate insert size has an effective concentration of more than 2 nM, theconstructed libraries are qualified and ready for Illumina® high-throughput sequencing.

4 High-throughput DNA Sequencing

The qualified DNA libraries were pooled according to their effective concentration as well as the expected data production. Pair-endsequencing were performed on Illumina® HiSeq platform, with the read length of 144 bp at each end.

Page 4: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

Ⅱ Bioinformatics Analysis Procedures

The bioinformatic analysis procedures are as follows:

(1) Quality control of raw sequencing data for clean data filtration;

(2) Mapping clean reads to reference genome;

(3) SNP and InDel detection and annotation according to the reference genome mapping results.

Page 5: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

Ⅲ Results of Analyses

1 Raw Data

The original sequencing data acquired by high-throughput sequencing platforms (e.g. Illumina HiSeqTM 2500/MiseqTM) recorded in imagefiles are firstly transformed to sequence reads by base calling with the CASAVA software. The sequences and corresponding sequencingquality information are stored in a FASTQ file.

Every read in FASTQ format is stored in four lines as follows:

@K00124:82:H2MH5BBXX:1:1101:31389:1158 2:N:0:0

TAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACAGATCGGAAGAGCGTCGTGTAGGG

+

AAFFFKKKKKKKKFKKKFFKKAAFKKKKKFKKKKFKKA,FKKKKKKKKKAKKFKKKKKKKAKKKKKKFFKKKKF<FFKKKKKKKKKKKKKFKKFKKF7FFFFFKFKKKFKKKKKKKKF<FFKKKKFKKKKKFKFKFKKFK<<F,A7,AFK

Line 1 begins with an '@' character and is followed by Illumina sequence identifiers, and an optional description (such as a FASTA title line).Line 2 is the sequence of a sequencing read.Line 3 begins with a '+' character and is optionally followed by Illumina sequence identifier and description.Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of characters as the bases in the sequence. The per base sequencingquality score could be calculated by the ASCII value of each character in Line 4 minus a constant 33.

EAS139 Unique instrument name

136 Run ID

FC706VJ Flowcell ID

2 Flowcell lane

2104 Tile number within the flowcell lane

15343 'x'-coordinate of the cluster within the tile

197393 'y'-coordinate of the cluster within the tile

1 Member of a pair, 1 or 2 (paired-end or mate-pair reads only)

Y Y if the read fails filter (read is bad), N otherwise

18 0 when none of the control bits are on, otherwise it is an even number

ATCACG Index sequence

Page 6: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

2 Quality Control of Sequencing Data

2.1 Sequencing Quality Distribution

If the sequencing error rate is represented by e, and Illumina HiSeqTM 2500/MiseqTM sequencing quality by Qphred, the quality score of abase (Phred score) is calculated by the following equation: Qphred=-10log10(e). The correspondence relationship between Illunima sequencingquality and Phred score in base calling by Casava version 1.8 is listed as follows:

Phred score Error Rate Correct Rate Q-score

10 1/10 90% Q10

20 1/100 99% Q20

30 1/1000 99.9% Q30

40 1/10000 99.99% Q40

For next-generation sequencing (NGS), the sequencing platform, chemical reactants, and sample quality can influence sequencing qualityand base error rate. Sequencing quality distribution is examined over the full length of all sequences, to detect any sites (base positions) withan unusually low sequencing quality, where incorrect bases may be incorporated at abnormally high levels. For detailed sequencing qualitydistribution, please refer to Figure 2.1.

Figure 2.1 Distribution of sequencing quality

The x-axis shows the base position within a sequencing read, and the y-axis shows the average phred score of all reads at each position.(Pair-end sequencing data are plotted together, with the first 144 bp representing read 1 and the following 144 bp for read 2.)

Page 7: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

2.2 Distribution of Sequencing Errors

Sequencing error rate is related to the base quality of the obtained sequence. The sequencing platform, chemical reactants, and samplequality can all influence sequencing error rate and herein the base quality. For next-generation sequencing (NGS) with sequencing-by-synthesis strategy, sequencing error rate distribution shows two common features:

(1) Error rate increases with extending of the sequencing reads due to the consumption of chemical reagents, damage of the DNA template by laser irradiation, and possibleaccumulation of errors during the sequencing cycles. All the Illumina high-throughput sequencing platforms have this feature.(2) The sequencing error rate is higher for the first several bases than at other positions, which is likely the result of reading errors during the first few cycles after calibration ofthe optical instruments.

Sequencing error rate distribution is examined over the full length of all sequences, to detect any sites (base positions) with an unusuallyhigh error rate, where incorrect bases may be incorporated at abnormally high levels. For detailed sequencing error distribution, please refer toFigure 2.2.

Figure 2.2 Distribution of sequencing errors

The x-axis shows the base position within a sequencing read, and the y-axis shows the average error rate of all reads at each position.(Pair-end sequencing data are plotted together, with the first 144 bp representing read 1 and the following 144 bp for read 2.)

Page 8: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

2.3 Sequencing Data Filtration

Raw data obtained from sequencing contains adapter contamination and low-quality reads. These sequencing artifacts may increase thecomplexity of downstream analyses, and therefore, we utilize quality control steps to remove them. Consequently, all the downstream analysesare based on the clean reads. The quality control steps are as follows:

(1) Discard the paired reads when either read contains adapter contamination; (2) Discard the paired reads when uncertain nucleotides (N) constitute more than 10 percent of either read; (3) Discard the paired reads when low quality nucleotides (base quality less than 5, Q ≤ 5) constitute more than 50 percent of either read.

Figure 2.3 Classification of the sequenced reads

(1) Adapter related: The proportion of filtered reads containing adapters in total reads. (2) Containing N: The proportion of filtered reads containing more than10% Ns in total reads.

(3) Low quality: The proportion of filtered reads for low quality in total reads. (4) Clean reads: The proportion of clean reads in raw reads.

Page 9: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

2.4 Statistics of Sequencing Data

Statistics of sequencing data are listed in Table 2.4。

Table 2.4 Statistics of Sequencing Data

Sample Raw Base(bp) Clean Base(bp) Effective Rate(%) Error Rate(%) Q20(%) Q30(%) GC Content(%)

S5 611,505,504 611,412,768 99.98 0.04 94.41 86.44 42.57

S10 817,272,000 817,245,792 100.00 0.04 94.57 86.86 42.93

S9 584,451,360 584,445,024 100.00 0.04 93.77 85.14 42.16

S15 766,285,344 766,275,264 100.00 0.03 95.87 89.73 41.98

S11 619,743,168 615,223,008 99.27 0.04 93.84 85.00 42.80

S8 641,968,992 641,954,880 100.00 0.04 94.43 86.60 42.88

S6 534,930,624 534,843,648 99.98 0.04 94.66 87.16 42.32

S3 569,230,272 569,142,720 99.98 0.04 94.01 85.60 42.55

S16 711,508,032 711,508,032 100.00 0.04 95.06 87.75 41.98

S2 674,776,224 674,680,032 99.99 0.04 94.28 86.21 42.71

S14 838,484,352 838,473,984 100.00 0.03 95.88 89.64 42.19

S4 700,677,504 700,560,576 99.98 0.04 94.48 86.66 42.86

S13 600,174,720 600,166,368 100.00 0.04 94.28 86.24 42.65

S1 620,999,424 620,889,120 99.98 0.04 94.72 87.25 42.50

S7 696,225,600 696,204,000 100.00 0.04 94.59 86.98 42.62

S12 661,790,592 661,773,024 100.00 0.04 94.94 87.94 42.48

The details for the sequencing data statistics are as follows:

(1) Sample: Sample name.(2) Raw Base(bp): The output of raw data calculated by the number and length of sequence (in bp).(3) Clean Base(bp): The valid data output of sequence (in bp) after filtering low quality reads, calculated by the number and length of sequences in clean data.(4) Effective Rate(%): The ratio of clean data to raw data.(5) Error Rate(%): Overall error rate of base.(6) Q20 and Q30(%): The percentage of bases with higher Phred score than 20 and 30 in total bases.(7) GC Content(%): The percentage of G and C in total bases.

2.5 Sequencing Evaluation Summary

Totally 10.65G raw data of 16 samples were sequenced from this run, with 10.645G clean data generated after filtering low-quality data.The raw data production for each sample ranged from 534.931 M to 838.484 M, indicating the sufficient amount of data production. As the Q20and Q30 reached 93.77% and 85.0%, respectively, the sequencing quality could meet the proper analysis requirements. The GC content of41.98% to 42.93% are also in the normal distribution range, fulfilling the quality standard.

In conclusion, the library construction and sequencing procedures are successful and highly reliable.

Page 10: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

2.6 Sequencing Evaluation Q&A

Q.: As the sequencing error increases with the read length, what is the acceptable range of sequencing error rate?

A.: Currently for Illumina sequencers, the per base error rate is generally lower than 1%, and the highest acceptable threshold is 6%.

Q.: What is the data filtering criterion at your company? Is it strictly adhered?

A.: Our pipeline utilizes stringent data quality control procedures and strictly adheres to the high standard to guarantee the accuracy and reliability of the sequencingdata. The detailed data quality control criterion is as follows:       (1) Discard the adapter-containing reads;       (2) Discard the paired reads when uncertain nucleotides (N) constitute more than 10 percent of either read;       (3) Discard the paired reads when low quality nucleotides (base quality less than 5, Q ≤ 5) constitute more than 50 percent of either read.

List of Related Terms:

       adapter: the oligo nucleotides ligated to sample DNA at library preparation, for proper adhesion to the flow-cell via base-pairing in DNA sequencing.

       index: the unique sequence tag for distinguishing each individual sample from multiplexing.

       Q20,Q30: the proportion of bases with Phred score higher than 20 or 30. The Phred score, which is negatively correlated to the probability of incorrect base-calling, iscalculated by the equation (Qphred=-10log10(e)) of sequencing error rate (e), indicating the sequencing quality.

       raw data/raw reads: the original sequence data output by the instrument for a specific sample.

       clean data/clean reads: the output of data from filtering raw data with quality control, which will be used for further analysis.

Page 11: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

3 Mapping Statistics

The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters: mem -t 4 -k 32 -M), and themapping rate and coverage was counted according to the alignment results (see Table 4.3). The BAM files were handled by SAMTOOLS[2].

3.1 Statistics of Reference Genome

Reference genome is downloaded from: https://www.ncbi.nlm.nih.gov/

Table 3.1 Statistics of Reference Genome

Seq number Total length GC content(%) Gap rate(%) N50 length N90 length

78595 2,365,766,571 44.19 8.14 7,072,151 482,319

The detailed information of reference are as follows:

(1) Seq number: the total number of the assembled genomic sequences.(2) Total length: the total length of the assembled genomic sequence.(3) GC content: the GC content of the reference genome.(4) Gap rate: the proportion of unknown sequence (N) in the reference genome assembly.(5) N50 length: the length of scaffold N50, of which 50% of the sequence is higher than this level.(6) N90 length: the length of scaffold N90, of which 90% of sequence is higher than this level.

Page 12: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

3.2 Mapping Statistics with Reference Genome and Tag Summary

The mapping rates of samples reflect the similarity between each sample and the reference genome. The depth and coverage areindicaters of the eveness and homology with the reference genome. With GBS sequencing, tag-related statistics are also calculated.

Table 3.2 Statistics of mapping rate, depth and coverage, as well as tag-related statistics

Sample Mapped reads Total reads Tag number Tag4 number Mapping rate(%) Average depth(X) Coverage at least 1X(%) Coverage at least 4X(%)

S5 3962428 3982154 686597 331328 99.50 6.98 3.71 2.16

S10 5324735 5351994 814515 376426 99.49 8.12 4.28 2.43

S9 3618747 3638908 678602 309679 99.45 6.44 3.67 2.03

S15 4823289 4844132 725619 346378 99.57 8.10 3.89 2.25

S11 3923265 3943704 743164 327057 99.48 6.44 3.98 2.14

S8 4161110 4180770 755638 336912 99.53 6.73 4.04 2.20

S6 3373273 3389334 643331 302637 99.53 6.23 3.53 1.98

S3 3564725 3581556 671801 313629 99.53 6.38 3.65 2.05

S16 4449757 4471858 710697 336032 99.51 7.63 3.82 2.19

S2 4329139 4349140 725479 343200 99.54 7.29 3.88 2.23

S14 5330895 5354688 778982 363809 99.56 8.38 4.15 2.36

S4 4515071 4542800 736476 353067 99.39 7.51 3.93 2.29

S13 3815975 3836220 700853 318408 99.47 6.57 3.79 2.08

S1 3896875 3914874 687083 327298 99.54 6.83 3.73 2.13

S7 4342978 4363758 743134 341066 99.52 7.12 3.99 2.22

S12 4136392 4159116 727466 331351 99.45 6.90 3.92 2.16

The details for mapping statistics are as follows:

(1) Sample: Sample names.(2) Mapped reads: The number of clean reads mapped to the reference assembly, including both single-end reads and reads in pairs.(3) Total reads: Total number of effective reads in clean data.(4) Tag number: The number of unique tags (enzyme cutting fragment).(5) Tag4 number: Number of tags with depth larger than 4.(6) Mapping rate: The ratio of the reference genome assembly mapped reads to the total sequenced clean reads.(7) Average depth: The average depth of mapped reads at each site, calculated by the total number of bases in the mapped reads dividing by size of the covered genome.(8) Coverage at least 1X: The percentage of the assembled genome with more than one read at each site.(9) Coverage at least 4X: The percentage of the assembled genome with ≥4X coverage at each site.

3.3 Mapping Summary

For the current 2,365,766,571 bp reference genome, the mapping rate of each sample ranges from 99.39% to 99.57%. The average depthon the reference genome (without Ns) is in 6.23X to 8.38X range, while the more than 1X coverage exceeds 3.53%. This result is in thequalified normal range and may serve in the subsequent variation detection and related analyses.

Page 13: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

3.4 Enzymatic Digestion Summary

Among pairs of clean reads, those containing the exact conserved sequence of the first restriction enzyme at the beginning ends of bothRead1 and Read2 are considered as successfully enzyme-catched reads, while those containing none recognition sequence of both theprimary and additional restiction enzyme(s) as completely cut reads. In this project, the ratio of enzyme-catched reads is among 98.9% ~99.4%, while the enzyme-digestion ratio ranges from 90.4% to 95.0%. All related statistics are shown in Table 3.4.

Table 3.4 Enzymatic Digestion Summary

sample total_PE_cleanReads total_PE_enzymeCatchReads total_PE_enzymeCutCompletelyReads enzymeCatchRatio(%) enzymeCutCompletelyRatio(%)

S1 2,155,865 2,138,326 1,957,437 99.2 91.5

S10 2,837,659 2,817,058 2,675,997 99.3 95.0

S11 2,136,191 2,122,083 1,971,852 99.3 92.9

S12 2,297,823 2,280,725 2,079,558 99.3 91.2

S13 2,083,911 2,068,059 1,918,110 99.2 92.7

S14 2,911,368 2,888,840 2,677,344 99.2 92.7

S15 2,660,678 2,640,409 2,422,066 99.2 91.7

S16 2,470,514 2,449,873 2,235,929 99.2 91.3

S2 2,342,639 2,317,304 2,174,570 98.9 93.8

S3 1,976,190 1,958,898 1,790,778 99.1 91.4

S4 2,432,502 2,412,364 2,271,400 99.2 94.2

S5 2,122,961 2,102,526 1,991,077 99.0 94.7

S6 1,857,096 1,840,492 1,694,667 99.1 92.1

S7 2,417,375 2,403,882 2,181,879 99.4 90.8

S8 2,229,010 2,209,581 2,090,385 99.1 94.6

S9 2,029,323 2,012,657 1,819,454 99.2 90.4

The details of Enzymatic Digestion Summary are as follows::

(1) sample: Sample name; (2) total_PE_cleanReads: Number of total pair-ended clean reads; (3) total_PE_enzymeCatchReads: Number of enzyme-catched reads; (4) total_PE_enzymeCutCompletelyReads: Number of completely cut reads; (5) enzymeCatchRatio(%): Ratio of enzyme-catched reads to total clean reads; (6) enzymeCutCompletelyRatio(%): Ratio of completely cut reads to total clean reads.

Page 14: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

3.5 Insert Size Summary

Table 3.5 Insert Size Summary

sample tagNum average depth tagNum_4 peak insert size (bp) insert size (+-13bp) ratio(%)

S10 814515 20.327 376,426 278.00 265_291 0.005

S11 743164 15.029 327,057 278.00 265_291 0.005

S12 727466 16.293 331,351 278.00 265_291 0.005

S13 700853 15.071 318,408 278.00 265_291 0.005

S14 778982 20.963 363,809 278.00 265_291 0.005

S15 725619 19.393 346,378 278.00 265_291 0.005

S16 710697 17.840 336,032 278.00 265_291 0.005

S1 687083 15.251 327,298 278.00 265_291 0.005

S2 725479 16.815 343,200 278.00 265_291 0.005

S3 671801 13.926 313,629 278.00 265_291 0.005

S4 736476 17.313 353,067 278.00 265_291 0.005

S5 686597 15.427 331,328 278.00 265_291 0.005

S6 643331 13.397 302,637 278.00 265_291 0.005

S7 743134 16.938 341,066 278.00 265_291 0.005

S8 755638 15.945 336,912 278.00 265_291 0.005

S9 678602 14.376 309,679 278.00 265_291 0.005

The details of insert size summary are as follows:

(1) sample: Sample name; (2) tagNum: Total tag number; (3) average depth: Average depth for tags within the specified range (265~291), only showing the 150000 tags with highest depths; (4) tagNum_4: Number of tags with at least 4X coverage; (5) peak insert size: Expectation of insert size; (6) insert size: Range of insert size; (7) ratio (%): Ratio of tags whose length are within the range specified above to total tags.

Page 15: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

3.5 Mapping Q&A

Q.: Which files are required for reference genome mapping?

A.: As for whole genome sequencing (WGS), only the corresponding reference genome file in FASTA format is required, while the whole exome sequencing (WES) also needsthe target region file in BED format.

Q.: What are the potential causes of low mapping rate?

A.: The three major causes are: (1) with poorly assembled reference genome or a relatively far genetic relationship between the reference and the sample; (2) treatmentsto DNA that may changed the sequence (e.g. bisulfite treatment) and herein affecting the mapping rate and (3) contamination of other species.

Q.: Could the full-length reads be mapped to the reference? or should we trim from the sequence ends?

A.: According to the standard Illumina pair-end (PE) sequencing protocol, DNA fragments were ligated with adapters at both ends. The adapter habors universal sequencesfor flow-cell binding, DNA sequencing, as well as unique index located upstream of the sequencing primer for multiplexing. As for the sequencing of GBS library, thesequenced reads of 144 bp at either end are adapter-free, which could be directly subjected to quality control for low quality reads filtration. The retaining sequences in144 bp length (namely clean data) are qualified for mapping with the reference genome.

Page 16: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

4 SNP Detection & Annotation

Single nucleotide polymorphism (SNP) refers to a variation in a single nucleotide which may occur at some specific position in thegenome, including transition and transversion of a single nucleotide. We detected the individual SNP variations using SAMtools[2] with thefollowing parameter: 'mpileup -m 2 -F 0.002 -d 1000'.

To reduce the error rate in SNP detection, we filtered the results with the criterion as follows:

(1) The number of support reads for each SNP should be more than 4; (2) The mapping quality (MQ) of each SNP should be higher than 20;

4.1 Statistics of SNP Detection & Annotation

ANNOVAR[3] is a widely used software in variation annotation with multiple capabilities, including gene-based annotation, region-basedannotation, filter-based anotation as well as other functionalities. Our pipeline uses ANNOVAR to do annotation of detected SNPs.

Table 4.1 Statistics of SNP detection and annotation

SampleUp-

stream

ExonicIntronic Splicing

Down-stream

Up-/Down-stream Intergenic ts tv ts/tv

Hetrate(‰) TotalStop

gainStoploss

Synonymous Non-synonymous

S5 2,523 6 4 1,320 1,006 130,631 7 3,116 71 156,848 197,094 107,834 1.827 0.073 304,928

S10 2,805 7 2 1,538 1,168 149,408 4 3,695 76 178,903 225,116 123,298 1.825 0.083 348,414

S9 2,214 4 1 1,259 902 121,922 6 2,909 59 144,020 181,663 100,466 1.808 0.068 282,129

S15 2,566 8 1 1,375 975 136,221 6 3,322 60 162,457 204,348 112,360 1.818 0.075 316,708

S11 2,246 4 1 1,322 951 127,781 6 2,982 53 155,665 193,841 106,397 1.821 0.071 300,238

S8 2,536 3 3 1,400 1,058 133,426 6 3,111 64 160,810 201,547 110,442 1.824 0.074 311,989

S6 2,267 5 3 1,270 868 118,968 8 2,681 43 142,164 179,162 97,858 1.830 0.066 277,020

S3 2,346 4 2 1,225 898 124,220 4 2,870 59 147,334 185,876 102,082 1.820 0.069 287,958

S16 2,496 4 1 1,338 914 132,777 8 3,104 57 158,927 199,647 109,718 1.819 0.073 309,365

S2 2,574 5 2 1,355 960 134,795 8 3,189 60 162,088 203,263 111,497 1.823 0.075 314,760

S14 2,677 7 1 1,508 1,085 143,780 6 3,256 62 172,115 216,908 118,029 1.837 0.080 334,937

S4 2,619 9 4 1,444 1,020 140,311 6 3,369 80 167,310 210,775 115,605 1.823 0.078 326,380

S13 2,305 7 3 1,323 956 125,350 3 3,003 55 149,449 188,318 103,280 1.823 0.069 291,598

S1 2,430 7 1 1,327 933 129,527 5 3,131 56 154,241 194,326 106,691 1.821 0.072 301,017

S7 2,460 3 3 1,413 1,052 133,711 11 3,382 65 161,061 202,166 110,748 1.825 0.075 312,914

S12 2,460 7 2 1,276 949 129,738 7 3,069 55 154,810 194,754 106,956 1.820 0.071 301,710

The details for SNP detection and annotation statistics are as follows:

(1) Sample: Sample name;(2) Upstream: SNPs located within 1 kb upstream (away from transcription start site) of the gene.(3) Exonic: SNPs located in exonic region; Non-synonymous: single nucleotide mutation with changing amino acid sequence; Stop gain/loss: a nonsynonymous SNP thatleads to the introduction/removal of stop codon at the variant site; Synonymous: single nucleotide mutation without changing amino acid sequence;(4) Intronic: SNPs located in intronic region;(5) Splicing: SNPs located in the splicing site (2 bp range of the intron/exon boundary).(6) Downstream: SNPs located within 1 kb downstream (away from transcription termination site) of the gene region.(7) Upstream/Downstream: SNPs located within the < 2 kb intergenic region, which is in 1 kb downstream or upstream of the genes.(8) Intergenic: SNPs located within the > 2 kb intergenic region.(9) ts: Transitions, a point mutation that changes a purine nucleotide to another purine (A ↔ G) or a pyrimidine nucleotide to another pyrimidine (C ↔ T). Approximately twoout of three SNPs are transitions.(10) tv: Transversions, the substitution of a (two ring) purine for a (one ring) pyrimidine or vice versa.(11) ts/tv: The ratio of transitions to transversions.(12) Het rate: Genome-wide heterozygous rate, calculated by the ratio of heterozygous SNPs to the total number of genome bases.(13) Total: The total number of SNPs.

Page 17: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

4.2 SNP Quality Distribution

To assess the credibility of detected SNPs, we checked the distribution of support reads number, SNP quality, as well as the distancebetween adjacent SNPs. The results are shown in Figure 4.2.

Figure 4.2 Cumulative distribution of SNP quality

Note: These figures show the quality distribution of SNPs by, from top to bottom, the distribution of SNP support reads number, the distribution of distancesbetween adjacent SNPs, and the cumulative distribution of SNP quality.

Page 18: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

4.3 SNP Mutation Frequency

Take the T:A>C:G mutations as an example, this category includes mutations from T to C and A to G. When T>C mutation appears oneither of the double-strand, the A>G mutation will be found in the same position of the other chain. Therefore the T>C and A>G mutations areclassified into one category. Accordingly, the whole-genome SNP mutations could be classified into six categories. The frequency of each typeis shown in Figure 4.3.

Figure 4.3 Frequency of SNP mutations

The x-axis represents the number of the SNPs, and y-axis indicates the mutation types.

Page 19: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

4.4 SNP Detection & Annotation Q&A

Q.: What is the MQ quality for a SNP?

A.: The SNP quality is represented by the mapping quality of covering reads, calculated by the root-mean-square of the support reads' mapping quality.

Q.: What is QUAL value for a SNP?

A.: The QUAL value is the Phred quality score (QUAL), represents the probability (p) of SNP truly existing at certain position. The higher the QUAL value, the more likelythe SNP exists. The relationship between QUAL and p is QUAL=-10*log10 (1-p). Therefore, the QUAL value of 20 means the probability of the existance of this SNP is 99%.

Q.: What are transitions and transversions?

A.: Transitions refers to the changes between A and G, which are both purines, or between C and T, which are both pyrimidines; while transversions reprents changesbetween a purine and a pyrimidine, such as between A and T.

Q.: What is the heterozygous SNP?

A.: Heterozygous SNPs are those called with both REF (the same to reference) and ALT (different from reference) genotypes in a diploid species.

Q.: How to verify the SNP genotypes?

A.: The "golden standard" of SNP verification is PCR amplification followed by Sanger sequencing.

Q.: If the PCR-sequencing method failed to verify the detected SNP, did this mean that the NGS SNP calling is not reliable?

A.: SNP calling in NGS is based on the support reads and depended on the sufficient coverage depth, which ensures the accuracy of most but not all the detected SNPs. Werecommend to double-check the PCR results first, then use a genome browser such as Savant and IGV to manually check the mapped reads of the NGS result.

Page 20: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

Ⅳ References

[1] Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.

[2] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format andSAMtools. Bioinformatics 2009, 25(16):2078-2079

[3] Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleicacids research 2010, 38(16):e164.

Page 21: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

Click to Top

V Appendix

1 List of Result Folders

Results are listed here.

2 List of Softwares

This chapter provides the software list in bioinformatics analysis pipeline for your reference.

List of softwares in WGS analyses

Analysis Software Usage Version

      MappingBWA Mapping clean reads to the reference genome and generation of bam result files. 0.7.8-r455

SAMtools Sorting the bam files. 0.1.19-44428cd

  SNP/InDel Detection SAMtools Detection and filtration of SNPs and InDels. 0.1.19-44428cd

  Variation Annotation ANNOVAR Annotation of the detected variations. 2013Aug23

3 Methods

This chapter provides the detailed analysis methods for your reference.

Methods:PDF

Page 22: GBS Project Analysis Report - n-genetics.com · Click to Top 3 Mapping Statistics The effective sequencing data was aligned with the reference sequence through BWA[1] software (parameters:

4 Technical Instructions

Guidelines and instructions for viewing results

*.fastaPlain-text containing sequence file of genes or genome stored infasta format, usually hard to open due to its large size. Pleasecheck sampled*.fasta file as it is much smaller.

For Unix/Linux/Mac users, use less or more commandto view these files.

For Windows users, use professional text editors likeEditplus or Notepad++ to view these files.

*.fq/fastqPlain-text containing file of sequenced reads stored in fastqformat, usually hard to open due to its large size. Please checksampled*.fastq file as it is much smaller.

For Unix/Linux/Mac users, use less or more commandto view these files.

For Windows users, use professional text editors likeEditplus or Notepad++ to view these files.

*.xls,*.txtTab-seperated (tab as field seperator) plain-text containing resultfiles.

For Unix/Linux/Mac users, use less or more commandto view these files.

For Windows users, use professional text editors likeEditplus / Notepad++ or Microsoft Excel to viewthese files.

*.png Loss-less image files.

For Unix/Linux/Mac users, use display command toview these files.

For Windows users, use image browser orprofessional editors like Windows Photos or AdobePhotoshop to view these files.

*.pdf

Vector graphic result files, which could be zoomed in and outwithout distortion, enabling users to view, edit, and processing. Itis in standard vector format, compatible with Adobe Illustratorand other PDF viewers.

For Unix/Linux users, use evince command to viewthese files.

For Windows/Mac users, use professional editors likeAdobe Illustrator, PDF viewers such as AdobeReader / Foxit Reader or open with your web browserto view these files.