sofia 19.06.2011 bio math
TRANSCRIPT
Estimation of sequencing error rates Estimation of sequencing error rates present in genome databasespresent in genome databases
Estimation of sequencing error rates Estimation of sequencing error rates present in genome databasespresent in genome databases
Valeriya SimeonovaValeriya Simeonova11, Ivan Popov, Ivan Popov22, Dimitar Vassilev, Dimitar Vassilev1*1*
1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria 1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria
2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria 2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria
1* - corresponding author: 1* - corresponding author: [email protected]@gmail.com
Valeriya SimeonovaValeriya Simeonova11, Ivan Popov, Ivan Popov22, Dimitar Vassilev, Dimitar Vassilev1*1*
1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria 1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria
2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria 2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria
1* - corresponding author: 1* - corresponding author: [email protected]@gmail.com
AbstractAbstractAbstractAbstract
Next - generation sequencingNext - generation sequencing
Validation of sequencesValidation of sequences
Donor/Acceptor sites - GT/AGDonor/Acceptor sites - GT/AG
NCBI as primary DB for information NCBI as primary DB for information
scanningscanning
IntroductionIntroductionIntroductionIntroduction
To measure the quality of sequencing, one needs a stretch of To measure the quality of sequencing, one needs a stretch of
DNA/RNA with high conservation, in which it is statistically very DNA/RNA with high conservation, in which it is statistically very
unlikely to find a variation. Such sequences found in all eukaryotes unlikely to find a variation. Such sequences found in all eukaryotes
are the splicing site’s donor and acceptor pairs.are the splicing site’s donor and acceptor pairs.
Donor - Acceptor Pairs:Donor - Acceptor Pairs:
if not reverse complement then: GT or GC vs. AGif not reverse complement then: GT or GC vs. AG
if reverse complement then: CT vs. AC or GCif reverse complement then: CT vs. AC or GC
Counting of reverse complement sites:Counting of reverse complement sites:
as CT is the RC of AG, it will be counted as AGas CT is the RC of AG, it will be counted as AG
as AC is the RC of GT, it will be counted as GTas AC is the RC of GT, it will be counted as GT
GC’s RC is GCGC’s RC is GC
Materials and Materials and methodsmethods
Materials and Materials and methodsmethods
The NCBI Genome entries for the Oryza sativa chromosomes were used to The NCBI Genome entries for the Oryza sativa chromosomes were used to
estimate the sequencing error in the splicing donor/acceptor sites. The estimate the sequencing error in the splicing donor/acceptor sites. The
classical form of the splicing sites (GT/GC - AG) was used for the analysis. classical form of the splicing sites (GT/GC - AG) was used for the analysis.
Only variations in this conservativeOnly variations in this conservative Error rate by chromosome sequence Error rate by chromosome sequence
were considered and any rare splicing sites (AT/AC) [2] found were not were considered and any rare splicing sites (AT/AC) [2] found were not
taken into account.taken into account.
An alternative sequence of the rice genome was obtained from the Plant An alternative sequence of the rice genome was obtained from the Plant
Genome Database [1]. It was used to verify the splicing errors in the NCBI Genome Database [1]. It was used to verify the splicing errors in the NCBI
sequence. The positions of the Intron - Exon boundaries were taken from sequence. The positions of the Intron - Exon boundaries were taken from
the annotation of the NCBI Nucleotide entries ofthe annotation of the NCBI Nucleotide entries of the the
chromosomes.The respective boundaries in the PGDB genome were chromosomes.The respective boundaries in the PGDB genome were
selected by local pairwise alignment (BLAST) of the chromosomes of the selected by local pairwise alignment (BLAST) of the chromosomes of the
two retrieved genomes. The fragments that did not enter the best BLAST two retrieved genomes. The fragments that did not enter the best BLAST
result were ignored. We estimate the sequencing errors by calculating the result were ignored. We estimate the sequencing errors by calculating the
frequency of appearance of sites that do not match the canonical form.frequency of appearance of sites that do not match the canonical form.
Results and Results and DiscussionDiscussionResults and Results and DiscussionDiscussion
12 Chromosomes, 12 Chromosomes, 225 981225 981 donor- donor-
acceptor sites checked, acceptor sites checked, 33853385
differences were found from the differences were found from the
classical formclassical form
This leads to an error rate of This leads to an error rate of 1.501.50 x x
10-210-2. This is three orders of . This is three orders of
magnitude higher than the estimated magnitude higher than the estimated
error rate by Wesche et al. [3] for the error rate by Wesche et al. [3] for the
referent mouse genome (whole referent mouse genome (whole
genome shotgun sequence of the genome shotgun sequence of the
C57BL/6J line), and one order of C57BL/6J line), and one order of
magnitude higher than the estimated magnitude higher than the estimated
error for coding sequences in the error for coding sequences in the
Genbank records of mouse genes.Genbank records of mouse genes.Chart 1Chart 1
Results based only on Results based only on NCBI dataNCBI data
Results based only on Results based only on NCBI dataNCBI data
These results slightly differs from previous, but they These results slightly differs from previous, but they
show us some inside information about the genome. show us some inside information about the genome.
We could analyze errors’ differences in genomes and We could analyze errors’ differences in genomes and
predict what error we could expect from sequencing predict what error we could expect from sequencing
other organism classified in certain group (plants, other organism classified in certain group (plants,
animal groups, etc.). The same manner could be animal groups, etc.). The same manner could be
used for examining (verifying) results from NGS.used for examining (verifying) results from NGS.
We analyzed We analyzed 1212 chromosomes and discovered chromosomes and discovered 36843684
differences from differences from 226 270.226 270.
Chart 2: Statistics about error by Chart 2: Statistics about error by Chromosomes if the error in Genome for Chromosomes if the error in Genome for
every site group is 100%every site group is 100%
Chart 2: Statistics about error by Chart 2: Statistics about error by Chromosomes if the error in Genome for Chromosomes if the error in Genome for
every site group is 100%every site group is 100%
Assuming: Every site group Assuming: Every site group (GT/GC and AG) results its’ error (GT/GC and AG) results its’ error for Genome, and this is 100%for Genome, and this is 100%
The two groups are not going to The two groups are not going to have the same trend lineshave the same trend lines11. .
In the same time: as the In the same time: as the chromosome is bigger, the chromosome is bigger, the errors are going up too.errors are going up too.
It means that Chromosome 1 is It means that Chromosome 1 is produced 15.08 % error level produced 15.08 % error level about GTC genome error about GTC genome error group.group.
1 - Trend lines’ type is Moving 1 - Trend lines’ type is Moving AverageAverage
Chart 3: Stats about the error for each site in Chart 3: Stats about the error for each site in each Chromosomeeach Chromosome
Chart 3: Stats about the error for each site in Chart 3: Stats about the error for each site in each Chromosomeeach Chromosome
Assuming: Every site group (GT/GC Assuming: Every site group (GT/GC and AG) results its’ error for each and AG) results its’ error for each Chromosome, and this is 100%. So Chromosome, and this is 100%. So here we show the error for each here we show the error for each site group in each chromosome.site group in each chromosome.
The two groups are going to have The two groups are going to have the similar trend linesthe similar trend lines11. .
In the same time: it is evident that In the same time: it is evident that error level in AG is more than the error level in AG is more than the error level in GT/GC in relative error level in GT/GC in relative sense.sense.
It means that in Chromosome 1 for It means that in Chromosome 1 for every 1000 GTC sites will be every 1000 GTC sites will be produced error about 16 wrong produced error about 16 wrong sites sites
1 - Trend lines’ type is Polinomial1 - Trend lines’ type is Polinomial
Chart 4: Statistics about error in Chart 4: Statistics about error in Chromosomes if each Chromosome is 100%Chromosomes if each Chromosome is 100%
Chart 4: Statistics about error in Chart 4: Statistics about error in Chromosomes if each Chromosome is 100%Chromosomes if each Chromosome is 100%
Assuming: Both site groups (GT/GC Assuming: Both site groups (GT/GC and AG) results the error level for and AG) results the error level for each Chromosome, and this is 100%each Chromosome, and this is 100%
The trend lineThe trend line1 1 of error level and the of error level and the trends from Chart show us which site trends from Chart show us which site group is resulting more high level group is resulting more high level errors than the other for each errors than the other for each chromosome. chromosome.
In the same time: there is no matter In the same time: there is no matter how much bps there are in the how much bps there are in the chromosome.chromosome.
It means that in Chromosome 1 for It means that in Chromosome 1 for every 10000 sites will be produced every 10000 sites will be produced error about 176 sites.error about 176 sites.
This chart also shows how much This chart also shows how much differs these results from the analyze differs these results from the analyze with verifying genome with with verifying genome with PlantGDB . It is important when we PlantGDB . It is important when we are going to examine sequenced and are going to examine sequenced and assembled data by different methods. assembled data by different methods.
1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
Charts 5: Stats about the error if whole Charts 5: Stats about the error if whole Genome is 100% - error and no error Genome is 100% - error and no error
occurrenceoccurrence
Charts 5: Stats about the error if whole Charts 5: Stats about the error if whole Genome is 100% - error and no error Genome is 100% - error and no error
occurrenceoccurrence
Assuming: The whole Assuming: The whole Genome is 100%. Here are Genome is 100%. Here are shown the two groups NE shown the two groups NE (errors) and EQ (no errors) (errors) and EQ (no errors) for each chromosome. So for each chromosome. So their sum is 100%their sum is 100%
The two groups are going The two groups are going to have similar trend to have similar trend lineslines11. .
In the same time: as the In the same time: as the chromosome is bigger, the chromosome is bigger, the rates are going up too.rates are going up too.
It means that for It means that for Chromosome 1 the error is Chromosome 1 the error is 0.26% based on whole 0.26% based on whole Genome, incl. no error Genome, incl. no error sites. sites.
1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
ReferencesReferencesReferencesReferences
Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant genomics. Nucl. Acids Res. 36, D959-D965.comparative plant genomics. Nucl. Acids Res. 36, D959-D965.
Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.(3): 357–65.
Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in Genbank records estimated using the mouse genome as reference. DNA in Genbank records estimated using the mouse genome as reference. DNA sequence 15(5/6): 362-64.sequence 15(5/6): 362-64.
Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant genomics. Nucl. Acids Res. 36, D959-D965.comparative plant genomics. Nucl. Acids Res. 36, D959-D965.
Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.(3): 357–65.
Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in Genbank records estimated using the mouse genome as reference. DNA in Genbank records estimated using the mouse genome as reference. DNA sequence 15(5/6): 362-64.sequence 15(5/6): 362-64.
Thank YouThank YouThank YouThank You
Presented by: Valeriya SimeonovaPresented by: Valeriya SimeonovaPresented by: Valeriya SimeonovaPresented by: Valeriya Simeonova