sofia 19.06.2011 bio math

Estimation of sequencing error rates Estimation of sequencing error rates present in genome databasespresent in genome databases

Estimation of sequencing error rates Estimation of sequencing error rates present in genome databasespresent in genome databases

Valeriya SimeonovaValeriya Simeonova11, Ivan Popov, Ivan Popov22, Dimitar Vassilev, Dimitar Vassilev1*1*

1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria 1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria

2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria 2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria

1* - corresponding author: 1* - corresponding author: [email protected]@gmail.com

Valeriya SimeonovaValeriya Simeonova11, Ivan Popov, Ivan Popov22, Dimitar Vassilev, Dimitar Vassilev1*1*

1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria 1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria

2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria 2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria

1* - corresponding author: 1* - corresponding author: [email protected]@gmail.com

mailto:[email protected]

mailto:[email protected]

AbstractAbstractAbstractAbstract

Next - generation sequencingNext - generation sequencing

Validation of sequencesValidation of sequences

Donor/Acceptor sites - GT/AGDonor/Acceptor sites - GT/AG

NCBI as primary DB for information NCBI as primary DB for information

scanningscanning

IntroductionIntroductionIntroductionIntroduction

To measure the quality of sequencing, one needs a stretch of To measure the quality of sequencing, one needs a stretch of

DNA/RNA with high conservation, in which it is statistically very DNA/RNA with high conservation, in which it is statistically very

unlikely to find a variation. Such sequences found in all eukaryotes unlikely to find a variation. Such sequences found in all eukaryotes

are the splicing site’s donor and acceptor pairs.are the splicing site’s donor and acceptor pairs.

Donor - Acceptor Pairs:Donor - Acceptor Pairs:

if not reverse complement then: GT or GC vs. AGif not reverse complement then: GT or GC vs. AG

if reverse complement then: CT vs. AC or GCif reverse complement then: CT vs. AC or GC

Counting of reverse complement sites:Counting of reverse complement sites:

as CT is the RC of AG, it will be counted as AGas CT is the RC of AG, it will be counted as AG

as AC is the RC of GT, it will be counted as GTas AC is the RC of GT, it will be counted as GT

GC’s RC is GCGC’s RC is GC

Materials and Materials and methodsmethods

Materials and Materials and methodsmethods

The NCBI Genome entries for the Oryza sativa chromosomes were used to The NCBI Genome entries for the Oryza sativa chromosomes were used to

estimate the sequencing error in the splicing donor/acceptor sites. The estimate the sequencing error in the splicing donor/acceptor sites. The

classical form of the splicing sites (GT/GC - AG) was used for the analysis. classical form of the splicing sites (GT/GC - AG) was used for the analysis.

Only variations in this conservativeOnly variations in this conservative Error rate by chromosome sequence Error rate by chromosome sequence

were considered and any rare splicing sites (AT/AC) [2] found were not were considered and any rare splicing sites (AT/AC) [2] found were not

taken into account.taken into account.

An alternative sequence of the rice genome was obtained from the Plant An alternative sequence of the rice genome was obtained from the Plant

Genome Database [1]. It was used to verify the splicing errors in the NCBI Genome Database [1]. It was used to verify the splicing errors in the NCBI

sequence. The positions of the Intron - Exon boundaries were taken from sequence. The positions of the Intron - Exon boundaries were taken from

the annotation of the NCBI Nucleotide entries ofthe annotation of the NCBI Nucleotide entries of the the

chromosomes.The respective boundaries in the PGDB genome were chromosomes.The respective boundaries in the PGDB genome were

selected by local pairwise alignment (BLAST) of the chromosomes of the selected by local pairwise alignment (BLAST) of the chromosomes of the

two retrieved genomes. The fragments that did not enter the best BLAST two retrieved genomes. The fragments that did not enter the best BLAST

result were ignored. We estimate the sequencing errors by calculating the result were ignored. We estimate the sequencing errors by calculating the

frequency of appearance of sites that do not match the canonical form.frequency of appearance of sites that do not match the canonical form.

Results and Results and DiscussionDiscussionResults and Results and DiscussionDiscussion

12 Chromosomes, 12 Chromosomes, 225 981225 981 donor- donor-

acceptor sites checked, acceptor sites checked, 33853385

differences were found from the differences were found from the

classical formclassical form

This leads to an error rate of This leads to an error rate of 1.501.50 x x

10-210-2. This is three orders of . This is three orders of

magnitude higher than the estimated magnitude higher than the estimated

error rate by Wesche et al. [3] for the error rate by Wesche et al. [3] for the

referent mouse genome (whole referent mouse genome (whole

genome shotgun sequence of the genome shotgun sequence of the

C57BL/6J line), and one order of C57BL/6J line), and one order of

magnitude higher than the estimated magnitude higher than the estimated

error for coding sequences in the error for coding sequences in the

Genbank records of mouse genes.Genbank records of mouse genes.Chart 1Chart 1

Results based only on Results based only on NCBI dataNCBI data

Results based only on Results based only on NCBI dataNCBI data

These results slightly differs from previous, but they These results slightly differs from previous, but they

show us some inside information about the genome. show us some inside information about the genome.

We could analyze errors’ differences in genomes and We could analyze errors’ differences in genomes and

predict what error we could expect from sequencing predict what error we could expect from sequencing

other organism classified in certain group (plants, other organism classified in certain group (plants,

animal groups, etc.). The same manner could be animal groups, etc.). The same manner could be

used for examining (verifying) results from NGS.used for examining (verifying) results from NGS.

We analyzed We analyzed 1212 chromosomes and discovered chromosomes and discovered 36843684

differences from differences from 226 270.226 270.

Chart 2: Statistics about error by Chart 2: Statistics about error by Chromosomes if the error in Genome for Chromosomes if the error in Genome for

every site group is 100%every site group is 100%

Chart 2: Statistics about error by Chart 2: Statistics about error by Chromosomes if the error in Genome for Chromosomes if the error in Genome for

every site group is 100%every site group is 100%

Assuming: Every site group Assuming: Every site group (GT/GC and AG) results its’ error (GT/GC and AG) results its’ error for Genome, and this is 100%for Genome, and this is 100%

The two groups are not going to The two groups are not going to have the same trend lineshave the same trend lines11. .

In the same time: as the In the same time: as the chromosome is bigger, the chromosome is bigger, the errors are going up too.errors are going up too.

It means that Chromosome 1 is It means that Chromosome 1 is produced 15.08 % error level produced 15.08 % error level about GTC genome error about GTC genome error group.group.

1 - Trend lines’ type is Moving 1 - Trend lines’ type is Moving AverageAverage

Chart 3: Stats about the error for each site in Chart 3: Stats about the error for each site in each Chromosomeeach Chromosome

Chart 3: Stats about the error for each site in Chart 3: Stats about the error for each site in each Chromosomeeach Chromosome

Assuming: Every site group (GT/GC Assuming: Every site group (GT/GC and AG) results its’ error for each and AG) results its’ error for each Chromosome, and this is 100%. So Chromosome, and this is 100%. So here we show the error for each here we show the error for each site group in each chromosome.site group in each chromosome.

The two groups are going to have The two groups are going to have the similar trend linesthe similar trend lines11. .

In the same time: it is evident that In the same time: it is evident that error level in AG is more than the error level in AG is more than the error level in GT/GC in relative error level in GT/GC in relative sense.sense.

It means that in Chromosome 1 for It means that in Chromosome 1 for every 1000 GTC sites will be every 1000 GTC sites will be produced error about 16 wrong produced error about 16 wrong sites sites

1 - Trend lines’ type is Polinomial1 - Trend lines’ type is Polinomial

Chart 4: Statistics about error in Chart 4: Statistics about error in Chromosomes if each Chromosome is 100%Chromosomes if each Chromosome is 100%

Chart 4: Statistics about error in Chart 4: Statistics about error in Chromosomes if each Chromosome is 100%Chromosomes if each Chromosome is 100%

Assuming: Both site groups (GT/GC Assuming: Both site groups (GT/GC and AG) results the error level for and AG) results the error level for each Chromosome, and this is 100%each Chromosome, and this is 100%

The trend lineThe trend line1 1 of error level and the of error level and the trends from Chart show us which site trends from Chart show us which site group is resulting more high level group is resulting more high level errors than the other for each errors than the other for each chromosome. chromosome.

In the same time: there is no matter In the same time: there is no matter how much bps there are in the how much bps there are in the chromosome.chromosome.

It means that in Chromosome 1 for It means that in Chromosome 1 for every 10000 sites will be produced every 10000 sites will be produced error about 176 sites.error about 176 sites.

This chart also shows how much This chart also shows how much differs these results from the analyze differs these results from the analyze with verifying genome with with verifying genome with PlantGDB . It is important when we PlantGDB . It is important when we are going to examine sequenced and are going to examine sequenced and assembled data by different methods. assembled data by different methods.

1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average

Charts 5: Stats about the error if whole Charts 5: Stats about the error if whole Genome is 100% - error and no error Genome is 100% - error and no error

occurrenceoccurrence

Charts 5: Stats about the error if whole Charts 5: Stats about the error if whole Genome is 100% - error and no error Genome is 100% - error and no error

occurrenceoccurrence

Assuming: The whole Assuming: The whole Genome is 100%. Here are Genome is 100%. Here are shown the two groups NE shown the two groups NE (errors) and EQ (no errors) (errors) and EQ (no errors) for each chromosome. So for each chromosome. So their sum is 100%their sum is 100%

The two groups are going The two groups are going to have similar trend to have similar trend lineslines11. .

In the same time: as the In the same time: as the chromosome is bigger, the chromosome is bigger, the rates are going up too.rates are going up too.

It means that for It means that for Chromosome 1 the error is Chromosome 1 the error is 0.26% based on whole 0.26% based on whole Genome, incl. no error Genome, incl. no error sites. sites.

1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average

ReferencesReferencesReferencesReferences

Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant genomics. Nucl. Acids Res. 36, D959-D965.comparative plant genomics. Nucl. Acids Res. 36, D959-D965.

Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.(3): 357–65.

Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in Genbank records estimated using the mouse genome as reference. DNA in Genbank records estimated using the mouse genome as reference. DNA sequence 15(5/6): 362-64.sequence 15(5/6): 362-64.

Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant genomics. Nucl. Acids Res. 36, D959-D965.comparative plant genomics. Nucl. Acids Res. 36, D959-D965.

Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.(3): 357–65.

Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in Genbank records estimated using the mouse genome as reference. DNA in Genbank records estimated using the mouse genome as reference. DNA sequence 15(5/6): 362-64.sequence 15(5/6): 362-64.

Thank YouThank YouThank YouThank You

Presented by: Valeriya SimeonovaPresented by: Valeriya SimeonovaPresented by: Valeriya SimeonovaPresented by: Valeriya Simeonova

sofia 19.06.2011 bio math

Technology

splicing sites donor

reverse complement sites

splicing donoracceptor

rice genome

ncbi sequence

estimated error rate

rare splicing sites

sequencing errors