assessment and interpretation of quality metrics
TRANSCRIPT
National Center for Emerging and Zoonotic Infectious Diseases
Assessment and Interpretation of Quality Metrics
PulseNet Eastern Regional Meeting
January 2019
Lavin Joseph, M.S.Microbiologist/Campylobacter Database Coordinator
Overview
Reference ID Database-Viewing Basic Quality Metrics Organism-specific Database-Viewing Expanded Quality Metrics Quality Metrics Summary Table Sequence Quality Communications
Select entries in the Reference ID database.
Examination of Sequence Quality-Reference ID Database
Click on WGS tools and select Quality assessment
Examination of Sequence Quality-Reference ID Database
Examination of Sequence Quality-Reference ID Database
Color codes: Green = Pass QC and PulseNet organism
Entries remain checked Yellow = Pass QC and not a PulseNet organism Red = Failed QC or not a PulseNet organism
Select entry and click on Details to see specific metrics that pass/fail for each entry
Warning Not a PulseNet Organism
Examination of Sequence Quality-Reference ID Database
Sequence failed quality due to sequence length.
Escherichia sequence length: > 4.2 Mbp
Sample sequence length: 3 Mbp
Sequence passed quality; not PN organism.
Species Identified: Vibrio furnissii
PN Vibrio organisms: Vibrio cholerae, Vibrio parahaemolyticus, Vibrio vulnificus
Warning Not a PulseNet OrganismNot a PulseNet OrganismWarning
Genus/species Check Organism must be one of the PulseNet monitored genus/species to be transferred to the correct
organism specific database:– Campylobacter jejuni– Campylobacter coli– Campylobacter lari– Campylobacter upsaliensis– Campylobacter fetus– Escherichia coli/Shigella species– Salmonella bongori– Salmonella enterica– Vibrio cholerae– Vibrio parahaemolyticus– Vibrio vulinificus– Listeria monocytogenes
Genus/species not identified by ANI or identified as a non-PulseNet monitored organism will not be automatically selected to transfer to the organism-specific databases.
Campylobacter wgMLST Database
Escherichia wgMLST Database
Listeria wgMLST Database
Salmonella wgMLST Database
Vibrio wgMLST Database
De novo Assembly Quality Metrics Average quality (1st end), and Average read quality (2nd end) (“Q-score”):
base call accuracy for read 1 (1st end) and read 2 (2nd end).– Q-scores must be greater than 30. Q30 less than 1/1000 probability
of incorrect base calls. Average de novo Coverage: the general coverage across the entire genome
– We absolutely do not accept data below the minimum coverage thresholds.
• >20X for Listeria and Campylobacter, 30X for Salmonella, 40X for Escherichia and Vibrio
• Low coverage results in poor assemblies and can lead to missing critical genes, such as serotyping and toxin genes.
If the Q score is slightly below 30 we may be able to use the data as long as you have plenty of coverage (ie. 10-20X above minimum required).
Sequence length (Length): the estimated genome size of the sequenced isolate– Varies by species or serotype of the isolate and presence or absence of
plasmids/phage sequences – If length is larger or smaller than expected range, you may have an
isolate mix-up or contamination
Denovo Assembly Quality Metrics (Continued)
Examination of Sequence Quality-Organism-Specific Databases
A green dot will be present for the quality experiment type
Select the entries to view the quality data
Create a comparison by clicking on the green + sign in the comparison window (or use Alt + C)
Viewing the Quality Metrics
Click on the eye next to the quality experiment in the Experiments panel The Aspect should remain <All Characters> to view all of the quality metrics at the
same time Click on the first 123 icon (Show values) and change “Character name” to “Character
description” Now you can see the quality metrics for the entries selected
Viewing the Quality Metrics
Drag the horizontal and vertical bars to see all of the QC metrics on the screen Use top scroll bar to zoom in/out and view the values in each field
Exporting Quality Metrics into Excel
Click on Characters and select “Export character table” Now you can paste the quality metrics for your sequences into Excel for easier
viewing
Exporting Quality Metrics into Excel
Raw Data Statistics Average read quality (AvgQuality) (“Q-score”): base call accuracy
– Q-score must be greater than 30. Q30 less than 1/1000 probability of incorrect base calls.
Expected coverage (AvgReadCoverage): the general coverage across the entire genome We absolutely do not accept data below the minimum coverage
thresholds. >20X for Listeria and Campylobacter, 30X for Salmonella, 40X for
Escherichia Low coverage results in poor assemblies and can lead to missing
critical genes, such as serotyping and toxin genes. If the Q score is slightly below 30 we may be able to use the data as long
as you have plenty of coverage (ie. 10-20X above minimum required).
Q30 (SrsQ30): Total number of bases present in the (paired end) data files that have a quality score of 30 or higher
Q30 1st end (SrsQ30_1): Number of bases present in the first end reads that have a quality score of 30 or higher
Q30 2nd end (SrsQ30_2): Number of bases present in the second end reads that have a quality score of 30 or higher
Q30 frequency (SrsQ30Freq): Number of bases that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the (paired end) data files
Q30 frequency 1st end (SrsQ30Freq_1): Number of bases present in the first end reads that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the first end reads
Q30 frequency 2nd end (SrsQ30Freq_2): Number of bases present in the second end reads that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the second end read
Raw Data Statistics
De novo Assembly
Contigs (NrContigs): # of contiguous DNA segments assembled from smaller DNA fragments
N50: median length of contigs Bases non-ACTG (NrNonACTG): number of N or ambiguous base calls Sequence length (Length): the estimated genome size of the sequenced
isolate Varies by species or serotype of the isolate and presence or absence of
plasmids/phage sequences If length is larger or smaller than expected range, you may have an
isolate mix-up or contamination
Assembly-free Allele Calls
Average coverage at identified loci (KeywordCov) Multiple alleles (NrAFMultiple): number of loci with multiple alleles
identified Perfect matches (NrAFPerfect): number of times a locus identified in the
raw sequence reads is a perfect match with an allele that already exists in the allele database
Present alleles (NrAFPresent): number of loci that had at least one perfect or closely matching allele in the allele database– Number present should be close to the number expected for that
species. – If significantly higher or lower, you may have contamination or an
isolate mix-up.
Quality check: If a new allele is only identified by assembly-free calling but not by assembly-based, it will not be added to the allele database
Assembly-based Allele Calls
Multiple alleles (NrBAFMultiple): number of loci with multiple alleles identified– The value should be 0, indicating one allele was identified for a specific
locus– If greater than 0, you may have contamination
Perfect matches (NrBAFPerfect): number of loci that have an allele which is identical to an allele already present in the allele database
Alleles to submit (NrToBeSubmitted): number of loci that have newly identified alleles
Alleles assigned on assembled reads through assembly-based BLAST methods
Assembly-based Allele Calls
Submitted alleles (NrAlreadySubmitted): number of loci that have already been submitted to the allele database
Present alleles (NrBAFPresent): total number of loci that have allele calls– Number present should be close to the number expected for that
species– If significantly higher or lower, you may have contamination or an
isolate mix-up Average locus coverage (AvgLocusCover): the average coverage of all loci
that have allele calls
Alleles assigned on assembled reads through assembly-based BLAST methods
Summary Allele Calls
Unknown alleles (NrConsensusUnknown): an allele identified by the assembly free method but not a 100% match to the prototype allele.
Multiple alleles (NrConsensusMultiple): number of loci with multiple alleles identified– The value should be 0, indicating one allele was identified for a specific
locus– If greater than 0, you may have contamination
Discrepant alleles (NrDifferent): number of loci where the assembly-free and assembly-based allele calls identified a different allele number for the same loci– The value should be <2 – If greater than 2, you may have contamination
Combines assembly-free and assembly-based allele calls
Summary Allele Calls
Present alleles (NrConsensus): total number of loci that have allele calls– Number present should be close to the number expected for that
species. – If significantly higher or lower, you may have contamination or an
isolate mix-up. % core present (CorePercent): percent of core loci that have allele calls.
Core loci used in the BioNumerics databases are the genes found in 95-100% of publicly available reference sequences. – Only validated for Listeria database at this point—should be >95%– If core genes are not detected as expected, you may have
contamination or an isolate mix-up.
Combines assembly-free and assembly-based allele calls
QC Metrics Summary Table (PHLs)
• ¥ Sequences may be usable if the Q-scores are between 29.00-29.99 or 28.00-28.99 as long as the coverage is 10X or 20X higher than shown in the table, respectively. If the Q-score falls below 28, then the sequence fails QC regardless of coverage.
• Sequences must pass QC metrics shown in red prior to submission to the PulseNet National Databases.• Sequences that do not pass the other QC metrics (shown in black) may be submitted to the National Databases, but we may reject these
sequences upon further QC review.• * QC metric thresholds have only been determined for Escherichia and Vibrio sequences only using the 2 x 250bp chemistry. QC metric
thresholds for the other organism sequences have been determined using the 2 x 250bp and 2 x 150bp chemistries.• Metrics in bold are found in the organism-specific databases only; other metrics can be found in the Reference ID Database.
Updated: January 14, 2019. Subject to change.
Quality Metric Listeria Campylobacter *Escherichia Salmonella *Vibrio¥ Raw data statistics: R1 & R2 Q-scores ≥ 30¥ De novo assembly: Average Coverage ≥ 20X ≥ 20X ≥ 40X ≥ 30X ≥ 40X
De novo assembly: Sequence Length (Mbp) ~2.8 – 3.1 ~1.4 –2.2
~4.2 – 5.9~4.2 – 4.9 (Shigella, rare spp.)
~4.9 – 5.9 (most serotypes)
~4.4 – 5.6 (common serotypes usually ~5.0)
~3.8 – 4.3 (Vc)~4.9 – 5.5 (Vp)~4.7 – 5.3 (Vv)
De novo assembly: Contigs ≤ 100 ≤ 200(usually 10-50)
≤ 600(usually 100-500)
≤ 400(usually 50-200)
≤ 200(usually 50-100)
Summary Calls:Present Alleles ≥ 2700 ≥ 1350 (C. jejuni)
(usually 1500-1700)≥ 3200 (usually 4400-
5900)
≥ 2500 (common serotypes usually
≥ 4200)NA
Percent Core Present ≥ 95% ≥ 95% (C. jejuni) ≥ 95% ≥ 95% NA
Sequence Quality Communications
Contact [email protected] and [email protected] for troubleshooting help.– Provide WGS ID# and all of the quality metrics for the isolate
E-mails from PulseNet DBMs requesting repeat sequencing based on QC review at CDC.
For more information, contact CDC1-800-CDC-INFO (232-4636)TTY: 1-888-232-6348 www.cdc.gov
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Questions?
#PulseNet
Telephone: 404-639-4558E-mail: [email protected] Web: www.cdc.gov/pulsenet #PulseNet