assessment and interpretation of quality metrics

National Center for Emerging and Zoonotic Infectious Diseases

Assessment and Interpretation of Quality Metrics

PulseNet Eastern Regional Meeting

January 2019

Lavin Joseph, M.S.Microbiologist/Campylobacter Database Coordinator

Overview

Reference ID Database-Viewing Basic Quality Metrics Organism-specific Database-Viewing Expanded Quality Metrics Quality Metrics Summary Table Sequence Quality Communications

Select entries in the Reference ID database.

Examination of Sequence Quality-Reference ID Database

Click on WGS tools and select Quality assessment



Color codes: Green = Pass QC and PulseNet organism

Entries remain checked Yellow = Pass QC and not a PulseNet organism Red = Failed QC or not a PulseNet organism

Select entry and click on Details to see specific metrics that pass/fail for each entry

Warning Not a PulseNet Organism


Sequence failed quality due to sequence length.

Escherichia sequence length: > 4.2 Mbp

Sample sequence length: 3 Mbp

Sequence passed quality; not PN organism.

Species Identified: Vibrio furnissii

PN Vibrio organisms: Vibrio cholerae, Vibrio parahaemolyticus, Vibrio vulnificus

Warning Not a PulseNet OrganismNot a PulseNet OrganismWarning

Genus/species Check Organism must be one of the PulseNet monitored genus/species to be transferred to the correct

organism specific database:– Campylobacter jejuni– Campylobacter coli– Campylobacter lari– Campylobacter upsaliensis– Campylobacter fetus– Escherichia coli/Shigella species– Salmonella bongori– Salmonella enterica– Vibrio cholerae– Vibrio parahaemolyticus– Vibrio vulinificus– Listeria monocytogenes

Genus/species not identified by ANI or identified as a non-PulseNet monitored organism will not be automatically selected to transfer to the organism-specific databases.

Campylobacter wgMLST Database

Escherichia wgMLST Database

Listeria wgMLST Database

Salmonella wgMLST Database

Vibrio wgMLST Database

De novo Assembly Quality Metrics Average quality (1st end), and Average read quality (2nd end) (“Q-score”):

base call accuracy for read 1 (1st end) and read 2 (2nd end).– Q-scores must be greater than 30. Q30 less than 1/1000 probability

of incorrect base calls. Average de novo Coverage: the general coverage across the entire genome

– We absolutely do not accept data below the minimum coverage thresholds.

• >20X for Listeria and Campylobacter, 30X for Salmonella, 40X for Escherichia and Vibrio

• Low coverage results in poor assemblies and can lead to missing critical genes, such as serotyping and toxin genes.

If the Q score is slightly below 30 we may be able to use the data as long as you have plenty of coverage (ie. 10-20X above minimum required).

Sequence length (Length): the estimated genome size of the sequenced isolate– Varies by species or serotype of the isolate and presence or absence of

plasmids/phage sequences – If length is larger or smaller than expected range, you may have an

isolate mix-up or contamination

Denovo Assembly Quality Metrics (Continued)

Examination of Sequence Quality-Organism-Specific Databases

A green dot will be present for the quality experiment type

Select the entries to view the quality data

Create a comparison by clicking on the green + sign in the comparison window (or use Alt + C)

Viewing the Quality Metrics

Click on the eye next to the quality experiment in the Experiments panel The Aspect should remain <All Characters> to view all of the quality metrics at the

same time Click on the first 123 icon (Show values) and change “Character name” to “Character

description” Now you can see the quality metrics for the entries selected

Viewing the Quality Metrics

Drag the horizontal and vertical bars to see all of the QC metrics on the screen Use top scroll bar to zoom in/out and view the values in each field

Exporting Quality Metrics into Excel

Click on Characters and select “Export character table” Now you can paste the quality metrics for your sequences into Excel for easier

viewing

Exporting Quality Metrics into Excel

Raw Data Statistics Average read quality (AvgQuality) (“Q-score”): base call accuracy

– Q-score must be greater than 30. Q30 less than 1/1000 probability of incorrect base calls.

Expected coverage (AvgReadCoverage): the general coverage across the entire genome We absolutely do not accept data below the minimum coverage

thresholds. >20X for Listeria and Campylobacter, 30X for Salmonella, 40X for

Escherichia Low coverage results in poor assemblies and can lead to missing

critical genes, such as serotyping and toxin genes. If the Q score is slightly below 30 we may be able to use the data as long

as you have plenty of coverage (ie. 10-20X above minimum required).

Q30 (SrsQ30): Total number of bases present in the (paired end) data files that have a quality score of 30 or higher

Q30 1st end (SrsQ30_1): Number of bases present in the first end reads that have a quality score of 30 or higher

Q30 2nd end (SrsQ30_2): Number of bases present in the second end reads that have a quality score of 30 or higher

Q30 frequency (SrsQ30Freq): Number of bases that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the (paired end) data files

Q30 frequency 1st end (SrsQ30Freq_1): Number of bases present in the first end reads that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the first end reads

Q30 frequency 2nd end (SrsQ30Freq_2): Number of bases present in the second end reads that have a quality score of 30 or higher, expressed as a percentage of the total number of bases present in the second end read

Raw Data Statistics

De novo Assembly

Contigs (NrContigs): # of contiguous DNA segments assembled from smaller DNA fragments

N50: median length of contigs Bases non-ACTG (NrNonACTG): number of N or ambiguous base calls Sequence length (Length): the estimated genome size of the sequenced

isolate Varies by species or serotype of the isolate and presence or absence of

plasmids/phage sequences If length is larger or smaller than expected range, you may have an

isolate mix-up or contamination

Assembly-free Allele Calls

Average coverage at identified loci (KeywordCov) Multiple alleles (NrAFMultiple): number of loci with multiple alleles

identified Perfect matches (NrAFPerfect): number of times a locus identified in the

raw sequence reads is a perfect match with an allele that already exists in the allele database

Present alleles (NrAFPresent): number of loci that had at least one perfect or closely matching allele in the allele database– Number present should be close to the number expected for that

species. – If significantly higher or lower, you may have contamination or an

isolate mix-up.

Quality check: If a new allele is only identified by assembly-free calling but not by assembly-based, it will not be added to the allele database

Assembly-based Allele Calls

Multiple alleles (NrBAFMultiple): number of loci with multiple alleles identified– The value should be 0, indicating one allele was identified for a specific

locus– If greater than 0, you may have contamination

Perfect matches (NrBAFPerfect): number of loci that have an allele which is identical to an allele already present in the allele database

Alleles to submit (NrToBeSubmitted): number of loci that have newly identified alleles

Alleles assigned on assembled reads through assembly-based BLAST methods

Assembly-based Allele Calls

Submitted alleles (NrAlreadySubmitted): number of loci that have already been submitted to the allele database

Present alleles (NrBAFPresent): total number of loci that have allele calls– Number present should be close to the number expected for that

species– If significantly higher or lower, you may have contamination or an

isolate mix-up Average locus coverage (AvgLocusCover): the average coverage of all loci

that have allele calls

Alleles assigned on assembled reads through assembly-based BLAST methods

Summary Allele Calls

Unknown alleles (NrConsensusUnknown): an allele identified by the assembly free method but not a 100% match to the prototype allele.

Multiple alleles (NrConsensusMultiple): number of loci with multiple alleles identified– The value should be 0, indicating one allele was identified for a specific

locus– If greater than 0, you may have contamination

Discrepant alleles (NrDifferent): number of loci where the assembly-free and assembly-based allele calls identified a different allele number for the same loci– The value should be <2 – If greater than 2, you may have contamination

Combines assembly-free and assembly-based allele calls

Summary Allele Calls

Present alleles (NrConsensus): total number of loci that have allele calls– Number present should be close to the number expected for that

species. – If significantly higher or lower, you may have contamination or an

isolate mix-up. % core present (CorePercent): percent of core loci that have allele calls.

Core loci used in the BioNumerics databases are the genes found in 95-100% of publicly available reference sequences. – Only validated for Listeria database at this point—should be >95%– If core genes are not detected as expected, you may have

contamination or an isolate mix-up.

Combines assembly-free and assembly-based allele calls

QC Metrics Summary Table (PHLs)

• ¥ Sequences may be usable if the Q-scores are between 29.00-29.99 or 28.00-28.99 as long as the coverage is 10X or 20X higher than shown in the table, respectively. If the Q-score falls below 28, then the sequence fails QC regardless of coverage.

• Sequences must pass QC metrics shown in red prior to submission to the PulseNet National Databases.• Sequences that do not pass the other QC metrics (shown in black) may be submitted to the National Databases, but we may reject these

sequences upon further QC review.• * QC metric thresholds have only been determined for Escherichia and Vibrio sequences only using the 2 x 250bp chemistry. QC metric

thresholds for the other organism sequences have been determined using the 2 x 250bp and 2 x 150bp chemistries.• Metrics in bold are found in the organism-specific databases only; other metrics can be found in the Reference ID Database.

Updated: January 14, 2019. Subject to change.

Quality Metric Listeria Campylobacter *Escherichia Salmonella *Vibrio¥ Raw data statistics: R1 & R2 Q-scores ≥ 30¥ De novo assembly: Average Coverage ≥ 20X ≥ 20X ≥ 40X ≥ 30X ≥ 40X

De novo assembly: Sequence Length (Mbp) ~2.8 – 3.1 ~1.4 –2.2

~4.2 – 5.9~4.2 – 4.9 (Shigella, rare spp.)

~4.9 – 5.9 (most serotypes)

~4.4 – 5.6 (common serotypes usually ~5.0)

~3.8 – 4.3 (Vc)~4.9 – 5.5 (Vp)~4.7 – 5.3 (Vv)

De novo assembly: Contigs ≤ 100 ≤ 200(usually 10-50)

≤ 600(usually 100-500)

≤ 400(usually 50-200)

≤ 200(usually 50-100)

Summary Calls:Present Alleles ≥ 2700 ≥ 1350 (C. jejuni)

(usually 1500-1700)≥ 3200 (usually 4400-

5900)

≥ 2500 (common serotypes usually

≥ 4200)NA

Percent Core Present ≥ 95% ≥ 95% (C. jejuni) ≥ 95% ≥ 95% NA

Sequence Quality Communications

Contact [email protected] and [email protected] for troubleshooting help.– Provide WGS ID# and all of the quality metrics for the isolate

E-mails from PulseNet DBMs requesting repeat sequencing based on QC review at CDC.

For more information, contact CDC1-800-CDC-INFO (232-4636)TTY: 1-888-232-6348 www.cdc.gov

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Questions?

#PulseNet

Telephone: 404-639-4558E-mail: [email protected] Web: www.cdc.gov/pulsenet #PulseNet

assessment and interpretation of quality metrics

Documents