buscomp: busco compilation and comparison for assessing...

1
Richard J Edwards School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia; [email protected] BUSCOMP: BUSCO Compilation and Comparison for Assessing Completeness in Multiple Genome Assemblies What makes a good genome assembly? E-mail: [email protected] https://www.slimsuite.unsw.edu.au https://edwardslab.blogspot.com.au @slimsuite BUSCO vs BUSCOMP comparisons were performed by manipulating a real set of BUSCO-containing contigs from a de novo tiger snake “pseudodiploid” assembly of 10x Chromium linked reads. § Coverage (not “X” depth) § What proportion of the genome is covered by the assembly? § Accuracy § How many base errors and indels are there? § How much is collapsed/duplicated/junk/misassembled? § Contiguity § How fragmented is the genome? Ø Number and length of pieces (e.g. N(G)50 , L(G)50) § “Completeness” § How much of the genome is usefully represented in the assembly? Ø Coverage + Accuracy + Contiguity BUSCO Assessment of genome completeness § Benchmarking Universal Single-Copy Orthologues § Genes found (once) in most organisms in taxonomic group Ø e.g. 3,950 BUSCO genes in tetrapods “Although this gene prediction approach may have its limitations and biases, they are consistent across different species, making for fair comparisons.” Breadth of coverage Depth of coverage § Easy automated comparison of BUSCO results across assemblies § Compilation of a maximal set of best complete BUSCOs across assemblies § Useful gene predictions for phylogenomics etc. § Combined BUSCO ratings for groups of assemblies § Robust benchmarking of “completeness” § Consistent, predictable, intuitive results § Robust to accuracy and gene prediction § Complementary: NOT trying to replace BUSCO! Complete BUSCOs Polished Assembly Assembly Polished Assembly Assembly Polished Assembly Assembly Polished Assembly Assembly BUSCOMP Workflow & Aims DNA Organism Polished Assembly Assembly Sequence Assemble Assembly Assemble Assembly Assemble 1 Summarise Assemblies Assemblies BUSCO Complete Genes BUSCO Summaries Compile BUSCOs 2 Compare BUSCO results 3 Compile best BUSCO gene predictions 4 Map genes onto assemblies with Minimap2 5 BUSCOMP ratings 6 BUSCOMP report 7 https://github.com/slimsuite/buscomp Error correct BUSCOMP output: fasta, tables, Rmarkdown & HTML Conclusions & Availability § BUSCOMP is a useful complement to BUSCO: § Estimating total BUSCO completeness § Compiling “Best” BUSCO set for phylogenomics § Comparing multiple assemblies of the same genome § BUSCOMP is freely available on GitHub: § https://github.com/slimsuite/buscomp Comparison data – 10x pseudodiploid tiger snake assembly Adding or removing contigs changes BUSCO scores for other contigs BUSCOMP scores robust to adding or removing other contigs Re-running BUSCO on same data generates slightly different results Re-running BUSCOMP on same data gives same results Reverse-complement gives different BUSCO score Reverse-complement gives more similar BUSCOMP scores Duplicated contigs are not 100% Duplicate in BUSCO Duplicated contigs are 100% Duplicate in BUSCOMP Random sequences = 0 BUSCO Random sequences = 0 BUSCOMP Adding random sequences changes BUSCO completeness Adding random sequences does not change BUSCOMP completeness Case study : cane toad assemblies v2.2 v2.0 PacBio v2.1 WTDBG2 Illumina DBG2OLC WTDBG2 Hybrid Arrow Pilon Transcriptomes BUSCO RATINGS Complete: Single-copy hits where "BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile." Duplicated: As Complete but 2+ copies. Fragmented: "BUSCO matches ... within the range of scores but not within the range of length alignments to the BUSCO profile." Missing: "Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile." BUSCOMP: BUSCO Compilation and Comparison BUSCOMP RATINGS Complete: 95%+ Coverage in a single contig/scaffold. (Note: unlike BUSCO, accuracy/identity is not considered. This will make it more prone to misidentification of closely related paralogues.) Duplicated: 95%+ Coverage in 2+ contigs/scaffolds. Fragmented: 95%+ combined coverage but not in any single contig/scaffold. (Note: as with BUSCOMP Complete ratings, these might include part of closely related paralogues.) Partial: 40-95% combined coverage. (Note: these might be marked as "Fragmented" by BUSCO, which does not discriminate between "split" full-length hits, and single partial hits.) Ghost: Hits meet local length & identity cut-offs but <40% combined coverage. Missing: No hits meeting local length & identity cut-offs. BUSCO vs BUSCOMP: Robustness and Reproducibility Primary Assembly Supernova Pseudodiploid (NR) Alternative Assembly BUSCO Contigs Primary BUSCO Contigs dipnr dipnr.pri dipnr.alt dipbusco phapbusco Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 Repeat BUSCO x 10 dipnr.rep1 dipnr.rep2 dipnr.rep10 Copy Reverse complement phapbusco.copy phapbusco.revcomp Shuffle Shuffle 2 Shuffle 3 phapbusco.shuffle1 phapbusco.shuffle2 phapbusco.shuffle3 Duplicated Triplicated phapbusco.trip phapbusco.dup Real + Random Real + 2 x Random Real + 3 x Random phapbusco.2n phapbusco.3n phapbusco.4n 1 2 3 4 1 - REPRODUCIBILITY 2 - PSEUDODIPLOIDY 3 - DUPLICATION 4 - RANDOM Primary and/or Alternative assemblies versus BUSCO-containing contigs only Repeated analysis of the same input assembly Duplicating and/or reverse-complementing complete sets of contigs Adding increasing numbers of randomly shuffled contig sequences Repeat Filter Random Duplicate BUSCO completeness is highly dependent on sequence quality BUSCOMP can help identify (lack of) deficiencies in new assemblies without polishing WTDBG2 v2.2 v2.0 v2.1 Transcriptome Transcriptome v2.2 transcripts BUSCOMP finds more genes in transcripts but fewer are rated complete § BUSCOMP is robust to contig filtering and sequence quality § BUSCOMP reveals that WTDBG2 assembly is more complete than draft genome, despite poorer BUSCO scores. (No need to polish first.) Pure (~20X) PacBio cane toad assembly compared to published draft genome. BUSCO and BUSCOMP run on each assembly.

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Richard J EdwardsSchool of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia; [email protected]

    BUSCOMP: BUSCO Compilation and Comparison for Assessing Completeness in Multiple Genome Assemblies

    What makes a good genome assembly?

    E-mail: [email protected]://www.slimsuite.unsw.edu.auhttps://edwardslab.blogspot.com.au @slimsuite

    BUSCO vs BUSCOMP comparisons were performed by manipulating a real set of BUSCO-containing contigs from a de novo tiger snake “pseudodiploid” assembly of 10x Chromium linked reads.

    § Coverage (not “X” depth)§ What proportion of the genome is covered by the assembly?

    § Accuracy§ How many base errors and indels are there?§ How much is collapsed/duplicated/junk/misassembled?

    § Contiguity§ How fragmented is the genome?

    Ø Number and length of pieces (e.g. N(G)50 , L(G)50)

    § “Completeness”§ How much of the genome is usefully represented in the assembly?

    Ø Coverage + Accuracy + Contiguity

    BUSCO Assessment of genome completeness§ Benchmarking Universal Single-Copy Orthologues

    § Genes found (once) in most organisms in taxonomic groupØ e.g. 3,950 BUSCO genes in tetrapods

    “Although this gene prediction approach may have its limitations and biases, they are consistent across different species,

    making for fair comparisons.”

    Breadth of coverage

    Depthof

    coverage

    § Easy automated comparison of BUSCO results across assemblies

    § Compilation of a maximal set of best complete BUSCOs across assemblies§ Useful gene predictions for phylogenomics etc.§ Combined BUSCO ratings for groups of assemblies

    § Robust benchmarking of “completeness”§ Consistent, predictable, intuitive results§ Robust to accuracy and gene prediction

    § Complementary: NOT trying to replace BUSCO!

    CompleteBUSCOs

    PolishedAssemblyAssembly PolishedAssemblyAssembly PolishedAssemblyAssembly PolishedAssemblyAssembly

    BUSCOMP Workflow & Aims

    DNAOrganism

    PolishedAssemblyAssembly

    Sequence

    Assemble Assembly

    Assemble Assembly

    Assemble

    1 SummariseAssemblies

    Assemblies

    BUSCO Complete Genes

    BUSCO Summaries

    CompileBUSCOs2

    CompareBUSCO results

    3

    Compile bestBUSCO gene predictions

    4

    Map genes onto assemblies with Minimap25

    BUSCOMP ratings6

    BUSCOMP report7

    https://github.com/slimsuite/buscomp

    Error correct

    BUSCOMP output: fasta, tables, Rmarkdown & HTML

    Conclusions & Availability

    § BUSCOMP is a useful complement to BUSCO:§ Estimating total BUSCO completeness§ Compiling “Best” BUSCO set for phylogenomics§ Comparing multiple assemblies of the same genome

    § BUSCOMP is freely available on GitHub:§ https://github.com/slimsuite/buscomp

    Comparison data – 10x pseudodiploid tiger snake assembly

    Adding or removingcontigs changes

    BUSCO scores forother contigs

    BUSCOMP scoresrobust to adding

    or removing other contigs

    Re-running BUSCO on same data

    generates slightly different results

    Re-running BUSCOMP on

    same data gives same results

    Reverse-complement gives different BUSCO score

    Reverse-complement gives more similar BUSCOMP scores

    Duplicated contigs are not 100% Duplicate in BUSCO

    Duplicated contigs are 100% Duplicate in BUSCOMP

    Random sequences = 0 BUSCO Random sequences = 0 BUSCOMP

    Adding random sequences changes

    BUSCO completeness

    Adding random sequences does not change

    BUSCOMP completeness

    Case study : cane toad assemblies

    v2.2

    v2.0

    PacBio

    v2.1

    WTDBG2

    Illumina

    DBG2OLC

    WTDBG2

    Hybrid

    Arrow

    Pilon

    Transcriptomes

    BUSCO RATINGSComplete: Single-copy hits where "BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile."

    Duplicated: As Complete but 2+ copies.

    Fragmented: "BUSCO matches ... within the range of scores but not within the range of length alignments to the BUSCO profile."Missing: "Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile."

    BUSCOMP: BUSCO Compilation and Comparison

    BUSCOMP RATINGSComplete: 95%+ Coverage in a single contig/scaffold. (Note: unlike BUSCO, accuracy/identity is not considered. This will make it more prone to misidentification of closely related paralogues.)Duplicated: 95%+ Coverage in 2+ contigs/scaffolds.Fragmented: 95%+ combined coverage but not in any single contig/scaffold. (Note: as with BUSCOMP Complete ratings, these might include part of closely related paralogues.)Partial: 40-95% combined coverage. (Note: these might be marked as "Fragmented" by BUSCO, which does not discriminate between "split" full-length hits, and single partial hits.)Ghost: Hits meet local length & identity cut-offs but