the ncbi eukaryotic genome annotation pipeline and alternate genomic sequences

26
GRC Assembly Analysis Workshop At Genome Informatics September 21, 2014 The NCBI Eukaryotic Genome Annotation Pipeline And Alternate Genomic Sequences Paul Kitts NCBI National Center for Biotechnology Information

Upload: genome-reference-consortium

Post on 24-Jan-2015

220 views

Category:

Science


2 download

DESCRIPTION

GRC Workshop at Churchill College on Sep 21, 2014. This is Paul Kitt's talk describing the NCBI approach to annotation the full human reference assembly.

TRANSCRIPT

Page 1: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

GRC Assembly Analysis Workshop At Genome InformaticsSeptember 21, 2014

The NCBI Eukaryotic Genome Annotation Pipeline And Alternate Genomic Sequences

Paul KittsNCBI

National Center for Biotechnology Information

Page 2: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Genomes Annotated By NCBI

Human GRCh382014-02-03

Zebrafish GRCz10in progress

Mouse GRCm38.p22013-12-27

Page 3: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Outline

• Overview of the NCBI Eukaryotic Genome Annotation Pipeline• What to do with alternate loci & patch scaffolds?• How we use the alt/patch/PAR alignments to inform our annotation• Examples:

– Annotation only on alternate loci– Different alleles annotated on primary assembly and alternate loci– Annotation improved by patches– Pseudoautosomal Regions annotated consistently on X & Y

• Recent enhancements:– Using RNA-Seq evidence for gene prediction– Gap-filling gene models using transcript sequences– Annotation reports

Page 4: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Eukaryotic Genome Annotation Pipeline Overview

Page 5: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Ranking Alignments

• Rank alignments for each query sequence– using a quality score that combines identity & coverage– Rank-1 > Rank-2 > Rank-3…

• Conflicting alignments cannot have same rank– alignments of the same query sequence to an assembly

conflict if they have significant overlap (>= 30%)– Insignificant

– Significant

• A subset of rank-1 alignments is used for annotation

Span in alignment B

Span in alignment A

Span in alignment B

Span in alignment A

Page 6: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

mRNA-F1

Annotation Of A Simple Assembly Using Ranked Alignments

mRNA-F1

mRNA-F2

Input mRNAsGenes in the assembly

mRNA-F2

Unplaced scaffold1

mRNA-F1

Filter out alignments that are not rank-1

GeneF1 GeneF2Chr1

GeneF1Chr1

Resulting annotation

GeneF2 Unplaced scaffold1

mRNA-F2 mRNA-F1* * **

* * *mRNA-F1mRNA-F2* *

Rank alignments

Unplaced scaffold1

GeneF2Chr1 GeneF1

Rank-1

Rank-2

Rank-3Rank-1

Rank-2

Align mRNAs

Unplaced scaffold1GeneF1 GeneF2Chr1

Page 7: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

What to do with alternate loci & patch scaffolds?

1. Omit the alternate loci & patch scaffolds2. Include the alternate loci & patch scaffolds;

no special treatment3. Include the alternate loci & patch scaffolds;

use known relationships to primary assembly

Page 8: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Gene1/A G2-Allele-APrimary Chr1

Resulting annotation

Gene2

mRNA-3A* * *

Annotation Omitting Alt-scaffolds

mRNA-1A

mRNA-1B

mRNA-2A

Input mRNAs

Gene3

Primary Chr1

Alt-scaffold1

Genes/Alleles represented in the assembly

Gene1/A Gene2

Gene1/B

Alt-scaffold2 mRNA-3A

Gene3

Scenario 1: no annotation for Gene3 no annotation for Gene1/Allele-B

mRNA-1A

Rank-1 mRNA alignments

Gene1/A Gene2Primary Chr1

mRNA-2A

✗✔

Scenario 2: Gene3 annotated at the wrong location no annotation for Gene1/Allele-B

Page 9: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Gene1/A G2-Allele-A

Gene3

Gene4

Primary Chr1

Alt-scaffold2

Alt-scaffold1

Resulting annotation

Gene2

Annotation Using Alt-scaffolds Without Alt-to-primary Alignments

mRNA-1A

mRNA-1B

mRNA-2A

Input mRNAs

Gene3

Primary Chr1

Alt-scaffold1

Genes/Alleles represented in the assembly

Gene1/A Gene2

Gene1/B

Alt-scaffold2

✔✗

mRNA-3A

mRNA-1A

mRNA-1B

mRNA-3A

Rank-1 mRNA alignments

Gene1/A Gene2

Gene3

Gene1/B

Primary Chr1

Alt-scaffold2

Alt-scaffold1

mRNA-2A

Page 10: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Gene1/A G2-Allele-A

Gene3

Gene1/B

Primary Chr1

Alt-scaffold2

Alt-scaffold1

Resulting annotation

Gene2

Annotation Using Alt-scaffolds & Alt-to-primary Alignments

mRNA-1A

mRNA-1B

mRNA-2A

Input mRNAs

Gene3

Primary Chr1

Alt-scaffold1

Genes/Alleles represented in the assembly

Gene1/A Gene2

Gene1/B

Alt-scaffold2

alt-to-primary alignment

✔✔

mRNA-3A

mRNA-1A

mRNA-1B

mRNA-3A

Rank-1 mRNA alignments

Gene1/A Gene2

Gene3

Gene1/B

Primary Chr1

Alt-scaffold2

Alt-scaffold1

mRNA-2A

Page 11: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Pros & cons of different choices for dealing with alternate loci & patch scaffolds

1. Omit the alternate loci & patch scaffoldsPros: Easy to implementCons: No representation for genes or alleles only on alts. Incorrect models for genes that have been patched.

2. Include the alternate loci & patch scaffolds; no special treatmentPros: Easy to implementCons: Incorrectly annotate genes that have alternate alleles or patches as if they were paralogs. Wrongly penalize sequences for having multiple or ambiguous placements.

3. Include the alternate loci & patch scaffolds;use known relationships to primary assemblyPros: Genes only on alts are annotated. Correctly annotate genes with alternate alleles. Correctly annotate patched genes Cons: Requires software and pipelines changes

Page 12: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Eukaryotic Genome Annotation Pipeline: Steps using alt-to-primary alignments

Alt-to-primaryalignments

Curated genelocalization

Page 13: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Ranking Alignments Across Assembly Units

• Create graph of related alignments– Alignments that are collocated or mappable– Transcript/protein to genomic– Alt or patch scaffold to primary assembly

• Partition graph into clusters– Each alignment in the cluster is related to at least one other

alignment in the same cluster– No alignment is related to any alignment in another cluster– Split conflicting alignments within a cluster into separate groups– Merge non-conflicting clusters into groups

• Evaluate groups, sort and assign ranks– All alignments in a group get the same rank

Page 14: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Ranked Alignment Groups Across Assembly Units

Assembly unitAssembly alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group

Assembly1-Primary

Assembly1-Alt1

Assembly1-Alt2

Rank-1

Rank-2

Page 15: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Ranked Alignment Groups Across Assemblies

Assembly unitAssembly alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group

Assembly1-Primary

Assembly1-Alt1

Assembly1-Alt2Rank-1

Rank-2

Assembly2-Primary

Assembly3-Primary

Page 16: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Ranked Alignment Groups Across Pseudoautosomal Regions (PARs)

ChromosomePAR alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group

Chromosome Y

Chromosome X

Rank-1

PAR#1 PAR#2

Page 17: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter]

Genes Only Annotated On GRCh38 Alternate Loci

NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop]

Num. Gene Type20 Protein Coding40 Protein Coding (model)21 Pseudogene32 Pseudogene (model)32 ncRNA (model)

5 Other 3 Other (model)

Page 18: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Different Alleles Annotated On GRCh38 Primary Assembly And Alternate Loci

ALT_REF_LOCI_2

ALT_REF_LOCI_7

NM_001243042.1 comment: This variant represents the C*07:01:01:01 allele of the HLA-C gene.

NM_002117.5 comment: This variant represents the C*07:02:01 allele of the HLA-C gene.

Page 19: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Annotation Of GRCh37 Improved By Patch Scaffold

EPPK1 gene on primary assembly chromosome 8 has an internal deletion.EPPK1 gene on patch scaffold is complete.

Primary Assembly chromosome 8

Patch scaffold HG104_HG975_PATCH

Page 20: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Pseudoautosomal Regions Annotated Consistently on GRCh38 chromosomes X & Y

Page 21: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Recent Enhancements To The Genome Annotation Pipeline:#1 Using RNA-Seq Evidence For Gene Prediction

0

10000

20000

30000

40000

50000

60000

70000

80000

Number of coding transcripts predicted +/- RNA-Seq

Chicken

CowHorse

Human

Mouse Pig Rat

Soybean

Zebrafish0

10000

20000

30000

40000

50000

60000

Number of genes predicted +/- RNA-Seq

Without RNA-Seq

With RNA-Seq

75 organisms annotated with RNA-Seq data

Page 22: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Example Of Tracks Made Using RNA-Seq Data

NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]

Page 23: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Recent Enhancements To The Genome Annotation Pipeline:#2 Gap-filling Gene Models Using Transcript Sequences

Genomic sequence

Transcript alignment 1 32 4

RefSeq model

Gap

How gap-filling works

Reporting of gap-filled regions

Page 24: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Recent Enhancements To The Genome Annotation Pipeline:#3 Annotation Reports

RNA-Seq

Page 25: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Summary

Including the alternate loci & patch scaffolds and using their known relationships to the primary assembly significantly improves the annotation of GRC assemblies.

It is worth the extra effort!

Page 26: The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

CREDITSGenome pipeline infrastructureAlex AstashynNathan BoukRob CohenMike DicuccioEric EngelsonOlga ErmoloevaWratko HlavinaLucian IonAvi KimchiBoris KiryutinDavid ManagadzeEyal MozesTerence MurphyDaniel RauschRobert SmithSasha SouvorovCraig WallinAlex Zasypkin

Eukaryotic annotation setup & execution

Françoise Thibaud-NissenJinna ChoiPatrick MastersonKim Pruitt and the “genome champions”

from the RefSeq group

Genomic Collections DBAvi KimchiVictor SapojnikovCharlie XiangAndrey Zherikov

Genome assemblies with alt/patch to primary alignmentsGenome Reference Consortium

The Wellcome Trust Sanger InstituteThe Genome Institute at Washington UniversityThe European Bioinformatics InstituteThe National Center for Biotechnology Information

Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/