ncbi’s genome annotation: overview

22
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation (batch) Post-annotation review Case studies NOTE: limiting discussion to annotation of genes and pseudo Donna Maglott, Ph.D. for the RefSeq and Annotation groups

Upload: eugenia-parsons

Post on 02-Jan-2016

37 views

Category:

Documents


2 download

DESCRIPTION

NCBI’s Genome Annotation: Overview. Incremental processing Re-annotation ( batch ) Post-annotation review Case studies. NOTE: limiting discussion to annotation of genes and pseudogenes. Donna Maglott, Ph.D. for the RefSeq and Annotation groups. NCBI: Incremental processing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NCBI’s Genome Annotation: Overview

NCBI’s Genome Annotation: Overview

• Incremental processing• Re-annotation (batch)• Post-annotation review• Case studies

NOTE: limiting discussion to annotation of genes and pseudogenes

Donna Maglott, Ph.D.for the RefSeq and Annotation groups

Page 2: NCBI’s Genome Annotation: Overview

NCBI: Incremental processing

• Maintain gene/sequence relationship– Data sources

• MGI’s ftp site (names, MGI ids, sequence accessions)

• Sequences and annotation from INSD (DDBJ, EMBL,

GenBank)• UniProt (names, sequence)

• CCDS Collaboration (CDS definition; Kim will expand on this later.)

• Gene family-specific databases • HomoloGene• UniGene• Scientific community

– Actions• Create or update data via Entrez Gene• Create or update RefSeq sequences• Identify conflicts and discuss with stakeholders

Page 4: NCBI’s Genome Annotation: Overview

Changed annotation after release of Mm37.1

Annotated as version 3,overlapping model at 5’ end

Page 5: NCBI’s Genome Annotation: Overview
Page 6: NCBI’s Genome Annotation: Overview

Hi MGI, I believe Tnrc18 (geneID: 381742, MGI:3648294) and Zfp469 (geneID: 231861, MGI:2448535) should be merged. This region of chr 5 has major assembly problems in the reference assembly, but the Celera assembly appears to accurately represent the structure as compared to transcript data and the orthologous regions of the human and rat genomes. In the reference assembly, Zfp469 and Tnrc18 are on separate scaffolds… there are multiple mouse and human transcripts spanning both loci… Zfp469 is currently represented as NM_178242.2 (based on BC049818.1), and this appears to be a valid transcript variant that uses a well-supported early polyA signal/site. However, mouse AK141849.1, CB249799.1, AK173280.1, and human AB058759.1 all extend past this early polyA signal/site to include an additional 13 exons that overlap with Tnrc18 and potentially encode an additional 1125 aa at the C-terminus…There is also an issue of nomenclature. I cannot find any evidence of the transcripts associated with Zfp469 encoding a zinc finger protein…the long variant does contain a trinucleotide repeat so the Tnrc18 name may be more appropriate, although the repeat is not found in the shorter NM_178242.2 variant. Also be aware that we had mis-associated the human TNRC18 nomenclature (HGNC: 11962).. -------- Terence Murphy

Identify conflicts and discuss with stakeholders

Page 7: NCBI’s Genome Annotation: Overview

One Gene or Two?

BLAST alignment of human RefSeq NM_001080495.2 to Mm37.1

Page 8: NCBI’s Genome Annotation: Overview

Exon coverage better in Celera assembly

BLAST alignment of human RefSeq NM_001080495.2 to Celera

Page 9: NCBI’s Genome Annotation: Overview

NCBI: Incremental processing

• Maintain gene/sequence relationship (cont’d)

– Products• RefSeq sequences via

– Entrez Nucleotide,– Entrez Protein, – ftp

• Gene-specific data via – Entrez Gene– ftp

• Nomenclature propagated to UniGene and HomoloGene

Page 10: NCBI’s Genome Annotation: Overview

NCBI: Re-annotation of genes

• Timing– Always with a new assembly– May occur without a re-assembly

• Evidence used– cDNAs aligned to the genome by Splign– Proteins aligned to the genome by proSplign– Ab initio predictions (gnomon)– Annotated RefSeq genomic sequences

Page 11: NCBI’s Genome Annotation: Overview

NCBI: Re-annotation• Tracking/identification of annotation

(decreasing weight)– Best RefSeq placement (Splign/global alignment)– Comparison to previous annotation

• Assembly to assembly• Clone to clone• ‘product’ to ‘product’

– Best GenBank placement

• Products of annotation– If tracked, reassign GeneID and RefSeq model

accession(s)– If novel and transcribed, assign new GeneID and RefSeq

model accession(s)– If novel pseudogene, assign new GeneID and annotate

on the genomic sequence without assigning a model RefSeq accession

Page 12: NCBI’s Genome Annotation: Overview

One Gene or Two?

BLAST alignment of human RefSeq NM_001080495.2 to Mm37.1

Page 13: NCBI’s Genome Annotation: Overview

NCBI: Post annotation review

• Data reviewed– GeneIDs with sequence data, not

annotated– GeneIDs annotated previously, not in

the current annotation– CDS features NOT included in the

CCDS set

• Actions taken– Create new RefSeqs if necessary– Update existing RefSeqs if necessary– Provide annotation comment to

explain cases currently under review

Page 14: NCBI’s Genome Annotation: Overview

NCBI: Post annotation review

Page 15: NCBI’s Genome Annotation: Overview

NCBI: Representative review cases

• Under-representation of ncRNAs in the RefSeq set– May result in failure to annotate

ncRNA that overlap a protein coding gene

– More RefSeqs for this category are being generated

• Management of ‘read-through’ transcripts– Generate RefSeq if multiple lines of

evidence– Discuss with all nomenclature groups

Page 16: NCBI’s Genome Annotation: Overview

NCBI: Representative review cases

• Adjudication of the name to be assigned to a given genomic location– Evaluate conserved synteny– Discuss with all nomenclature groups

Page 17: NCBI’s Genome Annotation: Overview

A case history: Arhgap27 and 5730442P18Rik

VEGA: One GeneNCBI/MGI: Two Genes

Page 18: NCBI’s Genome Annotation: Overview

A case history: Arhgap27 and 5730442P18Rik

Contributing factorsComputation: •model prediction suggests one gene, but independent RefSeq mRNAs force two•Only one gene is in the CCDS setCuration:•NCBI merged in 2005•Reversed the merge in 2006 in response to request from MGI•UniProt treats as one gene, Arhgap27•Evidence: one cDNA in rat, one cDNA from mouse, Arhgap12 all consistent with the one-gene model

Page 20: NCBI’s Genome Annotation: Overview

A case history: A (vertical) read-through locus

http://tinyurl.com/2oy36j

Page 21: NCBI’s Genome Annotation: Overview

A case history: olfactory receptors

Action: Merge the Gene records in collaboration with MGI