the c onsensus c o d ing s equence (ccds) database

29
National Center for Biotechnology Information The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Kim D. Pruitt Mouse Genome Annotation Summit Meeting Mouse Genome Annotation Summit Meeting March 12-13, 2008 March 12-13, 2008

Upload: dinesh

Post on 24-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

The C onsensus C o D ing S equence (CCDS) Database. Kim D. Pruitt Mouse Genome Annotation Summit Meeting March 12-13, 2008. Why is the CCDS project needed?. The Problem : Annotation of the genome sequence is essential – but beware of different interpretations!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

The Consensus CoDing Sequence (CCDS) Database

Kim D. PruittKim D. PruittMouse Genome Annotation Summit Meeting Mouse Genome Annotation Summit Meeting

March 12-13, 2008March 12-13, 2008

Page 2: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Why is the CCDS project needed?

• The availability of the human and mouse genome sequence has had a significant impact on disease and health research.

• Most scientists rely on annotation information when designing, interpreting, and evaluating research results.

• Inconsistencies in annotation results among the main public resources hampers use of this important data.

• Researchers may not realize that a different annotation result is available elsewhere – possibly leading to erroneous or incomplete interpretations.

The ProblemThe Problem:Annotation of the genome sequence is essential

– but beware of different interpretations!

Page 3: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

• Initiated by the main public annotation/browser groups to address concerns by the scientific community about inconsistencies in the human and mouse genome annotation.

• Built by consensus among the collaborating members, which include:

European Bioinformatics Institute (EBI) National Center for Biotechnology Information (NCBI) University of California, Santa Cruz (UCSC) Sanger Institute (WTSI)

CCDS - A collaborative project

Page 4: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

• Project Goals– identify a core set of protein-coding genes that are consistently

annotated and of high quality – support convergence toward a standard set of gene annotations

• Scope:– Human and mouse protein coding regions

• Update frequency– Variable– Depends on frequency of genome annotation updates

What is the CCDS project?

Page 5: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Process flow – calculating updates

Ensemblmerged

annotation

Havana (manual)Ensembl (computational)NCBI (computational)

RefSeq (manual)

Compare CDS

(Annotation+

Sequence)

Identical Similar Novel

Existing CCDS Retain Retain Lost

New match New CCDS ID Out of scope Out of scope

QA

Page 6: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Quality assessment tests include:– Consensus splice sites ("GY..AG" or "AT..AC") – Valid start and stop codons with no internal stops– NMD– Low complexity – Repeat-containing– Insufficient protein homology – Genome conservation– Putative pseudogene

Assessing Quality

CCDS status is conservatively applied:• Annotated CDS coordinates are identical• Annotation is of high quality and passes QA tests, or curator review• Existing CCDS proteins can be flagged for review by the collaborating members• Updates and removals are by consensus agreement.

QA test results are reviewed by curatorsOver-rides are set to retain supported CDSs

Page 7: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Date Build CDS IDs GeneIDs

Mar-05 Hs35.1 14,795 13,142

Feb-07 Hs36.2 18,290 16,008

Oct-06 Mm36.1 13,374 13,014

Nov-07 Mm37.1 17,707 16,893

CCDS Counts

Step Source Genes Proteins

Annotation NCBI 24765 26851Annotation Ensembl 27209 39941Matching CDS 18185 19048QA & curation rejections 1331 1350Accepted rejections 1292 1341Final CCDS ID 16893 17707

Page 8: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

• Any member of the collaboration can flag a CCDS for review– Update the CDS definition (alter N-terminus extent internal splice site)– Withdraw the CCDS ID (insufficiently supported, or non-protein coding)

• NCBI provides a collaboration web site to coordinate this review

• All collaborators must agree with a change to finalize a decision • Withdrawal of a CCDS may happen between genome annotation updates

• An update to a CCDS is indicated by:– Status change: a status of ‘pending update’ is reported when there is

collaborative agreement that a change is needed

– Version change: The CCDS version number is incremented once the change is reflected in public annotation. This only occurs after a genome annotation update and CCDS analysis has taken place.

CCDS curation is fully integrated with RefSeq curation

Curation – how are updates curated and coordinated?

Page 9: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

CCDS update & curation stats

name action status count

human update pending 366 human update agreed 557 human withdraw pending 189 human withdraw agreed 519 mouse update pending 185 mouse update agreed 57 mouse withdraw pending 16 mouse withdraw agreed 8

Curation-based changes:

Annotation pipeline-based changes:name build status count

human 35.1 Withdrawn, inconsistent annotation 133human 36.2 Withdrawn, inconsistent annotation 29mouse 36.1 Withdrawn, inconsistent annotation 29mouse 37.1 Withdrawn, inconsistent annotation 4

923

709

242

24

Mouse: ~5200 curated CCDS genes

Page 10: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

• Alignments • Track low quality sequences (‘kill list’) • Protein conservation• Publications• Personal communications• QA measures

Curation considerations

Page 11: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

• Genome browser displays– NCBI– UCSC

• Gene reports– Ensembl– NCBI– UCSC– Vega

• Other:– RefSeq annotation (NCBI)– CCDS web site – FTP

http://www.ncbi.nlm.nih.gov/CCDS/

Access – How do I know if an annotation has a CCDS ID?

Page 12: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

NCBI Map Viewer (chr.5)

Link to CCDS Browser

Page 13: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

UCSC Browser

chr5:30270000-30650000

Page 14: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

UCSC Browser – Tyms gene

CCDS Browser

Page 15: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Access of CCDS data at NCBI

•CCDS Database & Browser interface•Project Description•Query support•Reports attributes of the CCDS

•Location data•Sequence members•Status

•FTP reports

Page 16: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

CCDS Browser

History

Entrez Gene View CCDS Details

Find all CCDSs for the Gene

Page 17: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

CCDS Browser

•Mouse-over highlights codon•Click to highlight codon and corresponding amino acid

Page 18: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Biology is complex – some CCDS curation examples

• 1 vs 2 vs ‘n’ genes

• translation start site

Page 19: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

1 vs. 2 vs. ‘n’ genes

Curation Considerations:– Nomenclature– History (scientific use, publications, etc.)– Different (but similar) products vs. distinct products– Shared promoters

Page 20: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

carnitine palmitoyltransferase 1b,

choline kinase beta

Page 21: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

1 vs. 2 vs. ‘n’ genes

Current RefSeq representation of the region

- two protein coding loci

- one non-coding loci for the non-coding transcript product (a read-through transcript)

Chkb-cpt1b (PMID:12761301 )

Chkb (CCDS27750.1) Cpt1b (CCDS27749.1 )

Page 22: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Translation start site

• Curation Considerations– Publication reports (CDS begins at ‘n’)– Other cDNA sequencing reveals the ORF can be extended

further upstream– Evaluate:

• Genome conservation• Literature reports for the protein • Putative Kozak signals• Presence of in-frame upstream stop codon• INSDC submissions from an experimental lab source that do

have the longer ORF extent annotated.• Consult with an expert

Page 23: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Internal CCDS browser (restricted access)

Jmjd2d jumonji domain containing 2D (chr 19)

Page 24: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Update is agreed on by all parties

Resulting in a258 aa N-terminalextension

Page 25: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Examples – no CCDS ID

EBI+WTSI and NCBI transcript annotation may differ even though the gene includes

annotations with CCDS IDs

Page 26: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Examples –no CCDS ID Reasons:•not found by one group•different CDS length•different splice sites•different internal exon•Curation removal

EBI/WTSI NCBI EBI/WTSI NCBI

EBI/WTSI NCBI EBI/WTSI NCBI

Page 27: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Acknowledgements

Donna Maglott Josh CherryKeith Oxenride Craig WallinAndrei Shkeda

RefSeq CuratorsNCBI Genome Annotation GroupNCBI Map Viewer Group

Collaborators at Ensembl, UCSC, VegaJen Ashurst & Vega curator groupRachel HarteMark DiekhansSteve Searle

Page 28: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Ensembl – Tyms gene

Page 29: The  C onsensus  C o D ing  S equence (CCDS) Database

National Center for Biotechnology Information

Vega browser Tyms gene (chromosome 5 30388989-30404404)