Download - Church gmod2012 pt2
@deannachurch
Navigating Genome Resources at NCBI
Deanna M. Church, NCBI
The Evolution of the Reference Human Genome
Part 2
GenBank
Data Archives
Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
Data tracking
ABC14-1065514J1GapsPhase LengthDate
FP565796.1 1 121-Oct-2009
FP565796.2 1 014-Oct-2010
FP565796.3 3 007-Nov-2010
Mouse chrX: 34,800,000-34,890,000
NC_000086.123456 CM001013.17 2
Mouse chrX: 35,000,000-36,000000
X
MGSCv3 MGSCv36
hg19GRCh37
mm8MGSCv37
NCBIM37
danRer5Zv7
What’s in a name?
By any other name…
chr21:8,913,216-9,246,964
Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
By any other name…
Genome Browser Agreement
Submitter deposits assembly to
GenBank/EMBL/DDBJAssembly QA
Submitter updates assembly based on QA
results
Browsers pick up assembly from
GenBank/EMBL/DDBJ
Assemblies must be in GenBank/EMBL/DDBJ
http://www.ncbi.nlm.nih.gov/genome/assembly
GRCh37hg19
Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17
Primary Assembly
GCA_000001305.1/GCF_000001305.13
ALT 1
GCA_000001315.1/GCF_000001315.1
ALT 2
GCA_000001325.1/GCF_000001325.2
ALT 3
GCA_000001335.1/GCF_000001335.1
ALT 4
GCA_000001345.1/GCF_000001345.1
ALT 5
GCA_000001355.1/GCF_000001355.1
ALT 6
GCA_000001365.1/GCF_000001365.2
ALT 7
GCA_000001375.1/GCF_000001375.1
ALT 8
GCA_000001385.1/GCF_000001385.1
ALT 9
GCA_000001395.1/GCF_000001395.1
PatchesGCA_000005045.5GCF_000005045.4
Non-nuclear assembly unit
(e.g. MT)
GCA_000006015.1/GCF_000006015.1
GenBank RefSeq vs
Submitter Owned RefSeq Owned
Redundancy Non-RedundantUpdated rarely Curated
INSDC Not INSDC
BRCA183 genomic records31 mRNA records27 protein records
3 genomic records 5 mRNA records1 RNA record5 protein records
RefSeq for Assemblies
Typical assembly edits
Addition of non-nuclear (e.g. MT) assembly units
Removal of contamination
Drop unlocalized/unplaced scaffoldsMask contamination that is placed on chromosome
http://www.ncbi.nlm.nih.gov/genome
Understanding relationships between assemblies using alignments
First Pass
Second Pass
Reciprocal best hit
Non-reciprocal, duplicative hits
No second pass alignments in GRCh37.p5
NCBI36
GRCh37.p5
http://www.ncbi.nlm.nih.gov/tools/gbench/
Assemblies Transcripts Proteins
Set of genesOther decoration
Annotation pipeline
Francoise Thibaud-Nissen
Content of the final annotation productDescription In
sequence database
In a BLAST database
On FTP site
Chromosomes (NC_or AC_)
Scaffolds (NW_ or NT_) Curated transcripts/proteins (NM_, NR_/NP_)
Predicted transcripts/proteins (fully or partially -supported) (XM_, XR_/XP_)
Non-transcribed pseudogenes tRNA (annotated with tRNAScan) Ab initio Gnomon models
Annotation Pipeline RefSeq
Where to find the annotation products?• Nucleotide/Protein databases
• Gene• Map Viewer• BLAST databases• FTP site
http://www.ncbi.nlm.nih.gov/gene
http://www.ncbi.nlm.nih.gov/mapview
Annotating multiple assemblies
Group 1
Transcript
• Consistent placement of transcripts• Consistent labelling of the genes• Consistent annotation on all assemblies
Assembly 1
Assembly 2
• Assembly-assembly alignmentsAvailable at http://www.ncbi.nlm.nih.gov/genome/tools/remap
Group 2
Annotating multiple assemblies(2)
Btau_4.6.1
UMD_3.1
Same Gene symbol
Interacting with the community
FlyBase GenBank RefSeq
Thanks!
For Slides: Francoise Thibaud-Nissen Evan Eichler Steve Sherry
The Genome Reference ConsortiumThe Genome Center at Washington University The Wellcome Trust Sanger InstituteThe European Bioinformatics InstituteThe National Center for Biotechnology Information
Church group at NCBIValerie SchneiderNathan BoukHsiu-Chuan ChenPeter MericVictor AnanievChao ChenJohn LopezJohn GarnerTim HefferonCliff Clausen
NCBI