genome related biological databases. content dna sequence databases protein databases gene...

35
Genome Related Biological Databases

Post on 20-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Genome RelatedBiological Databases

Content

• DNA Sequence databases• Protein databases• Gene prediction• Accession numbers

• NCBI website• Ensembl website

Nucleotide databases

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformaticsInstitute

www.ebi.ac.uk/embl/

Housed at NCBI

NationalCenter forBiotechnologyInformation

www.ncbi.nlm.nih.gov/Genbank/

Housed in Japan

www.ddbj.nig.ac.jp/Welcome-e.html

The underlying raw DNA sequences are identical

>100,000 species are represented in GenBank

all species 196,538

viruses 5,214

bacteria 14,258

archaea 500

eukaryota 171,843

NCBI nucleotide databases

• GenBank• Individual submissions• Bulk submissions (Genome centers)

• High throughput sequencing (DNA)• Expressed Sequence Tags (mRNA)

• RefSeq• Curated subset of GenBank• “Reference” sequence• Single sequence per locus / molecule

Protein databases

• NCBI• RefSeq and Protein

• EBI• Swiss-Prot, PIR and TrEMBL → UniProt

• Translated from nucleotide sequence• Curated• Combined

UniProt versus GenBank and RefSeq

UniProt

Produced by SIB, EBI

& Georgetown U.

Protein data only

Curated in SwissProt, not in TrEMBL

GenBank/RefSeq

Produced by INSDC and NCBI

Protein and nucleotide data

Curated in RefSeq, not in GenBank

Accession numbers

Label to unambiguously identify a sequence

Examples (all for retinol-binding protein, RBP4):

protein

DNA

RNA

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)RBP4 HUGO genenames

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 UniProt protein1KT7 Protein Data Bank structure record

From Sequence to Genes

• Gene prediction• Extrinsic

• Search for genes based on observed mRNA / Protein sequences

• UniGene

• Ab initio• Predict genes based on genomic sequence alone• Promoter sequence• Poly(A) tail binding sites, CpG islands, splicing sites

UniGene

• Predict genes based on ESTs• EST:

• DNA sequence corresponding to mRNA from expressed gene

• ~500 base pairs long• Sequenced from a cDNA library

• Cluster ESTs from many cDNA libraries to predict distinct genes

EST clusters

This is a gene with1 EST associated;the cluster size is 1

This is a gene with10 ESTs associated;the cluster size is 10

1 2 3-4 5-8 9-16 17-32 129-256

257-512

33-64 65-128 513-1024

1025-2048

2049-4096

4097-8192

8193-16384

16385-32768

32769-65536

40986

18424 17855

13411

8288

5332 4607 4075 4052 3958

1902710 210 57 17 6 1

UniGene clusters

Cluster size

Nu

mb

er o

f cl

ust

ers

Likely to be real genes

Gene databases

• Ensembl (EBI)• Automatic annotation: mRNA and protein

sequence• Curated annotation: Vega project

• Entrez Gene (NCBI)• Links RefSeq sequences to external annotations

Web sites for biological databases

• NCBI www.ncbi.nlm.nih.gov

• EBI www.ebi.ac.uk

• ENSEMBL www.ensembl.org (= at EBI)

NCBI website

NCBI website

PubMed

Ensembl website

Ensembl structure

• Gene: ENSG…• Transcript: ENST…• Protein: ENSP…

Ensembl search

OTTHUMGXXX (Curated)

ENSGXXXX (Predicted)

Vega gene page

Ensembl gene page

Ensembl transcript page

Ensembl protein page