1 discussion practical 1. features of major databases (pubmed and ncbi protein db) 2

34
1 Discussion Practical 1

Upload: alannah-wood

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

1

Discussion

Practical 1

Page 2: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Features of major databases(PubMed and NCBI Protein Db)

2

Page 3: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of PubMed Db

3

Page 4: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Epub ahead of print and journal impact factor

4

How to get impact factor of any journal:1) Direct source – web of science database

2) In direct source, e.g. blogs, sites etc (do Google search)

Adopted from : http://admin-apps.isiknowledge.com/JCR/JCR?RQ=LIST_SUMMARY_JOURNAL

Page 5: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of a PubMed record

5

Page 6: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Demo on downloading articles

6

Page 7: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of a Protein Db

7

Page 8: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

8

gi|numeric identifier |source |alphanumeric identifier

humanP53 RefSeq mRNA record as an example:

gi|120407067|ref|NM_000546.3

GI (or GenInfo Identifier) 120407067Source RefSeq databaseAccession NM_000546

Other popular sources:dbj – DDBJ (DNA Data Bank of Japan database)emb – The European Molecular BiologyLaboratory (EMBL) databaseprf – Protein Research Foundation database

sp – SwissProtgb – GenBankpir – Protein Information Resource

Version NM_000546.3

GI or Geninfo Identifier) 120407067

Source Refseq databaseAccession NM_000546

Accession numbers and GenInfo Identifiers

Page 9: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

9

Why do we need accession number and GI for one record?

1) What is the difference between accession and GI?

2) Why do we need these two when both seem to be accession numbers?

Page 10: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

10

Q1) Which revision will NCBI show if you were to search bythe accession only without the version number?

ACCESSION GI VERSION

120407067 NM_000546.38400737 NM_000546.24507636 NM_000546.1

NM_000546

Sequence_v1

NM_000546

Sequence_v2

NM_000546

Sequence_v3

NM_000546

NM_000546.1 NM_000546.2 NM_000546.34507636 8400737 120407067

Sequenceupdate

Sequenceupdate

GIVersion

Why do we need accession number and GI for one record?

Page 11: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

11

Accession numbers

- The unique identifier for a sequence record.

- An accession number applies to the complete record.

- Accession numbers do not change, even if information in the recordis changed at the author's request.

- Sometimes, however, an original accession number might becomesecondary to a newer accession number, if the authors make a newsubmission that combines previous sequences, or if for somereason a new submission supercedes an earlier record.

Page 12: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

12

GenInfo Identifiers

- GenInfo Identifier: sequence identification number

- If a sequence changes in any way, a new GI number will be assigned

- A separate GI number is also assigned to each protein translationWithin a nucleotide sequence record

- A new GI is assigned if the protein translation changes in any way

- GI sequence identifiers run parallel to the new accession.version system of sequence identifiers

Page 13: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

13

Version- A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.

- If there is any change to the sequence data (even a single base), theversion number will be increased, e.g., U12345.1 → U12345.2, butthe accession portion will remain stable.

- The accession.version system of sequence identifiers runs parallel tothe GI number system, i.e., when any change is made to a sequence,it receives a new GI number AND an increase to its version number.

- A Sequence Revision History tool (http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi)is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record

Page 14: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

14

Anatomy of a Protein Db record

Page 15: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

15

Fasta Sequence

Page 16: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Fasta Format• Text-based format for representing nucleic

acid sequences or peptide sequences (single letter codes).

• Easy to manipulate and parse sequences to programs.

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

Page 17: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Fasta Format (cont.)• Begins with a single-line description, followed by lines of sequence data.• Description line

– Distinguished from the sequence data by a greater-than (">") symbol.– The word following the ">" symbol in the same row is the identifier of the sequence. – There should be no space between the ">" and the first letter of the identifier.– Keep the identifier short and clear ; Some old programs only accept identifiers of only 10

characters. For example: > gi|5524211|Human or >HumanP53• Sequence line(s)

– Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature)

– The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

Page 18: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Amino acids & Nucleotides

18

Page 19: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2
Page 20: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

IUPAC One Letter Amino Acid Code

• A• B• C• D• E• F• G• H• I• J• K• L• M

• N• O• P• Q• R• S• T• U• V• W• X• Y• Z

Alanine

Cysteine

Glycine

Histidine

Isoleucine

Leucine

Methionine

Proline

Serine

Threonine

Valine

Glutamic Acid

Aspartic Acid

Phenylalanine

Lysine

Asparagine

Glutamine

Arginine

Tryptophan

Tyrosine

21st (Sec) Selenocysteine

22nd (Pyl) Pyrrolysine

GLx

ASx

Glutamic Acid

Aspar(D)ic Acid

(F)enylalanine

Lysine

Asparagi(N)e

(Q)lutamine

(R)ginine

T(W)ptophan

T(Y)rosine

21st (Sec)Selenocysteine

22nd (Pyl) Pyrr(O)lysine

GLx

ASx

Page 21: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Note

Amino acid Three letter code Single letter code

Asparagine or aspartic acid Asx B

Glutamine or glutamic acid, GLx Z

Leucine or Isoleucine, Xle J

Unspecified or unknown amino acid Xaa X

Page 22: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

22

Standard IUPAC Nucleotide code is used to describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide. The code is shown in the table below.

IUPAC Nucleotide Code

http://www.yeastract.com/help/help_iupac.php

Page 23: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Advice• We highly recommend that you memorize the

amino acid codes and their structures• Memorizing the codes and in particular the

structures will be very useful for this module and other modules, especially for research purposes.

• It is not compulsory that you memorize these for this module.

Page 24: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Features of major database (Gene Db)

24

Page 25: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

25

Anatomy of Gene Db

Page 26: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

26

Anatomy of a Gene Db record

Page 27: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

A section of Gene Db record:Reference Sequences

27

mRNA Accession number

Protein Accession number

Page 28: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

28

Nucleic Acid Databases

Entrez nucleotide database (nt)

• GenBank• DDBJ• EMBL• RefSeq_genomic

Page 29: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

29Amino Acid Databases

1) Sequence repositories• GenPept (redundant; translation of GenBank; minimal

annotation)• Entrez Protein (redundant or NR)• translated DDBJ/EMBL/GenBank (i.e. GenPept)• Swiss-Prot, PIR, RefSeq_protein and PDB• RefSeq (non-redundant; reference sequences; minimal manual

curation; limited species)

2) Universal curated databases• PIR-PSD (non-redundant; focus on protein family classification)• Swiss-Prot (non-redundant; manually annotated)• TrEMBL (non-redundant; extensively computer-annotated)

3) Next-generation of protein sequence database• UniProtKB (Swiss-Prot, TrEMBL and PIR-PSD integrated; less

redundant than UniProt NREF)• UniParc (like Entrez Protein but more comprehensive)• UniProt NREF (like RefSeq but more comprehensive and rich

with annotation)Read more: http://www.ebi.ac.uk/panda/pdf/apweiler_bairoch_2004.pdf

Page 30: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

30

The RefSeq Project

• Goal: a “comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.”

http://www.ncbi.nlm.nih.gov/RefSeq/index.html

• Designed to reduce duplication by selecting one representative sequence for each locus, except when there are naturally occurring paralogs and splice variants.

• Info from:– Predictions from genomic sequence– Analysis of GenBank Records– Collaborating databases

Page 31: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Genbank versus refseq

http://www.ncbi.nlm.nih.gov/books/NBK21105/#ch1.Appendix_GenBank_RefSeq_TPA_and_UniP

Page 32: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Choice of databases for genomic/proteomic data

PromoterEnhancer Gene

E E

I

U U

Nucleotide Protein

RefSeq_genome RefSeq_Protein

Gene

All of above in multiple records

All real/ reliably predicted proteins in multiple records

Reference ones only Reference proteins only

Gene record with all related Information included (mRNAProtein, promoter, enhancer)

Genome architecture

Databases to house genomic/proteomic data

Page 33: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Database searching can help answer questions like

• What is the sequence of human IL-10?• What is the gene coding for human IL-10?• Is the function of human IL-10 known? What is it?• Are there any variants of human IL-10?• Who sequenced this gene?• What are the differences between IL-10 in human and in

other species?• Which species are known to have IL-10?• Is the structure of IL-10 known?• What are structural and functional domains of the IL-10?• Are there any motifs in the sequence that explain their

properties?• What is an upstream region of IL-10 containing

transcriptional regulation sites?

IL10 = X?

Page 34: 1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2

Take home messages for databases• Bioinformatics = databases + tools

• General databases versus specialized databases

• Databases come and go (especially the small ones)

• Database redundancy - many databases for the same topic (use the most comprehensive, if not use all for comprehensiveness)

• Database accuracy – published ones are more reliable; nevertheless, they are still prone to errors; always good to spend sometime assessing the reliability of your data of interest by doing cross-referencing to literature or other databases

• Fortunately, most databases are cross-referenced

• Unfortunately, no common standard format; need to spend some time familiarizing each; becomes easy after some practice

• Finding databases relevant to you

– NAR Database catalogue

– Pubmed

– Google

• 2 main methods for searching databases (each with its own pros and cons)

– 1. Keyword search (covered today)

– 2. Sequence search (day 2)

34