sequence databases june 21, 2005 learning objectives-understand how information is stored in...

19
Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database. Introduce ENTREZ platform for biological data analysis.

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Sequence Databases

June 21, 2005Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database. Introduce ENTREZ platform for biological data analysis.

Page 2: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

BIOSEQs

Biological sequence-central element in the NCBI data model.

Comprises a single continuous molecule of nucleic acid or protein.

Must have at least one sequence identifier (Seq-id)

Information on physical type of molecule (DNA, RNA or protein)

Annotations-refers to specific locations within the Bioseq

Descriptors-describe entire Bioseq

Page 3: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

What is GenBank?

Gene sequence database

Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region (limit 350 kb)

Generated from direct submissions to the DNA sequence databases from the authors.

Part of the International Nucleotide Sequence Database Collaboration.

Page 4: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how
Page 5: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

General Comments on GBFF

Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each

represented by a key. 3) Nucleotide sequence-each ends with // on

last line of record.

DNA-centered

Translated sequence is a feature

Page 6: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Feature Keys

Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to

sequences

Feature Key Description conflict Separate deter’s of the same seq. differ

rep_origin Origin of replication

protein_bind Protein binding site on DNA

CDS Protein coding sequence

Page 7: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Feature Keys-Terminology

Feature Key Location/Qualifiers

CDS 23..400

/product=“alcohol dehydro.”

/gene=“adhI”

The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Page 8: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Feature Keys-Terminology (Cont.)

Feat. Key Location/Qualifiers

CDS join (544..589,688..1032)

/product=“T-cell recep. B-ch.”

/partial

The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

Page 9: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Record from GenBank

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999

DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and

Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.

ACCESSION U49845

VERSION U49845.1 GI:1293613

KEYWORDS .

SOURCE baker's yeast.

ORGANISM Saccharomyces cerevisiae

Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;

Saccharomycetaceae; Saccharomyces.

Modification dateGenBank division (plant, fungal and algal)

Coding regionUnique identifier (never changes)

Nucleotide sequence identifier (changes when there is a changein sequence (accession.version))

GeneInfo identifier (changes whenever there is a change)

Word or phrase describing the sequence (not based on controlled vocabulary).Not used in newer records.

Common name for organism

Formal scientific name for the source organism and its lineagebased on NCBI Taxonomy Database

Locus name

Page 10: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Record from GenBank (cont.1)

REFERENCE 1 (bases 1 to 5028)

AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.

TITLE Cloning and sequence of REV7, a gene whose function is required

for DNA damage-induced mutagenesis in Saccharomyces cerevisiae

JOURNAL Yeast 10 (11), 1503-1509 (1994)

MEDLINE 95176709

REFERENCE 2 (bases 1 to 5028)

AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.

TITLE Selection of axial growth sites in yeast requires Axl2p, a

novel plasma membrane glycoprotein

JOURNAL Genes Dev. 10 (7), 777-793 (1996)

MEDLINE 96194260

Oldest reference first

Medline UID

REFERENCE 3 (bases 1 to 5028)

AUTHORS Roemer,T.

TITLE Direct Submission

JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,

New Haven, CT, USA

Submitter of sequence (always the last reference)

Page 11: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Record from GenBank (cont.2)

FEATURES Location/Qualifiers

source 1..5028

/organism="Saccharomyces cerevisiae"

/db_xref="taxon:4932"

/chromosome="IX"

/map="9"

CDS <1..206

/codon_start=3

/product="TCP1-beta"

/protein_id="AAA98665.1"

/db_xref="GI:1293614"

/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA

AEVLLRVDNIIRARPRTANRQHM"

Partial sequence on the 5’ end. The 3’ end is complete.

There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature)

Keys

Location

Qualifiers

Descriptive free text must be in quotations

Start of open reading frame

Database cross-refsProtein sequence ID #

Note: only a partial sequence

Values

Page 12: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615"

/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616"

/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “

Cutoff

Cutoff

Another location

Another location

Page 13: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Record from GenBank (cont.4)

BASE COUNT 1510 a 1074 c 835 g 1609 t

ORIGIN

1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg

61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//

Page 14: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

EBI Sequence Retreival System

EMBLSRS-authorSRS-accession numberSRS-titleSRS-referenceSRS-organism

Parts of the record are parsed into separate database files

Page 15: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Primary databases vs. Secondary databases

Primary database has information from experimenter. It is called an archival database

Secondary database derives information from primary database. It is a curated database

Page 16: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Types of primary databases carrying biological infomation

GenBank/EMBL/DDBJ

PDB-Three-dimensional structure coordinates of biological molecules

PROSITE-database of protein domain/function relationships.

Page 17: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

Types of secondary databases carrying biological infomation

dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping)

Genome databases-(there are over 20 genome databases that can be searched

EPD:eukaryotic promoter database

NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one.

Vector: A subset of GenBank containing vector DNA

ProDom

PRINTS

BLOCKS

Page 18: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how

RNA

cDNA

DNA protein

DNA databases derived from GenBankcontaining data for a single gene

•Non-redundant (nr)•dbGSS•dbHTGS•dbSTS•LocusLink

RNA (cDNA) databases derivedfrom GenBankcontaining data for a single gene•dbEST•UniGene•LocusLink

Protein databases derivedfrom GenBank containingdata for a single gene•Non-redundant (nr)•Swissprot•PIR•LocusLink

Page 19: Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how