lecture 2.21 retrieving information: using entrez
TRANSCRIPT
Lecture 2.2 1
Retrieving Information: Using Entrez
Lecture 2.2 2
Retrieving information: how it works:
• Servers have the records you want• You need to understand the data they have, and
how it is organized• There are often many ways to get to an answer.• Route to get there is not always obvious, but you
need to think of alternatives and traps.• Use some query language – each system has its
own.• Retrieve data in a specified format.• Save it in a way that will be useful to you.
Lecture 2.2 3
What you may be looking for:
• Did a BLAST search – and you need more info about some of the proteins they found similarities to.
• Heard on about a disease gene that was recently discovered, and you want to know more about it.
• Want to build a dataset for local blast searches.
• A colleague wants you to do an alignment of all sequences from a given protein family.
Lecture 2.2 4
What you are looking for:
• PubMed paper from author X• Sequence from gene X in organism Y• All information about organelle W in
model organism Y• All information about disease X in
human• Orthologs of that disease genes in other
model organisms
Lecture 2.2 5
Central Dogma: NCBI version
RNA
protein
DNA
Write a paper about it
Lecture 2.2 6
Entrez: Pathway to Discovery
Amino acid sequence similarityCoding region
features
Nucleotide sequence similarity
Term frequency statistics
Literature citations in sequence databases
Literature citations in sequence databases
MEDLINE abstracts
Nucleotide sequences
Protein sequences
1993
Lecture 2.2 7
Related Articles
Type in your last name and find a paper form one of your
teammates
Lecture 2.2 8
Hard link DNA to proteinL12345
Lecture 2.2 9
From Fig 1 ofEntrez search and retrieval systemJim OstellChapter 14, the NCBI Handbook.
2003
Lecture 2.2 10
Lecture 2.2 11
Lecture 2.2 12
Lecture 2.2 13
Ctrl-F
Lecture 2.2 14
Lecture 2.2 15
Getting started in Entrez
Lecture 2.2 16
“ouellette bf” [au] AND yeast
Lecture 2.2 17
Lecture 2.2 18
Lecture 2.2 19
Lecture 2.2 20
MeSH: Medical Subject Heading
Lecture 2.2 21
A query
• Word <free text> : too many hits– More words (the Boolean ‘AND’ is the
default)– Limit query to specified field– Limit query in time– Do Boolean on queries
• #1 AND #2• #3 NOT #5• #7 OR #8
Lecture 2.2 22
hieter p [au]
Lecture 2.2 23
Limit in Time: 1993-01-01 1993-12-31
Lecture 2.2 24
Lecture 2.2 25
No abstract
With abstract
Full Text on-line
Full Text in PubMed Central
Lecture 2.2 26
boguski m [au] 99
boguski ms [au] 80
Lecture 2.2 27
#24 NOT #23 19
Lecture 2.2 28
Lecture 2.2 29
Other types of links in Entrez
• Next slides to explore other kind of things linked into Entrez records.
Lecture 2.2 30
“hieter p” [au] cdc16p
Lecture 2.2 31
Lecture 2.2 32
Lecture 2.2 33
Lecture 2.2 34
Lecture 2.2 35
Lecture 2.2 36
Lecture 2.2 37
Lecture 2.2 38
Lecture 2.2 39
“Books”
Lecture 2.2 40
(2)
Lecture 2.2 41
Lecture 2.2 42
Lecture 2.2 43
Lecture 2.2 44
Lecture 2.2 45
Lecture 2.2 46
Link to Genome View of Chromosome I
Lecture 2.2 47
Lecture 2.2 48
Lecture 2.2 49
RefSeq
• RefSeq represents the NCBI curated “reference sequences” for all ‘worked’ genome.
• Historically, these used to be referred to as “GenBank-Gold”.
• RefSeq are either genomic, mRNA or protein sequences.
• Not all sequences are in RefSeq• All RefSeq sequences are assembled/taken
from things in GenBank.
Lecture 2.2 50
Some of the features of the RefSeq:
• non-redundancy • explicitly linked nucleotide and protein
sequences • updates to reflect current knowledge of
sequence data and biology • data validation and format consistency • distinct accession series • ongoing curation by NCBI staff and
collaborators, with review status indicated on each record
Lecture 2.2 51
Accession number space• GenBank:
– 1+5 (L12345, U00001)– 2+6 (AF000001, AC000003)– 4+2+6 (WGS)
• All have accession.version
• Protein:– 1+5 (SwissProt/UniProt)– 3+5 (GenPept)
• All have accession.version
• RefSeq:– N*_12345
Lecture 2.2 52
RefSeq Accession Number Space
NC_123456 Genomic Complete genomic molecules including genomes, chromosomes, organelles, plasmids.
NG_123456 Genomic Incomplete genomic region; supplied to support the NCBI Genome Annotation pipeline.
NM_123456 mRNA
NR_123456 RNA Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others
NP_123456 Protein
NP_12345678 Protein Planned expansion of accession series
Lecture 2.2 53
Automated Assemblies
NT_123456 Genomic Intermediate genomic assemblies of BAC sequence data
NW_123456 Genomic Intermediate genomic assemblies of Whole Genome Shotgun sequence data
Lecture 2.2 54
Model RefSeq records
XM_123456 mRNA model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig.
XR_123456 RNA model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig.
XP_123456 Protein model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig.
Lecture 2.2 55
WGS special case
NZ_ABCD12345678
Genomic A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project.
ZP_12345678 Protein Proteins annotated on NZ_ accessions (often via computational methods).
Lecture 2.2 56
Download all the data
Entrez and RefSeq
Lecture 2.2 57
Lecture 2.2 58
Lecture 2.2 59
Lecture 2.2 60
Locus Link
Lecture 2.2 61
Things to watch out for:
Lecture 2.2 62