searching the ncbi databases using - university at...

UNIT 1.3Searching the NCBI Databases UsingEntrez

One of the most widely used interfaces for the retrieval of information from biologicaldatabases is the NCBI Entrez system. Entrez capitalizes on the fact that there are pre-existing, logical relationships between the individual entries found in numerous publicdatabases. For example, a paper in MEDLINE (or, more properly, PubMed) may describethe sequencing of a gene whose sequence appears in GenBank. The nucleotide sequence,in turn, may code for a protein product whose sequence is stored in the protein databases.The three-dimensional structure of that protein may be known, and the coordinates forthat structure may appear in the structure database. Finally, the gene may have beenmapped to a specific region of a given chromosome, with that information being storedin a mapping database. The existence of such natural connections, mostly biological innature, argued for the development of a method through which all the information abouta particular biological entity could be found without having to sequentially visit andquery disparate databases.

Basic Protocols 1 and 2 describe simple, text-based searches, illustrating the types ofinformation that can be retrieved through the Entrez system; Basic Protocol 2 alsoillustrates the use of Cn3D, a viewer that is used to visualize three-dimensional structures.The Alternate Protocol builds upon Basic Protocol 1, using additional, built-in featuresof the Entrez system, as well as alternative ways of issuing the initial query. The SupportProtocol reviews how to save frequently issued queries.

BASICPROTOCOL 1

QUERYING ENTREZ

The Entrez Web interface is located at http://www.ncbi.nlm.nih.gov/Entrez. Most of theWeb pages at the NCBI Web site provide a direct link to Entrez, either in a blue barrunning across the top of the page or in the left-hand sidebar. Entrez queries can alsobe issued from the NCBI home page (http://www.ncbi.nlm.nih.gov). The best way toillustrate the integrated nature of the Entrez system and to drive home the power ofneighboring (see Commentary) is by considering three biological examples, describedin Basic Protocols 1 and 2 and the Alternate Protocol.

Necessary Resources

Software

An up-to-date Web browser, such as Netscape Communicator, MS Internet Explorer,Apple Safari, or Mozilla Firefox

Select and search an Entrez database1. Begin at the Entrez home page (http://www.ncbi.nlm.nih.gov/Entrez).

2. In the “Search across databases” text box, enter the following:

DCC AND "Vogelstein B"

Using Boolean operators such as AND, OR, and NOT is the simplest way to query theEntrez system. Note that all Boolean operators must be capitalized for the query to returnthe expected results.

Contributed by Andreas D. BaxevanisCurrent Protocols in Bioinformatics (2006) 1.3.1-1.3.24Copyright C© 2006 by John Wiley & Sons, Inc.

Using BiologicalDatabases

1.3.1

Supplement 13

Searching theNCBI Databases

Using Entrez

1.3.2

Supplement 13 Current Protocols in Bioinformatics

Table 1.3.1 Entrez Boolean Search Statementsa,b

Tag Significance/usage

[ACCN] Accession

[AD] Affiliation

[ALL] All fields

[AU] or [AUTH] Author name

O’Brien J [AUTH] yields all of O’Brien JA, O’Brien JB, etc.

"O’Brien J" [AUTH] yields only O’Brien J

[ECNO] Enzyme Commission or Chemical Abstract Service numbers

[FKEY] Feature key (nucleotide only)

[GENE] Gene name

[ISS] Issue of journal

[JOUR] Journal title, official abbreviation, or ISSN number

Journal of Biological Chemistry

J Biol Chem

0021-9258

[LA] Language

[MAJR] MeSH Major Topic

One of the major topics discussed in the article

[MDAT] Date of most recent modification to Entrez

YYYY/MM/DD, YYYY/MM, or YYYY

[MOLWT] Molecular weight of a protein, in Daltons

[MH] MeSH Terms

Controlled vocabulary of biomedical terms (subject)

[ORGN] Organism

[PS] Personal name as subject

Use when name is subject of article, e.g., Varmus H [PS]

[PDAT] Publication date

YYYY/MM/DD, YYYY/MM, or YYYY

[PROT] Protein name (not available in Structure database)

[PT] Publication type

Review

Clinical Trial

Lectures

Letter

Technical Publication

[SH] Subheading

Used to modify MeSH terms

hypertension [MH] AND toxicity [SH]

[SUBS] Substance name

Name of chemical discussed in article

continued


1.3.3

Current Protocols in Bioinformatics Supplement 13

Table 1.3.1 Entrez Boolean Search Statementsa,b, continued

Tag Significance/usage

[WORD] Text words

All words and numbers in the title and abstract, MeSH terms,subheadings, chemical substance names, personal name as subject,and MEDLINE secondary sources

[TITL] Title word

Only words in the definition line (not available in Structure database)

[UID] Unique Identifiers (PMID/MEDLINE numbers)

[VOL] Volume of journalaGeneral syntax is as follows: search term [tag] [Boolean operator] search term [tag]

[Boolean operator] . . .b[Boolean operator] = AND, OR, or NOT. [tag] represents a tag chosen from the left column of the tableabove.

If the [AU]qualifier were included after the search term “Vogelstein B”, the searchwould instruct Entrez to look only at the author field of entries for the occurrence of“Vogelstein B”. In this example, the query will return all available information onthe DCC gene in which the term “Vogelstein B” is also found. A list of this andother available search qualifiers is given in Table 1.3.1. Note that when using qualifiers,the square parentheses ([]) are required. See Table 1.3.1 for examples of how to specifyauthor names and the use of wildcards with author names.

3. Click the Go button to submit the query. Running the query in September 2005returned 14 papers (indicated by the “14” next to the PubMed entry at top of theleft-hand column of the list of databases), 16 nucleotide entries (indicated by the“16” next to “Nucleotide” in the list), and 15 protein entries (indicated by the “15”next to “Protein” in the list); see Figure 1.3.1.

Entrez does a query of all of the available databases, and the number of hits to eachdatabase is shown to the left of the name of the database. The user can further narrowdown the query by adding additional terms, if interested in a more specific aspect of thisgene or if there are simply too many entries returned by the initial query. The reader isencouraged to carefully look at Figure 1.3.1 to see the broad array of biological resourcesthat are available through the Entrez system; the name of each component database isfollowed by a brief description of its contents.

Viewing individual database entries4. Click on the “14” next to “PubMed.” This will produce the view seen in Figure 1.3.2.

For each of the found papers in PubMed, the user is presented with the authors onthe paper, the name of the paper, and the citation.

5. Clicking on any of the hyperlinked author lists will take the user to the Abstractview of the selected paper, which presents the name of the paper, the list of au-thors, their institutional affiliation(s), and the abstract itself, in standard (“Abstract”)format. Here, click on the list of authors of the fourth reference in the list, byKathy Cho and colleagues (Cho et al., 1994). This will produce the view seenin Figure 1.3.3.


Using Entrez

1.3.4


Figure 1.3.1 The Entrez unified results page, showing the number of hits to each of Entrez’scomponent databases fitting the query. Clicking on any of the numbers to the left of the databasename takes the user to the results found in that particular database.

Figure 1.3.2 Results of a text-based Entrez query using Boolean operators against PubMed.The initial query (from Fig. 1.3.1) is shown in the search box near the top of the window. Each entrygives the names of the authors, the title of the paper, and the citation information. An individualrecord can be viewed by clicking on the author list for that paper.


1.3.5


Figure 1.3.3 An example of a PubMed record in Abstract format, as returned through Entrez.This Abstract view is for the fourth reference shown in Figure 1.3.2. This view provides connec-tions to related articles, sequence information, and the actual, full-text journal article. See textfor details.

6. To change the display, select Citation from the Display drop-down menu (near bottomof page). Switching to this format produces a similar-looking entry; however, thecataloging information, such as the MeSH terms and indexed substances relating tothe entry, is now displayed below the abstract.

To see an additional display option, select MEDLINE from the same drop-down menuand click the Display button. This selection produces the MEDLINE/MEDLARS layout,with two-letter codes corresponding to the contents of each field going down the left-handside of the entry (e.g., the author field is denoted by the code AU). Entries in this formatcan be saved and easily imported into third-party bibliography-management programs,such as EndNote and Reference Manager.

7. Select Abstract from the drop-down menu and click Display to return to the Abstractview.

8. To view the full text of an article, select the Full-Text Article hyperlink located underthe journal citation (shown in Fig. 1.3.3). With the proper individual or institutionalprivileges, the user can view the entire text of the paper, including all figures andtables.

Finding related material9. Click on the Related Article links in the upper right-hand corner of the view shown

in Figure 1.3.3. Entrez will return a list of 101 references of similar subject matter(again, as of September 2005); the first eight of these are shown in Figure 1.3.4.


Using Entrez

1.3.6


Figure 1.3.4 Neighbors to an entry found in PubMed. The original entry from Figure 1.3.3 (Choet al., 1994) is at the top of the list, indicating that this is the parent entry. Clicking on the Linksbutton to the left of any of the entries produces a pop-up menu, providing links to related entriesoutside PubMed. See text for details.

Figure 1.3.5 The Entrez Gene page for the DCC (deleted in colorectal carcinoma) gene. Thescreen shows that this is a protein-coding gene at map location 18q21.3, and information on thegenomic context of DCC, as well as alternative gene names and information on the encodedprotein, is provided.


1.3.7


The first paper in the list is the same Cho et al. paper, since, by definition, it is mostrelated to itself (the “parent” entry). The order in which the related papers follow isbased on statistical similarity. Thus, the entry closest to the parent is deemed to be theclosest in subject matter to the parent. By scanning the titles, the user can easily findrelated information on other studies, as well as quickly amass a bibliography of relevantreferences. This can be a useful and time-saving function if the user is writing grantsor papers, because abstracts can easily be scanned and papers of real interest identifiedbefore the user heads off for the library stacks.

10. Click on the browser’s Back button to return to the Abstract view.

11. Click on the Links hyperlink in the upper right-hand corner of the Abstract view. Apop-up menu will appear (shown on the right-hand side of Fig. 1.3.4). The pop-upmenu provides a new series of links, representing hard-link connections to otherdatabases within the Entrez system. Select the Gene link and then click on “DCC”to obtain the page shown in Figure 1.3.5.

This page is from Entrez Gene, a new feature of Entrez that provides summary informationabout the gene in question. Entrez Gene replaces LocusLink, which was a gene-centricsystem for retrieving similar information on a given genetic locus. The data are gatheredfrom a variety of sources, including NCBI’s RefSeq. The screen shows that DCC is aprotein-coding gene at map location 18q21.3, and information on the genomic contextof DCC, as well as alternative gene names and information on the encoded protein, isprovided. Additional content not shown in the figure includes information on protein-protein interactions, Gene Ontology assignments, and homologies to selected organisms.

Figure 1.3.6 The dbSNP GeneView page for the DCC gene. The information on individual SNPsis shown in the table towards the bottom of the screen. Each SNP occupies two lines of the table,with one line showing the “contig reference” (the more-common allele) and the other showing theSNP (the less-common allele). For example, the first two rows of the table show a contig referenceT for which there is a documented SNP, changing the T to a C. At the protein level, this changesthe amino acid at position 23 of the DCC protein from phenylalanine to leucine. The rows arecolored red since this is a “nonsynonymous SNP”; that is, the SNP produces a discrete changeat the amino acid level. The next two rows of the table, in which the contig reference and thedbSNP allele ultimately produce the same amino acid, are colored green, to indicate that this is a“synonymous SNP”. For the color version of this figure go to http://www.currentprotocols.com.


Using Entrez

1.3.8


12. To view documented single-nucleotide polymorphisms (SNPs) in this gene, click onthe “SNP: GeneView” link located in the right sidebar. This will produce the viewseen in Figure 1.3.6.

The information on this page is derived from the Database of Single Nucleotide Poly-morphisms (dbSNP; Mullikin and Sherry, 2005). The individual SNPs that occur withinthe DCC gene are shown in the table at the bottom of the figure; each SNP occupies twolines of the table, with one line showing the “contig reference” (the more common allele)and the other showing the SNP (the less common allele). Details are given in the figurelegend.

13. Return to the Abstract view, click on the Links hyperlink in the upper right-handcorner of the Abstract view, and this time select Protein.

The Protein link produces the view shown in Figure 1.3.7. Clicking on any of the hyper-linked accession numbers would take the user to the actual entry for that protein, whichincludes the protein’s sequence.

14. Return to the Abstract view, click on the Links hyperlink in the upper-right handcorner of the Abstract view, and this time select OMIM.

The OMIM link produces the view shown in Figure 1.3.8, the entry for DCC in the OnlineMendelian Inheritance in Man (OMIM) database (McKusick, 1998; Hamosh et al., 2002;UNIT 1.2). OMIM provides concise textual information from the published literature onmost human conditions having a genetic basis, as well as pictures illustrating the condi-tion or disorder (where appropriate) and full citation information. Each entry includesinformation such as the gene symbol, alternate names for the disease, a description of thedisease, a clinical synopsis, and references.

15. Click on the Allelic Variants link in the left sidebar. This will jump the user down toa section on the same Web page labeled Allelic Variants (Fig. 1.3.9).

Figure 1.3.7 Entries in the protein databases corresponding to the original Cho et al. (1994) entryshown in Figure 1.3.2. Entries can be accessed by clicking on any of the accession numbers. Seetext for details.


1.3.9


Figure 1.3.8 The OMIM entry for the DCC gene. Each entry includes information such as thegene symbol, alternate names for the disease, a description of the disease, a clinical synopsis,and references.

Figure 1.3.9 An example of a list of allelic variants that can be obtained through OMIM. Thefigure shows the list of allelic variants for the DCC gene. The description under each allelic variantprovides information specific to that particular mutation.


Using Entrez

1.3.10


A particularly useful feature of OMIM is the list of allelic variants that appears in manyOMIM entries. A short description is given, after each allelic variant, of the clinical orbiochemical outcome of that particular mutation. There are currently more than 1000gene entries containing at least one allelic variant that either causes or is associatedwith a discrete phenotype in humans. Note that the two allelic variants shown in Figure1.3.9 produce significantly different clinical outcomes (colorectal cancer in one case,esophageal carcinoma in the other).

16. Return to the Abstract view, click on the Links hyperlink in the upper right-handcorner of the Abstract view, and this time select GEO Profiles.

The Gene Expression Omnibus (GEO) at NCBI serves as a public repository for array-based data, including microarray and chromatin immunoprecipitation (ChIP-chip) data(Barrett et al., 2005). The GEO records shown in Figure 1.3.10 all have accession numbersbeginning with the code GDS, indicating that they are “GEO Data Sets,” i.e., curated setsof GEO data. The GPL numbers refer to the “GEO Platforms,” the actual list of elementson the array (e.g., cDNAs or oligonucleotides). The reader is strongly encouraged to readthe online GEO Overview, which provides descriptions of the various GEO elements andinformation on how to browse GEO data.

17. Return to the Abstract view, click on the Links hyperlink in the upper right-handcorner of the Abstract view, and this time select Books.

The Books link produces a heavily hyperlinked version of the original citation. Thehighlighted words in this view correspond to keywords that can take the user to full-text books that are available through NCBI. Select the hyperlink for “cell adhesionmolecules.”

18. Select the hyperlink marked “16 items” next to the thumbnail of the cover of Molec-ular Biology of the Cell (Alberts et al., 2002). This will produce the page shown inFigure 1.3.11. From this page, click on the fourth link in the list (Integrins).

Figure 1.3.10 Gene Expression Omnibus (GEO) DataSets for the DCC gene. For each DataSet,a brief description of the experiment is provided, as well as a schematic of the gene expressionprofile derived in the study.


1.3.11


Figure 1.3.11 Entries relating to the original Cho et al. (1994) paper in Molecular Biology of theCell (Alberts et al., 2002). This is part of the NCBI Bookshelf, which offers electronic versions ofmany commonly used textbooks. See text for details.

Figure 1.3.12 Specific text for one of the entries shown in Figure 1.3.11, on integrins. The leftsidebar can be used to freely navigate through the entire contents of Alberts et al. (2002), as wellas view relevant figures and tables. See text for details.


Using Entrez

1.3.12


The user has now been taken to the relevant part of the textbook, a section devoted tothe biology of integrins (Fig. 1.3.12). From this page, the user can navigate through thisparticular section of the book, gathering more general information on cell adherence andanchoring junctions.

19. Return to the Abstract view, click on the Links hyperlink in the upper left-handcorner of the Abstract view, and this time select LinkOut.

LinkOut provides a list of third-party Web sites and resources related to the Entrez querybeing viewed, such as full text of articles that can be displayed directly through the Webbrowser, or the ability to order the document through online services. Through LinkOut,users can also obtain information that is particularly useful to patients and clinicians, aswill be demonstrated in the next step.

20. From the LinkOut page, click on the MEDLINEplus link. This generates a page de-voted to information for both laymen and health professionals on DCC and disordersrelated to DCC (Fig. 1.3.13).

The information available through this page is often appropriate to provide to patients,since the level of writing is geared towards nonprofessionals. There are also interactivetutorials for various procedures related to DCC along the right-hand side of the page.

21. Click on the “ClinicalTrials.gov: Colorectal Neoplasms” link, found in the ClinicalTrials section of the MEDLINEplus page. This takes the user to NIH’s centralinformation source for clinical trials, aptly named clinicaltrials.gov (Fig. 1.3.14).

The listing shown in Figure 1.3.14 gives the first twelve of the 124 clinical trials activelyrecruiting for patients with colorectal neoplasms at the time of this writing. Clicking onthe name of the protocol gives information regarding the study, including the principalinvestigator’s name and contact information. While scientists tend to focus on the kind ofbiological data discussed throughout this unit, the clinical trials site is, unarguably, the

Figure 1.3.13 The MEDLINEplus page devoted to information for both laymen and physicians onDCC and disorders related to DCC. The information available through this page is often much moreappropriate to provide to patients, since the level of writing is geared towards nonprofessionals;there are also interactive tutorials for various procedures related to DCC along the right-hand sideof the page.


1.3.13


Figure 1.3.14 The ClinicalTrials.gov page showing all actively recruiting clinical trials relating tocolorectal neoplasms. Information on each trial, including the principal investigator of the trial andqualification criteria, can be found by clicking on the name of the trial.

most important of the sites covered here, since it provides a means through which patientssuffering from a given genetic or metabolic disorder can receive the latest, cutting-edgetreatment, which may make a substantial difference in their quality of life.

SUPPORTPROTOCOL

USING “MY NCBI” TO SAVE SEARCHES AND RESULTS

A storage service called My NCBI (formerly Cubby) is provided to save searches andtheir corresponding results. The advantage of the My NCBI system is that it can recallthe searches that were saved and update them with a click of a mouse, eliminating theneed to re-enter the query each time the user wishes to view the most recent results. Thesystem is also capable of E-mailing users updated search results at specified intervals andfiltering results based on user-specified criteria. Each user may store up to 100 searches.

Necessary Resources

Software


Register and log in1. After executing an Entrez search such as the one described in Basic Protocol 1, click

the Save Search link that appears next to the Go and Clear buttons at the top of theresults page (Fig. 1.3.2).

For purposes of this example, return to the view shown in Figure 1.3.2, which shows theoriginal set of papers found through the initial Entrez search.

2a. If not registered with My NCBI: Click Register, then follow the instructions presentedon the screen. Once registered, the user is automatically logged into My NCBI.


Using Entrez

1.3.14


2b. If registered, but not already logged in: Enter a valid username-password pair andclick the Sign In button. The login will remain active for the length of the currentsession (as long as the browser window is open), unless the “Keep me signed inunless I sign out” box is checked.

Store a My NCBI search3. A new Save Search window will pop up, prompting the user for a name for the search

and whether E-mail updates are desired. If yes, the window will expand, promptingthe user for an E-mail address, the frequency with which to receive updated results,the desired E-mail format, and the number of entries that should be sent (Fig. 1.3.15).

4. Click OK to complete the process. A confirmatory E-mail will be sent to the E-mail address specified in the Save Search window, asking the user to verify havingoriginated the request and whether to still send the updates via E-mail.

Retrieve and update a My NCBI stored search5. Click the My NCBI link located on the left sidebar of most NCBI pages (see Fig.

1.3.15). There is also a My NCBI box at the upper right of most NCBI pages.

The user will be taken to a page showing the saved searches (Fig. 1.3.16). In this case, theone search that was stored in step 4 is shown, along with the date it was last updated andthe update frequency. The last set of results for this search can be displayed by clicking onthe search name, and the update frequency can be changed by clicking on the displayedfrequency (here, monthly).

6. Select the search by clicking on the check box to the left of the stored search name,then click the “What’s New for Selected” button. This will re-run the search andreturn a count of how many new papers have been added since the last time the querywas issued. If no new papers have been found, the message “No new items” will beshown.

Figure 1.3.15 The My NCBI save search page, which enables users to store queries and havethem updated automatically at user-defined intervals. See text for details.


1.3.15


Figure 1.3.16 Searches saved through My NCBI can be recalled, viewed, and updated throughthe My Saved Searches screen. See text for details.

7. To view the new papers found, select the number new link. The date and timeof the query are now updated to reflect the current date and time. If this link is notselected, the date and time of the query will not be updated.

ALTERNATEPROTOCOL

COMBINING ENTREZ QUERIES

There is another way to perform an Entrez query, involving some built-in features ofthe system. Consider an example in which one is attempting to find all genes coding forDNA-binding proteins in Methanothermobacter. Although this example concentrates onnucleotide sequences, the general strategy works equally well across any of Entrez’scomponent databases.

Necessary Resources

Software


Execute multiple queries1. Go to the NCBI home page (http://www.ncbi.nlm.nih.gov). Select Nucleotide from

the pull-down menu and enter the term DNA-binding in the “for” text box. ClickGo.

In September 2005, the query returned 36,124 entries (Fig. 1.3.17).

2. To narrow the query, select the Limits tab, which is located directly below the searchtext box on the upper left side of the page.

This brings the user to a new page (Fig. 1.3.18) that allows the search to be refined or“limited,” as implied by the name of the tab.


Using Entrez

1.3.16


Figure 1.3.17 Formulating a search against the nucleotide portion of Entrez. The initial queryis shown in the text box near the top of the window (DNA-binding), and the nucleotide entriesmatching the query are displayed below. See text for details.

Figure 1.3.18 Using the Limits feature of Entrez to limit a search to a particular organism. Seetext for details.


1.3.17


Figure 1.3.19 Combining individual queries using the History feature of Entrez. Each searchperformed in the last hour is given a number, and the searches can be combined using the searchnumbers and the Boolean operators AND, OR, or NOT. See text for details.

Figure 1.3.20 Entries resulting from the combination of two individual Entrez queries. The com-mand producing the results is shown in the text box near the top of the window (#18 AND #19)The numbers correspond to those assigned to the previously performed searches listed in Figure1.3.19. See text for details.


Using Entrez

1.3.18


3. To limit the search by organism, select Organism from the “Limited to” drop-downmenu.

4. Enter methanothermobacter in the text box near the upper left of the screen(Fig. 1.3.18), then select Go.

In September 2005, the query returned 291 entries (Fig. 1.3.19).

Combine the selected queries5. Click on the History hyperlink, located below the text box. The History page displays

the user’s most recent queries (Fig. 1.3.19).

The list shows the individual queries, whether those queries were field-limited, the timeat which the query was performed, and how many entries that query returned.

6. To combine the two queries into one, use their query numbers (in this particular case,the queries are numbered 18 and 19; readers may see different numbers on theirscreens). Enter #18 AND #19 in the text box on the upper left of the screen. ClickPreview to regenerate a table showing the new, combined query as #20, containingthree entries. Clicking on the 3 in the Result column takes the user to the three entriescommon to the two original queries, as shown in Figure 1.3.20.

BASICPROTOCOL 2

EXAMINING STRUCTURES IN ENTREZ

Structure queries can be accomplished simply by selecting Structure from the Searchdrop-down menu on the NCBI home page. For the example below, assume that the useris trying to find information regarding the structure of HMG-box B from rat, whose PDBaccession number is 1HMF.

Necessary Resources

Software

An up-to-date Web browser, such as Netscape Communicator, Internet Explorer,Apple Safari, or Mozilla Firefox

1. Go to the NCBI home page (http://www.ncbi.nlm.nih.gov) and select Structure fromthe Search drop-down menu. Enter 1HMF in the “for” text box. Click Go.

2. Click the 1HMF hyperlink. The structure summary page displays, and the user willimmediately note the format, which is decidedly different from any of the pagesdisplayed so far (Fig. 1.3.21).

This page shows the definition line from the source Molecular Modeling Database(MMDB) document (which is derived from PDB), as well as links to PubMed and tothe taxonomy of the source organism. The graphic below the header schematically il-lustrates the protein as a bar of length 77 (meaning 77 amino acids), below which is abar showing the position of a defined domain within the protein (here, the HMG box, aDNA-binding domain).

3. Click on the upper bar corresponding to the full-length protein. This displays a tableof four neighbors, as assessed by the Vector Alignment Search Tool (VAST; seeBackground Information).

4. To glean initial impressions about the shape of the protein, download Cn3D byclicking Get Cn3D 4.1. The application will walk the user through the installationprogram.

More information on Cn3D is available through the online Cn3D documentation. Inaddition, the user can save coordinate information to a file and view the data using third-party applications such as Kinemage (Richardson and Richardson, 1992) and RasMol(Sayle and Milner-White, 1995).


1.3.19


Figure 1.3.21 The structure summary for 1HMF, resulting from a direct query of the structuresaccessible through the Entrez system. The entry shows header information from the correspondingMMDB entry, links to PubMed, and links to the taxonomy of the source organism. Structureneighbors, as assessed by VAST, can be found by clicking on the long bar (purple on screen) nextto the Protein key. The structure itself can be viewed by clicking on the View 3D Structure button,thereby spawning the Cn3D viewer.

5. Once installed, use the Web browser’s Back button to return to the 1HMF structuresummary page (Fig. 1.3.21). Click on View 3D Structure. This will launch the Cn3Dviewer once the three-dimensional coordinates of 1HMF have been downloadedfrom the NCBI server.

6. Cn3D will produce two windows, one showing the structure of 1HMF, the othershowing the sequence (Fig. 1.3.22A). The user can highlight any part of the sequenceshown in the sequence window, and the corresponding part of the structure will appearin yellow. In addition, the user can search the sequence for a specific sequence pattern.Click anywhere in the Sequence/Alignment Viewer window, then select View > FindPattern. Type PKRP into the search box, then click OK. The portion of the backbonecorresponding to these four residues (Pro 7-Pro 10) will be shown in yellow in thestructure window.

7. The user can also adjust the display of the structure by selecting options in the Style> Rendering Shortcuts and Style > Coloring Shortcuts submenus. Figure 1.3.22Bshows the structure of 1HMF with the Rendering Shortcut set to Spacefill and theColoring Shortcut set to Charge.

The view given in Figure 1.3.22B shows the overall C-shape of the protein, which bindsto DNA. The blue patches, representing positive charges, indicate the residues that maybe responsible for DNA binding through minor-groove interactions. Negative charges areshown in red.

8. It is also possible to change the rendering and coloring of defined parts of themolecule.


Using Entrez

1.3.20


Figure 1.3.22 The structure of 1HMF rendered using Cn3D version 4.1, an interactivemolecular viewer. Cn3D can be used as a helper application to any Web browser or as a stand-alone application. In panel A, the backbone of the structure is shown as a worm, with the col-oring indicating secondary structural regions; in this case, there are three α-helices, shown ingreen, with a “crayon” indicating the length and directionality of the helix. Four residues havebeen highlighted in the sequence window, and those residues are shown in yellow in thestructure window. In panel B, the rendering of the structure has been changed,showing the structure in space-filling style, with the coloring being done by charge (red, nega-tive; blue, positive). For both panels, the coloring shown in the structure window is mirrored inthe sequence window below. See text for details. For the color version of this figure go tohttp://www.currentprotocols.com.

a. Reset the Rendering Shortcut to Worms and the Coloring Shortcut to SecondaryStructure, restoring the view to that shown in Figure 1.3.22A.

b. Highlight the same PKRP residues as before by clicking in theSequence/Alignment Viewer and selecting the four residues (Pro 7–Pro 10). Clickanywhere in the structure window, then select Style > Annotate.

c. In the new User Annotations window, click New. This will produce a new window,titled Edit Annotation.

d. Give the new annotation a name (e.g., PKRP), then click Edit Style. This generatesthe Style Options window, where settings can be selected for how the PKRPresidues should be displayed in the structure window (Fig. 1.3.23, left).

e. Here, change the Protein Backbone rendering to Ball and Stick and the colorscheme to Charge. Next, click the box next to Protein Sidechains, change therendering to Ball and Stick, and the color scheme to Charge. When done, clickDone.

The side chains of the four selected amino acids are now shown in yellow in thestructure window (Fig. 1.3.23, right).

f. Click Done in the User Annotations window, then click anywhere in theSequence/Alignment Viewer to clear the yellow highlighting. The colors willchange to correspond to the charges of the individual side chains.

9. Rotate the structure by moving the mouse while holding down the mouse button.To zoom in or out, hold down the Apple key (Mac) or Command key (PC) whiledragging the mouse.


1.3.21


Figure 1.3.23 Changing the rendering and coloring of selected parts of a structure. The StyleOptions window also allows individual residues to be numbered and the dimensions of side chainsand other features to be changed. See text for details. For the color version of this figure go tohttp://www.currentprotocols.com.

COMMENTARY

Background InformationEntrez, to be clear, is not a database itself

but rather the interface through which all ofits component databases can be accessed andtraversed. The Entrez information space in-cludes PubMed records, nucleotide and pro-tein sequence data, three-dimensional struc-ture information, and mapping information.The strength of Entrez lies in the fact that allof this information can be accessed by issu-ing one and only one query. Entrez is able tooffer integrated information retrieval throughthe use of two types of connections betweendatabase entries: neighboring and hard links.

Relationships between database entries:Neighboring

The concept of neighboring allows for en-tries within a given database to be connectedto one another. If a user is looking at a partic-ular PubMed entry, the user can ask Entrez tofind all of the other papers in PubMed that aresimilar in subject matter to the original paper(see Basic Protocol 1). Similarly, if a user islooking at a sequence entry, Entrez can returna list of all other sequences that bear similarityto the original sequence. The establishment ofneighboring relationships within a database isbased on statistical measures of similarity, asfollows. While the term “neighboring” is tra-ditionally used to describe these connections,the terminology used on the Entrez Web sitewill describe neighbors as “related papers,”“related sequences,” and so forth.

BLAST. Sequence data are compared toone another using the Basic Local AlignmentSearch Tool, or BLAST (Altschul et al., 1990).This algorithm attempts to find “high-scoringsegment pairs” (HSPs), which are pairs of se-quences that can be aligned with one anotherand, when aligned, meet certain scoring andstatistical criteria. UNITS 3.3 & 3.4 discuss thefamily of BLAST algorithms and their appli-cation at length.

VAST. Sets of coordinate data are comparedusing a vector-based method known as theVector Alignment Search Tool (VAST; Madejet al., 1995; Gibrat et al., 1996). There are threemajor steps that take place in the course of aVAST comparison.

First, based on known three-dimensionalcoordinate data, all of the α-helices and β-sheets that comprise the core of the protein areidentified. Straight-line vectors are then calcu-lated based on the position of these secondary-structure elements. VAST keeps track of howone vector is connected to the next (that is, howthe C-terminal end of one vector connects tothe N-terminal end of the next vector), as wellas whether a particular vector represents an α-helix or a β-sheet. Subsequent steps use onlythese vectors in making comparisons to otherproteins. In effect, most of the coordinate dataare discarded at this step. The reason for thisapparent oversimplification is simply the scaleof the problem at hand; with more than 32,000structures in PDB that need to be considered,the time that it would take to do an in-depth


Using Entrez

1.3.22


comparison of each and every structure to allof the other structures in the database wouldmake the calculations both impractical and in-tractable. The user should keep this simplifi-cation in mind when making biological infer-ences based on the results presented in a VASTtable.

Next, the algorithm attempts to optimallyalign these sets of vectors, looking for pairs ofstructural elements that are of the same typeand relative orientation, with consistent con-nectivity between the individual elements. Theobject is to identify highly similar “core sub-structures,” i.e., pairs that represent a statisti-cally significant match above that which wouldbe obtained by comparing randomly chosenproteins to one another.

Finally, a refinement is done using MonteCarlo methods at each residue position in anattempt to optimize the structural alignment.

Through this method, it is possible to findstructural (and, presumably, functional) rela-tionships between proteins in cases that maylack overt sequence similarity. The resultantalignment need not be global; matches maybe between individual domains of differentproteins.

It is important to note here that VAST isnot the best method for determining structuralsimilarities. More robust methods, such as ho-mology model building, provide much greaterresolving power in determining such relation-ships, since the raw information within thethree-dimensional coordinate file is used toperform more advanced calculations regard-ing the positions of side chains and the ther-modynamic nature of the interactions betweenside chains. Reducing a structure to a series ofvectors necessarily results in a loss of informa-tion. However, considering the magnitude ofthe problem here—again, the number of pair-wise comparisons that need to be made—andboth the computing power and time neededto employ any of the more advanced methods,VAST provides a simple and fast first answer tothe question of structural similarity. More in-formation on other structure prediction meth-ods based on X-ray or NMR coordinate datacan be found in Chapter 5.

Weighted key terms. The problem of com-paring sequence data somewhat pales next tothat of comparing PubMed entries, free textwhose rules of syntax are not necessarily fixed.Given that no two people’s writing styles areexactly the same, finding a way to compareseemingly disparate blocks of text poses asubstantial problem. Entrez employs a methodknown as the relevance pairs model of retrieval

to make such comparisons, relying on whatare known as weighted key terms (Wilbur andCoffee, 1994; Wilbur and Yang, 1996). Thisconcept is best described by example. Con-sider two manuscripts with the following titles:BRCA1 as a Genetic Marker for Breast Cancerand Genetic Factors in the Familial Transmis-sion of the Breast Cancer BRCA1 Gene. Bothtitles contain the terms BRCA1, Breast, andCancer, and the presence of these commonterms may indicate that the manuscripts aresimilar in their subject matter. The proximitybetween the words is also taken into account,so that words common to two records that arecloser together are scored higher than com-mon words that are further apart. In the currentexample, the terms Breast and Cancer wouldscore higher based on proximity than eitherof those words would against BRCA1, sincethe words are next to each other. Commonwords found in a title are scored higher thanthose found in an abstract, since title wordsare presumed to be “more important” thanthose found in the body of an abstract. Overallweighting depends on the frequency of a givenword among all the entries in PubMed, withwords that occur infrequently in the databaseas a whole carrying a higher weight.

Hard linksThe hard link concept is much easier con-

ceptually than neighboring. Hard links are ap-plied between entries in different databasesand exist everywhere there is a logical connec-tion between entries. For instance, if a PubMedentry is about the sequencing of a cosmid, ahard link is established between the PubMedentry and the corresponding nucleotide entry.If an open reading frame in that cosmid codesfor a known protein, a hard link is establishedbetween the nucleotide entry and the proteinentry. If, by sheer luck, the protein entry hasan experimentally deduced structure, a hardlink would be placed between the protein en-try and the structural entry. The hard link re-lationships between databases is illustrated isFigure 1.3.24.

Searches can, in essence, begin anywherewithin Entrez—the user has no constraintswith respect to where the foray into this infor-mation space must begin. However, depend-ing on which database is chosen as the startingpoint, different fields are available for search-ing. This stands to reason, inasmuch as theentries in different databases are necessarilyorganized differently, reflecting the biologicalnature of the entity that each database is tryingto catalog.


1.3.23


Figure 1.3.24 An overview of the relationships in the Entrez integrated information retrievalsystem. Each node represents one of the elements that can be accessed through Entrez, and thelines represent how each component database connects to the others. Entrez is under continuousevolution, with new components being added and the interrelationships between the elementschanging dynamically (from The Entrez Search and Retrieval System, The NCBI Handbook; seeInternet Resources).

Critical Parameters andTroubleshooting

Since a significant portion of this unitdeals with searching PubMed, it is impor-tant for the reader to understand the distinc-tion between PubMed and MEDLINE. MED-LINE is the National Library of Medicine’sdatabase of journal citations from 1966 to thepresent; updates to MEDLINE are done on aweekly basis. The scope of journals included inMEDLINE roughly covers the general areas ofbiomedicine and health, encompassing most(but not all) journals that those working in thebiomedical sciences routinely access. Whilethis broad definition brings the contents of∼4600 journals into MEDLINE, the majorityof records are from English-language publi-cations; the inclusion of non-English publica-tions is usually limited to journals providingabstracts translated into English. While theMEDLINE database as a whole contains en-tries going back to 1966, a given journal’s en-tries will only be present from the point in timewhen that journal was chosen for inclusion inMEDLINE; this means that a user cannot nec-essarily be assured that a MEDLINE searchwill actually return the most complete set ofresults.

PubMed, the resource available throughEntrez, attempts to slightly broaden thescope of MEDLINE and address some of its

shortfalls by including life-science citationsfrom general science and chemistry journals,adding roughly one million more entries tothe MEDLINE set. PubMed also attempts tocompletely index journals back to 1966, re-gardless of the date of the journal’s inclu-sion in MEDLINE. For papers published priorto 1966, the user will need to access OLD-MEDLINE, which must be done through adifferent search engine, called the NLM Gate-way (http://gateway.nlm.nih.gov). This meansthat, in essence, two searches will need to beperformed to assure that the complete litera-ture fitting a particular search has been ob-tained. In most cases, however, users are look-ing for more “recent” literature, so a searchof OLD-MEDLINE is seldom necessary. BothMEDLINE and OLD-MEDLINE can besearched simultaneously using the NLM Gate-way, but its Web-based interface is not nearlyas user-friendly or powerful as that of Entrez,nor do Gateway results pages provide the myr-iad of links to other databases and resourcesthat Entrez provides.

One final, yet important, historical distinc-tion lies in the number of authors actually in-dexed for any given paper. From 1966 to 1983,any author whose name was listed on the titlepage of the article was indexed. From 1984 to1995, the author list was limited to the first tenauthors, possibly omitting the senior author


Using Entrez

1.3.24


from the list. From 1996 to 1999, the authorlist was expanded to include the first 25 au-thors; in the case of large, consortium-basedstudies, the senior author(s) might still havebeen omitted, even with the longer list. Fi-nally, from 2000 on, all authors’ names wereindexed, regardless of the length of the authorlist.

DisclaimerThis unit was written by Dr. Andreas D.

Baxevanis in his private capacity. No officialsupport or endorsement by the National Insti-tutes of Health or the United States Depart-ment of Health and Human Services is in-tended or should be inferred.

Literature CitedAlberts, B., Johnson, A., Lewis, J., Raff, M.,

Roberts, K., and Walter, P. 2002. Molecular Bi-ology of the Cell, Fourth Edition. Garland Pub-lishing, New York.

Altschul, S., Gish, W., Miller, W., Myers, E., andLipman, D. 1990. Basic local alignment searchtool. J. Mol. Biol. 215:403-410.

Barrett, T., Suzek, T.O., Troup, D.B., Wilhite, S.E.,Ngau, W.C., Ledoux, P., Rudnev, D., Lash,A.E., Fujibuchi, W., and Edgar, R. 2005. NCBIGEO: Mining millions of expression profiles—database and tools. Nucl. Acids Res. 33:D562-D566.

Cho, K.R., Oliner, J.D., Simons, J.W., Hedrick, L.,Fearon, E.R., Preisinger, A.C., Hedge, P., Silver-man, G.A., and Vogelstein, B. 1994. The DCCgene: Structural analysis and mutations in col-orectal carcinomas. Genomics 19:525-531.

Gibrat, J.-F., Madej, T., and Bryant, S. 1996.Surprising similarities in structure comparison.Curr. Opin. Struct. Biol. 6:377-385.

Hamosh, A., Scott, A.F., Amberger, J., Bocchini,C., Valle, D., and McKusick, V.A. 2002. On-line Mendelian Inheritance in Man (OMIM), aknowledgebase of human genes and genetic dis-orders. Nucl. Acids Res. 30:52-55.

Madej, T., Gibrat, J.-F., and Bryant, S. 1995.Threading a database of protein cores. Proteins23:356-369.

McKusick, V.A. 1998. Online Mendelian Inher-itance in Man: A Catalog of Human Genesand Genetic Disorders, 12th Edition. The JohnsHopkins University Press, Baltimore, Md.

Mullikin, J.C. and Sherry, S.T. 2005. SequencePolymorphisms. In Bioinformatics: A PracticalGuide to the Analysis of Genes and Proteins,3rd Edition (A.D. Baxevanis and B.F.F. Ouel-lette, eds.) pp. 171-193. John Wiley and Sons,Hoboken, N.J.

Richardson, D. and Richardson, J. 1992. Thekinemage: A tool for scientific communication.Protein Sci. 1:3-9.

Sayle, R. and Milner-White, E. 1995. RasMol:Biomolecular graphics for all. Trends Biochem.Sci. 20:374-376.

Wilbur, W. and Coffee, L. 1994. The effectivenessof document neighboring in search enhance-ment. Inf. Process Manage. 30:253-266.

Wilbur, W. and Yang, Y. 1996. An analysis of sta-tistical term strength and its use in the indexingand retrieval of molecular biology texts. Com-put. Biol. Med. 26:209-222.

Internet Resourceshttp://www.ncbi.nlm.nih.gov

NCBI Home page.

http://www.ncbi.nlm.nih.gov/Entrez

NCBI Entrez Web page.

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.588

Ostell, J. 2003. The Entrez Search and RetrievalSystem. The NCBI Handbook, Chapter 15. NationalCenter for Biotechnology Information, Bethesda,Md.

http://www.ncbi.nlm.nih.gov/projects/geo/info/overview.html

NCBI GEO Overview.

Contributed by Andreas D. BaxevanisBethesda, Maryland

searching the ncbi databases using - university at...

Documents