ncbi news - national center for biotechnology information

12
A gene-based view of annotated genomes is essential to capitalize on the increase in the sequencing and analysis of model genomes. The Entrez Gene database has been developed to supply key connections between maps, sequences, expression profiles, structure, function, homolo- gy data, and the scientific literature. Unique identifiers are assigned to genes with defining sequence, genes with known map positions, and genes inferred from phenotypic information. These gene identifiers are tracked, and functional informa- tion is added when available. Access Entrez Gene from the Entrez Home Page or directly at: www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db=gene The Entrez Gene help document provides tips to ease the transition for LocusLink users to the current Entrez Gene database. The default display format for Entrez Gene is the graphics display shown in Figure 1 for BMP7, which resembles the traditional view of a LocusLink record. The array of col- ored boxes at the head of LocusLink reports that provide links to gene- related resources is replaced by the “Links” menu in Gene, which includes additional links, such as those to Books, GEO, UniSTS, and Taxonomy. The Gene Transcripts and Products section is provided when a gene has been annotated on a genomic Reference Sequence Three databases, the NCI/NCBI SKY (Spectral Karyotyping)/M- FISH (Multiplex-FISH) and CGH (Comparative Genomic Hybridization) Database, the NCI Mitelman Database of Chromosome Aberrations in Cancer, and the NCI Recurrent Chromosome Aberrations in Cancer databases are now integrat- ed into NCBI’s Entrez system as the “Cancer Chromosomes” database. Cancer Chromosomes supports searches for cytogenetic, clinical, or reference information using the flexi- ble Entrez search and retrieval sys- continued on page 6 Cancer Chromosomes: a New Entrez Database continued on page 3 Spring 2004 Transitioning from LocusLink to Entrez Gene In this issue 1 Transitioning from LocusLink to Entrez Gene 1 Cancer Chromosomes: a New Entrez Database 2 HomoloGene 4 BLAST Link (BLink) 5 Debut of HCT Database 7 350Kb Sequence Length Limit Removed 7 New Eukaryotic Genomes in Map Viewer 8 Environmental Samples from the Sargasso Sea 8 HIV Protein Interaction Database 9 Perform Reverse ePCR 9 New Organisms in UniGene 9 Rat Gets NP_999999 10 RefSeq Release 6 10 Entrez Tools new “Hotspot” 11 BLAST Lab 12 Entrez Quiz NCBI News National Center for Biotechnology Information National Library of Medicine National Institutes of Health Department of Health and Human Services Figure 1. Entrez Gene display for human BMP7 showing links to over 20 related resources in the “Links” pulldown menu.

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NCBI News - National Center for Biotechnology Information

A gene-based view of annotatedgenomes is essential to capitalize onthe increase in the sequencing andanalysis of model genomes. TheEntrez Gene database has beendeveloped to supply key connectionsbetween maps, sequences, expressionprofiles, structure, function, homolo-gy data, and the scientific literature.Unique identifiers are assigned togenes with defining sequence, geneswith known map positions, andgenes inferred from phenotypicinformation. These gene identifiersare tracked, and functional informa-tion is added when available. AccessEntrez Gene from the Entrez HomePage or directly at:

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

The Entrez Gene help documentprovides tips to ease the transitionfor LocusLink users to the currentEntrez Gene database.

The default display format forEntrez Gene is the graphics displayshown in Figure 1 for BMP7, whichresembles the traditional view of aLocusLink record. The array of col-ored boxes at the head of LocusLinkreports that provide links to gene-related resources is replaced by the“Links” menu in Gene, whichincludes additional links, such asthose to Books, GEO, UniSTS, andTaxonomy. The Gene Transcriptsand Products section is providedwhen a gene has been annotated ona genomic Reference Sequence

Three databases, the NCI/NCBISKY (Spectral Karyotyping)/M-FISH (Multiplex-FISH) and CGH(Comparative GenomicHybridization) Database, the NCIMitelman Database of ChromosomeAberrations in Cancer, and the NCIRecurrent Chromosome Aberrationsin Cancer databases are now integrat-ed into NCBI’s Entrez system as the“Cancer Chromosomes” database.Cancer Chromosomes supportssearches for cytogenetic, clinical, orreference information using the flexi-ble Entrez search and retrieval sys-

continued on page 6

Cancer Chromosomes: aNew Entrez Database

continued on page 3

Spring 2004

Transitioning from LocusLink to Entrez Gene

In this issue1 Transitioning from LocusLink to

Entrez Gene

1 Cancer Chromosomes: a New Entrez Database

2 HomoloGene

4 BLAST Link (BLink)

5 Debut of HCT Database

7 350Kb Sequence Length Limit Removed

7 New Eukaryotic Genomes in Map Viewer

8 Environmental Samples from the Sargasso Sea

8 HIV Protein Interaction Database

9 Perform Reverse ePCR

9 New Organisms in UniGene

9 Rat Gets NP_999999

10 RefSeq Release 6

10 Entrez Tools new “Hotspot”

11 BLAST Lab

12 Entrez Quiz

NCBI NewsNational Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthDepartment of Health and Human Services

Figure 1. Entrez Gene display for human BMP7 showing links to over 20 related resources in the “Links”pulldown menu.

Page 2: NCBI News - National Center for Biotechnology Information

HomoloGene is a system for auto-mated detection of homologs amongthe annotated genes of several com-pletely sequenced eukaryoticgenomes. The genomes representedin the recent Build 36 ofHomoloGene include H. sapiens,M.musculus, R.norvegicus , D.melanogaster , A. gambiae, C. elegans , S.pombe, S. cerevisiae , N. crassa, M. grisea,A. thaliana, and P. falciparum.

NCBI has adopted a new Homolo-Gene build procedure which is guid-ed by the taxonomic tree, relies onconserved gene order and measuresof DNA similarity among closelyrelated species, while making use ofprotein similarity for more distantlyrelated organisms. The new compu-tational procedure greatly increasesthe reliability of the computedhomologous gene sets and the result-ing HomoloGene entries nowinclude paralogs in addition toorthologs. For more details or tosearch the database, see theHomologene home page at:

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene

New Search Strategies Supported

Because HomoloGene is now anEntrez database, it can be queriedusing an assortment of fielded termscombined with boolean operators.Among the fields unique to Homolo-Gene is the “Ancestor” field whichrefers to the taxonomic group of thelast common ancestor of the speciesrepresented in a HomoloGene entry.Using the “Ancestor” field it is possi-ble to limit a search to genes con-served in one of 9 ancestral groups:Sordariomycetes (147,550 entries),Eukaryota (2,759 entries),Fungi/Metazoa (33,154 entries),Bilateria (33,213 entries), Coelomata(33,316 entries), Mammalia (9,172entries), Ascomycota (1,083 entries),Insecta (1,689 entries), Rodentia (1,587entries).

homologene.xml.gz

homologene.xml.gz is a compressed filethat contains a complete XML version of theHomoloGene build and includes the infor-mation available on the public webpage.The Homologene XML DTD is available inthe archive "homologene.dtd.tar" at the toplevel of the ftp site.

New Views of the Data

HomoloGene reports includehomology and phenotype informa-tion drawn from Online MendelianInheritance in Man (OMIM), MouseGenome Informatics (MGI),Zebrafish Information Network(ZFIN), Saccharomyces GenomeDatabase (SGD), Clusters ofOrthologous Groups (COG), andFlyBase. A “Pairwise Scores” displaygives a table of pairwise statistics formembers of a Homologene groupthat includes percent amino acid andnucleotide identities, the Jukes-Cantor genetic distance parameter,D, the ratio of non-synonymous tosynonymous amino acid substitutions(Ka/Ks) for predicted proteins, andthe ratio of nucleotide identitieswithin non-coding regions of thetranscript to those within codingregions (Knr/Knc).

—DW

NCBI News

NCBI News is distributed four times a year. We welcome communication from users of NCBI databases and software and invite suggestions for articles in future issues. Send corre-spondence to NCBI News at the address below. To subscribe to NCBINews, send your name and address toeither the street or E-mail address below.

NCBI NewsNational Library of MedicineBldg. 38A, Room 3S-3088600 Rockville PikeBethesda, MD 20894Phone: (301) 496-2475Fax: (301) 480-9241E-mail: [email protected]

EditorsDennis BensonDavid Wheeler

ContributorsSusan DombrowskiScott McGinnisTao Tao

WritersVyvy PhamDavid Wheeler

Editing and ProductionRobert Yates

Graphic DesignRobert Yates

In 1988, Congress established the National Center for Biotechnology Information as part of the National Libraryof Medicine; its charge is to create information systems for molecular biology and genetics data and perform research in computational molecular biology.

The contents of this newsletter may be reprinted without permission. The mention of trade names, com-mercial products, or organizations does not imply endorsement by NCBI, NIH, or the U.S. Government.

NIH Publication No. 04-3272

ISSN 1060-8788ISSN 1098-8408 (Online Version)

2Spring 2004 NCBI News

HomoloGene: An Entrez Database with a New Look

New HomoloGene FTP File Formats

The old HomoloGene FTP files of the formatsused in "hmlg.ftp" and "hmlg.trip.ftp" will be dis-continued after a transition period. During thetransition, a new set of codes, reflecting thenew build procedure, will be used in these filesto indicate the nature of the evidence forhomology: b - reciprocal best, B - reciprocalbest in a self-consistent triplet, m - similaritybetween sequences that do not give reciprocalbest hits.

The Homologene data is available by FTP wherethe data for each build is contained in two files;"homologene.data" and "homologene.xml.gz".Follow the "FTP site" link in the sidebar on theHomologene home page to download the files.

homologene.datahomologene.data is a tab delimited file con-taining, from left to right:

•HomoloGene group id •Taxonomy ID•gene ID •gene symbol •geninfo identifier(gi) of the protein product of the gene•accession number of the protein productof the gene

Page 3: NCBI News - National Center for Biotechnology Information

tem. Search tips are provided in theHelp document at:

www.ncbi.nlm.nih.gov/entrez/query/SkyCgh/help.html

Search “Cancer Chromosomes”from the database pulldown menuon the NCBI home page or navigateto the “Cancer Chromosomes”page for advanced searches via thelink on the Entrez home page at:

www.ncbi.nlm.nih.gov/Entrez

Three search formats areoffered on the EntrezChromosomes home page:a conventional EntrezQuery, a Quick/SimpleSearch, and an AdvancedSearch. The Entrez Queryis performed using thesearch box at the top ofthe page, and, as withother Entrez databases,searches may be combinedusing term limits andBoolean expressions. TheSimple Search, available viaa link in the sidebar on theCancer Chromosomespage, offers a set ofmenus from which onemay select search terms toindicate a disease site ordiagnosis. These terms canbe combined with specifi-cations for a particularchromosomal location andanomaly. The AdvancedSearch form, also linkedfrom the sidebar, isarranged similarly. Thisform contains three mainsections, labeledCytogenetic, Clinical, andReference, which offer acombination of forms andmenus of search terms to help in theconstruction of complex queries.Diagnostic terms vary among data-bases; the SKY/M-FISH databaseuses ICD-3-O terms, whereas the

3 Spring 2004 NCBI News

Mitelman and Recurrent databasesuse a different system. The menusinclude all ICD-O-3 terms enteredinto the database to date and allterms used in the Mitelman andRecurrent databases. Descriptions ofthe sections and terms indexed aregiven in the Help document.

Searches based on case information,such as diagnosis and disease site,return a “case-based report” that listsall cases matching the query terms.Searches based on underlying cytoge-netic features are displayed as a“clone/cell report” in which eachclone or cell-line is listed separately.

The number of cases or cells/clonesfound for each search is displayed atthe top of the results page, brokendown into three totals: the totalmatches found in the SKY/M-FISH

& CGH Database, the total matchesfound in the Mitelman database, andthe total matches from the MitelmanRecurrent Database.

From the results list, users can accessthe pull-down menu and display avariety of features, including the cor-responding literature from PubMed,the results as a list of UI (uniqueidentifier) numbers, or view relatedreports based on common cytoge-netic or diagnostic features. Userscan also view Similarity reports,which show terms common to agroup or records within several termcategories such as diagnosis/site and

cytogenetic abnormalities(including CGH) among theselected cases or clones/cells.Term co-occurences are listedat several levels: common toall cases, common to 50%-90% of cases, and commonto less than 50% of cases.The common term or abnor-mality is shown in the left col-umn and the number ofaffected cases is shown in theright column. The cytogenet-ic abnormalities are shown atall levels of resolution. Select‘Similarity Report (HighResolution Only)’ to see simi-larities at a high level of reso-lution such as chromosomeband.

The results of a sample queryfor records dealing withbreakpoints of the chromoso-mal band 8q is shown inFigure 1.

Links in the search resultssummary lead to full reportssuch as the Case report #1437shown in Figure 2. Displaybuttons provide access toadditional views of the data

such as chromosomal dia-grams or a table view. Userscan also access the case

details or link to related resourcesfrom the original search summary byway of the Entrez Links menu.

—VP

Cancer Chromosomescontinued from page 1

Figure 1. Results of an Entrez Cancer Chromosomes search for records using thequery "8q". The number of cases from each database is given at the top of the resultsummary. Clicking on the Details link shows how the query was interpreted by Entrez.

Figure 2. Detailed display of Case report #1437, with written karyotype, case summary,graphic display of colored ideogram, and chromosome abnormalities in tabular format.

Page 4: NCBI News - National Center for Biotechnology Information

4Spring 2004 NCBI News

Pre-computed sequence alignments,generated from routine all-against-allBLAST comparisons performed atNCBI, are available for each proteinrecord in Entrez. The best 200 ofthese alignments can be displayed byclicking on the “BLink” link in theupper right-hand corner of Entrezprotein reports. The BLink reportfor human MLH1 is shown in Fig-ure 1. The report begins with adescription of the query sequence,the sequence IDs of other entries inEntrez with identical sequences, anda set of controls, described below,used to customize the display. Thealignments presented in the lowersection of the report are depictedgraphically and color-coded on thebasis of the taxonomic origin of thealigned sequence. Each alignment isfollowed by its BLAST score, linkedto a detailed alignment view, theaccession number of the alignedsequence, linked to Entrez, and theGI number of the aligned sequence,linked to its own BLink report.

Customizing the Display

BLink reports can be customizedusing a combination of format but-tons, taxonomic restriction controls,and a source database pulldownmenu. Taxonomic and source data-base restrictions take effect when the“Select” button is pressed.Alignments may be sorted on thebasis of sequence similarity to thequery, the default, or by the taxo-nomic proximity of the sourceorganisms using a link in the format-ting area.

Six format buttons are used to selectthe display mode of the BLinkreport. The ‘Best Hits’ button dis-plays a single line for each organismrepresented in the results, showingthe alignment of the best hit in theorganism group and a link, in the“N” column, to a Blink report limit-ed to the group. A portion of a

“Best Hits” display is shown inFigure 2.

The “Common Tree” button allowsfor the selective display of align-ments to sequences arising from spe-cific branches of the taxonomic tree.The related “Taxonomy Report” but-ton lists the BLink results as aBLAST Taxonomy Report.

To limit the view to those sequencesderived from structure records, pressthe “3-D Structures” button. In the“3-D Structures” display, shown inFigure 3, the dots are links to Con-served Domain and Cn3D structuredisplays.

The “CDD Search” button does notformat the BLink report, but insteadlinks to a precomputed conserveddomain display for the querysequence.

The “GI” button links tothe Entrez display of theprotein sequences whosealignments are shown inthe BLink report.

Taxonomic and DatabaseRestrictions

Sequences derived from ataxonomic group may beselectively removed fromthe display by clicking onany of the color-codedtaxon links; an “X” acrossthe sequence count forthe group indicates thatthe group will be removedfrom the display when the“Select” button is pressed.The BLink report can alsobe customized using the“Keep only” pulldownmenu to limit the displayto entries included indatabases such as RefSeq,Protein Data Bank,SwissProt, COGs, andNCBI CompleteGenomes.

Use BLink for Quick Insights intoProtein Function

Because BLink reports are pre-com-puted it is possible to rapidly view aBLAST alignment without having togenerate it. The graphical display ofthe aligned sequences provides aclear view of the distribution of con-served sequence blocks across taxo-nomic groups as an aid to under-standing evolutionary and functionalrelationships. Added insight into pro-tein function is provided by theCDD display of multiple sequencealignments for functional domainsallowing one to evaluate positionspecific sequence conservation in thecontext of the biological function ofthe query protein. The “3DStructure” display provides a quickway to determine the availability of3D structures to serve as modelingtemplates to use for further study.

—TT

BLAST Link (BLink) to Protein Alignments and Structures

Figure 2. "Best Hits" Blink report format indicating two alignments to asequence from a synthetic construct and six alignments to sequencesfrom Homo sapiens. A graphical depiction of the best alignment in eachorganism group is shown on the left.

Figure 3. BLink report limited to sequences derived from structuresusing the "3-D Structures" button.

Figure 1. BLink report for the human MLH1protein. Format controls arelocated at the top of the report with alignments, colored by the taxonomicorigin of the sequence match, given in the lower section.

Page 5: NCBI News - National Center for Biotechnology Information

is the Antrhopology/AlleleFrequencies databank, created in aneffort to determine HLA class I andclass II allele and haplotype frequen-cies in various human populations.Studies of allelic diversity in differentpopulations can shed light on theevolution of HLA polymorphism aswell as on the evolution and migra-tion of human populations. In a clin-ical context, knowledge of the allelefrequency distributions in variouspopulations is critical to the strategyof establishing and searching bonemarrow donor registries as well as

for studies of HLA-associated diseasesusceptibility.

Users of the resourcecan choose to viewallelic frequenciesfound in individualsfrom certain regionsof the world, or viewdata submitted by aparticular group.Users can also speci-fy the loci to be dis-played in the outputtable. Additionalinformation about

the project from thelinks to project

overview and data contributors isfound at:

www.ncbi.nlm.nih.gov/mhc/ihwg.fcgi?ID=9&cmd=PRJOV

The HCT and Anthropology/AlleleFrequencies resources are availablefrom the dbMHC home page vialinks from the blue sidebar menu.

Questions and comments can beaddressed to the NCBI Service Deskat:

[email protected]

—VP

A powerful way to view the data indbMHC is to compute Kaplan-Meiersurvival plots using the HCT onlineplotting tool. Many default parame-ters for the plots can be adjusted inorder to customize the data dis-played. Help and FAQ documentsare available for most plots. As anexample, consider the Kaplan-Meiersurvival plot shown in Figure 1. Theplot uses tick marks for censoreddata and shading to indicate the con-fidence intervals. To produce theplot, the Advanced Query Form isinitialized with default parameters

and one curve is plotted whichallows no mismatches at HLA A, B,C, DRB1 and DQB1 loci and ignoresHLA DPB1 locus. The parameterswhich affect the entire plot can bespecified and a summary is displayedfor each curve. A user may add acurve, modify a curve, delete a curve,view the plot, view plot data, viewindividual data, save the curveparameters, or restore saved parame-ters. When adding or modifying acurve, another form is displayed forentering the parameters for the curveand upon completion the user willreturn to the original page to viewthe updated selections.

Another important resource stem-ming from the efforts of the IHWG

The International HistocompatibilityWorking Group (IHWG) inHematopoietic Cell Transplantation(HCT) is a worldwide scientific col-laborative effort to support the useof information on the properties ofthe HLA barrier to allogeneic trans-plantation to improve the safety, effi-cacy and availability of HCT. TheIHWG HCT studies are designed todetermine whether complete allelematching for HLA-A, B, C DRB1,DQB1 and DPB1 is necessary forsuccessful transplantation. Data gen-erated from the study are anticipated

to offer new approaches to the selec-tion of suitable donors for HCT.

The IHWG and the NCBI at theNational Institutes of Health, havecollaborated to create a public data-base, dbMHC, to store genotype andclinical data, including up-to-dateinformation on matching and trans-plant outcomes, and to provideonline tools for data analysis. Thenew database contains anonymousdata for selected unrelated donortransplants performed worldwide forthe treatment of both malignant andnon-malignant blood disorders.More information and a link to a listof contributors to dbMHC is foundon the home page at:

www.ncbi.nih.gov/mhc/ihwg.fcgi?cmd=page&page=HCTintro

5 Spring 2004 NCBI News

Debut of the HCT Database and Anthropology/Allele Frequencies in dbMHC

Figure 1. Kaplan-Meier survival plot generated on the web using analysis tools available from dbMHC. A summary of the parametersused to create the plot is displayed to the right. The plot reveals a significant effect on transplant survival times of a one allele mismatch.

Page 6: NCBI News - National Center for Biotechnology Information

(human [organism] OR mouse [organism] OR rat[organism]) AND bmp7

human [organism] AND (bmp7 OR bmp3)

human [orgn] AND (bmp7 [title] OR bmp3 [title])

Like other Entrez databases, Geneoffers a number of display formatsbeyond the default “Graphical” for-mat. Additional formats include anXML format and a “Gene Table”view, providing access to thesequences of each of the gene’sexons and introns.

Entrez Gene is also accessible usingthe Entrez Programming Utilities (E-utilities), that provide access toEntrez from application programsand scripts.

Users interested in subscribing toemail announcements of new EntrezGene features are welcome to jointhe Gene-announce mailing list at:

www.ncbi.nlm.nih.gov/mailman/listinfo/gene-announce

—VP

6Spring 2004 NCBI News

(RefSeq) and intron, exon, and cod-ing region information is availablewith genomic coordinates. Eachaccession given in this section is alink to a menu allowing the displayof the sequence in several formats.Protein accessions provide menuoptions to navigate to BLink, CDD,or COG displays. This section isequivalent to the RNA-Genomicalignment available from the graphicat the top of a LocusLink entry. Inthe case of the Gene record forBMP7, NC_000020 is the accessionnumber of the genomic contig thatcontains the gene. Clicking on the“NC_000020” link brings up a menuused to select one of several displaysof the contig within the genomicrange of the gene BMP7.

The Entrez Gene record’s “GeneralGene Information” section summa-rizes information contained inLocusLink’s “Function”,“Relationships” and “MapInformation” sections. This sectionincludes several categories of infor-mation, such as Gene Ontology(GO), Homology, Phenotypes,Markers, Pathways and Relationships.

The remaining sections of an EntrezGene record-”NCBI ReferenceSequences”, “Related Sequences”,and “Additional Links”-are equiva-lent to the corresponding entries inthe LocusLink report. The first sec-tion lists gene-specific NCBIRefSeqs, provides links to the appro-priate Entrez sequence database, andgives descriptions of each transcriptvariant, the accession numbers ofsequences used to support theRefSeqs, and a listing of conserveddomains found in the encoded pro-teins. The “Related Sequences” sec-tion lists the nucleotide and proteinaccessions of sequences that arerelated to the gene, and provideslinks to the sequence records inEntrez. The “Additional Links” sec-tion provides a printable view of a

subset of links to information bothwithin and external to NCBI. Someof these links overlap those includedin the Links menu. The intent of thissection is to provide a printablereport of, for example, MIM num-bers, UniGene cluster numbers, andfamily-specific Web sites.

Entrez Gene can be considered asthe successor to LocusLink, butGene improves on LocusLink byproviding coverage of more NCBIreference genomes, by providingadditional display formats, and by itsintegration with other databaseswithin NCBI’s Entrez system. Userscan query Gene via the powerfulquery features of Entrez, usingBoolean operators, filters, and fieldlimiters, such as accession number,gene name, protein name,disease/phenotype, and map loca-tion. Users can search for records inEntrez Gene using any of the searchstrategies in the shaded box.

LocusLink to Entrez Genecontinued from page 1

Table 1. LocusLink to Gene feature transition chart. The help documentation covers the conversion of themaster LocusLink FTP file, “LL_tmpl”, to the Entrezgene.asn format. The Entrezgene.asn data will beavailable on the Gene FTP site in the near future.

Page 7: NCBI News - National Center for Biotechnology Information

missions from large scale sequencingprojects of draft sequences, such asphase 1 and phase 2 high-throughputgenomic sequences (HTGS), thatwere longer than 350 KB. To avoidbreaking a huge amount of draftsequence into 350 KB chunks, thedatabase collaborators agreed torelax the 350 KB limit in these cases.The 350 KB limit was also relaxedfor assemblies of Whole GenomeShotgun (WGS) project data and forlarge eukaryotic genes.

Removal of 350 kb Limit

In 2003, the Database Collaboratorsagreed to remove the 350 KB limitfor all sequences as of June 2004,since the increased ability of molecu-lar biology software to analyze longsequences quickly has rendered thelimit on sequence length unnecessary.To help software developers preparefor the change, some sample recordswith large sequences have been madeavailable for testing:

ftp.ncbi.nih.gov/genbank/LargeSeqs

An example of the effect of theremoval of the 350 KB limit onGenBank records may be seen in thecase of accession U00096, the

Escherichia coli K-12 MG1655 com-plete genome sequence. Under the350 KB limit, this accession numberrefered to a contig record giving alist of short sequences that can beassembled to create the completegenome. With the removal of the350 KB limit the accession nowrefers to the complete contiguoussequence for Escherichia coli K-12.The accessions for all 400 parts willappear as secondary accessions. TheCON division will remain as aGenBank division to representsequences which by their nature areassembles; Ex. genome scaffoldrecords.The effect of the changes on theNCBI GenBank FTP files and theBLAST database files available fordownload is expected to be minimal.As sequences become secondary toprimary records, the overall size ofthe databases should not changedrastically. However, the number ofmegabased-sized records will in-crease, therefore NCBI recommendsthat software be tested with theexample large sequence records,mentioned above.

—SM

7 Spring 2004 NCBI News

In 1995, the International NucleotideSequence Database Collaborators(GenBank, DDBJ, and EMBL)agreed to a 350 KB limit on the sizeof database sequence records inorder to maintain compatibility withexisting molecular biology softwarethat was not able to work with largesequences.

At this time, a new GenBank divi-sion was created called "CON" forcontig. The records in the CON divi-sion contain the instructions for theassembly of full-length contigs fromthe sequence data of multipleGenBank records. Although CONdivision records contain no sequencedata, the assembly information theyprovide makes it possible for NCBI'sEntrez search and retrieval system toshow complete genomic sequencesby dynamically assembling the datafor display. Using the information inCON division records, FTP files arealso regularly created by NCBI fordownload that contain megabase-scale genomic sequences as singleFASTA files.

By 1998, GenBank, DDBJ, andEMBL were routinely accepting sub-

350 kb Sequence Length Limit Removed by Sequence Database Collaboration

NCBI has created new Map Viewerdisplays for several organisms, in-cluding the honey bee, cat, and thefungi Eremothecium gossypii andEncephalitozoon cuniculi. In addition,new genome guides are available forthe honey bee, cat, chicken, frog andsea urchin.

The Map Viewer display for Apis mel-lifera, the honey bee, includes a vari-ety of sequence-based maps, such asmaps for contigs, UniGene clusters,genes, and transcripts, as well as theSolignac microsatellite-based linkagemap. The current honey bee genomebuild, Amel_1.1, is a composite ofwhole genome shotgun sequence andBAC sequence from clones isolatedby a clone-array pooled strategy.

New Eukaryotic Genomes at NCBIThe Apis mellifera whole genomeshotgun (WGS) project has beenassigned the project accessionAADG00000000 and the Amel_1.1assembly displayed and annotated inMap Viewer, is comprised of acces-sions AADG02000001-AADG020-30074.

The NCBI Map Viewer for Felis catuspresents an RH map of the catgenome. The map contains 1,126markers, including microsatellite-based markers and coding loci. Anintegrated genetic map is also includ-ed.1,2

The sequencing, annotation, andanalysis of the Eremothecium gossypiigenome is described in Dietrich etal.3 Its chromosomes have been

assigned RefSeq accessionsNC_005782 to NC_005789. Thesequenced-based maps- contig, gene,and transcript maps- are provided inthe Map Viewer for this filamentousfungus.

The sequencing, annotation, andanalysis of the Encephalitozoon cuniculigenome is described in Katinka etal.4 Its chromosomes have beenassigned RefSeq accessionsNC_003242, and NC_003229 toNC_003238. The contig and genemaps are available in the Map Viewerfor this microsporidium.

New Genome Guide pages, createdby NCBI in cooperation with thegenomic research communities toprovide links to an array of genome-continued on page 11

Page 8: NCBI News - National Center for Biotechnology Information

Documenting the interaction ofhuman immunodeficiency virus type1 (HIV-1) proteins with those of thehost cell is crucial to our understand-ing of the processes of HIV-1 repli-cation and pathogenesis. To meetthis need, the Division of AcquiredImmunodeficiency Syndrome(DAIDS) of the National Institute ofAllergy and Infectious Diseases(NIAID), in collaboration with theSouthern Research Institute andNCBI, has begun compiling a com-prehensive “HIV Protein-InteractionDatabase” to provide a concise sum-mary of documented interactionsbetween HIV-1 proteins and host cellproteins, other HIV-1 proteins, orproteins from disease organismsassociated with HIV or AIDS. Foreach documented protein-proteininteraction the following informationis collected, if available:

Protein Reference Sequence acces-sion numbers.

Entrez Gene ID numbers.

The Amino acids from each proteinthat are known to be involved in theinteraction.

A Brief description of the protein-pro-tein interaction.

Keywords to support searching forinteractions.

PubMed identification numbers for alljournal articles describing the interac-tion.The HIV Protein-Interaction Data-base may be searched through anonline interface at:

www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html

Clicking on a link for an HIV-1 pro-tein in the “Reports and downloads”

The technology of Whole GenomeShotgun (WGS) sequencing is nowbeing applied to quickly assemblelarge sets of genomic sequencestaken from organisms inhabiting aparticular ecological niche. Sequencedata collected in this manner pro-vides a snapshot of the geneticdiversity existing at a particular localeand is especially important in provid-ing data for organisms which are dif-ficult or impossible to culture in thelaboratory. Recently, The Institutefor Biological Energy Alternativessampled water from the SargassoSea, one of the most well-character-ized regions of the world's oceans.1The larger of two sets of samplescollected produced over 1.3 gigabas-es of sequence in the form 1.66 mil-lion WGS reads. These reads wereassembled into contigs containingabout 1 gigabase of non-redundantsequence. In addition, over 1 millionprotein sequences were derived fromthe annotation of open readingframes on the genomic sequences.Contigs constructed from the WGSreads and the remaining single readshave been deposited in the WGSdivision of GenBank, under theproject accession numberAACY01000000. Scaffolds assem-bled from these contigs are availablewithin the accession rangesCH004436-CH004736, andCH004737-CH236877. The rawsequencing data is available in theTrace Archive.

The Sargasso Sea dataset along withother environmental sample datasets,such as sequences from an acid minedrainage biofilm submitted by theDOE Joint Genome Institute,2 canbe queried using the new “Environ-mental Samples” BLAST page at:

www.ncbi.nlm.nih.gov/BLAST/Genome/EnvirSamplesBlast.html

8Spring 2004 NCBI News

New HIV Protein-Interaction Database

section of the page displays its inter-action report.

Interaction reports for HIV-1 pro-teins are displayed in a 4-column for-mat beginning from the left with theHIV-1 protein name, linked to theLocusLink report for the gene, andcontinuing with phrases taken fromthe literature indicating, in the sec-ond column, a type of interaction,and, in the third column, a descrip-tion of the interaction partner. Thefourth column displays links toLocusLink and Entrez Gene reportsfor the interaction partner.Interaction reports can be filtered ina number of ways by phrases appear-ing in the reports using pull downphrase lists. Full or filtered reportsmay be downloaded as text in a tab-delimited format that includes fieldsfor the name of the subject HIV-1protein, its RefSeq accession number,the interaction phrase, the descrip-tion of the interaction partner, thepartner’s RefSeq accession number,the title of the publication reportingthe interaction, and its PubMed iden-tifier. For example, the first line inthe tab-delimited file produced by fil-tering the interaction report for the“tat” protein by the interaction“upregulates” is shown in Figure 1.

Interaction reports are currentlyavailable for 7 of the 9 proteins pro-duced by HIV-1 including, gag, pol,rev, tat, vif, vpr, and vpu. Reportsfor the remaining proteins nef andenv, will be completed soon. Allprotein-protein interactions docu-mented in the HIV Protein-Interaction Database are listed inEntrez Gene reports in the “HIV-1protein interactions” section.

—DW

Tat, p14 NP_057853 upregulates B-cell lymphoma 6 protein NP_001697 HIV-1

Tat upregulates the expression of BCL-6 in Kaposi's sarcoma cells 11994280

Environmental SamplesMake Big Splash

continued on page 12

Figure 1. First line in the tab-delimited file produced by filtering the interaction report for the “tat” proteinby the interaction “upregulates”.

Page 9: NCBI News - National Center for Biotechnology Information

user-supplied primer pairs. TheReverse e-PCR query page canaccept up to 20 primer pairs or STSidentifiers as input for searchesagainst organism-specific databases.Primer pairs or STS identifiers canbe searched against the genomic andtranscript databases of Anopheles gam-biae, Arabidopsis thaliana, Caenorhabditiselegans, Drosophila melanogaster, Homosapiens, Mus musculus and Rattusnorvegicus.

Interested in seeing how the forwardand reverse e-PCR tools work? Tryyour hand at some of the e-PCRexamples that are provided on thetwo search pages found at:

www.ncbi.nlm.nih.gov/sutils/e-pcr

For those who need to perform largebatches of searches or who need tosearch a custom database, a stand-alone version of e-PCR has beendeveloped for the Windows, Linuxand Unix operating systems. Thesebinaries, along with the source codefor compiling e-PCR on other oper-ating systems, are available via FTPat:

ftp.ncbi.nlm.nih.gov/pub/schuler/e-PCR

—SD

STS is found using the online ver-sion of Forward e-PCR, the chromo-somal location and a link to UniSTSare provided. Mapped markers canbe displayed from UniSTS reportsusing the MapViewer and viewed inthe context of the other availablegenomic data.

To improve the sensitivity of aForward e-PCR search, there is nowan option to search using discontigu-ous, or imperfect, matches betweenthe query sequence and the STSs inUniSTS. To increase the probabilityof finding STSs that may have beenmissed with the contiguous wordoption, the size of the segment to bematched, called the word size, num-ber of allowable mismatches, andnumber of permissible gaps can beadjusted. The size of the STS canalso be adjusted in order to allow fordeviations which may arise from theamplification of a region in thegenome that shows length polymor-phism.

Reverse e-PCR can be used to esti-mate the genomic binding site,amplicon size and specificity for

9 Spring 2004 NCBI News

e-PCR and Reverse e-PCR: Greater Sensitivity, MoreOptionsElectronic PCR (e-PCR) is used toidentify sequence landmarks calledSequence Tagged Sites (STSs) withina nucleotide sequence. Electronic-PCR works by looking for matchesto STS primer pairs with the orienta-tion and spacing required to producean amplicon of the expected size.

Two types of e-PCR can now beperformed from the e-PCR homepage: the original, Forward e-PCR,and a new application, Reverse e-PCR. Forward e-PCR is used todetermine if a user-suppliednucleotide sequence contains anyknown STS. Queries are madeagainst the markers in NCBIsUniSTS database, a public collectionof PCR primer pairs used in map-ping and other types of genomeanalysis for a wide range of organ-isms. UniSTS contains over 270,000markers and includes data from theSTS division of GenBank, TheRadiation Hybrid Database, TheGenome Database, Mouse GenomeInformatics, the Rat GenomeDatabase, Zebrafish InformationNetwork and PubMed Central. If an

New Organisms in UniGene

UniGene now covers 45 animals andplants and can be searched using theEntrez search system where it islinked to nucleotide records. Recentadditions to UniGene include:

Canis familiaris (dog) with 15,281transcript sequences in 4,544 clusters,Helianthus annuus (sunflower), with

21,155 transcript sequences in 1,904clusters, Salmo salar (Atlantic salmon),with 23,111 sequences in 1,050 clus-ters, Bombyx mori (domestic silk-worm), with 60,481 sequences in2,050 clusters, Apis mellifera (honeybee), with 14,386 transcripts in 5,190clusters, Lotus corniculatus (Birdsfoottrefoil), with 59,678 transcripts in8,158 clusters, Physcomitrella patens(physcomitrella moss) with 83,948

transcripts in 6,950 clusters, Lactucasativa (garden lettuce) with 53,347transcripts in 9,656 clusters, Malus xdomestica (Apple) with 34,733 tran-scripts in 3,996 clusters, Hydra magni-papillata with 69,049 transcripts in5,362 clusters, Populus tremula xPopulus tremuloides (aspen) with 20,820transcripts in 2,707 clusters, and Ovisaries (sheep) with 2,852 transcripts in1,164 clusters.

RefSeq Accession Numbers Get Longeras Rat Gets Last 6-digit Accession

Rattus norvegicus has scurried to the end of the possibilitiesoffered by the six digit RefSeq accession format by makingoff with NCBI protein RefSeq accession NP_999999,required for its olfactory receptor Olr1386. To compen-sate, RefSeq accessions have now been extended to a

length of 9 digits, e.g., NP_123456789. Preexisting acces-sions will not be changed and accession numbers in boththe old and new extended formats, such as NM_000100and NM_01000000, will coexist.

Are you wondering which organism got the first 9-digitRefSeq protein accession? That would be the energeticand inquisitive Rattus norvegicus again, for another of itsvery important olfactory receptors!

Page 10: NCBI News - National Center for Biotechnology Information

Indiana University mirror:

bio-mirror.net/biomirror/genbank

Uncompressed, the Release 142.0 flat-files require approximately 136 giga-bytes while the more compact ASN.1version requires 119 gigabytes.

The FGPlus provides more detailed information, practicaltips, and more extensive hands-on practice than the stan-dard course. In addition, the FGPlus offers completeBLAST and structure modules, as well as a Mini-Courseon disease genes. Following the success of the Spring edition,NCBI will again offer the FGPlus course on August 24 and 25th,2004 at the National Library of Medicine's Lister Hill auditori-um. For more details and registration information pleasevisit the FGPlus page:

www.ncbi.nlm.nih.gov/Class/FieldGuide/FGPlus

or write to Peter Cooper ([email protected])

Primary FTP site at NCBI:

ftp.ncbi.nih.gov/genbank

San Diego SuperComputer Centermirror:

genbank.sdsc.edu/pub

the scope and content of the release are provided at:

ftp.ncbi.nih.gov/refseq/release/release-notes

For more information, visit the NCBI RefSeq Web Site at:

www.ncbi.nih.gov/RefSeq

RefSeq Release 6 is now available by anonymous FTP at:

ftp.ncbi.nih.gov/refseq/release

Release 6 includes genomic, transcript, and proteinsequences available as of July 5, 2004 from 2,467 organ-isms. The number of RefSeq accessions in Release 6 andtheir combined lengths is given in the shaded box.

RefSeq releases are posted bimonthly and the next releaseis scheduled for September. Release notes documenting

10Spring 2004 NCBI News

RefSeq Release 6 on FTP Site

GenomicRNAProtein

68,592247,639

1,050,975

8,263,102,565433,269,151365,446,682

# Accessions # Basepairs/Residues

Exponential Growth of GenBank Continues withRelease 142Over the past decade, the growth ofGenBank has followed an exponen-tial curve with a doubling time ofbetween 12 and 15 months. Asshown in Figure 1, the trend contin-ues with release 142 for which close-of-data was June 16. In the eightweek period between the close datesfor GenBank releases 141.0 and142.0, the non-WGS portion ofGenBank grew by 1,335,978,783base pairs and by 1,855,785 sequencerecords. The number of base pairsof sequence in release 142 for sever-al organisms of interest is shown inFigure 2.

GenBank release 142 is available onthe NCBI FTP site and at two mir-ror sites.

Entrez Tools is a ‘Hot Spot’Look for a new “Hot Spot” on the NCBI homepage called “Entrez Tools”. Entrez Tools provides single-page access to agroup of specialized Entrez resources. These resources include Batch Entrez, the Entrez Cubby, help with advanced Entrezsearches, documentation for the Entrez utilities, and a guide to creating links to the interactive Entrez pages.

The popularity of the free NCBI training course for lifescientists, A Field Guide to NCBI Resources, continues togrow, with over 40 courses presented throughout theUnited States in the first half of 2004. Portions of theField Guide have also been presented as more detailedmodules on specific tools and databases, including a 3Dstructures course, a gene expression course, and a BLASTcourse. In addition to the 2-day Field Guide, NCBI offersseveral 2-hour problem-centered Mini-Courses. ThisSpring, the NCBI conducted the first enhanced FieldGuide, the FGPlus, at the National Library of Medicine,that combines the best of the standard Field Guide withthe modular Field Guide courses and the Mini-Courses.

Slots available for FieldGuidePlus Training Course Onsite at NCBI

Figure 2. Billions of base pairs of sequence in GenBankrelease 142 for selected organisms.

Figure 1. Growth of GenBank in billions of base pairsfrom release 3 in April of 1994 to the current release,142.

Page 11: NCBI News - National Center for Biotechnology Information

specific resources, are available for honey bee, cat andchicken. These pages can be found at:

www.ncbi.nlm.nih.gov/genome/guide/bee

www.ncbi.nlm.nih.gov/genome/guide/cat

www.ncbi.nlm.nih.gov/genome/guide/chicken

A Map Viewer display for the chicken genome, Gallus gallus,will be available soon, but the sequences deposited intoGenBank under the whole genome shotgun (WGS)sequencing project accession AADN00000000 (accessionsAADN01000001-AADN01111864), are available now inEntrez and on the GenBank FTP site within the “WGS”directory.

—VP1Menotti-Raymond et al, 2003a, PMID 149707162Menotti-Raymond et al, 2003b, PMID 12692169 3Dietrich et al. Science 304: 304-307, 2004 PMID 150017154Katinka et al. Nature 414:450-453, 2001 PMID 11719806

The sequences in "infile" will be clustered and the resultswill be written to "outfile". The input sequences are identi-fied as nucleotide (-p F); "-p T", or protein, is the default.To register a pairwise match two sequences will need to be95% identical (-S 95) over an area covering 90% of thelength (-L .9) of each sequence (-b T) . Using "-b F"instead of "-b T" would enforce the alignment lengththreshold on only one member of a sequence pair. Theparameter "S", used here to specify the percent identity, canalso be used to specify, instead, a "score density." The lat-ter is equivalent to the BLAST score divided by the align-ment length. If "S" is given as a number between 0 and 3,it is interpreted as a score density threshold; otherwise it isinterpreted as a percent identity threshold.

To create a stringent non-redundant protein sequence set,use the following command line:

blastclust -i infile -o outfile -p T -L 1 -b T -S 100

In this case, only sequences which are identical will be clus-tered together. The “blastclust.txt” file in the standaloneBLAST package details the full range of BLASTClustparameters.

—DW

In the simplest case, BLASTClust takes as input a file con-taining catenated FASTA-format sequences, each with aunique identifier at the start of the definition line.BLASTClust formats the input sequence to produce a tem-porary BLAST database, performs the clustering, andremoves the database at completion. Hence, there is noneed to run formatdb in advance to use BLASTClust. Theoutput of BLASTClust consists of a file, one cluster to aline, of sequence identifiers separated by spaces. The clus-ters are sorted from the largest cluster to the smallest.

BLASTClust accepts a number of parameters that can beused to control the stringency of clustering includingthresholds for score density, percent identity, and alignmentlength. The BLASTClust program has a number of appli-cations, the simplest of which is to create a non-redundantset of sequences from a source database. As an example,one might have a library of a few thousand shortnucleotide sequence reads and wish to replace these with anon-redundant set. To produce the non-redundant set, onemight use:

blastclust -i infile -o outfile -p F -L .9 -b T -S 95

11 Spring 2004 NCBI News

Using BLASTClust to Make Non-redundant Sequence SetsBLASTClust is a program within the standalone BLAST package used to cluster either protein or nucleotidesequences. The program begins with pairwise matches and places a sequence in a cluster if the sequence matchesat least one sequence already in the cluster. In the case of proteins, the blastp algorithm is used to compute thepairwise matches; in the case of nucleotide sequences, the Megablast algorithm is used.

New Eurkaryotic Genomescontinued from page 7

New Microbial Genomes in GenBank

Mycobacterium avium subsp. paratuberculosis str. k10

Bdellovibrio bacteriovorus

Treponema denticola ATCC 35405

Lactobacillus johnsonii NCC 533

Wolbachia endosymbiont of Drosophila melanogaster

Wolbachia endosymbiont of Drosophila melanogaster

Mycoplasma mobile 163K

Bacillus anthracis strain Ames 0581

Picrophilus torridus

AE016958 | NC_002944

BX842601 | NC_005363

AE017226 | NC_002967

AE017098 | NC_005362

AE017196 | NC_002978

pending | pending

AE017308 | NC_006908

AE017261 | NC_005877

Organism GenBank | RefSeqAccession Numbers

For more detailed information, see the online version of the Spring2004 NCBI News, or use the GenBank or RefSeq Accession Numberto query the Entrez “Genome” database using the query box on theNCBI Home Page.

Page 12: NCBI News - National Center for Biotechnology Information

What is the total number of records in each of the 23 Entrez databases? Ifyou've ever asked yourself this question, try the following query in the Entrezglobal search:

All[filter]

Each of the Entrez databases supports the “all” filter term which returns thetotal number of records indexed in a database. You can also collect this infor-mation using an E-utilities URL, e.g.:

eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=all[filter]The results of such a query as of mid-June are plotted in the Figure 1.

Entrez Quiz

Department of Health and Human ServicesPublic Health Service, National Institutes of HealthNational Library of MedicineNational Center for Biotechnology InformationBldg. 38A, Room 3S3088600 Rockville PikeBethesda, Maryland 20894

Official BusinessPenalty for Private Use $300

FIRST CLASS MAILPOSTAGE & FEES PAID

DHHS/NIH/NLMBETHESDA, MD

PERMIT NO. G-816

Spring 2004NATIONAL INSTITUTES OF HEALTH ■ National Library of Medicine

NCBI News

Figure 1. Number of records inthe 23 Entrez databases as ofmid June 2004. Numbers areplotted on a logarithmic scale.

Environmental Samplescontinued from page 8

Environmental sample data can alsobe searched using two newly-createdstandard BLAST databases, “env_nt”or “env_nr” for nucleotide and pro-tein sequences respectively. Theenvironmental sample data containedwithin these two new databases is nolonger contained within the “nt” or“nr” BLAST databases.

—VP

1 Venter, J.C., et.al., Environmental Genome ShotgunSequencing of the Sargasso Sea, Science, 2004 Apr2;304(5667):66-74.

2 Tyson, G.W., et.al., Community structure and metabo-lism through reconstruction of microbial genomes fromthe environment, Nature, 2004 Mar 4;428(6978):37-43.