programmatic access to chemical information in...

41
The 250 th ACS National Meeting (Boston, MA) Programmatic Access to Chemical Information in PubChem Sunghwan Kim, Paul Thiessen, Evan E. Bolton and Stephen H. Bryant National Center for Biotechnology Information National Library of Medicine National Institutes of Health

Upload: doanmien

Post on 01-Nov-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

The 250th ACS National Meeting (Boston, MA)

Programmatic Access to Chemical Information

in PubChem

Sunghwan Kim, Paul Thiessen, Evan E. Bolton and Stephen H. Bryant

National Center for Biotechnology InformationNational Library of MedicineNational Institutes of Health

The 250th ACS National Meeting (Boston, MA)

Acknowledgements

Stephen BryantEvan E. BoltonLewis GeerYanli WangAsta GindulyteLianyi HanBo Yu

Paul Thiessen*Jian ZhangJiyao WangRenata GeerBen ShoemakerJane HeJie Chen

Tiejun ChengGang FuLeonid ZaslavskyTakako TakedaMing HaoAmrita Roy Choudhury

The PubChem Team

PubChem depositors, users, and collaborators

Funded by the National Library of Medicine

The 250th ACS National Meeting (Boston, MA)

PubChem(https://pubchem.ncbi.nlm.nih.gov)

The 250th ACS National Meeting (Boston, MA)

PubChem (https://pubchem.ncbi.nlm.nih.gov)

A “public” repository of information on small molecules and their biological activities, developed and maintained by the U.S. National Institutes of Health (NIH).

Launched in 2004 as a part of the Molecular Libraries Roadmap initiatives.

A key resource of chemical information for researchers in the area of cheminformatics, chemical biology, medicinal chemistry, and many others.

The 250th ACS National Meeting (Boston, MA)

PubChem (https://pubchem.ncbi.nlm.nih.gov)

The 250th ACS National Meeting (Boston, MA)

PubChem (http://pubchem.ncbi.nlm.nih.gov)

PubChem contains:

• >157 million substance descriptions,• >60 million unique chemical structures,• >229 million biological test results• >1 million biological assays, covering ~10,000 unique protein

sequence targets.

(Arguably) the largest corpus of publicly available chemical information from

more than 340 data contributors.

The 250th ACS National Meeting (Boston, MA)

ProgrammaticAccess to PubChem

EntrezUtilites(E-Utils)

Power User Gateway

(PUG)

PUG-SOAP

PUG-REST

PubChemRDF REST interface

Programmatic Access to PubChem

The 250th ACS National Meeting (Boston, MA)

ReferencePUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem.S. Kim, P.A. Thiessen, E.E. Bolton, & S.H. BryantNucleic Acids Res. 2015, 43(W1):W605-11.

Programmatic Access to PubChem

The 250th ACS National Meeting (Boston, MA)

Entrez Utilities(E-Utils)

The 250th ACS National Meeting (Boston, MA)

Entrez

NCBI’s database search and retrieval system. Integrates NCBI’s >40 databases into a tightly interlinked

system. Provides an integrated view of biomedical data and their

relationships.

The 250th ACS National Meeting (Boston, MA)

Entrez Utilities (E-Utilities or E-Utils)

A suite of tools that provides access to nearly all Entrezfunctionality, primarily through an XML over HTTP interface

Not Developed and maintained by PubChem

http://www.ncbi.nlm.nih.gov/books/NBK25497/#_chapter2_The_Nine_Eutilities_in_Brief_

EInfodatabase statistics

ESearchtext searches

EPostUID uploads

ESummarydocsum download

EFetchdata record download

ELinkEntrez links

EGQueryglobal query

ESpellspelling suggestions

ECitMatchbatch citation search

The 250th ACS National Meeting (Boston, MA)

Entrez Utilities (E-Utilities or E-Utils)

Suited for accessing text or numeric-fielded data No ability to handle complex data types

specific to PubChem• Chemical structures• Tabular bioactivity data

The 250th ACS National Meeting (Boston, MA)

Power User Gateway(PUG)

The 250th ACS National Meeting (Boston, MA)

Power User Gateway (PUG)

Provides programmatic access to PubChem Services via a single common gateway interface (CGI), available at the URL:

http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi

Exchanges data through a relatively complex XML schema, over the Hypertext Transfer Protocol (HTTP).

Examples of PUG-enables services are:• Substance/Compound download• BioAssay data download• Structure standardization service• Chemical structure search• Score matrix service• Identifier exchange service

The 250th ACS National Meeting (Boston, MA)

Power User Gateway (PUG)

Each PUG-supported service has its own input/output.

Different XML for different services, defined in:• http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd• http://pubchem.ncbi.nlm.nih.gov/pug/pug.xsd

PUG XML can also be used to import and export queries within supported service web pages.

The 250th ACS National Meeting (Boston, MA)

Power User Gateway (PUG)

Most PUG requests are queued, allowing users to submit a number of long-running tasks.

The initial PUG response contains a 64-bit identifier for a requested task, and must be used in any further communication with PUG concerning that task.

Upon request, PUG will check the status of a task given its identifier, returning the results of the task if completed or the status of the task if not completed.

The 250th ACS National Meeting (Boston, MA)

PUG-SOAP

The 250th ACS National Meeting (Boston, MA)

PUG-SOAP Uses the simple-object access protocol (SOAP).

Much of the same functionality as PUG, broken down into simpler functions, as defined via the web service definition language (WSDL; http://www.w3.org/TR/wsdl), using SOAP formatted message envelopes for information exchange.

The WSDL for PUG-SOAP can be found at:http://pubchem.ncbi.nlm.nih.gov/pug_soap/pug_soap.cgi?wsdl.

Most suitable for SOAP-aware GUI workflow applications (Taverna and Pipeline Pilot) and programming languages. (C, C++, C#, .NET, Perl, Python, and Java)

The 250th ACS National Meeting (Boston, MA)

“Keys” in PUG-SOAP Simple strings that store data objects

• Structure keys for a single chemical structure.• List keys for a set of identifiers (SIDs, CIDs, AIDs)• Assay keys for a set of rows and coloums from an assay table.• Download keys for an download URL (usually FTP)

Used for exchanging data between the PUG-SOAP server and Client application.• Avoids sending/receiving intermediate results• Allows one to readily chain queries between different PubChem

services• Reduces bandwidth requirements.

The 250th ACS National Meeting (Boston, MA)

• Specifies the input structure and ID list• Synchronous

InputFunctions

• Performs supported operation on the input• Asynchronous

ProcessingFunctions

• Retrieves the results• Synchronous

OutputFunctions

“Functions” in PUG-SOAP

The 250th ACS National Meeting (Boston, MA)

Asynchronous functions in PUG-SOAP

InputStructure()

SMILES(c1ccccc1)

Structure key733…801

IdentitySearch()

List key457…843

Download()

Download key397…976

FTP URL(ftp://......)

GetDownload

URL() GetOperation

Status()

GetOperation

Status()

The 250th ACS National Meeting (Boston, MA)

PUG-REST

The 250th ACS National Meeting (Boston, MA)

PUG-REST Representational State Transfer (REST)-style

interface.

Simplified access route without the overhead of XML or SOAP envelopes

Access to data that are not accessible through other PUG Services.

Intended to handle short, synchronous requests (<30 seconds).

The 250th ACS National Meeting (Boston, MA)

Conceptual Framework of a PUG-REST request

PUG-REST Serverat PubChem

User’sComputer

① INPUTIdentifiers(CIDs, SIDs, AIDs)

③ OUTPUTResults in a desired format

② OPERATIONwith identifiers

All necessary information is encoded into a one-line URL.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

The three parts are (mostly) independent of each other.→ Many possible requests in a PUG-REST request.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

① http://...... /compound/cid/2244,1983/record/XML?record_type=3d

→ Retrieve in XML full records for CIDs 2244 and 1983, including 3-D structure description.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

② http://....../compound/smiles/C(C(=O)O)N/property/TPSA,XLogP/CSV

→ Retrieve in CSV the TPSA and XLogP values for compounds whose smile string is “C(C(=O)O)N”.

TSPA: topological polar surface area

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

③ http://....../compound/name/lipitor/record/PNG?record_type=2d&image_size=large

→ Download the large image of the 2-D structure of Lipitor in PNG.

PNG: portable network graphics

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

④ http://....../substance/xref/PatentID/US6127355/sids/TXT

→ Retrieve Substances that are mentioned in U.S. Patent US6127355 in TXT.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

⑤ http://....../assay/aid/640/cids/XML?cids_type=active

→ Retrieve in XML compounds that are tested to be active in AID 640.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

⑥ http://……/assay/aid/490,1000/targets/ProteinName,GeneSymbol/XML

→ Retrieve in XML the protein name and gene name targeted in AIDs 490 and 1000.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

⑦ http://……/assay/aid/504526/doseresponse/CSV?sid=104169547

→ Retrieve in CSV dose-response data for SID 104169547 tested in AID 504526.

The 250th ACS National Meeting (Boston, MA)

http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]

Prolog(common to all PUG REST requests)

Options specific to some operations

<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......

<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations

<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT

URL construction for a PUG-REST request

⑧ http://....../assay/target/gi/66528677/concise/CSV

→ Getting a concise view of assays targeting protein GI 66528677(glucocorticoid receptor isoform γ)

The 250th ACS National Meeting (Boston, MA)

Search by chemical name Exact match (Compound whose name is aspirin)

→ Identical to “aspirin[CompleteSynonym]” in Entrez

http://....../compound/name/aspirin/cids/TXT?name_type=complete

Partial match (Compounds whose name contains aspirin)

→ Identical to “aspirin[Synonym]” in Entrez

http://....../compound/name/aspirin/cids/TXT?name_type=word

The 250th ACS National Meeting (Boston, MA)

Requests with conflicts with URL syntax Multi-line SDF file Chemical names, SMILES, InChI strings

with special characters reserved in the URL syntax(ex) a forward slash (“/”)• C(=C/F)\F

• InChI=1S/C2H2F2/c3-1-2-4/h1-2H/b2-1+

A very long lists of identifiers in the URL

Use HTTP POST

> curl -H 'Content-Type: application/x-www-form-urlencoded' -d "inchi=InChI=1S/C3H8/c1-3-2/h3H2,1-2H3"http://PROLOG/rest/pug/compound/inchi/cids/TXT

The 250th ACS National Meeting (Boston, MA)

Asynchronous jobs A standard time limit of 30 seconds per web service

request

Some tasks may take longer than 30 seconds.

(ex) Chemical structure search including:

• identity search• substructure search• superstructure search• similarity search• molecular formula search

→Used an asynchronous approach using a list keys.

“synchronous” alternatives are now available.

The 250th ACS National Meeting (Boston, MA)

Asynchronous jobs Any operation that results in a list of SIDs/CIDs/AIDs can

be stored in a list key on the server side

A list key can be retrieved by subsequent requests.

Helpful when chaining requests.

A list key expires after 8 hours of inactivity.

The 250th ACS National Meeting (Boston, MA)

Request volume limitations

PUG-REST is NOT designed for very large volumes of requests(e.g. millions requests)

Any script or application should not make more than five requests per second to avoid overloading the PubChem servers.

If you have a large data set to process, please contact us for help on optimizing your task.

The 250th ACS National Meeting (Boston, MA)

Summary

The 250th ACS National Meeting (Boston, MA)

Entrez Utilities For accessing textual/numeric data.

Power User Gateway Pure XML-based interface. Uses a complex PubChem-specific XML schema.

PUG-SOAP Uses the Simple-Object Access Protocol. Good for scripting/programming languages with SOAP interface.

PUG-REST Representational State Transfer-style interface. The simplest and easiest to use and learn.

The 250th ACS National Meeting (Boston, MA)

Thank you!

Questions or Comments?