Download - John M. Barnar IRFS 040610
-
8/8/2019 John M. Barnar IRFS 040610
1/27
www.digitalchemistry.co.uk
Searching the Atoms andBonds in Chemical Patents
Presented at IRF Symposium
Vienna, Austria
4 June 2010
Dr John M. Barnard
Scientific Director
Digital Chemistry Ltd., UK
-
8/8/2019 John M. Barnar IRFS 040610
2/27
2
Outline Chemical structures in patents
Principles of searching for chemical structures
History of chemical structure searching in patents
Current developments
specific structures vs. Markush structures
automatic analysis vs. manual curation
online systems vs. in-house systems
Retrieval performance evaluation
-
8/8/2019 John M. Barnar IRFS 040610
3/27
3
Chemical structures in patentsThe most important information
in a chemical patent is oftenthe chemical structure
disclosed or claimed.
- specifies atoms and bondspresent and the way they areconnected
- integrated mixture of textand structure diagrams
-
8/8/2019 John M. Barnar IRFS 040610
4/274
Markush structures
N
NR3
CH3 O
R2
R1
R1 = phenyl / cyclohexyl / ...
R2 = H / methyl / ...
R3 = H / Cl / NO2 / ...
Dr EugeneMarkush(1887-1968)
Patents may include both Markushstructure claim and exemplifiedspecific structures.
Classes of molecules withcommon structural features may cover millions (or infinite numbers)
of specific structures
allow protection of related moleculeswith common properties
named after inventor involved in USlegal case in 1924
-
8/8/2019 John M. Barnar IRFS 040610
5/275
Chemical structures in patents
Markushstructure
Specificstructure
name
-
8/8/2019 John M. Barnar IRFS 040610
6/276
Markush structuresSpecific structures can be generated by combinatorialassembly of alternatives for each R-group
Variable-position
attachment
Variablemultiplicity
Genericgroups Specific
groups
Non-structuraldescription
-
8/8/2019 John M. Barnar IRFS 040610
7/277
Substructure search Search a database of chemical structures for
all those containing a specified pattern ofatoms and bonds (substructure)
NCO
N
N
CH3 O
CH3
CH3
NHC
O
N
N
CH2
CH3
CH3
Query substructure:
Retrieved molecule Retrieved molecule Molecule not retrieved
N
N
N
CH3
O
CH3
-
8/8/2019 John M. Barnar IRFS 040610
8/278
Substructure search Originally applied to databases of specificstructures (single, fully-defined molecules)Exact and deterministicsearch algorithms
based in topologicalgraph theory
100% recall
100% precision
Search retrieves all database molecules that containthe query substructure and none of those that don't.
Substructure search also possible for Markushstructures, but more complicated.
-
8/8/2019 John M. Barnar IRFS 040610
9/279
Patent searching before 1980Chemical Fragmentation Codes
substructure fragments used as index terms
manually assigned by expert coders
applied both to specific and Markush structures
search uses Boolean logic for required combinations
Fragment codes wereoriginally designed for
punched cards.
Connectivity / alternativenessrelationships betweenfragments usually lost
Poor
Precision
-
8/8/2019 John M. Barnar IRFS 040610
10/2710
Patent searching the 1980s"Topological" / graphical systems introduced
- with display of structure diagrams
Initial work with non-patentdatabases journal literature "in-house" structures
Commercial systems
operational by start of decade "public" databases
CAS Online
Systme DARC
"in-house" data MDL MACCS etc.
Specific Structures Markush Structures
Sheffield University academic research
on patent Markushstorage and retrieval
Commercial systems
and databases launchedat end of decade
Markush DARC(Derwent / Questel / INPI)
MARPAT(Chemical Abstracts)
-
8/8/2019 John M. Barnar IRFS 040610
11/2711
Patent searching since 1990
Little change
still only available onlinewith proprietarydatabases
showing their age withclunky interfaces
fragment code systemsstill widely used
Commercial searchsystems Databases
New databases of specificstructures from patents
Reaxys (formerlyMDL/Elsevier ChemicalPatent Database
SURECHEM
Machine-readable patentdocuments available direct
from patent offices
Some automation in databasecreation
-
8/8/2019 John M. Barnar IRFS 040610
12/27
12
Related developments
Markush applications outsidepatent field
informatics for"combinatorial libraries"
specific structureenumeration
physicochemical property
calculation
Markush searching Data mining
Chemical data extraction fromfree text and diagrams
structure diagram "OCR" chemical nomenclature
translation
Research work on capture ofMarkush structures from free-text patentsNew "in-house" systems for
patent Markush search underdevelopment
Markush applications outsidepatent field
informatics for"combinatorial libraries"
specific structureenumeration
physicochemical property
calculation
-
8/8/2019 John M. Barnar IRFS 040610
13/27
13
Which way forward?
Markush structures cover the scope of thepatent more comprehensively (better recall),but are more complicated to search, and canlead to poor retrieval precision.
Which structures to index and search?
How to build the databases?
Exemplified / enumeratedspecific structures
Markush structure
Manual input andcuration
Automatic analysis offull text patent
At least at present, searchers regard curateddatabases as the "gold standard" for retrievalperformance.
or
or
-
8/8/2019 John M. Barnar IRFS 040610
14/27
14
Different approachesSpecific
Structures
MarkushStructures
ManuallyCurated
AutomaticallyExtracted
MMS
MARPAT
SureChemCA Registry
DerwentChemistryResource
Reaxys
IBM
CLiDEchemoCR
Databases
Data-mining software
DecrIPt
-
8/8/2019 John M. Barnar IRFS 040610
15/27
15
Using specific structuresConventional approach
Extract specific structures frompatent
manual curation
CA Registry Derwent Chemistry
Resource
automatic extraction SureChem
IBM combination of both
Reaxys
Search using standard
substructure search software
IssuesSelection of compounds
exemplified "prophetic"
anything with a name
Effectiveness of automatic
nomenclature identificationand translation
Correctness of systematicnames in patent document
-
8/8/2019 John M. Barnar IRFS 040610
16/27
16
Using specific structuresOther "text analytics" approaches
Accelrys/Notiora WorkAutomatic chemical name to
structure conversionStructure "fingerprints" foreach molecule based onsubstructure fragments
Logical "OR" of fingerprintsfor whole patent
Structural similarity searchbased on logical "OR"fingerprints and maximumcommon substructure
IBM WorkAutomatic chemical name
to structure conversionVector representationderived from IUPACChemical Identifier (InChI)
Structural similaritysearch based oncomparison of vectorrepresentations
-
8/8/2019 John M. Barnar IRFS 040610
17/27
17
Using Markush structuresExisting systems
Two online systems/databasesavailable since late 1980s
Merged Markush Service(ThomsonReuters /Markush DARC)
MARPAT (ChemicalAbstracts Service/STN)
Problems
Excessively broad Markushesdefy existing systems, and give
poor recall / precision
Searchers often faced withmanually sifting 1000+ hits tofind 5 or 6 relevant patents
R1 is a substituted orunsubstituted, mono-, di-or polycyclic, aromatic ornon-aromatic, carbocylicor heterocyclic ringsystem, or ...
-
8/8/2019 John M. Barnar IRFS 040610
18/27
18
Searchers commentsDiscussion in "breakout group" at International PatentInformation Conference (IPI-Confex), Venice, Mar 2009
Multiple search tools are needed for comprehensive
retrieval Search strategies need to focus on the core structure
of interest and put up with poor precision
Current systems based on automatic extraction andanalysis of nomenclature have limited usefulness
Suggestions for improvement: ranking of search output
more comprehensive indexing of specific structures
-
8/8/2019 John M. Barnar IRFS 040610
19/27
19
In-house Markush systems?Advantages over existing online systems
Informatics support fordrug discoveryIntegration of patent datawith other chemicaldatabases
end-user chemistaccess to patent data
Adding patentability criteriato drug design
Adjunct or preliminary toexisting systems
confidentiality advantages
Possib
leuse
ofstructural
similarity
and
clusteranalysis
techn
iques
Data miningStructure activity
analysisPhysico-chemicalproperty calculation
Competitive intelligence
Identification ofunpatented "gaps" inchemical space
-
8/8/2019 John M. Barnar IRFS 040610
20/27
20
In-house Markush systems?Prospects
SoftwareNew Markush search systems
under development Digital Chemistry Ltd.
ChemAxon
Also work on selective
enumeration of specificstructures from Markush
DecrIPt Inc.
DatabasesExisting curated databases
ThomsonReuters haveexpressed interest inmaking MMS data available
MARPAT database anotherobvious possibility
"Home-grown" databases forspecialist purposes
input software needed
Automatic extraction from patentdocuments
-
8/8/2019 John M. Barnar IRFS 040610
21/27
21
Automatic Markush extractionCurrently a "hot area" for research, after a fallow period
complex combined issues of text and image processing,nomenclature translation and semantic analysis
Sheffield University3 publications (1992-97), initiallyanalysing Derwent patent abstracts.
CLiDE Pro (KeyModule Ltd.)Work by A.P. Johnson (2009) extendingearlier chemical OCR software.
Cambridge UniversityUnilever Centre for Molecular InformaticsOngoing work by Murray-Rust group onanalysis of full-text patents, extendingOPSIN nomenclature translation program.
chemoCR (Fraunhofer SCAI)Recent work on prototype software forMarkush "reconstruction" from patenttext, with limited success.
Cambridge UniversityUnilever Centre for Molecular InformaticsOngoing work by Murray-Rust group onanalysis of full-text patents, extendingOPSIN nomenclature translation program.
ChemProspector (InfoChem)Ongoing research into extraction ofMarkush structures from patents. Commercially-viable operational
systems probably still some way off.
-
8/8/2019 John M. Barnar IRFS 040610
22/27
22
Precision and recall
Substructure search finds allmolecules in database that
contain query substructure
Patent databases with specific structures
100% precision100% recall
Poor precision
Poor recall
Database contains irrelevant ortrivial molecules from patent text
Database omits moleculescovered by Markush structure
Database contains incorrectmolecules (errors in nomenclatureidentification / translation)
-
8/8/2019 John M. Barnar IRFS 040610
23/27
23
Precision and recallPatent databases with Markush structures
Poor precision
Poor recall
Query substructure matcheshighly generic description inunimportant part of Markush
Using system search optionsto avoid matches with highlygeneric descriptions
"broad/narrowtranslation" (DARC)
"match level" (MARPAT)
N
N
N
CH3 O
CH3 matches
R84 is a substituted orunsubstituted, mono-, di- orpolycyclic, aromatic or non-aromatic, carbocylic orheterocyclic ring system, or ...
-
8/8/2019 John M. Barnar IRFS 040610
24/27
24
Patent search evaluationChemical substructure search systems usuallygive 100% precision and 100% recall
retrieval performance evaluation not important
Not really true for chemical patent searches much more room for argument about
whether or not a hit is relevant
Designers of chemical patent search systems may
need to pay more attention to performance evaluation
Precision / Recalltrade-off
Evaluation of hitrelevance in context
of type of query
Consideration of the relativeimportance of different
parts of Markush(what the patent "teaches")
-
8/8/2019 John M. Barnar IRFS 040610
25/27
25
TREC-CHEM Multi-year evaluation project under auspices of long-
running Text Retrieval Conferences (TREC)
Uses chemical patent data and queries with withrelevance judgements
Used to compare retrieval experiments performed bydifferent groups
Results from first year (2009) presented elsewhere
most search approaches based on automaticanalysis of patent text, nomenclature extraction etc.
some issues identified concerning automatedrelevance judgements based on cited prior artdocuments
-
8/8/2019 John M. Barnar IRFS 040610
26/27
26
TREC-CHEM
It would be valuable to apply TREC-type evaluation to searchsystems for patent chemistry that are based on commercialdatabases
MARPAT vs. MMS/Markush DARC
existing systems and databases vs. new ones using newtechniques, automated data extraction etc.
There are many potential benefits to practising patent searchers ininvolving cheminformaticians in the IR-IP debate for which the IRF
provides a forum.
TREC-CHEM is not using data from curated databases these are the current "industry standard" against
which new approaches will ultimately be judged
TREC rules do not allow commercial databasesto be included
-
8/8/2019 John M. Barnar IRFS 040610
27/27
27
Contact details
Dr John M. Barnard
Scientific Director, Digital Chemistry Ltd.46 Uppergate Road, Sheffield S6 6BX, UK
[email protected]+44 (0)114 233 3170