grisp: a massive multilingual terminological database for scientific and technical domains patrice...
TRANSCRIPT
![Page 1: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/1.jpg)
GRISP: A Massive Multilingual Terminological Database
for Scientific and Technical Domains
Patrice Lopez and Laurent RomaryINRIA & HUB – IDSL
![Page 2: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/2.jpg)
Overview
• GRISP (Generic Research Insight in Scientific and technical Publications)– Multiple scientific and technical fields
– Multilingual (en, fr, de)
– Built from the compilation of open resources
• Sound conceptual model
• Mapping across a variety of domains
• Use of structural constraints
• Machine learning techniques for controlling the fusion process
– Our sources: MeSH, UMLS, Specialist Lexicon, Gene Ontology, ChEBI, WordNet, WOLF, SUMO, IPC, Wikipedia
– Result: several millions terms, concepts, semantic relations and definitions.
![Page 3: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/3.jpg)
Why are we doing all this?
• Terminology is the main vehicle by which technical and scientific units of knowledge are represented and conveyed (30-80%; Ahmad, 1996)
• Application to a large collection of multilingual and multi-domain patent documents
• Two underlying considerations:
– Cost of manually maintained terminological resources
• Cf. Biosis, IATE, TermScience– Khayari et al., 2006: Modeling the heterogeneity of
resources
– A lot of available resources online, based on heterogeneous organizational principles
• Underlying vision: Integrating knowledge engineering into current state of the art information retrieval and classification systems
![Page 4: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/4.jpg)
Merging terminological resources
• Related to the fusion of ontologies– Ontologies are usually relatively small in size
• Semi-automatic methods: McGuinness et al., 2000
• Fully automatic method– Madhavan et al., 2001: exploit structural and linguistic matching
– Doan et al., 2001: Machine learning techniques (concepts and properties)
– Gal et al., 2005: fuzzy logic methods
• Existing work on merging classification systems– Wang et al., 2008: Merging of subject headers in Digital Libraries
• Automatic merging techniques for heterogeneous terminologies has not been yet investigated– Much richer linguistic content
– No formal organization of concepts• Do not model facts or assertions
![Page 5: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/5.jpg)
A quick reminder
• Terminological resources– Approximation of lexical semantics in specialized fields
– Based on a concept to term (onomasiological) model
– Naturally multilingual (term grouping according to languages)
– Existing standards• ISO 704: editorial principles for building up a terminological
resource
• ISO 16642: Abstract model for representing terminological databases– Romary, 2001
• ISO 30042: A concrete XML syntax (TBX)
– Note: terminology standards do not standardize terminologies!
![Page 6: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/6.jpg)
Target terminological model
• Multiple languages• Multiple terms
– Variants, abbreviation, inflexions
• Multiple descriptions– E.g. multiple definitions, complementing each other– Additional information: illustrations, formulae, etc.
• Basic conceptual relations• Local metadata
– Provides management information attached to the various terminological description levels (e.g. origin, validation level, register)
– Allows the creation of views (e.g. all MeSH entries; cf. Khayari et al., 2006)
• And yes, ISO 16642 (TMF) can all this!– Main issue: identifying the relevant data category in the various
source terminologies
![Page 7: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/7.jpg)
Merging terminologies,merging models
TMF model 1
TMF model 2
TMF model 2
TMF model 2
Target model
![Page 8: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/8.jpg)
TMF in a nutshell
Terminological Data Collection (TDC)Terminological Data Collection (TDC)Terminological Data Collection (TDC)Terminological Data Collection (TDC)
Terminological EntryTerminological EntryTerminological EntryTerminological Entry
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Terminological EntryTerminological EntryTerminological EntryTerminological Entry
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Terminological EntryTerminological EntryTerminological EntryTerminological Entry
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Terminological EntryTerminological EntryTerminological EntryTerminological Entry
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Language Language SectionSection
Language Language SectionSection
Term Section Term Section Term Section Term Section
Term Section Term Section Term Section Term Section
Metadata (sources,
revisions)
Ontological relations,
definition
Dialectal inform
ation,
definition
Grammatical
information, re
gister, …
definition
+ any kind of local metadata (origin, certainty, accessibility)+ any kind of local metadata (origin, certainty, accessibility)+ any kind of local metadata (origin, certainty, accessibility)+ any kind of local metadata (origin, certainty, accessibility)
![Page 9: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/9.jpg)
Merging terminologies,merging models
TMF model 1
TMF model 2
TMF model 2
TMF model 2
Target model
Data category mapping
/definition/
/definition/
![Page 10: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/10.jpg)
Identifying domains
• Theoretical background– Non-ambiguity of a term within a domain
– E.g. 129 domains in MESH
• GRISP– Set of 76 reference domains (see table 1)
• Scientific and technical domains of Wordnet Domains (Magnini and Cavaglià, 2000)
• Organised as a hierarchy
– Manual mapping from resource specific domains to our reference set
![Page 11: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/11.jpg)
Merging concepts
• Identification of common concepts across terminological sources – core principles– Baseline: same term + same domain = same concept
– Difficulties: Conflicting domain mapping, high polysemy of term variants and incorrectly positioned concepts (e.g. Wikipedia)• Wrongly merged concepts
• Lost in precision for concept description
– Revised: same preferred term + same domain = same concept
– Source conformance rule: separated concepts in a given source cannot be further merged (by transitivity)• Not applied to Wordnet, IPC and Wikipedia
– Smoothing down the rules: using machine learning techniques
![Page 12: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/12.jpg)
Concept merging as a machine learning process
Concept pool
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
ConceptConceptConceptConcept
Features Merging decision
SVM (Support Vector Machine) and MLP (Multi-Layer Perceptron) binary classification models
ConceptConceptConceptConcept
![Page 13: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/13.jpg)
Training process
• Training features• (f1-2) sources (e.g. S1=“MeSH”, S2=“Wikipedia”)
• (f3) Number of common domains between the two concepts
• (f4) Number of same source-specific categorizations
• (f5) Boolean indicating if both preferred terms are identical
• (f6) Boolean indicating if both preferred terms are identical after stemming
• (f7) Ratio of identical terms given all terms
• (f8) Similarity measure of the definition texts, after stemming and based on negative KL divergence
• (f9) Number of domains of the merged concept
• (f10) Number of words of the longest common terms
• Training data– Wikipedia – MeSH mapping
– Pascal database (INIST)
![Page 14: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/14.jpg)
Result overview
Merger Concepts Terms Sem. Rel.
Aggregation 1,503,818 3,140,726 970,864
Merg. Rule 1 1,457,538 3,157,179 1,022,303
Merg. Rule 2 1,476,508 3,114,711 971,218
SVM 1,450,688 3,195,118 1,088,446
MLP 1,451,710 3,192,325 1,081,955
• Observations:– Small number of actual merges (cf. product names,
chemical and medical entities)– Merging relevant for frequently used concepts
Overall content:• 596,865 definitions• 1,321,988 source specific
categorizations of concepts• 20,000 acronyms• 14,268 chemical formulas
and• 12,375 chemical structure
identifiers.
![Page 15: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/15.jpg)
EvaluationMerger Wiki/MeSH PASCAL
Merging Rule 1 cov. 0.6464acc. 0.9497
cov. 0.5358acc. 0.9371
Merging Rule 2 cov. 0.3607acc. 0.9949
cov. 0.2735acc. 0.9916
SVM cov. 0.8642acc. 0.9698
cov. 0.6203acc. 0.9522
MLP cov. 0.8607acc. 0.9748
cov. 0.6178acc. 0.9515
• Random subset of 10% of the merging examples extracted from Wikipedia/MeSH mappings and from the PASCAL terminology
• Merging Rule 2 produces almost perfect merging but with a very low coverage
• Rule 1 extends the coverage at the price of a relatively high rate of merging error
• Machine Learning approaches further extend the coverage while maintaining a high precision
![Page 16: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/16.jpg)
16
renderingrenderingrendering
GRISP browser: radial engine
![Page 17: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/17.jpg)
17
![Page 18: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/18.jpg)
Application: Patatras
• PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS)
• Context: CLEF-IP competition– Prior art search task (EPO documents)
– 1,9 million documents in English, French and German (more than 3 billion words)
– Ranked first for all subtasks of the evaluation track among 14 participants (Roda et al., 2009)
• Conceptual indexing of the CLEF-IP corpus– Development of a term annotator based on GRISP
• Term variant matching after POS + lemmatization
• Concept disambiguation based on IPC classes of the documents
• 1.1 million different terms identified
• 176 million annotations
![Page 19: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/19.jpg)
19
Results: Patatras
• Significant accuracy improvements for CLEF-IP– Combination of a word-based and concept-based ranked results
with a regression model
Based on 10,000 queries
![Page 20: GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.compatrice_lopez@hotmail.com](https://reader035.vdocument.in/reader035/viewer/2022081603/56649ddf5503460f94ad8c28/html5/thumbnails/20.jpg)
Epilogue
• Online tool– Contact: [email protected]
• Free resource– Based on the freely available subset of resources
• Constant evolution– Maintenance according to evolution of our
sources– Addition of further sources