semantic representation of events inevents in the ...€¦ · sage o semantic web approach....

32
Semantic Representation of Events in the Pharmaceutical Industry Events in the Pharmaceutical Industry Martin Romacker, Samuel Läubli & Marc Bux NIBR-IT / Text Mining Services NIBR IT / Text Mining Services 24-Feb-2011 CSHALS

Upload: others

Post on 28-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Semantic Representation ofEvents in the Pharmaceutical IndustryEvents in the Pharmaceutical IndustryMartin Romacker, Samuel Läubli & Marc BuxNIBR-IT / Text Mining ServicesNIBR IT / Text Mining Services24-Feb-2011 CSHALS

Page 2: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

IntroductionIntroduction

Generation of a Comprehensive Terminology for p gyCompanies (data capture, unique reference across NIBR).

Important Events related to Companies Drugs IndicationsImportant Events related to Companies, Drugs, Indications and Geographical Locations.

D t f C t t P id XML f dData from Content Providers as XML feeds(Prous Integrity, TPP, Adis R&D, TR PDI)

L k f ti t ll l lLack of semantics at all levels:• No definitions for concepts

f C C• Unspecific relations like parentCompany or relatedCompany• Important events locked in Natural Language statements

U f S ti W b A h2 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Usage of Semantic Web Approach

Page 3: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Generation of Terminology for CompaniesGeneration of Terminology for Companies

We try to automatize the production of our terminologiesy p gand corresponding pointers as much as possible• Thorough analysis of the input sources and relations between them

Company terminology is also intellectually curatedtime spent on task

curation

automation

curation

probability of errors

3 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Find optimum with high automation and few errors

Page 4: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Canonical Representation for TerminologyCanonical Representation for Terminology

Preferred Term/ Concepts Unique Identifierp qmandatory label to be used to name an object (Controlled Vocabulary)unique identifier represents concept.

Synonymsa set of Synonyms semantically equivalent

Pointer / Cross-Referencea referential link between a Preferred Term and data repository using the approriate value for accessthe approriate value for access• Example:

Novartis AG (Preferred Term), Prous Integrity (data repository), 16964 (value for access)

Creation of a referential MetaData Layer which is shared by

4 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

y yall scientific NIBR applications

Page 5: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Design StrategyDesign Strategy

Multiple Usage/ Reusabilityp g y• Example: Company terminology used for

Text MiningText MiningFAST Enterprise SearchSuggest/ AutocompletionUltralink ApplicationUltralink Application

Semantic Interoperability

Compatibility with public domain knowledge repositories

Focus on coverage of terms relevant to Novartis (Top 50 competitors, strategic alliances etc.)

5 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 6: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Content (Example)Company Terminology – Content (Example)

6 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 7: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Input SourcesCompany Terminology – Input Sources

A set of commercial feeds from Competitive Intelligence p gand scientific content providers:• Prous Integrity (now known as Thomson Reuters Integrity)• TPP – Thomson Pharma Partnering (formerly known as IdDB)• Adis Insight R&D• TR PDI – Thompson Reuters Pipeline Data Integration

Raw input feeds are preprocessed:a put eeds a e p ep ocessed• extract any valuable information and filter noise• convert to well-defined, distinct format,

The Algorithm is implemented in Pipeline Pilot

7 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 8: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Input Sources WorkflowCompany Terminology – Input Sources, Workflow

8 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 9: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology MergerCompany Terminology – Merger

Data from the input feeds is merged in 4 stepsp g p

Step Example

S d N ti AGSandoz Novartis AGSandoz Technology Ltd Sandoz AG

1. NormalizationSandoz NovartisSandoz Technology Sandoz

2. Transitivity ResolutionySandoz Technology Novartis

3. DenormalizationSandoz Technology Ltd Novartis AGSandoz Technology Ltd Novartis AG

4. Synonym ExpansionSandoz Technology Limited Novartis AG

9 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 10: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Merger WorkflowCompany Terminology – Merger, Workflow

10 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 11: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Filter / CleanCompany Terminology – Filter / Clean

An English natural language dictionary is used to removeg g g yfalse positives from company synonyms• The natural language dictionary contains lemmatized versions of

A i d B iti h E li h dcommon American and British English words• It contains neither abbreviations nor names

Example of filtered false positives due to normalization:• University University of Calgary• Phase pHase Pharmaceuticals LLC

After the filtering synonyms and pointers are dumped toAfter the filtering, synonyms and pointers are dumped totext files and ready to be uploaded into the Metastore

11 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 12: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Filter / Clean WorkflowCompany Terminology – Filter / Clean, Workflow

12 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 13: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology ChallengesCompany Terminology – Challenges

Different state and quality of input sources leads toq y p• Contradictions

- Merck Merck & Co Inc.- Merck Merck KGaA

• Misleading facts- Roche Consumer Health AG Bayer AG (Acquisition)- Roche Roche Consumer Health AG (Abbreviation)C l• Cycles

• Integration of outdated facts

13 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 14: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology ModfileCompany Terminology – Modfile

Various options to alter the data flow:p• Add intellectually maintained facts about companies• Prevent normalization or remove facts from input in order to break up

cycles, resolve contradictions and enforce synonym relations• Manually remove noise from output• Redesignate Preferred Terms of companies• Prevent natural language filtering of selected synonyms

S if ffi d i l h t f t li ti d• Specify suffixes and special characters for term normalization andexpansion

M l ti l id f t i th i t f dManual assertions always override facts in the input feed

14 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 15: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Modfile WorkflowCompany Terminology – Modfile, Workflow

15 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 16: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Diff ReportCompany Terminology – Diff-Report

16 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 17: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology Reporting WorkflowCompany Terminology – Reporting, Workflow

17 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 18: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Company Terminology WorkflowCompany Terminology – Workflow

18 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 19: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Application of Company Terminology on Adis R&DApplication of Company Terminology on Adis R&D

Adis R&D Insight is a drug pipeline database that tracks g g p pand evaluates drugs worldwide through the entire development process, from discovery, through pre-clinical and clinical studies to launchclinical and clinical studies to launch.

Updates on drug development processes are distributed as XML feed basically consisting of• Short news lines (text strings; «shouts»)• Structuring information• Basic meta-information

19 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 20: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Semantic Representation of Events Why ?Semantic Representation of Events – Why ?

Bevacizumab has been licensed to Chugai in Japan

Adis R&D feeds contain a huge number of importantstatements on events in very short stereotypical sentences

Bevacizumab has been licensed to Chugai in Japan

statements on events in very short, stereotypical sentences.

This knowledge is locked in natural language ...

20 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 21: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Introduction | Semantics Why?Introduction | Semantics – Why?Chugai has acquired licensing rights toBevacizumab in all countries except USABevacizumab has been licensed to Chugai in Japan

Computers cannot «automatically» map has been licensed and acquired licensing rights to alicensed and acquired licensing rights to a common semantic concept

R lti bl h t f l t iResulting problem: how to formulate queries on pure natural language text?

e g list all countries where Chugai holds a license for Bevacizumab• e.g. «list all countries where Chugai holds a license for Bevacizumab»

21 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 22: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Semantic Model for EventsSemantic Model for Events

OWL Ontologyincluding semantic rolesg

22 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 23: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Semantic Model for Licensing EventSemantic Model for Licensing Event

Frame: LicensingEvent

Bevacizumab has been licensed to Chugai in JapanPRODUCTS COMPANIES TERRITORIES

g

Semantic annotation and normalization of news entries

LicensingSubject Licensee ValidTerritory

Semantic annotation and normalization of news entries makes information explicit and thus machine-readable

Q f fQueries can now be formulated on frames, types and roles rather than textual surface

23 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 24: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Data Processing | Overall PipelineData Processing | Overall Pipeline

Pre-Processing / Annotation1. Tokenization

Parallel:Ontology Development

2. Part-of-Speech-Tagging3. Chunking

& Refinement

4. Named Entity RecognitionUsing Novartis Metastore (SOAP-WebService)

Evaluation5. Rule Evaluation

Output6. Result Representation (Triples)

24 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 25: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Data Processing | Pre Processing / AnnotationData Processing | Pre-Processing / Annotation

Tokenization PoS Tagging Chunking Named Entity RecognitionTokenization PoS-Tagging Chunking Named Entity Recognition

Bevacizumab has been licensed to Chugai in Japan.'Bevacizumab','has','been','licensed','to','Chugai','in','Japan','.'NNP           VBZ   VBN    VBN        TO   NNP      IN   NNP

25 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 26: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Example of a Semantic RuleExample of a Semantic Rule

Result of Pre-Processing / Annotation:Result of Pre-Processing / Annotation:(S(NP bevacizumab/ER:PRODUCTS)h /VBZhas/VBZbeen/VBNlicensed/VBN(PP to/TO (NP Roche/ER:COMPANIES))( / ( / ))(PP in/IN (NP Japan/ER:TERRITORIES))./.)

Result of Rule Evaluation:EventType:        LicensingStatusEvent

LicensingSubject: bevacizumabLicensingSubject: bevacizumabLicensee:         rocheValidTerritory:   japan

26 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 27: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Semantic Representation of Events as GraphsSemantic Representation of Events as Graphs

bevacizumab chugaihasLicensee

Subject Predicate Object

[http://usecases.novartis.intra/ci.owl#chug[http://usecases.novartis.intra/ci.owl#hasLicensee][http://usecases.novartis.intra/ci.owl#bevacizumab]

hasAcquired

hasCASNumber

hasType

Each triple corresponds to one statement consisting of a subject, a predicate and an object.

216974-75-3

hasTypesubject, a predicate and an object.

Entities are identified by an URI (Unique RessourceIdentifier)

rochecompanyhasType

Identifier)

New resources can easily be «attached» in order to Integration of other Concept Types from the MetaStore27 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

create big networks of relations (hence «linked data») Integration of other Concept Types from the MetaStore

Page 28: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Querying and Exploring the Triple StoreQuerying and Exploring the Triple Store

Example queries for an event using SPARQLp q gSELECT ?countryWHERE {

?event rdf:type ci:LicensingStatusEvent?event rdf:type ci:LicensingStatusEvent .?event ci:hasLicensee ci:roche .?event ci:hasLicensingSubject ci:bevacizumab .?event ci:hasValidTerritory ?country .

}}

28 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 29: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Querying and Exploring the Triple StoreQuerying and Exploring the Triple Store

29 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 30: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Storage & Representation | Querying & ExploringStorage & Representation | Querying & Exploring

30 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 31: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

Meaning Expansion: OntologiesMeaning Expansion: Ontologies

New Triples can be inferred using ontology definitions:p g gy

A i i i E isAcquisitionObjectAcquisitionEvent1 chugai

isAcquisitionObject

hasType hasTypehasType

CompanyAcquiredCompanyAcquisitionEve CompanyAcquiredCompanynt

31 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma

Page 32: Semantic Representation of Events inEvents in the ...€¦ · sage o Semantic Web Approach. Generation of Terminology for Companies We try to automatize the production of our terminologies

ConclusionsConclusions

Data feeds from commercial content providers lack psemantics (XML=Syntax, statements in Natural Language)

Data feeds from commercial content providers containData feeds from commercial content providers containinconsistencies and outdated facts(need for consolidation)

Transformation of content (entities and events) into a Semantic Web representation eases knowledge integrationSemantic Web representation eases knowledge integrationand exploration of data (graph navigation)

Content providers should shift towards a meaningful wellContent providers should shift towards a meaningful, well-defined and explorable Semantic Web representation oftheir data.

32 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma