computer curation of patents & scientific literature ...computer curation of patents &...

22
1 1 Source J Kreulen

Upload: others

Post on 14-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

1 1

Source – J Kreulen

Page 2: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

2 2

Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value

Stephen K. Boyer, Ph.D. [email protected]

408-858-5544

Moneyball

Medicine

Page 3: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

3 3

The Problem

All content and no discovery ?

Page 4: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

4 4

Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ?

The Question

Page 5: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

5 5

As text

Chemical names found in the text of

documents

As bitmap images

Pictures of chemicals found in the document

Images

For Example : Patents and scientific papers contain molecular data in many different forms -

Page 6: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

6 6

Massive Computing Environment

Find and compute the 3D structures

Identify every protein

Identify every disease

Identify every Medline MeSh code

Identify occurrence of every biomarker

Equivalent to 240K simultaneous Google

searches -

Data warehouse

Compute properties, &

find relationships,

Chemical & Biological information derived from text analytics

Page 7: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

7 7

Dat

a So

urc

es

View selected

Documents & Reports

U.S. Patents (1976 -—

2009)

U.S. Pre-

Grants (All)

PCT & EPO Apps

Medline Abstracts

(>18 M)

Selected Internet Content

User Applications

In-House

Content

Knime or Pipeline Pilot

BIW

SIMPLE

Chem Axon Search

Cognos/DDQB/ Other Apps

Parse & Extract

data

Annotator 1

Annotator 2

Database

+ compu ted Meta Data

e Classifier & Other Data Associations

Annotation Factory

Computational Analytics

(Semantic

Associations)

Computer Curation Process Overview

IP Database (e.g. DB2)

ADU*

* ADU = Automated Data Update

ChemVerse

db

ChemVerse

Services Hosted at IBM Almaden

Page 8: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

8 8

Current Activates …

Page 9: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

9 9

- - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - -

= Chemical

= Target

= Disease

= Assay data

Text Annotated Text

- - - - - - - - - - - - - - - - - - - -

Annotated Text

identify chemical names

convert chem names to chemical structures SMILES – then convert these Into inchi & Inchkeys

- - - - - - - - - - - - - - - - - - - -

Annotated Text

replace all chemical names with the term “inchikey_& the unique inchikey” for that chemical

Re-index inchikeys w SOLR

= aspirin = inchikey = BSYNRYMUTXBXSQ-UHFFFAOYSA-N

= aspirin = CC(=O)OC1=CC=CC=C1C(=O)O

dB SOLR index

Current activity : “in line” entity tagging & classification

Page 10: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

10 10

- - - - - - - - - - - - - - - - - - - -

= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N”

= Target

= Disease

= Assay data

Text

= Chemical

compound Target 1

Target 2

Target 3

Target 1

Target 2

Target 3

= [target _gene name]

Target 4

Target 5

Compound – Targets associations Known from the literature

Compound – Targets associations Known from the SEA computations

dB

Current activity : “in line” entity tagging & classification

In line text tagging (classification) coupled with computational & experimental data

NIH HTS Assay data

Compound – Targets associations Known from NIh experimental HTS

Page 11: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

11 11

Medline co-occurrence of Statin structures vs. MeSH -Signs & Symptoms (C23)

Chemical Structures vs. Signs and Symptoms

Page 12: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

12 12

Search Chemical Search using ChemAxon w/ DB2

Proximal Search Nearest Neighbor Search

Page 13: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

13 13

BioTerm Analysis

Clustering Claims Originality

Discovery

Page 14: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

14 14

Landscape Analysis

Visualization

Networks

Page 15: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

15 15

Dat

a So

urc

es

View selected

Documents & Reports

U.S. Patents (1976 -—

2009)

U.S. Pre-

Grants (All)

PCT & EPO Apps

Medline Abstracts

(>18 M)

Selected Internet Content

User Applications

In-House

Content

Knime or Pipeline Pilot

BIW

SIMPLE

Chem Axon Search

Cognos/DDQB/ Other Apps

Parse & Extract

data

Annotator 1

Annotator 2

Database

+ compu ted Meta Data

e Classifier & Other Data Associations

Annotation Factory

Computational Analytics

(Semantic

Associations)

Computer Curation Process Overview

IP Database (e.g. DB2)

ADU*

* ADU = Automated Data Update

ChemVerse

db

ChemVerse

Services Hosted at IBM Almaden

Page 16: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

16 16

backup

Page 17: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

17 17

Orange Book -

- Legal status - Assignee - Foreign filings - Expiration Date

IP Attributes

Molecules have

Various Attributes ( From different sources)

NIST dB

-IR spectra -NMR, -Mass Spec, etc.

Spectral Attributes

Computational

-MW, -MF -Bp -Mp , etc

Physical Attributes

Drugbank

- Activity - Pharm data - Protein Binding - half life

WomBat

- Activity - Pharm data - Target data for SRA - Literature references

PubChem -Activity - Pharm data - Target data for SRA - Literature references

Drug Attributes

Screening Attributes

EPA databases

- Toxicity studies - LD50 - Literature references

Toxicity Attributes

Molecular Attributes

Attributes derived from different sources

Page 18: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

18 18

The Tank

Internet

Data Sources

Attributes

Orange Book

Pub Chem

Drugbank FDA

Others

Attributes |||||||||| |||||||||

Data Source 1 Schema 1

Attributes |||||||||| |||||||||

Data Source 2 Schema 2

Database C (Tox)

Location

Structure (trusted database) Database A (Medline)

SMILE

InChi_id

Binding site

Code name

Target Activity

app_id Trade Name

Geo

Country

Pathway Tox

IP status

Certifications Licensing

Input list of Attributes

Output file list of SMILES

Output file list of

attributes

Input list of SMILES

Cross mapping attributes from different sources

Semantic association of attributes

Page 19: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

19 19

Watson

Page 20: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

20 20

IBM’s - Massively Parallel Probabilistic Architecture

Question/T

opic

Analysis

Question

Hypothesis & Evidence

Scoring

Answer,

Confidence

Synthesis Final Merging

& Ranking

Query

Decomposition

Hypothesis

Generation

Hypothesis &

Evidence Scoring

Soft

Filtering

Hypothesis

Generation

Hypothesis &

Evidence Scoring Soft

Filtering

Hypothesis

Generation

Trained

Models

Primary

Search

Candidate

Answer

Generation

A.

Sources Supporting

Evidence

Retrieval

Deep

Evidence

Scoring

Answer

Scoring

E.

Sources

Evidence

Retrieval

Deep

Evidence

Scoring

20

Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing,

Machine Learning and Reasoning Algorithms. These gather and weigh evidence over both unstructured and

structured content to determine the answer with the best confidence.

Source – J Kreulen

Page 21: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

21 21

DeepQA Application (Java/C++)

Watson Infrastructure

• 90 Power 750 Servers

• Each Server 3.5GHz POWER7 8 Core Processor with

4 threads/core

• Total: 2880 POWER7 Cores with 16TB RAM

• Processing speed: 500Gb/sec; 80 TeraFLOPS

• 94th on Top 500 Supercomputers

• Note: This hardware is for Jeopardy. Any other

application of Watson will require appropriate sizing

and optimization for purpose.

SUSE Linux Enterprise Server 11

Apace Hadoop + Apache UIMA

Nature of Domain: Open vs. Closed

Closed domain implies all knowledge is contained within a specific domain

characterized by ontologies and there is no need to go outside the domain.

Jeopardy is an open-domain example where it is general knowledge.

Knowledge/Data Sources: Availability

QA systems are natural language search engines. Watson goes beyond NL

search. If knowledge sources are incomplete, unavailable, insufficient or

inadequate then it is not possible for the system to provide an answer. In some

cases one would need to envisage Interactive QA that require human

interaction to guide the search. Another very important consideration is the

availability of sufficient sample data for training (i.e. training corpus).

Need for multi-modality

Is there a need for Transcription from Speech to Text before a question is

answered? This would require integration of Speech to Text capabilities that are

not really ready for real-time applications.

Latency

Watson is capable of processing 500GB of information per second with 3 sec

response to questions and used most of its knowledge source in memory (as

opposed to disk) for speed. What is the latency requirement for the application?

Multi-Lingual or Cross-Lingual Support

Watson can support only English at this time; with language-specific parsers

other languages can be supported . If knowledge sources or QA is required in

multiple languages then that would not be a good candidate. Additionally if

cultural context have to be accommodated in the answer then it would not be

prudent to deploy QA systems directly interacting with users.

Question Type

Decomposition and classification of the question is critical to how QA systems

work. Bulk of the question types in Jeopardy were Factoid questions. Watson

did not include 2 question categories: One is Audio/Video type questions that

require looking at a video to answer and another are questions that require

special instructions (e.g. verbal instructions to explain a question.)

Answer Types

Watson is not designed to curate a task-oriented system. It can handle temporal

and geo-spatial reasoning in its answers. As it stands it cannot handle business

process type of reasoning (to do task B tasks A, C must be completed etc.)

Technical Issues to consider when applying QA systems like Watson

Page 22: Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value ... Apace Hadoop

22 22

I would like to acknowledge the IBM Almaden Research – team

Jeff Kreulen

Ying Chen

Scott Spangler

Alfredo Alba

Tom Griffin

Eric Louie

Su Yan

Issic Cheng

Prasad Ramachandran

Bin He

Ana Lelescu

Brian Langston

Qi He

Linda Kato

Ana Lelescu

Brad Wade

John Colino

Meenakshi Nagarajan

Timothy J Bethea

German Attanasio

Cassidy Kelly

Jack Labrie

Fredrick Eduardo

Ionia Stanoi

+ a host of folks from

IBM China Labs -