problem statement

27
Retrieving Information Across Multiple, Related Domains Based on User Query and Feedback: Application to Patent Laws and Regulations Hang Yu, University of Illinois at Urbana- Champaign Siddharth Taduri, Stanford University Jay Kesan, University of Illinois at Urbana- Champaign Gloria Lau, Stanford University Kincho H. Law, Stanford University 27 th October 2010 International Conference on Theory and Practice of Electronic Governance (ICEGOV), Beijing, China.

Upload: esme

Post on 18-Jan-2016

112 views

Category:

Documents


1 download

DESCRIPTION

Retrieving Information Across Multiple, Related Domains Based on User Query and Feedback: Application to Patent Laws and Regulations Hang Yu, University of Illinois at Urbana-Champaign Siddharth Taduri, Stanford University Jay Kesan, University of Illinois at Urbana-Champaign - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PROBLEM  STATEMENT

Retrieving Information Across Multiple, Related Domains Based on User Query and Feedback: Application to Patent Laws and

Regulations

Hang Yu, University of Illinois at Urbana-ChampaignSiddharth Taduri, Stanford University

Jay Kesan, University of Illinois at Urbana-ChampaignGloria Lau, Stanford University

Kincho H. Law, Stanford University

27th October 2010International Conference on Theory and Practice of

Electronic Governance (ICEGOV), Beijing, China.

Page 2: PROBLEM  STATEMENT

PROBLEM STATEMENT

How to develop a comprehensive knowledge of patents in a particular technological space?

This task involves extensive study of patent documents, scientific publications, and other govt. agency and court documents

Motivation2

Technology Firms’ Concerns• Can I get patent protection for my innovation?• Do I build or do I buy related technologies?• What are my competitors doing? • How strong are their patents? • Am I perhaps infringing on someone else’s

patents? • Is so, are those patents valid? • Have they been enforced in court?• Has their validity been challenged in court?

10/27/2010

Page 3: PROBLEM  STATEMENT

Patent Validity and Enforcement Questions involves analysis of documents in various domains – World-wide Patents, PTO File Wrappers, Scientific Publications and Court documents

These domains are incompatible with each other and each needs a different approach

Goal: Provide a single framework, interface to collect a comprehensive set of related documents from each of these incompatible domains

Motivation3

PROBLEM STATEMENT

COURT CASES

PTO FILE WRAPPERS

PUBLICATIONS

LAWS & REGULATIONS

PATENTS

10/27/2010

Page 4: PROBLEM  STATEMENT

Many patent documents and research tools/resources available online (free and paid – Google Patent, espace, USPTO, WIPO, Delphion, MicroPatent, …)

Many resources available for scientific publications/journals (PubMed, MedLine, IEEE, Google Scholar, etc…)

Thomson Reuters/Innovation brings together the Derwent Patent index, Web of Science for publications and Inspec, a bibliographic tool

Dialog LLC is an online information retrieval system for Patents, Medical databases, News, and other technical Journals

Fewer resources available to access PTO file wrappers, court documents, and laws and regulations

Motivation4

BACKGROUND

10/27/2010

Page 5: PROBLEM  STATEMENT

Challenges5

PATENTS

Over 7 million U.S. patents

In 2009, 485,312 patent applications were filed

Foreign Patents (DWPI, European, German, Japanese, etc..)

Patent Sources: USPTO, Delphion, WIPO, Derwent Patent Index, Google Patents …

Keyword based search results are imprecise and low in recall

20042006

2008100,000150,000200,000250,000300,000350,000400,000450,000500,000

Patent Applica-tionsGranted Patents

10/27/2010

Page 6: PROBLEM  STATEMENT

Court cases are important - A patent that has been litigated is valuable

94 District Courts & one Court of Appeals (CAFC)

PACER – an electronic system to access databases for U.S. Courts

PACER requires one to know party/assignee name, case number/type, etc…

Other options – Google Scholar

Keyword based search may not be effective because of information overload and lack of context

Challenges6

IP LITIGATION

10/27/2010

Page 7: PROBLEM  STATEMENT

Challenges7

USPTO PROCEEDINGS: FILE WRAPPERS

Patent file wrappers contain information about scope of protection; application/patent data, prosecution history, application history, and other examination information

Available on PAIR (Patent Application Information Retrieval)

Public PAIR – Displays issued or published application status

Private PAIR – Real-time current patent application status

Some file wrappers are only available as images and text cannot be automatically extracted

10/27/2010

Page 8: PROBLEM  STATEMENT

Challenges8

SCIENTIFIC PUBLICATIONS

Very broad set of topics need to be searched

Many databases must be searched

Current options include – PubMed, MedLine, Google Scholar, etc...

PubMed contains articles from over 300 research journals

Can we determine the state-of-the-art at the time of filing of a patent application?

10/27/2010

Page 9: PROBLEM  STATEMENT

Proposed Framework9

PROPOSED FRAMEWORK

Step 1: Expand Keywords

Step 2: Independently search domains

Step 3: Combine Results + Rank

Step 4: Consider User Feedback

10/27/2010

Page 10: PROBLEM  STATEMENT

Proposed Framework10

STEP 1: EXPAND KEY WORDS

Goal: Expand the user query using ontologies/taxonomies (BioPortal, GeneCards, MedTerms)

Simple Example:Doc AThe car has a 3.5l V6 engine

Doc BThe vehicle has a 3.5l V6 engine

Keyword search for “car” will return only Doc A. An ontology that describes the term “vehicle” as a synonym, or a parent of “car” will internally expand the query to return both Doc A and Doc B

Picking the right ontology (An imprecise ontology may result in irrelevant keywords)

Combining terms from various ontologies

Challenges:

10/27/2010

Page 11: PROBLEM  STATEMENT

Proposed Framework11

STEP 2: INDEPENDENTLY SEARCH DATABASES

Patents: Appropriate weighing of various features such as patent assignee, inventor, forward and backward citations, …

Cases: How can we obtain data in a search format? PACER does not provide a keyword based interface

File Wrappers: Automatic text extraction can be hard as some documents are scanned as images.

Adapting search to user preference of Type-I and Type-II errors

Goal: Find relevant documents in a database of homogenous documents (e.g., Patents, or publications)

Challenges:

10/27/2010

Page 12: PROBLEM  STATEMENT

Proposed Framework12

STEP 3: COMBINE RESULTS FROM THE FOUR DIFFERENT DOMAINS

Establishing links between various domains

Improving the quality of search in one domain using results from another

Feature Extraction

Ranking documents requires combining many features with an appropriate weighting function

Goal: (1) Cross-reference results from other domains (2) Rank results

Challenges:

10/27/2010

Page 13: PROBLEM  STATEMENT

Proposed Framework13

STEP 4: CONSIDER USER FEEDBACK

What format or scale should the feedback be taken in? (yes/no, paragraph)

How must these be integrated with the system?

How can we resolve conflicting thoughts?

Goal: Consider user feedback from domain experts

Challenges:

10/27/2010

Page 14: PROBLEM  STATEMENT

Use Case: EPO14

EXPERIMENTATION/METHODOLOGY

Build a Use Case to implement the functional requirements

It will provide a basis for experimentation

Chosen Use Case: “EPO/Erythropoietin”

Erythropoietin is a hormone that regulates the production of red blood cells

Synthetic production of this hormone holds significance in treatment of many diseases such as Anemia

10/27/2010

Page 15: PROBLEM  STATEMENT

Use Case: EPO15

USE CASE: EPO/ERYTHROPOIETIN

Core patents – U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698

135 directly related patents and over 3000 related publications

Around 20 court cases, patent litigation involving major companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic Therapies, Inc.

Several available ontologies: Gene ontology, National Cancer Institute Thesaurus …

This corpus forms a good experimental platform to test the overall effectiveness of the framework

Why does this make a good use case?

10/27/2010

Page 16: PROBLEM  STATEMENT

Use Case: EPO16

PATENTS

Search results for “erythropoietin” amongst the 135 closely related patents:

Documents are indexed from search using Apache Lucene

Rank computation is based on the general idea that a term occurring more frequently across many documents (e.g., “the”) is less informative than a term (e.g., “EPO”) that occurs frequently in fewer documents

Returns over 7000 documents from over 7 million documents in the USPTO database

Returns ~90 of the 135 related patentsU.S. Patent No. 6,204,247 is relevant but does not contain the term erythropoietin

Q: How can this be made better?

Patent Number Rank5955422 0.1096204247 0.0006245740 0.0186270989 0.0006280977 0.0276340742 0.1136420339 0.0006420340 0.0006524818 0.009

10/27/2010

Page 17: PROBLEM  STATEMENT

Use Case: EPO17

ONTOLOGY

BioPortal: Web-based application for accessing and sharing biomedical ontologies developed at National Center for Biomedical Ontologies (NCBO)

Gene Ontology (GO): GO uses three organizing principles – Cellular component, Biological process and Molecular function. This ontology represents “erythropoietin receptor binding” as a molecular function.

National Cancer Institute (NCI) Thesaurus: Provides reference terminology, vocabulary for clinical care, translational and basic research, and public information and administrative activities

(a) Gene Ontology(b) NCI Thesaurus

a b Expanded Term Base “Erythropoietin”, “Erythropoietin Receptor Binding”, “Colony Stimulating Factor”, “Cytokine” …

10/27/2010

Page 18: PROBLEM  STATEMENT

Use Case: EPO18

RESULTS AFTER USING EXPANDED TERM BASE

Improved results: more relevant documents are identified

Computed rank is the average of document ranks for each individual keyword

The 5 core patents have a relatively high rank

Returns a large set of documents when searched in USPTO (185,126 documents contain “protein”; 23,759 contain “cytokine”…)

Patent Number Score5955422 0.0506204247 0.0286245740 0.0386270989 0.0056280977 0.0086340742 0.0496420339 0.0266420340 0.0286524818 0.015

10/27/2010

Page 19: PROBLEM  STATEMENT

Use Case: EPO19

ENTREZ: CROSS-DATABASE SEARCH FOR THE LIFE SCIENCES

Entrez is a global query system which links to multiple databases such as PubMed and PubMed Central for articles, databases for nucleotide sequences, gene expressions etc.

Provide Web Services and utilities such as ESearch and EFetch to search and retrieve articles from PubMed

Also provides SOAP interface for the Entrez Utilities

We have generated a script around the utilities to automatically download infromation

10/27/2010

Page 20: PROBLEM  STATEMENT

Use Case: EPO20

CROSS REFERENCING SCIENTIFIC PUBICATIONS WITH CORE PATENTS

We select PUBMED data base as our main corpus of scientific publications

Each scientific publication ( document ) in PUBMED has a unique Paper Id and has a “true” score as a measure of its relevance

We define our score as the number of times a document is cited in each of the 5 patents ( called RefScore )

Each expanded term has a word frequency to measure how often it appears in the abstract of the paper.

Paper Id Ref Score

Erythro-poietin

EPO Protein

6713094 5 0.446 7.59 0

2813359 5 1.093 8.74 0

18202227 5 0.565 3.96 0.565

3680293 4 0.467 3.74 1.402

3624248 3 3.265 0 1.224

232226 2 0 0 0

14025852 1 0 0 0

Table: Example of some selected papers with their RefScore and some expanded

term’s rank ( word frequency )

10/27/2010

Page 21: PROBLEM  STATEMENT

Use Case: EPO21

CORRELATION BETWEEN EXPANDED TERMS AND RefScore

For each expanded terms ( keyword ), we can calculate the correlation between its word frequency and RefScore

If we plot each document’s RefScore and word frequency for term “EPO”, we can see it is positively correlated.

If frequency of EPO is larger than 4% in a document, this document would be cited no less than 4 out of 5 core patents, with only 1 exception.

Most keywords would show positive correlations, except for “protein”, which is a general term that appears in many other papers.

This justifies the use of word frequency as a measure to cross-reference between patents and scientific publications.

Keyword CorrelationErythropoietin 0.089Epo 0.08Iron 0.065Erythropoietin 0.035Cytokines 0.035Desamethasone 0.035hydroxyurea 0.035Protein -0.002

Word Frequency (%)

RefS

core

10/27/2010

Page 22: PROBLEM  STATEMENT

User feedback can be in multiple forms: citations, direct feedback via Web user interface …

A user feedback can be defined by dividing the cited times of a particular documents in core patents by the cited times in all PUBMED documents (Rufs)

We also need to know this ratio for an average document (Aufs)

We normalize each document's feedback score by average user feedback score and use a minimum threshold to filter out desirable documents ( 1.50 as an example )

Therefore, document 2813359 will always come into the query results

22

Paper RefScore #Citation Rufs(%) Fufs

6713094 5 219 2.28 0.94

2813359 5 134 3.73 1.54

18202227 5 260 1.92 0.79

3680293 4 119 3.36 1.38

362424 3 98 3.06 1.26

232226 2 103 1.94 0.80

14205852 1 98 1.02 0.42

Total 25 1031 2.42 --

USE CITATIONS AS USER FEEDBACK to IMPROVE

10/27/2010

Page 23: PROBLEM  STATEMENT

Other issues and challenges23

OTHER ISSUES AND CHALLENGES

USPTO disallows crawling. Currently, we need to know specific details about the documents we intend to automatically download, such as the patent numbers, or inventors.

PubMed is a leading database for biomedical journals. However, many publications, especially older ones are not indexed in the database

Full text of many articles in the PubMed database is still unavailable

PACER is a good source for litigation documents, but all court pleadings are scanned as electronic images, are they machine readable?

Since PACER does not provide keyword based search, difficult to manually scan 94 judicial districts

PAIR enforces CAPTCHA verification, hindering automatic downloading of PTO file wrappers

10/27/2010

Page 24: PROBLEM  STATEMENT

Current Status and Future Work

24

Implement the proposed framework

Make the information and relevance feedback techniques available via a web interface

Expand the scope to other domains – court cases, file wrappers, regulations etc.

Drive towards structured information integration. Develop an ontology or a formal representation of the document domains

Future Work

CURRENT STATUS & FUTURE WORK

10/27/2010

Page 25: PROBLEM  STATEMENT

PatentsUSPTO – http://www.uspto.gov/Delphion – http://www.delphion.com/Google Patents – http://www.google.com/patents/

File WrappersPAIR – http://portal.uspto.gov/external/portal/pair/

Court CasesPACER – http://pacer.psc.uscourts.gov/

PublicationsPubmed – http://www.ncbi.nlm.nih.gov/pubmed/ Medline – http://www.nlm.nih.gov/medlineplus/Google Scholar – http://scholar.google.com/

Ontology/TaxonomyBioPortal – http://bioportal.bioontology.com/Genecards – http://www.genecards.org/MedTerms – http://www.medterms.com/

MiscellaneousThomson Reuters – http://www.thomsoninnovation.com/Dialog – http://www.dialog.com/

USEFUL LINKS

2510/27/2010

Page 26: PROBLEM  STATEMENT

This research is partially supported by NSF Grant Number 0811975 awarded to the University of Illinois and NSF Grant Number 0811460 to Stanford University. Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation.

ACKNOWLEDGEMENT

2610/27/2010

Page 27: PROBLEM  STATEMENT

DISCUSSION

2710/27/2010