problem statement
DESCRIPTION
Retrieving Information Across Multiple, Related Domains Based on User Query and Feedback: Application to Patent Laws and Regulations Hang Yu, University of Illinois at Urbana-Champaign Siddharth Taduri, Stanford University Jay Kesan, University of Illinois at Urbana-Champaign - PowerPoint PPT PresentationTRANSCRIPT
Retrieving Information Across Multiple, Related Domains Based on User Query and Feedback: Application to Patent Laws and
Regulations
Hang Yu, University of Illinois at Urbana-ChampaignSiddharth Taduri, Stanford University
Jay Kesan, University of Illinois at Urbana-ChampaignGloria Lau, Stanford University
Kincho H. Law, Stanford University
27th October 2010International Conference on Theory and Practice of
Electronic Governance (ICEGOV), Beijing, China.
PROBLEM STATEMENT
How to develop a comprehensive knowledge of patents in a particular technological space?
This task involves extensive study of patent documents, scientific publications, and other govt. agency and court documents
Motivation2
Technology Firms’ Concerns• Can I get patent protection for my innovation?• Do I build or do I buy related technologies?• What are my competitors doing? • How strong are their patents? • Am I perhaps infringing on someone else’s
patents? • Is so, are those patents valid? • Have they been enforced in court?• Has their validity been challenged in court?
10/27/2010
Patent Validity and Enforcement Questions involves analysis of documents in various domains – World-wide Patents, PTO File Wrappers, Scientific Publications and Court documents
These domains are incompatible with each other and each needs a different approach
Goal: Provide a single framework, interface to collect a comprehensive set of related documents from each of these incompatible domains
Motivation3
PROBLEM STATEMENT
COURT CASES
PTO FILE WRAPPERS
PUBLICATIONS
LAWS & REGULATIONS
PATENTS
10/27/2010
Many patent documents and research tools/resources available online (free and paid – Google Patent, espace, USPTO, WIPO, Delphion, MicroPatent, …)
Many resources available for scientific publications/journals (PubMed, MedLine, IEEE, Google Scholar, etc…)
Thomson Reuters/Innovation brings together the Derwent Patent index, Web of Science for publications and Inspec, a bibliographic tool
Dialog LLC is an online information retrieval system for Patents, Medical databases, News, and other technical Journals
Fewer resources available to access PTO file wrappers, court documents, and laws and regulations
Motivation4
BACKGROUND
10/27/2010
Challenges5
PATENTS
Over 7 million U.S. patents
In 2009, 485,312 patent applications were filed
Foreign Patents (DWPI, European, German, Japanese, etc..)
Patent Sources: USPTO, Delphion, WIPO, Derwent Patent Index, Google Patents …
Keyword based search results are imprecise and low in recall
20042006
2008100,000150,000200,000250,000300,000350,000400,000450,000500,000
Patent Applica-tionsGranted Patents
10/27/2010
Court cases are important - A patent that has been litigated is valuable
94 District Courts & one Court of Appeals (CAFC)
PACER – an electronic system to access databases for U.S. Courts
PACER requires one to know party/assignee name, case number/type, etc…
Other options – Google Scholar
Keyword based search may not be effective because of information overload and lack of context
Challenges6
IP LITIGATION
10/27/2010
Challenges7
USPTO PROCEEDINGS: FILE WRAPPERS
Patent file wrappers contain information about scope of protection; application/patent data, prosecution history, application history, and other examination information
Available on PAIR (Patent Application Information Retrieval)
Public PAIR – Displays issued or published application status
Private PAIR – Real-time current patent application status
Some file wrappers are only available as images and text cannot be automatically extracted
10/27/2010
Challenges8
SCIENTIFIC PUBLICATIONS
Very broad set of topics need to be searched
Many databases must be searched
Current options include – PubMed, MedLine, Google Scholar, etc...
PubMed contains articles from over 300 research journals
Can we determine the state-of-the-art at the time of filing of a patent application?
10/27/2010
Proposed Framework9
PROPOSED FRAMEWORK
Step 1: Expand Keywords
Step 2: Independently search domains
Step 3: Combine Results + Rank
Step 4: Consider User Feedback
10/27/2010
Proposed Framework10
STEP 1: EXPAND KEY WORDS
Goal: Expand the user query using ontologies/taxonomies (BioPortal, GeneCards, MedTerms)
Simple Example:Doc AThe car has a 3.5l V6 engine
Doc BThe vehicle has a 3.5l V6 engine
Keyword search for “car” will return only Doc A. An ontology that describes the term “vehicle” as a synonym, or a parent of “car” will internally expand the query to return both Doc A and Doc B
Picking the right ontology (An imprecise ontology may result in irrelevant keywords)
Combining terms from various ontologies
Challenges:
10/27/2010
Proposed Framework11
STEP 2: INDEPENDENTLY SEARCH DATABASES
Patents: Appropriate weighing of various features such as patent assignee, inventor, forward and backward citations, …
Cases: How can we obtain data in a search format? PACER does not provide a keyword based interface
File Wrappers: Automatic text extraction can be hard as some documents are scanned as images.
Adapting search to user preference of Type-I and Type-II errors
Goal: Find relevant documents in a database of homogenous documents (e.g., Patents, or publications)
Challenges:
10/27/2010
Proposed Framework12
STEP 3: COMBINE RESULTS FROM THE FOUR DIFFERENT DOMAINS
Establishing links between various domains
Improving the quality of search in one domain using results from another
Feature Extraction
Ranking documents requires combining many features with an appropriate weighting function
Goal: (1) Cross-reference results from other domains (2) Rank results
Challenges:
10/27/2010
Proposed Framework13
STEP 4: CONSIDER USER FEEDBACK
What format or scale should the feedback be taken in? (yes/no, paragraph)
How must these be integrated with the system?
How can we resolve conflicting thoughts?
Goal: Consider user feedback from domain experts
Challenges:
10/27/2010
Use Case: EPO14
EXPERIMENTATION/METHODOLOGY
Build a Use Case to implement the functional requirements
It will provide a basis for experimentation
Chosen Use Case: “EPO/Erythropoietin”
Erythropoietin is a hormone that regulates the production of red blood cells
Synthetic production of this hormone holds significance in treatment of many diseases such as Anemia
10/27/2010
Use Case: EPO15
USE CASE: EPO/ERYTHROPOIETIN
Core patents – U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698
135 directly related patents and over 3000 related publications
Around 20 court cases, patent litigation involving major companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic Therapies, Inc.
Several available ontologies: Gene ontology, National Cancer Institute Thesaurus …
This corpus forms a good experimental platform to test the overall effectiveness of the framework
Why does this make a good use case?
10/27/2010
Use Case: EPO16
PATENTS
Search results for “erythropoietin” amongst the 135 closely related patents:
Documents are indexed from search using Apache Lucene
Rank computation is based on the general idea that a term occurring more frequently across many documents (e.g., “the”) is less informative than a term (e.g., “EPO”) that occurs frequently in fewer documents
Returns over 7000 documents from over 7 million documents in the USPTO database
Returns ~90 of the 135 related patentsU.S. Patent No. 6,204,247 is relevant but does not contain the term erythropoietin
Q: How can this be made better?
Patent Number Rank5955422 0.1096204247 0.0006245740 0.0186270989 0.0006280977 0.0276340742 0.1136420339 0.0006420340 0.0006524818 0.009
10/27/2010
Use Case: EPO17
ONTOLOGY
BioPortal: Web-based application for accessing and sharing biomedical ontologies developed at National Center for Biomedical Ontologies (NCBO)
Gene Ontology (GO): GO uses three organizing principles – Cellular component, Biological process and Molecular function. This ontology represents “erythropoietin receptor binding” as a molecular function.
National Cancer Institute (NCI) Thesaurus: Provides reference terminology, vocabulary for clinical care, translational and basic research, and public information and administrative activities
(a) Gene Ontology(b) NCI Thesaurus
a b Expanded Term Base “Erythropoietin”, “Erythropoietin Receptor Binding”, “Colony Stimulating Factor”, “Cytokine” …
10/27/2010
Use Case: EPO18
RESULTS AFTER USING EXPANDED TERM BASE
Improved results: more relevant documents are identified
Computed rank is the average of document ranks for each individual keyword
The 5 core patents have a relatively high rank
Returns a large set of documents when searched in USPTO (185,126 documents contain “protein”; 23,759 contain “cytokine”…)
Patent Number Score5955422 0.0506204247 0.0286245740 0.0386270989 0.0056280977 0.0086340742 0.0496420339 0.0266420340 0.0286524818 0.015
10/27/2010
Use Case: EPO19
ENTREZ: CROSS-DATABASE SEARCH FOR THE LIFE SCIENCES
Entrez is a global query system which links to multiple databases such as PubMed and PubMed Central for articles, databases for nucleotide sequences, gene expressions etc.
Provide Web Services and utilities such as ESearch and EFetch to search and retrieve articles from PubMed
Also provides SOAP interface for the Entrez Utilities
We have generated a script around the utilities to automatically download infromation
10/27/2010
Use Case: EPO20
CROSS REFERENCING SCIENTIFIC PUBICATIONS WITH CORE PATENTS
We select PUBMED data base as our main corpus of scientific publications
Each scientific publication ( document ) in PUBMED has a unique Paper Id and has a “true” score as a measure of its relevance
We define our score as the number of times a document is cited in each of the 5 patents ( called RefScore )
Each expanded term has a word frequency to measure how often it appears in the abstract of the paper.
Paper Id Ref Score
Erythro-poietin
EPO Protein
6713094 5 0.446 7.59 0
2813359 5 1.093 8.74 0
18202227 5 0.565 3.96 0.565
3680293 4 0.467 3.74 1.402
3624248 3 3.265 0 1.224
232226 2 0 0 0
14025852 1 0 0 0
Table: Example of some selected papers with their RefScore and some expanded
term’s rank ( word frequency )
10/27/2010
Use Case: EPO21
CORRELATION BETWEEN EXPANDED TERMS AND RefScore
For each expanded terms ( keyword ), we can calculate the correlation between its word frequency and RefScore
If we plot each document’s RefScore and word frequency for term “EPO”, we can see it is positively correlated.
If frequency of EPO is larger than 4% in a document, this document would be cited no less than 4 out of 5 core patents, with only 1 exception.
Most keywords would show positive correlations, except for “protein”, which is a general term that appears in many other papers.
This justifies the use of word frequency as a measure to cross-reference between patents and scientific publications.
Keyword CorrelationErythropoietin 0.089Epo 0.08Iron 0.065Erythropoietin 0.035Cytokines 0.035Desamethasone 0.035hydroxyurea 0.035Protein -0.002
Word Frequency (%)
RefS
core
10/27/2010
User feedback can be in multiple forms: citations, direct feedback via Web user interface …
A user feedback can be defined by dividing the cited times of a particular documents in core patents by the cited times in all PUBMED documents (Rufs)
We also need to know this ratio for an average document (Aufs)
We normalize each document's feedback score by average user feedback score and use a minimum threshold to filter out desirable documents ( 1.50 as an example )
Therefore, document 2813359 will always come into the query results
22
Paper RefScore #Citation Rufs(%) Fufs
6713094 5 219 2.28 0.94
2813359 5 134 3.73 1.54
18202227 5 260 1.92 0.79
3680293 4 119 3.36 1.38
362424 3 98 3.06 1.26
232226 2 103 1.94 0.80
14205852 1 98 1.02 0.42
Total 25 1031 2.42 --
USE CITATIONS AS USER FEEDBACK to IMPROVE
10/27/2010
Other issues and challenges23
OTHER ISSUES AND CHALLENGES
USPTO disallows crawling. Currently, we need to know specific details about the documents we intend to automatically download, such as the patent numbers, or inventors.
PubMed is a leading database for biomedical journals. However, many publications, especially older ones are not indexed in the database
Full text of many articles in the PubMed database is still unavailable
PACER is a good source for litigation documents, but all court pleadings are scanned as electronic images, are they machine readable?
Since PACER does not provide keyword based search, difficult to manually scan 94 judicial districts
PAIR enforces CAPTCHA verification, hindering automatic downloading of PTO file wrappers
10/27/2010
Current Status and Future Work
24
Implement the proposed framework
Make the information and relevance feedback techniques available via a web interface
Expand the scope to other domains – court cases, file wrappers, regulations etc.
Drive towards structured information integration. Develop an ontology or a formal representation of the document domains
Future Work
CURRENT STATUS & FUTURE WORK
10/27/2010
PatentsUSPTO – http://www.uspto.gov/Delphion – http://www.delphion.com/Google Patents – http://www.google.com/patents/
File WrappersPAIR – http://portal.uspto.gov/external/portal/pair/
Court CasesPACER – http://pacer.psc.uscourts.gov/
PublicationsPubmed – http://www.ncbi.nlm.nih.gov/pubmed/ Medline – http://www.nlm.nih.gov/medlineplus/Google Scholar – http://scholar.google.com/
Ontology/TaxonomyBioPortal – http://bioportal.bioontology.com/Genecards – http://www.genecards.org/MedTerms – http://www.medterms.com/
MiscellaneousThomson Reuters – http://www.thomsoninnovation.com/Dialog – http://www.dialog.com/
USEFUL LINKS
2510/27/2010
This research is partially supported by NSF Grant Number 0811975 awarded to the University of Illinois and NSF Grant Number 0811460 to Stanford University. Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation.
ACKNOWLEDGEMENT
2610/27/2010
DISCUSSION
2710/27/2010