uspto patent data source and data extraction mandy dang mis 580 university of arizona 02-06-2008

23
USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

USPTO Patent Data Source and Data Extraction

Mandy Dang

MIS 580

University of Arizona

02-06-2008

2

OutlineOutline

• Patent

• USPTO

• Search USPTO Patents

• Data Extraction: Case Study of NSE Patents

3

PatentPatent

• “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. – A patent is not a right to practice or use the invention. Rather, it

provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date.

– It is a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public.

• A patent is a special type of technology document which documents many important innovations and technology advances.

4

USPTOUSPTO

• The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification.

• Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week.

• USPTO provides online full-text access for patents issued since 1976.

• URLs:– USPTO Official Website: http://www.uspto.gov/– USPTO Patent Search: http://www.uspto.gov/main/search.html

5

Search USPTO PatentsSearch USPTO Patents

http://www.uspto.gov/main/search.html

6

7

8

Data Extraction: Case Study of NSE PatentsData Extraction: Case Study of NSE Patents

• Nanoscale Science and Engineering (NSE) field– Fundamental technology that is critical for a nation’s

technological competence.– Revolutionize a wide range of application domains.

• Nanotechnology– Is an applied science/ technology field that is multi-

disciplinary and encompasses engineering and other work taking place at the nanoscale.

– Critical for a nation’s technological competence. – R&D status attracts various communities’ interest.

9

Data Extraction ProcedureData Extraction Procedure

• The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database.

• Procedure of extracting NSE patents from USPTO:1. Spider search results (summary pages)2. Spider individual patent documents (detailed pages)3. Noise filtering4. Parsing

10

1. Spider search results (summary pages)1. Spider search results (summary pages)

• A list of keywords can be used to search for patents related to NSE domain. The keywords were provided by domain experts.

• A spider program written by Perl was used to spider the search result pages.

Keywordsatomic force microscopeatomic force microscopicatomic force microscopyatomic-force-microscopeatomic-force-microscopyatomistic simulationbiomotormolecular devicemolecular electronicsmolecular modelingmolecular motormolecular sensormolecular simulationnano*quantum computingquantum dot*quantum effect*scanning tunneling microscopescanning tunneling microscopicscanning tunneling microscopyscanning-tunneling-microscopescanning-tunneling-microscopyself assembledself assemblingself assemblyselfassembl*self-assembledself-assemblingself-assembly

11

use HTML::TokeParser;

use LWP;

use URI::Escape;

use strict;

sub query

{ … … … …

open(f, $ARGV[0]);

my @keywords = <f>;

close(f);

… … … …

$query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc-bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx";

$response = $browser->get($query_url);

$result = $response->content();

open(f, "> $fpage-$pno.html");

select(f);

print $result;

close(f);

}

query('1/1/2007', '12/31/2007');

Example code

Get keywords

Download search pages

Set up time range

12

Patent IDs

Search result page example

13

2. Spider individual patent documents (detailed pages)2. Spider individual patent documents (detailed pages)

• In this step, we need to:– 1st, collect all the patent IDs;– 2nd, download all the patents based on

the patent IDs by using proxies.• The data set is often very large, so using

proxies can save a lot of time.

14

1

Download detailed patent documents

Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs).

Server:

Send different patent ID files to different client threads.

… … … …

open(f, $ARGV[0]);my @theids = <f>;close(f);

my $theid;foreach $theid (@theids){

$new_sock = $sock->accept(); my $buf = <$new_sock>;

print ($new_sock $theid."\n");print $buf . " " . $theid."\n";

close $new_sock;… … … …

Client:

Use proxy to download the patents whose IDs are in the file sent from the server.

… … … …

do {

$response = $browser->get($pat_url);

if (!$response->is_success()){

select(stdout);

print $response->status_line, "\n\n";

sleep(rand(7)+1);

}while (!$response->is_success())

… … … …

15

Patent document example

16

17

3. Noise filtering3. Noise filtering

• Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords.– Such patents need to be filtered out.

• Noise keywords includes:– nanosecond– nanoliter– nano$– nano-second– nano-liter– nano.sub– nano [space]– nano2

18

4. Parsing4. Parsing

• Extract different data fields from the HTML patent documents and parse into database.

USP_Patent

PK patentId

issueDate title appSerialNumber appDate appType attorneyAgent primaryExaminer assistantExaminer

USP_inventor

PK inventorId

iLname iMname iFname iCity iState iCoutnry

USP_Patent_Inventor

PK patentIdPK inventorId

USP_Assignee

PK AssigneeId

aName aCity aState aCountry

USP_Patent_Assignee

PK patentIdPK assgneeId

USP_OtherRef

PK refenceId

CitingPatentId reference

USP_usClass

PK patentId

us_Class1 us_Class2 major

USP_Countryname

PK Country_lable

Country_fullname

USP_Patent_Content

PK patentId

Abstract Title Claim USP_Patent_Citation

PK CitingPatentIdPK CitedPatentId

CitedPatentDate

USP_Foreignref

PK CitingPatentIdPK CitedPatentID

CitedPatentDate CitedPatentSource

USP_Int_Class

PK patentId

section class subclass maingroup subgroup

Database Design (USPTO)

19

public static void processAssignees( ) throws IOException{ … … … …

String[] assignees = assigneeString.split("<BR>");for (int i = 0; i < assignees.length; i++){

currentassignee=assignees[i].trim();if(currentassignee.length()==0)

continue;currentassignee = currentassignee.replaceAll("\r\n", "");

name =findBetween(currentassignee,0,"<B>","</B>");currPosition=currentassignee.indexOf("</B>")+"</B>".length();

address=findBetween(currentassignee,currPosition,"(",")");if(address==null){ System.err.println("wrong address: " + patentId); }int startIndex=0, endIndex=0;if((endIndex = address.lastIndexOf(',')) >= 0){ city = address.substring(0, endIndex);

if (city.lastIndexOf(',') >= 0){ city = city.substring(city.lastIndexOf(',')

+ 1);city.replaceAll("[^a-zA-Z]", "");

}startIndex = endIndex + 1;

}else

city="-";address = address.substring(startIndex);country=findBetween(address,0,"<B>","</B>");if(country==null){ country="US";

state=address.trim();}else

state="-";name=name.trim();city=city.trim();state=state.trim();rank++;

}}

Parsing example: parsing inventor data

Process inventor name

Process inventor address

Keep the ranking order of inventors

20

Data Analysis ExamplesData Analysis Examples

• Bibliographic analysis– Top 50 countries

select c.countryName, count(distinct b.patentId)

from usp_assignee a, usp_patentAssignee b, usp_countryName c

where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode

group by c.countryName

order by count(distinct b.patentId)desc

Rank Assignee CountryNumber of

Patents

1 United States 13,506

2 Japan 2,653

3 Federal Republic of Germany 836

4 France 534

5 China (Taiwan) 428

6 Republic of Korea 406

7 Canada 333

8 Netherlands 325

9 Australia 276

10 United Kingdom 258

11 Switzerland 193

12 Israel 163

13 Sweden 108

14 Belgium 106

15 Italy 82

16 Singapore 70

17 China 66

18 Denmark 56

19 Finland 51

20 India 39

21 Hong Kong 33

22 Bermuda 28

23 Ireland 26

24 Austria 24

25 Norway 23

26 Spain 15

27 Liechtenstein 13

28 Barbados 13

29 British Virgin Islands 7

30 New Zealand 7

21

Citation Network AnalysisCitation Network Analysis

Developing software: Graphviz http://www.pixelglow.com/graphviz/download/

22

Content Map AnalysisContent Map Analysis

Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona

23

Thanks!