using patstat in universities evaluation procedures

20
Using patstat in national evaluation procedures a joint exercise with the Italian Ministry of Education, University and Research. CRIOS Center for Research on Innovation, Organization and Strategy 1 Gianluca Tarasconi, Crios DBA rawpatentdata.blogspot.com PATSTAT user day, Tokyo 19/11/2014

Upload: gianluca-tarasconi

Post on 11-Jul-2015

521 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Using patstat in universities evaluation procedures

Using patstat in national

evaluation proceduresa joint exercise with the Italian Ministry of Education,

University and Research.

CRIOS

Center for Research on Innovation, Organization and Strategy

1• Gianluca Tarasconi, Crios DBA rawpatentdata.blogspot.com

PATSTAT user day, Tokyo 19/11/2014

Page 2: Using patstat in universities evaluation procedures

In short

• This is a joint work with Francesco Lissoni, a

methodology (derived from APE-Inv project)

used to match PATSTAT inventor names to a

full list of researchers working in Italian

universities

• The goal is to have higher recall, leaving

institutions/researchers to validate the data.

• Focus will not be on results (evaluation still in

progress) but on data processing, selection

and match algorithm, highlighting some

difficulties and relative workarounds.2

Page 3: Using patstat in universities evaluation procedures

Starting point 1: researchers

Data have been provided form ANVUR (University Ministry)

Researchers:

85.955 distinct researchers (with institution and SDS)

83.568 distinct names

Filter on scientific disciplinary sector (SDS):

Researchers listed with low patenting propensity were dropped (we lose some potential matches that are garage inventors rather than academic)

After filter (population to match):

57.473 distinct researchers

56.282 distinct names

3

Page 4: Using patstat in universities evaluation procedures

Starting point 2: inventors

Documents: patent of invention EPO,

UIBM (=IT), USPTO, IB + PCT

(EP/US/IT) with first publication year

between 2011 and 2013.

PATSTAT - oct. 2013 (we lose part of

2013 documents)

.Object of the evaluation is the number of

inventions; thus only one application

(the older) by docdb family is retained4

Page 5: Using patstat in universities evaluation procedures

Starting point 2b: docdb families

Since docdb_family_id is not stable this could

create possible problems in incremental updates

across years;

A stable version of family ID has been introduced

using for each family id the younger members

appln_id; where two members applied the same day

the smaller appln_id is taken.

The idea underlying this algorithm is that docdb

families can be appended only with younger

applications (This is in principle correct, unless for example older

back files with data are loaded or corrections are made. But even in that

case, one could easily identify these cases and update the

table/algorithm.)5

Page 6: Using patstat in universities evaluation procedures

Starting point 2c: counts

Number of published patents and

inventors in match population:

6

# of published patents

2011 2012 2013

EP 71.381 60.451 30.894

IT 9.386 8.824 4.857

US 198.443 203.430 120.977

WO 56.133 59.685 37.703

# of inventors (person_ids)

US 764.241

IT+IB 198.085

EPO 401.534

We observe a decay in

number of patents in 2013

due to usage of patstat

october ediction

Seeking 56.282 in

1.363.860 inventors

Page 7: Using patstat in universities evaluation procedures

Adding Missing Patents using

OPS (I) Universities must be able to add missing

validation could be a problem

Solution: input data (publication number)

are validated using EPO OPS services;

We get in return also title, inventors,

application id;

OPS interrogation is easy to fill (see

example: http://ops.epo.org/3.1/rest-

services/published-

data/publication/epodoc/WO2012139653

/biblio

Also is easy to parse for information

7

Page 8: Using patstat in universities evaluation procedures

Adding Missing Patents using

OPS (II)

8

User inputs

publn_auth and publn_nr

PHP creates OPS URL

and interrogates

Valid answer?

No valid

auth/number

no

Format publ date,

appln_id, title, inventors

Error: reenter data

Publn date in range

and appln_id not yet extracted?

Display title

and inventors;

Researcher gives ok

Store data

Ready for new

input

no

Page 9: Using patstat in universities evaluation procedures

Match criteria (I)

Preprocessing: charset normalization

Match technique is based on tokens

similarity, improved with clustering using

bigrams (see Pezzoni Lissoni Tarasconi 2014)

Researcher addresses are not

provided; also applicants do not

contribute to recall (academic patents are often sold at

an early stage – we would loose individual patents)

9

Page 10: Using patstat in universities evaluation procedures

Match criteria (II)

The atomic entity chosen in patstat for

data validation is the couple person_id /

docdb_family_id (as person_id indicates a unique set

ctry/name/address may indicate 2 homonyms especially where address

/ ctry is empty or not completely filled in)

10

Page 11: Using patstat in universities evaluation procedures

Which inventor to match? (I)

• Within the same docdb family two

applications may have different

inventors person_ids.

• Just appending all the inventors we may

have a lot of duplicates (due to

misspellings/change of address)

• Final choice was to include for match

only inventors of first publication (in

chronological order) in the family 11

Page 12: Using patstat in universities evaluation procedures

Which inventor to match? (II)

• Evidence supporting:

• 95% of families have apps

with the same number of

inventors

• 84.9% of families have

same inventors person_ids

across all applications

• (suggestions are welcome)

12

delta n families %

0 36.048.365 95,523%

1 859.567 2,278%

2 413.670 1,096%

3 206.235 0,546%

4 101.529 0,269%

5 48.545 0,129%

6 25.400 0,067%

7 13.372 0,035%

8 7.775 0,021%

9 4.432 0,012%

10 2.972 0,008%

11 1.697 0,004%

Page 13: Using patstat in universities evaluation procedures

Results

A preliminary version (denominated

max recall) was produced

This version was showing at a glance

some problematic issues:

Patenting profile was very different from

expected

Most productive researchers had Asiatic

names

13

Page 14: Using patstat in universities evaluation procedures

Maximum Recall (I)

12.122 distinct couples person_id/family

4.681 researchers

10.014 patent families

7.569 inventors

14

Page 15: Using patstat in universities evaluation procedures

Maximum Recall (II)

15

ISI-OST-INPI 30 classes IPC reclassification 1 ranking

4 Information technology 1498

1 Electrical engineering 1021

3 Telecommunications 946

7 Technologies for Control/Measures/Analysis 929

5 Semiconductors 734

16 Pharmaceuticals; Cosmetics 646

8 Medical engineering 576

29 Consumer goods 543

2 Audiovisual technology 470

15 Biotechnologies 450

19 Handling; Printing 418

10 Organic chemistry 409

27 Transport technology 378

20 Materials processing 342

6 Optics 341

1: see Schmoch et al, Linking Technology Areas to Industrial Sectors, 2003

Page 16: Using patstat in universities evaluation procedures

Maximum Recall (III): The Asianparadox

16

Name Surname n pat Acad qualif

Y C 776 PA

W C 735 AS

Y W 403 DO

W W 321 AS

M W L 174 DO

Y Z 171 AS

Y L 148 DO

J C 144 DO

H L 141 DO

M L 124 DO

Q Z 93 DO

S C 76 DO

Y S 68 DO

Y Y 64 DO

B W 63 DO

Z L 58 DO

B Z 57 DO

AJ A 50 AS

T J S 50 RU

Y Y P 43 AS

F L 41 AS

Z C 40 DO

A B 40 RU

2 researchers with the

same Asian surname are

in the first two positions in

rank and patent over

1500 patents in 3 years.

First non Asian name in

20° position

Researchers names

anonymized for privacy

reasons (as from ANVUR

request)

First Italian name 25th

Page 17: Using patstat in universities evaluation procedures

Geographic based filter(I)

17

Max recall Geo based prec.Appl

auth N families % N families %

IT 3114 31 3114 70,8

EP 1293 13 590 13,4

US 4980 50 395 9,0

WO 627 6.3 299 6,8

Only inventors with address based in IT for EP,WO,US are

matched

All inventors in UIBM are matched (address missing)

We lose a part of possible match (international collaborations)

Page 18: Using patstat in universities evaluation procedures

Geographic based filter (II)

18

distinct couplesperson_id/family 6360 12122researchers 4107 4681patent families 4398 10014

inventors 4314 7569

Number of inventors matched decreases by

40% but still researchers recall is very high

Matches compared to maximum recall

Page 19: Using patstat in universities evaluation procedures

Geographic based filter (III)

19

Max recall

4 Information technology 1498

1 Electrical engineering 1021

3 Telecommunications 946

7 Technologies for Control/Measures/Analysis 929

5 Semiconductors 734

16 Pharmaceuticals; Cosmetics 646

OST30 class ranking

7 Technologies for Control/Measures/Analysis 497 8,45%

16 Pharmaceuticals; Cosmetics 408 6,94%

29 Consumer goods 364 6,19%

8 Medical engineering 335 5,70%

19 Handling; Printing 297 5,05%

3 Telecommunications 274 4,66%

1 Electrical engineering 268 4,56%

27 Transport technology 261 4,44%

15 Biotechnologies 252 4,29%

30 Civil engineering 231 3,93%

10 Organic chemistry 229 3,89%

20 Materials processing 228 3,88%

4 Information technology 224 3,81%

11 Macromolecular chemistry 160 2,72%

18 Technical processes (chemical, physical, mechanical) 159 2,70%

More in line

with italian

tech profile

Page 20: Using patstat in universities evaluation procedures

Geographic based filter (IV):

20

Name Surname npat Acad title Sector Institution

A B 38 PO MED/38Università degli Studi di MILANO-

BICOCCA

A P 25 AS CHIM/06

Università degli Studi del PIEMONTE ORIENTALE "Amedeo Avogadro"-Vercelli

P B 25 RU ING-INF/05 Politecnico di TORINO

P B 25 PO MED/04 Università degli Studi di PADOVA

F C 24 RD FIS/03 Università degli Studi di GENOVA

M M 23 PO BIO/17 Università degli Studi di UDINE

R L 19 DO ING-INF/05 Università degli Studi di FERRARA

P D 18 PO ING-IND/34Scuola Superiore di Studi Universitari e

Perfezionamento SantAnna

M M 17 PA ING-IND/20 Politecnico di MILANO

A B 17 PA MED/04 Università degli Studi di PERUGIA

P F 16 RD ICAR/13 Università IUAV di VENEZIA

P F 16 RU ING-INF/07 Università degli Studi di BRESCIA

R A P 15 PA AGR/11 Università degli Studi di SASSARI

G F Z 15 PO MED/20 Università degli Studi di PADOVA

A M 15 PA ING-IND/34Scuola Superiore di Studi Universitari e

Perfezionamento SantAnna

M P 4 DO ING-IND/32 Università degli Studi di PADOVA

Top 15 with

italian names;

first foreigner

name scores

16th