using patstat in universities evaluation procedures
TRANSCRIPT
Using patstat in national
evaluation proceduresa joint exercise with the Italian Ministry of Education,
University and Research.
CRIOS
Center for Research on Innovation, Organization and Strategy
1• Gianluca Tarasconi, Crios DBA rawpatentdata.blogspot.com
PATSTAT user day, Tokyo 19/11/2014
In short
• This is a joint work with Francesco Lissoni, a
methodology (derived from APE-Inv project)
used to match PATSTAT inventor names to a
full list of researchers working in Italian
universities
• The goal is to have higher recall, leaving
institutions/researchers to validate the data.
• Focus will not be on results (evaluation still in
progress) but on data processing, selection
and match algorithm, highlighting some
difficulties and relative workarounds.2
Starting point 1: researchers
Data have been provided form ANVUR (University Ministry)
Researchers:
85.955 distinct researchers (with institution and SDS)
83.568 distinct names
Filter on scientific disciplinary sector (SDS):
Researchers listed with low patenting propensity were dropped (we lose some potential matches that are garage inventors rather than academic)
After filter (population to match):
57.473 distinct researchers
56.282 distinct names
3
Starting point 2: inventors
Documents: patent of invention EPO,
UIBM (=IT), USPTO, IB + PCT
(EP/US/IT) with first publication year
between 2011 and 2013.
PATSTAT - oct. 2013 (we lose part of
2013 documents)
.Object of the evaluation is the number of
inventions; thus only one application
(the older) by docdb family is retained4
Starting point 2b: docdb families
Since docdb_family_id is not stable this could
create possible problems in incremental updates
across years;
A stable version of family ID has been introduced
using for each family id the younger members
appln_id; where two members applied the same day
the smaller appln_id is taken.
The idea underlying this algorithm is that docdb
families can be appended only with younger
applications (This is in principle correct, unless for example older
back files with data are loaded or corrections are made. But even in that
case, one could easily identify these cases and update the
table/algorithm.)5
Starting point 2c: counts
Number of published patents and
inventors in match population:
6
# of published patents
2011 2012 2013
EP 71.381 60.451 30.894
IT 9.386 8.824 4.857
US 198.443 203.430 120.977
WO 56.133 59.685 37.703
# of inventors (person_ids)
US 764.241
IT+IB 198.085
EPO 401.534
We observe a decay in
number of patents in 2013
due to usage of patstat
october ediction
Seeking 56.282 in
1.363.860 inventors
Adding Missing Patents using
OPS (I) Universities must be able to add missing
validation could be a problem
Solution: input data (publication number)
are validated using EPO OPS services;
We get in return also title, inventors,
application id;
OPS interrogation is easy to fill (see
example: http://ops.epo.org/3.1/rest-
services/published-
data/publication/epodoc/WO2012139653
/biblio
Also is easy to parse for information
7
Adding Missing Patents using
OPS (II)
8
User inputs
publn_auth and publn_nr
PHP creates OPS URL
and interrogates
Valid answer?
No valid
auth/number
no
Format publ date,
appln_id, title, inventors
Error: reenter data
Publn date in range
and appln_id not yet extracted?
Display title
and inventors;
Researcher gives ok
Store data
Ready for new
input
no
Match criteria (I)
Preprocessing: charset normalization
Match technique is based on tokens
similarity, improved with clustering using
bigrams (see Pezzoni Lissoni Tarasconi 2014)
Researcher addresses are not
provided; also applicants do not
contribute to recall (academic patents are often sold at
an early stage – we would loose individual patents)
9
Match criteria (II)
The atomic entity chosen in patstat for
data validation is the couple person_id /
docdb_family_id (as person_id indicates a unique set
ctry/name/address may indicate 2 homonyms especially where address
/ ctry is empty or not completely filled in)
10
Which inventor to match? (I)
• Within the same docdb family two
applications may have different
inventors person_ids.
• Just appending all the inventors we may
have a lot of duplicates (due to
misspellings/change of address)
• Final choice was to include for match
only inventors of first publication (in
chronological order) in the family 11
Which inventor to match? (II)
• Evidence supporting:
• 95% of families have apps
with the same number of
inventors
• 84.9% of families have
same inventors person_ids
across all applications
• (suggestions are welcome)
12
delta n families %
0 36.048.365 95,523%
1 859.567 2,278%
2 413.670 1,096%
3 206.235 0,546%
4 101.529 0,269%
5 48.545 0,129%
6 25.400 0,067%
7 13.372 0,035%
8 7.775 0,021%
9 4.432 0,012%
10 2.972 0,008%
11 1.697 0,004%
Results
A preliminary version (denominated
max recall) was produced
This version was showing at a glance
some problematic issues:
Patenting profile was very different from
expected
Most productive researchers had Asiatic
names
13
Maximum Recall (I)
12.122 distinct couples person_id/family
4.681 researchers
10.014 patent families
7.569 inventors
14
Maximum Recall (II)
15
ISI-OST-INPI 30 classes IPC reclassification 1 ranking
4 Information technology 1498
1 Electrical engineering 1021
3 Telecommunications 946
7 Technologies for Control/Measures/Analysis 929
5 Semiconductors 734
16 Pharmaceuticals; Cosmetics 646
8 Medical engineering 576
29 Consumer goods 543
2 Audiovisual technology 470
15 Biotechnologies 450
19 Handling; Printing 418
10 Organic chemistry 409
27 Transport technology 378
20 Materials processing 342
6 Optics 341
1: see Schmoch et al, Linking Technology Areas to Industrial Sectors, 2003
Maximum Recall (III): The Asianparadox
16
Name Surname n pat Acad qualif
Y C 776 PA
W C 735 AS
Y W 403 DO
W W 321 AS
M W L 174 DO
Y Z 171 AS
Y L 148 DO
J C 144 DO
H L 141 DO
M L 124 DO
Q Z 93 DO
S C 76 DO
Y S 68 DO
Y Y 64 DO
B W 63 DO
Z L 58 DO
B Z 57 DO
AJ A 50 AS
T J S 50 RU
Y Y P 43 AS
F L 41 AS
Z C 40 DO
A B 40 RU
2 researchers with the
same Asian surname are
in the first two positions in
rank and patent over
1500 patents in 3 years.
First non Asian name in
20° position
Researchers names
anonymized for privacy
reasons (as from ANVUR
request)
First Italian name 25th
Geographic based filter(I)
17
Max recall Geo based prec.Appl
auth N families % N families %
IT 3114 31 3114 70,8
EP 1293 13 590 13,4
US 4980 50 395 9,0
WO 627 6.3 299 6,8
Only inventors with address based in IT for EP,WO,US are
matched
All inventors in UIBM are matched (address missing)
We lose a part of possible match (international collaborations)
Geographic based filter (II)
18
distinct couplesperson_id/family 6360 12122researchers 4107 4681patent families 4398 10014
inventors 4314 7569
Number of inventors matched decreases by
40% but still researchers recall is very high
Matches compared to maximum recall
Geographic based filter (III)
19
Max recall
4 Information technology 1498
1 Electrical engineering 1021
3 Telecommunications 946
7 Technologies for Control/Measures/Analysis 929
5 Semiconductors 734
16 Pharmaceuticals; Cosmetics 646
OST30 class ranking
7 Technologies for Control/Measures/Analysis 497 8,45%
16 Pharmaceuticals; Cosmetics 408 6,94%
29 Consumer goods 364 6,19%
8 Medical engineering 335 5,70%
19 Handling; Printing 297 5,05%
3 Telecommunications 274 4,66%
1 Electrical engineering 268 4,56%
27 Transport technology 261 4,44%
15 Biotechnologies 252 4,29%
30 Civil engineering 231 3,93%
10 Organic chemistry 229 3,89%
20 Materials processing 228 3,88%
4 Information technology 224 3,81%
11 Macromolecular chemistry 160 2,72%
18 Technical processes (chemical, physical, mechanical) 159 2,70%
More in line
with italian
tech profile
Geographic based filter (IV):
20
Name Surname npat Acad title Sector Institution
A B 38 PO MED/38Università degli Studi di MILANO-
BICOCCA
A P 25 AS CHIM/06
Università degli Studi del PIEMONTE ORIENTALE "Amedeo Avogadro"-Vercelli
P B 25 RU ING-INF/05 Politecnico di TORINO
P B 25 PO MED/04 Università degli Studi di PADOVA
F C 24 RD FIS/03 Università degli Studi di GENOVA
M M 23 PO BIO/17 Università degli Studi di UDINE
R L 19 DO ING-INF/05 Università degli Studi di FERRARA
P D 18 PO ING-IND/34Scuola Superiore di Studi Universitari e
Perfezionamento SantAnna
M M 17 PA ING-IND/20 Politecnico di MILANO
A B 17 PA MED/04 Università degli Studi di PERUGIA
P F 16 RD ICAR/13 Università IUAV di VENEZIA
P F 16 RU ING-INF/07 Università degli Studi di BRESCIA
R A P 15 PA AGR/11 Università degli Studi di SASSARI
G F Z 15 PO MED/20 Università degli Studi di PADOVA
A M 15 PA ING-IND/34Scuola Superiore di Studi Universitari e
Perfezionamento SantAnna
M P 4 DO ING-IND/32 Università degli Studi di PADOVA
Top 15 with
italian names;
first foreigner
name scores
16th