patstat and patstat related resources for patent data analisys
TRANSCRIPT
About the speaker
Background in Management Engineering @ Politecnico of Milan
Database Architect @ KITeS (previosly CESPRI) since 2002
Project manager for data production in EU Projects STI-NET, TENIA, AEGIS and EU Tenders ICT network impact, INNOVA, Higly Cited Patents, Measurement and analysis of knowledge and R&D exploitation flows, assessed by patent and licensing data
Collaborations on database projects with: MIT, LSE, Danish Board of Technology, Bonn Graduate School of Economic, Universtät Mainz, BETA …
Redactor of blog rawpatentdata.blogspot.com
What is PATSTAT
is a snapshot of the EPO database for over about 70 million applications from more than 80 application authorities, containing bibliographic data, citations and family links. It requires the data to be loaded in the customer's own database.
+ low cost of ownership
- costs of implementation
Data Sorces for PATSTAT
Source for EP data is DOCDB (EPO master documentation database)
Source for other offices are filesprovided by other patent authorities
+ Good coverage for US, EU states, JP, EPO, WIPO
- For other authorities gaps and leaksnot easy to identify
Implementing the DB (I)
Over 20 tables in
a relational DB
with application is
as main primary
key
EPO adds /
improves data
each ediction
Implementing the DB (II)
+ standard scripts, a growing community
to exchange procedures etc. (example)
- need a person who has both DB and
patent data knowledge
Plug & play extensions
Datasets that can be added with no effort:
Regpat: OECD dataset giving NUTS3 for each
applocant / inventor (EP only)
Han: OECD Harmonized applicants names
dataset (EP only)
eee_ppat: KUL/Eurostat standard names and
sector allocation (all patstat)
Tls221: Epo legal data table, allowing to include
changes of ownership, oppositions... (example)
ape-inv: Inventors disambiguation tools and
academic inventors.
Note: all tables, but TLS221 are free of cost
Some papers using Kites-Patstat
DBLissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research
Evaluation, 17(2): 87-102.
Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.
Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109
Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432
Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118
Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf]
Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) -Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming
Frazzoni S., Mancusi M., Rotondi Z., Sobrero M., Vezzulli A., (2011), “Relationship with banks and access to credit for innovation and internationalization in SMEs”, L’EUROPA E OLTRE. Banche e imprese nella nuova globalizzazione, XVI Rapporto sul sistemafinanziario italiano, Edibank, 2011. ISBN 978-88-449-0495-1.
V. Sterzi: Patent quality and ownership: An analysis of UK faculty patenting, Research Policy, 2012 (forthcoming)
Some advanced
applications OST patent applicants data quality
procedure and Match with ORBIS
OST common identifier among Patstat
WoS, Framework programs DBs
Applicants data quality
procedure and Match with
ORBIS (I) Goal of the procedure is to clean and
standardize patent applicants names (ie
removing type of company, common
misspelling etc.)
After names C&S a procedure has been
developed in order to apply 5 different
match algorithms in order to give allow
the best matches with ORBIS company
names.
Applicants data quality
procedure and Match with
ORBIS (II) Data quality procedure developed using
portable query and tables (see Tarasconi -
Sharing names/address cleaning patterns for Patstat
from patstat users day 2011)
Match procedure developed aiming to
be multiporpose (IE has already been used to match
TM vs Patents applicants @ KITeS)
Code and tables available for MySql and
Oracle. http://documents.epo.org/projects/babylon/eponet.nsf/0/92ab5eb34ff406d1c125795d0050bbc
c/$FILE/PATSTAT_user_day_2011_presentations.zip
Applicants data quality
procedure and Match with
ORBIS (III) C&S step results: from 12.280.000 pat.
applicants to about 3.800.000 companies
Match against: 353.294 Orbis Companies in Nace 2540, 2630, 2651, 2910, 3030, 3011, 8422 (defense)
Results: 94726 Patent applicants against 66256 Orbis companies
Benchmark: Againsts a sample of 1% validation returned a precision rate of 91% and a recall of 95%
OST Common identifier (I)
Data cathegories existing across patent, scientific publications and Framework programs data:
PATSTAT FPS WOS
Geographic data
inventors/applicant
s addresses
participants
addresses affiliations addresses
Individuals
inventors,
applicants contacts authors
companies applicants participants affiliations
sci /tech taxonomies IPC TPs subject cathegories
OST Common identifier (II)
1) DEFINE ATOMIC ENTITIES AND NON AMBIGUOS JOINS
Even if they regard similar entities there are differences among datasets on the granularity they use on data.
(ie in WOS affiliations may be by lab / dept while patents may be by IP office: different size)
Bridge dataset should use a entity sizeallowing unique data match across different sets. This might need some changes also in existing databases.
Bridge dataset should also make possible a hierarchic structure of entities allowing join at different level to main datasets.
OST Common identifier (IV)
2) TIMESERIES
2a) DATASET ASINCHRONIES
Data may enter the database with different time frame
depending from the dataset.
(IE PATSTAT is a full update so a snapshot at moment of
data creation, WOS is an incremental update; so name
changes/M&A could make same entity different in 2 datasets;
note also geographic entities change with time: counties,
countries…)
Bridge tables must have a time-related dimension.
2b) DATA TRANFORMATIONS
Data change within time.
(IE companies may merge, split [most critical case], change
name, change owner…)
Bridge tables must have a continuation dimension
allowing to follow transformation of entities.
OST Common identifier (V)
Timeseries examples
Sarajevo chg from YU to BS in 1992
BEFORE Sarajevo YU BS
AFTER Sarajevo YU 1800 1991
Sarajevo BS 1992 9999
OST Common identifier (V)
OBJECT / PROPERTIES DATASTRUCTURE
Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store
PROPERTIES/STATUS/EVENTS, so FI contain following fields:
PROPERTY NAME (ie ownership, affiliation…)
PROPERTYVALUE (ie new owner, new affiliation)
DATEFROM
DATETO
CHGREASON (if blank is still valid)
VALUE1…N (ie type of acquisition, % ownership…)
Along with properties must also be defined how properties are inherited among entities
(IE CNRS Bordeaux inherits from CNRS ownership, probably sector of activity… )
(1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer
Science Department of the University of Arizona
APPENDIX: Temporal database Example (I)
NOVARTIS
Novartis pharma is originated by merge of CIBA (1884) GEIGY (1758) and Sandoz (1876)
Until 1970 they are 3 separate entities
LEGPCODE LEGPNAME
1 CIBA
2 GEIGHY
3 SANDOZ
4 CIBA SUB 1..N
5 GEIGHY SUB 1…N
6 SANDOZ SUB 1…N
LEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
1OWNERSHIP FULLOWN 1 100 1884 9999
2OWNERSHIP FULLOWN 2 100 1758 9999
3OWNERSHIP FULLOWN 3 100 1876 9999
4OWNERSHIP FULLOWN 1 100 1884 9999
5OWNERSHIP FULLOWN 2 100 1758 9999
6OWNERSHIP FULLOWN 3 100 1876 9999
19
Temporal database : Example (II)
NOVARTIS
1970 first merge CIBA + GEIGHY = CIBA GEIGHY LTD
LEGPCODE LEGPNAME
1 CIBA
2 GEIGHY
3 SANDOZ
4 CIBA SUB 1..N
5 GEIGHY SUB 1…N
6 SANDOZ SUB 1…N
7 CIBA GEIGY LTD.
LEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
1 OWNERSHIP FULLOWN 1 100 1884 1969 MERGE
2 OWNERSHIP FULLOWN 2 100 1758 1969 MERGE
3 OWNERSHIP FULLOWN 3 100 1876 9999
4 OWNERSHIP FULLOWN 1 100 1884 1969 MERGE
5 OWNERSHIP FULLOWN 2 100 1758 1969 MERGE
6 OWNERSHIP FULLOWN 3 100 1876 9999
1
TRANSFORMATI
ON MERGE 7 50 1970 1970
2
TRANSFORMATI
ON MERGE 7 50 1970 1970
7 OWNERSHIP FULLOWN 7 100 1970 9999
4 OWNERSHIP FULLOWN 7 100 1970 9999
5 OWNERSHIP FULLOWN 7 100 1970 9999 20
Temporal database : Example (III)
NOVARTIS
1996 second merge: CIBA GEIGHY + Sandoz = Novartis
LEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
3OWNERSHIP FULLOWN 3 100 1876 1995 MERGE
4OWNERSHIP FULLOWN 1 100 1884 1969 MERGE
5OWNERSHIP FULLOWN 2 100 1758 1969 MERGE
6OWNERSHIP FULLOWN 3 100 1876 1995 MERGE
7OWNERSHIP FULLOWN 7 100 1970 1995 MERGE
4OWNERSHIP FULLOWN 7 100 1970 1995 MERGE
5OWNERSHIP FULLOWN 7 100 1970 1995 MERGE
3
TRANSFORMATI
ON MERGE 8 50 1996 9999
7
TRANSFORMATI
ON MERGE 8 50 1996 9999
8OWNERSHIP FULLOWN 8 100 1996 9999 MERGE
4OWNERSHIP FULLOWN 8 100 1996 9999 MERGE
5OWNERSHIP FULLOWN 8 100 1996 9999 MERGE
6OWNERSHIP FULLOWN 8 100 1996 9999 MERGE
LEGPCODE LEGPNAME
3SANDOZ
4CIBA SUB 1..N
5GEIGHY SUB 1…N
6SANDOZ SUB 1…N
7CIBA GEIGY LTD.
8NOVARTIS
21