patstat and patstat related resources for patent data analisys

By Gianluca Tarasconi – Kites Univ. Bocconi / O.S.T.

About the speaker

Background in Management Engineering @ Politecnico of Milan

Database Architect @ KITeS (previosly CESPRI) since 2002

Project manager for data production in EU Projects STI-NET, TENIA, AEGIS and EU Tenders ICT network impact, INNOVA, Higly Cited Patents, Measurement and analysis of knowledge and R&D exploitation flows, assessed by patent and licensing data

Collaborations on database projects with: MIT, LSE, Danish Board of Technology, Bonn Graduate School of Economic, Universtät Mainz, BETA …

Redactor of blog rawpatentdata.blogspot.com

What is PATSTAT

is a snapshot of the EPO database for over about 70 million applications from more than 80 application authorities, containing bibliographic data, citations and family links. It requires the data to be loaded in the customer's own database.

+ low cost of ownership

- costs of implementation

Data Sorces for PATSTAT

Source for EP data is DOCDB (EPO master documentation database)

Source for other offices are filesprovided by other patent authorities

+ Good coverage for US, EU states, JP, EPO, WIPO

- For other authorities gaps and leaksnot easy to identify

Implementing the DB (I)

Over 20 tables in

a relational DB

with application is

as main primary

key

EPO adds /

improves data

each ediction

Implementing the DB (II)

+ standard scripts, a growing community

to exchange procedures etc. (example)

- need a person who has both DB and

patent data knowledge

http://rawpatentdata.blogspot.fr/search/label/mysql

Plug & play extensions

Datasets that can be added with no effort:

Regpat: OECD dataset giving NUTS3 for each

applocant / inventor (EP only)

Han: OECD Harmonized applicants names

dataset (EP only)

eee_ppat: KUL/Eurostat standard names and

sector allocation (all patstat)

Tls221: Epo legal data table, allowing to include

changes of ownership, oppositions... (example)

ape-inv: Inventors disambiguation tools and

academic inventors.

Note: all tables, but TLS221 are free of cost

http://www.oecd.org/science/innovationinsciencetechnologyandindustry/oecdpatentdatabases.htm



http://www.ecoom.be/nl/eee-ppat

http://www.epo.org/searching/subscription/raw/product-14-11.html

http://rawpatentdata.blogspot.fr/2011/07/how-to-track-transfer-of-patents.html

http://www.esf-ape-inv.eu/index.php?page=12



Some papers using Kites-Patstat

DBLissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research

Evaluation, 17(2): 87-102.

Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.

Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109

Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432

Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118

Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf]

Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) -Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming

Frazzoni S., Mancusi M., Rotondi Z., Sobrero M., Vezzulli A., (2011), “Relationship with banks and access to credit for innovation and internationalization in SMEs”, L’EUROPA E OLTRE. Banche e imprese nella nuova globalizzazione, XVI Rapporto sul sistemafinanziario italiano, Edibank, 2011. ISBN 978-88-449-0495-1.

V. Sterzi: Patent quality and ownership: An analysis of UK faculty patenting, Research Policy, 2012 (forthcoming)

Some advanced

applications OST patent applicants data quality

procedure and Match with ORBIS

OST common identifier among Patstat

WoS, Framework programs DBs

Applicants data quality

procedure and Match with

ORBIS (I) Goal of the procedure is to clean and

standardize patent applicants names (ie

removing type of company, common

misspelling etc.)

After names C&S a procedure has been

developed in order to apply 5 different

match algorithms in order to give allow

the best matches with ORBIS company

names.



ORBIS (II) Data quality procedure developed using

portable query and tables (see Tarasconi -

Sharing names/address cleaning patterns for Patstat

from patstat users day 2011)

Match procedure developed aiming to

be multiporpose (IE has already been used to match

TM vs Patents applicants @ KITeS)

Code and tables available for MySql and

Oracle. http://documents.epo.org/projects/babylon/eponet.nsf/0/92ab5eb34ff406d1c125795d0050bbc

c/$FILE/PATSTAT_user_day_2011_presentations.zip



ORBIS (III) C&S step results: from 12.280.000 pat.

applicants to about 3.800.000 companies

Match against: 353.294 Orbis Companies in Nace 2540, 2630, 2651, 2910, 3030, 3011, 8422 (defense)

Results: 94726 Patent applicants against 66256 Orbis companies

Benchmark: Againsts a sample of 1% validation returned a precision rate of 91% and a recall of 95%

OST Common identifier (I)

Data cathegories existing across patent, scientific publications and Framework programs data:

PATSTAT FPS WOS

Geographic data

inventors/applicant

s addresses

participants

addresses affiliations addresses

Individuals

inventors,

applicants contacts authors

companies applicants participants affiliations

sci /tech taxonomies IPC TPs subject cathegories

OST Common identifier (II)

1) DEFINE ATOMIC ENTITIES AND NON AMBIGUOS JOINS

Even if they regard similar entities there are differences among datasets on the granularity they use on data.

(ie in WOS affiliations may be by lab / dept while patents may be by IP office: different size)

Bridge dataset should use a entity sizeallowing unique data match across different sets. This might need some changes also in existing databases.

Bridge dataset should also make possible a hierarchic structure of entities allowing join at different level to main datasets.

OST Common identifier (III)

Example

OST Common identifier (IV)

2) TIMESERIES

2a) DATASET ASINCHRONIES

Data may enter the database with different time frame

depending from the dataset.

(IE PATSTAT is a full update so a snapshot at moment of

data creation, WOS is an incremental update; so name

changes/M&A could make same entity different in 2 datasets;

note also geographic entities change with time: counties,

countries…)

Bridge tables must have a time-related dimension.

2b) DATA TRANFORMATIONS

Data change within time.

(IE companies may merge, split [most critical case], change

name, change owner…)

Bridge tables must have a continuation dimension

allowing to follow transformation of entities.

OST Common identifier (V)

Timeseries examples

Sarajevo chg from YU to BS in 1992

BEFORE Sarajevo YU BS

AFTER Sarajevo YU 1800 1991

Sarajevo BS 1992 9999

OST Common identifier (V)

OBJECT / PROPERTIES DATASTRUCTURE

Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store

PROPERTIES/STATUS/EVENTS, so FI contain following fields:

PROPERTY NAME (ie ownership, affiliation…)

PROPERTYVALUE (ie new owner, new affiliation)

DATEFROM

DATETO

CHGREASON (if blank is still valid)

VALUE1…N (ie type of acquisition, % ownership…)

Along with properties must also be defined how properties are inherited among entities

(IE CNRS Bordeaux inherits from CNRS ownership, probably sector of activity… )

(1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer

Science Department of the University of Arizona

APPENDIX: Temporal database Example (I)

NOVARTIS

Novartis pharma is originated by merge of CIBA (1884) GEIGY (1758) and Sandoz (1876)

Until 1970 they are 3 separate entities

LEGPCODE LEGPNAME

1 CIBA

2 GEIGHY

3 SANDOZ

4 CIBA SUB 1..N

5 GEIGHY SUB 1…N

6 SANDOZ SUB 1…N

LEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON

1OWNERSHIP FULLOWN 1 100 1884 9999






19

Temporal database : Example (II)

NOVARTIS

1970 first merge CIBA + GEIGHY = CIBA GEIGHY LTD

LEGPCODE LEGPNAME

1 CIBA

2 GEIGHY

3 SANDOZ

4 CIBA SUB 1..N

5 GEIGHY SUB 1…N

6 SANDOZ SUB 1…N

7 CIBA GEIGY LTD.


1 OWNERSHIP FULLOWN 1 100 1884 1969 MERGE


3 OWNERSHIP FULLOWN 3 100 1876 9999




1

TRANSFORMATI

ON MERGE 7 50 1970 1970

2

TRANSFORMATI

ON MERGE 7 50 1970 1970



5 OWNERSHIP FULLOWN 7 100 1970 9999 20

Temporal database : Example (III)

NOVARTIS

1996 second merge: CIBA GEIGHY + Sandoz = Novartis


3OWNERSHIP FULLOWN 3 100 1876 1995 MERGE







3

TRANSFORMATI

ON MERGE 8 50 1996 9999

7

TRANSFORMATI

ON MERGE 8 50 1996 9999





LEGPCODE LEGPNAME

3SANDOZ

4CIBA SUB 1..N

5GEIGHY SUB 1…N

6SANDOZ SUB 1…N

7CIBA GEIGY LTD.

8NOVARTIS

21

patstat and patstat related resources for patent data analisys

Business

ep data

italian data

data production

bibliographic data

patent data landscape

epo database

epo legal data table

patent citations