20180423 ipccat icsdv 2018.ppt2018/04/23 · losses due to translation are more visible for each...
TRANSCRIPT
![Page 1: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/1.jpg)
NiceApril 23, 2018
Patrick FIÉVET & Jacques GUYOT
IC-SDV 2018Automatic Text categorization in the International Patent Classification
IPCCAT-Neural
![Page 2: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/2.jpg)
IPCCAT-neural : automatic text categorization in the IPC
What is it about?
Patent Classifications : IPC (and CPC)
Automatic text CATegorization in the specific context of patent documents
Artificial Intelligence (AI) to mimic patent classification practices by human beings
2
![Page 3: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/3.jpg)
IPCCAT-neural : automatic text categorization in the IPC
3
![Page 4: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/4.jpg)
IPCCAT-neural : automatic text categorization in the IPC
Initial problems to be solved:
IPCs allotment in small Patent Offices
Automatic routing of patent/technical documents according to their technical domains
4
![Page 5: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/5.jpg)
IPCCAT-neural : Principles
Large collection (millions) of patent documents already classified, preferably with good-reputation practices
Minimum number of documents for each IPC symbol
Use of the CPC (converted into IPC through concordance)
Complement with IPC symbols
Training /Testing phase : 80% / 20%
Precision measure: Three-guesses evaluation on millions of test cases
5
![Page 6: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/6.jpg)
IPCCAT-neural : Principles
Production use: 100% of the collection
Web service: returns IPCs with a numerical confidence for each
User interface through IPC publication platform (IPCPUB) 5 confidence levels
6
![Page 7: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/7.jpg)
IPCCAT-neural : user interface (through IPC Publication platform)
7
![Page 8: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/8.jpg)
IPCCAT-neural : automatic text categorization in the IPC
Challenges?
IPC coverage Vs. availability of training collections
Precision Vs. Recall (e.g. for Prior art Search)
Absolute Vs. relative quality of IPCCAT-Neural
8
![Page 9: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/9.jpg)
IPCCAT-neural : recall challenges
Recall: “First” IPC does not mean “Main” IPC
One IPC is usually not enough for patent document classification but is enough for their routing
Automatic Prediction (or guess) of the most appropriate IPC symbols on the basis on a text input (e.g. patent abstract) with a confidence levelassociated to each of these predictions
9
![Page 10: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/10.jpg)
IPCCAT-neural : precision challenges
Precision: “Mycat” based on classic neural networks
Trained system based on (many) neural networks
No evidence that more recent technology e.g. Deep Learning would perform better than “classic” neural networks (because patent classification is not hidden information in the training collection)
Use of N-grams (terms with several words)
10
![Page 11: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/11.jpg)
IPCCAT-neural : quality challenges
IPCCAT quality is relative to IPC quality in its
training collection:
IPCCAT Imitates human practices (good and bad ones)
Limited by patent documents fragments used during its training
IPCCAT offers consistent and repeatable predictions
That human beings are usually not be able to achieve
11
![Page 12: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/12.jpg)
IPCCAT-neural 2018
12
Where are we today?
![Page 13: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/13.jpg)
IPCCAT-neural 2018: text categorization in the IPC at subgroup level !
• Automatic prediction in 99% of the IPC i.e. among 72,137 categories
• Top-three guess precision > 80%
13
![Page 14: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/14.jpg)
IPCCAT-neural 2018: text categorization in the IPC at subgroup level
Training collection:
27.7 million in EN
IPC coverage(using IPC and CPC through concordance):
99% at subgroup level (EN)
Top three guess evaluation of its Precision:
82.5% based on 1.5 million of test cases (EN)
14
![Page 15: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/15.jpg)
IPCCAT-neural text categorization in the IPC at subgroup level
Why was It actually doable?
Recent evolution of the IPCCAT classifier available on-demand as open source by the Olanto foundation see http://olanto.org/foundation
Added value in data processing:
Training based on patent documents computed from DOCDB XML excerpts (title + abstract)
Computation of both IPC and CPC classifications
Progress in computing power opens new R&D horizons e.g. GPU, text processing,L
15
![Page 16: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/16.jpg)
Evolution of IPCCAT R&D over years
2017
2003-2008:IPC Main Group level(~7,000 categories)
16
2018: IPC
Group level
~73,000
categories
![Page 17: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/17.jpg)
IPCCAT-neural 2018
17
Potential use of IPCCAT technology
![Page 18: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/18.jpg)
IPCCAT-neural practical use
18
What it could be for?
Improvement of the consistency in patent classification
Reduction of the backlog of IPC reclassification through automation of the residual IPC reclassification of patent documents after some years: Potential alternative to IPC reclassification Default
transfer
![Page 19: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/19.jpg)
IPCCAT-neural for IPC reclassification
19
Additional Challenges:
non-EN language:
Large training collection, with good IPC coverage
Consistency in legacy classification practices
Preserve possibility of language-dedicated solution
Costs containment for WIPO
Streamline production of training collection(s)
Streamline IPCCAT yearly retraining (new vocabulary)
![Page 20: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/20.jpg)
IPCCAT-neural cross lingual
IPCCAT EN
WIPO translate: XX
into EN
WIPODELTA ENText in XX
DOCDB XML
+Full Text?
IPC guess for text in XX
![Page 21: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/21.jpg)
Cross-lingual text categorization to
assist IPC reclassification
Chronology:
1. Evidence that text categorization works at IPC subgroup level with an acceptable level of precision: Done
2. Integration of IPCCAT neural at sub-group level into IPCPUB v 7.6 Done
3. Confirmation that Cross-lingual text categorization can assist in other languages than EN, even in absence of large training collections: Done
21
![Page 22: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/22.jpg)
IPCCAT-neural cross lingual prototype
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
100 800 6400 51200
Precision (%)
Test with 1000 randomly selected patents in AR, DE, ES, FR,
JA, RU, ZH
Difficult to compare, not the same distribution of patents
Number of Categories
EN-NATIVE
AVERAGE-TRANSLATIONAR-TRANSLATION
DE-TRANSLATION
ES-TRANSLATION
FR-TRANSLATION
JA-TRANSLATION
RU-TRANSLATION
ZH-TRANSLATION
![Page 23: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/23.jpg)
IPCCAT-neural cross lingual prototypeTest with 500 randomly selected patents from subclass G06F
Losses due to translation are more visible for each language
Needs to be evaluated for all languages
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
100 1600 25600
Precision (%)
Number of categories
EN-NATIVE Average Dist
DE-TRANS-G06F2
JA-TRANS-G06F2
ZH-TRANS-G06F2
![Page 24: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/24.jpg)
Cross-lingual text categorization to
assist IPC reclassification
Chronology: (Still a long way to go)
4. Incentives for R&D in automated text categorization: WIPO DELTA training collection: Done
5. Propose alternatives to Default Transfer e.g. more than one symbol based on IPCCAT guesses and confidence levels for decisions, resource planning, etcL: 2019
6. Development of the production-scale solution integrating cross-lingual text categorization and WIPO translate: 2019-
2020?
7. Integration in IPC reclassification system (IPCWLMS)
2020?
24
![Page 25: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/25.jpg)
Incentive to R&D in text categorization: WIPO-Delta training collection
Incentives for research and development institutes interested
in automatic text categorization :
WIPO DELTA 2018 EN dataset available upon request
Fully specified XML format
~50 million excepts of patent documents classified in the
IPC
Complement the public WIPO-ALPHA & GAMMA datasets
See http://www.wipo.int/classifications/ipc/en/ITsupport/Categori
zation/dataset/index.html
25
![Page 26: 20180423 IPCCAT ICSDV 2018.ppt2018/04/23 · Losses due to translation are more visible for each language Needs to be evaluated for all languages 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%](https://reader036.vdocument.in/reader036/viewer/2022070818/5f16671d3af36f52a95d8e4a/html5/thumbnails/26.jpg)
Thank you for your attention!
26