impact of automated translation on mining knowledge from text data 19. 11. 2015, brno luděk svozil
DESCRIPTION
EU projects on horizon Modern MT – aims to bring powerful, ready to use MT system to desktop users LTI cloud – gathers language technology components for easy use in information systems strana 3TRANSCRIPT
![Page 1: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/1.jpg)
Impact of automated translation on mining knowledge from text data
19. 11. 2015, BrnoLuděk Svozil
![Page 2: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/2.jpg)
strana 2
Introduction
• Statistical and hybrid machine translation systems are gaining more attention
• Apart from commercial services like Google Translate and Bing, there are number of projects aiming to bring the benefits of big data knowledge to end-users
Kapitola 1
![Page 3: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/3.jpg)
EU projects on horizon
• Modern MT – aims to bring powerful, ready to use MT system to desktop users
http://www.modernmt.eu/
• LTI cloud – gathers language technology components for easy use in information systems
http://www.ltinnovate.org/lticloud
strana 3
![Page 4: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/4.jpg)
• If machine translation is part of preprocessing, would it benefit the text-mining procces? And how?
• Earlier experiments have shown that when combining scarce data across different languages, MT provides great simplification of problem
strana 4
![Page 5: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/5.jpg)
Test data and experiment
• 20 000 reviews in 5 languages from booking.com were subjected to Google machine translation, stemming and then c5.0 decision tree was trained on them and evaluated using cross-validation
strana 5
![Page 6: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/6.jpg)
Results – % decrease in attributes count
strana 6
ES FR PL CS DE
translation 24% 17% 42% 40% 29%
stemming 37% 31% 20% 33% 16%
translation and stemming
41% 35% 56% 53% 44%
![Page 7: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/7.jpg)
Results – avg. classification error
strana 7
ES FR PL CS DE
Original 14,10% 14,10% 12,40% 14,60% 12,70%
Translated 14,10% 13,30% 11,30% 12,70% 12,00%
Stemmed 15,30% 14,00% 11,90% 11,80% 13,50%
Translated and stemmed 15,50% 15,50% 12,80% 13,70% 14,10%
![Page 8: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/8.jpg)
• To observe how well the translated data would combine with native English, another experiment was made
• 10 000 English documents were combined with another 10 000 from different language, the other language was then Google translated
strana 8
![Page 9: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/9.jpg)
Results – avg. classification error
strana 9
EN+FR EN+PL EN+DE EN+ES
original 16,10% 14,80% 14,60% 17,30%
non-English language translated
33,50% 33,90% 37,70% 36,10%
![Page 10: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/10.jpg)
Conclusions
• MT simplifies problem (reduces dictionary) while doesn’t increase classification error
• Attention must be paid, while combining native and translated documents
strana 10
![Page 11: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil](https://reader036.vdocument.in/reader036/viewer/2022082620/5a4d1ad07f8b9ab059971229/html5/thumbnails/11.jpg)
• Další detaily, testy a porovnání rule-based a MT translátorů najdete v mé bakalářské práci „Dolování znalostí z vícejazyčných textových dat“, která bude k dispozici během ledna-února 2016
strana 11