impact of automated translation on mining knowledge from text data 19. 11. 2015, brno luděk svozil

11
Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Upload: christian-crawford

Post on 06-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

EU projects on horizon Modern MT – aims to bring powerful, ready to use MT system to desktop users LTI cloud – gathers language technology components for easy use in information systems strana 3

TRANSCRIPT

Page 1: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Impact of automated translation on mining knowledge from text data

19. 11. 2015, BrnoLuděk Svozil

Page 2: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

strana 2

Introduction

• Statistical and hybrid machine translation systems are gaining more attention

• Apart from commercial services like Google Translate and Bing, there are number of projects aiming to bring the benefits of big data knowledge to end-users

Kapitola 1

Page 3: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

EU projects on horizon

• Modern MT – aims to bring powerful, ready to use MT system to desktop users

http://www.modernmt.eu/

• LTI cloud – gathers language technology components for easy use in information systems

http://www.ltinnovate.org/lticloud

strana 3

Page 4: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

• If machine translation is part of preprocessing, would it benefit the text-mining procces? And how?

• Earlier experiments have shown that when combining scarce data across different languages, MT provides great simplification of problem

strana 4

Page 5: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Test data and experiment

• 20 000 reviews in 5 languages from booking.com were subjected to Google machine translation, stemming and then c5.0 decision tree was trained on them and evaluated using cross-validation

strana 5

Page 6: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Results – % decrease in attributes count

strana 6

ES FR PL CS DE

translation 24% 17% 42% 40% 29%

stemming 37% 31% 20% 33% 16%

translation and stemming

41% 35% 56% 53% 44%

Page 7: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Results – avg. classification error

strana 7

ES FR PL CS DE

Original 14,10% 14,10% 12,40% 14,60% 12,70%

Translated 14,10% 13,30% 11,30% 12,70% 12,00%

Stemmed 15,30% 14,00% 11,90% 11,80% 13,50%

Translated and stemmed 15,50% 15,50% 12,80% 13,70% 14,10%

Page 8: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

• To observe how well the translated data would combine with native English, another experiment was made

• 10 000 English documents were combined with another 10 000 from different language, the other language was then Google translated

strana 8

Page 9: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Results – avg. classification error

strana 9

EN+FR EN+PL EN+DE EN+ES

original 16,10% 14,80% 14,60% 17,30%

non-English language translated

33,50% 33,90% 37,70% 36,10%

Page 10: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

Conclusions

• MT simplifies problem (reduces dictionary) while doesn’t increase classification error

• Attention must be paid, while combining native and translated documents

strana 10

Page 11: Impact of automated translation on mining knowledge from text data 19. 11. 2015, Brno Luděk Svozil

• Další detaily, testy a porovnání rule-based a MT translátorů najdete v mé bakalářské práci „Dolování znalostí z vícejazyčných textových dat“, která bude k dispozici během ledna-února 2016

strana 11