near language identification using nooj božo bekavac, kristina kocijan, marko tadić faculty of...

14
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia NooJ 2014 Sassari 2014-06-04

Upload: lily-evans

Post on 22-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Near Language Identification Using NooJ

Božo Bekavac, Kristina Kocijan, Marko Tadić

Faculty of Humanities and Social SciencesUniversity of Zagreb, Croatia

NooJ 2014Sassari

2014-06-04

Page 2: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

NooJ2014Sassari2014-06-04

Introduction It is not hard to distinguish automatically

very different languages, but similar languages like Czech, Slovakian Indonesian, Malaysian or Brazilian Portuguese, European Portuguese

is very hard to distinguish even for state-of-the-art statistical tools they often mix those languages

We use NooJ as a core part of a system designed for automatic identification of near languages Croatian and Serbian

Page 3: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Differences: Croatian - Serbian Lexical level (some differences)

Reflex of proto-Slavic vowel jat ije/je vs. e e. g. milk (en) –mlijeko (hr) vs. mleko (sr) verbs ending –irati, - ovati e. g. to employ (en) – angažirati (hr) vs. angažovati (sr)

Construction of future tense analytical in hr, e. g. pitat ću (I will ask) synthetic in sr, e. g. pitaću (I will ask)

Typical structures for certain language Croatian: modal verb + infinitive, e. g. hoću raditi Serbian: modal verb + da + present , e. g. hoću

da radim

NooJ2014Sassari2014-06-04

Page 4: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Formalizing differences We used only Croatian language resources

and designed morphological grammars for recognition of unknown tokens in Serbian

some words specific to Serbian are left unknown (e. g. bread (en) – kruh (hr) vs. hleb (sr) but it had no impact on efficiency of system

Syntactic and lexical grammars focuses on formalization of differences between languages Examples follow…

NooJ2014Sassari2014-06-04

Page 5: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Lexical grammars (1) E. g. president (en) –predsjednik (hr) vs.

predsednik (sr)

NooJ2014Sassari2014-06-04

Page 6: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Lexical grammars (2) E. g. to meet (en) –sastati (hr) vs.

sastaću (sr)

NooJ2014Sassari2014-06-04

Page 7: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Syntactic grammars (sr) E. g. should do (en) - treba da uradi (sr)

NooJ2014Sassari2014-06-04

Page 8: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Syntactic grammars (hr) E. g. should do (en) - treba uraditi (sr)

NooJ2014Sassari2014-06-04

Page 9: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Implementation Instead of NoojApply we applied: Fully automated process through

Autohotkey http://www.autohotkey.com/

AutoHotkey - a scripting language for desktop automation > Max suggested enables emulation of clicking on desktop

applications enables scripting language capabilities

Pros & cons are discussed in conclusion

NooJ2014Sassari2014-06-04

Page 10: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

System description Open text Apply Croatian language linguistic analyses Count

No. of tokens No. of Serbian lng. lexical units No. of syntactic constructions V da V No. of syntactic constructions V Vinf

Make decision in respect to obtained results from above processing based on percentages of occurrences

Write statistics and results

NooJ2014Sassari2014-06-04

Page 11: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Output of processing

Demo

NooJ2014Sassari2014-06-04

Page 12: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Results Testing was performed on corpus of 2500

articles from SETimes corpus http://www.setimes.com/ texts on Serbian and Croatian language short news translated from English

System obtained precision of 99,82 % Outperforming all known systems in this

task 3 texts on Serbian language are

misclassified as Croatian texts with low recall in considered criteria

NooJ2014Sassari2014-06-04

Page 13: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

Conclusion & future work NooJ and AutoHotkey in combination are

sufficient even for performing very complex tasks

The system is completely automatized

Disadvantage: AutoHotkey is very dependent on computer screen resolution (automatic clicking)

Future work: There is room for improvement of the

system To take into account unknown words To tune system voting To create lists of „forbidden” words

NooJ2014Sassari2014-06-04

Page 14: Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia

NooJ2014Sassari2014-06-04

Thank youfor your attention!