mutual bilingual terminology extraction

Mutual bilingual terminology extraction

Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas***

* University of Wolverhampton** Universidad de Sevilla

*** Universidad de MalagaE-mail: *{l.a.ha,r.mitkov}@wlv.ac.uk, **[email protected],

***[email protected]

Introduction

• Terms and Terminology– Terms: linguistic units which have specialised

use.– Terminology: the system of terms in a subject

field.– Terminology is vital for specialised

communication, in both mono lingual and multi lingual contexts.

Mono and multi lingual terminology processing

• Mono lingual terminology processing– Three steps: extraction, validation, and organisation.– Automatic extraction approaches: linguistic (may

produce noises), statistical (may overlook important but low frequency terms), and hybrid approaches

• Bilingual/Multilingual term extraction– The same three steps as in monolingual terminology

processing: extraction, validation, and organisation– Relying on parallel corpora aligned at a certain level– Different models to align term candidates– Alignment as an independent step

Our approach: mutual bilingual term extraction

• Alignment plays an active role in term extraction.

• Automatic alignment is used to propagate the strengths of terminology extraction from one language into another.

• Relying on the availability of parallel corpora aligned at sentence level.

Mutual term extraction: Three step

• 1: lists of term candidates are extracted for the source and target languages;

• 2: term candidates from the target language are aligned to those in the source language;

• 3: if a term candidate in the target language is aligned to a term candidate in the source language, its term score is increased: this candidate promoted.

• Steps 1-3 can be repeated many times.

Mono-lingual term extraction

• Lexical-syntactic-statistical approach– Lexical-syntactic POS patterns

• English: [AN]*(NP)?[AN]*N • Spanish: N[NA]*(PN)?[NA]*

– Statistical measures• Different measures tested• Frequency is chosen

Term alignment

• Contingency table-based method: log-likelihood is used to estimate the likelihood of a term candidate in the source language is translated into another term candidate in the target language

• The table is built using a parallel corpus aligned at sentence level

Contingency table for “lymph node” and “ganglio linfático”

lymph node

lymph node

total

ganglio linfático 18(c12) 7 25(c2)

ganglio linfático 4 1865 1869

total 22(c1) 1872 1894(N)

Boosting algorithms

• Hypothesis: the term score of a term candidate in one language can be used to improve the term score of its aligned candidate in the other language, and vice versa via boosting processes

• Given that:AL(T1,T2): alignment score of the two term candidates T1 and T2.

TCs[T]: term score of the candidate T in the source language

TCt[T]: term score of the candidate T in the target language

BT(TC1,TC2): boosting function, i.e. how the term score of the aligned term affects the target term score; Example: simple addition: BT(TC1,TC2)=TC1+TC2;

Boosting algorithms (cont.)• Single boosting: boosting process is performed on the target

language only: Foreach term candidate Tt in the target language

Ts=argmax(AL(Tt,Ti));TCt[Tt]=BT(TCs[Ts],TCt[Tt]);

• Double boosting: boosting process is performed on both source and target languagesForeach term candidate Ts in the source language

Tt=argmax(AL(Ts,Ti)); TCs[Ts]=BT(TCs[Ts],TCt[Tt]);

Foreach term candidate Tt in the target languageTs=argmax((AL(Tt,Ti));TCt[Tt]=BT(TCs[Ts],TCt[Tt]);

• Recursive boosting: boosting process is repeated for both languages until the term candidate lists are stabilised.

Parameters

• Factors affecting the outcome of the proposed algorithms: the alignment function AL, the mechanism to calculate the initial term scores TCs and TCt, and the boosting function BT.

• Different combinations of these functions have been experimented with.

• The best term score function is frequency, and the best boosting function is simple addition.– In our next research, we propose several probabilistic

models which provide better probabilistic foundations for the boosting function.

Evaluation: data, gold standard, and evaluation metrics

• Data– MedlinePlus parallel texts (English/Spanish) on the

topic of Cancer• 9,250 segments for each language • 31,498 English words, 30344 Spanish words• Aligned by Trados winalign, manually corrected

• Gold standard– 389 English terms, 442 Spanish terms, and 357 term

pairs have been validated and used as a gold standard.

• Evaluation metrics– F-measure

Evaluation: results

• Alignment accuracy– In total, the algorithm suggests 472 translation

pairs, of which 374 are confirmed as correct translation. This suggests that the accuracy of the alignment is 0.8.

• Term extraction performance: improved by 10 to 25%

Results (cont.)

0.5

0.55

0.6

0.65

0.7

0.75

400 500 600 700 800

Number of candidates

F-m

easu

re

English TF

Spanish TF

English TF

(Boosted)

Spanish TF

(Boosted)

English

converge boosted

Spanish

converge boosted

Conclusion and future directions

• A promising approach, but

• More research will be needed

• A better mathematical foundation:– Probabilistic models– More experiments

• Other domains and language pairs– Legal– English-Hindi

Thank you very much

Questions? Comments? Criticisms?

mutual bilingual terminology extraction

Documents