extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation...

Post on 10-May-2015

211 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Material presented at the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India. Paper download at http://hal.archives-ouvertes.fr/hal-00743807. Institutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina, Gremuts.

TRANSCRIPT

Extraction of domain-specific bilingual lexiconfrom comparable corpora

compositional translation and ranking

Estelle Delpech1, Beatrice Daille1, Emmanuel Morin1, ClaireLemaire2,3

1LINA, Universite de Nantes 2GREMUTS, Universite de Grenoble3Lingua et Machina

COLING’12 10/12/12 Mumbai, India

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:

I 51% to 88% precision on top 20 candidates with specializedcorpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:

I 51% to 88% precision on top 20 candidates with specializedcorpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]I More than 60% of terms in technical and scientific domains are

morphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]I More than 60% of terms in technical and scientific domains are

morphologically complex [Namer and Baud, 2007]I Outperforms context-based approaches for the translation of

terms with compositional meaning [Morin and Daille, 2009]

2 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}

Translate {α, β}Reorder {αβ, βα}

Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}

Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}

Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphens

I match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Recreate target words by generating all possibleconcatenations of the components :

{toxique, cellule} → {toxique cellule},{toxiquecellule}

13 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Recreate target words by generating all possibleconcatenations of the components :

{toxique, cellule} → {toxique cellule},{toxiquecellule}

13 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

ContextTranslation method

Ranking methodResults of experiments

Future work

Target term frequency

Number of occurrences of target term divided by the totalnumber of occurrences in the target texts

Freq(t) =occ(t)

N

16 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Target term frequency

Number of occurrences of target term divided by the totalnumber of occurrences in the target texts

Freq(t) =occ(t)

N

16 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonyms

I score = mean of the reliability of the resources used fortranslating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonyms

I score = mean of the reliability of the resources used fortranslating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonymsI score = mean of the reliability of the resources used for

translating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonymsI score = mean of the reliability of the resources used for

translating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Combination

Linear combination of the 4 criterion Frequency, Context,Part-of-speech translation probability and Resources reliabilily

Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t)

20 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Combination

Linear combination of the 4 criterion Frequency, Context,Part-of-speech translation probability and Resources reliabilily

Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t)

20 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]

I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]

I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget texts

generated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Results for translation generation

EN → FR EN → DE

# source terms 126 90

# at least 1 translation 86 (68%) 56 (62%)

# at least 1 translation 86 56

1 trans. in UMLS 68 (79%) 40 (71%)

1 trans. in UMLS or judged correct 81 (94%) 51 (91%)

26 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Results for translation ranking

EN → FR EN → DE Average

Random .83 .80 .815

Freq .92 .84 .88

Cont .90 .82 .86

Pos .88 .91 .895

Reso .92 .82 .87

Combination .93 .89 .91

ML AdaRank .90 .84 .87

ML CoordAsc .93 .89 .91ML LambdaMart .86 .88 .87

Table: Top1 translation in UMLS or judged correct

27 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Impact of fertile translations

EN → FR EN → DE

exact translations 21% 10%

wrong translations 50% 80%

Table: % of fertile translations

German germanic language: tendency to agglutinationoestrogen-independant → Ostrogen-unabhangige

French romance language: creates phrases more easilyoestrogen-independant → independant des œstrogenes

30 / 31

ContextTranslation method

Ranking methodResults of experiments

Future work

Impact of fertile translations

EN → FR EN → DE

exact translations 21% 10%

wrong translations 50% 80%

Table: % of fertile translations

German germanic language: tendency to agglutinationoestrogen-independant → Ostrogen-unabhangige

French romance language: creates phrases more easilyoestrogen-independant → independant des œstrogenes

30 / 31

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

ContextTranslation method

Ranking methodResults of experiments

Future work

Future work

Improve quality of linguistic resourcesI morphological derivation rules instead of stemmingI use of a thesaurus

Try translations patterns on top of permutations

Try learning morpheme translation equivalences fromI cognatesI bilingual dictionariesI out-of-domain parallel data

31 / 31

Thank you for your attention.

Bestelle.delpech@univ-nantes.frbeatrice.daille@univ-nantes.fr

emmanuel.morin@univ-nantes.frcl@lingua-et-machina.com

ADDITIONAL SLIDES

Exact translations

Non fertiles:I pathophysiological → physiopathologiqueI overactive → uberaktiv

Fertiles:I cardiotoxicity → toxicite cardiaque ’cardiac toxicity’I mastectomy → ablation der brust ’ablation of the breast’

Morphological variants

Non fertiles:I dosimetry → dosimetrique ’dosimetric’I radiosensitivity → strahlenempfindlich ’radiosensitive’

Fertiles:I milk-producing → production de lait ’production of milk’I selfexamination → selbst untersuchen ’self examine’

Inexact but semantically related

Non fertiles:I oncogene → oncogenese ’oncogenesis’I breakthrough → durchbrechen ’break’

Fertiles:I chemoradiotherapy → chemotherapie oder strahlen

’chemotherapy or radiation’I treatable → pouvoir le traiter ’can treat it’

Wrong translations

Non fertiles:I immunoscore → immunomarquer ’immunostain’I check-in → unkontrollieren ’uncontrolled’

Fertiles:I bloodstream → fliessen mehr blut ’more blood flow’I risk-reducing → risque de reduire ’risk of reducing’

References I

Baldwin, T. and Tanaka, T. (2004).

Translation by machine of complex nominals.In Proceedings of the ACL 2004 Workshop on Multiword expressions: Integrating Processing, pages 24–31,Barcelona, Spain.

Bo, L. and Gaussier, E. (2010).

Improving corpus comparability for bilingual lexicon extraction from comparable corpora.In 23eme International Conference on Computational Linguistics, pages 23–27, Beijing, Chine.

Cartoni, B. (2009).

Lexical morphology in machine translation: A feasibility study.In Proceedings of the 12th Conference of the European Chapter of the ACL, pages 130–138, Athens, Greece.

Daille, B. and Morin, E. (2005).

French-English terminology extraction from comparable corpora.In Proceedings, 2nd International Joint Conference on Natural Language Processing, volume 3651 ofLecture Notes in Computer Sciences, page 707–718, Jeju Island, Korea. Springer.

Delpech, E. (2011).

Evaluation of terminologies acquired from comparable corpora : an application perspective.In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), volume 11of NEALT Proceedings Series,, pages 66–73, Riga, Latvia. Pedersen B.S., Nespore G., Skadina I.

Fung, P. (1997).

Finding terminology translations from non-parallel corpora.pages 192–202, Hong Kong.

Garera, N. and Yarowsky, D. (2008).

Translating compounds by learning component gloss translation via multiple languages.In Proceedings of the 3rd International Joint Conference on Natural Language Processing, volume 1, pages403–410, Hyderabad, India.

References II

Grefenstette, G. (1999).

The world wide web as a resource for example-based machine translation tasks.ASLIB’99 Translating and the computer, 21.

Harastani, R., Daille, B., and Morin, E. (2012).

Neoclassical compound alignments from comparable corpora.In Proceedings of the 13th International Conference on Computational Linguistics and Intelligent TextProcessing, volume 2, pages 72–82, New Delhi, India.

Hauer, B. and Kondrak, G. (2011).

Clustering semantically equivalent words into cognate sets in multilingual lists.In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 865–873,Chiang Mai, Thailand.

Keenan, E. L. and Faltz, L. M. (1985).

Boolean semantics for natural language.D. Reidel, Dordrecht, Holland.

Lardrilleux, A. (2008).

A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method.

Li, H. and Xu, J. (2007).

Adarank: A boosing algorithm for information retrieval.In Proceedings of the 30th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 391–398, Amsterdam, The Netherlands.

Metzler, D. and Croft, W. B. (2000).

Linear feature-based models for information retrieval.Information Retrieval, 10(3):257–274.

References III

Morin, E. and Daille, B. (2009).

Compositionality and lexical alignment of multi-word terms.In Language Resources and Evaluation (LRE), volume 44 of Multiword expression: hard going or plainsailing, pages 79–95. P. Rayson, S. Piao, S. Sharoff, S. Evert, B. Villada Moiron, springer netherlandsedition.

Morin, E. and Daille, B. (2010).

Compositionality and lexical alignment of multi-word terms.In Rayson, P., Piao, S., Sharoff, S., Evert, S., and B., V. M., editors, Language Resources and Evaluation(LRE), volume 44 of Multiword expression: hard going or plain sailing, pages 79–95. Springer Netherlands.

Namer, F. and Baud, R. (2007).

Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system.International Journal of Medical Informatics, 76(2-3):226–33.

Porter, M. F. (1980).

An algorithm for suffix stripping.Program, 14(3):130–137.

Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., and Utsuro, S. (2006).

Compiling French-Japanese terminologies from the web.In Proceedings of the 11th Conference of the European Chapter of the Association for ComputationalLinguistics, pages 225–232, Trento, Italy.

Tiedemann, J. (2009).

News from opus - a collection of multilingual parallel corpora with tools and interfaces.

Wu, Q., Burges, J. C., Svore, K., and Gao, J. (2010).

Adapting boosting for information retrieval measures.Journal of Information Retrieval, 13(3):254–270.

top related