yuliya morozova institute for informatics problems of the russian academy of sciences, moscow

EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS

USING METHODS OF DISTRIBUTIONAL SEMANTICS

Yuliya MorozovaInstitute for Informatics Problems of the Russian Academy of Sciences, Moscow

Distributional semanticsnew area of linguistic researchinferring semantic properties of linguistic

units from corporaTheoretical foundations: distributional

methodology by Z. Harris, F. de Saussure, L. Wittgenstein.

Distributional hypothesis: semantically similar words occur in similar contexts.

J. R. Firth “You shall know a word by the company it keeps”.

Vector spacedrink coffee – occurred 1 timedrink tea – occurred 2 times

Cosine measure of vector similarity

n

i i

n

i i

n

i ii

yx

yx

1

2

1

2

1

Main application areaslexical ambiguity resolutioninformation retrievaldictionaries of semantic relationsmultilingual dictionariessemantic maps of different domainsmodelling of synonymydocument topic detectionsentiment analysis

The present researchGoal: to apply distributional semantics

models to extraction of translation correspondences from a parallel corpus.

Vector space model + test corpus

Test corpusPatent texts in French translated into Russian

Texts splitted into sentencesAlignment at the sentence level – manually

verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus

manager

PreprocessingLemmatizationFrequent words removed (prepositions ,

conjunctions etc.)Punctuation marks removed

Vector space model

type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency

(equal either to 1 or 0); method used to compute the distance

between vectors: cosine measure.

Example (aligned region as a context)Aligned region #1

présent invention concerner liant minéral notamment hydraulique

настоящий изобретение касаться неорганический связующий частность гидравлический связующий

Example (vector space)

Aligned region

#1 #2 #3

présent 1 … …

invention 1 … …

concerner 1 … …

настоящий 1 … …

изобретение

1 … …

касаться 1 … …

ResultsA list of translation correspondences.

Linguistic filter: the same part of speech.

Precision: 78%.

Correspondences with different POS

Syntactic transformations

verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”)

noun (French) → adjective (Russian)

crochet (“hook”) → крюкообразный (“hook-shaped”)

verbal infinitive (French) → adjective (Russian)

connaître (“to know”) → известный (“well-known”)

Correspondences with different POS

Parts of multi-word expressionsau moins (“at least”) → по меньшей мере (“at least”)

The output of the program:moins → мера

EvaluationEduardo Cendejas, Grettel Barceló,

Alexander Gelbukh, Grigori Sidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009.

Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient.

Precision – 53%.

ConclusionDistributional semantics methodology can be

used to extract translation correspondences from a parallel corpus with a high level of precision.

It can be used to study productive syntactic transformations occurring in translation.

The present vector space model needs to be enhanced to take into account multi-word expressions.

Thank you!

yuliya morozova institute for informatics problems of the russian academy of sciences, moscow

Documents