yuliya morozova institute for informatics problems of the russian academy of sciences, moscow
TRANSCRIPT
EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS
USING METHODS OF DISTRIBUTIONAL SEMANTICS
Yuliya MorozovaInstitute for Informatics Problems of the Russian Academy of Sciences, Moscow
Distributional semanticsnew area of linguistic researchinferring semantic properties of linguistic
units from corporaTheoretical foundations: distributional
methodology by Z. Harris, F. de Saussure, L. Wittgenstein.
Distributional hypothesis: semantically similar words occur in similar contexts.
J. R. Firth “You shall know a word by the company it keeps”.
Vector spacedrink coffee – occurred 1 timedrink tea – occurred 2 times
Cosine measure of vector similarity
n
i i
n
i i
n
i ii
yx
yx
1
2
1
2
1
Main application areaslexical ambiguity resolutioninformation retrievaldictionaries of semantic relationsmultilingual dictionariessemantic maps of different domainsmodelling of synonymydocument topic detectionsentiment analysis
The present researchGoal: to apply distributional semantics
models to extraction of translation correspondences from a parallel corpus.
Vector space model + test corpus
Test corpusPatent texts in French translated into Russian
Texts splitted into sentencesAlignment at the sentence level – manually
verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus
manager
PreprocessingLemmatizationFrequent words removed (prepositions ,
conjunctions etc.)Punctuation marks removed
Vector space model
type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency
(equal either to 1 or 0); method used to compute the distance
between vectors: cosine measure.
Example (aligned region as a context)Aligned region #1
présent invention concerner liant minéral notamment hydraulique
настоящий изобретение касаться неорганический связующий частность гидравлический связующий
Example (vector space)
Aligned region
#1 #2 #3
présent 1 … …
invention 1 … …
concerner 1 … …
настоящий 1 … …
изобретение
1 … …
касаться 1 … …
ResultsA list of translation correspondences.
Linguistic filter: the same part of speech.
Precision: 78%.
Correspondences with different POS
Syntactic transformations
verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”)
noun (French) → adjective (Russian)
crochet (“hook”) → крюкообразный (“hook-shaped”)
verbal infinitive (French) → adjective (Russian)
connaître (“to know”) → известный (“well-known”)
Correspondences with different POS
Parts of multi-word expressionsau moins (“at least”) → по меньшей мере (“at least”)
The output of the program:moins → мера
EvaluationEduardo Cendejas, Grettel Barceló,
Alexander Gelbukh, Grigori Sidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009.
Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient.
Precision – 53%.
ConclusionDistributional semantics methodology can be
used to extract translation correspondences from a parallel corpus with a high level of precision.
It can be used to study productive syntactic transformations occurring in translation.
The present vector space model needs to be enhanced to take into account multi-word expressions.
Thank you!