yuliya morozova institute for informatics problems of the russian academy of sciences, moscow

17
EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS USING METHODS OF DISTRIBUTIONAL SEMANTICS Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Upload: gilbert-warren

Post on 23-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS

USING METHODS OF DISTRIBUTIONAL SEMANTICS

Yuliya MorozovaInstitute for Informatics Problems of the Russian Academy of Sciences, Moscow

Page 2: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Distributional semanticsnew area of linguistic researchinferring semantic properties of linguistic

units from corporaTheoretical foundations: distributional

methodology by Z. Harris, F. de Saussure, L. Wittgenstein.

Distributional hypothesis: semantically similar words occur in similar contexts.

J. R. Firth “You shall know a word by the company it keeps”.

Page 3: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Vector spacedrink coffee – occurred 1 timedrink tea – occurred 2 times

Page 4: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Cosine measure of vector similarity

n

i i

n

i i

n

i ii

yx

yx

1

2

1

2

1

Page 5: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Main application areaslexical ambiguity resolutioninformation retrievaldictionaries of semantic relationsmultilingual dictionariessemantic maps of different domainsmodelling of synonymydocument topic detectionsentiment analysis

Page 6: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

The present researchGoal: to apply distributional semantics

models to extraction of translation correspondences from a parallel corpus.

Vector space model + test corpus

Page 7: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Test corpusPatent texts in French translated into Russian

Texts splitted into sentencesAlignment at the sentence level – manually

verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus

manager

Page 8: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

PreprocessingLemmatizationFrequent words removed (prepositions ,

conjunctions etc.)Punctuation marks removed

Page 9: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Vector space model

type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency

(equal either to 1 or 0); method used to compute the distance

between vectors: cosine measure.

Page 10: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Example (aligned region as a context)Aligned region #1

présent invention concerner liant minéral notamment hydraulique

настоящий изобретение касаться неорганический связующий частность гидравлический связующий

Page 11: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Example (vector space)

Aligned region

#1 #2 #3

présent 1 … …

invention 1 … …

concerner 1 … …

настоящий 1 … …

изобретение

1 … …

касаться 1 … …

Page 12: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

ResultsA list of translation correspondences.

Linguistic filter: the same part of speech.

Precision: 78%.

Page 13: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Correspondences with different POS

Syntactic transformations

verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”)

noun (French) → adjective (Russian)

crochet (“hook”) → крюкообразный (“hook-shaped”)

verbal infinitive (French) → adjective (Russian)

connaître (“to know”) → известный (“well-known”)

Page 14: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Correspondences with different POS

Parts of multi-word expressionsau moins (“at least”) → по меньшей мере (“at least”)

The output of the program:moins → мера

Page 15: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

EvaluationEduardo Cendejas, Grettel Barceló,

Alexander Gelbukh, Grigori Sidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009.

Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient.

Precision – 53%.

Page 16: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

ConclusionDistributional semantics methodology can be

used to extract translation correspondences from a parallel corpus with a high level of precision.

It can be used to study productive syntactic transformations occurring in translation.

The present vector space model needs to be enhanced to take into account multi-word expressions.

Page 17: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Thank you!