creating a term base to customize an mt system: reusability of resources and tools from the...

Post on 01-Apr-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Creating a Term Base to Customize an MT System:

Reusability of Resources and Tools from the Translator’s Point of View

Natalie Kübler

Intercultural Centre for Studies in Lexicology

Objectives Introducing available resources, tools,

and MT in translation training

Testing customisable MT as a time-saving tool for « industrial » translation

Using simple tools and immediately available resources to improve MT translation results

Translation training Post-graduate students in language industry (LI)

and specialised translation (ST): Translation, linguistics, localisation, technical writing Dreamweaver, Catalyst, HTML, XML, SQL, UNIX,

translation memory, etc. Semi-professional: every other week with a

private company in translation or language industry

Corpus linguistics and applications to terminology and translation => project in ST (HOWTO) + LI (analysis + feedback to Systran)

ExperimentTranslating some yet untranslated Linux

HOWTOs, using a MT system subdomain of computing Highly specialised texts written by computer experts – and not

technical writers – for computer experts Translated by French-speaking computer

experts + Translating computing dictionary entries

Systranet Systran’s on-line customisable service Domain-specific dictionaries User dictionaries:

Mono- or multitarget « advanced » linguistic information

On-line source and target text alignement Words not in any (Systran’s or user’s) dictionary Words in the user’s dictionary

Resources + Tools Headwords + equivalents + linguistic information

On-line technical bilingual glossaries On-line term bases

Comparable and translation technical corpora The Web as a corpus

Term extraction (Terminology Extractor)

Methodology Step one dictionary: extracting term candidates

from text Creating and coding step-one dictionary First translation using the dictionary Step two dictionary: changing and/or adding

linguistic information using Systranet’s alignment and color features + linguistic analysis (feedback)

Step two: until the dictionary is saturated

Web-based HOWTO glossary Several French equivalents boot,root disk= disquettes (d') amorce ou de démarrage, racine

browser= butineur, navigateur, arpenteur buffer=tampon to build= bâtir currently= actuellement feedback=comment contacter l'auteur, retour d'information

A.D.S.L. (noun)=raccordement numérique asymétrique

Step 1:Terminology ExtractorFrench and English dictionariesMorphological analysisStop wordsCollocations: sequence of 2 to 10 words

repeated at least onceNon-wordsConcordances

TE non-wordsDebian Netscape accelleratePermedia Dennis XFCERedHat Dialogs CorelRgbPath FAQs anoyingServerFlags Howto MicrodoftServerLayour README LinuxXkbLayout XkbModel RealAudioSolaris ISA degredationUI KDE GUIUSB LeftOf IRQsWindowMaker ModulePath NFS

TE collocationsInternet Gateway 3 { Looking look } at the Network 3IP aliasing 3 name server 4ISA { card cards } 3 Network { Device devices } 4latest version 3 Linux computer 3DHCP Server 15 IP { addresses address } 16Linux gateway 3 Linux box 16modules file 3 card on the Linux box 4scripts / ifcfg 3 DNS { Server servers } 17server will start 3 interface configuration file 3{ Network networking } { Card Cards } 12

« Le grand dictionnaire terminologique » Looking for French equivalents

ENGLISH FRENCHbuffer mémoire tampon n. f.Syn. Syn. buffer storage tampon n. m. buffer memory mémoire intermédiaire

n. f intermediate memory zone tampon n. f.

HOWTO translation corpus English source – French translation WALL: Web-based environment

Concordances with perl-like regexp Paragraph alignment

French equivalents lexicogrammatical information semantic classes « statistical » information in the domain

HOWTOs: equivalentsThe daemon […] listens to all messages on each network

deviceLe démon […] écoute tous les messages sur chacun des

périphériques réseau All the Digital cards will autoprobe for their media Toutes les cartes Digital effectueront la détection

automatique du médiaThe latest source distribution can be FTPed from the directory

ftp…or Mosaiced from http…On peut charger la dernière version sur ftp…et sous Mosaic

depuis http…Called by the kernel when the card posts an interrupt.Appelé par le noyau quand la carte déclenche une interruption

HOWTOs « semantic classes »

can I run 32-bit video games under dosemu

used to run Linux on a 386/16 MHz (

unless you want your modem to answer the phone

The static SLIP server will answer your modem call

WebCorpThe web as a corpusConcordances : buffer, run* * * on

Updated information More elements

buffer me des débordements de  buffer (tampon en

français). Pour com/advisories/bufero.html . Writing  buffer

overflow exploits – a tutorial for de NOP . débordement de  buffer dans le

tas (heap buffer overflow) (buffer overflow) . débordement de  buffer

sous windows (et oui ;-)) --[

Customized dictionary« Advanced » linguistic information, such as:

Part-of-speech information noun, proper noun (product name, country, etc.), verb, adjective,

sentence Morphological information

URL (noun) (plural:URLs) / cache (noun)(masculine) Lexicogrammatical information

access (verb)(noprep)=accéder (verb)(prep:à) Basic semantic information

to run (verb)(context:OS) Unix (noun) (SEMCAT:OS)

Idioms Your mileage may vary (sentence)

Dictionary Sample"AT&T" (company name) auto-dial (noun)=numérotation automatique (noun)automatic number identification (noun)=identification de

l'appelant (noun)based (adjective)(noprep)=architecturé (adjective)

(prep:autour)basic language constructs (noun) (plural)=base de

construction du langage (noun) (singular)to log in (verb)=se loger (verb) to introduce (verb) (context:extensions)=introduireto carry (verb)(context:digital data)=transmettre (verb)

With Step-one dictionary

This page contains a simple cookbook for setting up Red Hat 6.X as an internet gateway for a home network or small office network.

Cette page contient un cookbook simple pour le chapeau rouge 6.X d'établissement en tant que Gateway d'Internet pour un réseau à la maison ou le petit réseau de bureau.

With Step-two dictionary

This page contains a simple cookbook for setting up Red Hat 6.X as an internet gateway for a home network or small office network.

Cette page contient des recettes simples pour installer Red Hat 6.X en tant que passerelle Internet pour un réseau domestique ou un petit réseau de bureau.

Error typology Morphosyntax: subject-verb or noun-adjective

agreement Syntax:

POS ambiguïty NP: determiners, NP coordination transformations/ellipsis/cleft sentences/PP

attachment Metacharacters « Bugs »

Error examples (1) I am not going *je n'vais pas => je ne vais pas the phase of the light through it*la phase du dépassement léger par lui=> la phase de la lumière qui les traverse. decoded by specific individuals.*décodée par les individus spécifiques.décodée par des individus spécifiques. A cable or ADSL connection*un câble ou une connexion d’AADSLUne connexion par câble ou ADSL

Error examples (2)

When a user picks or is assigned a password, it is encoded with a randomly generated value called the salt.

=> *Quand un utilisateur sélectionne ou est généré un mot de passe, il est codé avec une valeur aléatoirement produite appelée le sel.

Conclusion Translation results can be significantly

improved by creating customised dictionaries The tools mentionned here are user-friendly But, it implies much work in the beginning +

translators must have a training in linguistics and basic NLP.

Change of attitude towards MT + various tools, especially in the language industry oriented option

More things to be done..Merging all dictionaries together into a

« Systranet term base »Translating more HOWTOsProject with Systran: improve user

coding…

top related