creating a term base to customize an mt system: reusability of resources and tools from the...
TRANSCRIPT
Creating a Term Base to Customize an MT System:
Reusability of Resources and Tools from the Translator’s Point of View
Natalie Kübler
Intercultural Centre for Studies in Lexicology
Objectives Introducing available resources, tools,
and MT in translation training
Testing customisable MT as a time-saving tool for « industrial » translation
Using simple tools and immediately available resources to improve MT translation results
Translation training Post-graduate students in language industry (LI)
and specialised translation (ST): Translation, linguistics, localisation, technical writing Dreamweaver, Catalyst, HTML, XML, SQL, UNIX,
translation memory, etc. Semi-professional: every other week with a
private company in translation or language industry
Corpus linguistics and applications to terminology and translation => project in ST (HOWTO) + LI (analysis + feedback to Systran)
ExperimentTranslating some yet untranslated Linux
HOWTOs, using a MT system subdomain of computing Highly specialised texts written by computer experts – and not
technical writers – for computer experts Translated by French-speaking computer
experts + Translating computing dictionary entries
Systranet Systran’s on-line customisable service Domain-specific dictionaries User dictionaries:
Mono- or multitarget « advanced » linguistic information
On-line source and target text alignement Words not in any (Systran’s or user’s) dictionary Words in the user’s dictionary
Resources + Tools Headwords + equivalents + linguistic information
On-line technical bilingual glossaries On-line term bases
Comparable and translation technical corpora The Web as a corpus
Term extraction (Terminology Extractor)
Methodology Step one dictionary: extracting term candidates
from text Creating and coding step-one dictionary First translation using the dictionary Step two dictionary: changing and/or adding
linguistic information using Systranet’s alignment and color features + linguistic analysis (feedback)
Step two: until the dictionary is saturated
Web-based HOWTO glossary Several French equivalents boot,root disk= disquettes (d') amorce ou de démarrage, racine
browser= butineur, navigateur, arpenteur buffer=tampon to build= bâtir currently= actuellement feedback=comment contacter l'auteur, retour d'information
A.D.S.L. (noun)=raccordement numérique asymétrique
Step 1:Terminology ExtractorFrench and English dictionariesMorphological analysisStop wordsCollocations: sequence of 2 to 10 words
repeated at least onceNon-wordsConcordances
TE non-wordsDebian Netscape accelleratePermedia Dennis XFCERedHat Dialogs CorelRgbPath FAQs anoyingServerFlags Howto MicrodoftServerLayour README LinuxXkbLayout XkbModel RealAudioSolaris ISA degredationUI KDE GUIUSB LeftOf IRQsWindowMaker ModulePath NFS
TE collocationsInternet Gateway 3 { Looking look } at the Network 3IP aliasing 3 name server 4ISA { card cards } 3 Network { Device devices } 4latest version 3 Linux computer 3DHCP Server 15 IP { addresses address } 16Linux gateway 3 Linux box 16modules file 3 card on the Linux box 4scripts / ifcfg 3 DNS { Server servers } 17server will start 3 interface configuration file 3{ Network networking } { Card Cards } 12
« Le grand dictionnaire terminologique » Looking for French equivalents
ENGLISH FRENCHbuffer mémoire tampon n. f.Syn. Syn. buffer storage tampon n. m. buffer memory mémoire intermédiaire
n. f intermediate memory zone tampon n. f.
HOWTO translation corpus English source – French translation WALL: Web-based environment
Concordances with perl-like regexp Paragraph alignment
French equivalents lexicogrammatical information semantic classes « statistical » information in the domain
HOWTOs: equivalentsThe daemon […] listens to all messages on each network
deviceLe démon […] écoute tous les messages sur chacun des
périphériques réseau All the Digital cards will autoprobe for their media Toutes les cartes Digital effectueront la détection
automatique du médiaThe latest source distribution can be FTPed from the directory
ftp…or Mosaiced from http…On peut charger la dernière version sur ftp…et sous Mosaic
depuis http…Called by the kernel when the card posts an interrupt.Appelé par le noyau quand la carte déclenche une interruption
HOWTOs « semantic classes »
can I run 32-bit video games under dosemu
used to run Linux on a 386/16 MHz (
unless you want your modem to answer the phone
The static SLIP server will answer your modem call
WebCorpThe web as a corpusConcordances : buffer, run* * * on
Updated information More elements
buffer me des débordements de buffer (tampon en
français). Pour com/advisories/bufero.html . Writing buffer
overflow exploits – a tutorial for de NOP . débordement de buffer dans le
tas (heap buffer overflow) (buffer overflow) . débordement de buffer
sous windows (et oui ;-)) --[
Customized dictionary« Advanced » linguistic information, such as:
Part-of-speech information noun, proper noun (product name, country, etc.), verb, adjective,
sentence Morphological information
URL (noun) (plural:URLs) / cache (noun)(masculine) Lexicogrammatical information
access (verb)(noprep)=accéder (verb)(prep:à) Basic semantic information
to run (verb)(context:OS) Unix (noun) (SEMCAT:OS)
Idioms Your mileage may vary (sentence)
Dictionary Sample"AT&T" (company name) auto-dial (noun)=numérotation automatique (noun)automatic number identification (noun)=identification de
l'appelant (noun)based (adjective)(noprep)=architecturé (adjective)
(prep:autour)basic language constructs (noun) (plural)=base de
construction du langage (noun) (singular)to log in (verb)=se loger (verb) to introduce (verb) (context:extensions)=introduireto carry (verb)(context:digital data)=transmettre (verb)
With Step-one dictionary
This page contains a simple cookbook for setting up Red Hat 6.X as an internet gateway for a home network or small office network.
Cette page contient un cookbook simple pour le chapeau rouge 6.X d'établissement en tant que Gateway d'Internet pour un réseau à la maison ou le petit réseau de bureau.
With Step-two dictionary
This page contains a simple cookbook for setting up Red Hat 6.X as an internet gateway for a home network or small office network.
Cette page contient des recettes simples pour installer Red Hat 6.X en tant que passerelle Internet pour un réseau domestique ou un petit réseau de bureau.
Error typology Morphosyntax: subject-verb or noun-adjective
agreement Syntax:
POS ambiguïty NP: determiners, NP coordination transformations/ellipsis/cleft sentences/PP
attachment Metacharacters « Bugs »
Error examples (1) I am not going *je n'vais pas => je ne vais pas the phase of the light through it*la phase du dépassement léger par lui=> la phase de la lumière qui les traverse. decoded by specific individuals.*décodée par les individus spécifiques.décodée par des individus spécifiques. A cable or ADSL connection*un câble ou une connexion d’AADSLUne connexion par câble ou ADSL
Error examples (2)
When a user picks or is assigned a password, it is encoded with a randomly generated value called the salt.
=> *Quand un utilisateur sélectionne ou est généré un mot de passe, il est codé avec une valeur aléatoirement produite appelée le sel.
Conclusion Translation results can be significantly
improved by creating customised dictionaries The tools mentionned here are user-friendly But, it implies much work in the beginning +
translators must have a training in linguistics and basic NLP.
Change of attitude towards MT + various tools, especially in the language industry oriented option
More things to be done..Merging all dictionaries together into a
« Systranet term base »Translating more HOWTOsProject with Systran: improve user
coding…