using theinternet forspecialisedtranslation
Post on 02-Aug-2022
2 Views
Preview:
TRANSCRIPT
USING THEINTERNET
FOR SPECIALISEDTRANSLATION
1
TranslationTechnology
“muchtranslationworkiscarried out in a computer-assisted translation (CAT)environment, which may vary from a standard desktopequippedwithwordprocessing softwareanda browsertoa full-blowntranslatorworkstation consistingofa multiplicityoftools specificallycreatedfortranslators oftechnicaltextsand localizers."
“Translation agencies organize their workflow around project
managementsystemsthatdistribute translationtasks,memoriesand
terminologiestoand around individual translators.”
(F.Zanettin 2014, “Corpora inTranslation”)
Translation technologies
• Electronic dictionariesand terminologicaldatabases,
the arrivalof the Internetwith its numerous
possibilities for research, documentation and
communication, and the emergence of computer-
assistedtranslationtools.
Alcina A. (2008) «Translation technologies - Scope, tools and resources».Target20:1,79–102
Degrees of Translationautomation
• Theterm traditional humantranslation is understood to refertotranslation
without any kind ofautomation
• Fullyautomatichigh qualitytranslation(FAHQT)meanstranslationthat is performedwholly by the computer,withoutany kindof
human involvement,and is of “highquality”
• Human-aidedmachinetranslation (HAMT)refersto systems inwhichthe
translation is essentiallycarriedout by theprogram itself, but aidrequired from humans
• Machine-aided human translation (MAHT) comprises any process or degree of automation in the translationprocess,provided that the mechanical intervention provides somekind of linguisticsupport.
Degrees of Translationautomation
Toolsvs.Resources
• The word tool refers to computer programs that enable translators to carry
out a series of functions or tasks with a set of data that they have prepared
and, at the same time, allows a particular kind of results tobe obtained.
• Internet search engines
• Word processor
• Trados, Wordfast, Déjà Vu, Across, OmegaT,…
• Antconc, Wordsmith…
• By resourceswe refer to all sets of previously gathered linguistic datawhich
are organized in a particular manner and made available in some electronic
format so that they can be used or looked up or usedby translators used in
the course of some phase of processing. Terminological databases (e.g.
IATE), glossaries,…
• (online) dictionaries
• British National Corpus, …
why and how can we mine the web?
• the study of words “by presenting them in thecompany they usually keep - that is to say, an element of their meaning is indicated when their habitual word accompaniments are shown”
• “Extended units of meaning” at work in language
(Sinclair, 1996)
Extended units of meaning
Words must be studied in context rather than in
isolation
• collocation• colligation• semantic preference• semantic prosody
Extended units of meaning
Words must be studied in context rather than in
isolation
• Differences in Italian between (from Taylor,1998: 61):
◦ “pressione alta” = “high (blood) pressure”[medical]
◦ “alta pressione” = “(banks of) high pressure” [meteorological]
• collocation
• “Tendency of certain words to co-occur regularly in a given language” (Mona Baker,1992: 47)
• As observed in actual texts (vs. intuition)
• Key features of collocations
o language-specific (collocations vary from language tolanguage)
• Collocations are not stable or fixed
o they may change diachronically (over time) in generallanguageo they may change in LSP vs. generallanguageo they may change across LSPdomains
11
Collocation
• “Provide” tends to occur with “information”, “service(s)”, “support”,“help”, “money”, “protection”,“food”,“care”
• “Cause” tends to occur with “pain”, “damage”,“harm”
•“Aconsistentaura of meaning with which a form is imbued by its collcates” (Louw1993)
• “Feeling” or “aura” that is evoked by using certain words (reinforced by collocates, due to co-selectional implications and restrictions)
•Usually this feeling is “positive or negative”
• “Provide” tends to occur with wordsdenoting things which are desirable, necessary or good, such as “information”, “service(s)”, “support”, “help”,“money”, “protection”,“food”,“care”
• cf. Italian “fornire” and“elargire”
• “Cause” tends to occur with words denoting negative repercussions/consequences, such as “pain”, “damage”,“harm”
• cf. Italian “causare”
Semantic prosody
14
• “Commit” is used with e.g. “murder”, “crime”,“suicide”
•“Revoke” is used with e.g.
“licence”,“permit”,“authorization”
15
•Relation between a lemma and a set of semantically related words (Stubbs, 2001: 65)
• Lemma: dictionary entry of a word
• “Commit” is used with a group of semantically similar words, e.g. “murder”, “crime”,“suicide” (cf. Italian “commettere”)
•“Revoke” is used with e.g. “licence”, “permit”,“authorization”
•Semantic prosody→ positive/negativeevaluation
•Semantic preference→ relation to words belonging to a particular, definable semantic field
Semantic preference
16
• We heard the visitorsleave/leaving.
• We noticed him walk away/walking away.
• We heard Pavarotti sing/singing.
• We saw it fall/falling.
Colligation
Colligation
hear, notice, see, watch enters into colligation with the sequence of object + either the bare infinitive or the -ing form; e.g.
• We heard the visitorsleave/leaving.
• We noticed him walk away/walking away.
• We heard Pavarotti sing/singing.
• We saw it fall/falling.
•Relation between a pair of grammatical categories or a pairing of lexis and grammar (Stubbs, 2001:65)
17
Conclusionon using theWeb
for specialised translation – Main advantages
• massive amount of texts and multi-sourceinformation
can besearched
• content is constantly “refreshed” (i.e. updatedand extended)
How to friend and unfriend someone on Facebook - ComputerHope 1.https://www.computerhope.com › ... › Facebook Help24 gen 2018 - Before you can connect with another person on
Facebook and view their full profile, you must first become friends.
Below are the steps on how to find new friends on Facebook, add
friends, and how to unfriend any of your current friends. How tofind
friends on Facebook; How to friend someone on ...
Conclusionon using theWeb
for specialised translation – Main advantages
• a lot of sources,text types anddomains/topics are
represented
• manylanguages (Englishis dominant, goodpresenceof
Italian)
• replicable searchtechniques across(your
working/target) languages
• it is availableat anytime, at virtuallyno cost!
Main disadvantages andproblems
o need to differentiate good/reliablesources from
questionable information
▪for facts (limited control over user-generatedcontent
likeWikipedia)
▪for linguistic usage (badly translated, non-native texts,
poorauthors)
▪it may be difficult to identify differences between
expert/non-expertuse
o data/resultsstill need to be interpreted
Main disadvantages and problems
o Googlefocuseson content/information,rather than linguistic
forms
• the rankingand sortingof results are performed according
to criteria like
• “popularity” of the websites, orgeographic relevance
• the same search can yield different numbers of hits,
depending on unpredictable and uncontrollable factors as
the time of the day,or the location from which the query
is made -- wordcountsare not reliable+it is difficult to
compare frequencies to verify translationhypothes
• data on which searches areperformed isunstable/changes
Main disadvantages and problems
Particularly relevant to linguists/translators:
▪ no possible/meaningfulsortingofhits/results(esp.L/R-handcollocates)
- e.g.alphabetical sorting of collocates, from least to most frequent,etc.
- think of e.g. the “a * range/array of”, “on the vergeof” exercises
▪ punctuationanduppercase(capitals)areignored,e.g.“aids” vs.“AIDS”
▪ impossibleto search partsofwords,e.g.start with “geo…”,end in “-itis”
▪ no lemmatised searches
- hard to calculatefrequenciesof specificwordcombinations
- e.g.to calculatehowfrequent isthe combination “tirare l’acquaalproprio
mulino”,all inflected formsmustbesearchedfor
▪ no POS-sensitivesearches
- e.g. to searchfor ‘spot’as anoun vs.adverb
▪ no possibility to specify the spano ccuringbetweentwosearchterms
- i.e. the * wildcard can include zero to nwords
«Googleology is bad science» (Kilgarriff 2007)
MACHINE TRANSLATION
(MT)
1
24Machine translation (MT):
definition and key terms
• Definition of machine translation:
“computerised systems responsible for the production of translations
from one natural language into another, with or without human
assistance” (Hutchins & Somers, 1992: 3)
• Some key terms:
o MT system / engine / service = the software that produces the translation
o input = the source text
o [raw] output = [unedited] target text
“L'inglese di Expo non sembra Google Translate,
è Google Translate”
From http://www.linkiesta.it/it/blog-post/2015/02/12/linglese-di-expo-non-sembra-
google-translate-e-google-translate/22476/
MT – popular conceptions
Probably the translation technology that
attracts the most public attention, esp.
among non-translators.
28
Texts in SL Texts in TL
Parallel corpora: a collection of original texts in language L1 and their translations
into a give L2
Machine translation (MT): main architectures of MT systems
29
Machine translation (MT):
why is MT so difficult? Or why is translation difficult for computers?
• So why is translation difficultfor computers?
o Some blame the computer’s lack of “real-world knowledge”
legno bosco foresta IT
wood forest EN
bois forêt FR
30Machine translation (MT):
why is MT so difficult? Or why is translation difficult for computers?
• Partly because the translation often depends on the context / situation,
which the computer is not able to take into account
“The ball is in your court”
“Il pallone è nella vostra metà campo”
(the manager to the players)
“Il ballo è nella vostra corte”
(the chamberlain to the king)
31Machine translation (MT):why is MT so difficult? Or why is translation difficult for computers?
• Lexical ambiguities (gramm. category <-> meaning <-> translation)
for example, in EN:round
j) My team was eliminated in the first round
k) The cowboy started to round up the cattle
l) We can use the round table for dinner
m) Maggie is going on a cruise round the world
•
(Noun: girone)
(Verb: radunare)
(Adjective: rotondo)
(Preposition: intornoal)
32
2) The chimp eats the banana because it is ripe.
3) The chimp eats the banana because it is lunchtime.
?
Machine translation (MT):
some linguistic phenomena that are particularly difficult forMT
1) The chimp eats the banana because it is greedy.
• The case / example of pronominal anaphora (resolution), difficult for MT
33
MT post-editing
34
• The aim of post-editing is to make the revised output usable or
understandable
• The priority is to save time andmoney
• The extent and the accuracy of post-editing are negotiated/specified
on a case by case basis, depending on the needs and requirements
• Different “types” and levels of post-editing (in companies, organisations):
• no post-editing
• internal circulation, almost never external publication
• minimum post-editing
• internal circulation, rarely external publication
• full/complete post-editing (but… is it worth it?)
• very rarely internal circulation, mostly external publication
MT post-editing
35
• new skill that is acquired with experience, different from translation
• in this scenario one has to balance and optimise quality-speed-cost,
in relation to the intended use/durationof the translation
• length ofuse of
the document
• needs and
expectations of
the enduser(s)
• ability of the
readers/addressees to
makeuse of the doc.
• type, length and
“visibility” ofthe
document
• available and
viable options
MT post-editing: introduction
36
• (minimum/full-complete) are decided specifically
• Factors to be considered(prioritised)
• save time and money (quality is less relevant)
• understandability and correctness of general meaning are key
• Factors to be ignored (irrelevant in PE)
• any detail or nuance
•elegance, fluency, naturalness of expression, etc.
on average PE is paid roughly 50%of the “real/proper”
translation
Aims and level of PE (vs.translation/proofreading!)
37
MT pre-editing
38
•There are two possibilities to limit the texts / language in / for MT:
• adopt a controlled language (restricted input)
• use the sublanguageapproach
• Common aims with both options (to the advantage of MT):
• limited vocabulary
• more certainty on interpretation
• reduce syntacticvariation
Limit input domain / topic
39
• Prescriptive rules aimed at normalising the style of the input (ST), e.g.
• do not write sentences with more than 20 words (general, language-neutral)
• avoid passive constructions, use only active verb forms
• avoid anaphoras, make all subjects and pronominal references explicit
• in EN: do not omit “that” in relative clauses (language-specific)
• in IT: do not use “solo” as an adverb, but use “soltanto/solamente”
• in IT: use the word “minuto” only as a noun (i.e. to mean 60 seconds);
for the adjectival meaning, use only “piccolo”
Etc……
The result of controlled language is restricted input
Controlled language
40
• Natural/normal behaviour of language within a well-defined domain
(~ LSP, specialised language, jargon, etc.)
• “sub-” in the mathematical sense as in “subset”, not derogatory!
• referred to very well-defined, enclosed, limited domains and texts
• A sublanguage exists and is used regardless of MT, but one can design an MT system that takes advantage of this sublanguage
• vocabulary
• limited (relatively few concepts to be covered/expressed)
• finite/closed (innovation/deviation tend to be avoided)
• a few homographs, in general limited use of synonyms and coreferences
• syntax
• limited range of structures and constructions (regularity + repetitiveness)
• usually sublanguages are very similar cross-linguistically between SL/TL(s)
Sublanguage (1/2)
41
• Input must be in (or converted into) electronic format
• Correct formatting and layout of the input are very important
o the word “e r r o r” (spaced letters) would not be recognised / translated
o spelling and typos are crucial: THEYBOOKS AROOM …
(anybody would understand banal mistakes, but not an MT system!)
• Limited availability of language combinations (improving with SMT)
o coverage mostly limited to “usual” big languages with commercial interest
Machine translation (MT):
restrictions to the use of MT
COMPUTER-ASSISTED
TRANSLATION
(CAT) TOOLS
1
• Computer-assisted translation or computer(machine)-aided translation (CAT) refers to a variety of tools, a family of software products designed to support professional translators in their work.
• CAT is a “recent” development, derived from MT over the last 20 years
• The actual development of commercial CAT tools started in the 1990’s–the so-called“translator’s workstation / workbench”, which includes
• terminology managementpackages
• translation memory (TM) software (+ text alignment software, etc.)
• CAT tools are pieces of software designed to enhance the work of translators:
• maximise speed→ higher productivity
• improve coherence and precision→ higher quality
43
Computer-assisted translation (CAT)tools
• Used to create, store, retrieve and manipulate bi-/multilingual termbases/glossaries
• As searching for terminology can be highly time-consuming (even up to 75% of translators’ time), setting up a database which gathers the terminology you come across is vital.
• Lists in word processors / spreadsheets (e.g. Excel)→limited options for presenting and sorting data
• The terminology covered is usually that of a given (sub-)discipline or the terms needed for a specific translation project.
• Terminology records consist of a number of flexible fields
44
CAT tools, example 1:
terminology management packages
46
• Translation memory (TM):
“multilingual text archive containing […]
multilingual texts, allowing storage and retrieval of
aligned text segments against various search conditions”
(EAGLES* 1995)* Evaluation of Natural Language Processing Systems
•This roughly means: a “filing cabinet” (i.e. a database) of old translations whose bits can be retrieved and used when / as needed by the translator
• essentially a textual database that can be searched• pairs of source-text and target-text segments
CAT tools, example 2:
translation memory (TM) software
Note: Translation memory indicates both the software tool and the contents
of the database, i.e. the whole set of aligned text segments that it includes
Translation memory (TM) software
• Key idea: recycle similar past translations, never translate the same (or a similar) text twice
• How it works:• TM tools divide the source text – which must be in (or turned
into, e.g. with OCR) electronic/digital format –into segments, which translators can translate one-by-one in the traditional way.
• These segments (usually sentences, or even phrases) are then
sent to a built-in database. When there is a new source segmentequal or similar to one already translated, the memory retrievesthe previous translation from the database.
• When is this most useful:• for the translation of any text that has a high degree of repeated
terms and phrases which must be translated consistently, as is the case with e.g. user manuals, computer products and subsequent versions of the same document (e.g. website updates).
• mostly relevant to technical/specialised translation (notl4i7terature)
48
• Scenario
◦you have to translate the user manual of a printer (new model) from English into Italian
◦ a lot of repetition within the documentitself
◦overlap and repetitions across updated (old-new) versions of the documentation
◦ you have a relevant TM (similar topic / domain / texts / clients)
◦ you translated the previous manual(s)
◦ TM provided by client / translation agency / colleague
Using translation memory (TM) software
Source text (in language A)
ST: There are 4 ways to change print settings for this printer
Exact/Perfect match (everything in the segment is exactly the same)
A: There are 4 ways to change print settings for this printer
B: Ci sono 4 modi per cambiare le impostazioni di stampa di questa stampante
Full match (only figures, dates and similar small details are different)
A: There are 2 ways to change print settings for thisprinter
B:49 Ci sono 2 modi per cambiare le impostazioni di stampa di questa stampante
Using translation memory (TM) software
• Translation of a printer manual English (A) → Italian (B)
Source text (in language A)
ST: “There are 4 ways to change print settings for thisprinter”
Fuzzy match 85% similar (a few words in translation unit are different)
A: “There are theseveral ways to change print settings for printer”
B: “Ci sono vari modi per cambiare le impostazioni di stampa alla stampante”
Fuzzy match 60% similar (some words in translation unit are different)
A: “There are several ways to modify of yourthe default setting printer”
B: “Ci sono vari modi per modificare l’impostazione standard della tua stampante”
• With the acceptibility threshold of the TM tool set at 75%, no
candidate translation unit under that level of similarity is retrieved
a50nd shown to thetranslator!!
Using translation memory (TM) software
• CAT tools - Advantages
• can speed up the translation process and increase productivity
•can improve translation quality (by enhancing terminological and phraseological coherence)
• can help translators provide quotations
• allow for collaboration over large projects
• TMs/termbases can be shared by several translators and updated in realtime
• Useless for some text types (e.g. literature)
• Essential for many specialized/technical domains
• Translation agencies require translators to use (specific types of) CAT tools
• Technical / practical issues
• different approaches: some CAT tools have a proprietary, stand-alone text editor, others are «integrated» (e.g. to Word processor), some recent ones are fully online
• proprietary vs. interchange formats
• no matches calculated below sentence-level (e.g. at phrase level)
• but Concordance function is becoming standard
• criteria used to define similarity / matches• matching is calculated not on the basis of sentence or
word meaning, but on the basis of character-stringsimilarity
TP: I bambini giocano in gruppo con il pallone
FM1: I pampini giovano il grullo con il tallone (94% match)FM2: I bimbi si divertono giocando a calcio insieme (42% match)
16Some issues about TMs
• Language / translation issues
• segmentation implies that overall perception of the ST/TT is lost→ ST structure tends to be reproduced in TT
• cross-linguisticdifferences in e.g. cohesive patternsmight be overlooked
• using TMs limits the translator’s creativity, as s/he is usually expected to use the terminology and phraseology included in the TM
• TMs can sometimes be reversed, as if translation direction did not matter…
• need to control the reliability of translations within TM
16Some issues about TMs
CORPORA AND TRANSLATION
1
•“a collection of naturally-occurringlanguage text, chosen to characterize a
state or variety of a language” (Sinclair, 1991:171)
•“a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7)
• “a closed set of texts in machine-readable form establishedfor general or specific purposes by previously defined criteria” (Engwall, 1992:167)
•“a finite-sized body of machine-readable text, sampled in order to bemaximally representative of the language variety under consideration”(McEnery & Wilson, 1996:23)
•“a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5)
What is a corpus?
Some (authoritative) definitions
What is / is not a corpus…?
A newspaper archive on CD-ROM? An online glossary?A digital library (e.g. Project Gutenberg)?All RAI 1 programmes (e.g. for spoken TV
language)
The answer is always “NO”
(see definition)
Corpora vs. web•Corpora:
– Usually stable•searches can be replicated
– Control overcontents•we can select the texts to be included, or have control over selection strategies
– Ad-hoc linguistically-awaresoftware to investigate them•concordancers can sort / organise concordance lines
• Web (as accessed via Google or other search engines):– Very unstable
•results can change at any time for reasons beyond our control
– No control over contents•what/how many texts are indexed by Google’s robots?
– Limited control oversearch results•cannot sort or organise hits meaningfully; they are presented randomly
Click here for another corpus vs. Google comparison
What types of corpora exist?Abrief overview
• A corpus is a principled collection of naturally occurring electronic
texts designed to be a representative sample of language in actual use
• Some of the main features and criteria used to describe and classify
corpora:general
specialised
written
spoken (transcribed)
multimodal (audio/video)
balanced (sample)
opportunistic
synchronic
diachronic
static
dynamic
closed / finite
open-ended (monitor)
raw (pre-corpus)
marked-up (augmented)
POS-tagged (augmented)
annotated (augmented)
monolingual
bi- / multilingual
parallel
comparable
An example of planned balance:
the British National Corpus
100 m words of contemporary spoken and writtenBritish
English Representative of British English “as a whole”
Designed to be appropriate for a variety of uses:
lexicography, education, research, commercial
applications (computational tools)
Balanced with regard to genre, subject matter andstyle
Sampling and representativeness very difficult toensure
Dynamic (Monitor) vs static (Finite)
A static corpus will give a snapshot of language use at a given time
Easier to control balance of contentMay limit usefulness, esp. as time passes
A dynamic corpus is ever-changingCalled “monitor” corpus because allows us to monitor language change over time
Concordance for nodeword “eyes” (sorted 1L) generated
from theBNC
Parallel (translational) corpora
•contain translationally “equivalent” texts: STs and their corresponding TTs
•need to be aligned, usually at the sentence level, i.e. SL sentence X matched to TL sentence X’
•context is provided to account for “equivalence” and “translation shifts” between ST and TT
•translation direction needs tobe clear, i.e. which are SL and TLcomponents of the corpus
Comparable corpora
•texts originally produced (not translated) in the respective languages
•consist of independent texts which are “similar” according to some pre-determined criteria
•the various language components share a set of common features, e.g. text type, genre, publication span, domain, topic
•parameters definingthis similarity vary widely
63
Parallel vs. comparable multilingual corpora
Bilingual parallel corpora on the web
64
• OPUS corpus, opus.lingfil.uu.se
• A variety of multilingual parallel corpora• European Parliament debates (EuroParl corpus)
• European Central Bank corpus• UN documents• Subtitles (open subtitle project)• Software manuals (PHP, OO)
• …
Query
Sort +
Launch the queryChooseTL(s)
help
http://opus.lingfil.uu.se/ → EuroParl v7 search interface
Other usefulfunctions
Choose SL
66
Comparable Eng/Ita corpus on botany
Summing up: corpus use in translationMain uses:
Test/generate hypotheses as to interpretation of the source text, and as to appropriate translations
helpful when you’re dealing with little known text-types / domains helpful when you’re dealing with a little knownlanguage
Improve quality – capture subtleties of source text, produce translations which read like native speaker texts
More precisely,
Reference corpora provide insights onphraseological regularities indiscourse
Comparable corpora (automatic and manual) can be used for (contrastive) specialised/genre-controlled text analysis
Parallel corpora provide equivalents in context/evidence of translation strategies (and are more versatile than TMs)
Discuss the most important milestones in the historyof
Machine Translation, also with regard to the evolution
of MT approaches.
Question on MT
8
9
Sample answer on MT
• Important points, to mention: ☺
– “Birth” of MT, idea by W. Weaver and Memorandum (1949)
– Innovative idea of using computersto translate
– First demonstrationGeorgetown-IBM(1954)
– US government funding, 1st generation direct rule-based approaches
– FAHQMT ideal (with short explanation)
– ALPAC report (1966): gist of contentsand negativerepercussionsin the
US
• Important points, to mention: [continued]: ☺
– ’80s: PCs more and more widespread, commercial systems in US and
Europe
– 2nd generation, rule-based transferapproach
– rule-based interlinguaapproach (complex,disappointing)
– Excessive ideal of FAHQMT gradually replaced by more realistic
alternatives
– Statistical approach replaces rule-based approach (advantages)
– Increase in the number of texts in digital format to be translated
(Internet,email)
– End ’90s: TA systems availableonline (even for free)
– MT integration with CAT tools
– L1a0test developments: neural MT
Sample answer onMT
11
Sample answer onMT
• Less relevant, non essential points:
– It is not necessary to provide definitions (for MT, input, output, etc.),
you can take them for granted
– Civil applications of MT (e.g. for warcryptography)
– Individuals (apart from W. Weaver)
– Conferences and journals, specific research centres
– Not important to differentiate between countries (apart from US
beginning)
– Not necessary to mention single MT tools or firms
12
Sample answer onMT
• FOR THIS QUESTION,avoid :
– Explain pre-editingvs post-editing
– Detailed explanation of sublanguageand controlled language
– Detailed explanation of different rule-based approaches
– Detailed explanation of statistical MT approaches, the resources needed
to developthem and their advantages
– Details on the ALPAC committee
– Reviewof available online MT systems
– Detailed explanation of how MT is integrated into CAT tools
top related