corpora for translators
TRANSCRIPT
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 1/12
Corpora fo r Translators
Jarmila Fictumová
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 2/12
Corpora Monolingual: foreign-language, Czech, general, specialized
(based on genre or field, ad hoc) Bi- and Multilingual: parallel (= translation corpora);
comparable (e.g. for searching technical terminology) Learner corpora: monolingual; parallel
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 3/12
CORPUS LINGUISTICS TERMINOLOGY
B ASICS.
Tagging (mark-up; annotation): assigning explicit linguistic
information to a text (parts of speech & semantic annotation)
TARGET LANGUAGE (TL) The language into which we translate
CQL: Contextual Query Language (also Corpus Query
Language)
DIACHRONIC CORPUS language development over an extended
time period
CONCORDANCE the immediate context of a lexical unit
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 4/12
CORPUS LINGUISTICS TERMINOLOGY
B ASICS.
CORPUS an extensive collection of authentic electronic texts (written or
speech transcripts) collected according to specific criteria. Corpus manager: software that searches for concordances of specified
terms; it finds all the instances of a term within a given corpus.
KWIC: displaying the key word in context, usually aligned in the centreof the screen.
LEMMA: a word form chosen as the representative (a headword) of agroup of related word forms.
Lexeme: each unique word within a corpus (e.g.: He lived by the forestdown by the river)
OPEN CORPUS a corpus to which new content is added regularly.
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 5/12
CORPUS LINGUISTICS TERMINOLOGY
B ASICS. PARALLEL CORPUS: texts in different languages - translations
aligned sentence-by-sentence, not unlike translation memories.
COMPARABLE CORPUS texts in different languages that are nottranslations, but do have some features in common.
S YNCHRONIC CORPUS: does not study changes resulting from
language development.
Tokens: all words, regardless of form, contained within a corpus. Source Language (SL) the language from which we translate.
Alignment/Pairing finding and matching the corresponding
segments in different language versions of a text.
Find more at Overview of the basic corpus linguistics terms
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 6/12
Corpus Too ls
Information taken from the article "Corpus Linguistics
Help with Text Writing" (muni.cz 14.1.2014)
by Zuzana Nevěřilová, researcher at the Natural
Language Processing Centre at the Faculty o f
Info rmatics at MU and teacher at the Centre fo r
Compu ter L inguis t ics at the Faculty o f Arts at
MU.
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 7/12
The Sketch Engine has been developed for over ten years by Lexical ComputingLtd. in cooperation with the Natural Language Processing Centre at MasarykUniversity. All of the university students and employees have free access to thiscorpus-based program. …
... The Sketch engine computes a word sketch showing which partner words the keyword co-occurs with and also how often and in what context this happens. … The Sketch Engine can then use the word sketches to compute suitable wordpartners on larger units (phrases). The output of this process is a Thesaurus thathelps us find words related in meaning. …
However, the software also contains a number of advanced functions for workingwith user-generated corpora (automatic keyword extraction, sub-corpora based ondocument length or author attributes) or multilingual (parallel) corpora. The SketchEngine currently provides access to more than 400 corpora in 70 languages. All ofthe functions are described in the documentation
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 8/12
WEB-BASED CORPORA
(MORE INFORMATION in the article about a lecture by ING. VLADIMÍR BENKO)
ENGLISH-LANGUAGE CORPORA
Araneum Anglicum Maius (En Web 14.04) 1,20 G enTenTen12 New Model Corpus ukWaC Times…
CZECH-LANGUAGE CORPORA
czTenTen12 [v. 7] OPUS2 Czech CzechParl 2012 Bruna Bohemica Minor (czes 14.04) 121 M
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 9/12
CNK
CORPUS INTERFACE USER GUIDE
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 10/12
FOREIGN-LANGUAGE CORPORA
Mark Davies: Professor, Corpus Linguistics, Brigham YoungUniversity
corpus.byu.edu
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 11/12
Further options
University of Leeds: Tutorial in English
8/11/2019 Corpora for Translators
http://slidepdf.com/reader/full/corpora-for-translators 12/12
PILOT RUN
A TOOL FOR CREATING ERROR-TAGGED MONO- AND BILINGUAL PARALLEL
OR LEARNER CORPORA.