corpora for translators

12
Corp o ra f o r T ran sla tors   Jarmila Fictumová

Upload: stefan-sitani

Post on 02-Jun-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 1/12

Corpora fo r Translators  

Jarmila Fictumová

Page 2: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 2/12

Corpora Monolingual: foreign-language, Czech, general, specialized

(based on genre or field, ad hoc) Bi- and Multilingual: parallel (= translation corpora);

comparable (e.g. for searching technical terminology) Learner corpora: monolingual; parallel 

Page 3: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 3/12

 CORPUS LINGUISTICS TERMINOLOGY 

B ASICS.

Tagging (mark-up; annotation): assigning explicit linguistic

information to a text (parts of speech & semantic annotation)

TARGET LANGUAGE (TL) The language into which we translate

CQL: Contextual Query Language (also Corpus Query

Language)

DIACHRONIC CORPUS language development over an extended

time period

CONCORDANCE the immediate context of a lexical unit

Page 4: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 4/12

CORPUS LINGUISTICS TERMINOLOGY 

B ASICS.

CORPUS an extensive collection of authentic electronic texts (written or

speech transcripts) collected according to specific criteria. Corpus manager: software that searches for concordances of specified

terms; it finds all the instances of a term within a given corpus.

KWIC: displaying the key word in context, usually aligned in the centreof the screen.

LEMMA: a word form chosen as the representative (a headword) of agroup of related word forms.

Lexeme: each unique word within a corpus (e.g.: He lived by the forestdown by the river)

OPEN CORPUS a corpus to which new content is added regularly.

Page 5: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 5/12

 CORPUS LINGUISTICS TERMINOLOGY 

B ASICS. PARALLEL CORPUS: texts in different languages - translations

aligned sentence-by-sentence, not unlike translation memories.

COMPARABLE CORPUS texts in different languages that are nottranslations, but do have some features in common.

S YNCHRONIC CORPUS: does not study changes resulting from

language development.

Tokens: all words, regardless of form, contained within a corpus. Source Language (SL) the language from which we translate.

 Alignment/Pairing finding and matching the corresponding

segments in different language versions of a text.

Find more at Overview of the basic corpus linguistics terms 

Page 6: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 6/12

 Corpus Too ls  

Information taken from the article "Corpus Linguistics

Help with Text Writing" (muni.cz 14.1.2014)

by Zuzana Nevěřilová, researcher at the Natural

Language Processing Centre  at the Faculty o f

Info rmatics at MU and teacher at the Centre fo r

Compu ter L inguis t ics  at the Faculty o f Arts at

MU. 

Page 7: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 7/12

The Sketch Engine has been developed for over ten years by Lexical ComputingLtd. in cooperation with the Natural Language Processing Centre at MasarykUniversity. All of the university students and employees have free access to thiscorpus-based program. … 

... The Sketch engine computes a word sketch showing which partner words the keyword co-occurs with and also how often and in what context this happens. … The Sketch Engine can then use the word sketches to compute suitable wordpartners on larger units (phrases). The output of this process is a Thesaurus thathelps us find words related in meaning. … 

However, the software also contains a number of advanced functions for workingwith user-generated corpora (automatic keyword extraction, sub-corpora based ondocument length or author attributes) or multilingual (parallel) corpora. The SketchEngine currently provides access to more than 400 corpora in 70 languages. All ofthe functions are described in the documentation 

Page 8: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 8/12

WEB-BASED CORPORA

(MORE INFORMATION in the article about a lecture by ING. VLADIMÍR BENKO)

ENGLISH-LANGUAGE CORPORA

 Araneum  Anglicum Maius (En Web 14.04) 1,20 G  enTenTen12  New Model Corpus  ukWaC  Times… 

CZECH-LANGUAGE CORPORA

czTenTen12 [v. 7]  OPUS2 Czech  CzechParl 2012  Bruna Bohemica Minor (czes 14.04) 121 M 

Page 10: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 10/12

 

FOREIGN-LANGUAGE CORPORA 

Mark Davies: Professor, Corpus Linguistics, Brigham YoungUniversity

corpus.byu.edu 

Page 11: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 11/12

Further options

University of Leeds: Tutorial in English 

Page 12: Corpora for Translators

8/11/2019 Corpora for Translators

http://slidepdf.com/reader/full/corpora-for-translators 12/12

PILOT RUN 

A TOOL FOR CREATING ERROR-TAGGED MONO- AND BILINGUAL PARALLEL 

OR LEARNER CORPORA.