corpus linguistics notes
DESCRIPTION
A series of notes on corpus linguistics.TRANSCRIPT
![Page 1: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/1.jpg)
![Page 2: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/2.jpg)
![Page 3: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/3.jpg)
CASSCorpus Approaches to Social Science
![Page 4: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/4.jpg)
Using comparable and parallel corpora in contrastive and translation studies
![Page 5: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/5.jpg)
Richard Xiao Lancaster University
![Page 6: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/6.jpg)
Outline of the session • Types of corpora used in translation and contrastive
studies
• Paradigmatic shift in contrastive and translation studies
• A model of Contrastive Corpus Linguistics
• Alignment and parallel concordancing
• Corpus resources and tools
![Page 7: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/7.jpg)
Types of corpora: Some distinctions
• Monolingual versus multilingual corpora
• Parallel versus comparable corpora
• Comparable versus comparative corpora
![Page 8: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/8.jpg)
Monolingual vs. multilingual corpora • Monolingual corpora
• A corpus that only involves one language • Multilingual corpora
• A corpus that contains texts of more than one language • A corpus covering two languages is conventionally
known as ‘bilingual’ • Multilingual corpora, in a narrow sense, must involve
more than two languages • ‘Multilingual’ and ‘bilingual’ are often used
interchangeably • Parallel and comparable corpora
![Page 9: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/9.jpg)
Parallel vs. comparable corpora • Terminological confusion centres around the terms • For some scholars (e.g. Aijmer & Altenberg 1996; Granger 1996: 38)
• Corpora composed of source texts in one language and their translations in another language (or other languages) are ‘translation corpora’ while those comprising different components sampled from different native languages using comparable sampling techniques are called ‘parallel corpora’
• For many others (e.g. Baker 1993: 248, 1995, 1999; Barlow 1995, 2000: 110; Hunston 2002: 15; McEnery and Wilson 1996: 57; McEnery, Xiao & Tono 2006) • Corpora of the first type are labelled ‘parallel corpora’ while
those of the latter type are ‘comparable corpora’
![Page 10: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/10.jpg)
Parallel vs. comparable corpora • Consistent and logical ways of doing things…
• We can say a corpus is a translation or a non-translation corpus if the criterion of corpus content is used
• But if we choose to define corpus types by the criterion of corpus form, we must use the criterion consistently • We can say a corpus is parallel if the corpus contains source
texts and translations in parallel, or it is a comparable corpus if its components or subcorpora are comparable by applying the same sampling techniques and representing similar balance
• It is simply inconsistent and illogical to refer to corpora of the first type as ‘translation corpora’ by the criterion of content while referring to corpora of the latter type as ‘comparable corpora’ by the criterion of form!
![Page 11: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/11.jpg)
Multilingual vs. monolingual comparable corpora • A common practice in TS is to compare a corpus of translated texts
(‘translational corpus’) with a corpus comprising comparably sampled non-translated native texts in the target language • The ZJU Corpus Translation Chinese (ZCT C) vs. the Lancaster
Corpus of Mandarin Chinese (LCMC ) • The two sub-corpora form monolingual comparable corpora, as
opposed to multilingual comparable corpora composed of comparable texts for different languages (LCMC s. FLOB)
![Page 12: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/12.jpg)
Comparative corpora • Corpora containing different varieties of the same
language are not comparable corpora • e.g. the International Corpus of English (ICE); the
Brown family of corpora • All corpora, as a resource for linguistic research, are well suited for comparative studies, in either intralingual or interlingual research
• Corpora of this kind are comparative corpora
![Page 13: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/13.jpg)
Use of parallel & comparable corpora • Parallel and comparable corpora “offer specific uses and
possibilities” for contrastive and translation studies (Aijmer & Altenberg 1996:12) • Giving new insights into the languages compared – insights
that are not likely to be gained from the study of monolingual corpora
• Used for a range of comparative purposes and increasing our knowledge of language-specific, typological and cultural differences, as well as of universal features
• Illuminating differences between source texts and translations, and between native and non-native texts
• Used for a number of practical applications, e.g. in lexicography, language teaching and translation
![Page 14: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/14.jpg)
Use of parallel & comparable corpora • Used primarily for translation and contrastive studies • The two types of corpora have their own characteristics, and serve
different purposes • Parallel corpora: useful in translation studies, but they alone
serve as a poor basis for cross-linguistic contrast, because translations cannot avoid the effect of translationese
• Comparable corpora: well suited for contrastive research, but are less useful in translation studies, e.g. in studying translation equivalents
![Page 15: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/15.jpg)
Using corpora in translation studies • Translational corpora
• Used in combination with a comparable TL corpus to provide primary evidence in product-oriented Translation Studies, and in studies of “translation universals”
• If corpora of this kind are encoded with sociolinguistic and cultural parameters, they can also be used to study the sociocultural environment of translations
• Monolingual SL and TL corpora • Can raise the translator’s linguistic and cultural awareness in
general • A useful and effective reference tool for translators • Used in combination with a parallel corpus to form a so-called
‘translation evaluation corpus’: helping translator trainers or critics to evaluate translations more effectively and objectively
![Page 16: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/16.jpg)
Using corpora in translation studies • Parallel corpora
• Useful in exploring how an idea in one language is conveyed in another language, thus providing indirect evidence to the study of the translation process
• Indispensable for building statistical or example-based machine translation (EBMT) systems, and for the development of bilingual lexicons and translation memories
• Parallel concordancing is a useful tool for translators • Comparable corpora of SL and TL
• Useful in improving the translator’s understanding of the subject field and improving the quality of translation in terms of fluency, correct term choice and idiomatic expressions in the chosen field
• Can also be used to build terminology banks
![Page 17: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/17.jpg)
Corpora in contrastive linguistics • Contrastive analysis
• An important part of FLT methodology following WWII and remained dominant throughout the 1960s
• Lost ground to more learner-oriented approaches e.g. error analysis, performance analysis, and interlanguage analysis
• Revived in the 1990s • The rapid development of corpus linguistics has been
recognized as a principal reason for its revival (cf. Salkie 2002; Xiao & McEnery 2010; Xiao 2011)
![Page 18: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/18.jpg)
Corpora in contrastive linguistics • The marriage of corpus linguistics and contrastive analysis is an
entirely natural one • Corpus linguistics is inherently comparative in nature • The combination of corpus analysis and contrastive analysis can
produce a synergy that can and has benefited both corpus linguistics and contrastive analysis
• Corpora have “always been pre-eminently suited for comparative studies” (Aarts 1998:ix) • Corpora of the Brown family (Lancaster 1931, LOB, FLOB, BE2006;
B-Brown, Brown, Frown, AE2006) • Even the BNC, which is designed balanced corpus representing
modern British English in general, provides a useful basis for various intra-lingual comparisons
![Page 19: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/19.jpg)
Corpora in contrastive linguistics • Corpus analysis techniques are also intrinsically
comparative • keyword analysis • collocation analysis • interlanguage analysis
• Corpus-based contrastive linguistics has emerged with a wealth of methodologies, addressing a wide spectrum of cross-linguistic issues (cf. Altenberg & Granger 2002; Granger 2003)
![Page 20: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/20.jpg)
Corpus-based Translation Studies • Laviosa (1998a): “the corpus-based approach is evolving,
through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation.” • Hypothesis that translation universals can be identified and tested by using corpus data (Baker 1993, 1995)
• Rapid development of corpus linguistics, especially multilingual corpus research in the early 1990s
• Increasing interest in Descriptive Translation Studies (Toury 1995)
![Page 21: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/21.jpg)
Corpus-based Translation Studies • Tymoczko (1998): “Corpus Translation Studies is central to the way that
Translation Studies as a discipline will remain vital and move forward.”
• Meta 43/4 (1998); Kenny (2001); Bowker (2002); Laviosa (2002); Granger et al (2003); Teich (2003); Zanettin et al (2003); Mauranen et al (2004); Olohan (2004); Santos (2004); Rogers & Anderman (2007); Beeby et al (2009); Saldanha (2009); Hruzov (2010); Izwaini (2010); Tengku et al (2010); Véronis (2010); Xiao (2010, 2011, 2012); Hu 2011; Kruger et al (2011); Wang 2012
• Corpus-based Translation Studies book series (Shanghai Jiao Tong University Press / Springer)
![Page 22: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/22.jpg)
The Holmes-Toury map • Applied Translation Studies
• Descriptive Translation Studies
• Theoretical Translation Studies
![Page 23: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/23.jpg)
Applied Translation Studies • Three major contributions of corpora
• Corpus-assisted translating • Bowker (1998: 631): “corpus-assisted translations are of a higher
quality with respect to subject field understanding, correct term choice and idiomatic expressions.”
• Corpus-aided translation teaching and training • Bernardini (1997): ‘large corpora concordancing’ (LCC) can help
students to develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, which are the skills that distinguish a professional translator from those unskilled amateurs
• Development of translation tools • Corpora, and especially aligned parallel corpora, are essential for
the development of translation technology such as machine translation (MT) systems, and computer-aided translation (CAT) tools and translation memories (TM)
![Page 24: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/24.jpg)
Descriptive Translation Studies • Characterized by its emphasis on the study of
translation per se, aiming to answer the question of “why a translator translates in this way” instead of “how to translate”
• Baker (1993) predicted that the availability of large corpora of both source and translated texts, together with the development of the corpus-based approach, would enable translation scholars to uncover the nature of translation as a mediated communicative event
![Page 25: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/25.jpg)
Descriptive Translation Studies • Three focuses (Holmes 1972/1988) • The function of translation
• Concerned with the study of contexts rather than texts: e.g. function or impact of a translation work
• Relatively few function-oriented studies that are corpus-based • Translation as a process
• Aiming to reveal the thought processes that take place in the mind of the translator while they are translating
• One possible way for corpus-based DTS is to analyze the written transcripts of these recordings off-line (Think-Aloud Protocols, or TAPs)
• Research of translation as a product can also provide indirect evidence to translation as a process (product vs. process)
• Translation as a product • Concerned with describing translation as a product by comparing
comparable corpora of translated and non-translated texts in TL • Attempting to uncover evidence to validate / invalidate the so-called
translation universal hypotheses
![Page 26: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/26.jpg)
Descriptive Translation Studies • Core patterns of lexical use (Laviosa 1998b)
• A relatively low proportion of lexical words over function words
• A relatively high proportion of high-frequency words over low-frequency words
• A relatively great repetition of the most frequent words
• Less variety in most frequently used words
![Page 27: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/27.jpg)
Descriptive Translation Studies • Beyond the lexical level
• Simplification: “tendency to simplify the language used in translation” (Baker 1996: 181-182)
• Normalisation: “tendency to exaggerate features of the target language and to conform to its typical patterns” (Baker 1996: 183)
• Explicitation: translations tend to “spell things out rather than leave them implicit” (Baker 1996: 180)
• Sanitisation: translated texts are “somewhat ‘sanitised’ versions of the original” (Kenny 1998: 515)
• Leveling out (convergence): “tendency of translated text to gravitate towards the centre of a continuum” (Baker 1996: 184)
![Page 28: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/28.jpg)
Theoretical Translation Studies • Aims “to establish general principles by means of which
these phenomena can be explained and predicted” (Holmes 1988: 71) • Closely related to, and often reliant on the empirical
findings produced by Descriptive Translation Studies
• One good battleground of using DTS findings to pursue general theory of translation is the hypothesis of so-called “translation universals” (TUs) – the inherent common features of translational language • An important area of corpus-based TS over the past
decade
![Page 29: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/29.jpg)
Contrastive Corpus Linguistics • Bringing together the strengths of contrastive analysis and
corpus analysis • This synergy has not only revived contrastive analysis but
has also expanded the fields of corpus linguistics, translation studies, and SLA research
• A new model of Contrastive Corpus Linguistics (Xiao & McEnery 2010) to demonstrate the promise and potential value of the corpus-based approach to contrastive and translation studies • Common platform for research areas including corpus
linguistics, contrastive linguistics, translation studies, and SLA
![Page 30: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/30.jpg)
Contrastive Corpus Linguistics
![Page 31: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/31.jpg)
Corpus alignment • We have so far assumed that parallel corpora means aligned parallel corpora
• An essential step in the construction and exploitation of parallel corpora
• Without alignment, we cannot easily determine which sentences in TL are translations of which in SL
• Corpus alignment makes explicit the information regarding the translation in a parallel corpus, with the aim of finding translation equivalents at different levels (sentence, phrase, word) between the SL and TL texts in a parallel corpus
• Most multilingual corpus tools only take pre-aligned parallel texts as input in parallel concordancing
![Page 32: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/32.jpg)
Corpus alignment • Levels of alignment
• Document level • Paragraph • Sentence • Phrase (multi-word unit) • Word
• Sentence alignment is generally the first step to phrase and word alignment
![Page 33: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/33.jpg)
Corpus alignment • Combined vs. stand-alone format
• Combined/embedded : the source and translated texts stored in a single text
• Stand-alone: stored in separate files, with SL and TL segment in each translation equivalent linked together with a unique identifier or pointer
• Conversion between the two formats is possible • Different parallel concordancers may have different
requirements
![Page 34: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/34.jpg)
Corpus alignment • Statistical (probabilistic) approach to sentence alignment
• Usually based on sentence length in terms of words or characters
• Linguistic (knowledge/rule-based) approach • Using morpho-syntactic information to explore similarities
between languages • Punctuations and “anchor points” • Achieving more accurate alignment, but necessarily slow
• Hybrid approach • Most widely used approach to sentence alignment • Integrating linguistic knowledge into a probabilistic algorithm
to achieve improved accuracy • Making use of anchor points
![Page 35: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/35.jpg)
Corpus alignment • Research of alignment has focused on European
language pairs
• Sentence alignment among closely related European language pairs has achieved a very high accuracy rate (98%+)
• But less accurate for typologically different languages such as English and Chinese (ca. 80%+), typically requiring human intervention or post-editing
![Page 36: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/36.jpg)
Corpus alignment • InterText Editor (with automatic Hunalign)
• Supporting different operating systems • Local and networked server • http://wanthalf.saga.cz/intertext
• WinAlign in SDL-Trados • Commercial CAT software tool
• Uplug corpus tools • http://sourceforge.net/projects/uplug/?source=dlp
![Page 37: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/37.jpg)
Corpus alignment
![Page 38: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/38.jpg)
Corpus alignment
![Page 39: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/39.jpg)
Parallel concordancing • ParaConc
• Commercial software (US$89): http://www.paraconc.com/
• Unicode compliant • Semi-automatic alignment • Computing and highlighting collocation • Supporting 2-4 aligned parallel texts stored in
separate files
![Page 40: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/40.jpg)
Parallel concordancing
![Page 41: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/41.jpg)
Parallel concordancing • CUC_Paraconc
• Freeware tool • Supporting up to 16 parallel texts store either in
one file or in different files • Unicode compliant • Supporting Regular Expression search • Displaying results in KWIC format, and saving
results either in a single text file or in different files
• www.fass.lancs.ac.uk/projects/corpus/data/CUC_Paraconc.zip
![Page 42: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/42.jpg)
Parallel concordancing
![Page 43: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/43.jpg)
Parallel concordancing
![Page 44: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/44.jpg)
Parallel concordancing
![Page 45: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/45.jpg)
Parallel concordancing • Terminology in multilingual corpus linguistics • Types of corpora used in contrastive and translation
studies • Relationship between corpus linguistics and
contrastive analysis • Corpus-based translation studies • Corpus alignment and parallel concordancing • Well known and influential corpora
• www.fass.lancs.ac.uk/projects/corpus/cbls/corpus_survey.pdf
![Page 46: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/46.jpg)
UCCTS conferences • International conferences on Using Corpora in
Contrastive and Translation Studies • UCCTS1: China
• www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings
• UCCTS2: UK • www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings
• UCCTS3 (jointly with ICLC7): Belgium • http://www.iclc7-uccts3.ugent.be/
• UCCTS4: July 2014, Lancaster • http://ucrel.lancs.ac.uk/uccts4/
![Page 47: Corpus Linguistics Notes](https://reader030.vdocument.in/reader030/viewer/2022020118/55cf8f17550346703b98dea3/html5/thumbnails/47.jpg)