corpora for translators in spain. the cdj-gitrad corpus and the
TRANSCRIPT
1
BORJA ALBI, ANABEL (2007). “Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project” en Gunilla Anderman y Margaret Rogers (eds.) Incorporating Corpora - The Linguist and the Translator, Series Translating Europe, 243-265. Multilingual Matters, Clevendon. ISBN: 978-1-85359-985-9
Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project
Anabel Borja (Universitat Jaume I) [email protected]
Introduction
The use of electronic language resources (LRs), such as corpora, lexicons, grammar
descriptions and ontologies is currently the subject of numerous research projects in
Spain, and the results obtained in the future could completely change the way translators
work today. Research projects on the development of LRs conducted in recent years
have shown the potential of these tools and the phenomenon of the translator’s turn in
this field. Not only have developments in technology influenced translation research, but
translation scholars and professionals have in turn helped to determine the direction in
which these corpus technologies have developed. The influence of translation in the
development of corpora is evident since corpora are compiled with the needs of their
potential users in mind, and translators and the creators of tools for automating
translation - such as translation memories (TM), Computer Assisted Translation (CAT)
or Automatic Translation (AT) programmes - form one of the most important groups of
potential users of corpora.
This chapter aims to provide an overview of the corpora available in Spain that could
prove useful for translators and translation researchers. The goal is to help translators
2
working with Spanish (as a source or target language) to become familiar with readily
available and efficiently exploitable computer tools of this kind. It also discusses the
possibilities of exploiting these resources and the Internet in general as a dynamic, open,
accessible and continually evolving corpus. With this in mind, a brief introduction is
given describing the evolution of the corpus concept and its applications in translating.
Then the key concepts related to corpora are briefly defined and the various types and
designs of corpora are discussed in order to agree a common terminology that will allow
us to refer consistently to the various types of corpora that are available.
In the next section we present some of the corpora containing texts available in
Spanish. This list includes all the corpus resources we have identified as useful tools for
professional translators and translation researchers. The study ends with a description of
the CDJ-GITRAD corpus, a Multilingual Corpus of Legal Documents, and the GENTT
project, which is currently compiling a multilingual encyclopaedia of specialised texts
for translators. Both projects have been developed by translation research teams that
have adopted a corpus design organised around the concept of genre, and are specifically
intended for medical, technical and legal translators working in Spanish.
Evolution of the concept of corpus
In recent times, the use traditionally made of corpora in the field of literature,
linguistics and lexicography has been extended, paving the way for their application in
the teaching of languages, technical writing and translation. From the first corpora,
designed exclusively for the use of experts researching the syntactical or morphological
aspects of language (for instance, see Biber, Conrad and Reppen (1998), Kennedy (1998)
and McEnery and Wilson (1996)) we have now reached the point where these tools are
in general use for the day-to-day work of language teachers and students, journalists,
3
writers of all kinds of texts and, of course, translators, who find in this resource an
excellent tool for observing language in use.
The advent of the Internet is perhaps the most revolutionary aspect of the evolution of
corpora. The Internet has made it possible to gain access to thousands of on-line
documents, find highly specialised texts, consult texts on the same subject in several
languages, locate translated texts, rapidly build corpora and perform linguistic and
statistical analysis on corpora using web browsing interfaces. In fact, the Internet itself
can be considered a corpus, the largest one in existence and the broadest in scope.
However, the Internet has not only made it possible to access corpora and their tools,
but has also proved an enormous spur to their development. In a work on the processing
of bilingual corpora Abaitua (2002) explores how the Internet boom “has greatly
increased the demand for technologies with a capacity for multilingual processing: smart
browsers, indexing and cataloguing systems, extractors of information, knowledge
managers, text generators, summary generators, etc”. All these applications use corpora
as their raw material, and researchers in the field of artificial intelligence, computational
linguistics and voice recognition who conduct their experiments using corpora can no
longer conduct or conceive of research without this most valuable tool. The technology
available to these researchers has enabled them to create an infinite variety of types and
designs of corpora to test their computer tools. If they need a corpus that does not exist,
they will compile it themselves, since they have very powerful scanners and OCR
software as well as programs with highly sophisticated text recognition and analysis
features.
To end this brief look at the evolution of corpora and their application to translation, I
would say we are very close to the philosophy of the DIY corpus. Commercial software
programs are available for translators that allow the user (for example an English-
Spanish technical translator) to extract texts on a particular subject (for example,
4
dermatology) from the web (medicine portals, on-line dermatology journals in Spanish
and English), organise and store them on the basis of pre-established criteria and even
automatically align translated texts and extract the specialized terminology. Who could
ask for more? As a result of these developments, the concept of corpus has undergone a
revolution and corpora are no longer the sacred (and expensive) tools they were some
years ago. Today they are instantly available for translators and translation researchers to
study and analyse: we can adapt them to our needs and make them our own. With this
philosophy in mind, let us look again at the possibilities of designing corpora and what
they have to offer translators of various kinds.
Towards an ideal Corpus Design for Translators
Although the concept of corpus is very simple – it is no more than an extensive
collection of texts in electronic format – there are many variables relating to the design
and applicability of corpora that require a more painstaking analysis. There is an infinite
number of ways the texts in a corpus can be selected and compiled, and those chosen
will depend on the type of corpus to be created and the resources available (in terms of
money, time, knowledge, etc.). The needs of the professional translators or researchers
for whom the corpus is intended play an important role in the choice of these variables.
When planning the design of corpora for translators the profile and objectives of those
who will use the corpus must be accurately defined. Various authors (Abaitua, 2002;
Atkins et al. 1992; or McEnery and Wilson 1996) emphasise the influence that these
criteria will have on the resulting corpus.
A number of binary elements are involved in the design of a corpus (dynamic/static,
monolingual/multilingual, oral/written, general texts/specialised texts,
synchronic/diachronic, full texts/fragments of texts, tagged/untagged, …). The
combination of these elements with each other generates an almost infinite number of
5
potential designs. Let us think of some examples: a) synchronic monolingual corpus of
18th century Spanish literary texts, full text, no tagging; b) diachronic corpus (1900-
2000) of legally drafted wills, not translations but comparable originals in each language,
bilingual Spanish-English, full text, with morphological tagging; c) multilingual
English/Spanish/French/German corpus of medical texts on oncology from 1950
onwards that are not translations of each other, with annotation of identifying
information (date, author, source…); d) bilingual English-Spanish corpus of translated
aligned texts on mixed mobile telephony, with no tagging.
In order to create a corpus we will have to decide between the following dichotomies,
amongst others:
- Printed / Electronic
- On line / Off line
- Oral / Written
- Full texts / Fragments of text
- Static / Dynamic
- Synchronic / Diachronic
- Large / Small
- General / Specialised
- Literary sources / Non-literary sources
- Representative (of a specialised language, variety of language, genre, etc) /
Non-representative (sample corpus)
- Organised around a genre / Not organised around a genre
- Monolingual /Multilingual
- Parallel multilingual / Comparable multilingual
- Aligned parallel multilingual / Non-aligned parallel multilingual
6
- Plain / Annotated
- For Professional Use/ For Research Purposes
- For General Research Purposes/ For Specific Research Purposes
- For Professional Use in General Translation / For Professional Use in
Specialized Translation
Not too long ago, corpora were collections of printed texts, but now when we talk
about corpora it is implicitly understood that we are referring to texts in electronic
format, normally with annotation, or tagging, of some kind. While the first (electronic)
corpora were very costly tools to which only specialists and researchers had access
(many were subscription only), today the fundamental characteristic of corpora is their
accessibility. Although some can only be obtained on CD, most can be consulted on line,
after paying a subscription or are free of charge. Anyone can access them, at any time
and from anywhere via the Internet. The spread of on-line corpora has meant that the
proportion of dynamic corpora has increased. These are corpora that are not considered
complete at any given time, but are always open to the incorporation of new elements.
In general terms, we could say that translators are only interested in written corpora,
but the needs of interpreting and the translation of audiovisual media should not be
overlooked, since they may also benefit from the contribution of spoken language
corpora. Opting for a synchronic or a diachronic design will depend on whether we want
to observe a language at a particular point in time or in the process of evolving.
One aspect in which considerable development has been observed is in the size and
degree of specialisation of the corpora being compiled. Today, technological advances
mean that corpora that previously took years of work to complete can now be completed
in a matter of weeks. Initially corpora consisted mainly of literary texts, but now we find
comprehensive corpora on all imaginable subjects (sociology, cardiology, textiles,
7
pottery, law, etc.), consisting of texts of all types and genres (newspaper articles,
contracts, technical instructions, children’s writing, and so on). The degree of
specialisation may vary from corpora that are entirely general, such as the CRAE
(literature, press, science, law, etc.) to corpora devoted entirely to a single field of
specialisation, such as the CDJ-GITRAD for legal texts or TURISCOR for tourism
(these three corpora are described later in this chapter).
Another fundamental decision to be made when defining the design of a multilingual
corpus is its size and the ratio between its size and its representativeness. How
representative a corpus is will depend on its intended use, the degree of specialisation of
the texts, the number of languages, etc. Obviously, a million-word corpus of Spanish
weather reports will be infinitely more representative than a corpus of Spanish literary
texts of the same size.
As Abaitua (2002:96), points out: 'Another criterion for classification concerns the
genre and the type of text. This criterion mainly affects the representative character of
the corpus. The selection criteria will be very different depending on whether the aim is
to design a specialised corpus, such as the Aarhus Corpus – on European contract law –
or create a reference corpus that covers a wide range of styles and registers.’
Corpora organized around the concept of genre constitute a new category that has
very useful applications for translation which we will discuss in greater detail when we
talk about two multilingual corpora we are currently developing, the CDJ-GITRAD
corpus and the GENTT encyclopaedia of genres.
Within this process of change and improvement, the move from monolingual corpora
to multilingual corpora should also be mentioned, and within multilingual corpora, the
appearance of parallel aligned multilingual corpora. With rare exceptions most of the
corpora that have been compiled in the world have, until recently, been monolingual.
Multilingual corpora have appeared in a wide variety of forms, since the “language”
8
variable itself generates many possible combinations. In the case of multilingual corpora
the only obligatory requirement is the presence of texts in various languages, but the
type of link between them in the different languages can be manifold:
- Monolingual/ Bilingual / Trilingual/ Multilingual
- Comparable corpora (untranslated texts) / Parallel corpora (translated texts)
- Comparable corpora of texts with a particular feature in common/ Comparable
corpora of texts with no features in common
- Parallel non-aligned corpora/ Parallel aligned corpora.
- Spontaneous parallel corpora/ Artificially compiled parallel corpora
- Unidirectional parallel corpora / Multidirectional parallel corpora
- Multilingual texts by native speakers / Multilingual by non-native speakers...
Comparable corpora are made up of texts in different languages which may be related
in various ways, but are not translations of each other. They may have nothing in
common at all, or be on the same subject, of the same genre, or from the same
chronological period, etc. Comparable corpora containing texts that are not on the same
subject are used for studying general linguistic phenomena and are mainly used by
linguists, although they may also help translators to define patterns of language and
compare texts. Some examples are the CDJ-GITRAD Corpus of Legal texts in Spanish,
Catalan and English, the Aarhus Corpus of Contract Law texts in Danish, French and
English and the bilingual English-Chinese corpus compiled by Fung in 1995, used to
generate bilingual dictionaries.
Comparable corpora of texts on the same subject matter are one of the most valuable
tools for professional translation. They are much easier to compile than parallel corpora,
provide substantial information in terms of terminology, field, and context and are used
9
mainly to observe the use of specialised language in very restricted fields such as hedge
funds, statistics, musical composition, etc. Comparable corpora of texts written at
different periods of time are used to observe linguistic changes over the course of time,
and are of great interest to the translator, particularly the literary translator of historical
texts.
Parallel corpora can develop naturally or as the result of a specific translation
commission or project. Multilingual international organisations are the most important
source of parallel corpora that originate spontaneously. This would be the case of the
European Union’s data bases of documents, the Hansard base of legislation in French
and English in Canada, or the Hong Kong legislation database made up of documents in
English and Chinese. These naturally produced parallel corpora are the result of a set of
particular social circumstances and constitute one of the most important resources for the
specialised translator in certain areas such as law, the discipline in which the author of
this chapter is working. Their use (by alignment or in some other way) is one of the areas
of research with the greatest potential. In fact, these organizations are currently trying to
exploit their resources by putting together AT systems based on vast translation
memories and grammar parsers (the EU Euramis Project is a good example).
On the other hand, parallel corpora can also be compiled from existing translations.
One of the main problems of parallel corpora compiled in this way might be the selection
criteria for the translations to be included and the criteria for validating their quality.
These considerations are relevant for corpora intended for professional translators who
will use them as reference tools and need to be sure that they meet the necessary quality
criteria. However, this should not be considered a problem for translation researchers
interested in analysing existing translations as they stand.
Parallel aligned corpora are the most useful tool for translators but pose important
problems that have not as yet been addressed. As Lawler points out in his review of
10
Botley, McEnery and Wilson (2000), “Text alignment, it quickly becomes clear, is the
outstanding problem in research on multilingual corpora, and thus -- to the extent that
progress has been made in its solution -- its outstanding success story. The problems that
arise in alignment research reprise practically every issue in Natural Language
Processing (NLP) and Automatic Translation, (e.g., sentence division, anaphor tracking,
ambiguity resolution), and the peculiar limitations of the alignment task make the
application of alignment strategies to these broader problems surprisingly productive”.
In unidirectional parallel corpora we always find the originals in the same language
whereas in multidirectional parallel corpora the originals may be in any language of the
corpus. Unidirectional parallel corpora are typically used for studying the literary works
of a writer and their translations into one or more languages.
Apart from the pure text, a corpus can also be provided with additional linguistic
information, called 'annotation' or ‘tagging’. If we look at the type of annotation, within
the category of multilingual corpora we find everything from highly structured corpora
with very specific tagging aimed at a specific aspect of research (such as an English-
Spanish corpus of research articles on sustainable tourism with tagging aimed at
comparing the use of certain syntactic structures) to enormous collections of texts in
several languages compiled with no particular selection criterion and no annotation. The
annotation may be of a different nature, such as prosodic, semantic or historical. The
grammatically tagged corpora constitute the most common form of annotated corpora. In
a grammatically tagged corpus, the words have been assigned a word class label (part-of-
speech tag).
The potential uses of corpora depend on how well they have been designed to meet
the needs of research or professional use. However, in order to take full advantage of all
the information a corpus contains, it is essential to have proper annotation tools that are
suited to each particular corpus. Whether a corpus is annotated and the type of
11
annotation used will depend on the purpose for which it will be used. In the past
annotated corpora were used for specific aspects of linguistic research, particularly
computational linguistics, but today practically all corpora have some kind of annotation
thanks to the new automatic annotation tools. The type of annotation may vary to a great
extent. Ball (1997) proposes the following classification of corpora:
- perfectly plain: e.g. Project Gutenberg texts, produced by scanning; no information
about text (usually, not even edition)
- marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics,
etc.
- annotated with identifying information, e.g. edition date, author, genre, register, etc.
- annotated for part of speech, syntactic structure, discourse information, etc.
In order to annotate a corpus it is no longer necessary to use expensive software
programs that are difficult to use and designed to carry out very specific tasks. Today
anyone can obtain very valuable textual information by applying very simple search
systems. The ever-increasing diversity of annotation possibilities is also observed in
modern corpora, depending on the purpose for which they are designed, and the
annotation of corpora is becoming increasingly automatic and standardised (with
standards such as the TEI Standard, the Text Encoding Initiative).
Applications of corpora in Translation Studies and Professional Practice
This initial distinction between corpora for research and corpora for translating is viewed
as relevant because it results in very different corpus designs.
12
(1) Corpora for Translation Studies
The use of corpora for research into translation (also referred to as CTS, Corpus
Translation Studies) is a relatively new activity and is still in its early stages. The first
initiative was the project started by Mona Baker in Manchester in 1993 with the TEC
corpus. Although at that time professional translators were already using these tools to
carry out documentation tasks, they had not yet been used in Translation Studies. Today
there are still few examples of studies of this kind, and they are mostly restricted to
academic circles (PhD dissertations and research projects), but as Malmkjaer, (2003:119)
points out: ‘The use in translation studies of methodologies inspired by corpus
linguistics has proved to be one of the most important gate-openers to progress in the
discipline since Toury’s re-thinking of the concept of equivalence’.
Because of the nature of their work, researchers will in principle require extensive
reference corpora from which they can extract conclusions applicable to general
phenomena. However, corpus design for Translation Studies will once again depend on
the particular aspect of the translation phenomenon to be analyzed: translation strategies,
translation norms, translation equivalence, features of translated texts (Baker 1993,
1999; Laviosa, 1998), shifts in translations of the same originals in the course of time or
changes in other parameters (Munday, 1998).
Corpora in translation studies are products of human minds, of actual human beings, and,
thus, inevitably reflect the views, presuppositions, and limitations of those human
beings. Moreover, the scholars designing studies utilizing corpora are people operating
in a particular time and place, working within a specific ideological and intellectual
context. Thus, as with any scientific or humanistic area of research, the questions asked
in CTS will inevitably determine the results obtained and the structure of the databases
will determine what conclusions can be drawn.
13
Of the research studies carried out to date using corpora, three major categories may
be mentioned:
- Those that examine contrastive aspects, whether at a linguistic or cross-
cultural level, using multilingual parallel corpora. Here we would find the
studies that look at the various ways of solving a particular translation
problem. Kenny 2001, for example, monitors the translation of creative
source-text word forms and collocations uncovered in a specially constructed
German-English parallel corpus of literary texts.
- Those that examine the characteristics of translated texts without taking the
originals into account, using monolingual corpora of translated texts in the
target language. This is the case of the studies initiated by Mona Baker with
her TEC corpora and continued by authors such as Sara Laviosa. These lines
of research suggest that translated corpora-based research may fruitfully be
used to discover the patterning specific to translational language and the
extent of the influence of variables such as source language, text genre or
translation mode on translational language. Laviosa’s 1996 study The English
Comparable Corpus (ECC): A Resource and a Methodology for the Empirical
Study of Translation, for example, sets out to develop a viable descriptive and
target-oriented corpus-based methodology for the systematic study of the
nature of translated text.
- Those that examine the characteristics of texts in the target language using
monolingual corpora of texts originally written in the target language.
Through an exhaustive analysis of vast quantities of computerised text,
14
translators can obtain invaluable information on grammar, semantic
relationships, the acceptability of certain usages, new or obsolete usage of
words, neologisms, and even pragmatic aspects of the target language (see
Hanks 1996; Moon 1998) which enable the translator to make decisions
consistent with those uses.
(2) Corpora for Professional Practice
The translation of specialised texts requires from the translator a range of knowledge
that generally exceeds the scope of the translator’s academic and personal training. The
notion of the translator as a scholar with an encyclopaedic knowledge of the world has
been a constant throughout history. The need for a profound understanding that does not
miss the nuances of the original is a requirement common to all specialist areas of
translation.
The increasing volume of translations and their degree of specialisation have seen
professional translators’ need for electronic LRs grow exponentially. In just a few years,
the Internet has become the main resource for terminological and conceptual
documentation for professional translators, and today it can be said that 95% of the
material they need can be found on the web. This has triggered an escalating demand for
on-line LRs and tools to enable the information to be searched for and exploited on line
too. In Spain a large number of projects have been set up to create LRs for translators,
and there are now many resources of this kind available to professional translators.
Monolingual corpora in the target language have proved to be an outstanding
terminological tool for specialised translation (Bowker 1998) and for the training of
professional translators (Borja 2005; Borja and Monzó 2001). From the data about
professional behaviour obtained in Monzó 2002 and the interviews made of 200 legal
translators during the last year, the GITRAD Research Team is currently preparing a
15
study on professional habits. Although the results have not been published as yet we can
advance some data here. Today more than half the professional translators of legal texts
in Spain use corpora and the Internet (as an on-line corpus) instead of traditional printed
dictionaries and when asked which they would give up if they had to choose between the
two, many say that they would keep the Internet. In fact they point out that corpora and
the Internet make it possible to carry out more intelligent context-dependent searches by
locating collocations and obtain results that a dictionary could never offer.
At a textual level monolingual corpora in the target language and bilingual parallel
corpora are cited as a powerful textual resource for translators who can check the
appropriateness of their syntax and cohesive devices in original texts on the same subject
matter, period of time, level of formality, etc. From the point of view of conceptual
documentation that any translator of specialised texts needs to carry out, multilingual
comparable corpora are the resource of choice. In some very specialised areas where the
translator needs to acquire a certain level of understanding of the field before being able
to translate a text relating to a particular field, the use of multilingual parallel aligned
corpora enables the professional translator to look for translated words, expressions,
sentences, formulae, culture-marked terms, geographical names, etc. In the field of legal
translation, the availability of parallel corpora is greater than in other fields because of
the existence of International Organizations multilingual databases as has been indicated
before. Legal translators also mention the importance of self made corpora of translated
documents as one of the main sources of information for their work but regret the fact
that they have not incorporated their translations into translation memories and find it
too demanding and time consuming a task to do it now. In fact, many translators still
work with original texts in paper print. Only young translators have started working with
translation memories from the time they start and can apply efficient automatic retrieval
systems and translation management methodologies.
16
Corpora Developers in Spain
This section covers the various resources in the form of electronic corpora that may be
used by professional translators or translation researchers working with Spanish,
including monolingual corpora, multilingual comparable corpora and multilingual
parallel aligned corpora. They all have much to offer translators.
Although the information is abundant, we are aware that many very interesting
projects will not be mentioned and that much of the information given here will have to
be updated within a short space of time because of the rapid progress being made in this
field. However, many of the works cited are long-term projects that will remain relevant
for a considerable time, since they are sponsored by government institutions or
universities. Another aspect worth mentioning about this list of resources is the insight it
may provide for translators creating their own resources or researching whether on-line
resources exist for their particular field of interest.
For the preparation of this section we have personally consulted leaders of groups
currently researching computational linguistics and corpora: Joseba Abaitua (DELI
Research Group); Alberto Alvarez Lugris (CLUVI Research Group); Gloria Corpas
(TURICOR project); Isabel García Izquierdo (GENTT Research Group); Esther Monzó
(GITRAD Research Group), among others. The information obtained from the web sites
of Manuel Barberá (http://www.bmanuel.org/), Joaquim Llisterri
(http://liceu.uab.es/~joaquim/) and David Lee (http://devoted.to/corpora) has also been
very useful. Finally, it will be seen that the subject resources are biased in favour of legal
and economic topics, which reflects the fact that I am a translator specialising in these
disciplines. I am sure that readers will be able to compensate for this imbalance in their
own particular areas of interest.
The resources identified belong to three categories: monolingual corpora, multilingual
comparable corpora and multilingual parallel corpora. Together with the reference and a
17
brief description they are defined by using the binary terminology established in
previous sections.
Before proceeding to describe the corpora that currently exist in Spain it is
illuminating to explore who is currently developing them: this means identifying the
most important sources of corpora so that the reader can periodiodically check the new
advances in the field. As we shall see, these are not translators or academic institutions
devoted to translation,
Corpora Compiled by Institutions Devoted to the Promotion and Diffusion of
Spanish
The most prestigious, extensive and representative monolingual corpora of Spanish
texts are those of the Real Academia de la Lengua Española and the Instituto Cervantes,
institutions devoted to the promotion and diffusion of Spanish (language, literature and
culture) throughout the world. No translator should neglect to visit the Royal Academy’s
web site and browse the inspiring resources it has to offer.
The monolingual Corpus Diacrónico del Español (CORDE),
corpus.rae.es/cordenet.html, compiled by the Instituto de Lexicografía de la Real
Academia de la Lengua Española is made up of texts from three historical periods, Edad
Media, Siglos de Oro and Época Contemporánea, and aims to be representative of the
Spanish language chronologically. It contains 125 million words, both fiction (verse and
prose) and non-fiction texts (scientific, social, press and advertising, religious, historical
and legal documents).
The monolingual Corpus de Referencia del Español Actual (CREA),
http://corpus.rae.es/creanet.html, compiled also by the Instituto de Lexicografía de la
Real Academia de la Lengua Española covers twenty five years, from 1975 to 1999. It is
18
still being compiled and will contain literary, journalistic, scientific and technical texts
as well as transcripts of oral language and media records.
The Miguel de Cervantes Virtual Library, http://www.cervantesvirtual.com, is another
treasure for the translator working with Spanish. It is a digital edition project of the
bibliographic heritage of Spanish and Latin American culture promoted by the
University of Alicante and the Banco Santander Central Hispano, with the collaboration
of the Marcelino Botín Foundation. It is a vast bibliographic and documentary repository
that can be freely accessed via the Internet. The Linguistic Tools section provides a
collection of tools specifically designed for analysing and exploiting digital texts. It has
an advanced text search engine that allows words to be searched for within texts, and a
concordance tool, which makes it possible to search for words in context.
A recent and extremely interesting resource also offered by the Cervantes Institute is
the Grammatical Archive of the Spanish Language, AGLE. It is not exactly a corpus of
texts, but rather a corpus of grammatical records compiled by the Spanish grammarian
Salvador Fernández Ramírez (1896-1983) which brings together an entire set of
language phenomena that he intended to use as a corpus for producing the grammar he
was planning. The result is a wide-ranging sample of the Spanish language from the
Poema del Cid to selections from the contemporary press, from pieces of Latin American
literature to a comment heard on the bus, selected by our greatest grammarian. It can be
accessed by grammatical categories, author, work, word or expression. While complex to
begin with, it is extremely interesting once its mechanism has been mastered.
http://cvc.cervantes.es/obref/agle/
Corpora Compiled by Publishing Houses
Next in order of importance are the monolingual corpora compiled by major
publishing houses in order to extract information for lexicons and dictionaries of all
19
kinds. The objective of these projects is to use corpora for extracting terminology and
creating ontologies. Outstanding in this section is the corpus of the publishing houses
SGEL and VOX. The Corpus CUMBRE offers a set of representative linguistic data for
the use of contemporary Spanish, compiled by the publisher SGEL, S.A. under the
supervision of a research team of the University of Murcia; although its purposes are
grammatical and lexicographic, it can be classified as a general purpose corpus, in view
of the diversity of the materials it contains. A free sample of this corpus may be obtained
with the purchase of the dictionary Gran diccionario de uso del español actual, Sánchez,
A.; Cantos, P. and Simón, J. (2001). The publishing house SGEL can be contacted at
http://www.sgel.es/espanyol/ieindex.htm.
The second Publisher cited, VOX-Bibliograf has compiled its own corpus for the
development of dictionaries. The Corpus Textual VOX-Biblograf includes 10,352,337
words: literary texts (9.5%), journalistic (32.5%), scientific (4.5%), technical (29.5%),
academia (15.5%), advertising (0.5%), spoken language transcripts (3%), media
broadcastings (5%). It currently participates in the Eurowordnet Project. The
EuroWordNet project aims to develop a multilingual database with basic semantic
relationships between words for several European languages (Dutch, Italian and
Spanish). More information may be obtained through: http://www.vox.es. We have tried
to access their corpus but with no success. It seems it will be available through the
Spanish EuroWordNet database interface but currently the links to the corpus have
restricted access (http://www.illc.uva.nl/EuroWordNet).
The publishing house Diccionarios SM has also compiled a corpus of literary,
journalistic, scientific and technical texts for the development of lexicographic work of
60,000 words. Although we have not been able to search this corpus, the company may
be contacted at http://www.grupo-sm.com/ . However, through its website this publisher
20
offers an interesting resource for translators: the contents of the books that they publish
for secondary education (http://www.librosvivos.net).
Corpora Compiled by Linguistic Engineering and Computational Linguistics
Groups
Another important source of corpora are the departments of linguistic engineering and
computational linguistics devoted to developing computer systems capable of
recognising, understanding, interpreting and generating human language in all its forms.
Research in this field includes the development of linguistic resources (morphological
grammars, formal and computational grammars, electronic lexicons with information in
conventional formats such as EAGLES), Computer Assisted Translation and Automatic
Translation programs, the development of person-machine interfaces and tools for
analysing and using corpora. The immense majority, if not all the systems of language
processing, operate with monolingual or bilingual corpora of texts to which various
linguistic processors are applied (at the phonological, phonetic, textual, morphological,
lexical, syntactical, logical, semantic and pragmatic level).
In these contexts, the object of creating very extensive corpora is to provide adequate
bases for creating Example Based Machine Translation Systems (EBMT) or Machine
Generated Text, both monolingual and multilingual (automatic production of
multilingual technical documentation). Projects of this kind have generated numerous
multilingual corpora and are expected to generate many more in the near future.
Researchers believe that this could mean moving from CAT to AT in fields where
significant bilingual corpora exist. In many cases these are multilingual projects that
combine a specific economic activity (for example tourism) with the development of
linguistic technologies (AT; text generation, identification of types of texts or
information on Internet). Many of them receive European funding through actions such
21
as e-Content. The Council has recently approved the e-Contentplus programme. The 4-
year programme (2005-08), proposed by the European Commission, will have a budget
of € 149 million to tackle the fragmentation of the European digital content market and
improve the accessibility and usability of geographical information, cultural content and
educational material. EU funded projects such as ELSNET and EUROMAP, are behind
many regional subprojects. EUROMAP ("Facilitating the path to market for language and
speech technologies in Europe") - aims to provide awareness, bridge-building and market-
enabling services for accelerating the rate of technology transfer and market take-up of the
results of European HLT RTD projects. ELSNET ("The European Network of Excellence in
Human Language Technologies") - aims to bring together the key players in language and
speech technology, both in industry and in academia, and to encourage interdisciplinary co-
operation through a variety of events and services.
Mention should also be made of GilCub (Grup d’Investigació en Lingüística
Computacional) of the Universitat de Barcelona which concentrates on researching
systems for processing natural language, particularly areas related with AT. EUROTRA,
TRADE and MULTEXT are important projects developed by this group. MULTEXT is
of particular interest for translators because it has “developed a set of generally usable
software tools to manipulate and analyse text corpora, together with lexicons and
multilingual corpora in seven European languages. It has established conventions for the
encoding of corpora and harmonised specifications for computational lexicons, building
on and contributing to the preliminary recommendations of the relevant international and
European standardisation initiatives. All project results are freely and publicly
available”. In their webpage they also state that they have developed “The first freely
available large-scale multilingual text corpus for the seven languages”. At the time of
22
writing, the corpus was not available through the Internet but the group can be contacted
at http://www.ub.es/gilcub/ingles/projects/european/multext.html.
Another group working in this field is CLiC, Centre de Llenguatge i Computació, from the
same university, Universitat de Barcelona. (http://clic.fil.ub.es/ ). CLiC has compiled a
corpus of 6,000,000 words with morphological and syntactical annotation, the Lexesp,
Léxico informatizado del español containing journalistic texts, literary texts, semi-
specialized scientific articles. Clic has also compiled a reference corpus of 1,000,000 words
with morphological and syntactical annotation, manually validated, which is the result of the
fusion of two important lexical resources: 500,000 words from the corpus Lexesp and
500,000 words from the electronic corpus of the newspaper La Vanguardia. Besides these
corpora, Clic has produced bilingual lexicons connected to the lexico-semantic web
EuroWordNet: bilingual lexicon English-Spanish (more than 120.000 entries); bilingual
lexicon Catalan-Spanish (more than 32.000 entries); bilingual lexicon Catalan-English (in
progress). These resources can be searched at http://clic.fil.ub.es/demos/corpus/corpus.shtml
and are also available on CDRom (SEBASTIÁN, N. et alii. (2000).
The DELI Research Group of the University of Deusto,
http://www.deli.deusto.es/Resources/LEGE-Bi, was created in 1998, promoted by
professors of the Faculty of Language and ESIDE who were interested in digital edition
and linguistic engineering (http://www.deli.deusto.es/AboutUs ). Some of the projects
that are currently being developed by this group and concerned with electronic corpora
are: XTRA-Bi: Extracción automática de unidades bitextuales para memorias de
traducción (2000-2001). Main Researcher: Inés Jacob; LEGEBiDUNA: Textos paralelos
bilingües en euskara y castellano de las administraciones vascas con etiquetado
SGML/TEI-P3(1995-2000). Universidad de Deusto. Main Researcher: Joseba Abaitua.
They also form part of the CORDE project. The biligual Spanish-Basque LEGEBiDUNA
corpus - approximately 7 million words in each language, extracted from the bilingual
23
official bulletins of the Basque Administration: (BOA 1990-92) , (BOB 1989-95) and
(BOPV 1995) - is a very useful resource for translators of legal and institutional
terminology. The research group is developing tools for the automatic drafting and
translation of administrative documents based on this bilingual corpus.
The SLI (Computational Linguistics Group of the University of Vigo) has developed
the corpus CLUVI (Linguistic Corpus of the University of Vigo). According to the
information provided on their webpage (http://sli.uvigo.es/CLUVI/info_en.html), “The
CLUVI is an open textual corpus of specialised registers of contemporary oral and
written Galician language developed by the University of Vigo. Since September 2003,
the SLI offers the possibility of searching and browsing the main sections of the CLUVI
Parallel Corpora (8 million words), that is, the TECTRA Corpus of English-Galician
literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of
Galician-Spanish legal texts, and the UNESCO Corpus of English-Galician-French-
Spanish scientific-technical divulgation texts. The public searching and browsing tool
designed by the SLI is available at http://sli.uvigo.es/CLUVI“.
The number of aligned works and language pairs available at this website increases
regularly, and with great vitality since the CLUVI is an academic research project in
progress. At the moment, the CLUVI Parallel Corpus webpage permits the search of four
major corpora -TECTRA, FEGA, LEGA and The Unesco Courier (of more than one
million words each)- as well as other minor parallel corpora now in progress. It should
be pointed out that the CLUVI interface also permits browsing of the Legebiduna
Corpus of Basque-Spanish administrative texts developed by the DELi group at the
University of Deusto.
The cooperative effort between the Lancaster University and the Universidad
Autónoma de Madrid has made possible the development of the CRATER Corpus. The
CRATER project, Corpus Resources and Terminology Extraction, involves three
24
languages: English, French and Spanish. The corpus developed consists entirely of
technical texts from the International Telecommunications Union (ITU). The CRATER
corpus has now been completed and consists of 5,5 million words. The texts are tagged
with part-of-speech and morphological annotation. This multilingual annotated corpus
can be searched at http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html.
The Institut Universitari de Lingüística Aplicada (IULA),
http://www.iula.upf.es/corpus/corpus.htm, is the centre for research and postgraduate
studies of the Universitat Pompeu Fabra. They are currently compiling a multilingual
corpus (Catalan, Castilian, English, French and German) of texts belonging to the areas
of economy, law, environment, medicine and information technology). The selection of
texts is made by experts in each area, they are then classified according to their subject
field, annotated according to SGML standard and the "Corpus Encoding Standard”
(CES) developed by EAGLES. The corpus can be searched at
http://brangaene.upf.es/cgi-bin/bwananet/seldocsCTplus.pl.
VISL, which stands for "Visual Interactive Syntax Learning", is a research and
development project at the Institute of Language and Communication (ISK), University
of Southern Denmark (SDU) - Odense Campus. Since September 1996, staff and
students at ISK have been designing and implementing Internet-based grammar tools for
education and research using corpora. In Spanish, the ECI-ES2 and the Europarl-es
annotated corpora can be searched at http://corp.hum.sdu.dk/cqp.es.html. The Europarl-
es corpus contains 29 millions words of parliamentary debates and can be searched
without password. The ECI-ES2 corpus contains 14 million words from the newspaper
“El Diario” and requires a password.
25
Corpora Compiled by University Translation or Linguistics Research Teams
As data intensive methods have become more affordable in the 1990s, thanks to advances
in computing as well as data collection efforts, empiricist methods have become the method
of choice for many translation researchers at Spanish Universities. The growing interest
recently observed in the development of multilingual and bilingual corpora at the Faculties
of Translation and Linguistics can be observed in the shift in orientation of research projects
and doctoral theses. Many CTS projects (Corpora Translation Studies) are under way, many
doctoral theses have incorporated corpora to perform various kinds of experiments and many
teachers are coordinating the wealth of texts of all types that they can gather with the help of
their students. However for limits of space, in this chapter we will only cite briefly the main
projects receiving institutional funding and will describe in detail two Genre Comparable
Multilingual Corpora, the CDJ-GITRAD Corpus and the GENTT Project. The main
advantage of the CDJ-GITRAD and the GENTT corpora is the fact that they will permit to
download documents at full text to preserve and make evident genre conventions.
A very useful resource is the 100 million word monolingual corpus of Spanish texts
Corpus del español (http://www.corpusdelespanol.org) compiled by Professor Mark
Davies of the Department of Linguistics and English Language at Brigham Young
University in Provo, Utah, whose areas of activity and research include corpus and
computational linguistics and design and optimization of linguistic databases. The corpus
contains 100 million words of text: 20 million from the 1200s-1400s, 40 million from the
1500s-1700s, 40 million from the 1800s-1900s. The 20,000,000 words from the 1900s
are divided equally among literary, spoken texts, and newspapers/encyclopedias. As
Davies points out in his website, in addition to being very fast, the search engine allows
a wider range of searches than almost any other large corpus in existence. The database
can be queried very quickly -- usually just a few seconds for even the most complicated
queries. It permits simple exact word searches as well as much more interesting searches
26
that are the real strength of this corpus: word patterns, words in context, collocations,
synonyms, customized lists, etc. We do recommend the reader to have a look at this
powerful resource.
The interdisciplinary Research Group Oncoterm (terminologists, translators and
physicians from the University of Granada) has developed an information system for the
medical subdomain of oncology. This information system is intended for health
practicioners, relatives and friends of patients with cancer, translators and journalists. A
vast corpus of documents on cancer in English and Spanish has been compiled together
with a terminological database. The corpus cannot be searched on line for copyright
reasons but queries can be sent to http://www.ugr.es/~oncoterm/intro.html. However, the
terminological database is available at http://www.ugr.es/~oncoterm/alpha-index.html.
A multilingual corpus of tourism contracts (German, Spanish, English, Italian) for
automatic text generation and legal translation is the objective of a research group based
at the University of Malaga, with its main researcher Gloria Corpas. According to the
information given at their web, a multilingual corpus (both parallel and comparable) will
be compiled from tourism and law websites in the Internet. A protocol will be laid out
for searching the WWW, and retrieving, encoding and storing (hiper)texts.
http://www.turicor.org
The CDJ-Gitrad Corpus and the GENTT Project
In 1999, the GITRAD research group of the Universitat Jaume I (Castellón) began to
compile a English-Spanish corpus of legal documents starting from a modest collection
of Spanish-English legal tests compiled by the author for her PhD dissertation (1998),
then extended in a subsequent research project in which Esther Monzó, Steve Jennings
and the author took part.
27
The idea of compiling this corpus arose from the observation that the comparison of
legal texts in the source and target languages is something that specialised translators are
always doing, and legal translators need to be conversant with the textual typology used
in their field of specialisation to ensure that they are observing the necessary textual,
social, and in our case legal, conventions. (Borja 2000). Moreover, the information it
offers for each legal genre helps the translator respect the conventions of genre, so
important in this field of knowledge for example, when translating a will, the translator
should be familiar with the formal aspects of this genre in the target language. In fact,
the conservatism of law is reflected in the repetitive and fossilised character of its textual
structures, its phraseology and its specialised lexicon, in such a way that the legal text
constitutes a paradigm of stereotyped text susceptible to generic description.
It would therefore seem advisable to have systems of classifying documents for each
area of specialisation and, particularly in the case of legal translation, it would be useful
to have a taxonomy of texts in the source and the target language that would enable
translators to compare terminology, usage and the practical application of the law. The
translator should always try to fit the text he or she is going to translate into a
conventional textual category that speakers of a particular language will recognise.
Legal texts constitute instruments that have a certain form and function in each
culture and sometimes there is no equivalence between languages due to the lack of
uniformity between different legal systems. Is it a will or a fragment of a text book on
law? Is it a section of an Act, a writ of summons or a judgment? It is obvious that the
translator’s solutions will not be the same in all cases. In order to facilitate research into
legal texts and the characterisation of their discursive features, the structural element
proposed for the corpus is the cultural and anthropological concept of genre as proposed
by authors such as Monzó and Borja (2000). The advantages of a corpus of this kind for
translation have already been demonstrated in previous studies which have explored its
28
research and training applications as well as their descriptive and ontological potential
(Borja 2000, 2003, 2005; Monzó 2001, 2003; Borja and Monzó 2001).
One of the most direct applications of the corpus developed is its potential for
teaching purposes. In fact, from the early days of the project, it has been used in legal
translation classes at Universitat Jaume I, where the researchers of the group are engaged
as teachers. The teaching experience has shown us that our description of the legal
genres is a didactic tool of undoubted value.
The corpus is complemented by a system developed by Steve Jennings (2003) for
managing relational data bases, which enables multiple searches to be made in an
extremely functional and efficient way. The user interface is fast and easy to use, and
allows the documents to be recovered in text format. The CDJ-GITRAD corpus of legal
documents (Spanish-Catalan-English) can be searched on-line: http://www.cdj.uji.es.
The documents can be downloaded at full text for research purposes after applying for a
password through the same web address.
In 2000, the CDJ-GITRAD corpus merged with the GENTT project to create an
encyclopaedia of original texts, comparable texts (and at a later stage, also translated
aligned texts) in the legal, scientific and technical field, classified by genre.
The intercultural approach to translation adopted by the GENTT research group
assumes that the specialised translator needs information of three kinds: thematic, textual
and linguistic (García and Monzó, 2004). When in possession of this information, the
translator can improve his/her knowledge, both linguistic and extralinguistic, using a
self-taught process. The original organisation of this corpus for the legal division (Legal
system ⇒ Branch of Law ⇒ Subbranch ⇒ Genre ⇒ Subgenre) generates a classification
that is extremely useful for the translator, who can easily place the text on which they
are working in the taxonomy and compare it with the equivalent genre in the legal
system of the target language. This methodology is applied to the other two divisions of
29
the corpus: the medical and the technical divisions. Moreover, this classification is
complemented by a system of crossed searches that combines amongst other data the
original language, the status of the text (original or translation), date of creation and the
source. More information can be found at http://www.gentt.uji.es.
Conclusions
The empiricist trend is rapidly gaining ground in Translation Studies as new methods and
research resources are available. Corpora which until now have mainly been used for
analysing aspects of cross-cultural linguistics are unquestionably useful for Translation
Studies. Translation researchers can obtain very useful information from raw (untagged)
monolingual corpora, but multilingual corpora containing text files in several languages
(segmented, aligned, parsed and classified) allowing storage and retrieval of aligned
multilingual texts against various search conditions have proved to be an invaluable tool. On
the other hand, the process of professionalisation and specialisation that translation has
undergone since the mid-20th century has resulted in something akin to an identity crisis
in the translator that has become more acute in recent decades with the rapid
development of information technologies. The new technologies have facilitated the
appearance of expert systems for organising knowledge that render obsolete the more
traditional modus operandi of the scientific and professional community. The high
degree of specialisation required by certain types of translation today (such as legal or
medical translation) makes it necessary to find new systems for obtaining and recovering
knowledge and data which even the best and most encyclopaedic human mind cannot
match. Today a mixed system of knowledge management is needed that enables
translators to integrate their skills with electronic information management and recovery
systems. Corpora resources are the basis of any such system and will very soon replace
the traditional dictionaries and encyclopaedias.
30
There is a vast variety of corpus designs and as a result when discussing the design of
corpora for translators we need to define the profile and objectives of the corpus users
with great precision. Nevertheless, since there is no single translator profile or a single
profile of Translation Studies approach it is impossible to talk about a single ideal corpus
design for translators, but rather of specific designs for specific translation or research
purposes.Translators can exploit and apply corpora in many ways (performing
terminological or statistical searches, collocations, looking for functional equivalences,
observing the texture of genres, investigating phraseology, etc.) and it is possible for all
these tasks to be performed on the Internet with no need to download programs or data
bases. The interests are very varied showing that a number of different working
possibilities are opening up for translation researchers and practitioners.
The corpora resources available in Spanish identified in this contribution are only a
small sample of what we can expect to find in the near future. Unfortunately for
translators, most of these resources are only available through the Internet browsing
tools which permit terminological and collocation searches but do not, generally, allow
the reading of full texts due to copyright restrictions. Corpora based on genre provide the
end user with full texts showing genre conventions and structure. A good example of
what the future might bring in this field is WebCorp, a suite of tools available at
http://www.webcorp.org.uk. which allows free access to the World Wide Web as a
corpus, one of the most exiciting tools that I have found.
References Abaitua , J. (2002) Tratamiento de corpora bilingües. In M. A. Martí and J. Llisterri (eds)
Tratamiento del lenguaje natural (pp. 61-90). Barcelona, Spain: Edicions Universitat de
Barcelona (electronic versión:
http://paginaspersonales.deusto.es/abaitua/konzeptu/ta/soria00.htm).
31
Atkins, S.; Clear, J. and Ostler, N. (1992). Corpus design criteria. Literary and Linguistic
Computing 7-1: 1-16.
Baker, M. (1993): Corpus Linguistics and Translation Studies: Implications and
Applications, Mona Baker, Gill Francis, and Elena Tognini-Bonelli (Eds), Text and
Technology: in Honour of John Sinclair, Amsterdam and Philadelphia, John Benjamins,
pp. 233-250.
Baker, M. (1999) The role of corpora in investigating the linguistic behaviour of professional
translators. International Journal of Corpus Linguistics 4-2: 1-18.
Ball, C.N. (1997) Concordances and Corpora. On-line Tutorial.
http://www.georgetown.edu/faculty/ballc/corpora/tutorial.html
Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics Investigating Language
Structure and Use. Cambridge: Cambridge University Press.
Borja Albi, A. (2000) El texto jurídico inglés y su traducción al español. Barcelona: Ariel.
– (2003) La investigación en traducción jurídica, en García Peinado y Ortega Arjonilla (dirs.)
Panorama actual de la investigación en traducción e interpretación. Granada: Atrio,
Granada.
– (2005) Organización del conocimiento para la traducción jurídica a través de sistemas
expertos basados en el concepto de género textual. In Isabel García Izquierdo (ed.) El
género textual y la traducción. Reflexiones teóricas y aplicaciones pedagógicas. Bern,
Peter Lang.
– (2007) Los géneros jurídicos. In Enrique Alcaraz (ed.) Las lenguas profesionales y académicas. Barcelona, Ariel.
Borja Albi, A. and Monzó Nebot, E. (2001) Aplicación de los métodos de aprendizaje
cooperativo a la enseñanza de la traducción jurídica: cuaderno de bitácora. In F.
32
Martínez Sánchez (ed.) (2001): EDUTEC '01. Congreso Internacional de Tecnología,
Educación y Desarrollo Sostenible, Murcia, Edutec, CAM.
Botley, S, McEnery, A. and Andrew Wilson, Eds. (2000) Multilingual Corpora in Teaching
and Research. Amsterdam: Rodopi.
Bowker, L. (1998) Using Specialized Monolingual Native-Language Corpora as a
Translation Resource: A Pilot Study. Meta XLIII, 4, 1998.
García, I. and Monzó, E. (2004). Traducir con corpus de géneros, Revista de la Facultad de
Lenguas Modernas, (pp. 45-59). Universidad Ricardo Palma, Lima, Peru.
Hanks, P. (1996). Contextual Dependency and Lexical Sets. International Journal of Corpus
Linguistics. Vol. 1 (1): 75-98.
Jennings, S. (2003) The Development of Software Tools for Corpus-Based Genre Research
in Translation Studies: A Practical Application in the Context of the Gentt Project
[treball d'investigació], Castelló de la Plana, Departament de Traducció i Comunicació,
Universitat Jaume I.
Kennedy, G. (1998) An Introduction to Corpus Linguistics. Amsterdam: Rodopi.
Kenny, D. (2001) Lexis and Creativity in Translation. Manchester: St Jerome.
Laviosa-Braithwaite, S. (1996) The English Comparable Corpus (ECC): A Resource and
Methodology for the Empirical Study of Translation. PhD Thesis, Manchester, UK,
UMIST.
— (1998) The English Comparable Corpus: A resource and a methodology, in Lynne
Bowker, Michael Cronin, Dorothy Kenny y Jennifer Pearson (comp.). Unity in
Diversity? Current Trends in Translation Studies. St. Jerome Publishing.
33
Malmkjaer, K. (2003) On a pseudo-subversive use of corpora in translation training, in F.
Zanettin, S. Bernardini and D. Stewart (eds.) Corpora in Translator Education. (pp.119-
34).Manchester, St. Jerome,
McEnery, T. and Wilson, A. (1996) Corpus Linguistics. Edinburgh University Press.
Monzó Nebot, E. (2001) El género textual: un concepto clave en la enculturación del
traductor» dins A. Barr i altres (ed.) (2001): Últimas corrientes teóricas en los estudios
de traducción y sus aplicaciones, Salamanca, Ediciones Universidad de Salamanca.
— (2003) Corpus-based Teaching: The Use of Original and Translated
Texts in the training of legal translators, Translation Journal, 7 (4),
[revista en format electrònic < http://accurapid.com/journal/26edu.htm]
Monzó Nebot, E. and A. Borja Albi (2000) Organització de corpus. L'estructura d'una base
de dades documental aplicada a la traducció jurídica, Revista de Llengua i Dret, 34, p. 9-
21.
Moon, R. (1998). Fixed Expressions and Idioms in English: A Corpus-based Approach.
Oxford Studies in Lexicography and Lexicology. Oxford: Oxford University Press.
Munday, J. (1998) A Computer-assisted Approach to the Analysis of Translation Shifts.
Meta XLIII, 4, 1998 142-56.
Olohan, M. (2004) Introducing Corpora in Translation Studies. London: Routledge.
Sánchez, A., Cantos, P. and Simón, J. (2001) Gran Diccionario de Uso del Español Actual.
Corpus CUMBRE del español contemporáneo de España e Hispanoamérica. Extracto
de dos millones de palabras. Editrorial SGEL.
Sebastián, N., Cuetos, F., Martí, M.A. & Carreiras, M.F. (2000) LEXESP: Léxico
informatizado del español. Edición en CD-ROM. Barcelona: Edicions de la Universitat
de Barcelona (Col.leccions Vàries, 14).