corpora for translators in spain. the cdj-gitrad corpus and the

1

BORJA ALBI, ANABEL (2007). “Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project” en Gunilla Anderman y Margaret Rogers (eds.) Incorporating Corpora - The Linguist and the Translator, Series Translating Europe, 243-265. Multilingual Matters, Clevendon. ISBN: 978-1-85359-985-9

Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project

Anabel Borja (Universitat Jaume I) [email protected]

Introduction

The use of electronic language resources (LRs), such as corpora, lexicons, grammar

descriptions and ontologies is currently the subject of numerous research projects in

Spain, and the results obtained in the future could completely change the way translators

work today. Research projects on the development of LRs conducted in recent years

have shown the potential of these tools and the phenomenon of the translator’s turn in

this field. Not only have developments in technology influenced translation research, but

translation scholars and professionals have in turn helped to determine the direction in

which these corpus technologies have developed. The influence of translation in the

development of corpora is evident since corpora are compiled with the needs of their

potential users in mind, and translators and the creators of tools for automating

translation - such as translation memories (TM), Computer Assisted Translation (CAT)

or Automatic Translation (AT) programmes - form one of the most important groups of

potential users of corpora.

This chapter aims to provide an overview of the corpora available in Spain that could

prove useful for translators and translation researchers. The goal is to help translators

2

working with Spanish (as a source or target language) to become familiar with readily

available and efficiently exploitable computer tools of this kind. It also discusses the

possibilities of exploiting these resources and the Internet in general as a dynamic, open,

accessible and continually evolving corpus. With this in mind, a brief introduction is

given describing the evolution of the corpus concept and its applications in translating.

Then the key concepts related to corpora are briefly defined and the various types and

designs of corpora are discussed in order to agree a common terminology that will allow

us to refer consistently to the various types of corpora that are available.

In the next section we present some of the corpora containing texts available in

Spanish. This list includes all the corpus resources we have identified as useful tools for

professional translators and translation researchers. The study ends with a description of

the CDJ-GITRAD corpus, a Multilingual Corpus of Legal Documents, and the GENTT

project, which is currently compiling a multilingual encyclopaedia of specialised texts

for translators. Both projects have been developed by translation research teams that

have adopted a corpus design organised around the concept of genre, and are specifically

intended for medical, technical and legal translators working in Spanish.

Evolution of the concept of corpus

In recent times, the use traditionally made of corpora in the field of literature,

linguistics and lexicography has been extended, paving the way for their application in

the teaching of languages, technical writing and translation. From the first corpora,

designed exclusively for the use of experts researching the syntactical or morphological

aspects of language (for instance, see Biber, Conrad and Reppen (1998), Kennedy (1998)

and McEnery and Wilson (1996)) we have now reached the point where these tools are

in general use for the day-to-day work of language teachers and students, journalists,

3

writers of all kinds of texts and, of course, translators, who find in this resource an

excellent tool for observing language in use.

The advent of the Internet is perhaps the most revolutionary aspect of the evolution of

corpora. The Internet has made it possible to gain access to thousands of on-line

documents, find highly specialised texts, consult texts on the same subject in several

languages, locate translated texts, rapidly build corpora and perform linguistic and

statistical analysis on corpora using web browsing interfaces. In fact, the Internet itself

can be considered a corpus, the largest one in existence and the broadest in scope.

However, the Internet has not only made it possible to access corpora and their tools,

but has also proved an enormous spur to their development. In a work on the processing

of bilingual corpora Abaitua (2002) explores how the Internet boom “has greatly

increased the demand for technologies with a capacity for multilingual processing: smart

browsers, indexing and cataloguing systems, extractors of information, knowledge

managers, text generators, summary generators, etc”. All these applications use corpora

as their raw material, and researchers in the field of artificial intelligence, computational

linguistics and voice recognition who conduct their experiments using corpora can no

longer conduct or conceive of research without this most valuable tool. The technology

available to these researchers has enabled them to create an infinite variety of types and

designs of corpora to test their computer tools. If they need a corpus that does not exist,

they will compile it themselves, since they have very powerful scanners and OCR

software as well as programs with highly sophisticated text recognition and analysis

features.

To end this brief look at the evolution of corpora and their application to translation, I

would say we are very close to the philosophy of the DIY corpus. Commercial software

programs are available for translators that allow the user (for example an English-

Spanish technical translator) to extract texts on a particular subject (for example,

4

dermatology) from the web (medicine portals, on-line dermatology journals in Spanish

and English), organise and store them on the basis of pre-established criteria and even

automatically align translated texts and extract the specialized terminology. Who could

ask for more? As a result of these developments, the concept of corpus has undergone a

revolution and corpora are no longer the sacred (and expensive) tools they were some

years ago. Today they are instantly available for translators and translation researchers to

study and analyse: we can adapt them to our needs and make them our own. With this

philosophy in mind, let us look again at the possibilities of designing corpora and what

they have to offer translators of various kinds.

Towards an ideal Corpus Design for Translators

Although the concept of corpus is very simple – it is no more than an extensive

collection of texts in electronic format – there are many variables relating to the design

and applicability of corpora that require a more painstaking analysis. There is an infinite

number of ways the texts in a corpus can be selected and compiled, and those chosen

will depend on the type of corpus to be created and the resources available (in terms of

money, time, knowledge, etc.). The needs of the professional translators or researchers

for whom the corpus is intended play an important role in the choice of these variables.

When planning the design of corpora for translators the profile and objectives of those

who will use the corpus must be accurately defined. Various authors (Abaitua, 2002;

Atkins et al. 1992; or McEnery and Wilson 1996) emphasise the influence that these

criteria will have on the resulting corpus.

A number of binary elements are involved in the design of a corpus (dynamic/static,

monolingual/multilingual, oral/written, general texts/specialised texts,

synchronic/diachronic, full texts/fragments of texts, tagged/untagged, …). The

combination of these elements with each other generates an almost infinite number of

5

potential designs. Let us think of some examples: a) synchronic monolingual corpus of

18th century Spanish literary texts, full text, no tagging; b) diachronic corpus (1900-

2000) of legally drafted wills, not translations but comparable originals in each language,

bilingual Spanish-English, full text, with morphological tagging; c) multilingual

English/Spanish/French/German corpus of medical texts on oncology from 1950

onwards that are not translations of each other, with annotation of identifying

information (date, author, source…); d) bilingual English-Spanish corpus of translated

aligned texts on mixed mobile telephony, with no tagging.

In order to create a corpus we will have to decide between the following dichotomies,

amongst others:

- Printed / Electronic

- On line / Off line

- Oral / Written

- Full texts / Fragments of text

- Static / Dynamic

- Synchronic / Diachronic

- Large / Small

- General / Specialised

- Literary sources / Non-literary sources

- Representative (of a specialised language, variety of language, genre, etc) /

Non-representative (sample corpus)

- Organised around a genre / Not organised around a genre

- Monolingual /Multilingual

- Parallel multilingual / Comparable multilingual

- Aligned parallel multilingual / Non-aligned parallel multilingual

6

- Plain / Annotated

- For Professional Use/ For Research Purposes

- For General Research Purposes/ For Specific Research Purposes

- For Professional Use in General Translation / For Professional Use in

Specialized Translation

Not too long ago, corpora were collections of printed texts, but now when we talk

about corpora it is implicitly understood that we are referring to texts in electronic

format, normally with annotation, or tagging, of some kind. While the first (electronic)

corpora were very costly tools to which only specialists and researchers had access

(many were subscription only), today the fundamental characteristic of corpora is their

accessibility. Although some can only be obtained on CD, most can be consulted on line,

after paying a subscription or are free of charge. Anyone can access them, at any time

and from anywhere via the Internet. The spread of on-line corpora has meant that the

proportion of dynamic corpora has increased. These are corpora that are not considered

complete at any given time, but are always open to the incorporation of new elements.

In general terms, we could say that translators are only interested in written corpora,

but the needs of interpreting and the translation of audiovisual media should not be

overlooked, since they may also benefit from the contribution of spoken language

corpora. Opting for a synchronic or a diachronic design will depend on whether we want

to observe a language at a particular point in time or in the process of evolving.

One aspect in which considerable development has been observed is in the size and

degree of specialisation of the corpora being compiled. Today, technological advances

mean that corpora that previously took years of work to complete can now be completed

in a matter of weeks. Initially corpora consisted mainly of literary texts, but now we find

comprehensive corpora on all imaginable subjects (sociology, cardiology, textiles,

7

pottery, law, etc.), consisting of texts of all types and genres (newspaper articles,

contracts, technical instructions, children’s writing, and so on). The degree of

specialisation may vary from corpora that are entirely general, such as the CRAE

(literature, press, science, law, etc.) to corpora devoted entirely to a single field of

specialisation, such as the CDJ-GITRAD for legal texts or TURISCOR for tourism

(these three corpora are described later in this chapter).

Another fundamental decision to be made when defining the design of a multilingual

corpus is its size and the ratio between its size and its representativeness. How

representative a corpus is will depend on its intended use, the degree of specialisation of

the texts, the number of languages, etc. Obviously, a million-word corpus of Spanish

weather reports will be infinitely more representative than a corpus of Spanish literary

texts of the same size.

As Abaitua (2002:96), points out: 'Another criterion for classification concerns the

genre and the type of text. This criterion mainly affects the representative character of

the corpus. The selection criteria will be very different depending on whether the aim is

to design a specialised corpus, such as the Aarhus Corpus – on European contract law –

or create a reference corpus that covers a wide range of styles and registers.’

Corpora organized around the concept of genre constitute a new category that has

very useful applications for translation which we will discuss in greater detail when we

talk about two multilingual corpora we are currently developing, the CDJ-GITRAD

corpus and the GENTT encyclopaedia of genres.

Within this process of change and improvement, the move from monolingual corpora

to multilingual corpora should also be mentioned, and within multilingual corpora, the

appearance of parallel aligned multilingual corpora. With rare exceptions most of the

corpora that have been compiled in the world have, until recently, been monolingual.

Multilingual corpora have appeared in a wide variety of forms, since the “language”

8

variable itself generates many possible combinations. In the case of multilingual corpora

the only obligatory requirement is the presence of texts in various languages, but the

type of link between them in the different languages can be manifold:

- Monolingual/ Bilingual / Trilingual/ Multilingual

- Comparable corpora (untranslated texts) / Parallel corpora (translated texts)

- Comparable corpora of texts with a particular feature in common/ Comparable

corpora of texts with no features in common

- Parallel non-aligned corpora/ Parallel aligned corpora.

- Spontaneous parallel corpora/ Artificially compiled parallel corpora

- Unidirectional parallel corpora / Multidirectional parallel corpora

- Multilingual texts by native speakers / Multilingual by non-native speakers...

Comparable corpora are made up of texts in different languages which may be related

in various ways, but are not translations of each other. They may have nothing in

common at all, or be on the same subject, of the same genre, or from the same

chronological period, etc. Comparable corpora containing texts that are not on the same

subject are used for studying general linguistic phenomena and are mainly used by

linguists, although they may also help translators to define patterns of language and

compare texts. Some examples are the CDJ-GITRAD Corpus of Legal texts in Spanish,

Catalan and English, the Aarhus Corpus of Contract Law texts in Danish, French and

English and the bilingual English-Chinese corpus compiled by Fung in 1995, used to

generate bilingual dictionaries.

Comparable corpora of texts on the same subject matter are one of the most valuable

tools for professional translation. They are much easier to compile than parallel corpora,

provide substantial information in terms of terminology, field, and context and are used

9

mainly to observe the use of specialised language in very restricted fields such as hedge

funds, statistics, musical composition, etc. Comparable corpora of texts written at

different periods of time are used to observe linguistic changes over the course of time,

and are of great interest to the translator, particularly the literary translator of historical

texts.

Parallel corpora can develop naturally or as the result of a specific translation

commission or project. Multilingual international organisations are the most important

source of parallel corpora that originate spontaneously. This would be the case of the

European Union’s data bases of documents, the Hansard base of legislation in French

and English in Canada, or the Hong Kong legislation database made up of documents in

English and Chinese. These naturally produced parallel corpora are the result of a set of

particular social circumstances and constitute one of the most important resources for the

specialised translator in certain areas such as law, the discipline in which the author of

this chapter is working. Their use (by alignment or in some other way) is one of the areas

of research with the greatest potential. In fact, these organizations are currently trying to

exploit their resources by putting together AT systems based on vast translation

memories and grammar parsers (the EU Euramis Project is a good example).

On the other hand, parallel corpora can also be compiled from existing translations.

One of the main problems of parallel corpora compiled in this way might be the selection

criteria for the translations to be included and the criteria for validating their quality.

These considerations are relevant for corpora intended for professional translators who

will use them as reference tools and need to be sure that they meet the necessary quality

criteria. However, this should not be considered a problem for translation researchers

interested in analysing existing translations as they stand.

Parallel aligned corpora are the most useful tool for translators but pose important

problems that have not as yet been addressed. As Lawler points out in his review of

10

Botley, McEnery and Wilson (2000), “Text alignment, it quickly becomes clear, is the

outstanding problem in research on multilingual corpora, and thus -- to the extent that

progress has been made in its solution -- its outstanding success story. The problems that

arise in alignment research reprise practically every issue in Natural Language

Processing (NLP) and Automatic Translation, (e.g., sentence division, anaphor tracking,

ambiguity resolution), and the peculiar limitations of the alignment task make the

application of alignment strategies to these broader problems surprisingly productive”.

In unidirectional parallel corpora we always find the originals in the same language

whereas in multidirectional parallel corpora the originals may be in any language of the

corpus. Unidirectional parallel corpora are typically used for studying the literary works

of a writer and their translations into one or more languages.

Apart from the pure text, a corpus can also be provided with additional linguistic

information, called 'annotation' or ‘tagging’. If we look at the type of annotation, within

the category of multilingual corpora we find everything from highly structured corpora

with very specific tagging aimed at a specific aspect of research (such as an English-

Spanish corpus of research articles on sustainable tourism with tagging aimed at

comparing the use of certain syntactic structures) to enormous collections of texts in

several languages compiled with no particular selection criterion and no annotation. The

annotation may be of a different nature, such as prosodic, semantic or historical. The

grammatically tagged corpora constitute the most common form of annotated corpora. In

a grammatically tagged corpus, the words have been assigned a word class label (part-of-

speech tag).

The potential uses of corpora depend on how well they have been designed to meet

the needs of research or professional use. However, in order to take full advantage of all

the information a corpus contains, it is essential to have proper annotation tools that are

suited to each particular corpus. Whether a corpus is annotated and the type of

11

annotation used will depend on the purpose for which it will be used. In the past

annotated corpora were used for specific aspects of linguistic research, particularly

computational linguistics, but today practically all corpora have some kind of annotation

thanks to the new automatic annotation tools. The type of annotation may vary to a great

extent. Ball (1997) proposes the following classification of corpora:

- perfectly plain: e.g. Project Gutenberg texts, produced by scanning; no information

about text (usually, not even edition)

- marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics,

etc.

- annotated with identifying information, e.g. edition date, author, genre, register, etc.

- annotated for part of speech, syntactic structure, discourse information, etc.

In order to annotate a corpus it is no longer necessary to use expensive software

programs that are difficult to use and designed to carry out very specific tasks. Today

anyone can obtain very valuable textual information by applying very simple search

systems. The ever-increasing diversity of annotation possibilities is also observed in

modern corpora, depending on the purpose for which they are designed, and the

annotation of corpora is becoming increasingly automatic and standardised (with

standards such as the TEI Standard, the Text Encoding Initiative).

Applications of corpora in Translation Studies and Professional Practice

This initial distinction between corpora for research and corpora for translating is viewed

as relevant because it results in very different corpus designs.

12

(1) Corpora for Translation Studies

The use of corpora for research into translation (also referred to as CTS, Corpus

Translation Studies) is a relatively new activity and is still in its early stages. The first

initiative was the project started by Mona Baker in Manchester in 1993 with the TEC

corpus. Although at that time professional translators were already using these tools to

carry out documentation tasks, they had not yet been used in Translation Studies. Today

there are still few examples of studies of this kind, and they are mostly restricted to

academic circles (PhD dissertations and research projects), but as Malmkjaer, (2003:119)

points out: ‘The use in translation studies of methodologies inspired by corpus

linguistics has proved to be one of the most important gate-openers to progress in the

discipline since Toury’s re-thinking of the concept of equivalence’.

Because of the nature of their work, researchers will in principle require extensive

reference corpora from which they can extract conclusions applicable to general

phenomena. However, corpus design for Translation Studies will once again depend on

the particular aspect of the translation phenomenon to be analyzed: translation strategies,

translation norms, translation equivalence, features of translated texts (Baker 1993,

1999; Laviosa, 1998), shifts in translations of the same originals in the course of time or

changes in other parameters (Munday, 1998).

Corpora in translation studies are products of human minds, of actual human beings, and,

thus, inevitably reflect the views, presuppositions, and limitations of those human

beings. Moreover, the scholars designing studies utilizing corpora are people operating

in a particular time and place, working within a specific ideological and intellectual

context. Thus, as with any scientific or humanistic area of research, the questions asked

in CTS will inevitably determine the results obtained and the structure of the databases

will determine what conclusions can be drawn.

13

Of the research studies carried out to date using corpora, three major categories may

be mentioned:

- Those that examine contrastive aspects, whether at a linguistic or cross-

cultural level, using multilingual parallel corpora. Here we would find the

studies that look at the various ways of solving a particular translation

problem. Kenny 2001, for example, monitors the translation of creative

source-text word forms and collocations uncovered in a specially constructed

German-English parallel corpus of literary texts.

- Those that examine the characteristics of translated texts without taking the

originals into account, using monolingual corpora of translated texts in the

target language. This is the case of the studies initiated by Mona Baker with

her TEC corpora and continued by authors such as Sara Laviosa. These lines

of research suggest that translated corpora-based research may fruitfully be

used to discover the patterning specific to translational language and the

extent of the influence of variables such as source language, text genre or

translation mode on translational language. Laviosa’s 1996 study The English

Comparable Corpus (ECC): A Resource and a Methodology for the Empirical

Study of Translation, for example, sets out to develop a viable descriptive and

target-oriented corpus-based methodology for the systematic study of the

nature of translated text.

- Those that examine the characteristics of texts in the target language using

monolingual corpora of texts originally written in the target language.

Through an exhaustive analysis of vast quantities of computerised text,

14

translators can obtain invaluable information on grammar, semantic

relationships, the acceptability of certain usages, new or obsolete usage of

words, neologisms, and even pragmatic aspects of the target language (see

Hanks 1996; Moon 1998) which enable the translator to make decisions

consistent with those uses.

(2) Corpora for Professional Practice

The translation of specialised texts requires from the translator a range of knowledge

that generally exceeds the scope of the translator’s academic and personal training. The

notion of the translator as a scholar with an encyclopaedic knowledge of the world has

been a constant throughout history. The need for a profound understanding that does not

miss the nuances of the original is a requirement common to all specialist areas of

translation.

The increasing volume of translations and their degree of specialisation have seen

professional translators’ need for electronic LRs grow exponentially. In just a few years,

the Internet has become the main resource for terminological and conceptual

documentation for professional translators, and today it can be said that 95% of the

material they need can be found on the web. This has triggered an escalating demand for

on-line LRs and tools to enable the information to be searched for and exploited on line

too. In Spain a large number of projects have been set up to create LRs for translators,

and there are now many resources of this kind available to professional translators.

Monolingual corpora in the target language have proved to be an outstanding

terminological tool for specialised translation (Bowker 1998) and for the training of

professional translators (Borja 2005; Borja and Monzó 2001). From the data about

professional behaviour obtained in Monzó 2002 and the interviews made of 200 legal

translators during the last year, the GITRAD Research Team is currently preparing a

15

study on professional habits. Although the results have not been published as yet we can

advance some data here. Today more than half the professional translators of legal texts

in Spain use corpora and the Internet (as an on-line corpus) instead of traditional printed

dictionaries and when asked which they would give up if they had to choose between the

two, many say that they would keep the Internet. In fact they point out that corpora and

the Internet make it possible to carry out more intelligent context-dependent searches by

locating collocations and obtain results that a dictionary could never offer.

At a textual level monolingual corpora in the target language and bilingual parallel

corpora are cited as a powerful textual resource for translators who can check the

appropriateness of their syntax and cohesive devices in original texts on the same subject

matter, period of time, level of formality, etc. From the point of view of conceptual

documentation that any translator of specialised texts needs to carry out, multilingual

comparable corpora are the resource of choice. In some very specialised areas where the

translator needs to acquire a certain level of understanding of the field before being able

to translate a text relating to a particular field, the use of multilingual parallel aligned

corpora enables the professional translator to look for translated words, expressions,

sentences, formulae, culture-marked terms, geographical names, etc. In the field of legal

translation, the availability of parallel corpora is greater than in other fields because of

the existence of International Organizations multilingual databases as has been indicated

before. Legal translators also mention the importance of self made corpora of translated

documents as one of the main sources of information for their work but regret the fact

that they have not incorporated their translations into translation memories and find it

too demanding and time consuming a task to do it now. In fact, many translators still

work with original texts in paper print. Only young translators have started working with

translation memories from the time they start and can apply efficient automatic retrieval

systems and translation management methodologies.

16

Corpora Developers in Spain

This section covers the various resources in the form of electronic corpora that may be

used by professional translators or translation researchers working with Spanish,

including monolingual corpora, multilingual comparable corpora and multilingual

parallel aligned corpora. They all have much to offer translators.

Although the information is abundant, we are aware that many very interesting

projects will not be mentioned and that much of the information given here will have to

be updated within a short space of time because of the rapid progress being made in this

field. However, many of the works cited are long-term projects that will remain relevant

for a considerable time, since they are sponsored by government institutions or

universities. Another aspect worth mentioning about this list of resources is the insight it

may provide for translators creating their own resources or researching whether on-line

resources exist for their particular field of interest.

For the preparation of this section we have personally consulted leaders of groups

currently researching computational linguistics and corpora: Joseba Abaitua (DELI

Research Group); Alberto Alvarez Lugris (CLUVI Research Group); Gloria Corpas

(TURICOR project); Isabel García Izquierdo (GENTT Research Group); Esther Monzó

(GITRAD Research Group), among others. The information obtained from the web sites

of Manuel Barberá (http://www.bmanuel.org/), Joaquim Llisterri

(http://liceu.uab.es/~joaquim/) and David Lee (http://devoted.to/corpora) has also been

very useful. Finally, it will be seen that the subject resources are biased in favour of legal

and economic topics, which reflects the fact that I am a translator specialising in these

disciplines. I am sure that readers will be able to compensate for this imbalance in their

own particular areas of interest.

The resources identified belong to three categories: monolingual corpora, multilingual

comparable corpora and multilingual parallel corpora. Together with the reference and a

17

brief description they are defined by using the binary terminology established in

previous sections.

Before proceeding to describe the corpora that currently exist in Spain it is

illuminating to explore who is currently developing them: this means identifying the

most important sources of corpora so that the reader can periodiodically check the new

advances in the field. As we shall see, these are not translators or academic institutions

devoted to translation,

Corpora Compiled by Institutions Devoted to the Promotion and Diffusion of

Spanish

The most prestigious, extensive and representative monolingual corpora of Spanish

texts are those of the Real Academia de la Lengua Española and the Instituto Cervantes,

institutions devoted to the promotion and diffusion of Spanish (language, literature and

culture) throughout the world. No translator should neglect to visit the Royal Academy’s

web site and browse the inspiring resources it has to offer.

The monolingual Corpus Diacrónico del Español (CORDE),

corpus.rae.es/cordenet.html, compiled by the Instituto de Lexicografía de la Real

Academia de la Lengua Española is made up of texts from three historical periods, Edad

Media, Siglos de Oro and Época Contemporánea, and aims to be representative of the

Spanish language chronologically. It contains 125 million words, both fiction (verse and

prose) and non-fiction texts (scientific, social, press and advertising, religious, historical

and legal documents).

The monolingual Corpus de Referencia del Español Actual (CREA),

http://corpus.rae.es/creanet.html, compiled also by the Instituto de Lexicografía de la

Real Academia de la Lengua Española covers twenty five years, from 1975 to 1999. It is

18

still being compiled and will contain literary, journalistic, scientific and technical texts

as well as transcripts of oral language and media records.

The Miguel de Cervantes Virtual Library, http://www.cervantesvirtual.com, is another

treasure for the translator working with Spanish. It is a digital edition project of the

bibliographic heritage of Spanish and Latin American culture promoted by the

University of Alicante and the Banco Santander Central Hispano, with the collaboration

of the Marcelino Botín Foundation. It is a vast bibliographic and documentary repository

that can be freely accessed via the Internet. The Linguistic Tools section provides a

collection of tools specifically designed for analysing and exploiting digital texts. It has

an advanced text search engine that allows words to be searched for within texts, and a

concordance tool, which makes it possible to search for words in context.

A recent and extremely interesting resource also offered by the Cervantes Institute is

the Grammatical Archive of the Spanish Language, AGLE. It is not exactly a corpus of

texts, but rather a corpus of grammatical records compiled by the Spanish grammarian

Salvador Fernández Ramírez (1896-1983) which brings together an entire set of

language phenomena that he intended to use as a corpus for producing the grammar he

was planning. The result is a wide-ranging sample of the Spanish language from the

Poema del Cid to selections from the contemporary press, from pieces of Latin American

literature to a comment heard on the bus, selected by our greatest grammarian. It can be

accessed by grammatical categories, author, work, word or expression. While complex to

begin with, it is extremely interesting once its mechanism has been mastered.

http://cvc.cervantes.es/obref/agle/

Corpora Compiled by Publishing Houses

Next in order of importance are the monolingual corpora compiled by major

publishing houses in order to extract information for lexicons and dictionaries of all

19

kinds. The objective of these projects is to use corpora for extracting terminology and

creating ontologies. Outstanding in this section is the corpus of the publishing houses

SGEL and VOX. The Corpus CUMBRE offers a set of representative linguistic data for

the use of contemporary Spanish, compiled by the publisher SGEL, S.A. under the

supervision of a research team of the University of Murcia; although its purposes are

grammatical and lexicographic, it can be classified as a general purpose corpus, in view

of the diversity of the materials it contains. A free sample of this corpus may be obtained

with the purchase of the dictionary Gran diccionario de uso del español actual, Sánchez,

A.; Cantos, P. and Simón, J. (2001). The publishing house SGEL can be contacted at

http://www.sgel.es/espanyol/ieindex.htm.

The second Publisher cited, VOX-Bibliograf has compiled its own corpus for the

development of dictionaries. The Corpus Textual VOX-Biblograf includes 10,352,337

words: literary texts (9.5%), journalistic (32.5%), scientific (4.5%), technical (29.5%),

academia (15.5%), advertising (0.5%), spoken language transcripts (3%), media

broadcastings (5%). It currently participates in the Eurowordnet Project. The

EuroWordNet project aims to develop a multilingual database with basic semantic

relationships between words for several European languages (Dutch, Italian and

Spanish). More information may be obtained through: http://www.vox.es. We have tried

to access their corpus but with no success. It seems it will be available through the

Spanish EuroWordNet database interface but currently the links to the corpus have

restricted access (http://www.illc.uva.nl/EuroWordNet).

The publishing house Diccionarios SM has also compiled a corpus of literary,

journalistic, scientific and technical texts for the development of lexicographic work of

60,000 words. Although we have not been able to search this corpus, the company may

be contacted at http://www.grupo-sm.com/ . However, through its website this publisher

20

offers an interesting resource for translators: the contents of the books that they publish

for secondary education (http://www.librosvivos.net).

Corpora Compiled by Linguistic Engineering and Computational Linguistics

Groups

Another important source of corpora are the departments of linguistic engineering and

computational linguistics devoted to developing computer systems capable of

recognising, understanding, interpreting and generating human language in all its forms.

Research in this field includes the development of linguistic resources (morphological

grammars, formal and computational grammars, electronic lexicons with information in

conventional formats such as EAGLES), Computer Assisted Translation and Automatic

Translation programs, the development of person-machine interfaces and tools for

analysing and using corpora. The immense majority, if not all the systems of language

processing, operate with monolingual or bilingual corpora of texts to which various

linguistic processors are applied (at the phonological, phonetic, textual, morphological,

lexical, syntactical, logical, semantic and pragmatic level).

In these contexts, the object of creating very extensive corpora is to provide adequate

bases for creating Example Based Machine Translation Systems (EBMT) or Machine

Generated Text, both monolingual and multilingual (automatic production of

multilingual technical documentation). Projects of this kind have generated numerous

multilingual corpora and are expected to generate many more in the near future.

Researchers believe that this could mean moving from CAT to AT in fields where

significant bilingual corpora exist. In many cases these are multilingual projects that

combine a specific economic activity (for example tourism) with the development of

linguistic technologies (AT; text generation, identification of types of texts or

information on Internet). Many of them receive European funding through actions such

21

as e-Content. The Council has recently approved the e-Contentplus programme. The 4-

year programme (2005-08), proposed by the European Commission, will have a budget

of € 149 million to tackle the fragmentation of the European digital content market and

improve the accessibility and usability of geographical information, cultural content and

educational material. EU funded projects such as ELSNET and EUROMAP, are behind

many regional subprojects. EUROMAP ("Facilitating the path to market for language and

speech technologies in Europe") - aims to provide awareness, bridge-building and market-

enabling services for accelerating the rate of technology transfer and market take-up of the

results of European HLT RTD projects. ELSNET ("The European Network of Excellence in

Human Language Technologies") - aims to bring together the key players in language and

speech technology, both in industry and in academia, and to encourage interdisciplinary co-

operation through a variety of events and services.

Mention should also be made of GilCub (Grup d’Investigació en Lingüística

Computacional) of the Universitat de Barcelona which concentrates on researching

systems for processing natural language, particularly areas related with AT. EUROTRA,

TRADE and MULTEXT are important projects developed by this group. MULTEXT is

of particular interest for translators because it has “developed a set of generally usable

software tools to manipulate and analyse text corpora, together with lexicons and

multilingual corpora in seven European languages. It has established conventions for the

encoding of corpora and harmonised specifications for computational lexicons, building

on and contributing to the preliminary recommendations of the relevant international and

European standardisation initiatives. All project results are freely and publicly

available”. In their webpage they also state that they have developed “The first freely

available large-scale multilingual text corpus for the seven languages”. At the time of

22

writing, the corpus was not available through the Internet but the group can be contacted

at http://www.ub.es/gilcub/ingles/projects/european/multext.html.

Another group working in this field is CLiC, Centre de Llenguatge i Computació, from the

same university, Universitat de Barcelona. (http://clic.fil.ub.es/ ). CLiC has compiled a

corpus of 6,000,000 words with morphological and syntactical annotation, the Lexesp,

Léxico informatizado del español containing journalistic texts, literary texts, semi-

specialized scientific articles. Clic has also compiled a reference corpus of 1,000,000 words

with morphological and syntactical annotation, manually validated, which is the result of the

fusion of two important lexical resources: 500,000 words from the corpus Lexesp and

500,000 words from the electronic corpus of the newspaper La Vanguardia. Besides these

corpora, Clic has produced bilingual lexicons connected to the lexico-semantic web

EuroWordNet: bilingual lexicon English-Spanish (more than 120.000 entries); bilingual

lexicon Catalan-Spanish (more than 32.000 entries); bilingual lexicon Catalan-English (in

progress). These resources can be searched at http://clic.fil.ub.es/demos/corpus/corpus.shtml

and are also available on CDRom (SEBASTIÁN, N. et alii. (2000).

The DELI Research Group of the University of Deusto,

http://www.deli.deusto.es/Resources/LEGE-Bi, was created in 1998, promoted by

professors of the Faculty of Language and ESIDE who were interested in digital edition

and linguistic engineering (http://www.deli.deusto.es/AboutUs ). Some of the projects

that are currently being developed by this group and concerned with electronic corpora

are: XTRA-Bi: Extracción automática de unidades bitextuales para memorias de

traducción (2000-2001). Main Researcher: Inés Jacob; LEGEBiDUNA: Textos paralelos

bilingües en euskara y castellano de las administraciones vascas con etiquetado

SGML/TEI-P3(1995-2000). Universidad de Deusto. Main Researcher: Joseba Abaitua.

They also form part of the CORDE project. The biligual Spanish-Basque LEGEBiDUNA

corpus - approximately 7 million words in each language, extracted from the bilingual

23

official bulletins of the Basque Administration: (BOA 1990-92) , (BOB 1989-95) and

(BOPV 1995) - is a very useful resource for translators of legal and institutional

terminology. The research group is developing tools for the automatic drafting and

translation of administrative documents based on this bilingual corpus.

The SLI (Computational Linguistics Group of the University of Vigo) has developed

the corpus CLUVI (Linguistic Corpus of the University of Vigo). According to the

information provided on their webpage (http://sli.uvigo.es/CLUVI/info_en.html), “The

CLUVI is an open textual corpus of specialised registers of contemporary oral and

written Galician language developed by the University of Vigo. Since September 2003,

the SLI offers the possibility of searching and browsing the main sections of the CLUVI

Parallel Corpora (8 million words), that is, the TECTRA Corpus of English-Galician

literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of

Galician-Spanish legal texts, and the UNESCO Corpus of English-Galician-French-

Spanish scientific-technical divulgation texts. The public searching and browsing tool

designed by the SLI is available at http://sli.uvigo.es/CLUVI“.

The number of aligned works and language pairs available at this website increases

regularly, and with great vitality since the CLUVI is an academic research project in

progress. At the moment, the CLUVI Parallel Corpus webpage permits the search of four

major corpora -TECTRA, FEGA, LEGA and The Unesco Courier (of more than one

million words each)- as well as other minor parallel corpora now in progress. It should

be pointed out that the CLUVI interface also permits browsing of the Legebiduna

Corpus of Basque-Spanish administrative texts developed by the DELi group at the

University of Deusto.

The cooperative effort between the Lancaster University and the Universidad

Autónoma de Madrid has made possible the development of the CRATER Corpus. The

CRATER project, Corpus Resources and Terminology Extraction, involves three

24

languages: English, French and Spanish. The corpus developed consists entirely of

technical texts from the International Telecommunications Union (ITU). The CRATER

corpus has now been completed and consists of 5,5 million words. The texts are tagged

with part-of-speech and morphological annotation. This multilingual annotated corpus

can be searched at http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html.

The Institut Universitari de Lingüística Aplicada (IULA),

http://www.iula.upf.es/corpus/corpus.htm, is the centre for research and postgraduate

studies of the Universitat Pompeu Fabra. They are currently compiling a multilingual

corpus (Catalan, Castilian, English, French and German) of texts belonging to the areas

of economy, law, environment, medicine and information technology). The selection of

texts is made by experts in each area, they are then classified according to their subject

field, annotated according to SGML standard and the "Corpus Encoding Standard”

(CES) developed by EAGLES. The corpus can be searched at

http://brangaene.upf.es/cgi-bin/bwananet/seldocsCTplus.pl.

VISL, which stands for "Visual Interactive Syntax Learning", is a research and

development project at the Institute of Language and Communication (ISK), University

of Southern Denmark (SDU) - Odense Campus. Since September 1996, staff and

students at ISK have been designing and implementing Internet-based grammar tools for

education and research using corpora. In Spanish, the ECI-ES2 and the Europarl-es

annotated corpora can be searched at http://corp.hum.sdu.dk/cqp.es.html. The Europarl-

es corpus contains 29 millions words of parliamentary debates and can be searched

without password. The ECI-ES2 corpus contains 14 million words from the newspaper

“El Diario” and requires a password.

25

Corpora Compiled by University Translation or Linguistics Research Teams

As data intensive methods have become more affordable in the 1990s, thanks to advances

in computing as well as data collection efforts, empiricist methods have become the method

of choice for many translation researchers at Spanish Universities. The growing interest

recently observed in the development of multilingual and bilingual corpora at the Faculties

of Translation and Linguistics can be observed in the shift in orientation of research projects

and doctoral theses. Many CTS projects (Corpora Translation Studies) are under way, many

doctoral theses have incorporated corpora to perform various kinds of experiments and many

teachers are coordinating the wealth of texts of all types that they can gather with the help of

their students. However for limits of space, in this chapter we will only cite briefly the main

projects receiving institutional funding and will describe in detail two Genre Comparable

Multilingual Corpora, the CDJ-GITRAD Corpus and the GENTT Project. The main

advantage of the CDJ-GITRAD and the GENTT corpora is the fact that they will permit to

download documents at full text to preserve and make evident genre conventions.

A very useful resource is the 100 million word monolingual corpus of Spanish texts

Corpus del español (http://www.corpusdelespanol.org) compiled by Professor Mark

Davies of the Department of Linguistics and English Language at Brigham Young

University in Provo, Utah, whose areas of activity and research include corpus and

computational linguistics and design and optimization of linguistic databases. The corpus

contains 100 million words of text: 20 million from the 1200s-1400s, 40 million from the

1500s-1700s, 40 million from the 1800s-1900s. The 20,000,000 words from the 1900s

are divided equally among literary, spoken texts, and newspapers/encyclopedias. As

Davies points out in his website, in addition to being very fast, the search engine allows

a wider range of searches than almost any other large corpus in existence. The database

can be queried very quickly -- usually just a few seconds for even the most complicated

queries. It permits simple exact word searches as well as much more interesting searches

26

that are the real strength of this corpus: word patterns, words in context, collocations,

synonyms, customized lists, etc. We do recommend the reader to have a look at this

powerful resource.

The interdisciplinary Research Group Oncoterm (terminologists, translators and

physicians from the University of Granada) has developed an information system for the

medical subdomain of oncology. This information system is intended for health

practicioners, relatives and friends of patients with cancer, translators and journalists. A

vast corpus of documents on cancer in English and Spanish has been compiled together

with a terminological database. The corpus cannot be searched on line for copyright

reasons but queries can be sent to http://www.ugr.es/~oncoterm/intro.html. However, the

terminological database is available at http://www.ugr.es/~oncoterm/alpha-index.html.

A multilingual corpus of tourism contracts (German, Spanish, English, Italian) for

automatic text generation and legal translation is the objective of a research group based

at the University of Malaga, with its main researcher Gloria Corpas. According to the

information given at their web, a multilingual corpus (both parallel and comparable) will

be compiled from tourism and law websites in the Internet. A protocol will be laid out

for searching the WWW, and retrieving, encoding and storing (hiper)texts.

http://www.turicor.org

The CDJ-Gitrad Corpus and the GENTT Project

In 1999, the GITRAD research group of the Universitat Jaume I (Castellón) began to

compile a English-Spanish corpus of legal documents starting from a modest collection

of Spanish-English legal tests compiled by the author for her PhD dissertation (1998),

then extended in a subsequent research project in which Esther Monzó, Steve Jennings

and the author took part.

27

The idea of compiling this corpus arose from the observation that the comparison of

legal texts in the source and target languages is something that specialised translators are

always doing, and legal translators need to be conversant with the textual typology used

in their field of specialisation to ensure that they are observing the necessary textual,

social, and in our case legal, conventions. (Borja 2000). Moreover, the information it

offers for each legal genre helps the translator respect the conventions of genre, so

important in this field of knowledge for example, when translating a will, the translator

should be familiar with the formal aspects of this genre in the target language. In fact,

the conservatism of law is reflected in the repetitive and fossilised character of its textual

structures, its phraseology and its specialised lexicon, in such a way that the legal text

constitutes a paradigm of stereotyped text susceptible to generic description.

It would therefore seem advisable to have systems of classifying documents for each

area of specialisation and, particularly in the case of legal translation, it would be useful

to have a taxonomy of texts in the source and the target language that would enable

translators to compare terminology, usage and the practical application of the law. The

translator should always try to fit the text he or she is going to translate into a

conventional textual category that speakers of a particular language will recognise.

Legal texts constitute instruments that have a certain form and function in each

culture and sometimes there is no equivalence between languages due to the lack of

uniformity between different legal systems. Is it a will or a fragment of a text book on

law? Is it a section of an Act, a writ of summons or a judgment? It is obvious that the

translator’s solutions will not be the same in all cases. In order to facilitate research into

legal texts and the characterisation of their discursive features, the structural element

proposed for the corpus is the cultural and anthropological concept of genre as proposed

by authors such as Monzó and Borja (2000). The advantages of a corpus of this kind for

translation have already been demonstrated in previous studies which have explored its

28

research and training applications as well as their descriptive and ontological potential

(Borja 2000, 2003, 2005; Monzó 2001, 2003; Borja and Monzó 2001).

One of the most direct applications of the corpus developed is its potential for

teaching purposes. In fact, from the early days of the project, it has been used in legal

translation classes at Universitat Jaume I, where the researchers of the group are engaged

as teachers. The teaching experience has shown us that our description of the legal

genres is a didactic tool of undoubted value.

The corpus is complemented by a system developed by Steve Jennings (2003) for

managing relational data bases, which enables multiple searches to be made in an

extremely functional and efficient way. The user interface is fast and easy to use, and

allows the documents to be recovered in text format. The CDJ-GITRAD corpus of legal

documents (Spanish-Catalan-English) can be searched on-line: http://www.cdj.uji.es.

The documents can be downloaded at full text for research purposes after applying for a

password through the same web address.

In 2000, the CDJ-GITRAD corpus merged with the GENTT project to create an

encyclopaedia of original texts, comparable texts (and at a later stage, also translated

aligned texts) in the legal, scientific and technical field, classified by genre.

The intercultural approach to translation adopted by the GENTT research group

assumes that the specialised translator needs information of three kinds: thematic, textual

and linguistic (García and Monzó, 2004). When in possession of this information, the

translator can improve his/her knowledge, both linguistic and extralinguistic, using a

self-taught process. The original organisation of this corpus for the legal division (Legal

system ⇒ Branch of Law ⇒ Subbranch ⇒ Genre ⇒ Subgenre) generates a classification

that is extremely useful for the translator, who can easily place the text on which they

are working in the taxonomy and compare it with the equivalent genre in the legal

system of the target language. This methodology is applied to the other two divisions of

29

the corpus: the medical and the technical divisions. Moreover, this classification is

complemented by a system of crossed searches that combines amongst other data the

original language, the status of the text (original or translation), date of creation and the

source. More information can be found at http://www.gentt.uji.es.

Conclusions

The empiricist trend is rapidly gaining ground in Translation Studies as new methods and

research resources are available. Corpora which until now have mainly been used for

analysing aspects of cross-cultural linguistics are unquestionably useful for Translation

Studies. Translation researchers can obtain very useful information from raw (untagged)

monolingual corpora, but multilingual corpora containing text files in several languages

(segmented, aligned, parsed and classified) allowing storage and retrieval of aligned

multilingual texts against various search conditions have proved to be an invaluable tool. On

the other hand, the process of professionalisation and specialisation that translation has

undergone since the mid-20th century has resulted in something akin to an identity crisis

in the translator that has become more acute in recent decades with the rapid

development of information technologies. The new technologies have facilitated the

appearance of expert systems for organising knowledge that render obsolete the more

traditional modus operandi of the scientific and professional community. The high

degree of specialisation required by certain types of translation today (such as legal or

medical translation) makes it necessary to find new systems for obtaining and recovering

knowledge and data which even the best and most encyclopaedic human mind cannot

match. Today a mixed system of knowledge management is needed that enables

translators to integrate their skills with electronic information management and recovery

systems. Corpora resources are the basis of any such system and will very soon replace

the traditional dictionaries and encyclopaedias.

30

There is a vast variety of corpus designs and as a result when discussing the design of

corpora for translators we need to define the profile and objectives of the corpus users

with great precision. Nevertheless, since there is no single translator profile or a single

profile of Translation Studies approach it is impossible to talk about a single ideal corpus

design for translators, but rather of specific designs for specific translation or research

purposes.Translators can exploit and apply corpora in many ways (performing

terminological or statistical searches, collocations, looking for functional equivalences,

observing the texture of genres, investigating phraseology, etc.) and it is possible for all

these tasks to be performed on the Internet with no need to download programs or data

bases. The interests are very varied showing that a number of different working

possibilities are opening up for translation researchers and practitioners.

The corpora resources available in Spanish identified in this contribution are only a

small sample of what we can expect to find in the near future. Unfortunately for

translators, most of these resources are only available through the Internet browsing

tools which permit terminological and collocation searches but do not, generally, allow

the reading of full texts due to copyright restrictions. Corpora based on genre provide the

end user with full texts showing genre conventions and structure. A good example of

what the future might bring in this field is WebCorp, a suite of tools available at

http://www.webcorp.org.uk. which allows free access to the World Wide Web as a

corpus, one of the most exiciting tools that I have found.

References Abaitua , J. (2002) Tratamiento de corpora bilingües. In M. A. Martí and J. Llisterri (eds)

Tratamiento del lenguaje natural (pp. 61-90). Barcelona, Spain: Edicions Universitat de

Barcelona (electronic versión:

http://paginaspersonales.deusto.es/abaitua/konzeptu/ta/soria00.htm).

31

Atkins, S.; Clear, J. and Ostler, N. (1992). Corpus design criteria. Literary and Linguistic

Computing 7-1: 1-16.

Baker, M. (1993): Corpus Linguistics and Translation Studies: Implications and

Applications, Mona Baker, Gill Francis, and Elena Tognini-Bonelli (Eds), Text and

Technology: in Honour of John Sinclair, Amsterdam and Philadelphia, John Benjamins,

pp. 233-250.

Baker, M. (1999) The role of corpora in investigating the linguistic behaviour of professional

translators. International Journal of Corpus Linguistics 4-2: 1-18.

Ball, C.N. (1997) Concordances and Corpora. On-line Tutorial.

http://www.georgetown.edu/faculty/ballc/corpora/tutorial.html

Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics Investigating Language

Structure and Use. Cambridge: Cambridge University Press.

Borja Albi, A. (2000) El texto jurídico inglés y su traducción al español. Barcelona: Ariel.

– (2003) La investigación en traducción jurídica, en García Peinado y Ortega Arjonilla (dirs.)

Panorama actual de la investigación en traducción e interpretación. Granada: Atrio,

Granada.

– (2005) Organización del conocimiento para la traducción jurídica a través de sistemas

expertos basados en el concepto de género textual. In Isabel García Izquierdo (ed.) El

género textual y la traducción. Reflexiones teóricas y aplicaciones pedagógicas. Bern,

Peter Lang.

– (2007) Los géneros jurídicos. In Enrique Alcaraz (ed.) Las lenguas profesionales y académicas. Barcelona, Ariel.

Borja Albi, A. and Monzó Nebot, E. (2001) Aplicación de los métodos de aprendizaje

cooperativo a la enseñanza de la traducción jurídica: cuaderno de bitácora. In F.

32

Martínez Sánchez (ed.) (2001): EDUTEC '01. Congreso Internacional de Tecnología,

Educación y Desarrollo Sostenible, Murcia, Edutec, CAM.

Botley, S, McEnery, A. and Andrew Wilson, Eds. (2000) Multilingual Corpora in Teaching

and Research. Amsterdam: Rodopi.

Bowker, L. (1998) Using Specialized Monolingual Native-Language Corpora as a

Translation Resource: A Pilot Study. Meta XLIII, 4, 1998.

García, I. and Monzó, E. (2004). Traducir con corpus de géneros, Revista de la Facultad de

Lenguas Modernas, (pp. 45-59). Universidad Ricardo Palma, Lima, Peru.

Hanks, P. (1996). Contextual Dependency and Lexical Sets. International Journal of Corpus

Linguistics. Vol. 1 (1): 75-98.

Jennings, S. (2003) The Development of Software Tools for Corpus-Based Genre Research

in Translation Studies: A Practical Application in the Context of the Gentt Project

[treball d'investigació], Castelló de la Plana, Departament de Traducció i Comunicació,

Universitat Jaume I.

Kennedy, G. (1998) An Introduction to Corpus Linguistics. Amsterdam: Rodopi.

Kenny, D. (2001) Lexis and Creativity in Translation. Manchester: St Jerome.

Laviosa-Braithwaite, S. (1996) The English Comparable Corpus (ECC): A Resource and

Methodology for the Empirical Study of Translation. PhD Thesis, Manchester, UK,

UMIST.

— (1998) The English Comparable Corpus: A resource and a methodology, in Lynne

Bowker, Michael Cronin, Dorothy Kenny y Jennifer Pearson (comp.). Unity in

Diversity? Current Trends in Translation Studies. St. Jerome Publishing.

33

Malmkjaer, K. (2003) On a pseudo-subversive use of corpora in translation training, in F.

Zanettin, S. Bernardini and D. Stewart (eds.) Corpora in Translator Education. (pp.119-

34).Manchester, St. Jerome,

McEnery, T. and Wilson, A. (1996) Corpus Linguistics. Edinburgh University Press.

Monzó Nebot, E. (2001) El género textual: un concepto clave en la enculturación del

traductor» dins A. Barr i altres (ed.) (2001): Últimas corrientes teóricas en los estudios

de traducción y sus aplicaciones, Salamanca, Ediciones Universidad de Salamanca.

— (2003) Corpus-based Teaching: The Use of Original and Translated

Texts in the training of legal translators, Translation Journal, 7 (4),

[revista en format electrònic < http://accurapid.com/journal/26edu.htm]

Monzó Nebot, E. and A. Borja Albi (2000) Organització de corpus. L'estructura d'una base

de dades documental aplicada a la traducció jurídica, Revista de Llengua i Dret, 34, p. 9-

21.

Moon, R. (1998). Fixed Expressions and Idioms in English: A Corpus-based Approach.

Oxford Studies in Lexicography and Lexicology. Oxford: Oxford University Press.

Munday, J. (1998) A Computer-assisted Approach to the Analysis of Translation Shifts.

Meta XLIII, 4, 1998 142-56.

Olohan, M. (2004) Introducing Corpora in Translation Studies. London: Routledge.

Sánchez, A., Cantos, P. and Simón, J. (2001) Gran Diccionario de Uso del Español Actual.

Corpus CUMBRE del español contemporáneo de España e Hispanoamérica. Extracto

de dos millones de palabras. Editrorial SGEL.

Sebastián, N., Cuetos, F., Martí, M.A. & Carreiras, M.F. (2000) LEXESP: Léxico

informatizado del español. Edición en CD-ROM. Barcelona: Edicions de la Universitat

de Barcelona (Col.leccions Vàries, 14).