language resources and their commercial applications kara warburton [email protected]

33
Language resources and their commercial applications Kara Warburton [email protected]

Post on 19-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Language resources and theircommercial applications

Kara [email protected]

ISO/TC 37 Terminology and other language and content resources

My aim

Demonstrate the value of language resources for commercial applications

Discuss why standards for language resources are important

Present TC37 as a standards-developing organization

Warning – slight terminology bias!

ISO/TC 37 Terminology and other language and content resources

Managing language resources

A language resource is Information expressed in a natural language Information that supports the interpretation of

natural language Language resources can enhance business

processes If properly deployed Requires interoperability, which in turn

requires standards.

ISO/TC 37 Terminology and other language and content resources

Why me?

Implemented terminological resources, lexical resources, and standards for content interoperability in business environments - Terminologist for IBM, LISA contributor, business consultant

Developed standards and best practices for language resources: ISO TC37, LISA

Practical experience as a technical writer and translator – using language resources in increasingly technical environments

ISO/TC 37 Terminology and other language and content resources

The cold reality

The computer age has generated exponential growth of information and knowledge.

Even with the aid of computers, we can’t manage this volume of information. Why? Computers can’t understand “natural” language. They only understand “1” and “0”.

Natural language is largely unstructured; even many structured language resources are “unpredictably” structured.

This environment demands increasing volumes of structured language resources to enable next-generation computing

ISO/TC 37 Terminology and other language and content resources

Business scenarios for managing language resources

Translation memories Terminologies and lexical resources for enhancing

NLP applications Content management and retrieval

Content repurposing Content classification Normalized language Keyword management

Example: term extraction tool – use of “layered” lexical resources; grammatical rules; ranking algorithms

ISO/TC 37 Terminology and other language and content resources

Managing terminology supports both social and commercial interests

Economic/commercial: Control terminology to ensure quality and

minimize production costs. Build terminological resources that are

repurposable across the content management chain.

Increase competitiveness in local and global markets.

Social/geopolitical: Strengthen and protect minority languages. Support cultural diversity. Increase global presence and visibility.

ISO/TC 37 Terminology and other language and content resources

Is “managed” terminology really important for a business?

In the automotive industry, almost 50% of translation errors are “wrong term” (Woyde)

40% of time required for text production is terminology work (Stellbrink)

Between 30% and 70% of errors in technical documentation are terminology errors (Schutz, and MULTIDOC)

Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)

Return on investment: 10% ($100 investment yields $110 return) (Champagne)

Outsourced translations may be 50% more expensive if source terminology is inconsistent (Kjeldgaard)

ISO/TC 37 Terminology and other language and content resources

Need more proof?

Terminology tools increase productivity by approx. 20% (Champagne)

Without a central reference, each needless search can take 20 to 30 minutes (Champagne)

It costs 10 times more to fix a term at the end of the production cycle than at the beginning (Xerox, JDEdwards)

Inconsistent or inaccurate terminology raises service costs

Terminology mistakes can lead to lawsuits for copyright or trademark infringement, or for damages due to defective products or incorrect user documentation.

ISO/TC 37 Terminology and other language and content resources

IBM scenario…

“Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)”

429 million words are translated per year in IBM. Thus over 21 million words require attention.

In 2009, IBM “processed” over 160,000 terms as part of the “content conveyor belt”, in nearly 3,000 specialized “dictionaries”.

Very small staff

High degree of automation

ISO/TC 37 Terminology and other language and content resources

What “measures” need to be taken?

Deploy a terminology database that serves multiple purposes

Integrate the database into all content environments to ensure a “push” mechanism

Respect data management principles, such as data granularity, elementarity, etc.

Adopt best practices for terminology, such as term autonomy and concept orientation

Allow for extensibility for features such as morphology as needed for future applications

ISO/TC 37 Terminology and other language and content resources

Basic example – repurpose information

<h1>CI revision conflicts</h1><p>When revising your <term keyref="ci">CI</term>, to avoid conflicts...</p>

<glossentry id="ci"> <glossterm>configuration item</glossterm> <glossdef>An entity in a configuration that satisfies an end-use function and can be uniquely identified.</glossdef>

<glossBody> <glossAlt> <glossAcronym>CI</glossAcronym> </glossAlt> </glossBody></glossentry>

ISO/TC 37 Terminology and other language and content resources

Controlled authoring

ISO/TC 37 Terminology and other language and content resources

Controlled translation

ISO/TC 37 Terminology and other language and content resources

Search – Query expansion

ISO/TC 37 Terminology and other language and content resources

Search – Query expansion

ISO/TC 37 Terminology and other language and content resources

Source data…

ISO/TC 37 Terminology and other language and content resources

Search – Query correction

ISO/TC 37 Terminology and other language and content resources

Synonyms/inconsistencies multiply in the target language – this is bad for business

automatic memory reclamation

remise en état automatique du mémoire

récupération automatique de mémoire

automatic storage reclamation

remise en état automatique de l’archivage

remise en état automatique du stockage

récupération automatique de l’archivage

récupération automique du stockage

garbage collection récupération de place

vidage de la corbeille

récupération de place en mémoire

récupération de positions inutilisées

récupération de l’espace mémoire

ISO/TC 37 Terminology and other language and content resources

“Cosmetic” differences can become more than cosmetic in the target language

1. pupitre d’administration

2. console d’administration

3. pupitre admin

4. console admin

5. pupitre administratif

6. console administrative

1. administration console

2. admin console3. administrative

console

ISO/TC 37 Terminology and other language and content resources

Explosion of affected compounds…

administrative console application / administration console application

administrative console button / administration console button

administrative console login page / administration console login page

core administrative console / core administration console

....

ISO/TC 37 Terminology and other language and content resources

Fixing the problem isn’t easy…

Change “pupitre” to “console”….. Le pupitre administratif est ouvert. Vous devez

le fermer. La console administrative est ouverte. Vous

devez la fermer.

ISO/TC 37 Terminology and other language and content resources

Development of terminology resources is also key for language planning

Prescriptive terminology approach – just like in enterprise environments

The Canadian experience: Termium, the BTQ Other examples: Danterm, Korterm,

Eurotermbank Termbases feed into widely-distributed bulletins

and other distribution media to support adoption and language reinforcement As an educational resource For social and political policies As an aid to commerce

ISO/TC 37 Terminology and other language and content resources

Effective management of language resources requires

adherence to standards and best practices

ISO/TC 37 Terminology and other language and content resources

ISO/TC 37 Terminology and other language and content resources

ISO/TC 37 Terminology and other language and content resources

Interoperability requires adherence to standards

Interoperability between tools and applications: CAT tools vs controlled authoring, Web interfaces, GMS, ECM, search engines…

Interoperability between users – writers, translators, publicists

For delivering derivative products – glossaries, Web sites, etc.

For different purposes – learning, commercialization, government, social services, language planning, tourism, etc.

For different media – online vs paper, hand-helds, transport interfaces, broadcasting media, marketing collateral, etc.

ISO/TC 37 Terminology and other language and content resources

Standards at various levels

Data transfer File format File structure (data model) Encoding Markup Syntax Semantics

ISO/TC 37 Terminology and other language and content resources

ISO TC37 – Terminology and other language and content resources

Standardization of principles, methods and applications relating to terminology and other language and content resources in the contexts of multilingual communication and cultural diversity.

Web site…

ISO/TC 37 Terminology and other language and content resources

TC37 Current focus areas

Word segmentation Language annotations to facilitate machine

processing Terminology policies Translation quality Simultaneous interpretation Data categories - www.isocat.org XML representation and exchange formats Persistent identifiers in multilingual

environments

ISO/TC 37 Terminology and other language and content resources

Key standards and best practices – ISO TC37

ISO 30042 – TBX ISO 16642 – TMF ISO 12620 new ISO TC37 Data Category Registry ISO Concept Database ISO 704 – Terminology work: Principles and

methods ISO 12616 – Translation-oriented terminography ISO 26162 – Design, implementation and

maintenance of terminology management systems Annotation schemes and frameworks (SC4)

ISO/TC 37 Terminology and other language and content resources

Training professionals in language resource management – an opportunity!

Lack of university training programs Lack of competency in existing fragmented

university courses Increasing demand for qualified professionals For example,

LISA offered 6 workshops (there was a demand for more) - 73 companies attended.

TermNet summer school – attendance grows each year

ISO/TC 37 Terminology and other language and content resources

Thank you!