language resources and their commercial applications kara warburton [email protected]
Post on 19-Dec-2015
221 views
TRANSCRIPT
ISO/TC 37 Terminology and other language and content resources
My aim
Demonstrate the value of language resources for commercial applications
Discuss why standards for language resources are important
Present TC37 as a standards-developing organization
Warning – slight terminology bias!
ISO/TC 37 Terminology and other language and content resources
Managing language resources
A language resource is Information expressed in a natural language Information that supports the interpretation of
natural language Language resources can enhance business
processes If properly deployed Requires interoperability, which in turn
requires standards.
ISO/TC 37 Terminology and other language and content resources
Why me?
Implemented terminological resources, lexical resources, and standards for content interoperability in business environments - Terminologist for IBM, LISA contributor, business consultant
Developed standards and best practices for language resources: ISO TC37, LISA
Practical experience as a technical writer and translator – using language resources in increasingly technical environments
ISO/TC 37 Terminology and other language and content resources
The cold reality
The computer age has generated exponential growth of information and knowledge.
Even with the aid of computers, we can’t manage this volume of information. Why? Computers can’t understand “natural” language. They only understand “1” and “0”.
Natural language is largely unstructured; even many structured language resources are “unpredictably” structured.
This environment demands increasing volumes of structured language resources to enable next-generation computing
ISO/TC 37 Terminology and other language and content resources
Business scenarios for managing language resources
Translation memories Terminologies and lexical resources for enhancing
NLP applications Content management and retrieval
Content repurposing Content classification Normalized language Keyword management
Example: term extraction tool – use of “layered” lexical resources; grammatical rules; ranking algorithms
ISO/TC 37 Terminology and other language and content resources
Managing terminology supports both social and commercial interests
Economic/commercial: Control terminology to ensure quality and
minimize production costs. Build terminological resources that are
repurposable across the content management chain.
Increase competitiveness in local and global markets.
Social/geopolitical: Strengthen and protect minority languages. Support cultural diversity. Increase global presence and visibility.
ISO/TC 37 Terminology and other language and content resources
Is “managed” terminology really important for a business?
In the automotive industry, almost 50% of translation errors are “wrong term” (Woyde)
40% of time required for text production is terminology work (Stellbrink)
Between 30% and 70% of errors in technical documentation are terminology errors (Schutz, and MULTIDOC)
Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)
Return on investment: 10% ($100 investment yields $110 return) (Champagne)
Outsourced translations may be 50% more expensive if source terminology is inconsistent (Kjeldgaard)
ISO/TC 37 Terminology and other language and content resources
Need more proof?
Terminology tools increase productivity by approx. 20% (Champagne)
Without a central reference, each needless search can take 20 to 30 minutes (Champagne)
It costs 10 times more to fix a term at the end of the production cycle than at the beginning (Xerox, JDEdwards)
Inconsistent or inaccurate terminology raises service costs
Terminology mistakes can lead to lawsuits for copyright or trademark infringement, or for damages due to defective products or incorrect user documentation.
ISO/TC 37 Terminology and other language and content resources
IBM scenario…
“Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)”
429 million words are translated per year in IBM. Thus over 21 million words require attention.
In 2009, IBM “processed” over 160,000 terms as part of the “content conveyor belt”, in nearly 3,000 specialized “dictionaries”.
Very small staff
High degree of automation
ISO/TC 37 Terminology and other language and content resources
What “measures” need to be taken?
Deploy a terminology database that serves multiple purposes
Integrate the database into all content environments to ensure a “push” mechanism
Respect data management principles, such as data granularity, elementarity, etc.
Adopt best practices for terminology, such as term autonomy and concept orientation
Allow for extensibility for features such as morphology as needed for future applications
ISO/TC 37 Terminology and other language and content resources
Basic example – repurpose information
<h1>CI revision conflicts</h1><p>When revising your <term keyref="ci">CI</term>, to avoid conflicts...</p>
<glossentry id="ci"> <glossterm>configuration item</glossterm> <glossdef>An entity in a configuration that satisfies an end-use function and can be uniquely identified.</glossdef>
<glossBody> <glossAlt> <glossAcronym>CI</glossAcronym> </glossAlt> </glossBody></glossentry>
ISO/TC 37 Terminology and other language and content resources
Synonyms/inconsistencies multiply in the target language – this is bad for business
automatic memory reclamation
remise en état automatique du mémoire
récupération automatique de mémoire
automatic storage reclamation
remise en état automatique de l’archivage
remise en état automatique du stockage
récupération automatique de l’archivage
récupération automique du stockage
garbage collection récupération de place
vidage de la corbeille
récupération de place en mémoire
récupération de positions inutilisées
récupération de l’espace mémoire
ISO/TC 37 Terminology and other language and content resources
“Cosmetic” differences can become more than cosmetic in the target language
1. pupitre d’administration
2. console d’administration
3. pupitre admin
4. console admin
5. pupitre administratif
6. console administrative
1. administration console
2. admin console3. administrative
console
ISO/TC 37 Terminology and other language and content resources
Explosion of affected compounds…
administrative console application / administration console application
administrative console button / administration console button
administrative console login page / administration console login page
core administrative console / core administration console
....
ISO/TC 37 Terminology and other language and content resources
Fixing the problem isn’t easy…
Change “pupitre” to “console”….. Le pupitre administratif est ouvert. Vous devez
le fermer. La console administrative est ouverte. Vous
devez la fermer.
ISO/TC 37 Terminology and other language and content resources
Development of terminology resources is also key for language planning
Prescriptive terminology approach – just like in enterprise environments
The Canadian experience: Termium, the BTQ Other examples: Danterm, Korterm,
Eurotermbank Termbases feed into widely-distributed bulletins
and other distribution media to support adoption and language reinforcement As an educational resource For social and political policies As an aid to commerce
ISO/TC 37 Terminology and other language and content resources
Effective management of language resources requires
adherence to standards and best practices
ISO/TC 37 Terminology and other language and content resources
Interoperability requires adherence to standards
Interoperability between tools and applications: CAT tools vs controlled authoring, Web interfaces, GMS, ECM, search engines…
Interoperability between users – writers, translators, publicists
For delivering derivative products – glossaries, Web sites, etc.
For different purposes – learning, commercialization, government, social services, language planning, tourism, etc.
For different media – online vs paper, hand-helds, transport interfaces, broadcasting media, marketing collateral, etc.
ISO/TC 37 Terminology and other language and content resources
Standards at various levels
Data transfer File format File structure (data model) Encoding Markup Syntax Semantics
ISO/TC 37 Terminology and other language and content resources
ISO TC37 – Terminology and other language and content resources
Standardization of principles, methods and applications relating to terminology and other language and content resources in the contexts of multilingual communication and cultural diversity.
Web site…
ISO/TC 37 Terminology and other language and content resources
TC37 Current focus areas
Word segmentation Language annotations to facilitate machine
processing Terminology policies Translation quality Simultaneous interpretation Data categories - www.isocat.org XML representation and exchange formats Persistent identifiers in multilingual
environments
ISO/TC 37 Terminology and other language and content resources
Key standards and best practices – ISO TC37
ISO 30042 – TBX ISO 16642 – TMF ISO 12620 new ISO TC37 Data Category Registry ISO Concept Database ISO 704 – Terminology work: Principles and
methods ISO 12616 – Translation-oriented terminography ISO 26162 – Design, implementation and
maintenance of terminology management systems Annotation schemes and frameworks (SC4)
ISO/TC 37 Terminology and other language and content resources
Training professionals in language resource management – an opportunity!
Lack of university training programs Lack of competency in existing fragmented
university courses Increasing demand for qualified professionals For example,
LISA offered 6 workshops (there was a demand for more) - 73 companies attended.
TermNet summer school – attendance grows each year