a guide to open standards and open sourceoaxal c. tmx translation memory exchange •from the tmx...

52
© 2009 Moravia IT a.s. and Angelika Zerfass A Guide to Open Standards and Open Source A Conceptual Case Study Angelika Zerfass [email protected] David Filip, Ph.D. [email protected]

Upload: others

Post on 28-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

Agenda

1 Polling Questions

2 Definitions

------------------------

3 Architecture considerations

4 Strategy

5 Open Standards

6 Talking Legalese

7 Open Tools

8 Usage Cases

1 Polling Questions

bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)

I know these standards well and see them regularly in the work done at my organization

I have a basic understanding of localization industry open standards

Im new to localization industry open standards and want to learn more

1 Polling Questions

bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)

Im familiar with these tools and use them (or tools like them) regularly

I have a basic understanding of these applications but dont really use them

Im new to the idea of open source tools for the localization industry and want to learn more

1 Polling Questions

bull (C) Which are important to you

Learning about the differences between open standards and open source

Learning about actual use open standards and commonly used tools

Learning about licensing and patent issues

Learning about the open Translation Management Systems in use or development today

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 2: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Agenda

1 Polling Questions

2 Definitions

------------------------

3 Architecture considerations

4 Strategy

5 Open Standards

6 Talking Legalese

7 Open Tools

8 Usage Cases

1 Polling Questions

bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)

I know these standards well and see them regularly in the work done at my organization

I have a basic understanding of localization industry open standards

Im new to localization industry open standards and want to learn more

1 Polling Questions

bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)

Im familiar with these tools and use them (or tools like them) regularly

I have a basic understanding of these applications but dont really use them

Im new to the idea of open source tools for the localization industry and want to learn more

1 Polling Questions

bull (C) Which are important to you

Learning about the differences between open standards and open source

Learning about actual use open standards and commonly used tools

Learning about licensing and patent issues

Learning about the open Translation Management Systems in use or development today

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 3: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

1 Polling Questions

bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)

I know these standards well and see them regularly in the work done at my organization

I have a basic understanding of localization industry open standards

Im new to localization industry open standards and want to learn more

1 Polling Questions

bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)

Im familiar with these tools and use them (or tools like them) regularly

I have a basic understanding of these applications but dont really use them

Im new to the idea of open source tools for the localization industry and want to learn more

1 Polling Questions

bull (C) Which are important to you

Learning about the differences between open standards and open source

Learning about actual use open standards and commonly used tools

Learning about licensing and patent issues

Learning about the open Translation Management Systems in use or development today

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 4: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

1 Polling Questions

bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)

Im familiar with these tools and use them (or tools like them) regularly

I have a basic understanding of these applications but dont really use them

Im new to the idea of open source tools for the localization industry and want to learn more

1 Polling Questions

bull (C) Which are important to you

Learning about the differences between open standards and open source

Learning about actual use open standards and commonly used tools

Learning about licensing and patent issues

Learning about the open Translation Management Systems in use or development today

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 5: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

1 Polling Questions

bull (C) Which are important to you

Learning about the differences between open standards and open source

Learning about actual use open standards and commonly used tools

Learning about licensing and patent issues

Learning about the open Translation Management Systems in use or development today

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 6: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

DefinitionsThe magic quadrant

Open Standards

Open Source

Closed Source

Proprietary ways

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 7: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

2 Definitions

bull TMS GMS

ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo

Computer Aided L10N Project Management System (CALPMS)

bull Open Standards XLIFF TMX TBX SRX etc

bull OSS Open Source Free Software vs Freeware

bull Open Source (Copy-left) Licensing vs Permissive Licensing

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 8: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

3 Architecture

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 9: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

4 Strategy

New business

Needs

Change Enabler

Win

Win

Win

Translator

LSP of any size

Enterprise

TinyTM OmegaT Open ACS OKAPI framework Etc

Exponential growth of content

Changing balance between published and user generated content

Need for Continuous Translation

Community Translation Shared language

data Massive online

collaboration Translation automation

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 10: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

What is an open standard

World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical

discussions meeting minutes are archived and referencablein decision making)

bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)

bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)

bull Impartiality and consensus (neutral org leading it with equal weight for each participant)

bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)

bull Support (multiple implementations ongoing process for testing errata revision permanent access)

Wikipedia 2009

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 11: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Goal of open standards

bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats

bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)

Success of open standards

bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on

strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 12: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

5 Open Standards

bull Why Open Standards in Open Source

bull Implementing open standards seems obvious success scenario for OSS development

bull XLIFF and TMX are open standards co-developed by our clients

bull Minimalist open standards implementation ensures desired functionality and is also legally safe

bull LISA OSCAR TMX 14b 15 20

bull OASIS XLIFF 11 12 121 20

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 13: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Open Standards OAXAL

copy A

ndrz

ej Z

ydro

n O

ASIS

OAXAL T

C

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 14: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

TMXTranslation Memory Exchange

bull From the TMX specification

bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 15: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

What is TMX

bull It is an XML representation of translation memory data

bull Header

bull Body

ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB

gt

Deacutejagrave Vu Transit Trados MemoQ

Version build number of the tool

HTML SGML RTF Interleaf Javahellip

Basic segmentation

Default language for elements like ltnotegt

Source text language

Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 16: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

What is TMX

bull Body

ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt

lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt

lttuvgtlttuv lang=DE-DEgt

ltseggtDies ist der erste Satzltseggtlttuvgt

lttugtltbodygt

tu = Translation Unittuv lang = translation unit variant (language) seg = segment

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 17: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

What is TMX

bull Depending on the tool that created the TMX file it can be bilingual or multilingual

bull Importing multilingual TMX file into a bilingual project will only import the relevant languages

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 18: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Levels of TMX

bull Level 1bull Plain text only (sufficient for data coming from software localization tools)

bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level2 compliant

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 19: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Level 1

bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text

bull Original

bull This sentence has some formatting

bull In TMX

bull This sentence has some formatting

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 20: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Level 2

bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file

bull Different tools use different ways of encoding that information (placeholders or actual formatting information)

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 21: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Level 2

seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt

MemoQ ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt

Trados 2009 ndash Word DOC with formatting

ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt

Trados 2007 82 83 ndash Word DOC with formatting

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 22: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Level 2

MemoQ ndash HTML file with link

Trados 2009 ndash HTML file with link

Trados 2007 82 83 ndash HTML file with link

ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt

ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt

OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt

TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt

OmegaT - HTML file with link

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 23: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Level 2

MemoQ ndash InDesign

Trados 2009 ndash InDesign

Trados 2007 82 83 ndash InDesign

ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt

ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt

ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 24: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Implications of different tags for formatting

bull Tools that use placeholder tags do not include the actual formatting information in the TMX file

bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source

bull The result of the exchange would then be the same as with TMX level 1 (text only)

bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 25: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Where do you use TMX

bull Transfering data between different translation memory tools

bull Checking tools QA tools

bull TM maintenance tools

bull Basis for bilingual term extraxtion

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 26: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Reusing TMX data

bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways

bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 27: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

SRX ndash Segmentation Rules Exchange

bull From the SRX specification

bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors

bull hellipis intended to enhance the TMX standardhellip

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 28: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Why SRX

bull Tool Abull Semicolon is end of segment

bull This is a sentence this is another sentence

bull TM system sees two separate segments

bull Tool Bbull Semicolon is NOT end of segment

bull This is a sentence this is another sentence

bull TM system sees one segmentbull No match from the TMX data

bull Match rate around 50 usual setting around 70

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 29: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Segmentation rules

bull Rules that the tool applies to the text to translate to split it up into segments

bull paragraph

bull sentence

bull phrase

bull incomplete sentences in bulleted lists

bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 30: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Segmentation rules

bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known

abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon

bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes

graphics)

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 31: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi-

colon

no end end end no end no end

Tab end no end no end no end no end

Soft

return

no end no end end in

Word no

end in

PPT

end in

Word no

end in

PPT

no end

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 32: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

What can SRX do and what not

bull It can only show the segmentation rule settings at the time of export

bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM

bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 33: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

TBX ndash TermBase Exchange

bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data

bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)

bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 34: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

TMX TBX

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 35: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Zerfasszaacde 35

Term in English

Term in French

Global information in entry head

Information on term level

Administrative data of this language

Language ID

Language ID

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 36: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Where could you use TBX

bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool

bull For indexing keywords in document management systems content management systems knowledge management systems

bull Publishing terminological data on the Intranet Internet

bull Optimization of search enginges text mining by searching for synonyms automatically

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 37: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

XLIFF ndash XML Localization Interchange File Format

bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file

format in translation instead of different processes to extract filter convert text from different file formats)

bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization

process (like meta data on versions of source and target segemtns)

bull An XLIFF file is bilingual and can be the container for a number of individual files

bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 38: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

XLIFF

bull XLIFF can carry several translation matches

bull Additional fields can contain context author creation tool historyhellip

lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein

Satzlttargetgtltalt-trans match-

quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein

Satzlttargetgtltalt-transgtltalt-trans match-

quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer

Satzlttargetgtltalt-transgtlttrans-unitgt

lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt

ltsource xmllang=engtCancelltsourcegtlttrans-unitgt

bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 39: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Where is XLIFF useful

bull Where experience with XML exists

bull Projects contain many different file formats

bull All formats are converted to XLIFF for translation

bull Different tools need to be used during localization

bull Different translations (alt-trans) or languages needed as reference

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 40: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Any idea why XLIFF should NOT be the cure for everything

bull Instead of developing parsers for different file

formats (to read in the file into a translation tool)

developers now need to create parsers to convert

those file formats to XLIFF

bull Some file formats already can be dealt with

(Office HTML XMLhellip) ndash why should a new parser

be created for those

bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 41: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

Q1

Open-Closed

Good

Q2

Open-Open

Good

Q3

Proprietary-Closed

Bad

Q4

Proprietary-Open

Wild

The magic quadrant again justto remember the distinction

Open Standards

Open Source

Closed Source

Proprietary ways

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 42: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

6 Talking Legalesein OSS

OSS (Open Source Software)

bull Copyleft and Permissive Licensing

bull GPL (General Public License by FSF)

bull BSD (Berkeley Software Distribution)

bull Apache License 20

bull MS open source licensing

bull Ms-PL (Public License)

bull Ms-RL (Reciprocal License)

DerivedDerivative works vs dynamic linkage

bull Paradigmatic case is app running on OS

Contributions

Distribution

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 43: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

6 Talking Legalesein standards

RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)

Standards can be licensed under different terms

bull RF

bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary

bull Proprietary

By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 44: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

7 Open Tools

bull Infrastructure baseline

bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss

bull TM Server

bull TinyTM

bull Workflow capabilities through any CALPMS

bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc

bull CAT Clients

bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol

bull Filters

bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 45: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

8 Use Cases - odt + OmegaT

bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support

OpenOfficeorgodt

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 46: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

8 Use cases - TinyTM

bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo

bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo

bull Translation clients can be licensed underany license

bull TinyTM documentation is licensed under the Creative Commons Attribution License

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 47: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

8 Use Cases - TinyTM - Status

bull Alpha functionality

bull Java TMX importer

bull Mark up parsing logic

-------------------------bull Protocol freeze

in public discussionon Sourceforge

bull Protocol featuresbull Conservative extension with many important enhancements

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 48: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

8 Use CasesTinyTM ndash new protocol features

bull Industry and domain taxonomybull open to accommodate future TDA taxonomy

research

bull Inline markup handling1 stripped plain version2 normalized version with formatting

placeholders i3 TMX version with full TMX markup

bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 49: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

9 Discussion

Thanks for your attention

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 50: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

copy 2009 Moravia IT as and Angelika Zerfass

A Guide to Open Standards and Open Source

A Conceptual Case Study

Angelika ZerfasszerfasszaacdeDavid Filip PhD

davidfmoraviaworldwidecom

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 51: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

10 References

Open Standards (selective)

bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml

bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm

bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html

Open Tools (selective)

bull OKAPI Framework httpokapisourceforgenet

bull OmegaT httpsourceforgenetprojectsomegat

bull OmegaT+ httpomegatplussourceforgenet

bull Open ACS httpopenacsorg

bull PostgreSQL httpwwwpostgresqlorg

bull TinyTM httptinytmsourceforgenet

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits

Page 52: A Guide to Open Standards and Open SourceOAXAL C. TMX Translation Memory Exchange •From the TMX specification: •…The purpose of the TMX format is to provide a standard method

10 References

Tools continued

bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip

bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html

bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html

Standards continued

bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29

bull W3C ITS httpwwww3orgTRits