a guide to open standards and open sourceoaxal c. tmx translation memory exchange •from the tmx...
TRANSCRIPT
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
Agenda
1 Polling Questions
2 Definitions
------------------------
3 Architecture considerations
4 Strategy
5 Open Standards
6 Talking Legalese
7 Open Tools
8 Usage Cases
1 Polling Questions
bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)
I know these standards well and see them regularly in the work done at my organization
I have a basic understanding of localization industry open standards
Im new to localization industry open standards and want to learn more
1 Polling Questions
bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)
Im familiar with these tools and use them (or tools like them) regularly
I have a basic understanding of these applications but dont really use them
Im new to the idea of open source tools for the localization industry and want to learn more
1 Polling Questions
bull (C) Which are important to you
Learning about the differences between open standards and open source
Learning about actual use open standards and commonly used tools
Learning about licensing and patent issues
Learning about the open Translation Management Systems in use or development today
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Agenda
1 Polling Questions
2 Definitions
------------------------
3 Architecture considerations
4 Strategy
5 Open Standards
6 Talking Legalese
7 Open Tools
8 Usage Cases
1 Polling Questions
bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)
I know these standards well and see them regularly in the work done at my organization
I have a basic understanding of localization industry open standards
Im new to localization industry open standards and want to learn more
1 Polling Questions
bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)
Im familiar with these tools and use them (or tools like them) regularly
I have a basic understanding of these applications but dont really use them
Im new to the idea of open source tools for the localization industry and want to learn more
1 Polling Questions
bull (C) Which are important to you
Learning about the differences between open standards and open source
Learning about actual use open standards and commonly used tools
Learning about licensing and patent issues
Learning about the open Translation Management Systems in use or development today
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
1 Polling Questions
bull (A) What is your level of experience with localization industry open standards (such as the XML-based TMX TBX SRX and XLIFF standards)
I know these standards well and see them regularly in the work done at my organization
I have a basic understanding of localization industry open standards
Im new to localization industry open standards and want to learn more
1 Polling Questions
bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)
Im familiar with these tools and use them (or tools like them) regularly
I have a basic understanding of these applications but dont really use them
Im new to the idea of open source tools for the localization industry and want to learn more
1 Polling Questions
bull (C) Which are important to you
Learning about the differences between open standards and open source
Learning about actual use open standards and commonly used tools
Learning about licensing and patent issues
Learning about the open Translation Management Systems in use or development today
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
1 Polling Questions
bull (B) How familiar are you with open source applications used in the localization industry (such as OmegaT Okapi Framework Sun XLIFF Translation Editor)
Im familiar with these tools and use them (or tools like them) regularly
I have a basic understanding of these applications but dont really use them
Im new to the idea of open source tools for the localization industry and want to learn more
1 Polling Questions
bull (C) Which are important to you
Learning about the differences between open standards and open source
Learning about actual use open standards and commonly used tools
Learning about licensing and patent issues
Learning about the open Translation Management Systems in use or development today
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
1 Polling Questions
bull (C) Which are important to you
Learning about the differences between open standards and open source
Learning about actual use open standards and commonly used tools
Learning about licensing and patent issues
Learning about the open Translation Management Systems in use or development today
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
DefinitionsThe magic quadrant
Open Standards
Open Source
Closed Source
Proprietary ways
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
2 Definitions
bull TMS GMS
ETMS ndash Enterprise TMSldquofrom cradle to the graverdquo
Computer Aided L10N Project Management System (CALPMS)
bull Open Standards XLIFF TMX TBX SRX etc
bull OSS Open Source Free Software vs Freeware
bull Open Source (Copy-left) Licensing vs Permissive Licensing
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
3 Architecture
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
4 Strategy
New business
Needs
Change Enabler
Win
Win
Win
Translator
LSP of any size
Enterprise
TinyTM OmegaT Open ACS OKAPI framework Etc
Exponential growth of content
Changing balance between published and user generated content
Need for Continuous Translation
Community Translation Shared language
data Massive online
collaboration Translation automation
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
What is an open standard
World Wide Web Consortiums definitionbull Transparency (designdue process is public and all technical
discussions meeting minutes are archived and referencablein decision making)
bull Relevance (new standardization is started upon due analysis of the market needs including requirements phase eg accessibility multi-linguism)
bull Openness (anyone can participate industry individual public government bodies academia on a worldwide scale)
bull Impartiality and consensus (neutral org leading it with equal weight for each participant)
bull Availability (free access to the standard text both during development and at final stage translations and clear IPR rules for implementation allowing open source development in the case of Web technologies)
bull Support (multiple implementations ongoing process for testing errata revision permanent access)
Wikipedia 2009
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Goal of open standards
bull Interoperability of toolsbull Vendors can concentrate on innovation in other fields than their proprietary formats
bull Standardization of processes (translation of just one file format like XLIFF instead of DOC HTML InDesign FMhellip)
Success of open standards
bull Depends on the commercial usabilitybull TMX ndash widespread XLIFF ndash coming on
strong SRX ndash not widely used TBX ndash slow others ndash in the making (TBX Basic GMXhellip)
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
5 Open Standards
bull Why Open Standards in Open Source
bull Implementing open standards seems obvious success scenario for OSS development
bull XLIFF and TMX are open standards co-developed by our clients
bull Minimalist open standards implementation ensures desired functionality and is also legally safe
bull LISA OSCAR TMX 14b 15 20
bull OASIS XLIFF 11 12 121 20
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Open Standards OAXAL
copy A
ndrz
ej Z
ydro
n O
ASIS
OAXAL T
C
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
TMXTranslation Memory Exchange
bull From the TMX specification
bull hellipThe purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools andor translation vendors while introducing little or no loss of critical data during the processhellip
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
What is TMX
bull It is an XML representation of translation memory data
bull Header
bull Body
ltheadercreationtool=ldquoDeacutejagrave Vu creationtoolversion=ldquo4datatype=PlainTextrdquosegtype=sentenceadminlang=en-ussrclang=en-uso-tmf=DVMDB
gt
Deacutejagrave Vu Transit Trados MemoQ
Version build number of the tool
HTML SGML RTF Interleaf Javahellip
Basic segmentation
Default language for elements like ltnotegt
Source text language
Original translation memory format (DVMDB ndash Deacutejagrave Vu databasehellip)
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
What is TMX
bull Body
ltbodygtlttu creationdate=20030915T153704Z creationid=USERgt
lttuv lang=EN-USgtltseggtThis is the first sentenceltseggt
lttuvgtlttuv lang=DE-DEgt
ltseggtDies ist der erste Satzltseggtlttuvgt
lttugtltbodygt
tu = Translation Unittuv lang = translation unit variant (language) seg = segment
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
What is TMX
bull Depending on the tool that created the TMX file it can be bilingual or multilingual
bull Importing multilingual TMX file into a bilingual project will only import the relevant languages
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Levels of TMX
bull Level 1bull Plain text only (sufficient for data coming from software localization tools)
bull Level 2bull Text plus formatting (data coming from translation memory tools used for translation of documentation)
To move formatting and text from one tool to the other both tools need to be level2 compliant
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Level 1
bull Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file only pure text
bull Original
bull This sentence has some formatting
bull In TMX
bull This sentence has some formatting
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Level 2
bull Formatting that is applied to the source and target text of a translation unit is exported to the TMX file
bull Different tools use different ways of encoding that information (placeholders or actual formatting information)
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Level 2
seggtThis is the ltbpt i=1 type=boldgtltbptgtfirstltept i=1gtlteptgt sentence this is ltbpt i=2 type=ulinedgtltbptgtanotherltept i=2gtlteptgt sentenceltseggt
MemoQ ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1 type=Bold gtfirstltept i=1 gt sentence this isltbpt i=2 type=Underline gtanotherltept i=2 gt sentenceltseggt
Trados 2009 ndash Word DOC with formatting
ltseggtThis is the ltbpt i=1gtampltcf bold=ampquotonampquotampgtltbptgtfirstltept i=1gtampltcfampgtlteptgt sentence this is ltbpt i=2gtampltcf underlinestyle=ampquotsingleampquotampgtltbptgtanotherltept i=2gtampltcfampgtlteptgtsentenceltseggt
Trados 2007 82 83 ndash Word DOC with formatting
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Level 2
MemoQ ndash HTML file with link
Trados 2009 ndash HTML file with link
Trados 2007 82 83 ndash HTML file with link
ltseggtText with a link to ltbpt i=1gtamplta href=ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
ltseggtText with a link to ltbpt i=1 type=19 x=1 gtanother pageltept i=1 gtltseggt
ltseggtText with a link to ltbpt i=1 type=linkgtamplta href = ampquothttpwwwsamplehtmlcompage1htmampquotampgtltbptgtanother pageltept i=1gtampltaampgtlteptgtltseggt
OmegaT internal format ltseggtText with a link to amplta0ampgtanother pageamplta0ampgtltseggt
TMX Level 2 format ltseggtText with a link to ltbpt i=0 x=0gtamplta0ampgtltbptgtanother pageltept i=0gtamplta0ampgtlteptgtltseggt
OmegaT - HTML file with link
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Level 2
MemoQ ndash InDesign
Trados 2009 ndash InDesign
Trados 2007 82 83 ndash InDesign
ltseggtInDesign text with ltbpt i=1gtampltcf ptfs=ampquotc_Boldampquotampgtltbptgtformatting in boldltept i=1gtampltcfampgtlteptgtltseggt
ltseggtInDesign text with ltbpt i=1 type=pt16 x=1 gtformatting in boldltept i=1 gtltseggt
ltseggtInDesign Text with ltbpt i=1 type=boldgtltbptgtformtatting in boldltept i=1gtlteptgtltseggt
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Implications of different tags for formatting
bull Tools that use placeholder tags do not include the actual formatting information in the TMX file
bull Other tools might only be able to re-use the text especially if the formatting is only applied in the target segment but not in the source
bull The result of the exchange would then be the same as with TMX level 1 (text only)
bull TMX files which carry the actual formatting information will yield better matches in other tools that can read this information
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Where do you use TMX
bull Transfering data between different translation memory tools
bull Checking tools QA tools
bull TM maintenance tools
bull Basis for bilingual term extraxtion
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Reusing TMX data
bull Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations) this has been realized in different ways
bull Exchange with TMX works but there is an issue that can lower the match rates nonethelesshellip the segmentation rules
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
SRX ndash Segmentation Rules Exchange
bull From the SRX specification
bull hellipThe purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools andor translation vendors
bull hellipis intended to enhance the TMX standardhellip
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Why SRX
bull Tool Abull Semicolon is end of segment
bull This is a sentence this is another sentence
bull TM system sees two separate segments
bull Tool Bbull Semicolon is NOT end of segment
bull This is a sentence this is another sentence
bull TM system sees one segmentbull No match from the TMX data
bull Match rate around 50 usual setting around 70
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Segmentation rules
bull Rules that the tool applies to the text to translate to split it up into segments
bull paragraph
bull sentence
bull phrase
bull incomplete sentences in bulleted lists
bull single words (headings ldquoNoterdquo ldquoAttentionrdquo)
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Segmentation rules
bull End of segment rules (common to the default settings of all tools)bull Dot at the end of a sentence (not after known
abbreviations)bull Question mark exclamation markbull Paragraph markbull Colon
bull End of segment rules (different for different tools)bull Semicolonbull Tab characterbull Sub segments (index entries footnotes
graphics)
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Comparison of default rules
Workbench Transit DV SDLX Across
Colon end end end no end no end
Semi-
colon
no end end end no end no end
Tab end no end no end no end no end
Soft
return
no end no end end in
Word no
end in
PPT
end in
Word no
end in
PPT
no end
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
What can SRX do and what not
bull It can only show the segmentation rule settings at the time of export
bull It cannot show any changes that have been applied in the segmentation rules during the use of the TM
bull Sometimes the rules from system 1 cannot be re-created in system 2 then the rule will be ignored
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
TBX ndash TermBase Exchange
bull From the working draft of the TBX specificationbull TBX is an open XML-based standard format for terminological data
bull TBX is designed to support the analysis representation dissemination and exchange of information from human-oriented terminological databases (termbases)
bull TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF core structure)
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
TMX TBX
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Zerfasszaacde 35
Term in English
Term in French
Global information in entry head
Information on term level
Administrative data of this language
Language ID
Language ID
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Where could you use TBX
bull Exchange terminological databull From term base to source language checking toolsbull Between term bases of translation memory toolsbull From term base to terminology checking toolsbull From term base to terminology extraction tools (as stop word lists)bull From term base to dictionary of a machine translation tool
bull For indexing keywords in document management systems content management systems knowledge management systems
bull Publishing terminological data on the Intranet Internet
bull Optimization of search enginges text mining by searching for synonyms automatically
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
XLIFF ndash XML Localization Interchange File Format
bull The XLIFF format aims to bull separate localizable text from formatting (deal with one file
format in translation instead of different processes to extract filter convert text from different file formats)
bull enable multiple tools to work on the source textbull store information that is helpful in supporting a localization
process (like meta data on versions of source and target segemtns)
bull An XLIFF file is bilingual and can be the container for a number of individual files
bull Each file element in an XLIFF file contains a header (project data such as contact information project phases pointers to reference material and information on the skeleton file) and a body section with the actual text segments
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
XLIFF
bull XLIFF can carry several translation matches
bull Additional fields can contain context author creation tool historyhellip
lttrans-unit id=n1gtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDies ist ein
Satzlttargetgtltalt-trans match-
quality=100 tool=TM_SystemgtltsourcegtThis is a sentenceltsourcegtlttarget xmllang=degtDas ist ein
Satzlttargetgtltalt-transgtltalt-trans match-
quality=70 tool=TM_SystemgtltsourcegtThis is a short sentenceltsourcegtlttarget xmllang=degtDies ist ein kurzer
Satzlttargetgtltalt-transgtlttrans-unitgt
lttrans-unit id=1 resname=IDCANCEL restype=button coord=885014gt
ltsource xmllang=engtCancelltsourcegtlttrans-unitgt
bull Information on name of a button and its coordinates (for visual representation of the button in a localization tool)
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Where is XLIFF useful
bull Where experience with XML exists
bull Projects contain many different file formats
bull All formats are converted to XLIFF for translation
bull Different tools need to be used during localization
bull Different translations (alt-trans) or languages needed as reference
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Any idea why XLIFF should NOT be the cure for everything
bull Instead of developing parsers for different file
formats (to read in the file into a translation tool)
developers now need to create parsers to convert
those file formats to XLIFF
bull Some file formats already can be dealt with
(Office HTML XMLhellip) ndash why should a new parser
be created for those
bull XLIFF has its variants like all XML-based files ndashhow can you make sure that each tool can process any of the possible XLIFF flavours
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
Q1
Open-Closed
Good
Q2
Open-Open
Good
Q3
Proprietary-Closed
Bad
Q4
Proprietary-Open
Wild
The magic quadrant again justto remember the distinction
Open Standards
Open Source
Closed Source
Proprietary ways
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
6 Talking Legalesein OSS
OSS (Open Source Software)
bull Copyleft and Permissive Licensing
bull GPL (General Public License by FSF)
bull BSD (Berkeley Software Distribution)
bull Apache License 20
bull MS open source licensing
bull Ms-PL (Public License)
bull Ms-RL (Reciprocal License)
DerivedDerivative works vs dynamic linkage
bull Paradigmatic case is app running on OS
Contributions
Distribution
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
6 Talking Legalesein standards
RF (Royalty Free) vs RAND (Reasonable and Non-Discriminatory)
Standards can be licensed under different terms
bull RF
bull RAND ndash there is no general consensus up to date on whether RAND is Open or Proprietary
bull Proprietary
By open standard we mean just RF Standards
The opposite of Open Standard is any proprietary way of doing things be it standardized
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
7 Open Tools
bull Infrastructure baseline
bull PostgreSQL MySQL Open ACS Petri Nets Alfresco jBoss
bull TM Server
bull TinyTM
bull Workflow capabilities through any CALPMS
bull Open ]po[ LRN Closed source AIT Projetex LTC Worx Plunet Business Manager etc
bull CAT Clients
bull OmegaT Metatexis TinyTM Word macro Multilizer andwhoever wants to implement the protocol
bull Filters
bull OKAPI framework mainstream XLIFF generators such as Sun or Adobe
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
8 Use Cases - odt + OmegaT
bull Level 2 TMX supportbull Has TinyTM connector in alphabull Has plaintext glossary support
OpenOfficeorgodt
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
8 Use cases - TinyTM
bull TinyTM protocol is licensed under the rdquoLGPL V21or higherldquo
bull Rest of the TinyTM code is licensed under therdquoGPL V20 or higherldquo
bull Translation clients can be licensed underany license
bull TinyTM documentation is licensed under the Creative Commons Attribution License
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
8 Use Cases - TinyTM - Status
bull Alpha functionality
bull Java TMX importer
bull Mark up parsing logic
-------------------------bull Protocol freeze
in public discussionon Sourceforge
bull Protocol featuresbull Conservative extension with many important enhancements
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
8 Use CasesTinyTM ndash new protocol features
bull Industry and domain taxonomybull open to accommodate future TDA taxonomy
research
bull Inline markup handling1 stripped plain version2 normalized version with formatting
placeholders i3 TMX version with full TMX markup
bull Advanced leverage functionsbull hash based context storagebull hash and trigram enhanced fuzzy search
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
9 Discussion
Thanks for your attention
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
copy 2009 Moravia IT as and Angelika Zerfass
A Guide to Open Standards and Open Source
A Conceptual Case Study
Angelika ZerfasszerfasszaacdeDavid Filip PhD
davidfmoraviaworldwidecom
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
10 References
Open Standards (selective)
bull XLIFF 12 httpdocsoasis-openorgxliffv12osxliff-corehtml
bull TMX 14b httpwwwlisaorgfileadminstandardstmx14tmxhtm
bull SRX 20 httpwwwlisaorgfileadminstandardssrx20html
Open Tools (selective)
bull OKAPI Framework httpokapisourceforgenet
bull OmegaT httpsourceforgenetprojectsomegat
bull OmegaT+ httpomegatplussourceforgenet
bull Open ACS httpopenacsorg
bull PostgreSQL httpwwwpostgresqlorg
bull TinyTM httptinytmsourceforgenet
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits
10 References
Tools continued
bull XML to XLIFF to XML httpssourceforgenetprojectsxliffroundtrip
bull TMX Complicance Kit (test for TMX certification)httpwwwlisaorgTranslation-Memory-e340html
bull TBX CheckerhttpwwwlisaorgTBX-Resources6500html
Standards continued
bull Unicode Standard Annex 29 httpwwwunicodeorgreportstr29
bull W3C ITS httpwwww3orgTRits