xml:tm
XML Based Text Memory
Using XML technology to reduce the cost of translating XML documents
27 June 2005
• Machine Translation
• Translation Memory
• Hybrid Linguistic Inference Engines
• Terminology
Automating Translation
Automating Translation
• Machine translation• 40 year history• Rigorous control of grammar and
terminology can produce good results• Lots of interesting new developments with
hybrid statistical/transfer based systems• Translation of free format text is
theoretically impossible with current technology.
Translation Memory
• Align source and target text
• Look up new text against memory
• Relatively primitive technology
• Not much innovation over the past 30 years
• Need for proofing
• Proprietary translation memory formats
• XML inherently easier to translate
• Separation of form and content
• Support for Unicode and other international encoding formats.
• Allows multiple output formats - PDF, XHTML, WAP
Translating XML Documents
XML Translation Standards
• LISA - Localization Industry Standards Association: http://www.lisa.org
• OASIS - Organization for the Advancement of Structured Information Standards: http://www.oasis-open.org
• W3C - World Wide Web Consortium: http://www.w3c.org
• OLIF Consortium: http://www.olif.net
LISA Standards
• TMX - Translation Memory Exchange format: http://www.lisa.org/tmx
• TBX - Termbase Exchange format: http://www.lisa.org/tbx
• SRX - Segmentation Rules Exchange format: http://www.lisa.org/srx
• GMX - GILT Metrics Exchange format: http://www.lisa.org/gmx
OASIS L10N Standards
• XLIFF - XML Localization Interchange File Format: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff
• TransWS - Translation Web Services: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws
• DITA – Darwin Information Technology Architecture http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita
W3C and OLIF
• W3C ITS http://www.w3.org/International/
http://www.w3.org/International/its
• OLIF - Open Lexicon Interchange Format: http://www.olif.net
XML namespace
• Major feature of XML• Allows the mapping of different ontological
entities onto the same representation• Allows different ways to look at the same
data• Namespaces can be made transparent
xml:tm
• XML based text memory
• Revolutionary approach to translating XML documents
• First significant advance in translation memory technology
• Uses XML namespace to transparently embed contextual information
xml:tm namespace
• Text Memory namespace• Can be mapped onto any XML document• Vertical view of document in terms of ‘text
segments’• Can be totally transparent
xml:tm namespace
Example of the use of tm namespace in an XML document:
<document xmlns:tm="urn:xml-Intl-tm" > <tm:tm> <section> <para> <tm:te> <tm:tu> Namespace is very flexible. </tm:tu> <tm:tu> It is very easy to use. </tm:tu> </tm:te> </para>
xml:tm namespace
doc
title
section section
para
tm
te sentence sentencetu tu
te sentence sentencetu tu
te sentence sentencetu tu
tm namespace view
original document
view te texttutext
te sentence sentencetu tu
para text
para text
para text
para text
para text
te sentence sentencetu tu
te sentence sentencetu tu
text
xml:tm namespace
Namespace is very simple. It is easy to use.
te sentence sentencetu tu
original document view
tm namespace view
<para>
</para>
<para>
</para>
<tm:te id=“e1”>
<tm:tu id=“u1.1”> Namespace is very simple. </tm:tu>
<tm:tu id=“u1.2”> It is easy to use. </tm:tu>
</tm:te>
text
xml:tm Text Memory
• Author memoryMaintain memory of source text
Authoring statistics
Authoring tool input
• Translation memoryAutomatic alignment
Maintain perfect link of source and target text
Reduce translation costs
Updated Source Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”new
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
xml:tm DOM differencing
origid=”5”modified
xml:tm Author Memory
• Namespace aware DOM differencing• Identify changes from the previous version• Unique text unit identifiers are maintained• Modification history• Text units can be loaded into a database• Authoring environment integration
xml:tm Translation Memory• The tm namespace can be used to create XLIFF
files
• Automatic alignment of source and target languages
• Allows for more focused translation matching– Exact matching– Leveraged matching from document - identical text– Leveraged matching from database– Modified text unit matching– Non translatable text unit identification
DITA Strengths
• Topic-centric level of granularity
• Very well thought out and flexible architecture for content creation and publishing
• Substantial reuse of existing assets
• Specialization at the topic and domain levels
• Automated processing based on meta data property
• Translate topic only once, reuse many times
DITA and xml:tm
• Both complement each other• xml:tm encourages text reuse at the sentence
level• Automates translation matching and extraction• Automatic alignment of source and target
documents at the text unit (sentence) level• Introduces the concept of exact matching for
translation as well as focused matching• Fully integrated with existing standards such as
SRX, GMX, TMX and XLIFF
xml:tm translation via XLIFF
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
XLIFF Document
trans-unit id=”1”
trans-unit id=”2”
trans-unit id=”3”
trans-unit id=”4”
trans-unit id=”5”
trans-unit id=”6”
doc
title
section section
para tekst
tm
te zdanie zdanietu tu
te zdanie zdanietu tu
te zdanie zdanietu tu
translated tm namespace
view
translated document
view te teksttutekst
te zdanie zdanietu tu
para tekst
para tekst
para tekst
para tekst
para tekst
te zdanie zdanietu tu
te zdanie zdanietu tu
xml:tm translated document
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Exact alignment
xml:tm perfect alignment
xml:tm perfect matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
modified
new
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires translation
xml:tm leveraged DB memorySource Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Perfect alignment
DB
TMX
xml:tm in-document leveraged matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
modified
new:same id=”3”
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
leveraged match
xml:tm in-document fuzzy matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
mod:origid=”5”
New:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
fuzzy match
leveraged match
xml:tm db leveraged matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
mod:origid=”5”
new:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
fuzzy match
doc leveraged match
tu id=”9” tu id=”9”
DB
requires proofing DB leveraged match
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
non trans
tu id=”8”new:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Exact Matching
requires translation
requires proofing
fuzzy match
doc leveraged match
tu id=”9” tu id=”9”
DB
requires proofing DB leveraged match
tu id=”2” requires no translation non translatable
xml:tm non-translatable text
Traditional Translation Scenario
source text
Publishing Translation
source text extractExtracted
texttm
process
Prepared text
TranslateTranslated
texttarget texttarget text
merge target text
QA
xml:tm source
text
Publishing
Translator
extractExtracted
texttm
process
XLIFF
file
Translate
xml:tm target text
merge
Web
perfect matching
leveraged matching
Automatic Process
Web service/ interface
QA
Automatic Process
xml:tm Translation Scenario
xml:tm benefits • Open Standard donated by XML INTL to LISA
• Complements DITA
• Enterprise level scalability
• Totally integrated within the XML framework
• Source text is automatically extracted and matched• Word counts are controlled by the customer• Text can be presented for translation via the web• Data is merged automatically at end of translation cycle• All memory operations are totally automated • Can be used transparently for relay translations• More accurate – better matching
xml:tm• Full specification:
– http://www.xml-intl.com/docs/specification/xml-tm.html
• Maintained by xml-intl.com– http://www.xml-intl.com/dtd/tm.dtd– http://www.xml-intl.com/dtd/tm.xsd
• Detailed article on xml:tm in www.xml.com
• Donated by XML INTL to Lisa OSCAR
Any questions?
XML INTL Contact Details
• Postal address:PO Box 2167Gerrards CrossBucks SL9 8XFUnited Kingdom
• Phone: +44 1753 480 467 • Fax: +44 1753 480 465 • Bob Willans - [email protected]• Andrzej Zydroń – [email protected]• Bartek Bogacki – [email protected]