Download - Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005

xml:tm

XML Based Text Memory

Using XML technology to reduce the cost of translating XML documents

27 June 2005

• Machine Translation

• Translation Memory

• Hybrid Linguistic Inference Engines

• Terminology

Automating Translation

Automating Translation

• Machine translation• 40 year history• Rigorous control of grammar and

terminology can produce good results• Lots of interesting new developments with

hybrid statistical/transfer based systems• Translation of free format text is

theoretically impossible with current technology.

Translation Memory

• Align source and target text

• Look up new text against memory

• Relatively primitive technology

• Not much innovation over the past 30 years

• Need for proofing

• Proprietary translation memory formats

• XML inherently easier to translate

• Separation of form and content

• Support for Unicode and other international encoding formats.

• Allows multiple output formats - PDF, XHTML, WAP

Translating XML Documents

XML Translation Standards

• LISA - Localization Industry Standards Association: http://www.lisa.org

• OASIS - Organization for the Advancement of Structured Information Standards: http://www.oasis-open.org

• W3C - World Wide Web Consortium: http://www.w3c.org

• OLIF Consortium: http://www.olif.net

LISA Standards

• TMX - Translation Memory Exchange format: http://www.lisa.org/tmx

• TBX - Termbase Exchange format: http://www.lisa.org/tbx

• SRX - Segmentation Rules Exchange format: http://www.lisa.org/srx

• GMX - GILT Metrics Exchange format: http://www.lisa.org/gmx

http://www.lisa.org/tmx/

http://www.lisa.org/tmx/

http://www.lisa.org/tbx

http://www.lisa.org/tbx

http://www.lisa.org/srx

http://www.lisa.org/srx

OASIS L10N Standards

• XLIFF - XML Localization Interchange File Format: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff

• TransWS - Translation Web Services: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws

• DITA – Darwin Information Technology Architecture http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita

http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff




http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws

http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws

http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita




W3C and OLIF

• W3C ITS http://www.w3.org/International/

http://www.w3.org/International/its

• OLIF - Open Lexicon Interchange Format: http://www.olif.net

http://www.w3.org/International/

http://www.w3.org/International/

XML namespace

• Major feature of XML• Allows the mapping of different ontological

entities onto the same representation• Allows different ways to look at the same

data• Namespaces can be made transparent

xml:tm

• XML based text memory

• Revolutionary approach to translating XML documents

• First significant advance in translation memory technology

• Uses XML namespace to transparently embed contextual information

xml:tm namespace

• Text Memory namespace• Can be mapped onto any XML document• Vertical view of document in terms of ‘text

segments’• Can be totally transparent

xml:tm namespace

Example of the use of tm namespace in an XML document:

<document xmlns:tm="urn:xml-Intl-tm" > <tm:tm> <section> <para> <tm:te> <tm:tu> Namespace is very flexible. </tm:tu> <tm:tu> It is very easy to use. </tm:tu> </tm:te> </para>

xml:tm namespace

doc

title

section section

para

tm

te sentence sentencetu tu



tm namespace view

original document

view te texttutext


para text

para text

para text

para text

para text



text

xml:tm namespace

Namespace is very simple. It is easy to use.


original document view

tm namespace view

<para>

</para>

<para>

</para>

<tm:te id=“e1”>

<tm:tu id=“u1.1”> Namespace is very simple. </tm:tu>

<tm:tu id=“u1.2”> It is easy to use. </tm:tu>

</tm:te>

text

xml:tm Text Memory

• Author memoryMaintain memory of source text

Authoring statistics

Authoring tool input

• Translation memoryAutomatic alignment

Maintain perfect link of source and target text

Reduce translation costs

Updated Source Document

tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”new

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

xml:tm DOM differencing

origid=”5”modified

xml:tm Author Memory

• Namespace aware DOM differencing• Identify changes from the previous version• Unique text unit identifiers are maintained• Modification history• Text units can be loaded into a database• Authoring environment integration

xml:tm Translation Memory• The tm namespace can be used to create XLIFF

files

• Automatic alignment of source and target languages

• Allows for more focused translation matching– Exact matching– Leveraged matching from document - identical text– Leveraged matching from database– Modified text unit matching– Non translatable text unit identification

DITA Strengths

• Topic-centric level of granularity

• Very well thought out and flexible architecture for content creation and publishing

• Substantial reuse of existing assets

• Specialization at the topic and domain levels

• Automated processing based on meta data property

• Translate topic only once, reuse many times

DITA and xml:tm

• Both complement each other• xml:tm encourages text reuse at the sentence

level• Automates translation matching and extraction• Automatic alignment of source and target

documents at the text unit (sentence) level• Introduces the concept of exact matching for

translation as well as focused matching• Fully integrated with existing standards such as

SRX, GMX, TMX and XLIFF

xml:tm translation via XLIFF

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

XLIFF Document

trans-unit id=”1”






doc

title

section section

para tekst

tm

te zdanie zdanietu tu



translated tm namespace

view

translated document

view te teksttutekst


para tekst

para tekst

para tekst

para tekst

para tekst



xml:tm translated document

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Exact alignment

xml:tm perfect alignment

xml:tm perfect matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

modified

new

Matched Target Document

tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching

requires translation


xml:tm leveraged DB memorySource Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Perfect alignment

DB

TMX

xml:tm in-document leveraged matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

modified

new:same id=”3”


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

leveraged match

xml:tm in-document fuzzy matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

mod:origid=”5”

New:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

fuzzy match

leveraged match

xml:tm db leveraged matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

mod:origid=”5”

new:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

fuzzy match

doc leveraged match

tu id=”9” tu id=”9”

DB

requires proofing DB leveraged match


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

non trans

tu id=”8”new:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Exact Matching


requires proofing

fuzzy match

doc leveraged match

tu id=”9” tu id=”9”

DB

requires proofing DB leveraged match

tu id=”2” requires no translation non translatable

xml:tm non-translatable text

Traditional Translation Scenario

source text

Publishing Translation

source text extractExtracted

texttm

process

Prepared text

TranslateTranslated

texttarget texttarget text

merge target text

QA

xml:tm source

text

Publishing

Translator

extractExtracted

texttm

process

XLIFF

file

Translate

xml:tm target text

merge

Web

perfect matching

leveraged matching

Automatic Process

Web service/ interface

QA

Automatic Process

xml:tm Translation Scenario

xml:tm benefits • Open Standard donated by XML INTL to LISA

• Complements DITA

• Enterprise level scalability

• Totally integrated within the XML framework

• Source text is automatically extracted and matched• Word counts are controlled by the customer• Text can be presented for translation via the web• Data is merged automatically at end of translation cycle• All memory operations are totally automated • Can be used transparently for relay translations• More accurate – better matching

xml:tm• Full specification:

– http://www.xml-intl.com/docs/specification/xml-tm.html

• Maintained by xml-intl.com– http://www.xml-intl.com/dtd/tm.dtd– http://www.xml-intl.com/dtd/tm.xsd

• Detailed article on xml:tm in www.xml.com

• Donated by XML INTL to Lisa OSCAR

Any questions?

XML INTL Contact Details

• Postal address:PO Box 2167Gerrards CrossBucks SL9 8XFUnited Kingdom

• Phone: +44 1753 480 467 • Fax: +44 1753 480 465 • Bob Willans - [email protected]• Andrzej Zydroń – [email protected]• Bartek Bogacki – [email protected]

Download - Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005

Top Related