w3cits2.0$ ·...

49
The Mul(lingualWebLT Working Group receives funding by the European Commission (project name LTWeb) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815. W3C ITS 2.0 h,p://www.w3.org/TR/its20/ Facilita<ng Automated Crea<on and Processing of Mul<lingual Web Content Felix Sasaki (W3C, DFKI), Chris(an Lieske (SAP AG)

Upload: others

Post on 13-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

W3C  ITS  2.0  h,p://www.w3.org/TR/its20/  

Facilita<ng  Automated  Crea<on  and  Processing  of  Mul<lingual  Web  Content  

Felix  Sasaki  (W3C,  DFKI),  Chris(an  Lieske  (SAP  AG)  

Page 2: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Authors  

2  

Prof.  Dr.  Felix  Sasaki    DFKI/FH  Potsdam/W3C  

Chris<an  Lieske      Globaliza<on  Services  SAP  AG  

n  Appointed  to  Prof.  in  2009;    since  2010  senior  researcher  at  DFKI  (LT-­‐Lab)  

n  Working  in  German-­‐Austrian  W3C-­‐Office  

n  Before,  staff  of  the  World  Wide  Web  Consor(um  (W3C)  in  Japan  

n  Main  field  of  interest:  combined  applica(on  of  W3C  technologies  for  representa(on  and  processing  of    mul(lingual  informa(on  

n  Studied  Japanese,  Linguis(cs  and  Web  technologies  at  various  Universi(es  in  Germany  and  Japan  

n  Knowledge  Architect  n  Content  engineering  and  process  automa(on  

(including  evalua(on,  prototyping  and  pilo(ng)  

n  Main  field  of  interest:  Interna(onaliza(on,  transla(on  approaches  and  natural  language  processing  

n  Contributor  to  standardiza(on  at  World  Wide  Web  consor(um  (W3C),  OASIS,  Unicode  Consor(um  and  elsewhere  

n  Degree  in  Computer  Science  with  focus  on  Natural  Language  Processing  and  Ar(ficial  Intelligence  

Page 3: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

3  

Page 4: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Mul(lingual  content  produc(on  

Seen  from  the  moon  

Interna(onalize  

Localize  

Translate  

Seen  from  an  airplane  

Create  

Interna(onalize  

Translate/Localize  

Publish  

Harvest  

Analyze  

Seen  from  a  desktop  

Specify  direc(onality  

Mark-­‐up  terminology  

Add  links  about  en((es  

Extract  /  filter  content  

Segment  

Run  through  MT  

Generate  transla(on  kit  

Assess  (linguis(c)  quality  

Run  post-­‐produc(on  

4  

Page 5: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Mul(lingual  content  produc(on  needs  help  

“Which  data  elements  need  to  be  translated?”  

5  

<rsrc  id="123">  ...    <data  type="text">images/cancel.gif</data>    <data  type="posi(on">12,20</data>    <data  type="text“>Cancel</data>    <data  type="posi(on">60,40</data>    <data  type="text“>Number  of  files:  </data>  

</rsrc>  

Page 6: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  –  The  help  

6  

•  Supports  interna(onaliza(on,  transla(on,  localiza(on  and  other  aspects  of  the  mul(lingual  content  produc(on  cycle  

Comprehensive  

• Building  on  W3C  ITS  1.0  (W3C  Recommenda(on)  Standardized  

• data  categories,  values  etc.    Meta  data    

Page 7: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Pitch:  Why  is  this  important?  •  Large  quan((es  of  mul(lingual  data  to  be  produced  under  (me  pressure  

•  Ambiguous  content  needing  accuracy,  esp.  with  quicker  turnarounds  

•  An  automated  solu(on  has  been  lacking  and  is  gepng  more  urgent  

•  ITS  2.0  represents  a  solu(on  that  has  been  developed  with  a  wide  range  of  actors  from  the  interna(onaliza(on/localiza(on/language  technology  space  

7  

Page 8: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

8  

Page 9: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  Basic  principles  

Say  important  things  •  “Do  not  translate”  

About  specific  content  •  “All  or  selected  data  elements”  

In  a  standard  way  • With  agreed  upon  syntax  and  values  

9  

Page 10: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

1.  Say  important  things:  ITS  2.0  “data  categories”  

•  Translate  •  Localiza<on  Note  •  Terminology  •  Direc<onality  •  Language  Informa<on  •  Elements  Within  Text  •  Domain  •  Text  Analysis  •  Locale  Filter  •  Provenance  

•  External  Resource  •  Target  Pointer  •  Id  Value  •  Preserve  Space  •  Localiza<on  Quality  Issue  •  Localiza<on  Quality  Ra<ng  •  MT  Confidence  •  Allowed  Characters  •  Storage  Size  

10  

Page 11: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

2.  About  specific  content:  Content  selec(on  approaches  

11  

<rsrc  ...>  <its:rules  xmlns:its="h,p://www.w3.org/2005/11/its"  version="2.0">      <its:translateRule  selector="//data"  translate="no"/>  </its:rules>  

<data  type="text"  its:translate="yes">Cancel</data>  <data  type="posi(on">60,40</data>  ...  </rsrc>  

• XPath  (or  CSS)  to  select  markup  nodes  Selec(on  global  

•  ITS  local  arributes  Selec(on  local  

ITS  selec(on  can  be  compared  to  CSS  •  global  =  “style”  element  •  local  =  “style”  arribute  

Page 12: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

3.  In  a  standard  way  (1/2)  

• “Translate”:  “yes”  or  “no”  Pre-­‐defined  (if  appl.)  meta  data  

values  

• Elements:  translate  “yes”,  arributes:  translate  “no”  

Specific  defaults  (if  appl.)  

• E.g.  “alt”  arribute  default  “yes”  

Specific  HTML5  behaviour  

12  

Page 13: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

3.  In  a  standard  way  (2/2)  

•  Powerful  (e.g.  easy  combina(on)  •  Dublin  Core,  xml  •  Example:  locQualityIssueComment  in  addi(on  to  storageSize  

Independent  /orthogonal  

•  Supported  ITS  2.0  data  categories  •  Supported  selec(on  mechanism  (local  /  global)  and  type  of  content  (HTML  /  XML)  

•  Test  suite  to  guide  implementers  and  users  hrps://github.com/w3c/its-­‐2.0-­‐testsuite  

Strict  conformance  

clauses  

13  

Page 14: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  of  ITS  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

14  

Page 15: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Why  ITS  2.0  (1/2)    

ITS  1.0  =  simplified  view  of  mul(lingual  content  produc(on  

Too  limited  for  comprehensive  automated  content  processing/usage  scenarios  (see  hrp://www.w3.org/TR/mlw-­‐metadata-­‐us-­‐impl/  for  various  ITS  2.0  usage  scenario  descrip(ons)  

Example  limita(on:  too  few  data  categories  

15  

Page 16: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Why  ITS  2.0    (2/2)  Coverage  for  addi(onal  types  of  content:  HTML5  

• Easy  bridge  to  main  Web  formats  • Accommodate  relevant  HTML5  markup  (e.g.  HTML5  “translate”  arribute  behaviour)  

Easy  mapping/conversion  to  other  formats  • XML  Localiza(on  Interchange  File  Format  (XLIFF)  =  bridge  to  localiza(on  workflows;  status:  informal  mapping,  under  discussion,  for  XLIFF  1.2  mostly  stable.  

• Natural  Language  Processing  Interchange  Format  (NIF)  =  bridge  to  the  Seman(c  Web  and  Natural  Language  Processing;  status:  informal  mapping  

Introduced  traceability  • Which  tool  produced  what?  

ITS  RDF  Ontology  • To  make  ITS  a  first-­‐class  ci(zen  of  the  Seman(c  Web  (see  hrp://www.w3.org/2005/11/its/rdf-­‐content/its-­‐rdf.rdf)    

Some  parts  of    ITS  1.0  needed  to  go  (at  least  temporarily)  • Ruby,  dir  

16  

Page 17: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  in  HTML5  (1/3)  

Difference  in  syntax  for  local  markup  

17  

<myXMLVocabulary  ...>    <span  its:term="yes"        its:termInfoRef="hrp://example.com/terms/t1">  ...  </myXMLVocabulary>  

<!DOCTYPE  html>  ...    <span  its-­‐term="yes"      its-­‐term-­‐info-­‐ref="hrp://example.com/terms/t1">  ...  </html>  

Page 18: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  in  HTML5    (2/3)  

Link  to  global  rules  via  HTML  “link”  element  

18  

<!DOCTYPE  html>  ...    <link  href=EX-­‐translateRule-­‐html5-­‐1.xml  rel=its-­‐rules>  ...  </html>  

Page 19: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  in  HTML5    (3/3)  Accommoda(on  of  exis(ng  HTML5  markup  

19  

<!DOCTYPE  html><html  lang="en"  ...    <p  id="p1"  translate="no">This  is  a  <em>motherboard</em>  and        image:  </p>      <img  src="hrp://example.com/myimg.png"  alt="My  image"/>  ...  </html>  ITS  2.0  processors  “understand”  without  ITS  markup:  •  “p”  is  not  translatable  •  “alt”  arribute  at  “img”  is  translatable  •  Language  is  “en”  •  “id”  arribute  at  “p”  is  an  “ID  Value”  data  category  value  •  “em”  is  “within  text”  (part  of  another  text  flow)  

Page 20: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  2.0  in  XHTML  

Consump(on  on  the  Web:  use  HTML5  its-­‐*  syntax  

20  

<html  xmlns="hrp://www.w3.org/1999/xhtml">...    <p>Don't  use      <span  its-­‐loc-­‐note="Interna(onaliza(on  Tag  Set">ITS</span>  prefixed  arributes  inside  the  content,  like  its:locNote.</p>          </body>  </html>  

Consump(on  in  XML  workflows:  use  XML  its:*  syntax  and  process  as  XML  

Page 21: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS  Mime  Type  •  its+xml  –  registered  at    hrp://www.iana.org/assignments/media-­‐types/applica(on/its+xml  

•  Applicable  for  ITS  1.0  and  ITS  2.0  content  •  One  important  means  to  foster  ITS  adop(on  on  the  web  

21  

Page 22: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

What  went  away?  •  Where  did  “Ruby”  go?  – Data  category  dropped  from  ITS2  – Current  defini(on  in  HTML5  not  yet  stable  – Update  of  ITS2  might  add  then  stable  Ruby  again  

•  “Direc(onality”  defined  in  terms  of  HTML  4.01  – Again  awai(ng  stability  in  HTML5  

22  

Page 23: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  of  ITS  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

23  

Page 24: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Text  analysis  Annotate  named  en((es  or  other  „conceptual  items“    

-­‐  iden(fy  items  that  need  special  transla(on  rules  -­‐  assist  in  disambigua(on  of  homonyms  (e.g.  the  string  “Armstrong”  –  dozens  of  meanings  in  Wikipedia)  

24  

<!DOCTYPE  html>  ...  <span                        its-­‐ta-­‐confidence="0.7"                      its-­‐ta-­‐class-­‐ref="hrp://nerd.eurecom.fr/ontology#Movie"                          its-­‐ta-­‐ident-­‐ref=  "hrp://dbpedia.org/page/My_Neighbor_Totoro">  となりのトトロ</span>...</html>  

Page 25: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Domain  

Iden(fy  the  topic  or  subject  field  of  content  

Example  usage:  choose  the  MT  engine  that  fits  to  the  domain  

25  

 ...<its:domainRule  selector="/h:html/h:body"  domainPointer=  "/h:html/h:head/h:meta[@name='dcterms.subject']/@content"  domainMapping=  "automo(ve  auto,  medical  medicine,  'criminal  law'  law,  'property  law'  law"/>...  

Page 26: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

MT  Confidence  

Score  from  machine  transla(on  engine  

Example  for  ITS2  capability:  Tool  traceability  

26  

<!DOCTYPE  html>  ...  <body  its-­‐annotators-­‐ref="mt-­‐confidence|file://tools.xml#T1">    <p>                <span  its-­‐mt-­‐confidence=0.8982>Dublin  is  the  capital  of          Ireland.</span></p>      </body></html>  

Page 27: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Locale  Filter  

Content  relevant  only  for  a  specific  locale  

27  

<!DOCTYPE  html>  ...  <div  its-­‐locale-­‐filter-­‐list="*-­‐ca">      <p>Text  for  Canadian  locales.</p>  </div>  <div  its-­‐locale-­‐filter-­‐list="*-­‐ca"  its-­‐locale-­‐filter-­‐type="exclude">    <p>Text  for  non-­‐Canadian  locales.</p>  </div>  ...  

Page 28: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Localiza(on  Quality  Issue  

For  quality  assessment  

28  

<!DOCTYPE  html>  ...  <span    its-­‐loc-­‐quality-­‐issue-­‐comment="should  be  'quality'"    its-­‐loc-­‐quality-­‐issue-­‐profile-­‐ref=hrp://example.org/qaMovel/v1    its-­‐loc-­‐quality-­‐issue-­‐severity=50    its-­‐loc-­‐quality-­‐issue-­‐type=misspelling>qulaity</span>  ...  

Page 29: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  of  ITS  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

29  

Page 30: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Tooling  for:  •  Content  crea(on  •  Content  enrichment  •  Workflows  transpor(ng  ITS  2.0  between  formats  – Source  formats  (e.g.  DocBook  >  HTML)  – XLIFF  roundtripping  

•  A  detailed  example:  ITS  2.0  processed  via  the  OKAPI  framework  

30  

Page 31: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Helping  creators:  valida(on  of  HTML5  

31  

Page 32: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

...  and  XML  

32  

HTML5  ITS  Tools  hrps://github.com/kosek/html5-­‐its-­‐tools  •  ITS  2.0  valida(on  of  file  sets  •  Syntax  conversion:  HTML5  <>  XML  

•  Tool:  validator.nu  •  Basis  for  HTML5  and  XML  valida(on  

Page 33: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Helping  creators:  (plugins  for)  edi(ng  support  

BlueGriffon  web  editor  

33  

General  JavaScript  ITS2  parser  hrp://plugins.jquery.com/its-­‐parser/    

Page 34: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Adding  more  value  to  content:  Named  En(ty  Recogni(on  and  Disambigua(on  

See  hrp://enrycher.ijs.si/mlw/    

34  

Page 35: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Adding  more  value  to  content:  Genera(on  of  terminology  markup  

See  hrp://taws.(lde.com/    

35  

Page 36: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Format  conversion  and  more:  DocBook  -­‐  >  HTML  -­‐  >  online  MT  

See  hrp://xmlguru.cz/2013/05/docbook-­‐and-­‐its2    36  

Page 37: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Service  Oriented  Localisa(on  Architecture  Solu(on  (SOLAS)  

•  See  hrp://mlwlt.moravia.com/mlwlt-­‐web-­‐test/Presenta(on.aspx      

•  XLIFF  in,  (MT-­‐translated)  XLIFF  out  •  ITS  2.0  mapped  into  XLIFF  •  Consumes  data  categories:  Translate,  Domain  and  Text  Analysis  

•  Generates  metadata  for  data  categories:  Provenance  and  MT  Confidence  

37  

Page 38: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

A  detailed  example:  ITS2  processing  with  OKAPI  framework  

•  See  hrp://okapi.opentag.com/      •  Components  and  applica(ons  for  localiza(on  and  transla(on  

•  ITS1  and  ITS2  (ongoing)  implemented  in  many  usage  scenarios  

•  Scenarios  and  examples  provided  by  Yves  Savourel  (ENLASO);  run  with  Rainbow  &  CheckMate  tools  

38  

Page 39: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS2-­‐aware  XLIFF  genera(on  

39  

<its:translateRule  selector="//h:*[@class='totrans']"    translate="yes"/>  <its:storageSizeRule  selector="//h:td[@class='totrans']"    storageSize="30"/>  

<td  class="totrans">  The  Lost  Temples  of  the  Khmer</td>  

<trans-­‐unit  ...  <source  xml:lang="en-­‐us"  its:storageSize="30">  The  Lost  Temples  of  the  Khmer</source>  

Page 40: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS2  “domain”  mapping:  choosing  the  ‘travel’  MT  engine  

40  

<its:domainRule  ...    domainPointer=  "/h:html/h:head/h:meta[@name='dcterms.subject']/@content"    domainMapping="'vaca(on  packages'  travel"/>  

<meta  content="vaca(on  packages"  ...  <td  ...>  The  Lost  Temples  of  the  Khmer</td>  

<trans-­‐unit  itsxlf:domains="travel"....  <target  xml:lang="fr-­‐fr">Les  temples  perdus  des  Khmers</target>  

Page 41: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Segmenta(on,  MT  and  quality  checks  

41  

<its:domainRule  .../>  <its:translateRule  .../>  <its:storageSizeRule  ...  storageSize="30"/>  

<td  class="totrans">  Canyon  X  and  the  Land  of  the  Navajo</td>  

<target  ...  its:storageSize="30"  its:locQualityIssueComment="Number  of  bytes  in  the  target  (using  UTF-­‐8)  is:  32.  Number  allowed:  30."  ...  <mrk...>Canyon  X  et  la  terre  des  Navajos</mrk>...  

Page 42: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Quality  check  details  

42  

Rainbow  HTML  output  

CheckMate  tool  report  

Page 43: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Breaking  news:  Okapi  Ocelot  Editor  •  See  hrp://open.vistatec.com/ocelot/  •  Open  Source  Java  based  XLIFF+ITS  2.0  Editor  •  Supports  Localiza(on  Quality  Issue,  Provenance  and  MT  Confidence  

•  Also  general  XLIFF  1.2  editor  

Page 44: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Showcases  with  “real  clients”  ...  •  ITS2-­‐aware  online  MT  – Using  “Translate”,  “Domain”,  “Language  informa(on”  to  drive  rule  based  MT  system  

•  Localiza(on  chain  integra(on  – Coupling  Drupal  Content  Management  System  with  Localiza(on  Service  Provider/Transla(on  Agency  workflow  

– Demonstra(ng  workflow  benefits  achieved  via  ITS2  data  categories  

44  

Page 45: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

...  and  more  •  ITS2  data  categories  for  the  human  review  process  – Harvest  metadata  during  the  review  – Facilitate  audit  during  the  review,  e.g.  via  Ocelot  tool  

•  Conversion  of  ITS2  documents  (XML,  HTML)  into  RDF  –  NIF  format  –  Informa(ve  feature  – Prototypes  to  generate  e.g.  “text  analysis”  informa(on  in  RDF  out  of  Wikipedia  pages  

45  

Page 46: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Overview  •  Mo(va(on  for  ITS  (1.0  and  2.0)  •  Basic  principles  of  ITS  •  Why  ITS  2.0?  •  Selected  data  categories  •  Implementa(ons  and  usage  scenarios  •  Outlook  and  pointers  for  more  informa(on  

46  

Page 47: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

What  is  missing?  •  XLIFF  mapping  to  be  finalized  – Representa(on  of  ITS2  markup  in  XLIFF  not  finished  – XLIFF  1.2  to  be  stabilized  first;  XLIFF  2.0  later  

•  ITS  and  RDF  –  to  be  con(nued  – NIF  conversion  based  on  ITS  RDF  ontology  – Not  stabilized  &  not  yet  “real  life”  deployment  

47  

Page 48: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

What  will  come  next?  •  For  some  (me  no  new  ITS  version  -­‐  but:  more  – Usage  scenarios  

hrp://www.w3.org/Interna(onal/its/wiki/Use_cases_-­‐_high_level_summary    

–  Implementa(ons  hrp://www.w3.org/Interna(onal/its/wiki/ITS_Implementa(ons    

– User  &  implementers  feedback  at  public-­‐i18n-­‐its-­‐[email protected]    

•  Join  us  in  the  ITS  Interest  Group!  •  For  Mul(lingual  Linked  Open  Data:  Join  BPMLOD  group  hrp://www.w3.org/community/bpmlod/    

48  

Page 49: W3CITS2.0$ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$ Framework$Programme$(FP7)$in$the$areaof

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

W3C  ITS  2.0  h,p://www.w3.org/TR/its20/  

Facilita<ng  Automated  Crea<on  and  Processing  of  Mul<lingual  Web  Content  

Felix  Sasaki  (W3C,  DFKI),  Chris(an  Lieske  (SAP  AG)