textanaly(cs$in$its$2.0:$annotaon$of$ named$en((es$€¦ ·...
TRANSCRIPT
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Text Analy(cs in ITS 2.0: Annota(on of Named En((es
Tadej Štajner Jožef Stefan Ins(tute
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Mo(va(on • Transla(ng proper names
… can be problema(c for sta(s(cal MT systems
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Mo(va(on (2) • Transla(on depends on source and target language: – There are specific rules to translate (or transliterate) par(cular proper names or concepts
– Some(mes, they should not even be translated
• Solu(on: figure out what is actually being men4oned and see if any exis4ng translated expression exists for that en4ty
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Mo(va(on (3) • Localiza(on of proper names: – personal names, product names, or geographic
names, chemical compounds, protein names
• Names can appear without sufficient context: – we can use ITS2.0 Text Analysis annota(ons to
provide context for ambiguous content.
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
ITS2.0 Text Analysis • Support text analysis agents that enhance content by sugges(ng or iden(fying concepts, iden((es, iden(fied by IRIs.
• The data category provides three pieces of informa(on: – confidence – en(ty type/concept class – en(ty/concept iden(fier
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
ITS2.0 Text Analysis <!DOCTYPE html> <div its-‐annotators-‐ref="text-‐analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> <span its-‐ta-‐ident-‐ref="http://dbpedia.org/resource/Dublin" its-‐ta-‐class-‐ref="http://schema.org/Place">Dublin</span> is the <span its-‐ta-‐ident-‐ref="http://purl.org/vocabularies/princeton/wn30/synset-‐capital-‐noun-‐3.rdf">capital</span> of <span its-‐ta-‐ident-‐ref="http://dbpedia.org/resource/Ireland" its-‐ta-‐class-‐ref="http://schema.org/Place">Ireland</span>. </div>
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Producing these annota(ons • NLP Techniques – Named en(ty extrac(on & disambigua(on – Word sense disambigua(on
• Manual annota(on
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Named en(ty disambigua(on Document
Label
En(ty
Men(on
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Use cases • Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific transla(on rules: [this is taken up in OKAPI and the XLIFF genera(on] – proper names – officially regulated transla(ons.
• Informing sogware agent (i.e. CMS) about the conceptual type of a textual en(ty in order to enable special processing or indexing; – geographic names – personal names – product names – chemical compounds
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Open issues • Text Analysis data category can’t represent stand-‐off annota(ons, so only one layer can be done
• Support for the domain data category via text analysis tools
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Business case • By itself, it’s infrastructure that indirectly supports business cases by suppor(ng other technical scenarios
• [Clemens:] Having metadata saves (me, – producing it automa(cally can compound the savings
• [XLIFF roundtrip:] Human as well as machine consump(on of this metadata
The Mul(lingualWeb-‐LT Working Group receives funding by the European Commission (project name LT-‐Web) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815.
Demo • hmp://enycher.ijs.si/mlw/