aktuelle themen der angewandten informatik semantische...

Annotation Interoperability

Christian Chiarcos

chiarcos@informatik.uni-frankfurt.de

EUROLAN-2015, 2015, July 21, Sibiu, Romania

Annotation Interoperability

Tue, July 21st, 09:00-10:30

Annotation Interoperability I

Ontologies of Linguistic Annotation – Motivations and Principles

Wed, July 22nd, 11:00-12:30

Annotation Interoperability II

Applications and Use Cases

Wed, July 22nd, 14:00-15:30

Annotation Interoperability III

Hands-on session

Annotation Interoperability Ontologies of Linguistic Annotation --

Motivations and Principles

1. Conceptual Interoperability

2. Towards a modular set of linked ontologies

3. Structure and history of OLiA ontologies

4. Use case I: Documentation and formalization

5. A closer look on an example: MULTEXT-East

6. Use case II: Cross-tagset search via query rewriting

Before we proceed …

… please download and install Protégé 5.0 (Desktop version) over the day

http://protege.stanford.edu/

for the hands-on session tomorrow

• we‘ll be building annotation models and link them

Conceptual Interoperability

Problem and earlier approaches

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., tools

– using tagger A with parser B

• with a domain-adapted tagger A,* and a general-purpose parser B

* think of POS taggers for the biomedical domain (e.g., Genia) which use different tokenization strategies than out-of-the-box parsers

Interoperability

• e.g., corpora

– run the same query on corpus A and corpus B

more data, more likely significant results, comparable results

Interoperability

• e.g., dictionary + tool/corpus

– use dictionary A as a component for tagger B

• if grammatical categories correspond to tags in the tagset that the tagger is trained on

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

Interoperability Issues: Monolingual

• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP 11

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP

395 tags word classes

morphological features syntactic features

lexical classes

number and degree

• Integrating both resources allows us to

– apply more wide-scale statistical analyses

– increase training data for supervised POS tagging

– increase test data for unsupervised POS tagging

morphological features syntactic features

lexical classes

number and degree

Interoperability Issues: Multilingual

• with interoperable POS tags used across different languages, …

– we can apply the same unlexicalized NLP tools (e.g., parsers, cf. McDonald et al. 2013)

– we can perform comparative corpus studies

– we simplify multilingual annotation projection

Violations of Interoperability

• ROSANA Anaphor Resolution (Stuckardt 2001)

– required Connexor parser

• a commercial product

• UiMA annotation type systems

– NLP modules using the same annotation types are interoperable, but different groups develop their own, even for the same tools for the same language

Classical solution: Standardization

• Expert Advisory Group on Language Engineering (EAGLES)* – European standardization project (1993 – 1996)

– further elaborated by MULTEXT-East and ISLE/Parole

• Recommendations for POS tag sets – derived in a bottom-up manner

– no theoretical specification of tag sets, only identification of commonly used terms

* http://www.ilc.cnr.it/EAGLES96/home.html 16

Issues with EAGLES

… although linguists agree on the general ”common-sense” definitions of categories like proper noun, common noun etc, our analysis of competing tagsets for English corpora shows that these categories are in fact ‘fuzzy’, and different corpus tagging projects have adopted subtly but significantly different definitions, probably unaware that their analyses are incompatible with those of other linguists …

(Hughes et al. 1995)

* EAGLES is a classical case, our generation is just about to re-invent this wheel with „Universal Dependencies“ (http://universaldependencies.github.io/) 17

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

• (pre-)determiner ? (all the books)

• pronoun ? (all of them)

• number ? (all books ~ 25 books)

• adjective ? (all books ~ green books) – suggested for inflecting languages

– attributive pronouns (his book) • pronoun ?

• determiner ?

– attributive pronouns (his book)

– adjectival participles (enduring freedom) • verb ?

• adjective ?

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

• attributive pronoun => determiner vs. pronoun

– syntax vs. morphology

• adjectival participles => adjective or verb

– morphology vs. semantics

• ordinal numbers => adjectives vs. numerals

– morphology vs. semantics

– homophony vs. linguistically defined categories

• VH for auxiliary have but also have as a main verb

• BUT

– standardization towards a meta-tagset implicitly enforces unambiguous classification

• taxonomical/tree-like structure

independent decisions by tagset designers incompatibilities, e.g., AUX „auxiliar verb“ vs. „potential

auxiliar verb“

• EAGLES is prescriptive

– a standard-conformant tagset needs to provide certain categories

• even if not relevant to a language

• e.g. – Determiner (lacking for most Slavic languages)

– Adjective (lacking for Chinese)

– Noun-Verb distinction (debated for Fijian and Inuktitut)

EAGLES is specific to Western European

• EAGLES is built in a bottom-up fashion

– if unknown phenomena for novel languages are encountered, they are added as optional (language-specific) features

– existing features may or may not be re-used

• later: some problematic cases from MULTEXT-East

• EAGLES requires a 1:1-mapping* from standard-conformant tagsets

– every language-specific tag is mapped to exactly one EAGLES tag, so that they are equivalent

– given the definitorial problems mentioned before, we‘d like to express whether a mapping is perfect or imprecise

• or indicate partial overlaps with standard categories

28 * in fact, tags can be underspecified, so it is a 1:m mapping

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

many shortcomings of the standardization approach can be addressed by modelling

linguistic reference terminology by means of ontologies

Towards an ontology of linguistic terminology

• Goal – develop and apply an ontology as a terminological

backbone of different kinds of linguistic annotation

• Use cases – overcome differences in task-, domain- or

language-specific annotations

– provide a unified access to terminologically heterogeneously analysed

• Ontology – conceptualization of a certain domain

• e.g. a taxonomy of linguistic terms

– hierarchically and relationally structured

• OWL2/DL (Web Ontology Language) – formal description language for ontologies – formalizes description logics

• conceptual subsumption (rdfs:subClassOf) • logical operators (incl. disjunction and negation)

* Web Ontology Language, http://www.w3.org/TR/owl2-overview/ 32

• Against multiple tag sets

– unified representation of heterogeneous data

• linked to multiple different tag sets

– transparent

• abstraction from tag set specifics

– formal definitions

• based on description logics

• Against standardisation

– different conceptualizations

• language-specific traditions

• domain-specific conceptualizations

– different granularity

– implicit interpretation

• when mapping annotations to standard terms

Towards a modular set of linked ontologies

Just using a central ontology isn‘t enough

An joint, extensible terminology repository?

• Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names.

• The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data.

(Ide & Romary 2004)

The solution I

General Ontology of Linguistic Description (GOLD)

– ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ...

– ... the data and the various encoding schemes in which they are represented need an explicit semantics.

– ... a data model ... which is consistent with .... the Semantic Web ...

(Farrar & Langendoen 2003)

http://linguistics-ontology.org/gold

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

(Wright 2004)

http://isocat.org/

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

The RELISH project aimed to harmonize GOLD and ISOcat, and they brought GOLD-2010 to ISOcat

unfortunately, this only meant to increase redundancy: 5 types of CommonNouns along each other

RelCat, not materialized yet

http://isocat.org/

https://tla.mpi.nl/relish/

The solution III-VIII

Documentation standards in typology – EUROTYP (Bakker et al. 1993)

– AUTOTYP (Bickel & Nichols 2002)

– Typological Database System (TDS) ontology (Dimitriadis et al. 2009)

Standardization initiatives and multi-language tagsets – EAGLES (Leech & Wilson 1996)

– MULTEXT/East (Erjavec 2010)

– Common POS tagset for Indian languages (Baskaran et al. 2008)

– Universal POS tags / Universal Dependencies (Petrov et al. 2012)

Imagine you plan to develop a tool that makes use of a terminology repository.

Which one would you choose ?

Maybe, it‘s not even your choice ...

... your clients may have their own preferences ... and different clients may have different preferences

Another Problem

Modular architecture – Instead of limiting ourselves to one, we may make

use of an intermediate representation that links to all of them

– If we want to avoid losing information by replacing annotations with reference categories, the original annotation scheme should be formalized as well

Ontologies of Linguistic Annotation (OLiA) http://purl.org/olia

Another Solution

Structure of OLiA ontologies

and their relation to other terminology repositories

Ontologies of Linguistic Annotation

modular OWL/DL ontologies – Annotation Models

• annotation scheme

– OLiA Reference Model • common terminology

– External Reference Models • existing terminology repositories

OLiA Reference Model – interface between annotations and

(multiple) terminology repositories

OLiA Reference

Terminology Repositories

Annotation Models

OLiA Reference Model

• harmonization of repositories of annotation terminology

• morphosyntax & morphology

– 39 schemes

– ~70 languages*

• syntax, discourse structure, anaphora, information structure

OLiA Reference

Annotation Models

* including multilingual

annotation schemes:

Tapainen & Järvinen

(1997), and Dipper et al.

(2007), Erjavec (2010)

OLiA Reference

Annotation Models

Determiner

Morphosyntactic Category

Morphological Feature

Accusative Case

concepts

properties hasCase

x x : MorphosyntacticCategory

y x : Case

is-a is-a

Demonstrative Determiner

PronounOrDeterminer

OLiA Annotation Models

• OWL/DL formalizations of annotation schemes

– structure similar to the Reference Model

• individuals represent annotation values

– hasTag property

• string value of annotation

OLiA Reference

Annotation Models

OLiA Annotation Model

Adjective is-a

instance-of instance-of

STTS Annotation Model

ADJD ADJA

OLiA Reference

Annotation Models

hasTag „ADJA“

STTS: German part-of-speech tags

OLiA Linking Model

Annotation model concepts are defined as subclasses of Reference Model concepts

– properties as sub-properties

– individuals as instances

The linking is physically separated from the models

– one possible interpretation of Annotation Model concepts in terms of the Reference Model

OLiA Reference

Annotation Models

OLiA Linking Model

Adjective is-a

instance-of instance-of

Attributive

Adjective

Morphosyntactic

aktuelle themen der angewandten informatik semantische...

Documents

st. galler diplom im angewandten digital management

shotton& peroni semantic annotation of publication...

review free egasp: the human encode genome annotation...

asymptotische methoden der angewandten mathematik

package ‘annotationtools’ - bioconductor.riken.jp...

protein functional annotation · annotation! michelle...

annotation pro...

automatic annotation in...

annotation! · annotation! learn it! live it! love it!...

tiziano flati and roberto navigli three birds (in the llod...

comparative genomics & annotation the foundation of...

st. galler diplom im angewandten marketing fuer junior...

neovision 2 annotation guidelines - welcome to...

annotation and evaluation - gate · university of...

sd llod-15 apertium

st. galler diplom im angewandten digital management für...

zur angewandten mathematik -...

functional annotation and functional enrichment. annotation...

llod and corpora -...

formal model and conceptual architecture of the annotation...