aktuelle themen der angewandten informatik semantische...

Post on 17-Sep-2018

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Annotation Interoperability

Christian Chiarcos

chiarcos@informatik.uni-frankfurt.de

EUROLAN-2015, 2015, July 21, Sibiu, Romania

Annotation Interoperability

Tue, July 21st, 09:00-10:30

Annotation Interoperability I

Ontologies of Linguistic Annotation – Motivations and Principles

Wed, July 22nd, 11:00-12:30

Annotation Interoperability II

Applications and Use Cases

Wed, July 22nd, 14:00-15:30

Annotation Interoperability III

Hands-on session

2

Annotation Interoperability Ontologies of Linguistic Annotation --

Motivations and Principles

1. Conceptual Interoperability

2. Towards a modular set of linked ontologies

3. Structure and history of OLiA ontologies

4. Use case I: Documentation and formalization

5. A closer look on an example: MULTEXT-East

6. Use case II: Cross-tagset search via query rewriting

3

Before we proceed …

… please download and install Protégé 5.0 (Desktop version) over the day

http://protege.stanford.edu/

for the hands-on session tomorrow

• we‘ll be building annotation models and link them

4

Conceptual Interoperability

Problem and earlier approaches

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., tools

– using tagger A with parser B

• with a domain-adapted tagger A,* and a general-purpose parser B

* think of POS taggers for the biomedical domain (e.g., Genia) which use different tokenization strategies than out-of-the-box parsers

6

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., corpora

– run the same query on corpus A and corpus B

more data, more likely significant results, comparable results

7

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., dictionary + tool/corpus

– use dictionary A as a component for tagger B

• if grammatical categories correspond to tags in the tagset that the tagger is trained on

8

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

9

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

10

Interoperability Issues: Monolingual

• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP 11

Interoperability Issues: Monolingual

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP

395 tags word classes

morphological features syntactic features

lexical classes

57 tags word classes

number and degree

12

Interoperability Issues: Monolingual

• Integrating both resources allows us to

– apply more wide-scale statistical analyses

– increase training data for supervised POS tagging

– increase test data for unsupervised POS tagging

395 tags word classes

morphological features syntactic features

lexical classes

57 tags word classes

number and degree

13

Interoperability Issues: Multilingual

• with interoperable POS tags used across different languages, …

– we can apply the same unlexicalized NLP tools (e.g., parsers, cf. McDonald et al. 2013)

– we can perform comparative corpus studies

– we simplify multilingual annotation projection

14

Violations of Interoperability

• ROSANA Anaphor Resolution (Stuckardt 2001)

– required Connexor parser

• a commercial product

• UiMA annotation type systems

– NLP modules using the same annotation types are interoperable, but different groups develop their own, even for the same tools for the same language

15

Classical solution: Standardization

• Expert Advisory Group on Language Engineering (EAGLES)* – European standardization project (1993 – 1996)

– further elaborated by MULTEXT-East and ISLE/Parole

• Recommendations for POS tag sets – derived in a bottom-up manner

– no theoretical specification of tag sets, only identification of commonly used terms

* http://www.ilc.cnr.it/EAGLES96/home.html 16

Issues with EAGLES

… although linguists agree on the general ”common-sense” definitions of categories like proper noun, common noun etc, our analysis of competing tagsets for English corpora shows that these categories are in fact ‘fuzzy’, and different corpus tagging projects have adopted subtly but significantly different definitions, probably unaware that their analyses are incompatible with those of other linguists …

(Hughes et al. 1995)

* EAGLES is a classical case, our generation is just about to re-invent this wheel with „Universal Dependencies“ (http://universaldependencies.github.io/) 17

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

• (pre-)determiner ? (all the books)

• pronoun ? (all of them)

• number ? (all books ~ 25 books)

• adjective ? (all books ~ green books) – suggested for inflecting languages

18

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

– attributive pronouns (his book) • pronoun ?

• determiner ?

19

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

– attributive pronouns (his book)

– adjectival participles (enduring freedom) • verb ?

• adjective ?

20

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

• attributive pronoun => determiner vs. pronoun

21

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

• adjectival participles => adjective or verb

22

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

– morphology vs. semantics

• ordinal numbers => adjectives vs. numerals

23

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

– morphology vs. semantics

– homophony vs. linguistically defined categories

• VH for auxiliary have but also have as a main verb

24

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• BUT

– standardization towards a meta-tagset implicitly enforces unambiguous classification

• taxonomical/tree-like structure

independent decisions by tagset designers incompatibilities, e.g., AUX „auxiliar verb“ vs. „potential

auxiliar verb“

25

Issues with EAGLES, cont‘ed

• EAGLES is prescriptive

– a standard-conformant tagset needs to provide certain categories

• even if not relevant to a language

• e.g. – Determiner (lacking for most Slavic languages)

– Adjective (lacking for Chinese)

– Noun-Verb distinction (debated for Fijian and Inuktitut)

EAGLES is specific to Western European

26

Issues with EAGLES, cont‘ed

• EAGLES is built in a bottom-up fashion

– if unknown phenomena for novel languages are encountered, they are added as optional (language-specific) features

– existing features may or may not be re-used

• later: some problematic cases from MULTEXT-East

27

Issues with EAGLES, cont‘ed

• EAGLES requires a 1:1-mapping* from standard-conformant tagsets

– every language-specific tag is mapped to exactly one EAGLES tag, so that they are equivalent

– given the definitorial problems mentioned before, we‘d like to express whether a mapping is perfect or imprecise

• or indicate partial overlaps with standard categories

28 * in fact, tags can be underspecified, so it is a 1:m mapping

Issues with EAGLES, cont‘ed

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

29

Issues with EAGLES, cont‘ed

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

many shortcomings of the standardization approach can be addressed by modelling

linguistic reference terminology by means of ontologies

30

Towards an ontology of linguistic terminology

• Goal – develop and apply an ontology as a terminological

backbone of different kinds of linguistic annotation

• Use cases – overcome differences in task-, domain- or

language-specific annotations

– provide a unified access to terminologically heterogeneously analysed

31

Towards an ontology of linguistic terminology

• Ontology – conceptualization of a certain domain

• e.g. a taxonomy of linguistic terms

– hierarchically and relationally structured

• OWL2/DL (Web Ontology Language) – formal description language for ontologies – formalizes description logics

• conceptual subsumption (rdfs:subClassOf) • logical operators (incl. disjunction and negation)

* Web Ontology Language, http://www.w3.org/TR/owl2-overview/ 32

Towards an ontology of linguistic terminology

• Against multiple tag sets

– unified representation of heterogeneous data

• linked to multiple different tag sets

– transparent

• abstraction from tag set specifics

– formal definitions

• based on description logics

33

Towards an ontology of linguistic terminology

• Against standardisation

– different conceptualizations

• language-specific traditions

• domain-specific conceptualizations

– different granularity

– implicit interpretation

• when mapping annotations to standard terms

34

Towards a modular set of linked ontologies

Just using a central ontology isn‘t enough

An joint, extensible terminology repository?

• Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names.

• The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data.

(Ide & Romary 2004)

36

The solution I

General Ontology of Linguistic Description (GOLD)

– ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ...

– ... the data and the various encoding schemes in which they are represented need an explicit semantics.

– ... a data model ... which is consistent with .... the Semantic Web ...

(Farrar & Langendoen 2003)

37

http://linguistics-ontology.org/gold

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

(Wright 2004)

38

http://isocat.org/

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

The RELISH project aimed to harmonize GOLD and ISOcat, and they brought GOLD-2010 to ISOcat

unfortunately, this only meant to increase redundancy: 5 types of CommonNouns along each other

RelCat, not materialized yet

39

http://isocat.org/

https://tla.mpi.nl/relish/

The solution III-VIII

Documentation standards in typology – EUROTYP (Bakker et al. 1993)

– AUTOTYP (Bickel & Nichols 2002)

– Typological Database System (TDS) ontology (Dimitriadis et al. 2009)

Standardization initiatives and multi-language tagsets – EAGLES (Leech & Wilson 1996)

– MULTEXT/East (Erjavec 2010)

– Common POS tagset for Indian languages (Baskaran et al. 2008)

– Universal POS tags / Universal Dependencies (Petrov et al. 2012)

40

Imagine you plan to develop a tool that makes use of a terminology repository.

Which one would you choose ?

Maybe, it‘s not even your choice ...

... your clients may have their own preferences ... and different clients may have different preferences

Another Problem

41

Modular architecture – Instead of limiting ourselves to one, we may make

use of an intermediate representation that links to all of them

– If we want to avoid losing information by replacing annotations with reference categories, the original annotation scheme should be formalized as well

Ontologies of Linguistic Annotation (OLiA) http://purl.org/olia

Another Solution

42

Structure of OLiA ontologies

and their relation to other terminology repositories

Ontologies of Linguistic Annotation

modular OWL/DL ontologies – Annotation Models

• annotation scheme

– OLiA Reference Model • common terminology

– External Reference Models • existing terminology repositories

OLiA Reference Model – interface between annotations and

(multiple) terminology repositories

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

44

OLiA Reference Model

• harmonization of repositories of annotation terminology

• morphosyntax & morphology

– 39 schemes

– ~70 languages*

• syntax, discourse structure, anaphora, information structure

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

* including multilingual

annotation schemes:

Tapainen & Järvinen

(1997), and Dipper et al.

(2007), Erjavec (2010)

45

OLiA Reference Model

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Determiner

Morphosyntactic Category

Morphological Feature

Accusative Case

...

...

...

...

Case

concepts

properties hasCase

x x : MorphosyntacticCategory

y x : Case

is-a

is-a is-a

is-a

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Demonstrative Determiner

is-a

...

PronounOrDeterminer

46

OLiA Annotation Models

• OWL/DL formalizations of annotation schemes

– structure similar to the Reference Model

• individuals represent annotation values

– hasTag property

• string value of annotation

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

47

OLiA Annotation Model

POS

Adjective is-a

instance-of instance-of

STTS Annotation Model

ADJD ADJA

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

hasTag „ADJA“

STTS: German part-of-speech tags

48

OLiA Linking Model

Annotation model concepts are defined as subclasses of Reference Model concepts

– properties as sub-properties

– individuals as instances

The linking is physically separated from the models

– one possible interpretation of Annotation Model concepts in terms of the Reference Model

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

49

OLiA Linking Model

POS

Adjective is-a

instance-of instance-of

Attributive

Adjective

Morphosyntactic

Category

Adjective

is-a

is-a

instance-of

is-a

OLiA Reference Model

ADJD ADJA

STTS Annotation Model

hasTag „ADJA“

STTS Linking

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

50

OLiA: Terminology Repositories

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

OLiA Reference Model further linked to terminological repositories

– if they are modelled in OWL/DL

• GOLD (Chiarcos 2008)

• ISOcat (Chiarcos 2010)

• OntoTag (Buyko et al. 2008)

• TDS (Dimitriadis et al. 2009)

51

Extensibility

• OLiA Reference Model provides only a possible view on linguistic terminology, adaptations for other communities are encouraged

External Reference Model / Terminology Repository

– its concepts are superclasses of the OLiA Reference Model concept

• OLiA can be seen as a GOLD Community of Practice Extension

OLiA serving as interface to different tagsets

only one mapping needs to be defined 52

How to access OLiA ontologies

• modular structure – every model is an

independent ontology in a separate file

=> different name spaces

• declarative linking – linking model in a

separate file • stts-link.rdf

• to use OLiA directly import the linking model (and its imports)

olia.owl

stts.owl

Annotation Model

STTS

OLiA

Reference Model

Linking

Model stts-link.rdf

53

File Structure

olia.owl

stts.owl

OLiA

Reference Model

stts-link.rdf susa.owl

Annotation Model

Susanne

susa-link.rdf

Annotation Model

STTS

penn.owl

Annotation Model

Penn

penn-link.rdf ...

For every Annotation Model, there is at least one Linking Model linking it with the OLiA Reference Model

54

How to access multiple OLiA ontologies

olia.owl

stts.owl

OLiA

Reference Model

stts-link.rdf susa.owl

Annotation Model

Susanne

susa-link.rdf

Annotation Model

STTS

penn.owl

Annotation Model

Penn

penn-link.rdf ...

all.rdf Master file

Create a master file which imports the Linking Models with their imports

55

How to access external terminology repositories

olia.owl

OLiA

Reference Model

all.rdf Master file

Analoguously, external reference models (terminology repositories) can be included

Terminology Repository

e.g., GOLD

gold.owl Linking

Model gold-link.rdf

For querying (etc.), one can access external conceptual models

=> simplifications with SPARQL Update 56

Inferring Conceptual Descriptions

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

57

Inferring Conceptual Descriptions

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Further analogous inference of GOLD or ISOcat concepts

=> interoperable with both repositories

58

Terminology

• Translation from tags to ontological descriptions (triple sets)

– comparable representations for annotations of different origin

mapping between tagsets

concept-based ensemble combination architecture

concept-based corpus querying

more in a minute

59

A brief history of OLiA

Original research context

„Sustainability of Linguistic Data“ (2005-2008)

co-operation project between three German collaborative research centers

CRC441 „Linguistic data structures“ (Tübingen)

CRC538 „Multilingualism“ (Hamburg)

CRC632 „Information Structure“ (Potsdam/Berlin)

data collections of research projects should be kept available for later research activities

61

Original use case

• Motivation

– structural differences between different annotations/analyses

• hindering interoperability between concurrent taggers/tag sets

reference to a common terminological backbone

62

Original use case

• Goal – develop and apply an ontology as a terminological

backbone of different kinds of linguistic annotation

• Use cases • overcome differences in task-, domain- or language-

specific annotations

• provide a unified access to terminologically heterogeneously analysed

63

Developing an ontology Procedure

• derive a taxonomy of word classes from EAGLES „EAGLES ontology“

• augment with categories from other tag sets „E(xtended)-EAGLES ontology“

• harmonize E-EAGLES ontology with GOLD – enrichment of structures

– possible revisions of GOLD

„E-GOLD ontology“

64

Developing an ontology The EAGLES ontology (2005)

• hierarchical interpretation of EAGLES meta tags – word classes

• noun, verb, adjective, ...

=> top level categories

– recommended features • common noun vs. proper noun

=> subclasses

– purely inflectional features ignored • case, definiteness of nouns, mood, etc.

65

Developing an ontology The EAGLES ontology (2005)

Verb

FiniteVerb

Infinitive Participle

NonFiniteVerb

subclass

disjoint

66

Developing an ontology The extended EAGLES ontology (2005)

Verb

FiniteVerb

Infinitive Participle AdverbialParticiple

NonFiniteVerb

subclass

disjoint

„transgressive“ CRC441/B1 tagset

67

Developing an ontology E-GOLD (2006)

• use GOLD as a reference ontology – Number as a sub-class of Quantifier

• suggested additions to GOLD – CommonNoun vs. ProperNoun

• suggest revisions of GOLD

She‘s the one. • Number ⊑ Quantifier ⊑ Determiner ???

Quantifier as top-level category

68

Developing an ontology OLiA Reference Model

• extended in accordance with further annotation schemes

• extended for syntax (2007) and discourse (2014, experimental)

• linked to OntoTag (2008), ISOcat (2010), MULTEXT/East (2011), TDS (2012), lexinfo (2015)

• to be linked to Universal Dependencies => hands-on session tomorrow

69

Conceptual Interoperability

Penn

The DT

Fulton NNP

County NNP

Grand NNP

Jury NNP

said VBD

Friday NNP

Determiner ⊓ PronounOrDeterminer

Susanne

The AT Fulton NP1s

County NNL1cb

Grand JJ

Jury NN1c

said VVDv

Friday NPD1

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

(MainVerb ⊔ StrictAuxiliaryVerb) ⊓ Verb ⊓ ∃hasTense.Past [sic!]

DefiniteArticle ⊓ Article ⊓ Determiner ⊓ PronounOrDeterminer

Surname ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

TopographicalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

Adjective ⊓ ∃hasDegree.Positive

CommonNoun ⊓ Noun ⊓ ∃hasNumber.Singular

TemporalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

MainVerb ⊓ Verb ⊓ ∃hasTense.Past

mostly identical triples, just a few more from

Susanne

70

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?

71

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?

– But what about Slavic verbs in past tense ?

72

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only FiniteVerbs have Tense“ ?

– past and present participles

– tensed infinitives in Old Norse and Old Greek

73

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„Adverbs don‘t agree“ ?

– German meinetwegen, deinetwegen, seinetwegen

74

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„Nouns are no finite Verbs“ ?

– Inuktitut

qimiutuq „(he) has a dog“ (v.3s.vp.) = „dog-owner“ (n.abs)

qimiutup „he has a dog“ (vpart.) „dog-owner“ (n.erg)

75

Limitations

• The OLiA Reference Model is not fully axiomatized

• no disjointness and cardinality axioms

– need to be defined in a language-specific way

=> can only be heuristically extrapolated from annotations

76

Linking the ontology

• applies LOD principles to the relation between tagsets

• used as a vocabulary – NLP Interchange Format, Apache Stanbol (for

linguistic annotations)

– lemon (machine-readable dictionaries)

• linked with bibliographical data – Virtuelle Fachbibliothek Allgemeine

Sprachwissenschaft (2015-2016)

77

Original Use Case

• Tagset formalization

– formal definitions

– uniformly layouted HTML, automatically generated from an ontology

• Advanced use cases

– ontology-based corpus querying

– ontology-based NLP applications

78

A closer look on an example: MULTEXT-East

Chiarcos & Erjavec (2011)

MULTEXT-East

• Corpus and dictionary project (Veronis & Ide 2004, Erjavec 2010)

• Idea: Extend EAGLES to Eastern Europe

• Parallel „1984“ corpus plus morphosyntactically annotated dictionaries – English

– Slavic (Bulgarian, Croatian, Czech, Macedonian, Polish, Resian, Russian, Serbian, Slovak, Slovene, Ukrainian)

– Finno-Ugric (Estonian, Hungarian)

– Romanian

– Persian

80

Building the MULTEXT-East Ontology

• Annotation guidelines in TEI/XML • Automatically converted to OWL2/DL using XSLT

– common specifications as TBox – language-specific as ABoxes importing the TBox

• discussed with MULTEXT-East users and maintainers – manually revised

• common specifications semiautomatically linked with OLiA Reference Model – OLiA Reference Model manually extended

http://nl.ijs.si/ME/owl/

81

MULTEXT-East Morphosyntactic Descriptions

Multiple documents

• common specifications

• language-specific

82

MULTEXT-East Morphosyntactic Descriptions

Multiple documents

• common specifications

• language-specific

provides all values used in Multext-East corpora/dictionaries (language-specific similar)

83

MULTEXT-East Common Specifications

• categories become top-level concepts

• Fine-grained parts of speech are encoded as features (~ EAGLES)

– e.g., Noun, Type=common (Nc)

– converted into sub-concepts

• choice followed OLiA Reference Model

• other features are encoded as object properties plus associated feature concept

84

MULTEXT-East Language Specifics

• Stored in separate document – No hierarchical structure inferred, import

common specifications

• Add tags as

individuals = Instance

of concepts

and features

with tag value

and object

properties to itself

85

Observations

Like EAGLES, MTE uses a positional tagset

bias against adding new attributes

systematic overload (of attributes and values)

A manual revision was thus unavoidable

86

Manual Revision

• Adjust automatically generated names

CorrelatCoordConjunction < Coord, Type=correlat expanded to CorrelativeCoordinatingConjunction

YesDefiniteness < Definiteness=yes simplified to Definite

87

Manual Revision

• Manual hierarchical reanalysis of (some) feature values

CliticProximalDeterminer ⊑ CliticDefiniteDeterminer

(could be presented as a flat list in MTE only)

88

Resolving Attribute Overload

• one attribute groups together unrelated phenomena from different languages

• Definiteness => – CliticDeterminerType (presence of a postfixed article

of Romanian, Bulgarian and Persian nouns and adjectives)

– ReductionFeature (full and reduced adjectives in many Slavic languages)

– PersonOfObject (the so-called ‘definite conjugation’ of Hungarian verbs)

89

Documenting Value Overload

• one attribute groups together unrelated phenomena from different languages

• Definiteness=yes (=> Definite), i.e., – clitic definite determiner (CliticDeterminerType in Rom. and Bulg.)

– clitic specific determiner (CliticDeterminerType in Persian)

– verb with a definite 3rd-person direct object (PersonOfObject in Hungarian)

Definite ⊑ CliticDefiniteDeterminer ⊔

CliticSpecificDeterminer ⊔ PersonOfObject

• In addition, add concept as “anchor” for such ambiguous features

Definite ⊑ AmbiguousDefinitenessFeature 90

Redundancy

• MTE tagsets were created bottom-up from existing resources, often unaware of earlier treatment of the same phenomenon

• e.g., reduced (vs. full) adjectives in Slavic

– Czech MTE Formation=nominal,

– Polish MTE Definiteness=short-art

marked by owl:equivalentClass

91

Definiteness in the MULTEXT-East Ontologies

Definiteness=1s2s (2) Definiteness=distal (d) Definiteness=full-art (f) Definiteness=no (f) Definiteness=proximal (p) Definiteness=short-art (s) Definiteness=yes (y)

92

Linking to OLiA

• After discussion with the MTE community, the TBox was semiautomatically linked with the OLiA Reference Model

• semiautomatically

– automatically link concepts with the same local name

– suggest linking candidates for concepts with overlapping local names => selection or comment

– comment linking status

• manually revise, check every concept with a comment

93

Extending OLiA

• During the semiautomatic linking, several cases came up where no OLiA concept could be found

• NumeralAgreementClass – SingularQuantifier (agreement pattern like numeral 1)

– DualQuantifier (agreement pattern like numeral 2)

– PaucalQuantifier (agreement pattern for quantities between singular [dual] and plural quantifiers)

– PluralQuantifier (agreement pattern like high numerals)

94

Practical Results

• developing the ontology helped identifying inconsistencies facilitated dialog between – an NLP person with limited knowledge about the

languages under consideration – language specialists with different degrees of

awareness of the structure of other MTE language models

• Resource can be used for documentation – using browseable OWL or the generated HTML

• Advanced uses possible

95

Use case II: Cross-tagset search

via query rewriting

Ontology-based query rewriting

• „Sustainability of linguistic resources“ – common terminological interface for querying

heterogeneously annotated data

• OntoClient (Rehm et al. 2008)

– Preprocessor for ontology-sensitive corpus queries

• OntoClient@ANNIS 1.0 – ANNIS (http://annis-tools.org/)

• web application for corpus querying

– OntoANNIS (Chiarcos & Goetze 2007)

ANNIS meets OntoClient

97

Ontology-based query rewriting

... pos in { Noun \ Nominal} & cat = ...

corpus query

ontology lookup: 1. retrieve instances and tags 2. application of set operators

Noun

ProperNoun

MassNoun CountableNoun

CommonNoun

Nominal

VerbalNoun

Substantive

tibet: ProperNoun

tibet:

InanimateNoun tibet:

AnimateNoun tibet:

Person

tibet: CommonNoun

NOM_inan

NOM_anim_lq

NOM_inan_lq NOM_pers

NOM_pers_anim

NAME

NOM_anim

Reference Model

Annotation Model

linking

return modified corpus query

... pos = NN | pos = NCOM | pos =

substantiv_masc_pl_dat_bel |pos =

substantiv_masc_pl_akk_unb | pos =

substantiv_fem_sg_ins_unb & cat =

...

Unparsed

String

Onto

Key

Onto

Left

Par

Onto

Concept

Onto

Concept

Onto

Op

Onto

Right

Par

Unparsed

String

98

OntoClient Query Language

Query := (UnparsedString* OntoQuery*)*

OntoQuery := OntoKey OntoLPar OntoExp OntoRPar

OntoKey := „in“

OntoLPar := „{“

OntoRPar := „}“

OntoExp := OntoConcept | (OntoExp OntoOp OntoExp)

OntoOp := „and“ | „or“ | „without“ | „&“ | „|“ | „\“

OntoConcept : Upper model concept |

Upper model relation „(“ Upper model concept „)“

99

OntoClient Interface to Query Language

UnparsedString

OntoKey

OntoLPar

OntoRPar

OntoExp

UnparsedString

Key

LeftPar

RightPar

tag (Disj tag)*

Disj

UnparsedString

„=“

„/“

„/“

„NP1m|NP1c“

„|“

Input Output e.g. TIGER

100

A sample application OntoANNIS

101

A sample application OntoANNIS

• OntoANNIS allowed to query across different annotation, but was a prototype only.

With the configurable OntoClient, a similar prototype for CWB was set up, connecting to an annotated edition of the Uppsala Corpus (Russian, 1 mio tokens) hosted at Tübingen

• Technology mixture can create a bottle-neck

Motivation to explore a native-SPARQL implementation

102

Tomorrow

• Second session on annotation interoperability

– SPARQL-native corpus querying

– Ontology-based ensemble combination

• Hands-on session on annotation interoperability

– building annotation models for different language editions of universal dependencies

– linking them

103

For tomorrow …

… please don‘t forget to download and install Protégé 5.0 (Desktop version)

http://protege.stanford.edu/

for the hands-on session tomorrow

• we‘ll be building annotation models and link them

104

Selected References

Christian Chiarcos (2008). An ontology of linguistic annotations. LDV Forum. 23(1).

Christian Chiarcos (2010). Grounding an ontology of linguistic annotations in the Data Category Registry. LREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS), Valetta, Malta. 2010.

Christian Chiarcos, Tomaž Erjavec (2011). OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In Proceedings of the 5th Linguistic Annotation Workshop. Association for Computational Linguistics, 2011.

Scott Farrar, D. Terence Langendoen (2003). A linguistic ontology for the semantic web. Glot International 7.3: 97-100.

John Hughes, Clive Souter, Eric Atwell (1995), Automatic Extraction of Tagset Mappings from Parallel-annotated Corpora. In: From Texts to Tags: Issues in Multilingual Language Analysis.

Proceedings of SIGDAT Workshop in Conjunction with the 7th Conference of the European Chapter of the Association for Computational Linguistics. University College Dublin, Ireland.

Ryan McDonald, Joakim Nivre, et al. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proc. ACL-2013, pp. 92-97.

Slav Petrov, Dipanjan Das, Ryan McDonald (2012). A universal part-of-speech tagset. Proc. LREC-2012.

Roland Stuckardt (2001). Design and Enhanced Evaluation of a Robust Anaphor Resolution Algorithm. Computational Linguistics 27(4):479-506

Sue Ellen Wright (2004). A Global Data Category Registry for Interoperable Language Resources. In Proc. LREC-2004.

top related