towards portability and interoperability for linguistic annotation and language- specific ontologies...

Post on 26-Mar-2015

223 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Towards portability and interoperability for linguistic

annotation and language-specific ontologies

Robert Munro & David Nathan

Endangered Languages Archive, School of Oriental and African Studies

Outline

1. Introduction and motivation

2. Linguistic ontologies and markups

3. Representing knowledge

4. Supporting fieldworkers

5. Supporting speakers

6. Conclusions

1. Introduction and motivation

Introduction

The main goal of this paper:how does GOLD meets the requirements of portability

for language documentation and description (Bird & Simons, 2003)

Road-testing:ability to meet the needs of archive users and

contributors

Motivation

The Endangered Languages Archive (ELAR) is part of the Hans Rausing Endangered Languages Project (HRELP)

HRELP supports:the archivegrants for documentation projectspostgraduate programs focussing on language

documentation

Motivation

We (ELAR):support a digital archive (preserve data and provide

access to it)

We also train students and grantees in:markup strategiesdata management strategiesmultimedia developmentchoice of recording equipment

Motivation

There is concern that cataloguing metadata (IMDI / OLAC) has not yet been sufficiently extended (Nathan and Austin, 2004)rich linguistic and contextual information is not being

recorded in well-formed portable formats/structures

Common ontologies present a solution to this

How does GOLD meet our needs

We find GOLD to be the most suitable ontology for supporting data portability

GOLD’s focus has been on ‘datanalysis sets’

Summary

We suggest extending the focus to:data acquisitiondata access

Key extensions:formalising the definitions of concepts by representing

them as a set of formal propertiesexplicitly capturing the conventions and constraints for

presentation (rendering)modelling features that are inherently indeterminate

and/or complex structures

2. Linguistic ontologies and markups

Linguistic ontologies and markups

Ontology:strictly, what we agree exists

Markup:strictly, what we are certain about

Ontology and markup converge:only with consensus and complete confidencebut there is rarely full confidence in the classification

of new hard-to-classify phenomena in little-studied endangered languages

Indeterminacy

Builders of ontologies outside of linguistics have been reluctant to accept inherent indeterminacy:

In some cases, the incompatibilities [between ontologies] can be smoothed over by tweaking definitions of concepts or formalizations of axioms; in other cases, wholesale theoretical revision may be required. (Niles & Pease, 2001)

If we can identify the incompatibilities, we can model them

Supporting linguistics

A theory-neutral model of linguistics is not possible:Theories are poly-centricThey will change

We need a pan-theory model of linguistics

Formulising definitions

Each concept in GOLD should be represented by a set of properties that describe that concept

Three possible values for a given property: ‘Yes’, ‘No’, or ‘Undefined’ (default)

To accurately represent variance: include enough properties to distinguish terms

For portability: include as many properties as possible

Formulising definitions

‘Yes’ can potentially be expanded: whether the property is mandatory or optional for the

conceptdependencies between properties for a concept

Example

‘Noun’ in GOLD:Noun Definition: A noun is a broad classification of parts of speech which include substantives and nominals (Crystal 1997:371; Mish et al. 1990:1176). (http://emeld.org/gold-ns/description.html#Noun, last checked 23/05/2003)

How do I know if my definition is the same as Crystal or Mish et al?

Is it both definitions, or the common ground?

Example

Will future users of GOLD have the same definition?the core of ‘noun’ may have longevitythe boundaries with other concepts will not

COPEs can define extensions in terms of sets of properties, and add those properties to GOLD

Example

GOLD:

COPEs:

NOUN

GerundNOUN NomVerbNOUN

Can’t formally identify the similarities

Example

GOLD:

COPEs:

NOUN

GerundNOUN NomVerbNOUN

+ property: verb suffix + property: verb suffix

Can formally identify the similarities

Definition of NOUN can grow

3. Representing knowledge

Rendering

Separating form from content:ideal for flexibilitynot possible for some materials (esp. video)

Rendering conventions / constraints

Some are well known:italicize part-of-speech in dictionariesalign interlinear transcriptions

Some are not:representation of language-specific kinship systems,

ethnobotanical ontologies etc

Solution 1

Include a (written) description and/or example of the rendering conventions and constraints:hard-code the interface

Solution 2

Include formal representations of the conventions within the data:interface takes instructions from the data

Solutions

These are two extremeshard-coded and language specificdata driven and language independent

Database architectures and linguistic ontologiesnot designed for navigation‘transparent’ access to such structures – who does it

support?

4. Supporting fieldworkers

Supporting indeterminacy

There are two kinds of indeterminacy in linguistics: confidence in assigning a category (uncertainty) phenomena that are inherently variable, probabilistic,

gradient or continuous

The most valuable information

The most valuable information that a field linguist learns may be the least likely to be annotated

Example: 7uhch in Lakanon Maya:A temporal-modal deictic expressing participant

frames and speaker's footings (Bergqvist 2005)This term has been given the most thought by the

researcher, but it is still not completely understoodThe uncertainty (or the extent of certainty) should be

recorded: all the properties we do know

5 reasons for modelling uncertainty

1. To record our the extent of our knowledge For example, we want everything known about

7uhch in Lakanon Maya to be recorded, even if we don’t yet have a category for it

5 reasons for modelling uncertainty

2. For searchability If an archive implementing an ontology with

uncertain categories exists, then we can more easily find existing solutions to a problem

If a problem is truly new, then we can allow future researchers to find it

5 reasons for modelling uncertainty

3. To reach certainty Even an indeterminate markup can allow a

corpus analysis that can inform a decision about assigning the appropriate category

5 reasons for modelling uncertainty

4. To highlight problems with descriptive frameworks

A feature may only appear to belong to multiple (or no) categories because the descriptive framework does not yet account for it

5 reasons for modelling uncertainty

5. Because the concept is inherently indeterminate

The concept may be inherently fuzzy but not previously encountered as a continuous / contiguous phenomena

Inherently indeterminate features

Eg: cline, gradience, squish, continuities, contiguities, vague, fuzzy, probabilistic

Many prosodic, semantic and discourse features are inherently continuous

Growing arguments for probabilities to be part of our formal linguistic models for morphological and syntactic structures (Aarts, 2004; Bayen, 2003; Manning, 2003)

Inherently indeterminate features

Representing categories by formal properties meets the current requirements of modelling gradience (Aarts, 2004)

Perhaps the “ContinuousObject” concept of SUMO (Niles & Pease, 2001) could also be used?

The problem is, currently, largely unresolved

Incorporating new categories

How do we know that a given category is not the same as another one identified elsewhere?

Formal properties for concepts give us another means for comparison

Incorporating structures

As well as inherently discrete phenomena and inherently indeterminate ones, there is a third kind: concepts that are complex structurescommon in syntax and discourse semantics

How do we model a structure in an ontology?

5. Supporting speakers

Users of EL archives

The largest (and growing) user group for endangered languages materials are the speakers of endangered languages

Rarely interested in linguistic categories or navigating a corpus or archive via them

Supporting language-specific ontologies means supporting information-rich structures for both navigation and analysis

Case Study: Yolngu kinship

The Yolngu languages have an extensive kinship terminology called Gurrutu27 terms that identify individuals and sets of

individuals in terms of moiety, generation, gender, and patriline or matriline.

The terms extend infinitely through cyclicity

Case Study: Yolngu kinship

Speakers draw from the same sets of kinship relations to describe their relationship to the Yolngu lands

We cannot always annotate well-known linguistic concepts independently of language-specific ontologies

6. Conclusions

Conclusions

Ontology building for endangered languages can be very different to other ontology projectsThe uncertain is often more valuable than the certainThe local is often more interesting than the universal… but will still need interoperability

We suggest extending the focus of GOLD todata acquisition data access

Conclusions

Current GOLD does not need to be altered to incorporate our suggestionsexcept to remove assumptions of invariability

Key extensionsformalising the definitions of concepts by representing

them as a set of formal propertiesexplicitly capturing the conventions and constraints for

presentation (rendering)modelling features that are inherently indeterminate

and/or complex structures

References

Aarts, B 2004 Modelling linguistic gradience. Studies in Language, 28(1):1–49.Bateman, J 1992 The theoretical status of ontologies in natural language processing. In Text Representation and Domain Modelling – ideas

from linguistics and AI, Technische Universität BerlinBayen, H 2003 Probabilistic Approaches to Morphology In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Bergqvist, H 2005 Semantics of temporal deictics in Lakandon Maya. Presentation given at the ELAP-ELAR seminar series, SOAS, London.Bird, S & G Simons. 2003. Seven Dimensions of Portability for Language Documentation and Description, Language 79/3: 557-582.Christie, M & W Gaykamangu 2003. “Kinship, moiety, land & language in Arnhem Land”. In literacy link. Australian Council for Adult Literacy, vol

23, no 5 Oct 2003.Christie, M, W Gaykamangu & D Nathan. 2001. Yolngu Languages and Culture: Gupapuyngu. Faculty of Aboriginal and Torres Strait Islander

Studies, NTU [Multimedia CD-ROM]Crystal, D. 1997 A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: BlackwellCysouw, M, J Good, M Albu & HJ Bibiko 2005 Can GOLD “cope” with WALS? Retrofitting an ontology onto the World Atlas of Language

Structures. Proceedings of the E-MELD 2005Farrar, S. & D. T. Langendoen. 2003. A linguistic ontology for the Semantic Web. GLOT International 7 (3), 97-100.Farrar, S. 2003a Markup and the GOLD ontology. Proceedings of the EMELD 2003 Farrar, S. 2003b An ontological account of linguistics: extending SUMO with GOLD. Proceedings of the 2003 IEEE International Conference on

Natural Language Processing and Knowledge Engineering. BeijingFoley, W A 2003 Genre, register and language documentation in literate and preliterate communities. In Peter K Austin (ed.) Language

Documentation and Description vol 1Grinevald, C 2003 Speakers and documentation of endangered languages. In Peter K Austin (ed.) Language Documentation and Description

volume 1Gruber, T R. 1993 A translation approach to portable ontologies. Knowledge Acquisition, 5(2), 199-220Himmelmann, N P 1998 Documentary and descriptive linguistics. Linguistics 36. 161-195. Berlin: de Gruyter. Holton, G 2003 Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center

Archive. Proceedings of the EMELD 2003Manning, C. 2003 Probabilistic Syntax In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Nathan, D. (ed) 1996. Australia’s Indigenous Languages. Adelaide: SSABSANathan, D and P K Austin (2004) Reconceiving metadata: language documentation through thick and thin. In Peter K Austin (ed.) Language

Documentation and Description Volume 2. Niles, I & A Pease. 2001. Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in

Information Systems (FOIS-2001)Penton, D, C Bow, S Bird & B Hughes. 2004. Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004

top related