is iso 639 enough for a multilingual thesaurus? the agrovoc case

18
A I M S Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran, Johannes Keizer Food and Agriculture Organization of the UN AOS 2012 Sept 4. 2012 - Kuching (MY)

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

A I M SIs ISO 639 enough for a multilingual

thesaurus?The AGROVOC case

Caterina Caracciolo, Gudrun Johannsen, Lavanya Kiran,

Johannes KeizerFood and Agriculture Organization of the UN

AOS 2012

Sept 4. 2012 - Kuching (MY)

Page 2: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Background

• AGROVOC is published in 21 languages + other under development

• Multilinguality has always been an issue

• Since the beginning, multilinguality was interpreted as “translation”:

– One hierarchy of terms (one structure), translations in various languages

• This organization remained with the move from a term-centered to a concept-centered resource9/19/2012 2

Page 3: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

AGROVOC as object-centered resource…

• Being mainly a resource for document indexing in the area of agriculture, it contains large amount of words referring to plants, animals, food in general

9/19/2012 3

Page 4: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

# of concepts below top concepts

9/19/2012 40 5000 10000 15000 20000 25000

strategies

site

events

time

factors

processes

technology

stages

state

measures

groups

locations

systems

subjects

resources

objects

features

properties

methods

products

activities

phenomena

entities

substances

organism

Series1

Page 5: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Differentiating languages

• Salmon (en)

• Salmón (es)

• лососи (ru)

9/19/2012 5

Page 6: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

But distribution of languages may be wide…

9/19/2012 6

Page 7: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

… and names of food tend to vary…

9/19/2012 7

Palta

Aguacate

Page 8: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

… and names of food tend to vary…

9/19/2012 8

Coime, coimi,

cuimi, millmi

Achis,

Coyos (Cajamarca),

Achita (Ayacucho),

Kiwicha (Cusco)

Ataco morado,

sangorache,

sergorache,

hawarcha

Page 9: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Not only food names vary

9/19/2012 9

Page 10: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Requirements for rendering multilinguality in AGROVOC

1. Unambiguously express the geographic area where a given word is used

– specification of the area of use of a given word should be optional.

2. No limitations on the type of area allowed

– Countries, groups of countries, geographical or administrative regions should be equally available for specification.

9/19/2012 KISAF, Rome 10

Page 11: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

AGROVOC as a SKOS resource

• skos:Concept is to indicate a group of words in various languages, to be considered translations of one another

• URI are kept “abstract” to emphasize independence of the concept from language– E.g. http://aims.fao.org/aos/agrovoc/c_12332

• The words grouped are then labels of the given concept

9/19/2012 11

Page 12: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

SKOS properties to express terms

• skos:prefLabel, skos:altLabel– take plain literals as values

– and an optional language tag expressed by XML attribute xml:lang

• skosxl:prefLabel, skosxl:altLabel– Take entities with URIs, so extra infomation be

attached to labels

9/19/2012 12

Page 13: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

AGROVOC uses ISO 639 2 digitsto tag languages in xml:lang

• ISO 639 provides codes for languages independently of

– the country where they are spoken:

• Spanish, Basque (same country, both official languages)

• Dutch, Flamish (different country, similar enough languages…)

– And their status: French and Breton (same country, Breton has no status)

• Only one code for English, Spanish…

• Limitations shown from previous examples9/19/2012 KISAF, Rome 13

Page 14: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Multilinguality

ISO 639

Language

codes

9/19/2012 14

Page 15: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Is ISO 639 3 digits an option?

• More languages are included

– More contemporary languages

• Bemba language

– “Old” languages (no longer spoken)

• Old French (842ca-1400)

– Groups of languages

• Cuacasian languages

– Artificial languages

• Same approach as the 2 digit version

9/19/2012 KISAF, Rome 15

Page 16: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Is IETF an option?

• Internet Engineering Task Force (IETF)

• IETF 5646 Tags for identifying languages

– Basis is ISO for languages (639)

– Subtags from ISO for countries (3166), ISO for scripts (15924)

• Examples:

– tr-CY = Turkish from Cyprus

– zh-Hant-HK = Chinese in traditional Chinese script

9/19/2012 KISAF, Rome 16

Page 17: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Is a relational approach an option?

• Keep tagging approach to mark the language

– Use ISO 639 or IETF

• And introduce a relational notion of “where a given word is used”

• Link together a concept representing a geographic area, and the object to name

– E.g., Kiwicha isNameUsedInRegion Cusco

• Aim at “standard” relations…

9/19/2012 KISAF, Rome 17

Page 18: Is ISO 639 enough for a multilingual thesaurus? The AGROVOC case

Conclusions?

• This is work in progress

• We continue working out use cases, especially from Spanish and Portuguese

• Assess alternatives

9/19/2012 KISAF, Rome 18