expressing lexical complexity in skos(xl)
DESCRIPTION
innoQ Deutschland GmbH D-40880 Ratingen www.innoq.com [email protected]. Expressing Lexical Complexity in SKOS(XL). Thomas Bandholtz 5th ECOTERM MEETING at FAO, Rome, Italy 05-06 October 2009. Content. Expressing Lexical Complexity in SKOS(XL) Motivation - PowerPoint PPT PresentationTRANSCRIPT
Expressing Lexical Complexity in SKOS(XL)
Thomas Bandholtz
5th ECOTERM MEETING at FAO, Rome, Italy
05-06 October 2009
innoQ Deutschland GmbH
D-40880 Ratingen
www.innoq.com [email protected]
Content
Expressing Lexical Complexity in SKOS(XL)
Motivation
Thesaurus Models with regard to lexical complexity
UMTHES extensions of SKOSXL
Examples using RDF Turtle syntax
5/6 October 2009 2Ecoterm 2009: Lexical Complexity SKOS(XL)
Motivation
What is „lexical complexity“?
Why should we care?
The case: UMTHES in SKOS
Umweltbundesamt (DE) & innoQ develop iQvoc
What is „lexical complexity“?
Each Concept may be represented by multiple terms
Preferred / non-preferred term, multilingualism, etc.
Each term may have many lexical representations
inflection
abbreviation
“legal” variants in orthography
historical versions of “legal” orthography (in German: 1880 - 2006)
common misspellings
regional variants in the same language
Each term may be a compound term
a compound term may contain term delimiters (spaces or hyphens)
the components may appear dispersed within a sentence
the components may designate different concepts by themselves.
5/6 October 2009 4Ecoterm 2009: Lexical Complexity SKOS(XL)
(a side note about orthography)
5/6 October 2009 5Ecoterm 2009: Lexical Complexity SKOS(XL)
“Before compulsory education has been established, it was something to be able to write.”
tb: just like Cervantes, Dante, Goethe, Shakespeare, Whitman, etc.
“Since then, you have to be a proper speller.”
(Peter Bichsel, Der Leser. Das Erzählen. Frankfurter Poetik-Vorlesungen. 1982)
Why should we care?
Traditional: (nice-to-have):
Alphabetic lists of subject indices show some lexical variants.
Contemporary (prerequisite):
automatic (machine-made) detection of Concepts covered by a natural language document (“Named Entity Recognition”)
must capture a covered Concept as concise as possible
considering all possible lexical appearances, including term composition
Language dependant:
English is comparatively simple in this regard.
German is awful!
(add your language here)
5/6 October 2009 6Ecoterm 2009: Lexical Complexity SKOS(XL)
The case: UMTHES in SKOS
The German Environmental Thesaurus UMTHES
~ 12,000 preferred + 25,000 non-preferred terms + 11 000 'multiple-composition' (spelling) forms
needs to be serialized in SKOS for migration into the iQvoc vocabulary management tool
includes sophisticated knowledge about lexical complexity
we don‘t want to loose this moving to SKOS(XL)
5/6 October 2009 7Ecoterm 2009: Lexical Complexity SKOS(XL)
UBA(de) & innoQ develop …
iQvoc - Open Source Vocabulary Management Tool
Totally Web-based, supports distributed editorial teams
Safe and comfortable, schema driven editing features
Simple but powerful workflow implementation
Conformance
W3C “Cool URI” design and deployment
W3C SKOS Recommendation
Availability
GNU public license (GPL)
iQvoc version 1 demo (GEMET) at:http://apps.innoq.com/iqvoc/about.html
iQvoc 2 availability planned for Q1 2010
5/6 October 2009 8Ecoterm 2009: Lexical Complexity SKOS(XL)
Thesaurus modelswith regard to lexical complexity
– Traditional - ISO 2788:1986
– ISO Model revised (Draft 2008-11-18)
– SKOS W3C Recommendation 2009-08-18
Traditional - ISO 2788:1986
“Guidelines for the establishment and development of monolingual thesauri”
indexing language: “A controlled set of terms selected from natural language and used to represent, in summary form, the subjects of documents.”
thesaurus: “The vocabulary of a controlled indexing language, formally organized …”
preferred term: “A term used consistently when indexing to represent a given concept … sometimes known as descriptor.“
non-preferred term: “The synonym or quasi-synonym of a preferred term. A non-preferred term is not assigned to documents but is provided as an entry point … sometimes known as a non-descriptor"
5/6 October 2009 10Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO 2788:1986 Model (1)
5/6 October 2009 11Ecoterm 2009: Lexical Complexity SKOS(XL)
(hierarchical and associativerelations between preferred terms here not in focus)
term equivalence
see next slide
ISO 2788:1986 Model (2)
compound term: “An indexing term which can be factored morphologically into separate components, each of which could be expressed, or re-expressed, as a noun that is capable of serving independently as an indexing term.
a) the focus or head, i.e. the noun component which identifies the general class of concepts to which the term as a whole refers. Examples: ‘printed indexes’, ‘hospitals for children’.
b) The difference or modifier, i.e. one or more further components which serve to narrow the extension of the focus and so specify one of its subclasses. Examples: ‘printed indexes’, ‘hospitals for children’.
The focus and its difference(s) may be written as separate words, as in ‘dining rooms’ and ‘soup spoons’, or they may be concatenated into single words, as in ‘bedrooms’ and ‘teaspoons’”.
5/6 October 2009 12Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO Model revised (Draft 2008-11-18)
Leonard Will 2009-02-13 in the public SKOS mailing list:
“I write as Chair of the ‘Data Modeling, Exchange Formats and Protocols’ subgroup of the ISO working group SC9WG8/Project 25964, currently revising the ISO standard for thesauri for information retrieval, but as these standards are still in draft form anything I say here is my own interpretation of the way we are going, and is not authoritative”. …
“The ISO model is firmly based on relationships between concepts, not terms. Terms are used as labels for concepts, as in SKOS”.
http://lists.w3.org/Archives/Public/public-esw-thes/2009Feb/0033.html
(see diagram on next slide)
5/6 October 2009 13Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO Model revised (Draft 2008-11-18)
5/6 October 2009 14Ecoterm 2009: Lexical Complexity SKOS(XL)
W3C SKOS Recommendation
Simple Knowledge Organization System
“SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web …”
Started in 2004: http://www.w3.org/2004/02/skos/
2009-08-18: W3C Recommendation status
SKOS Reference: http://www.w3.org/TR/2009/REC-skos-reference-20090818/
SKOS Primer: http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
SKOS Use Cases and Requirements: http://www.w3.org/TR/2009/NOTE-skos-ucr-20090818/
5/6 October 2009 15Ecoterm 2009: Lexical Complexity SKOS(XL)
SKOS Model
about Concepts not terms
5/6 October 2009 16Ecoterm 2009: Lexical Complexity SKOS(XL)
“anything“ can have these
labels (~terms) and notes
includes relations known from ISO “preferred term”:
hierarchical ,associative,
but not equivalence
~ ISO node label
ISO 2788:1986 mapped to SKOS
5/6 October 2009 17Ecoterm 2009: Lexical Complexity SKOS(XL)
ISO 2788:1986 ~ SKOS (without XL)
document out of scope
indexing language n/a, (may be described as the set of all values assigned to prefLabel or altLabel properties of Concept instances in a ConceptScheme)
thesaurus ConceptScheme (any kind of "controlled structured vocabulary“)
mentioned but not defined Concept “An idea or notion; a unit of thought.”
indexing term n/a, indexing should use Concept references
• preferred term value of prefLabel assigned to a Concept instance
• non-preferred term value of altLabel assigned to a Concept instance
• compound term n/a.
node label Collection
term hierarchy broader/narrower not between terms but Concept instances
term association related not between terms but Concept instances
term equivalence n/a, (may be seen between values assigned to prefLabel / altLabel of the same Concept instance
Scope note, definition note (changeNote, definition, editorialNote, example, scopeNote, …)
What is added by SKOSXL?
skosxl:Label is a Class not a literal
skosxl:Label has (exactly one) literalForm
skosxl:Label can have labelRelation to another Label
What you don’t see in the diagram:
skos:prefLabel etc. are extended by a „property chain“(seen from a rdfs:Resource) :the value of an assigned skos:prefLabel is equivalent to the value of the skosxl:literalForm of an assigned skosxl:Label.
5/6 October 2009 18Ecoterm 2009: Lexical Complexity SKOS(XL)
Extensions of SKOSXL by UMTHES
properties of skosxl:Label complementing skosxl:literalForm baseForm inflectional “root” of the term (add suffixes to this)
inflectionalCode encoding of a regular inflectional pattern
lexicalVariant any lexical variant that may appear in a written document
inflectional - derived by inflection
acronym - any kind of abbreviation
cultural - any (sub) cultural variation
misspelled - common spelling errors
subProperties of skosxl:labelRelation
homograph homograph part of a qualified name
hasQualifier qualifier part of a qualified name
lexicalExtension may point to historical orthography, or verb form, etc.
compoundFrom composition (value is a rdf:List)
5/6 October 2009 19Ecoterm 2009: Lexical Complexity SKOS(XL)
Examples using SKOS(XL)(mostly stripped down to a topic)
Switching to Turtle Syntax
Terse RDF Triple Language
W3C Team Submission 14 January 2008
http://www.w3.org/TeamSubmission/turtle/ by TBL
Used in W3C SKOS Recommendation as well as in OWL 2 Draft
Everything can be expressed in XML as well.
Turtle syntax makes more sense for human reading.
see yourself …
5/6 October 2009 21Ecoterm 2009: Lexical Complexity SKOS(XL)
UMTHES in SKOS(XL) examples
Namespace prefixes used in the following:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#>.
@prefix ext: <http://www.uba.de/2009/08/UmThesScheme#>.# no prefix means: defined in the local namespace
5/6 October 2009 22Ecoterm 2009: Lexical Complexity SKOS(XL)
waste & garbage
# SKOS only
:4711 rdf:type skos:Concept;
skos:prefLabel “waste”;
skos:altLabel “garbage”.
# exactly the same in SKOSXL
:4711 rdf:type skos:Concept;
skosxl:prefLabel :waste;
skosxl:altLabel :garbage.
:waste rdf:type skosxl:Label;
skosxl:literalForm “waste”.
:garbage rdf:type skosxl:Label;
skosxl:literalForm “garbage”.
5/6 October 2009 23Ecoterm 2009: Lexical Complexity SKOS(XL)
NOTE: Local instance identifiers (:4711, :waste, :garbage, etc.) in these examples follow a local naming convention which addresses human reading only.
“4711” used to be the brand name of a Cologne based perfume manufacturer (“Eau de Cologne”). This has emerged to a generic ID symbol in informatics in the 80/90s. So, :4711 stands for “any kind of unique, but by itself meaningless ID”.
The only functional requirements for IDs in this place are:• being unique within the assigned namespace;• being part of a working http URI.
NOTE: Local instance identifiers (:4711, :waste, :garbage, etc.) in these examples follow a local naming convention which addresses human reading only.
“4711” used to be the brand name of a Cologne based perfume manufacturer (“Eau de Cologne”). This has emerged to a generic ID symbol in informatics in the 80/90s. So, :4711 stands for “any kind of unique, but by itself meaningless ID”.
The only functional requirements for IDs in this place are:• being unique within the assigned namespace;• being part of a working http URI.
waste & garbage
# SKOS only
:4711 rdf:type skos:Concept;
skos:prefLabel “waste”;
skos:altLabel “garbage”.
# exactly the same in SKOSXL
:4711 rdf:type skos:Concept;
skosxl:prefLabel :waste;
skosxl:altLabel :garbage.
:waste rdf:type skosxl:Label;
skosxl:literalForm “waste”.
:garbage rdf:type skosxl:Label;
skosxl:literalForm “garbage”.
# this looks like saying the same stuff in a more complicated way
# but wait ...
5/6 October 2009 24Ecoterm 2009: Lexical Complexity SKOS(XL)
“waste water” composition
:4711 rdf:type skos:Concept;
skosxl:prefLabel :wasteWater.
:wasteWater rdf:type skosxl:Label;
skosxl:literalForm “waste water”;
ext:lexicalVariant “wastewater”;
ext:compoundFrom (:waste :water).
# already defined in the previous slide, could skip it here:
:waste rdf:type skosxl:Label;
skosxl:literalForm “waste”.
# only the noun, “wasted water” is NOT “waste water”!
:water rdf:type skosxl:Label;
skosxl:literalForm “water”;
ext:inflectional “waters”.
5/6 October 2009 25Ecoterm 2009: Lexical Complexity SKOS(XL)
Multiple Composition in German
# @en: technique of facilities for the recycling of waste water
:4711 rdf:type skos:Concept;
skosxl:prefLabel :abwasserAufbereitungsAnlagenTechnik.
:abwasserAufbereitungsAnlagenTechnik rdf:type skosxl:Label;
skosxl:literalForm “Abwasseraufbereitungsanlagentechnik”;
ext:compoundFrom (:abwasser :aufbereitung :anlage :technik);
ext:compoundFrom (:abwasserAufbereitung :anlage :technik);
ext:compoundFrom (:abwasserAufbereitungsAnlage :technik);
ext:compoundFrom (:abwasser :Aufbereitungsanlage :technik);
ext:compoundFrom (:abwasserAufbereitung :anlagenTechnik);
ext:compoundFrom (:abwasser :aufbereitung: :anlagenTechnik);
ext:compoundFrom (:abwasser :aufbereitungsAnlagenTechnik).
# maybe I missed some composition variant?
Not joking!
5/6 October 2009 26Ecoterm 2009: Lexical Complexity SKOS(XL)
Lexical extension example in German
# in English: “cleaning”
:reinigung rdf:type skosxl:Label;
skosxl:literalForm “Reinigung”@de;
ext:lexicalExtension :reinigen .
# extended by the verb form, English “to clean” Caution: see “wasted water”
:reinigen rdf:type skosxl:Label;
skosxl:literalForm “reinigen“@de;
ext:baseForm “reinig”;
ext:inflectionalCode “007”
ext:inflectional “reinige”;
ext:inflectional “reinigen”;
ext:inflectional “reinigte”;
ext:inflectional “gereinigt”;
ext:inflectional “gereinigte”;
ext:inflectional “gereinigter”;
ext:inflectional “gereinigtes”;
ext:inflectional “reinigend”;
ext:inflectional “reinigende”;
ext:inflectional “reinigender”;
ext:inflectional “reinigendes”; #to be continued …
5/6 October 2009 27Ecoterm 2009: Lexical Complexity SKOS(XL)
Homograph & qualifier
:4711 rdf:type skos:Concept;
skosxl:prefLabel :bass--fish. # [ˈbas]
:4712 rdf:type skos:Concept;
skosxl:prefLabel :bass--music . # [ˈbās]
:bass rdf:type skosxl:Label;
skosxl:literalForm “bass”.
:fish rdf:type skosxl:Label;
skosxl:literalForm “fish”.
:bass--fish rdf:type skosxl:Label;
skosxl:literalForm “bass (fish)”;
ext:homograph :bass;
ext:hasQualifier :fish.
# add Labels :music and :bass--music using the same pattern
5/6 October 2009 28Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (symmetric)
# symmetric (in SKOS, can be expressed in SKOSXL likewise)
:4711 rdf:type skos:Concept;
skos:prefLabel “organisation”@en;
skos:prefLabel “organization”@en-US;
# add your language here ... (GEMET has more than 20)
skos:prefLabel “Organisation”@de.
SKOS integrity condition S14: “A resource has no more than one value of skos:prefLabel per language tag.”
NOTE: this does not mean it must have prefLabel values in multiple languages
5/6 October 2009 29Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (language-centric)
# UMTHES is German-centric with altLabel values also in English
:4711 rdf:type skos:Concept;
skos:prefLabel “Organisation”@de;
skos:altLabel “organisation”@en;
skos:altLabel “organization”@en-US.
# or use skosxl: in the above to refer to:
:Organisation rdf:type skosxl:Label;
skosxl:literalForm “Organisation”@de;
ext:inflectional “Organisationen”;
ext:inflectional “Organisations-”.
:organisation rdf:type skosxl:Label;
skosxl:literalForm “organisation”@en;
ext:inflectional “organisations”.
:organization rdf:type skosxl:Label;
skosxl:literalForm “organization”@en-US;
ext:inflectional “organizations”.
5/6 October 2009 30Ecoterm 2009: Lexical Complexity SKOS(XL)
Multilingualism (asymmetric)
# full asymmetric pattern (currently not used by UMTHES)
:4711 rdf:type skos:Concept;
skosxl:prefLabel :Organisation;
ext:hasTranslation :4712.
:4712 rdf:type skos:Concept;
skosxl:prefLabel :organisation.
ext:hasTranslation :4711.
# :Organisation & :organisation already known from previous slide
5/6 October 2009 31Ecoterm 2009: Lexical Complexity SKOS(XL)
About Federation
UMTHES has been one of the 8 sources of GEMET UMTHES extends GEMET with more detailed German Concepts and their
lexical complexity.
@prefix gemet: <http://www.eionet.europa.eu/gemet/concept/>.
# GEMET URIs do resolve in SKOS since 2009-09 !!!
:14452 rdf:type skos:Concept;
skosxl:prefLabel :klimaAenderung;
skosxl:altLabel :klimaWandel;
skosxl:altLabel :climateChange;
# referencing GEMET “climatic change” from here
skos:closeMatch gemet:1471.
:klimaAenderung rdf:type skosxl:Label;
ext:compoundFrom (:klima :aenderung);
# ... etc, as exemplified before
5/6 October 2009 32Ecoterm 2009: Lexical Complexity SKOS(XL)
preferred, non-preferred term again
# you may define such classes in SKOS (OWL) at any time
# but they will never be exactly equivalent to ISO 2788 (why?)
:isPrefLabelOf owl:inverseOf skosxl:prefLabel.
:isAltLabelOf owl:inverseOf skosxl:altLabel.
:PreferredTerm owl:equivalentClass [
rdf:type owl:Restriction ;
owl:onProperty :isPrefLabelOf ;
owl:someValuesFrom skos:Concept ].
:NonPreferredTerm owl:equivalentClass [
owl:intersectionOf (
[owl:complementOf :PreferredTerm ]
[owl:equivalentClass [
rdf:type owl:Restriction ;
owl:onProperty :isAltLabelOf ;
owl:someValuesFrom skos:Concept ]
])].
5/6 October 2009 33Ecoterm 2009: Lexical Complexity SKOS(XL)
Finally …
# you may express anything in RDF / Turtle …
@prefix foaf: <http://xmlns.com/foaf/spec#>.
:ecoTerm2009 rdf:type :meeting;
:hasOnAgenda :theseSlides.
:theseSlides rdf:type :presentation;
skos:preflabel “Expressing Lexical Complexity in SKOS(XL)”;
:hasPresenter :tb.
:tb rdf:type foaf:person;
foaf:mbox <mailto:[email protected]>;
foaf:isPrimaryTopicOf <http://www.bandholtz.eu/foaf.rdf>;
foaf:workplaceHomepage <http://www.innoq.com>;
foaf:currentProject <http://apps.innoq.com/iqvoc/about.html>;
# add your assertions here ...
:says “Good Buy!”.
5/6 October 2009 34Ecoterm 2009: Lexical Complexity SKOS(XL)