standards for language resources the iso/tc 37(/sc 4) perspective
Post on 05-Jan-2016
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
Standards for language resources the ISO/TC 37(/SC 4) perspective
Laurent Romary
Directeur de Recherche INRIA
ISO/TC 37/SC 4 chair
Context
ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology
ISO 12200 - Martif ISO 12620 - Data categories (under revision) ISO 16642 - TMF (Terminological Markup
Framework) SC4 - Language Resource Management
www.tc37sc4.org
An example scenario: information extraction
Part-of-speechtagging
Primary Data
Semantic content
Content analysis
Chunk parsing
POS tagging
Syntactic structures
Horizontal view(W3C perspective)
XML
SOAP
RDF
OWL
Part-of-speechtagging
Primary Data
Semantic content
Content analysis
Chunk parsing
POS tagging
Syntactic structures
Vertical view(ISO/TC 37/SC 4 perspective)
Evaluation
Lexica
Linguisticmodels anddescriptors
(DataCategories) Part-of-speech
tagging
Primary Data
Semantic content
Content analysis
Chunk parsing
POS tagging
Syntactic structures
Linguistic information sources …and initiatives
Primary resources(text, dialogues)
Structural mark-upBasic annotations
[TEI, MPEG7, TMX,XLIFF, XHTML, etc.]
NLP structures(annotations)POS tagging
Chunks (cf. Named Entities)Deep Syntactic structures
Co-references etc.[Eagles/ISLE,
CES, MATE,…]
Knowledge structuresHierarchies of types
Relations between concepts(subjects/topics etc.)
Links to primary resources[Topic Maps, OIL, RDF]
Lexical structures(Language models)
TerminologiesTransfer lexica
LTAG/HPSG/LFG lexica[TBX, OLIF,
Eagles/ ISLE (Genelex)]
Links
Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]
Access protocols[Corba, SOAP]
SC4 Approach
Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources
In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches
No provision of new formats Situate development squarely in the framework of XML
and related standards Ensure compatibility with established and widely accepted web-
based technologies Ensure feasibility of transduction from legacy formats into newly
defined formats
--------------------
SC4 and other standardizing bodies
W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP
MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices
ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats
TEI-text representationReference for primary sourcese.g.: text archives
Text
Audio/Speech
Technical background
Oscar
Contributing organizations
ISO/TC 37/SC 4 structure
WG1Basic descriptors and mechanisms
for language resources
WG2Representation schemes
WG3Multilingual text representation
WG4Lexical databases W
G5
Workflow
of language Resource M
anagement
Datacategories
On-going activities
Feature structure representation (in collaboration with the TEI - Text Encoding Initiative)
ISO DIS 24610 Morpho-syntactic annotation
ISO NP 24611 Lexical markup framework
ISO NP 24612 (+ ISO NP 12620-3)
Task force on Meta-data for language resources (OLAC+IMDI) ACL/Sigsem working group on multimodal content representation Data category registry for ISO/TC 37
ISO CD 12620-1 on ballot (deadline Jan. 2004)
Modeling linguistic annotation structures
General framework - 1
Model for linguistic annotation that can be instantiated in a standard representational format
GMT: Generic Mapping Tool serve as a pivot format into and out of which
proprietary formats may be transduced to enable Comparison, merging, manipulation via common tools
Reference: ISO 16642 - Terminological Markup Framework
General framework - 2
A meta-model A general, underlying model that informs
current practice
A set of data-categories Provides to precise semantics of the format Obtained:
By sub-setting a Data Category RegistryBy providing application specific categories
ISO 16642: A family of formats
TMF
TML1 TML2 TML3 TMLi…
(TBX)(Geneter)
GMT
Meta-model
Terminological Data Collection (TDC)
Global Information (GI) Complementary Information (CI)
Terminological Entry (TE)
Language Section (LS)
Term Section (TS)
Term Component Section (TCS)
*
*
*
*
TMF: example
TE
TS
LSLS
TS
id=‘ID67’subjectField=‘ manufacturing ’definition=‘A value…’
lang=‘ hu ’lang=‘ en ’
term=‘…’term=‘alpha smoothing factor’termType=‘fullForm’
Implementation in TBX(cf. www.lisa.org)
<termEntry id='ID67'>
<descrip type='subjectField‘>manufacturing</descrip>
<descrip type='definition'>A value between 0 and 1 used in ...</descrip>
<langSet lang='en'>
<tig>
<term>alpha smoothing factor</term>
<termNote type='termType'>fullForm</termNote>
</tig>
</langSet>
<langSet lang='hu'>
<tig>
<term>Alfa ...</term>
</tig>
</langSet>
</termEntry>
Implementing a Data Category Registry for ISO TC37
Data Category
Definition: Elementary descriptor used in a linguistic description or annotation
scheme Example:
/Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/
Background: Experience gained from ISO 16642 in linguistic format specification Wider notion of data-categories as meta-data for tagged language
resources
Multiple uses of data categories
Data category selection
Meta model
Documentation
Meta-data
XML schemas
XSL filters
Application domains
Terminological data collection (TC 37/SC 3) Cf. “old” ISO 12620 set of data categories for terminology
Language codes (TC 37/SC 2) Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4
On-going and future SC4 activities (TC 37/SC 4) Meta-data for language resources Morpho-syntax/Syntax, Discourse level annotation NLP lexica, MT lexica Multilingual data representations (e.g. translation memories) and
access (query languages)
Technical background
ISO 11179 (ISO JTC 1/SC 32): meta-data registry view Provide mechanisms for the management of data categories
ISO 16642 (ISO TC 37/SC 3): terminology view Provides ways of dealing with multilingual issues
OWL (W3C Sem. Web activity): ontology view Provides a framework for dealing with hierarchies and expressing
constraints on data-categories E.g. a /noun/ can be described by means of /gender/ and /number/ in
French
Relation to ISO 11179
Data element concept Conceptual domain
Data element Value domain
Complex datcat Set of Simple datcats
/gender/ /masculine//feminine//neuter/
m, f, nImplemented as an XMLattribute named ‘gen’
XML schema declaration
<w lemme=“vert” gen=“f”>verte</w>
XML object List of values
The ISO 12620-1 proposal
Entry Identifier: genderProfile: morpho-syntaxDefinition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi)Definition (en): Grammatical category… (Source: TLFi (Trad.))Conceptual Domain: {/feminine/, /masculine/, /neuter/}
Object Language: frName: genreConceptual Domain: {/feminine/, /masculine/}
Object Language: enName: gender
Object Language: deName: GeschlechtConceptual Domain: {/feminine/, /masculine/, /neuter/}
Perspectives
ISO/TC 37/SC 4 in a wider picture Basic building blocks to bring coherence in the
representation of linguistic information in a variety of application domains E.g. e-documentation, e-learning, e-business (e-catalogues),
multimedia, localisation… Provide vertical solution to linguistically based
applications E.g. Information extraction, indexing
top related