2000. yu.demchenko. terena multilingual issues in information retrieval and resource description...

2000. Yu.Demchenko. TERENA

Multilingual Issues in Information Retrieval and Resource Description

Slide 2_1

Multilingual Issues in

Information Retrieval and

Resource Description Overview

Yuri Demchenko, TERENA

[email protected]

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_2

In this presentation

• Multilingual Issues in TERENA Technical Programme

• Multilinguality: trends and developments

• Technical Issues/Background Data presentation and resource description format Standards Overview

• Metadata and Cataloging

• Recent Development in Subject Gateways and SE

• Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework


Slide_3

TERENA Multilingual Community and TERENA Technical Programme

• TERENA has 43 members from 34 countries speaking 30 languages

• Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page -

http://www.terena.nl/projects/multiling/

• Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF


Slide_4

Multilinguality: trends and developments

Storing, processing, presentation and exchange of information in many languages

Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation)

Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC

Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language

unification

Slide_5

Internet Applications

None-interactive Application: Electronic Mail Correct Message Composition and Rendering

Interactive applications WWW: HTTP/HTML

http-equiv="Content-Type" Content="text/html; charset=euc-jp" <META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication

Operational Applications (Internationalised) DNS LDAP and X.500 (Language Support ?)


Slide_6

I18n and ML issues at IETF and other STD bodies

IETF Architectural Model of Multilingual support in Internet Applications - RFC 2130

Language and Charset/Encoding tagging

Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming

Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec)

Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension


Slide_7

IETF Architectural Model of Multilingual support in Internet Applications

User Interface Presentation Culture Locale Language

On-the-wire Coded Character Set -

Repertoire of ISO-10646 Character Encoding Scheme -

UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1

Transfer Encoding Scheme (Base64, QP)

Resolution Service /Directory(content MD)

Communication/Network

Language

PresentationCultureLocale

Content Transfer Agent

Language

PresentationCultureLocale

Content Transfer Agent

Communication ProtocolR

esou

rce


Slide_8

Content Negotiation Framework (IETF/W3C)

Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged

Abstract framework for content negotiation

(Content) (Transmit.data) (Data document)[Author]----->-----[Sender]----->-----[Receiver]----->-----[User]

Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703

Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services

CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content

negotiation


Slide_9

Charset and Language tagging

MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme

base64 quoted-printable

Other media attributes and features (e.g., resolution, color, language, etc.)

Language RFC 1766 ISO639-2

Slide_10

WWW: HTTP/HTML

HTTP header includes information about the type of the transferred information and the character encoding for text-based information:

http-equiv="Content-Type" Content="text/html; charset=euc-jp"

The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:

http-equiv="Content-Type" Content-Language=se

Character encoding information in the META information of the HTML document:

<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Slide_11

XML: Character Set tagging

Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF

The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16

Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName

‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<? xml encoding+’UTF-8’?><? xml encoding+’EUC-JP’?>

Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding

Slide_12

XML: Language tagging

Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages

LanguageID : : = Langcode (‘-’ Subcode) Langcode : : = ISO639Code | IanaCode | UserCode

Examples:The quick brown fox jumps over the lazy dog.

What colour is it?

What color is it?

<sp who="Faust" desc='leise' xml:lang="de">

<l>Habe nun, ach! Philosophie,</l>

<l>Juristerei, und Medizin</l>

<l>und leider auch Theologie</l>

<l>durchaus studiert mit heißem Bemüh'n.</l>

</sp>


Slide_13

Unicode Technical Reports

The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page

http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1

http://www.stri.is/TC304/p10_1998_05_30.pdf

Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0

I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags

Slide_14

Language Definition in DC Metadata set - DC.Language Format

<meta name = "DC.Language" content = "en">

<meta name = "DC.Language" scheme = "rfc1766" content = "en">

<meta name = "DC.Language" scheme = "ISO639-2” content = "eng">

<meta name = "DC.Language” scheme = "rfc1766” content = "en-US">

<meta name = "DC.Language” content = "zh">

<meta name = "DC.Language" content = "ja">

<meta name = "DC.Language” content = "es">

<meta name = "DC.Language” content = "german">

<meta name = "DC.Language” lang = "fr” content = "allemand">

Slide_15

Language Definition in DC Metadata set - Field content language labelling/attributing

A work in Spanish may be assigned the following metadata:

<meta name = "DC.Language” scheme = "rfc1766” content = "es">

<meta name = "DC.Title"

lang = "es"

content = "La Mesa Verde y la Silla Roja">

<meta name = "DC.Title"

lang = "en"

content = "The Green Table and the Red Chair">


Slide_16

DC in Multiple Languages

The reference language of Int’l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language

The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/

DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC

terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations

Formal recognition and standardization procedure

Slide_17

Document Description with Unqualified DC and RDF syntax

<?xml:namespace ns="http://purl.org/metadata/dublin_core_elements" prefix="DC"?>

<RDF:RDF>

<RDF:DESCRIPTION RDF:HREF="http://www.biblio.de/buecher/kleist.html">

<DC:Title XML:lang="de">Das Erdbeben in Chili</DC:Title>

<DC:Creator>Heinrich von Kleist</DC:Creator>

</RDF:Description>

</RDF:RDF>

XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding


Slide_18

Recent Developments in Subject Gateways, Indexing, Searching

NRENs projects

Subject gateways

Commercial Search Engines

Multilingual Text Retrieval and Processing TUSTEP system - using “fuzzy” multilingual seaching

Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST


Slide_19

Multilingual Subject Gateway (DESIRE)

Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules

TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language

Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes


Slide_20

Multilingual provision in popular Internet Search Engines

Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages

Documents indexed as is Automatic translation - very simple and naive

Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages

Other sites that have dedicated national sites interface language language resources no special language policy

Excite - 11 countries Lycos - 23 countries


Slide_21

TUSTEP TUebingen System of Text Processing Programs

1. File structure

2. Multilingual capabilities

3. Internal data presentation

4. Database publishing/output data presentation

5. CGI

6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit

Try entries like Smith or Meier or...

http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery


Slide_22

Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8

TREC - Text REtrieval Conference - http://trec.nist.gov/

Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation

Latent Semantic Indexing Generalised Vector Space Model, etc.

Computer translation Machine-readable bilingual dictionaries MultilingualThesauri

Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others


Slide_23

REIS Project/Initiative Multilinguality framework - First attempt

Multiple language indexing multiple language documents/indexes

Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using “fuzzy” multilingual searching/matching

Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol

Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like)

Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging


Slide_24

Multilinguality Framework for Multilingual Indexing/Search Services

To be developed yet

2000. yu.demchenko. terena multilingual issues in information retrieval and resource description...

Documents

terena multilingual

u language

multilinguality u

resource description

languages multilingual

resource u references

u message

terena multilingual