2000. yu.demchenko. terena multilingual issues in information retrieval and resource description...

24
2000. Yu.Demche nko. TERENA Multilingual Issues in In formation Retrieval and R Slide 2_1 Multilingual Issues in Information Retrieval and Resource Description Overview Yuri Demchenko, TERENA [email protected]

Upload: julian-hubert-james

Post on 18-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

2000. Yu.Demchenko. TERENA

Multilingual Issues in Information Retrieval and Resource Description

Slide 2_1

Multilingual Issues in

Information Retrieval and

Resource Description Overview

Yuri Demchenko, TERENA

[email protected]

Page 2: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_2

In this presentation

• Multilingual Issues in TERENA Technical Programme

• Multilinguality: trends and developments

• Technical Issues/Background Data presentation and resource description format Standards Overview

• Metadata and Cataloging

• Recent Development in Subject Gateways and SE

• Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework

Page 3: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_3

TERENA Multilingual Community and TERENA Technical Programme

• TERENA has 43 members from 34 countries speaking 30 languages

• Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page -

http://www.terena.nl/projects/multiling/

• Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF

Page 4: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_4

Multilinguality: trends and developments

Storing, processing, presentation and exchange of information in many languages

Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation)

Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC

Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language

unification

Page 5: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_5

Internet Applications

None-interactive Application: Electronic Mail Correct Message Composition and Rendering

Interactive applications WWW: HTTP/HTML

http-equiv="Content-Type" Content="text/html; charset=euc-jp" <META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication

Operational Applications (Internationalised) DNS LDAP and X.500 (Language Support ?)

Page 6: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_6

I18n and ML issues at IETF and other STD bodies

IETF Architectural Model of Multilingual support in Internet Applications - RFC 2130

Language and Charset/Encoding tagging

Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming

Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec)

Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension

Page 7: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_7

IETF Architectural Model of Multilingual support in Internet Applications

User Interface Presentation Culture Locale Language

On-the-wire Coded Character Set -

Repertoire of ISO-10646 Character Encoding Scheme -

UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1

Transfer Encoding Scheme (Base64, QP)

Resolution Service /Directory(content MD)

Communication/Network

Language

PresentationCultureLocale

Content Transfer Agent

Language

PresentationCultureLocale

Content Transfer Agent

Communication ProtocolR

esou

rce

Page 8: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_8

Content Negotiation Framework (IETF/W3C)

Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged

Abstract framework for content negotiation

(Content) (Transmit.data) (Data document)[Author]----->-----[Sender]----->-----[Receiver]----->-----[User]

Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703

Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services

CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content

negotiation

Page 9: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_9

Charset and Language tagging

MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme

base64 quoted-printable

Other media attributes and features (e.g., resolution, color, language, etc.)

Language RFC 1766 ISO639-2

Page 10: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_10

WWW: HTTP/HTML

HTTP header includes information about the type of the transferred information and the character encoding for text-based information:

http-equiv="Content-Type" Content="text/html; charset=euc-jp"

The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:

http-equiv="Content-Type" Content-Language=se

Character encoding information in the META information of the HTML document:

<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Page 11: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_11

XML: Character Set tagging

Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF

The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16

Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName

‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<? xml encoding+’UTF-8’?><? xml encoding+’EUC-JP’?>

Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding

Page 12: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_12

XML: Language tagging

Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages

LanguageID : : = Langcode (‘-’ Subcode) Langcode : : = ISO639Code | IanaCode | UserCode

Examples:<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>

<p xml:lang="en-GB">What colour is it?</p>

<p xml:lang="en-US">What color is it?</p>

<sp who="Faust" desc='leise' xml:lang="de">

<l>Habe nun, ach! Philosophie,</l>

<l>Juristerei, und Medizin</l>

<l>und leider auch Theologie</l>

<l>durchaus studiert mit heißem Bemüh'n.</l>

</sp>

Page 13: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_13

Unicode Technical Reports

The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page

http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1

http://www.stri.is/TC304/p10_1998_05_30.pdf

Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0

I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags

Page 14: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_14

Language Definition in DC Metadata set - DC.Language Format

<meta name = "DC.Language" content = "en">

<meta name = "DC.Language" scheme = "rfc1766" content = "en">

<meta name = "DC.Language" scheme = "ISO639-2” content = "eng">

<meta name = "DC.Language” scheme = "rfc1766” content = "en-US">

<meta name = "DC.Language” content = "zh">

<meta name = "DC.Language" content = "ja">

<meta name = "DC.Language” content = "es">

<meta name = "DC.Language” content = "german">

<meta name = "DC.Language” lang = "fr” content = "allemand">

Page 15: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_15

Language Definition in DC Metadata set - Field content language labelling/attributing

A work in Spanish may be assigned the following metadata:

<meta name = "DC.Language” scheme = "rfc1766” content = "es">

<meta name = "DC.Title"

lang = "es"

content = "La Mesa Verde y la Silla Roja">

<meta name = "DC.Title"

lang = "en"

content = "The Green Table and the Red Chair">

Page 16: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_16

DC in Multiple Languages

The reference language of Int’l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language

The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/

DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC

terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations

Formal recognition and standardization procedure

Page 17: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_17

Document Description with Unqualified DC and RDF syntax

<?xml:namespace ns="http://purl.org/metadata/dublin_core_elements" prefix="DC"?>

<RDF:RDF>

<RDF:DESCRIPTION RDF:HREF="http://www.biblio.de/buecher/kleist.html">

<DC:Title XML:lang="de">Das Erdbeben in Chili</DC:Title>

<DC:Creator>Heinrich von Kleist</DC:Creator>

</RDF:Description>

</RDF:RDF>

XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding

Page 18: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_18

Recent Developments in Subject Gateways, Indexing, Searching

NRENs projects

Subject gateways

Commercial Search Engines

Multilingual Text Retrieval and Processing TUSTEP system - using “fuzzy” multilingual seaching

Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST

Page 19: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_19

Multilingual Subject Gateway (DESIRE)

Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules

TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language

Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes

Page 20: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_20

Multilingual provision in popular Internet Search Engines

Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages

Documents indexed as is Automatic translation - very simple and naive

Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages

Other sites that have dedicated national sites interface language language resources no special language policy

Excite - 11 countries Lycos - 23 countries

Page 21: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_21

TUSTEP TUebingen System of Text Processing Programs

1. File structure

2. Multilingual capabilities

3. Internal data presentation

4. Database publishing/output data presentation

5. CGI

6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit

Try entries like Smith or Meier or...

http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery

Page 22: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_22

Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8

TREC - Text REtrieval Conference - http://trec.nist.gov/

Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation

Latent Semantic Indexing Generalised Vector Space Model, etc.

Computer translation Machine-readable bilingual dictionaries MultilingualThesauri

Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others

Page 23: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_23

REIS Project/Initiative Multilinguality framework - First attempt

Multiple language indexing multiple language documents/indexes

Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using “fuzzy” multilingual searching/matching

Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol

Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like)

Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging

Page 24: 2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description Slide 2_1 Multilingual Issues in Information Retrieval

©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description

Slide_24

Multilinguality Framework for Multilingual Indexing/Search Services

To be developed yet