2000. yu.demchenko. terena multilingual issues in information retrieval and resource description...
TRANSCRIPT
2000. Yu.Demchenko. TERENA
Multilingual Issues in Information Retrieval and Resource Description
Slide 2_1
Multilingual Issues in
Information Retrieval and
Resource Description Overview
Yuri Demchenko, TERENA
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_2
In this presentation
• Multilingual Issues in TERENA Technical Programme
• Multilinguality: trends and developments
• Technical Issues/Background Data presentation and resource description format Standards Overview
• Metadata and Cataloging
• Recent Development in Subject Gateways and SE
• Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_3
TERENA Multilingual Community and TERENA Technical Programme
• TERENA has 43 members from 34 countries speaking 30 languages
• Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page -
http://www.terena.nl/projects/multiling/
• Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_4
Multilinguality: trends and developments
Storing, processing, presentation and exchange of information in many languages
Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation)
Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC
Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language
unification
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_5
Internet Applications
None-interactive Application: Electronic Mail Correct Message Composition and Rendering
Interactive applications WWW: HTTP/HTML
http-equiv="Content-Type" Content="text/html; charset=euc-jp" <META http-equiv="Content-Type" Content="text/html; charset=euc-jp">
Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication
Operational Applications (Internationalised) DNS LDAP and X.500 (Language Support ?)
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_6
I18n and ML issues at IETF and other STD bodies
IETF Architectural Model of Multilingual support in Internet Applications - RFC 2130
Language and Charset/Encoding tagging
Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming
Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec)
Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_7
IETF Architectural Model of Multilingual support in Internet Applications
User Interface Presentation Culture Locale Language
On-the-wire Coded Character Set -
Repertoire of ISO-10646 Character Encoding Scheme -
UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1
Transfer Encoding Scheme (Base64, QP)
Resolution Service /Directory(content MD)
Communication/Network
Language
PresentationCultureLocale
Content Transfer Agent
Language
PresentationCultureLocale
Content Transfer Agent
Communication ProtocolR
esou
rce
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_8
Content Negotiation Framework (IETF/W3C)
Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged
Abstract framework for content negotiation
(Content) (Transmit.data) (Data document)[Author]----->-----[Sender]----->-----[Receiver]----->-----[User]
Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703
Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services
CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content
negotiation
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_9
Charset and Language tagging
MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme
base64 quoted-printable
Other media attributes and features (e.g., resolution, color, language, etc.)
Language RFC 1766 ISO639-2
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_10
WWW: HTTP/HTML
HTTP header includes information about the type of the transferred information and the character encoding for text-based information:
http-equiv="Content-Type" Content="text/html; charset=euc-jp"
The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:
http-equiv="Content-Type" Content-Language=se
Character encoding information in the META information of the HTML document:
<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_11
XML: Character Set tagging
Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF
The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16
Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName
‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<? xml encoding+’UTF-8’?><? xml encoding+’EUC-JP’?>
Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_12
XML: Language tagging
Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages
LanguageID : : = Langcode (‘-’ Subcode) Langcode : : = ISO639Code | IanaCode | UserCode
Examples:<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
<l>Habe nun, ach! Philosophie,</l>
<l>Juristerei, und Medizin</l>
<l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_13
Unicode Technical Reports
The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page
http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1
http://www.stri.is/TC304/p10_1998_05_30.pdf
Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0
I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_14
Language Definition in DC Metadata set - DC.Language Format
<meta name = "DC.Language" content = "en">
<meta name = "DC.Language" scheme = "rfc1766" content = "en">
<meta name = "DC.Language" scheme = "ISO639-2” content = "eng">
<meta name = "DC.Language” scheme = "rfc1766” content = "en-US">
<meta name = "DC.Language” content = "zh">
<meta name = "DC.Language" content = "ja">
<meta name = "DC.Language” content = "es">
<meta name = "DC.Language” content = "german">
<meta name = "DC.Language” lang = "fr” content = "allemand">
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_15
Language Definition in DC Metadata set - Field content language labelling/attributing
A work in Spanish may be assigned the following metadata:
<meta name = "DC.Language” scheme = "rfc1766” content = "es">
<meta name = "DC.Title"
lang = "es"
content = "La Mesa Verde y la Silla Roja">
<meta name = "DC.Title"
lang = "en"
content = "The Green Table and the Red Chair">
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_16
DC in Multiple Languages
The reference language of Int’l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language
The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/
DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC
terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations
Formal recognition and standardization procedure
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_17
Document Description with Unqualified DC and RDF syntax
<?xml:namespace ns="http://purl.org/metadata/dublin_core_elements" prefix="DC"?>
<RDF:RDF>
<RDF:DESCRIPTION RDF:HREF="http://www.biblio.de/buecher/kleist.html">
<DC:Title XML:lang="de">Das Erdbeben in Chili</DC:Title>
<DC:Creator>Heinrich von Kleist</DC:Creator>
</RDF:Description>
</RDF:RDF>
XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_18
Recent Developments in Subject Gateways, Indexing, Searching
NRENs projects
Subject gateways
Commercial Search Engines
Multilingual Text Retrieval and Processing TUSTEP system - using “fuzzy” multilingual seaching
Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_19
Multilingual Subject Gateway (DESIRE)
Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules
TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language
Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_20
Multilingual provision in popular Internet Search Engines
Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages
Documents indexed as is Automatic translation - very simple and naive
Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages
Other sites that have dedicated national sites interface language language resources no special language policy
Excite - 11 countries Lycos - 23 countries
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_21
TUSTEP TUebingen System of Text Processing Programs
1. File structure
2. Multilingual capabilities
3. Internal data presentation
4. Database publishing/output data presentation
5. CGI
6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit
Try entries like Smith or Meier or...
http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_22
Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8
TREC - Text REtrieval Conference - http://trec.nist.gov/
Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation
Latent Semantic Indexing Generalised Vector Space Model, etc.
Computer translation Machine-readable bilingual dictionaries MultilingualThesauri
Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_23
REIS Project/Initiative Multilinguality framework - First attempt
Multiple language indexing multiple language documents/indexes
Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using “fuzzy” multilingual searching/matching
Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol
Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like)
Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging
©2000. Yu.Demchenko. TERENA Multilingual Issues in Information Retrieval and Resource Description
Slide_24
Multilinguality Framework for Multilingual Indexing/Search Services
To be developed yet