1999. yu.demchenko. terena multilinguality in indexing, searching and metadata slide 2_1...

21
1999. Yu.Demche nko. TERENA Multilinguality in Indexi ng, Searching and Metadat Slide 2_1 Multilinguality and cross-language searching Multilingual aspects in Indexing, Searching and Metadata (Resource Description)

Upload: aron-hart

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

1999. Yu.Demchenko. TERENA

Multilinguality in Indexing, Searching and Metadata

Slide 2_1

Multilinguality and

cross-language searching

Multilingual aspects

in Indexing, Searching and Metadata (Resource Description)

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_2

Multilingual aspects in Indexing, Searching and Metadata

IETF Model of Multilingual support in Internet Applications Electronic Mail Interactive applications

Charset and Language tagging MIME types XML Language and Charset tagging DC language definition

Metadata and RDF DC.Language

Existing solutions TUSTEP Search Engines and Subject Gateways

Multilingual framework for the REIS Project

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_3

IETF Model of Multilingual support in Internet Applications

Electronic Mail Language Character Encoding Scheme Transfer Encoding Scheme

Interactive applications WWW: HTTP/HTML

http-equiv="Content-Type" Content="text/html; charset=euc-jp" <META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

XML/DOM LDAP and X.500 (?)

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_4

XML:Language and Charset tagging

Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF

The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16

Character Encoding in Entities (XML 4.3.3) EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName

‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<?xml encoding+’UTF-8’?> <?xml encoding+’EUC-JP’?>

Autodetection of Character Encoding

Language identification (XML 2.12) Tag for identification of languages

LanguageID : : = Langcode (‘-’ Subcode) Langcode : : = ISO639Code | IanaCode | UserCode

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_5

Charset and Language tagging

MIME types text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme

base64 quoted-printable

Language RFC 1766 ISO639-2

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_6

Language Definition in DC Metadata set

<meta name = “DC.language”

scheme= “rfc1766” “ISO639-2”

content= “es”>

<meta name = “DC.title”

lang = “es”

content= “La Mesa y Silla Roja”>

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_7

Multilingual Subject Gateway

Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules

TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language

Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_8

Multilingual provision in popular Internet Search Engines

AltaVista Search in 25 languages

Documents indexed as is

Automatic translation - very simple and naive

Other sites that have dedicated national sites interface language language resoures no special language policy

Euroseek Excite Lycos Infoseek

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_9

New Developments in Subject Gateways, Indexing, Searching

NRENs projects

Subject gateways

Commercial Search Engines

Multilingual Text Retrieval and Processing TUSTEP system

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_10

NREN projects

Social Science Information Gateway http://sosig.esrc.bris.ac.uk/

ROADS Project Software/Documentation Server - http://www.roads.lut.ac.uk/

CHIP-Pilot (Clearing House for Internet Projects) - http://www.terena.nl/chip/

IMesh - International Collaboration on Internet Subject Gateways - http://www.desire.org/html/subjectgateways/community/imesh/

DFN Indexing and Searching projects - http://www.dfn.de/links/suchen.html

X.500 Directory E-mail Addresses Search (AMBIX-D) - http://ambix.uni-tuebingen.de:8889

TUSTEP Munltilingual Textdata Processing and Fuzzy Searching - http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html

IKEM Toolkit - http://bikit.rug.ac.be:80/ikem/

DRUID Classification Tools, University of Twente - http://twentyone.tpd.tno.nl/druid/

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_11

Search Engines news

CLEVER project at IBM Almaden Research Center - http://www.almaden.ibm.com/cs/k53/clever.html

Cora Search Engine - http://www.cora.justresearch.com/about.html

Google Search Engine - http://www.google.com/why_use.html

Free AltaVista Search Intranet v2.3A Entry Level Software http://www.altavista.software.digital.com/search/intranet/free_3k/index.asp

Ultraseek Server for Linux Platformshttp://software.infoseek.com/products/ultraseek/linux/ultrareq.htm

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_12

TUSTEP TUebingen System of Text Processing Programs

1. File structure

2. Multilingual capabilities

3. Internal data presentation

4. Database publishing/output data presentation

5. CGI

6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit

Try entries like Smith or Meier or...

http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_13

TUSTEP: File structure

TUSTEP can handle basically all kinds of (explicitely or implicitely) structured text files)

Special support for XML "Databases" (i. e. files with a repeated and regular structure) are only a special case

of this.

Fuzzy search and other retrieval actions can then be used to access the data

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_14

TUSTEP: Multilingual capabilities

TUSTEP supports the following scripts: - Latin - Cyrillic - Greek (classical and modern) - Hebrew (with support for Yiddish) - Arabic - Estrangelo - Coptic - Old Church Slavonic

More: Phonetics, Egyptian hieroglyphs allows use of combining diacritics

Experimental: Indic scripts and Armenian

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_15

TUSTEP: Internal data presentation and transformation

TUSTEP uses internally a script tagging system with transliteration into ASCII which allows all data to be encoded in a human-readable and easily transmittable form

TUSTEP has a module for importing from and exporting into the UCS (UTF8 and UTF16)

Example: #r+Novij rafiqnij clovnik ykra^ins^bko%:^i movi#r-

Transformation module allows use of other tagging systems and other transliteration schemes

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_16

TUSTEP: Database publishing

TUSTEP's typesetting module offers a high-quality, fast and easy way of publishing all or part of the database

in paper (or pdf) form

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_17

TUSTEP: CGI

Complete control over input and output forms Possibility to configure exactly the kind of search(es), e.g.

exact matches only SoundEX "intelligent" fuzzy search "brute" fuzzy search that allows a number of different letters.

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_18

Multilinguality framework of the project

Multiple language indexing multiple language documents/indexes

Cross-language Searching Multiple language indexes/documents Automatic Query forwarding based on thesauri

Automatic translation Multilingual information retrieval Translation Request Protocol

Language and Character Encoding tagging XML as internal presentation of data

Using XML language and charset tagging

Metadata DC.Language definition

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_19

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_20

©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata

Slide2_21