mutilingual iformation retrieval
TRANSCRIPT
-
7/31/2019 Mutilingual Iformation Retrieval
1/15
Multilingual Information
Retrieval
by
T.Mehbub Basha
-
7/31/2019 Mutilingual Iformation Retrieval
2/15
OverviewIntroduction
Document Preprocessing
Monolingual Information Retrieval
-
7/31/2019 Mutilingual Iformation Retrieval
3/15
Introduction
Concerned with satisfying information needs of
users. Ex: documents
World Wide Web(WWW) websites requiresefficient approaches to retrieve relevant subsets
for specific information needs.
Constantly increasing number of information
items, it requires to adapt the retrieval techniques
applied to Web search to these new scenarios.
-
7/31/2019 Mutilingual Iformation Retrieval
4/15
Why we need multilingual?
-
7/31/2019 Mutilingual Iformation Retrieval
5/15
Websites, social networks orpersonal emails
are written in different languages(27.3% of
English Internet users , last accessed November
16, 2010)
People from different nations and languages areconnected in social networks
Internet usage statistics as presented in Figure
1.1 show that only one fourth of the Internet users
are native English speakers.
-
7/31/2019 Mutilingual Iformation Retrieval
6/15
Figure 1.1: Statistics of the number of Internet users by language
-
7/31/2019 Mutilingual Iformation Retrieval
7/15
Cont..
Many information retrieval approaches are based
on Machine Translation (MT) systems. However,
these systems still have high error rates(like
grammars, meanings)
This motivates the development ofmultilingual
retrieval methods that do not depend on MT or at
least are able to compensate errors introduced
by the translation systems.
-
7/31/2019 Mutilingual Iformation Retrieval
8/15
DEFINITION OF INFORMATION RETRIEVAL:
Given a collection D containing information items
di and a keyword query q representing an
information need, IR is defined as the task of
retrieving a ranked list of information items d1,
d2, . . . sorted by their relevance in respect to the
specified information need.
The overall search process is visualized in Figure
II.1 . This process consists of two parts.
1.Indexing part
2.Search part
-
7/31/2019 Mutilingual Iformation Retrieval
9/15
-
7/31/2019 Mutilingual Iformation Retrieval
10/15
The indexing part processes the entire document
collection to built index structures & Eachdocument is thereby preprocessed and mapped
to a vector representation
The search part is based on the same
preprocessing step that is also applied to the
query. Using the vector representation of the
query, the matching algorithm determines
relevant documents which are then returned as
ranked results.
-
7/31/2019 Mutilingual Iformation Retrieval
11/15
monolingual case, the content of information
items di and the keyword query q are thereby
written in the same language.
Cross-lingual and Multilingual IR, the information
need and the corresponding query of the usermay be formulated in other languages than the
one in which the documents are written in.
-
7/31/2019 Mutilingual Iformation Retrieval
12/15
Introduction
Document Preprocessing
Monolingual Information Retrieval
-
7/31/2019 Mutilingual Iformation Retrieval
13/15
Preprocessing takes a set ofraw documents as
input and produces as set oftokens as output.
Depending on language, script and other factors,
the process for identifying terms can differ
substantially
For Western European languages, terms used in
IR systems are often defined by the words of
these languages. But forChinese, words are not
separated by whitespaces. So use character
sequences ,avoids the problem of detecting word
borders
-
7/31/2019 Mutilingual Iformation Retrieval
14/15
Common techniques used for document preprocessing
document syntax, encoding, tokenization & normalization of
tokens
-
7/31/2019 Mutilingual Iformation Retrieval
15/15