information retrieval for developmentsigir.org/afirm2019/slides/05. tuesday - ir for...information...

53
Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for ICT4D Department of Computer Science University of Cape Town January 2019

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Information Retrievalfor Development

Hussein Suleman

Digital Libraries Laboratory @ Centre for ICT4DDepartment of Computer Science

University of Cape Town

January 2019

Page 2: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Key Research Question

How do we use Information Retrieval / Data Mining /...

to support Development in Africa?

Page 3: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Outline of Talk

What is DevelopmentWhat is Development What is ICT for DevelopmentWhat is ICT for Development

Collection DevelopmentCollection Development African Language IRAfrican Language IR

Challenges in IR 4 DevelopmentChallenges in IR 4 Development

Low Resource EnvironmentsLow Resource Environments

Where to next ?Where to next ?

Development InterventionsDevelopment Interventions

Page 4: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is (Human/Socio-economic)

Development?

Page 5: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Development Agendas UN Millenium Development Goals UN Millenium Declaration UN Sustainable Development Goals South Africa

National Development Plan (2012) Growth Employment and Redistribution (1996) Reconstruction and Development Plan (1994)

Africa-wide New Partnership for Africa's Development (NEPAD) ...

Page 6: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

UN Millenium Developmemt Goals

Page 7: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Page 8: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

SA National Development Plan 2012-2030

The creation of jobs and the development of the economy Development of the economic infrastructure: coal and gas, water, electricity and

telecommunications Environmental sustainability and management of environmental resources Development of an inclusive rural economy Regional and international trade Housing and urban/rural planning Education and training Medical care Safety and security Building capacity for a developmental state Fighting corruption Nation building for a unified society

Page 9: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Programme of the Austrian Federal Govt 2008-2013

Page 10: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Nigeria Vision 20:2020

Page 11: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Zambia 7th National Dev Plan

Page 12: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

The Decolonisation Debates How do we decolonise African society?

Different knowledge systems? ICT? Do we do ICT differently? Do we need a programming language with keywords in isiZulu? Do we teach programming in isiZulu? Public intellectuals or universal scholars? Excellence vs. Local Relevance

Why is AFIRM mostly run by people from the Northern Hemisphere?

What do they say: Ngũgĩ wa Thiong'o, Mahmood Mamdani,...

Page 13: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is ICT for Development

Page 14: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 1/4

Page 15: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 2/4

Page 16: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 3/4

Page 17: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 4/4

Page 18: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

The Big Question

Can we use ICT to aid human development?

Can we use IR/DM to aid human development?

Page 19: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenges: IR for Development

Page 20: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Goal: IR for Human Development

Human Dignity Promote the status of local languages. Create tools that support local languages. Increase presence of local languages.

IR4D IR for employment, governance, health, etc.

Page 21: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 1: IR algorithms

Little algorithmic support in IR/NLP.

Are there language-specific tools/algorithms in African languages? How well do they work? How many languages are supported?

Page 22: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 2: Data

Very little and noisy data.

<1000 Wikipedia documents for some African languages.

How much electronic content do we produce?

Page 23: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 3: Fuzziness

Unclear language boundaries.

How many languages are there? How many have been clearly defined? How many are managed?

What is a language and what is a dialect/accent?

Page 24: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 4: Digital Divide

Access / Knowledge

How many people understand how to search?

How many people use search? Do people even have Internet access?

Page 25: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 5: Many Languages

Multilingualism is the norm.

How many languages do people use?

Are documents/queries in one language or are they mixed?

Page 26: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Challenge 6: Resource Limits

We do not have the resources.

Limited skills among researchers. Limited bandwidth to access data. Limited skills among users. Limited funding for anything.

Page 27: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Collection Development

Page 28: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Corpora

Corpora for African Language IR are rare. There are limited corpora for speech

recognition, speech synthesis, MT, etc.

Very few documents online. Wikipedia has <1000 (poor quality) pages

in many Bantu languages! Lots of OOV, loan words, mixed texts, etc.

Page 29: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Corpora: Language Detection

Meluleki Dube, U/G

Can we successfully determine the language, from among a group of 9 related African languages, of a piece of text? Web page? Tweet?

Trigram modelling and model alignment distance gives up to 92% accuracy. Incorrect predictions scatter by language similarity.

Page 30: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Corpora: Crowdsourcing

Sean Packham, MSc

Parallel corpus in isiXhosa-English. Will people contribute if money paid is

varied or there is no money but only gamification? Payment is only criterion!

Page 31: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Corpora: SALANG

Andreas von Holy, Osher Shuman, Alon Bresler, Bsc(Hons)

Create a central portal for documents in any SA Bantu language, with gamification, multilingual search, etc.

Page 32: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Corpora: Long-term efects

Jackson Moji, MSc (current)

Does gamification for corpus creation work in the long term? Will people lose interest? Will they continue to contribute? How is intrinsic motivation affected by time?

Extension of SALang project.

Page 33: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

African Language IR

Page 34: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Mixed Language IR

Mohammed Mustafa Ali, PhD

Noted that Google is language unaware. Poor results for mixed queries – queries in

multiple languages. Dominant languages are dominant in results. Mixed language use is very popular in Africa.

Solution: Examine queries and rerank based on language-based collection weights.

Page 35: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR

Search engines in Bantu languages, especially South African languages (isiZulu, isiXhosa, etc.).

Many core IR algorithms are unchanged but some language-specific algorithms needed: Language identification Text pre-processing and normalization Ranking and reranking

Page 36: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: AfriWeb

Nkosana Malumba, Katlego Moukangwe, BSc(Hons)

Zulu Search Engine. High accuracy in identifying

isiZulu vs. English+Italian. Simple morphological parser

outperformed simple stemmer in IR results.

Page 37: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Transfer?

Nyasha Katemauswa, U/G Shona Search Engine.

Can we adapt the isiZulu framework to get better results in chiShona?

Michael Kyeyune, U/G Xhosa Search Engine.

Can we adapt the isiZulu framework to get better results in isiXhosa?

Page 38: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Similar Language IRCatherine Chavula, PhD (current); Sinead Urisohn, Andre Lopes, BSc(Hons)

Exploit language similarity for those who can read multiple languages. Reranking to emphasize language similarity in

addition to relevance. Universal language group text pre-processing,

such as stemming.

Page 39: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: kiSwahili

Joseph Telemala, PhD (current)

How do we support Swahili speakers? Professionals want English for work. Everyone wants kiSwahili for play.

Who you are and what you are doing dictates query/result expectations.

Page 40: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

IR in Low Resource Environments

Page 41: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Speech UI

Morebodi Modise, MSc Speech-driven mobile search interface in

isiXhosa. Works well, but educated people want English!

Page 42: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

|Xam IR

Extinct Khoisan language.

Language used in documenting early South African history/culture (25000 pages of stories).

No Unicode representation.

Page 43: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Digital Bleek and Lloyd Collection

Page 44: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Low Resource IR

IR engine within the browser – no network needed.

Only simple transcriptions supported.

Page 45: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: DictionaryLebogang Molwantoa, Sanvir Manilal, Kyle Williams, BSc(Hons)

Visual dictionary – pictures of words. Find meanings of words in stories by image search.

Page 46: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Transcription

Kyle Williams, MSc; Ngoni Munyaradzi, MSc

Using machine learning to transcribe |Xam. Training data manually generated. 45% accuracy at best.

Crowdsourcing had 10% better performance. Answer determined by agreement among 3

amateur transcribers.

Page 47: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Text Input

Sunkanmi Olaleye, MSc

Inputting |Xam is non-trivial. Diacritics above, below and both; single

and multiple characters. Custom Android keyboards for predictive

and directed text entry in |Xam.

Page 48: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

IR/DM for Development

Page 49: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

IR for Development

Gina Paihama, PhD (current) How can we give users directed results to

address unemployment? Relevance is more specific here:

Page 50: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

DM for Development

Selvas Mwanza, PhD (current) Can we use Twitter data to evaluate

developmental measures in society (e.g., level of free speech)? We have found an association between what

people discuss (politics vs. entertainment) and how.

Page 51: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

What next?

Page 52: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

Digital Libraries Lab @ Centre for ICT4D

Where we are

Some early successes but: Too many languages, with Too few documents, Too few resources (money/users), and Too much mixing of languages in queries and

documents.

Lots of work still needed Lots of opportunities for research

Page 53: Information Retrieval for Developmentsigir.org/afirm2019/slides/05. Tuesday - IR for...Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for

questions, comments, ...

http://dl.cs.uct.ac.za/ enkosihamba kakuhlethank you and go well