a pilot project for ice-mauritius dolly koo tee … · dolly koo tee fong bsc computing and...

101
A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee Fong BSc Computing and Management 2004-2005

Upload: phamduong

Post on 12-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

A PILOT PROJECT FOR

ICE-MAURITIUS Dolly Koo Tee Fong

BSc Computing and Management 2004-2005

Page 2: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

i

SUMMARY

The overall objective of this project was to develop a prototype of the Mauritius component of

the International Corpus of English (ICE) to demonstrate feasibility and potential problems

for a larger-scale follow-up project.

In doing so, a proposal was also drafted in accordance to the EPSRC requirements with a

possibility to be sent for funding.

The following was achieved in the project:

• Tools and techniques available for corpus development and processing were

investigated and discussed, along with the main ones used by ICE.

• The Mauritius component of ICE, named as ICE-Mauritius, had been collected and

compiled up to 5% of the original size of an ICE project.

• A full work plan was written for a follow-up project to develop a full-scale ICE-lite

corpus, consisting not only of English from Mauritius but also from other 39 English-

speaking countries.

• Finally, the prototype and the work plan were evaluated by three people who are

experienced and involved in corpus collection and funding application.

Page 3: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

ii

ACKNOWLEDGEMENT

I would like to thank my project supervisor and personal tutor, Eric Atwell, for his help and

support throughout this project and also through my whole third year at Leeds University

I would also like to thank Gerald Nelson and Serge Sharoff for kindly agreeing to take part in

evaluating the project and for their advice.

Finally, I would like to thank my boyfriend, family and flatmates for their input, support and

encouragement throughout the course of the project.

Page 4: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

iii

CONTENTS

1. Introduction ___________________________________________ 1 - 3

1.1 Aim ____________________________________________________ 1 1.2 Objectives ______________________________________________ 1 1.3 Minimum Requirements __________________________________ 1 1.4 Deliverables _____________________________________________ 1 1.5 Initial Project Schedule __________________________________ 2

2. Survey of computer technologies for corpus development _________ 3 - 23

2.1 Background to the problem ________________________________ 3 - 10 2.1.1 Introduction ___________________________________________ 3 2.1.2 What is a Corpus? ______________________________________ 4 2.1.3 Overview of The International Corpus of English (ICE)_________ 5 2.1.4 Other Corpora ________________________________________ 6 2.1.5 Reasons for Encoding a Corpus____________________________ 7 2.1.6 ICE Corpus Design _____________________________________ 8

2.2 Corpus Collection and Encoding ____________________________ 10 - 17 2.2.1 Collecting Data ________________________________________ 10 2.2.2 Computerising Data _____________________________________ 11 2.2.3 ICE Markup System _____________________________________ 12 2.2.4 Corpus Tagging ________________________________________ 15 2.2.5 Syntactic Parsing _______________________________________ 16

2.3 Annotation Tools ________________________________________ 17 - 23 2.3.1 The ICE Markup Assistant ________________________________ 17 2.3.2 The Different Taggers Available ___________________________ 17 2.3.3 The ICE Tag Selection System ____________________________ 19 2.3.4 The ICE Syntactic Marking System _________________________ 19 2.3.5 Different Varieties of Syntactic Annotation ___________________ 20 2.3.6 The ICE Syntactic Tree Annotator __________________________ 22

3. Methodology _____________________________________________ 23 - 27

3.1 Corpus Design ___________________________________________ 23 - 25 3.1.1 Methods to be Used _____________________________________ 23 3.1.2 Copyright Issues ________________________________________ 25 3.1.3 Corpus Layout ________________________________________ 25

3.2 Capturing Text in Electronic Format ______________________ 25 - 26 3.2.1 Computerising Speech __________________________________ 25 3.2.2 Computerising Written Texts ____________________________ 26

3.3 Corpus Annotation _______________________________________ 26 - 27 3.3.1 Structural Mark-up _____________________________________ 26 3.3.2 Procedure for Annotating the Corpus ______________________ 27

Page 5: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

iv

4. The Pilot Project __________________________________________ 27 - 37

4.1 Collection of Texts ________________________________________ 27 - 32 4.1.1 Search Methods ________________________________________ 27 4.1.2 Text Collection ________________________________________ 29 4.1.3 Written Text Classification ________________________________ 30 4.1.4 Permission Letters ______________________________________ 31 4.1.5 Layout of the Pilot Project ________________________________ 32

4.2 Corpus Annotation ________________________________________ 32 - 37 4.2.1 TEI-Header ________________________________________ 32 4.2.2 Texts Encoding ________________________________________ 33

5. The Proposal __________________________________________ 38 - 44

5.1 Funding opportunities __________________________________ 38 - 40 5.1.1 Research at University of Leeds, School of Computing__________ 38 5.1.2 Introduction to the EPSRC _______________________________ 38 5.1.3 Eligibility of Investigators ________________________________ 39 5.1.4 Research Opportunities __________________________________ 39 5.1.5 How to Apply ________________________________________ 40

5.2 Writing up the proposal __________________________________ 40 - 44 5.2.1 Original Idea ________________________________________ 40 5.2.2 Expansion of Corpus Design ______________________________ 41 5.2.3 Writing Up Proposal ____________________________________ 43

6. Evaluation ______________________________________________ 44 - 51

6.1 Product _________________________________________________ 44 6.2 Minimum Requirements __________________________________ 47 6.3 Project Stages ____________________________________________ 47 6.4 Planning and Schedule _____________________________________ 50

7. Conclusion ______________________________________________ 51 - 51

References ________________________________________________ 52 - 54 APPENDIX A: Personal Experience ____________________________ 55 - 56 APPENDIX B: Markup Symbols ____________________________ 57 - 58 APPENDIX C: Corpus Design Layout ____________________________ 59 - 59 APPENDIX D: List of Texts Collected ____________________________ 60 - 66 APPENDIX E: Sample of the Letters of Copyright ______________________ 67 - 68 APPENDIX F: Template for the Header ____________________________ 69 - 69 APPENDIX G: Example of Raw Text ____________________________ 70 - 71 APPENDIX H: Examples of Encoded Text ____________________________ 72 - 75 APPENDIX I: First Draft of the Case for Support for ICE-lite_____________ 76 - 81 APPENDIX J: EPSRC Application Form ____________________________ 82 - 90 APPENDIX K: Revised Case for Support for the ICE-lite Proposal_________ 91 - 96

Page 6: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

1

A PILOT PROJECT FOR ICE-MAURITIUS

1. Introduction

1.1 Aim To develop a prototype of the Mauritius component of the International Corpus of English, to

demonstrate feasibility and potential problems for a larger-scale follow-up project.

1.2 Objectives The objectives of the project are to:

• Compare and evaluate the different computer technologies available to extend the

International Corpus of English to Mauritius English.

• Investigate data-sources and instigate data-collection for a Mauritius ICE sub-corpus.

• Research on infrastructure and data collection methods.

• Investigate the requirements and feasibility of a larger-scale follow-on project to develop

a full-scale ICE-Mauritius Corpus.

1.3 Minimum Requirements The minimum requirements are:

• Develop a small-scale prototype of the Mauritian Corpus of English.

• Survey of computer technologies for corpus development and processing.

The possible extensions:

• Work plan for a follow -up project to develop a full-scale ICE-Mauritius corpus.

1.4 Deliverables The project deliverables are:

• The project report

• The prototype of the Mauritian Corpus of English

Page 7: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

2

1.5 Initial Project Schedule The initial project schedule, Schedule 1 below , does not reflect the actual work done since after

obtaining the assessor’s feedback in January, it became clear that the project ha d to take a new

direction, with some changes to the aims and requirements. The new aims and requirements were

given above. This also resulted in a new work plan for the second semester, as shown in Schedule

2 on the next page.

Schedule 1: Schedule before Christmas Break

Dates Milestones Tasks

11/10/04 - 22/10/04 Identify and specify aim and minimum requirements

22/10/04 - 08/11/04 Section on Background Research

Research requirements of EPSRC

08/11/04 - 22/11/04 Section on Background Research

Research the International Corpus of English

08/11/04 - 22/11/04 Section on Background Research

Research on Mauritius English

15/11/04 - 24/01/05 Appendix I,J & K Draft Proposal 29/11/04 - 10/12/04 Mid-project report Collate mid-project report 10/12/04 - 22/12/04 n/a Research and evaluate other EPSRC

training courses available 10/12/04 - 22/12/04 n/a Research and evaluate different

training/education techniques 10/12/04 - 24/01/05 n/a Christmas break, revision and end of

semester 1 exams 24/01/05 - 31/01/05 n/a Decide on delivery mechanism 31/01/05 - 07/02/05 n/a Research and decide on what to

include in training course 07/02/05 - 21/02/05 n/a Write up tutorial 21/02/04 - 07/03/05 n/a Work on improving aspects of tutorial

and the draft proposal 08/03/05 n/a Give training course to research staff

and students 08/02/05 - 18/03/05 n/a Collect feedback on training course 18/03/05 - 01/04/05 n/a Analyse feedback and evaluate

training course 01/04/05 –18/04/05 Final Report Complete final report. Most chapters

should be already partially written up, but may need reworking.

Page 8: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

3

Schedule 2: New schedule for second semester Dates Milestones Tasks

24/01/05 - 31/01/05 Section 1 Decide on new aims & objectives and design new plan

01/02/05 - 07/02/05 Section on Background Research (sections 1 & 2)

Research on methods available to extend the ICE corpus to Mauritius

07/02/05 - 09/02/05 Appendix C and Section 3

Design layout and text categories of ICE Mauritius

10/02/05 - 17/02/05 Section 4 and Appendices D & E

Collect sample texts from the Internet & send request for copyright permission

18/02/04 - 25/02/05 Section 4 and Appendices F,G & H

Annotate corpus

26/02/05 - 28/02/05 Section 4 & 5 and Appendices I, J & K

Investigate feasibility of ICE-Mauritius

01/03/05 - 18/03/05 Appendices I,J & K Draft a proposal for ICE-Mauritius 18/03/05 - 01/04/05 Section 6 Evaluate corpus & proposal 01/04/05 –18/04/05 Final Report Complete final report. Most chapters

should be already partially written up, but may need reworking.

2. Survey of computer technologies for corpus development

2.1 Background to the problem

2.1.1 Introduction ‘Leeds has a track record for research on computer analysis of English language texts, also known

as English Corpus Linguistics. For example, the University has developed a Part-of-Speech

analysis system, used on other research projects such as the International Corpus of English (ICE),

which includes research teams in fifteen countries where English is the is the first language or

second official language language. In many of these English-speaking countries, the national ICE

sub-corpus is a recognised resource used in research and teaching’ (Atwell, 2004).

Mauritius is one of the many English-speaking African countries, but there is no Mauritian sub-

corpus in ICE. However, the government has started a Cyber City project, which is the first of its

kind of a new generation of IT parks in this part of the world. ‘The construction of Ebène

CyberCity is a historical milestone towards achieving the Government’s objective of transforming

Mauritius into a diversified, high-tech, high income services and knowledge economy’ (BPML,

2004). Therefore, it may be feasible to collect at least some samples of Mauritian English

remotely, via the World Wide Web.

Page 9: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

4

Mauritius has been chosen because of the special characteristics of the English used there.

English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810,

and is the official language of the country (Republic of Mauritius, 2004). However, at that time,

slaves were imported from Africa and Madagascar, a large number of labourers from India were

brought to work in the sugar cane fields and a small number of Chinese came to trade, and the

influence of the French who were the rulers before the British was still very strong. The different

languages brought by the different settlers have therefore influenced significantly the official

English language of the country and the mixture of those languages have also resulted in a new

language, Creole, which is nowadays spoken by everyone. Even if Creole is the most spoken

language in Mauritius, all official communications and teaching in schools are done in English.

However, the English used has particular characteristics. For instance, in many of the official

communications or press reports, some French and Creole words are used to emphasise a specific

theme. In school textbooks, other dialect words are bound to appear, such as names of individuals

or companies written in Chinese or Hindi. Another characteristic worth noting is that people in

Mauritius tend to think in their native language and then translating what they want to say in

English, keeping the same structure and grammar as in their native language. This results in a

great variation of English among the cultures in the same country and also in young people

learning English differently. Developing a corpus for Mauritius will therefore allow an interesting

and useful analysis and understanding of the different types of English used and will hopefully

help the government, teaching organisations and the people in general to understand and build up

a common standard of English.

The pilot project has hence investigated what can be done to instigate and start data collection for

a Mauritian ICE sub-corpus. A proposal for the infrastructure and data collection methods has

also been made in respect of the EPSRC requirements.

2.1.2 What is a Corpus? The term ‘corpus’ was traditionally used to designate a body of naturally-occurring or authentic

language data which usually might consist of written or spoken texts, or samples of spoken and

written language in a particular language or language variety. The corpus could be used as a basis

for linguistic research. In the last thirty-five years, the term ‘corpus’ has been used to describe

more the electronic form of the set of language material which may be processed by computer for

various purposes such as language research and engineering. This includes the study of all aspects

of language such as syntax, semantics, pragmatics, speech and recently in lexicographic studies.

Page 10: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

5

Due to the small scale of data available at hand in the past, many past theories and interpretations

explaining linguistics phenomena, although accurate, were too narrow to be applied to the whole

set of languages. In addition, the focus was more on language structure than on the use of the

language. (Leech, 1997a, Al-Sulaiti, 2004)

Due to the recent explosion in technology, corpora have increased dramatically in size, variety

and ease of access. The combined use of corpora and computers to study languages has changed

the way linguistics phenomena is analysed. Nowadays, corpus linguistics are analysing the use of

structures and investigating factors that affect our choice of a particular structure. For instance,

the factors may be related to the nature of the writing or speaking such as science or literature or it

may be related to discover typical linguistic patterns in some defined contexts. With the

computer, storing a huge amount of data, this new view of language analysis becomes more

accessible and hence, the computer corpus is fast becoming a universal resource for language

research.

2.1.3 Overview of The International Corpus of English (ICE) ICE was initiated in 1988 by the late Sydney Greenbaum, the then Director of the Survey of

English Usage, UCL. From 1996 to 2001, it was coordinated by Charles Meyer, University of

Massachusetts-Boston and it is now coordinated by Gerald Nelson, who recently returned to UCL

from the University of Hong Kong (Nelson et al., 2002). The ICE’s primary aim is to collect

material for comparative studies of English worldwide and its long-term aim is to produce up to

twenty-one million-word corpora. Around the world there are fifteen research teams, shown in

Table 1 below, who are preparing electronic corpora of their own national or regional variety of

English (UCL, 2002). Five other ICE projects for Cameroon, Fiji, Ghana, Nigeria and Sierra

Leone have been considered but no text has been collected yet.

Table 1: Components of the ICE project

Australia Great Britain Ireland New Zealand South Africa

Canada Hong Kong Jamaica Philippines Sri Lanka

East Africa India Malaysia Singapore USA

Each ICE corpus consists of one million written and spoken words of English. Each team is

following very closely the same corpus design, as well as the same scheme for grammatical

annotation so as to ensure compatibility between the individual corpora in ICE (UCL, 2002). The

Page 11: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

6

texts in the corpus date from 1990 or later. The authors and speakers of the texts have grown up

and been taught through the English medium. They are aged 18 or over and are either born or

emigrated at an early age to the country in whose corpus they are included and are educated

through the English medium in the country concerned (UCL, 2002). The corpora in ICE are being

annotated at various levels to enhance their value in linguistic research. These levels are Textual

Mark-up, Word class Tagging and Syntactic Parsing.

Despite attempts to achieve conformity, complete identity between corpora is not possible. There

are inevitable differences in the samples taken for speech and writing and in some countries

certain categories are difficult to obtain. Information about each author or speaker may also be

unavailable and the projects have different start dates which result in discrepancies in the date of

the texts. However, for global comparisons, the corpora are similar enough to justify any analysis

carried out. (Greenbaum, 1996)

2.1.4 Other corpora

The Brown Corpus of Standard American English (Ku.era and Francis, 1967) is the first modern

electronically readable corpus to be developed. The corpus consists of one million words of

American English texts printed in 1961 and the texts are sampled in different proportions from 15

different text categories, some of which are press, skills and hobbies, religion and fiction.

Compared with the various corpora available today, the Brown Corpus of Standard American

English is considered to be small. However, it is still used in teaching and as a model for the

development of other corpora.

The British National Corpus (BNC) is another large corpus and it was completed in 1994 (Leech,

1997a). The corpus consists of 100 million words and it contains both written and spoken

material. In addition to the British and American English corpora, there are other varieties of

English corpora such as the Australian Corpus of English (ACE), the Finland Corpus of Early

English Correspondence Sampler and others (Breyer, 2005). Many other corpora have also been

developed for different languages, such as the Czech National Corpus, CORIS (an Italian Corpus)

and the French Corpus. These corpora are for general-use in linguistics research. There exist

other corpora which are more specialised, such as the Air Traffic Control (ATC) corpus and the

Trains Spoken Dialogue Corpus (Al-Sulaiti, 2004).

Page 12: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

7

Modern-day corpora are of various types and the difference in their composition depends on their

use. Balanced corpora like the Brown Corpus, which includes different types of written English,

are more valued by individuals who are interested in linguistic description and analysis. In other

corpora, size may be more important than balance. One such example is the Penn Treebank. In

this case, linguists are more interested in the computational aspects of the corpus, involving

research in natural language processing. For instance, these types of corpora have been used in

the development of taggers and parsers. (Meyer, 2002)

It is useful to note that part of the British component of the ICE and the BNC have been funded by

the Engineering and Physical Sciences Research Council (EPSRC) in the UK and the Brown

Corpus has been funded by the equivalent of the EPSRC in America.

2.1.5 Reasons for encoding a Corpus Two types of corpus can be identified whether it is written or spoken: raw corpus and annotated

(or marked-up) corpus. The former is mainly the natural text itself with no other additional

information and in the latter the text is “enriched with a variety of information” (Al-Sulaiti, 2004).

Although raw corpora can be used with the help of tools to carry out any linguistic analysis,

annotated corpora provide better analysis.

Leech (1997a) has identifie d the following advantages of annotated corpora:

• Extracting information: a piece of language can have various meanings and uses in its

orthographic form, for instance the word ‘left’ can be a noun, an adjective or a verb.

Therefore, extracting information becomes easier and more efficient if the corpus is

grammatically tagged since each occurrence of ‘left’ will be accompanied by a label

indicating its type.

• Re-usability: once the corpus has been annotated, it can be handed on to other users and this is

a valuable advantage since corpus annotation are usually an expensive and time consuming

process.

• Multi-functionality: annotation adds overt linguistic information to a corpus and this makes it

useful for a multitude of purposes.

Page 13: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

8

2.1.6 ICE Corpus Design

Length of Corpus Meyer (2002) has state d that the first questions to ask when designing a corpus are:

1. “What will be the overall length of the corpus?” The lengthier the corpus, the better it is, but it

has to be feasible.

2. “How long the corpus needs to be to permit the kinds of studies one envisions for it?”

The standard requirement of ICE is that the core corpus should contain a total of one million

written and spoken words of English. However, some region might want to collect more material

in certain text categories or to include additional categories, depending on their needs

(Greenbaum, 1991b).

Type of genres

Again, Meyer (2002) has raised an important question: “Why these genres and not others?” To

answer this question, we have to consider the different types of corpora that have been created and

the purpose of each one. As mentioned above, some corpora are multi-purpose, namely the BNC

and the ICE Corpus, which means that they are intended to be used for a variety of different

purposes and therefore these corpora need to contain a broad range of genres. However, the

multi-purpose corpora do not always cover a full representation of all genres. Therefore, special

genres need to be collected for special-purpose corpora such as the Michigan Corpus of Academic

Spoken English (MICASE), which is used to study the type of speech used by individuals

conversing in an academic setting (Meyer, 2002).

The ICE corpus is usually divided into the ratio of 60:40 for spoken and written English

respectively. Within both halves a distinction is made between private (conversation or letter) and

public (news report or lecture). Both the private and public sections can be further divided into

monologue and dialogue for speech and scripted, non-printed and printed for written texts

(Greenbaum, 1991b). Below are the typical ICE Text Categories, taken from the ICE website.

Table 2: ICE Text Categories

Numbers in brackets indicate the number of 2,000-word texts in each category. Spoken

(300) Dialogues

(180) Private

(100) Conversations (90) Phone calls (10)

Public (80)

Class Lessons (20) Broadcast Discussions (20) Broadcast Interviews (10)

Page 14: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

9

Parliamentary Debates (10) Cross-examinations (10) Business Transactions (10)

Monologues (120)

Unscripted (70)

Commentaries (20) Unscripted Speeches (30) Demonstrations (10) Legal Presentations (10)

Scripted (50)

Broadcast News (20) Broadcast Talks (20) Non-broadcast Talks (10)

Written (200)

Non-printed (50)

Student Writing

(20)

Student Essays (10) Exam Scripts (10)

Letters (30)

Social Letters (15) Business Letters (15)

Printed (150)

Academic (40)

Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)

Popular (40)

Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)

Reportage (20)

Press reports (20)

Instructional (20)

Administrative Writing (10) Skills/hobbies (10)

Persuasive (10)

Editorials (10)

Creative (20)

Novels (20)

Length of individual text samples

Each text in the corpus contains about 2000 words (UCL, 2002) , following the sample -size nor ms

of pioneering Brown and LOB corpora. Therefore, there are 500 texts in each regional corpus

with 10 texts (20,000 words) as the minimum for each text category. Since most corpora contain

relatively short samples of text, text fragments instead of complete texts tend to be stored.

Ideally, it is be better to include complete text in the corpora but the length of the text is one of the

main reasons why this is not possible. For instance, a book is too lengthy and it will take up the

whole corpus if it is to be used as a whole. If only part of a text is used, the 2000 word sample

can be chosen from any part of the text. In existing ICE Corpora, many samples also consist of

composite texts, that is, a series of complete short texts that total 2,000 words in length (UCL,

2002, Meyer, 2002). These often include personal letters which are usually less than 2,000 words.

Page 15: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

10

Range of speakers and writers

Meyer (2002) has pointed out that it is “not simply whether one obtains texts from native or non-

native speakers but rather that the texts selected for inclusion are obtained from individuals who

accurately reflect actual users of the particular language variety that will make up the corpus”.

Greenbaum (1991b) has also emphasised that it is not the language that has to be selected but the

people and their language should not be excluded on subjective criteria of correctness, adequacy

or appropriateness. Therefore, since the ICE project is restricted to educated English, the only

criteria for selecting the population should be “adults of eighteen or over who ha ve received

formal education through the medium of English to at least the completion of secondary school”

(UCL, 2002).

The selection of text should not be random but the population differences and the textual

differences should be taken into account. Some relevant variables to consider are age, gender,

level of educational, dialect variation (e.g. urban or rural locations), ethnic group, region,

occupation and status in occupation, social contexts and social relationships.

2.2 Corpus Collection and Encoding

2.2.1 Collecting Data Spoken Texts

Collecting spoken texts, especially spontaneous speech, is the most difficult and frustrating task in

the development of the corpus (Sharoff, 2005, Nelson, 1996a). The cooperation of the speakers is

required but often there is the problem of the “observer’s paradox”. That is, people tend to

behave differently when being observed (or recorded) and therefore the way they speak may

change. According to Meyer (2002), one way around this problem is to record a longer speech

and then choose the most natural part. However, speech collection is already time -consuming and

recording a longer speech just to obtain a natural part will be too costly. To record the speech,

either analogue or digital recorder can be used since they both yield satisfactory result. Meyer

(2002) has nevertheless recommended using digital recorder since it is easier to transfer to the

computer for manipulation and longer speech can be recorded. To improve the quality of

recordings, the type of microphones being used is also an important consideration. For the other

spoken categories, such as broadcast speech, it is best to use radio or television for direct

recordings.

Page 16: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

11

Written Texts

Compared to the collection of spoken texts, according to Nelson (1996a) written texts are the

easiest to obtain, but in Sharoff’s (2005) experience, extra efforts are needed to obtain private

texts, such as personal letters. With the Internet, a wide range of texts are easily and freely

available today. Although using electronic texts saves us the time and effort of computerising

printed texts, some important questions raised by Meyer (2002) also need to be considered: “Are

electronic texts essentially the same as traditionally published written texts?”, “Is an article from a

personal webpage any different from one that has gone through the editorial process?”

Copyright Issues

Collecting the texts is one complex task but without copyright permission the texts cannot be used

in the corpus, especially if the corpus is going to be made accessible to anyone and is going to be

used internationally for research purposes. Based on the experience of the other ICE teams,

Nelson (1996a) has found that owners of texts are usually willing to help. The only frustration is

getting the permission within a short time period. Having other priorities, owners usually take a

long time to reply or some may not even bother replying. Nelson (1996a) has also discovered that

due to major confidentially issues, it is more difficult to obtain permission for texts in the

commercial sector.

2.2.2 Computerising Data Computerising Written Data

Nowadays, most texts are readily available in electronic form. Those texts downloaded from the

Internet however contain a significant amount of HTML code. Meyer (2002) has suggested using

software such as “HTMASC” (http://www.bitenbyte.com) to automatically strip the HTML

coding from the texts to produce an ASCII text file with no coding. If it is not possible to obtain

the texts in electronic form, a printed copy of the text can be converted with an optical scanner.

These exist in two types: form-feed and flatbed scanners. Meyer (2002) has encouraged the use

of the fla tbed scanner since experience with ICE-USA has shown that they are slightly more

accurate.

Transcription of Spoken Texts

After collecting the spoken texts in digital form and having obtained copyright permissions, the

texts need to be written down. This process is known as transcription and more precisely, as

Page 17: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

12

defined by Edwards (1995) “it involves capturing who said what, in what manner, to whom and

under what circumstances”. Software programs such as “Voice Walker 2.0” used within the Santa

Barbara Corpus of Spoken American English and “Sound-Scriber” used in the Michigan Corpus

of Academic Spoken English (MICASE) have been designed specifically to help in the

transcription of digitised speech. These programs (Meyer, 2002) can be downloaded freely from

http://www.linguistics.ucsb.edu/resources/computing/download/download.htm and

http://www.lsa.umich.edu/eli/micase/soundscriber.html respectively.

2.2.3 ICE Mark-up System Mark-up is the first stage in the annotation process of ICE corpora. Nelson (1996b) has described

mark-up as two distinct types: textual mark-up, which is added to the texts themselves and

bibliographical and biographical mark-up, which is stored externally in the form of a file header

for each text. There exist two manuals , one for spoken and one for written texts (Nelson, 1991a,

1991b) which describe the textual mark-up system and a third one is available for encoding

bibliographical and biographical information (Nelson, 1991c). In written texts, mark-up symbols

are used to encode typographic features such as boldface, italics and underlining, and structural

features such as sentence boundaries, paragraph boundaries and headings. In spoken texts, mark-

up is needed to indicate sentence boundaries, speaker turns, overlapping strings and pauses

(Nelson et al., 2002).

More recently, with the increasing use of electronic documents, a standard for the markup of these

types of documents has been developed. This standard, known as Standard Generalized Markup

Language (SGML), offers the advantage of computer independence, that is, the corpus can be

transferred from computer to computer while keeping its original description. However, although

it is a flexible language, problems such as lack of general style sheets, do arise when transferring

the text over the Web. Due to these problems, interests have been shifted to a newly emerging

mark-up system, the Extensible Markup Language (XML), which has been designed mainly for

use in web documents. (Meyer, 2002)

In the ICE components, all mark-up symbols are characterised by angled brackets, appearing with

an opening symbol <symbol> and a closing symbol </symbol>. Some examples from Nelson’s

(1991a, 1991b) manuals are given below.

Page 18: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

13

Written Text:

Boldface <bold> </bold>

Example: Readers must return all books to the library

Markup: Readers <bold>must</bold> return all books to the library

Italics <it> </it>

Example: You must attend every day during term

Markup: You must attend <it>every</it> day during term

Typeface <typeface> </typeface>

Example: Warhol is alive and well

Markup: Warhol is <typeface: courier>alive</typeface: courier> and well

Spoken Text:

Overlapping speech <[> </[> and <{> </{>

Example: $A's utterance "Nothing stands out" overlaps completely

with $B's "Yeah I suppose".

Markup: <$A> <#><{><[>Nothing stands out</[>

<$B> <#><[>Yeah I suppose</[></{>

Anthropophonics (non-verbal sounds)

Examples: <O>cough</O> <O>sneeze</O> <O>laugh</O>

Mark-up can be done manually but to speed up the process, Nelson (1996b) has proposed to

partially automate it with the use of the Mark-up Assistant program, a set of WordPerfect macros

that assigns whole mark-up symbols to single keys. The minimum set of ICE mark-up symbols

which has been used is given in Appendix B.

Bibliographical and biographical data

The description of each text is represented as bibliographical and biographical infor mation and the

mark-up is stored separately in a header file. The description includes ‘category’, ‘date’,

‘publisher’ among others and the data is enclosed within opening and closing symbols like in the

textual mark-up, for example, <date> 1996 </date>. A common standard used in many corpora is

the Text Encoding Initiative (TEI) (Al-Sulaiti, 2004), which has been working to incorporate

XML within its standard and which comprises of four main components:

• File Description <fileDesc>: includes bibliographic information about the text. Below is an

example from the TEI website:

Page 19: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

14

<fileDesc> <titleStmt> <title> Thomas Paine: Common sense, a machine-readable transcript </title> <respStmt><resp> compiled by </resp> <name> Jon K Adams </name> </respStmt> </titleStmt> <publicationStmt> <distributor> Oxford Text Archive </distributor> </publicationStmt> <sourceDesc> <bibl> The complete writings of Thomas Paine, collected and edited by Phillip S. Foner (New York, Citadel Press, 1945)</bibl> </sourceDesc> </fileDesc>

• Encoding Description <encodingDesc>: states the relationship between the text and its

source. The simplest example from Baker et al. (2003) is shown below:

<encodingDesc> <projectDesc>Text collected for use in EMILLE project</projectDesc> <sampleDesc>simple written text only has been transcribed. Diagrams,

pictures and tables have been omitted and their place marked with a gap element </sampleDesc>

</encodingDesc> • Profile Description <profileDesc>: supplies non-bibliographic information about the text

and the participants. The profile description can be divided into two parts: the text description

and the person description. Again an example from the TEI website is given below:

<profileDesc> <textDesc n='novel'>

<channel mode=w>print; part issues</channel> <constitution type=single> <derivation type=original> <domain type=art> <factuality type=fiction> <interaction type=none> <preparedness type=prepared> <purpose type=entertain degree=high> <purpose type=inform degree=medium>

</textDesc> <person id=P1 sex=F age='mid'>

<birth date='1950-01-12'> <date>12 Jan 1950</date> <name type=place>Shropshire, UK</name> </birth> <firstLang>English</firstLang> <langKnown>French</langKnown> <residence>Long term resident of Hull</residence> <education>University postgraduate</education> <occupation>Unknown</occupation> <socecstatus source=PEP code=B2>

</person> </profileDesc>

Page 20: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

15

• Revision description <RevisionDesc>: gives a summary of the history of the text and

provides a detailed change log in which each change made to a text may be recorded. Some

examples of changes made to a text, adapted from the TEI website is shown below:

<revisionDesc> <change><date>1996-01-22 <name>CM SMcQ<what>finished proofreading</change> <change><date>1995-10-30 <name>L.B. <what>finished proofreading</change> <change><date>1995-07-20 <name>R.G. <what>finished proofreading</change> <change><date>1995-07-04 <name>R.G. <what>finished data entry</change> <change><date>1995-01-15 <name>R.G. <what>began data entry</change>

</revisionDesc> For the ICE corpora, Nelson (1996b) has re-classified the above information in the header file

into four different levels, but with very similar attributes:

• Text Description: specif ies the text category and subcategories so that it can be located in the

hierarchy of the corpus

• Text Source: records bibliographical data about the sources of texts in the corpus, such as

source title, publisher, date and place of publication. Copyright statements are also included

in this level.

• Text Internals: contains information about the specific extract used in the corpus, for example,

title of article, page numbers, relationship between speakers.

• Biographical Information: includes details, such as sex, age, nationality, of each author and

speaker in the corpus.

2.2.4 Corpus Tagging During this stage, each lexical item is usually assigned a part-of-speech label or tag, for example

‘N’ for noun. In addition, most tags contain additional inf ormation, which appears in brackets.

Together they form the tagset of each item (Nelson et al., 2002). Leech’s (1997b) principles for

creating tagsets are adopted for the ICE components, that is, the tagsets should satisfy the three

criteria mentioned be low:

• Conciseness: labels should be brief

• Perspicuity: labels should be user-friendly and easy to read and remember

• Analysability: labels should be decomposable into their logical parts, for instance, ‘noun’

can occur above more specific tags such as ‘singular’ or ‘present tense’

Over the years, a number of different tagging software has been developed to insert a variety of

different tagsets and most taggers are highly accurate with more than 95% success rates. The

different tagging software available is discussed later in the report. In the ICE corpora, the texts

Page 21: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

16

are automatically tagged using the TOSCA Tagger, developed by the TOSCA Research Group at

the University of Nijmegen (UCL, 2004). An example of a grammatically tagged sentence from

the ICE webs ite is shown below:

Each PRON(univ,sing) of PREP(ge) these PRON(dem,plu) is V(cop,pres) the ART(def) responsibility N(com,sing) of PREP(ge) one NUM(card,sing) person N(com,sing)

2.2.5 Syntactic Parsing For the ICE components, the tagged corpus from the previous stage forms the input to the parsing

stage. However, before the tagged corpus is automatically parsed, it first needs to be pre-edited.

The pre-editing stage, also known as syntactic marking, involves manually marking several high-

frequency constructions in order to reduce the ambiguity of the input, and hence reducing the

number of decisions that the automatic parser will have to make (Nelson et al., 2002). Following

syntactic marking, the corpus is submitted to the automatic parser, developed by the TOSCA

Research Group at the University of Nijmegen, for syntactic analysis. Every sentence in the

corpus is analysed at phrase, clause and sentence level and the analysis is shown in the form of a

parse tree as shown in Figure 1 (UCL, 2004).

Figure 1: example of ICE Syntactic Parsing

Page 22: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

17

The parse tree is then analysed with the ICETree 2, which is a “dedicated syntactic tree editor”

especially designed for the ICE corpora by the Survey of English Usage. The ICETree can also

be used with other corpora, but some modifications to the data files will be required first.

Unlike grammatical tagging, “syntactic annotation tends to lack a sense of standard practice” and

parsing software has much lower accuracy rates (70-80 percent at best) and they require human

intervention at varying levels (Leech and Eyes, 1997). Syntactic parsing is seen as the most

difficult and time-consuming stage in the development of a corpus (Nelson et al., 2002).

2.3 Annotation Tools

2.3.1 The ICE Markup Assistant The ICE Mark-up Assistant reduces the time taken for the insertion of markup symbols by

automating and simplifying key presses. Generally, it can save up to tens of minutes per text.

The program has a set of WordPerfect macros implemented into it, which allows the text unit

markup to be inserted automatically at probable sentence boundaries, for example, each full stop

is followed by a space. Most markup types require an open and close symbol and the ICE

Markup Assistant also helps to ensure that all markup symbols are closed. For instance, if the

user tries to open the same symbol again before closing it, the program will remind the user to do

so. (Quinn and Porter, 1996)

Using a reduced system of annotation is another way of minimizing the amount of time taken to

annotate texts. For those ICE teams which lack resources to insert all the ICE markup that has

been developed, the ICE project reduces the amount of structural markup that is required to the

most “essential” markup. (Meyer, 2002)

2.3.2 The Different Taggers available Automatic text tagging is an important first step in discovering the linguistic structure of text

corpora. For a tagger to function as a practical component in a language processing system,

Cutting et al. (2005) believe that a tagger must be:

• Robust: A tagger should be able to deal with ungrammatical constructions, isolated phrases,

such as titles, and, non-linguistic data, such as tables and special words (which might be

unknown by the tagger).

Page 23: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

18

• Efficient: Due to the large amount of words which needs to be analysed and tagged in every

corpus, a tagger must be time efficient and any training required should also be fast to allow

rapid turnaround with new corpora and new text genres.

• Accurate: A tagger should assign the correct part-of-speech tag to every word it encounter to

reduce human intervention.

• Tunable: A tagger should be able to take different hints to correct systematic errors and to be

adapted to different corpora.

• Reusable: A tagger should require a minimal amount of effort to be retargeted to new corpora,

new tagsets and new languages.

There are two different types of taggers: rule -based or probabilistic (Garside and Smith, 1997,

Meyer 2002). In a rule-based tagger, grammar rules are written into the tagger and tags are

inserted on the basis of these rules. The TAGGIT program is among the first rule -based tagger to

be developed, followed by the Brill tagger. However, rule -based taggers are being superseded by

probabilistic ones. The latter works by assigning a tag to a word based on the most likely

outcome of the tag in the context of the word and its immediate neighbours. Garside and Smith

(1997) have given an example in the sentence beginning the run: the word run has a high

probability of being a noun rather than a verb because it is preceded by the.

The most common taggers with which corpus linguists typically work are:

• The TAGGIT program by Greene and Rubin was one of the earliest tagger to be developed

around the 1971s and it was an aid in the tagging of the Brown Corpus. The corpus was

tagged at 77% and the rest was done manually over a period of several years. The tags

assigned were from a set of some 77 tags (the Brown tags). (Garside and Smith, 1997)

• CLAWS (the Constituent Likelihood Automatic Word-tagging System) , another one of the

first tagging programs, was designed in the early 1980s at the University of Lancaster (Atwell,

1983). CLAWS has consistently achieved 96-97% accuracy and since then, various versions

of the CLAWS program have been developed and have been used to tag the LOB Corpus (the

British counterpart of the Brown Corpus) and the British National Corpus. (Leech, 1997a)

• The TOSCA (Tools for Syntactic Corpus Analysis) tagger has been designed by the TOSCA

team at the University of Nijmegen to insert two types of tagsets, namely the TOSCA tagset,

Page 24: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

19

which is used to tag the Nijmegen Corpus and the ICE tagsets composed of 262 tags. (UCL,

2004, Meyer, 2002)

• Another tagger that can be used to insert the ICE tagset is the AUTASYS tagger (Fang, 1996).

The tagger has been developed by Fang and Xiaoli at the Guangzhou Institute of Foreign

Languages, China and it assigns not only ICE tags, but also LOB tags and SKELETON tags.

AUTASYS has an accuracy rate of 96% and it has a fast rate of processing words.

• The Brill tagger, a multi-purpose tagger, can be trained to insert any tagset the user is working

with. It can also be applied to any language (Garside and Smith, 1997, Atwell et al., 2000).

• EngCG-2 (the Helsinki English Constraint Grammar) is a tagger that has been designed to

overcome the problems in the TAGGIT program and other rule -based taggers. It has a wider

application and is able to “refer up to sentence boundaries rather that the local context along”

(Meyer, 2002). One main advantage of EngCG-2 is its 99.5% accuracy rate.

2.3.3 The ICE Tag Selection System TAGSELECT helps users to automatically select alternative word-class tag generated by the

TOSCA tagger or AUTASYS. The most likely alternative tags for each word are displayed first,

so human interference is only needed if the first tag is not the correct one. Where no correct

alternative is provided, a new tag can be chosen from the list of possible tags. TAGSELECT is

user-friendly since it runs under Microsoft Windows and therefore all functions are available

using menus, buttons and scroll bars. (Quinn and Porter, 1996)

2.3.4 The ICE Syntactic Marking System For the ICE projects, syntactic markers are added to the tagged texts prior to parsing by the

TOSCA parser or any other parser that requires such pre-editing. This is done with the

ICEMARK system. Syntactic markers make the input to the parser simpler and therefore restrict

the number of alternative syntax trees generated. Like the ICE Tag Selection System, ICEMARK

also runs under Microsoft Windows, making it user-friendly. (Quinn and Porter, 1996)

Page 25: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

20

2.3.5 Different Varieties of Syntactic Annotation Tagging and Parsing are closely related and so many parsers have taggers built into them. For

instance, the EngFDG (Functional Dependency Grammar of English) parser and the TOSCA

parser can assign both syntactic functions and part-of-speech tags to words (Meyer, 2002). Like

taggers, they can be either probabilistic or rule-based. Both parsers have been widely used but

major emphasis is being put on the development of probabilistic ones in recent years since they

are thought to be more robust in the sense that they are “able to parse rare or aberrant kinds of

language, as well as more regular, run-of-the-mill types of sentence structures” (Leech and Eyes,

1997).

The Lancaster/IBM Treebank

A skeleton parsing scheme based on a shallow PS (phrase-structure) model has been used to parse

about 3 million words of text. The PS model simply involves analysing every sentence in the

corpus and adding labelled brackets to it. A sample of skeleton parsing from the Lancaster/IBM

Spoken English Corpus is shown in Figure 2. It can be noted that the tree is incomplete and that

the number of bracket labels used is quite small. This is done intentionally to speed up the

process and to limit the complexity of the parsing (Leech and Eyes, 1997).

Figure 2: Sample from the Lancaster/IBM Spoken English Corpus (Leech and Eyes, 1997) The Penn Treebank: Phase 1

It is the largest and best-known treebanking operation available today. The Penn Treebank has

been developed at the University of Pennsylvania by Mitchell Marcus and his team and it is

closely modelled from the Lancaster/IBM Treebank. A PS model of parsing is used and

incomplete parsed trees are accepted into the Treebank. The differences are that the Penn Tree is

displayed vertically as shown in Figure 3 below and it is generally available throughout the world.

(Marcus et al., 1993)

Another more ambitious version (Phase 2) of the Penn Treebank is being developed. In the Phase

2 Treebank, a wider range of additional information, such as functional labels or types of

adverbial, will be added.

SJ06 298v [S But_CCB ,_, [[N the_AT thing_NN1 N][V was_VBDZ V]] ,_, [N you_PPY N] often_RR [V found_WD [Fn that_CST [Fa although_CS [N you_PPY N][V had_VHD [N a_AT1 reserved_JJ sear_NN1 N]V]Fa] ,-, that_CST there_EX just_RR [V would_VM n’t_XX be_VBO [N room_NN1 N][P on_II [N the_AT train_NN1 N]P]V]Fn]V] ._. S]

Page 26: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

21

Figure 3: A sentence from the Penn Treebank (Phase 1) (Leech and Eyes, 1997:42) Nijmegen Treebanks

Developed before the Penn Treebank, the TOSCA parsing system was set up in the early 1980s at

the Catholic University of Nijmegen, Holland. It uses a grammatical model, known as Affix

Grammar and the TOSCA Treebank is integrated with the Linguistic DataBase (LDB), which

allows the Treebank to be searched for varied features. One of its main features is that it allows

users to correct or change the parse where necessary. Figure 4 gives an example of a sentence

from the TOSCA Treebank. (Leech and Eyes, 1997)

Figure 4: Sentence from the TOSCA Treebank (Leech and Eyes, 1997:44)

The SUSANNE Corpus

Geoffrey Sampson’s SUSANNE Corpus is a Treebank which provides a lot of parsing

information for each sentence. It is a result of manual analysis and “contains much detail within a

small compass.” An example from the SUSANNE Corpus is given in Figure 5 below. Moreover,

it is available freely to any research community. The only downside is that the texts are old

(1961) compared to what people would usually analyse. (Sampson, 1995)

( (S (NP (NP Pierre Vinken) , (ADJP (NP 61 years)

old ,)) will (VP join

(NP the board) (PP as

(NP a nonexecutive director)) (NP Nov. 29))) .)

-:TXTU() UTT:S(act,indic,inter,mortr,pres,unm) INTOP:AUX(do,indic,pres){Does} Does SU:NP() NPHD:PN(pers,sing){he} he V:VP(act,do,indic,motr) MVB:LV(indic,nfin,mortr){realize} realize OD:CL(act,indic,intens,pres,unm,zsub) SU:NP() NPHD:PN(pers,sing){he} he V:VP(act,indic,intens,pres) MVB:LV(indic,intes,pres){is} is CS:AJP(prd) AJHD:ADJ(prd){wront} wrong PUNC:PM(qm){?} ?

Page 27: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

22

Figure 5: A sample from the SUSANNE Corpus (Sampson, 1995:32)

The Helsinki Constraint Grammar

The Helsinki Constraint Grammar parser adopts a dependency grammar model instead of the PS

grammar model as in the other parsers mentioned above. The parser also provides a breakdown

of the attributes of individual words such as sub-categorisation information for verbs and in

addition, functional labels such as ‘subject’ or ‘object’ are added. A sample of Helsinki parser

output is shown in Figure 6 below. (Leech and Eyes, 1997)

Figure 6: Output from the Helsinki ENGCG parser (Leech and Eyes, 1997:48)

2.3.6 The ICE Syntactic Tree Annotator Within the ICE community, two sets of programs are used for the annotation process. First, a

parser is applied to produce a partial as well as a complete analysis and then an editor is used to

(“<*royal>” (“royal” A ABS (@AN>))) (“<*dutch>” (“dutch” <Nominal> A ABS (@AN> @<Nom))) (“<*shell>” (“shell” N NOM SG (@SUBJ))) (“<$,>”) (“<*worth>” (“worth” PREP (@ADVL))) (“<*just>” (“just” ADV (@AD-A>))) (“<*$500m>” (“$500m” NUM CARD (@<P))) (“<*less=than>” (“less=than” <CompPP> PREP (@ADVL))

(“less=than” <**CLB> CS (@CS)) (“less=than” ADV (@ADVL)))

(“<*exxon>” (“exxon” <Proper> N NOM SG (@<P))) (“<$,>”) (“<is>” (“be” <SV><SVC/N><SVC/A> V PRES SG3 VFIN (@+FMAINV))) (“<*third>” (“third” NUM ORD (@PCOMPL-S))) (“<$.>”)

N03:0460f - YB <minbrk> - [Oh.Oh] N03:0460g - PPHS1m He he [O[S[Nas:s.Nas:s] N03:0460h - WDt handed hand [Vd.Vd] N03:0460i - AT the the [Ns:o. N03:0460j - NN1c bayonet bayonet .Ns:o] N03:0460k - II to to [P:u. N03:0460m - NP1m Dean Dean [Nns.Nns]p:u] N03:0460n - CC and and [S+ N03:0460p - WDv kept keep [Vd:Vd] N03:0460q - AT the the [Ns:o. N03:0460a - NN1c pistol pistol [.Ns:o]S+]S] N03:0460b - YF +. - .

Page 28: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

23

correct or complete the analyses. The parse analysis is represented as the ‘tree’ form and

ICETREE is the editor that allows such ‘tree’ form analysis to be manipulated. The ICETREE

also allows parse trees to be built from scratch and it can be used as a viewer for complete

analyses. (Quinn and Porter, 1996)

3. Methodology There are many approaches to software development and one of the main approaches is “The

Waterfall Model” . It defines a project as a set of stages: from problem definition to requirements

analysis, design, implementation, testing and finally maintenance. However, each individual

stage in the project must be completed before moving on to the next (Laudon and Laudon, 2002).

The “Feedback Model” uses the same development stages as the “Waterfall Model” but it allows

for re-evaluation of earlier stages if problems arise in the later stages. Therefore, the “Feedback

Model” is the methodology that has been adapted for this project.

The first stages, namely, problem definition and requirements analysis were already carried out

during the background research and were described in sections 1 and 2 above. This section will

therefore describe the design of the project.

3.1 Corpus Design

3.1.1 Methods to be used

Firstly, it was decided that the pilot project would be fully Internet-based, that is, all of the texts

would be taken from the Internet only. The texts would be collected from the numerous

Mauritian websites already available.

On the basis of this approach, it was decided that the ICE-Mauritius would be composed of the

following genres, as adapted by the ICE standard text categories.

Table 3: Text Categories to be adapted for ICE-Mauritius

Spoken

Dialogues

Public

Broadcast Discussions Broadcast Interviews Parliamentary

Monologues

Unscripted

Commentaries Unscripted Speeches

Page 29: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

24

Legal Presentations Scripted

Broadcast News Broadcast Talks

Written

Non-printed

Student Writing

Student Essays Exam Scripts

Letters

Social Letters Business Letters

Printed

Academic

Humanities Social Sciences Natural Sciences Technology

Popular

Humanities Social Sciences Natural Sciences Technology

Reportage Press reports Instructional

Administrative Writing Skills/hobbies

Persuasive Editorials Creative Novels

Since the texts would only be collected from the Internet only, some categories were removed

because they would not be available on the Internet and also the number to collect for each text

were not stated since it was difficult to know how many of those texts would be available online

beforehand.

The main method would be to collect as many texts as possible for any category and even for

those categories not listed above and then classify the texts accordingly and creating or removing

categories where necessary. For the pilot project, one to two percent of the corpus would be

collected. However, the samples collected would have to follow the standard of 2,000 words per

text to total the one million words that the corpus was required to reach at the end.

While collecting the material, the 3 main problems associated with the use of the Internet and as

identified by Sharoff (2005) would have to be kept in mind:

1. It cannot be claimed that the material is representative and that there is a balance of text types

2. Search engines address the needs of information retrieval, rather than linguistic search

3. Search engines present search result in a way that also does not correspond to the needs of a

linguist

Page 30: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

25

3.1.2 Copyright Issues One of the important issues to consider for the collection of texts was copyright issues. The

corpus could not be used publicly unless permission was granted from owners of sites to use the

material. “Experience with the creation of the other ICE corpus has shown that, in general, it is

quite difficult to obtain permission to use copyrighted material” (Meyer, 2002). This is because

some people will take months before replying while others will not even bother sending a reply.

Therefore, extra time would have to be allocated for the request of permission to use the

copyrighted materials and also, extra texts would have to be collected in case permission was not

granted for some of them.

The first stage in compiling the ICE-Mauritius would be to identify some suitable websites and to

obtain email addresses as well as postal addresses, telephone numbers and fax numbers. Two

letters would be prepared: one would explain the purpose of the corpus and for the owners and

authors to keep, and the other with a return slip for them to sign if they agreed for their websites

to be used.

3.1.3 Corpus Layout “Organising corpus into a series of directories and subdirectories makes working with the corpus

much easier and allows the corpus compiler to keep track of the progress being made on corpus as

it is being created” (Meyer, 2002). Therefore, the corpus would be organised into directories and

subdirectories according to the different text categories. For the proposed diagrammatic layout of

the corpus, see Appendix C.

Each text would be assigned a number that designated a specific category in the corpus in which

the sample might be included. For instance, a text number LETT01 would be the first sample

collected for inclusion in the category “Letters” while B-N01 would be the first sample collected

for inclusion in the category “Broadcast News”. This numbering system would allow the corpus

compiler to keep easy records of where a text belonged in the corpus and how many samples had

been collected for that part.

3.2 Capturing Text in Electronic Format

3.2.1 Computerising Speech

Page 31: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

26

It was assumed that the spoken texts collected would be in digitised form since they would be

taken from the Internet. The software “Voice Walker 2.0” or “Sound-Scriber” mentioned above

would be downloaded freely from the Internet to run the samples of digitised speech. Since no

other alternatives were available, speech would be manually transcribed. This process would take

the longest time in the compilation of the corpus and therefore extra time should be allowed.

3.2.2 Computerising written texts Texts downloaded from the Internet were expected to contain as much HTML coding as text.

Since to manually delete this coding would take a considerable amount of time and effort, the

software “HTMASC” mentioned above would be used to automatically strip the HTML coding

from text. An ASCII text file with no coding was expected to be produced.

3.3 Corpus Annotation

3.3.1 Structural mark-up The mark-up of the texts would be carried out by writing minimal encoding and pasting a header

using a word processor. The following components, adapted from TEI-Header from the

Humanities Text Initiative (HTI) website to the ICE standards, would be added to each text:

File Description <fileDesc>

<fileDesc> <titleStmt> <title> </title> <author> </author> <respStmt><resp>compiled by</resp> <name>Dolly Koo</name></respStmt> </titleStmt> <publicationStmt> <publisher> </publisher> <pubPlace> </pubPlace> <date></date> </publicationStmt> <sourceDesc> <p>created in machine-readable form in http://mauritiustimes.com/040205mr.htm</p> </sourceDesc> </fileDesc>

Encoding Description <encodingDesc>

<encodingDesc> <projectDesc> <p>Texts collected for use in the pilot project for ICE-Mauritius, February, 2005</p> </projectDesc> <samplingDecl>

Page 32: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

27

<p>Whole text of 862 words copied from the site</p> </samplingDecl> </encodingDesc>

A Profile Description would also be added and it would be similar to the one described in section

2.2.3 above.

3.3.2 Procedure for annotating the corpus As mentioned earlier, the encoding would be done manually since no program was developed to

encode the corpus automatically. For each text, the following steps would be performed:

1. Text would be copied from the Internet and paste d into Microsoft Word. It would be saved as

encoded text choosing Unicode UTF-8 as recommended by Al-Sulaiti (2004) since some of

the texts might contain some French quotations with some special characters.

2. The text would then be encoded with paragraph marker using the option FIND/REPLACE in

edit: Find ^p Replace </p>^p<p> in the case of a normal.

3. After the paragraphing was marked, the adapted TEI-header would be added and the missing

information would be filled in.

4. When the text was complete, it would be saved with its ID number as its name. For instance,

the text with the ID number LETT01 from the “Letters” category would be saved as

LETT01.txt in the “Letters” directory.

5. The text would then be renamed by changing the file extension from .txt to .xml.

6. Finally, to verify the XML file, the text could be opened in Internet Explorer.

4. Corpus Encoding With the design laid out in section 3 in place, implementation of the ICE-Mauritius was started.

This section covers the encoding of the pilot project.

4.1 Collection of Texts

4.1.1 Search methods Keeping the 3 main problems mentioned above in section 3.1.1 when collecting texts from the

Internet in mind, the following search engines and key words were used:

Page 33: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

28

Search Engine Key words 1 Google Mauritius, Mauritius articles/ books/ novels/ business letters/ press/

websites/ educational/ reports/ newspapers/ schools/ stories/ texts, Higher School Certificate/Mauritius exam papers

2 Yahoo Mauritius, Mauritius articles/ news/ books/ novels 3 MSN Mauritius, Mauritius articles/ books/ novels/ business letters/ press/

educational reports/ newspapers/ schools/ stories/ texts, Mauritius Higher School Certificate /exam papers

Table 4: Search engines & key words used to collect texts

Some of these searches proved to be very useful, for instance when searching for “Mauritius” in

Google, some of the main Mauritian websites came up, such as the government pages and other

interesting websites containing the texts needed were found. However, with over 20 million

results of “Mauritius”, it was difficult to look through all of them. The search had to be refined

and new key words such as “Mauritius newspaper” or “Mauritius schools” were typed in. The

‘Advanced Search’ option and the ‘Preference’ option in Google were also used, but they did not

prove very useful. Key words like “Mauritius business letters” matched over 200,000 sites but

none were related to the corpus or were written by Mauritian people.

The same process was carried out with the search engines Yahoo and MSN. After a few searches

with Yahoo, it was found that the latter did not yield many results and all the sites it referred to

were already visited in Google. With MSN, more results were obtained when searching the

Internet and some new materials were collected, but as with Yahoo, many of the sites were

already displayed in Google.

Two of the most useful related Mauritian websites are mentioned in Table 5 below :

Table 5: Most popular Mauritian websites

Websites Description http://www.servihoo.com/ Website owned by Telecom Plus, the only telephone

provider in Mauritius. It has links to other websites such as local newspapers, radio, television and it contains articles ranging from culture to business to sports.

http://www.mauritiustopsites.com/topsiteshtml/index157.shtml

Website owned by Internet Communication Services Mauritius. It has a list of the most 946 popular websites from the country.

Page 34: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

29

4.1.2 Text Collection Numerous websites related to Mauritius ha d been searched but careful attention had to be paid to

the author and the publisher. Many of the articles found were not from Mauritian people. The

first few texts took a considerable amount of time to obtain but once the useful sites were known,

the texts were collected more quickly. Due to the lack of time, after only four days of thorough

search from the abovementioned engines, written texts from fifty websites were collected and the

amount of words was totalled to 51,960, comprising 5 percent of the actual size of the corpus

(exceeding the target of 1-2% for the pilot project). The author, publisher, publisher place, date

and contact details of the author where available were also noted for each text. Some of the texts

such as press reports were easily obtained from the various newspaper websites. However, letters

and student writing prove d to be very difficult to find – none of student essays or exam scripts

were available online. It was also important to note that shorter texts were easier to find than

longer texts of 2,000 words each. Appendix D shows the full list of texts and the details that were

collected.

Not surprisingly, spoken text was impossible to obtain from the Internet. Only two websites had

spoken texts, namely the Mauritius Broadcasting Corporation (http://mbc.intnet.mu/) and TopFM

(http://www.topfmradio.com/index.php). The Mauritius Broadcasting Corporation provide d live

TV News transmission, but it had only the French version available online and both of the sites

provide d live radio transmission, but most of the talks were in French too and saving the spoken

texts proved difficult, infeasible given the short amount of time to compile the pilot pr oject.

Hence, no spoken texts were collected. The proposed solution by Sharoff (2005), that is, to

increase the amount of ephemera (leaflets, junk mail and typed material) and correspondence

could be attempted in the follow-up project to compensate for the lack of spoken texts and to

make the project more balanced.

Alongside collecting the texts, a database file was created in Microsoft Excel (Appendix D) which

stored the type, ID number, source, title, author, publisher, place and year of publication and the

number of words of each text. This database file was important to have for the organisation of the

texts in the corpus and for counting the words automatically.

Page 35: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

30

4.1.3 Written Text Classification It proved difficult to decide to which category the text belonged. For example, there was

confusion on whether some of the popular printed texts came from press report or other

magazines and therefore, those texts were classified as popular printed texts based only on best

judgement. Sinclair (1996) had examined in detail the problems of text classification and had

reported that corpus design ma de use of some internal and external factors to decide on the text

category. He pointed out that many text classifications were based on topic as it was represented

in newspapers and magazines.

The classification of the written texts of the ICE-Mauritius was based on the ICE standard

classifications, but with some amendments due to the lack of texts to cover all the categories. The

texts collected from the 50 websites were grouped and classified differently. Those exceeding

1,500 words were considered as whole texts while those below 1,500 words were grouped

together as one text, up to the total of around 2,000 words. However, the texts that were grouped

together have to be part of the same initial category. This resulted in 30 final texts ready to be

included in the corpus. The spoken text category was removed completely for the pilot project,

even though this resulted in an unbalanced corpus. Much more time to collect and encode the

spoken texts would have to be allocated for the actual ICE-Mauritius. Table 6 below shows the

text categories which were derived from the sources, the number of texts and the total number

of words in each category.

Table 6: Number of texts and number of words in each category

Text Categories No. of Texts No. of Words

Written

Non-printed

Student Writing

Summary of project 1 762

Letters School/Business / Social 2 3352

Speeches Formal 3 5421 Printed Academic Various Topics 2 3354

Popular Various Topics 3 5528 Reportage Press reports 9 16885 Instructional Administrative/hobbies 5 7897

Persuasive Editorials 2 3226 Creative Novels 3 5735

Page 36: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

31

4.1.4 Permission Letters As mentioned above , copyright issues were one of the most important aspects to consider for the

collection of texts. The authors’ details were recorded alongside the texts when the Internet was

searched. However, it was noticed that many texts did not contain any details of its owner. When

compiling the actual ICE-Mauritius, those texts should be rejected from the beginning since they

could not be used without permission, but for the pilot project all of the texts collected were used

even if no permission was obtained since they would be kept only temporarily and would not be

made available to the public.

Two letters prepared by Al-Sulaiti (2004) to request permission for the use of texts available

online were used and sent to the authors of the texts that had been collected. One explained the

purpose of the corpus and for the owners and authors to keep, and the other had a return slip for

them to sign if they agreed for their websites to be used. Samples of the two letters can be found

in Appendix E.

It took one full day to send twenty three of these letters out by emails. Due to the lack of time,

they were sent only by emails and replies were expected mostly by emails since it was estimated

to take two weeks for a letter to reach Mauritius and another two weeks to get a reply if the author

sent it back straight away by post. Out of the twenty three letters, four were not delivered due to

the wrong address available on the Internet. To the present date, three ha d given their permissions

and were happy to help and one of them even asked for comments on his novel. However, one

was not agreeable and had asked for a formal support from the University of Leeds and a

complete CV. The outcomes proved that much more time and effort would be needed to obtain

permissions for the follow -up project.

Table 7 shows the list of addresses of resources for which permission of copyright had been

received.

Source Contacts

http://mauritiustimes.com/040205mr.htm

Madhukar Ramlallah [email protected]

http://pages.intnet.mu/rajbalkeehomepage/hd-complete.htm

Raj Balkee [email protected]

http://ile-maurice.tripod.com/rougpoisal.htm

Madeleine Philippe [email protected]

Table 7: Sources with copyright permission

Page 37: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

32

4.1.5 Layout of the Pilot Corpus The design of the Corpus ha d evolved slightly from the original plan since it was found that it

would be useful to have a folder for the marked-up corpus and one for a raw corpus. The latter

contained word docume nts of the actual text together with a table which included details such as

title, author, publisher, publisher place, date, source, email of author and the amount of words

which had been collected and had been used in the header. The raw corpus folder also contained

the HTML files of the texts taken from the source but with the extra coding slightly stripped off

manually. Even if building up this separate raw corpus had taken some time, it made the

annotation process much quicker and easier and hence would not affect the overall length of the

project. Each small text was marked-up individually before being grouped together and was

stored within a sub-folder in the corresponding category in the main marked-up corpus folder.

Both the raw corpus and the marked-up corpus folders were divided into the following sub-folders

for the different categories: Academic, Editorial, Instructional, Letter, Novel, Popular, Reportage,

Speech and Student Writing. The texts in the different folders were crossed reference by their

name.

4.2 Corpus Annotation

4.2.1 TEI-Header Some amendments had to be made in the adapted TEI-header since the information required were

not available from the Internet and it was impossible to obtain the information in such short time.

The information in the Profile Description which was more concerned with the author’s

characteristics such as age, education, occupation and first language ha d to be removed. The new

Profile Description that was used for the pilot project is shown below and for a full template of the

header, please refer to Appendix F.

Profile Description <profileDesc> <profileDesc> <creation> <date value="2005-02">Feb 2005 </date> <rs type="city">Pointe Aux Sables, Mauritius </rs> </creation> <langUsage>English</langUsage> <textClass> <textDesc n=" "> <channel mode="w">print; written</channel> </textDesc>

Page 38: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

33

<particDesc> <person id="P1" sex=" "> </person> </particDesc> </textClass> </profileDesc>

However, even after careful consideration on which fields to include in the header, some of the

information was still missing when the texts were encoded. For many texts, the author, the

publisher or the date published were not available online. Therefore, those fields were filled in

with “unknown”. In fact, among the twenty-nine texts collected, only six of them were complete.

Obtaining this information would be another task that would require extra effort and time in the

follow-up project.

4.2.2 Texts Encoding During the text encoding stage the time taken for processing was calculated. Using the procedures

described previously, the time taken to go through the six steps was approximately 20 minutes,

depending on the information available to fill in the header but regardless of the length of the texts

since the paragraphing was done automatically. If further files from the same site were collected,

the header could be reused with some minor adjustments to fit the new text. Obviously this took

less time than the first file, ranging between 5 to 10 minutes. A sample of a raw text can be found

in Appendix G while a sample of the encoded text can be found in Appendix H.

When encoding the texts, some problems did surface with the viewing of the XML files.

Some of the common error messages that were displayed when the XML files were opened

with Internet Explorer are shown in Table 8.

Table 8: Errors during encoding of texts

XML Files Error Message LETT02.xml - Figure 7 ‘whitespace not allowed’

REP05.xml - Figure 8 ‘A semi colon character was expected’ NOV01.xml - Figure 9 ‘End tag “P” does not match the start tag “h”’

Page 39: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

34

Figure 7: Error as shown when “LETT02.xml” was opened in Internet Explorer

Figure 8: Error as shown when “REP05.xml” was opened in Internet Explorer

Figure 9: Error as shown when “NOV01.xml” was opened in Internet Explorer

Page 40: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

35

In the event of those problems, the file was opened in a program called UniRed, which is a

freeware available at http://sourceforge.net/projects/unired. UniRed is a Unic ode plain text editor

for windows and it supports many character sets including UTF-8 and mark-up languages such as

XML and HTML. If an error does exist in the XML file, the program identifie s the error by

highlighting it in red. If green highlight is shown, it means that the code is correct; it has an

opening and a corresponding closing tag.

Figure 10: Screenshot of “LETT02.xml” in UniRed editor

A screenshot of the error from “LETT02.xml” in the UniRed editor is shown in Figure 10 above.

The XML tag that appeared red in the middle (the shaded ‘&’ character) of the screenshot meant

that the code was invalid. The error was related to some unusual characters or signs which needed

to be modified to be accepted by XML. Here, in this example, the ‘&’ sign needed to be written

as ‘and’ or as ‘&amp;’.

Figure 11: Screenshot of “REP05.xml” in UniRed editor

Page 41: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

36

A different error message is displayed for “REP05.xml”. However, after the file was opened

in UniRed (Figure 11), it was noticed that the error was related to the same unusual sign, the

‘&’ sign. The only difference was that the error occurred in the URL address of the source

(http://www.businessmag.mu/displayNewsContent.asp?NID=5747&CID=30) and this could only

be changed to ‘&amp;’ for obvious reasons.

However, in “NOV01.xml”, a completely different error was spotted. The file could not compile

due to a missing closed tag. In UniRed (Figure 12), it was found that the opening tag <h> in line

5 did not have a matching closing tag </h>. This error was shown by highlighting in red the next

opening tag (the shaded “<” character).

Figure 12: Screenshot of “NOV01.xml” in UniRed editor

Since it was not only faster to use UniRed but also it was guaranteed that the files were correctly

saved and could be viewed in the browser with no problems, this method had been tested and

compared with the former method, namely, creating the text in Microsoft Word and then

converting it to XML. Processing time with the UniRed method prove d to be more efficient,

taking only around 5 minutes.

After the errors ha d been corrected, the three files mentioned above “LETT02.xml”, “REP05.xml”

and “NOV01.xml” should look as shown in Figures 13, 14 and 15 respectively when they were

opened again in Internet Explorer. For a full display of how a file should look, refer to Appendix

H for another example .

Page 42: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

37

Figure 13: Expected output for “LETT02.xml”

Figure 14: Expected output for “REP05.xml”

Figure 15: Expected output for “NOV01.xml”

Page 43: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

38

5. The Proposal This section is also related to the implementation stage but instead of a software implementation,

it describes how the possible extension laid out in section 1.3 was achieved. After the compilation

of the pilot ICE-Mauritius project, it was possible to write a work-plan and a proposal for a

follow-up project to develop a full-scale ICE-Mauritius corpus and extend the methodology to a

much more ambitious multinational ICE Corpus.

5.1 Funding Opportunities

5.1.1 Research at University of Leeds, School of Computing The School of Computing web site (University of Leeds, 2004) states that the School has been

‘awarded a Grade 5 in the 2001 Research Assessment Exercise (RAE), confirming the School's

status as a leading research institute for computing’. The research activity within the School is

grouped into five categories, namely, Computer Vision and Language, Knowledge Representation

and Reasoning, Scientific Computing and Visualization, Theoretical Computer Science and

Informatics.

The School may offer scholarships but most research staff and students who need grants for their

research will have to apply to Research Councils, namely, to the Engineering and Physical

Sciences Research Council (EPSRC). Therefore, to develop the full-scale ICE-Mauritius , an

application to the EPSRC will be made. In order to fill in the application form, further research

on the requirements of EPSRC has been made and is briefly described below.

5.1.2 Introduction to EPSRC - The Engineering and Physical Sciences Research Council The Engineering and Physical Sciences Research Council (EPSRC, 2004) is ‘the UK

Government's leading funding agency for research and training in engineering and the physical

sciences’. The EPSRC operates, mostly, by funding research projects in universities and other

research organisations. The funds are intended to meet the direct costs of the research project,

together with a contribution towards the indirect costs involved (EPSRC Funding Guide, 2004).

The majority of funding from the EPSRC is supported through the Responsive Mode, but other

funding routes are available, for example Fellowship and others. ‘Calls for Proposal’ are also

available, where strategic opportunities are announced and researchers can choose from the list

Page 44: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

39

provided. There is no minimum or maximum funding, and no minimum or maximum period

(EPSRC Funding Guide, 2004).

The EPSRC fund ‘a dynamic and evolving research portfolio, extending from fundamental

research in mathematics, chemistry, computer science and physics to more applied topics in

engineering and technology’ (EPSRC, 2004). Many of the EPSRC research activities are co-

funded between programmes to encourage multidisciplinary collaborations since major

breakthroughs of ten arise when researcher from other related disciplines work together.

5.1.3 Eligibility of Investigators Principal investigators should be permanent employee of an eligible research organisation (all UK

universities and similar research organisations are eligible organisations). Fixed term employees

may be eligible provided that the organisation will give all the support normal for a permanent

staff and that there is no conflict of interest between the investigator’s obligations to the EPSRC

and the other organisation (EPSRC Funding Guide, 2004).

‘Research Assistant can be identified as Co-Investigators if they have made a substantial

contribution to the development of the application and will be closely involved with the project, if

funded. Then the application can seek funds for the assistant’s salary for the duration of the

project’ (EPSRC Funding Guide, 2004). Research assistant cannot be the principal investigator.

Moreover, research proposals will not be considered from an applicant who was the principal

investigator of another grant and who has not yet finished producing the Final Report.

5.1.4 Research Opportunities The majority of funding from the EPSRC is supported through the Responsive Mode, where the

research idea is determined by the applicant and where the proposals can be submitted at any

time. The main criteria against which the proposal is assessed is the ‘intrinsic engineering or

scientific excellence’ (EPSRC Funding Guide, 2004) as determined by peer review. EPSRC

especially encourage research proposals that are adventurous with new concepts and techniques.

First Grant Scheme

First Grant Scheme is used to assist individuals at the beginning of their academic careers by

offering them a research grant. To be eligible for the First Grant Scheme, candidates must

Page 45: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

40

have been appointed to their first academic lecturing appointment in a UK university within

the previous 24 months and should be within ten years of completing their PhD. Candidates

wholly employed as research fellows are not eligible to apply (EPSRC Funding Guide, 2004).

The scheme provides up to £120,000 for support. Proposal, which has received two or more

strong references, will be considered by a peer review panel along with the other First Grant

applications. First Grant proposals will not be considered against other types of proposals at the

same time.

5.1.5 How to Apply

Since 31 March 2005, applications for research grants can only be made via an electronic form

through the Je-S (Joint Electronic Submission) system and each application should be

accompanied with a self-contained ‘case for support’.

The ‘Case for Support’ comprises of the following (EPSRC Beginners’ Guide, 2004) :

• Previous track records (2 sides A4)

• Description of the proposed research and context (purpose, background, project,

resources, applications, collaboration) (6 sides A4)

• Diagrammatic work plan (1 side A4)

• Annexes (CVs, references, letters of support, equipment quotes, illustrations and named

research assistants)

Good applications contain ‘Case for Support’ which are clear, concise and uncluttered with

technical jargon. The main criterion to determine the grade assigned to any grant proposal will be

its scientific quality, but ‘viability and planning, cost-effectiveness and dissemination plans can be

taken into account’ (EPSRC Mock Panel Guidance Notes, 2004). In addition, for First Grant

proposal the applicant’s own plans for developing their research career and the commitment of the

university to career development may be considered.

5.2 Writing up the Proposal

5.2.1 Original Idea From the development of the pilot project up to 5 percent of the actual corpus , it had been proven

that a full-scale ICE-Mauritius was feasible just by using the Internet to collect the texts.

Page 46: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

41

Therefore, a proposal was drafted in accor dance to the EPSRC requirements in view of compiling

a high standard application which could actually be sent for funding.

First, the Je-SRP1 (EPSRC) application form was downloaded from the EPSRC website and filled

in. However, many sections on the form could not be filled until a full detailed plan of how the

pilot project ha d been developed was written. For instance, sections N (Travel and Subsistence)

and O (Consumables) were difficult to fill in without knowing the actual tools and stages needed

for the project. Moreover, sections such as J (Objectives) and K (Summary) were not of the best

quality when written before more considerations were given to the outline and plan of the project.

Therefore, it was decided to begin with writing the Case for Support first. Writing the Case for

Support was not an easy task since it ha d to be clear, concise and attractive. Many details about

how the corpus would be collected and annotated and its standards and the tools and staff needed,

and the length of the project ha d to be stated in the Case for Support. To be able to provide these

details and in order to extrapolate how much time and effort would be needed to collect the full

corpus, further research and calculations were made on the process and development of the pilot

project. The number-of-word and time-taken estimates for the collection of text and the text that

had been edited and marked up were calculated to come up with estimate of lower and upper

bounds of time and person-months needed for the full corpus. From these estimations, the initial

research work plan to collect a one-million word corpus for Mauritian English was then organised

into seven activity streams which would take up to three years to be completed by one post-

doctoral research fellow.

As mentioned previously, evidence from the pilot project showed that with this internet collection

technique, the corpus would contain less than one million words due to the limited set of text

categories available on the World Wide Web and this would also result in an unbalanced corpus.

One way around this problem was to collect more texts that were available to compensate for the

missing ones. Another solution was to expand the corpus to a different dimension and this is

explained in the next section.

5.2.2 Expansion of Corpus Design To compensate for the small amount of texts and for the unbalanced texts categories, it was

decided instead to expand the corpus to include other types of English from other English

speaking countries. This would also result in a more ambitious and adventurous project which are

the characteristics that the EPSRC are looking for. With this new objective for the proposal, more

Page 47: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

42

research was needed to find other countries where either English is the official language or where

English is one of the main spoken languages. Twenty countries were chosen to form part of the

corpus and they are as follows: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman

Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia,

Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. As in Mauritius, English in most of these

countries ha d in one way or another been highly influenced by other languages, either brought by

ancestors or derived from their culture. The corpus would hence allow an interesting and useful

analysis of the variation in English across the nations.

Since collecting the fully-balanced one-million words for each country would be impossible, it

was decided that the proposal would only target half a million words for each country, resulting in

an “ICE-lite”. The term “lite” was borrowed from other simplified projects such as “TEI-lite”

which meant a simpler version of TEI, a standard XML-markup convention for text corpora (TEI,

2005). To provide compatibility and an enhanced comparison with the other existing projects, the

“lite” version of the 20 teams already in ICE (mentioned in section 2.1.3) would also be included

in the corpus. For each country, numerous websites were easily accessible via the World Wide

Web, and different texts categories were available. Therefore, the corpus would aim to contain

approximately 20 million words taken only from the Internet. This meant that more staff would

be required and a new work plan was needed. New ambitious estimates were then calculated.

This was done by taking the amount of time taken to collect (20 minutes average for 1 text) and

annotate (20 to 30 minutes per text) the thirty texts obtained (figure given in section 4.1.3 above)

and multiplying them accordingly by 250 texts to obtain the estimates for one country and then

multiply the result by 40 for the whole ICE-lite corpus. The overall expected completion time of

the project was kept to three years but instead of only one research fellow, two more would be

needed. The new research work plan for the ICE-lite was then organised into eight activities as

listed below:

WP1: Collection of Spoken and Written Text of English

WP2: Transcription

WP3: Textual Mark-up

WP4: Word-class tagging

WP5: Syntactic parsing

WP6: Evaluation

WP7: Comparison across dialects

WP8: Dissemination for Exploitation

Page 48: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

43

The expanded project would thus be beneficial to the governement and the educational system

in each of the twenty countries mentioned above and the existing ICE teams. A

comprehensive description of the different types of English could be obtained from the corpus

and therefore each country would be able to develop its own reference guides to usage,

dictionaries and other teaching materials. This could help both schools and universities to

adapt their methods of teaching, and especially the structure in which English was taught and

spoken to a better standard. The comparison across the dialects of English to find any striking

similarities or differences would be useful for further research and teaching methods in each

country and would also benefit those people who wanted to travel to or trade with other

English-speaking countries since the comparison would provide a useful insight in how they

would have to adapt their language. When the corpus would be released, it would also be

beneficial to other research or academic institutions across the world. It could be used as a

comparison or for further research by the existing corpuses or other potential corpuses.

Longer-term impacts of the work to be done included:

• Promoting cooperation between other English speaking countries and for the purpose of

developing basic components for the linguistic society.

• Easing the entrance requirements of English speaking countries into the different markets.

• Promoting the different culture of the 40 countries across the world.

5.2.3 Writing Up Proposal Once the estimates were calculated and the work plan designed, the Case for Support was written

more easily and it also became much easier to fill in the application form since the figures were

readily available. The only difficulty was to divide the work among the three research fellows to

make the completion of the work possible within three years. This was done by using only the

lower limits of the estimates and therefore resulted in quite a tight schedule. Other estimates were

calculated concerning costs of travelling, consumables, etc. Details about the cost of staff should

be calculated through the COSTA system of the Universit y at

http://www.leeds.ac.uk/rsu/COSTA.htm , but due to restricted access to students, the estimated

costs were taken from another proposal by Atwell and Al-Sulaiti (2005). It is important to note

that one paper application (in Word format) allows details of only two researches to be filled in.

Therefore, to make the proposal complete, a second application was needed to add the details of

the third researcher. However, due to the space limit of this report, the second application form

Page 49: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

44

could not be added in the appendices and sections 2 and 3, which requested personal information

about the referees and the investigators, were also omitted. Copies of the first draft of the Case

for Support (which was sent for evaluation) and the first application form are shown in

Appendices I and J respectively.

6. Evaluation To measure the success of a project, the latter needs to be evaluated against a number of relevant

criteria. For this particular project, the criteria that were set up are:

• Product: Evaluates the design and compilation of the final product.

• Minimum Requirements: Evaluates what minimum requirements are met.

• Project Stages: Evaluates the methodology used to produce the final product.

• Planning and Schedule: Evaluates the planning of the project from start to finish.

6.1 Product The product was evaluated by three subject-experts, namely, Eric Atwell, Gerald Nelson and

Serge Sharoff. Eric Atwell was the supervisor of the project. His evaluation would not be

discussed in this report since he provided feedback throughout the course of the whole project.

Gerald Nelson, from UCL, is the coordinator of the International Corpus of English and has been

directly involved in the development of ICE-GB, the British component of ICE. Serge Sharoff,

from the Centre for Translation Studies of Leeds University, has been involved in several corpus

developments, such as a Russian corpus and a Chinese corpus, which he has collected only

through the Internet.

Evaluation from Gerald Nelson:

Both the proposal and part of the pilot project were sent to Gerald Nelson and his first explicit

comment was “May I say, first of all, that I am very impressed by this proposal. It shows an

amazing knowledge of corpus linguistics, and of issues in world Englishes.” Therefore it can be

said that both the proposal and the pilot project met the requirements needed and were of good

standards. In his feedback, Gerald Nelson also implicitly suggested some improvements that

could be done before the proposal is sent to the EPSRC and some issues that should be addressed

concerning the ICE-lite if the funding is obtained.

Page 50: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

45

Issues about the ICE-lite are:

• The text files should be named according to the ICE coding scheme, not as LETT01, etc. as

described above.

• The TEI headers should be stored externally as separate files.

• The details in the headers should follow the ICE scheme.

• Permission letters could cause problems to other ICE teams since they are strictly non-

commercial whereas the permission letters sent stated “We may also want to use the text(s)

for developing electronic products such as translators and dictionaries.

• The distribution method of the ICE-lite

Gerald Nelson agreed to send full details of the ICE filename and header conventions in his

emails but respecting his busy schedule, he was not able to do so before the report was due. So,

no improvement was able to be made to the pilot project. Also, for the purpose of this project, the

issues of non-commercial corpus and distribution were decided to be ignored until the funding

was obtained.

Improvements to the proposal include:

• Gerald Nelson suggested that the parsing should be dropped altogether since the syntactic

parsing of the whole corpus is quite unrealistic, given the timescale involved. For ICE-GB, it

took about 3 years to parse one-million words, and there were six or seven part-timers

working on it. He also suggested that the aim should be to produce a fully-checked POS-

tagged corpus and to consider the parsing as another follow-up project.

• Changes to the wordings in the proposal such as:

o Page 1, paragraph 1: "where English is the main language" to be changed to "where

English is the first language or second official language".

o Page 2, line 1: Delete "Australia" as it is not yet available.

o Page 2, line 5: "and other freely available sources": more details should be given.

o Page 4, line 3: "a software" should be changed to "a program".

o Page 5, Staff: It is unlikely to get post-doctoral researchers working on this project.

Therefore “post-doctoral” should be changed to "post-graduate".

Despite the small changes needed and based on Gerald Nelson’s comment which he added at the

end of the feedback: “As I said, this is a very impressive proposal, and you can count on my full

Page 51: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

46

support (and the Survey's) if it gets funded”, the pilot project and the proposal proved to be

successful and of a great potential for a follow-up project.

Evaluation from Serge Sharoff:

After reading the proposal, Serge Sharoff sent his approval implicitly by saying “I read the

proposal with interest”. He had also shown that he wanted to participate and that he thought the

proposal as being feasible and worth following up by giving comments on possible extensions and

how he could contribute to the project.

He proposed to contribute in two aspects as described below:

• In WP6 (Evaluation), he proposed to add a lexical comparison of the new ICE-lite against the

British National Corpus (similar to what he had done in one of his Internet corpora paper).

• In WP8 (Dissemination), he proposed to disseminate data through his web interface, which he

referred to as the Leeds CQP interface. There’s no publication on it yet but he is more than

willing to write a paper on it if the project goes ahead.

Another suggestion from Serge Sharoff that could be useful was the use of Google to estimate the

size of source texts available for each country. He had tried finding English texts from Mauritius

by typing “allintext: that OR in OR for site:.mu” in the Google query and this came up with

125,000 English pages, corresponding to more than 250 million words (if an average Internet page

is about 2000 words). Therefore, this method could be used to find the size of texts available

online for each country in the ICE-lite project.

He also raised an important issue concerning the collection of the texts. According to him, it

would be difficult to know whether a text was written by someone from a specific country. That

is, you could not be sure that a text obtained from a Gambian website, for instance, was actually

written by someone born in Gambia. For the pilot project, this problem was not encountered since

coming from Mauritius, I could easily tell the difference from a text written by a Mauritian citizen

and one which was not by either looking at the name of the author or by just looking at the

structure and the words used since Mauritian English has a particularity to it, often including other

dialects words.

However, this could be a potential problem for a full-scale project and this issue would need

further investigation if the proposal was to be funded. Due to the lack of time, this issue could not

be resolved in this project.

Page 52: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

47

6.2 Minimum Requirements The minimum requirements were: • Develop a small-scale prototype of the Mauritian Corpus of English.

Section 3 described how the prototype ha d been designed and planned, with details of the

different tools and techniques that are available for use. The development of the prototype itself

was detailed in sections 4.1 and 4.2. As the prototype was being developed, some amendments to

the original plan were needed. Overall the prototype can make up 5 percent of the full-scale ICE-

Mauritius Corpus.

• Survey of computer technologies for corpus development and processing .

The different technologies available for corpus development and processing ha d been mentioned

throughout the whole of the report, but more particularly, the different taggers and parsing

systems available worldwide were outlined in section 2.3 while the techniques used specifically

for ICE were described in section 2.2.

The possible extension was:

• Work plan for a follow-up project to develop a full-scale ICE-Mauritius corpus.

To be able to build a work plan for a follow-up project, the pilot project had to be well understood

and documented (which formed part of sections 4.1 and 4.2 above). Also research into the

Research Council, namely, the Engineering and Physical Sciences Research Council (EPSRC)

had to be carried out in order to know the requirements and to apply for grants. These

requirements were described in section 5.1 while the steps taken in writing the application form

and the proposal were described in section 5.2.

6.3 Project Stages The overall quality of the project was also assessed by applying the following criteria to each of

the different stages of the project to see if they were appropriate to solve the initial problem, and

their relevance in the development of the solution. The criteria were:

• Was the background research of a suitable standard, did it help to understand the problem and

did it help to gather the learning requirements.

• Was the chosen methodology suitable for the project and was it adhered to.

• Were the requirements gathered effectively and did final product successfully meet these

requirements.

Page 53: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

48

• Were the appropriate technologies used in creation of the pilot corpus.

• Did the project solve the initial problem and was the final prototype of sufficient standard to

prove the feasibility of a full-scale project and was the proposal of sufficient standard to send

to the EPSRC for funding.

Background research

The background research helped to fully understand the problem and therefore what the project

should actually achieve. It gave an insight into the emergence of corpora and their increasing uses

in teaching and research. Research on ICE showed that there are only a few number of existing

corpora and that many English-speaking countries can benefit from the compilation of their

English language. Findings from the ICE website and other books on corpora were then used to

design and set the standards for ICE-Mauritius. In addition, the different techniques available

were researched to allow and facilitate the collection and annotation of the pilot corpus.

Methodology

The most signific ant problem that was encountered in the course of this project was the need to

modify the aims and requirements of the project at the beginning of the second semester. This

also meant changing the work plan and methodology.

The “Feedback Model” used for this project as described in section 3, proved to be a good choice

throughout the project. Many changes were made to the initial design after flaws became apparent

in the encoding phase of the project. The following steps were taken during the development:

• First the problem was analysed, that is, the need for a Mauritian Corpus was identified

(section 2).

• Then a system study was carried out and the findings showed that collecting a corpus is costly

and timely and that using the Internet would be a solution to the problem (section 1 and 2).

• The pilot project was designed next and this was explained in section 3.

• The corpus was collected and annotated in the following stage , section 4 and the proposal

written in section 5.

• As the collection and annotation was carried out, it was found that many changes in the design

were needed (section 4 and 5).

• Finally, the pilot project was evaluated as described in section 6.1.

Corpus requirements

Page 54: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

49

The initial requirements for the ICE-Mauritius were gathered through the ICE website and other

ICE-related books , which contains almost all that is needed to be included in an ICE sub-corpus.

However, some more detailed requirements such as naming scheme for each text or the minimum

amount of information to be included in the header were not specified. As mentioned above,

Gerald Nelson from UCL agreed to send those details during the Easter break but he never got

around doing so. Therefore, the only basic requirements from the official ICE website were

applied to the pilot ICE-Mauritius.

In addition to this, annotation requirements were also gained through the background research

into the different technologies available. These were general requirements that any corpus

should have and were not related to ICE.

The prototype was evaluated by the people mentioned in the section above to see if the initial

design was adequate, as well as to provide additional feedback. And as a result, they agreed that

the pilot project did meet the basic requirements of ICE.

In relation to the proposal, the requirements were taken directly from the EPSRC application

guide. According to the feedback obtained, the proposal did meet the requirements of the EPSRC

and hence consisted of a potential application for a follow-up project.

Technologies

Other than using Microsoft Word to collect and annotate the corpus manually, other technologies

and tools were discussed throughout the report. It was seen that programs such as HTMASC

could facilitate the stripping of HTML coding from the texts to produce ASCII text file while

UniRed was used to provide a faster and more error -free compilation and saving of the mark-up

texts. However, other specific corpus tools such as ICECUP or ICETREE could not be used and

tested since they are not freely and easily available to anyone.

Initial problem

The initial problem identified the need for an ICE-Mauritius. To develop a full-scale ICE project

would be impossible within this project. Therefore this project concentrated on developing a

prototype of the ICE-Mauritius, investigating data-sources and instigating data-collection and

looking at the different technologies available to investigate the requirements and feasibility of a

larger-scale follow-on project. The pilot project, together with the feedback obtained proved that

a full-scale ICE-Mauritius was feasible. However, a more ambitious follow-on project was

Page 55: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

50

described in the proposal, which according to the feedback obtained, was a plausible application

to the EPSRC for funding.

6.4 Planning and Schedule

Up until the mid-project report, the schedule was followed very closely and everything was going

according to plan. However, after feedback was obtained from the assessor in January, it became

clear that a new direction for the project had to be devised with some changes to the aims and

requirements. This resulted in a new schedule for the second semester. Both the old and new

schedules were shown in section 1.5.

To meet the new aims and requirements, more work was required in a much restricted amount of

time. More research on the background and to understand the problem was needed and the one

week allocated was not enough. Moreover, as the corpus was being developed, it was found that

it was difficult to design the corpus since the categories to be inc luded would vary depending on

the texts collected. Therefore, the texts had to be collected first and then classified accordingly.

Also, while the schedule stated that the feasibility investigation of ICE-Mauritius and the writing

up of the proposal would be done after the pilot project was compiled, drafting the proposal

alongside compiling the corpus was easier since the different steps taken were noted as they were

carried out and new ideas kept surfacing for the final proposal. And while drafting the proposal,

the feasibility of ICE-Mauritius was being self -addressed.

The initial schedule had been created failing to take into account that just before the end of the

second term, other projects and essays would have to be submitted, and therefore not much time

would be available to work on the project. The schedule hence had to be revised again,

accounting for this flaw. With a clearer view of the amount of work the project would entail, the

development of the corpus and writing up the proposal were both scheduled to be completed

before the Easter break, to leave enough time to evaluate and write up the rest of the project

during the holidays. This goal was achieved and with only slight revisions of the corpus and of

the proposal needing to be done during the Easter break, there was enough time to evaluate the

project. However, the time to get feedback from the different people to whom the corpus and the

proposal were sent to was underestimated. Feedback was obtained only in the last week of the

Easter break, leaving not much time to write the evaluation. Nevertheless, the write-up was

completed with a week to spare before submission and the time was used to revise the final report.

Page 56: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

51

Schedule 3 below shows the revised project schedule.

Schedule 3: Revised Project Schedule for second semester

Dates Milestones Tasks 24/01/05 - 31/01/05 Section 1 Decide on new aims & objectives and

design new plan 01/02/05 - 12/02/05 Section on Background

Research Research on methods available to extend the ICE corpus to Mauritius

13/02/05 - 20/02/05 Appendix C,D Collect sample texts from the Internet & send request for copyright permission

20/02/05 - 22/02/05 Appendix B and Section 4 Design layout and text categories of ICE Mauritius

23/02/04 - 18/03/05 Section 4 and Appendix E Annotate corpus 23/02/05 - 18/03/05 Appendix F Draft a proposal for ICE-Mauritius 01/03/05 - 18/03/05 Section 4 and proposal Investigate feasibility of ICE-Mauritius 18/03/05 - 18/04/05 Evaluation Evaluate corpus & proposal 01/04/05 - 26/04/05 Final Report Complete final report. Most chapters

should be already partially written up, but may need reworking.

7. Conclusion

As stated in the first section, the aim of this project was “to develop a prototype of the Mauritius

component of the International Corpus of English, to demonstrate feasibility and potential

problems for a larger-scale follow-up project”. Throughout this project, both benefits and

difficulties of developing a corpus, together with the techniques and tools availa ble for the

development were discovered. The outcome was a prototype of the ICE-Mauritius up to 5 percent

of its original size and in addition a work-plan for the follow-up project was set up, whereby

showing the feasibility of an ICE-Mauritius collected only through the Internet. To summarise

therefore, the project fulfilled its minimum requirements, as well as its suggested extended

requirements and it went even further by providing a full proposal for the application of a much

wider and more ambitious ICE-lite project to the EPSRC for funding.

Despite some issues which would need further consideration for the ICE-lite, much interest and

approvals were obtained from the two evaluators and field-experts mentioned above , proving its

success. Therefore, as future work and improvements, it is hoped that the proposal will be sent to

the EPSRC and that the prototype will be developed into a larger-scale project.

Page 57: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

52

References: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. Atwell, E. and Al-Sulaiti, L. (2005) Development of the International Corpus of Arabic. EPSRC Application Form (not yet submitted). University of Leeds. Atwell, E. (1983) Constituent Likelihood Grammar. ICAME Journal (7) pp34-67. Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., and Wilcock ,S. (2000) A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal (24) pp 7-23 Atwell, E. (2004) Gambian English ICE Corpus. University of Leeds, School of Computing. [News Group]. Baker, P. et al. (2003) Constructing corpora of South Asian languages. In Proceedings of the Corpus Linguistics 2003 conference, 16(1), 71-80. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Breyer, Y. (2005) Gateway to Corpus Linguistics on the Internet [online]. [Accessed 15th February 2005]. Available from World Wide Web: http://www.corpus-linguistics.de/corpora/corp_engl_a_e.html Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (2005) A Pratical Part-of-Speech Tagger. Palo Alto: Xerox Palo Alto Research Centre. Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Edwards, J. (1995) Principles and alternative systems in the transcription, coding and mark-up of spoken discourse. In Leech, G., Myers, G. and Thomas, J. (ed.) (1995) Spoken English on Computer: Transcription, mark-up and application. Harlow: Longman. EPSRC, The Engineering and Physical Sciences Research Council (2004) The EPSRC web site [online]. [Accessed 23rd October 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Funding Guide web site [online]. [Accessed 7th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Research Grants Beginners’ Guide [online]. [Accessed 26th October 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Mock Panel Guidance Notes [online]. [Accessed 6th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/

Page 58: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

53

EPSRC (2004) Guidance Notes for completing the Je-SRP1 (EPSRC) form [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ Fang, A. (1996) AUTASYS: Grammatical Tagging and Cross-Tagset Mapping. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Garside, R. and Smith, N. (1997) A Hybrid Grammatical Tagger: CLAWS4. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Greenbaum, S. (1991b) The development of the International Corpus of English. In Aijmer, K. and Altenberg, B. (eds.) English Corpus Linguistics. Studies in Honour of Jan Svartvik. London: Longman. Pp. 83-91. Greenbaum, S. (1996) Introducing ICE. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005]. Available from World Wide Web: http://www.hti.umich.edu/cgi/t/tei/tei- idx?type=pointer&value=HD Ku.era, H. and Francis, W.H. (1967) Computational analysis of present-day American English. Brown University Press, Providence, Rhode Island. Laudon, K. and Laudon, J. (2002) Management Information Systems – Managing the Digital Firm. 7th edition. New Jersey: Prentice Hall. Leech, G. (1997a) Introducing corpus annotation. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Leech, G. (1997b) Grammatical Tagging. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Leech, G. and Eyes, E. (1997) Syntactic Annotation: Treebanks. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993) Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19(2), 313-30. Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1991a) Manual for Spoken Texts. London: Survey of English Usage, University College London.

Page 59: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

54

Nelson, G. (1991b) Manual for Written Texts. London: Survey of English Usage, University College London. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Novacek, W. (2000) Bite’n’Byte: Software Development [online]. [Accessed 17th February 2005]. Available from World Wide Web: http://www.bitenbyte.com/ Quinn, A. and Porter, N. (1996) ICE Annotation Tools. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sampson, G. (1995) English for the computer: The SUSANNE Corpus and analytic scheme. Oxford: Clarendon Press. Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. SourceForge (2005) Project: UniRed: Summary [online]. [Accessed 16th February 2005]. Available from World Wide Wed: http://sourceforge.net/projects/unired The School of Computing, University of Leeds (1998-2004) The University of Leeds web site [online]. [Accessed 21st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml

Page 60: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

55

APPENDIX A: Personal Experience

Since I am doing a Joint Honours degree in Computing and Management, while selecting a

project, I was looking for a task which would allow me to combine knowledge gained from both

subjects. Therefore, my first choice, which was to develop a training course for research students

on how to apply for funding seemed ideal at the time. When the project coordinator advised that I

should relate my project more to computing aspects, I thought I would manage to do so by

building an on-line training course. However, after feedback was obtained by the assessor in

January, I was in a total state of shock and disappointment. At that moment, I realised that I

should have listened to the advice I was given. It was clear that I would have to consider a new

outline for my project. This meant that I needed to stop feeling sorry for myself and start working

even harder right away.

Also, being a Joint Honours student and taking on a 40-credit Computing project meant that I

could only take another 20 credits of Computing modules in the final year. In addition, with only

a subset of level 1 and level 2 modules, it was difficult to take other modules that would have

been relevant to the project such as Knowledge Management or Natural Language Processing.

Therefore, from this project, a number of lessons were learnt and the following advice can be

given to future students:

• Choose a project that meets the requirements of the School. It is important to know what the

school is expecting and what your supervisor and assessor is expecting and most of all what

constitute a good level 3 project. One recommendation will be to read carefully the final year

project website and at least one past project before deciding and starting on yours.

• Choose a project with a purpose or that interest you. The project is over two semesters and

it is guaranteed that your initial enthusiasm will not last over the full course of the project.

Therefore it is important that you choose a topic in which you have at least some interest or in

which you feel concerned and want to get involved with.

• Choose a project that is relevant to your course. Especially if you are a Joint Honours

student, choose a project that allows you to make some use of the other half of your course

such as project planning/management for Computing and Management students.

• Always listen to advice given from your supervisor, the project coordinator and anyone else

involved in the project. These people are more experienced and are here to guide you, so do

not hesitate to contact them when you are confused. Don’t think you know best and can solve

Page 61: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

56

the problems by yourself. Also, the weekly meeting with the supervisor are very useful and

should be attended.

• Do a considerable amount of background reading. Background reading may seem a lost of

time, but it is very important to deliver a project of high quality. Firstly, the more you learnt

about something, the more interested and involved you get and secondly background reading

helps you understand the problem at an early stage and makes it easier to work on the project.

• Don’t plan on being able to work consistently. You will still have a lot of coursework and

other assignments to complete, and allowances must be made for these if the rest of your

studies are to be unaffected by the extra work the project requires. Similarly leave time for the

exam periods and time for yourself and a break. You will need it!!!

• Don’t leave the write-up for the end. Always keep track of what you are doing and write the

report as you go along. Then, it is less likely that you will forget to include something crucial

to your project and it saves you from being stressed nearer the deadline.

• Allow extra time and effort for evaluation. Any good evaluation relies on other people’s

opinion or experience. However, third parties are very busy people and getting them involved

may take longer than you expect. Therefore, ensure that your schedule is flexible and one

recommendation will be to start by requesting their help, then begin on your own evaluation

of the project and drop everything when they are ready to help you.

• Never give up. There will be some time during the course of the project that everything will

seem to go wrong and you will feel desperate, but remember that there is always a solution

and that you are not the only one going through this nightmare.

My overall experience of this project has had both its good and chaotic time; it was difficult to

restart the project in the second semester but I have enjoyed the development of the pilot corpus

and the writing of the proposal. The chance to work on a project of this size has given me the

opportunity to develop project and time management skills and report writing which have already

proven vital with my work outside University.

Page 62: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

57

APPENDIX B: Markup Symbols Written Text Markup Symbols <#> Text unit marker <I>...</I> Subtext marker <l> Linebreak marker <p>...</p> Paragraph marker <h>...</h> Heading <w>...</w> Orthographic word <X>...</X> Extra-corpus text <?>...</?> Uncertain transcription <O>...</O> Untranscribed text <.>...</.> Incomplete word <->...</-> Normative deletion <+>...</+> Normative insertion <=>...</=> Original normalization <}>...</}> Normative replacement <&>...</&> Editorial comment <(>...</(> Discontinuous word <)>...</)> Normalized discontinuous word <@>...</@> Changed name or word <sb>...</sb> Subscript <sp>...</sp> Superscript <ul>...</ul> Underline <it>...</it> Italics <bold>...</bold> Boldface <typeface>...</typeface> Change of typeface <roman>...</roman> Roman type <smallcaps>...</smallcaps> Small capitals <footnote>...</footnote> Footnote <fnr>...</fnr> Reference to footnote <space> Orthographic space <quote>...</quote> Quotation <del>...</del> Deleted text <marginalia>...</marginalia> Marginalia <mention>...</mention> Mention <indig>...</indig> Indigenous word(s) <foreign>...</foreign> Foreign word(s)

Page 63: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

58

Spoken Text Markup Symbols <$A>, <$B>, etc Speaker identification <I>...</I> Subtext marker <#> Text unit marker <O>...</O> Untranscribed text <?>...<?> Uncertain transcription <->...</-> Normative deletion <+>...</+> Normative insertion <=>...</=> Original normalization <.>...</.> Incomplete word <}>...</}> Normative replacement <[>...</[> Overlapping string <{>...</{> Overlapping string set <,> Short pause <,,> Long pause <(>...</(> Discontinuous word <)>...</)> Normalized disc. word <X>...</X> Extra-corpus text <&>...</&> Editorial comment <@>...</@> Changed name or word <w>...</w> Orthographic word <quote>...</quo te> Quotation <mention>...</mention> Mention <foreign>...</foreign> Foreign word(s) <indig>...</indig> Indigenous word(s) <unclear>...</unclear> Unclear word(s)

Page 64: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

59

APPENDIX C: Corpus Design layout

ICE

-Mau

ritiu

s

Spok

en

Writ

ten

Dia

logu

e M

onol

ogue

Publ

ic

Uns

crip

ted

Scrip

ted

Prin

ted

Stud

ent W

ritin

g

Unp

rinte

d

Lette

rs

Aca

dem

ic

Popu

lar

Page 65: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

60

APPENDIX D: List of Texts collected

Acc

ept

?

not

deliv

ered

No

Sent

? Yes

Yes

Yes

Yes

Yes

Em

ail

pros

i@bo

w.in

tnet

.m

u ram

chur

nco

@in

tnet

.mu

info

@ro

ger

s.m

u LB

IS@

intn

et.m

u rhev

atee

g@lb

is.in

tnet

.mu -

Wor

d 76

2

250

323

507

266

2006

Dat

e

Apr

-99

2001

Oct

-04

Nov

-03

Feb-

03

-

Publ

ishe

r Pl

ace

Mau

ritiu

s Red

uit,

Mau

ritiu

s Po

rt-L

ouis

, M

aurit

ius

Mok

a,

Mau

ritiu

s Mok

a,

Mau

ritiu

s -

Publ

ishe

r

Pros

i M

agaz

ine

Mau

ritiu

s V

eter

inar

y A

ssoc

iatio

n Rog

ers

Le B

ocag

e In

tern

atio

nal

Scho

ol

Mau

ritiu

s Le

Boc

age

Inte

rnat

iona

l Sc

hool

M

aurit

ius

Sund

ay V

ani

Aut

hor

- R.

Ram

chur

n Hec

tor

Espi

talie

r-N

oël

Jean

-Pau

l de

Cha

zal

Rhe

vate

e G

obin

vario

us

Titl

e

The

1998

Illo

vo

Aw

ard

proj

ect

com

petit

ion:

Su

mm

ary

of

prop

osal

s m

ade

by

the

win

ning

team

fr

om D

r Mau

rice

Cur

é St

ate

Seco

ndar

y Sc

hool

Mes

sage

from

the

Pres

iden

t of t

he

Ass

ocia

tion

Lette

r to

Shar

ehol

ders

(New

G

roup

Stru

ctur

e)

Scho

ol F

ees

2004

Ass

essm

ent a

nd

Rep

orts

- Fe

brua

ry

2003

vario

us

ID

STU

01

LE

TT

02

LE

TT

01

Web

Add

ress

http

://w

ww

.pro

si.

net.m

u/m

ag99

/36

3 http

://m

va.in

tnet

.mu/

mes

sage

s_fil

es/a

nee

.htm

http

://w

ww

.roge

rs.

mu/

http

://w

ww

.lebo

cag

e.ne

t/cir

cula

r/C

ircu

lar

%20

fees

2004

.htm

http

://w

ww

.lebo

cag

e.ne

t/circ

ular

/Ass

essm

ent%

20R

G.h

tm

http

://su

nday

vani

.intn

et.m

u/Li

nks/

You

r%

20vo

ice.

htm

Typ

e

Stud

ent

Wri

ting

Let

ters

Page 66: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

61

A

ccep

t?

not

deliv

ered

not

deliv

ered

Sent

? Yes

Yes

Em

ail

ncb0

1@nc

b.in

tnet

.mu

- - - bart

hest

ude

nts

@

yaho

o.co

.uk

- rouk

aya@

uom

.ac.

mu

Wor

d 81

8

1130

2072

1407

993

1499

313

Dat

e

Apr

-04

Nov

-03

Jan-

96

Dec

-04

- - Jun-

04

Publ

ishe

r Pl

ace

Po

rt-L

ouis

, M

aurit

ius

Port

-Lou

is,

Mau

ritiu

s Po

rt-L

ouis

, M

aurit

ius

Port

-Lou

is,

Mau

ritiu

s - R

edui

t, M

aurit

ius

Red

uit,

Mau

ritiu

s

Publ

ishe

r

Nat

iona

l C

ompu

ter

Boa

rd

Nat

iona

l C

ompu

ter

Boa

rd

Gov

ernm

ent

of M

aurit

ius

Gov

ernm

ent

of M

aurit

ius

- Uni

vers

ity o

f M

aurit

ius

Uni

vers

ity o

f M

aurit

ius

Aut

hor

Mr K

emra

z M

ohee

Hon

. Sus

hil

Khu

shira

m

- - - R.

Ram

chur

n Rou

kaya

K

asen

ally

Titl

e

Spee

ch b

y C

hairm

an o

f the

N

atio

nal C

ompu

ter B

oard

Add

ress

by

the

Hon

. Sus

hil

Khu

shira

m, M

inis

ter o

f D

evel

opm

ent,

Fina

ncia

l Se

rvic

es a

nd C

orpo

rate

A

ffai

rs o

n E-

Bus

ines

s

Add

ress

by

the

Pres

iden

t-Y

ear 1

996

Spee

ch b

y H

on. A

.K.

Gay

an, M

inis

ter o

f Tou

rism

an

d Le

isur

e on

the

occa

sion

of

the

hand

ing

over

of

certi

ficat

es to

ski

pper

s Pr

oble

ms

faci

ng th

e ba

r st

uden

t in

Mau

ritiu

s – L

aw

stud

ents

thre

aten

ed w

ith a

le

thal

blo

w?

Dis

ease

s of

Rab

bits

in

Mau

ritiu

s Med

ia a

nd D

emoc

racy

ID

SPE

E01

AC

AD

01

AC

AD

02

Web

Add

ress

http

://nc

b.in

tnet

.mu/

http

://nc

b.in

tnet

.mu/

med

rc.h

tm

http

://m

aurit

ius

asse

mbl

y.go

v.m

u/as

sem

96.h

tm

http

://to

uris

m.g

ov.m

u/sp

eech

1.ht

m

http

://su

nday

vani

.intn

et.m

u/Li

nks

/vie

ws.

htm

http

://m

va.in

tne

t.mu/

artic

les.

htm

ht

tp://

ww

w.u

om

.ac.

mu/

Abo

utU

s/N

ewsl

ette

r/jun

e_04

.pdf

Typ

e

Spee

ch

Aca

dem

ic

Page 67: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

62

A

ccep

t?

Sent

? Yes

Yes

Yes

Yes

Yes

Em

ail

tplu

s@in

tnet

.mu

hem

a@th

em

aurit

ianc

onn

ectio

n.co

m - tplu

s@in

tnet

.mu

- jec@

intn

et.

mu dahk

iam

@i

ntne

t.mu

Wor

d 54

9

405

700

210

569

643

384

Dat

e

- - - - - Oct

-96

-

Publ

ishe

r Pl

ace

- - Po

rt Lo

uis,

M

aurit

ius

- - Mau

ritiu

s -

Publ

ishe

r

- - CO

MPN

et

- - Pros

i M

agaz

ine

-

Aut

hor

Ric

aud

Auc

kbur

Mis

s H

ema

Mal

ini P

aupi

ah

The

Mau

ritia

n W

ildlif

e Fo

unda

tion

- Sam

ad R

ojoa

Raj

Mak

oond

Arm

and

F.

Pam

pusa

Titl

e

The

tran

sit o

f Ven

us

Mau

ritia

n Se

ga

The

Mau

ritiu

s K

estre

l, on

ce th

e w

orld

's ra

rest

bird

Te

lese

rvic

es L

td, t

he

effic

ient

resp

onse

...

Bio

logi

cal D

iver

sity

an

d ap

proa

ches

to it

s C

onse

rvat

ion

Envi

ronm

ent

Prot

ectio

n Le

gisl

atio

n sh

ould

be

mor

e bu

sine

ss

frie

ndly

Th

e M

usic

Sce

ne

ID

POP0

1 PO

P02

Web

Add

ress

http

://w

ww

.ser

viho

o.co

m/c

hann

els/

kin

ews/

v3do

ssie

r_de

tai

ls.p

hp?i

d=43

863

http

://w

ww

.them

aur

itian

conn

ectio

n.co

m/c

ultu

re/s

ega/

inde

x.ht

ml

http

://w

ww

.mau

rine

t.com

/wild

life.

htm

l http

://w

ww

.ser

viho

o.co

m/c

hann

els/

kin

ews/

v3do

ssie

r_de

tai

ls.p

hp?i

d=61

438

http

://pa

ges.

intn

et.

mu/

nath

raj/a

rticl

e3.

htm

l http

://w

ww

.jec-

mau

ritiu

s.or

g/

http

://w

ww

.info

mau

ritiu

s.co

m/m

aurit

ius

/late

st/th

e_m

usic

_sc

ene/

?sid

=35

Typ

e

Aca

dem

ic

Pop

ular

Page 68: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

63

A

ccep

t?

not

deliv

ered

Sent

? Yes

Yes

Yes

Yes

Yes

Yes

Em

ail

cont

act@

fom

.co.

mu

-

k.ja

nkee

e@uo

m.a

c.m

u

psad

dul@

mai

l.gov

.mu

tp

lus@

intn

et.m

u bu

smag

@i

ntne

t.mu

bu

smag

@i

ntne

t.mu

-

Wor

d 27

8

1739

1012

928

467

2150

16

93

451

Dat

e

- - Feb-

05

Feb-

03

- Feb-

05

Feb-

05

Jun-

99

Publ

ishe

r Pl

ace

Po

rt Lo

uis,

M

aurit

ius

- Port

Loui

s,

Mau

ritiu

s - - Po

rt Lo

uis,

M

aurit

ius

Port

Loui

s,

Mau

ritiu

s Mau

ritiu

s

Publ

ishe

r

Free

port

Ope

ratio

ns

(Mau

ritiu

s)

Ltd Rot

ary

Clu

b Bus

ines

s M

agaz

ine

- - Bus

ines

s M

agaz

ine

Bus

ines

s M

agaz

ine

Proz

i M

agaz

ine

Aut

hor

- M.O

. B

akar

khan

Dr

Cha

ndan

Ja

nkee

- - Jacq

ues

Din

an

- -

Titl

e

Soci

été

Du

Port

Mau

ritiu

s D

rug

Prof

iles

Com

petit

ion

in th

e ba

nkin

g se

ctor

: fu

rther

evi

denc

e:

“Act

ions

spe

ak

loud

er th

an w

ords

” Nat

iona

l lite

racy

and

nu

mer

acy

stra

tegy

(N

L &

NS)

Circ

le C

ycle

Tou

r w

heel

s in

new

sp

onso

r...

Mr.

Phili

ppe

Bou

llé:

“Mau

ritiu

s is

look

ed

at a

s an

impo

rtant

ec

onom

ic e

ntity

” D

efus

e th

is ti

me

bom

b Brid

ges

of H

ope:

A

post

-vio

lenc

e so

cial

pr

ojec

t

ID

RE

P02

RE

P01

Web

Add

ress

http

://w

ww

.free

port-

mau

ritiu

s.co

m/h

oldi

ng/

http

://ro

tary

.intn

et.m

u/

http

://w

ww

.bus

ines

smag

.m

u/de

faul

t.asp

?CID

=10

http

://m

inis

try-

educ

atio

n.go

v.m

u/m

ajpr

oj/n

atlit

.htm

ht

tp://

ww

w.s

ervi

hoo.

com

/ch

anne

ls/k

inew

s/v3

doss

ier

_det

ails

.php

?id=

4448

1

ht

tp://

ww

w.b

usin

essm

ag.

mu/

disp

layN

ewsC

onte

nt.

asp?

NID

=623

2&C

ID=2

6

ht

tp://

ww

w.b

usin

essm

ag.

mu/

disp

layN

ewsC

onte

nt.

asp?

NID

=574

7&C

ID=3

0

ht

tp://

ww

w.p

rosi

.net

.mu/

mag

99/3

65ju

ne/p

ram

365

.htm

Typ

e

Pop

ular

R

epor

tage

Page 69: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

64

A

ccep

t?

Sent

?

Em

ail

busm

ag@

int

net.m

u busm

ag@

int

net.m

u busm

ag@

int

net.m

u bu

smag

@i

ntne

t.mu

busm

ag@

int

net.m

u m

times

@i

ntne

t.mu

mtim

es@

int

net.m

u

Wor

d 11

82

467

1468

409

1541

1012

1020

Dat

e

Feb-

05

Mar

-05

Mar

-05

Mar

-05

Mar

-05

Sep-

02

Oct

-02

Publ

ishe

r Pl

ace

Po

rt Lo

uis,

M

aurit

ius

Port

Loui

s,

Mau

ritiu

s Po

rt Lo

uis,

M

aurit

ius

Port

Loui

s,

Mau

ritiu

s Po

rt Lo

uis,

M

aurit

ius

Poin

te-a

ux-

Sabl

es,

Mau

ritiu

s Po

inte

-aux

-Sa

bles

, M

aurit

ius

Publ

ishe

r

Bus

ines

s M

agaz

ine

Bus

ines

s M

agaz

ine

Bus

ines

s M

agaz

ine

Bus

ines

s M

agaz

ine

Bus

ines

s M

agaz

ine

Mau

ritiu

s T

imes

M

aurit

ius

Tim

es

Aut

hor

Sir S

atca

m

Boo

lell

Sir S

atca

m

Boo

lell

Titl

e

MC

CI s

tress

es th

e ne

ed

to d

evel

op a

mor

e bu

sine

ss-f

riend

ly

envi

ronm

ent

Prop

osal

s fr

om th

e Pr

inte

rs &

Sta

tione

ry

Man

ufac

ture

rs

Ass

ocia

tion

(PSM

A)

MC

B e

stim

ates

gro

wth

ra

te a

t 4.2

% la

st y

ear a

nd

at 5

.2%

in 2

005

M

EF b

elie

ves

Bud

get

shou

ld p

rovi

de m

ore

for

skill

dev

elop

men

t M

r. A

ssad

Bhu

glah

, D

irect

or, T

rade

Pol

icy

Uni

t: “T

he lo

bbyi

ng fo

r th

e re

cogn

ition

of S

IDS

by th

e W

TO s

houl

d st

art

right

from

now

CA

C: W

ithou

t Fea

r and

Fa

vour

? W

hate

ver h

appe

ned

to

the

Sach

s C

omm

issi

on?

ID

Web

Add

ress

http

://w

ww

.bus

ines

smag

.m

u/di

spla

yNew

sCon

tent

.as

p?N

ID=6

378&

CID

=8

http

://w

ww

.bus

ines

smag

.m

u/di

spla

yNew

sCon

tent

.as

p?N

ID=6

381&

CID

=8

http

://w

ww

.bus

ines

smag

.m

u/di

spla

yNew

sCon

tent

.as

p?N

ID=6

304&

CID

=8

http

://w

ww

.bus

ines

smag

.m

u/di

spla

yNew

sCon

tent

.as

p?N

ID=6

380&

CID

=8

http

://w

ww

.bus

ines

smag

.m

u/di

spla

yNew

sCon

tent

.as

p?N

ID=6

349&

CID

=26

http

://m

aurit

iust

imes

.co

m/0

6090

2ssb

.htm

ht

tp://

mau

ritiu

stim

es.c

om

/041

002s

sb.h

tm

Typ

e

Rep

orta

ge

Page 70: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

65

A

ccep

t?

not

deliv

ered

Yes

Sent

?

Yes

Yes

Yes

Em

ail

mtim

es@

intn

et.m

u mtim

es@

intn

et.m

u

dire

ctor

@ut

m.in

tnet

.mu

pros

i@bo

w.

intn

et.m

u

mad

elei

ne@

cjp.

net

cent

rala

dmi

n@uo

m.a

c.m

u w

ebm

aste

r-po

rtal

@m

ail.g

ov.m

u

Wor

d 14

76

1635

2010

666

563

1501

1434

Dat

e

Sep-

02

Feb-

03

Feb-

04

Jul-9

9 - - M

ar-0

5

Publ

ishe

r Pl

ace

Po

inte

-aux

-Sa

bles

, M

aurit

ius

Poin

te-a

ux-

Sabl

es,

Mau

ritiu

s Po

inte

-aux

-Sa

bles

, M

aurit

ius

Mau

ritiu

s - R

edui

t, M

aurit

ius

Port

-Lou

is,

Mau

ritiu

s

Publ

ishe

r

Mau

ritiu

s T

imes

Mau

ritiu

s T

imes

Uni

vers

ity o

f Te

chno

logy

, M

aurit

ius

Proz

i M

agaz

ine

- Uni

vers

ity o

f M

aurit

ius

Gov

ernm

ent

of M

aurit

ius

Aut

hor

Sir S

atca

m

Boo

lell

S.

Mod

elia

r - - Mad

elei

ne

Phili

ppe

- -

Titl

e

The

Cho

ice

Can

not

Be

Cle

arer

Rob

ert L

esag

e sh

ould

no

t allo

w h

imse

lf to

be

intim

idat

ed b

y an

ybod

y an

d le

ast o

f al

l by

ICA

C a

nd h

is

arre

st

Adm

issi

on

Reg

ulat

ions

19

99 Il

lovo

Aw

ard

Inte

r-C

olle

ge P

roje

ct

Com

petit

ion

Salt

fish

in to

mat

o sa

uce

Stra

tegi

c Pl

an

Cab

inet

Dec

isio

ns

take

n on

04

Mar

ch

2005

ID

INS

01

Web

Add

ress

http

://m

aurit

iust

imes

.co

m/2

0090

2/20

0902

ssb

.htm

http

://m

aurit

iust

imes

.co

m/2

1020

3mod

.htm

http

://w

ww

.utm

.ac.

mu

/ http

://w

ww

.pro

si.n

et.

mu/

mag

99/3

66ju

ly/il

ovo

366.

htm

ht

tp://

ile-

mau

rice.

trip

od.c

om/r

oug

pois

al.h

tm

http

://w

ww

.uom

.ac.

mu/

Abo

utU

s/St

rate

gicP

lan/

over

view

.htm

ht

tp://

pmo.

gov.

mu/

deci

sion

.htm

Typ

e

Rep

orta

ge

Inst

ruct

iona

l

Page 71: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

66

Acc

ept

? Y

es

Yes

Sent

?

Yes

Yes

Em

ail

web

mas

ter-

port

al@

mai

l.gov

.mu

web

mas

ter-

port

al@

mai

l.gov

.mu

m

times

@in

tne

t.mu

m

times

@in

tne

t.mu

m

times

@in

tne

t.mu

m

times

@in

tne

t.mu

rajb

alke

e@i

ntne

t.mu

rajb

alke

e@i

ntne

t.mu

-

Wor

d 993

730

862

835

709

820

1890

19

88

1857

Dat

e - - Fe

b-05

Aug

-02

Sep-

02

Mar

-05

2001

1995

M

ay-9

7

Publ

ishe

r Pl

ace

Port

-Lou

is,

Mau

ritiu

s Po

rt-L

ouis

, M

aurit

ius

Poin

te-a

ux-

Sabl

es,

Mau

ritiu

s Po

inte

-aux

-Sa

bles

, M

auri

tius

Poin

te-a

ux-

Sabl

es,

Mau

ritiu

s Po

inte

-aux

-Sa

bles

, M

aurit

ius

Mau

ritiu

s Mau

ritiu

s M

aurit

ius

Publ

ishe

r

Gov

ernm

ent

of M

aurit

ius

Gov

ernm

ent

of M

aurit

ius

Mau

ritiu

s T

imes

Mau

ritiu

s T

imes

Mau

ritiu

s T

imes

Mau

ritiu

s T

imes

O

cean

ic

Publ

ishi

ng

Oce

anic

Pu

blis

hing

Pr

ozi

Mag

azin

e

Aut

hor

- - Mad

huka

r R

amla

llah

Mad

huka

r R

amla

llah

Mad

huka

r R

amla

llah

Mad

huka

r R

amla

llah

Raj

Bal

kee

Raj

Bal

kee

Jacq

ues

Din

an

Titl

e O

ur C

onst

itutio

n Fu

nctio

ns o

f the

N

atio

nal

Ass

embl

y

Su

bser

vien

ce

MSM

sty

le

For a

few

at t

he

cost

of t

he m

any

The

Tide

Is

Turn

ing

Dem

ocra

tisat

ion

Har

man

Dah

l's

Lega

cy

Not

you

r day

to

die

Mau

ritiu

s an

d Su

gar

ID

ED

IT01

NO

V0

1

Web

Add

ress

http

://w

ww

.gov

.mu/

govt

/g_c

onst

.htm

ht

tp://

mau

ritiu

sass

em

bly.

gov.

mu/

role

/fun

ctio

n.ht

m

http

://m

aurit

iust

imes

.com

/040

205m

r.ht

m

ht

tp://

mau

ritiu

stim

es.c

om/3

0080

2edi

to.h

tm http

://m

aurit

iust

ime

s.co

m/2

0090

2/20

0902

edit.

htm

http

://m

aurit

iust

imes

.com

/040

305m

r.ht

m

http

://pa

ges.

intn

et.

mu/

rajb

alke

ehom

epag

e/hd

-co

mpl

ete.

htm

http

://pa

ges.

intn

et.

mu/

rajb

alke

ehom

epag

e/n-

one.

htm

ht

tp://

ww

w.p

rosi

.net

.m

u/si

mau

97/p

refa

ce.

htm

Typ

e

Inst

ruct

iona

l

Edi

tori

al

Cre

ativ

e

Page 72: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

67

APPENDIX E: Sample of the Letters of Copyright Sample 1: First letter to explain the purpose of the corpus and for the owners and authors to keep 10 February 2005 Dear General Director of Request for permission to use texts for linguistic research Creation of a Mauritian Corpus of English I am working on a student project at the University of Leeds that involves collecting English texts from Mauritian people in electronic form and storing them on a computer to create a corpus that may be freely available to all via the Web. I believe that you are the owner of the text(s) of on the website: I would like to use the text(s) as part of the corpus. People would be able to access your text(s) and the text(s) of others for further research and teaching. We may also want to use the text(s) for developing electronic products such as translators and dictionaries. I would be very grateful if you would grant to myself and the University of Leeds a free and perpetual non-exclusive licence for the above purposes only. In consideration for your consent mentioned above, I will gladly acknowledge your contribution in any relevant material. If you agree to above and can confirm that there are no other third parties that have any further rights in the text(s) that I need to contact, please acknowledge your acceptance to this by returning signed and dated the attached copy of this letter. Yours faithfully Dolly Koo Phone: [0044 - 7818855441] Email: [[email protected]] Address: [c/o Mr Eric Atwell, Senior Lecturer University of Leeds Leeds LS2 9JT United Kingdom]

Page 73: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

68

Sample 2: Second letter for authors and owners to sign if they agree for their websites to be used.

10 February 2005

Dear General Director of

Request for permission to use texts for linguistic research

Creation of a Mauritian Corpus of English

I am working on a student project at the University of Leeds that involves collecting English texts from Mauritian people in electronic form and storing them on a computer to create a corpus that may be freely available to all via the Web. I believe that you are the owner of the text(s) of on the website: I would like to use the text(s) as part of the corpus. People would be able to access your text(s) and the text(s) of others for further research and teaching. We may also want to use the text(s) for developing electronic products such as translators and dictionaries. I would be very grateful if you would grant to myself and the University of Leeds a free and perpetual non-exclusive licence for the above purposes only. In consideration for your consent mentioned above, I will gladly acknowledge your contribution in any relevant material. If you agree to above and can confirm that there are no other third parties that have any further rights in the text(s) that I need to contact, please acknowledge your acceptance to this by returning signed and dated the attached copy of this le tter. This is to confirm to the School of Computing at Leeds University that I agree to give permission for all the texts on my website to be used as explained to me by the researcher. I also agree to make the Corpus available for public use by researchers, students and language engineers. Name (in block capitals)_____________________________________ Signature: ________________________________________________ Date: ____________________________________________________

Page 74: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

69

APPENDIX F: Template for the header

<tei.2> <teiHeader id=" "> <fileDesc> <titleStmt > <title> </title> <author> </author> <respStmt > <resp>compiled by</resp> <name>Dolly Koo</name > </respStmt > </titleStmt > <publicationStmt > <publisher> </publisher> <pubPlace> </pubPlace> <date></date> </publicationStmt > <sourceDesc> <p>created in machine-readable form in “ “</p> </sourceDesc> </fileDesc> <encodingDesc> <projectDesc> <p>Texts collected for use in the pilot project for ICE-Mauritius, February,

2005</p> </projectDesc> <samplingDecl> <p>Whole text of “ “ words copied from the site</p>

</samplingDecl> </encodingDesc>

<profileDesc> <creation> <date value=" "> </date> <rs type="city"> </rs> </creation> <langUsage>English</langUsage> <textClass> <text Desc n=" "> <channel mode="w">print; written</channel> </textDesc> <particDesc> <person id=" " sex=" " /> </particDesc> </textClass> </profileDesc> </teiHeader> <text > <body> </body> </text > </tei.2>

Page 75: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

70

APPENDIX G: Example of Encoded Text An example of a raw text which belongs to the “Reportage” category, with id “REP14.xml”. Title Robert Lesage should not allow himself to be intimidated by

anybody and least of all by ICAC and his arrest Author S. Modeliar Publisher Mauritius Times Publisher Place Pointe-aux-Sables, Mauritius Date February 2003 Source http://mauritiustimes.com/210203mod.htm Email [email protected] Amount of Words 1637

Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest

“Robert Lesage should himself write out his statement and send it to the police, to ICAC, to the DPP, to

Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to

demolish will come crumbling down…” After the Cuttaree affair and his shielding in circumstances which have been repeated ad nauseam, after a minister of the present government has been arrested on suspicion of corruption and after a complete inaction in the case of another minister, another scandal has emerged. This time it does not concern any Swiss bank involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam though Paul Bérenger has already ruled that these latter two are guilty. This time one of the most respected banks of the country, the Mauritius Commercial Bank, better known as the MCB and being considered the best, is involved. Until the ramifications of what is described as a fraud with regard to the National Pensions Fund (NPF) are known no blame should be attached to anybody. One section of the press has talked about this and has referred to Minister Choonee. It is to be hoped that such an attitude becomes a general feature of the press and that the civilised press as opposed to the gutter press and to the partisan press will be prevailed upon when it comes to the innocence and reputation of people. This philosophy should also be the hallmark of certain politicians who can blow hot and cold at the same time.

While pontificating about presumption of innocence in the case of those close to the regime, because only those who espouse the cause of the supreme leader of Mauritius can aspire to be appointed to posts in the services including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the MCB scandal has emerged he is trying to make a connection between the case at the MCB with the Swiss bank affair. On what basis he is doing that is not clear and yet he is saying that the whole matter will be fully investigated. Who will investigate the matter? Is it going to be investigated by ICAC? Whether we like it or not, ICAC is yet to be perceived as a totally independent institution and totally free from political influences. Even if it is, the perception is otherwise. At times perception of independence is as important if not more important than independence itself.

In addition to trying to lay the blame for the MCB scandal on the Labour Party through a Swiss connection, Paul Bérenger is all praise for the MCB. By so doing Paul Bérenger is already brainwashing public opinion against any malpractice or offence that may have been committed by the MCB or any member of the MSM or MMM because it should not be forgotten that it appears that the scandal dates as far back as 1992, a time at which Paul Bérenger was in a coalition with Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger shout victory too soon as the investigators would have to find out who were the ministers responsible for the NPF from 1992 up to today. The dates at which the funds have been misused will have to be determined as well as the companies that benefited from those transfers of funds.

Page 76: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

71

But how will all this be determined? One would expect an impartial approach to the investigation. This is simply not possible and the perception is that this is not the case right now. Paul Bérenger has already placed a political coloration on the whole matter. The MCB has already talked of a total absence of any conspiracy at the level of the bank and is suggesting that the accusation of conspiracy at the bank finds its source in a personal vendetta of Robert Lesage. Most disturbing is the attitude of ICAC which has indicated yet once more that it is not functioning as a completely independent body. If proof is needed it is be found in the very revealing statement of both the Commissioner of ICAC, Navin Beekharry and that of Robert Lesage. Let us hope that the office of the DPP does not join the bandwagon.

According to reports Robert Lesage is alleged to have stated that “…being given the new approach taken, I have decided to withdraw my cooperation with the inquiry altogether and not to make any statement. However, I confirm that I am still willing to continue my cooperation with the inquiry so long as the line taken since the beginning. But if such cooperation is resumed, I shall tell the truth, the whole truth and nothing but the truth.” In fact it would appear that what the ICAC investigators have been trying to do is to accept Robert Lesage’s statement on part of the scandal or investigation. Mr Beekharry, the independent commissioner of ICAC confirms this view in a statement to the press. What he says is that the statement of Robert Lesage will be taken according to procedures and according to revelations made. This is a very disturbing and vague statement and defies all logic.

Surely when an investigation is underway the person who is willing to make a statement should be allowed to say all that he knows without any form of censorship and, once everything is taken down, then the investigator can retain whatever is relevant. The procedure that ICAC is propounding may lead to the conclusion that he does not want Robert Lesage to say all that he knows in order to shield some people. If this is the case or the perception, then let ICAC be closed down. Perhaps the novel investigative procedure that is put forward by the independent commission is unprecedented in the history of investigations. Now that Mr Beekharry has himself admitted that there has been an attempt to censure the statement of Robert Lesage he should explain to the public, in the name of transparency, and in the interest of ICAC, what he means by censorship. He should also explain in detail the procedures of any investigations and especially the taking of statements so that in future well-meaning citizens who want to expose those who have been making money illegally, will know what stand to take vis-à-vis so-called independent institutions.

The arrest of Robert Lesage is also very revealing. This man has been praised by many of his former friends and colleagues as somebody who is clean. He went to the ICAC following the discovery of the misuse of the NPF funds and was not unduly worried as he told the ICAC investigators what he knew. However when he decided to make a written statement and is confronted by what seemed to be an arbitrary censorship on what he was going to say, and when he refused to play that kind of game it is only then that he is arrested. One wonders what Mrs Indira Manrakhan would have been made to endure if she had adopted such a procedure. Why is that Robert Lesage was not arrested following his oral statement? What additional information has come to light between the first appearance of Robert Lesage at ICAC and his arrest? On what basis has he been arrested? In the absence of a clear and unequivocal communiqué from ICAC, the impression would be that he was arrested in order to exert pressure on him in order to compel him to say only what the ICAC, for reasons best known to it, wants to hear.

Rumour has it that politicians of all parties have been named by Robert Lesage. The MCB itself has said that no proper control of the NPF funds could have been made as high profile people were involved in the management of those funds. A former financial secretary who is very close to the MSM has an objection to departure against him. The names of officials at the bank have been named. The siphoning of funds to private companies has been taking place since the late eighties. Paul Bérenger was in the 1991 government. Questions also relate to those responsible for the audit of the MCB, the audit of the NPF funds and the overall responsibility of different politicians who had charge of such funds. It is not going to be a simple inquiry and censorship, the Beekharry style will certainly not help. Nobody should be spared. No stone should be left unturned to get to the truth because important government funds and an important bank are involved.

Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest. He is being legally advised and as a responsible citizen he should go all the way by making public all that he knows. He should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down. For too long with the MSM or the MMM, there have been selective investigations with regard to fraud on a political line. It is high time for things to change. If the world population can get the United States to change its mind on war with Iraq, why can’t the people of Mauritius organise rallies against fraudsters and their occult institutional allies?

S. MODELIAR

Page 77: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

72

APPENDIX H: Example of Encoded Text Example of the encoded version of the text “REP14.xml” above.

<?xml version="1.0" encoding="utf-8" ?> - <tei.2> - <teiHeader id="REP14"> - <fileDesc> - <titleStmt > <title>Robert Lesage should not allow himself to be intimidated by anybody and

least of all by ICAC and his arrest</title> <author>S. Modeliar</author> - <respStmt > <resp>compiled by</resp> <name>Dolly Koo</name >

</respStmt > </titleStmt >

- <publicationStmt > <publisher>Mauritius Times</publisher> <pubPlace>Pointe-aux-Sables, Mauritius</pubPlace> <date>2003</date>

</publicationStmt > - <sourceDesc> <p>created in machine-readable form in

http://mauritiustimes.com/210203mod.htm</p> </sourceDesc> </fileDesc>

- <encodingDesc> - <projectDesc> <p>Texts collected for use in the pilot project for ICE-Mauritius, February,

2005</p> </projectDesc>

- <samplingDecl> <p>Whole text of 1637 words copied from the site</p>

</samplingDecl> </encodingDesc>

- <profileDesc> - <creation> <date value="2003-02">Feb 2003</date> <rs type="city">Pointe-aux-Sables</rs>

</creation> <langUsage>English</langUsage> - <textClass> - <textDesc n="01"> <channel mode="w">print; written</channel>

</textDesc> - <particDesc> <person id="P1" sex="male " />

</particDesc> </textClass> </profileDesc> </teiHeader>

Page 78: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

73

- <text > - <body> - <bold> <p>Robert Lesage should not allow himself to be intimidated by anybody and least

of all by ICAC and his arrest</p> - <p> <it>“Robert Lesage should himself write out his statement and send it to the

police, to ICAC, to the DPP, to Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down…”</it>

</p> </bold>

<p>After the Cuttaree affair and his shielding in circumstances which have been repeated ad nauseam, after a minister of the present government has been arrested on suspicion of corruption and after a complete inaction in the case of another minister, another scandal has emerged. This time it does not concern any Swiss bank involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam though Paul Bérenger has already ruled that these latter two are guilty. This time one of the most respected banks of the country, the Mauritius Commercial Bank, better known as the MCB and being considered the best, is involved.</p>

<p>Until the ramifications of what is described as a fraud with regard to the National Pensions Fund (NPF) are known no blame should be attached to anybody. One section of the press has talked about this and has referred to Minister Choonee. It is to be hoped that such an attitude becomes a general feature of the press and that the civilised press as opposed to the gutter press and to the partisan press will be prevailed upon when it comes to the innocence and reputation of people. This philosophy should also be the hallmark of certain politicians who can blow hot and cold at the same time.</p>

<p> <marginalia> While pontificating about presumption of innocence in the case of those close to the regime, because only those who espouse the cause of the supreme leader of Mauritius can aspire to be appointed to posts in the services including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the MCB scandal has emerged he is trying to make a connection between the case at the MCB with the Swiss bank affair. On what basis he is doing that is not clear and yet he is saying that the whole matter will be fully investigated. Who will investigate the matter? Is it going to be investigated by ICAC? Whether we like it or not, ICAC is yet to be perceived as a totally independent institution and totally free from political influences. Even if it is, the perception is otherwise. At times perception of independence is as important if not more important than independence itself. </marginalia> </p>

<p>In addition to trying to lay the blame for the MCB scandal on the Labour Party through a Swiss connection, Paul Bérenger is all praise for the MCB. By so doing Paul Bérenger is already brainwashing public opinion against any malpractice or offence that may have been committed by the MCB or any member of the MSM or MMM because it should not be forgotten that it appears that the scandal dates as far back as 1992, a time at which Paul Bérenger was in a coalition with Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger shout victory too soon as the investigators would have to find out who were the ministers responsible for the NPF from 1992 up to today. The dates at which the funds have been misused will have to be determined as well as the companies that benefited from those transfers of funds.</p>

<p>But how will all this be determined? One would expect an impartial approach to the investigation. This is simply not possible and the perception is that this is not the case right now. Paul Bérenger has already placed a political coloration

Page 79: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

74

on the whole matter. The MCB has already talked of a total absence of any conspiracy at the level of the bank and is suggesting that the accusation of conspiracy at the bank finds its source in a personal vendetta of Robert Lesage. Most disturbing is the attitude of ICAC which has indicated yet once more that it is not functioning as a completely independent body. If proof is needed it is be found in the very revealing statement of both the Commissioner of ICAC, Navin Beekharry and that of Robert Lesage. Let us hope that the office of the DPP does not join the bandwagon.</p>

<p>According to reports Robert Lesage is alleged to have stated that “…being given the new approach taken, I have decided to withdraw my cooperation with the inquiry altogether and not to make any statement. However, I confirm that I am still willing to continue my cooperation with the inquiry so long as the line taken since the beginning. But if such cooperation is resumed, I shall tell the truth, the whole truth and nothing but the truth.” In fact it would appear that what the ICAC investigators have been trying to do is to accept Robert Lesage’s statement on part of the scandal or investigation. Mr Beekharry, the independent commissioner of ICAC confirms this view in a statement to the press. What he says is that the statement of Robert Lesage will be taken according to procedures and according to revelations made. This is a very disturbing and vague statement and defies all logic.</p>

<p>Surely when an investigation is underway the person who is willing to make a statement should be allowed to say all that he knows without any form of censorship and, once everything is taken down, then the investigator can retain whatever is relevant. The procedure that ICAC is propounding may lead to the conclusion that he does not want Robert Lesage to say all that he knows in order to shield some people. If this is the case or the perception, then let ICAC be closed down. Perhaps the novel investigative procedure that is put forward by the independent commission is unprecedented in the history of investigations. Now that Mr Beekharry has himself admitted that there has been an attempt to censure the statement of Robert Lesage he should explain to the public, in the name of transparency, and in the interest of ICAC, what he means by censorship. He should also explain in detail the procedures of any investigations and especially the taking of statements so that in future well-meaning citizens who want to expose those who have been making money illegally, will know what stand to take vis-à-vis so-called independent institutions.</p>

<p>The arrest of Robert Lesage is also very revealing. This man has been praised by many of his former friends and colleagues as somebody who is clean. He went to the ICAC following the discovery of the misuse of the NPF funds and was not unduly worried as he told the ICAC investigators what he knew. However when he decided to make a written statement and is confronted by what seemed to be an arbitrary censorship on what he was going to say, and when he refused to play that kind of game it is only then that he is arrested. One wonders what Mrs Indira Manrakhan would have been made to endure if she had adopted such a procedure. Why is that Robert Lesage was not arrested following his oral statement? What additional information has come to light between the first appearance of Robert Lesage at ICAC and his arrest? On what basis has he been arrested? In the absence of a clear and unequivocal communiqué from ICAC, the impression would be that he was arrested in order to exert pressure on him in order to compel him to say only what the ICAC, for reasons best known to it, wants to hear.</p>

<p><marginalia> Rumour has it that politicians of all parties have been named by Robert Lesage. The MCB itself has said that no proper control of the NPF funds could have been made as high profile people were involved in the management of those funds. A former financial secretary who is very close to the MSM has an objection to departure against him. The names of officials at the bank have

Page 80: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

75

been named. The siphoning of funds to private companies has been taking place since the late eighties. Paul Bérenger was in the 1991 government. Questions also relate to those responsible for the audit of the MCB, the audit of the NPF funds and the overall responsibility of different politicians who had charge of such funds. It is not going to be a simple inquiry and censorship, the Beekharry style will certainly not help. Nobody should be spared. No stone should be left unturned to get to the truth because important government funds and an important bank are involved. </marginalia> </p>

<p>Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest. He is being legally advised and as a responsible citizen he should go all the way by making public all that he knows. He should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Re public. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down. For too long with the MSM or the MMM, there have been selective investigations with regard to fraud on a political line. It is high time for things to change. If the world population can get the United States to change its mind on war with Iraq, why can’t the people of Mauritius organise rallies against fraudsters and their occult institutional allies?</p>

- <h> <bold>S. MODELIAR</bold>

</h> </body> </text > </tei.2>

Page 81: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

76

Appendix I: First Draft of the Case for Support for the ICE-lite Proposal EPSRC Research Proposal: Development of the ICE-lite Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT 1. Background The University of Leeds and University College London have done previous research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, development of a Part-of-Speech analysis system which is being used on other research projects such as the International Corpus of English (ICE), which includes research teams in fifteen countries where English is the main language. In many of these English-speaking countries, the national ICE sub-corpus is a recognised resource used in research and teaching (ICE, 2002). “The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the nationa l variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.” (ICE, 2002) Mauritius is one of the many English-speaking African countries, but there is no Mauritian sub-corpus in ICE yet. English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and Madagascar, a large number of labourers from India were brought to work in the sugar cane fields and a small number of Chinese came to trade, and the influence of the French who were the rulers before the British was still very strong. The languages brought by the Hindus workers and merchants from India include Bhojpuri, Hindi, Tamil, Telegu, Marathi and Gujerati. The Chinese who came to Mauritius generally speak Hakka or Cantonese and the Muslims workers from India speak Arabic or Urdu. The slaves brought Malagasy (the language spoken in Madagascar) and Afrikaan to the country as well. All those different languages have quite a big impact on the official language, which is English, but the mixture of those languages also resulted in a new language, which is Creole. Creole is the most widely spoken language on the island and it is used by more than half the population, including many people who are not of Creole descent. However, even if Creole is the most common language in Mauritius, all official communications, and teaching in schools are done in English. With the influence of the other languages, the traditional English brought by the British settlers have suffered drastic changes. In many official communications or press reports for instance, we will come across some French or Creole words. It might be names of individuals or companies or it might be used only to put some emphasis on a theme. In school textbooks, often there will be words in Hindi or Chinese, depending on whether they are used in private or public schools. It is also important to note that dialectal variation is reflected much more in spoken than written Mauritian English. This is due to the fact that people tend to think in their native language and then translate what they want to say in English. Therefore the structure and grammar of the sentences will differ among the different cultures in Mauritius. There are already numerous Mauritian websites available on the World Wide Web and most of them are written in English and the government has just started a Cyber City project, which is the

Page 82: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

77

first of its kind of a new generation of IT parks in this part of the world (BPML, 2004). Therefore, it will be feasible to collect at least some written samples of Mauritian English remotely, via the World Wide Web.

From the success of the other ICE corpuses, namely the sub-corpuses from Australia, Great Britain, India, Hong Kong and East Africa, among other, researching Mauritius English and developing a sub-corpus within this country will help it in its development whether it be in IT, research or teaching. However, the Mauritius sub-corpus will be compiled by collecting texts mostly from the Internet and some amendments will have to be made to the standards since the type of texts available on the Internet will not match the text categories of ICE and other information, such as details of authors and publication will not be easily available on the Internet. According to this technique, the sub-corpus will not contain all the text-categories required in the standard ICE scheme; instead, we will develop an “ICE-lite” scheme, to simplify compilation of an ICE-Mauritius corpus. Furthermore, a more ambitious extension will be to include other types of English from other English-speaking countries in the corpus. It will be a quick and simple way of compiling a corpus of ten million words, with around five hundred thousand words for each country. The twenty English-speaking countries not currently covered by ICE to be included in the corpus are: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda, Zambia and Zimbabwe.

2. Objectives To achieve the goal of developing a Multi-national Corpus of English, the following specific research objectives have been identified: • To set up infrastructure and prototype sampler corpus for the Multi-national Corpus of

English. • To collect, mark-up and lexico-grammatically annotate different samples of spoken and

written texts in English from the twenty countries :- 250 texts of approximately 2,000 words each for each country, a total of approximately ten million words.

• Use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sampler corpus.

3. Research Work-Plan The work has been organised into eight activity streams: WP1: Collection of Spoken and Written Text of English. (24 months RF1, 12 months RF2, 9 months RF3) The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250 texts of approximately 2,000 words each for each country - a total of approximately ten million words. The authors and speakers of the texts need to have been brought up and taught through the English medium. They must be aged 18 or over and were either born or immigrated at an early age to the country. 1.1 Written Text All of the written texts will be collected from the Internet and other freely available sources. However, it will be difficult to obtain non-printed texts such as social letters or student essays. 1.2 Spoken Text Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will only be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to

Page 83: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

78

use sources such as radio and TV broadcasts, which are available on the Internet. The solution proposed by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed material), correspondence & spoken language samples”. 1.3 Copyright issues Letters of copyright will have to be obtained. This will involve in the first place identifying the owners of sources and finding the right contact details. 1.4 Classification of texts Selecting and organising the texts will be a complex task and careful consideration is required. As far as possible, the text classification will follow the ICE standard categories, but it is expected that some texts categories will not be available on the Internet. It is also important to classify the text according to the country it comes from. Deliverable D1: A detailed list of the texts collected together with information about the authors, publisher, publisher place and date. WP2: Transcription (19 months RF1, 12 months RF2, 9 months RF3) Most of the written texts will be in electronic format already. After the spoken texts have been collected and permission is received, the spoken texts will be transcribed, that is, written on paper or typed on screen. It is expected that most of the speech recorded will be in digital format already since they will be collected from the Internet. Deliverable D2: A sampler of the raw Multi-national Corpus of English. WP3: Textual Mark-up (6 months RF1, 2 months RF2, 2 months RF3) 3.1 Encoding of Text The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost when it is converted into a plain text file on a computer will be encoded. In written texts this includes features such as boldface, italics and underlining as well as sentence boundaries, paragraph boundaries and headings. In spoken texts the encoding features will be sentence boundaries, speaker turns, and pauses (Nelson, 1996a). Paragraphing and header information (adapted from the ICE standards) regarding author, publisher, etc. will be added. Texts with different formats (Doc, PDF, HTML) will be converted into a unified framework (XML format) (Al-Sulaiti, 2004). 3.2 Proofread Text Both spoken and written text will be proofread on the screen. This task includes deleting extra and unnecessary material from texts and checking and adjusting paragraphing markers. Deliverable D3: Multi-national Sampler Corpus ready for distribution. WP4: Word-class tagging (18 months RF3) Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA Tagger, developed by the TOSCA Research Group at the University of Nijmegen. This assigns wordclass tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language” (ICE, 2002). During this stage, each item will be assigned a label or tag, for example, ‘N’ for noun and ‘ADV’ for adverb. Other information, such as singular or plural, or the verb tense will be added in brackets next to the label or tag. Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus. WP5: Syntactic parsing The tagged corpus from the previous stage formed the input to the next major stage, the syntactic parsing. 5.1 Syntactic marking (18 months RF2) The corpus is pre-edited (also known as syntactic marking) before the rest of the parsing stage. This involves manually marking several high-frequency constructions in order to reduce the

Page 84: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

79

ambiguity of the input, and thereby reduce the number of decisions that the automatic parser would have to make. 5.2 TOSCA parser The TOSCA parser (Nelson et al., 2002), a software that has been developed by the TOSCA group, will be used to automate this stage. The output from the TOSCA parser will be series of labelled syntactic trees, in which the nodes will be labelled for function, category, and features. 5.3 Manual Analysis The TOSCA parser should yield a complete analysis for around 70% of the parsing units in the corpus and for the remainder, the analysis will have to be done manually. Deliverable D5: An analysis of the corpus at phrase, clause, and sentence level, and the analysis will be shown in the form of a parse tree. WP6: Evaluation (7 months RF1, 2 months RF2, 17 months RF3) 6.1 Cross-sectional checking The syntactic trees will be checked on a cross-sectional, construction–by-construction basis. This will allow the check to be concentrated on just one grammatical construction at time and correction can be made on each instance of the construction throughout the whole corpus, if necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking. 6.2 Spot-checking Finally, the corpus will be ‘spot-checked’ before being released. Deliverable D6: The final Multi-national Corpus of English WP7: Comparison across dialects (10 months RF1, 16 months RF2, 6 months RF3) We will use English concordance and corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in the sampler corpus. Deliverable D7: Research paper on lexical and grammatical variation in the Multi-National Sampler Corpus, to be submitted to the International Journal of Corpus Linguistics. WP8: Dissemination for Exploitation (2 months each) The normal dissemination route for academic research is journal and conference papers. D5 and D6 are directly publishable, other papers will need to be written from other deliverables. Deliverable D8: Plan for continuing expansion of the Multi-national Corpus of English, extending to new countries. 4. Benefits The research will first be beneficial to the government and the educational system in each of the twenty countries mentioned above. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods in each country and will also benefit those people who want to travel to or trade with other English-speaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses.

Page 85: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

80

Longer-term impacts of the work to be done include • Promoting cooperation between English speaking countries and for the purpose of developing

basic components for the linguistic society.

• Easing the entrance requirements of English speaking countries into the different markets. • Promoting the different culture as a whole. 5. Resources Staff: the development of the project will require the employment of: Three English corpus linguists as post-doctoral Research Fellows and project managers for three years. Consumables: A powerful laptop PC for each researcher, costing £2000 each. Consultancy fees of £40,000 for transcription and mark-up of source materials. Travel and Subsistence: Results will be reported and published in conference proceedings including Corpus Linguistics (CL’07 Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of £4,000. The costs for the International Steering Panel meetings at start, mid and end of project are estimated at a total of £21,000. References: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. The School of Computing, University of Leeds (1998-2004). The University of Leeds web site [online]. [Accessed 21st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml

Page 86: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

81

Part 2: DIAGRAMMATIC WORK PLAN The work has been organised into eight activity streams. The three research fellows will be working in parallel during the 36-month project starting August 2006. WP1: Collection of Spoken and Written Text of English (24 months RF1, 12 months RF2, 9 months RF3) WP2: Transcription (19 months RF1, 12 months RF2, 9 months RF3) WP3: Textual Mark-up (6 months RF1, 2 months RF2, 2 months RF3) WP4: Word-class tagging (18 months RF3) WP5: Syntactic parsing (18 months RF2) WP6: Evaluation (7 months RF1, 2 months RF2, 17 months RF3) WP7: Comparison across dialects (10 months RF1, 16 months RF2, 6 months RF3) WP8: Dissemination for Exploitation (2 months each) Research Fellow 1:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8

Research Fellow 2:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8

Research Fellow 3:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8

Page 87: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

82

Appendix J: EPSRC Application Form for ICE-lite

Engineering & Physical Sciences Research Council Polaris House, North Star Avenue, Swindon, Wiltshire, United Kingdom, SN2 1ET Telephone +44 (0) 1793 444000 Web http://www.epsrc.ac.uk/

Je-SRP1 (EPSRC) v1.1

COMPLIANCE WITH THE DATA PROTECTION ACT 1998 In accordance with the Data Protection Act 1998, the personal data provided on this form will be processed by EPSRC, and may be

held on computerised database and/or manual files. Further details may be found in the guidance notes

EPSRC Reference:

RESEARCH PROPOSAL 1. DETAILS OF PROPOSAL

You should read the separate notes for guidance, the 'EPSRC Funding Guide’ and any specific call documentation on the EPSRC Web site before completing any research proposal. Form Je-SRP1 (EPSRC) must be accompanied by a Case for Support. EPSRC will reject incomplete research proposals.

A. Organisation Where Grant Would Be Held Organisation University of Leeds

Division or Department School of Computing Research Organisation Reference:

Address Line 1 Computer Vision and Language group Address Line 2 School of Computing

Address Line 3 University of Leeds

Town/City Leeds

Admin Area/County West Yorkshire Postal Code LS2 9JT

B. Investigators Please give details of each investigator below. Please provide the details of any additional investigators on a separate sheet using the same format as below.

Details Principal Investigator (PI) Co-Investigator 1

Title Mr

Forename(s) Eric

Surname Atwell

Organisation University of Leeds

Division or Department School of Computing

Post will outlast project (Y/N) Y

% time committed to project 20

Other commitments (description and average hours per week)

8

Total number of co-investigators (ie. excluding the PI) 0

Page 88: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

83

C. Recognised Researchers Please give details of each Recognised Researcher below. Please provide the details of any additional Recognised Researchers on a separate sheet using the same format as below.

Details Recognised Researcher 1 Recognised Researcher 2

Title Dr Ms Forename(s) Serge Bayan Surname Sharoff Abu Shawar Organisation University of Leeds University of Leeds Division or Department Centre for Translation Studies School of Computing % time committed to project 100 100

Total number of Recognised Researchers 3

D. Title of Research Project [up to 150 chars]

Development of an ICE-lite

E. Start Date and Duration a. Proposed start date August 1st 2005 b. Duration of the grant (months) 36

F. Type of Proposal

Scheme: Call: n/a

G. Summary of EPSRC Resources Required for Project a. Financial resources required b. Summary of staff effort requested c. Services

Total £ Months £

Staff 330,072 Research 108

Travel and Subsistence 25,000 Technician

Consumables 46,000 Other

Exceptional Items Project Students

Equipment Visiting Researchers

Large Capital Total 108

PCTF 500

Sub-total 401,572

Indirect Costs 101,222

Total 502,794

H. Related Proposals EPSRC Reference Number How related? (one of Continuation,

Follow-up to outline proposal, Invited resubmission, Uninvited resubmission)

a. If this proposal is related to a previous proposal to EPSRC, please give the previous EPSRC research grant proposal reference number(s) and indicate the type of relationship.

Total Number of Proposals being submitted

Name of Lead Research Organisation

Common Reference

b. If there is more than one organisation submitting a Je-SRP1 (EPSRC) proposal form for this project, please give the number of proposals involved, the lead Research Organisation and the project common reference.

Page 89: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

84

I. Research Councils / MoD Joint Research Grants Scheme (JGS) If you have received a commitment of support from the Defence Science Technology Laboratory (DSTL), please give the following details:

Percentage funding indicated by DSTL

DSTL contact (name and address) Title/Forename(s) Surname Address Line 1 Address Line 2 Address Line 3 Town/City Administrative Area/County Postal Code Telephone Fax E-mail

DSTL Reference (please ensure that the letter providing this reference is attached with the Case for Support)

J. Objectives List main objectives of the proposed research in order of priority [up to 4000 chars]

We will set up unfrastructure and prototype sample corpus for the ICE-lite, an International Corpus of English component which contains a 'lite' version of Englishes from 40 different English-speaking countries.

An international steering panel will establish agreed standards for text types and categories and the other annotation standards such as encoding and XML mark-up and tagging, distribution. We will collect, mark-up and annotate different samples of spoken and written texts in English from 40 English-speaking countries:- 250 texts of approximately 2,000 words each, a total of approximately 20 million words. We will use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sample corpus.

Page 90: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

85

K. Summary Describe the proposed research in a style that would be accessible to an interested 14 year old [up to 4000 chars]

Texts stored on the computer, known as corpora together with software tools provide a powerful method to learn more about language usage. Corpora are useful for studying all aspects of language such as grammar, meaning, speech sounds and helping dictionary makers in spotting new words. Corpus linguistics are nowadays analysing the use of structures and investigating factors that affect our choice of a particular structure. For instance, the factors may be related to the nature of the writing or speaking such as science rather than literature. Other factors that may influence of choice include as age, gender, period of time, text type, and medium (spoken or written) and these are being fully examined to get best result in our study of language. The main aim of a corpus linguistic is to discover common linguistic patterns in some specific contexts rather than stating whether the pattern is correct or incorrect. Therefore, with the computer storing a huge amount of data, this view of language analysis becomes more accessible and gives a good resource to start with. The corpus can be searched and handled at high speed using a special software tool. In addition, some information such as grammar, meaning, and speech sound can be added to it to make it useful to examine. Many large corpora have been developed during the past few years. Some are for general-use in linguistics research and represent different languages such as English, Spanish, French, and Russian, while others are more specialised such as the Air Traffic Control corpus. English is widely spoken in different parts of the world and one main corpus that handles its variation is known as the International Corpus of English (ICE). The main purpose of collecting this corpus is for comparing English as spoken worldwide. Around the world, fifteen research teams are preparing electronic corpora of their own variety of English and each one consists of one million words: 60% spoken and 40% written. Each team is following the same corpus design to ensure compatibility.

Many English-speaking countries do not have a component in ICE yet and developing a sub-corpus for each one will be very costly and time-consuming. A better extension to the ICE project will be to collect a small version of the corpus for each of the English-speaking country and grouping them together to form the ICE-lite. The term “lite” is borrowed from other simplified projects such as “TEI-lite” which means a simpler version of TEI, a standard XML-markup convention for text corpora. Therefore, the aim of this project is to build a corpus for the ICE-lite which will follow similar conventions to the full ICE version. To ensure compatibility with the other ICE-projects, the “lite” version of the teams already in ICE will also be included in the corpus together with other 25 countries. The aim is to collect a corpus of 20 million words: 250 texts of 2,000 words for each country. An international steering panel will be appointed to agree on a general design structure for the corpus. The different types of English will then be analysed and compared to find similarities or differences across countries. This will allow an understanding of the different cultures and therefore will be useful for other research and academic institutions across the world.

L. Beneficiaries Describe who will benefit from the research [up to 4000 chars]

The research will first be beneficial to the governement and the educational system in each of the twenty countries mentioned above and the existing ICE teams. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods in each country and will also benefit those people who want to travel to or trade with other English-speaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses.

Longer-term impacts of the work to be done include:

• Promoting cooperation between other English speaking countries and for the purpose of developing basic components for the linguistic society.

• Easing the entrance requirements of English speaking countries into the different markets.

• Promote the different cultures of the 40 countris across the world.

Page 91: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

86

M. Staff

Joint Negotiating Committee For Higher Education Staff (JNCHES – formerly UCEA) Posts EFFORT ON PROJECT

Name /Post Identifier Grade Starting Spine Point

Effective Date of Salary Scale

Increment Date

Start Date

Period on Project

(months)

% of Full Time

London Allowance

(Y/N)

Total cost on grant

(£)

i) Research Staff

Serge Sharoff RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024

Bayan Abu Shawar RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024

Sean Wallis at UCL RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024

%

%

%

%

%

%

ii) Technical Staff

%

%

%

%

%

%

%

%

%

iii) Visiting Researchers

%

%

%

%

%

%

%

%

Total 330,072 [

Page 92: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

87

Non-JNCHES Posts EFFORT ON PROJECT

Name / Post Identifier Basic Starting Salary

Scale Effective Date of Salary Scale

Increment Date

Start Date

Period on Project

(months)

% of Full Time

London Allowance

(£)

Superannuation and NI (£)

Total cost on grant (£)

i) Research Staff

%

%

%

%

%

%

%

ii) Technical Staff

%

%

%

%

%

%

%

iii) Other Staff

%

%

%

%

%

%

%

iv) Visiting Researchers

%

%

%

%

Total

Page 93: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

88

Ma. Project Studentships Name/Post Identifier Start Date London (Y/N) Stipend (£)

Total

Mb. Visiting Researchers Please provide the details of any additional visiting researchers on a separate sheet in the same format as below.

Details Visiting Researcher 1 Visiting Researcher 2 Visiting Researcher 3

Title

Forename(s)

Surname

Home Organisation

Division or Department

Address Line 1

Address Line 2

Address Line 3

Town/City

Administrative Area/County

Postal Code

Country

Telephone

Fax

E-mail

Post held

a) If you have requested an amount from EPSRC for the Visiting Researcher's salary in Section M, will the Visiting Researcher receive any other contribution on top of this? (Y/N)

b) If the Visiting Researcher will receive another contribution, how much will this be? (£)

c) What annual salary would the host organisation expect to pay staff of the Visiting Researcher's status? (£)

Total number of visiting researchers 0

Mc. Public Communication Training Funds (PCTF) Do you wish to apply for £500 towards Public Communication Training Funds? YES NO

Page 94: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

89

N. Travel and Subsistence Destination and purpose Total £

(i) Within UK International Steering Panel review meetings at start, mid and end of project 21,000

(ii) Outside UK

Corpus Linguistics conferences (CL'2007, TALC'06,07,08) to disseminate results 4,000

Total £ 25,000

O. Consumables Description Total £

Consultancy fee funds for transcription and markup of source materials 40,000

3 laptops for data collection and analysis 6,000

Total £ 46,000

P. Exceptional Items Description Total £

Total £

Page 95: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

90

Q. Equipment (single items between £3,000 and £99,999, including VAT)) Description Country of

Manufacture Delivery Date

Basic price £

Import duty £

VAT £ Total £

Total £

R. Large Capital (single items £100,000 and over, including VAT) Description Country of

Manufacture Delivery Date

Basic price £

Import duty £

VAT £ Total £

Total £

S. Services

Service Instrument(s) Units Cost £

Total

T. Other Support Give details of any support sought or received from any source for this or related research in the past three years (minimum £10,000)

Awarding Organisation

Awarding Organisation’s

Reference

Title of project Decision Made (Y/N)

Award Made (Y/N)

Start Date

End Date

Amount Sought/

Awarded (£)

Page 96: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

91

Appendix K: Revised Case for Support for the ICE-lite Proposal EPSRC Research Proposal: Development of the Multi -National Corpus of English Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT 1. Background The University of Leeds (2004) has done previous research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, the University has developed a Part-of-Speech analysis system which is being used on other research projects such as the International Corpus of English (ICE), which includes research teams in fifteen countries where English is the first language or second official language. In many of these English-speaking countries, the national ICE sub-corpus is a recognised resource used in research and teaching. ICE began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.” (ICE, 2002) These corpora are collected mostly by looking at standard and more traditional materials such as books, newspapers and articles. This method allows a wide variety of texts to be obtained, however it is very time consuming and costly since researchers have to be sent on the site and the texts need to be transcribed and converted into electronic format. This is one of the reasons why five other ICE projects (Cameroon, Fiji, Ghana, Nigeria and Sierra Leone) have not been able to start collecting any texts up to this date. Therefore, an alternative way of quickly and simply compiling a big corpus of much more than one million words would be to use the World Wide Web. Other attempts at using this method have proved to be successful (e.g. Serge Sharoff at Leeds University has developed tools to extract 100 million words corpora of Russian, German and Chinese). A pilot project to investigate the possibility of collecting a corpus for Mauritius, one of the many English-speaking African countries, was undertaken. English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and Madagascar, a large number of labourers from India were brought to work in the sugar cane fields and a small number of Chinese came to trade, and the influence of the French who were the rulers before the British was still very strong. The different languages brought by the different settlers have therefore influenced significantly the official English language of the country. Still, all official communications and teaching in schools are done in English, even though in many official communications or press reports or school textbooks for instance, you might come across some dialect words, such as names of individuals or companies written in French or Hindi. There are already numerous Mauritian websites available on the World Wide Web and most of them are written in English and the government has just started a Cyber City project, which is the first of its kind of a new generation of IT parks in this part of the world (BPML, 2004). Therefore, it will be feasible to collect at least some written samples of Mauritian English remotely, via the World Wide Web. For the pilot project, a sample of 30 texts between 1,000 to 2,000 words were collected (a total of 51,960 words). Each text, including its details such as author, publisher and date, took between15 and 20 minutes to find. From the texts obtained, it was noted that some amendments will have to be made to the standards of ICE since the types of texts available on the Internet will not match the text categories nor the text size and other information of the texts are not easily available on the Internet. 18 permissions for the use of the texts were sent out by emails and it took 6 minutes on average to

Page 97: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

92

send each one. The issue of potential commercial use will have to be addressed in more details since the other ICE projects are strictly not commercial and according to Gerald Nelson at UCL this statement might cause difficulties in obtaining persmissions from owners and might cause problem to the other ICE teams. The 30 texts were also mark-up with a reduced ICE header and it took between 20 to 30 minutes to “clean up” the source webpage and to add the markup to each text. This stage can be done partly by a program, where human interaction is only needed to proofread and post-edit or correct the draft of the marked up texts which are produced. However, for the pilot project, no such program was available and therefore the estimates are derived from the manual process. The tagging and parsing will be done automatically and hence it was estimated that one million words will take one and a half weeks to be tagged and one and a half weeks to be parsed. Evidence from the pilot project shown that with this internet collection technique, the sub-corpus will contain less than one million words due to the limited set of text categories available on the World Wide Web. Therefore, a better extension will be to include other types of English from other English-speaking countries in the corpus. This will result in an “ICE-lite” with around five hundred thousands words for each country. The term “lite” is borrowed from other simplified projects such as “TEI-lite” which means a simpler version of TEI, a standard XML-markup convention for text corpora (TEI, 2005). The twenty countries which have been chosen to form part of the Multi-national Corpus of English are: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. In each of these abovementioned countries, English is either the national language or one of the main speaking languages. For instance, in Bermuda and Zimbabwe, English is the official language while in Liberia English is used mostly for trading purposes. Therefore, the form of English in Liberia has significant differences in terms of its word structure and it can take some time and practice to master. Like in Mauritius, English in most of these countries has in a way or another been highly influenced by other languages, either brought by ancestors or derived from their culture. The ICE-lite corpus will hence allow an interesting and useful analysis of the variation in English across the nations. To ensure compatibility and provide an enhanced comparison with the other existing projects, the “lite” version of the 20 teams already in ICE will also be included in the corpus. For each country, numerous websites are easily accessible via the World Wide Web, and different texts categories are available. Google provides evidence that for the Mauritius pilot project there are 250 million words of text to select from. Therefore, a large amount of texts should be available to collect 250 texts of 2,000 words each for each country and thus the corpus will aim to contain approximately 20 million words in total. An important issue which will need further consideration is the distribution method. The existing ICE corpora are distributed on CD and this results in reduced accessibility. One possible solution will be to make the ICE-lite corpus available on the Internet via a public licence. For instance, it can be distributed via the GNU public licence, analogous to open-source software freely downloadable from Sourceforge.net. From the success of the other ICE corpuses, namely the sub-corpuses from Great Britain, India, Hong Kong and East Africa, among others, researching English across the different nations and developing an ICE-lite Corpus will help in the development of each of the participating countries whether it be in IT, research or teaching. 2. Objectives To achieve the goal of developing an ICE-lite Corpus of English, the following specific research objectives have been identified: • To set up infrastructure and prototype sampler corpus for the ICE-lite Corpus of English. • An international steering panel will establish agreed standards for text types and categories and

the other annotation standards such as encoding and XML mark-up and tagging, distribution.

Page 98: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

93

• To collect, mark-up and lexico-grammatically annotate different samples of spoken and written texts in English from the 40 countries :- 250 texts of approximately 2,000 words each for each country, a total of approximately 20 million words.

• Use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sampler corpus.

3. Research Work-Plan The work has been organised into eight activity streams (assuming that a month consists of 20 days of 8 hours working time): WP0: Project Management via International Steering Panel (1 month each over 3 years) The Panel will establish agreed standards for text types and categories; encoding and XML mark-up; morphological analysis and Part-of-Speech tagging and distribution methods. Standards proposals will be drawn by the project investigators but subject to approval and improvement by the Panel. Members will be chosen from the 40 countries that will form part of the ICE-lite. Deliverable D0: ICE-lite International Steering Panel to meet annually to oversee project progress. WP1: Collection of Spoken and Written Text of English. (10 months RF1, 8 months RF2, 8 months RF3) The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250 texts of approximately 2,000 words each for each country - a total of approximately 20 million words. The authors and speakers of the texts need to have been brought up and taught through the English medium. They must be aged 18 or over and were either born or immigrated at an early age to the country. 1.1 Written Text All of the written texts will be collected from the Internet only. However, it will be difficult to obtain non-printed texts such as social letters or student essays. 1.2 Spoken Text Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will only be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to use sources such as radio and TV broadcasts, which are available on the Internet. The solution proposed by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed material), correspondence & spoken language samples”. 1.3 Copyright issues Letters of copyright will have to be obtained. This will involve in the first place identifying the owners of sources and finding the right contact details. 1.4 Classification of texts Selecting and organising the texts will be a complex task and careful consideration is required. As far as possible, the text classification will follow the ICE standard categories, but it is expected that some texts categories will not be available on the Internet. It is also important to classify the text according to the country it comes from. Deliverable D1: A detailed list of the texts collected together with information about the authors, publisher, publisher place and date. WP2: Transcription (5 months RF1, 3 months RF2, 2 months RF3) The written texts will be in electronic format already. After the spoken texts have been collected and permission is received, the spoken texts will be transcribed, that is, written on paper or typed on screen. It is expected that most of the speech recorded will be in digital format already since they will be collected from the Internet. No sample of spoken texts from Mauritius was available, so it is expected that only a limited number of spoken texts will be obtained and the time required for transcription is only based on personal judgement and relative to the time allowed for WP1. Deliverable D2: A sampler of the raw Multi-national Corpus of English.

Page 99: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

94

WP3: Textual Mark-up (10 months RF1, 8 months RF2, 8 months RF3) 3.1 Encoding of Text The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost when it is converted into a plain text file on a computer will be encoded. In written texts this includes features such as boldface, italics and underlining as well as sentence boundaries, paragraph boundaries and headings. In spoken texts the encoding features will be sentence boundaries, speaker turns, and pauses (Nelson, 1996a). Paragraphing and header information (adapted from the ICE standards) regarding author, publisher, etc. will be added. Texts with different formats (Doc, PDF, HTML) will be converted into a unified framework (XML format) (Al-Sulaiti, 2004). 3.2 Proofread Text Both spoken and written text will be proofread on the screen. This task includes deleting extra and unnecessary material from texts and checking and adjusting paragraphing markers. Deliverable D3: Multi-national Sampler Corpus ready for distribution. WP4: Word-class tagging (6 months RF2, 8 months RF3) Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA Tagger, developed by the TOSCA Research Group at the University of Nijmegen. This assigns wordclass tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language” (ICE, 2002). During this stage, each item will be assigned a label or tag, for example, ‘N’ for noun and ‘ADV’ for adverb. Other information, such as singular or plural, or the verb tense will be added in brackets next to the label or tag. Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus. WP5: Evaluation (6 months RF1, 4 months RF2, 2 months RF3) 5.1 Cross-sectional checking The syntactic wordclass tags will be checked on a cross-sectional, construction–by-construction basis. This will allow the check to be concentrated on just one grammatical construction at time and correction can be made on each instance of the construction throughout the whole corpus, if necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking. 5.2 Spot-checking Finally, the corpus will be ‘spot-checked’ before being released. Deliverable D5: The final Multi-national Corpus of English WP6: Comparison across dialects (3 months RF1, 3 months RF2, 5 months RF3) We will use English concordance and corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in the sampler corpus. Deliverable D6: Research paper on lexical and grammatical variation in the Multi-National Sampler Corpus, to be submitted to the International Journal of Corpus Linguistics. WP7: Dissemination for Exploitation (2 months each) The normal dissemination route for academic research is journal and conference papers. D6 is directly publishable, other papers will need to be written from other deliverables. Deliverable D7: Plan for continuing expansion of the Multi-national Corpus of English, extending to new countries. 4. Benefits The research will first be beneficial to the governement and the educational system in each of the twenty countries mentioned above and the existing ICE teams. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods

Page 100: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

95

in each country and will also benefit those people who want to travel to or trade with other English-speaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses. Longer-term impacts of the work to be done include: • Promoting cooperation between other English speaking countries and for the purpose of

developing basic components for the linguistic society. • Easing the entrance requirements of English speaking countries into the different markets. • Promote the different cultures of the 40 countries across the world. 5. Resources Staff: the development of the project will require the employment of: 3 English corpus linguists as post-graduate Research Fellows and project managers for 3 years. Consumables: A powerful laptop PC for each researcher, costing £2000 each. Consultancy fees of £40,000 for transcription and mark-up of source materials. Travel and Subsistence: Results will be reported and published in conference proceedings including Corpus Linguistics (CL’07 Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of £4,000. The costs for the International Steering Panel meetings at start, mid and end of project are estimated at a total of £21,000. Reference: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005]. Available from World Wide Web: http://www.hti.umich.edu/cgi/t/tei/tei-idx?type=pointer&value=HD Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. SourceForge.net (2005) Project: MinGW - Minimalist GNU for Windows [online]. [Accessed 11th March 2005]. Available from World Wide Web: http://sourceforge.net

Page 101: A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee … · Dolly Koo Tee Fong BSc Computing and Management 2004-2005 . ... Collect sample texts from the Internet & ... high income services

Koo Tee Fong, Dolly

A PILOT PROJECT FOR ICE-MAURITIUS

96

The School of Computing, University of Leeds (1998-2004). The University of Leeds web site [online]. [Accessed 21st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml Part 2: DIAGRAMMATIC WORK PLAN The work has been organised into eight activity streams. The three research fellows will be working in parallel during the 36-month project starting August 2005. WP0: Project Management via International Steering Panel (1 month each over 3 years) WP1: Collection of Spoken and Written Text of English (10 months RF1, 8 months RF2, 8 months RF3) WP2: Transcription (5 months RF1, 3 months RF2, 2 months RF3) WP3: Textual Mark-up (10 months RF1, 8 months RF2, 8 months RF3) WP4: Word-class tagging (6 months RF2, 8 months RF3) WP5: Evaluation (6 months RF1, 4 months RF2, 2 months RF3) WP6: Comparison across dialects (3 months RF1, 3 months RF2, 5 months RF3) WP7: Dissemination for Exploitation (2 months each) Research Fellow 1:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7

Research Fellow 2:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7

Research Fellow 3:

Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7

Actual Estimate Possible overflow