_6_a malay-english terminology retrieval system

7/28/2019 _6_A Malay-English Terminology Retrieval System

1/9


2/9

The process of indexing can be divided into two steps: first, clustering of the terms intoroot words using stemming algorithm; second, clustering of the root words into two

character subwords (bigrams). The stemming algorithm used for Malay terms clustering

is Fatimahs algorithm (1996). Porters suffix striping stemming algorithm (Porter 1980)

is used to cluster the English terms.

The function of the retrieving module is to retrieve words which matches the query word.

Partial string matching techniques used are stemming and bigram matching whichdetermine whether two words match each other. For bigram matching, dice coefficient

value with threshold 0.6 is used for partial matching. Figure 1 shows an example of the

output for the query solar energy.

Figure 1: Output Interface

Evaluation Procedure

The experiments are performed to evaluate the retrieval effectiveness of varioustechniques employed within the framework of n-gram matching in the contact of

automatic query expansion approach as set by Lennon et al. [1981]. This involve theranking and the calculation of string similarity measures of each unique terms in the


3/9

dictionary to a specified query term. The effectiveness of retrieval is evaluated by using

recall and precision measures based on the number of words retrieved and relevant(R&R). The recall, R, is defined as proportion of relevant terms actually retrieved from

the dictionary with respect to a specified query term. The precision, P, is thus defined as

the proportion of retrieved actually relevant. The relevancy of each term in the dictionary

to a specified query term is determined manually by exhaustive scanning through thedictionary with the help of the string matching utility find in the Microsoft Excel

software. Two terms are regarded to be relevant to each other if they share a common

root.

The measure of van Rijsbergen, E, which is a weighted combination of recall and

precision is also used to calculate the retrieval effectiveness:

( )E

PR

P R=

+

+

100 1

1 2

2*

where , in the range of 0 to infinity, is used to reflect the relative importance of recall

and precision. = 1.0 reflects attaching equal importance to precision and recall, while

= 2.0 reflects attaching twice as much importance to recall as to precision. In our

experiments we used the value of = 2.0. There is an inverse relationship between the E

value and the effectiveness.

For each query term, the terms in the dictionary are ranked based on a similarity

matching measure as specified in each experiment. The E values are calculated for cutoffpoints at 10 to 100, at an interval of 10, top ranking dictionary terms. This is done for all

the 84 query terms and the mean values of E is calculated. The average of the mean

values of E at the ten cutoff points are calculated and used to evaluate the effectiveness ofeach technique. Generally the average of the mean values of E at all the cutoff points

serve as a good indicator of performance, we used the symbol aE to denote this value.

Statistical sign test will be used to measure the degree of significance between two

methods under consideration with the null hypothesis as follows:

P[Xi > Yi] = P[Xi < Yi] = 1/2

where Xi and Yi are the two scores for a matched pair obtained from the two methods. In

applying the sign test , the direction of the difference between every Xi and Yi notingwhether the sign of the difference is positive or negative. Binomial distribution with p = q

= 1/2 is used to calculate the probability,p, associated with the occurrence of a particular

number of positives and negatives. If the number of matched pairs, N, is larger than 35,the normal approximation to the binomial distribution is used in term of the z value

[Siegel and Castellan, 1988].

The Results

The commonly used similarity coefficients are Dice and Overlap coefficients. We run


4/9


5/9

results obtained using this approach is given in Table_7. The one using digrams performs

better than the trigrams with aE = 61.04 as compare to 61.35 and obviously better thanwith the previous approach when only the query terms are stemmed. Table_5 shows the

results obtained from sign test which indicates that the digrams matching using stemmed-

query and stemmed-dictionary performs significantly better than with stemmed-query

alone.

Comparison with Stemmed-Boolean Matching

We also run experiments using the conventional stemmed-boolean matching between the

stemmed query and the stemmed dictionary to compare its effectiveness to the approach

of incorporating stemming in the n-gram framework. Comparing these two approachespose some problems: stemmed-boolean matching is based on boolean match whereas the

later is based on ranking approach with degree of similarity. E values are used to compare

the performance calculated using the following methods:

1) using variable cutoff points based on the number of words retrieved by thestemmed-boolean match: for each query term if the stemmed-boolean match

retrievedx terms then the cutoff point ofx is taken for that query; the mean Evalues obtained for this method is shown in Table_8.

2) using cutoff with weighted transformation, at 5 and 10 cutoff points, on the

number of words retrieved by the stemmed-boolean match: if c is the cutoffpoint andx is the number of words retrieved by the stemmed-boolean match

then do the following transformation:

ifx c then add in c-x non-relevant words to the number of words retrieved;else multiply the number of words retrieved and relevant by c/x;

the mean E value obtained for this method is given in Table_8.

From Table_8 it shows that using variable cutoff points evaluation, stemmed-boolean

approach performs better than the incorporating-stemming approach. Table_9 shows that

the difference in performance is significant. However, this method of evaluation is biasedto stemmed-boolean approach.

On the other hand, using the cutoff with weighted transformation evaluation theincorporated-stemming approach performs better but not significant as Table_10 shows.

Thus, we cannot unequivocally say that which performs better. The two methods based

on two different approaches, one on the boolean matching and the other on matching withuncertainty measure and both have their functional purposes.

Conclusions

From the experiments performed we can conclude that the Overlap coefficient performsbetter than the Dice coefficient but not significantly, and the digrams matching performs

significantly better than the trigrams. The usage of stemming prior to digrams and trigram


6/9

matching have significantly enhance the performance. The application of stemming on

the query terms and the dictionary performs significantly better than the application ofstemming on the query terms alone which in turn performs significantly better than

without the application of stemming.

However, we cannot equivocally conclude which is better between the conventionalstemmed-boolean approach and the incorporating-stemming in the n-grams approach.

Furthermore, the two approaches are based on different paradigms, one on the boolean

retrieval and the other on the best match ranking retrieval and they having advantages anddisadvantages of their own.

References

Belal Mustafa, A.A., Tengku Mohd. T.S., & Mohd. Yusoff. 1995. SISDOM : A

Multilingual Document Retrieval System. Asian Libraries, Vol. 4, No.3, MCB

University Press Limited, England.

Fatimah Ahmad. 1995. Satu Sistem Capaian Dokumen Bahasa Melayu : Satu PendekatanEksperimen Dan Analisis. Ph.D Thesis, Universiti Kebangsaan Malaysia..

Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok. 1996. Experimentswith A Malay Stemming Algorithm, Journal of American Society of Information

Science.

Frakes, W.B. 1992. Stemming Algorithms. In Frakes, W.B. & Baeza-Yates, R. (ed).).Information Retrieval: Data Structures & Algorithms : 131-160.Englewood Cliffs : Prentice Hall.

Marcus, R.S. 1983. An Experimental Comparison of the Effectiveness of Computers

and Humans as Search Intermediaries. Journal of the American Society for Information

Science 34(6):381-404.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.

Salton, G. & McGill, M.J. 1983. Introduction to Modern Information Retrieval . New

York : McGraw-Hill.


7/9

Table_1: Mean values at cutoff points to assess similarity coefficients performance

Measure

n-gram Coeff Cutoffs Cutoff

10 20 30 40 50 60 70 80 90 100 Mean

Digrams Dice 2.85 3.40 3.67 4.04 4.23 4.42 4.53 4.63 4.71 4.85 4.13

Number Overlap 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28of R&R Trigrams Dice 2.78 3.22 3.53 3.65 3.75 3.90 3.94 4.06 4.08 4.11 3.70

Overlap 2.84 3.36 3.54 3.72 3.82 3.94 4.00 4.08 4.17 4.26 3.77

Digrams Dice 57.66 64.15 68.47 71.48 73.87 76.29 78.08 79.69 81.05 82.0

1

73.27

E Overlap 56.85 62.25 67.35 70.75 73.44 75.95 77.78 79.25 80.60 81.5

8

72.58

Fusion 55.85 62.82 67.56 71.21 73.82 76.13 78.11 79.15 80.24 81.5

2

72.64

Trigrams Dice 58.72 65.85 69.96 73.81 76.70 78.61 80.62 81.96 83.37 84.5

4

75.41

Overlap 58.03 64.33 70.03 73.62 76.52 78.58 80.48 81.93 83.11 84.1

4

75.07

Table_2: Sign test to determine whether Overlap coefficient performs significantly better than Dicecoefficient.

Cutoff = 10 20 30 40 50 60 70 80 90 100

Digrams N 27 27 23 29 23 22 22 26 26 23

negatives 13 8 8 12 9 9 10 13 12 10

p .500 .061 .105 .229 .202 .262 .416 .577 .423 .339

significant(=0.05) no yes no no no no no no no no

Trigrams N 18 15 11 13 16 15 15 16 13 12

negatives 6 3 5 6 7 7 6 8 5 3

p .119 .018 .500 .500 .400 .500 .304 .598 .291 .073

significant(=0.05) no yes no no no no no no no no

Table_3: Sign test to determine whether digrams performs significantly better than trigrams (overlap

coefficient is used).

Cutoff = 10 20 30 40 50 60 70 80 90 100

N 26 25 29 32 25 26 30 30 30 25

negatives 12 6 5 6 0 2 3 5 6 3

p .423 .007 .001 .000 .000 .000 .000 .000 .001 .000

significant(=0.05) no yes yes yes yes yes yes yes yes yes

Table_4: Sign test to determine whether digrams with stemmed-query performs significantly better thandigrams with non-stemmed query (overlap coefficient is used).

Cutoff = 10 20 30 40 50 60 70 80 90 100

N 34 30 32 30 30 32 31 31 31 29

negatives 2 2 3 2 2 2 2 2 2 1

p .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

significant( =0.05) yes yes yes yes yes yes yes yes yes yes

Table_5: Sign test to determine whether digrams with stemmed-query and stemmed-dictionary performs

7


8/9

significantly better than digrams with stemmed-query only (overlap coefficient is used).

Cutoff = 10 20 30 40 50 60 70 80 90 100

N 45 29 22 19 15 14 14 13 12 12

negatives 5 3 2 3 2 1 1 2 1 1

p orz 5.06 .000 .000 .002 .004 .001 .001 .011 .003 .003

significant( =0.05) yes yes yes yes yes yes yes yes yes yes

Table_6: Mean values at cutoff points to assess the fusion between digrams and trigrams against Digrams

only (overlap coefficient is used).

Measure n-gram Cutoffs Cuto

ff

10 20 30 40 50 60 70 80 90 100 Mea

n

Number Digrams 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28

of R&R Fusion 3.06 3.57 3.78 4.03 4.22 4.38 4.53 4.72 4.91 4.97 4.21

E Digrams 56.8

5

62.2

5

67.3

5

70.7

5

73.4

4

75.9

5

77.7

8

79.2

5

80.6

0

81.5

8

72.5

8Fusion 54.9

7

62.5

9

68.2

0

71.6

2

74.2

5

76.5

0

78.3

2

79.5

4

80.5

3

81.7

9

72.8

3

Table_7: Mean values at cutoff points when stemming is incorporated.

Measure n-gram Stemmed Cutoffs Cut

f

Quer

y

Dict

n

10 20 30 40 50 60 70 80 90 100 Me

Digrams yes no 4.19 5.11 5.45 5.70 5.82 5.94 6.00 6.09 6.10 6.15 15.1

Number Trigram

s

yes no 4.00 4.64 4.92 5.28 5.48 5.57 5.66 5.70 5.83 5.86 14.1

of R&R Digrams yes yes 5.26 6.07 6.21 6.25 6.27 6.32 6.35 6.35 6.36 6.38 6.18

Trigram

s

yes yes 5.26 6.02 6.13 6.14 6.17 6.26 6.28 6.28 6.28 6.28 6.11

Fusion yes yes 5.27 6.07 6.20 6.23 6.29 6.32 6.34 6.35 6.35 6.36 6.18

Digrams yes no 42.7

0

49.8

0

56.8

0

61.8

1

65.9

9

69.1

1

71.9

2

74.1

0

76.1

5

77.7

3

64.6

Trigram

s

yes no 45.0

5

53.1

4

60.0

5

63.9

1

67.4

9

70.6

9

73.1

9

75.4

4

76.9

7

78.5

9

66.4

E Digrams yes yes 31.2

2

42.3

5

51.7

8

58.5

7

63.6

1

67.3

5

70.4

2

73.0

2

75.1

4

76.9

5

61.0

Trigram

s

yes yes 31.3

4

42.8

0

52.2

2

59.0

7

63.9

7

67.5

7

70.6

5

73.2

4

75.4

0

77.2

3

61.3

Fusion yes yes 31.0

1

42.3

3

51.8

3

58.6

2

63.4

8

67.3

5

70.4

6

73.0

2

75.1

9

76.9

9

61.0

8


9/9

Table_8: The mean values to compare the stemmed-boolean and the incorporating-stemming approaches.

Measure Approach Stemmed VariableCutoff

Cutoff with WeightedTransformation c/x

Quer

y

Dict

n

5 10

No. Retrieved Digrams(overlap) yes yes 5.44 3.46 5.26

and Relevant Stemmed-boolean

yes yes 5.60 3.48 5.15

E Digrams(overlap)

yes yes 13.43 36.94 31.22

Stemmed-

boolean

yes yes 8.19 37.13 33.24

Table_9: Sign test using variable cutoff to determine whether stemmed-boolean approach performs

significantly better than incorporating-stemming approach (overlap coefficient is used).

Cutoff = Variable

N 12negatives 0

p .000

significant(=0.05) yes

Table_10: Sign test using cutoff with weighted transformation to determine whether incorporating-stemming approach performs significantly better than stemmed-boolean approach (overlap coefficient is

used).

Cutoff = 5 10

N 11 16

negatives 3 6

p .113 .227

significant(=0.05) no no

9

_6_a malay-english terminology retrieval system

Documents