-
7/28/2019 _6_A Malay-English Terminology Retrieval System
1/9
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
2/9
The process of indexing can be divided into two steps: first, clustering of the terms intoroot words using stemming algorithm; second, clustering of the root words into two
character subwords (bigrams). The stemming algorithm used for Malay terms clustering
is Fatimahs algorithm (1996). Porters suffix striping stemming algorithm (Porter 1980)
is used to cluster the English terms.
The function of the retrieving module is to retrieve words which matches the query word.
Partial string matching techniques used are stemming and bigram matching whichdetermine whether two words match each other. For bigram matching, dice coefficient
value with threshold 0.6 is used for partial matching. Figure 1 shows an example of the
output for the query solar energy.
Figure 1: Output Interface
Evaluation Procedure
The experiments are performed to evaluate the retrieval effectiveness of varioustechniques employed within the framework of n-gram matching in the contact of
automatic query expansion approach as set by Lennon et al. [1981]. This involve theranking and the calculation of string similarity measures of each unique terms in the
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
3/9
dictionary to a specified query term. The effectiveness of retrieval is evaluated by using
recall and precision measures based on the number of words retrieved and relevant(R&R). The recall, R, is defined as proportion of relevant terms actually retrieved from
the dictionary with respect to a specified query term. The precision, P, is thus defined as
the proportion of retrieved actually relevant. The relevancy of each term in the dictionary
to a specified query term is determined manually by exhaustive scanning through thedictionary with the help of the string matching utility find in the Microsoft Excel
software. Two terms are regarded to be relevant to each other if they share a common
root.
The measure of van Rijsbergen, E, which is a weighted combination of recall and
precision is also used to calculate the retrieval effectiveness:
( )E
PR
P R=
+
+
100 1
1 2
2*
where , in the range of 0 to infinity, is used to reflect the relative importance of recall
and precision. = 1.0 reflects attaching equal importance to precision and recall, while
= 2.0 reflects attaching twice as much importance to recall as to precision. In our
experiments we used the value of = 2.0. There is an inverse relationship between the E
value and the effectiveness.
For each query term, the terms in the dictionary are ranked based on a similarity
matching measure as specified in each experiment. The E values are calculated for cutoffpoints at 10 to 100, at an interval of 10, top ranking dictionary terms. This is done for all
the 84 query terms and the mean values of E is calculated. The average of the mean
values of E at the ten cutoff points are calculated and used to evaluate the effectiveness ofeach technique. Generally the average of the mean values of E at all the cutoff points
serve as a good indicator of performance, we used the symbol aE to denote this value.
Statistical sign test will be used to measure the degree of significance between two
methods under consideration with the null hypothesis as follows:
P[Xi > Yi] = P[Xi < Yi] = 1/2
where Xi and Yi are the two scores for a matched pair obtained from the two methods. In
applying the sign test , the direction of the difference between every Xi and Yi notingwhether the sign of the difference is positive or negative. Binomial distribution with p = q
= 1/2 is used to calculate the probability,p, associated with the occurrence of a particular
number of positives and negatives. If the number of matched pairs, N, is larger than 35,the normal approximation to the binomial distribution is used in term of the z value
[Siegel and Castellan, 1988].
The Results
The commonly used similarity coefficients are Dice and Overlap coefficients. We run
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
4/9
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
5/9
results obtained using this approach is given in Table_7. The one using digrams performs
better than the trigrams with aE = 61.04 as compare to 61.35 and obviously better thanwith the previous approach when only the query terms are stemmed. Table_5 shows the
results obtained from sign test which indicates that the digrams matching using stemmed-
query and stemmed-dictionary performs significantly better than with stemmed-query
alone.
Comparison with Stemmed-Boolean Matching
We also run experiments using the conventional stemmed-boolean matching between the
stemmed query and the stemmed dictionary to compare its effectiveness to the approach
of incorporating stemming in the n-gram framework. Comparing these two approachespose some problems: stemmed-boolean matching is based on boolean match whereas the
later is based on ranking approach with degree of similarity. E values are used to compare
the performance calculated using the following methods:
1) using variable cutoff points based on the number of words retrieved by thestemmed-boolean match: for each query term if the stemmed-boolean match
retrievedx terms then the cutoff point ofx is taken for that query; the mean Evalues obtained for this method is shown in Table_8.
2) using cutoff with weighted transformation, at 5 and 10 cutoff points, on the
number of words retrieved by the stemmed-boolean match: if c is the cutoffpoint andx is the number of words retrieved by the stemmed-boolean match
then do the following transformation:
ifx c then add in c-x non-relevant words to the number of words retrieved;else multiply the number of words retrieved and relevant by c/x;
the mean E value obtained for this method is given in Table_8.
From Table_8 it shows that using variable cutoff points evaluation, stemmed-boolean
approach performs better than the incorporating-stemming approach. Table_9 shows that
the difference in performance is significant. However, this method of evaluation is biasedto stemmed-boolean approach.
On the other hand, using the cutoff with weighted transformation evaluation theincorporated-stemming approach performs better but not significant as Table_10 shows.
Thus, we cannot unequivocally say that which performs better. The two methods based
on two different approaches, one on the boolean matching and the other on matching withuncertainty measure and both have their functional purposes.
Conclusions
From the experiments performed we can conclude that the Overlap coefficient performsbetter than the Dice coefficient but not significantly, and the digrams matching performs
significantly better than the trigrams. The usage of stemming prior to digrams and trigram
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
6/9
matching have significantly enhance the performance. The application of stemming on
the query terms and the dictionary performs significantly better than the application ofstemming on the query terms alone which in turn performs significantly better than
without the application of stemming.
However, we cannot equivocally conclude which is better between the conventionalstemmed-boolean approach and the incorporating-stemming in the n-grams approach.
Furthermore, the two approaches are based on different paradigms, one on the boolean
retrieval and the other on the best match ranking retrieval and they having advantages anddisadvantages of their own.
References
Belal Mustafa, A.A., Tengku Mohd. T.S., & Mohd. Yusoff. 1995. SISDOM : A
Multilingual Document Retrieval System. Asian Libraries, Vol. 4, No.3, MCB
University Press Limited, England.
Fatimah Ahmad. 1995. Satu Sistem Capaian Dokumen Bahasa Melayu : Satu PendekatanEksperimen Dan Analisis. Ph.D Thesis, Universiti Kebangsaan Malaysia..
Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok. 1996. Experimentswith A Malay Stemming Algorithm, Journal of American Society of Information
Science.
Frakes, W.B. 1992. Stemming Algorithms. In Frakes, W.B. & Baeza-Yates, R. (ed).).Information Retrieval: Data Structures & Algorithms : 131-160.Englewood Cliffs : Prentice Hall.
Marcus, R.S. 1983. An Experimental Comparison of the Effectiveness of Computers
and Humans as Search Intermediaries. Journal of the American Society for Information
Science 34(6):381-404.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.
Salton, G. & McGill, M.J. 1983. Introduction to Modern Information Retrieval . New
York : McGraw-Hill.
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
7/9
Table_1: Mean values at cutoff points to assess similarity coefficients performance
Measure
n-gram Coeff Cutoffs Cutoff
10 20 30 40 50 60 70 80 90 100 Mean
Digrams Dice 2.85 3.40 3.67 4.04 4.23 4.42 4.53 4.63 4.71 4.85 4.13
Number Overlap 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28of R&R Trigrams Dice 2.78 3.22 3.53 3.65 3.75 3.90 3.94 4.06 4.08 4.11 3.70
Overlap 2.84 3.36 3.54 3.72 3.82 3.94 4.00 4.08 4.17 4.26 3.77
Digrams Dice 57.66 64.15 68.47 71.48 73.87 76.29 78.08 79.69 81.05 82.0
1
73.27
E Overlap 56.85 62.25 67.35 70.75 73.44 75.95 77.78 79.25 80.60 81.5
8
72.58
Fusion 55.85 62.82 67.56 71.21 73.82 76.13 78.11 79.15 80.24 81.5
2
72.64
Trigrams Dice 58.72 65.85 69.96 73.81 76.70 78.61 80.62 81.96 83.37 84.5
4
75.41
Overlap 58.03 64.33 70.03 73.62 76.52 78.58 80.48 81.93 83.11 84.1
4
75.07
Table_2: Sign test to determine whether Overlap coefficient performs significantly better than Dicecoefficient.
Cutoff = 10 20 30 40 50 60 70 80 90 100
Digrams N 27 27 23 29 23 22 22 26 26 23
negatives 13 8 8 12 9 9 10 13 12 10
p .500 .061 .105 .229 .202 .262 .416 .577 .423 .339
significant(=0.05) no yes no no no no no no no no
Trigrams N 18 15 11 13 16 15 15 16 13 12
negatives 6 3 5 6 7 7 6 8 5 3
p .119 .018 .500 .500 .400 .500 .304 .598 .291 .073
significant(=0.05) no yes no no no no no no no no
Table_3: Sign test to determine whether digrams performs significantly better than trigrams (overlap
coefficient is used).
Cutoff = 10 20 30 40 50 60 70 80 90 100
N 26 25 29 32 25 26 30 30 30 25
negatives 12 6 5 6 0 2 3 5 6 3
p .423 .007 .001 .000 .000 .000 .000 .000 .001 .000
significant(=0.05) no yes yes yes yes yes yes yes yes yes
Table_4: Sign test to determine whether digrams with stemmed-query performs significantly better thandigrams with non-stemmed query (overlap coefficient is used).
Cutoff = 10 20 30 40 50 60 70 80 90 100
N 34 30 32 30 30 32 31 31 31 29
negatives 2 2 3 2 2 2 2 2 2 1
p .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
significant( =0.05) yes yes yes yes yes yes yes yes yes yes
Table_5: Sign test to determine whether digrams with stemmed-query and stemmed-dictionary performs
7
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
8/9
significantly better than digrams with stemmed-query only (overlap coefficient is used).
Cutoff = 10 20 30 40 50 60 70 80 90 100
N 45 29 22 19 15 14 14 13 12 12
negatives 5 3 2 3 2 1 1 2 1 1
p orz 5.06 .000 .000 .002 .004 .001 .001 .011 .003 .003
significant( =0.05) yes yes yes yes yes yes yes yes yes yes
Table_6: Mean values at cutoff points to assess the fusion between digrams and trigrams against Digrams
only (overlap coefficient is used).
Measure n-gram Cutoffs Cuto
ff
10 20 30 40 50 60 70 80 90 100 Mea
n
Number Digrams 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28
of R&R Fusion 3.06 3.57 3.78 4.03 4.22 4.38 4.53 4.72 4.91 4.97 4.21
E Digrams 56.8
5
62.2
5
67.3
5
70.7
5
73.4
4
75.9
5
77.7
8
79.2
5
80.6
0
81.5
8
72.5
8Fusion 54.9
7
62.5
9
68.2
0
71.6
2
74.2
5
76.5
0
78.3
2
79.5
4
80.5
3
81.7
9
72.8
3
Table_7: Mean values at cutoff points when stemming is incorporated.
Measure n-gram Stemmed Cutoffs Cut
f
Quer
y
Dict
n
10 20 30 40 50 60 70 80 90 100 Me
Digrams yes no 4.19 5.11 5.45 5.70 5.82 5.94 6.00 6.09 6.10 6.15 15.1
Number Trigram
s
yes no 4.00 4.64 4.92 5.28 5.48 5.57 5.66 5.70 5.83 5.86 14.1
of R&R Digrams yes yes 5.26 6.07 6.21 6.25 6.27 6.32 6.35 6.35 6.36 6.38 6.18
Trigram
s
yes yes 5.26 6.02 6.13 6.14 6.17 6.26 6.28 6.28 6.28 6.28 6.11
Fusion yes yes 5.27 6.07 6.20 6.23 6.29 6.32 6.34 6.35 6.35 6.36 6.18
Digrams yes no 42.7
0
49.8
0
56.8
0
61.8
1
65.9
9
69.1
1
71.9
2
74.1
0
76.1
5
77.7
3
64.6
Trigram
s
yes no 45.0
5
53.1
4
60.0
5
63.9
1
67.4
9
70.6
9
73.1
9
75.4
4
76.9
7
78.5
9
66.4
E Digrams yes yes 31.2
2
42.3
5
51.7
8
58.5
7
63.6
1
67.3
5
70.4
2
73.0
2
75.1
4
76.9
5
61.0
Trigram
s
yes yes 31.3
4
42.8
0
52.2
2
59.0
7
63.9
7
67.5
7
70.6
5
73.2
4
75.4
0
77.2
3
61.3
Fusion yes yes 31.0
1
42.3
3
51.8
3
58.6
2
63.4
8
67.3
5
70.4
6
73.0
2
75.1
9
76.9
9
61.0
8
-
7/28/2019 _6_A Malay-English Terminology Retrieval System
9/9
Table_8: The mean values to compare the stemmed-boolean and the incorporating-stemming approaches.
Measure Approach Stemmed VariableCutoff
Cutoff with WeightedTransformation c/x
Quer
y
Dict
n
5 10
No. Retrieved Digrams(overlap) yes yes 5.44 3.46 5.26
and Relevant Stemmed-boolean
yes yes 5.60 3.48 5.15
E Digrams(overlap)
yes yes 13.43 36.94 31.22
Stemmed-
boolean
yes yes 8.19 37.13 33.24
Table_9: Sign test using variable cutoff to determine whether stemmed-boolean approach performs
significantly better than incorporating-stemming approach (overlap coefficient is used).
Cutoff = Variable
N 12negatives 0
p .000
significant(=0.05) yes
Table_10: Sign test using cutoff with weighted transformation to determine whether incorporating-stemming approach performs significantly better than stemmed-boolean approach (overlap coefficient is
used).
Cutoff = 5 10
N 11 16
negatives 3 6
p .113 .227
significant(=0.05) no no
9