_6_a malay-english terminology retrieval system

Upload: alexey-yagudin-wong

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    1/9

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    2/9

    The process of indexing can be divided into two steps: first, clustering of the terms intoroot words using stemming algorithm; second, clustering of the root words into two

    character subwords (bigrams). The stemming algorithm used for Malay terms clustering

    is Fatimahs algorithm (1996). Porters suffix striping stemming algorithm (Porter 1980)

    is used to cluster the English terms.

    The function of the retrieving module is to retrieve words which matches the query word.

    Partial string matching techniques used are stemming and bigram matching whichdetermine whether two words match each other. For bigram matching, dice coefficient

    value with threshold 0.6 is used for partial matching. Figure 1 shows an example of the

    output for the query solar energy.

    Figure 1: Output Interface

    Evaluation Procedure

    The experiments are performed to evaluate the retrieval effectiveness of varioustechniques employed within the framework of n-gram matching in the contact of

    automatic query expansion approach as set by Lennon et al. [1981]. This involve theranking and the calculation of string similarity measures of each unique terms in the

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    3/9

    dictionary to a specified query term. The effectiveness of retrieval is evaluated by using

    recall and precision measures based on the number of words retrieved and relevant(R&R). The recall, R, is defined as proportion of relevant terms actually retrieved from

    the dictionary with respect to a specified query term. The precision, P, is thus defined as

    the proportion of retrieved actually relevant. The relevancy of each term in the dictionary

    to a specified query term is determined manually by exhaustive scanning through thedictionary with the help of the string matching utility find in the Microsoft Excel

    software. Two terms are regarded to be relevant to each other if they share a common

    root.

    The measure of van Rijsbergen, E, which is a weighted combination of recall and

    precision is also used to calculate the retrieval effectiveness:

    ( )E

    PR

    P R=

    +

    +

    100 1

    1 2

    2*

    where , in the range of 0 to infinity, is used to reflect the relative importance of recall

    and precision. = 1.0 reflects attaching equal importance to precision and recall, while

    = 2.0 reflects attaching twice as much importance to recall as to precision. In our

    experiments we used the value of = 2.0. There is an inverse relationship between the E

    value and the effectiveness.

    For each query term, the terms in the dictionary are ranked based on a similarity

    matching measure as specified in each experiment. The E values are calculated for cutoffpoints at 10 to 100, at an interval of 10, top ranking dictionary terms. This is done for all

    the 84 query terms and the mean values of E is calculated. The average of the mean

    values of E at the ten cutoff points are calculated and used to evaluate the effectiveness ofeach technique. Generally the average of the mean values of E at all the cutoff points

    serve as a good indicator of performance, we used the symbol aE to denote this value.

    Statistical sign test will be used to measure the degree of significance between two

    methods under consideration with the null hypothesis as follows:

    P[Xi > Yi] = P[Xi < Yi] = 1/2

    where Xi and Yi are the two scores for a matched pair obtained from the two methods. In

    applying the sign test , the direction of the difference between every Xi and Yi notingwhether the sign of the difference is positive or negative. Binomial distribution with p = q

    = 1/2 is used to calculate the probability,p, associated with the occurrence of a particular

    number of positives and negatives. If the number of matched pairs, N, is larger than 35,the normal approximation to the binomial distribution is used in term of the z value

    [Siegel and Castellan, 1988].

    The Results

    The commonly used similarity coefficients are Dice and Overlap coefficients. We run

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    4/9

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    5/9

    results obtained using this approach is given in Table_7. The one using digrams performs

    better than the trigrams with aE = 61.04 as compare to 61.35 and obviously better thanwith the previous approach when only the query terms are stemmed. Table_5 shows the

    results obtained from sign test which indicates that the digrams matching using stemmed-

    query and stemmed-dictionary performs significantly better than with stemmed-query

    alone.

    Comparison with Stemmed-Boolean Matching

    We also run experiments using the conventional stemmed-boolean matching between the

    stemmed query and the stemmed dictionary to compare its effectiveness to the approach

    of incorporating stemming in the n-gram framework. Comparing these two approachespose some problems: stemmed-boolean matching is based on boolean match whereas the

    later is based on ranking approach with degree of similarity. E values are used to compare

    the performance calculated using the following methods:

    1) using variable cutoff points based on the number of words retrieved by thestemmed-boolean match: for each query term if the stemmed-boolean match

    retrievedx terms then the cutoff point ofx is taken for that query; the mean Evalues obtained for this method is shown in Table_8.

    2) using cutoff with weighted transformation, at 5 and 10 cutoff points, on the

    number of words retrieved by the stemmed-boolean match: if c is the cutoffpoint andx is the number of words retrieved by the stemmed-boolean match

    then do the following transformation:

    ifx c then add in c-x non-relevant words to the number of words retrieved;else multiply the number of words retrieved and relevant by c/x;

    the mean E value obtained for this method is given in Table_8.

    From Table_8 it shows that using variable cutoff points evaluation, stemmed-boolean

    approach performs better than the incorporating-stemming approach. Table_9 shows that

    the difference in performance is significant. However, this method of evaluation is biasedto stemmed-boolean approach.

    On the other hand, using the cutoff with weighted transformation evaluation theincorporated-stemming approach performs better but not significant as Table_10 shows.

    Thus, we cannot unequivocally say that which performs better. The two methods based

    on two different approaches, one on the boolean matching and the other on matching withuncertainty measure and both have their functional purposes.

    Conclusions

    From the experiments performed we can conclude that the Overlap coefficient performsbetter than the Dice coefficient but not significantly, and the digrams matching performs

    significantly better than the trigrams. The usage of stemming prior to digrams and trigram

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    6/9

    matching have significantly enhance the performance. The application of stemming on

    the query terms and the dictionary performs significantly better than the application ofstemming on the query terms alone which in turn performs significantly better than

    without the application of stemming.

    However, we cannot equivocally conclude which is better between the conventionalstemmed-boolean approach and the incorporating-stemming in the n-grams approach.

    Furthermore, the two approaches are based on different paradigms, one on the boolean

    retrieval and the other on the best match ranking retrieval and they having advantages anddisadvantages of their own.

    References

    Belal Mustafa, A.A., Tengku Mohd. T.S., & Mohd. Yusoff. 1995. SISDOM : A

    Multilingual Document Retrieval System. Asian Libraries, Vol. 4, No.3, MCB

    University Press Limited, England.

    Fatimah Ahmad. 1995. Satu Sistem Capaian Dokumen Bahasa Melayu : Satu PendekatanEksperimen Dan Analisis. Ph.D Thesis, Universiti Kebangsaan Malaysia..

    Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok. 1996. Experimentswith A Malay Stemming Algorithm, Journal of American Society of Information

    Science.

    Frakes, W.B. 1992. Stemming Algorithms. In Frakes, W.B. & Baeza-Yates, R. (ed).).Information Retrieval: Data Structures & Algorithms : 131-160.Englewood Cliffs : Prentice Hall.

    Marcus, R.S. 1983. An Experimental Comparison of the Effectiveness of Computers

    and Humans as Search Intermediaries. Journal of the American Society for Information

    Science 34(6):381-404.

    Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.

    Salton, G. & McGill, M.J. 1983. Introduction to Modern Information Retrieval . New

    York : McGraw-Hill.

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    7/9

    Table_1: Mean values at cutoff points to assess similarity coefficients performance

    Measure

    n-gram Coeff Cutoffs Cutoff

    10 20 30 40 50 60 70 80 90 100 Mean

    Digrams Dice 2.85 3.40 3.67 4.04 4.23 4.42 4.53 4.63 4.71 4.85 4.13

    Number Overlap 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28of R&R Trigrams Dice 2.78 3.22 3.53 3.65 3.75 3.90 3.94 4.06 4.08 4.11 3.70

    Overlap 2.84 3.36 3.54 3.72 3.82 3.94 4.00 4.08 4.17 4.26 3.77

    Digrams Dice 57.66 64.15 68.47 71.48 73.87 76.29 78.08 79.69 81.05 82.0

    1

    73.27

    E Overlap 56.85 62.25 67.35 70.75 73.44 75.95 77.78 79.25 80.60 81.5

    8

    72.58

    Fusion 55.85 62.82 67.56 71.21 73.82 76.13 78.11 79.15 80.24 81.5

    2

    72.64

    Trigrams Dice 58.72 65.85 69.96 73.81 76.70 78.61 80.62 81.96 83.37 84.5

    4

    75.41

    Overlap 58.03 64.33 70.03 73.62 76.52 78.58 80.48 81.93 83.11 84.1

    4

    75.07

    Table_2: Sign test to determine whether Overlap coefficient performs significantly better than Dicecoefficient.

    Cutoff = 10 20 30 40 50 60 70 80 90 100

    Digrams N 27 27 23 29 23 22 22 26 26 23

    negatives 13 8 8 12 9 9 10 13 12 10

    p .500 .061 .105 .229 .202 .262 .416 .577 .423 .339

    significant(=0.05) no yes no no no no no no no no

    Trigrams N 18 15 11 13 16 15 15 16 13 12

    negatives 6 3 5 6 7 7 6 8 5 3

    p .119 .018 .500 .500 .400 .500 .304 .598 .291 .073

    significant(=0.05) no yes no no no no no no no no

    Table_3: Sign test to determine whether digrams performs significantly better than trigrams (overlap

    coefficient is used).

    Cutoff = 10 20 30 40 50 60 70 80 90 100

    N 26 25 29 32 25 26 30 30 30 25

    negatives 12 6 5 6 0 2 3 5 6 3

    p .423 .007 .001 .000 .000 .000 .000 .000 .001 .000

    significant(=0.05) no yes yes yes yes yes yes yes yes yes

    Table_4: Sign test to determine whether digrams with stemmed-query performs significantly better thandigrams with non-stemmed query (overlap coefficient is used).

    Cutoff = 10 20 30 40 50 60 70 80 90 100

    N 34 30 32 30 30 32 31 31 31 29

    negatives 2 2 3 2 2 2 2 2 2 1

    p .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

    significant( =0.05) yes yes yes yes yes yes yes yes yes yes

    Table_5: Sign test to determine whether digrams with stemmed-query and stemmed-dictionary performs

    7

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    8/9

    significantly better than digrams with stemmed-query only (overlap coefficient is used).

    Cutoff = 10 20 30 40 50 60 70 80 90 100

    N 45 29 22 19 15 14 14 13 12 12

    negatives 5 3 2 3 2 1 1 2 1 1

    p orz 5.06 .000 .000 .002 .004 .001 .001 .011 .003 .003

    significant( =0.05) yes yes yes yes yes yes yes yes yes yes

    Table_6: Mean values at cutoff points to assess the fusion between digrams and trigrams against Digrams

    only (overlap coefficient is used).

    Measure n-gram Cutoffs Cuto

    ff

    10 20 30 40 50 60 70 80 90 100 Mea

    n

    Number Digrams 2.89 3.57 3.90 4.17 4.39 4.50 4.65 4.81 4.91 5.06 4.28

    of R&R Fusion 3.06 3.57 3.78 4.03 4.22 4.38 4.53 4.72 4.91 4.97 4.21

    E Digrams 56.8

    5

    62.2

    5

    67.3

    5

    70.7

    5

    73.4

    4

    75.9

    5

    77.7

    8

    79.2

    5

    80.6

    0

    81.5

    8

    72.5

    8Fusion 54.9

    7

    62.5

    9

    68.2

    0

    71.6

    2

    74.2

    5

    76.5

    0

    78.3

    2

    79.5

    4

    80.5

    3

    81.7

    9

    72.8

    3

    Table_7: Mean values at cutoff points when stemming is incorporated.

    Measure n-gram Stemmed Cutoffs Cut

    f

    Quer

    y

    Dict

    n

    10 20 30 40 50 60 70 80 90 100 Me

    Digrams yes no 4.19 5.11 5.45 5.70 5.82 5.94 6.00 6.09 6.10 6.15 15.1

    Number Trigram

    s

    yes no 4.00 4.64 4.92 5.28 5.48 5.57 5.66 5.70 5.83 5.86 14.1

    of R&R Digrams yes yes 5.26 6.07 6.21 6.25 6.27 6.32 6.35 6.35 6.36 6.38 6.18

    Trigram

    s

    yes yes 5.26 6.02 6.13 6.14 6.17 6.26 6.28 6.28 6.28 6.28 6.11

    Fusion yes yes 5.27 6.07 6.20 6.23 6.29 6.32 6.34 6.35 6.35 6.36 6.18

    Digrams yes no 42.7

    0

    49.8

    0

    56.8

    0

    61.8

    1

    65.9

    9

    69.1

    1

    71.9

    2

    74.1

    0

    76.1

    5

    77.7

    3

    64.6

    Trigram

    s

    yes no 45.0

    5

    53.1

    4

    60.0

    5

    63.9

    1

    67.4

    9

    70.6

    9

    73.1

    9

    75.4

    4

    76.9

    7

    78.5

    9

    66.4

    E Digrams yes yes 31.2

    2

    42.3

    5

    51.7

    8

    58.5

    7

    63.6

    1

    67.3

    5

    70.4

    2

    73.0

    2

    75.1

    4

    76.9

    5

    61.0

    Trigram

    s

    yes yes 31.3

    4

    42.8

    0

    52.2

    2

    59.0

    7

    63.9

    7

    67.5

    7

    70.6

    5

    73.2

    4

    75.4

    0

    77.2

    3

    61.3

    Fusion yes yes 31.0

    1

    42.3

    3

    51.8

    3

    58.6

    2

    63.4

    8

    67.3

    5

    70.4

    6

    73.0

    2

    75.1

    9

    76.9

    9

    61.0

    8

  • 7/28/2019 _6_A Malay-English Terminology Retrieval System

    9/9

    Table_8: The mean values to compare the stemmed-boolean and the incorporating-stemming approaches.

    Measure Approach Stemmed VariableCutoff

    Cutoff with WeightedTransformation c/x

    Quer

    y

    Dict

    n

    5 10

    No. Retrieved Digrams(overlap) yes yes 5.44 3.46 5.26

    and Relevant Stemmed-boolean

    yes yes 5.60 3.48 5.15

    E Digrams(overlap)

    yes yes 13.43 36.94 31.22

    Stemmed-

    boolean

    yes yes 8.19 37.13 33.24

    Table_9: Sign test using variable cutoff to determine whether stemmed-boolean approach performs

    significantly better than incorporating-stemming approach (overlap coefficient is used).

    Cutoff = Variable

    N 12negatives 0

    p .000

    significant(=0.05) yes

    Table_10: Sign test using cutoff with weighted transformation to determine whether incorporating-stemming approach performs significantly better than stemmed-boolean approach (overlap coefficient is

    used).

    Cutoff = 5 10

    N 11 16

    negatives 3 6

    p .113 .227

    significant(=0.05) no no

    9