[ieee 2010 international conference on computer and information application (iccia) - tianjin, china...

4
The Effects of Dictionary and Lexicon to Morphological Analysis Mohd Yunus Sharum Fac. of Computer Science and Info. Tech., Universiti Putra Malaysia 43400 UPM Serdang, Selangor, Malaysia [email protected] Abstract—Dictionary or lexicon is one of the key components for morphological analyzers, except to purely rule-based tools. However, only few studies had been performed to evaluate the effects of using dictionary or lexicon for Malay morphological analysis. In this paper, we report the result of such evaluation on our morphological analyzer called MALIM, based on auto- expansion simulation. We found that lexicon improves the performance of the analyzer in term of correctness, and also the quality of the output. Keywords-Malay language, MALIM, Morphological analyzer, Dictionary, Lexicon I. INTRODUCTION Morphological analysis is one of the key processes involved in natural language processing (NLP). This process regularly performed in stemming, lemmatization and tagging. Malay is an Austronesian language, which is also the super class for Malaysian, Indonesian and Brunei language. There had been several Malay morphological analyzers developed or improved for Malaysian and Indonesian, including [1], [2], [3], [4], [5], [6], [7], [8], [9], [10] and others. Alas none had been reported for Brunei. We developed a morphological analyzer named MALIM, which implements a new algorithm for Malay morphological analysis called SAPI. MALIM (Morphological Analyzer for Linguistic Indecision of Malay) [11] was developed for Malay word glossing and tagging. It was implemented in Perl, and so far able to handle most but not all word's types in Malay particularly affixation and reduplication. As part of its component, lexicon was used as reference to verify valid root words. This, besides being one of the adopted approaches in Malay stemming, was also due to the SAPI algorithm used by MALIM [12]. In general SAPI algorithm implements a technique which exhaustively produced all possibilities of results. However, in order to ensure that all the results are valid they are examined with the assistance of the lexicon. The following describe the algorithm used, which was described in details in our earlier paper [12]: Step 1 - For all set of suffix and prefix, check if any pair of circumfix exists in w. If yes, remove circumfix (without morphonological alternation) and go to step 2. If not and all circumfixes tested, go to Step 4. Step 2 - Check w' from lexicon. If exist, all related information from w stored (e.g. stem, affixes etc.) and go to step 3. If not, repeat step 1 with w' as w. Step 3 - For all default prefixes, check for circumfix with default prefix, and then with morphonological alternation, into w''. Verify w'' from lexicon. If exist, all related information from w' stored. Repeat step 1 with w'' as w. Step 4 - For all suffixes, check if any suffix exists in w. If yes, remove suffix and go to step 5. If not and all suffix tested, go to Step 6. Step 5 - If suffix ‘-an’, perform morphonological alternation rule [b:p] for w if it is either ‘jawap’ or wajip’ . Then check w' from lexicon. If exist, all related information from w stored. Repeat step 4 with w' as w. Step 6 - For all prefixes, check if any prefix exists in w. If yes, remove prefix and go to Step 7. If not and all prefix tested, end. Step 7 - Check w' from lexicon. If exist, all related information from w stored and go to step 8. If not, repeat step 6 with w'. Step 8 - For all default prefixes, check for default prefix, and then with morphonological alternation as w''. Verify w'' from lexicon. If exist, all related information from w' stored. Repeat step 6 with w'' as w. Generally dictionary or lexicon is one of the key components in most of the existing morphological analyzers, particularly to MALIM. However, this does not apply to the purely rule-based tools. Dictionary is different from lexicon in the sense that the former is more general and contains more lexicographical information than the latter. But they have a role in common, which is to provide a complete list of lexemes that is needed as reference and source of information to validate a ‘lemma’ produced by the analysis. In this paper, we report a study conducted to evaluate the effect of dictionary and lexicon to the performance of morphological analyzer. This study will provide empirical evidence of their role to the morphological analysis. Although it is obvious that lexicon is useful in such operation, we want to investigate the actual aspects of its influence, which might not have been anticipated in normal operations. Next section describes the related work done previously, while Section 3 is about the evaluation process and the result. Section 4 will describe the conclusion. II. RELATED WORK Regarding on the use of dictionary, Ahmad suggests that stemming can only performed well by having dictionary [10]. But the issue of using dictionary is that it must always be International Conference on Computer and Information ApplicationICCIA 20109 C 978-1-4244-8598-7 /10/$26.00 2010 IEEE

Upload: mohd-yunus

Post on 17-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 International Conference on Computer and Information Application (ICCIA) - Tianjin, China (2010.12.3-2010.12.5)] 2010 International Conference on Computer and Information

The Effects of Dictionary and Lexicon to Morphological Analysis

Mohd Yunus Sharum Fac. of Computer Science and Info. Tech., Universiti Putra Malaysia

43400 UPM Serdang, Selangor, Malaysia [email protected]

Abstract—Dictionary or lexicon is one of the key components for morphological analyzers, except to purely rule-based tools. However, only few studies had been performed to evaluate the effects of using dictionary or lexicon for Malay morphological analysis. In this paper, we report the result of such evaluation on our morphological analyzer called MALIM, based on auto-expansion simulation. We found that lexicon improves the performance of the analyzer in term of correctness, and also the quality of the output.

Keywords-Malay language, MALIM, Morphological analyzer, Dictionary, Lexicon

I. INTRODUCTION Morphological analysis is one of the key processes

involved in natural language processing (NLP). This process regularly performed in stemming, lemmatization and tagging. Malay is an Austronesian language, which is also the super class for Malaysian, Indonesian and Brunei language. There had been several Malay morphological analyzers developed or improved for Malaysian and Indonesian, including [1], [2], [3], [4], [5], [6], [7], [8], [9], [10] and others. Alas none had been reported for Brunei.

We developed a morphological analyzer named MALIM, which implements a new algorithm for Malay morphological analysis called SAPI. MALIM (Morphological Analyzer for Linguistic Indecision of Malay) [11] was developed for Malay word glossing and tagging. It was implemented in Perl, and so far able to handle most but not all word's types in Malay particularly affixation and reduplication. As part of its component, lexicon was used as reference to verify valid root words. This, besides being one of the adopted approaches in Malay stemming, was also due to the SAPI algorithm used by MALIM [12]. In general SAPI algorithm implements a technique which exhaustively produced all possibilities of results. However, in order to ensure that all the results are valid they are examined with the assistance of the lexicon. The following describe the algorithm used, which was described in details in our earlier paper [12]:

• Step 1 - For all set of suffix and prefix, check if any pair of circumfix exists in w. If yes, remove circumfix (without morphonological alternation) and go to step 2. If not and all circumfixes tested, go to Step 4.

• Step 2 - Check w' from lexicon. If exist, all related information from w stored (e.g. stem, affixes etc.) and go to step 3. If not, repeat step 1 with w' as w.

• Step 3 - For all default prefixes, check for circumfix with default prefix, and then with morphonological alternation, into w''. Verify w'' from lexicon. If exist, all related information from w' stored. Repeat step 1 with w'' as w.

• Step 4 - For all suffixes, check if any suffix exists in w. If yes, remove suffix and go to step 5. If not and all suffix tested, go to Step 6.

• Step 5 - If suffix ‘-an’, perform morphonological alternation rule [b:p] for w if it is either ‘jawap’ or ‘wajip’ . Then check w' from lexicon. If exist, all related information from w stored. Repeat step 4 with w' as w.

• Step 6 - For all prefixes, check if any prefix exists in w. If yes, remove prefix and go to Step 7. If not and all prefix tested, end.

• Step 7 - Check w' from lexicon. If exist, all related information from w stored and go to step 8. If not, repeat step 6 with w'.

• Step 8 - For all default prefixes, check for default prefix, and then with morphonological alternation as w''. Verify w'' from lexicon. If exist, all related information from w' stored. Repeat step 6 with w'' as w.

Generally dictionary or lexicon is one of the key components in most of the existing morphological analyzers, particularly to MALIM. However, this does not apply to the purely rule-based tools. Dictionary is different from lexicon in the sense that the former is more general and contains more lexicographical information than the latter. But they have a role in common, which is to provide a complete list of lexemes that is needed as reference and source of information to validate a ‘lemma’ produced by the analysis. In this paper, we report a study conducted to evaluate the effect of dictionary and lexicon to the performance of morphological analyzer. This study will provide empirical evidence of their role to the morphological analysis. Although it is obvious that lexicon is useful in such operation, we want to investigate the actual aspects of its influence, which might not have been anticipated in normal operations.

Next section describes the related work done previously, while Section 3 is about the evaluation process and the result. Section 4 will describe the conclusion.

II. RELATED WORK Regarding on the use of dictionary, Ahmad suggests that

stemming can only performed well by having dictionary [10]. But the issue of using dictionary is that it must always be

International Conference on Computer and Information Application(ICCIA 2010)

9C978-1-4244-8598-7 /10/$26.00 2010 IEEE

Page 2: [IEEE 2010 International Conference on Computer and Information Application (ICCIA) - Tianjin, China (2010.12.3-2010.12.5)] 2010 International Conference on Computer and Information

updated to include new root words. Otherwise, a root word will not be recognized by the analyzer. In attempt to avoid/reduce this additional task, a pure rule-based analyzer is developed [6]. However, based on the comparison using Porter's approach (rule-based) and Nazief's (which use the lexicon), still shows that the lexicon-based analyzer performs much better for Malay.

Asian et al. [5] performed a study on the effects of using dictionary to the Nazief's stemming algorithm used for information retrieval (IR). The test used a corpus which contains collection of daily electronic newspaper. Of all the 3986 words used, only 1807 are unique. Several versions of dictionary had been used, and the result failed to show significant improvement to the analysis. The ‘failure’ claimed to be caused by three factors; (i) dictionary contains ‘basic’ words which forced the stemming to stop; (ii) not all words stemmed require dictionary (i.e. root words and proper nouns); and (iii) dictionary include rare words which decrease performance (i.e. wrong stemming).

Most of all, factor (ii) could be the most acceptable reason to reconsider the importance of dictionary for stemming. However, it is also a good reason for not using a stemmer at all. If the analyzed words do not require stemming, so what is the purpose of using stemmer? Thus, we could expect that to evaluate the performance of a stemmer should use a better test data which require an extensive use of stemmer. Otherwise, the test data is not sufficient to evaluate the importance of dictionary or lexicon as the stemming process is not required in the first place.

Using dictionary is problematic for algorithm that implements an approach that we called lexicon-based priority analysis. In this approach, an analysis is stopped once it managed to produce a lexeme which exists in the dictionary. Traditionally, most Malay stemmers use this approach, and this caused factor (i) to occur. The problem appears when there are multiple possibilities of result. Since the dictionary or lexicon organizes the lexemes in alphabetical order (e.g. memerang/beaver begins with ‘m’ and is at the lower position than perang/war which begins with ‘p’), most of the time the lexeme at the lower alphabetical order will be encountered first. Even in the case when we expect the lexeme which is positioned at higher alphabetical order, the lower still win. So in such case the analysis will never produce a correct result.

The factor (iii) should be expected since the data set is too small and based on common usage. We assume the words in the collection mostly are common words. For information retrieval, the test data might be relevant as it shows the real situation. On the other hand, it is not sufficient to prove the usability of dictionary for stemmer. Generally a dictionary describes exhaustive information regarding the lexicographical aspect of a language. Thus, it should contain frequently used words as well as rare words, where the latter could be larger in number than the former. Rare words in dictionary most probably never occurred in the test data, and this factor actually had influenced the result. Eventually, the use of dictionary in the experiment does not improve stemming in IR and even degrades the result.

Besides, the study does not show the effect of using lexicon, which is more specific in purpose than dictionary.

III. EVALUATION ON MORPHOLOGICAL ANALYZER The study by Asian et al. showed that dictionary might

not so relevant in stemming for IR. However, we found that we could not avoid at least using lexicon, if not dictionary, in our stemming process. Thus, MALIM uses the lexicon so extensively during analysis. By SAPI, the lexicon is used but not solely relied upon to verify stemmed words. As lexicon seems important to the algorithm, we need to verify how important the lexicon is.

We perform an evaluation by simulating the performance of our analyzer with varied size of lexicon. This is done by incrementally expanding the lexicon. The simulation generally could show how the quality of the analyses result changed when the lexicon changes its size. A sample dataset used contains 14319 word-root pairs, derived from [13]. Of this pairs, 5705 unique roots had been collected. We also used the ‘word’ part as input dataset, while the pair is used for result evaluation.

Our simulation initially started by using a small lexicon consisting of 500 roots, which are randomly picked from the set of 5705 root words. Then MALIM is used to analyze the input dataset using the lexicon, and the results are then evaluated. Each test run will incrementally expands lexicon by adding another set of 500 roots (picked randomly) and the same analysis process on complete dataset is repeated. The simulation finish when all roots had been inserted into the lexicon. Fig. 1 shows the performance of MALIM in effect of lexicon expansion. The same evaluation procedure from [11] is used to evaluate the results.

Figure 1. The performance of MALIM in effect of lexicon expansion.

One of the key objectives of SAPI algorithm is to handle ambiguity in morphological analysis of Malay. Ambiguity is a something that we could not avoid in NLP, and this problem exists in most languages. Some ambiguities could be solved at morphological level, while some are not. Eventually, the unsolvable ambiguities produced as multiple outputs, which are produced by MALIM as ambiguous

020406080

100120

0

1000

2000

3000

4000

5000

5705

Lexicon size (words)

Perc

enta

ge o

f wor

ds (%

)

Single Ambiguous Reject

International Conference on Computer and Information Application(ICCIA 2010)

10

Page 3: [IEEE 2010 International Conference on Computer and Information Application (ICCIA) - Tianjin, China (2010.12.3-2010.12.5)] 2010 International Conference on Computer and Information

output. The single output is the solvable (but not necessarily correct) analysis.

Our results show that, for our analyzer, the lexicon expansion actually does influence the analysis. If the lexicon is too small the rejection is high while as the lexicon getting bigger the rejection rate becomes lower. At the same time the result (single and ambiguous) gradually increased as more words included in the lexicon. This result is actually could be expected, as MALIM will only recognized a lexeme if it exist in lexicon. If otherwise, the analyzed word is rejected and left without any change. On related aspect, we also found that ambiguous result slowly increased for each test-run. But this happened because the ratio of ambiguity in the complete sample is already low. Thus the probability of ambiguity to be generated for each test-run is also low.

However, the error rate in processing is produced randomly with the expansion. Fig. 2 shows the errors produced in each result's categories - single result and ambiguous result. At first, we may expect the error is not affected by the expansion, since the rates do not follow the same declining pattern as with rejection. However, after a careful analysis on the data, we found the errors are actually influenced by the expansion. E.g. an expansion will make the previously rejected word to become correct or incorrect, while a previously incorrect or correct analysis may become ambiguous analysis. However, neither correct analysis nor correct ambiguous analysis becomes incorrect with the expansion.

Figure 2. The number of errors for each result's types in effect of lexicon expansion.

IV. DISCUSSION AND FUTURE WORK From the result, we had found that the expansion of

lexicon not only affect the acceptance and rejection of a word during analysis, but also the correctness and accuracy of the analysis as well. Although the result is affected by the size, our analyzer is not constrained by the use of lexicon. As explained earlier, most of the analyzers that use dictionary or lexicon apply the lexicon-based priority approach. However, we had developed a new algorithm which departs from such approach. The algorithm, named SAPI, implements affix-

based priority approach which avoids from relying solely on the dictionary or lexicon for analysis. Rather, the analysis process will not stop even if a stem is already found from the dictionary or lexicon. It only stops when all existing affixes had been analyzed. Thus, in a case where there are multiple possibilities of result, the analysis is not restricted to the alphabetical ordering that is used by dictionary or lexicon.

As the lexicon grows by adding new root words, the result is eventually improved. An important aspect is that the analysis improved but not degrades as the lexicon expands. This property is crucial since we can ensure that each time we update the lexicon by adding new words, the previous performance will not degrade. Updating is necessary since as new words are formed or found, it must be included in the lexicon so that it can be recognized by the analyzer.

As conclusion, we claim that dictionary or lexicon is important in morphological analysis, but not necessarily caused degradation in the performance of the analyzer. The effectiveness of an analyzer might be low if using small size lexicon. But as the lexicon expands, the analysis will improved as more words are recognized and analyzed correctly. The importance of this experiment is that we proved that MALIM's performance is not degrades by the expansion, but in fact improves as we update the lexicon.

In the future, we expect to determine the best way to increase and expand the lexicon. Probably, a plausible approach is to insert words with high frequency and leave rare words. This avoids ambiguous output between common and rare words. This also could avoid incorrect analysis in case the common words are not in the lexicon.

ACKNOWLEDGMENT This work would not be accomplished without the

funding from Universiti Putra Malaysia and the Ministry of Higher Education of Malaysia. The author also would like to thank Muhamad Taufik Abdullah, Md Nasir Sulaiman, Masrah Azrifah Azmi Murad and Zaitul Azma Zainon Hamzah for their invaluable comments and feedbacks. Also thanks to some anonymous reviewers for their comments and feedbacks on the earlier version of this paper.

REFERENCES [1] F. Pisceldo, R. Mahendra, R. Manurung, and I.W. Arka, "A Two-

Level Morphological Analyser for the Indonesian Language," Proc. 2008 Australasian Language Technology Association Workshop (ALTA 2008), Hobart, Australia, pp. 142-150, 2008.

[2] M. Adriani, J. Asian, B. Nazief, S.M. Tahaghoghi, and H.E. Williams, "Stemming Indonesian: A Confix-stripping Approach," ACM Transactions on Asian Language Information Processing (TALIP), vol. 4, pp. 1-33. 2007.

[3] M.T. Abdullah, “Monolingual and Cross-language Information Retrieval Approaches for Malay and English Language Documents,” (Universiti Putra Malaysia, Serdang, Malaysia), unpublished.

[4] T. Baldwin, and S. Awab, "Open Source Corpus Analysis Tools for Malay," Proc. 5th International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2212-2215, 2006.

[5] J. Asian, H.E. Williams, and S.M. Tahaghoghi, "Stemming Indonesian," Proc. Twenty-Eighth Australasian Conference on Computer Science Volume 38 (Newcastle, Australia). V. Estivill-

0

50

100

150

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

5705

Lexicon size (words)

Num

ber o

f err

ors

(wor

ds)

Single result Ambiguous result

International Conference on Computer and Information Application(ICCIA 2010)

11

Page 4: [IEEE 2010 International Conference on Computer and Information Application (ICCIA) - Tianjin, China (2010.12.3-2010.12.5)] 2010 International Conference on Computer and Information

Castro, Ed. ACSC, vol. 102, pp. 307-314. Australian Computer Society, Darlinghurst, Australia, 2005.

[6] F. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” Dissertation (University of Amsterdam), unpublished.

[7] R.M. Bali, "Computational Analysis of Affixed Words in Malay Language," 8th International Symposium on Malay/Indonesian Linguistics (ISMIL8), Penang Malaysia, 2004.

[8] N. Idris, “Automated Essay Grading System Using Nearest Neighbour Technique In Information Retrieval,” Dissertation (University of Malaya), unpublished.

[9] K.R. Beesley, and L. Karttunen, "Finite-state non-concatenative morphotactics," Proc. 38th Annual Meeting on Association For Computational Linguistics (Hong Kong, October 03 - 06, 2000), Annual Meeting of the ACL - Association for Computational Linguistics. Morristown, NJ, pp. 191-198, October 2000.

[10] F. Ahmad, “A Malay Language Document Retrieval System: An Experimental Approach and Analysis,” Dissertation (Universiti Kebangsaan Malaysia, Bangi, Malaysia), unpublished.

[11] M.Y. Sharum, M.T. Abdullah, M.N. Sulaiman, M.A. Azmi Murad, and Z.A. Zainon Hamzah, "MALIM - A New Computational Approach of Malay Morphology," Proc. 4th International Symposium on Information Technology (ITSim), KL, Volume 2, pp. 837-843, June 2010. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber= 5561561&tag=1

[12] M.Y. Sharum, M.T. Abdullah, M.N. Sulaiman, M.A. Azmi Murad, and Z.A. Zainon Hamzah, "SAPI - New Algorithm for Malay Morphological Analyzer and Stemmer," Proc. Artificial Intelligence Workshops 2010 (AIW2010) - Extraction of Structured Information from Texts in the Biomedical Domain (ESIT-BioMed 2010), Sarawak, Malaysia, pp. 67-82, July 2010.

[13] M. Taharin, R. Ja'afar, and N. Abd. Shukor, (eds.) Tesaurus Bahasa Melayu Dewan - Edisi Baharu, KL, Malaysia: Dewan Bahasa dan Pustaka, 2005.

International Conference on Computer and Information Application(ICCIA 2010)

12