a comparable corpus driven, multivariate approach to light verb variations in world chineses jingxia...

Download A Comparable Corpus Driven, Multivariate Approach to Light Verb Variations in World Chineses Jingxia LIN 2, Menghan JIANG 1, and Chu-Ren HUANG 1 1 The

If you can't read please download the document

Upload: efren-osley

Post on 15-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1

A Comparable Corpus Driven, Multivariate Approach to Light Verb Variations in World Chineses Jingxia LIN 2, Menghan JIANG 1, and Chu-Ren HUANG 1 1 The Hong Kong Polytechnic University, 2 Nanyang Technological University Slide 2 Light verbs in Chinese Similar to English light verbs: take rest, give advice, give description Semantically bleached: containing no eventive information The predicative content mainly comes from its taken complement jin4xing2 tao3lun4 have a discussion Being semantically bleached, they do not strongly select their objects They can take a wide range of objects, including deverbal nouns, eventive nouns, and sometime concrete numbers with eventive meaning They are sometimes interchangeable with the same nominal object Slide 3 Underspecified Selecitonal Restriction of Chinese Light Verbs cong2shi4, gao3, jia1yi3, jin4xing2, zuo4 are among the most frequently used (also most typical) light verbs in Modern Chinese The use of these five light verbs are sometimes interchangeable / / / / cong2shi4/gao3/jia1yi3/jin4xing2/zuo4 yan2jiu1 to do research Slide 4 Underspecified Selecitonal Restriction of Chinese Light Verbs II Collocation constraints are sometimes found with these light verbs, e.g., /* /* / /* , jin4jing2/*jia1yi3/*cong2shi4/gao3/*zuo4 bi3sai4 play a game * / /* /* /* *jin4jing2/jia1yi3/*cong2shi4/*gao3/*zuo4 kao3lv4 give consideration Slide 5 Variations of Light Verb Usages in Mainland and Taiwan Mandarin Variants Even with the very limited collocation constraints, variations still exist: Taiwan light verbs tend to take more types of NPs and even VPs as its complements / Jin4xing2 gan3en1zhi1lv3/ju1zi3zhi1zheng1 to proceed with a thanksgiving trip/gentlemens dispute / Jin4xing2 mo3hei1/kai1piao4 to proceed with mud-slinging/ballot counting -------(Huang et al. 2013) Slide 6 Theoretical Challenges for Corpus-based Studies of Chinese Light Verbs Can distribution based statistically analysis identify the differences among different Chinese light verbs? The contrasts among the light verbs are often tendencies rather than grammaticality dichotomies; hence the distributional patterns are less prominent and harder to characterize Can the subtle light verb variations between different variants of Chinese, be identified through statistical analysis based on comparable corpora (cf. Huang et al. 2013). Slide 7 Main Research Questions Facing the above challenges, we try to resolve the following four research questions: Can light verbs be differentiated from each other by statistical methods? Can the grammatical differences between variants of the same language be empirically verified by distributional features? Are these differences statistically significant? If answers to both questions are yes, how do they differ statistically from each other? That is, is the distributional difference between two different light verbs or the between two variants of the same light verb more prominent? Slide 8 Methodology A comparable-corpus-driven statistical approach jia1yi3, jin4xing2, cong2shi4, gao3, zuo4 in Mainland Mandarin and Taiwan Mandarin Statistical methods and tools Univariate analysis + multivariate analysis Polytomous package in R (Arppe 2008) Slide 9 Data Chinese Gigaword corpus ( over 1.1 billion Chinese words) Central News Agency (Taiwan, about 700 million characters) Xinhua News Agency (Mainland China, about 400 million characters) Random sample: 200 sentences for each of the five light verbs in Mainland and Taiwan corpora 1,000 in total for Mainland Chinese 1,000 in total for Taiwan Chinese Slide 10 12 factors: (e.g. Zhu 1985, Zhou 1987, Cai 1982, Huang et al. 1995, among others) Value levels Co-occur with other light verbs OTHERLV kai1shi3/jin4xing2/bi3sai4 start the gameYes, no Take aspectual marker: ASP zuo2tian1/jin4xing2/le0/bi3s ai4 played the game yesterday No, le, zhe, guo Event complement is at subject position EVECOMP bi3sai4/zai4/xue2xiao4/jin4 xing2 play the game at school Yes, no Slide 11 POS POS N jin4xing2/bi3sai4 V jin4xing2/zhan4d ou4 play the game fight the battle N, V Argument structure ARGSTR two jin4xing2/diao4ch a2 carry on investigation One, two, zero VO compound as argument VOCOMP jin4xing2/tou2pia o4 carry on voting Yes, no Slide 12 Spontaneous/contr ollable event SPONTEVT jin4xing2/tou2piao4 carry on votingYes, no durative event DUREVT jin4xing2/bi3sai4 play a gameYes, no formal event FOREVT jin4xing2/fang3wen4 pay an official visit Yes, no psychological activity PSYEVT jia1yi3/kao3lv4 give consideration Yes, no event involving interaction of agent and patient INTEREVT jia1yi3/gou1tong1 inflict/communicate do communication Yes, no accomplishment complement ACCOMPEVT jin4jing2/xiu1zheng4 proceed/correct make corrections/amen dments Yes, no Slide 13 str(MLLV3) 'data.frame':1000 obs. of 13 variables: $ LV : Factor w/ 5 levels "congshi","gao",..: 1 1"> Mainland Chinese-An overall look of the factors > str(MLLV3) 'data.frame':1000 obs. of 13 variables: $ LV : Factor w/ 5 levels "congshi","gao",..: 1 1 1 1 1 1 1 1 1 1... $ POS : Factor w/ 2 levels "N","V": 2 2 2 2 1 1 2 2 2 2... $ ARGSTR : Factor w/ 3 levels "one","two","zero": 1 1 2 1 3 3 2 1 1 1... $ VOCOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ EVECOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ OTHERLV : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1... $ ASP : Factor w/ 4 levels "guo","le","no",..: 3 3 3 3 3 3 3 3 3 3... $ SPONTEVT : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1... $ DUREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2... $ FOREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2... $ PSYEVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ INTEREVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... $ ACCOMPEVT: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1... Among the 12 independent variables, two have only one level OTHERLV: occurrence of the dependent variable (light verbs) with another light verb All five light verbs (1000 sentences) do not co-occur with another light verb SPONTEVT: with spontaneous events as the complement to light verbs All five light verbs (1000 sentences) take spontaneous events as their complements the two factors are not effective in distinguishing the five light verbs, and are thus excluded from further statistical analysis Slide 14 Univariate analysis of Chinese light verbs Chi-squared tests for the significance of the co-occurrence of the factor with individual light verbs Chisq.posthoc() function in the Polytomous package automatically transforms the results (Standardized pearson residuals e ij (Agresti 2002)) into signs +: e ij > 2, statistically significant overuse of the light verb with the factor -: e ij < -2, statistically significant underuse of the light verb with the factor 0: e ij [-2,2], lack of statistical significance Slide 15 Mainland Chinese a univariate analysis Four features show no significance (p-value Main Results of Polytomous for Mainland Chinese odds>1: the chance of the occurrence of a light verb is significantly increased by the feature (marked in orange) odds0,05) are given in parentheses Slide 19 Distributional Contrasts Can Differentiate Light Verb Pairs Most pairs of light verbs can be effectively differentiated by one of more factors (i.e. those where they have contrasting positive/negative tendencies to appear) congshi/gao: ARGSTRtwocongshi/jiayi: ARGSTRtwo congshi/jinxing: INTEREVTypesgao/jiayi: ACCOMPEVTypes gao/zuo: ARGSTRtwo/ARGSTRzerojiayi/jingxing: ACCOMPEVTypes jiayi/zuo: ARGSTRtwojinxing/zuo: INTEREVTypes Only two pairs are without contrasting significant features congshi/zuo gao/jinxing Slide 20 A probability model is adopted to predict the identity of light verb at its position of occurrence. The overall performance of the model is good the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures) PROBABILITY OF OCCURRENCE OF LIGHT VERBS Slide 21 F-score of Automatic Identification of Five Light Verbs Based on Mainland Mandarin Data recallprecision F-score congshi0.6550.46450.5436 gao0.080.50.1379 jiayi0.960.44550.6086 jinxing0.310.69660.4291 zuo0.4850.58430.5300 Slide 22 Each light verb can be successful identified with a better F-score than chance (0.2) with the exception of gao3, while the performance varies from light verb to light verb Jia1yi3 > cong2shi4/ zuo4 > jin4xing2 > gao3 - Jia1yi3 is the only light verb with effective differentiating factors with all other light verbs.// All four significant factors are positive (i.e. direct evidence for its occurrence). cong2shi4/ zuo4: Both have only one type of significant factors, but they are negative ones (i.e. indirect evidence). gao3, and jin4xing2 have both positive and negative factors, which may have cancelled each other out. The significance of their factors are also relatively weak. Note that the low f-score of gao3 is consistent with the linguistic observation that it is rarely used as LV in ML. Analysis of Outcome (ML) Slide 23 F-score of Automatic Identification of Five Light Verbs Based on Taiwan Mandarin Data recallprecision F-score congshi0.320.56140.4076 gao0.6950.50360.5840 jiayi0.950.41390.5766 jinxing0.3350.59290.4281 zuo0.160.84210.2689 Slide 24 Each light verb can be successful identified with a better f-score than chance (0.2). But the performance varies from light verb to light verb gao3/ Jia1yi3 > jin4xing2/ cong2shi4 > zuo4 gao3/ Jia1yi3 each have significant factors are positive only (i.e. direct evidence for its occurrence). cong2shi4 negative significant factors only (i.e. indirect evidence). jin4xing2 has more positive than negative significant factors zuo4 have both types of significant factors, but negative ones outnumber positive ones. Linguistically, Analysis of Outcome (TW) Slide 25 Key results: ML and TW zuo4 show opposite usage tendency of the feature ARGSTR.two ML and TW jin4xing2 show opposite usage tendencies of the features ASP.le and ASP.no But the difference is between a significant and non-significant feature, rather than between a significant positive vs. a significant negative feature Comparison of Mainland and Taiwan light verbs -univariate analysis Slide 26 Probability estimates of Mainland and Taiwan light verbs by Polytomous In both ML and TW, the model in overall is good: the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures) The results also show while a light verb has a highest probability given a particular context (a set of factors), other light verbs might also have a chance to occur. the reason why empirically more than one light verb can occur in the same context. Slide 27 Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Slide 28 Both have similar, non-contradictory distributional patterns. They differ only in that TW is less likely to take formal event as arguments (FOREVTyes). This is consistent with the intuition that jingxing will be preferred in this context in TW. Slide 29 Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. Both ML and TW gao3 are significantly favored by ML gao3 is less likely to occur with accomplishment object. This and the fact that it is unlikely to occur with the aggregate of default variable values suggest that it is unlikely to be used as light verb in ML. Slide 30 Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non- contradictory distributional patterns ML jia1yi3 are more likely to occur with two arguments (ARGSTRtwo), as well as taking VO compound or psychological events as objects (VOCOMPyes, and PSYEVTyes). Which confirms the intuition that it is more frequently used in ML. Slide 31 Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. ML jinxing is not likely to take accomplishment objects (ACCOMPEVTypes), while TW jin4xing2 is very likely to take VO compound objects (VOCOMPyes), consistent with Huang et al. (2013) Slide 32 Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns Their distributional patterns are consistent with the analysis of zuo4 as the most bleached of Mandarin light verbs. (The attachment of perfect aspect le is known to be shared grammatical potential of all light verbs.) Slide 33 Conclusion This study compares the usage tendencies of Chinese light verbs (1) Among five different light verbs (2) Between Mainland and Taiwan Mandarin Usage of the same light verb The comparable-corpus-driven statistical analysis is able to generalize about the similarities and differences among light verbs with different factors The contrast between different light verb pairs can be anchored by statistically significant positive vs. statistically significant negative pairs, The difference between two Chinese varieties for the same light verbs, however, is between statistically significant vs. non-significant pairs. The above result allows us to hypothesize that Different light verbs, even with its weak selectional features, can be identified and differentiated by contrasting distributional tendencies Variants of the same language, however, do not show contrasting tendencies but can be differentiated by existence (i.e. significant vs. non-significant) of some distributional tendencies Slide 34 References Arppe, A. (2008) Univariate, bivariate and multivariate methods in corpus-based lexicography a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978- 952-10-5175-3. Arppe, A. (2009) Linguistic choices vs. probabilities how much and what can linguistic theory explain? In: Featherston, S. & S. Winkler (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 124. Arppe, A. (in prep.) Solutions for fixed and mixed effects modeling of polytomous outcome settings. Han, Weifeng, Arppe, Antti & Newman, John (2013). Topic marking in a Shanghainese corpus: from observation to prediction. Corpus Linguistics and Linguistic Theory (preprint). Butt, M., & Geuder, W. (2001). On the (semi) lexical status of light verbs. Semi- lexical Categories, 323-370. Cattell, R. (1984). Composite Predicates in English. Syntax and Semantics Volume 17. Sydney: Academic Press Australia. Cai, Wenlan. (1982). Issues on the Complement of jinxing ( ). Chinese Language Learning ( ) (3), 7-11. Slide 35 References Huang, Chu-Ren and Jingxia Lin. (2013). The ordering of Mandarin Chinese light verbs. Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.): CLSW 2012, LNAI 7717, pp. 728-735. Heidelberg: Springer. Huang Chu-Ren, Jingxia Lin, and Huarui Zhang (2013). World Chineses based on comparable corpus: The case of grammatical variations of jinxing. , 397-414. Jespersen, O. (1965). A Modern English Grammar on Historical Principles. Part VI, Morphology. London: George Allen and Unwin Ltd. Zhou, Gang. (1987a). Subdivision of Dummy Verbs ( ). Chinese Language Learning ( ), 1, 11-14. Zhou, Xiaobing. (1987b). Sentence Pattern Comparison of jinxing and jiayi ( ). Chinese Language Learning ( ), 6, 1-5. Zhu, Dexi. (1985). Dummy Verbs and NV in Modern Chinese ( ). Journal of Peking University (Humanities and Social Sciences) ( ( )), 5, 1-6. Slide 36 36 Thank you