![Page 1: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/1.jpg)
A Universal Phrase Tagset for Multilingual Treebanks
Aaron Li-Feng Han1,2, Derek F. Wong1, Lidia S. Chao1, Yi Lu1, Liangye He1, and Liang Tian1
1 NLP2CT lab, University of Macau 2 ILLC, University of Amsterdam
Speaker: Mr. Zhiyang Teng / Haibo Zhang Chinese Academy of Science(CAS)
CCL and NLP-NABD 2014 Wuhan, P.R.China
![Page 2: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/2.jpg)
Experiment
Proposed tagset
Discussion
CONTENTMotivation
![Page 3: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/3.jpg)
• Many syntactic treebanks and parser toolkits are developed in the past twenty years
• Including dependency structure parsers and phrase structure parsers
• For the phrase structure parsers, they usually utilise different phrase tagsets for different languages
• To capture the characteristics of specific languages
• Phrase categories span from ten to twenty or even more
• Results in an inconvenience when conducting the multilingual research
Motivations
![Page 4: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/4.jpg)
• To facilitate the research of multilingual tasks
• Could we make some bridges between these treebanks?
• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks
• Petrov et al. (2012) developed a universal part-of-speech (PoS) tagset
• Han et al. (2013) discussed the universal phrase tagset between French-English (bilingual)
• => Then, A Multilingual Universal Phrase Tagset? • and Mappings between existing phrase tags and universal ones?
Motivations
![Page 5: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/5.jpg)
5
![Page 6: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/6.jpg)
• After look inside of some syntactic treebanks
• We designs a refined universal phrase tagset
• Use 9 common phrase categories • all with high appearance rates
• Conduct mappings between the phrase tagsets from the existing phrase structure treebanks and the universal ones
• Cover 25 treebanks and 21 languages
Proposed tagset
![Page 7: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/7.jpg)
• Refined universal phrase tagset • Noun phrase (NP),
• Verbal phrase (VP),
• Adjectival phrase (AJP),
• Adverbial phrase (AVP),
• Prepositional phrase (PP),
• Sentence or sub-sentence (S),
• Conjunction phrase (CONJP),
• Coordinated phrase (COP),
• Others (X) covering the list marker, interjection, URL, etc.
Proposed tagset
![Page 8: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/8.jpg)
• Cover 21 languages and 25 treebanks:
• Arabic, Catalan, Chinese, Danish, English, Estonian, French, German, Hebrew,Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Portuguese, Spanish, Swedish, Thai, Urdu, and Vietnamese
• Detailed mappings between existing phrase tags and the universal ones in following table
Proposed tagset
![Page 9: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/9.jpg)
• Mappings of 25 phrase structure treebanks
Proposed tagset
![Page 10: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/10.jpg)
• Mappings of 25 phrase structure treebanks
Proposed tagset
![Page 11: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/11.jpg)
• Mappings of 25 phrase structure treebanks
Proposed tagset
![Page 12: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/12.jpg)
• Mappings of 25 phrase structure treebanks
Proposed tagset
![Page 13: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/13.jpg)
13
![Page 14: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/14.jpg)
• How to evaluate the effectiveness of the proposed works?
• - the universal phrase tagset
• - the mapping of existing tagsets
• Experimental design:
• - parsing accuracy testing
Experiment
![Page 15: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/15.jpg)
• Steps:
• 1. Run training and testing on original corpus, record the testing accuracy
• 2. Replace original phrase tags with universal ones resulting in ‘new corpus’
• 3. Re-run the training and testing on new corpus, record the new accuracy
• 4. Compare the changes of accuracy
Experiment
![Page 16: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/16.jpg)
• Evaluation criteria
• cost of training time (hours),
• size of the generated grammar (MB),
• parsing accuracy scores, • Labeled Precision (LPre), Labeled Recall (LRec), the
harmonic mean of precision and recall (F1), and exact match (Ex, whole sentence/tree)
Experiment
![Page 17: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/17.jpg)
• Calculation formula:
Experiment
![Page 18: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/18.jpg)
• Tested Languages:
• Five representative languages: Chinese (CN), English (EN), Portuguese (PT), French (FR), and German (DE)
Experiment
![Page 19: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/19.jpg)
• Based on Berkeley parser (Petrov and Klein, 2007):
• learn probabilistic context-free grammars (PCFGs) to assign a sequence of words the most likely parse tree
• introduce the hierarchical latent variable grammars to automatically learn a broad coverage grammar starting from a simple initial one
• The generated grammar is refined by hierarchical splitting, merging and smoothing
Experiment
Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)
![Page 20: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/20.jpg)
• Setting:
• The Berkeley parser generally gains the best testing result using the 6th smoothed grammar
• For a broad analysis of the experiments, we tune the parameters to learn the refined grammar by 7 times of splitting, merging and smoothing
• except 8 times for French treebank
Experiment
![Page 21: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/21.jpg)
• Hardware:
• The experiments are conducted on a server with the configuration stated in Table
Experiment
![Page 22: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/22.jpg)
• Chinese corpus:
• Penn Chinese Treebank (CTB-7) (Xue et al., 2005)
• standard splitting criteria for the training and testing data
• training documents contain CTB-7 files 0-to-2082
• development documents contain files 2083-to- 2242
• testing documents are files 2243-to-2447
Experiment
Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.Natural Language Engineering, 11(2), 207-238.
![Page 23: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/23.jpg)
• Experiment results:
• The highest precision, recall, F1 and exact score • 85.58 (85.06), 83.24 (83.01), 84.4 (83.99), and 25.33
(24.73) respectively by using the universal phrase tagset (original tags)
• Grammar size and training time • (65.55 MB) and (94.79 hours) using the original tagset
almost doubles that (34.53 MB & 56.25 hours) of the universal one for the learning of 7th refined grammar
• Detailed learning scores in the figure of next page
Experiment
![Page 24: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/24.jpg)
Experiment
![Page 25: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/25.jpg)
• English corpus:
• Wall Street Journal treebank (Bies et al., 1995)
• standard setting: • WSJ section 2-to-21 corpora are for training
• section 22 for developing
• section 23 for testing (Petrov et al., 2006)
Experiment
Ann Bies, Mark Ferguson, Karen Katz and Robert MacIntyre. 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper.
![Page 26: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/26.jpg)
• Experiment results (similar with CN):
• The highest precision, recall, F1 and exact score • 91.45 (91.25), 91.19 (91.11) and 91.32 (91.18)
respectively by using the universal phrase tagset (original tags)
• Grammar size and training time • 38.67 (51.64) hours and 30.72 (47.00) MB of memory
during the training process for the 7th refined grammar respectively on universal (original) tagset
• Detailed learning scores in the figure of next page
Experiment
![Page 27: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/27.jpg)
Experiment
![Page 28: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/28.jpg)
• Portuguese corpus:
• Bosque treebank subset of Floresta Virgem corpora (Afonso et al., 2002; Freitas et al., 2008)
• A size of 162,484 lexical units • 80 percent of the sentences for training
• 10 percent for developing
• another 10 percent for testing
• i.e. 7393, 939, and 957 sentences respectively !
Experiment
Susana Afonso, Eckhard Bick, Renato Haber, and Diana Santos.2002. Florestasintá(c)tica: a treebank for Portuguese. In Proceedings of LREC 2002,pp.1698-1703.
Cláudia Freitas, Paulo Rocha andEckhard Bick.2008. FlorestaSintá(c)tica: Bigger, Thicker and Easier. In António Teixeira, Vera LúciaStrube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, pp. 216-219.
![Page 29: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/29.jpg)
• Experiment results (similar with CN):
• The highest precision, recall, F1 and exact score • 81.84 (81.44), 80.81 (80.27) and 81.32 (80.85)
respectively by using the universal phrase tagset (original tags)
• Grammar size and training time • 3.69 (4.16) hours and 9.17 (10.02) MB of memory
during the training process for the 7th refined grammar respectively on universal (original) tagset
• Detailed learning scores in the figure of next page
Experiment
![Page 30: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/30.jpg)
Experiment
![Page 31: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/31.jpg)
• German corpus:
• German Negra treebank (Skut et al., 1997)
• 355,096 tokens and 20,602 sentences German newspaper text with completely annotated syntactic structures
• 80 percent (16,482 sentences) of corpus for training
• 10 percent (2,060 sentences) for developing
• 10 percent (2,060 sentences) for testing.
Experiment
W. Skut, B.Krenn, T. Brants, and H. Uszkoreit. 1997. An annotation scheme for free word order languages. In Conference on ANLP.
![Page 32: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/32.jpg)
• Experiment results:
• The highest precision, recall, F1 and exact score • 81.35 (81.23), 81.03 (81.02), and 81.19 (81.12)
respectively by using the universal phrase tagset (original tags)
• Grammar size and training time • similar on universal (original) tagset
• Detailed learning scores in the figure of next page
Experiment
![Page 33: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/33.jpg)
Experiment
![Page 34: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/34.jpg)
• French corpus:
• Different with previous standard language treebanks, we build ourselves
• Extract 20,000 French sentences from Europarl corpora
• Parse the extracted French text using the Berkeley French grammar “fra_sm5.gr” (Petrov, 2009)
• parsing accuracy around 0.80, the parsed Euro-Fr corpus is for the training
• Developing and testing corpora • WMT12 and WMT13 French plain text, 3,003 and 3,000 sentences respectively,
which are parsed by the same parser
Experiment
Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)
![Page 35: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/35.jpg)
• Experiment results:
• The highest precision, recall, F1 and exact score • 80.49 (80.34), 80.93 (80.96), and 80.71 (80.64)
respectively by using the universal phrase tagset (original tags)
• Grammar size and training time • 13.66 (28.91) hours and 12.07 (16.66) MB of memory
during the training process for the 8th refined grammar respectively on universal (original) tagset
• Detailed learning scores in the figure of next page
Experiment
![Page 36: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/36.jpg)
Experiment
![Page 37: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/37.jpg)
37
![Page 38: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/38.jpg)
• 1. Differences with related work
• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks
• Han et al. (2013) first discussed the universal phrase tag set for syntactic treebanks. Differences between our work and Han’s:
• Han’s covers the mapping of French and English <=> We extend the mapping into 25 treebanks and 21 languages
• Han’s apples the universal tag set into MT evaluation (indirect examination) <=> We examine the effectiveness of the mapping directly on parsing tasks
• Han’s experiments on French-English MTE corpus <=> Our experiments on five representative languages: Chinese/English/French/German/Portuguese
Discussion
McDonald, R.; Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)
Han, A.L.-F; Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. GSCL-2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)
![Page 39: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/39.jpg)
• 2. Analysis of the performances
• The higher parsing accuracy performances are not only due to the less number of phrase tags we employed, since:
• the beginning stage parsing accuracies (e.g. the 1st and 2nd refined grammars) using the universal tags are even lower than the ones using original tags.
• e.g. the F-scores of the 1st refined grammar using universal tags vs original tags are 67.06 (lower) vs 70.84; however, the winner changed after 5th refinement
• we think the parsing accuracy is also related to the mapping quality
• The exact match scores sometimes (French/German) lower than the ones using original tags, which is a issue for future work
• Currently, universal tags usually gain a higher performance according to the best F-score
Discussion
![Page 40: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/40.jpg)
• 3. Future work
• We plan to evaluate the parsing experiments on more language treebanks
• Improve the Exact match score/ whole tree level
• Utilize the universal phrase tagset into other multilingual applications, e.g. Petrov et al. (2012)’s work.
Discussion
Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)
![Page 41: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks](https://reader033.vdocument.in/reader033/viewer/2022051611/549efb5bac79594c768b48e0/html5/thumbnails/41.jpg)
Cite this work:
Han, A.L.-F; Wong, D.F.; Chao, L.S.; Lu, Y.; He, L.; and Tian, L.: A Universal Phrase Tagset for Multilingual Treebanks. In M. Sun et al. (Eds.): CCL and NLP-NABD 2014, LNAI 8801, pp. 247–258, 2014. © Springer International Publishing Switzerland 2014