revisiting a brazilian wordnet

Revisiting a Brazilian WordNet

Valeria de PaivaRearden Commerce

Foster City, CA, [email protected]

Alexandre RademakerApplied Mathematics School, FGV

Rio de Janeiro, [email protected]

Abstract

Brazilian Portuguese needs a WordNetthat is open access, downloadable andchangeable, so that it can be improved byresearchers, such as the community inter-ested in automated deduction. This wouldbe very valuable to linguists and com-puter scientist interested in representingknowledge obtained from texts. We dis-cuss briefly why we want a Brazilian Por-tuguese WordNet and how we are goingabout getting one. These are only firststeps, though, as our project is just start-ing.

1 Introduction

WordNet (Fellbaum, 1998) is an extremely valu-able resource for research in Computational Lin-guistics and Natural Language Processing in gen-eral. WordNet has been used for a number of dif-ferent purposes in information systems, includingword sense disambiguation, information retrieval,automatic text classification, automatic text sum-marization, and dozens of other knowledge inten-sive projects.

But if there is still a lack of lexical resourcesfor English, the problem is ten-fold more acute forother languages, which lack even easily accessiblecorpora and basic tools such as tokenizers, taggersand splitters. This lack of resources slows downconsiderably, almost stops completely any workon reasoning about knowledge obtained from lan-guage, our main goal.

We are starting a project at Foundacao GetulioVargas (FGV) in Brazil, where we want, in thelong run, to use formal logical tools to rea-son about knowledge obtained from text in Por-tuguese. We are logicians, not linguists, so wewant to minimize the amount of ComputationalLinguistics that we have to develop. Hence it

would have been very sensible to use a BrazilianWordNet, if we could have one. While we orig-inally expected to be able to use some existingBrazilian WordNet, out of the box, it turns out thatthese are not available. There are some attempts.

There is the project WordNet.PT (PortugueseWordNet) from the “Centro de Linguitica da Uni-versidade de Lisboa” headed by Palmira Mar-rafa. But this is available online only, no down-load available and, as far as we can see on theirwebpages, little development has happened re-cently to this project. The WordNet.PT versionavailable online 1 has about 19000 lexical expres-sions, from different semantic fields. The frag-ment made available online includes expressionsfrom subdomains such as art, clothing, geogra-phy, health, institutions, living entities and trans-portation, but no description of other domainsand/or future releases of the database are dis-cussed. The group has also a newer version ofWordNet.PT called WordNet.PTglobal (Mar-rafa et al., 2011) which pays attention to differentvarieties of Portuguese, like African variations inthe language. But while this is very interesting forlinguistic comparative research and useful for on-line queries 2, this smaller version of WordNet.PTis still not available for download and/or modifica-tions and improvements.

Then there is also the MultiWordNet projectand its Portuguese version MWN.PT, developedby Antonio Branco and colleagues at the NLX-Natural Language and Speech Group, of the Uni-versity of Lisbon, Department of Informatics.According to their description 3 MWN.PT theMultiWordnet of Portuguese (version 1) spansover 17,200 manually validated concepts/synsets,linked under the semantic relations of hyponymyand hypernymy. These concepts are made of over

1http://tinyurl.com/6p2vsy3.2http://www.clul.ul.pt/wnglobal/.3http://tinyurl.com/bum4mmh.

21,000 word senses/word forms and 16,000 lem-mas from both European and American variantsof Portuguese. It includes the subontologies un-der the concepts of Person, Organization, Event,Location, and Art works, which are covered bythe top ontology made of the Portuguese equiva-lents to all concepts in the 4 top layers of the En-glish Princeton WordNet and to the 98 Base Con-cepts suggested by the Global Wordnet Associa-tion, and the 164 Core Base Concepts indicated bythe EuroWordNet project. But again this wordnetis available online only and/or with a restrictive li-cense that requires payment.

Thirdly there is a first version of a Brazilian Por-tuguese version of Wordnet developed by BentoDias da Silva and collaborators (Dias-Da-Silva etal., 2000; Scarton and Aluisio, 2009). But this alsocannot be downloaded, is not available online andis not being maintained in an open access basis,which is one of the strongest points of PrincetonWordNet. Open access availability is one of themain reasons we would like to have our own ver-sion of WordNet-BR, which we are calling WN-BR. This is because we believe that resources likeWikipedia and WordNet need to be open and mod-ifiable by others in order to improve over time.

Finally there is a whole batch of workon merging WordNet with Wikipedia cate-gories and infoboxes, that tries to leverage thework already done by the Wikipedia volunteers.Amongst these we are particularly excited aboutYAGO/MENTA (de Melo and Weikum, 2010)(and YAGO2 for further work on temporal/spatialinformation), which we describe below. This kindof work fits in well with our goals of ultimatelydoing reasoning, in large scale, with knowledgeobtained from text.

2 Global WordNet Grid

In this forum it is perhaps not necessary to recallthat the Global WordNet Association aims at thedevelopment of wordnets for all languages of theworld and to extend the existing wordnets to fullcoverage and all parts-of-speech. In 2006 the asso-ciation launched a project to start building a com-pletely free worldwide wordnet “grid”. This gridwould be built around a shared set of concepts, andwould be expressed in terms of the original Word-net synsets and SUMO (Niles and Pease, 2001)terms. The idea of the grid is very appealing andthe suggested procedure to create wordnets looks

sensible and feasible.

To recap the suggestion was to build the firstversion of the Grid around the set of 4689 “Com-mon Base Concepts” and to make the Grid free,following the example of the Princeton WordNet.Now the Base Concepts are supposed to be themost important concepts in the various wordnetsof different languages. The importance of the con-cepts was measured in terms of two main crite-ria: (1) A high position in the semantic hierar-chy; (2) Having many relationships to other con-cepts. The procedure described as the “expandapproach” seems to us viable: First translate thesynsets in the Princeton WordNet to Portuguese,then take over the relations from Princeton and re-vise, adding the Portuguese terms that satisfy dif-ferent relations. Then revise and revise and reviseuntil we can guarantee the consistency of the tax-onomy.

In somewhat more detail but still following thesuggestions of the global wordnet grid, we thinkwe can develop a core WordNet for Brazilian Por-tuguese by: first representing the 1024 core ba-sic concepts by one or more synsets in Portuguesethat are either equivalent or very closely associ-ated to the original core concepts in the PrincetonWordNet. Then adding Base Concepts that are im-portant to Brazilian Portuguese, but not in the setof core basic concepts. (For that one could uselists of Portuguese words listed by frequency, andcomparisons with other Wordnets for romance lan-guages.) Next we need to check that this forms aclosed and consistent hierarchy. Finally we shouldadd further relations necessary to specify the se-mantics of the basic concepts.

While this procedure seems sensible anddoable, despite being hard work, the existence ofseveral versions of Wordnet, and the fact that theconcepts uncovered as basic ones are not relatedto their synsets in WordNet 3.0 makes things moredifficult. WordNet 3.0 has the huge advantage ofdisambiguated glosses, a real plus if the goal is se-mantics. But the mechanics of following the pro-cedure sketched above turn out more complicatedthan expected.

Our solution is to use the mapping from Word-Net 2.0 to WordNet 3.0 provided by (Daude etal., 2000). The idea consists in, using the map,identify the WordNet 3.0 synsets equivalent to the4689 WordNet 2.0 basic concepts listed in the Eu-roWordNet project (Stamou et al., 2002).

We plan to use the RDF version of WordNet3.0 made freely available online 4 by Mark vanAssem and Jacco van Ossenbruggen. The RDFversion has the benefit of better supporting ourcollaborative work, facilitating the maintenance inthe same data structure of both English and Por-tuguese versions of the glosses and lexical forms,and, once loaded in a Triple Store, providing uswith a working environment to add or removerelations, comments and so on. Unfortunately,there are at least two different versions of Word-Net available in RDF. The RDF representation ofWordNet 2.0 is described in http://www.w3.org/TR/wordnet-rdf/ and seems to be usedas reference for the newly minted RDF represen-tation of WordNet 3.0. Nevertheless, we still haveto investigate the differences between the Word-Net 2.0/3.0 mapping used in the RDF representa-tion of WordNet 3.0 and the mapping provided by(Daude et al., 2000).

3 Sumo and WordNet-BR

The Global WordNet Grid approach has three as-pects that seem to us worth considering. Firstthere is the work on determining which synsets(corresponding to concepts) are most popular inseveral languages. This work was done in theEuroWordNet projects and it would be a shamenot to use it. The emphasis on many languagesshould help filter out personal and cultural biases.Then there is the proposed stepping up schedule,which seems to us very attractive, as a way of help-ing to get the “grunt” work done. Finally and insome ways more importantly there is the connec-tion with SUMO that we discuss now.

According to Wikipedia, the Suggested UpperMerged Ontology or SUMO is an upper ontol-ogy intended as a foundation ontology for a va-riety of computer information processing systems.It can be downloaded and used freely and it hasbeen available and in development since 2000. Amapping from WordNet synsets to SUMO has alsobeen defined and maintained for several versionsof WordNet. Most importantly for us, SUMO isorganized for interoperability of automated rea-soning engines. In particular SUMO’s associatedopen source knowledge engineering environment,Sigma 5 runs already in Vampire (Ganzinger etal., 1999) and Leo-II (Benzmueller and Paulson,

4http://semanticweb.cs.vu.nl/lod/wn30/.5http://sigmakee.sourceforge.net/.

2010), for example. Projections of SUMO intodescription logics, automatically available, can berun in the OWL reasoners.

In the beginning of 2010 we started an informalproject of discussing how logic and automated rea-soning could have a bigger impact, if coupled withnatural language processing and how it would bea great thing to translate some of the advances al-ready made for English text understanding to Por-tuguese text understanding and reasoning.

Since one of us (Valeria de Paiva) had workedfor almost nine years in Xerox PARC, in the sys-tems developed by the Natural Language Theoryand Technology (NLTT) group, particularly on thesystem Bridge, we requested an academic licenseto the XLE (Xerox Language Engine) to try toadapt the systems to Brazilian Portuguese. How-ever, we are both logicians, our expertise lies atthe end of the long pipeline of the system Bridgeand we tried to recruit, still informally, people withmore expertise on the language side of the project.But at that stage we did not have any formal back-ing, so despite some interesting offers, nothingmuch happened. Recently we have been grantedformal backing, although still in small scale, andone of the opportunities that presented itself wasto forestall the need for the creation of a BrazilianWordNet (or perhaps to help improve the creationof such) , via the use of SUMO.

Wordnet is an important component of the XLEUnified Lexicon (UL (Crouch and King, 2005)),as the logical formulae created by the AbstractKnowledge Representation (AKR) component ofthe system are given meaning, in terms of Word-net sysets. A previous version of the system used,instead of the Unified Lexicon, Cyc (Lenat, 1995)concepts as semantics. As discussed in (De Paivaet al., 2007) the sparseness of Cyc concepts wasthe main reason to move away from Cyc onto aversion of the Bridge system based on the UL andWordNet. Since a WordNet-BR is not available,a workaround might be gotten via SUMO, if thiswere to be available in Portuguese. As a warmingexercise we translated the basic concepts used forthe basic concept descriptions in SUMO to Brazil-ian Portuguese 6 and this is already available onthe SourceForge repository for Sigma, SUMO’sknowledge engineering platform. This is not asubstitute for a Brazilian Portuguese WordNet, butmerely a stopping stone towards it.

6http://tinyurl.com/bu874aq

4 Scaling Up?

One of the ways we are considering of scalingup our proposal, from the five thousand conceptssuggested in the grid page to the level that wethink is necessary for our application goes via thework on YAGO (Suchanek et al., 2007) and, per-haps, YAGO2. The YAGO approach to informa-tion extraction for building a searchable, large-scale, highly accurate knowledge base of commonfacts goes via harvesting infoboxes and categorynames from Wikipedia for facts about individualentities. It reconciles these with the taxonomicbackbone of WordNet in order to ensure that allentities have proper classes and the class systemis consistent. The work in YAGO at the Max-PlanckInstitute has led de Melo and Weikum towork in MENTA (Multilingual Taxonomies fromWikipedia), which can be considered a multilin-gual version of WordNet. From this multilingualversion (with 254 languages) we want to ‘project’the component consisting of Portuguese synsetsonly. (The plan is to use the work in the Por-tuguese version of MENTA to complement, auto-matically, the five thousand concepts for WN-BR,obtained through manual translation. We believethat the MENTA (de Melo and Weikum, 2010)projection into Portuguese could give us a rea-sonable basis in terms of synsets in Portugueseto which we would like to compare the existingversions of Portuguese wordnets.) Since YAGOis already integrated with SUMO (De Melo et al.,2008), we hope to be able to maintain consistencyof the database.

5 Conclusions

It is early days for our project and time will tellwhether our decision to follow the global Word-Net grid guidelines for seeding new wordnets willpay off or not, and if so, how well. We have nowa master student interested in the project and moreinterested students are expected. One thing is clearto us, whatever kinds of resource we end up with,we hope to make them freely available in one ofthe numerous sites (SourceForge, GitHub, etc.) atour disposal nowadays. We are not aware of anysuch specialized lexicons available for BrazilianPortuguese and it is about time that we had themopenly and freely accessible.

ReferencesC. Benzmueller and L.C. Paulson. 2010. Multimodal

and intuitionistic logics in simple type theory. LogicJournal of IGPL, 18(6):881.

D. Crouch and T.H. King. 2005. Unifying lexicalresources. In Proceedings of the InterdisciplinaryWorkshop on the Identification and Representationof Verb Features and Verb Classes, Saarbruecken,Germany.

J. Daude, L. Padro, and G. Rigau. 2000. Map-ping wordnets using structural information. InProceedings of the 38th Annual Meeting on As-sociation for Computational Linguistics, pages504–511. Association for Computational Lin-guistics. http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=57.

G. de Melo and G. Weikum. 2010. Menta: inducingmultilingual taxonomies from wikipedia. In Pro-ceedings of the 19th ACM international conferenceon Information and knowledge management, pages1099–1108. ACM.

G. De Melo, F. Suchanek, and A. Pease. 2008. In-tegrating yago into the suggested upper merged on-tology. In Tools with Artificial Intelligence, 2008.ICTAI’08. 20th IEEE International Conference on,volume 1, pages 190–193. IEEE.

V. De Paiva, DG Bobrow, C. Condoravdi, R. Crouch,L. Karttunen, TH King, R. Nairn, and A. Zaenen.2007. Textual inference logic: Take two. Proceed-ings of the Workshop on Contexts and Ontologies,Representation and Reasoning, page 27.

B. C. Dias-Da-Silva, H. R. Moraes, M. F. Oliveira,R. Hasegawa, D. A. Amorim, C. Paschoalino, andA. C. Nascimento. 2000. Construcao de um the-saurus eletronico para o portugues do brasil. In Pro-cessamento Computacional do Portugues escrito efalado (Propor), pages 1–10.

C. Fellbaum. 1998. WordNet: An electronic lexicaldatabase. The MIT press.

Harald Ganzinger, Alexandre Riazanov, and AndreiVoronkov. 1999. Vampire. In Automated Deduc-tion CADE-16, volume 1632 of Lecture Notes inComputer Science, pages 674–674. Springer Berlin/ Heidelberg.

D.B. Lenat. 1995. Cyc: A large-scale investment inknowledge infrastructure. Communications of theACM, 38(11):33–38.

Palmira Marrafa, Raquel Amaro, and Sara Mendes.2011. Wordnet.pt global – extending wordnet.pt toportuguese varieties. In Proceedings of the FirstWorkshop on Algorithms and Resources for Mod-elling of Dialects and Language Varieties, pages 70–74, Edinburgh, Scotland, July. Association for Com-putational Linguistics.

I. Niles and A. Pease. 2001. Towards a standard upperontology. In Proceedings of the international con-ference on Formal Ontology in Information Systems-Volume 2001, pages 2–9. ACM.

Carolina Scarton and Sandra Aluisio. 2009. Herancaautomatica das relacoes de hiperonimia para a word-net.br. Technical Report NILC-TR-09-10, USP, SaoCarlos, SP, Brazil.

S. Stamou, K. Oflazer, K. Pala, D. Christoudoulakis,D. Cristea, D. Tufis, S. Koeva, G. Totkov, D. Dutoit,and M. Grigoriadou. 2002. Balkanet a multilingualsemantic network for the balkan languages. In Pro-ceedings of the International Wordnet Conference,Mysore, India, pages 21–25.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: A Core of Semantic Knowl-edge. In 16th international World Wide Web con-ference (WWW 2007), New York, NY, USA. ACMPress.