cap: a hierarchical lexical function
DESCRIPTION
Amalia Todirascu Linguistique, Langues, Paroles (LILPA) University of Strasbourg [email protected]. CAP: A Hierarchical Lexical Function. The Project. Goals to study a specific CAP lexical function, in several languages (French, English, German) economy, politics - PowerPoint PPT PresentationTRANSCRIPT
CAP: A Hierarchical Lexical Function
Amalia Todirascu
Linguistique, Langues, Paroles (LILPA)
University of Strasbourg
2
The Project
Goals to study a specific CAP lexical function, in
several languages (French, English, German) economy, politics
to provide a complete linguistic description of this function
to extend a multilingual ontology, Prolexbase (Tran and Maurel, 2006)
The Project (II)
collaboration with CLARIN European project (http://www.clarin.eu)– WP3 Humanities overview
• WP3.3 Call for collaboration with Humanities projects
– Collaboration• access to existing corpora and tools
• consultancy
4
CAP – a Lexical Function
CAP lexical function (Mel'čuk 1984, 1988, 1992, 1999) – hierarchical relations
Two persons François Fillon est premier ministre de Nicolas Sarkozy Sebek em war ein Oberpriester ca. 1780 v.Chr
Two organisations Swiss Private Aviation AG, a fully-owned subsidiary of Swiss
International Air Lines AG Peugeot est une firme sochalienne
A Person and an organization or a country SWISS Finanzchef Marcel Klaus Traian Băsescu is the Romanian president
5
Context
linguistics : noun classifications (Kleiber 1990, Kleiber 1999, Jonasson 1994)
lexical databases: WordNet (Miller, 1995), EuroWordNet (Vossen, 1998), BalkaNet (Tufis, 2004), FrameNet (Baker, et al, 1998)
ontologies: Prolexbase (Tran and Maurel, 2006) (Grass et al, 2004) , SUMO (Niles and Pease, 2001)
several applications : information extraction QA systems
6
The Methodology
we identify existing monolingual and parallel corpora
DE, EN, FR CLARIN language resource registry
tagged and raw corpora annotation tools (both from the repository and on-line web
services)
we create our own multilingual corpora
7
The Methodology (II)
we apply several data extraction strategies• searching synonyms of "chef/head of/Vorsitzender";• searching Named Entities related by the CAP relation (Martine Aubry – Parti Socialiste);• searching annotated persons and organizations through aligned corpora
we analyse the contexts to classify the expressions and their argumentswe extend Prolexbase ontology
Corpora (I)
• Available public data• Web interfaces (CQP)
• Various domains and genres
• monolingual : • Wortschatz (http://corpora.informatik.uni-leipzig.de), IULA
(http://bwananet.iula.upf.edu), COSMAS (http://www.ids-mannheim.de/cosmas2), BNC (http://www.natcorp.ox.ac.uk/)
• multilingual :• Oslo (http://www.hf.uio.no), CLUVI (http://sli.uvigo.es/CLUVI), DGT-TM
(http://langtech.jrc.it/DGT-TM.html)
Corpora (II)
Corpora built for the project
monolingual : party chiefs (DE, EN), French president (FR) (200,000 tokens/language)
multilingual (paralel and comparable) :
aiplane companies (51,000-54,000 tokens)
European parliament (127,000-134,000 tokens)
European commission (175,000-195,000 tokens)
Domains : politics, economy
10
Preprocessing the Corpora
Unitex tool (Paumier, 2000) Resources available for the three languagesTools :
tokenizer, lemmatizer and tagger CasSys (Friburger and Maurel, 2004) to annotate French Named
Entities
Weblicht Platform NE annotations for German and English
sentence aligner : Alinea (Kraif, 2001)
11
Data extraction
three strategies for data extractionA. we identify synonyms/hyponyms for English (WordNet,
FrameNet) and their equivalents in French and German• chef, président, PDG, directeur général• Chief executive officer, president, head of• Vorsitzender, Direktor
B. we search pairs of entities which are related by a CAP relation• Barack Obama – United States of America• José Manuel Barroso – la Commission européenne• Marcel Klaus – SWISS
C. we use aligned corpora and French NER CasSys (Friburger and Maurel, 2004) to obtain relevant contexts of Person or Organization
Data Extraction (II)
Problems few contexts from existing corpora (30 to 50) Various queries
CQP/web interface
raw texts
Various annotations few tagged corpora
almost no NE annotated corpora
heterogenous tools to preprocess corpora
12
13
'Cap' lexical units
various lexical categories nouns :
positions (e.g.Finanzdirektor), professions (infirmière en chef), titles (Dr.), army ranks (General)
verbs : to lead, to organize, to commandA trilingual ontology
95 lexical units (FR), 93 lexical units (EN), 67 lexical units (DE)
From existing lexical databases From corpora
14
Linguistic Analysis
arguments types persons, organizations, places common nouns : anaphoric references to
organisations or persons in charge, nationality adjectives
various linguistic expressions Nouns – morpho-syntactic variations Verbs
complex verbo-nominal predicates (sous la gouverne de, unter der Leitung von, under the direction of, become président, être elu …)
15
Morpho-Syntactic Properties
Nouns affixation
général, généralissime (FR) composition
vice-roi (FR), vice-roy (EN), Vizekönig (DE) modification
adjective (directeur général, FR, Generaldirektor DE) prepositional phrase (infirmière en chef FR, head nurse EN,
Oberschwester DE) noun being the possessor of another noun
du Conseil de Sécurité des Nations Unies, United Nation Security Council, des UN-Sicherheitsrates
17
Conclusion and Further Work
study from the lexical semantics field : a hierarchy relation in a multilingual perspective – CAP various expressions and various arguments types data from monolingual and multilingual corpora trilingual ontology (FR,DE, EN) – extension of Prolexbase
Overall experience querying various interfaces heterogeneous annotation information heterogeneous tools combining linguists’ and computational linguists’ competences
18
The Lexico-Syntactic Patterns
French patterns <Organization>de<Organization> Conseil d'Administration de SWISS
English patterns <CAP function> of <Organisation>, <Person> Chief executive officer of the company TAROM, M. Gheorghe Birla
German patterns <Person> <sein> <tokens>* <CAP function> <Organisation>
Peter Siegenthaler ist seit Juli 2000 Direktor der Eidgenössischen Finanzverwaltung