senseval2 scott cotton and martha palmer isle meeting dec 11, 2000 university of pennsylvania

SENSEVAL2

Scott Cotton and Martha Palmer

ISLE Meeting

Dec 11, 2000

University of Pennsylvania

SENSEVAL

• SENSEVAL/SIGLEX98: (Brighton, Sep,98)– Workshop on Word Sense Disambiguation– Hector, corpus-based sense inventory– 34 words, nouns, verbs, adjectives, mixed– Inter-annotator agreement over 90%– English (18 participating systems)– Also Italian (2) and French(5)

Siglex99: All words Experiment

• WSJ 5K word corpus– running text– WordNet 1.6

• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement

(disagreement in 90/350 verb tokens)

SENSEVAL2

• Toulouse, France, July 5,6 (ACL’02)– Samples, mid-DEC

– Training data, April

– Testing data, May

• 13 Languages• Lexical sample and all words• Standardized data and formats, central server• Closer tie to applications

13 Languages

• Swedish - lexical sample– Dimitrios Kokkinakis <[email protected]>

• Chinese - lexical sample– Chu-Ren Huang [email protected]

– Keh-jiann Chen <[email protected]>

• Danish - lexical sample– Bolette Pedersen <[email protected]>

• Estonian - all words (in principle)– Haldur Oim <[email protected]>

mailto:[email protected]







13 Languages, cont.

• Japanese - lexical sample – Sadao Kurohashi [email protected]

• Bangla - lexical sample– Niladri Sekhar Dash [email protected]

• Italian - lexical sample– Nicoletta Calzolari <[email protected]>

• English - lexical sample and All words– Adam Kilgarriff [email protected]

– Martha Palmer [email protected]

13 Languages, cont.

• Basque - lexical sample– Eneko Agirre <[email protected]>

• Spanish - lexical sample– Mariona Taulé <[email protected]>– German Rigau <[email protected]>

• Korean - – Key-Sun Choi <[email protected]>

• Czech -– Ondrej Cikhart <[email protected]>

• Dutch - – Antal van den Bosch <[email protected]>

Lexical Sample DTD

<!ELEMENT corpus (lexset+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT lexset (instance+)><!ATTLIST lexset item CDATA #REQUIRED><!ELEMENT instance (answer*,context)><!ELEMENT answer EMPTY><!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED><!ELEMENT context (#PCDATA | itemloc)+><!ELEMENT itemloc (#PCDATA)

<!DOCTYPE corpus SYSTEM "lexical-sample.dtd">

<corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>

XML version?<!ELEMENT corpus (descr?,rtext+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT descr (#PCDATA)><!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)><!ELEMENT tloc (#PCDATA)><!ATTLIST tloc id ID #REQUIRED><!ELEMENT answer (lexentry,loc+,sense+)><!ELEMENT lexentry (#PCDATA)><!ELEMENT loc EMPTY><!ATTLIST loc ids IDREFS #REQUIRED><!ELEMENT sense EMPTY><!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>

<!DOCTYPE corpus SYSTEM "all-words.dtd">

<corpus lang="en">

<rtext> <descr> taken from the man page for intro of section 3

of from a FreeBSD 4.0 system. </descr><text>

Words in text are tagged:

This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>.

Most ofthese <tloc id="w10">functions</tloc><tloc id="w11">are</tloc><tloc id="w12">available</tloc>from the C <tloc

id="w13">library</tloc>,libc. Other <tloc id="w14">libraries</tloc>,

Then, for each tag:

</text>

<answer>

<lexentry>section</lexentry>

<loc ids="w0"/>

<sense id="1"/>

</answer>

</rtext>

</corpus>

senseval2 scott cotton and martha palmer isle meeting dec 11, 2000 university of pennsylvania

Documents