senseval2 scott cotton and martha palmer isle meeting dec 11, 2000 university of pennsylvania
TRANSCRIPT
SENSEVAL2
Scott Cotton and Martha Palmer
ISLE Meeting
Dec 11, 2000
University of Pennsylvania
SENSEVAL
• SENSEVAL/SIGLEX98: (Brighton, Sep,98)– Workshop on Word Sense Disambiguation– Hector, corpus-based sense inventory– 34 words, nouns, verbs, adjectives, mixed– Inter-annotator agreement over 90%– English (18 participating systems)– Also Italian (2) and French(5)
Siglex99: All words Experiment
• WSJ 5K word corpus– running text– WordNet 1.6
• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement
(disagreement in 90/350 verb tokens)
SENSEVAL2
• Toulouse, France, July 5,6 (ACL’02)– Samples, mid-DEC
– Training data, April
– Testing data, May
• 13 Languages• Lexical sample and all words• Standardized data and formats, central server• Closer tie to applications
13 Languages
• Swedish - lexical sample– Dimitrios Kokkinakis <[email protected]>
• Chinese - lexical sample– Chu-Ren Huang [email protected]
– Keh-jiann Chen <[email protected]>
• Danish - lexical sample– Bolette Pedersen <[email protected]>
• Estonian - all words (in principle)– Haldur Oim <[email protected]>
13 Languages, cont.
• Japanese - lexical sample – Sadao Kurohashi [email protected]
• Bangla - lexical sample– Niladri Sekhar Dash [email protected]
• Italian - lexical sample– Nicoletta Calzolari <[email protected]>
• English - lexical sample and All words– Adam Kilgarriff [email protected]
– Martha Palmer [email protected]
13 Languages, cont.
• Basque - lexical sample– Eneko Agirre <[email protected]>
• Spanish - lexical sample– Mariona Taulé <[email protected]>– German Rigau <[email protected]>
• Korean - – Key-Sun Choi <[email protected]>
• Czech -– Ondrej Cikhart <[email protected]>
• Dutch - – Antal van den Bosch <[email protected]>
Lexical Sample DTD
<!ELEMENT corpus (lexset+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT lexset (instance+)><!ATTLIST lexset item CDATA #REQUIRED><!ELEMENT instance (answer*,context)><!ELEMENT answer EMPTY><!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED><!ELEMENT context (#PCDATA | itemloc)+><!ELEMENT itemloc (#PCDATA)
<!DOCTYPE corpus SYSTEM "lexical-sample.dtd">
<corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>
XML version?<!ELEMENT corpus (descr?,rtext+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT descr (#PCDATA)><!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)><!ELEMENT tloc (#PCDATA)><!ATTLIST tloc id ID #REQUIRED><!ELEMENT answer (lexentry,loc+,sense+)><!ELEMENT lexentry (#PCDATA)><!ELEMENT loc EMPTY><!ATTLIST loc ids IDREFS #REQUIRED><!ELEMENT sense EMPTY><!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>
<!DOCTYPE corpus SYSTEM "all-words.dtd">
<corpus lang="en">
<rtext> <descr> taken from the man page for intro of section 3
of from a FreeBSD 4.0 system. </descr><text>
Words in text are tagged:
This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>.
Most ofthese <tloc id="w10">functions</tloc><tloc id="w11">are</tloc><tloc id="w12">available</tloc>from the C <tloc
id="w13">library</tloc>,libc. Other <tloc id="w14">libraries</tloc>,
Then, for each tag:
</text>
<answer>
<lexentry>section</lexentry>
<loc ids="w0"/>
<sense id="1"/>
</answer>
</rtext>
</corpus>