senseval2 scott cotton and martha palmer isle meeting dec 11, 2000 university of pennsylvania

13
SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Upload: josephine-green

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

SENSEVAL2

Scott Cotton and Martha Palmer

ISLE Meeting

Dec 11, 2000

University of Pennsylvania

Page 2: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

SENSEVAL

• SENSEVAL/SIGLEX98: (Brighton, Sep,98)– Workshop on Word Sense Disambiguation– Hector, corpus-based sense inventory– 34 words, nouns, verbs, adjectives, mixed– Inter-annotator agreement over 90%– English (18 participating systems)– Also Italian (2) and French(5)

Page 3: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Siglex99: All words Experiment

• WSJ 5K word corpus– running text– WordNet 1.6

• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement

(disagreement in 90/350 verb tokens)

Page 4: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

SENSEVAL2

• Toulouse, France, July 5,6 (ACL’02)– Samples, mid-DEC

– Training data, April

– Testing data, May

• 13 Languages• Lexical sample and all words• Standardized data and formats, central server• Closer tie to applications

Page 5: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

13 Languages

• Swedish - lexical sample– Dimitrios Kokkinakis <[email protected]>

• Chinese - lexical sample– Chu-Ren Huang [email protected]

– Keh-jiann Chen <[email protected]>

• Danish - lexical sample– Bolette Pedersen <[email protected]>

• Estonian - all words (in principle)– Haldur Oim <[email protected]>

Page 6: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

13 Languages, cont.

• Japanese - lexical sample – Sadao Kurohashi [email protected]

• Bangla - lexical sample– Niladri Sekhar Dash [email protected]

• Italian - lexical sample– Nicoletta Calzolari <[email protected]>

• English - lexical sample and All words– Adam Kilgarriff [email protected]

– Martha Palmer [email protected]

Page 7: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

13 Languages, cont.

• Basque - lexical sample– Eneko Agirre <[email protected]>

• Spanish - lexical sample– Mariona Taulé <[email protected]>– German Rigau <[email protected]>

• Korean - – Key-Sun Choi <[email protected]>

• Czech -– Ondrej Cikhart <[email protected]>

• Dutch - – Antal van den Bosch <[email protected]>

Page 8: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Lexical Sample DTD

<!ELEMENT corpus (lexset+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT lexset (instance+)><!ATTLIST lexset item CDATA #REQUIRED><!ELEMENT instance (answer*,context)><!ELEMENT answer EMPTY><!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED><!ELEMENT context (#PCDATA | itemloc)+><!ELEMENT itemloc (#PCDATA)

Page 9: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

<!DOCTYPE corpus SYSTEM "lexical-sample.dtd">

<corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>

Page 10: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

XML version?<!ELEMENT corpus (descr?,rtext+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT descr (#PCDATA)><!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)><!ELEMENT tloc (#PCDATA)><!ATTLIST tloc id ID #REQUIRED><!ELEMENT answer (lexentry,loc+,sense+)><!ELEMENT lexentry (#PCDATA)><!ELEMENT loc EMPTY><!ATTLIST loc ids IDREFS #REQUIRED><!ELEMENT sense EMPTY><!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>

Page 11: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

<!DOCTYPE corpus SYSTEM "all-words.dtd">

<corpus lang="en">

<rtext> <descr> taken from the man page for intro of section 3

of from a FreeBSD 4.0 system. </descr><text>

Page 12: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Words in text are tagged:

This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>.

Most ofthese <tloc id="w10">functions</tloc><tloc id="w11">are</tloc><tloc id="w12">available</tloc>from the C <tloc

id="w13">library</tloc>,libc. Other <tloc id="w14">libraries</tloc>,

Page 13: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Then, for each tag:

</text>

<answer>

<lexentry>section</lexentry>

<loc ids="w0"/>

<sense id="1"/>

</answer>

</rtext>

</corpus>