handling of missing values in lexical acquisition lrec 2010, la valletta, malta, may 2010 1 grup de...

16
Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra

Upload: andra-campbell

Post on 13-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

1GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Handling of missing values

in lexical acquisition

Núria Bel

Universitat Pompeu Fabra

Page 2: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

2GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

By Automatic Lexical Information Acquisition we ..

try to find how to build repositories of language dependent lexical information automatically. Many technologies behind applications (MT, IE, Automatic Summarization, Sentiment Analysis, Opinion Mining, Question Answering, etc.) do need this information to work

("paralelo" AST ALO "paralel" ATR POST CL (PF-AS PM-OS SF-A SM-O) FC (NPP) LY AMENTE MC ("a") PLC (NG) PRED (ESTAR SER) TA (OBJ-P REL) AUTHOR "juan" DATE "31-Aug-99" SITE "FB52")

("paralelo" AST ALO "paralel" ATR POST CL (PF-AS PM-OS SF-A SM-O) FC (NPP) LY AMENTE MC ("a") PLC (NG) PRED (ESTAR SER) TA (OBJ-P REL) AUTHOR "juan" DATE "31-Aug-99" SITE "FB52")

("fiesta" NST ALO "fiest" CL (PF-AS SF-A) GD (F) KN MS PLC (NF) TYN (ABS) AUTHOR "juan" DATE "28-Aug-99" SITE "FB52")

Entries borrowed from MT system Incyta (Metal family)

Page 3: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

3GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

• Differences in the distribution of certain contexts separate words of different classes (Harris, 1951).

• For example: some / *many mud• Words (types) can be represented in terms of a

collection of contexts where their occurrence or not in these contexts is taken as hints or cues for a word to be classified as being of a particular class.

Cue Based Lexical Acquisition

Page 4: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

4GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Word’s occurrences are represented as vectors and used to train a classifier.

@data

15,2,8,4,0,8,1,0,1,0,0,0,0,0

Number of times the word has been observed in each of the defined contexts.

Non occurrence in particular contexts is as informative as occurrence.

We use supervised classifiers (Support Verb Machines, Decision Trees) to predict the class (Abstract, Mass, etc.) of new words.

Page 5: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

5GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Cues, classification and state-of-the-art results

• Merlo and Stevenson (2001) selected very specific cues for classifying verbs into a number of Levin (1993) based verbal classes: animacy of the subject, passives, ...

• Baldwin (2005) used general features, such as the pos tags of neighboring words for type classification.

• Joanis et al. (2007) used the frequency of filled syntactic positions or slots, tense and voice of occurring verbs, etc., to describe the whole system of English verbal classes.

• Difficult to compare the results, but .. an accuracy of about 70%

Page 6: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

6GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

The problem: missing values

Page 7: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

7GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

The Sparse data problem

• Joanis and Stevenson, 2003; Joanis et al. 2007; Korhonen et al. 2008 mention that they have to face the problem of sparse data, many of the types/words are low in frequency and show up very little information.

• Most of the words will appear very little (i.e. Zipff distribution) and therefore will show few cues.

• Yallop et al. (2005) calculated that in the 100M-word British National Corpus, from a total of 124,120 distinct adjectives, 70,246 occur only once.

• The cues we can use as information are mutually exclusive, i.e. an adjective can be prenominal and postnominal, but if it only occurs once, it will only show one cue, the other ones being a zero value.

• Even when appearing more frequently, the optional nature and variety of the contexts of occurrence are the origin of missing values also for those types that occur more than once.

Page 8: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

8GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Zero values and learning

• Zero values create not only a problem of enough information to decide, but a further uncertainty when learning from the data.

• A zero value could be indeed a negative value, i.e. the cue is that it has not been observed, but it could be that the cue was just not observed in the examined corpus because of various reasons

• When there are many zero values, the cue loses its predictive power because of the mentioned uncertainty.

• Katz (1987) and Baayen and Sproat (1996), among others, acknowledged the importance of preprocessing low frequency events and Joanis et al. (2007) also decided to smooth the data, even working with more than 1000 occurrences per verb in the BNC.

Page 9: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

9GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Our smoothing experiment: Harmonization based on linguistic information

Page 10: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

10GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Intuitively: How likely is that a 0 is just an unobserved feature and not a true 0, given the values of other observations?

To classify Abstract/Concrete nouns in English:Cue 1 is “suffix “–ness”, “-ism”, …. For Abstracts (Light 1996)

Cue 2 is “determiners “such”, “little”, much” .. For Abstracts

Cue 3 is “adjectives like “big”, “small”, … For Concrete

P(cue_1=1|[0,1,0]) =

P(abstract=yes|[0,1,0])* P(cue_1=1|abstract=yes)

+

P(abstract=no|[0,1,0]) * P(cue_1=1|abstract=no)

Page 11: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

11GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

• We use the information of observed features to assess the likelihood of a particular unobserved cue.

• Harmonization is substituting 0 values by the likelihood of being 1 given the other cues observed.

• BUT …

In order to get P(cue_1=1|[0,1,0]) we need to have P(cue_n|class) and for all cues in the vector.

i

i kjPkvP )|()|(

Page 12: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

12GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

The challenge: how to get P(cue_n|class) with so many 0’s in the data… ?

By estimating the P(cue_n|class) with linguistic information

Abstract Concrete

Suffix=no 0.5 1.0

Suffix=yes 0.5 0.0

SC_Adj=no 1.0 0.5

SC_Adj=yes 0.0 0.5

“The probability of being Concrete and having suffix “ness” is 0”

Page 13: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

13GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Harmonization effects in Spanish Mass experiment

Harmonized Frequency types

0,1,0,1,0,1,1,0,1,0,0,1,1,0 0,3,0,1,0,1,1,0,1,0,0,1,1,0 agua (‘water’)

1,1,0.5,0.5,0.5,1,1,1,1,0,0,0,0,0 1,2,0,0,0,2,1,1,2,0,0,0,0,0 acero (‘steel’)

0.5,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0,0,0,0,0

0,0,0,0,0,0,1,0,0,0,0,0,0,0 desabastecimiento

(‘shortage’)

0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.47,0.47,0.47,0.47,0.47

0,0,0,0,0,0,0,0,0,0,0,0,0,0 aceptabilidad

(‘acceptability’)

Page 14: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

14GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Results of the experiments

Spanish Mass English Abstract

Experiment DT SVM DT SVM

Mean 74.2 63.8 57.8 61.0

Trimmed mead 77.5 67.4 55.6 61.0

Frequency 79.9 79.1 61.4 64.1

Harmonized 82.8 80.7 76.1 70.1

Baseline 74.8 61.5

Page 15: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

15GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Error Analysis & Future work

• Frequency information to filter noise has been neutralized

• Future work is about how to handle missing values and noise together.

Page 16: Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 1 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Handling

Handling of missing values in lexical acquisitionLR

EC

20

10

, La

Valle

tta, M

alt

a, M

ay 2

01

0

16GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)

Thanks for your attention !