grammar: five analysis methods investigating bert’s

Post on 29-Dec-2021

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Investigating BERT’s Knowledge of Grammar: Five analysis methods with NPIsAlex Warstadt*, Yu Cao*, Ioana Grosu*, Wei Peng*, Hagen Blix*, Yining Nie*, Anna Alsop*, Shikha Bordia*, Haokun Liu*, Alicia Parrish*, Sheng-Fu Wang*, Jason Phang*, Anhad Mohananey*, Phu Mon Htut*, Paloma Jeretic*, and Samuel R. Bowman

New York University *=equal contribution

EMNLP 2019November 6, 2019

Slides:

Recent research asks:

BERT (and other self-supervised models) clearly knows a lot about language, but what does it know about linguistic phenomenon X?

(Linzen et al., 2016; Wilcox et al., 2018; Warstadt and Bowman, 2019)

But first…

What analysis methods should you use to test for phenomenon-specific knowledge?

Our approach:

One phenomenonNegative Polarity Items (NPIs)

Five analysis methods- Probing tasks- Masked language modeling- Acceptability judgments (3 ways)

Conclusions

No one analysis method can show what a model “knows” a phenomenon. A variety of experimental methods is necessary to tell a full story.

Negative Polarity Items (NPIs)

Negative Polarity Items

Expressions (e.g. any, ever) that only appear in a “negative” environment.

(1) Mary has ever been to France. ✗(2) Mary hasn’t ever been to France. ✓

Why NPIs?

Why NPIs?

They’re just really complicated!

Why NPIs?

They’re just really complicated!

...and extensively studied in theoretical linguistics.

Fauconnier (1975), Ladusaw (1979), Linebarger (1980), Kadmon & Landman (1993), Giannakidou (1998), Chierchia (2013), and many others

Why NPIs?

They’re just really complicated!

...and extensively studied in theoretical linguistics.

Fauconnier (1975), Ladusaw (1979), Linebarger (1980), Kadmon & Landman (1993), Giannakidou (1998), Chierchia (2013), and many others

And in NLP: Marvin & Linzen, (2018), Wilcox et al., (2019)

1. The class of licensing environments is heterogeneous.

2. The syntactic scope of the licensor matters.

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

9 sub-datasets. 1 per licensing environment.

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

9 sub-datasets. 1 per licensing environment.

A crowd-worker validation gives >82% agreement with our boolean labels.

Zero-Shot LearningTraining and evaluating on data generated from the same grammar is too easy.

We train on one sub-dataset (licensing environment), evaluate on the others, to test how easily BERT generalizes.

Five Analysis Methods

Analysis Method 1:Probing Tasks

Train a classifier on top of BERT without fine-tuning.

Common method in NLP analysis (Adi et al., 2016; Ettinger et al., 2016; Conneau & Kiela, 2018; Tenney et al., 2019)

Probing tasks

Question: Does BERT know the syntactic scope of NPI licensors?

Results: Probing Tasks

Analysis method 2:Masked Language Modeling

We can use BERT’s pre-training task to test it in a totally unsupervised setting.

Similar methods used by Linzen et al (2016), Wilcox et al (2018).

Masked Language Modeling

Question: Does BERT assign higher probability to grammatical completions?

John knows <MASK> Betsy has ever been to France.

whether

that☞

Two Kinds of Minimal Pairs

John knows Betsy has ever been to France that☞ whether

Licensor-Presence

John knows that Betsy has been to France ever☞ often

NPI-Presence

Results:Masked Language Modeling

Analysis method 3:Boolean acceptability classifier

Same task is used by Linzen et al. (2016), CoLA dataset (Warstadt et al., 2019), Kann et al. (2019)

Classifier

Mary hasn’t eaten any cookies.

Mary has eaten any cookies.

Acceptable (grammatical)

Unacceptable (ungrammatical)

Results:Boolean acceptability classification

Analysis methods 4&5: Boolean/Gradient minimal pairs

A synthesis of the acceptability classification method and the masked language model method.Question: Is a supervised acceptability classifier sensitive to minimal differences between sentences?

Classifier

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Acceptable

Unacceptable

Correct

Classifier

Acceptable

Acceptable

Incorrect

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Boolean minimal pair:The classifier must label both sentences correctly.

Classifier

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

0.4

0.1

Correct

Classifier

0.1

0.4

Incorrect

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Gradient minimal pair:The classifier must prefer the grammatical sentence.

Results: Scope detection with minimal pairs

Results: Scope detection with minimal pairs

ResultsScope detection by environment

Recap: If we had only used...

Probing, Masked language modeling, or Boolean acceptability: - BERT encodes properties of NPIs well, but imperfectly.

Boolean minimal pairs- BERT has pretty weak knowledge of NPIs

Gradient minimal pairs- BERT has totally systematic knowledge of NPIs.

Conclusions:What does BERT know about NPIs?

● BERT does have systematic knowledge about NPIs.● But this knowledge is gradient, not discrete.● And it can only be observed after giving BERT some amount of

supervision.

Summary:What we learn about methodology

● Relying on one method can give misleading results when trying to probe for knowledge of a phenomenon.

● Researchers should compare boolean and gradient measures.● And compare supervised and unsupervised evaluation methods.

Resources

Generation Code:https://github.com/alexwarstadt/data_generation

Experiments:https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/bert_npi

NPI Data:https://alexwarstadt.files.wordpress. com/2019/08/npi_lincensing_data.zip

Thank you!

This material is based upon work supported by the National Science Foundation under Grant No. 1850208. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This project has also benefited from financial support to SB by Samsung Research under the project Improving Deep Learning using Latent Structure and from the donation of a Titan V GPU by NVIDIA Corporation.

top related