logistics - cs.gmu.edu

NLP-14-interpretabilityLogistics Project Presentations: Nov 30th
Each team will get about 10 minutes (6 minutes for presentation + 4 minutes for QA) “Forced QA” — each team will be matched with another team to ask questions on their project Schedule: https://cs.gmu.edu/~antonis/course/cs695-fall20/week15/
Final Projects Due: Dec 6th
X models Antonis Anastasopoulos
https://cs.gmu.edu/~antonis/course/cs695-fall20/
All Datasets Have Their Biases
• No matter the task, data bias matters
• Domain bias
A Case Study: bAbI (Weston et al. 2014)
• Automatically generate synthetic text aimed at evaluating whether a model can learn certain characteristics of language
• Problem: papers evaluate only on this extremely simplified dataset, then claim about ability to learn language
• Extra Credit: Write a Prolog solver for bAbI!
An Examination of CNN/ Daily Mail (Chen et al. 2015)
• Even synthetically created real datasets have problems! • An analysis of CNN/Daily Mail revealed very few
sentences required multi-sentence reasoning, and many were too difficult due to anonymization or wrong preprocessing
Adversarial Examples in Machine Reading (Jia and Liang 2017)
• Add a sentence or word string specifically designed to distract the model
• Drops accuracy of state- of-the-art models from 81 to 46
Adversarial Creation of New Datasets? (Zellers et al. 2018)
• Idea: create datasets that current models do poorly on, but humans do well
• Process:
• Find ones that QA model does poorly on
• Have humans filter for naturalness
Natural Questions (Kwiatkowski et al. 2019)
• Opposite approach:
Interpretability
Why interpretability? • Task: predict probability of death for patients with pneumonia
• Why: so that high-risk patients can be admitted, low risk patients can be treated as outpatients
• AUC Neural networks > AUC Logistic Regression
• Rule based classifier HasAsthma(X) —> LowerRisk(X)
more intensive care Example from Caruana et al.
Why interpretability? • Legal reasons: uninterpretable models are banned!
— GDPR in EU necessitates "right to explanation"
• Distribution shift: deployed model might perform poorly in the wild
• User adoption: users happier with explanations
• Better Human-AI interaction and control
• Debugging machine learning models
Only if we could understand
model.ckpt
• Can we explain the outcome in "understandable terms"?
global interpretation
local interpretation
• Output: extent to which M captures P
• Techniques: classification, regression
• Output: an explanation E
What is the model learning?
Source Syntax in NMT
Does String-Based Neural MT Learn Source Syntax? Shi et al. EMNLP 2016
5 syntactic properties
Shi et al. EMNLP 2016
Note: LSTMs can learn to count, whereas GRUs can not do unbounded counting (Weiss et al. ACL 2018)
Fine grained analysis of sentence embeddings
• Sentence representations: word vector averaging, hidden states of the LSTM
• Auxiliary Tasks: predicting length, word order, content
• Findings: - hidden states of LSTM capture to a great deal length, word order and content - word vector averaging (CBOW) model captures content, length (!), word order (!!)
Adi et al. ICLR 2017
Fine grained analysis of sentence embeddings
What you can cram into a single vector: Probing sentence embeddings for
linguistic properties • "you cannot cram the meaning of a whole %&!$# sentence
into a single $&!#* vector" — Ray Mooney
• Design 10 probing tasks: len, word content, bigram shift, tree depth, top constituency, tense, subject number, object number, semantically odd man out, coordination inversion
• Test BiLSTM last, BiLSTM max, Gated ConvNet encoder
Conneau et al. ACL 2018
Issues with probing
Voita et al. 2020
• Characterizes both probe quality and the amount of effort needed to achieve it
• More informative and stable
Training Phase Test Phase
Input x Predict f(x)
Input x Predict f(x)
Explanation Technique: Influence Functions
• What would happen if a given training point didn’t exist?
• Retraining the network is prohibitively slow, hence approximate the effect using influence functions.
Most influential train images
Koh & Liang, ICML 2017
Document classification Yang et al, 2016
Image captioning Xu et al, 2015
Explanation Technique: Attention
1. Attention is only mildly correlated with other importance score techniques
2. Counterfactual attention weights should yield different predictions, but they do not
"Attention might be an explanation."
• Attention scores can provide a (plausible) explanation not the explanation.
• Attention is not explanation if you don’t need it
• Agree that attention is indeed manipulable, "this should provide pause to researchers who are looking to attention distributions for one true, faithful interpretation of the link their model has established between inputs and outputs."
• Manipulated models perform better than no-attention models
• Elucidate some workarounds (what happens behind the scenes)
Multi-Task Multi-lingual Models
Remember, Neural Nets are Feature Extractors!
• Create a vector representation of sentences or words for use in downstream tasks
this is an example
this is an example
• In many cases, the same representation can be used in multiple tasks (e.g. word embeddings)
Reminder: Types of Learning
• Multi-task learning is a general term for training on multiple tasks
• Transfer learning is a type of multi-task learning where we only really care about one of the tasks
• Domain adaptation is a type of transfer learning, where the output is the same, but we want to handle different topics or genres, etc.
Methods for Multi-task Learning
Standard Multi-task Learning • Train representations to do well on multiple tasks at
once
Tagging Encoder
• In general, as simple as randomly choosing minibatch from one of multiple tasks
• Many many examples, starting with Collobert and Weston (2011)
Pre-training (Already Covered)
this is an example TranslationEncoder
this is an example TaggingEncoder
Initialize
• Also pre-training sentence representations (Dai et al. 2015)
Selective Parameter Adaptation
• Sometimes it is better to adapt only some of the parameters
• e.g. in cross-lingual transfer for neural MT, Zoph et al. (2016) examine best parameters to adapt
Soft Parameter Tying • It is also possible to share parameters loosely between
various tasks • Parameters are regularized to be closer, but not tied in a
hard fashion (e.g. Duong et al. 2015)
Multiple Annotation Standards
• For analysis tasks, it is possible to have different annotation standards
• Solution: train models that adjust to annotation standards for tasks such as semantic parsing (Peng et al. 2017).
• We can even adapt to individual annotators! (Guan et al. 2017)
Domain Adaptation
Domain Adaptation • Basically one task, but incoming data could be from
very different distributions
language
• Often have big grab-bag of all domains, and want to tailor to a specific domain
• Two settings: supervised and unsupervised
Supervised/Unsupervised Adaptation
• Supervised adaptation: have data in target domain
• Simple pre-training on all data, tailoring to domain- specific data (Luong et al. 2015)
• Learning domain-specific networks/features
• Matching distributions over features
• e.g. Train general-domain and domain-specific feature extractors, then sum their results (Kim et al. 2016)
• Append a domain tag to input (Chu et al. 2016)
<news> news text <med> medical text
Multi-lingual Models
Multilingual Learning • We would like to learn models that process multiple
languages
• Why?
• Transfer Learning: Improve accuracy on lower- resource languages by transferring knowledge from higher-resource languages
• Memory Savings: Use one model for all languages, instead of one for each
High-level Multilingual Learning Flowchart
memory constraints?
yes no
Multi-lingual Sequence-to- sequence Models
• It is possible to translate into several languages by adding a tag about the target language (Johnson et al. 2016, Ha et al. 2016)
<fr> this is an example → ceci est un exemple <ja> this is an example →
• Potential to allow for “zero-shot” learning: train on fr↔en and ja↔en, and use on fr↔ja
• Works, but not as effective as translating fr→en→ja
Multi-lingual Pre-training • Language model pre-training has shown to be
effective for many NLP tasks, eg. BERT
• BERT uses masked language model (MLM) and next sentence prediction (NSP) objective.
• Models such as mBERT, XLM, XLM-R extend BERT for multi-lingual pre-training.
Multi-lingual Pre-training BERT
languages
XLM [Lample and
MLM: Masked language modeling with word-piece MLM* : MLM + byte-pair encoding
Difficulties in Fully Multi- lingual Learning
• For a fixed sized model, the per-language capacity decreases as we increase the number of languages. [Siddhant et al, 2020]
• Increasing the number of low-resource languages —> decrease in the quality of high-resource language translations [Aharoni et al, 2019]
Source: Conneau et al, 2019
Data Balancing • A temperature-based strategy is used to control ratio
of samples from different languages.
• For each language l, sample a sentence with prob: T is temperature.
where and is corpus size
Cross-lingual Transfer Learning
• CLTL leverages data from one or more high-resource source languages.
• Popular techniques of CLTL include data augmentation, annotation projection, etc.
Data Augmentation • Train a model on combined data. [Fadee et al. 2017,
Bergmanis et al. 2017].
• [Lin et al, 2019] provide a method to select which language to transfer from for a given language.
• [Cottrell and Heigold, 2017] find multi-source transfer >> single-source for morphological tagging.
What if languages don’t share the same script?
• Use phonological representations to make the similarity between languages apparent.
• For eg: [Rijhwani et al, 2019] use a pivot-based entity linking system for low-resource languages.
Annotation Projection • Induce annotations in the target language using
parallel data or bilingual dictionary [Yarowsky et al, 2001].
Zero-shot Transfer to New Languages
• [Xie et al. 2018] project annotations from high- resource NER data into target language.
• Doesn’t expect training data in the target language.
•
Data Creation, Active Learning • In order to get in-language training data, Active Learning
(AL) can be used.
• AL aims to select ‘useful’ data for human annotation which maximizes end model performance.
• [Chaudhary et al, 2019] propose a recipe combining transfer learning with active learning for low-resource NER.
Working with non-English data
The state-of-the-art in German-English MT on News translation is around 42 BLEU.
What is it for English-German?
What is it for Chinese-English?
What is it for French-German?
What is it for Gujarati-English?
What is it for Greek-Swahili?
~45
What do the different languages of the world look like?
Mitä tämä lause sanoo?

?
?
:
Case Study: Kazakh-English ? what does this sentence mean?
Only 97k parallel sentences
+back-translation
+distillation, ensembling
Results from: The NiuTrans Machine Translation Systems for WMT19, Li et al. 2019
Case Study: translation between similar languages
Catalan: Què diu aquesta frase? Spanish: ¿Qué dice esta oración? Galician: Que di esta frase? Portuguese: O que esta frase diz?
Many similarities to utilize
Let’s look at the "similar languages" shared task results
Case Study: Indian subcontinent
&' ? "% ),./? $ ?
?
?
n n? ?
? "$ ' ()*+? & ** +,?
• Phonetic and Orthographic Similarity • Transliteration and Cognate mining • Character-level translation
Issues: text normalization, tokenisation
Very high resource, but: logographic writing system —> huge vocabulary tokenization?

Character-based decoding can help when translating to Chinese (Bowden et al, 2019)

Issue: Root-and-Pattern morphology
Preprocessing (tokenization+segmentation):
Handling dialectal data:
What about linguistically-informed segmentation?
Case Study: African languages
The most important issue is the lack of data and standardized evaluation sets.
This is starting to change, but data can be very noisy
1. Intuition (maaaayyybe ok) 2. Geography (could be misleading) 3. Typological Features

logistics - cs.gmu.edu

Documents