analyzing neural language models introduction · deep learning tidal wave 13 vgg16 inception resnet...
TRANSCRIPT
![Page 1: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/1.jpg)
Analyzing Neural Language ModelsIntroduction
Shane Steinert-ThrelkeldJan 9, 2020
1
![Page 2: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/2.jpg)
Today’s Plan● Motivation / background● NLP’s “ImageNet moment”● NLP’s “Clever Hans moment”
● 15 minute break
● Course information / logistics
2
![Page 3: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/3.jpg)
Motivation
3
![Page 6: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/6.jpg)
What is ImageNet?● Large dataset, v1 in 2009
● Object classification (among others):● Input: image● Label: synsets from WordNet
● ~14M images currently
● http://www.image-net.org
6
![Page 7: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/7.jpg)
What is ImageNet?
7
![Page 8: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/8.jpg)
Why is ImageNet Important?
8
link
![Page 9: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/9.jpg)
Why is ImageNet Important?
1. Deep learning
2. Transfer learning 8
link
![Page 10: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/10.jpg)
ILSVRC● ImageNet Large Scale Visual Recognition Challenge
● Annual competition on standard benchmark● 2010-2017
● ~1.2M training images, 1000 categories
● http://www.image-net.org/challenges/LSVRC/
9
![Page 11: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/11.jpg)
ILSVRC results
10source
![Page 12: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/12.jpg)
ILSVRC results
10source
What happened in 2012?
![Page 14: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/14.jpg)
ILSVRC 2012: winner
12
NeurIPS 2012 paper
![Page 15: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/15.jpg)
ILSVRC 2012: winner
12
NeurIPS 2012 paper
“AlexNet”
![Page 18: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/18.jpg)
Deep Learning Tidal Wave
13
VGG16
Inception
ResNet (34 layers above; up to 152 in paper)
![Page 19: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/19.jpg)
Transfer Learning
“We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they grad-ually move further away from the original task and data the OverFeat network was trained to solve [cf. ImageNet]. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets”
14
![Page 20: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/20.jpg)
Standard Learning
15
Task 1 inputs
Task 1 outputs
![Page 21: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/21.jpg)
Standard Learning
15
Task 1 inputs
Task 1 outputs
Task 2 inputs
Task 2 outputs
![Page 22: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/22.jpg)
Standard Learning
15
Task 1 inputs
Task 1 outputs
Task 2 inputs
Task 2 outputs
Task 3 inputs
Task 3 outputs
![Page 23: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/23.jpg)
Standard Learning
15
Task 1 inputs
Task 1 outputs
Task 2 inputs
Task 2 outputs
Task 3 inputs
Task 3 outputs
Task 4 inputs
Task 4 outputs
![Page 24: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/24.jpg)
Standard Learning● New task = new model
● Expensive!● Training time● Storage space● Data availability● Can be impossible in low-data regimes
16
![Page 25: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/25.jpg)
Transfer Learning
17
“pre-training” task inputs
“pre-training” task outputs
![Page 26: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/26.jpg)
Transfer Learning
17
“pre-training” task outputs
![Page 27: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/27.jpg)
Transfer Learning
17
“pre-training” task outputs
Task 1 inputs
![Page 28: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/28.jpg)
Transfer Learning
17
Task 1 inputs
![Page 29: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/29.jpg)
Transfer Learning
17
Task 1 inputs
Task 1 outputs
![Page 30: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/30.jpg)
Transfer Learning
17
Task 1 outputs
![Page 31: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/31.jpg)
Transfer Learning
17
Task 1 outputs
Task 2 inputs
![Page 32: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/32.jpg)
Transfer Learning
17
Task 1 outputs
Task 2 inputs
Task 2 outputs
![Page 33: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/33.jpg)
Transfer Learning
17
Task 1 outputs Task 2 outputs
![Page 34: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/34.jpg)
Transfer Learning
17
Task 1 outputs Task 2 outputs
Task 3 inputs
![Page 35: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/35.jpg)
Transfer Learning
17
Task 1 outputs Task 2 outputs Task 3 outputs
Task 3 inputs
![Page 36: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/36.jpg)
Transfer Learning
17
Task 1 outputs Task 2 outputs Task 3 outputs
![Page 37: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/37.jpg)
Transfer Learning
17
Task 1 outputs Task 2 outputs Task 3 outputs
Pre-trained model, either:- General feature extractor- Fine-tuned on tasks
![Page 38: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/38.jpg)
Example: Scene Parsing
18
![Page 41: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/41.jpg)
Transfer Learning in NLP
20
![Page 42: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/42.jpg)
Where to transfer from?
21
![Page 43: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/43.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
21
![Page 44: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/44.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:
21
![Page 45: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/45.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing
21
![Page 46: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/46.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing● Semantic parsing
21
![Page 47: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/47.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing● Semantic parsing● Machine translation
21
![Page 48: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/48.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing● Semantic parsing● Machine translation● QA
21
![Page 49: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/49.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing● Semantic parsing● Machine translation● QA● …
21
![Page 50: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/50.jpg)
Where to transfer from?● Goal: find a linguistic task that will build general-purpose / transferable
representations
● Possibilities:● Constituency or dependency parsing● Semantic parsing● Machine translation● QA● …
● Scalability issue: all require expensive annotation
21
![Page 51: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/51.jpg)
Language Modeling
22
![Page 52: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/52.jpg)
Language Modeling● Recent innovation: use language modeling (a.k.a. next word prediction)● [*: we will talk about variations later in the seminar]
22
![Page 53: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/53.jpg)
Language Modeling● Recent innovation: use language modeling (a.k.a. next word prediction)● [*: we will talk about variations later in the seminar]
● Linguistic knowledge:● The students were happy because ____ …● The student was happy because ____ …
22
![Page 54: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/54.jpg)
Language Modeling● Recent innovation: use language modeling (a.k.a. next word prediction)● [*: we will talk about variations later in the seminar]
● Linguistic knowledge:● The students were happy because ____ …● The student was happy because ____ …
● World knowledge:● The POTUS gave a speech after missiles were fired by _____● The Seattle Sounders are so-named because Seattle lies on the Puget _____
22
![Page 55: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/55.jpg)
Language Modeling is “Unsupervised”● An example of “unsupervised” or “semi-supervised” learning● NB: I think that “un-annotated” is a better term. Formally, the learning is
supervised. But the labels come directly from the data, not an annotator.
● E.g.: “Today is the first day of 575.”● (<s>, Today)● (<s> Today, is)● (<s> Today is, the)● (<s> Today is the, first)● …
23
![Page 56: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/56.jpg)
Data for LM is cheap
24
![Page 57: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/57.jpg)
Data for LM is cheap
24
![Page 58: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/58.jpg)
Data for LM is cheap
24
Text
![Page 59: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/59.jpg)
Text is abundant● News sites (e.g. Google 1B)
● Wikipedia (e.g. WikiText103)
● ….
● General web crawling:● https://commoncrawl.org/
25
![Page 60: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/60.jpg)
The Revolution will not be [Annotated]
26https://twitter.com/rgblong/status/916062474545319938?lang=en
Yann LeCun
![Page 61: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/61.jpg)
ULMFiT
27
Universal Language Model Fine-tuning for Text Classification (ACL ’18)
![Page 62: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/62.jpg)
ULMFiT
28
![Page 63: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/63.jpg)
ULMFiT
29
![Page 65: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/65.jpg)
Deep Contextualized Word RepresentationsPeters et. al (2018)
● NAACL 2018 Best Paper Award
30
![Page 66: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/66.jpg)
Deep Contextualized Word RepresentationsPeters et. al (2018)
● NAACL 2018 Best Paper Award
● Embeddings from Language Models (ELMo)● [aka the OG NLP Muppet]
30
![Page 67: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/67.jpg)
Deep Contextualized Word RepresentationsPeters et. al (2018)
● Comparison to GloVe:
31
Source Nearest Neighbors
GloVe play playing, game, games, played, players, plays, player, Play, football, multiplayer
biLM
Chico Ruiz made a spectacular play on Alusik’s grounder…
Kieffer, the only junior in the group, was commended for his ability to hit in the clutch, as well as his all-round
excellent play.Olivia De Havilland
signed to do a Broadway play for
Garson…
…they were actors who had been handed fat roles in a successful play, and had talent enough to fill the roles
competently, with nice understatement.
![Page 68: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/68.jpg)
Deep Contextualized Word RepresentationsPeters et. al (2018)
● Used in place of other embeddings on multiple tasks:
32
SQuAD = Stanford Question Answering DatasetSNLI = Stanford Natural Language Inference CorpusSST-5 = Stanford Sentiment Treebank
figure: Matthew Peters
![Page 69: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/69.jpg)
BERTBidirectional Encoder Representations from Transformers
Devlin et al 2019
33
![Page 70: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/70.jpg)
Initial Results
34
![Page 71: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/71.jpg)
Major Application
35
https://www.blog.google/products/search/search-language-understanding-bert/
![Page 72: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/72.jpg)
Major Application
36
![Page 73: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/73.jpg)
Pre-trained Neural Models Everywhere
37General Language Understanding Evaluation (GLUE) / SuperGLUE
![Page 74: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/74.jpg)
Sidebar: Word Embeddings
38
![Page 75: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/75.jpg)
Sidebar: Word Embeddings● Aren’t word embeddings like word2vec and GloVe examples of transfer
learning?● Yes: get linguistic representations from raw text to use in downstream tasks● No: not to be used as general-purpose representations
38
![Page 76: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/76.jpg)
Sidebar: Word Embeddings
39
![Page 77: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/77.jpg)
Sidebar: Word Embeddings● One distinction:
● Global representations:
● word2vec, GloVe: one vector for each word type (e.g. ‘play’)
● Contextual representations (from LMs):
● Representation of word in context, not independently
39
![Page 78: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/78.jpg)
Sidebar: Word Embeddings● One distinction:
● Global representations:
● word2vec, GloVe: one vector for each word type (e.g. ‘play’)
● Contextual representations (from LMs):
● Representation of word in context, not independently
● Another:
● Shallow (global) vs. Deep (contextual) pre-training
39
![Page 79: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/79.jpg)
Global Embeddings: Models
40
![Page 80: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/80.jpg)
Global Embeddings: Models
40
Mikolov et al 2013a (the OG word2vec paper)
![Page 81: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/81.jpg)
Shallow vs Deep Pre-training
41
Global embedding
Model for task
Raw tokens
Model for task
Contextual embedding (pre-trained)
Raw tokens
![Page 82: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/82.jpg)
NLP’s “Clever Hans Moment”
42
link
BERT
Clever Hans
![Page 83: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/83.jpg)
Clever Hans● Early 1900s, a horse trained by his owner to do:● Addition● Division● Multiplication● Tell time● Read German● …
● Wow! Hans is really smart!
43
![Page 84: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/84.jpg)
Clever Hans Effect
44
![Page 85: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/85.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
44
![Page 86: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/86.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
● Hans’ success:
44
![Page 87: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/87.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
● Hans’ success:● 89% when questioner knows answer
44
![Page 88: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/88.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
● Hans’ success:● 89% when questioner knows answer● 6% when questioner doesn’t know answer
44
![Page 89: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/89.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
● Hans’ success:● 89% when questioner knows answer● 6% when questioner doesn’t know answer
● Further experiments: as Hans’ taps got closer to correct answer, facial tension in questioner increased
44
![Page 90: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/90.jpg)
Clever Hans Effect● Upon closer examination / experimentation…
● Hans’ success:● 89% when questioner knows answer● 6% when questioner doesn’t know answer
● Further experiments: as Hans’ taps got closer to correct answer, facial tension in questioner increased
● Hans didn’t solve the task but exploited a spuriously correlated cue
44
![Page 91: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/91.jpg)
Central question● Do BERT et al’s major successes at solving NLP tasks show that we have
achieved robust natural language understanding in machines?
● Or: are we seeing a “Clever BERT” phenomenon?
45
![Page 93: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/93.jpg)
47
![Page 94: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/94.jpg)
48
Results
(performance improves if fine-tuned on this challenge set)
![Page 96: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/96.jpg)
Recent Analysis Explosion● E.g. BlackboxNLP workshop [2018, 2019]
● New “Interpretability and Analysis” track at ACL
50
![Page 97: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/97.jpg)
Why care?● Effects of learning what neural language models understand:● Engineering: can help build better language technologies via improved models,
data, training protocols, …● Trust, critical applications● Theoretical: can help us understand biases in different architectures (e.g.
LSTMs vs Transformers), similarities to human learning biases● Ethical: e.g. do some models reflect problematic social biases more than others?
51
![Page 98: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/98.jpg)
Stretch Break!
52
![Page 99: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/99.jpg)
Course Overview / Logistics
53
![Page 100: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/100.jpg)
Large Scale● Motivating question: what do neural language models understand about
natural language?● Focus on meaning, where much of the literature has focused on syntax
● A research seminar: in groups, you will carry out and execute a novel analysis project.● Think of it as a proto-conference-paper, or the seed of a conference paper.
54
![Page 101: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/101.jpg)
Course structure● First half: learning about the tools and techniques required● Wk 2: language models
[architectures, tasks, data, …]● Wk 3: analysis methods
[visualization, probing classifiers, artificial data, …]● Wk 4: resources / datasets
[guest lecture by Rachel Rudinger]● Wk 5: technical resources / writing tips
● Be active! Reading, participating, planning ahead
55
![Page 102: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/102.jpg)
Course structure● Second half: presentations● Each group will give one “special topic” presentation and lead a discussion, e.g.:● reading a paper or two on a topic related to your final project● explaining a method you are using in project, issues, etc.● Final week: project presentation festival! ● “Mini conference”, incl. reception
56
![Page 103: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/103.jpg)
Evaluation● Proposal: 10%
● Special topic presentation: 20%
● Final project presentation: 10%
● Final paper: 50%
● Participation: 10%
57
![Page 104: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/104.jpg)
Reading List● Semi- but not fully comprehensive list of recent papers on website
● Browse, get ideas/inspiration
● Deep dive on a few later
● NB: I’ll add key-words tomorrow for sorting
58
![Page 105: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/105.jpg)
Group Formation (HW1)
59
![Page 106: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/106.jpg)
Three Tasks● Form groups (more next)
● Set up repository● GitHub, GitLab, patas Git server …● Make it private for now!● Don’t put private or sensitive data in the repo! (incl LDC corpora)
● Add ACL paper template to repository● https://acl2020.org/calls/papers/#paper-submission-and-templates● Format for final paper
60
![Page 107: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/107.jpg)
Groups● There will be eight groups● Sized 4-6 people
● Unified grade
● Group decides how to divide work, but reports who did what at the end.
● Aim to diversify talents / interests in the group.● Experimental design● Data work● Implementation● Experiment running / analyzing● Writing● Speaking (presentations)
61
![Page 108: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/108.jpg)
Communication● CLMS Student Slack● Useful, since a majority of students in this seminar are on it already● Self-organize (575 channel?), based on interests, background competences, etc● For students not on it yet:● Canvas thread for requesting access● CLMS students: please add ASAP
● For general / non-group discussions, still use Canvas discussions.
● NB: I am not on that Slack (nor are other faculty)
62
![Page 109: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/109.jpg)
Registering Groups● List your groups here:● https://docs.google.com/spreadsheets/d/
1ziyww5J49iQ7iE8ElgzMX6rl2cR_cbcKxK89pvZDPF4/edit?usp=sharing
● On Canvas, upload “readme.pdf” with:● Group #, screenshot of repository
63
![Page 110: Analyzing Neural Language Models Introduction · Deep Learning Tidal Wave 13 VGG16 Inception ResNet (34 layers above; up to 152 in paper) Transfer Learning “We use features extracted](https://reader034.vdocument.in/reader034/viewer/2022050717/5e8c9c1d45536c7de833beb2/html5/thumbnails/110.jpg)
Thanks! Looking forward to a great quarter!
64