language models and transfer learningyifengt/courses/machine-learning/... · language models and...
TRANSCRIPT
![Page 1: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/1.jpg)
Language Models and Transfer LearningYifeng Tao
School of Computer ScienceCarnegie Mellon University
Slides adapted from various sources (see reference page)
Carnegie Mellon University 1Yifeng Tao
Introduction to Machine Learning
![Page 2: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/2.jpg)
What is a Language Model
oA statistical language model is a probability distribution over sequences of words
oGiven such a sequence, say of length m, it assigns a probability to the whole sequence:
oMain problem: data sparsity
Yifeng Tao Carnegie Mellon University 2
[Slide from https://en.wikipedia.org/wiki/Language_model.]
![Page 3: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/3.jpg)
Unigram model: Bag of words
oGeneral probability distribution:
oUnigram model assumption:
oEssentially, bag of words modeloEstimation of unigram params: count word
frequency in the doc
Yifeng Tao Carnegie Mellon University 3
[Slide from https://en.wikipedia.org/wiki/Language_model.]
![Page 4: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/4.jpg)
n-gram model
on-gram assumption:
oEstimation of n-gram params:
Yifeng Tao Carnegie Mellon University 4
[Slide from https://en.wikipedia.org/wiki/Language_model.]
![Page 5: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/5.jpg)
Word2Vec
oWord2Vec: Learns distributed representations of wordsoContinuous bag-of-words (CBOW)
oPredicts current word from a window of surrounding context wordsoContinuous skip-gram
oUses current word to predict surrounding window of context wordsoSlower but does a better job for infrequent words
Yifeng Tao Carnegie Mellon University 5
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
![Page 6: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/6.jpg)
Skip-gram Word2Vec
oAll words: oParameters of skip-gram word2vec model
oWord embedding for each word:oContext embedding for each word:
oAssumption:
Yifeng Tao Carnegie Mellon University 6
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
![Page 7: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/7.jpg)
Distributed Representations of Words
oThe trained parameters of words in skip-gram word2vec modeloSemantics and embedding space
Yifeng Tao Carnegie Mellon University 7
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
![Page 8: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/8.jpg)
Word Embeddings in Transfer Learning
oTransfer learning:oLabeled data are limitedoUnlabeled text corpus enormousoPretrained word embeddings can be transferred to other supervised tasks.
E.g., POS, NER, QA, MT, Sentiment classification
Yifeng Tao Carnegie Mellon University 8
[Slide from Matt Gormley.]
![Page 9: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/9.jpg)
SOTA Language Models: ELMo
oEmbeddings from Language Models: ELMooFits full conditional probability in forward direction:
oFits full conditional probability in both directions using LSTM:
Yifeng Tao Carnegie Mellon University 9
[Slide from https://arxiv.org/pdf/1802.05365.pdf andhttps://arxiv.org/abs/1810.04805.]
![Page 10: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/10.jpg)
SOTA Language Models: OpenAI GPT & BERT
oUses transformer other than LSTM to model languageoOpenAI GPT: single directionoBERT: bi-direction
Yifeng Tao Carnegie Mellon University 10
[Slide from https://arxiv.org/abs/1810.04805.]
![Page 11: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/11.jpg)
SOTA Language Models: BERT
oAdditional language modeling task: predict whether sentences come from same paragraph.
Yifeng Tao Carnegie Mellon University 11
[Slide from https://arxiv.org/abs/1810.04805.]
![Page 12: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/12.jpg)
SOTA Language Models: BERT
oInstead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks.
Yifeng Tao Carnegie Mellon University 12
[Slide from https://arxiv.org/abs/1810.04805.]
![Page 13: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/13.jpg)
The Transformer and Attention Mechanism
oAn encoder-decoder structureoOur focus: encoder and attention mechanism
Yifeng Tao Carnegie Mellon University 13
[Slide from https://jalammar.github.io/illustrated-transformer/.]
![Page 14: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/14.jpg)
The Transformer and Attention Mechanism
oSelf-attentiono Ignores positions of words, assign weights globally.oCan be parallelized, in contrast to LSTM.
oE.g., the attention weights related to word “it_”:
Yifeng Tao Carnegie Mellon University 14
[Slide from https://jalammar.github.io/illustrated-transformer/.]
![Page 15: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/15.jpg)
Self-attention Mechanism
Yifeng Tao Carnegie Mellon University 15
[Slide from https://jalammar.github.io/illustrated-transformer/.]
![Page 16: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/16.jpg)
Self-attention Mechanism
oMore…ohttps://jalammar.githu
b.io/illustrated-transformer/
Yifeng Tao Carnegie Mellon University 16
[Slide from https://jalammar.github.io/illustrated-transformer/.]
![Page 17: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/17.jpg)
Take home message
oLanguage models suffer from data sparsityoWord2vec portrays language probability using distributed word
embedding parametersoELMo, OpenAI GPT, BERT model language using deep neural
networksoPre-trained language models or their parameters can be transferred
to supervised learning problems in NLPoSelf-attention has the advantage over LSTM that it can be
parallelized and consider interactions across the whole sentence
Carnegie Mellon University 17Yifeng Tao
![Page 18: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University](https://reader030.vdocument.in/reader030/viewer/2022040306/5ec6c2e12a2f667f3e566690/html5/thumbnails/18.jpg)
References
oWikipedia: https://en.wikipedia.org/wiki/Language_modeloTensorflow. Vector Representations of Words:
https://www.tensorflow.org/tutorials/representation/word2vecoMatt Gormley. 10601 Introduction to Machine Learning:
http://www.cs.cmu.edu/~mgormley/courses/10601/index.htmloMatthew E. Peters et al. Deep contextualized word representations:
https://arxiv.org/pdf/1802.05365.pdfoJacob Devlin et al. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding: https://arxiv.org/abs/1810.04805
oJay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Carnegie Mellon University 18Yifeng Tao