language models and transfer learningyifengt/courses/machine-learning/... · language models and...

Language Models and Transfer LearningYifeng Tao

School of Computer ScienceCarnegie Mellon University

Slides adapted from various sources (see reference page)

Carnegie Mellon University 1Yifeng Tao

Introduction to Machine Learning

What is a Language Model

oA statistical language model is a probability distribution over sequences of words

oGiven such a sequence, say of length m, it assigns a probability to the whole sequence:

oMain problem: data sparsity

Yifeng Tao Carnegie Mellon University 2

[Slide from https://en.wikipedia.org/wiki/Language_model.]

Unigram model: Bag of words

oGeneral probability distribution:

oUnigram model assumption:

oEssentially, bag of words modeloEstimation of unigram params: count word

frequency in the doc



n-gram model

on-gram assumption:

oEstimation of n-gram params:



Word2Vec

oWord2Vec: Learns distributed representations of wordsoContinuous bag-of-words (CBOW)

oPredicts current word from a window of surrounding context wordsoContinuous skip-gram

oUses current word to predict surrounding window of context wordsoSlower but does a better job for infrequent words


[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]

Skip-gram Word2Vec

oAll words: oParameters of skip-gram word2vec model

oWord embedding for each word:oContext embedding for each word:

oAssumption:



Distributed Representations of Words

oThe trained parameters of words in skip-gram word2vec modeloSemantics and embedding space



Word Embeddings in Transfer Learning

oTransfer learning:oLabeled data are limitedoUnlabeled text corpus enormousoPretrained word embeddings can be transferred to other supervised tasks.

E.g., POS, NER, QA, MT, Sentiment classification


[Slide from Matt Gormley.]

SOTA Language Models: ELMo

oEmbeddings from Language Models: ELMooFits full conditional probability in forward direction:

oFits full conditional probability in both directions using LSTM:


[Slide from https://arxiv.org/pdf/1802.05365.pdf andhttps://arxiv.org/abs/1810.04805.]

SOTA Language Models: OpenAI GPT & BERT

oUses transformer other than LSTM to model languageoOpenAI GPT: single directionoBERT: bi-direction


[Slide from https://arxiv.org/abs/1810.04805.]

SOTA Language Models: BERT

oAdditional language modeling task: predict whether sentences come from same paragraph.



SOTA Language Models: BERT

oInstead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks.



The Transformer and Attention Mechanism

oAn encoder-decoder structureoOur focus: encoder and attention mechanism


[Slide from https://jalammar.github.io/illustrated-transformer/.]

The Transformer and Attention Mechanism

oSelf-attentiono Ignores positions of words, assign weights globally.oCan be parallelized, in contrast to LSTM.

oE.g., the attention weights related to word “it_”:



Self-attention Mechanism



Self-attention Mechanism

oMore…ohttps://jalammar.githu

b.io/illustrated-transformer/



Take home message

oLanguage models suffer from data sparsityoWord2vec portrays language probability using distributed word

embedding parametersoELMo, OpenAI GPT, BERT model language using deep neural

networksoPre-trained language models or their parameters can be transferred

to supervised learning problems in NLPoSelf-attention has the advantage over LSTM that it can be

parallelized and consider interactions across the whole sentence


References

oWikipedia: https://en.wikipedia.org/wiki/Language_modeloTensorflow. Vector Representations of Words:

https://www.tensorflow.org/tutorials/representation/word2vecoMatt Gormley. 10601 Introduction to Machine Learning:

http://www.cs.cmu.edu/~mgormley/courses/10601/index.htmloMatthew E. Peters et al. Deep contextualized word representations:

https://arxiv.org/pdf/1802.05365.pdfoJacob Devlin et al. BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding: https://arxiv.org/abs/1810.04805

oJay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/


language models and transfer learningyifengt/courses/machine-learning/... · language models and...

Documents