mpnet: masked and permuted pre-training for natural

MPNet: Masked and Permuted Pre-training for Natural Language Understanding 1 Kaitao Song*, 2 Xu Tan*, 2 Tao Qin, 1 Jianfeng Lu, 2 Tie-Yan Liu 1 Nanjing University of Science and Technology, 2 Microsoft Research Motivation • BERT (Masked Language Model, MLM) allows model to see the full input sentence (input consistency), but ignores the dependency between the masked/predicted tokens (output dependency). • XLNet (Permuted Language Model, PLM) leverages dependency between masked/predicted tokens (output dependency) but can only see its preceding tokens rather than the full sentence (input consistency). (a) MLM (b) PLM Method Fig 1. Architecture of MPNet Code: https://github.com/Microsoft/MPNet Experiments • GLUE MPNet – Inherit advantages of MLM and PLM • Autoregressive prediction (avoid the limitation in BERT) • Each predicted token condition on previous predicted tokens to ensure output dependency • Position compensation (avoid the limitation in XLNet) • Each predicted token can see full position information to ensure input consistency Analysis • Case Study (Factorization) • Comparisons between MPNet and MLM/PLM (Tokens and Positions) • SQuAD • Ablation Study • Setting: Pre-train on English Wikipedia, BooksCorpus, OpenWebText, Stories (160GB) with a batch size of 8192 for 500K Steps. The mask ratio is set as 15%.

Upload: others

Post on 19-Jun-2022

2 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: MPNet: Masked and Permuted Pre-training for Natural

MPNet: Masked and Permuted Pre-training for Natural Language Understanding1Kaitao Song*, 2Xu Tan*, 2Tao Qin, 1Jianfeng Lu, 2Tie-Yan Liu

1Nanjing University of Science and Technology, 2Microsoft Research

Motivation• BERT (Masked Language Model, MLM) allows model to see the full input sentence

(input consistency), but ignores the dependency between the masked/predicted tokens (output dependency).

• XLNet (Permuted Language Model, PLM) leverages dependency between masked/predicted tokens (output dependency) but can only see its preceding tokens rather than the full sentence (input consistency).

(a) MLM (b) PLM

Method

Fig 1. Architecture of MPNet Code: https://github.com/Microsoft/MPNet

Experiments

• GLUEMPNet – Inherit advantages of MLM and PLM• Autoregressive prediction (avoid the limitation in BERT)

• Each predicted token condition on previous predicted tokens to ensure output dependency

• Position compensation (avoid the limitation in XLNet)• Each predicted token can see full position information to ensure input consistency

Analysis

• Case Study (Factorization)

• Comparisons between MPNet and MLM/PLM (Tokens and Positions)

• SQuAD

• Ablation Study

• Setting: Pre-train on English Wikipedia, BooksCorpus, OpenWebText, Stories (160GB) with a batch size of 8192 for 500K Steps. The mask ratio is set as 15%.

https://github.com/Microsoft/MPNet

Masked lap wings (scoop 3)

A Masked Ball

Unveiling “Careto” - The Masked APT

Analyzing the Efficiency and Security of Permuted ... · generator with some bad inputs for c , m , and especially a . Permuted congruential generators aren’t a huge step up from

Permuted Orthogonal Block-Diagonal Transformation Matrices ... · Permuted Orthogonal Block-Diagonal Transformation Matrices for Large Scale Optimization Benchmarking Ouassim Ait

Object Lesson - Masked?

Permuted Graph Bases for Verified Computation of Invariant ...kam.mff.cuni.cz/conferences/swim2015/slides/haqiri.pdf · Permuted Graph Bases for Veri ed Computation of Invariant Subspaces

Masked Ball 2012 sponsor doc

Postulate based Permuted Subject Indexing Systemdlis.du.ac.in/eresources/POPSI.pdf · 2020. 3. 28. · •POPSI is a pre-coordinate indexing system developed by G.Bhattacharya in

On the Decoding of Polar Codes on Permuted Factor Graphs · On the Decoding of Polar Codes on Permuted Factor Graphs Nghia Doan1, Seyyed Ali Hashemi1, Marco Mondelli2, and Warren

Masked Devils: The KKK