dapting uxiliary 3. s losses using gradient similaritybalaji/cl-neurips2018-adapt-poster.pdf ·...

Given a main task of interest and an auxiliary task which is not of direct interest, how do we weight the auxiliary loss? The typical multi-task approach uses: However, this could be sub-optimal since we care only about performance on the main task. E.g., the auxiliary loss might help initially but hurt later. We want to solve the following problem: Question: How to automatically adapt the auxiliary loss so that it does not hurt the main loss? Single task on Breakout ADAPTING AUXILIARY LOSSES USING GRADIENT SIMILARITY Yunshu Du*, Wojciech M. Czarnecki*, Siddhant M. Jayakumar, Razvan Pascanu, Balaji Lakshminarayanan 1. PROBLEM SETUP 2. USING GRADIENT COSINE SIMILARITY TO ADAPT AUXILIARY LOSS Weighted version: Weight the auxiliary loss by cosine similarity (as above). Unweighted version: Use aux loss when cos > threshold and ignore otherwise. [email protected], {lejlot, sidmj, razp, balajiln}@google.com 3. SUPERVISED LEARNING USING PAIRS OF IMAGENET CLASSES Ground truth of task similarity: use Least Common Ancestor (LCA) and Frechet Inception Distance (FID) between ImageNet classes. Near pair: the most similar, such as Trimaran and Catamaran Far pair: the least similar, such as Rock python and Traffic light Motivating example: main function , auxiliary function 4. REINFORCEMENT LEARNING ON IMPERFECT-TEACHER DISTILLATION Multi-task on Breakout and Ms. PacMan 5. SUMMARY Given a pair of classes (A, B), we define the main task as (A vs. rest) and the auxiliary task as (B vs. rest). Figure (a): we validate that near pairs have high cosine similarity and far pairs have low cosine similarity. Figure (b): in a near pair, our method uses auxiliary to learn faster and recovers the performance of multi-task Figure (c): in a far pair, our method successfully ignores auxiliary and recovers the performance of single task Our method automatically uses (ignores) auxiliary when it helps (hurts), achieving the best of both worlds. ● Proposed gradient cosine similarity as a simple yet effective way to automatically adapt the auxiliary task to help (& not hurt) the main task. ● Experiments on ImageNet and Atari show empirical success; paper contains additional experiments on cross-domain distillation tasks. ● Paper shows theoretical guarantees on the convergence to local optimum of the main task. Potential issues and Future directions ● Guarantees convergence to local optimum of the main task but not faster convergence. ● Extend theory to optimizers that rely on statistics of the gradients or second order information (e.g., Adam or RMSprop). ● Apply our method to settings where the auxiliary task hurts initially but helps later. Single task on Breakout: the main task is Breakout, the auxiliary task is a sub-optimal pre-trained Breakout teacher Only KL: solely following the teacher leads to sub-optimal solutions RL (Baseline): single task learning without the teacher RL + KL (Baseline): the teacher only helps initially Our Method: uses the teacher’s knowledge when it helps initially and ignores when it hurts later on Multi-task on Breakout and Ms. PacMan: the main task is multi-task Breakout + Ms. PacMan, the auxiliary task is a sub-optimal pre-trained Breakout teacher Multi-task: learns Ms. PacMan at the expense of Breakout Multi-task RL + Distillation: the teacher helps Breakout but hurts Ms. PacMan Our Method: Ms. PacMan ignores the teacher when it hurts; both Breakout and Ms. Pacman learn well

Upload: others

Post on 06-Feb-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: DAPTING UXILIARY 3. S LOSSES USING GRADIENT SIMILARITYbalaji/CL-NeurIPS2018-adapt-poster.pdf · 2018-12-03 · Yunshu Du*, Wojciech M. Czarnecki*, Siddhant M. Jayakumar, Razvan Pascanu,

Given a main task of interest and an auxiliary task which is not of direct interest, how do we weight the auxiliary loss? The typical multi-task approach uses:

However, this could be sub-optimal since we care only about performance on the main task. E.g., the auxiliary loss might help initially but hurt later. We want to solve the following problem:

Question: How to automatically adapt the auxiliary loss so that it does not hurt the main loss?

Single task on Breakout

ADAPTING AUXILIARY LOSSES USING GRADIENT SIMILARITYYunshu Du*, Wojciech M. Czarnecki*, Siddhant M. Jayakumar, Razvan Pascanu, Balaji Lakshminarayanan

1. PROBLEM SETUP

TITLE

2. USING GRADIENT COSINE SIMILARITY TO ADAPT AUXILIARY LOSS

Weighted version: Weight the auxiliary loss by cosine similarity (as above).

Unweighted version: Use aux loss when cos > threshold and ignore otherwise.

[email protected], {lejlot, sidmj, razp, balajiln}@google.com

3. SUPERVISED LEARNING USING PAIRS OF IMAGENET CLASSES

Ground truth of task similarity: use Least Common Ancestor (LCA) and Frechet Inception Distance (FID) between ImageNet classes.

Near pair: the most similar, such as Trimaran and CatamaranFar pair: the least similar, such as Rock python and Traffic light

Motivating example: main function , auxiliary function

4. REINFORCEMENT LEARNING ON IMPERFECT-TEACHER DISTILLATION

Multi-task on Breakout and Ms. PacMan

5. SUMMARY

Given a pair of classes (A, B), we define the main task as (A vs. rest) and the auxiliary task as (B vs. rest).

Figure (a): we validate that near pairs have high cosine similarity and far pairs have low cosine similarity. Figure (b): in a near pair, our method uses auxiliary to learn faster and recovers the performance of multi-task Figure (c): in a far pair, our method successfully ignores auxiliary and recovers the performance of single task

Our method automatically uses (ignores) auxiliary when it helps (hurts), achieving the best of both worlds.

● Proposed gradient cosine similarity as a simple yet effective way to automatically adapt the auxiliary task to help (& not hurt) the main task.

● Experiments on ImageNet and Atari show empirical success; paper contains additional experiments on cross-domain distillation tasks.

● Paper shows theoretical guarantees on the convergence to local optimum of the main task.

Potential issues and Future directions

● Guarantees convergence to local optimum of the main task but not faster convergence.

● Extend theory to optimizers that rely on statistics of the gradients or second order information (e.g., Adam or RMSprop).

● Apply our method to settings where the auxiliary task hurts initially but helps later.

Single task on Breakout: the main task is Breakout, the auxiliary task is a sub-optimal pre-trained Breakout teacherOnly KL: solely following the teacher leads to sub-optimal solutionsRL (Baseline): single task learning without the teacherRL + KL (Baseline): the teacher only helps initiallyOur Method: uses the teacher’s knowledge when it helps initially and ignores when it hurts later on

Multi-task on Breakout and Ms. PacMan: the main task is multi-task Breakout + Ms. PacMan, the auxiliary task is a sub-optimal pre-trained Breakout teacherMulti-task: learns Ms. PacMan at the expense of BreakoutMulti-task RL + Distillation: the teacher helps Breakout but hurts Ms. PacManOur Method: Ms. PacMan ignores the teacher when it hurts; both Breakout and Ms. Pacman learn well

ENERATIVE MODELS KNOW WHAT THEY DON T …balaji/BDL-NeurIPS2018-ood-poster.pdfFor CV-Glow in particular, the volume term drops out (since it does not depend on x), yielding: where

Suyu Miao1 2 1 Mingjie Zhu3 Yunshu Lu2 Ramadas Krishna4€¦ · 15/01/2014 · 1 Synuclein gamma compromise spindle assembly checkpoint and renders resistance to antimicrotubule

Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

ImproMed training resources...P E SolutionsSSACoetruscom mproed ® training resources dapting and adusting your practice worow W softwaresericescoetruscom Table of contents Scheduling

A dapting de Finetti's proper scoring rules for Eliciting Bayesian Beliefs to Prospect Theory

T RAINING S CHOOL P SYCHOLOGISTS IN THE 21 ST C ENTURY : A DAPTING TO A C HANGING W ORLD A NNUAL M EETING T RAINERS OF S CHOOL P SYCHOLOGISTS F EBRUARY,

A DAPTING L ESSON P LANS TO FIT D IFFERENT L EARNING T HEORIES AND D ESIGNS Flora Roberson

A DAPTING PUBLISHED RESEARCH DATA TO A BIOINFORMATICS MODULE IN A UNDERGRADUATE BIOLOGY MAJOR ’ S COURSE N ATURE. 2011 M AY 26;473(7348):519-22 L ONG -

Learning to Optimize Tensor Programs - ML For Systemsmlforsystems.org/assets/papers/neurips2018/learning_chen_2018.pdf · Learning to Optimize Tensor Programs Tianqi Chen 1Lianmin

Adapting Auxiliary Losses Using Gradient Similaritybalaji/CL-NeurIPS2018-adapt.pdf · Adapting Auxiliary Losses Using Gradient Similarity Yunshu Du y, Wojciech M. Czarnecki, Siddhant

COVID-19 Planning Guide f or A dapting Risk Communication ... · people care about (e.g., social protection, malaria, food security programming). For example, social protection messaging

LAG: Lazily Aggregated Gradient for Communication ...06-15-30)-06-15-40-12765-LAG:... · 1 TianyiChen Georgios Giannakis Tao Sun WotaoYin UMN, ECE UCLA, Math NeurIPS2018 LAG: Lazily