geometric losses for distributional learning...geometric losses for distributional learning arthur...
TRANSCRIPT
![Page 1: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/1.jpg)
Geometric losses for distributional learning
Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyre(1)
(1) Ecole Normale Superieure, DMACentre National pour la Recherche Scientifique
Paris, France
(2) NTT Communication Science LaboratoriesKyoto, Japan
June 17, 2019
![Page 2: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/2.jpg)
1 Introduction
2 Fenchel-Young losses for distribution spaces
3 Geometric softmax from Sinkhorn negentropy
4 Statistical properties
5 Applications
![Page 3: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/3.jpg)
Introduction: Predicting distributions
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 4: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/4.jpg)
Introduction: Predicting distributions
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 5: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/5.jpg)
Introduction: Predicting distributions
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 6: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/6.jpg)
Introduction: Predicting distributions
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 7: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/7.jpg)
Introduction: Predicting distributions
Link
Loss
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 8: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/8.jpg)
Introduction: Predicting distributions
Link
Loss
Link
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17
![Page 9: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/9.jpg)
Contribution: losses and links for continuous metrized output
Handling output geometryLink and loss with cost between classes
C : Y × Y → ROutput distribution over continuous space Y
New geometric losses and associated link functions:1 Construction from duality between distributions and scores2 Need: Convex functional on distribution space
Provided by regularized optimal transport
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 2 / 17
![Page 10: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/10.jpg)
Contribution: losses and links for continuous metrized output
Handling output geometryLink and loss with cost between classes
C : Y × Y → ROutput distribution over continuous space Y
New geometric losses and associated link functions:1 Construction from duality between distributions and scores
2 Need: Convex functional on distribution spaceProvided by regularized optimal transport
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 2 / 17
![Page 11: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/11.jpg)
Contribution: losses and links for continuous metrized output
Handling output geometryLink and loss with cost between classes
C : Y × Y → ROutput distribution over continuous space Y
New geometric losses and associated link functions:1 Construction from duality between distributions and scores2 Need: Convex functional on distribution space
Provided by regularized optimal transport
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 2 / 17
![Page 12: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/12.jpg)
Background: learning with a cost over outputs Y
Cost augmentation of losses1, 2:Convex cost-aware loss Lc : [1, d]× Rd → R
!4 Undefined link functions: Rd →4d: what to predict at test time ?
Use a Wasserstein distance between output distributions:Ground metric C defines a distance WC between distributionsPrediction with a softmax link
`(α, f) , WC(softmax(f), α))!4 Non-convex loss and costly to compute
1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005).2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 3 / 17
![Page 13: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/13.jpg)
Background: learning with a cost over outputs Y
Cost augmentation of losses1, 2:Convex cost-aware loss Lc : [1, d]× Rd → R
!4 Undefined link functions: Rd →4d: what to predict at test time ?
Use a Wasserstein distance between output distributions3:Ground metric C defines a distance WC between distributionsPrediction with a softmax link
`(α, f) , WC(softmax(f), α))!4 Non-convex loss and costly to compute
1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005).2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010.3Charlie Frogner et al. “Learning with a Wasserstein loss”. In: NIPS. 2015.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 3 / 17
![Page 14: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/14.jpg)
1 Introduction
2 Fenchel-Young losses for distribution spaces
3 Geometric softmax from Sinkhorn negentropy
4 Statistical properties
5 Applications
![Page 15: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/15.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 16: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/16.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 17: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/17.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 18: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/18.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 19: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/19.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 20: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/20.jpg)
Predicting distributions from topological duality
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 4 / 17
![Page 21: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/21.jpg)
All you need is a convex functional
Fenchel-Young losses4 5: Convex function Ω : 4d → R
and conjugateΩ?(f) = min
α∈4dΩ(α)− 〈α, f〉 `Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0
Define link functions between dual and primal
4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 5 / 17
![Page 22: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/22.jpg)
All you need is a convex functional
Fenchel-Young losses4 5: Convex function Ω : 4d → R and conjugateΩ?(f) = min
α∈4dΩ(α)− 〈α, f〉
`Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0
Define link functions between dual and primal
4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 5 / 17
![Page 23: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/23.jpg)
All you need is a convex functional
Fenchel-Young losses4 5: Convex function Ω : 4d → R and conjugateΩ?(f) = min
α∈4dΩ(α)− 〈α, f〉 `Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0
Define link functions between dual and primal
4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 5 / 17
![Page 24: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/24.jpg)
All you need is a convex functional
Fenchel-Young losses4 5: Convex function Ω : 4d → R and conjugateΩ?(f) = min
α∈4dΩ(α)− 〈α, f〉 `Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0
Define link functions between dual and primal∇Ω(α) = argmin
f∈Rd
`Ω(α, f) ∇Ω?(f) = argminα∈4d
`Ω(α, f)4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 5 / 17
![Page 25: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/25.jpg)
All you need is a convex functional
Fenchel-Young losses4 5: Convex function Ω : 4d → R and conjugateΩ?(f) = min
α∈4dΩ(α)− 〈α, f〉 `Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0
Define link functions between dual and primal`Ω(α, f) = 0⇔ α = ∇Ω∗(f) ∇Ω?(f) = argmin
α∈4d
`Ω(α, f)4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 5 / 17
![Page 26: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/26.jpg)
Discrete canonical example: Shannon entropy
Ω(α) = −H(α) =d∑i=1
αi logαi Ω∗(f) = logsumexp(f)
Not defined on continuous distributions, cost-agnosticWe need an entropy defined on all distributions
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 6 / 17
![Page 27: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/27.jpg)
Discrete canonical example: Shannon entropy
Ω(α) = −H(α) =d∑i=1
αi logαi Ω∗(f) = logsumexp(f)
Not defined on continuous distributions, cost-agnosticWe need an entropy defined on all distributions
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 6 / 17
![Page 28: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/28.jpg)
Discrete canonical example: Shannon entropy
Ω(α) = −H(α) =d∑i=1
αi logαi Ω∗(f) = logsumexp(f)
Not defined on continuous distributions, cost-agnostic
We need an entropy defined on all distributions
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 6 / 17
![Page 29: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/29.jpg)
Discrete canonical example: Shannon entropy
Ω(α) = −H(α) =d∑i=1
αi logαi Ω∗(f) = logsumexp(f)
Not defined on continuous distributions, cost-agnosticWe need an entropy defined on all distributions
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 6 / 17
![Page 30: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/30.jpg)
1 Introduction
2 Fenchel-Young losses for distribution spaces
3 Geometric softmax from Sinkhorn negentropy
4 Statistical properties
5 Applications
![Page 31: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/31.jpg)
Detour: regularized optimal transport6
Distance between distribution on YOTC,ε(α, β) = min
π∈U(α,β)〈C, π〉+εKL(π|α⊗β)
Displacement cost for optimal couplingU(α, β)=π ∈M+
1 (Y2), π1 = α, π>1 = β
+ Entropic regularization: spread coupling
6Marco Cuturi. “Sinkhorn distances: Lightspeed computation of optimal transport”. In: NIPS. 2013.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 7 / 17
![Page 32: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/32.jpg)
Sinkhorn negentropy from regularized optimal transport
Self regularized optimal transportation distance:
ΩC(α) = −12OTC,ε=2(α, α) = − max
f∈C(Y)〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉
Continuous convex
Special cases
εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD
C =
0 ∞ . . .
∞ 0 . . .
Shannon entropyGini index
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 8 / 17
![Page 33: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/33.jpg)
Sinkhorn negentropy from regularized optimal transport
Self regularized optimal transportation distance:
ΩC(α) = −12OTC,ε=2(α, α) = − max
f∈C(Y)〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉
Continuous convex
Special cases
εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD
C =
0 ∞ . . .
∞ 0 . . .
Shannon entropyGini index
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 8 / 17
![Page 34: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/34.jpg)
Sinkhorn negentropy from regularized optimal transport
Self regularized optimal transportation distance:
ΩC(α) = −12OTC,ε=2(α, α) = − max
f∈C(Y)〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉
Continuous convex
Special cases
εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD
C =
0 ∞ . . .
∞ 0 . . .
Shannon entropyGini index
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 8 / 17
![Page 35: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/35.jpg)
Dual mapping from Sinkhorn negentropy
Ω(α) = − maxf∈C(Y)
〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 9 / 17
![Page 36: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/36.jpg)
Dual mapping from Sinkhorn negentropy
∇Ω(α) = − argmaxf∈C(Y)
〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 9 / 17
![Page 37: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/37.jpg)
Returning to primal: geometric softmax
Ω∗ = g-logsumexp : f 7→ − log minα∈M+
1 (Y)〈α⊗ α, exp(−f ⊕ f + C
2 )〉
Minimizes a simple quadratic.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 10 / 17
![Page 38: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/38.jpg)
Returning to primal: geometric softmax
∇Ω∗ = g-softmax : f 7→ argminα∈M+
1 (Y)〈α⊗ α, exp(−f ⊕ f + C
2 )〉
Minimizes a simple quadratic.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 10 / 17
![Page 39: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/39.jpg)
Geomtric loss construction and computation
Training with the geometric logistic loss:
`C(α, f) = geometric-LSEC(f) + sinkhorn-negentropyC(α)− 〈α, f〉
Tractable discrete Y: use mirror descent/L-BFGS10× as costly as a softmaxBackpropagation: ∇Ω? = geometric-softmax
Continuous case:Frank-Wolfe scheme, adding one Dirac at each iteration
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 11 / 17
![Page 40: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/40.jpg)
Properties of the geometric-softmax
−2 0 2
f1 − f2
0.0
0.2
0.4
0.6
0.8
1.0
∇Ω?(f
) 1C =(0 γ
γ 0
)
0.00 0.25 0.50 0.75 1.00
α
0.0
0.2
0.4
0.6
−Ω
([α,1−α
])
γ =∞γ = 0.1
γ = 2
∇Ω? returns from Sinkhorn potentials: ∇Ω? ∇Ω = Id
∇Ω? non injective (∼ softmax)∇Ω ∇Ω∗ projects f onto symmetric Sinkhorn potentialsSparse ∇Ω∗(f) ← minimization on the simplex
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 12 / 17
![Page 41: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/41.jpg)
Properties of the geometric-softmax
−2 0 2
f1 − f2
0.0
0.2
0.4
0.6
0.8
1.0
∇Ω?(f
) 1C =(0 γ
γ 0
)
0.00 0.25 0.50 0.75 1.00
α
0.0
0.2
0.4
0.6
−Ω
([α,1−α
])
γ =∞γ = 0.1
γ = 2
∇Ω? returns from Sinkhorn potentials: ∇Ω? ∇Ω = Id∇Ω? non injective (∼ softmax)∇Ω ∇Ω∗ projects f onto symmetric Sinkhorn potentialsSparse ∇Ω∗(f) ← minimization on the simplex
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 12 / 17
![Page 42: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/42.jpg)
Properties of the geometric-softmax
ε∇Ω∗(f/ε)ε→ 0 Mode finding ε→∞ Positive deconvolution
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 13 / 17
![Page 43: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/43.jpg)
1 Introduction
2 Fenchel-Young losses for distribution spaces
3 Geometric softmax from Sinkhorn negentropy
4 Statistical properties
5 Applications
![Page 44: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/44.jpg)
Consistent learning with the geometric logistic loss
Bregman divergence from Sinkhorn negentropy:Dg(α|β) = Ω(α)− Ω(β)− 〈∇Ω(α), α− β〉.
= asymmetric transport distance
Sample distribution (xi, αi)i ∈ X ×M+1 (Y).
Fisher consistency:min
β:X→M+1 (Y)
E [Dg (α, β (x))] = ming:X→C(Y)
E [`g (α, g-softmax (g (x)))]
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 14 / 17
![Page 45: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/45.jpg)
Consistent learning with the geometric logistic loss
Bregman divergence from Sinkhorn negentropy:Dg(α|β) = Ω(α)− Ω(β)− 〈∇Ω(α), α− β〉.
= asymmetric transport distance
Sample distribution (xi, αi)i ∈ X ×M+1 (Y).
Fisher consistency:min
β:X→M+1 (Y)
E [Dg (α, β (x))] = ming:X→C(Y)
E [`g (α, g-softmax (g (x)))]
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 14 / 17
![Page 46: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/46.jpg)
Consistent learning with the geometric logistic loss
Bregman divergence from Sinkhorn negentropy:Dg(α|β) = Ω(α)− Ω(β)− 〈∇Ω(α), α− β〉.
= asymmetric transport distance
Sample distribution (xi, αi)i ∈ X ×M+1 (Y).
Fisher consistency:min
β:X→M+1 (Y)
E [Dg (α, β (x))] = ming:X→C(Y)
E [`g (α, g-softmax (g (x)))]
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 14 / 17
![Page 47: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/47.jpg)
1 Introduction
2 Fenchel-Young losses for distribution spaces
3 Geometric softmax from Sinkhorn negentropy
4 Statistical properties
5 Applications
![Page 48: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/48.jpg)
Applications: variational auto-encoder
Goal: Generate nearly 1D distribution in 2D imagesDataset: Google QuickdrawTraditional sigmoid activation layer
← replaced by geometric softmax
Deconvolutional effectCost-informed non-linearity
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 15 / 17
![Page 49: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/49.jpg)
Applications: variational auto-encoder
Goal: Generate nearly 1D distribution in 2D imagesDataset: Google QuickdrawTraditional sigmoid activation layer
← replaced by geometric softmax
Deconvolutional effectCost-informed non-linearity
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 15 / 17
![Page 50: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/50.jpg)
Applications: variational auto-encoders
softmax g-softmax
Reconstruction
Generation
Better defined generated imagesArthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 16 / 17
![Page 51: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/51.jpg)
Conclusion
Geometric softmax:New loss and projector onto output probabilitiesDiscrete/continuous, aware of a cost between outputsFenchel duality in Banach spaces + regularized optimal transportApplication in VAE and ordinal regressiongithub.com/arthurmensch/g-softmax
Future directions:How to improve computation methods (continuous FW)Geometric logisitic loss in super resolution
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 17 / 17
![Page 52: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/52.jpg)
Conclusion
Geometric softmax:New loss and projector onto output probabilitiesDiscrete/continuous, aware of a cost between outputsFenchel duality in Banach spaces + regularized optimal transportApplication in VAE and ordinal regressiongithub.com/arthurmensch/g-softmax
Future directions:How to improve computation methods (continuous FW)Geometric logisitic loss in super resolution7
7Nicholas Boyd et al. “DeepLoco: Fast 3D localization microscopy using neural networks”. In: BioRxiv (2018), p. 267096.
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 17 / 17
![Page 53: Geometric losses for distributional learning...Geometric losses for distributional learning Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´e(1) (1) Ecole Normale Sup´ ´erieure,](https://reader034.vdocument.in/reader034/viewer/2022050600/5fa7a79eedd9275f2c4740ea/html5/thumbnails/53.jpg)
Mathieu Blondel Gabriel Peyre
Poster # 179
Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 17 / 17