mark rowland richard e. turner adrian weller google brain
TRANSCRIPT
INNOVA
TION +
ASS
ISTA
NCE
1
Structure is all you need - learning compressed RL policies viagradient sensing
Google Brain Robotics: Krzysztof Choromanski, Vikas Sindhwani
University of Cambridge and Alan Turing Institute: Mark Rowland, Richard E. Turner, Adrian Weller
INNOVA
TION +
ASS
ISTA
NCE
2
Gradient sensing - blackbox approach to RL
INNOVA
TION +
ASS
ISTA
NCE
3
Gradient sensing - blackbox approach to RL
INNOVA
TION +
ASS
ISTA
NCE
4
Gradient sensing - blackbox approach to RL
INNOVA
TION +
ASS
ISTA
NCE
5
Gradient sensing - blackbox approach to RL
finite differencegradient sensing
gradient sensingvia randomly rotated basis
INNOVA
TION +
ASS
ISTA
NCE
6
Gradient sensing - blackbox approach to RL
gradient sensingvia independent gaussian vectors(Salimans et al. 2017)
finite differencegradient sensing
gradient sensingvia randomly rotated basis
INNOVA
TION +
ASS
ISTA
NCE
7
Gradient sensing - blackbox approach to RL
Towards smooth relaxations
family of probabilistic distributions on
Gaussian smoothings
● infinitely differentiable objective (Nesterov & Spokoiny, 2017)
● optimal value lower-bounds That of the original problem
INNOVA
TION +
ASS
ISTA
NCE
8
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
used for variance reduction
randomized version of a standard finite difference method
used in manyevolutionary strategies papers, no control variate
number of samples
INNOVA
TION +
ASS
ISTA
NCE
9
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
used for variance reduction
randomized version of a standard finite difference method
used in manyevolutionary strategies papers, no control variate
number of samples
INNOVA
TION +
ASS
ISTA
NCE
10
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
● all three estimators are unbiased● empirically tested that none of them
dominates the others
number of samples
used in manyevolutionary strategies papers, no control variate
used for variance reduction
randomized version of a standard finite difference method
INNOVA
TION +
ASS
ISTA
NCE
11
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
● all three estimators are unbiased● empirically tested that none of them
dominates the others● how to learn optimal control
variates ?
number of samples
INNOVA
TION +
ASS
ISTA
NCE
12
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
Some recent success stories of structured random estimators (very biased selection !):
● “Initialization matters: Orthogonal Predictive State Recurrent Neural Networks”, Choromanski, Downey, Boots (to appear at ICLR’18)
● “Optimizing Simulations with Noise-Tolerant Structured Exploration”, Choromanski, Iscen, Sindhwani, Tan, Coumans (to appear at ICRA’18)
● “The Geometry of Random Features”, Choromanski, Rowland, Sarlos, Sindhwani, Turner, Weller (to appear at AISTATS’18)
● “The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, Choromanski, Rowland, Weller (NIPS’17)
● “Structured adaptive and random spinners for fast machine learning computations”, Bojarski, Choromanska, Choromanski, Fagan, Gouy-Pailler, Morvan, Sakr, Sarlos, Atif (AISTATS’17)
● “Orthogonal Random Features”, Yu, Suresh, Choromanski, Holtmann-Rice, Kumar (NIPS’16)
● “Recycling Randomness with Structure for Sublinear time Kernel Expansion”, Choromanski, Sindhwani (ICML’16)
● “Binary embeddings with structured hashed projections”, Choromanska, Choromanski, Bojarski, Jebara, Kumar, LeCun (ICML’16)
INNOVA
TION +
ASS
ISTA
NCE
13
Gradient sensing as a Monte-Carlo estimation
ES-style gradient estimator (Salimans et al. 2017):
Gradient estimator with antithetic pairs (Salimans et al. 2017):
FD-style gradient estimator:
Theorem (Choromanski, Rowland, Sindhwani, Turner, Weller’18)
The orthogonal gradient estimator is unbiased and yields lower MSE than the unstructured estimator , namely:
INNOVA
TION +
ASS
ISTA
NCE
14
Efficiency of the orthogonal space exploration for gradient sensing
INNOVA
TION +
ASS
ISTA
NCE
15
Efficiency of the orthogonal space exploration for gradient sensing
INNOVA
TION +
ASS
ISTA
NCE
16
Efficiency of the orthogonal space exploration for gradient sensing
INNOVA
TION +
ASS
ISTA
NCE
17
...
Efficiency of the orthogonal space exploration for gradient sensing - discrete space sensing
INNOVA
TION +
ASS
ISTA
NCE
18
Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks
INNOVA
TION +
ASS
ISTA
NCE
19
Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks
INNOVA
TION +
ASS
ISTA
NCE
20
Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks
INNOVA
TION +
ASS
ISTA
NCE
21
Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems
stochastic noise
deterministic noise
53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems
● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task
● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)
DFO benchmarking suite,More & Wild (2009)
INNOVA
TION +
ASS
ISTA
NCE
22
Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems
smooth problems
nondifferentiableproblems
53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems
● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task
● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)
DFO benchmarking suite,More & Wild (2009)
INNOVA
TION +
ASS
ISTA
NCE
23
Experimental results: learning compressed policies for RL tasks
Tested variants:
● structured architectures + structured exploration (STST)
● structured architectures + unstructured exploration (STUN)
● unstructured architectures + unstructured exploration (UNUN)
Tested environments:
● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking
solved by STST
Structured space exploration strategies:
● Gaussian orthogonal matrices● matrices HD (k=1); 256- or
512-dimensional parameter vectors
Optimization setting:
● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden
layers of size h=41
Distributed TF training on at most 400 machines
INNOVA
TION +
ASS
ISTA
NCE
24
Experimental results: learning compressed policies for RL tasks
Tested variants:
● structured architectures + structured exploration (STST)
● structured architectures + unstructured exploration (STUN)
● unstructured architectures + unstructured exploration (UNUN)
Tested environments:
● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking
solved by STST
Structured space exploration strategies:
● Gaussian orthogonal matrices● matrices HD (k=1); 256- or
512-dimensional parameter vectors
Optimization setting:
● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden
layers of size h=41, Toeplitz matrices
279-dimensional neural network learns reward 4.83 for rollouts of length 500
Learns reward 1842
Distributed TF training on at most 400 machines
INNOVA
TION +
ASS
ISTA
NCE
25
Experimental results: learning compressed policies for RL tasks
d=253 d=256 d=273 d=247
d=273 d=266 d=246
INNOVA
TION +
ASS
ISTA
NCE
26
Experimental results: learning curves - Ant
STST(GaussOrt) STST(Hadamard)
STUN UNUN
509.2 1144.57
566.42 -12.78
INNOVA
TION +
ASS
ISTA
NCE
27
Experimental results: learning curves - Cont. Mountain Car
STST(GaussOrt) STST(Hadamard)
STUN UNUN
92.05 93.06
90.69 -0.09
INNOVA
TION +
ASS
ISTA
NCE
28
Experimental results: learning curves - HalfCheetah
STST(GaussOrt) STST(Hadamard)
STUN UNUN
3619.92 3273.94
1948.26 2660.76
INNOVA
TION +
ASS
ISTA
NCE
29
Experimental results: learning curves - Hopper
STST(GaussOrt) STST(Hadamard)
STUN UNUN
99883.29 99969.47
99464.74 1540.10
INNOVA
TION +
ASS
ISTA
NCE
30
Experimental results: learning curves - Humanoid
STST(GaussOrt) STST(Hadamard)
STUN UNUN
1842.70 84.13
1425.85 509.56
INNOVA
TION +
ASS
ISTA
NCE
31
Experimental results: learning curves - Pendulum
STST(GaussOrt) STST(Hadamard)
STUN UNUN
-128.48 -127.06
-3278.60 -5026.27
INNOVA
TION +
ASS
ISTA
NCE
32
Experimental results: learning curves - Pusher
STST(GaussOrt) STST(Hadamard)
STUN UNUN
-48.07 -43.04
-36.71 -46.91
INNOVA
TION +
ASS
ISTA
NCE
33
Experimental results: learning curves - Reacher
STST(GaussOrt) STST(Hadamard)
STUN UNUN
-4.29 -10.31
-73.04 -145.39
INNOVA
TION +
ASS
ISTA
NCE
34
Experimental results: learning curves - Striker
STST(GaussOrt) STST(Hadamard)
STUN UNUN
-112.78 -87.27
-52.56 -63.35
INNOVA
TION +
ASS
ISTA
NCE
35
Experimental results: learning curves - Swimmer
STST(GaussOrt) STST(Hadamard)
STUN UNUN
370.36 370.53
368.38 150.90
INNOVA
TION +
ASS
ISTA
NCE
36
Experimental results: learning curves - Thrower
STST(GaussOrt) STST(Hadamard)
STUN UNUN
-349.72 -243.88
-265.60 -192.76
INNOVA
TION +
ASS
ISTA
NCE
37
Experimental results: learning curves - Walker2d
STST(GaussOrt) STST(Hadamard)
STUN UNUN
9998.12 9974.60
9756.91 456.51
INNOVA
TION +
ASS
ISTA
NCE
38
Success stories - teaching “Smoky” to walk (ICRA’18) joint work with: Atil Iscen, Vikas Sindhwani, Jie Tan and Erwin Coumans
INNOVA
TION +
ASS
ISTA
NCE
39