adaptive stochastic natural gradient method for one-shot ...11-16... · adaptive stochastic natural...
Post on 15-Nov-2020
2 Views
Preview:
TRANSCRIPT
Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search
○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University)
Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University)
Kouhei Nishida (Shinshu University)
Neural Architecture
!2
Neural Network Architectures
Trial and Error!
VGGNet ResNet Inception …Task
(Dataset)
• Find a good one •Design a brand-new
architecture and train it
often pre-trained on some datasets
Sometimes...• a known architecture
works well on our tasks. Happy! Other times...
One-Shot Neural Architecture SearchJoint Optimization of Architecture c and Weights w
!3
maxw,c
f(w, c)<latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">AAACH3icbVC7TsMwFHXKq5RXgZHFoqooEqqSMsCECgwwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQQr1tcEYUH1WyHuIoGw0hlOXXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA==</latexit>
xt
Conv 3 x 3 W1
Conv 5 x 5 W2
max pooling
avg pooling
…
xt+1
0
0
1
0
…
+
w: (W1, W2) c: (0, 0, 1, 0)
maxc
f(w⇤(c), c)
subject to w⇤(c) = argmaxw
f(w, c)<latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">AAACtHicbZHfb9MwEMedjB+jbNDBIy8W1aaCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujjL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzeelLKTgGKRZf3nENL+eeZZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1ttb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDDpU1VzIHYbTmZe47HxvPTAWWo7HtWkuJCyW1ROe7fJhLN9gLxqbbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkddzUtyJ+LyH1NQ1xc=</latexit>
NAS as hyper-parameter search
One-shot NAS
1 c evaluation = 1 training
optimization of x and c within 1 training
Difficulties for PractitionersHow to choose / tune the search strategy?
!4
w w + ✏wrwf(w, c(✓))
✓ ✓ + ✏✓r✓f(w, c(✓))<latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">AAADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmmi8WSer7nqt3XFDZPnMr+GGuHLlL/TfdGMbaBMkRlrpvTczejszaaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo55Y69ZUacO7WcyvxX7lxidm7pJJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudggaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsaayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA==</latexit>
Search Space Search Strategy
Gradient-Based Method
Other Choices • Evolutionary Computation Based • Reinforcement Learning Based
hyper-parameter: step-size
- how to treat integer variables such as #filters? - how to tune the hyper-parameters in such situations?
ContributionsNovel Search Strategy for One-shot NAS
1. arbitrary search space (categorical + ordinal) 2. robust against its inputs (hyper-param. and search space)
Our approach
maxw,c
f(w, c) ) maxw,✓
J(w,✓) :=
Zf(w, c)p(c | ✓)dc
<latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AAAC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jjwDrwDTlqqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iiX0Y1mc1eYsKYw3n3P97DEfkkxhOgxuiZP1xhAoDDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaEExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QQX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrrVlzDM7dQehOzsW5F9b/BS0pbOoEJ99A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit>
1. Stochastic Relaxation exponential family
differentiable w.r.t. w and θ
2. Stochastic Natural Gradient + Adaptive Step-Size
wt+1 = wt + ✏tw\rwJ(wt,✓t)
✓t+1 = ✓t + ✏t✓F(✓t)�1 \r✓J(wt+1,✓t)
<latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAAgKc=">AAADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x77XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YYX0DOexagK+JFI+o+CIpomkKxx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rrtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit> Natural Gradient
J(wt,✓t) < J(wt+1,✓t) < J(wt+1,✓t+1)<latexit sha1_base64="y6/6GQggAGnSQUJ1h9XWy4NyciY=">AAACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjooUMJ5YYGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrrz7a49PumvNESaG14q0Mglc66bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXXIgdulGI692F3lfdp3a7PZHBWaab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZee4</latexit>
Monotone Improvement
Under appropriate step-size
!5
Results and Details• Faster & Competitive Accuracy to other one-shot NAS
!6
The detail will be explained at Poster #53
top related