neural networks how do we get around only linear solutions?
TRANSCRIPT
![Page 1: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/1.jpg)
Neural Networks
• How do we get around only linear solutions?
![Page 2: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/2.jpg)
Neural Networks
• A multi-layer network of linear perceptrons is still linear.
• Non-linear (differentiable) units• Logistic or tanh function
![Page 3: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/3.jpg)
• From Galit Haim:http://u.cs.biu.ac.il/~haimga/Teaching/AI/
![Page 4: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/4.jpg)
Backpropagation Learning Details
• Method for learning weights in feed-forward (FF) nets
• Can’t use Perceptron Learning Rule– no teacher values are possible for hidden units
• Use gradient descent to minimize the error– propagate deltas to adjust for errors backward from outputs to hidden layers to inputs forward
backward
![Page 5: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/5.jpg)
Backpropagation Algorithm – Main Idea – error in hidden layers
The ideas of the algorithm can be summarized as follows :
1. Computes the error term for the output units using the observed error.
2. From output layer, repeat - propagating the error term back to the previous layer
and - updating the weights between the two layers until the earliest hidden layer is reached.
![Page 6: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/6.jpg)
Backpropagation Algorithm• Initialize weights (typically random!)• Keep doing epochs
– For each example e in training set do• forward pass to compute
– O = neural-net-output(network,e)– miss = (T-O) at each output unit
• backward pass to calculate deltas to weights• update all weights
– end• until tuning set error stops improving
Backward pass explained in next slideForward pass explained earlier
![Page 7: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/7.jpg)
Gradient
• Trying to make error decrease the fastest• Compute:
• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]
• Change i-th weight by• deltawi = -eta* dE/dwi
• We need a derivative! • Activation function must be continuous,
differentiable, non-decreasing, and easy to compute
Derivatives of error for weights
Compute deltas
![Page 8: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/8.jpg)
• Sigmoid function: y(x)
• Derivative: y’(x)=y(x)(1-y(x))
-6 -4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
1.2
sig(x)sig'(x)
![Page 9: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/9.jpg)
Updating hidden-to-output
• We have teacher supplied desired values
![Page 10: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/10.jpg)
Updating interior weights
• Layer k units provide values to all layer k+1 units• “miss” is sum of misses from all units on k+1• missj = [ something about weights from k->
k+1 and k+1 outputs]
For layer k+1
![Page 11: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/11.jpg)
Gradient Descent• Define objective to minimize error:
where D is the set of training examples, K is the set of output units, tkd and okd are, respectively, the target and current output for unit k for example d.
• The derivative of a sigmoid unit with respect to net input is:
• Learning rule to change weights to minimize error is:
2)()( kdDd Kk
kd otWE
)1( jjj
j oonet
o
jiji w
Ew
![Page 12: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/12.jpg)
Backpropagation Training Algorithm
Create the 3-layer network with H hidden units with full connectivity between layers. Set weights to small random real values.Until all training examples produce the correct value (within ε), or mean squared error ceases to decrease, or other termination criteria: Begin epoch For each training example, d, do: Calculate network output for d’s input values Compute error between current output and correct output for d Update weights by backpropagating error and using learning rule End epoch
![Page 13: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/13.jpg)
output (k)
hidden (j)
input (i)
![Page 14: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/14.jpg)
• hidden node inputs: netj using wij
output (k)
hidden (j)
input (i)
![Page 15: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/15.jpg)
• hidden node inputs: netj using wij
• hidden node outputs: xj=sig(netj)
output (k)
hidden (j)
input (i)
![Page 16: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/16.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
output (k)
hidden (j)
input (i)
![Page 17: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/17.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
output (k)
hidden (j)
input (i)
![Page 18: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/18.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
error: tk-ok
output (k)
hidden (j)
input (i)
![Page 19: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/19.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
error: tk-ok
Δwjk=η*error*sig’(netk)*xj
=η*(tk-ok)*sig’(netk) *xj output (k)
hidden (j)
input (i)
![Page 20: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/20.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
error: tk-ok
Δwjk=η*error*sig’(netk)*xj
=η*(tk-ok)*sig’(netk) *xj
Δwij=η*errorOverOutputs*sig’(netj)*xi
=η*Σk[(tk-ok)sig’(netk)wkj]*sig’(netj) *xi output (k)
hidden (j)
input (i)
![Page 21: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/21.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
error: tk-ok
Δwjk=η*error*sig’(netk)*xj
=η*(tk-ok)*sig’(netk) *xj
Δwij=η*errorOverOutputs*sig’(netj)*xi
=η*Σk[(tk-ok)sig’(netk)wkj]*sig’(netj) *xi output (k)
hidden (j)
input (i)
![Page 22: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/22.jpg)
• hidden node inputs: netj
• hidden node outputs: xj=sig(netj)
• output node inputs: netk using wjk
• output node outputs: ok=sig(netk)
error: tk-ok
Δwjk=η*error*sig’(netk)*xj
=η*(tk-ok)*sig(netk)(1-sig(netk)*xj
Δwij=η*errorOverOutputs*sig’(netj)*xi
=η*Σk[(tk-ok)sig’(netk)wkj]*sig(netj)(1-sig(netj))*xi
=η*Σk[(tk-ok)sig(netk)(1-sig(netk))wkj]*sig(netj)(1-sig(netj))*xioutput (k)
hidden (j)
input (i)
![Page 23: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/23.jpg)
Comments on Training Algorithm• Not guaranteed to converge to zero training error,
may converge to local optima or oscillate indefinitely.
• However, in practice, does converge to low error for many large networks on real data.
• Many epochs (thousands) may be required, hours or days of training for large networks.
• To avoid local-minima problems, run several trials starting with different random weights (random restarts).– Take results of trial with lowest training set error.– Build a committee of results from multiple trials
(possibly weighting votes by training set accuracy).
![Page 24: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/24.jpg)
• Coarse coding• Generalization based on features activating
![Page 25: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/25.jpg)
Size Matters
![Page 26: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/26.jpg)
Tile coding
![Page 27: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/27.jpg)
Tile coding, view #2
• Consider a game of soccer
![Page 28: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/28.jpg)
CMAC
• http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html
• http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tiles.html
![Page 29: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/29.jpg)
(Code excerpts from Peter Stone: Keepaway)
void CMAC::loadTiles(){ int tilingsPerGroup = TILINGS_PER_GROUP; /* num tilings per tiling group */ numTilings = 0;
/* These are the 'tiling groups' -- play here with representations */ /* One tiling for each state variable */ for ( int v = 0; v < getNumFeatures(); v++ ) { for ( int a = 0; a < getNumActions(); a++ ) { GetTiles1( &(tiles[ a ][ numTilings ]), tilingsPerGroup, colTab, state[ v ] / getResolution( v ), a , v ); } numTilings += tilingsPerGroup; } if ( numTilings > RL_MAX_NUM_TILINGS ) cerr << "TOO MANY TILINGS! " << numTilings << endl;}
![Page 30: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/30.jpg)
double CMAC::computeQ( int action ){ double q = 0; for ( int j = 0; j < numTilings; j++ ) { q += weights[ tiles[ action ][ j ] ]; } return q;}
int SarsaAgent::selectAction(){ int action; // Epsilon-greedy
if ( drand48() < epsilon) { action = range_rand( getNumActions() ); //explore } else{ action = argmaxQ(); //exploit } return action;}
![Page 31: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/31.jpg)
int SarsaAgent::startEpisode( double state[] ){ FA->decayTraces( 0 ); FA->setState( state );
for ( int a = 0; a < getNumActions(); a++ ) { Q[ a ] = FA->computeQ( a ); }
lastAction = selectAction();
FA->updateTraces( lastAction ); return lastAction;}
void SarsaAgent::endEpisode( double reward ){ double delta = reward - Q[ lastAction ]; FA->updateWeights( delta, alpha ); lastAction = -1;}
![Page 32: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/32.jpg)
int SarsaAgent::step( double reward, double state[] ){ double delta = reward - Q[ lastAction ];
FA->setState( state );
for ( int a = 0; a < getNumActions(); a++ ) { Q[ a ] = FA->computeQ( a ); }
lastAction = selectAction();
delta += Q[ lastAction ]; FA->updateWeights( delta, alpha );
// need to redo because weights changed Q[ lastAction ] = FA->computeQ( lastAction )); FA->decayTraces( gamma * lambda );
for ( int a = 0; a < getNumActions(); a++ ) { //clear other than F[a] if ( a != lastAction ) { FA->clearTraces( a ); } } FA->updateTraces( lastAction ); //replace/set traces F[a]
return lastAction;}
![Page 33: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/33.jpg)
But, how do you pick the “coarseness”?
Adaptive tile codingIFSA
![Page 34: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/34.jpg)
Irregular tilings
![Page 35: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/35.jpg)
Radial Basis Functions
• Instead of binary, have degrees of activation
• Can combine with tile coding!
![Page 36: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/36.jpg)
• Kanerva Coding: choose “prototype states” and consider distance from prototype states
• Now, updates depend on number of features, not number of dimensions
• Instance-based methods
![Page 37: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/37.jpg)
Fitted R-Max [Jong and Stone, 2007]
x
y ?
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
![Page 38: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/38.jpg)
Fitted R-Max [Jong and Stone, 2007]
x
y
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
![Page 39: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/39.jpg)
Fitted R-Max [Jong and Stone, 2007]
x
y
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
![Page 40: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/40.jpg)
Fitted R-Max [Jong and Stone, 2007]
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
x
y
![Page 41: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/41.jpg)
Mountain-Car Task
![Page 42: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/42.jpg)
3D Mountain Car
• X: position and acceleration• Y: position and acceleration
http://eecs.wsu.edu/~taylorm/traj.gif
![Page 43: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/43.jpg)
• Control with FA• Bootstrapping
![Page 44: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/44.jpg)
Efficiency in ML / AI
1. Data efficiency (rate of learning)
2. Computational efficiency (memory, computation, communication)
3. Researcher efficiency (autonomy, ease of setup, parameter tuning, priors, labels, expertise)
![Page 45: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/45.jpg)
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. Fast gradient-descent methods for temporal-difference
learning with linear function approximation.
• ICML-09.
• Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator.
• We introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).
• The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero.
• In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD.
![Page 46: Neural Networks How do we get around only linear solutions?](https://reader035.vdocument.in/reader035/viewer/2022062717/56649e205503460f94b0ae7c/html5/thumbnails/46.jpg)
van Seijen, H., Sutton, R. S. True online TD(λ)
• ICML-14
• TD(λ) based on equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner.
• Equivalence between TD(λ) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate.
• In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it.
• In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ).