Connectionist Networks

[Based on class handout of 9/10/98.]

Functions and Examples of Functions

(1) 2 Inputs, 1 Output

4, 4 8
2, 3 5
1, 9 10
6, 7 13
341, 257 598
Function is plus

(2) Input Output

rock rock
sing sing
alqz alqz
dark dark
lamb lamb
Function is identity

(3) 2 Inputs, 1 Output

0 0 0
1 0 0
0 1 0
1 1 1
Function is AND: output is "on" only if input 1 and input 2 are "on"

(4) Input Output

look looked
rake raked
sing sang
go went
want wanted
Function is past tense: output is past tense of input stem

(5) Input Output

John left 1
Wallace fed Gromit 1
Fed Wallace Gromit 0
Who do you like Mary and? 0
Function is grammatical in English

Questions about Neural Networks that should not be confused with one another...

How can a Simple Network Represent a Simple Function?

AND network

Input Output
0 0 0
1 0 0
0 1 0
1 1 0
Error Curve: Training for 20000 sweeps

Sigmoid Activation Function

Network Weights After 20,000 sweeps of training

Output activations
using and.20000.wts and and.data (Training Set)
0.001
0.081
0.083
0.903

Description of Solution

  • The goal of the network is to have the output to be "on" only when both of the inputs are "on". The sigmoid activation function has the effect that the output node is "on" when its net input is above zero, and "off" when its net input is below zero. The bias node has a strong negative weight &emdash; positive input from either one of the inputs is not enough to override the effect of the bias and bring the net input above zero. Only when both inputs are sending activation to the output node is the effect of the bias overcome.
  • OR Network

    NETWORK CONFIGURED BY TLEARN
    # weights after 10000 sweeps
    # WEIGHTS
    # TO NODE 1
    -1.9083807468 ## bias to 1
    4.3717832565 ## i1 to 1
    4.3582129478 ## i2 to 1
    0.0000000000
  • The OR problem can be solved by making a simple change to the network that solved AND. In the AND network the negative effect of the bias was stronger than any of the input weights individually. In order for the output to be "on" whenever either of the input nodes is on, all that is needed is to set the negative bias to be smaller than the positive weights on the inputs. In this way, any positive activation from the inputs will be enough to override the bias and turn the output node "on".
  •  

    Why the XOR network fails in a 2-layer Network

  • In order for the network to model the XOR function, we need activation of either of the inputs to turn the output node "on" &emdash; just as in the OR network. This was achieved easily by making the negative weight on the bias be smaller in magnitude than the positive weight on either of the inputs. However, in the XOR network we also want the effect of turning both inputs on to be to turn the output node "off". Since turning both nodes on can only increase the total input to the output node, and the output is switched "off" when it receives less input, this effect cannot be achieved.
  • Multiple-layer Networks can learn more complex functions?

    XOR Network

     

    NETWORK CONFIGURED BY TLEARN
    # weights after 5000 sweeps
    # WEIGHTS
    # TO NODE 1
    -3.0456776619 ## bias to 1
    5.5165352821 ## i1 to 1
    -5.7562727928 ## i2 to 1
    0.0000000000 ## only ‘1 0’ turns node 1 "on"
    0.0000000000
    0.0000000000
    # TO NODE 2
    -3.6789164543 ## bias to 2
    -6.4448370934 ## i1 to 2
    6.4957633018 ## i2 to 2
    0.0000000000 ## only ‘0 1’ turns node 2 "on"
    0.0000000000
    0.0000000000
    # TO NODE 3
    -4.4429202080 ## bias to output
    0.0000000000
    0.0000000000
    9.0652370453 ## 1 to output
    8.9045801163 ## 2 to output
    0.0000000000 ## these connections like OR network
    ## except 1 & 2 are never both "on"
  • Output activations
  • using xor.5000.wts and xor.data (Training Set)
    0.022
    0.980
    0.981
    0.020

    Solution

    Learning

    The fact that there is a network configuration which deals quite well with XOR does not guarantee that a network starting with random weights and using a simple learning algorithm will be able to find this configuration.

     

    Learning Parameters &emdash; Ways of Searching for Solutions

    Random Seed

    The random seed is of little importance, except as a tool in helping you use the simulator.

    Except where explicitly specified, the networks always begin training with all weights between nodes set randomly. Although the network should hopefully solve the learning problem from any random starting point, different starting points are liable to lead to slightly different solutions and slightly different courses of development. The seed is a number that is used to generate random numbers; when the same random seed is used, the same random set of initial weights is used. By allowing you to control the random initial settings in this way, the tlearn program makes it easier to replicate earlier demonstrations by yourself and others.

    Learning Rate

    The learning rate, which is explained in chapter 1 (pp. 12-13), is a training parameter which basically determines how strongly the network responds to an error signal at each training cycle. The higher the learning rate, the bigger the change the network will make in response to a large error. Sometimes having a high learning rate will be beneficial, at other times it can be quite disastrous for the network. You can see an example of sensitivity to learning rate in the case of the XOR network discussed in chapter 4.

    Why should it be a bad thing to make big corrections in response to big errors? The reason for this is that the network is looking for the best general solution to mapping all of the input-output pairs, but the network normally adjusts weights in response to an individual input-output pair. Since the network has no knowledge of how representative any individual input-output pair is of the general trend in the training set, it would be rash for the network to respond too strongly to any individual error signal. By making many small responses to the error signals, the network learns a bit more slowly, but it is protected against being messed up by outliers in the data.

    Momentum

    Just as with learning rate, sometimes the learning algorithm can only find a good solution to a problem if the momentum training parameter is set to a specific value. What does this mean, and why should it make a difference?

    If momentum is set to a high value, then the weight changes made by the network are very similar from one cycle to the next. If momentum is set to a low value, then the weight changes made by the network can be very different on adjacent cycles. So what?

    The problem that the network faces is that it is searching for the best available configuration to model the training data, but it has no ‘knowledge’ of what the best solution is, or even whether there is a particularly good solution at all. It therefore needs some efficient and reliable way of searching the range of possible weight-configurations for the best available solution.

    One thing that can be done is for the network to test whether any small changes to its current weight-configuration lead to improved performance. If so, then it can make that change. Then it can ask the same question in its new weight-configuration, and again modify the weights if there is a small change that leads to improvement. This is a fairly effective way for a blind search to proceed, but it has inherent dangers &emdash; the network might come across a weight-configuration which is better than all very similar configurations, but is not the best configuration of all. In this situation, the network can figure out that no small changes improve performance, and will therefore not modify its weights. It therefore ‘thinks’ that it has reached an optimal solution, but this is an incorrect conclusion. This problem is known as a local maximum or local minimum.

    Momentum can serve to help the network avoid local maxima, by controlling the ‘scale’ at which the search for a solution proceeds. If momentum is set high, then changes in the weight-configuration are very similar from one cycle to the next. A consequence of this is that early in training, when error levels are typically high, weight changes will be consistently large. Because weight changes are forced to be large, this can help the network avoid getting trapped in a local maximum.