AN05 Simple Example Feedforward and Backpropagation Gradient Descent Algorithm Artificial Neural Network
Artificial Neural Network (ANN)
The most common architecture for ANN is feedforward ANN. In feedforward ANN, the neurons are arranged in layers and these layers are stacked one after another. The nodes in a particular layer are only connected with nodes in previous layer and next layer. First layer receives input from external source and therefore it is called input layer. Number of nodes in input layer is equal to total number of inputs. Input nodes do not have activation function.
The last layer gives output to external source and therefore it is called output layer. Number of nodes in output layer is equal to number of outputs. Activation function for output layer is chosen based on the nature of output (eg., regression, classification). The layers between input layer and output layers are not accessible to external source and therefore are called hidden layers. The input received by input layer is passed onto next layers and so on till the output layer. Because the input in this network flows in forward direction this architecture is called feedforward ANN. Each link between nodes has different weight. Thus, feedforward ANNs form directed weighted graphs.
Two main building blocks of Artificial Neural Network
Connections (Links)
Connection are the transfer elements of a neural network. These transfer activations (signal) from one neuron to another neuron. Each link between neurons has a weight associated with it.wij – weight for connection between neuron i and neuron j. Depending on the weight of the link, the output activation (output)of preceding neuron is amplified or reduced and transferred to succeeding neurons as input activations (inputs).
Neurons ( Nodes or Units)
Neurons are the main processing units of an Artificial Neural Network. They
(i) combine the inputs received from different neurons using Summation Function and;
(ii) transform it using Activation Function
At the end of axon are highly complex and specialized structures called synapses.
Synapse - The point of contact between adjacent neurons where nerve impulses are transmitted from one to another.
Summation Function
A function that combines the various input activations into a single activation for a neuron.
Activation Function
Summation function for a particular node combines inputs of all the nodes in the previous layers (weighted sum) and adds bias corresponding to the node. The output of summation function acts as input to activation function.
How does learning happen in ANN?
When an ANN is generated all the weights and biases are initialized randomly. Predicted output of such ANN for any input will in general have large error with respect to actual or target output. So next question is “How does an ANN learn to predict correct output ?”
Learning of ANN is also inspired by the way humans learn. One of way in which humans learn is under guidance or supervision of a guide or a teacher. This type of learning under supervision of actual of expected answer is called supervised learning. In the same way an ANN learns under supervision of actual or expected output in supervised manner.
Unlike learning to solve well defined mathematical problems where our answer is exactly same as actual or expected answer, ANNs in general may not be able to predicts exact same output as actual or expected output for all the inputs. In general, there will always be some error and our objective of learning is to come up with weights for links and biases for nodes that will minimize the cumulative error for all input output pairs in our training dataset.
The function that measures the cumulative error is called Loss Function or Cost Function. And our objective is to minimize the Loss Function or Cost Function. Therefore, the loss function is also called objective function. There are several loss functions that we can use for training ANNs depending on the type of problem we are solving.
Mean Squared Error (MSE or L2 Loss)
Gradient Descent Algorithm
Gradient is another term for slope. For higher dimensional surfaces gradient is the direction of maximum slope. Gradient descent means moving downward in the direction of maximum slope. Gradient descent algorithm is an iterative algorithm to find the minimum of loss function.
Step Size or Learning Rate
Learning rate is a hyperparameter in the optimization algorithm that determines the step size for iterative solution to find minimum of a loss function
Higher learning rate allows the model to learn faster, i.e., update weights and biases faster, at the cost of arriving on a sub-optimal solution. A smaller learning rate results in a more optimal solution but it may take significantly longer to reach that optimal solution.
Composition of a Function
Backpropagation