Artificial Neural Networks are a virtual representation of our brain which consist of one input layer of neurons where some information are fed and they gradually pass through some hidden layers and formulations for getting the most out of it, finally entering into the output layer where we get our desired output. Neural Networks are the backbone of classification and regression problems in Deep Learning.

Suppose, if we look into the above gif closely, it is a 3D representation of neural networks along with hidden layers which are being used to classify cat and dog images. The number of hidden layers is totally hypothetical and they are used according to the need of each problem. Apparently, more the number of hidden layers, greater will be the accuracy of the output.

Ok, so what is an activation function? How it is being linked to neurons?

None

Activation functions are mathematical equations which determine the output of a neural network. It decides the condition whether the neuron should be "fired" or not. Their value ranges from -1 to 1 or 0 to 1.

Suppose, there are two neurons in the input layer x1,x2 applied with their respective strengths called weights along with a bias b. The activation function is applied to the summation of the products of the weights and the input value of neurons added with bias which leads to the output.

Y = Activation(wi*xi +b) , i=1,2,3…

There can be two types of Activation functions:-Linear and Non-Linear

But the main purpose of the activation functions in the neural networks is to bring non-linearity into the network.

Why non-linearity?

To understand we have to go deep into the backward propagation method of the learning process. So, during the training of a Deep Learning model, after the forward propagation, we get a loss function which should be minimised by any means and gradually the corresponding weights of the neurons of the hidden layers are updated through gradient descent to get a better output. If the activation functions are linear, then it is not possible to go back and understand which weights are to be assigned to input neurons for a better prediction.

Non-Linear Activation Functions also allow backpropagation as the gradient descent term is related to the inputs.

None

Mostly, there can be 4 activation functions:-

1. Sigmoid or Logistic Activation Function

The Sigmoid Function curve looks like an S-shape.

None

The main reason why we use the sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is differentiable. That means, we can find the slope of the sigmoid curve at any two points.

The function is monotonic but the function's derivative is not.

The logistic sigmoid function can cause a neural network to get stuck at the training time.

2. Tanh or hyperbolic tangent Activation Function

Tanh is also like a better version of the sigmoid. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s-shaped).

None

The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.

The function is differentiable.

The function is monotonic while its derivative is not monotonic.

The tanh function is mainly used classification between two classes.

3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now. Since it is used in almost all the convolutional neural networks or deep learning.

None
Relu v/s Sigmoid

As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and it's derivative both are monotonic.

But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.

So, let us move now to Hidden Layers:-

Hidden layers allow for the function of a neural network to be broken down into specific transformations of the data. Each hidden layer function is specialized to produce a defined output. For example, a hidden layer functions that are used to identify human eyes and ears may be used in conjunction with subsequent layers to identify faces in images. While the functions to identify eyes alone are not enough to independently recognize objects, they can function jointly within a neural network.

There is a very known term of Hyperparameter in Deep Learning which kind of boosts the learning process. The number of hidden layers is one of the hyperparameters which is already known before the process.

In order to add hidden layers, we need to answer these following two questions:

  1. What is the required number of hidden layers?
  2. What is the number of hidden neurons across each hidden layer?

First of all, hidden layers are of no use if we use linear activation functions as the combination of two or more linear functions become linear.

According to the minimisation of the loss function, we need to backpropagate and update the weights of the input and hidden layer neurons. So the number of hidden layers and hidden neurons are dependent on the gradient descent.

None

It can also be said that after certain epochs how the training and test accuracy of the model is getting closer to each other can also indicate how many hidden layers should be used for higher accuracy of prediction.