The reason why Leaky ReLU is less sensitive to initialization than ReLU activation function is because Leaky ReLU introduces a small negative slope for the negative input values, while ReLU sets all negative input values to zero.

This means that in the case of ReLU, if the weights of the network are initialized in a way that results in a large number of neurons receiving negative input values, those neurons will always produce zero outputs, effectively "killing" them.

On the other hand, in the case of Leaky ReLU, the small negative slope allows some of the negative input values to propagate through the network, even if they are small. This means that even if some neurons receive negative input values, they will still produce non-zero outputs, and they will not be "killed". As a result, Leaky ReLU is less sensitive to weight initialization than ReLU.

It is worth noting that other activation functions, such as ELU and SELU, also introduce a small negative slope for negative input values and have been shown to have better performance than ReLU and Leaky ReLU in some cases. However, they come with their own set of caveats and hyperparameters that need to be tuned carefully.

This story has been published in Aaweg Interview, a space for questions asked in AI/ML/DS job interviews, and their answers. To check out other stories of Interview Question/Answer kind, do check:

Feel free to contribute.