Welcome back to a new chapter of "Courage to Learn ML." For those new to this series, this series aims to make these complex topics accessible and engaging, much like a casual conversation between a mentor and a learner, inspired by the writing style of "The Courage to Be Disliked," with a specific focus on machine learning.

This time we will continue our exploration into how to overcome the challenges of vanishing and exploding gradients. In our opening segment, we talked about why it's critical to maintain stable gradients to ensure effective learning within our networks. We uncovered how unstable gradients can be barriers to deepening our networks, essentially putting a cap on the potential of deep "learning". To brig these concepts to life, we use the an analogy of running a miniature ice cream factory named DNN (short for Delicious Nutritious Nibbles), and draw parallels to illuminate potent strategies for DNN training akin to orchestrating a seamless factory production line.

Now, in this second installment, we're diving deeper into each proposed solution, examining them with the same clarity and creativity that brought our ice cream factory to life. Here are the list of topics we'd cover in this part:

  1. Activation Functions
  2. Weight Initialization
  3. Batch Normalization
  4. In Practice (Personal Experience)

Activation Functions

Activation functions are the backbone of our "factory" setup. They're responsible for passing along information in both forward and backward propagation within our DNN assembly line. Picking the right ones is crucial for the smooth operation of our DNN assembly line and, by extension, our DNN training process. This part isn't just a simple rundown of activation functions along with their advantages and disadvantages. Here, I will use the Q&A format to uncover the deeper reasoning behind the creation of different activation functions and to answer some important questions that often are overlooked.

Think of these functions as the blenders in our ice cream production analogy. Rather than offering a catalog of available blenders, I'm here to provide an in-depth review and understand the innovations of each and the reasons behind any specific enhancements.

What exactly are activation functions, and how do I choose the right one?

None
Image created by the author using ChatGPT.

Activation functions are the key elements to grant a neural network model the flexibility and power to capture both linear and nonlinear relationships. The key distinction between logistic regression and DNNs lies in these activation functions combined with multiple layers. They together allow NNs to approximate a wide range of functions. However, this power comes with its challenges. The choice of activation function needs more careful consideration. The wrong selection can stop model from learning effectively, especially during backpropagation.

Picture yourself as the manager of our DNN ice cream factory. You'd want to meticulously select the right activation function (think of them as ice cream blenders) for your production line. This means doing your homework and sourcing the best fit for your needs.

So, the first step in choosing an effective activation function involves addressing two key questions:

How does the choice of activation function affect issues like vanishing and exploding gradients? what criteria define a good activation function?

Note, to deal with the unstable gradient, our discussion focus on the activations in the hidden layers. For output activation function, the choice depends on the task whether its regression or classification problems, and if it's a multiclass problem.

When dealing with the choice of activation function in hidden layer, the problem is more related to vanishing gradient. This can be traced back to our traditional activation function sigmoid (the very traditional or basic model). The sigmoid function was widely used due to its ability to map inputs to a probability range (0, 1), which is particularly useful in binary classification tasks. This capability allowed researchers to adjust the probability threshold for categorizing predictions, enhancing model flexibility and performance.

However, its application in hidden layers has led to significant challenges, most notably the vanishing gradient problem. This can be attributed to two main factors:

  • During the forward pass, the sigmoid function compresses inputs to a very narrow range between 0 and 1. If one network only uses sigmoid as activation function in hidden layers, then repeated application through multiple layers further narrows this range. This compression effect not only reduces the variability of outputs but also introduces a bias towards positive values. Since outputs remain between 0 and 1 regardless of the input sign.
  • During backpropagation, the derivative of the sigmoid function (which has a bell-shaped curve) yields values between 0 and 0.25. This small range can cause gradients to diminish rapidly despite the input as they propagate through multiple layers, resulting in vanishing gradients. Since earlier layer gradients are products of successive layer derivatives, this compounded product of small derivatives results in exponentially smaller gradients, preventing effective learning in earlier layers.

To overcome these limitations, an ideal activation function should exhibit the following properties:

  • Non-linearity. Allowing the network to capture complex patterns.
  • Non-saturation. The function and its derivative should not compress the input range excessively, preventing vanishing gradients.
  • Zero-centered Output. The function should allow for both positive and negative outputs, ensuring that the mean output across the nodes does not introduce bias towards any direction.
  • Computational Efficiency. Both the function and its derivative should be computationally simple to facilitate efficient learning.

Given these essential properties, how do popular activation functions build upon our basic model, the Sigmoid, and what makes them stand out?

This section aims to provide a general overview of nearly all the current activation functions.

Tanh, A Simple Adjustment to Sigmoid. The hyperbolic tangent (tanh) function can be seen as a modified version of the sigmoid, offering a straightforward enhancement in terms of output range. By scaling and shifting the sigmoid, tanh achieves an output range of [-1, 1] with zero mean, This zero-centered output is advantageous as it aligns with our criteria for an effective activation function, ensuring that the input data and gradients are less biased toward any specific direction, whether positive or negative.

Despite these benefits, tanh retains the core characteristic of sigmoid in terms of its non-linear shape, which means it still compresses the output into a narrow range. This compression leads to similar issues as observed with sigmoid, which causes gradients to saturate. Therefore it affecting the network's ability to learn effectively during backpropagation.

ReLU, a popular choice in NNs. ReLU (Rectified Linear Unit) stands out for its simplicity, operating as a piecewise linear function where f(x) = max(0, x). This means it outputs zero for any negative input and mirrors the input otherwise. What makes ReLU particularly appealing is its straightforward design, satisfying three of those four key properties (we discussed above) with ease. Its linear nature on the positive side avoids compressing outputs into a tight range, unlike sigmoid or tanh, and its derivative is simple, being either 0 or 1.

One intriguing aspect of ReLU is its ability to turn off neurons for negative inputs, introduces sparsity to models. Similar to the effect of dropout regularization by deactivating certain neurons. This can lead to more generalized models. However, it also leads to the "dying ReLU" issue, where neurons become inactive and stop learning due to zero output and gradient. While some neurons may come back to life, those in early layers are particularly could be permanently deactivated. This is similar to halting feedback in an ice cream production line, where the early stages fail to adapt based on customer feedback or contribute useful intermediate products for subsequent stages.

Another point of consideration is ReLU's non-differentiability at x=0, due to the sharp transition between its linear segments. In practice, frameworks like PyTorch manage this using the concept of subgradients, often setting the derivative at x=0 to 0.5 or another value within [0, 1]. This typically doesn't pose an issue due to the rarity of exact zero inputs and the variability of data.

So, is ReLU the right choice for you? Many researchers say yes, thanks to its simplicity, efficiency, and support from major DNN frameworks. Moreover, recent studies, like one at https://arxiv.org/abs/2310.04564, highlight ReLU's ongoing relevance, marking a kind of renaissance in the ML world.

In certain applications, a variant known as ReLU6, which caps the output at 6, is used to prevent overly large activations. This modification, inspired by practical considerations, further illustrates the adaptability of ReLU in various neural network architectures. Why capping to 6? You can find answer in this post.

Leaky ReLUs, a slight twist on the classic ReLU.When we take a closer look at ReLU, a couple of issues emerge. its zero output for negative inputs, leading to the "dying ReLU" problem where neurons cease to update during training. Additionally, ReLU's preference for positive values can introduce a directional bias in the model.To counter these drawbacks while retaining ReLU's advantages, researchers developed several variations, including the concept of 'leaky' ReLUs.

Leaky ReLUs modifies the negative part of ReLU, giving it with a small and non-zero slope. This adjustment allows negative inputs to produce small negative output, effectively 'leaking' through the otherwise zero-output region. The slope of this leak is controlled by a hyperparameter ฮฑ, which is typically set close to 0 to maintain a balance between sparsity and keeping neurons active. By allowing a slight negative output, Leaky ReLU aims to centralize the activation function's output around zero and prevent neurons from becoming inactive, thus addressing the "dying ReLU" issue.

However, introducing ฮฑ as a hyperparameter adds a layer of complexity to model tuning. To manage this, variations of the original Leaky ReLU have been developed:

  • Randomized Leaky ReLU (RReLU): This version randomizes ฮฑ within a specified range during training, fixing it during evaluation. The randomness can help in regularizing the model and preventing overfitting.
  • Parametric Leaky ReLU (PReLU): PReLU allows ฮฑ to be learned during training, adapting the activation function to the specific needs of the dataset. Even though this can enhance model performance by tailoring ฮฑ to the training data, it also risks overfitting.

Exponential Linear Unit (ELU), an Improvement on Leaky ReLU by Enhancing Control Over Leakage. Both Leaky ReLUs and ELUs allow negative values, which help in pushing mean unit activations closer to zero and maintaining the vitality of the activation functions. The challenge with Leaky ReLUs is their inability to regulate the extent of these negative values; theoretically, these values could extend to negative infinity, despite intentions to keep them small. ELU addresses this by incorporating a nonlinear exponential curve for non-positive inputs, effectively narrowing and controlling the negative output range to a maximum of โˆ’๐›ผ (where ๐›ผ is a new hyperparameter, typically set to 1). Additionally, ELU is a smooth function. Its exponential component enables a seamless transition between negative and positive values, which is advantageous for gradient-based optimization because it ensures a well-defined gradient across all input values. This feature also resolves the non-differentiability issues seen with ReLU and Leaky ReLUs.

Scaled Exponential Linear Unit (SELU), an Enhanced ELU with Self-Normalizing Properties. SELU is essentially a scaled version of ELU designed to maintain zero mean and unit variance within neural networks โ€” a concept we'll explore further in our discussion on Batch Normalization. By integrating a fixed scale factor, ฮป (which is greater than 1), SELU ensures that the slope for positive net inputs exceeds one. This characteristic is particularly useful as it amplifies the gradient in scenarios where the gradients of the lower layers are diminished, helping to prevent the vanishing gradient problem often encountered in deep neural networks.

Note that the scale factor ฮป is applied to both negative and positive inputs to uniformly scale the gradient during backpropagation. This uniform scaling helps maintain variance within the network, which is crucial for the SELU activation function's self-normalizing properties.

For SELU, the parameters (ฮฑ and ฮป) have fixed values and are not learnable, which simplifies the tuning process since there are fewer parameters to adjust. You can find these specific values in the SELU implementation in PyTorch.

SELU was introduced by Gรผnter Klambauer et al. in their paper. This comprehensive paper includes an impressive 92-page appendix, which provides detailed insights for those curious about the derivation of the specific values of ฮฑ and ฮป. You can find the calculations and rationale behind these parameters in the paper itself.

SELU is indeed a sophisticated "blender" in the world of activation functions, but it comes with specific requirements. It's most effective in feedforward or sequential networks and may not perform as well in architectures like RNNs, LSTMs, or those with skip connections due to its design.

The self-normalizing feature of SELU requires that input features be standardized โ€” having a mean of 0 and a unit standard deviation is crucial. Additionally, every hidden layer's weights must be initialized using the LeCun normal initialization, where weights are sampled from a normal distribution with a mean of 0 and a variance of 1/fan_in. If you're not familiar with the term "fan_in," I'll explain it in a dedicated session on weight initialization.

In summary, for SELU's self-normalization to function effectively, you need to ensure that the input features are normalized and that the network structure remains consistent without any interruptions. This consistency helps maintain the self-normalizing effect throughout the network without any leakage.

GELU (Gaussian Error Linear Unit) is an innovative activation function that incorporates the idea of regularization from Dropout. Unlike traditional ReLU, which outputs zeros for negative inputs, leaky ReLU, ELU, and SELU allow negative outputs. This helps shift the mean of the activations closer to zero, reducing bias in a way similar to ReLU but without zeroing out negative inputs entirely. However, this leakage means they lose some benefits of "dying ReLU," where inactivity in some neurons can lead to a sparser, more generalized model.

Considering the benefits of sparsity seen in dying ReLU and Dropout's ability to randomly deactivate and reactivate neurons, GELU takes this a step further. It combines the dying ReLU's feature of zero outputs with an element of randomness, allowing neurons to potentially ''come back to life'. This approach not only maintains beneficial sparsity but also reintroduces neuron activity, making GELU a robust solution. To fully appreciate its mechanics, Let't take a closer look at GELU's definition:

None
Image created by author using Mathcha.com

In the GELU activation function, the CDF, ฮฆ(x), or the standard Gaussian cumulative distribution function, plays a key role. This function represents the probability that a standard normal random variable will have a value less than or equal to x. ฮฆ(x) transitions smoothly from 0 (for negative inputs) to 1 (for positive inputs), effectively controlling the scaling of the input when modeled with a normal distribution N(0,1). According to a paper by Dan Hendrycks et al. (source), the use of the normal distribution is justified because neuron inputs tend to follow a normal distribution, particularly when using batch normalization.

The function's design allows inputs to be "dropped" more frequently as x decreases, making the transformation both stochastic and dependent on the input value. This mechanism helps keep the shape similar to the ReLU function by making the usual straight-line function, f(x) = x, smoother, and avoiding sudden changes that you get with a piecewise linear function. The most significant feature of GELU is that it can completely inactivate neurons, potentially allowing them to reactivate with changes in input. This stochastic nature acts like a selective dropout that isn't entirely random but instead relies on the input, giving neurons the chance to become active again.

None
Cumulative distribution function from Wikipedia. Source: https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/300px-Normal_Distribution_CDF.svg.png

To summarize, GELU's main advantage over ReLU is that it considers the entire range of input values, not just whether they're positive or negative. As ฮฆ(x) decreases, it increases the chances that the GELU function output will be closer to 0, subtly "dropping" the neuron in a probabilistic way. This method is more sophisticated than the typical dropout approach because it depends on the data to determine neuron deactivation, rather than doing it randomly. I find this approach fascinating; it's like adding a soft cream to an artisan dessert, enhancing it subtly but significantly.

GELU has become a popular activation function in models like GPT-3, BERT, and other Transformers due to its efficiency and strong performance in language processing tasks. Although it's computationally intensive because of its probabilistic nature, the curve of the standard Gaussian cumulative distribution, ฮฆ(x), is similar to sigmoid and tanh functions. Interestingly, GELU can be approximated using tanh or by using the formula x(1.702*x). Despite these possibilities for simplification, PyTorch's implementation of GELU is fast enough that such approximations are often unnecessary.

Before we dive deeper with more why, let's try to summarize.

What is exactly a good activation function by reviewing those ReLU and the other activation function inspired by it?

Gรผnter Klambauer et al.'s paper, where SELU was introduced, highlights essential characteristics of an effective activation function:

  1. Range: It should output both negative and positive values to help manage the mean activation level across the network.
  2. Saturation regions: These are areas where the derivative approaches zero, helping to stabilize overly high variances from lower layers.
  3. Amplifying slopes: A slope greater than one is crucial to boost variance when it's too low in the lower layers.
  4. Continuity: A continuous curve ensures a fixed point where the effects of variance damping and increasing are balanced.

Additionally, I would suggest two more criteria for an "ideal" activation function:

  1. Non-linearity: This is obvious and necessary because linear functions can't model complex patterns effectively.
  2. Dynamic output: The ability to output zero and change outputs based on input data allows for dynamic neuron activation and deactivation, which lets the network adjust to varying data conditions efficiently.

Can you give me a more intuitive explanation on why we want activation functions to output negatives?

Think of activation functions as blenders that transform the original input data. Just like blenders that might favor certain ingredients, activation functions can introduce biases based on their inherent characteristics. For example, sigmoid and ReLU functions typically yield only non-negative outputs, regardless of the input. This is akin to a blender that always produces the same flavor, no matter what ingredients you put in.

None
Image created by the author using ChatGPT.

To minimize this bias, it's beneficial to have activation functions that can output both negative and positive values. Essentially, we aim for zero-centered outputs. Imagine a seesaw representing the output of an activation function: with functions like Sigmoid and ReLU, the seesaw is heavily tilted towards the positive side, as these functions either ignore or zero out negative inputs. Leaky ReLU attempts to balance this seesaw by allowing negative inputs to produce slightly negative outputs, although the adjustment is minor due to the linear and constant nature of its negative slope. Exponential Linear Unit (ELU), on the other hand, provides a more dynamic push on the negative side with its exponential component, helping the seesaw approach a more balanced state at zero. This balance is crucial for maintaining healthy gradient flow and efficient learning in neural networks, as it ensures that both positive and negative updates contribute to training, avoiding the limitations of unidirectional updates.

Could we create an activation function like ReLU that zeros out positive inputs instead, similar to using min(0, x)? Why do we prefer functions that approach zero from the negative side rather than zeroing out the positive inputs?

Here, saturation on the negative side means that once the input value drops below a certain threshold, further decreases have less and less effect on the output. This limits the impact of large negative inputs.

Certainly, you could design a version of ReLU that zeroes out positive values and lets negative values pass unchanged, like f(x) = min(x, 0). This is technically feasible because the important aspect here isn't the sign of the values but rather introducing non-linearity into the network. It's important to remember that these activation functions are typically used in hidden layers, not output layers, so they don't directly affect the sign of the final output. In other words, the presence of these activation functions within the network means the final output can still be both positive and negative, unaffected by the specific characteristics of these layers.

No matter the sign of the output, the network's weights and biases can adjust to match the required sign of the output. For example, with traditional ReLU, if the output is 1 and the subsequent layer's weight is 1, the output remains 1. Similarly, if a proposed ReLU variant outputs -1, and the weight is -1, the result is still 1. Essentially, we are more concerned with the magnitude of the output rather than its sign.

Therefore, ReLU saturating on the negative side is not fundamentally different from it saturating on the positive side. However, the reason we value zero-centered activation functions is because they prevent any inherent preference for positive or negative values, avoiding unnecessary bias in the model. This balance helps maintain neutrality and effectiveness in learning across the network.

I get that for functions like Leaky ReLU, we want to output negative values to keep the output centered around zero. But why are ELU, SELU, and GELU specifically designed to saturate with negative inputs?

To understand this, we can look at the biological inspiration behind ReLU. ReLU mimics biological neurons which have a threshold; inputs above this threshold activate the neuron, while inputs below it do not. This ability to switch between active and inactive states is crucial in neural function. When considering variations like ELU, SELU, and GELU, you'll notice that their design addresses two distinct needs:

  • Positive region: Allows signals that exceed the threshold to pass through unchanged during the forward pass, essentially transmitting the desired signals.
  • Negative region: Serves to minimize or filter out unwanted signals and mitigate the impact of large negative values, acting like a leaky gate.

These functions essentially act as gates for inputs, managing what should and should not influence the neuron's output. For instance, SELU utilizes these two aspects distinctively:

  • Positive region: The scaling factor ฮป (greater than 1) not only passes but slightly amplifies the signal. During backpropagation, the derivative in this region remains constant (about 1.0507), enhancing small but useful gradients to counteract vanishing gradients.
  • Negative region: The derivative ranges between 0 and ฮปฮฑ (with typical values ฮป โ‰ˆ 1.0507 and ฮฑ โ‰ˆ 1.6733), leading to a maximum derivative of about 1.7583. Here, the function nearly approaches zero, effectively reducing overly large gradients to help with the exploding gradient problem.

Here is a really good plot to illustrate the first derivatives of those activation functions.

This design allows these activation functions to balance enhancing useful signals while dampening potentially harmful extremes, providing a more stable learning environment for the network.

The concept of activation functions serving as gates is not a new idea. It has a strong precedent in structures like LSTMs where sigmoid functions decide what to remember, update, or forget. This gating concept helps us understand why variations of ReLU are designed in specific ways. For instance, GELU acts as a dynamic gate that uses a scale factor derived from the standard normal distribution's cumulative distribution function (CDF). This scaling allows a small fraction of the input to pass through when it's close to zero, and lets larger positive values pass through largely unaltered. By controlling how much of the input influences subsequent layers, GELU facilitates effective information flow management, particularly useful in architectures like transformers.

All three mentioned activation functions, ELU, SELU, and GELU, make the negative side smoother. This smooth saturation of negative inputs doesn't just mitigate the effects of large negative values. It also makes the network less sensitive to fluctuations in input data, leading to more stable feature representations.

In summary, the specific area of saturation, whether positive or negative, doesn't fundamentally matter since these activation functions operate within the middle layers of a network, where weights and biases can adapt accordingly. However, the design of these functions, which allows one side to pass signals unchanged or even amplified while the other side saturates, is important. This arrangement helps organize the signal and facilitate effective backpropagation, enhancing the overall performance and learning stability of the network.

When should we choose each activation function? Why is ReLU still the most popular activation function in practice?

Choosing the right activation function depends on several factors, including computational resources, the specific needs of the network architecture, and empirical evidence from prior models.

  1. Computational Resources: If you have enough computational resources, experimenting with different activation functions using cross-validation can be insightful. This allows you to tailor the activation function to your specific model and dataset. Note that when using SELU, you generally don't need batch normalization, which can simplify the architecture, unlike other functions where batch normalization might be necessary.
  2. Empirical Evidence: Certain functions have become standard for specific applications. For example, GELU is often the preferred choice for training transformer models due to its effectiveness in these architectures. SELU, with its self-normalizing properties and lack of hyperparameters to tune, is particularly useful for deeper networks where training stability is crucial.
  3. Computation Efficiency and Simplicity: For scenarios where computational efficiency and simplicity are priorities, ReLU and its variants like PReLU and ELU are excellent choices. They avoid the need for parameter tuning and support the model's sparsity and generalization, helping to reduce overfitting.

Despite the advent of more sophisticated functions, ReLU remains extremely popular due to its simplicity and efficiency. It's straightforward to implement, easy to understand, and provides a clear method to introduce non-linearity without complicating the computation. The function's ability to zero out negative parts simplifies calculations and enhances computational speed, which is advantageous especially in large networks.

ReLU's design inherently increases the model's sparsity by zeroing out negative activations, which can improve generalization โ€” a critical factor given that overfitting is a significant challenge in training deep neural networks. Moreover, ReLU does not require any extra hyperparameters, which contrasts with functions like PReLU or ELU that introduce additional complexity into model training. Furthermore, because ReLU has been widely adopted, many machine learning frameworks and libraries offer optimizations specifically for it, making it a practical choice for many developers.

In summary, while newer activation functions offer certain benefits for specific scenarios, ReLU's balance of simplicity, efficiency, and effectiveness makes it a go-to choice for many applications. When moving forward with any activation function, understanding its characteristics thoroughly is crucial to ensure it aligns with your model's needs and to facilitate troubleshooting during model training.

PyTorch offers a variety of activation functions, each with specific applications and benefits detailed in its documentation. While I won't cover all possible activation functions here due to length constraints, such as softplus. It's important to think of these functions as blenders that modify inputs in different ways, building upon the functionality of their predecessors. Understanding how each function evolves from the last helps in quickly grasping new ones and evaluating their advantages and disadvantages. We will dive into how these activation functions interact with different weight initialization strategies later, further enhancing the effective use of these tools in neural network design.

For a more detailed exploration of PyTorch's activation functions, you can always refer to the official documentation from PyTorch

Weight Initialization

Alright, let's stop searching for the perfect activation functions to stabilize gradients and focus on another crucial aspect to set up our neural network properly by initializing weights efficiently.

Before diving into the most popular methods for weight initialization, let's address a fundamental question:

Note that weight initialization is actually more complex than it might seem at first glance, and this post only scratches the surface of the subject. As I mentioned, choosing the right starting point for weight initialization is crucial for the effective optimization of your network. If you're looking for a deeper and more comprehensive understanding, I recommend checking out the detailed review available here. This could really enhance your grasp of the techniques involved.

Why is weight initialization important, and how can it help mitigate unstable gradients?

Proper weight initialization ensures that gradients flow correctly throughout the model, similar to how semi-products are passed around in an ice cream factory. It's important that not only the initial machine settings are correct but also that every department works efficiently.

Weight initialization aims to maintain a stable flow of information both forward and backward through the network. Weights that are too large or too small can cause problems. Excessively large weights might increase the output excessively during the forward pass, leading to oversized predictions. On the other hand, very small weights might diminish the output too much. During backpropagation, the magnitude of these weights becomes critical. If a weight is too large, it can cause the gradient to explode, if too small, the gradient might vanish. Understanding this, we avoid initializing weights at extremes, such as zero (which nullifies outputs and gradients) or excessively high values. This balanced approach helps maintain the network's efficacy and prevents the issues associated with unstable gradients.

What is a good way to initialize weights?

First and foremost, the best weight initialization often comes from using weights that have been pre-trained. If you can obtain a set of weights that have already undergone some learning and are trending towards minimizing loss, continuing from this point is ideal.

However, if you're starting from scratch, you'll need to carefully consider how to initially set your weights, especially to prevent unstable gradients. Here's what you should aim for in a good weight initialization:

  • Avoid extreme values. As we discussed previously, weights should neither be too large, too small, nor zero. Properly scaled weights help maintain stability during both the forward and backward passes of network training, as discussed previously.
  • Break symmetry. It's quite important that weights are diverse to prevent neurons from mirroring each other's behavior, which would lead them to learn the same features and ignore others. This lack of differentiation can severely limit the network's ability to model complex patterns. Different initial weights help each neuron to start learning different aspects of the data. This is like having various types of production lines in an ice cream factory enhances the range of flavors that can be produced.
  • Position favorably on the loss surface. Initial weights should place the model in a decent starting position on the loss surface to make the journey toward the global minimum more feasible. Since we don't have a clear picture of what the loss landscape looks like, introducing some randomness in weight initialization can be beneficial.

This is why setting all weights to zero is problematic. It causes symmetry issues, where all neurons behave the same and learn at the same rate, preventing the network from effectively capturing diverse patterns. Zero weights also lead to zero outputs, especially with ReLU and its variations, resulting in zero gradients. This lack of gradient flow stops learning altogether, rendering all neurons inactive.

Why not initialize all weights with a small random number?

While using small random numbers to initialize weights can be helpful, it often lacks sufficient control. Randomly assigned weights might be too small, leading to a vanishing gradient problem, where updates during training become insignificantly small, stalling the learning process. Furthermore, completely random initialization doesn't guarantee the breaking of symmetry. For example, if the initialized values are too similar or all have the same sign, the neurons might still behave too similarly, failing to learn diverse aspects of the data.

In practice, more structured approaches to initialization are used. Famous methods include Glorot (or Xavier) initialization, He (or Kaiming) initialization, and LeCun initialization. These techniques typically rely on either normal or uniform distributions but are calibrated to consider the size of the previous and next layers, providing a balance that promotes effective learning without the risk of vanishing or exploding gradients.

If so, why not just use a standard normal distribution (N(0,1)) for weight initialization?

Using a standard normal distribution (N(0,1)) provides some control over the randomization process, but it isn't sufficient for optimal weight initialization due to the lack of control over variance. The mean of zero is a solid choice as it helps ensure weights do not all share the same sign, effectively helping to break symmetry. However, a variance of 1 can be problematic.

Consider a scenario where the activation function inputs, ๐‘, depend on the weights. Suppose ๐‘ is calculated by summing the outputs of ๐‘ neurons from the previous layer, each with weights initialized from a standard normal distribution. Here, ๐‘ would also be normally distributed with a mean of zero, but its variance would be ๐‘. If ๐‘=100, for example, the variance of ๐‘ becomes 100, which is too large and leads to uncontrolled inputs into the activation function, potentially causing unstable gradients during backpropagation. Using an ice cream factory as an analogy, this would be like setting a high tolerance for errors in each machine's settings, resulting in a final product that deviates significantly from the desired outcome due to lack of quality control.

So why do we care about the variance of ๐‘? The variance controls the spread of ๐‘ values. If the variance is too small, the output of ๐‘ may not vary enough to effectively break symmetry. However, a variance that is too large can lead to those values are either too high or low. For activation functions like sigmoid, extremely high or low input push the outputs towards the function's saturating tails, which can cause the vanishing gradient problem.

Therefore, when initializing weights with a random draw from a distribution, both the mean and the variance are crucial. The goal is to set the mean to zero to break symmetry effectively, while also minimizing the variance to ensure that the semi-product (i.e., the neuron outputs) is neither too large nor too small. Proper initialization ensures a stable flow of information through the network, both forward and backward, maintaining an efficient learning process without introducing instability in gradients. A thoughtful approach to initialization can, therefore, result in a network that learns effectively and robustly.

So, to control the output values in the middle layers of a neural network, which also serve as inputs for subsequent layers, we use distributions with carefully chosen mean and variance for weight initialization. But how do the most popular methods achieve control over this variance?

Before diving into the most common ways to initialize weights, it's important to note that the variance of ๐‘Z is influenced not only by the variance of the weight initialization but also by the number of neurons involved in calculating ๐‘Z. If only 16 neurons are used, the variance of ๐‘Z is 16, whereas with 100 neurons, it rises to 100. Essentially, this variance isn't only influenced by the distribution from which the weights are drawn but also by the number of neurons contributing to the calculation, known as the "fan-in." Fan-in refers to the number of input connections coming into a neuron. Similarly, "fan-out" refers to the number of output connections a neuron has.

Let me illustrate with an example: Suppose there is a middle layer in a neural network with 200 neurons, connected to a previous layer of 100 neurons and a subsequent layer of 300 neurons. In this case, the fan-in for this layer is 100, and the fan-out is 300.

Using fan-in and fan-out provides a mechanism to control the variance during weight initialization.

  • Fan-in helps control the variance of the output ๐‘ of the current layer during the forward pass.
  • Fan-out adjust the influence that the weights of the subsequent layer have during backpropagation.

Based on the consideration of the number of neurons feed in to the current layer both forward and backward, researchers comes up a bunch of initiation methods are built on top of the ideas. There are Lecun, Xavier/ Glorot initialization and He/ Kaiming initiation. The idea of those methods are quite similar, they use either uniform or normal distribution as the weight random generated from, and use the fan in or fan out to control the variance. The mean of those distribution are all 0 to achieve zero mean of the output value.

In this post, I only provide a quick overview of weight initialization methods. To see the detailed explanations, one can refer to the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow and the Weights & Biases post here for practical insights and historical context.

# Different types of initializations

| Initialization | Activation functions          | ฯƒยฒ (Normal) |
| -------------- | ----------------------------- | ----------- |
| Xavier/Glorot  | None, tanh, logistic, softmax | 1 / fan_avg |
| He/Kaiming     | ReLU and variants             | 2 / fan_in  |
| LeCun          | SELU                          | 1 / fan_in  |

Lecun Initialization is based on scaling down the variance of ๐‘ by using a smaller variance for the weight distribution. If the variance of ๐‘ is the product of fan-in and the variance of each weight, then to ensure ๐‘ has a variance of 1, the variance of each weight should be 1/fan-in. Thus, Lecun initialization draws weights randomly from ๐‘(0,1/fan-in).

Xavier/Glorot Initialization considers not just the impact of the previous layer's weights (fan-in) but also the effect these weights have during backpropagation on the subsequent layer (fan-out). It balances the variance during both the forward and backward pass by using the formula 2/(fan_in + fan_out)) for the variance, from which weights can be drawn either Normal distribution,N(0,2/(fan_in + fan_out)) or Uniform (- sqrt(6/ (fan_in + fan_out)), sqrt(6/ (fan_in + fan_out)))

Uniform (- sqrt(6/ (fan_in + fan_out)), sqrt(6/ (fan_in + fan_out))) has the same variance as the normal distribution, 2/(fan_in + fan_out), since we use the lower bound and upper bound to define a uniform distribution, and the variance of it is just 1/12 * (upper bound โ€” lower bound) **2. (source)

He/Kaiming Initialization is tailored for ReLU and its variants due to their unique properties. Since ReLU zeroes out negative inputs, half of the neuron activations are expected to be non-zero, which could lead to reduced variance and vanishing gradients. To compensate, He initialization doubles the variance used in Lecun's method, effectively using 2*1/fan_in, thus maintaining the necessary balance for layers using ReLU. For leaky ReLUs and ELUs, while adjustments are minor (e.g., using a factor of 1.55 for ELU instead of 2, source), the principle remains the same, we want to adjust the variance to stabilize gradients during backpropagation. In contrast, SELU requires using Lecun initialization across all hidden layers to leverage its self-normalizing properties.

This discussion opens up an interesting aspect of how weight initialization is implemented in frameworks like PyTorch, which can be framed as a question โ€”

How is weight initialization implemented in PyTorch, and what makes it special?

In PyTorch, the default approach for initializing weights in linear layers is based on the Lecun initialization method. On the other hand, the default initialization technique used in Keras is the Xavier/Glorot initialization.

However, PyTorch offers a particularly flexible approach when it comes to weight initialization, allowing users to fine-tune the process to match the specific requirements of different activation functions used in their models. This fine-tuning is achieved by considering two key components:

  1. Mode: This component determines whether the variance of the initialized weights is adjusted based on the number of input connections (fan-in) or the number of output connections (fan-out) in the layer.
  2. Gain: This is a scaling factor that adjusts the scale of the initialized weights depending on the activation function employed in the model. PyTorch provides a torch.nn.init.calculate_gain function that calculates a tailored gain value, optimizing the initialization process to enhance the overall functioning of the neural network.

This flexibility in customizing weight initialization parameters allows you to set up an initialization approach that is comparable to and compatible with the specific activation functions used in your model. Interestingly, PyTorch's implementation of weight initialization can help reveal some underlying relationships between different initialization methods.

For instance, while reviewing the PyTorch documentation on the SELU activation function, I discovered an intriguing aspect of weight initialization. The documentation notes that when using kaiming_normal or kaiming_normal_ for initialization with the SELU activation, one should opt for nonlinearity='linear' instead of nonlinearity='selu' to achieve self-normalization. This detail is fascinating because it highlights how the default Lecun initialization in PyTorch, when adjusted with the Kaiming method set to a gain of 1 from nonlinearity='linear', effectively replicates the Lecun initialization method. This demonstrates that the Lecun initialization is a specific application of the more general Kaiming initialization approach. Similarly, the Xavier initialization method can be seen as another variant of the Lecun initialization that adjusts for both the number of input connections (fan-in) and the number of output connections (fan-out).

I get that we need to be careful in choosing the mean and variance when initializing the weights from a distribution. But what I'm still not clear on is why we would want to draw the initial weights from a normal distribution versus a uniform distribution. Can you explain the reasoning behind using one over the other?

You make a fair point regarding the importance of carefully choosing the mean and variance of the distribution when initializing weights. When initializing weights in neural networks, an important consideration is whether to draw from a normal or uniform distribution. While there is no definitive research-backed answer, there are some plausible reasons behind these choices:

The uniform distribution has the highest entropy, meaning all values within the range are equally likely. This unbiased approach can be useful when you lack prior knowledge about which values might work better for initialization. It treats each potential weight value fairly, assigning an equal probability. This is akin to betting evenly across all teams in a game with limited information โ€” it maximizes the likelihood of a favorable outcome. Since you don't know which specific values make good initial weights, using a uniform distribution ensures an unbiased starting point.

On the other hand, a normal distribution is more likely to initialize weights with smaller values closer to zero. Smaller initial weights are generally preferred because they reduce the variance of the output and help maintain stable gradients during training. This is similar to why we prefer smaller variance in weight initialization methods over unit variance. Additionally, certain activation functions like sigmoid and tanh tend to perform better with smaller initial weight values, even if these activations are only used in the final output layer rather than hidden layers.

Regarding the concept of likelihood, you can refer to my old post where I used an example involving my cat, Bubble, to explain likelihood, maximum likelihood estimation (MLE), and maximum a posteriori (MAP) estimation.

Ultimately, the uniform distribution provides an unbiased start when lacking prior knowledge, treating all potential weight values as equally likely. In contrast, the normal distribution favors smaller initial weights close to zero, which can aid gradient stability and suit certain activation functions like sigmoid and tanh. The choice between these distributions is often guided by empirical findings across different neural architectures and tasks. While no universally optimal approach exists, understanding the properties of uniform and normal distributions allows for more informed, problem-specific initialization decisions.

Do we also use those weight initialization methods for the bias terms? How do we initialize the biases?

Good question. We don't necessarily use the same initialization techniques for the bias terms as we do for the weights. In fact, it's a common practice to simply initialize all biases to 0. The reason is that while the weights determine the shape of the function each neuron is learning to approximate the underlying data, the biases just act as an offset value to shift those functions up or down. So the biases don't directly impact the overall shape learned by the weights.

Since the main goal of initialization is to break symmetry and provide a good starting point for the weight learning, we're less concerned with how the biases are initialized. Setting them all to 0 is generally good enough. You can find more detailed discussion on this in the CS231n course notes.

Batch Normalization

With the activation functions chosen and weights properly initialized, we can start training our neural network (firing up our mini ice cream factory production line). But quality control is needed, both initially and during training iterations. Two key techniques are feature normalization and batch normalization.

As discussed earlier in my post about gradient descent, these techniques reshape the loss landscape for faster convergence. Feature normalization applies this to the initial data inputs, while batch normalization normalizes inputs to hidden layers between epochs. Both techniques are akin to implementing quality assurance checks at different stages of the 'production line'.

Why does batch normalization work? Why is making the input to each layer have zero mean and unit variance helpful for solving gradient issues?

Batch normalization helps mitigate issues like vanishing/exploding gradients by reducing the internal covariate shift between layers during training. Let's think about why this internal shift occurs in the first place. As we update the parameters of each layer based on the gradients, where each layer of the network is a different department in the factory. Every time you update the parameters (or settings) in one department, it changes the input for the next department. This can create a bit of chaos, as each following department has to adjust to the new changes. This is what we call internal covariate shift in deep learning. Now, what happens when these shifts occur frequently? The network struggles to stabilize because each layer's input keeps changing. It's similar to how constant changes in one part of the factory can lead to inconsistencies in the product quality, confusing the workers and messing up the workflow.

Batch normalization aims to fix this by normalizing the inputs to each layer to have zero mean and unit variance across the mini-batches during training. It enforces a consistent, controlled input distribution that layers can expect. Going back to the factory analogy, it's like setting a strict quality standard for each department's output before it gets passed to the next department. For example, setting rules that the baking department must produce ice cream cones of consistent size and shape. This way, the next decoration department doesn't have to account for cone variance โ€” they can simply add the same amount of ice cream to each standardized cone.

By reducing this internal covariate shift through normalization, batch norm prevents the gradients from going haywire during the training process. The layers don't have to constantly readjust to wildly shifting input distributions, so the gradients remain more stable.

Additionally, the normalization acts as a regularizer, smoothing out the objective landscape. This allows using higher learning rates for faster convergence. Generally, batch normalization reduces internal variance shifts, stabilizes gradients, regularizes the objective, and enables training acceleration.

As we touched on earlier in the activation section, SELU uses the principles of batch normalization to achieve self-normalization. For a more in-depth exploration of batch normalization, I highly recommend Johann Huber's detailed post on Medium.

How to Apply Batch Normalization? Should It Be Before or After Activation? How to Handle It During Training and Testing?

Batch normalization has really changed how we train DNNs by adding an extra layer to stabilize gradients. There's a debate about whether to apply it before or after activation functions in DL area. Honestly, it depends on your model, and you might need to experiment a bit. Just make sure to keep your method consistent, as switching it up can cause unexpected issues.

During training, the batch normalization layer computes the mean and standard deviation for each dimension across the mini-batches. These statistics are then used to normalize the output, ensuring it has zero mean and unit variance. This process can be thought of as transforming the input's distribution into a standard normal distribution. Unlike feature normalization, which normalizes features using the entire training dataset, batch normalization adjusts based on each mini-batch, making it dynamic and responsive to the data being processed.

Now, testing is a different story. It's important not to use the mean and variance from the testing data for normalization. Instead, these parameters, viewed as learned features, should be carried over from the training process. Although each mini-batch during training has its own mean and variance, a common practice is to use a moving average of these values throughout the training phase. This provides a stable estimate that can be applied during testing. Another less common method involves conducting one more epoch using the entire training dataset to compute a comprehensive mean and variance.

When training with PyTorch as your DNN framework, it offers extra flexibility with two adjustable hyperparameters, ฮณ and ฮฒ. These allow for fine-tuning the batch normalization process. Generally, the default settings are quite effective. However, it's important to note that during the training's forward pass, PyTorch uses a biased estimator for calculating variance, but it switches to an unbiased estimator for the moving average during testing. This adjustment is beneficial for more accurately approximating the population standard deviation, enhancing the model's reliability in unseen conditions.

Applying batch normalization correctly is important for effective learning in your network. It ensures that the network not only learns well but also maintains its performance across different datasets and testing scenarios. Think of it as precisely calibrating each segment of a production line, ensuring seamless and consistent operation throughout.

Why is batch normalization applied during the forward pass rather than directly to the gradients during backpropagation?

There are several reasons why batch normalization is typically applied to inputs or activations during the forward pass rather than directly to the gradients during backpropagation.

Firstly, there's a lack of empirical evidence or practice showing the benefits of applying batch normalization directly to gradients. The concept of internal covariate shift primarily occurs during the forward pass as the distribution of layer inputs changes due to updates in the parameters. Therefore, it makes sense to apply batch normalization during this phase to stabilize these inputs before they are processed by subsequent layers. Also, applying batch normalization directly to the gradients could potentially distort the valuable information carried by the gradients' magnitude and direction. This is similar to altering customer feedback in a way that changes its inherent meaning, which could mislead future adjustments in a production process of our mini ice cream factory.

However, making minor adjustments to gradients, such as gradient clipping, is generally acceptable and beneficial. This technique caps the gradients to prevent them from becoming excessively large, effectively keeping them within a safe range. This is similar to filtering out extreme outliers in feedback, which helps maintain the integrity of the overall feedback while preventing any drastic reactions that could derail the process. In PyTorch, monitoring gradient norms is a common practice, and if gradients begin to explode, techniques like gradient clipping can be employed. PyTorch offers functions such as torch.nn.utils.clip_grad_norm_ and torch.nn.utils.clip_grad_value_ to help manage this.

You mentioned the option of clipping gradients instead of directly normalizing them. Why exactly do we choose to clip gradients rather than flooring them?

Clipping gradients is a simple yet efficient technique helps prevent the issue of exploding gradients. We'd often manually cap the maximum value of gradients. For instance, the ReLU activation function can be modified to have an upper limit of 6, known as ReLU6 in PyTorch (learn more about ReLU6 here). By setting this cap, we ensure that during the backpropagation process, when gradients are multiplied at each layer according to the chain rule, their values do not become excessively large. This clipping directly prevents the gradients from escalating to a point where they could derail the learning process by ensuring they remain within manageable limits.

Flooring gradients, on the other hand, would set a lower limit to prevent them from getting too small. However, it doesn't address the fundamental issue of vanishing gradients. Vanishing gradients often occur because certain activation functions, like sigmoid or tanh, stature the gradient values severely as inputs move away from zero. This leads to very small gradient values that make learning extremely slow or stagnant. Flooring the gradients doesn't solve this because the root of the problem lies in the nature of the activation function compressing the gradient values, not just in the values being too small. Instead, to effectively combat vanishing gradients, it's more beneficial to adjust the network architecture or the choice of activation functions. Techniques such as using activation functions that do not saturate (like ReLU), adding skip connections (as seen in ResNet architectures), or employing gated mechanisms in RNNs (like LSTM or GRU) can inherently prevent gradients from vanishing by ensuring a healthier flow of gradients throughout the network during backpropagation.

To summarize, while gradient clipping effectively manages overly large gradients, flooring, which sets a lower limit, does not effectively address the issue of overly small gradients. Instead, resolving problems associated with vanishing gradients typically requires making architectural adjustments.

It's important to note that when using gradient clipping, the Gradient Clipping Threshold becomes an additional hyperparameter that may need to be tuned or set based on findings from other research and the choice of activation function. As always, introducing this extra hyperparameter adds another layer of complexity to the model training process.

In Practice (Personal Experience)

Before wrapping up, it's clear that all the methods discussed are valuable for addressing vanishing and exploding gradient issues. These are all practical approaches that could enhance your model's training process. To conclude this post, I'd like to end it with one last question -

What's the reality? What's the common process in practice?

In practice, the good news is one don't need to experiment with every possible solution. When it comes to choosing an activation function, ReLU is often the go-to choice and is very cost-effective. It passes the magnitude of positive inputs unchanged (unlike sigmoid and tanh, which compress large values to 1 regardless of their size) and is straightforward in terms of calculation and its derivatives. It's also well-supported across major frameworks. If you're concerned about the issue of dead ReLUs, you might consider alternatives like Leaky ReLU, ELU, SELU, or GELU, but generally, it's advisable to steer clear of sigmoid and tanh to avoid vanishing gradients.

With ReLU being the preferred activation function, there's less worry about weight initialization being overly sensitive, which is more of a concern with functions like sigmoid, tanh, and SELU. Instead, focusing on the recommended weight initialization methods for your chosen activation function should suffice (for example, using He/Kaiming initialization with ReLU due to its considerations for the non-linearities of ReLU).

Always incorporate batch normalization in your networks. Decide (or experiment) whether to apply it before or after the activation function, and stick with that choice consistently throughout your model. Batch normalization offers multiple benefits, including regularization effects and enabling the use of higher learning rates, which can speed up training and convergence.

So, what's worth experimenting with? Optimizers are worth some exploration. In a previous post, I discussed various optimizers, including gradient descent and its popular variations (read more here). While Adam is fast, it can lead to overfitting and might decrease the learning rate too quickly. SGD is reliable and can be very effective, especially in parallel computing environments. Though it tends to be slower, it's a solid choice if you're aiming to squeeze every bit of performance from your model. Sometimes, RMSprop might be a better alternative. I personally find starting with Adam for its speed and then switching to SGD in later epochs to find a better minimum and prevent overfitting is a good strategy.

If you're enjoying this series, remember that your interactions โ€” claps, comments, and follows โ€” do more than just support; they're the driving force that keeps this series going and inspires my continued sharing.

Other posts in this series:

Reference

Activation function

Weight initialization

Gradient Clipping