Many times while training deep neural networks, we wish we had a bigger GPU to accommodate the BIG Batch size .. but we don't !! and it leads to:

  • slower convergence and
  • lesser accuracy

Let's revisit the typical training loop in PyTorch. We will be using this to understand Gradient Accumulation.

None

Let's understand the solution and Gradient Accumulation in 3 steps:

Step 1: Divide the BIG batch into smaller batches

Dividing here just means keeping the batch size small so that it fits GPU memory.

None

Step 2: Accumulate Gradients

This is the core step.

We can't fit a large batch size but how about we keep track and collect gradients calculated for each batch and act on them together? The idea is to keep accumulating gradients for each of the mini-batches till we reach some number of accumulation steps.

In the below example, we are accumulating gradients for 3 steps.

None

How to implement this in code? Well, just don't do "optimizer.step()" after every forward pass. And it will automatically keep on accumulating gradients for each trainable parameter.

None

Step 3: Update parameter values using Accumulated gradients

Once we have accumulated gradients for "accumulation steps", we go ahead and do the parameter update.

None
None

But there is a catch

Won't the gradients be huge since we are accumulating them ?

They will be — and that's why we need to normalize them by the Accumulation steps.

None
None

Summary :

None

Hope you enjoyed this quick read !!

Originally Published at Intuitive Shorts:

Follow Intuitive Shorts (a Free Substack newsletter), to read quick and intuitive summaries of ML/NLP/DS concepts.

None

BECOME a WRITER at MLearning.ai // AI Factory // Super Cheap AI.