Matrix Calculus For Deep Learning: Taking Derivatives of Matrices Through Time <In Depth>

Hello everyone! Today, we are going to go through one of the most important fundamental concepts within Deep Learning, Matrix Calculus. To…

Eugene Ku

~4 min read · August 4, 2023 (Updated: August 31, 2024) · Free: Yes

Hello everyone! Today, we are going to go through one of the most important fundamental concepts within Deep Learning, Matrix Calculus. To fully grasp a concept and unlock your intuition on something, it is often necessary that you try building it from scratch. To do so in Deep Learning, you need a solid understanding in Matrix Calculus in order to incorporate backpropagation algorithms of complex model architectures or operations.

The aim of this post is to show you how to compute the derivatives of matrix of functions with respect to the matrices of inputs. Normally, this will result in a 4-dimensional tensor, but there is a trick around this. We will use the recursive portion of Backpropagation Through Time (BPTT) in Recurrent Neural Networks (RNN) as an example to learn about this trick. For those who are curious, we will also include a proof of the trick at the end. Please note that I will mostly be using screenshots of LaTex and hand-written notes for improved readability.

Here is an illustration of the forward propagation and backward propagation of recursive cell in RNN.

Image from DeepLearning.AI[2]: Sequence Model

We will focus on the boxed area (recursive portion) for the purpose of this post. For those who prefer equations, here is the equation version of the above illustration.

Let's now consider the derivative of the cost function of Recurrent Neural Network with respect to <t-1>th activation layer. The equation for this derivative is as follows:

The addition of direct loss (in many-to-many RNN) and derivative of other subsequent layers like SoftMax and Dense layers are hidden inside this equation.

The partial derivatives of parameters we are interested in are:

We will derive W_ax and W_aa together. You can try the rest.

Here are some useful Matrix Calculus rules:

We actually won't need rule 1 and 2, but they are good to know :)

The order of operations is switched here, yet the outputs of derivatives are the same. Pretty neat.

Derivative rule 1 and 2 are pretty straightforward. If not, you could go ahead and use the brute-force method to compute the functions then take the derivative element-wise to prove rule 1 and 2.

However, rule 3 is quiet odd. It results a 4D tensor yet we can flatten it into a 2D matrix. We'll explain the consequence of using this simplification soon. Again, if you would like to see the proof of the rule 3, wait till the end.

Set up the equation using the chain rule. Then, computed the components inside the equation independently.

We do not have to worry about the order of operation that is nested before using rule 3. We can just directly replace components in derivative of z^<t> with the earlier computed components.

Reusing earlier computations.

Distinguishing two identical values by representing the dimensions using the closest and most direct output from the variable

That is all for how to get around a 4D tensor in matrix calculus. However, for those who are curious and wants to go deeper into why rule 3 works, let's see how we can derive rule 3.