Hello everyone! Today, we are going to go through one of the most important fundamental concepts within Deep Learning, Matrix Calculus. To fully grasp a concept and unlock your intuition on something, it is often necessary that you try building it from scratch. To do so in Deep Learning, you need a solid understanding in Matrix Calculus in order to incorporate backpropagation algorithms of complex model architectures or operations.

The aim of this post is to show you how to compute the derivatives of matrix of functions with respect to the matrices of inputs. Normally, this will result in a 4-dimensional tensor, but there is a trick around this. We will use the recursive portion of Backpropagation Through Time (BPTT) in Recurrent Neural Networks (RNN) as an example to learn about this trick. For those who are curious, we will also include a proof of the trick at the end. Please note that I will mostly be using screenshots of LaTex and hand-written notes for improved readability.

Here is an illustration of the forward propagation and backward propagation of recursive cell in RNN.

None
Image from DeepLearning.AI[2]: Sequence Model

We will focus on the boxed area (recursive portion) for the purpose of this post. For those who prefer equations, here is the equation version of the above illustration.

None

Let's now consider the derivative of the cost function of Recurrent Neural Network with respect to <t-1>th activation layer. The equation for this derivative is as follows:

None
The addition of direct loss (in many-to-many RNN) and derivative of other subsequent layers like SoftMax and Dense layers are hidden inside this equation.

The partial derivatives of parameters we are interested in are:

None
We will derive W_ax and W_aa together. You can try the rest.

Here are some useful Matrix Calculus rules:

None
We actually won't need rule 1 and 2, but they are good to know :)
None
The order of operations is switched here, yet the outputs of derivatives are the same. Pretty neat.

Derivative rule 1 and 2 are pretty straightforward. If not, you could go ahead and use the brute-force method to compute the functions then take the derivative element-wise to prove rule 1 and 2.

However, rule 3 is quiet odd. It results a 4D tensor yet we can flatten it into a 2D matrix. We'll explain the consequence of using this simplification soon. Again, if you would like to see the proof of the rule 3, wait till the end.

None
None
Set up the equation using the chain rule. Then, computed the components inside the equation independently.
None
None
We do not have to worry about the order of operation that is nested before using rule 3. We can just directly replace components in derivative of z^<t> with the earlier computed components.
None
None
None
None
None
None
Reusing earlier computations.
None
Distinguishing two identical values by representing the dimensions using the closest and most direct output from the variable

That is all for how to get around a 4D tensor in matrix calculus. However, for those who are curious and wants to go deeper into why rule 3 works, let's see how we can derive rule 3.

None
Homage to Inception and InceptionNet
None
None
Setting up the brute-force method by illustrating z^<t>
None
4-Dimensional Tensor. Matrices inside a matrix.
None
None
None

Practice makes perfect :) Try to compute the other partial derivatives of interest. Here are the answers for you to check:

None
Edited from DeepLearning.AI[2]

Inspired and encouraged to write by:

Hsi-Ming Chang and Raymond Kwok

Sources:

  1. vecDerivs.pdf (stanford.edu)

2. DeepLearning.AI