Hello everyone! Today, we are going to go through one of the most important fundamental concepts within Deep Learning, Matrix Calculus. To fully grasp a concept and unlock your intuition on something, it is often necessary that you try building it from scratch. To do so in Deep Learning, you need a solid understanding in Matrix Calculus in order to incorporate backpropagation algorithms of complex model architectures or operations.
The aim of this post is to show you how to compute the derivatives of matrix of functions with respect to the matrices of inputs. Normally, this will result in a 4-dimensional tensor, but there is a trick around this. We will use the recursive portion of Backpropagation Through Time (BPTT) in Recurrent Neural Networks (RNN) as an example to learn about this trick. For those who are curious, we will also include a proof of the trick at the end. Please note that I will mostly be using screenshots of LaTex and hand-written notes for improved readability.
Here is an illustration of the forward propagation and backward propagation of recursive cell in RNN.

We will focus on the boxed area (recursive portion) for the purpose of this post. For those who prefer equations, here is the equation version of the above illustration.

Let's now consider the derivative of the cost function of Recurrent Neural Network with respect to <t-1>th activation layer. The equation for this derivative is as follows:

The partial derivatives of parameters we are interested in are:

Here are some useful Matrix Calculus rules:


Derivative rule 1 and 2 are pretty straightforward. If not, you could go ahead and use the brute-force method to compute the functions then take the derivative element-wise to prove rule 1 and 2.
However, rule 3 is quiet odd. It results a 4D tensor yet we can flatten it into a 2D matrix. We'll explain the consequence of using this simplification soon. Again, if you would like to see the proof of the rule 3, wait till the end.











That is all for how to get around a 4D tensor in matrix calculus. However, for those who are curious and wants to go deeper into why rule 3 works, let's see how we can derive rule 3.







Practice makes perfect :) Try to compute the other partial derivatives of interest. Here are the answers for you to check:

Inspired and encouraged to write by:
Hsi-Ming Chang and Raymond Kwok
Sources: