N-HiTS — Making Deep Learning for Time Series Forecasting More Efficient

A deep dive into how N-HiTS works and how you can use it

Jonte Dancker

Towards Data Science

· ~10 min read · May 30, 2024 (Updated: May 30, 2024) · Free: No

In 2020, N-BEATS was the first deep-learning model to outperform statistical and hybrid models in time series forecasting.

Two years later, in 2022, a new model threw N-BEATS off its throne. Challu and Olivares et al. published the deep learning model N-HiTS. They addressed two shortcomings of N-BEATS for longer forecast horizons:

decreasing accuracy and
increasing computation.

N-HiTS stands for Neural Hierarchical Interpolation for Time Series Forecasting.

The model builds on N-BEATS and its idea of neural basis expansion. The neural basis expansion takes place in several blocks across layered stacks.

In this article, I will go through the architecture behind N-HiTS, particularly the differences to N-BEATS. But do not be afraid, the deep dive will be easy-to-understand. However, it is not enough to only understand how N-HiTS works. Thus, I will show you how we can easily implement a N-HiTS model in Python and also tune its hyperparameters.

If the core idea is the same, what is the difference between N-BEATS and N-HiTS?

The difference lies in how each model treats the input and output of each stack. The main idea of N-HiTS is to combine forecasts of different time scales.

For this, N-HiTS applies

a multi-rate data sampling of the input and
a hierarchical interpolation of the output.

With this N-HiTS achieves better accuracy for longer horizons and lower computational cost.

The multi-rate data sampling forces stacks to specialize in short-term or long-term effects. Hence, it becomes easier for these stacks to learn the respective components. The focus on long-term behavior results in improved long-horizon forecasting compared to N-BEATS.

The hierarchical interpolation allows each block to forecast on a different time scale. The model then interpolates the forecast to match the time scale of each block to the final prediction. The resampling and interpolation reduce the number of learnable parameters. This results in a lighter model with shorter training times.

Now that we know what N-HiTS does differently, let's see how the architecture includes these changes.

How does N-HiTS work in detail?

The N-HiTS model has the following architecture:

Architecture of N-HiTS (Image taken from Challu and Olivares et al.).

As we can see, there are many similarities compared to N-BEATS.

First, N-HiTS splits the time series into a lookback and forecast period. Second, the model consists of multi-layered stacks and blocks, generating a backcast and forecast. In each block, a multi-layer perceptron produces basis expansion coefficients for the backcast and forecast. The backcast shows what part of the time series the block captured. Before we pass the time series into a block we remove the backcast of the previous block. With this, each block learns a different pattern as we only pass residuals from block to block. The model generates the final prediction through the sum of all block's forecasts.

As for the similarities, I will keep it at this level of detail. For more information, I point you to my N-BEATS article.

N-BEATS — The First Interpretable Deep Learning Model That Worked for Time Series Forecasting

An easy-to-understand deep dive into how N-BEATS works and how you can use it.

towardsdatascience.com

But let's dive deeper into the differences: multi-rate data sampling and hierarchical interpolation.

Multi-rate signal sampling of the input

N-HiTS does the multi-rate sampling at block level through a MaxPool layer.

The MaxPool layer smooths the input by taking the largest value within a chosen kernel size. Hence, the kernel size determines the rate of sampling. The larger the kernel size, the more aggressive the smoothing will be.

The larger the kernel size of the the MaxPool layer, the stronger the smoothing of the input signal. Hence, a large kernel size emphasizes long-term effects. (Image by the author)

We define the kernel size of the MaxPool layer at the stack level. Hence, each block within the same stack has the same kernel size.

For resampling, N-HiTS uses a top-down approach. The first stacks focus on the long-term effects through a larger kernel size. The latter stacks focus on short-term effects through a smaller kernel size.

Hierarchical Interpolation of the output

N-HiTS uses hierarchical interpolation to reduce the number of predictions of each stack, i.e., the cardinality. The smaller cardinality results in reduced computing requirements for long-horizon forecasts.

What does this mean?

Assume we want to predict the next 24 hours of a time series. We expect our model to output 24 predictions (one for each hour). If we want to predict the next two weeks of hourly data, we need 336 predictions (14 * 24). That makes sense, right?

But this is where it becomes problematic. Let's take the N-BEATS model. The final forecast is a combination of the partial forecasts of each stack. Hence, each stack must predict 336 values, which is computationally expensive. N-BEATS is not the only model suffering the same problem for longer forecast horizons. Other deep learning approaches, such as Transformers or Recurrent Neural Networks, face the same problem.

N-HiTS overcomes this challenge by letting each stack make predictions at different time scales. N-HiTS then matches the time scales of each stack to the final output using interpolation.

For this, N-HiTS uses the concept of expressiveness ratio. The ratio determines the number of predictions in the forecast horizon. A small expressiveness ratio means that the stack makes fewer predictions. Hence, the stack has a small cardinality. For example, we choose an expressiveness ratio of 1/2. This results in the stack predicting every second value we want in our final forecast.

The expressiveness ratio relates the output to the resampling of the input. Combined with the resampling of the input, each stack thus works at a different frequency. Hence, each stack can specialize in treating the time series at different rates.

The authors of N-HiTS suggest that stacks close to the input should focus on the long-term effects. Hence, these stacks should have smaller expressiveness ratios. For example, we could have three stacks. The first stack specializes in weekly behavior, the second in daily behavior, and the third in hourly behavior.

Each stack makes predictions at a different time scale. Thus each stack has a different cardinality. Here, stack 1 has a cardinality of 3, stack 2 a cardinality 5 and stack 3 a cardinality of 10. The final forecast is then the sum of each stack's forecast. In comparison, in the N-BEATS model, each stack would have the same cardinality, i.e., a cardinality of 10. (Image by the author)

But what is a reasonable choice for the expressiveness ratio?

It depends on the time series. The authors recommend two options.

use exponentially increasing expressiveness ratios between stacks to reduce the number of parameters while handling a wide range of frequencies
use known cycles of the time series, such as daily, weekly, etc.

Forecasting example using N-HiTS

Now that we know how N-HiTS works, let's apply the model to a forecasting task.

As in my N-BEATS article, we will predict the next two weeks of wholesale electricity prices in Germany. We take the "European Wholesale Electricity Price" data, which Ember provides with a CC-BY-4.0 license. We will use the N-HiTS implementation from Nixtla's neuralforecast library.

Electricity wholsesale prices in Germany (Image by the author, data by Ember).

Without doing a detailed data exploration, we can see two seasonal components:

daily: Prices are higher in the morning and evening hours as the electricity consumption is usually higher during these hours.
weekly: Prices are higher on weekdays than on weekends as electricity consumption is usually higher during weekdays.

As I used the same dataset in my N-BEATS article we can re-use all code for the data preparation, the train-test split, the plotting of results and the baseline model. Hence, I am not going to show those code snippets here.

Before we jump into the code, please note that I am not trying to get a forecast as accurate as possible but rather show how we can apply N-HiTS.

Baseline Model

But let's start with a simple model as our baseline.

Why You Should Always Start With a Baseline Model

A baseline model takes 10 % of the time to develop but gets us 90 % of the way to achieve reasonable results.

towardsai.net

I will use the same seasonal naïve model which I used in my N-BEATS article. Hence, I will not go into much detail and only show the results.

Using the last week of data in the training set as our forecast results in an MAE of 17.84. Which is quite good already.

Forecast of the Seasonal Naive baseline model (Image by the author).

Training the N-HiTS model

Let's train our first N-HiTS model. Because we use Nixtla's neuralforecast library, the implementation is straightforward. We initialize our N-HiTS model defining our forecast and lookback period. In this case, I use a lookback period of one week.

Then, we have some customization options. We can customize

the model by choosing the number of stacks and blocks, size of the MLP layers, activation function, kernel size for the MaxPooling, pooling type, etc.
the training by choosing the loss function, learning rate, batch size, etc.
the scaling of our input data.

See the Nixtla's documentation for a full description.

In contrast to the N-BEATS model, we can see some differences. We have more parameters to customize our model. We can customize the multi-rate data sampling by choosing the kernel size and pooling type. The hierarchical interpolation we can customize through the interpolation type and expressivity ratio. In the code snippet, I have already played around with some of the hyperparameters.

After we have initialized our model, we wrap it with the neuralforecast class and fit the model. If you have read my N-BEATS article, you should be familiar with these steps.

The results look better than our baseline. The MAE goes down to 17.01 compared to 17.84 of our baseline.

Forecast results with the N-HiTS model (Image by the author).

Tuning the hyperparameters of the N-HiTS model

Instead of playing around to find good hyperparameters, we can run an optimization.

It is not complicated. We do not need to add many lines of code. We only need to replace the NHITS model with Nixtla's AutoNHITS model. The AutoNHITS model does the hyperparameter tuning for us. We only need to choose the backend (ray oroptuna) and choose the search space of our hyperparameters.

These two choices are the only difference compared to running the NHITS model. All other steps stay the same.

Let's start with choosing Optuna and using a custom config file.

We see that the "optimized" N-HiTS has a worse accuracy (MAE of 22.63) compared to the baseline and N-HiTS model.

Forecast results of the AutoNHITS model. The results show the best model of the hyperparameter tuning experiment. (Image by the author)

Perhaps my choice of the search space was not good. Hence, we could try running more trials and using a different search space to get better results. Or we could use the default config of AutoNHITS. We can either use it directly by not passing a config to the model or by making small changes to the default config.

N-HiTS with exogenous variables

Before I finish this article, I want to show you one last thing. We can also use exogenous variables in the N-HiTS model. For this, we only need to pass exogenous variables into the NHITS model as futr_exog_list during initialization. For example, we could pass the day of the week to the model as there is a weekly seasonality.

Adding the day of the week as an exogenous variable resulted in an MAE of 21.62. Trying different exogenous variables or different hyperparameters could improve the accuracy.

Forecast using the N-HiTS model with exogenous variables (Image by the author).

A final note on N-HiTS

N-HiTS has shown very good performance on a wide range of data sets on paper. However, that does not mean N-HiTS will work best for all problems and data sets. Particularly yours.

We saw in the examples, that N-HiTS could barely beat the simple seasonal naive baseline model. But it took me time to get there. First, I spent more time setting up the model and finding a good set of hyperparameters. Second, training took over 30 times as long as the baseline model.

So, if this had been a company project, I would choose the baseline model. Although N-HiTS adds a small accuracy gain, the added complexity is not worth the trouble.

Hence, although N-HiTS is easy to use and seems like a promising model, do not start a project with the model. Start with a simple baseline model. Based on your baseline, you can decide if N-HiTS is a good choice for your problem. For example, if N-HiTS adds enough value compared to the added complexity.

Conclusion

The article has been very long. But there was a lot to cover. If you stayed until here, you now should

have a very good understanding of how the N-HiTS model works,
what N-HiTS does differently compared to N-BEATS,
be able to use the N-HiTS model in practice, and
be able to change the model's inner workings during your hyperparameter tuning.

If you want to dive deeper into the N-HiTS model, check out the N-HiTS paper. Otherwise, leave a comment and/or see you in my next article.

#data-science #machine-learning #time-series-forecasting #deep-learning #thoughts-and-theory