MOMENT: A Foundation Model for Time Series Forecasting, Classification, Anomaly Detection

A unified model that covers multiple time-series tasks

Nikos Kafritsas

Towards Data Science

· ~11 min read · April 27, 2024 (Updated: April 27, 2024) · Free: No

Foundation models have ignited the applications of LLMs in time series.

In the past few months, we saw the release of novel forecasting models like TimesFM, TimeGPT, and of course Salesforce's MOIRAI.

Foundation time-series models will have a great impact on practical applications. Time series are ubiquitous, and used in many domains like retail, energy demand, economics, healthcare, and more. A foundation TS model can be readily applied to any TS case with great accuracy, like GPT-4 for text.

This article explores MOMENT[1], the latest foundation time-series model.

What differentiates MOMENT from the aforementioned models is its general purpose — it can handle forecasting, classification, anomaly detection, and imputation tasks. Also, it's open-source!

This article will describe how MOMENT works, its architecture, and how it performs compared to other SOTA time-series models.

Let's get started.

I've launched AI Horizon Forecast, a newsletter focusing on time-series and innovative AI research. Subscribe here to broaden your horizons!

What is MOMENT

MOMENT is a 385M parameter T5-based foundation model, pretrained and repurposed for forecasting, classification, anomaly detection, and imputation tasks.

Key characteristics of MOMENT:

Open-source: The model, including its pretraining dataset (named as Time-Series Pile by authors), will be open-sourced.
LLM-based: The authors use T5 as the base model — by repurposing it for 5 time-series analysis tasks.
Lightweight execution: MOMENT was designed to work with limited resources and time — making it suitable for faster execution.
Zero-shot forecasting: MOMENT specializes in zero-shot scenarios, but it can be optionally fine-tuned for enhanced performance.

This section will delve into MOMENT's architecture and analyze its benchmark results. First, we'll discuss pretraining and then fine-tuning.

Note 1: The model code has been released on the Anonymous Github — the authors will also release model weights and the training dataset.

Note 2: T5 (T ext-t o-T ext T ransfer T ransformer) is Google's follow-up model on BERT that explores how different sequence-to-sequence tasks can be learned by a single model.

MOMENT Architecture

MOMENT faces a challenge — the model needs to handle 5 different time-series tasks exceptionally well.

To do this, the authors adopted a modular architecture. Also, the authors leveraged patching, a technique embraced by other time-series models — notably TimesFM and MOIRAI.

With patching, a sub-sequence of timepoints(patch) is treated as a token — instead of treating each timepoint as a token. This technique significantly improves inference speed and avoids constraining the model to a specific prediction length.

Figure 1 shows MOMENT's architecture:

Figure 1: Top-level architecture of MOMENT during pretraining (Source, Annotated)

In short, the pretraining process resembles how BERT is trained (masked language modeling). Parts of the input time series are randomly masked and the model is trained to optimally reconstruct them. The steps are as follows:

The model receives a univariate time series of length T and a masking vector with the same length.
The timepoints are normalized and then patchified into N disjoint patches of length P.
The masking vector is randomly initialized with 0's and 1's — 0's represent which patches are treated as masked/unobserved and 1's those which are considered observed.
The observed patches are mapped via a linear projection layer to a patch embedding of size D.
The unobserved patches are replaced with a special [MASK] embedding. Hence, we have a total of N patch embeddings of size D.
Each patch embedding is fed to the Transformer encoder and produces a transformed patch embedding of the same size.
The transformed patch embedding is then fed to a reconstruction head — to reconstruct the original patch of timepoints.
Pretraining aims to minimize the masked reconstruction error — the MSE between predicted and ground truth patches.

A few extra notes:

The Transformer layer consists of T5-encoder blocks, receiving as input a patch embedding of size D and producing a transformed patch embedding of size D (Figure 1).
The reconstruction head is not a full decoder — just a lightweight linear layer. It receives a transformed patch embedding of size D as input and produces a patch of P timepoints.
The Transformer layer ingests both masked and unmasked embeddings — but the MSE is calculated on the masked ones only.
Interestingly, the masked embedding [MASK]is learnable.

That's it! The authors created 3 pretrained MOMENT models based on T5-Small, T5-Base, and T5-Large — with 40M, 125M, and 385M parameters.

Note: Feel free to check the AI Projects folder for new tutorials on time-series forecasting!

Pre-training and Fine-tuning Strategy

We can use MOMENT as a zero-shot predictor or further fine-tune it to achieve better performance.

The authors compiled a public dataset (the Time-Series Pile) which consists of the following datasets:

First, the authors split the time series of each dataset into train, validation, and test sets. To prevent data leakage, MOMENT was carefully pretrained only on the training splits of each dataset.

Similarly, only the training and validation parts are used later for fine-tuning. Evaluation is conducted on the test splits — which MOMENT has not seen during either pre-training or fine-tuning.

You can see the full details of the Time-Series Pile at the end of this article. (Appendix A)

Fine-Tuning on Downstream Tasks

Depending on the task, we can apply the following modifications to MOMENT:

Forecasting tasks: In this case, we swap the reconstruction head with a prediction head that flattens the N D-dimensional patch embeddings into an NxD dimensional vector, which is then projected into a H-dimensional time-series via a projection layer (where H is the prediction length).
All the other tasks: Here, the model retains its reconstruction head.

MOMENT has 2 inference modes:

Zero-shot ( MOMENT-0): For every task, we use the model as-is and start making predictions.
Linear Probing (MOMENT-LP):

Forecasting tasks: We freeze all the layers except the forecasting head (which is trained for a few epochs).
All the other tasks: Again, we freeze every layers except the reconstruction head (which is trained for a few epochs).

Note: We can also fine-tune MOMENT end-to-end on a particular dataset, however this was beyond the scope of this paper.

Table 1 below presents the fine-tuning method, dataset, and metric used for each task when fine-tuning MOMENT:

Table 1: The evaluation methods, datasets, and metrics for each task MOMENT was benchmarked on (Source)

For the classification task, the authors made an adjustment to ensure compatibility with previous works: The model produces a latent vector for each time-sequence, which is then used to train an ML classifier — using the latent vector as an input feature and the classification label as a target variable.

Also, notice the term 'subset' on some datasets — as mentioned earlier, the authors evaluated the models on test splits of certain datasets to prevent data leakage.

In the next section, we will analyze MOMENT's performance in the benchmarks compared to other models.

Evaluation Benchmarks

The authors evaluated MOMENT in 5 time-series tasks:

Short-term forecasting
Long-term forecasting
Classification
Imputation
Anomaly detection

Let's start with the forecasting evaluation.

Long-Term Forecasting Benchmark

Here, the authors benchmarked MOMENT-LP against popular foundation models (Time-LLM, GPT4TS), DL models (TimesNet, N-BEATS, DLinear), and Transformer models (PatchTST, FedFormer). MAE and MSE were used as metrics:

Table 2: MSE and MAE of various time-series models in the long-term forecasting benchmark (Source)

We notice the following:

The last layer of MOMENT was fine-tuned here (MOMENT-LP), while the other models were fully fine-tuned.
MOMENT showed promising results and outperformed Time-LLM and TimesNet. However, PatchTST achieved the top spot.
Interestingly, GPT4TS outperformed MOMENT-LP on some datasets.
Time-LLM was not benchmarked in some datasets due to the authors' limitations regarding hardware requirements (Time-LLM could not fit on a single GPU).
It would be interesting to include the zero-shot version of MOMENT and a few additional statistical models in the benchmark.

Short-Term Forecasting Benchmark

Here, the authors fine-tune the models on a source dataset and then evaluate it on a target dataset. Specifically, they perform the following evaluations:

Fine-tune on M4 → evaluate on M3
Fine-tune on FRED → evaluate on M3
Fine-tune on FRED → evaluate on M4

The statistical models were individually fit on each dataset. The results are displayed in Table 3:

Table 3: Comparing different models using sMAPE. The statistical and baseline models were fit on each dataset, while the other models were evaluated in a 'transfer-learning' scenario (Source)

Notes regarding the results:

MOMENT-LP and GPT4TS achieved the best scores, followed by N-BEATS.
Interestingly, statistical models surpassed the DL models in some cases. This was because:
The authors reported their results only on 40% of the M3 and M4 datasets — meaning there wasn't enough data for DL models to leverage scale and showcase their performance. The authors had to make this choice because the rest of the datasets were seen by MOMENT during pretraining.
Besides, as shown in [Makridakis et al.], DL models work better in short-term forecasting scenarios when they are "ensembled" or there is enough data.

Classification Benchmark

Next, the authors evaluate MOMENT as a time-series classifier — in a zero-shot scenario.

They report the mean and median accuracies over 91 datasets of the UCR classification archive. The results are displayed in Table 4:

Table 4: The classification benchmark, reporting the mean and median accuracies over multiple datasets (Source)

MOMENT-0 surpassed the other foundation models.
However, some specialized models performed better.
MOMENT is a viable choice as a zero-shot model — that can be readily used without training.

Imputation and Anomaly Detection Benchmarks

Finally, the authors evaluate how MOMENT performs in imputation and anomaly detection tasks.

For both tasks, the authors evaluate both MOMENT-0 and MOMENT-LP.
In imputation tasks, they assess reconstruction performance.
In anomaly detection tasks, they measure performance using the adjusted-F1 and VUS-ROC averaged over 44 time series from the UCR Anomaly Archive.

The results are shown in Table 5:

Table 5: Imputation and anomaly detection benchmarks. Here, the authors evaluate both zero-shot MOMENT and linear-probed MOMENT (Source)

Notes regarding the results:

In both tasks, MOMENT seems promising. MOMENT-0 is competitive, while MOMENT-LP achieves the top place in some datasets.
Interestingly, in the anomaly detection benchmark, K-NN beats every model!
Sometimes, simpler methods are better!

Some Insights from MOMENT

Additionally, the authors wanted to explore the capabilities of language models as forecasters and how they scale with more data.

After all, the core architecture of MOMENT is T5 — a language model.

Let's analyze the extracted insights from MOMENT and how these can be applied to future models

A) MOMENT can solve sequential tasks across other modalities

The novel work of llmtime has shown that pretrained LLMs can be readily used as zero-shot forecasters.

[Lu et al] proved that pretrained Transformers act as universal computation engines. Specifically, they showed that language and vision Transformers can solve general tasks across different modalities — beyond vision and text, which were their initial training domains.

Here, the authors explore whether pretrained Transformers on time series can solve sequence classification tasks on image, text, and binary data. To evaluate this hypothesis, they benchmark MOMENT against Flan-T5 and GPT-2 on image and text datasets.

The results are summarized in Table 6:

Table 6: Testing the transfer-learning capabilities of MOMENT vs other models on different modalities. The evaluation metric is accuracy (Source)

It is evident that MOMENT, a pretrained Transformer on time series demonstrates strong generalization capabilities across different tasks and modalities, comparable to other pretrained models.

B) MOMENT scales with more data

A key strength of Transformer models is scaling ability — their capacity to leverage larger datasets and achieve improved performance.

The authors explore whether MOMENT exhibits similar scaling capabilities. Specifically, they measured the training losses of different model sizes. The results are illustrated in Figure 2:

Figure 2: Testing how MOMENT scales during pretraining with more parameters (Source)

The largest model (385M) achieved the best results.
All models kept improving with more data — it's quite likely that MOMENT could have achieved better results with additional data or training time.

C) The effect of initialization weights on performance

Let's say we are building a pretrained time-series model using a well-known language model (like MOMENT did here by using T5)

There are 2 approaches:

Use T5 as-is and continue pretraining but on time-series data.
Initialize the model weights randomly and train T5 from scratch.

The authors explored this dilemma and found that the 2nd approach is better:

Figure 3: Training a Flan-T5 model, initialized with its original weights and a Flan-T5 trained with random weights (Source)

To put it differently:

Pretraining a language model from scratch(with time-series data) results in a lower training loss than pretraining a model of similar size initialized with its original language modeling weights.

Interestingly, Amazon's latest foundation model, Chronos, concluded the same — probably without either model being aware of the other, given that they were released almost simultaneously.

Closing Remarks

MOMENT is an intriguing generalist foundation time-series model that builds upon the work of similar successful models like TimesNet and GPT4TS.

Contrary to the other foundation models like TimesFM, MOIRAI, and Chronos, the model appears less impressive.

For example, MOMENT was trained on much less data than the aforementioned models. It didn't use synthetic data or data augmentations either — which is a cheap and effective source of training data. It is evident from Figure 2 that the model is undertrained and could have performed better with more data.

Also, the benchmarks would be more informative if they included all MOMENT model sizes — to provide a comprehensive understanding of how scale impacts performance.

Regardless, while forecasting isn't MOMENT's strongest point, it remains a viable approach for other tasks — especially in scenarios with limited time and computational resources. One ideal use case for MOMENT is in a multi-objective scenario where you are doing both forecasting and anomaly detection or imputation.

Other than that, MOMENT's lightweight nature makes it an excellent candidate for ensemble methods with other models to enhance overall performance!

Thank you for reading!

Follow me on Linkedin!
Subscribe to my newsletter, AI Horizon Forecast!

AutoGluon-TimeSeries : Creating Powerful Ensemble Forecasts - Complete Tutorial

Amazon's framework for time-series forecasting has it all.

substack.com

Appendix

Table 7 displays the datasets used to construct the Time-series Pile. You can find more details in the original paper

Table 7: The datasets of Time-series Pile (Source)

References

[1] Goswami et al., MOMENT: A Family of Open Time-Series Foundation Models (February 2024)

#time-series-forecasting #artificial-intelligence #data-science #deep-learning #foundation-models