This year, I was fortunate enough to attend the International Conference on Representation Learning (ICLR) held in Kigali, Rwanda. I had the opportunity to explore the latest research, connect with fellow professionals, and soak in the collective wisdom of the AI community.
Via this blog series, I'd like to share four things every Machine Learning Engineer should know. First, this blog post will discuss the general concept of Representation Learning, while the following three posts tackle three papers I got to discover at ICLR this year:
- What is representation learning?: A refresher or introduction, depending on your familiarity, to set the stage for the upcoming papers. https://medium.com/radix-ai-blog/representation-learning-breakthroughs-what-is-representation-learning-5dda2e2fed2e
- Token Merging: Your ViT but faster: We'll explore how this advancement makes more efficient use of the hidden representations found in Vision Transformers (ViT), making them substantially faster. https://medium.com/radix-ai-blog/representation-learning-breakthroughs-token-merging-your-vit-but-faster-e3f88f25d6d1
- Mind the pool: CNNs can overfit input size: A highlight of an underrecognized pitfall in Convolutional Neural Networks (CNNs) where they are biased by the input size of the images, and an approach on how to avoid it. https://medium.com/radix-ai-blog/representation-learning-breakthroughs-convolutional-neural-networks-can-overfit-input-size-2aba1cb94c01
- No reason for no supervision, improved generalization in supervised models: A showcase for exploiting representation learning to make more robust and general models that are trained on a supervised task. https://medium.com/radix-ai-blog/representation-learning-breakthroughs-improved-generalization-in-supervised-models-d60a43a7f354
Each post aims to bring the ICLR 2023 experience to you, providing both practical applications and food for thought as we navigate AI's exciting, ever-evolving landscape together. I hope you learn something new in each section of this blog post and that you find the research interesting, I surely did! Let's dive into it.
What is Representation Learning?
Representation learning is a method of training a machine learning model to discover and learn the most useful representations of input data automatically. These representations, often known as "features", are internal states of the model that effectively summarize the input data, which help the algorithm to understand the underlying patters of this data better.
Representation learning marks a significant shift from traditional, manual feature engineering, instead entrusting the model to automatically distill complex and abundant input data into simpler, meaningful forms. This approach particularly shines with intricate data types like images or text, where manually identifying relevant features becomes overwhelming. By autonomously identifying and encoding these patterns, the model simplifies the data and ensures that the essential information is kept. So summarised, representation learning offers a way for machines to autonomously grasp and condense the information stored in large datasets, making the steps in machine learning that follow after more informed and efficient.
From Hand-Crafted to Automated: The Shift in Feature Engineering
In the early days of machine learning, feature engineering was akin to sculpting by hand. It required engineers to manually identify, extract, and craft features from raw data, a process heavily reliant on domain expertise and intuition. Imagine trying to predict car prices. Beyond the obvious features like make, model, and mileage, one would have to speculate: Does the color matter? Or the month of sale? The process was tedious and often restrictive, bounded by the limits of human insights.
Then came representation learning, a revolutionary approach that lets the model learn what the most informative features are. It is possible to do this without a concrete task in mind ("what is a car") or tailored to a specific task ("price probably will matter"). So, while traditional feature engineering laid the foundation, representation learning streamlines and deepens our exploration of data, showing a new era of efficiency and adaptability.
Self-Supervised Learning: Autoencoders and the Embedding Space
Self-supervised learning, a subset of unsupervised learning, is a powerful way to learn representations of data. Among the popular approaches in this category is the autoencoder. An autoencoder is a type of neural network that learns to encode input data into a lower-dimensional, and thus more compact, form. The network then uses this encoded form to reconstruct the original input. The encoding process discovers and extracts essential features in the data, while the decoding process ensures that the extracted features are representative of the original data.
An essential concept to grasp in self-supervised learning is the idea of an embedding space. This space represents the features or characteristics the autoencoder (or any other self-supervised model) has learned. In an effectively trained model, similar data instances will be close to each other in this space, forming clusters. For example, a model trained on a dataset of images might form distinct clusters for different categories of images, like birds, clothing, or food. The distance and direction between these clusters can often provide valuable insights into the relationships between different categories of data.
This embedding space can be leveraged in a variety of ways. For instance, it can be used for data exploration, anomaly detection, or as a pre-processing step for other machine learning tasks. The notion of creating an expressive, useful embedding space lies at the heart of self-supervised learning, and it's one of the reasons why this approach has become so prevalent in modern machine learning research and practice.
Hidden Representations in Supervised Learning
While we've focused on self-supervised learning, it's important to note that the concept of representation learning extends to supervised models as well. A classic example is a deep Convolutional Neural Network (CNN) trained for image classification. As the network deepens, each successive layer learns increasingly abstract representations of the input data.
Consider this: in the early layers of a CNN, the network might represent simple features like edges and colors. As we move deeper, these elements merge into representations of more complex shapes and patterns. By the time we reach the final layers, the representations have abstracted so much that they can differentiate complex categories like different breeds of dogs or different types of vehicles. Essentially, the deeper the model, the more abstract (and often more useful) the representation.
The Power of Self-Supervised Representations: A Look at DINO
The DINO (Distillation of knowledge with NO labels) model is a relatively recent advancement in self-supervised learning. It represents a key development in our understanding of the unique benefits that self-supervised learning can provide, especially when applied to Vision Transformers (ViTs).
In the DINO model, a Vision Transformer is trained in a self-supervised manner. What sets DINO apart is its ability to create highly expressive features that carry explicit information about the semantic segmentation of an image. This property doesn't emerge as clearly with supervised ViTs, nor with convolutional networks (ConvNets). Interestingly, these learned features also serve as excellent k-NN classifiers, achieving significant accuracy on ImageNet, even with a small Vision Transformer. This finding suggests that the representations learned via self-supervised methods like DINO can be more expressive and, potentially, more useful for classification tasks than those learned in a purely supervised fashion.
To illustrate this, consider the side-by-side comparison of the image below, which shows the learned representations from a self-supervised model like DINO and those of a supervised classification ViT model. The representations learned by DINO are more likely to differentiate complex categories effectively due to their expressive nature, thereby enhancing the classification results. This kind of synergy between self-supervised learning and classification tasks shows the growing importance and potential of representation learning in AI and Machine Learning.
Understanding the intricacies of representation learning, its diverse forms, and its applications can empower Machine Learning Engineers to design more robust and powerful models. With the advent of self-supervised learning models such as DINO, it's clear that we are uncovering more about the true potential of these learned representations and how they contribute to making the final results better.
Transfer Learning: Building on the Pillars of Self-Supervision
In the current AI landscape, many AI models are not built from the ground up. Instead, they kickstart their performance by relying on a self-supervised model (e.g., a foundation model) that's been pre-trained in a self-supervised manner. Next up, they are fine-tuned to a task-specific objective. This approach is for a reason; as the previous section shows, these self-supervised models learn powerful representations that are relevant for a wide variety of tasks.
At the heart of self-supervised models is their uncanny ability to discover meaningful representations from raw data, without the need for explicit task-oriented labels. These representations, refined by numerous patterns and connections within the data, are a broad and general description of the data, making them applicable for a wide array of tasks. Multi-purpose representations allow for a meaningful transfer to other tasks. This, in essence, is transfer learning. Here, the new task-specific model relies on previously trained representations, which can come from a self-supervised or another supervised training process.
There are two main benefits of transfer learning: firstly, there's a significant reduction in the volume of data required. You no longer depend on large labeled datasets to achieve robust performance. Secondly, it leads to a noticeable dip in computational costs. The foundation model has already done a bulk of the heavy lifting, and your task-specific model simply refines and specializes from there.
A strategy many adopt involves using a "frozen backbone". In this approach, the representations formed during self-supervised learning are maintained intact, and only the task-specific head of the model undergoes training. This method is efficient and computationally less demanding. However, it's not always the one-size-fits-all solution. Depending on the task, some learned representations might not perfectly align with what's needed. In such cases, allowing a portion of the backbone to undergo retraining can be beneficial. It ensures the model captures the nuances of the task-specific data and allows the model to forget representations that are not relevant. This adaptation, however, requires a careful balance, factoring in computational costs and the quality and quantity of available data.
In essence, the interplay between self-supervised learning and transfer learning is reshaping the blueprint of modern AI models. By leveraging the foundational knowledge from self-supervised models, we're streamlining the process, conserving resources, and ensuring models are as tuned and task-ready as they can be.
Challenges in Representation Learning: A Balanced View
As we marvel at the innovations that representation learning brings, it's crucial to keep a balanced perspective. Like any methodology used in machine learning, representation learning has its challenges and limitations.
- Risk of overfitting: Representation learning, especially when it involves deep neural networks, can sometimes capture noise in the data as features. In trying to identify intricate patterns, models might end up 'memorizing' the training data. This over-specification leads to poor generalization when introduced to unseen data. While models like DINO and other self-supervised approaches offer ways to combat this, the problem remains a concern.
- Interpreting Learned Representations: One significant benefit of manual feature engineering was its transparency. Engineers had a grasp on the features used and their influence on the final model. In stark contrast, learned representations create an even darker "black box" of a model. These representations, while powerful, can be complex, abstract, and not immediately interpretable. This in itself poses challenges in applications where understanding and trust are paramount.
- Computational Intensity: Despite the efficiency gains in using representations through concepts like transfer learning, the initial process of training models to learn these representations can be resource-intensive. Deep neural networks demand vast computational power, which might be a constraint for small organizations or individual researchers.
- Potential Biases: Models learn from data; if the data has biases, the learned representations can inadvertently capture and perpetuate them. This is especially true for self-supervised models that rely on vast amounts of unlabeled data. The ethical ramifications are significant, especially when models are deployed in sensitive areas like medical imaging or criminal justice.
- One Size Doesn't Fit All: Even though learned representations are broadly applicable, they might not always be the best for every task. Some domain-specific problems benefit more from customized, hand-crafted features compared to generic representations.
In conclusion, while representation learning is reshaping how we approach machine learning problems, a reasonable approach is required. Embracing its strengths and acknowledging its challenges ensures we apply the technique properly.
Conclusion
In our exploration of representation learning, we've gone from the painstaking process of manual feature engineering to an automated way of learning robust features. As we steer towards the future, it's essential to keep the benefits of representation learning in mind since they can elevate your future Machine Learning models. Following this blog post, I'll dive into three concrete papers addressing (1) how to exploit learned representations to make your Vision Transformer faster, (2) untangling CNNs to discover that they might overfit on their input size, and (3) a way of utilizing self-supervision to boost the performance of your supervised model. See you there!