with Hangfei Lin and Li Miao
Few-shot image classification
Few-shot image classification is the task of classifying unseen images to one of N
mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K
) presents a significant challenge to classification accuracy in some cases.
To address this, we have developed a method for augmenting the set of K
with an addition set of A
retrieved images. We call this system Retrieval-Augmented Few-shot Image Classification (RAFIC).
RAFIC
Overview of our proposed system Retrieval-Augmented Few-shot Image Classification (RAFIC). The top diagram shows the high-level view and the bottom one shows the details of the retriever. The top diagram depicts 2-way (N = 2
) classification of aircraft variants. For each class, we provide 1 support image (K = 1
) and ask the model to classify 1 query image. The retriever uses support images and class labels to retrieve 3 (A = 3
) additional images images. We then concatenate image CLIP embeddings from both support and retrieved images and use that for model training. The model is evaluated on correctly predicting the labels of the query images. Retrieval is done for each class label. We use the CLIP text encoder to embed the class name. We then extract image embeddings for each support image using CLIP, and take the mean. We retrieve the top-A images from LAION-5B using a faiss index.
Key findings
- Using CLIP embeddings leads to vastly superior performance vs. raw pixels. We show that using CLIP embeddings as input features significantly surpasses the performance of raw pixels. For instance, we see accuracy increase from 0.26 to 0.88 in the 10-way rare bird classification task.
- Zero-shot retrieval using class names is highly effective. We also found that zero-shot retrieval using class names text embedding is highly effective, achieving over 96% accuracy in 10-way bird classification, surpassing other methods and showcasing CLIP's familiarity with certain concepts over others.
- Efficient retrieval-augmentations leads to a boost in accuracy. The addition of retrieved images significantly improved model accuracy in two challenging few-shot tasks compared to a baseline of logistic regression without meta-training or augmentation. Using Approximate Nearest Neighbors (ANN) search via faiss, we have enabled real-time retrieval and representation extraction across LAION-5B, a repository of 5B+ images.
- Meta-learning the retrieval strategy further boosts accuracy. We demonstrate that accuracy can be further improved by using coarse (up or downweighting retrieval as a whole) and fine-grained (up or downweighting individual retrieved images) strategies for meta-learning the retrieval strategy.
- MAML is more adept at adaptation than ProtoNet. Training on one task and evaluating on another (cross-evaluation) showed that MAML adapts well relative to ProtoNet. Moreover, inclusion of more retrieval images helped with further closing of the performance gap.
Want more details?
See this.