Image classification is a topic of pattern recognition part of the computer vision domain. It aims to reduce the semantic gap, which is the difference between how a human perceives the contents of an image versus how an image can be represented in the way a computer can understand its content.
In other words, image classification takes as input an image and output a class (e.g. a cat, a dog, …) or a probability of classes that best describes the content of the image. An example of this task and its output is given below.
This task is not as easy because several challenges are faced when processing images in computer vision, e.g. viewpoint variation, scale variation, deformation, occlusions, illumination, background chatter, and interclass variation …
In Deep Learning; the image classification is done in four steps. First, you gather the data and annotate it, split it to train and validation sets, then train the model and finally evaluate it.
More details about each step:
Step 1 — Gather dataset:
- It is usually preferable to insure that you have the same number of images per category (class) otherwise the final trained model will be biased.
Step 2 — Split dataset
- Split the data into two sets: training and validation datasets. Both sets should not overlap. Common split (66% 33% or 75% 25% or 90% 10%)
- The split (train/validation) is one of the parameters (known as "hyperparameter" ) used to tune the trained model.
Step (3) — Train your network
- In general, a form of Gradient descent is used in deep learning. More details about this will be given in the next articles. In short Gradient descent is an optimization algorithm that iteratively moves in the direction of the steepest descent, as defined by the negative of the gradient, to minimize an error function.
"Training CNN is a non-trivial process — many experiments are needed to determine what does work and what does not"
Step (4) — Evaluation
- This step is done by measuring the following parameters: Precision, Recall, and f-measure.
- Precision is calculated as the number of true positives (TP) divided by the total number of true positives and false positives (FP).
- Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made. Unlike precision which only comments on the correct positive predictions out of all positive predictions, recall provides an indication of missed positive predictions i.e. the false negative (FN).
- f-measure provides a way to combine both precisions and recall into a single measure that captures both properties. Alone, neither precision nor recall tells the whole story. We can have excellent precision with terrible recall, or alternately, terrible precision with excellent recall. F-measure provides a way to express both concerns with a single score.
To understand the concept of true positive/negative and false positive/negative here is an example
Once your model is trained on the dataset you collected. The next step is to check if the network is able to generalize i.e. correctly predict the class label of an image that was not in the training or validation data sets. This is known as the "generalization problem" and it is advised to consider it when designing the network. More details about this will be in the next articles.
If you want to read more about image classification check out this article
As always to end this article I will leave you with the "Quote of the Article":
"What you do makes a difference, and you have to decide what kind of difference you want to make."Jane Goodall
Cheers