Introduction

Image Classification is the task of assigning an input image one label from a fixed set of categories. This is one of the core problems in Computer Vision that, despite its simplicity, has a large variety of practical applications.

Example: For example, in the image below an image classification model takes a single image and assigns probabilities to 2 labels, {cat, dog}. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red, Green, Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as “cat”.

Similarly, you could train a model which tells you by looking at an image which Kardashian you’re looking at — Kim, Kylie, Other

Training a classification model

Training a model which classifies images as a cat image or a dog image is an example of binary classification.

The image classification pipeline. We’ve seen that the task in Image Classification is to take an array of pixels that represents a single image and assign a label to it. Our complete pipeline can be formalized as follows:

  • Input: Our input consists of a set of N images, each labeled with one of Kdifferent classes. We refer to this data as the training set.

Learning: Our task is to use the training set to learn what every one of the classes looks like. We refer to this step as training a classifier or learning a model.

Data Mislabeling Problem

But what if your training data contains incorrect labeling? What if a dog was labeled as a cat? What if Kylie is labeled as Kendall or Kim as Kanye? This kind of data mislabeling might happen if you source your data from the internet, a very common source for procuring data.

This will eventually cause your problems to either learn the noise in the dataset or learn incorrect features. However, this can be avoided to a certain extent. If you’re training on a small dataset, you could go through all the labels and check them manually, or use your minions to do the dirty work. An alternative mathematical approach is shared in the next section.

So, the problem is that your model will be learning incorrect features (from a dog) and associate those features with the label “cat”. How can we solve that? To look into that, let’s look at the loss function used in image classification problems.

Before we get to the loss function, we should establish that the segmentation model gives the probability of each class:

In Image Classification problems, we use softmax loss, which is defined below for two categories:

L = −(ylog(p)+(1−y)log(1−p))

Here, L is the loss, y is the true label (0 — cat, 1 — dog), and p is the probability that the image belongs to class 1, ie dog. The objective of a model is to reduce loss.

The loss essentially drives your “gradients”, which in simple terms determines the “learning” of the model. We need to keep a very close eye on loss for this reason.

Say, you get a dog image, with a prob of 0.99. Your loss will be:

L = -(1*(log(0.99) + (1–0.99)*log(0.01)) ≈ 0

which is good! The loss should be small when the prediction is accurate!

This loss will be particularly high if your data has incorrect labels though, which will consequently cause problems in learning.

So how do we take care of that? In the next section, we see an approach which minimizes the loss in case of incorrect labels.

Label Smoothing — One Possible Solution

Say hello to Label Smoothing!

When we apply the cross-entropy loss to a classification task, we’re expecting true labels to have 1, while the others 0. In other words, we have no doubts that the true labels are true, and the others are not. Is that always true? Maybe not. Many manual annotations are the results of multiple participants. They might have different criteria. They might make some mistakes. They are human, after all. As a result, the ground truth labels we have had perfect beliefs on are possibly wrong.

One possible solution to this is to relax our confidence on the labels. For instance, we can slightly lower the loss target values from 1 to, say, 0.9. And naturally, we increase the target value of 0 for the others slightly as such. This idea is called label smoothing.

Here are the arguments in cross entropy loss as defined in Tensorflow:

This is what the Tensorflow documentation says about the label_smoothing argument:

If label_smoothing is nonzero, smooth the labels towards 1/num_classes: new_onehot_labels = onehot_labels * (1 – label_smoothing) + label_smoothing / num_classes

What does this mean?

Well, say you were training a model for binary classification. Your labels would be 0 — cat, 1 — not cat.

Now, say you  label_smoothing = 0.2

Using the equation above, we get:

new_onehot_labels = [0 1] * (1 — 0.2) + 0.2 / 2 =[0 1]*(0.8) + 0.1

new_onehot_labels =[0.9 0.1]

These are soft labels, instead of hard labels, that is 0 and 1. This will ultimately give you lower loss when there is an incorrect prediction, and subsequently, your model will penalize and learn incorrectly by a slightly lesser degree.

In essence, label smoothing will help your model to train around mislabeled data and consequently improve its robustness and performance.

Further Reading

When Does Label Smoothing Help?