5 min read

Active Learning in Machine Learning

Rédigé par Julien Perichon

Julien Perichon

How to be clever with your training by annotating samples your model has the most issues with? With Active Learning for Machine Learning of course!

1. Why use Active Learning for Machine Learning?

Let us suppose you are in a context where data is abundant. Actually, you have so much data that you don’t have the budget (money, labeling experience, or both) to annotate all your data. On the other hand, as you are serving a Machine Learning model, you would like to use some of this data to improve your performance. This is where Active Learning comes into play!

When you have a low budget compared to the volume of your unlabelled data, the key problem is how to choose interesting data. For instance, let us suppose you have a model classifying images between cats, dogs, and horses. If your model performance is already very good in the 2 first classes but pretty bad on horses, you would like to focus on labeling horse images so that your model has more data to train on. This is exactly what Active Learning does: you define a criterion describing how your model currently performs on the unlabelled data, and you choose data you would think would be the best for your model. The stopping criterion is how much annotation budget you have.

Do you want to learn this power? In just 3mn you will know about the most famous Active Learning methods, and in less than 10mn you will have implemented your own Active Learner in TensorFlow!


2. Active Learning methods for Classification

As shown before, the whole point of Active Learning is to choose the best data to label to improve your model.

How can we know what examples will be the most informative? Well, a very good proxy that is often used in the literature is to select instances for which your model is very unsure about its prediction: the idea is that the farthest you are from a perfect prediction (1 for a class, 0 for the others), the less knowledge you have on this type of data, which makes it a very important sample to label.

For a classification setup, you can use these basic criteria based on the classification probabilities of unlabelled data:

  • Least confidence: select data for which the maximum probability in the classification scores is minimal;
  • Smallest margin: select data for which the difference between the top 1 and top 2 classification probabilities is minimal;
  • Entropy: select data for which the entropy is maximal.

To put this into practice, let us use a simple example with this 5-class classification task:

Output classification probabilities for 3 unlabelled samples.

 

We get the following results for the criteria above:

Active learning basic criteria values for the 3 samples. For each row, the value in red corresponds to the greatest model uncertainty.

In this example with a budget of only one image to label, you can observe that the results differ depending on the chosen criterion. In particular, you would choose sample B when using Least confidence or Entropy, but sample A when using Smallest margin.

You can see that the easiest criterion to use and implement is the Least confidence criterion. Therefore, we will stick to this method for the rest of the article.

Now that you know why you should use Active Learning for your Machine Learning model, let me show you how to actually implement it for a Machine Learning model training!


3. How to implement your own Active Learner on a dataset

Now let us dive into the implementation of your own Active Learner. We will work on the tf_flowers dataset, which is an image dataset with 5 classes: daisy, dandelion, roses, sunflowers, and tulips. We will use an InceptionV3 model pre-trained on the iNaturalist dataset, so that we already have a good representation of what is a plant.

3.1 Loading the dataset

We will split the tf_flowers dataset into 4 parts:

  • the training dataset, taken from the 1,000 first samples;
  • the validation dataset, taken from the 500 following samples;
  • the test dataset, taken from the 1,000 first samples;
  • the unlabelled dataset, taken from the 1,170 remaining samples.

Then, as usual, we need to preprocess our datasets before doing any training. In particular, we need to resize all images to 299 x 299 pixels squares for the InceptionV3 model:

3.2 Defining our model

We will use an InceptionV3 model pre-trained on the iNaturalist dataset, which contains flower species. We expect that this model has learned a good representation of flower species to be able to classify them. On top of that, we just add a Dense layer for classification.

You may have noticed that we froze the inner InceptionV3 model with trainable=False. This is because we are doing transfer learning: first you train only the new Dense layers at the top, and then you unfreeze all the model for fine-tuning.

To train the top Dense layer, we train the model for 3 epochs. Then we retrain the whole model for 2 epochs.

Now that we have a real model trained on the dataset, we can evaluate it to know more about its performance and get our first results!

3.3 Selecting the data to label

Now that we have a functioning model, let us dive into the real Active Learning method. For the sake of simplicity, we will use the Least confidence criterion.

First, we need to compute the classification probabilities for the unlabelled dataset.

Remember we are using the Least confidence criterion. Thus, now that we have all predicted probabilities on the unlabelled dataset, we can compute the maximum probability for each sample. From there, we can select the samples with the smallest value until we spend all our budget.

Defining you budget makes more sense in a business setup, as you would be able to mesure how much money you need to label one image. For this article, we will take a budget of 500 images.

Take a moment to admire the fact that these 3 simple lines are basically all the code at the heart of the Active Learner! If you wanted to try another method, all you would need to do is to change these few lines!

The next step is to label those chosen instances. In a real case, you could use the method descripted in my last article: Image annotation made faster: label 3k images in just 30 minutes!.

But in this article, we are lucky as we can directly take the true labels from the existing dataset, so we can skip the labeling part!

If we take a sneak peek at the 20 worst probabilities, we get the following result:

Given that we have only 5 classes, the worst probabilities are basically close to those of a random model. Therefore, it is very important that we improve our model on this specific data.

We can also evaluate the trained model on the least_confidence_dataset :

We observe that we basically lose 10% in accuracy on this dataset compared to the test dataset. We need to add those samples in the train dataset and retrain the model.

3.4 Retrain the model using the additional data

The final step is to give the newly labelled data as additional data for the model to train on. We can do that with the following command. You can only concatenate unbatched datasets, so you need to rebatch it in the end.

Finally, retrain you model and evaluate on the test dataset:

We see that we only gained 1% in performance. Actually, the model performs too well and there are too few classes to observe a large shift. As an exercise, you can try a smaller model (such as ResNet50 or MobileNetV2) on a dataset with many classes (such as ImageNet). From there you should see a big difference!


4. Conclusion and next steps

As you can see, Active Learning methods are pretty easy to integrate into your Machine Learning training pipeline. The methods presented here are easy to understand and are natural to use. Furthermore, they help you focus on which problems to tackle in priority to make your model even better!

If you want to know more about Active Learning, you can clone my active-learning-methods repository and try to implement other criteria. Don’t hesitate to drop a star to support!

You can also use another approach by using self-supervised learning so that you don’t even have to label your data. For more details, have a look at the following article: Why Self-Supervised Learning is the future of Computer Vision?

Are you looking for Image Recognition Experts? Don't hesitate to contact us!

Cet article a été écrit par

Julien Perichon

Julien Perichon

Suivre toutes nos actualités

Use Airbyte to set up an ETL pipeline in minutes without code

6 min read

Databricks x DataHub: How to set up a Data Catalog in 5 minutes

5 min read

How to improve python unit tests thanks to Hypothesis!

4 min read