February 14, 2023 • 4 min read

Few-Shot Learning Benchmarks are Flawed: How can we fix them?

Rédigé par Etienne Bennequin

Few-Shot Learning is the subfield of Machine Learning in which we assume that we only have access to few labeled examples. The field is a few years old and now has its own research community, with its own evaluation processes and its own benchmarks. Sadly, we find that these benchmarks are irrealistic, thus giving us a false idea about the performance of our models.

few-shot learning task — The Few-Shot Classification problem as modeled by the *tiered*ImageNet benchmark.

In this article, we are going to assume that you are familiar with Few-Shot Learning. If you are not, don’t worry! Just follow this tutorial and come back here when you’re done.

We are going to consider the two most widely used Few-Shot-Learning benchmarks: tieredImageNet and miniImageNet. Since 2018, these two benchmarks combined have been used more than a thousand times in peer-reviewed papers.

The standard evaluation process in Few-Shot Learning is to sample hundreds of small few-shot tasks from the test set, compute the accuracy of the model on each task, and report the mean and standard deviation of these accuracies. But never do we look at the tasks individually—only the aggregated results.

So what is really in these tasks? Exactly what kind of problem did these hundreds of research papers try to solve? Does it reflect real-world problems?

Uniformly sampled tasks do not reflect real-world use cases for Few-Shot Learning

Few-Shot Learning benchmarks such as miniImageNet or tieredImageNet evaluate methods on hundreds of Few-Shot Classification tasks. These tasks are sampled uniformly at random from the set of all possible tasks.

This induces a huge bias towards tasks composed of classes that have nothing to do with one another. Classes that you would probably never have to distinguish in any real use case.

walking stick, pomegranate, teapot, geyser, damselfly — A task sampled from *tiered*ImageNet. We would ask the model to classify query images among these five classes. Can you think of a real-life application for this?

trifle, electric guitar, mixing bowl, scoreboard, malamute — Same thing with *mini*ImageNet. Do you remember that time you absolutely needed an AI that would distinguish between an electric guitar and a very specific kind of dog?

If you want to generate more examples of absurd tasks, check out the companion dashboard!

Build better benchmarks with Semantic Task Sampling

The classes of tieredImageNet are part of the WordNet semantic graph. We can use this graph to define a semantic distance between classes, which measures how far the concepts defined by the classes are, e.g. a hotdog is closer to a cheeseburger than it is to a house. Then we can define a measure for the coarsity of a task, as the mean square semantic distance between the classes that compose it.

repartition of the coarsities for two few-shot learniong benchmarks made with tieredImageNet: one with uniform sampling, the other with semantic sampling — Histogram showing the repartition of tasks in terms of coarsity.

Thanks to this measure, we can confirm with the above figure that using uniform task sampling (as is usually done in the literature), we can never get a task composed of classes that are semantically close to each other.

But these tasks are not unreachable! We can actually force our task sampler to sample together classes with a low coarsity. That's the pink histogram. The pink histogram makes the impossible possible. It can reach coarsities that the blue histogram would never even dream of.

OK, but what does it really mean for a few-shot learning task to have a low coarsity? I used the slider in the companion dashboard to generate such tasks.

plate, consomme, trifle, cheeseburger, hotdog

ringlet, monarch, cabbage butterfly, sulphur butterfly, lycaenid

tench, rock beauty, anemone fish, lionfish, puffer

kuvasz, schipperke, dobberman, miniature pinscher, affenpinsher

It seems that when you choose a low coarsity, you get a task composed of classes that are semantically close to each other. For instance, with the lowest coarsity (8.65), you get the task of discriminating between 5 breeds of dogs.

On the other hand, when you increase the coarsity, the classes seem to get more distant from one another.

menu, dough, dung beetle, cardoon, banana

coho, briard, French bulldog, beaker, teapot

An other way to see this distance is directly on the WordNet graph. Below you can see the subgraph of WordNet spanned by the classes of the few-shot learning benchmark tieredImageNet. The pink dots are the classes. I highlighted some of the classes so you can see the distance between some specific concepts. Again, if you want to play with the graph yourself, check out the dashboard!

the graph of the classes for the few-shot learning benchmark tieredImageNet; distance between hotdog and cheeseburger is 1.39, distance between hotdog and goldfish is 10.13

Realistic tasks are harder for Few-Shot Learning models

6 different Few-Shot Learning algorithms have their performance closely linked to the coarsity of the tasks — Average performance of various Few-Shot Learning models on *tiered*ImageNet for the four quartiles of the semantic benchmark sorted by coarsity. The higher the coarsity, the better the performance. On the right, you can see the performance on the uniform benchmark, which is even higher than on the easiest quartile of the semantic benchmark.

As you could have imagined, the performance of Few-Shot Learning models highly depends on the coarsity of the task. This means that if you have a model that has been tested on the standard tieredImageNet or miniImageNet benchmarks (with very coarse tasks), and then apply it to a real-life use case (most likely with more fine-grained tasks), you will suffer a huge drop in performance.

Going deeper into Few-Shot Learning...

This little article is meant to highlight that common Few-Shot Learning benchmarks are strongly biased toward tasks composed of classes that are very distant from each other.

At Sicara, we have seen a wide variety of industrial applications of Few-Shot Learning, but we never encountered a scenario that can be approached by benchmarks presenting this type of bias. In fact, in our experience, most applications involve discriminating between classes that are semantically close to each other: plates from plates, tools from tools, carpets from carpets, parts of cars from parts of cars, etc.

There are other benchmarks for fine-grained classifications. And it's OK that some benchmarks contain tasks that are very coarse-grained. But today, tieredImageNet and miniImageNet are wildly used in the literature, and it's important to know what's in there, and how to restore the balance.

If you want to know more about the biases of classical Few-Shot Learning benchmarks and about semantic task sampling, check out our paper Few-Shot Image Classification Benchmarks are Too Far From Reality: Build Back Better with Semantic Task Sampling (presented at the Vision Datasets Understanding Workshop at CVPR 2022). Finally, if you’re interested in Few-Shot Learning in general and want to dive into the code, you can get started with the EasyFSL library.

Cet article a été écrit par

Etienne Bennequin