Presenting business cases where self-supervised learning in Computer vision is a game-changer and an introduction to Dino the state-of-the-art self-supervised learning model based on transformers.
Computer vision complexity is often related to data. Sometimes, collecting data is a heavy task but in many cases data is available but it is not well exploited. In the project I am working on, I have more than 1M images. That’s pretty cool! But labeling this many images means a lot of time and/or money. In this article, I will talk about DINO, a Self-supervised learning (or SSL) model based on Vision Transformer (ViT). This model stands out compared to convolutional networks.
The data boom and self-supervised learning
The volume of data created, captured, copied, and consumed globally has increased exponentially over the past 20 years. According to a Statista study, this volume is forecast to increase from approximately 64ZB in 2020 to 181 ZB in 2025.
Considering the volume of data we have today and how fast it is increasing, annotating all unstructured data would be practically impossible. Besides, in several fields, labeling data requires sharp business knowledge that is specific to experts. For example, in the field of radiology trained physicians need to assess various medical images and report the results to detect, characterize, and monitor diseases.
In such a high potential domain as AI, labeling data should not be a barrier to building intelligent models. In deep learning, we frequently compare Neural Networks to the human brain’s way of learning. But as humans, we don’t need to see several labeled images of the same object to learn what it is. As explained in the Meta AI article, “Common sense helps people learn new skills without requiring massive amounts of teaching for every single task.” a human can recognize an object after seeing it even without putting a label on it. Self-supervised learning may be a solution to getting closer to the common sense we humans have.
Self-supervised learning business use cases
Concretely SSL may be convenient for several business cases where getting data is relatively easy. To name a few, I think that self-supervised learning is particularly adapted to the industrial sector and retail.
In an industrial production system, we need to check the quality of the product in different steps of the chain. In this kind of system, the product flow is huge. Having cameras in each step of the process may lead to millions of images of the product in a year. What is great about this use case is putting cameras in the system will impact no one. In fact, it will not slow down employees in their work and users will not see a difference. Therefore, I can start collecting data before building my model. Obviously, in this case, we have to ensure that the dataset quality is acceptable for our model. We should mainly focus on having a good and homogeneous luminosity. Once our dataset is ready we are good to train our self-supervised model.
In retail, automated billing is booming. In some cases using barcode tags or RFID may be sufficient. But these methods have their limitations: the lifetime of the tag, compatibility with products (e.g. RFID is not compatible with metals), etc. Here is why some retail businesses lean toward computer vision algorithms.
What is complicated in this use case is that the client should be part of the process. For example, if the client buys some fruits and vegetables a photo of his basket should be taken. In this case, the best approach is to have a model in production as soon as possible, collect data as we go, and iterate. Once the model is working in production, comes scalability. Having a model working on several stores means more data volume and variability. Annotation becomes more and more difficult. In a similar project at Sicara, we found ourselves with 1M non annotated images that are not used. And that’s one of the reasons why we turned to self-supervised learning.
Facebook and Inria researchers introduced Dino in the paper Emerging Properties in Self-Supervised Vision Transformers in April 2021. It is a self-supervised learning model that is based on Vision Transformers (ViT).
Dino uses self-distillation (”performing knowledge distillation against an individual model of the same architecture” from Microsoft research blog ) where a student network is trained by matching the output of a teacher network. The input of each network is a different version of the original image (data augmentation). Global cropping is applied for the teacher network and global and local cropping is used for the student network. The goal is to teach the network that the representation is the same since it is the same original image.
As I mentioned earlier the architecture used in the student and teacher networks is the same. In the paper, they focus on using ResNet and ViT but the latter presents better results in general and especially in image retrieval so I am going to focus on that starting with an introduction to transformers.
Transformers appeared first in NLP in 2017 with Attention Is All You Need and are now popular in many areas of artificial intelligence.
It is the “first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convultion” as explained in the paper. In a model with a sequential nature, it is difficult to scale or parallelize efficiently. So we are quickly confronted with performance issues. Since transformers are based on attention the decoder decides to look back to a particular part in the past which reduces path links.
If you want to deep dive into transformers, in addition to reading the paper I recommend taking a look at this article which explains the paper in detail starting with algebra reminders (which is always useful). And for those who prefer getting the essence without diving deep into details, you can take a look at Yannic Kilcher’s youtube video. I find his explanation very clear!
Vision transformer is an application of transformers on images. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale at the end of 2020. The idea behind this model is that an image is cut into patches and a list of flattened patches is passed into a Transformer Encoder.
As mentioned in the paper “Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train”.
If you want to learn more about Dino nothing better than the explanation of the researchers themselves that you can find here.
To wrap up, this was an introduction to Dino and self-supervised learning models’ utility in new business projects. I hope you enjoyed this article and stay tuned for new content on how I retrained Dino from scratch on a real-world dataset!
Are you looking for AI Experts? Don't hesitate to contact us!