After four years of working on applying data science to real projects, I realized that data science is often more about data than models. I am both impressed by and afraid of the number of data science teams that have trained huge models but have never looked at the data they are training or testing on. The cool thing is that this problem is easy to fix. So, let’s see why and how to explore your object detection dataset with Streamlit.
This post will explore three main topics:
- Why is exploring your dataset so important?
- Some examples using MS COCO Dataset and a Streamlit dashboard on what you can expect to find in your dataset
- A quick tutorial on how to set up a Streamlit dashboard to explore your object detection dataset
Why is exploring your object detection dataset so important?
From my personal and professional experience, the dataset is usually the weakest link in most data science projects. Let’s be honest, building a good dataset can be surprisingly hard. The least we can do as data scientists is get to know the dataset we use, and have a way to explore and discuss it.
For the project team
Being able to explore the dataset is a way to :
- interpret metrics and errors and improve the quality of the dataset: the team can easily check what is wrong with a set of predictions. Maybe the data does not look as expected? Maybe the annotation is wrong? If it’s the case, it should be easy to fix.
- know if the models can handle a situation and plan for new features: Sometimes product owners/customers come asking whether the model would be able to handle a new label or a new situation. The team should be able to give the first answer and some actions faster. For example:
Question: “Can the model detect scooters ?”
Answer: “We have a motorbike class in the dataset that contains both scooters and motorbikes. We can detect them but not differentiate them from motorbikes.”
Action: “We should relabel the motorbike class to separate them.”.
Question: “Will it work under the rain?”
Answer: “We have no example under the rain, so we don’t know.”
Action: “We should source new images under the rain to test this”.
For the annotation team
If the annotation team has a dashboard to explore the current dataset it can use it to answer its questions :
- “How to label this item?” : by looking for similar examples in the dashboard and finding the proper label
- “What should they do in unexpected situations?”: it can look in the current dataset for situations like obstructed labels for example.
This will improve data quality! But it will also make them more autonomous and give more time to the project team to work on the rest of the project.
For the final users, integrators, sales
They won’t have to ask for the meaning of a label and will be able to display examples and troubleshoot issues on their own.
Exploring COCO object detection dataset with Streamlit
Let’s go through an example exploring an Object Detection Dataset using Streamlit.
We will use the
toaster class of the MS COCO Dataset. Introduced in “Microsoft COCO: Common Objects in Context”, it is probably the most widely used benchmark and dataset for object detection. Moreover, the team working on COCO has gone to great lengths to :
- improve the quality of the dataset as described in the paper;
- provide data exploration and evaluation tools like the explorer and more recently the new FiftyOne integration.
❤️ Thank you to the COCO Dataset Team for this awesome dataset! ❤️
Why choose the toaster class? This is one of the smallest represented classes: 225 bounding boxes in the train set and 9 in the validation set. It seems simple enough as this is a widely used and relatively standard object. Plus, I like toasts.
Keep in mind, that I chose the
toaster class because this is also the first and only class I explored. You will probably be able to find similar things in all the classes of every dataset.
I set up the dashboard thanks to Streamlit and Streamlit Share. It is deployed and available here.
Note: All the example images have a link to a Streamlit dashboard in the caption to visualize them.
What is a toaster for COCO Dataset?
The first thing we can notice by exploring the dataset is that there are two kinds of toasters :
- The classic toaster
- The toaster oven
Note that the validation set already only contains classic toasters and a third variety of “toasters”: contact grills.
Another thing you can find in the validation set is this sad toaster.
You may think it will be extremely hard to detect it and have a harsh impact on your toastery metrics: don’t forget it accounts for more than 10% of the toaster validation dataset! Well, don’t despair, it has two friends in the train set.
Now let’s look at some common issues with object detection labeling.
Taking something else for a toaster. You can find some examples here. By scanning through the toasters in the train set I was able to find 14 probable errors. This represents around 6% of the train set. Let’s be honest, this is exceptionally good, even more so when we take into account the fact that
toaster is definitely not one of the main focuses of the dataset and that most
toasters are small / far in the images. Datasets rarely have less than 5% mislabelling.
This kind of error is hard to measure and even to define. When is a box incorrect? What does too big or too small mean? It depends on your goal: are you looking for pixel-perfect, or are you just trying to count elements and vaguely locate them? Do you need a certain precision for filtering outliers or for matching?
In the example below, the IoU between a correct prediction and the annotation would probably be less than 40%, which means that if the model predicts properly, it will be counted as an error.
What are you doing when multiple objects are close together? Are you doing a bounding box per object or a grouped one? The common answer is to put one per object. But sometimes it may be really hard such as in the pile of oranges below :
When an object is hidden behind another, the common practice is to annotate only the visible surface as long as it is greater than a given proportion of the expected object (usually > 20%). However, it may impact what you do with it later (training and postprocessing). Another good practice is to tag the annotation as truncated or obstructed.
Hard to identify objects
Some objects may be impossible to identify: too small, obstructed, or the quality of the picture may not be good enough. Most of the time the answer is not to annotate it at all, but you can also tag it as hard.
The real question is: do you want to teach your model to detect those? Do you want to impact your metrics with this kind of object? Is this the goal of your final product?
Picture in picture, drawings, and mirrors
Often overlooked, it occurs in most crowdsourced / crawled / end-user-based datasets as opposed to datasets in controlled environments. The usual decision is to label them normally, but you can decide not to do it in specific use cases or decide to tag them to be able to differentiate them later on. Here it represents 5% of the train set.
Here is a final bonus to hammer in the fact that if you explore your object detection dataset, the data may always surprise you :
If we can find all those difficulties in COCO, probably the most widely used Object Detection Dataset, imagine what you could find in yours.
Setting up a Streamlit dashboard to explore your object detection dataset
There are a few ways to set up a dashboard to explore your object detection dataset.
First, you could use premade tools like the amazing FiftyOne by the Voxel51 team and maybe one day Know You Data by Google. 🤞
Secondly, you could build your own solution with dashboarding solutions like Streamlit or Dash.
Choosing the first solution will grant your advanced tools more easily like user authentication, dataset curation, model evaluation... and will be easier to maintain.
Setting up the second solution might be faster, it will also provide you more flexibility to customize your solution and you will be able to share code and practices between your project code and your different dashboards.
I will show you how to do it using a Streamlit dashboard as it is a nice way to demonstrate the power of Streamlit and the ease to integrate it to explore your object detection dataset.
- Have a Python project with a way to get your annotations in a
pandas.DataFrameand your images locally.
- Install streamlit:
pip install streamlit, you can find more details here.
- (You can add stqdm to add a loading bar when loading images:
pip install stqdm)
- Have a recent browser to display the app
- Create a python file
- Run the app
streamlit run explore_dataset.py
Ensure you have everything you need
Load and display the annotations :
If you don’t have any dataset, you can download the example parquet file here.
For the rest of the tutorial, we expect to have a DataFrame looking like this, but you could use your own format and display utils to create the images (
Let’s try to display one bounding box. If you don’t have code to transform your annotation in an image, you can use the two files in the following folder and use them as in the example below :
Making it possible to select labels in the dashboard’s sidebar
You should be able to see a sidebar looking like this and select some labels to display.
Display selected annotations
Let’s add the code to display selected annotations.
You should be able to see the dashboard looking like below and change the labels.
And here you go! You can now explore your object detection dataset with this Streamlit dashboard.
Being able to explore your dataset is essential to the success of your data science projects. Here we showcased a way to explore an Object Detection Dataset with Streamlit but you could do the same with another tool like FiftyOne or Dash. You could also do this for other tasks like Image Segmentation or even non Computer Vision tasks like Named Entity Recognition in NLP. The cool thing with using Streamlit is that you can capitalize on the tools you already use every day in your project with little new code.
If you want to know more about the capabilities of Streamlit and how to make it work conjointly with DVC to analyze your experiments, you can check this article from our blog.
If you want to know more about annotations tools to improve your dataset, you can check this article from our blog.
Are you looking for Image Recognition Experts? Don't hesitate to contact us!