7 min read

How to Explore Your Object Detection Dataset With Streamlit

Rédigé par Arnault Chazareix

Arnault Chazareix

After four years of working on applying data science to real projects, I realized that data science is often more about data than models. I am both impressed by and afraid of the number of data science teams that have trained huge models but have never looked at the data they are training or testing on. The cool thing is that this problem is easy to fix. So, let’s see why and how to explore your object detection dataset with Streamlit.

This post will explore three main topics:

  • Why is exploring your dataset so important?
  • Some examples using MS COCO Dataset and a Streamlit dashboard on what you can expect to find in your dataset
  • A quick tutorial on how to set up a Streamlit dashboard to explore your object detection dataset

It also introduces a GitHub repository that hosts the code for the examples and the corresponding Streamlit dashboard deployed on Streamlit Share.

Why is exploring your object detection dataset so important?

From my personal and professional experience, the dataset is usually the weakest link in most data science projects. Let’s be honest, building a good dataset can be surprisingly hard. The least we can do as data scientists is get to know the dataset we use, and have a way to explore and discuss it.

For the project team

Being able to explore the dataset is a way to :

  • interpret metrics and errors and improve the quality of the dataset: the team can easily check what is wrong with a set of predictions. Maybe the data does not look as expected? Maybe the annotation is wrong? If it’s the case, it should be easy to fix.
  • know if the models can handle a situation and plan for new features: Sometimes product owners/customers come asking whether the model would be able to handle a new label or a new situation. The team should be able to give the first answer and some actions faster. For example:
    New label:
    Question: “Can the model detect scooters ?”
    Answer: “We have a motorbike class in the dataset that contains both scooters and motorbikes. We can detect them but not differentiate them from motorbikes.”
    Action: “We should relabel the motorbike class to separate them.”.

    New situation:
    Question: “Will it work under the rain?”
    Answer: “We have no example under the rain, so we don’t know.”
    Action: “We should source new images under the rain to test this”.

For the annotation team

If the annotation team has a dashboard to explore the current dataset it can use it to answer its questions :

  • “How to label this item?” : by looking for similar examples in the dashboard and finding the proper label
  • “What should they do in unexpected situations?”: it can look in the current dataset for situations like obstructed labels for example.

This will improve data quality! But it will also make them more autonomous and give more time to the project team to work on the rest of the project.

For the final users, integrators, sales

They won’t have to ask for the meaning of a label and will be able to display examples and troubleshoot issues on their own.

Exploring COCO object detection dataset with Streamlit

Let’s go through an example exploring an Object Detection Dataset using Streamlit.
We will use the toaster class of the MS COCO Dataset. Introduced in “Microsoft COCO: Common Objects in Context”, it is probably the most widely used benchmark and dataset for object detection. Moreover, the team working on COCO has gone to great lengths to :

❤️ Thank you to the COCO Dataset Team for this awesome dataset! ❤️

Why choose the toaster class? This is one of the smallest represented classes: 225 bounding boxes in the train set and 9 in the validation set. It seems simple enough as this is a widely used and relatively standard object. Plus, I like toasts.
Keep in mind, that I chose the toaster class because this is also the first and only class I explored. You will probably be able to find similar things in all the classes of every dataset.

I set up the dashboard thanks to Streamlit and Streamlit Share. It is deployed and available here.

Note: All the example images have a link to a Streamlit dashboard in the caption to visualize them.

What is a toaster for COCO Dataset?

The first thing we can notice by exploring the dataset is that there are two kinds of toasters :

  • The classic toaster
The classic toaster
  • The toaster oven
toaster oven
The toaster oven

Note that the validation set already only contains classic toasters and a third variety of “toasters”: contact grills.

contact grill
The contact grill

Another thing you can find in the validation set is this sad toaster.

coco dataset weird sad toaster
The sad toaster 😟

You may think it will be extremely hard to detect it and have a harsh impact on your toastery metrics: don’t forget it accounts for more than 10% of the toaster validation dataset! Well, don’t despair, it has two friends in the train set.

coco dataset weird determined toaster
The determined toaster
coco dataset sad toaster
The sad toaster 2 😟

Now let’s look at some common issues with object detection labeling.

Mislabeling

Taking something else for a toaster. You can find some examples here. By scanning through the toasters in the train set I was able to find 14 probable errors. This represents around 6% of the train set. Let’s be honest, this is exceptionally good, even more so when we take into account the fact that toaster is definitely not one of the main focuses of the dataset and that most toasters are small / far in the images. Datasets rarely have less than 5% mislabelling.

coco streamlit object detection mislabeling
Not a toaster

Boxing errors

This kind of error is hard to measure and even to define. When is a box incorrect? What does too big or too small mean? It depends on your goal: are you looking for pixel-perfect, or are you just trying to count elements and vaguely locate them? Do you need a certain precision for filtering outliers or for matching?
In the example below, the IoU between a correct prediction and the annotation would probably be less than 40%, which means that if the model predicts properly, it will be counted as an error.

Juxtaposed objects

What are you doing when multiple objects are close together? Are you doing a bounding box per object or a grouped one? The common answer is to put one per object. But sometimes it may be really hard such as in the pile of oranges below :

coco streamlit object detection juxtaposed
Juxtaposed objects (oranges and toasters)

Obstructed objects

When an object is hidden behind another, the common practice is to annotate only the visible surface as long as it is greater than a given proportion of the expected object (usually > 20%). However, it may impact what you do with it later (training and postprocessing). Another good practice is to tag the annotation as truncated or obstructed.

coco streamlit object detection toaster
Truncated toaster behind a microwave

Hard to identify objects

Some objects may be impossible to identify: too small, obstructed, or the quality of the picture may not be good enough. Most of the time the answer is not to annotate it at all, but you can also tag it as hard.
The real question is: do you want to teach your model to detect those? Do you want to impact your metrics with this kind of object? Is this the goal of your final product?

coco streamlit object detection toaster
Can you see this toaster?

Picture in picture, drawings, and mirrors

Often overlooked, it occurs in most crowdsourced / crawled / end-user-based datasets as opposed to datasets in controlled environments. The usual decision is to label them normally, but you can decide not to do it in specific use cases or decide to tag them to be able to differentiate them later on. Here it represents 5% of the train set.

coco streamlit object detection toaster
Drawings of toasters

Here is a final bonus to hammer in the fact that if you explore your object detection dataset, the data may always surprise you :

coco streamlit object detection toaster
Toaster integrated to a microwave

If we can find all those difficulties in COCO, probably the most widely used Object Detection Dataset, imagine what you could find in yours.

Setting up a Streamlit dashboard to explore your object detection dataset

There are a few ways to set up a dashboard to explore your object detection dataset.
First, you could use premade tools like the amazing FiftyOne by the Voxel51 team and maybe one day Know You Data by Google. 🤞
Secondly, you could build your own solution with dashboarding solutions like Streamlit or Dash.
Choosing the first solution will grant your advanced tools more easily like user authentication, dataset curation, model evaluation... and will be easier to maintain.
Setting up the second solution might be faster, it will also provide you more flexibility to customize your solution and you will be able to share code and practices between your project code and your different dashboards.

I will show you how to do it using a Streamlit dashboard as it is a nice way to demonstrate the power of Streamlit and the ease to integrate it to explore your object detection dataset.

The full code for this small example is here.
If you want to check the full code for the deployed app, you can go here.

Requirements

  • Have a Python project with a way to get your annotations in a pandas.DataFrame and your images locally.
  • Install streamlit: pip install streamlit, you can find more details here.
  • (You can add stqdm to add a loading bar when loading images: pip install stqdm)
  • Have a recent browser to display the app
  • Create a python file explore_dataset.py
  • Run the app streamlit run explore_dataset.py

Ensure you have everything you need

Load and display the annotations :

Code for loading and displaying the data

If you don’t have any dataset, you can download the example parquet file here.
For the rest of the tutorial, we expect to have a DataFrame looking like this, but you could use your own format and display utils to create the images (numpy.array, PIL.Image) :

st.write(all_annotations)

Let’s try to display one bounding box. If you don’t have code to transform your annotation in an image, you can use the two files in the following folder and use them as in the example below :

Code for displaying one bounding box
coco object detection toaster
Streamlit dashboard displaying a single image

Making it possible to select labels in the dashboard’s sidebar

Select labels and statistics in the sidebar

You should be able to see a sidebar looking like this and select some labels to display.

sidebar object detection dataset explorer with streamlit
Dataset statistics and input to select labels

Display selected annotations

Let’s add the code to display selected annotations.

Display selected annotations

You should be able to see the dashboard looking like below and change the labels.

streamlit dashboard object detection
Displaying selected annotations

And here you go! You can now explore your object detection dataset with this Streamlit dashboard.

Wrapping up

Being able to explore your dataset is essential to the success of your data science projects. Here we showcased a way to explore an Object Detection Dataset with Streamlit but you could do the same with another tool like FiftyOne or Dash. You could also do this for other tasks like Image Segmentation or even non Computer Vision tasks like Named Entity Recognition in NLP. The cool thing with using Streamlit is that you can capitalize on the tools you already use every day in your project with little new code.

If you want to know more about the capabilities of Streamlit and how to make it work conjointly with DVC to analyze your experiments, you can check this article from our blog.

If you want to know more about annotations tools to improve your dataset, you can check this article from our blog.

Are you looking for Image Recognition Experts? Don't hesitate to contact us!

Cet article à été écrit par

Arnault Chazareix

Arnault Chazareix

Suivre toutes nos actualités

Active Learning in Machine Learning

5 min read

The Carbon Footprint of an AI project

Reconciling Databricks Delta Live Tables and Software Engineering Best Practices

5 min read