3 min read

Stable Diffusion Inpainting: Generate a Custom Dataset for Object Detection

Rédigé par Gabriel Guerin

Gabriel Guerin

You surely know that Deep Learning models need tremendous amount of data in order to get good results. Object Detection models are no different.

To train a model like YOLOv5 to automatically detect the object of your choice, your favorite toy for example, you will need to take thousands of images of your toy in many different contexts. And for each image, you will need to create a text file containing the toy position in the image.

This is obviously very time consuming.

This article proposes to use Image Segmentation & Stable Diffusion to automatically generate an Object Detection dataset for any kind of classes.

Custom dataset generation pipeline (source of dog image)

The pipeline to generate an object detection dataset is composed of four steps:

  • Find a dataset of the same instance as our toy cat (dogs for example)
  • Use image segmentation to generate a mask of the dog.
  • Fine-tune the Stable Diffusion Inpainting Pipeline from the 🧨Diffusers library
  • Run the Stable Diffusion Inpainting Pipeline using our dataset and the generated masks

Image Segmentation: Generate mask images

The Stable Diffusion Inpainting Pipeline takes as input a prompt, an image and a mask image. The pipeline will generate the image from the prompt only for the white pixels of the mask image.

PixelLib helps us do image segmentation in just a few lines of code. In this example we will use the PointRend model to detect our dog. This is the code for image segmentation.

Image segmentation using pixellib

The segmentImage function returns a tuple:

  • results : A dict containing information about 'boxes', 'class_ids', 'class_names', 'object_counts', 'scores', 'masks', 'extracted_objects'.
  • output : The original image blended with the masks and the bounding boxes (if show_bboxes is set to True)

Create a mask image

We create the mask containing only black or white pixels. We will make the mask bigger than the original dog in order to give room for Stable Diffusion to inpaint our toy cat.
To do so, we will translate the mask 10 pixels to the left, right, top & bottom and add these translated masks to the original mask.

Generate mask image from pixellib output

And voilà! We now have our dog's original image and its corresponding mask.

Generate a mask based on the dog image with pixellib

Fine-tune the Stable Diffusion Inpainting Pipeline

Dreambooth is a technique to fine-tune Stable Diffusion. With very few photos we can teach new concepts to the model. We are going to use this technique to fine-tune the Inpainting Pipeline. The train_dreambooth_inpaint.py script shows how to fine-tune the Stable Diffusion model on your own dataset. Just a few images (e.g. 5) are needed to train the model.

Hardware Requirements for Fine-tuning

Using gradient_checkpointing and mixed_precision it should be possible to fine tune the model on a single 24GB GPU. For higher batch_size and faster training, it’s better to use GPUs with more than 30 GB of GPU memory.

Installing the dependencies

Before running the scripts, make sure to install the library’s training dependencies:

pip install git+https://github.com/huggingface/diffusers.git
pip install -U -r requirements.txt

And initialize an 🤗Accelerate environment with:

accelerate config

You have to be a registered user in Hugging Face Hub, and you’ll also need to use an access token for the code to work. For more information on access tokens, please refer to this section of the documentation.

Run the following command to authenticate your token

huggingface-cli login

Fine-tuning Example

The hyperparameter tuning is key when running these computational expensive trainings. Try different parameters depending on the machine you’re running the training on, but I recommend using the ones bellow.

Run the train_dreambooth_inpaint.py script

Run the Stable Diffusion Inpainting pipeline

Stable Diffusion Inpainting is a text2image diffusion model capable of generating photo-realistic images given any text input by inpainting the pictures using a mask.
To run the pipeline, the 🧨Diffusers library makes it really easy.

Run the Stable Diffusion Inpainting Pipeline with our fine-tuned model


To summarize, we have:

  • Generated a mask image using image segmentation with pixellib, on a dog image.
  • Fine-tuned the runwayml/stable-diffusion-inpainting model to make it learn a new toy cat class.
  • Run the StableDiffusionInpaintPipeline with our fine-tuned model on our dog image with the generated mask.

Final results

  • After all these steps, we have generated a new image of a toy cat located at the same place as the dog, so the same bounding box can be use for the two images.
Result of the dataset generation pipeline

We can now generate new images for all the images of our dataset!


Stable Diffusion does not always output convincing results. Some cleaning will be necessary at the end of the dataset generation.

Note that this pipeline is very computational expensive. The fine-tuning of Stable Diffusion needs a 24GB GPU machine. And at inference, even if a lot of improvements have been made, we still need a few GB of GPU to run our pipeline.

This way of creating datasets is interesting if the images needed for the dataset are hard (or impossible) to obtain. For example, if like Pyronear - a french open-source project - you want to detect departures of forrest fires, it will be preferable to use this technique rather than burning trees, obviously. But keep in mind that the standard way of labeling datasets is less energy-consuming.

Cet article a été écrit par

Gabriel Guerin

Gabriel Guerin