### Introduction to Volume Computation of Objects

Have you ever worked on **volume computation of objects** from videos using AI? Did it prove to be either not relevant or highly difficult to achieve? If you answered ‘yes’ to two questions, and are a fan of volume computing or AI, then this article is for you. Using two videos, I will show you how to compute the volume of objects.

Uses cases mostly include inventory management. As many more companies are trying to reduce their environmental impacts, such a tool would improve the efficiency of supply chains and improve inventory management. However, this tool can also be used for furniture placement: interior design, event planning, construction, or renovation...

Volume computation of objects raises several **challenges**:

- locating the objects within the video
- defining their boundaries
- scaling the video to convert a pixel distance to meters

This article will go through all the steps below to retrace how to perform volume computation of objects:

### Underlying Set-Up and Assumptions

To perform volume computation of objects, I rely on a** few assumptions:**

- The area filmed is a closed room.
- This room is
**filmed twice**from the same point of view: once**without**the objects and once**with**them. - The videos are
**identical**, except for the objects - The objects
**lie**directly**against a wall**, without any space.

For simplicity, in the remainder of this article, the filmed room is empty, and the objects are rectangular boxes. Therefore, volume computation will be performed on these boxes. An example of a video sequence respecting these assumptions:

The most important tool to perform volume computation is the camera. Indeed, a normal 2D camera only displays color values for all pixels filmed. In this article, the set-up consists of a special camera named Intel RealSense D435. This camera is composed of **two infrared cameras** that offer a 3D representation of what is filmed. A Python API, *pyrealsense2,* enables to retrieve, from every pixel, a 3D-coordinate. Thus, one can get the **distance in meters** from the camera on the x-axis (right to left), on the y-axis (up to down), and on the z-axis (close to far) for all pixels. Any other camera can be used as long as it provides such coordinates.

A few words about pyrealsense2. Pyrealsense2 is the **Python wrapper** for** Intel RealSense SDK 2.0**, which is a library for Intel RealSense cameras. The SDK can normally be accessed by using **librealsense**, a C++package. According to the Github page of librealsense “*The SDK allows depth and color streaming, and provides intrinsic and extrinsic calibration information. The library also offers synthetic streams (pointcloud, depth aligned to color and vise-versa), and a built-in support for record and playback of streaming sessions.*”. To provide these services, the SDK uses a

**deterministic model**that relies on the input from the two infrared cameras.

### Preprocessing

Let’s now dig into the process of volume computation. Here is the **depth** from the images of these two videos:

These images illustrate a classic statistical problem, **outliers:**

- Black areas on the images of the videos correspond to pixels without any coordinates (depth is set to 0).
- Colorful areas are flawed coordinates (depth is 10 meters, while the wall is 1.5 meters away from the camera).

#### Addressing Outliers: Best Practices

Firstly, let’s define more precisely what is an **outlier**. Here, an outlier is a pixel with a z-coordinate (corresponding to the depth) value equal to 0 or higher than 1.6 meters. Since the wall is around 1.5 meters away from the camera, no pixel can have a z-value higher than 1.6 meters.

Such a method also enables us to select the pixels with **aberrant** x and y values. Indeed, a rapid check shows that pixels with a z-coordinate value of 0 or higher than 1.6 also have aberrant x and y-coordinate values. On the other hand, no pixel had been found with an aberrant x and y-coordinate value but a z-coordinate value between 0 and 1.6. Therefore, only the z-coordinate value defines whether a pixel is an outlier.

Let’s take one specific pixel of one image of one video and imagine that this pixel is an outlier.

The **best way** to replace the **z-coordinate** is to use the z-coordinate of the closest **not-outlier pixel**. Here, the “closest not-outlier pixel” is the pixel with the lowest Euclidean distance from our pixel.

The method used to retrieve **x- **and** y-coordinates** is **more complicated**. Focusing first on x too, let’s use the grid below to visualize the method used. This grid gives the x-coordinates for pixels located on a 7x7 portion of the image. The **red** cells represent **outliers**, the **green** cells represent **non-outliers**. The x-coordinate value is written **inside** the non-outlier pixels.

Here, I want to compute the **x-coordinate** for the **circled outlier** “**?**”** **located in (3,2). To do so, I first selected the **two closest points** from this outlier (circled in the grid) which are not on the same column. The **x-coordinate difference** between these two pixels is 0.82-0.80 = **0.02**. These two pixels are **one column away** from each other. Therefore, one can estimate that** the moving of one column **should change the x-value of **0.02**. Therefore, since the outlier is two columns away from the circled pixel “0.82”, its **estimated value is 0.78 **(=0.82-0.02*2).

The **same method **is used to replace y-coordinate values for outliers.

These preprocessing methods enable to get a much **better 3D representation** of what the camera filmed (here for the **depth**):

### Volume Computation Process

#### Technique to Perform Volume Computation Using Two Videos

Having videos with consistent 3D coordinates, I can perform **volume computation**. Reading this article, you might have wondered: “Why** **insist on recording a video without boxes while only boxes’ volume is needed?”. The **trick** to perform volume computation of the boxes is to compute the **volume for one image** of the video **with boxes** and **subtract** this volume from the volume of the same **image without boxes**. This **difference** corresponds to the computed **volume of the boxes**.

#### Compute Local Volumes Using Kernels

How to perform “volume computation” in one image? Looking at the image below, one can **split** it into smaller squares, namely, **kernels: **

Let’s simply this problem by first trying to compute the **areas** of the kernels and then their corresponding **volumes**.

Having the 3D coordinates of the four pixels that make up one kernel, one can compute the **area** of the kernel in different ways. To be more precise, we approximated a kernel to a **parallelogram**. Then, I got rid of the z-coordinates of the points and used this formula.

This formula ignores one point (here C), and the **four points** are considered to be in the **same 2D plane,** since the z-coordinates of each point are ignored.

Focusing on only one kernel, one can consider this kernel as the **basis of a rectangular prism**, the **other basis **being located at the coordinates** z=0** and being identical. One way to picture it is to imagine that the camera is located on a plane that is perpendicular to the z-axis. Therefore, the rectangular prism will have **one basis on this plane **and **one will be the kernel** on your image. The **height** of this rectangular prism will be the **average of the z-coordinates** of all the points inside the kernel, such as A, B, C, D, E, F, and G in the picture below:

#### Subtract Kernel Volumes

Accordingly, computing the** area** of a kernel and getting the **height** of the corresponding rectangular prism enables one to perform **volume computation** for one kernel. By **covering** one image with **kernels** and **summing** their corresponding volumes, it is possible to get the **volume of the full image**. Here is a visual representation of the volume: one pixel of each image represents the volume computed with a 2x2 pixels kernel:

In the images above, I added two squares: S1 and S2 on the image **without** boxes, and S1’ and S2’ on the image **with** boxes. Comparing S1 and S1’, the volumes should be **approximately the same**, because they enclose the same area and have the same depth. Thus, the difference between S1 and S1’ should be close to 0. However, comparing S2 and S2’, the volume of S2 should be greater than the one of S2’ because S2’ is on a box. Thus, even if S2 and S2’ have the same area, the mean depth of pixels in S2’ is lower than in S2. The difference between S2 and S2’ corresponds to the volume taken by the box there. Therefore, by covering the two images with squares and computing the difference in volume for each corresponding square, one can get the volume of the boxes.

### Conclusion

To conclude, volume computation using this technique gives pretty **good results** in estimating the volume of objects. For **two boxes**, the **absolute error** between the computed volume and the real volume is less than **1%**. For **one and three** boxes, this error is less than **3%**. Despite the many assumptions underlying the room and the objects, this algorithm proves to be efficient in computing objects’ volume. Furthermore, it provides a **basis** for more** exciting Computer Vision applications, **such as computing the volume of different **kinds of objects** using object detection algorithms.

*Are you looking for experts in data and Computer Vision? Don’t hesitate to contact us!*