In this article, I will tell you the things I wish I had known before I started my Edge TPU instance segmentation project with custom YOLO on Tensorflow.
What is Edge AI?
Edge AI is the combination of Edge Computing and Artificial Intelligence.
In simple terms, Edge AI will enable AI-based processing algorithms to be run inside the Edge Computing network.
Edge AI allows having extremely short response times by a few milliseconds because everything happens directly to the device: collection, storage, AI-based processing.
Even today, the majority of heavy computing processing is done in the cloud and requires large computing capacities.
Why compute directly on the device and not in the cloud?
- Speed of inference
- Low energy cost (device on batteries)
- Low hardware cost
- Security and confidentiality
Some companies have understood these issues and have developed their solutions, such as Jetson Xavier from Nvidia or Coral from Google.
Let's focus on Coral from Google:
Google has created TPU accelerators that can be directly connected to USB 3.0 devices and improve any of your edge devices!
But first, you will need a compiled model thanks to the
edgetpu_compiler made by Google Coral, unfortunately, it is available only on Linux.
In the case of inference, just copy the TFLite model to your device, along with the inference script, and you're ready to work. It works on Mac, Windows, Linux OS.
YOLACT project initial configurations:
Before going to the things I should know before, here is my initial configuration on the project.
- Model: YOLACT a convolutional model capable of doing real-time segmentation instance translated from Pytorch to Tensorflow 2.3
- Input size: 550x550
- Backbone: because of its small size we chose MobilenetV2
- Hardware: Edge TPU USB Accelerator firstly connected to a desktop with a 3.6 GHz CPU
Here is the workflow we had to develop our project:
After converting the model from Pytorch to Tf Lite, here is the result of edgetpu-compiler (you can get the compilation status via
edgetpu_compiler -s <model_name>):
We can see here that there are some problems detected during compilation:
- Some operations are not supported
- More than one subgraph is not supported
- Operation is otherwise supported, but not mapped due to some unspecification limitation
This results in 38 operations not being mapped to TPU but too CPU, this slows down our inference because our algorithm will have to go back and forth between CPU and TPU.
1: Some TF 2 operations are not supported but you can use their TF 1 version
For our real-time route analysis project, we used a Tensorflow 2.3 translation of YOLACT, which is written in Pytorch
In YOLACT, an instance segmentation model, there is a UPSampling2D
tf.keras.layers.UpSampling2D() function in the protonet, on the layers P3 and P4.
But this operation is not supported by the EDGE TPU compiler, see the list of compatible operations
At first, I didn't think of using this function, because I thought that using the
bilinear interpolation of the
Upsampling2D would be enough to make the operation compatible. It took me some time to realize that this was not the case
After the compilation, here is the result:
We can see that the number of operations has increased! But there are also more on CPU...
2: The huge importance of the input size
Operation is otherwise supported, but not mapped to due to some unspecified limitation
This message is due to the size of some matrices during the inference computation which exceeds the edge TPU coral limitation.
Our input size of 550x550 is probably too high, especially when it goes through UPsampling operations which double the size of the matrix
Our backbone being mobile_net V2, the size used initially is 224x224 and our objects to detect are not particularly small. It's why it was not a bad idea to go from 550x550 to 224x224.
After this step, here is the compilation result:
Here, everything is finally mapped to TPU! But it's not over yet...
3: It can be better not to keep the original INPUT/OUTPUT format depending on your case
According to Google Coral.ai, quantizing a model means converting all the 32-bit floating-point numbers (such as weights and activation outputs) to the nearest 8-bit fixed-point numbers. This makes the model smaller and faster. And although these 8-bit representations can be less precise, the inference accuracy of the neural network is not significantly affected.
By default, we kept the quantization of the input/output, so the Edge TPU compiler was responsible for converting the input/outputs to INT8.
By default, we kept the quantization of the input/output, however, we had not quantized the YOLACT NMS (Non-Max Suppression) brick that follows the Protonet. It expected FLOAT in input (whereas with quantization we gave it INT_8) and the results were very bad, to the point that almost no good prediction was made.
So we could have either:
- quantize the NMS,
- change the expected input of the NMS
- change the input/output of the Tflite model
To go faster in the development of the project prototype, we have chosen not to quantize the input/output of the model. It just required us to delete the line:
converter.inference.output_type = tf.int8
However, in the future, it will be better to integrate NMS in the quantization process.
Here are the results once the input/output is left in FLOAT
Thus, we can see that unlike before, some operations corresponding to formatting are now launched on the CPU. If we had chosen to change the input expected by the NMS, these operations would have been managed by the CPU in any case. The optimal choice is therefore in the quantization of the NMS to run it directly on the Edge TPU.
4: use DVC and python time profiler!
In our project, those tools
I discovered DVC for this project and it simplified my development flow! Its role: versioning data based on git and facilitating the reproduction of experience. It saved our lives, we can easily go back to an experience with the same data used! Check out this post to learn more about the power of DVC and Streamlit. If want to know how to speed up your pipeline reproduction in DVC, look at here.
cProfile to profile Python
About the time profiler, in a project on edge Coral TPU with strong time constraints, it was essential to identify the phases of the algorithm that took too much time.
For this nothing could be easier, Cprofiler is already present in python!
- You can look at the results directly on the console via:
python -m cProfile main.py
- OR create a file usable by Snakeviz to have an interactive visualization (you can deep dive in functions to get more detailed results:
python -m cProfile -o time.profile demo_with_user_lane.py-
pip install snakeviz-
In this article, we saw that:
- Some TF 2 operations are not supported but you can use their TF 1 version
- Do not underestimate the impact of the input size on this kind of edge device
- Change the initial Coral Compiler configurations to suit your project (especially if you are using legacy code)
- Find the right tools to make your life easier on the project!
Thanks for reading, I am available if you have any questions!