Machine Learning

April 8, 2022 • 3 min read

TensorFlow 2.0 Tutorial : Optimizing Training Time Performance

Rédigé par Raphaël Meudec

This tutorial explores how you can improve training time performance of your TensorFlow 2.0 model around:

tf.data
Mixed Precision Training
Multi-GPU Training Strategy

I adapted all these tricks to a custom project on image deblurring, and the result is astonishing. You can get a 2–10x training time speed-up depending on your current pipeline.

Usecase: Improving TensorFlow training time of an image deblurring CNN

2 years ago, I published a blog post on Image Deblurring with Generative Adversarial Networks in Keras. I thought it would be a nice transition to pass the repository in TF2.0 to understand what has changed and what are the implications on my code. In this article, I’ll train a simpler version of the model (the cnn part only).

1JkTYe7pRMKmFmLC2p1xGJA_813b7ed5780c90a565598e1a7c371a3c_800 (1)

The model is a convolutional net which takes the (256, 256, 3) blurred patch and predicts the (256, 256, 3) corresponding sharp patch. It is based on the ResNet architecture and is fully convolutional.

Step 1: Identify bottlenecks

To optimize training speed, you want your GPUs to be running at 100% speed. nvidia-smiis nice to make sure your process is running on the GPU, but when it comes to GPU monitoring, there are smarter tools out there. Hence, the first step of this TensorFlow tutorial is to explore these better options.

nvtop

If you’re using an Nvidia card, the simplest solution to monitor GPU utilization over time might probably be nvtop . Visualization is friendlier than nvidia-smi , and you can track metrics over time.

1zNO8QblyRx7Ugw-CVckANw_36c6bf49585de689216581e09e81a823_800 (1)

TensorBoard Profiler

1eNeIKqh4Ti1RANQ8yad6og_65405a1b1f99620407155bde8810c71f_1000 (1)

By simply setting profile_batch={BATCH_INDEX_TO_MONITOR} inside the TensorBoard callback, TF adds a full report on operations performed by either the CPU or GPU for the given batch. This can help identify if your GPU is stalled at some point for lack of data.

RAPIDS NVDashboard

This is a Jupyterlab extension which gives access to various metrics. Along with your GPU, you can also monitor elements from your motherboard (CPU, Disks, ..). The advantage is you don’t have to monitor a specific batch, but rather have a look on performance over the whole training.

Here, we can easily spot that GPU is at 40% speed most of the time. I have activated only 1 of the 2 GPUs on the computer, so total utilization is around 20%.

Step 2: Optimize your tf.data pipeline

The first objective is to make the GPU busy 100% of the time. To do so, we want to reduce the data loading bottleneck. If you are using a Python generator or a Keras Sequence, your data loading is probably sub-optimal. Even if you’re using tf.data, data loading can still be an issue. In my article, I initially used Keras Sequences to load the images.

14ZrPV_4cFeZBedxT1g2ywA_e22aa474ded936e3a952d05bf6dad17f_800 (1)

You can easily spot this phenomenon using the TensorBoard profiling. GPUs will tend to have free time while CPUs are performing multiple operations related to data loading.

Making the switch from the original Keras sequences to tf.data was fairly easy. Most operations for data loading are pretty well-supported, the only tricky part is to take the same patch on the blurred image and the real one.

	from pathlib import Path

	import tensorflow as tf

	def select_patch(sharp, blur, patch_size_x, patch_size_y):
	"""
	Select a patch on both sharp and blur images at the same localization.

	Args:
	sharp (tf.Tensor): Tensor for the sharp image
	blur (tf.Tensor): Tensor for the blur image
	patch_size_x (int): Size of patch along x axis
	patch_size_y (int): Size of patch along y axis

	Returns:
	Tuple[tf.Tensor, tf.Tensor]: Tuple of tensors with shape (patch_size_x, patch_size_y, 3)
	"""
	stack = tf.stack([sharp, blur], axis=0)
	patches = tf.image.random_crop(stack, size=[2, patch_size_x, patch_size_y, 3])
	return (patches[0], patches[1])


	class TensorflowDatasetLoader:
	def __init__(self, dataset_path, batch_size=4, patch_size=(256, 256), n_epochs=10, n_images=None):
	# List all images paths
	sharp_images_paths = [str(path) for path in Path(dataset_path).glob("/sharp/.png")]
	if n_images is not None:
	sharp_images_paths = sharp_images_paths[0:n_images]

	# Generate corresponding blurred images paths
	blur_images_paths = [path.replace("sharp", "blur") for path in sharp_images_paths]

	# Load sharp and blurred images
	sharp_dataset = tf.data.Dataset.from_tensor_slices(sharp_images_paths).map(
	lambda path: self.load_image(path, dtype),
	)
	blur_dataset = tf.data.Dataset.from_tensor_slices(blur_images_paths).map(
	lambda path: self.load_image(path, dtype),
	)

	dataset = tf.data.Dataset.zip((sharp_dataset, blur_dataset))

	# Select the same patch on the sharp image and its corresponding blurred
	dataset = dataset.map(
	lambda sharp_image, blur_image: select_patch(
	sharp_image, blur_image, patch_size[0], patch_size[1]
	)
	)

	# Define dataset characteristics (batch_size, number_of_epochs, shuffling)
	dataset = dataset.batch(batch_size)
	dataset = dataset.shuffle(buffer_size=50)
	dataset = dataset.repeat()

	self.dataset = dataset

	@staticmethod
	def load_image(image_path, dtype):
	image = tf.io.read_file(image_path)
	image = tf.image.decode_png(image, channels=3)
	image = tf.image.convert_image_dtype(image, dtype)
	image = (image - 0.5) * 2

	return image

view raw tensorflow_dataset_for_images.py hosted with ❤ by GitHub

Just switching from a Keras Sequence to tf.data can lead to a training time improvement. From there, we add some little tricks that you can also find in TensorFlow's documentation:

parallelization: Make all the .map() calls parallelized by adding the num_parallel_calls=tf.data.experimental.AUTOTUNE argument
cache: Keep loaded images in memory by caching datasets before the patch selection
prefetching: Start fetching elements before the previous batch has ended

The dataset creation now looks like this:

	from pathlib import Path

	import tensorflow as tf

	def select_patch(sharp, blur, patch_size_x, patch_size_y):
	"""
	Select a patch on both sharp and blur images at the same localization.
	Args:
	sharp (tf.Tensor): Tensor for the sharp image
	blur (tf.Tensor): Tensor for the blur image
	patch_size_x (int): Size of patch along x axis
	patch_size_y (int): Size of patch along y axis
	Returns:
	Tuple[tf.Tensor, tf.Tensor]: Tuple of tensors with shape (patch_size_x, patch_size_y, 3)
	"""
	stack = tf.stack([sharp, blur], axis=0)
	patches = tf.image.random_crop(stack, size=[2, patch_size_x, patch_size_y, 3])
	return (patches[0], patches[1])


	class TensorflowDatasetLoader:
	def __init__(self, dataset_path, batch_size=4, patch_size=(256, 256), n_epochs=10, n_images=None):
	# List all images paths
	sharp_images_paths = [str(path) for path in Path(dataset_path).glob("/sharp/.png")]
	if n_images is not None:
	sharp_images_paths = sharp_images_paths[0:n_images]

	# Generate corresponding blurred images paths
	blur_images_paths = [path.replace("sharp", "blur") for path in sharp_images_paths]

	# Load sharp and blurred images
	sharp_dataset = tf.data.Dataset.from_tensor_slices(sharp_images_paths).map(
	lambda path: self.load_image(path, dtype),
	num_parallel_calls=tf.data.experimental.AUTOTUNE,
	)
	blur_dataset = tf.data.Dataset.from_tensor_slices(blur_images_paths).map(
	lambda path: self.load_image(path, dtype),
	num_parallel_calls=tf.data.experimental.AUTOTUNE,
	)

	dataset = tf.data.Dataset.zip((sharp_dataset, blur_dataset))
	dataset = dataset.cache()

	# Select the same patch on the sharp image and its corresponding blurred
	dataset = dataset.map(
	lambda sharp_image, blur_image: select_patch(
	sharp_image, blur_image, patch_size[0], patch_size[1]
	),
	num_parallel_calls=tf.data.experimental.AUTOTUNE,
	)

	# Define dataset characteristics (batch_size, number_of_epochs, shuffling)
	dataset = dataset.batch(batch_size)
	dataset = dataset.shuffle(buffer_size=50)
	dataset = dataset.repeat()
	dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

	self.dataset = dataset

	@staticmethod
	def load_image(image_path, dtype):
	image = tf.io.read_file(image_path)
	image = tf.image.decode_png(image, channels=3)
	image = tf.image.convert_image_dtype(image, dtype)
	image = (image - 0.5) * 2

	return image

view raw optimized_tensorflow_dataset_for_image.py hosted with ❤ by GitHub

Those small changes make a 5 epochs training time fall from 1000 sec (on an RTX2080) to 616s (full graph is below) .

Step 3: Mixed Precision Training

By default, all variables used in our neural network training are stored on float32. This means every element has to be encoded on 32 bits. The core concept of Mixed Precision Training is to say: we don't need so much precision at all time, let's use 16 bits sometimes.

During the Mixed Precision Training process, you keep a float32 version of the weights, but perform forward and backward passes on float16 versions of the weights. All the expensive operations to obtain the gradients are performed using float16 elements. In the end, you use the float16 gradients to update the float32 weights. A loss scaling is used in the process to keep training stability.

Capturedcrande2020-01-1613-49-44_89839d4690243722685d8c467b56cd14_800 (1)

By keeping float32 weights, this process does not lower the accuracy of your models. On the contrary, they claim some performance improvements on various tasks.

TensorFlow makes it easy to implement from version 2.1.0, by adding different Policy. Mixed Precision Training can be activated by using these two lines before model instantiation.

	policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
	tf.keras.mixed_precision.experimental.set_policy(policy)

view raw mixed_preciision.py hosted with ❤ by GitHub

With this method, we can reduce the 5 epochs training time to 480 sec.

Step 4: Multi-GPU Strategies

Last topic concerns how to perform multi-GPU training with TF2.0. If you don't adjust your code for multi-GPU, you won't reduce your TensorFlow training time because they won't be efficiently used.

The easiest way to perform multi-GPU training is to use the MirroredStrategy. It instantiates your model on each GPU. At each step, different batches are sent to the GPUs which run the backward pass. Then, gradients are aggregated to perform weights update, and the updated values are propagated to each model instantiated.

The distribution strategy is again fairly easy with TensorFlow 2.0. You should only think of multiplying the usual batch size by the number of available GPUs.

	# Define multi-gpu strategy
	mirrored_strategy = tf.distribute.MirroredStrategy()
	# Update batch size value
	batch_size *= mirrored_strategy.num_replicas_in_sync
	# Create strategy scope to perform training
	with mirrored_strategy.scope():
	model = [...]
	model.fit(...)

view raw multi_gpu.py hosted with ❤ by GitHub

If you use TPUs, you might consider taking a deeper look at the official Tensorflow tutorial from documentation on training distribution.

Wrap-up on tips to improve your TensorFlow training time

All those steps lead to a massive reduction of your model training time. This graph traces the 5 epochs training time after each improvement of the training pipeline. I hope you enjoy this TensorFlow tutorial on training time performance. You can ping me on Twitter (@raphaelmeudec) if you have any feedback!

TrainingTimefor5epochsonDeblurGANCNN_b210a14a4278f4c9c5b9795d9a7b7afb_800 (1) — Impacts of tf.data, MPT and GPU Strategy on training time

If you are looking for Machine Learning experts, don't hesitate to contact us!

Cet article a été écrit par

Raphaël Meudec