April 8, 2022 • 4 min read

Sign Language Recognition - using MediaPipe & DTW

Rédigé par Gabriel Guerin

Gabriel Guerin

Discover how you can implement a model of Sign Language Recognition using MediaPipe. You will be able to translate videos of signs into natural language.


In France, over 100.000 people use French Sign Language (LSF) on a daily basis. However, little is done to facilitate the learning of Sign Language and the inclusion of deaf or hard of hearing people. To tackle this issue, Signes de Sens has been creating over the past 10 years the biggest French-to-LSF dictionary: LeDicoElix. It contains today over 20.000 LSF videos of signs, with their corresponding definitions. The next step of the Elix project is to be able to translate LSF back to French. Through the M33 Foundation, I've provided my Data Science expertise to implement a real-time Sign Language Recognition model.

The objective of this article is to show the current implementation of my model and give you a good understanding of the main ideas of the method chosen here.
As of now, on a small dataset (5 signs with 5 videos per sign), the model has already convincing results!

Real-time Sign Language Recognition with a small French dictionary, 2021-12-13

Table of contents

  1. Overview
  2. MediaPipe Detection
  3. Models
  4. Sign Prediction
  5. What’s next?

1. Overview

The objective is to output the corresponding word based on a sign recorded in a video.
Our Sign Language Recognition model uses hand landmarks - points of interest of the hand that we track - as input.
The prediction is done in three steps:

  • Extract landmarks
  • Compute the DTW distance between the recorded sign and the reference signs
  • Predict the sign by analysing the most similar reference signs

Source code can be found at the end of the article.


2. MediaPipe Detection

Holistic Model

MediaPipe is an open-source framework for computer vision solutions released by Google a couple of years ago. Among these solutions, the Holistic Model can track in real-time the position of the Hands, the Pose and the Face landmarks. For now, the code only uses hands positions to make the prediction

Extract landmarks

First, let's extract landmarks from the video feed using the Holistic model's process method. Color conversion has to be done because OpenCV uses BGR colors and MediaPipe RGB ones.

landmark detection function, from the repo Sign-Language-Recognition--MediaPipe-DTW

Draw landmarks

The drawing_utils sub-package of MediaPipe contains all the tools we need to draw the landmarks on one image!

draw_landmarks function, from the repo Sign-Language-Recognition--MediaPipe-DTW

And voilà!
We are now able to draw the landmarks on any frame of the video.

Sign for “Ball”, video from LeDicoElix

3. Models

In the current version of the project two models have been implemented:

  • The HandModel class contains the hand gesture information of an image.
  • The SignModel class contains the HandModel information for all frames of a video.

Hand Model

The main problem when using landmarks positions as input data, is that the prediction is sensitive to the size and the absolute position of the hands.
A good way to extract the information about the hand gesture is to use the angles between all the parts of the hand, called connections. We will use all 21 connections of MediaPipe’s Hand Model in this project.

The HandModel class implemented in this project is defined by its feature_vector that gives a representation of the hand gesture.

HandModel representation

The HandModel class has two arguments:

  • connections: List of tuples containing the ids of the two landmarks representing a connection
  • feature_vector: List of the 21*21=441 angles between all the connections of the hand
HandModel class, from the repo Sign-Language-Recognition--MediaPipe-DTW

Sign Model

Now that we can extract the information of the hand gesture, we have to build an object containing both spatial and temporal informations of a sign.
To do so, we just have to store the feature_vector of both hands for each frame of the video.

SignModel representation

The SignModel class implemented in this project has four attributes:

  • has_left_hand: True if the left hand is present in the video
  • has_right_hand: True if the right hand is present in the video
  • lh_embedding: List of the feature vectors of the left hand for each frame
  • rh_embedding: List of the feature vectors of the right hand for each frame
SignModel class, from the repo Sign-Language-Recognition--MediaPipe-DTW

4.Sign Prediction

So we have now created sign embeddings that contains the temporal and spatial information of a sign. The next step is to be able to classify them. Several methods could be applied here. We could use Deep Learning to classify the sequences but this would demand huge amounts of examples for each sign. Another method uses far less training data and is able to compute a similarity between two signs.
With Dynamic Time Warping, we can compute the distance between embeddings as they are time series of feature_vectors.

Dynamic Time Warping (DTW)

Dynamic Time Warping is an algorithm widely used for time series comparison. It finds the best alignments between two time series by warping them. This allows us to compare patterns instead of sequences. In our case, DTW will find similarities between embeddings of the same signs even if they are done at different speeds.

Comparison between Euclidian distance & DTW distance between two time series

To compute the similarity between two signs we compare their embeddings. We compute the DTW distance between the embeddings of the sign recorded and the embeddings of all the reference videos - each sign has multiple reference videos.
The following method returns all the signs in the catalog, sorted by their distance to the recorded sign.

DTW computation, from the repo Sign-Language-Recognition--MediaPipe-DTW

Sign prediction

We can now make a prediction! We have computed the distances between the recorded sign and all the reference ones. So by sorting them we can take a batch of the most similar signs to our record. With this batch, we can check if a sign appears enough times to be confident about our prediction.
In the code below we chose batch_size=5 and threshold=0.5, meaning that if the same sign appears at least 3 times in the batch we output it. Otherwise, we output “Unknown sign”.
The batch_size and the threshold values depend on the number of videos per sign present in the dataset.

Sign prediction, from the repo Sign-Language-Recognition--MediaPipe-DTW

5. What’s next?

We now know how to implement a Sign Language Recognition model. Our method doesn’t use deep learning so there is no need to build a dataset with thousands of examples per class. However, we still need a few dozens of videos per sign to obtain good results.

The results are convincing but we are not done yet! The two main objectives for the next iterations are:

  • Take into account Face and Pose landmarks for the classification. Indeed, facial expression and hands’ relative position to the body are very important to define a sign.
  • Removing outliers and smoothing hand’s movement will also improve the performance of the model. Because MediaPipe detection outputs sometimes abnormal data.

If you want to know more about sign language recognition contact us !


Source Code


References

Cet article a été écrit par

Gabriel Guerin

Gabriel Guerin