Discover how you can implement a model of Sign Language Recognition using MediaPipe. You will be able to translate videos of signs into natural language.
In France, over 100.000 people use French Sign Language (LSF) on a daily basis. However, little is done to facilitate the learning of Sign Language and the inclusion of deaf or hard of hearing people. To tackle this issue, Signes de Sens has been creating over the past 10 years the biggest French-to-LSF dictionary: LeDicoElix. It contains today over 20.000 LSF videos of signs, with their corresponding definitions. The next step of the Elix project is to be able to translate LSF back to French. Through the M33 Foundation, I've provided my Data Science expertise to implement a real-time Sign Language Recognition model.
The objective of this article is to show the current implementation of my model and give you a good understanding of the main ideas of the method chosen here.
As of now, on a small dataset (5 signs with 5 videos per sign), the model has already convincing results!
Table of contents
- MediaPipe Detection
- Sign Prediction
- What’s next?
The objective is to output the corresponding word based on a sign recorded in a video.
Our Sign Language Recognition model uses hand landmarks - points of interest of the hand that we track - as input.
The prediction is done in three steps:
- Extract landmarks
- Compute the DTW distance between the recorded sign and the reference signs
- Predict the sign by analysing the most similar reference signs
Source code can be found at the end of the article.
2. MediaPipe Detection
MediaPipe is an open-source framework for computer vision solutions released by Google a couple of years ago. Among these solutions, the Holistic Model can track in real-time the position of the Hands, the Pose and the Face landmarks. For now, the code only uses hands positions to make the prediction
First, let's extract landmarks from the video feed using the Holistic model's
process method. Color conversion has to be done because OpenCV uses BGR colors and MediaPipe RGB ones.
drawing_utils sub-package of MediaPipe contains all the tools we need to draw the landmarks on one image!
We are now able to draw the landmarks on any frame of the video.
In the current version of the project two models have been implemented:
HandModelclass contains the hand gesture information of an image.
SignModelclass contains the
HandModelinformation for all frames of a video.
The main problem when using landmarks positions as input data, is that the prediction is sensitive to the size and the absolute position of the hands.
A good way to extract the information about the hand gesture is to use the angles between all the parts of the hand, called connections. We will use all 21 connections of MediaPipe’s Hand Model in this project.
HandModel class implemented in this project is defined by its
feature_vector that gives a representation of the hand gesture.
HandModel class has two arguments:
connections: List of tuples containing the ids of the two landmarks representing a connection
feature_vector: List of the 21*21=441 angles between all the connections of the hand
Now that we can extract the information of the hand gesture, we have to build an object containing both spatial and temporal informations of a sign.
To do so, we just have to store the
feature_vector of both hands for each frame of the video.
SignModel class implemented in this project has four attributes:
has_left_hand: True if the left hand is present in the video
has_right_hand: True if the right hand is present in the video
lh_embedding: List of the feature vectors of the left hand for each frame
rh_embedding: List of the feature vectors of the right hand for each frame
4. Sign Prediction
So we have now created sign
embeddings that contains the temporal and spatial information of a sign. The next step is to be able to classify them. Several methods could be applied here. We could use Deep Learning to classify the sequences but this would demand huge amounts of examples for each sign. Another method uses far less training data and is able to compute a similarity between two signs.
With Dynamic Time Warping, we can compute the distance between
embeddings as they are time series of
Dynamic Time Warping (DTW)
Dynamic Time Warping is an algorithm widely used for time series comparison. It finds the best alignments between two time series by warping them. This allows us to compare patterns instead of sequences. In our case, DTW will find similarities between embeddings of the same signs even if they are done at different speeds.
To compute the similarity between two signs we compare their
embeddings. We compute the DTW distance between the
embeddings of the sign recorded and the
embeddings of all the reference videos - each sign has multiple reference videos.
The following method returns all the signs in the catalog, sorted by their distance to the recorded sign.
We can now make a prediction! We have computed the distances between the recorded sign and all the reference ones. So by sorting them we can take a batch of the most similar signs to our record. With this batch, we can check if a sign appears enough times to be confident about our prediction.
In the code below we chose
threshold=0.5, meaning that if the same sign appears at least 3 times in the batch we output it. Otherwise, we output “Unknown sign”.
The batch_size and the threshold values depend on the number of videos per sign present in the dataset.
5. What’s next?
We now know how to implement a Sign Language Recognition model. Our method doesn’t use deep learning so there is no need to build a dataset with thousands of examples per class. However, we still need a few dozens of videos per sign to obtain good results.
The results are convincing but we are not done yet! The two main objectives for the next iterations are:
- Take into account Face and Pose landmarks for the classification. Indeed, facial expression and hands’ relative position to the body are very important to define a sign.
- Removing outliers and smoothing hand’s movement will also improve the performance of the model. Because MediaPipe detection outputs sometimes abnormal data.