Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 October 2023

Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network

  • Refat Khan Pathan 1 ,
  • Munmun Biswas 2 ,
  • Suraiya Yasmin 3 ,
  • Mayeen Uddin Khandaker   ORCID: orcid.org/0000-0003-3772-294X 4 , 5 ,
  • Mohammad Salman 6 &
  • Ahmed A. F. Youssef 6  

Scientific Reports volume  13 , Article number:  16975 ( 2023 ) Cite this article

6511 Accesses

1 Citations

Metrics details

  • Computational science
  • Image processing

Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. Although some of the previous studies have successfully recognized sign language, it requires many costly instruments including sensors, devices, and high-end processing power. However, such drawbacks can be easily overcome by employing artificial intelligence-based techniques. Since, in this modern era of advanced mobile technology, using a camera to take video or images is much easier, this study demonstrates a cost-effective technique to detect American Sign Language (ASL) using an image dataset. Here, “Finger Spelling, A” dataset has been used, with 24 letters (except j and z as they contain motion). The main reason for using this dataset is that these images have a complex background with different environments and scene colors. Two layers of image processing have been used: in the first layer, images are processed as a whole for training, and in the second layer, the hand landmarks are extracted. A multi-headed convolutional neural network (CNN) model has been proposed and tested with 30% of the dataset to train these two layers. To avoid the overfitting problem, data augmentation and dynamic learning rate reduction have been used. With the proposed model, 98.981% test accuracy has been achieved. It is expected that this study may help to develop an efficient human–machine communication system for a deaf-mute society.

Introduction

Spoken language is the medium of communication between a majority of the population. With spoken language, it would be workable for a massive extent of the population to impart. Nonetheless, despite spoken language, a section of the population cannot speak with most of the other population. Mute people cannot convey a proper meaning using spoken language. Hard of hearing is a handicap that weakens their hearing and makes them unfit to hear, while quiet is an incapacity that impedes their talking and makes them incapable of talking. Both are just handicapped in their hearing or potentially, therefore, cannot still do many other things. Communication is the only thing that isolates them from ordinary people 1 . As there are so many languages in the world, a unique language is needed to express their thoughts and opinions, which will be understandable to ordinary people, and such a language is named sign language. Understanding sign language is an arduous task, an ability that must be educated with training.

Many methods are available that use different things/tools like images (2D, 3D), sensor data (hand globe 2 , Kinect sensor 3 , neuromorphic sensor 4 ), videos, etc. All things are considered due to the fact that the captured images are excessively noisy. Therefore an elevated level of pre-processing is required. The available online datasets are already processed or taken in a lab environment where it becomes easy for recent advanced AI models to train and evaluate, causing prone to errors in real-life applications with different kinds of noises. Accordingly, it is a basic need to make a model that can deal with noisy images and also be able to deliver positive results. Different sorts of methods can be utilized to execute the classification and recognition of images using machine learning. Apart from recognizing static images, work has been done in depth-camera detecting and video processing 5 , 6 , 7 . Various cycles inserted in the system were created utilizing other programming languages to execute the procedural strategies for the final system's maximum adequacy. The issue can be addressed and deliberately coordinated into three comparable methodologies: initially using static image recognition techniques and pre-processing procedures, secondly by using deep learning models, and thirdly by using Hidden Markov Models.

Sign language guides this part of the community and empowers smooth communication in the community of people with trouble talking and hearing (deaf and dumb). They use hand signals along with facial expressions and body activities to cooperate. Yet, as a global language, not many people become familiar with communication via sign language gestures 8 . Hand motions comprise a significant part of communication through signing vocabulary. At the same time, facial expressions and body activities assume the jobs of underlining the words and phrases communicated by hand motions. Hand motions can be static or dynamic 9 , 10 . There are methodologies for motion discovery utilizing the dynamic vision sensor (DVS), a similar technique used in the framework introduced in this composition. For example, Arnon et al. 11 have presented an event-based gesture recognition system, which measures the event stream utilizing a natively event-based processor from International Business Machines called TrueNorth. They use a temporal filter cascade to create Spatio-temporal frames that CNN executes in the event-based processor, and they reported an accuracy of 96.46%. But in a real-life scenario, corresponding background situations are not static. Therefore the stated power saving process might not work properly. Jun Haeng Lee et al. 12 proposed a motion classification method with two DVSs to get a stereo-vision system. They used spike neurons to handle the approaching occasions with the same real-life issue. Static hand signals are also called hand acts and are framed in different shapes and directions of hands without speaking to any movement data. Dynamic hand motions comprise a sequence of hand stances with related movement information 13 . Using facial expressions, static hand images, and hand signals, communication through signing gives instruments to convey similarly as if communicated in dialects; there are different kinds of communication via gestures as well 14 .

In this work, we have applied a fusion of traditional image processing with extracted hand landmarks and trained on a multi-headed CNN so that it could complement each other’s weights on the concatenation layer. The main objective is to achieve a better detection rate without relying on a traditional single-channel CNN. This method has been proven to work well with less computational power and fewer epochs on medical image datasets 15 . The rest of the paper is divided into multiple sections as literature review in " Literature review " section, materials and methods in " Materials and methods " section with three subsections: dataset description in Dataset description , image pre-processing in " Pre-processing of image dataset " and working procedure in " Working procedure ", result analysis in " Result analysis " section, and conclusion in " Conclusion " section.

Literature review

State-of-the-art techniques centered after utilizing deep learning models to improve good accuracy and less execution time. CNNs have indicated huge improvements in visual object recognition 16 , natural language processing 17 , scene labeling 18 , medical image processing 15 , and so on. Despite these accomplishments, there is little work on applying CNNs to video classification. This is halfway because of the trouble in adjusting the CNNs to join both spatial and fleeting data. Model using exceptional hardware components such as a depth camera has been used to get the data on the depth variation in the image to locate an extra component for correlation, and then built up a CNN for getting the results 19 , still has low accuracy. An innovative technique that does not need a pre-trained model for executing the system was created using a capsule network and versatile pooling 11 .

Furthermore, it was revealed that lowering the layers of CNN, which employs a greedy way to do so, and developing a deep belief network produced superior outcomes compared to other fundamental methodologies 20 . Feature extraction using scale-invariant feature transform (SIFT) and classification using Neural Networks were developed to obtain the ideal results 21 . In one of the methods, the images were changed into an RGB conspire, the data was developed utilizing the movement depth channel lastly using 3D recurrent convolutional neural networks (3DRCNN) to build up a working system 5 , 22 where Canny edge detection oriented FAST and Rotated BRIEF (ORB) has been used. ORB feature detection technique and K-means clustering algorithm used to create the bag of feature model for all descriptors is described, but the plain background, easy to detect edges are totally dependent on edges; if the edges give wrong info, the model may fall accuracy and become the main problem to solve.

In recent years, utilizing deep learning approaches has become standard for improving the recognition accuracy of sign language models. Using Faster Region-based Convolutional Neural Network (Faster-RCNN) 23 , a CNN model is applied for hand recognition in the data image. Rastgoo et al. 24 proposed a method where they cropped an image properly, used fusion between RGB and depth image (RBM), added two noise types (Gaussian noise + salt n paper noise), and prepared the data for training. As a naturally propelled deep learning model, CNNs achieve every one of the three phases with a single framework that is prepared from crude pixel esteems to classifier yields, but extreme computation power was needed. Authors in ref. 25 proposed 3D CNNs where the third dimension joins both spatial and fleeting stamps. It accepts a few neighboring edges as input and performs 3D convolution in the convolutional layers. Along with them, the study reported in 26 followed similar thoughts and proposed regularizing the yields with high-level features, joining the expectations of a wide range of models. They applied the developed models to perceive human activities and accomplished better execution in examination than benchmark methods. But it is not sure it works with hand gestures as they detected face first and thenody movement 27 .

On the other hand, the Microsoft and Leap Motion companies have developed unmistakable approaches to identify and track a user’s hand and body movement by presenting Kinect and the leap motion controller (LMC) separately. Kinect recognizes the body skeleton and tracks the hands, whereas the LMC distinguishes and tracks hands with its underlying cameras and infrared sensors 3 , 28 . Using the provided framework, Sykora et al. 7 utilized the Kinect system to catch the depth data of 10 hand motions to classify them using a speeded-up robust features (SURF) technique that came up to an 82.8% accuracy, but it cannot test on more extensive database and modified feature extraction methods (SIFT, SURF) so it can be caused non-invariant to the orientation of gestures. Likewise, Huang et al. 29 proposed a 10-word-based ASL recognition system utilizing Kinect by tenfold cross-validation with an SVM that accomplished a precision pace of 97% using a set of frame-independent features, but the most significant problem in this method is segmentation.

The literature summarizes that most of the models used in this application either depend on a single variable or require high computational power. Also, their dataset choice for training and validating the model is in plain background, which is easier to detect. Our main aim is to show how to reduce the computational power for training and the dependency of model training on one layer.

Materials and methods

Dataset description.

Using a generalized single-color background to classify sign language is very common. We intended to avoid that single color background and use a complex background with many users’ hand images to increase the detection complexity. That’s why we have used the “ASL Finger Spelling” dataset 30 , which has images of different sizes, orientations, and complex backgrounds of over 500 images per sign (24 sign total) of 4 users (non-native to sign language). This dataset contains separate RGB and depth images; we have worked with the RGB images in this research. The photos were taken in 5 sessions with the same background and lighting. The dataset details are shown in Table 1 , and some sample images are shown in Fig.  1 .

figure 1

Sample images from a dataset containing 24 signs from the same user.

Pre-processing of image dataset

Images were pre-processed for two operations: preparing the original image training set and extracting the hand landmarks. Traditional CNN has one input data channel and one output channel. We are using two input data channels and one output channel, so data needs to be prepared for both inputs individually.

Raw image processing

In raw image processing, we have converted the images from RGB to grayscale to reduce color complexity. Then we used a 2D kernel matrix for sharpening the images, as shown in Fig.  2 . After that, we resized the images into 50 × 50 pixels for evaluation through CNN. Finally, we have normalized the grayscale values (0–255) by dividing the pixel values by 255, so now the new pixel array contains value ranges (0–1). The primary advantage of this normalization is that CNN works faster in the (0–1) range rather than other limits.

figure 2

Raw image pre-processing with ( a ) sharpening kernel.

Hand landmark detection

Google’s hand landmark model has an input channel of RGB and an image size of (224 × 224 × 3). So, we have taken the RGB images, converted pixel values into float32, and resized all the images into (256 × 256 × 3). After applying the model, it gives 21 coordinated 3-dimensional points. The landmark detection process is shown in Fig.  3 .

figure 3

Hand landmarks detection and extraction of 21 coordinates.

Working procedure

The whole work is divided into two main parts, one is the raw image processing, and another one is the hand landmarks extraction. After both individual processing had been completed, a custom lightweight simple multi-headed CNN model was built to train both data. Before processing through a fully connected layer for classification, we merged both channel’s features so that the model could choose between the best weights. This working procedure is illustrated in Fig.  4 .

figure 4

Flow diagram of working procedure.

Model building

In this research, we have used multi-headed CNN, meaning our model has two input data channels. Before this, we trained processed images and hand landmarks with two separate models to compare. Google’s model is not best for “in the wild” situations, so we needed original images to complement the low faults in Google’s model. In the first head of the model, we have used the processed images as input and hand landmarks data as the second head’s input. Two-dimensional Convolutional layers with filter size 50, 25, kernel (3, 3) with Relu, strides 1; MaxPooling 2D with pool size (2, 2), batch normalization, and Dropout layer has been used in the hand landmarks training side. Besides, the 2D Convolutional layer with filter size 32, 64, 128, 512, kernel (3, 3) with Relu; MaxPooling 2D with pool size (2, 2); batch normalization and dropout layer has been used in the image training side. After both flatten layers, two heads are concatenated and go through a dense, dropout layer. Finally, the output dense layer has 24 units with Softmax activation. This model has been compiled with Adam optimizer and MSE loss for 50 epochs. Figure  5 illustrates the proposed CNN architecture, and Table 2 shows the model details.

figure 5

Proposed multi-headed CNN architecture. Bottom values are the number of filters and top values are output shapes.

Training and testing

The input images were augmented to generate more difficulty in training so that the model could not overfit. Image Data Generator did image augmentation with 10° rotation, 0.1 zoom range, 0.1 widths and height shift range, and horizontal flip. Being more conscious about the overfitting issues, we have used dynamic learning rates, monitoring the validation accuracy with patience 5, factor 0.5, and a minimum learning rate of 0.00001. For training, we have used 46,023 images, and for testing, 19,725 images. For 50 epochs, the training vs testing accuracy and loss has been shown in Fig.  6 .

figure 6

Training versus testing accuracy and loss for 50 epochs.

For further evaluation, we have calculated the precision, recall, and F1 score of the proposed multi-headed CNN model, which shows excellent performance. To compute these values, we first calculated the confusion matrix (shown in Fig.  7 ). When a class is positive and also classified as so, it is called true positive (TP). Again, when a class is negative and classified as so, it is called true negative (TN). If a class is negative and classified as positive, it is called false positive (FP). Also, when a class is positive and classified as not negative, it is called false negative (FN). From these, we can conclude precision, recall, and F1 score like the below:

figure 7

Confusion matrix of the testing dataset. Numerical values in X and Y axis means the sequential letters from A = 0 to Y = 24, number 9 and 25 is missing because dataset does not have letter J and Z.

Precision: Precision is the ratio of TP and total predicted positive observation.

Recall: It is the ratio of TP and total positive observations in the actual class.

F1 score: F1 score is the weighted average of precision and recall.

The Precision, Recall, and F1 score for 24 classes are shown in Table 3 .

Result analysis

In human action recognition tasks, sign language has an extra advantage as it can be used to communicate efficiently. Many techniques have been developed using image processing, sensor data processing, and motion detection by applying different dynamic algorithms and methods like machine learning and deep learning. Depending on methodologies, researchers have proposed their way of classifying sign languages. As technologies develop, we can explore the limitations of previous works and improve accuracy. In ref. 13 , this paper proposes a technique for acknowledging hand motions, which is an excellent part of gesture-based communication jargon, because of a proficient profound deep convolutional neural network (CNN) architecture. The proposed CNN design disposes of the requirement for recognition and division of hands from the captured images, decreasing the computational weight looked at during hand pose recognition with classical approaches. In our method, we used two input channels for the images and hand landmarks to get more robust data, making the process more efficient with a dynamic learning rate adjustment. Besides in ref 14 , the presented results were acquired by retraining and testing the sign language gestures dataset on a convolutional neural organization model utilizing Inception v3. The model comprises various convolution channel inputs that are prepared on a piece of similar information. A capsule-based deep neural network sign posture translator for an American Sign Language (ASL) fingerspelling (posture) 20 has been introduced where the idea concept of capsules and pooling are used simultaneously in the network. This exploration affirms that utilizing pooling and capsule routing on a similar network can improve the network's accuracy and convergence speed. In our method, we have used the pre-trained model of Google to extract the hand landmarks, almost like transfer learning. We have shown that utilizing two input channels could also improve accuracy.

Moreover, ref 5 proposed a 3DRCNN model integrating a 3D convolutional neural network (3DCNN) and upgraded completely associated recurrent neural network (FC-RNN), where 3DCNN learns multi-methodology features from RGB, motion, and depth channels, and FCRNN catch the fleeting data among short video clips divided from the original video. Consecutive clips with a similar semantic significance are singled out by applying the sliding window way to deal with a section of the clips on the whole video sequence. Combining a CNN and traditional feature extractors, capable of accurate and real-time hand posture recognition 26 where the architecture is assessed on three particular benchmark datasets and contrasted and the cutting edge convolutional neural networks. Extensive experimentation is directed utilizing binary, grayscale, and depth data and two different validation techniques. The proposed feature fusion-based CNN 31 is displayed to perform better across blends of approval procedures and image representation. Similarly, fusion-based CNN is demonstrated to improve the recognition rate in our study.

After worldwide motion analysis, the hand gesture image sequence was dissected for keyframe choice. The video sequences of a given gesture were divided in the RGB shading space before feature extraction. This progression enjoyed the benefit of shaded gloves worn by the endorsers. Samples of pixel vectors representative of the glove’s color were used to estimate the mean and covariance matrix of the shading, which was sectioned. So, the division interaction was computerized with no user intervention. The video frames were converted into color HSV (Hue-SaturationValue) space in the color object tracking method. Then the pixels with the following shading were distinguished and marked, and the resultant images were converted to a binary (Gray Scale image). The system identifies image districts compared to human skin by binarizing the input image with a proper threshold value. Then, at that point, small regions from the binarized image were eliminated by applying a morphological operator and selecting the districts to get an image as an applicant of hand.

In the proposed method we have used two-headed CNN to train the processed input images. Though the single image input stream is widely used, two input streams have an advantage among them. In the classification layer of CNN, if one layer is giving a false result, it could be complemented by the other layer’s weight, and it is possible that combining both results could provide a positive outcome. We used this theory and successfully improved the final validation and test results. Before combining image and hand landmark inputs, we tested both individually and acquired a test accuracy of 96.29% for the image and 98.42% for hand landmarks. We did not use binarization as it would affect the background of an image with skin color matched with hand color. This method is also suitable for wild situations as it is not entirely dependent on hand position in an image frame. A comparison of the literature and our work has been shown in Table 4 , which shows that our method overcomes most of the current position in accuracy gain.

Table 5 illustrates that the Combined Model, while having a larger number of parameters and consuming more memory, achieves the highest accuracy of 98.98%. This suggests that the combined approach, which incorporates both image and hand landmark information, is effective for the task when accuracy is priority. On the other hand, the Hand Landmarks Model, despite having fewer parameters and lower memory consumption, also performs impressively with an accuracy of 98.42%. But it has its own error and memory consumption rate in model training by Google. The Image Model, while consuming less memory, has a slightly lower accuracy of 96.29%. The choice between these models would depend on the specific application requirements, trade-offs between accuracy and resource utilization, and the importance of execution time.

This work proposes a methodology for perceiving the classification of sign language recognition. Sign language is the core medium of communication between deaf-mute and everyday people. It is highly implacable in real-world scenarios like communication, human–computer interaction, security, advanced AI, and much more. For a long time, researchers have been working in this field to make a reliable, low cost and publicly available SRL system using different sensors, images, videos, and many more techniques. Many datasets have been used, including numeric sensory, motion, and image datasets. Most datasets are prepared in a good lab condition to do experiments, but in the real world, it may not be a practical case. That’s why, looking into the real-world situation, the Fingerspelling dataset has been used, which contains real-world scenarios like complex backgrounds, uneven image shapes, and conditions. First, the raw images are processed and resized into a 50 × 50 size. Then, the hand landmark points are detected and extracted from these hand images. Making images goes through two processing techniques; now, there are two data channels. A multi-headed CNN architecture has been proposed for these two data channels. Total data has been augmented to avoid overfitting, and dynamic learning rate adjustment has been done. From the prepared data, 70–30% of the train test spilled has been done. With the 30% dataset, a validation accuracy of 98.98% has been achieved. In this kind of large dataset, this accuracy is much more reliable.

There are some limitations found in the proposed method compared with the literature. Some methods might work with low image dataset numbers, but as we use the simple CNN model, this method requires a good number of images for training. Also, the proposed method depends on the hand landmark extraction model. Other hand landmark model can cause different results. In raw image processing, it is possible to detect hand portions to reduce the image size, which may increase the recognition chance and reduce the model training time. Hence, we may try this method in future work. Currently, raw image processing takes a good amount of training time as we considered the whole image for training.

Data availability

The dataset used in this paper (ASL Fingerspelling Images (RGB & Depth)) is publicly available at Kaggle on this URL: https://www.kaggle.com/datasets/mrgeislinger/asl-rgb-depth-fingerspelling-spelling-it-out .

Anderson, R., Wiryana, F., Ariesta, M. C. & Kusuma, G. P. Sign language recognition application systems for deaf-mute people: A review based on input-process-output. Proced. Comput. Sci. 116 , 441–448. https://doi.org/10.1016/j.procs.2017.10.028 (2017).

Article   Google Scholar  

Mummadi, C. et al. Real-time and embedded detection of hand gestures with an IMU-based glove. Informatics 5 (2), 28. https://doi.org/10.3390/informatics5020028 (2018).

Hickeys Kinect for Windows - Windows apps. (2022). Accessed 01 January 2023. https://learn.microsoft.com/en-us/windows/apps/design/devices/kinect-for-windows

Rivera-Acosta, M., Ortega-Cisneros, S., Rivera, J. & Sandoval-Ibarra, F. American sign language alphabet recognition using a neuromorphic sensor and an artificial neural network. Sensors 17 (10), 2176. https://doi.org/10.3390/s17102176 (2017).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Ye, Y., Tian, Y., Huenerfauth, M., & Liu, J. Recognizing American Sign Language Gestures from Within Continuous Videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2145–214509 (IEEE, 2018). https://doi.org/10.1109/CVPRW.2018.00280 .

Ameen, S. & Vadera, S. A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images. Expert Syst. 34 (3), e12197. https://doi.org/10.1111/exsy.12197 (2017).

Sykora, P., Kamencay, P. & Hudec, R. Comparison of SIFT and SURF methods for use on hand gesture recognition based on depth map. AASRI Proc. 9 , 19–24. https://doi.org/10.1016/j.aasri.2014.09.005 (2014).

Sahoo, A. K., Mishra, G. S. & Ravulakollu, K. K. Sign language recognition: State of the art. ARPN J. Eng. Appl. Sci. 9 (2), 116–134 (2014).

Google Scholar  

Mitra, S. & Acharya, T. “Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C 37 (3), 311–324. https://doi.org/10.1109/TSMCC.2007.893280 (2007).

Rautaray, S. S. & Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 43 (1), 1–54. https://doi.org/10.1007/s10462-012-9356-9 (2015).

Amir A. et al A low power, fully event-based gesture recognition system. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 7388–7397 (IEEE, 2017). https://doi.org/10.1109/CVPR.2017.781 .

Lee, J. H. et al. Real-time gesture interface based on event-driven processing from stereo silicon retinas. IEEE Trans. Neural Netw. Learn Syst. 25 (12), 2250–2263. https://doi.org/10.1109/TNNLS.2014.2308551 (2014).

Article   PubMed   Google Scholar  

Adithya, V. & Rajesh, R. A deep convolutional neural network approach for static hand gesture recognition. Proc. Comput. Sci. 171 , 2353–2361. https://doi.org/10.1016/j.procs.2020.04.255 (2020).

Das, A., Gawde, S., Suratwala, K., & Kalbande, D. Sign language recognition using deep learning on custom processed static gesture images. In 2018 International Conference on Smart City and Emerging Technology (ICSCET) , 1–6 (IEEE, 2018). https://doi.org/10.1109/ICSCET.2018.8537248 .

Pathan, R. K. et al. Breast cancer classification by using multi-headed convolutional neural network modeling. Healthcare 10 (12), 2367. https://doi.org/10.3390/healthcare10122367 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. https://doi.org/10.1109/5.726791 (1998).

Collobert, R., & Weston, J. A unified architecture for natural language processing. In Proceedings of the 25th international conference on Machine learning—ICML ’08 , 160–167 (ACM Press, 2008). https://doi.org/10.1145/1390156.1390177 .

Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), 1915–1929. https://doi.org/10.1109/TPAMI.2012.231 (2013).

Xie, B., He, X. & Li, Y. RGB-D static gesture recognition based on convolutional neural network. J. Eng. 2018 (16), 1515–1520. https://doi.org/10.1049/joe.2018.8327 (2018).

Jalal, M. A., Chen, R., Moore, R. K., & Mihaylova, L. American sign language posture understanding with deep neural networks. In 2018 21st International Conference on Information Fusion (FUSION) , 573–579 (IEEE, 2018).

Shanta, S. S., Anwar, S. T., & Kabir, M. R. Bangla Sign Language Detection Using SIFT and CNN. In 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT) , 1–6 (IEEE, 2018). https://doi.org/10.1109/ICCCNT.2018.8493915 .

Sharma, A., Mittal, A., Singh, S. & Awatramani, V. Hand gesture recognition using image processing and feature extraction techniques. Proc. Comput. Sci. 173 , 181–190. https://doi.org/10.1016/j.procs.2020.06.022 (2020).

Ren, S., He, K., Girshick, R., & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process Syst. , 28 (2015).

Rastgoo, R., Kiani, K. & Escalera, S. Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20 (11), 809. https://doi.org/10.3390/e20110809 (2018).

Jhuang, H., Serre, T., Wolf, L., & Poggio, T. A biologically inspired system for action recognition. In 2007 IEEE 11th International Conference on Computer Vision , 1–8. (IEEE, 2007) https://doi.org/10.1109/ICCV.2007.4408988 .

Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 221–231. https://doi.org/10.1109/TPAMI.2012.59 (2013).

Huang, J., Zhou, W., Li, H., & Li, W. sign language recognition using 3D convolutional neural networks. In 2015 IEEE International Conference on Multimedia and Expo (ICME) , 1–6 (IEEE, 2015). https://doi.org/10.1109/ICME.2015.7177428 .

Digital worlds that feel human Ultraleap. Accessed 01 January 2023. Available: https://www.leapmotion.com/

Huang, F., & Huang, S. Interpreting american sign language with Kinect. Journal of Deaf Studies and Deaf Education, [Oxford University Press] , (2011).

Pugeault, N., & Bowden, R. Spelling it out: Real-time ASL fingerspelling recognition. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) , 1114–1119 (IEEE, 2011). https://doi.org/10.1109/ICCVW.2011.6130290 .

Rahim, M. A., Islam, M. R. & Shin, J. Non-touch sign word recognition based on dynamic hand gesture using hybrid segmentation and CNN feature fusion. Appl. Sci. 9 (18), 3790. https://doi.org/10.3390/app9183790 (2019).

“ASL Alphabet.” Accessed 01 Jan, 2023. https://www.kaggle.com/grassknoted/asl-alphabet

Download references

Funding was provided by the American University of the Middle East, Egaila, Kuwait.

Author information

Authors and affiliations.

Department of Computing and Information Systems, School of Engineering and Technology, Sunway University, 47500, Bandar Sunway, Selangor, Malaysia

Refat Khan Pathan

Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong, 4381, Bangladesh

Munmun Biswas

Department of Computer and Information Science, Graduate School of Engineering, Tokyo University of Agriculture and Technology, Koganei, Tokyo, 184-0012, Japan

Suraiya Yasmin

Centre for Applied Physics and Radiation Technologies, School of Engineering and Technology, Sunway University, 47500, Bandar Sunway, Selangor, Malaysia

Mayeen Uddin Khandaker

Faculty of Graduate Studies, Daffodil International University, Daffodil Smart City, Birulia, Savar, Dhaka, 1216, Bangladesh

College of Engineering and Technology, American University of the Middle East, Egaila, Kuwait

Mohammad Salman & Ahmed A. F. Youssef

You can also search for this author in PubMed   Google Scholar

Contributions

R.K.P and M.B, Conceptualization; R.K.P. methodology; R.K.P. software and coding; M.B. and R.K.P. validation; R.K.P. and M.B. formal analysis; R.K.P., S.Y., and M.B. investigation; S.Y. and R.K.P. resources; R.K.P. and M.B. data curation; S.Y., R.K.P., and M.B. writing—original draft preparation; S.Y., R.K.P., M.B., M.U.K., M.S., A.A.F.Y. and M.S. writing—review and editing; R.K.P. and M.U.K. visualization; M.U.K. and M.B. supervision; M.B., M.S. and A.A.F.Y. project administration; M.S. and A.A.F.Y, funding acquisition.

Corresponding author

Correspondence to Mayeen Uddin Khandaker .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Pathan, R.K., Biswas, M., Yasmin, S. et al. Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network. Sci Rep 13 , 16975 (2023). https://doi.org/10.1038/s41598-023-43852-x

Download citation

Received : 04 March 2023

Accepted : 29 September 2023

Published : 09 October 2023

DOI : https://doi.org/10.1038/s41598-023-43852-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

sign language detection research paper

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Artificial Intelligence Technologies for Sign Language

AI technologies can play an important role in breaking down the communication barriers of deaf or hearing-impaired people with other communities, contributing significantly to their social inclusion. Recent advances in both sensing technologies and AI algorithms have paved the way for the development of various applications aiming at fulfilling the needs of deaf and hearing-impaired communities. To this end, this survey aims to provide a comprehensive review of state-of-the-art methods in sign language capturing, recognition, translation and representation, pinpointing their advantages and limitations. In addition, the survey presents a number of applications, while it discusses the main challenges in the field of sign language technologies. Future research direction are also proposed in order to assist prospective researchers towards further advancing the field.

1. Introduction

Sign language (SL) is the main means of communication between hearing-impaired people and other communities and it is expressed through manual (i.e., body and hand motions) and non-manual (i.e., facial expressions) features. These features are combined together to form utterances that convey the meaning of words or sentences [ 1 ]. Being able to capture and understand the relation between utterances and words is crucial for the Deaf community in order to guide us to an era where the translation between utterances and words can be achieved automatically [ 2 ]. The research community has long identified the need for developing sign language technologies to facilitate the communication and social inclusion of hearing-impaired people. Although the development of such technologies can be really challenging due to the existence of numerous sign languages and the lack of large annotated datasets, the recent advances in AI and machine learning have played a significant role towards automating and enhancing such technologies.

Sign language technologies cover a wide spectrum, ranging from the capturing of signs to their realistic representation in order to facilitate the communication between hearing-impaired people, as well as the communication between hearing-impaired and speaking people. More specifically, sign language capturing involves the accurate extraction of body, hand and mouth expressions using appropriate sensing devices in marker-less or marker-based setups. The accuracy of sign language capturing technologies is currently limited by the resolution and discrimination ability of sensors and the fact that occlusions and fast hand movements pose significant challenges to the accurate capturing of signs. Sign language recognition (SLR) involves the development of powerful machine learning algorithms to robustly classify human articulations to isolated signs or continuous sentences. Current limitations in SLR lie in the lack of large annotated datasets that greatly affect the accuracy and generalization ability of SLR methods, as well as the difficulty in identifying sign boundaries in continuous SLR scenarios.

On the other hand, sign language translation (SLT) involves the translation between different sign languages, as well as the translation between sign and speaking languages. SLT methods employ sequence-based machine learning algorithms and aim to bridge the communication gap between people signing or speaking different languages. The difficulties in SLT lie in the lack of multilingual sign language datasets, as well as the inaccuracies of SLR methods, considering that the gloss recognition (performed by SLR methods) is the initial step of the SLT methods . Finally, sign language representation involves the accurate representation and reproduction of signs using realistic avatars or signed video approaches. Currently, avatar movements are deemed unnatural and hard to understand by the Deaf community due to inaccuracies in skeletal pose capturing and the lack of life-like features in the appearance of avatars.

Sign language technologies are connected in a way that affect each other as seen in Figure 1 . The accurate extraction of hand and body motions as well as facial expressions plays a crucial role to the success of the machine learning algorithms that are responsible for the robust recognition of signs. Moreover, the accurate sign language recognition significantly affects the performance of sign language translation and representation methods. The breakthroughs in sensorial devices and AI have paved the way for the development of sign language applications that can immensely facilitate hearing-impaired people in their everyday life.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g001.jpg

Sign language technologies.

Previous literature reviews mainly concentrate on specific sign language technologies, such as video-based and sensor-based sign language recognition [ 3 , 4 , 5 , 6 , 7 ] and sign language translation [ 8 , 9 ]. Lately, with the development of sign language applications, there are also reviews that presented sign language systems to facilitate hearing-impaired people in teaching and learning, as well as in voice and text interpretation systems [ 10 , 11 ]. However, there is no systematic review that presents all sign language technologies and their relations with each other. This review aims to fill this gap by presenting the advances of AI in all sign language technologies, ranging from capturing and recognition to translation and representation and concludes by describing recent sign language applications that can considerably facilitate the communication among hearing-impaired and speaking people. The main purpose of this review is to demonstrate the importance of using AI technologies in sign language to facilitate deaf and hearing-impaired people in their communication with other communities. In addition, this review aims at familiarizing researchers with the state-of-the-art in all sign language technologies and propose future research directions that can facilitate the development of even more accurate approaches that can lead to mainstream products for the Deaf community. More specifically, the objectives of this review can be summarized as follows:

  • A comprehensive overview of the use of AI technologies in various sign language tasks (i.e., capturing, recognition, translation and representation), along with their importance to their field, is provided.
  • The advantages and limitations of modern sign language technologies and the relations between them are discussed and explored.
  • Possible future directions in the development of AI technologies for sign language are suggested to facilitate prospective researchers in the field.

The rest of this survey is organized as follows. In Section 2 , the literature search guideline is presented. Sign language capturing sensors are described in Section 3 . In Section 4 , sign language recognition methods are categorized and discussed. Sign language representation approaches and applications are presented in Section 5 and Section 6 , respectively. Finally, conclusions and potential future research directions are highlighted in Section 7 .

2. Literature Search

A systematic literature search was performed by adopting the PRISMA guidelines [ 12 ]. The articles were extracted in June 2021 from three academic databases, namely Scopus ( https://www.scopus.com/home.uri ), (link, accessed on 28 May 2021), ProQuest ( https://www.proquest.com/ ), (link, accessed on 28 May 2021) and IeeeXplore ( https://ieeexplore.ieee.org/Xplore/home.jsp ), (link, accessed on 28 May 2021). The articles that were not peer-reviewed or written in English were discarded. Since this review deals with AI technologies for sign language, the search was based on the following condition:

TITLE-ABSTRACT-KEYWORDS ( sign AND language AND ( recognition OR application(*) OR avatar(*) OR representation(*) OR translation OR captur(*) OR generation OR production ) ) AND PUBLISH YEAR > 2018 AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) OR LIMIT-TO ( DOCTYPE , "ch" ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) ) AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "ENGI" ) )

The aforementioned search condition describes the existence of the above words (i.e., recognition, translation, etc.) in the title, abstract or keywords of the literature works. In this context, (*) allows for variations in the search terms (i.e., captur(*) allows the existence of words, such as capture, capturing, etc.). In addition, the search is performed for papers published after 2018 since the field is evolving with fast pace and older methods are rendered quickly obsolete. To this end, this review aims to present only the latest and best works related to sign language technologies. Finally, the papers included in this review have been published as journal articles, conference proceedings and book chapters (i.e., DOCTYPE) in the fields of computing and engineering (i.e., SUBJAREA).

The number of the records retrieved from the three databases is 2368. From this number, 331 duplicate records are removed, leading to 2037 unique records. After screening title, abstract and finally the full text with various criteria to discard irrelevant records, 106 records remain and are included in this review. The selection procedure is depicted in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g002.jpg

Flowchart of the systematic literature search process.

3. Sign Language Capturing

Sign language capturing involves the recording of sign gestures using appropriate sensor setups. The purpose is to capture discriminative information from the signs that will allow the study, recognition and 3D representation of signs at later stages. Moreover, sign language capturing enables the construction of large datasets that can be used to accurately train and evaluate machine learning sign language recognition and translation algorithms.

3.1. Capturing Sensors

The most common means of recording sign gestures is through visual sensors that are able to capture fine-grained information, such as facial expressions and body postures, that is crucial for understanding sign language. Cerna et al. in [ 13 ] employed a Kinect sensor [ 14 ] to simultaneously capture red-green-blue (RGB) image, depth and skeletal information towards the recording of a multimodal dataset with Brazilian sign language. Similarly, Kosmopoulos et al. in [ 15 ] captured realistic real-life scenarios with sign language using the Kinect sensor. The dataset contains isolated and continuous sign language recordings with RGB, depth and skeletal information, along with annotated hand and facial features. Contrary to the previous methods that use a single Kinect sensor, this work additionally employs a machine vision camera, along with a television screen, for sign demonstration. Sincan et al. in [ 16 ], captured isolated Turkish sign language glosses using Kinect sensors with a large variety of indoor and outdoor backgrounds, revealing the importance of capturing videos with various backgrounds. Adaloglou et al. in [ 17 ], created a large sign language dataset with RealSense D435 sensor that records both RGB and depth information. The dataset contain continuous and isolated sign videos and is appropriate for both isolated and continuous sign language recognition tasks.

Another sensor that has been employed for sign language capturing is Leap Motion, which has the ability to capture 3D positions of hand and fingers at the expense of having to operate close to the subject. Mittal et al. in [ 18 ], employed this type of sensor to record sign language gestures. Other setups with antennas and readers of radio-frequency identification (RFID) signals have also been adopted for sign language recognition. Meng et al. in [ 19 ], extracted phase characteristics of RFID signals to detect and recognize sign gestures. The training setup consists of an RFID reader, an RFID tag and a directional antenna. The recorded human should stand between the reader and the tag for a proper capturing. Moreover, the recognition system is signer-dependent.

On the other hand, wearable sensors have been adopted for capturing sign language gestures. Galea et al. in [ 20 ], used electromyography (EMG) to capture electrical activity that was produced during arm movement. The Thalmic MYO armband device was used for the recording of Irish sign language alphabet. Similarly, Zhang et al. [ 21 ] used a wearable device to capture EMG and inertial measurement unit (IMU) signals, while they used a convolutional neural network (CNN) [ 22 ] followed by a long short-term memory (LSTM) [ 23 ] architecture to recognize American sign language at both word and sentence levels. One disadvantage of the method is that its performance has not been evaluated under walking condition. Hou et al. in [ 24 ], proposed Sign-Speaker, which was deployed on a smartwatch to collect sign signals. Then, these signals were sent to a smartphone and were translated into spoken language in real-time. In this method, a very simple capturing setup is required, consisting of a smartwatch and a smartphone. However, their system recognizes a limited number of signs and it cannot generalize well to new users. Wang et al. in [ 25 ], employed a system with two armbands using both IMU and EMG sensors in order to capture fine-grained finger and hand positions and movements. How et al. in [ 26 ], used a low-cost dataglove with IMU sensors to capture sign gestures that were transmitted through Bluetooth to a smartphone device. Nevertheless, the employment of a single right-hand dataglove limited the number of signs that could be performed by this setup.

Each of the aforementioned sensor setups for sign language capturing has different characteristics, which makes it suitable for different applications. Kinect sensors provide high resolution RGB and depth information but their accuracy is restricted by the distance from the sensors. Leap Motion also requires a small distance between the sensor and the subject, but their low computational requirements enable its usage in real-time applications. Multi-camera setups are capable of providing highly accurate results at the expense of increased complexity and computational requirements. A myo armband that can detect EMG and inertial signals is also used in few works but the inertial signals may be distorted by body motions when people are walking. Smartwatches are really popular nowadays and they can also be used for sign language capturing but their output can be quite noisy due to unexpected body movements. Finally, datagloves can provide highly accurate sign language capturing results in real-time. However, the tuning of its components (i.e., flex sensor, accelerometer, gyroscope) may require a trial and error process that is impractical and time-consuming. In addition, signers tend to not prefer datagloves for sign language capturing as they are considered invasive.

3.2. Datasets

Datasets are crucial for the performance of methodologies regarding sign language recognition, translation and synthesis and as a result a lot of attention has been drawn towards the accurate capturing of signs and their meticulous annotation. The majority of the existing publicly available datasets are captured with visual sensors and are presented below.

3.2.1. Continuous Sign Language Recognition Datasets

Continuous sign language recognition (CSLR) datasets contain videos of sequences of signs instead of individual signs and are more suitable for developing real-life applications. Phoenix-2014 [ 27 ] is one of the most popular CSLR dataset with recordings of weather forecasts in German sign language. All videos were recorded with 9 signers at a frame rate of 25 frames per second. The dictionary has 1081 unique glosses and the dataset contains 5672 videos for training, 540 videos for validation and 629 videos for testing. The same authors created an updated version of Phoenix-2014, called Phoenix-2014-T [ 28 ], with spoken language translations, which makes it appropriate for both CSLR and sign language translation experiments. It contains 8257 videos from 9 different signers performing 1088 unique signs and 2887 unique words. Although all recordings are performed in a controlled environment, Phoenix-2014 and Phoenix-2014-T are both challenging datasets with large vocabularies and varying number of samples per sign with a few signs having a single sample. Similarly, BSL-1K [ 29 ] contains video recordings from British news broadcasts, along with automatically extracted annotations from provided subtitles. It is a large database with 273,000 samples from 40 signers that is also used for sign language segmentation. Another notable dataset is CSL [ 30 , 31 ] that contains Chinese words widely used in daily communication. The dataset has 100 sentences with signs that were performed from 50 signers. The recordings are performed in a lab with predefined conditions (i.e., background, lighting). The vocabulary size is 178 words that are performed multiple times, resulting in high recognition results achieved by SLR methods. GRSL [ 15 ] is another CSLR dataset of Greek sign language that is used in home care services, which contains multiple modalities, such as RGB, depth and skeletal joints. On the other hand, GSL [ 17 ] is a large Greek sign language dataset created to assist communication of Deaf people with public service employees. The dataset was created with a RealSense D435 sensor that records both RGB and depth information. Furthermore, it contains both continuous and isolated sign videos from 15 predefined scenarios. It is recorded on a laboratory environment, where each scenario is repeated five consecutive times.

3.2.2. Isolated Sign Language Recognition Datasets

Isolated sign language recognition (ISLR) datasets are important for identifying and learning discriminative features for sign language recognition. CSL-500 [ 31 , 32 ] is the isolated version of CSL but it contains 500 unique glosses performed from the same 50 signers. CSLR methods usually adopt this dataset for feature learning prior to finetuning on the CSL dataset. MS-ASL [ 33 ] is another widely employed ISLR dataset with 1000 unique American sign language glosses. It contains recordings collected from YouTube platform from 222 signers with a large variance in background settings, which makes this dataset suitable for training complex methods with strong representation capabilities. Similarly, WASL [ 34 ] is an ISLR dataset with 2000 unique American sign glosses performed by 119 signers. The videos have different background and illumination conditions, which makes it a challenging ISLR benchmark dataset. On the other hand, AUTSL is a Turkish sign language dataset captured under various indoor and outdoor backgrounds, while LSA64 [ 35 ] is an Argentinian sign language dataset that includes 3200 videos, in which 10 non-expert subjects execute 5 repetitions of 64 different types of signs. LSA64 is a small and relatively easy dataset, where SLR methods achieve outstanding recognition performance. Finally, IsoGD [ 36 ] is a gesture recognition dataset that consists of 47,933 RGB-D videos performed by 21 different individuals and contains 249 gesture labels. Although IsoGD is a gesture recognition dataset, its large size and challenging illumination and background conditions allows the training of highly accurate ISLR methods.

3.2.3. Discussion

A discussion about the aforementioned datasets can be made at this stage, while a detailed overview of the dataset characteristics is provided on Table 1 . It can be seen that over time datasets become larger in size (i.e., number of samples) with more signers involved in them, as well as contain high resolution videos captured under various and challenging illumination and background conditions. Moreover, new datasets usually include different modalities (i.e., RGB, depth and skeleton). Recording sign language videos using many signers is very important, since each person performs signs with different speed, body posture and face expression. Moreover, high resolution videos capture more clearly small but important details, such as finger movements and face expressions, which are crucial cues for sign language understanding. Datasets with videos captured under different conditions enable deep networks to extract highly discriminative features for sign language classification. As a result, methodologies trained in such datasets can obtain greatly enhanced representation and generalization capabilities and achieve high recognition performances. Furthermore, although RGB information is the predominant modality used for sign language recognition, additional modalities, such as skeleton and depth information, can provide complementary information to the RGB modality and significantly improve the performance of SLR methods.

Large-scale publicly available SLR datasets.

4. Sign Language Recognition

Sign language recognition (SLR) is the task of recognizing sign language glosses from video streams. It is a very important research area since it can bridge the communication gap between hearing and Deaf people, facilitating the social inclusion of hearing-impaired people. Moreover, sign language recognition can be classified into isolated and continuous based on whether the video streams contain an isolated gloss or a gloss sequence that corresponds to a sentence.

4.1. Continuous Sign Language Recognition

Continuous Sign Language Recognition aims at classifying signed videos to entire sentences (i.e., ordered sequence of glosses). CSLR is a very challenging task as it requires the recognition of glosses from video sequences without any knowledge of the sign boundaries (i.e., lack of ground truth annotations regarding the start and end of glosses). Most works adopt 2D or 3D-CNNs for feature extraction followed by temporal convolutional networks or recurrent neural networks (RNNs) for sequential information modelling. To measure CSLR performance, word error rate (WER) [ 38 ] is commonly adopted. WER measures the number of operations (i.e., substitutions, deletions and insertions) required to transform the predicted sequence into the target sequence.

Cui et al. [ 39 ] adopted a 2D-CNN followed by temporal 1D convolutional layers for feature extraction. The extracted spatio-temporal features were fed to a bidirectional long short-term memory (BLSTM) network for modelling the context of the entire sequence. The feature extractor was extended with a classifier and trained in a fully-supervised setting on isolated glosses for video to gloss alignment, while the BLSTM was used for CSLR. This two-step optimization process was conducted iteratively with Connectionist Temporal Classification (CTC) [ 40 ] and Cross-Entropy losses, until the network converged. Besides, the recognition model fused RGB with optical flow modalities and achieved a WER of 22.8% on the Phoenix-2014 dataset. Similarly, Koishybay et al. in [ 41 ], adopted a residual 2D-CNN with cascaded 1D convolutional layers for feature extraction, while for CSLR experiments, BLSTM was utilized. Their method generated gloss-level alignments using the Levenshtein distance in order to fine-tune the feature extractor. However, the authors stated that during the early iterations the model predicted poor alignment proposals, which hinders the training process and requires several iterations to converge. Cheng et al. in [ 42 ], proposed a 2D fully convolutional network with a feature enhancement module that did not require iterative training. Instead, it provided extra supervision and assisted the CSLR network to learn better gloss alignments. Niu et al. in [ 43 ], proposed a 2D-CNN followed by a Transformer network for CSLR. They used three stochastic methods to drop frames of the input video, to randomly stop gradients of back-propagation and to model glosses using hidden states, respectively, which led to better CSLR performance. Nevertheless, the randomness ratio of these stochastic processes must be tuned carefully to achieve good recognition rates. Generally, CSLR methods based on 2D-CNNs achieve great recognition performance. More specifically, 2D-CNNs extract descriptive features from the frame sequences, while the sequence modelling mechanisms align efficiently the input video and the output predictions. However, they usually require complex training strategies, such as iterative optimization techniques, to achieve strong feature extraction capabilities.

On the other hand, some works chose to incorporate attention mechanisms for CSLR. Pan et al. in [ 44 ], used a key-frame sampling technique to extract the most descriptive frames of the video. Then, a vector representation was constructed from the skeletal data of the key-frames, which was fed to an attention-based BLSTM to model the temporal information. Huang et al. [ 45 ] proposed an adaptive encoder-decoder architecture to learn the temporal boundaries of the video. Furthermore, a hierarchical BLSTM with attention over sliding windows was used on the decoder to weigh the importance of the input frames. Li et al. in [ 46 ], used a pyramid structure of BLSTMs in order to find key actions of the video representations, which were produced from the 2D-CNN. Moreover, an attention-based LSTM was used to align the input and output sequences and the whole network was trained jointly with Cross-Entropy and CTC losses.

Recently, the self-attention mechanism has been introduced in a variety of models, such as the Transformer, and has also been adopted by CSLR methods. Slimane et al. in [ 47 ], proposed two data streams with cropped hand images and full images. The two modalities were passed through two 2D-CNNs to extract the spatial features. Then, the modalities were synchronized by a self-attention module to obtain better contextual information and generate efficient video representations for CSLR. Zhou et al. [ 48 ], adopted a fully-inception architecture with 2D and 1D convolutional layers along with a self-attention to further improve the feature extraction capabilities of the inception layers.

Reinforcement techniques have also been applied for CSLR, along with Transformer networks. Zhang et al. in [ 49 ], adopted a 3D-CNN followed by a Transformer network that was responsible for recognizing gloss sequences from input videos. Instead of training the model with cross-entropy loss, they used the REINFORCE algorithm [ 50 ] to directly optimize the model by using WER as the reward function of the agent (i.e., the feature extractor). Wei et al. in [ 51 ], used a semantic boundary detection algorithm with reinforcement learning to improve CSLR performance. A spatio-temporal feature extractor learned the video representations. Then, the detection algorithm used reinforcement learning to detect gloss timestamps from video sequences and refine the final video representations. The evaluation metric was used again as the reward function. The major limitation of this method is the need for a careful selection of the pooling size, which defines the action search space for the reinforcement learning agent.

Papastratis et al. [ 52 ] constructed a cross-modal approach in order to effectively model intra-gloss dependencies by leveraging information from text. This method extracted video features using a video encoder that consisted of a 2D-CNN followed by temporal convolutions and a BLSTM, while text representations were obtained from an LSTM. Finally, these embeddings were aligned in a joint latent space. The improved representations led to great CSLR performance, achieving WERs of 24.0% and 3.52% on Phoenix-2014 and GSL SI, respectively. Papastratis et al. in their latest work [ 53 ], employed a generative adversarial network to evaluate the predictions of the video encoder. In addition, contextual information was incorporated to improve recognition performance on sign language conversations.

Due to their efficient feature extraction capabilities, 3D-CNNs have also been adopted by many researchers for CSLR. Wei et al. in [ 54 ], used a 3D residual CNN along with a BLSTM, while they applied grammatical rules sign language. The text was split into isolated words and n -grams, which are modelled using two classifiers. The two classifiers aimed to recognize each word independently and based on the context in contrast to CTC, which models the whole sequence. Pu et al. in [ 55 ], employed a 3D-CNN with an LSTM decoder and a CTC decoder that were jointly aligned with a soft dynamic time warping (soft-DTW) [ 56 ] alignment constraint. The network was trained recursively with the proposed alignments from soft-DTW. The method achieved WERs of 6.1% and 32.7% on CSL Split 1 and CSL Split 2, respectively. Guo et al. in [ 57 ], developed a fully convolutional approach with a 3D-CNN followed by 1D temporal convolutional layers. The 1D CNN block had a hierarchical structure with small and large receptive fields to capture short- and long-term correlations in the video, while the entire architecture was trained with CTC loss. 3D-CNNs are computationally expensive methods that require pre-training on large-scale datasets and cannot be tuned directly for CSLR. To this end, sliding window techniques are adopted to create informative features. To tackle this problem, some works incorporated pseudo-labelling, which is an optimization process that adds predicted labels on the training set. Pei et al. in [ 58 ], trained a deep 3D-CNN with CTC and generate clip-level pseudo-labels from the alignment of CTC to obtain better feature representations. To improve the quality of pseudo-labels, Zhou et al. in [ 59 ], proposed a dynamic decoding method instead of greedy decoding to find better alignment paths and filter out the wrong pseudo-labels. Their method applied the I3D [ 60 ] network from the action recognition field along with temporal convolutions and bidirectional gated recurrent units (BGRU) [ 61 ]. Moreover, the proposed method achieved a WER of 34.5% on the Phoenix-2014 dataset. However, pseudo-labelling required many iterations, while initial labels affected the convergence of the optimization process.

In Table 2 , several methods are compared on the test set of the most commonly adopted datasets for continuous sign language recognition. From the experimental results it is shown that multi-modal methods achieve the lowest WERs. More specifically, STMC [ 62 ] has the best recognition rates on Phoenix-2014, CSL Split 1 and CSL Split 2 datasets using RGB, hands and skeleton modalities, while SLRGAN [ 53 ], employing the RGB and text modality, achieves superior performance on the GSL SI and GSL SD datasets.

Performance comparison of CSLR approaches categorized by dataset measured in WER (%). The best performance for each dataset appears in bold.

4.2. Isolated Sign Language Recognition

Isolated sign language recognition refers to the task of accurately detecting single sign gestures from videos and thus it is usually tackled similar to action and gesture recognition, as well as other types of video processing and classification tasks with the extraction and learning of highly discriminative features [ 63 , 64 , 65 ]. In the literature, a common approach to the task of isolated sign language recognition is the extraction of hand and mouth regions from the video sequences in an attempt to remove noisy backgrounds that can inhibit classification performance. Liao et al. in [ 66 ], proposed a video-based SLR method that was based on hand region extraction and classification using 3D ResNet networks and BLSTM layers. Similarly, Aly et al. in [ 67 ], developed an ISLR method that segmented hand regions from images using DeepLabv3+ algorithm [ 68 ], extracted features from these regions using a Convolutional Self-Organizing Map and classified the features using a deep recurrent neural network consisting of 3 BLSTM layers. Gökçe et al. in [ 69 ], proposed 3D-CNN networks for the processing of hand, upper body and face image regions and the fusion of these streams in the score level to accurately classify isolated signs. The authors stated that their method performs comparatively worse on mono-morphemic signs performed with a single hand, rather than on temporally more complex signs with two-handed gestures. On the other hand, Zhang et al. in [ 70 ], proposed the Multiple extraction and Multiple prediction (MEMP) network that consists of alternating 3D-CNN networks and Convolutional LSTM layers that extracted spatio-temporal features from video sequences multiple times, enabling the network to achieve 99.06% and 78.85% accuracy in the LSA64 and IsoGD datasets, respectively. Li et al. in [ 71 ], proposed a SLR method that was based on the transferring of cross-domain knowledge of news signs to a base model and improve its performance using domain-invariant features.

To further improve the accuracy and robustness of SLR methods, several researchers proposed the extraction of other types of features, such as optical flow and skeletal joints from visual cues. These multi-stream networks are more computationally expensive than their single stream counterparts, but they have the advantage of overcoming confusing cases regularly met when a single type of features is employed. Sarhan et al. in [ 72 ], proposed a two-stream network architecture that received as input RGB and optical flow data, extracted features using I3D networks and performed late fusion at the score level for accurate sign language recognition. Rastgoo et al. in [ 73 ], proposed a multi-stream SLR method that utilized as input hand image regions, hand heatmaps and 2D projections of hand skeletal joints to images. These input data were processed using 3D-CNN networks, concatenated and fed to LSTM layers for sign recognition. Konstantinidis et al. in [ 74 ], proposed a SLR methodology that was based on the processing and late fusion of body and hand skeletal features using LSTM layers. Apart from the raw joint coordinates, the authors also utilized joint-line distances, which led to a significant improvement in the performance of the method, reaching 98.09% accuracy in the LSA64 dataset. In a later work [ 75 ], the same authors introduced additional streams that processed RGB video sequences and optical flow data, enhancing even more the performance of their method, ultimately achieving 99.84% accuracy in the LSA64 dataset. Similarly, Papadimitriou et al. in [ 76 ], proposed a multi-stream SLR method that processes hand and mouth regions, as well as optical flow and skeletal features for the accurate classification of signs. These features were concatenated and fed to a temporal deformable convolutional attention-based encoder-decoder that predicts the sign class. Gündüz et al. in [ 77 ], employed a multi-stream SLR approach that received as input RGB video sequences, optical flow sequences and body and hand skeletal features and performed a late fusion to accurately classify Turkish signs. Bilge et al. in [ 78 ], proposed a SLR method that can generalize well on unseen signs. To achieve this, the authors employed two 3D-CNN networks followed by BLSTM layers for the extraction of short-term and long-term feature representations from body and hand video sequences. In addition, the authors employed a BERT model [ 79 ] for the extraction of textual sign representations from text descriptions of how the signs were performed. Finally, they used a bi-linear compatibility function to associate video and text representations.

In an effort to derive more discriminative features, Rastgoo et al. in [ 63 ], proposed a multi-stream SLR method that gets as input hand regions, 3D hand pose features and Extra Spatial Hand Relation features (i.e., orientation and slope of hands). These features were concatenated and fed to an LSTM layer to derive the sign class. In this way, the authors managed to achieve a really high accuracy of 86.32% in the challenging IsoGD dataset. Kumar et al. in [ 64 ], proposed Spatial 3D Relational Features for sign language recognition. These features were computed from the area and perimeter of polygons formed by quadruples of skeletal joints. Then, the class of a test sign was predicted by comparing the sign with the training set using global alignment kernels. In another work [ 80 ], Kumar et al. introduced two novel features for accurate sign language recognition that were named colour-coded topographical descriptors. These descriptors were formed as images from the computation of joint distances and angles. Finally, these descriptors were processed by 2D CNNs and merged to derive the class of the sign.

Recently, the advances in deep learning led several isolated SLR methods to leverage attention mechanisms, transformer networks and graph convolutional networks. Attention mechanisms in particular enable a deep network to pay more attention on features that are important for a classification task and are widely employed by most state-of-the-art SLR methods. Parelli et al. in [ 81 ], proposed a multi-stream SLR method that processes hand and mouth image regions as well as 3D hand skeletal data. All streams were concatenated and fed to an attention CNN network that accurately predicts the class of the sign. Attention LSTM, attention GRU and Transformer networks were also tested but they led to inferior performance. De Amorim et al. in [ 82 ], proposed an American SLR method that extracts skeletal data from video sequences and then processes them using a Spatio-Temporal Graph Convolutional Network (GCN) [ 83 ]. Tunga et al. in [ 84 ], proposed a SLR method that extracts skeletal features from video sequences and then employs a GCN network to model spatial dependencies among the skeletal data, as well as a BERT model to model temporal dependencies among the skeletal data. The two representations were finally merged to derive the class of the sign. A limitation of this approach is that the model cannot differentiate in-plane and out-of-plane movements due to the use of only 2D spatial information. In a similar fashion, Meng et al. in [ 85 ], proposed a GCN with multi-scale attention modules to process the extracted skeletal data and model their long-term spatial and temporal dependencies. In this way, the authors achieved a really high accuracy of 97.36% in the CSL-500 dataset. GCNs are computationally lighter than the image processing networks, but they often cannot extract highly enriched features, thus leading to inferior performance, as noted in [ 82 ].

Finally, the wide adoption of RGB-D sensors for action and gesture recognition has led several researchers to adopt them for multi-modal sign language recognition as well. However, the performance of such multi-modal methodologies is currently limited by the small number of large publicly available RGB-D datasets and the mediocre accuracy of depth information. Tur et al. in [ 86 ], proposed a Siamese deep network for the concurrent processing of RGB and depth sequences. The extracted features were then concatenated and passed to an LSTM layer for isolated sign language recognition. Ravi et al. in [ 87 ], proposed a multi-modal SLR method that was based on the processing of RGB, depth and optical flow sequences. Each stream employed CNN layers to process the sequences and then, all features were fused together and fed to a CNN model for classification. Rastgoo et al. in [ 88 ], proposed a multi-modal SLR method that leverages RGB and depth video sequences to achieve an accuracy of 86.1% in the IsoGD dataset. More specifically, the authors extracted pixel-level, optical flow, deep hand and hand pose features for each modality, concatenated these features across both modalities and classified them to sign classes using an LSTM layer. The authors stated that there were signs with similar appearance and motion features that led to misclassification errors and thus they proposed the use of augmentation strategies, high capacity networks and more data samples.

Huang et al. in [ 89 ], proposed the use of RGB, depth and skeletal data as input to attention-based 3D-CNNs and attention-based BLSTMs in order for the proposed SLR method to pay attention to spatio-temporal dependencies in the input data and fuse the input streams in an optimal way. Huang et al. in [ 90 ], proposed a sequence-to-sequence approach that detects key frames to remove noisy information from video sequences. Then, they extracted CNN features from these key frames, histogram-of-gradients (HOG) features from depth motion maps and trajectory features from skeletal data. These features were finally concatenated and fed to an encoder-decoder LSTM network that predicted sub-words that form the signed word. Zhang et al. in [ 91 ], proposed a highly accurate SLR method that initially selected pairs of aligned RGB-D images to reduce redundancy. Then, the proposed method computed discriminative features from hand regions using a spatial stream and extracted depth motion features using a temporal stream. Both streams were finally fused by a convolutional fusion layer and the output feature vector was used for classification. The authors reported that occlusions and the surface materials can significantly affect the quality of depth images, degrading the performance of their model. Common failure cases among most ISLR methodologies are the difficulty in differentiating signs when performed differently by users and the inability to accurately classify signs with similar hand shapes and positions. An overview of the performance of ISLR methods on well-known datasets are presented in Table 3 .

Performance of ISLR methods on well-known datasets. The best performance for each dataset appears in bold.

4.3. Sign Language Translation

Sign Language Translation is the task of translating videos with sign language into spoken language by modeling not only the glosses but also the language structure and grammar. It is an important research area that facilitates the communication between the Deaf and other communities. Moreover, the SLT task is more challenging compared to CSLR due to the additional linguistic rules and the representation of spoken languages. SLT methods are usually evaluated using the bilingual evaluation understudy (BLEU) metric [ 92 ]. BLEU is a translation quality score that evaluates the correspondence between the predicted translation and the ground truth text. More specifically, BLEU- n measures the n -gram overlap between the output and the reference sentences. BLEU-1,2,3,4 are reported to provide a clear view of the actual translation performance of a method. Camgoz et al. in [ 28 ], adopted an attention-based neural machine translation architecture for SLT. The encoder consisted of a 2D-CNN and an LSTM network, while the decoder consists of word embeddings with an attention LSTM. The authors stated that the method is prone to errors when spoken words are not explicitly signed in the video but inferred from the context. Their method set the baseline performance on Phoenix-2014-T with a BLEU-4 score of 18.4. Orbay et al. in [ 93 ], compared different gloss tokenization methods using either 2D-CNN, 3D-CNN, LSTM or Transformer networks. In addition, they investigated the importance of using full frames compared to hand images as the first provide useful information regarding the face and arms of the signer for SLT. On the other hand, Ko et al. in [ 94 ], utilized human keypoints extracted from the video, which were then fed to a recurrent encoder-decoder network for sign language translation. Furthermore, the skeletal features were extracted with OpenPose and then normalized to improve the overall performance. Then, they were fed to the encoder, while the translation was generated from the attention decoder. Differently, Zheng et al. in [ 95 ], used a preprocessing algorithm to remove similar and redundant frames of the input video and increase the processing speed of the neural network without losing information. Then, they employed an SLT architecture that consisted of a 2D-CNN, temporal convolutional layers and bidirectional GRUs. Their method was able to deal with long videos that have long-term dependencies, improving the translation quality. Zhou et al. in [ 62 ], proposed a multi-modal framework for CSLR and SLT tasks. The proposed method used 2D-CNN, 1D convolutional layers and several BLSTMs and learned both spatial and temporal dependencies between different modalities. The proposed method achieved a BLEU-4 score of 23.65 on the test set of Phoenix-2014-T. However, due to the multi-modal cues, this method is very computationally heavy and requires several hours of training.

Recently, Transformer networks have also been employed for sign language translation due to their success in natural language processing tasks. Camgoz et al. in [ 96 ], introduced a joint architecture for CSLR and SLT with a Transformer encoder-decoder network. The network was trained with CTC and Cross-Entropy losses, while the gloss-level supervision improved the SLT performance. The authors evaluated various configurations of their method and stated that directly translating from video representations can improve the translation quality. A limitation of this approach was in translating numbers as there was no such context available during training. In their latest work, Camgoz et al. in [ 97 ], adopted additional modalities and a cross-modal attention to synchronize the different streams and model both inter- and intra-contextual information. Kim et al. in [ 98 ], used a deep neural network for human keypoint extraction that were fed to a transformer encoder-decoder network, while the keypoints were normalized based on the neck location. A comparison of existing methods for SLT that are evaluated on the Phoenix-2014-T dataset, is shown in Table 4 . Overall, Transformer-based SLT methods achieve slightly better performance than RNN-based methods, which indicates the importance of attention mechanism for SLT. In addition, using multiple modalities can also improve the translation quality.

Reported results on sign language translation on Phoenix-2014-T. The best performance appears in bold.

5. Sign Language Representation

The automatic and realistic sign language representation is vital for each sign language system. The representation of a sentence in sign language instead of a plain text can make the system friendlier and more accessible to the members of the deaf community. Signs are commonly represented using avatars or synthesized videos of a real human. The challenges of this task include the difficulty in creating realistic representations due to complex hand shapes and rapid arm movements.

5.1. Realistic Avatars

A common approach to sign language representation is the use of 3D avatars that with a high degree of accuracy and realism can reproduce facial expressions and body/hand movements in a way that represent signs understandable by deaf or hearing-impaired people. Balayn et al. in [ 99 ], developed a virtual communication agent for sign language to recognize Japanese sign language sentences from video recordings and synthesize sign language animations. Their system adopted a deep LSTM encoder-decoder network to translate sign language videos to spoken text, while a separate encoder-decoder network used as input the sign language glosses and extracted specific encodings, which were then used to synthesize the avatar motion. However, the network employed for the generation task does not have enough parameters to learn complete sentence expressions, lacking an attention module that could assist in learning longer-term dependencies. Shaikh et al. in [ 100 ], employed a system to generate sign animations from audio announcements in railway stations. At first, language rules and grammar was applied in the input text to transform it into a specific format. Then, inverse kinematics were applied to calculate the avatar target positions for each word and render the final video representation. Melchor et al. in [ 101 ], used a speech recognition system that translates Mexican spoken text into sign language. Then, the signs were represented through an avatar that was digitally animated on a mobile device. Uchida et al. in [ 102 ], developed an application to automatically produce sign language animations for sports games and was able to operate on live game broadcasts. A disadvantage of the application is that the delay time between the video occurrence and the video display is large.

Das et al. in [ 103 ], developed a 3D avatar to convert Indian text or speech into sign language. The input was translated to English and then to the corresponding Indian sign language using Natural Language Processing (NLP) rules and techniques. The final avatar movements were generated using a predefined sign vocabulary and Blender. A limitation of the system is that it was developed for a limited corpus and that the avatar had no facial expressions. Mehta et al. in [ 104 ], introduced a system in order to translate online videos into Indian Sign Language (ISL) and produce sign animations with a 3D cartoon-like avatar. The audio from the videos was captioned using NLP algorithms and mapped to signs that were finally rendered with the avatar. Nevertheless, due to the limited resources available for ISL, the performance of the system may degrade when dealing with complex grammatical structures and interactions. Patel et al. in [ 105 ], developed an application for animation generation. The input speech was recognised and translated with Google Cloud Speech Recognizer. Then, the translated text was converted to Hamburg notation system (HamNoSys) [ 106 ] and sign gesture markup language (SigML) [ 107 ] notations to effectively generate animations. Kumar et al. in [ 108 , 109 ] developed a mobile application to translate English text into ISL. HamNoSys was used for sign representation, SigML for its conversion to an XML file, and an avatar was employed to generate signs. A weakness of the developed system is that it struggles to represent complex animation and facial expressions of ISL signs. Moreover, the proposed system does not index the signs based on its context and this can cause confusion on directional signs that require different handling based on the context. Brock et al. in [ 110 ], adopted deep recurrent neural networks to generate 3D skeleton data from sign language videos. Subsequently, inverse kinematics were applied to calculate joints angles and positions that were mapped to a sign language avatar for animation synthesis.

5.2. Sign Language Production

Sign language production (SLP) has gained a lot of attention lately due to the huge advances in deep learning that allows the production of realistic signed videos. Sign language production techniques aim to replace the rigid body and facial features of an avatar with the natural features of a real human. To this end, these techniques usually receive as input sign language glosses and a reference image of a human and synthesize a signed video with the human performing signs in a more realistic way than the one that could have been achieved by an avatar.

Stoll et al. in [ 111 ], proposed an SLP method using a machine translation encoder-decoder network to translate spoken language into gloss sequences. Then, each gloss was assigned to a unique 2D skeleton pose, which were extracted from sign videos, normalized and aligned. Finally, a pose-guided generative adversarial network handled the skeleton pose sequence and a reference image to generate the gloss video. However, this methods fails to generate precise videos when the hand keypoints are not detected by the pose estimation method or the timing of the glosses is not predicted correctly. In their latest work, Stoll et al. in [ 112 ], used an improved architecture with additional components. The NMT network directly transforms spoken text to pose sequences, while a motion graph was adopted to generate 2D smooth skeletal poses. An improved generative adversarial network (GAN) was used in order to produce videos with higher resolution. The motion graph and the GAN modules improved significantly the quality of the generated videos. Stoll et al. in [ 113 ], adopted an auto-regressive gloss-to-pose network that can generate skeleton poses and velocities for each sign language gloss. In addition, a pose-to-video network generated the output video using a 2D-CNN along with a GAN. This approach resulted in smooth transitions between glosses and refined details on hand and finger shapes. Saunders et al. in [ 114 ], employed Transformers to automatically generate 3D human poses from spoken text using a multiple-level configuration. A text-to-gloss-to-pose (T2G2P) network with Transformer layers translated text sentences to sign language glosses and finally to 3D poses, while a text-to-pose (T2P) network directly transformed text into human poses. Furthermore, a progressive Transformer decoder was used to generate continuous and smooth human poses one frame at a time. Furthermore, the method achieved superior performance compared to NMT-based and GAN-based methods. Xiao et al. in [ 115 ] developed a bidirectional system for SLR and SLP. A deep RNN was used to jointly recognize sign language from input skeleton poses and generated skeleton sequences that were responsible to move an avatar or generate a signed video. The generated sequences were also used for SLR and improved the robustness of the system.

Cui et al. in [ 116 ], used a pose predictor network, which contains an LSTM and an autoencoder to generate the future human poses given a reference pose and the gloss label. Moreover, an image synthesis module accepted as input the current frame and the next pose to predict the next frame of the video using a U-Net based architecture with a CNN and an LSTM. Furthermore, it extracted regions of interest to improve details, such as the hands, which were crucial for generating high-quality sign language videos. This approach was able to synthesize realistic signs with naturally evolving hand shapes.

6. Applications

The advances in sign language capturing, recognition and representation have led to the development of several related applications. Each application can be compatible either with desktop computers or with android and iOS smartphones, as it is illustrated in Table 5 . The majority of the methods use one or two CNN models integrated to their applications. The use of lightweight CNN models ensures the real-time performance of the applications.

Characteristics of sign language applications.

Liang et al. in [ 117 ], introduced an automatic toolkit to recognize early stages of dementia among British Sign Language (BSL) users. Hand trajectory data, facial data and elbow distribution data were employed for feature extraction. The data were extracted using OpenPose and the dlib libraries. The final decision, whether the user was healthy or not, was taken by a CNN model. Zhou et al. in [ 118 ], created a Hong Kong sign language recognition platform, consisting of a mobile application and a Jetson Nano [ 130 ]. The mobile application was the front-end of the platform that preprocesses the sign language video. After the preprocessing, the video was transferred to the Jetson Nano that translates the video into spoken language, using a pre-trained deep learning model. Moreover, the authors created a Hong Kong sign language dataset for the purposes of the study. However, the method provides only word-level translation and predicts a relatively small vocabulary size. Furthermore, Ku et al. in [ 124 ], employed the 2d camera of the smartphone to record the signer. Hand skeleton information was extracted by OpenPose and a CNN model identified the meaning of the sign. The user could also choose to translate a pre-recorded video. However, very few gestures are recognised (three) and only finger positions are employed for feature extraction and not the entire hand. Moreover, the application does not run in real-time. On the other hand, Ozarkar et al. in [ 119 ], implemented a smartphone application consisting of three modules. The sound classification module detected and classified input sounds and alerted the user through vibrations. The gesture recognition module recognized the input Indian sign language video and converted it to natural language. In addition, the Multilingual Translation Module could either convert text to speech in different Indian regional languages or convert speech to text. Some limitations of the method are the performance degradation when more than one people appear in front of the camera, as well as the sensitivity of the sound classification module in noisy environments. Finally, Lee et al. in [ 126 ], described multiple technologies that could be integrated to a smartphone and ease the communication between speaking and hearing-impaired people. These technologies were: Text-To-Speech (TTS), Speech-To-Text (STT), Augmentative and Alternative Communication (AAC) and motion recognition.

Numerous educational oriented applications employing SLR have been also developed. These applications aim to help someone to learn or practice SL. Potamianos et al. in [ 125 ], presented a summary of the SL-ReDu project. The goal of the project was to teach the Greek sign language as a second language through recognition. The educational process was supported by self-monitoring and objective learning of the learners. Furthermore, a deep learning-based approach for isolated sign recognition of GSL was introduced. On the other hand, Joy et al. in [ 120 ], proposed a mobile application that could be used as a visual dictionary for children. It consisted of two modules: an object detection module and a word recognition module. The former enabled the user to select an object and the application displayed the corresponding sign. The latter took as input a picture of a text and it demonstrated the corresponding sign. However, the word recognition module is limited to translate a maximum number of 950 characters from a text. In addition, there are delays in loading sign animation videos due to the limited number of videos that can be stored on the mobile device. Moreover, Paudyal et al. in [ 121 ], designed a smartphone application that provides feedback to a sign language learner based on location, movement, orientation and hand-shape of his signs. A dataset was also created from 100 learners, for 25 American Sign Language (ASL) signs. However, the system does not perform continuous SLR. Schioppo et al. in [ 127 ], created a virtual environment for learning sign language, employing a virtual reality headset. A Leap Motion sensor was attached to the headset. The system was evaluated on the 26 letters of the alphabet in ASL. Luccio et al. in [ 122 ], employed an Elf Sandbot robot [ 131 ] to help people with hearing impairments to learn sign language. Two smartphone and tablet applications were also developed, with the first one controlling the movement of the robot and the second one taking a verbal or textual input of a word or sentence, translating it to sign language and demonstrating the corresponding video. Furthermore, Chaikaew et al. in [ 123 ], introduced an application that could help the communication of hearing-impaired people who want to learn the Thai sign language. The learners were able to choose the preferred vocabulary and practice with animation. Bansal et al. in [ 128 ], designed a game aiming to help Deaf children that lack continuous access to sign language, using only a high resolution camera and pose estimation software. The learner was asked to describe a scene and if the description was correct, he/she advanced to the next scene. Moreover, a dataset with RGB and depth features was created from adults with little experience with ASL. Nevertheless, the dataset consists of very few data to effectively train a deep learning model. Finally, Quandt et al. in [ 129 ], designed an avatar who served as the teacher of a virtual environment in order to teach introductory ASL to a novice signer. The users could also see a digital representation of their hands due to the usage of LEAP Motion. However, the system could not capture signs that involved touching a specific part of the body or signs that involved body part occlusion.

7. Conclusions and Future Directions

In this paper, the broad spectrum of AI technologies in the field of sign language is covered. Starting from sign language capturing methods for the collection of sign language data and moving on to sign language recognition and representation techniques for the identification and translation of sign language, this review highlights all important technologies for the construction of a complete AI-based sign language system. Additionally, it explores the in-between relations among the AI technologies and presents their advantages and challenges. Finally, it presents groundbreaking sign language applications that facilitate the communication between hearing-impaired and speaking people, as well as enable the social inclusion of hearing-impaired people in their everyday life. The aim of this review is to familiarize researchers with sign language technologies and assist them towards developing better approaches.

In the field of sign language capturing, it is essential to select an optimal sensor for capturing signs for a task that highly depends on various constraints (e.g., cost, speed, accuracy, etc.). For instance, wearable sensors (i.e., gloves) are expensive and capture only hand joints and arm movements, while in recognition applications, the user is required to use gloves. On the other hand, camera sensors, such as web or smartphone cameras, are inexpensive and capture the most substantial information, like the face and the body posture, which are crucial for sign language.

Concerning CSLR approaches, most of the existing works adopt 2D CNNs with temporal convolutional networks or recurrent neural networks that use video as input. In general, 2D methods have lower training complexity compared to 3D architectures and produce better CSLR performance. Moreover, it is experimentally shown that multi-modal architectures that utilize optical flow or human pose information, achieve slightly higher recognition rates than unimodal methods. In addition, CSLR performance on datasets with large vocabularies of more than 1000 words, such as Phoenix-2014, or datasets with unseen words on the test sets, such as CSL Split 2 and GSL SD, is far from perfect. Furthermore, ISLR methods have been extensively explored and have achieved high recognition rates on large-scale datasets. However, they are not suitable for real-life applications since they are trained to detect and classify isolated signs on pre-segmented videos.

Sign language translation methods have shown promising results although they are not exhaustively explored. The majority of the SLT methods adopt architectures from the field of neural machine translation and video captioning. These approaches are of great importance, since they translate sign language into spoken counterparts and can be used to facilitate the communication between the Deaf community and other groups. To this end, this research field requires additional attention from the research community.

Sign language representation approaches adopt either 3D avatars or video generation architectures. 3D animations require manual design of the movement and the position of each joint of the avatar, which is very time-consuming. In addition, it is extremely difficult to generate smooth and realistic animations of the fine grained movements that compose a sign, without the use of sophisticated motion capturing systems/technologies that employ multiple cameras and specialised wearable sensors. On the other hand, recent deep learning methods for sign language production have shown promising results at synthesizing sign language videos automatically. Besides, they can generate realistic videos using a reference image or video from a human, which are also preferable from the Deaf community instead of avatars.

Regarding the sign language applications, they are mostly developed to be integrated in a smartphone operating system and perform SL translation or recognition. A discrete category is the educational oriented applications, which are very useful for anyone with little or no knowledge of sign language. In order to create better and more easily accessible applications, the research should focus on the development of more robust and less computational expensive AI models, along with the further improvement of the existing software for integration of the AI models into smart devices.

Figure 3 is designed to provide objective and subjective comparisons of AI technologies and DNN architectures for sign language as seen from the perspective and the experience of the authors in the field. More specifically, Figure 3 a presents and compares the characteristics of the different AI technologies for sign language. Volume of works is used to measure the number of published papers for each sign language technology and it is calculated based on the results of the query search in the databases. Challenges is used to subjectively measure the difficulty in accurately dealing with each sign language technology and it is based on the performance of the methods on the specific area. Finally, future potential is used to express the view of the authors on which sign language technology has the most potential to deliver future research works.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-05843-g003.jpg

Radar charts showcasing the findings of this survey regarding ( a ) the literature methods for CSLR, ISLR and SLP and ( b ) the characteristics of each AI sign language technology.

From the chart in Figure 3 a, it can be seen that most existing works deal with sign language recognition, while sign language capturing and translation methods are still not thoroughly explored. It is strongly believed that these research areas should be explored more in future works. Furthermore, it is assumed that there is still great room for improvement for applications, especially mobile ones, that can assist the Deaf community. Regarding future directions, improvements can still be achieved in the accuracy of sign language recognition and production systems. In addition, advances should be made in the extraction of robust skeletal features, especially in the presence of occlusions, as well as in the realism of avatars. Finally, it is crucial to develop fast and robust sign language applications that can be integrated in the everyday life of hearing-impaired people and facilitate their communication with other people and services.

On the other hand, Figure 3 b draws a comparison between various DNN architectures in terms of the performance of the proposed networks (Accuracy), hardware requirements for inference and training of the proposed networks (Hardware requirements), scope for improvement based on the performance gains and the volume of works (Future potential), computational complexity during training (Training complexity) and the number of recorded datasets that are currently available (Existing datasets). Except for the existing datasets, whose values are based on a search for publicly available datasets, all other metrics presented in the chart of Figure 3 b are calculated based on the study of the review papers and the opinions and experience of the authors. As it can be observed, ISLR methods have high accuracy with small hardware requirements but such methods have been extensively explored resulting in limited future potential. On the other hand, CSLR and SLP methods have high hardware and training requirements, as well as demonstrate significant future potential as there is still great room for improvements in future research works.

Author Contributions

Conceptualization, I.P., C.C., D.K., K.D. and P.D.; Formal analysis, I.P., C.C., D.K., K.D. and P.D.; Funding acquisition, P.D.; Project administration, P.D.; Supervision, K.D.; Writing—original draft, I.P., C.C., D.K.; Writing—review and editing, K.D. and P.D. All authors have read and agreed to the published version of the manuscript.

This research was funded by the Greek General Secretariat of Research and Technology under contract T1E Δ K-02469 EPIKOINONO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Conflicts of interest.

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Architecture and Design
  • Asian and Pacific Studies
  • Business and Economics
  • Classical and Ancient Near Eastern Studies
  • Computer Sciences
  • Cultural Studies
  • Engineering
  • General Interest
  • Geosciences
  • Industrial Chemistry
  • Islamic and Middle Eastern Studies
  • Jewish Studies
  • Library and Information Science, Book Studies
  • Life Sciences
  • Linguistics and Semiotics
  • Literary Studies
  • Materials Sciences
  • Mathematics
  • Social Sciences
  • Sports and Recreation
  • Theology and Religion
  • Publish your article
  • The role of authors
  • Promoting your article
  • Abstracting & indexing
  • Publishing Ethics
  • Why publish with De Gruyter
  • How to publish with De Gruyter
  • Our book series
  • Our subject areas
  • Your digital product at De Gruyter
  • Contribute to our reference works
  • Product information
  • Tools & resources
  • Product Information
  • Promotional Materials
  • Orders and Inquiries
  • FAQ for Library Suppliers and Book Sellers
  • Repository Policy
  • Free access policy
  • Open Access agreements
  • Database portals
  • For Authors
  • Customer service
  • People + Culture
  • Journal Management
  • How to join us
  • Working at De Gruyter
  • Mission & Vision
  • De Gruyter Foundation
  • De Gruyter Ebound
  • Our Responsibility
  • Partner publishers

sign language detection research paper

Your purchase has been completed. Your documents are now available to view.

Sign language identification and recognition: A comparative study

Sign Language (SL) is the main language for handicapped and disabled people. Each country has its own SL that is different from other countries. Each sign in a language is represented with variant hand gestures, body movements, and facial expressions. Researchers in this field aim to remove any obstacles that prevent the communication with deaf people by replacing all device-based techniques with vision-based techniques using Artificial Intelligence (AI) and Deep Learning. This article highlights two main SL processing tasks: Sign Language Recognition (SLR) and Sign Language Identification (SLID). The latter task is targeted to identify the signer language, while the former is aimed to translate the signer conversation into tokens (signs). The article addresses the most common datasets used in the literature for the two tasks (static and dynamic datasets that are collected from different corpora) with different contents including numerical, alphabets, words, and sentences from different SLs. It also discusses the devices required to build these datasets, as well as the different preprocessing steps applied before training and testing. The article compares the different approaches and techniques applied on these datasets. It discusses both the vision-based and the data-gloves-based approaches, aiming to analyze and focus on main methods used in vision-based approaches such as hybrid methods and deep learning algorithms. Furthermore, the article presents a graphical depiction and a tabular representation of various SLR approaches.

1 Introduction

Based on the World Health Organization (WHO) statistics, there are over 360 million people with hearing loss disability (WHO 2015 [ 1 , 2 ]). This number has increased to 466 million by 2020, and it is estimated that by 2050 over 900 million people will have hearing loss disability. According to the world federation of deaf people, there are about 300 sign languages (SLs) used around the world. SL is the bridge for communication between deaf and normal people. It is defined as a mode of interaction for the hard of hearing people through a collection of hand gestures, postures, movements, and facial expressions or movements which correspond to letters and words in our real life. To communicate with deaf people, an interpreter is needed to translate real-world words and sentences. So, deaf people can understand us or vice versa . Unfortunately, deaf people do not have a written form and have a huge lack of electronic resources. The most common SLs are American Sign Language (ASL) [ 3 ], Spanish Sign Language (SSL) [ 4 ], Australian Sign Language (AUSLAN) [ 5 ], and Arabic Sign Language (ArSL) [ 6 ]. Some of these societies use only one hand for sign languages such as USA, France, and Russia, while others use two-hands like UK, Turkey, and Czech Republic.

The need for an organized and unified SL was first discussed in World Sign Congress in 1951. The British Deaf Association (BDA) Published a book named Gestuno [ 7 ]. Gestuno is an International SL for the Deaf which contains a vocabulary list of about 1,500 signs. The name “Gestuno” was chosen referencing gesture and oneness. This language arises in the Western and Middle Eastern languages. Gestuno is considered a pidgin of SLs with limited lexicons. It was established in different countries such as US, Denmark, Italy, Russia, and Great Britain, in order to cover the international meetings of deaf people. Although, Gestuno cannot be considered as a language due to several reasons. First, no children or ordinary people grow up using this global language. Second, it has no unified grammar (their book contains only a collection of signs without any grammar). Third, there are a fewer number of specialized people who are fluent or professional in practicing this language. Last, it is not used daily in any single country and it is not likely that people replace their national SL with this international one [ 8 ].

ASL has many linguistics that is difficult to be understood by researchers who are interested in technology, so experts of SLs are needed to facilitate these difficulties. SL has many building blocks that are known as phonological features. These features are represented as hand gestures, facial expressions, and body movements. Each one of these three phonological features has its own shape which differs and varies from one sign to another one. A word/an expression may have similar phonological features in different SLs. For example, the word “drink” could be represented similarly in the three languages ASL, ArSL, and SSL [ 16 ]. On the other hand, a word/an expression may have different phonological features in different SLs. For example, the word “Stand” in American and the word “يقف” (stand) in Arabic are represented differently in the two SLs. The process of understanding a SL by a machine is called Sign Language Processing (SLP) [ 9 ]. Many research problems are suggested in this domain such as Sign Language Recognition (SLR), Sign Language Identification (SLID), Sign Language Synthesis, and Sign Language Translation [ 10 ]. This article covers the first two tasks: SLR and SLID.

SLR basically depends on what is the translation of any hand gesture and posture included in SL, and continues/deals from sign gesture until the step of text generation to the ordinary people to understand deaf people. To detect any sign, a feature extraction step is a crucial phase in the recognition system. It plays the most important role in sign recognition. They must be unique, normalized, and preprocessed. Many algorithms have been suggested to solve sign recognition ranging from traditional machine learning (ML) algorithms to deep learning algorithms as we shall discuss in the upcoming sections. On the other hand, few researchers have focused on SLID [ 11 ]. SLID is the task of assigning a language when given a collection of hand gestures, postures, movements, and facial expressions or movements. The term “SLID” raised in the last decade as a result of many attempts to globalize and identify a global SL. The identification process is considered as a multiclass classification problem. There are many contributions in SLR with prior surveys. The latest survey was a workshop [ 12 , 13 ]. To the best of our knowledge, no prior works have surveyed SLID in previous Literature. This shortage was due to the need for experts who can explain and illustrate many different SLs to researchers. Also, this shortage due to the distinction between any SL and its spoken language [ 8 , 14 ] (i.e., ASL is not a manual form of English and does not have a unified written form).

Although many SLR models have been developed, to the best of our knowledge, none of them can be used to recognize multiple SLs. At the same time, in recent decades, the need for a reliable system that could interact and communicate with people from different nations with different SLs is of great necessity [ 15 ]. COVID-19 Coronavirus is a global pandemic that forced a huge percentage of employees to work and contact remotely. Deaf people need to contact and attend online meetings using different platforms such as Zoom, Microsoft Team, and Google Meeting rooms. So, we need to identify and globalize a unique SL as excluding deaf people and discarding their attendance will affect the whole work progress and damage their psyche which emphasizes the principle of “nothing about us without us.” Also, SL occupies a big space of all daily life activities such as TV sign translators, local conferences sign translators, and international sign translators which is a big issue to translate all conference’s points to all deaf people from different nations, as every deaf person requires a translator of their own SL to translate and communicate with him. In Deaflympics 2010, many deaf athletics were invited for this international Olympics. They need to interact and communicate with each other or even with anybody in their residence [ 16 ]. Building an interactive unified recognizer system is a challenge [ 11 ] as there are many words/expressions with the same sign in different languages, other words/expressions with different signs in the different languages, and other words/expressions could be expressed using the hands beside the movements of the eyebrows, mouth, head, shoulders, and eye gaze. For example, in ASL, raised eyebrows indicate an open-ended question, and furrowed eyebrows indicate a yes/no question. SLs could also be modified by mouth movements. For example, expressing the sign CUP with different mouth positions may indicate cup size, also body movements which may be included while expressing any SL provides different meanings. SLID will help in breaking down all these barriers for SL across the world.

Traditional machine and deep learning algorithms were applied to different SLs to recognize and detect signs. Most proposed systems achieved promising results and indicated significant improvements in SL recognition accuracy. According to higher results in SLR on different SLs, a new task of SLID arises to achieve more stability and facility in deaf and ordinary people communication. SLID has many subtasks starting from image preprocessing, segmentation, feature extraction, and image classification. Most proposed models for recognition were applied to a single dataset, whereas the proposed SLID was applied to more than one SL dataset [ 11 ]. SLID inherits all SLR challenges, such as background and illumination variance [ 17 ], also skin detection and hands segmentation using both static and dynamic gestures. Challenges are doubled and maximized in SLID as many characters and words in different signs share the same hand’s gestures, body movements, and so on, but may differ by considering facial expressions. For example, in ASL, raised eyebrows indicate an open-ended question, and furrowed eyebrows indicate a yes/no question, SL could also be modified by mouth movements.

Despite the need for interdisciplinary learning and knowledge of sign linguistics, most existing research does not go in depth but tackles the most important topics and separate portions. In this survey, we will introduce three important questions – (1) Why SLID is important? (2) What are the challenges to solve SLID? (3) what is the most used sign language for identifications and why? A contact inequality of SLs arises from this communication, whether it is in an informal personal context or in a formal international context. Deaf people have therefore used a kind of auxiliary gestural system for international communication at sporting or cultural events since the early 19th century [ 18 ]. Spoken languages like English are the most used language between all countries and many people thought it is a globally spoken language. Unfortunately, it is like all local languages. On the other side, for deaf people, many of them thought that ASL is the universal SL.

Furthermore, this article compares the different machine and deep learning models applied on different datasets, identifies best deep learning parameters such as, neural network, activation function, number of Epochs, best optimization functions, and so on, and highlights the main state of the art contributions in SLID. It also covers the preprocessing steps required for sign recognition, the devices used for this task, and the used techniques. The article tries to answer the questions: Which algorithms and datasets had achieved high accuracy? What are the main sub-tasks that every paper seeks to achieve? Are they successful in achieving the main goal or not?

This survey will also be helpful in our next research which will be about SLID, which requires deep understanding of more trending techniques and procedures used in SLR and SLID. Also, the survey compares the strengths and weaknesses of different algorithms and preprocessing steps to recognize signs in different SLs. Furthermore, it will be helpful to other researchers to be more aware of SL techniques.

The upcoming sections are arranged as follows: Datasets of different SLs are described in Section 2 . The preprocessing steps for these datasets, that are prerequisite for all SL aspects, and the required devices will be discussed in Section 3 . Section 4 includes the applied techniques for SL. Section 5 comprehensively compares the results and the main contributions of these addressed models. Finally, the conclusion and the future work will be discussed in Section 6 .

In this section, we discuss many datasets that had been used in different SL aspects such as skin and body detection, image segmentation, feature extraction, gesture recognition, and sign identification for more advanced approaches. For each dataset, we try to explore the structure of the dataset, the attributes with significant effects in the training and testing processes, the advantages and disadvantages of the dataset, and the content of the dataset (images, videos, or gloves). Also, we try to compare the accuracies of the dataset when applying different techniques to it [ 19 ]. Table 1 summarizes the results of these comparisons.

A comparison between different datasets

ASL: American Sign Language, DGS: German Sign Language, NSL: Netherlands Sign Language, TSL: Turkish Sign Language, LSF: French Sign Language, ISL: Irish Sign Language, SGSL: Swiss German Sign Language, ISL: Indian Sign Language, PSL: Persian Sign Language.

CopyCate Game [ 20 ]: a dataset that was collected from deaf children for educational adventure game purpose. These games facilitate interaction with computers using gesture recognition technology. Children wear two colored gloves (red and purple), one glove on each hand. They had collected about 5,829 phrases over 4 phases, with a total number of 9 deployments, each phrase has about 3, 4, or 5 signs taken from a vocabulary token of about 22 signs which is a list of adjectives, objects, prepositions, and subjects. The phrases have the following format:

[adjective1] subject preposition [adjective2] object.

Some disadvantages of this dataset are library continuity, sensor changes, varied environments, data integrity, and sign variation. Another disadvantage and disability are wearing gloves because users must interact with systems using gloves. On the other hand, it has the advantage of integrating new data they gathered from other deployments into its libraries.

Multiple Dataset [ 25 ]: Collected two datasets of ArSL, consisting of 40 phrases with 80-word lexicon, each phrase was repeated 10 times, using DG5-Vhand data glove with five sensors on each finger with an embedded accelerometer. It was collected using two Polhemus G4 motion trackers providing six different measurements. Dataset (Number.2) was collected using a digital camera without wearing gloves for capturing signs.

ArSL Dataset [ 26 ]: Digital cameras are used to capture signer’s gestures, then videos are stored as a AVI video format to be analyzed later. Data were captured from deaf volunteers to generate samples for training and testing the model. It consists of 20 lexicons, with 45 repetitions for every word, 20 for training and 18 for testing. All signer’s hands are bare, and no wearable gloves are required. Twenty-five frames are captured per second with a resolution of 640 × 480.

Weather Dataset [ 40 ]: A continuous SL composed of three state-of-the-art datasets for the SL recognition purpose: RWTH-PHOENIX-Weather 2012, RWTH-PHOENIX-Weather 2014, and SIGNUM.

SIGNUM [ 27 ]: It is used for pattern recognition and SL recognition tasks. A video database is collected from daily life activities and sentences like going to the cinema, ride a bus, and so on. Signer’s gestures were captured by digital cameras.

CORPUS-NGT [ 24 ]: Great and huge efforts are done to collect and record videos of the SL of Netherlands (Nederlandes Gebarentaal: NGT), providing global access to this corpus for all researchers and sign language studies. About 100 native signers of different ages participated in collecting and recording signs for about 72 h. It provided annotation or translation of some of these signs.

RWTH German fingerspelling [ 21 ]: A German SL dataset is collected from 20 participants, producing about 1,400 image sequences. Each participant was asked to record every sign twice on different days by different cameras (one webcam and one camcorder) without any background limitations or restrictions on wearing clothes while gesturing. Dataset contains about 35 gestures with video sequences of alphabets and first 5 numbers (1–5).

RWTH-BOSTON-104: A dataset published by the national center of SL and gesture resources by Boston University. Four cameras were used to capture signs, three of them are white/black cameras and one is a color camera. Two white/black cameras in front of the signers to form stereo, another camera on the side of the signer, and the colored camera was focused between the stereo cameras. It consists of 201 annotated videos, about 161 videos are used for training and about 40 for testing. Captured movies of sentences consist of 30 fps [312 × 242] using only the upper center [195 × 165].

Oliveira et al. [ 28 ]: An Irish SL dataset captured human subjects using handshapes and movements, producing 468 videos. It is represented as two datasets which were employed for static (each sign language is transferred by a single frame) and dynamic (each sign is expressed by different frames and dimensions) Irish Sign Language recognition.

ISL-HS consists of 486 videos that captured 6 persons performing Irish SL with a rotating hand while signing each letter. Only arms and hands are considered in the frames. Also, videos whose background was removed by thresholding were provided. Further, 23 labels are considered, excluding j, x, and z letters, because they require hand motion which is out of the research area of the framework [ 41 ].

Camgoz et al. [ 29 ]: It presented the Turkish sign language. It was recorded using the state-of-the-art Microsoft Kinect v2 sensor. This dataset contains about 855 signs from everyday life domains from different fields such as finance and health. It has about 496 samples in health domain, about 171 samples in finance domain, and the remaining signs (about 181) are commonly used signs in everyday life. Each sign was captured by 10 users and was repeated 6 times, each user was asked to perform about 30–70 signs.

SMILE [ 30 ]: It prepared an assessment system for lexical signs of Swiss German Sign Language that relies on SLR. The aim of this assessment system is to give adult L2 (learners of the sign language) ofDSGS feedback on the correctness of the manual parameters such as hand position, shape, movement, and location. As an initial step, the system will have feedback for a subset of DSGS vocabulary production of 100 lexical words, to provide the SLR as a component of the assessment systems a huge dataset of 100 items was recorded with the aid of 11 adult L1 signers and 19 adult L2 learners of DSGS.

Most SLR techniques begin with extracting the upper body pose information, which is a challenge task due to the difference between signer and background color, another challenge is motion blur. To overcome all these challenges, they used a diverse set of visual sensors including high-speed and high-resolution GoPro video cameras, and a Microsoft Kinect V2 depth sensor.

Gebre et al. [ 11 ]: Dataset includes two languages British and Greek sign language which are available on Dicta-Sign corpus. The corpus has recordings for 4 sign languages with at least 14 signers per language and approximately 2 h using the same material across languages, from this selection, two languages BL and GL were selected. The priority of the signer’s selection was based on their skin’s color difference from background color. About 95% F1 score of accuracy was achieved.

Sahoo [ 32 ]: About 5,000 images of digital numbers (0,1,2…9) from 100 users (31 were female and 69 were male) were collected. Each signer was asked to repeat each character 5 times. Sony digital camera with resolution up to 16.1MP is used. Image format is JPEG with a resolution of 4,608 × 3,456 of the captured images. Image resolution was resized to 200 × 300. Finally, the dataset was divided into two groups for training and testing.

RKS-PERSIANSIGN [ 33 ] include a large dataset of 10 contributors with different backgrounds to produce 10,000 videos of Persian sign language (PSL), containing 100 videos for each PSL word, using the most commonly used words in daily communication of people.

Joze and Koller [ 34 ] proposed a large-scale dataset for understanding ASL including about 1,000 signs registered over 200 signers, comprising over 25,000 videos.

WASL [ 35 ] constructed a wide scale ASL dataset from authorized websites such as ASLU and ASL_LEX. Also, data were collected from YouTube based on clear titles that describe signs. About 21,083 videos were accessed from 20 different websites. Dataset was performed by 119 signers, producing only one video for every sign.

AUTSL [ 36 ] presented a new large-scale multi-modal dataset for Turkish Sign Language dataset (AUTSL). 226 signs were captured by 43 different signers, producing 38,336 isolated sign videos. Some samples of videos containing a wide variety of background recorded in indoor and outdoor environments.

KArSL [ 37 ], a comprehensive benchmark for ArSL containing 502 signs recorded by 3 different signers. Each sign was repeated 50 times by each signer, using Microsoft Kinect V2 for sign recording.

Daniel [ 38 ] used Raspberry pi with a thermal camera to produce 3,200 images with low resolution of 32 × 32 pixel. Each sign has 320 thermal images, so we conclude capturing images of about 10 signs.

Mittal et al. [ 39 ] created an ISL dataset recorded by six participants. The dataset contains 35 sign words, each word was repeated at least 15 times by each participant, so the size of the dataset is 3,150 (35 × 15 × 6).

3 Preprocessing steps

Two main preprocessing steps are required for different sign language processing tasks: segmentation and filtration. These tasks include the following subtasks: skin detection, handshape detection, feature extraction, image/video segmentation, gesture recognition, and so on. In this section, we shall briefly discuss all these subtasks. Figure 1 shows the sequence of different preprocessing steps that are almost required for different SL models. Each model usually starts with a signer’s image, applying color space conversion. Non-skin images are rejected, while other images continue the processing by applying image morphology (erosion and dilation) for noise reduction. Each image is validated by checking whether it has a hand or not. If yes, then the Region-of-Interest (ROI) is detected using hand mask images and segment fingers using defined algorithms. Image enhancement such as image filtering, data augmentation, and some other algorithms could be used to detect edges.

Figure 1 
               A flowchart that demonstrates the different image preprocessing steps.

A flowchart that demonstrates the different image preprocessing steps.

Skin-detection: It is the process of separating the skin color from the non-skin color. Ref. [ 42 ] approved that it is not possible to provide a uniform method for detection and segmentation of human skin as it varies from one person to another. RGB is a widely used color mode, but it is not preferred in skin detection because of its chrominance and luminance and its non-uniform characteristics. Skin detection is applied on HSV (HUE, and Saturation Values) images and YCbCr.

ROI [ 42 ]: It is focused on detecting [ 43 ] hand gestures and extracting the most interesting points. The hand region is detected using skin-detection from the original image using some defined masks and filters as shown in Figure 2 .

Figure 2 
               A proposed hand gesture recognition system using ROI [64].

A proposed hand gesture recognition system using ROI [ 64 ].

Image resize: It is the process of resizing images by either expanding or decreasing image size. Ref. [ 44 ] applied an interpolation algorithm that changes the image accuracy from one to another. Bicubic, a new pixel B (r^’, c^’) is formed by interpolating the nearest 4 × 4.

Ref. [ 45 ] proposed a promising skin-color detection algorithm, giving the best results even with complex backgrounds. Starting with acquiring an image from the video input stream, then adjusting image size, converting an image from RGB color space to YCbCr space (also denoting that YCbCr space is the most suitable one for skin color detection), and finally identifying color based on different values of threshold [ 46 , 47 ] and marking the detected skin with white color, otherwise with black color. Figure 1 includes a sub-flowchart that illustrates this algorithm.

In Ref. [ 48 ], binarization of images from RGB color mode to black and white color mode using Ostu’s algorithm of global thresholding was performed, images captured were then resized to 260 × 260 pixels for width and height and then Ostu’s method was used to convert the image. Ref. [ 49 ] applied new technique for feature extraction known as 7Hu moments invariant, which are used as a feature vector of algebraic functions. Their values are invariant because of the change in size, rotation, and translation. 7Hu moments were developed by Mark Hu in 1961. Structural shape descriptors [ 23 ] are proposed in five terms, aspect ratio, solidity, elongation, speediness, and orientation.

Ref. [ 50 ] used a face region skin detector which includes eyes and mouth which are non-skin non-smooth regions, which affect and decrease the accuracy.

A window of 10 × 10 around centered pixel of signer’s face is used to detected skin, but it is not accurate because in most cases it detects nose as it suffers from high illumination conditions [ 51 ].

Image segmentation: It refers to the extraction of hands from video frames or images. Either background technique or skin detector algorithm is applied first to detect skin of signer and then segmentation algorithm is applied [ 52 ]. Ref. [ 53 ] applied skin color segmentation using Artificial Neural Network (ANN), features extracted from left and right hands are used for neural network model with average recognition of 92.85%.

Feature extraction: It is the process of getting most important data-items or most interesting points of segmented image or gesture. Ref. [ 25 ] applied two techniques for feature extraction including window-based statistical feature and 2D discrete cosine transform (DCT) transformation. Ref. [ 48 ] applied five types of feature extraction including fingertip finder, elongatedness, eccentricity, pixel segmentation, and rotation. Figures 3 and 4 depict a promising accuracy in percentage of different feature extraction algorithms. Figure 3 illustrates the strength combining different three feature extraction algorithms: pixel segmentation, eccentricity, and elongatedness and fingertip and applying each one individually. Combined algorithms have largest accuracy with 99.5%.

Figure 3 
               Average recognition rate (%) for each feature extraction algorithm and percentage of combining them.

Average recognition rate (%) for each feature extraction algorithm and percentage of combining them.

Figure 4 
            Comparison of different feature algorithms.

Comparison of different feature algorithms.

Tracking: Tracking body parts facilitate the SLR process. How important are accurate tracking of body parts and its movements? How accurate are its contribution to SLR? And how does the comparison and differences occur to just use the tracked image for feature extraction. Hand Tracking: hands of the signer convey most of the recognition of signs in most SL. Ref. [ 27 ] employed a free tracking system that is based on dynamic programming tracking (DPT). Tracking facial landmarks: introduced Active Appearance Models (AAMs) which was then reformulated by Matthews et al. [ 54 ].

4 Required devices

In the last decade, researchers depended on electronic devices to detect and recognize hand position and its gestures, because of many reasons [ 55 ]. One of them is SLR using signer independent or signer dependent. Signer dependent is the main core of any SLR system, as the signer performs both training and testing phases. So, this type affects the recognition rate positively. On the other side, signer independence is a challenging phase, as signers perform only the training phase, not admitted in the testing phase. This discarding is a challenge in adapting the system to accept another signer. The target of SL systems can be achieved by (I) image-based approach [ 56 ] or (II) glove-based approach based on sensors as shown in Figure 5 or (III) a new method for gesture recognition called virtual button [ 57 ].

Figure 5 
               Cyber-glove.

Cyber-glove.

One of the disadvantages of data-gloves or electronic devices mainly are, data-gloves gives accurate information but with little information, the more advanced technology of sensors used, the more the cost, finally the data-gloves must be on-off-on each time of hand gesture recognition, which adds more obstacles with people who do not or are not aware of communication with this technology especially when they are in public places. Below is a short description of most used devices for SLR.

Tilt sensor: It is a device that produces an electrical signal that varies with an angular movement, used to measure slope and tilt with a limited range of motion.

Accelerometer sensor: It measures 3-axis acceleration caused by gravity and motion; in another word it is used to measure the rate of change of velocity.

Flex sensor: It is a very thin and lightweight electric device, used for the measurement of bending or deflection. Usually is stocked to the surface of fingers and the resistance of the sensor varied by bending the surface.

Motion (proximity) sensor: It is an electrical device which utilizes a sensor to capture motion, or it is used to detect the presence of objects without any physical contact.

Figure 6 
               Different sensor types attached to hand gloves.

Different sensor types attached to hand gloves.

As previously illustrated all these kinds of sensors used to measure the bend angles of fingers, the orientation or direction of the rest, abduction, and adduction between fingers. These sensors give an advantage over vison-based systems. Sensors can directly report with required data without any preprocessing steps for feature extraction (pending degree, orientation, etc.) in terms of voltage values to the system, but on the other side, vision-based systems require to apply tracking and feature extraction algorithms. But ref. [ 58 ] mentioned that using data-gloves and sensors do not provide the naturalness of HCI systems.

As a part of electronic devices which may have built-in sensors, there are two devices widely used in many fields, infra-red sensors such as Microsoft Kinect and Leap-Motion devices as shown in Figure 7 ( Table 2 ).

Figure 7 
               Digital devices with built-in sensors used to capture dynamic gestures of human expressions.

Digital devices with built-in sensors used to capture dynamic gestures of human expressions.

Most widely used electronic devices for hand gesture recognition

ANFIS: Adaptive Neuro-Fuzzy Inference system, MSL: Malaysian Sign Language, IBL: instance-based learning, DTL: decision-tree learning, ISL: Indian Sign Language.

Bold indicated the highest accuracy of using electronic devices for hand gesture recognition.

Vision-based approach : The great development in computer techniques and ML algorithms motivate many researchers to depend on vision-based methodology. A camera is used to capture images and then process to detect the most important features for recognition purposes. Most researchers prefer vision-based method because of its framework’s adaptability, the involvement of facial expression, body movements, and lips perusing. So, this approach required only a Camera to capture a person’s movements with a clear background without any gadgets. Previous gloves required an accompanying camera to register the gesture but does not work well in lightning conditions.

Virtual Button approach [ 57 ]: Depends on a virtual button generated by the system and receives hand’s motion and gesture by holding and discharging individually. This approach is not effective for recognizing SL because every sign language required utilization of all hand’s fingers and it also cannot be practical for real life communication.

4.1 Methodology and applied techniques

Many datasets were used in SL recognition, some of these datasets are based on the approaches of vision and some are based on the approach of soft computing like ANN, Fuzzy Logic, Genetic Algorithms, and others like Principal Component Analysis (PCA) and deep learning like Convolutional Neural Network (CNN).

Also, many algorithms and techniques were applied to recognize SLs and identify different languages with variance accuracy. Some of these techniques are classical algorithms and others are deep learning which has become the heading technique for most AI problems and overshadowing classical ML. A clear reason for depending on deep learning is that it had repeatedly demonstrated high quality results on a wide variety of tasks, especially those with big datasets.

Traditional ML algorithms are a set of algorithms that use training data to learn, then apply what they had learned to make informed decisions. Among traditional algorithms, there are classification and clustering algorithms used for SLR and SLID. Deep learning is considered as an evolution of ML, it uses programmable neural networks which make decisions without human intervention.

5 Methodology and applied techniques

K-Nearest Neighbors (KNN): It is one of the traditional ML algorithms used for classification and regression problems. Many researchers applied KNN on different SL datasets, but accuracy was lower than expected. Ref. [ 65 ] achieved results of 28.6% accuracy when applying KNN with PCA for dimensionality reduction. Other researchers merged some preprocessing steps for better accuracies. Although KNN indicate lower accuracy for image classification problems, some researchers recommended using KNN because of its ease of use, implementation, and fewer steps. Table 3 discusses some of the KNN algorithms applied on different datasets.

KNN classification comparison on different datasets

Bold indicated the highest accuracy using KNN algorithm using different K-values.

Dewinta and Heryadi [ 65 ] classified ASL dataset using KNN classifier, varying the value of K = 3, 5, 7, 9, and 11. The highest accuracy was 99.8% using K = 3, while the worst accuracy was achieved by setting K = 5 using PCA for dimensionality reduction.

Fitri [ 66 ] proposed a framework using Simple Multi Attribute Rating Technique (SMART) weighting and KNN classifier. SMART was used to optimize and enhance accuracy of KNN classifier. The accuracy varied from 94 to 96% according to some lightening conditions. The accuracy decreases when lighting decreases and vice versa ( Figure 8 ).

Figure 8 
               Sign languages-based approaches.

Sign languages-based approaches.

According to Figure 9 it is clear that different algorithms preferred that K -values should not be static to its default value which equals “1”, but varying K -values to 1, 3, 5 or any odd number will result in good results. With K = 3, most researchers get the best accuracy.

Figure 9 
               Different KNN results based on change in K-values.

Different KNN results based on change in K -values.

Jadhav et al. [ 67 ] proposed a framework based on KNN for recognizing sign languages. The unique importance of this framework allowed users to define their own sign language. In his system, users must store their signs first in database, after that he can use these signs while communicating with others. While communicating with another person using those stored signs, the opposite person can see the signs and its meaning. This framework suggested using real time sign recognition. The framework is based on three main steps:

Skin detection, he created a “skin detector” method for converting images from BGR format to HSV format. He used an interpolation algorithm for shadow detection from the images and fills it with continuous dots using “FillHoles” method. Another method called “ DetectAndRecognize ” takes the fill hole image as input for detection and recognition and calculates the contours which detect the edges of the signs. Title Blob detection  – have two methods “FillHoles” and “DetectAndRecognize” methods.

Umang [ 68 ] applied KNN and PNN as a classification technique to recognize ISL alphabets. 7Hu moments were used for feature extraction. 82% is the approximate accuracy they achieved, using KNN built-in function in MATLAB with default K = 1.

Hidden Markov Model (HMM): Based on our review, HMM was one of the strongly recommended approaches for SL problems. Hybridization of HMM with CNN provided high accuracy with huge datasets. HMM is the most widely used technique for speech recognition and SL problems for both vision-based and data-gloves-based approach. Table 4 discusses some of the HMMs models applied on different datasets.

HMM comparison on different datasets

Parcheta and Martínez-Hinarejos [ 71 ] used an optimized sensor called “leap motion” that we presented previously. This leap device was used to capture 3D information of hands gestures. He applied one of the two available types of HMM which is discrete and continuous. Continuous HMM was used for gesture recognition. Hidden Markov Model Toolkit ( HTK) was used to interact and interpret HMMs. He tried to recognize about 91 gestures collected using the aforementioned device by partitioning data into four parts, training HMM topologies through some defined models and producing accuracy of 87.4% for gesture recognition.

Starner [ 73 ] proposed a real-time HMM-based system to recognize sentences of ASL consisting of 40-word lexicon and capturing users’ gestures using cameras mounted to a desk, producing 92% accuracy. The second camera was mounted to a cap of the user, producing 98% accuracy. This paper proved that vision-based approach is more useful than glove-based approaches.

Ref. [ 34 ] provides systems based on HMM to recognize real-time ArSL. The system is a signer-independent, removing all barriers to communicate with deaf people. They built their own dataset to recognize 20 Arabic isolated words. They used 6 HMM models with different number of states and different Gaussian mixtures per state. The best accuracy was 82.2%.

Oliveira et al. [ 28 ] built a framework for static and dynamic sign language.

Hand Segmentation used OpenPose [ 73 ] detector trained on the dataset, getting high results for hand segmentation among all evaluated detectors (HandSegNet [ 54 ] and hand detector [ 74 ]). The right hand is detected by applying a forward feed neural network based on VGG-19, and the left image is detected by flipping the image and applying the previous steps again.

Static Signs consists of 2D convolution which contains the features, first layer tends to know more about the basic feature’s pixels like lines and corners. Each input frame is convolved with more than 32 filters to cover the network’s scope which is narrower at the beginning. The model is interested in fewer features.

Dynamic Sign Language Model: It is concerned with two key-points. First, it considered three dimensions of the layer for temporal dimension. It extends over the temporal dimension; this is useful in sign language recognition because it helps to model the local variations that describe the trajectory of gesture during its movement. Briefly, a dynamic model is implemented on a single frame followed by the gesture of each sequence.

For static results of ISL recognition, this paper achieves an accuracy of 0.9998 for cropped frames whereas it achieves an accuracy of 0.9979 for original frame, while for dynamic results, categorical accuracy is considered for each class. Classifier model was trained and tested on 8 classes, its accuracy was not high as it ranges between 0.66 and 0.76 for different streams.

Binyam Gebrekidan Gebre [ 11 ] proposed a method that gathers two methods of Stokoe’s and H-M model as they assumed that features extracted from frames are independent of each other, But Gebre assumes that sign’s features will be extracted from two frames. The next and previous one to get a hand or any movement. He Proposed an ideal SLID, the system subcomponents are: (1) skin detection, (2) feature extraction, (3) modeling, and (4) identification. For a modeling step, it used a random forest algorithm which generates many decision tree classifiers and aggregates their results. Extracted features include high performance, flexibility, and stability. He achieved about 95% F1 score of accuracy.

Invariant features [ 75 ] consists of three stages, namely, a training phase, a testing phase, and a recognition phase. The parameters of 7Hu invariant moment and structural shape descriptors which are created to form a new feature vector to recognize the sign are combined and then MSVM is applies for training the recognized signs of ISL. The effectiveness of the proposed method is validated on a dataset of 720 images with a recognition rate of 96%.

CNN [ 76 ] Kang et al. used CNN, specifically caffe implementation network (CaffeNet), consisting of 5 convolution layers, 3 max-pooling layers, and 3 fully connected layers.

FFANN [ 17 , 77 ] was used in ref. [ 48 ] achieving an average accuracy of 94.32% using convex hull eccentricity, elongatedness, pixel segmentation, and rotation for American number, and alphabets recognition of about 37 signs, whereas ref. [ 49 ] applied FFANN on facial and hand gestures of 11 signs, with an average accuracy of 93.6% depending on automatic gesture area segmentation and orientation normalization. Ref. [ 78 ] also used FFANN for Bengali alphabet with 46 signs achieving an accuracy of 88.69% for testing result depending on Fingertip finder algorithm with multilayered feedforward, back propagation training.

Effective ML algorithms were used to achieve high accuracy, but deep learning algorithms indicate more accurate results. Deep learning types vary between unsupervised pre-trained networks, CNN, recurrent neural network, and recursive neural network which encourage more people to do more research, share, and compare their results. We will compare between types of deep learning algorithms and used parameters, to determine which activation function is the best? How to test and train the model?

Ref. [ 44 ] applied two CNN models on 24 letters of ASL with 10 images per letter, image size is 227 × 227 which is resized using the Bicubic interpolation method. The images were trained using 4 CNNs with 20 layers in each CNN. Each model had a different activation function and a different optimization algorithm. PReLU and ReLU were used in model 1 and model 2, respectively. Accuracy for model 1 is 99.3% as it was able to recognize all 24 letters, but the accuracy of model 2 was 83.33% as it recognizes only 20 letters of all the 24 letters.

Ref. [ 81 ] used deep learning algorithms to identify SL using three publicly available datasets. Also introduced a new public large-scale dataset for Greek sign language RGB + D, providing two CTC variations that were mostly used in other application fields EnCTC and StimCTC. Each frame was resized from 256 × 256 to 224 × 224. The models are trained using Adam optimizer, and initial learning rate of 0.0001 was reduced to 0.00001.

Ref. [ 82 ] proposed a deep learning model consisting of CNN (inception model) and RNN to capture images of ASL, this dataset consists of 2,400 images, divided into 1,800 images for training and the remaining for testing. CNN extracts feature from the frames, using two major approaches for classification as SoftMax layer and the pool layer. After retraining the model using the inception model, the extracted features were passed to RNN using LSTM Model.

Ref. [ 87 ] studied the effect of data-augmentation on deep learning algorithms, achieving an accuracy of 97.12% which is higher than that of the model before applying data augmentation by about 4%. Dataset consists of 10 static gestures to recognize, each class has 800 images for training and 160 for testing, resulting in 8,000 images for training and 1,600 for testing. This algorithm overcomes both SVM and KNN as shown in Figure 10 , while being applied on the same dataset ( Tables 5 and 6 ).

Figure 10 
               Traditional and deep learning algorithm results applied on the same dataset.

Traditional and deep learning algorithm results applied on the same dataset.

Comparison of different machine learning algorithms based on different datasets

Comparison of deep learning of different sign language datasets focusing on technical parameters such as activation and optimization function, learning rate, and so on

ISL: Indian Sign Language.

Bold indicates highest results of applying different CNN models on various SL datasets.

Ref. [ 43 ] uses CNN for hand gesture classification. First, the author used the algorithm of connected components analysis to select and segment hands from the image dataset using masks and filters, finger cropping, and segmentation. The author also used Adaptive Histogram equalization (AHE) for image enhancement to improve image contrast. CNN algorithm’s accuracy was 96.2% which is higher than SVM classification algorithm applied by the author to achieve an accuracy of 93.5%. The following table illustrates this difference. Also, recognition time using CNN (0.356 s) is lower than SVM (0.647 s). As shown in Table 7 , CNN exceeds SVM in different measurements like sensitivity, specificity, and accuracy.

Distinction between CNN and SVM on different measurements

Ref. [ 88 ] implemented training and testing using CNN by Keras and TensorFlow using SGD algorithm as its optimizer, having a learning rate of 0.01. The number of epochs is equal to 50 with a batch size of 500. Dataset has a set of static signs of letters, digits, and some words then resize of words to 50 × 50. Each class contains 1,200 images. The overall average accuracy of the system was 93.67%, of which 90.04, 93.44, and 97.52% for ASL alphabets, number recognition, and static word recognition, respectively. Tests were applied on 6 persons who were signer’s interpreters and 24 students without any knowledge of using sign language ( Figure 11 ).

Figure 11 
               Performance measure before and after applying data augmentation.

Performance measure before and after applying data augmentation.

Ref. [ 89 ] applied CNN algorithm on Bhutanese Sign Language digits recognition, collected dataset of 20,000 images of digits [0–9] from 21 students, each student was asked to capture 10 images per class. Images and videos were captured from different angles, directions, different backgrounds, and lighting conditions. Images were scaled to 64 × 64. TensorFlow was used as a deep learning library. Comparison with traditional ML was done and approved the superiority of deep learning CNN to SVM and KNN algorithms with average accuracy of 97.62% for CNN, 78.95% for KNN, and 70.25% for SVM, with lower testing time for CNN ( Figure 11 ).

Ref. [ 90 ] applied CNN algorithm on ISL dataset which consists of distinct 100 images, generating 35,000 images of both colored and grayscale image types. The dataset includes digits [0–10] and 23 alphabets and about 67 most common words. Original image size of 126 × 126 × 16 was reduced to 63 × 63 × 16 using kernel filter of size 2. Many optimizers were applied such as ADAM, SGD, Adagrad, AdaDelta, RMSprop, and SGD. Using ADAM optimizer he achieved the best result of 99.17% and 98.8% for training and validation, respectively. Also, the proposed model accuracy exceeds other classifiers such as KNN (95.95%), SVM (97.9%), and ANN (98%).

6 Conclusion and future work

The variety of sign language datasets, which includes different gestures, leads to different accuracies as we had discussed based on review of previous literature. This survey showed that different datasets have been used in the training and testing of SLR systems. It compared between vison-based approach and glove-based approach, showed the advantages and the disadvantages of both, illustrated the difference between signer dependent and signer independent, and addressed the basic preprocessing steps such as skin detector, image segmentation, hand tracking, feature extraction, and hand’s gesture classification.

The survey also compares some ML techniques with the most used deep learning algorithm (CNN), showing that deep learning results exceed traditional ML. Some glove-based systems outperform deep learning algorithms due to the accurate signals that researchers get while feature extraction, while using deep learning their features get during model training which is not accurate as the gloves-based systems. According to this previous issue, we need to get rid of any obstacles (gloves, sensors, and leap devices) or any electronic device that may restrict user interaction with the system. Many trials had been done but with less accuracy.

Few researchers are working to solve SLID, although it is important for having a comprehensive SLR system. Including ArSL in our future work will be a challenging task. Also, trying to wear-off any gloves or any electric based systems will give user more comfort while communicating with others.

Conflict of interest: Authors state no conflict of interest.

Data availability statement: Data sharing is not applicable to this article as no new data were created or analyzed in this study.

[1] R. Kushalnagar, “Deafness and Hearing Loss,” Web Accessibility. Human–Computer Interaction Series, Y. Yesilada, S. Harper, eds, London, Springer, 2019. 10.1007/978-1-4471-7440-0_3 Search in Google Scholar

[2] World Federation of the Deaf. Our Work, 2018. http://wfdeaf.org/our-work/Accessed 2019–03–26. Search in Google Scholar

[3] S. Wilcox and J. Peyton, “American Sign Language as a foreign language,” CAL. Dig., pp. 159–160, 1999. Search in Google Scholar

[4] M. del Carmen Cabeza-Pereiro, J. M. Garcia-Miguel, C. G. Mateo, and J. L. A. Castro, “CORILSE: a Spanish sign language repository for linguistic analysis,” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, May, pp. 1402–1407. Search in Google Scholar

[5] T. Johnston and A. Schembri, Australian Sign Language (Auslan): An Introduction to Sign Language Linguistics, Cambridge, UK, Cambridge University Press, 2007. ISBN 9780521540568. 10.1017/CBO9780511607479 . Search in Google Scholar

[6] M. Abdel-Fattah, “Arabic Sign Language: A Perspective,” J. Deaf. Stud. Deaf. Educ., vol. 10, no. 2, 2005, pp. 212–221. 10. 212-21. 10.1093/deafed/eni007. Search in Google Scholar

[7] J. V. Van Cleve, Gallaudet Encyclopedia of Deaf People and Deafness, Vol 3, New York, New York, McGraw-Hill Company, Inc., 1987, pp. 344–346. Search in Google Scholar

[8] D. Cokely, Charlotte Baker-Shenk, American Sign Language, Washington, Gallaudet University Press, 1981. Search in Google Scholar

[9] U. Shrawankar and S. Dixit, Framing Sentences from Sign Language Symbols using NLP, In IEEE conference, 2016, pp. 5260–5262. Search in Google Scholar

[10] N. El-Bendary, H. M. Zawbaa, M. S. Daoud, A. E. Hassanien, K. Nakamatsu, “ArSLAT: Arabic Sign Language Alphabets Translator,” 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM), Krackow, 2010, pp. 590–595. 10.1109/CISIM.2010.5643519 Search in Google Scholar

[11] B. G. Gebre, P. Wittenburg, and T. Heskes, “Automatic sign language identification,” 2013 IEEE International Conference on Image Processing, Melbourne, VIC, 2013, pp. 2626–2630. 10.1109/ICIP.2013.6738541 Search in Google Scholar

[12] D. Bragg, O. Koller, M. Bellard, L. Berke, P. Boudreault, A. Braffort, et al., “Sign language recognition, generation, and translation: an interdisciplinary perspective,” The 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’19), New York, NY, USA, Association for Computing Machinery, 2019, pp. 16–31. 10.1145/3308561.3353774 Search in Google Scholar

[13] R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,” Expert. Syst. Appl., vol. 164, 113794, 2020. 10.1016/j.eswa.2020.113794 Search in Google Scholar

[14] A. Sahoo, G. Mishra, and K. Ravulakollu, “Sign language recognition: State of the art,” ARPN J. Eng. Appl. Sci., vol. 9, pp. 116–134, 2014. Search in Google Scholar

[15] A. Karpov, I. Kipyatkova, and M. Železný, “Automatic technologies for processing spoken sign languages,” Proc. Computer Sci., vol. 81, pp. 201–207, 2016. 10.1016/j.procs.2016.04.050 . Search in Google Scholar

[16] F. Chou and Y. Su, “An encoding and identification approach for the static sign language recognition,” 2012 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Kachsiung, 2012, pp. 885–889. 10.1109/AIM.2012.6266025 Search in Google Scholar

[17] https://en.wikipedia.org/wiki/Feedforward_neural_network. Search in Google Scholar

[18] https://www.deafwebsites.com/sign-language/sign-language-other-cultures.html. Search in Google Scholar

[19] D. Santiago, I. Benderitter, and C. García-Mateo, Experimental Framework Design for Sign Language Automatic Recognition, 2018, pp. 72–76. 10.21437/IberSPEECH.2018-16. Search in Google Scholar

[20] Z. Zafrulla, H. Brashear, P. Yin, P. Presti, T. Starner, and H. Hamilton, “American sign language phrase verification in an educational game for deaf children,” IEEE, pp. 3846–3849, 2010, 10.1109/ICPR.2010.937 . Search in Google Scholar

[21] K. B. Shaik, P. Ganesan, V. Kalist, B. S. Sathish, and J. M. M. Jenitha, “Comparative study of skin color detection and segmentation in HSV and YCbCr color space,” Proc. Computer Sci., vol. 57, pp. 41–48, 2015. 10.1016/j.procs.2015.07.362 . Search in Google Scholar

[22] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, and H. Ney, “Speech Recognition Techniques for a Sign Language Recognition System,” ICSLP, Antwerp, Belgium, August. Best Paper Award, 2007a. 10.21437/Interspeech.2007-668 Search in Google Scholar

[23] K. Dixit and A. S. Jalal, “Automatic Indian Sign Language recognition system,” 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, 2013, pp. 883–887. 10.1109/IAdCC.2013.6514343 . Search in Google Scholar

[24] I. Z. Onno Crasborn and J. Ros, “Corpus-NGT. An open access digital corpus of movies with annotations of Sign Language of the Netherlands,” Technical Report, Centre for Language Studies, Radboud University Nijmegen, 2008. http://www.corpusngt.nl. Search in Google Scholar

[25] M. Hassan, K. Assaleh, and T. Shanableh, “Multiple proposals for continuous arabic sign language recognition,” Sensing Imaging, vol. 20, no. 1. pp. 1–23, 2019. 10.1007/s11220-019-0225-3 Search in Google Scholar

[26] A. Youssif, A. Aboutabl, and H. Ali, “Arabic sign language (ArSL) recognition system using HMM,” Int. J. Adv. Computer Sci. Appl., vol. 2, 2011. 10.14569/IJACSA.2011.021108 . Search in Google Scholar

[27] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vis. Image Underst., vol. 141, pp. 108–125, 2015. 10.1016/j.cviu.2015.09.013 . Search in Google Scholar

[28] M. Oliveira, H. Chatbri, Y. Ferstl, M. Farouk, S. Little, N. OConnor, et al., “A dataset for Irish sign language recognition,” Proceedings of the Irish Machine Vision and Image Processing Conference (IMVIP), vol. 8, 2017. Search in Google Scholar

[29] N. C. Camgoz, A. A. Kindiroğlu, S. Karabüklü, M. Kelepir, A. S. Ozsoy, and L. Akarun, BosphorusSign: a Turkish sign language recognition corpus in health and finance domains. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 1383–1388. Search in Google Scholar

[30] S. Ebling, N. C. Camgöz, P. B. Braem, K. Tissi, S. Sidler-Miserez, S. Stoll, and M. Magimai-Doss, “SMILE Swiss German sign language dataset,” Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018, University of Surrey, 2018. Search in Google Scholar

[31] N. M. Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, G. T. Papadopoulos, V. Zacharopoulou, and P. Daras none, “A comprehensive study on deep learning-based methods for sign language recognition,” IEEE Trans. Multimedia, pp. 1, 2021. 10.1109/tmm.2021.3070438 . Search in Google Scholar

[32] A. Sahoo, “Indian sign language recognition using neural networks and kNN classifiers,” J. Eng. Appl. Sci., vol. 9, pp. 1255–1259, 2014. Search in Google Scholar

[33] R. Rastgoo, K. Kiani, and S. Escalera, “Hand sign language recognition using multi-view hand skeleton,” Expert. Syst. Appl., vol. 150, p. 113336, 2020a. 10.1016/j.eswa.2020.113336 Search in Google Scholar

[34] H. R. V. Joze and O. Koller, “MS-ASL: A large-scale dataset and benchmark for understanding American sign language. arXiv preprint arXiv:1812.01053,” arXiv 2018, arXiv:1812.01053. Search in Google Scholar

[35] D. Li, C. Rodriguez, X. Yu, and H. Li, “Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison,” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020, pp. 1459–1469. 10.1109/WACV45572.2020.9093512 Search in Google Scholar

[36] O. M. Sincan and H. Y. Keles, “AUTSL: A large-scale multi-modal Turkish sign language dataset and baseline methods,” IEEE Access, vol. 8, pp. 181340–181355, 2020. 10.1109/ACCESS.2020.3028072 Search in Google Scholar

[37] A. A. I. Sidig, H. Luqman, S. Mahmoud, and M. Mohandes, “KArSL: Arabic sign language database,” ACM Trans. Asian Low-Resour. Lang. Inf. Process, vol. 20, pp. 1–19, 2021. 10.1145/3423420 Search in Google Scholar

[38] D. S. Breland, S. B. Skriubakken, A. Dayal, A. Jha, P. K. Yalavarthy, and L. R. Cenkeramaddi, “Deep learning-based sign language digits recognition from thermal images with edge computing system,” IEEE Sens. J., vol. 21, no. 9. pp. 10445–10453, 2021‏. 10.1109/JSEN.2021.3061608 Search in Google Scholar

[39] A. Mittal, P. Kumar, P. P. Roy, R. Balasubramanian, and B. B. Chaudhuri, “A modified LSTM model for continuous sign language recognition using leap motion,” IEEE Sens. J., vol. 19, no. 16. pp. 7056–7063, 2019. 10.1109/jsen.2019.2909837 . Search in Google Scholar

[40] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs,” Int. J. Comput. Vis., vol. 126, pp. 1311–1325, 2018. 10.1007/s11263-018-1121-3 Search in Google Scholar

[41] I. Hernández, Automatic Irish sign language recognition, Trinity College, Diss. Thesis of Master of Science in Computer Science (Augmented and Virtual Reality), University of Dublin, 2018. Search in Google Scholar

[42] P. S. Neethu, R. Suguna, and D. Sathish, “An efficient method for human hand gesture detection and recognition using deep learning convolutional neural networks,” Soft Comput., vol. 24, pp. 15239–15248, 2020. 10.1007/s00500-020-04860-5 . Search in Google Scholar

[43] C. D. D. Monteiro, C. M. Mathew, R. Gutierrez-Osuna, F. Shipman, Detecting and identifying sign languages through visual features, 2016 IEEE International Symposium on Multimedia (ISM), 2016. 10.1109/ism.2016.0063 . Search in Google Scholar

[44] F. Raheem and A. A. Abdulwahhab, “Deep learning convolution neural networks analysis and comparative study for static alphabet ASL hand gesture recognition,” Xi'an Dianzi Keji Daxue Xuebao/J. Xidian Univ., vol. 14, pp. 1871–1881, 2020. 10.37896/jxu14.4/212 . Search in Google Scholar

[45] A. Kumar and S. Malhotra, Real-Time Human Skin Color Detection Algorithm Using Skin Color Map, 2015. Search in Google Scholar

[46] Y. R. Wang, W. H. Li and L. Yang, “A Novel real time hand detection based on skin color,” 17th IEEE International Symposium on Consumer Electronics (ISCE), 2013, pp. 141–142. 10.1109/ISCE.2013.6570151 Search in Google Scholar

[47] K. Sheth, N. Gadgil, and P. R. Futane, “A Hybrid hand detection algorithm for human computer interaction using skin color and motion cues,” Inter. J. Computer Appl., vol. 84, no. 2. pp. 14–18, December 2013. 10.5120/14548-2636 Search in Google Scholar

[48] M. M. Islam, S. Siddiqua, and J. Afnan, “Real time hand gesture recognition using different algorithms based on American sign language,” 2017 IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR), 2017. 10.1109/icivpr.2017.7890854 . Search in Google Scholar

[49] Y.-J. Tu, C.-C. Kao, and H.-Y. Lin, “Human computer interaction using face and gesture recognition,” 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013. 10.1109/apsipa.2013.6694276 . Search in Google Scholar

[50] M. Kawulok, “Dynamic skin detection in color images for sign language recognition,” Image Signal. Process, vol. 5099, pp. 112–119, 2008. 10.1007/978-3-540-69905-7_13 Search in Google Scholar

[51] S. Bilal, R. Akmeliawati, M. J. E. Salami, and A. A. Shafie, “Dynamic approach for real-time skin detection,” J. Real-Time Image Proc., vol. 10, no. 2. pp. 371–385, 2015. 10.1007/s11554-012-0305-2 Search in Google Scholar

[52] N. Ibrahim, H. Zayed, and M. Selim, “An automatic arabic sign language recognition system (ArSLRS),” J. King Saud. Univ. – Computer Inf. Sci., Vol. 30, no. 4, October 2018, Pages 470–477. 10.1016/j.jksuci.2017.09.007 . Search in Google Scholar

[53] M. P. Paulraj, S. Yaacob, Z. Azalan, M. Shuhanaz, and R. Palaniappan, A Phoneme-based Sign Language Recognition System Using Skin Color Segmentation, 2010, pp. 1–5. 10.1109/CSPA.2010.5545253 . Search in Google Scholar

[54] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand Keypoint Detection in Single Images Using Multiview Bootstrapping” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4645–4653. doi: 10.1109/CVPR.2017.494. 10.1109/CVPR.2017.494 Search in Google Scholar

[55] R. Akmeliawati, “Real-time Malaysian sign language translation using colour segmentation and neural network”, Proc. of the IEEE International Conference on Instrumentation and Measurement Technology 2007, Warsaw, 2007, pp. 1–6. 10.1109/IMTC.2007.379311 Search in Google Scholar

[56] J. Lim, D. Lee, and B. Kim, “Recognizing hand gesture using wrist shapes,” 2010 Digest of Technical Papers of the International Conference on Consumer Electronics (ICCE), Las Vegas, 2010, pp. 197–198. Search in Google Scholar

[57] O.Al-Jarrah and A. Halawani, “Recognition of gestures in Arabic sign language using neuro-fuzzy systems,” Artif. Intell., vol. 133, pp. 117–138, 2001. 10.1016/S0004-3702(01)00141-2 . Search in Google Scholar

[58] M. A. Hussain, Automatic recognition of sign language gestures, Master’s Thesis. Jordan University of Science and Technology, Irbid, 1999. Search in Google Scholar

[59] C. Oz and M. C. Leu, “American sign language word recognition with a sensory glove using artifcial neural networks,” Eng. Appl. Artifcial Intell., vol. 24, no. 7. pp. 1204–1213, Oct. 2011. Search in Google Scholar

[60] M. W. Kadous, “Machine recognition of Auslan signs using PowerGloves: Towards large-lexicon recognition of sign language,” Proceedings of the Workshop on the Integration of Gesture in Language and Speech, Wilmington, DE, USA, 1996, pp. 165–174. Search in Google Scholar

[61] N. Tubaiz, T. Shanableh, and K. Assaleh, “Glove-based continuous Arabic sign language recognition in user-dependent mode,” IEEE Trans. Human-Mach. Syst., vol. 45, no. 4. pp. 526–533, 2015. 10.1109/THMS.2015.2406692 Search in Google Scholar

[62] P. D. Rosero-Montalvo, P. Godoy-Trujillo, E. Flores-Bosmediano, J. Carrascal-Garcia, S. Otero-Potosi, H. Benitez-Pereira, et al., “Sign language recognition based on intelligent glove using machine learning techniques,” 2018 IEEE Third Ecuador Technical Chapters Meeting (ETCM), 2018. 10.1109/etcm.2018.8580268 . Search in Google Scholar

[63] L. Chen, J. Fu, Y. Wu, H. Li, and B. Zheng, “Hand gesture recognition using compact CNN via surface electromyography signals,” Sensors, vol. 20, no. 3. p. 672, 2020. 10.3390/s20030672 . Search in Google Scholar PubMed PubMed Central

[64] D. Aryanie and Y. Heryadi, “American sign language-based finger-spelling recognition using k-Nearest Neighbors classifier.” 2015 3rd International Conference on Information and Communication Technology (ICoICT), 2015, pp. 533–536. 10.1109/ICoICT.2015.7231481 Search in Google Scholar

[65] F. Utaminingrum, I. Komang Somawirata, and G. D. Naviri, “Alphabet sign language recognition using K-nearest neighbor optimization,” JCP, vol. 14, no. 1. pp. 63–70, 2019. 10.17706/jcp.14.1.63-70 Search in Google Scholar

[66] A. Jadhav, G. Tatkar, G. Hanwate, and R. Patwardhan, “Sign language recognition,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 7, pp. 109–115, no. 3, 2017. 10.23956/ijarcsse/V7I3/0127 Search in Google Scholar

[67] U. Patel and A. G. Ambekar, "Moment Based Sign Language Recognition for Indian Languages," 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), 2017, pp. 1–6. 10.1109/ICCUBEA.2017.8463901 . Search in Google Scholar

[68] G. Saggio, P. Cavallo, M. Ricci, V. Errico, J. Zea, and M. E. Benalcázar, “Sign language recognition using wearable electronics: implementing k-Nearest Neighbors with dynamic time warping and convolutional neural network algorithms,” Sensors, vol. 20, no. 14. p. 3879, 2020. 10.3390/s20143879 . Search in Google Scholar PubMed PubMed Central

[69] A. K. Sahoo, “Indian sign language recognition using machine learning techniques,” Macromol. Symp., vol. 397, no. 1. p. 2000241, 2021. 10.1002/masy.202000241 . Search in Google Scholar

[70] Z. Parcheta and C.-D. Martínez-Hinarejos, “Sign language gesture recognition using HMM,” in Pattern Recognition and Image Analysis. Lecture Notes in Computer Science 2017. L. Alexandre, J. Salvador Sánchez, J. Rodrigues, (Eds), IbPRIA, vol. 10255, Cham: Springer, pp. 419–426, 2017. 10.1007/978-3-319-58838-4_46 . Search in Google Scholar

[71] T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer-based video,” IEEE Trans. Pattern Anal. Mach. Intellig., vol. 20, no. 12. pp. 1371–1375, 1998. 10.1109/34.735811 Search in Google Scholar

[72] C. Zimmermann and T. Brox, “Learning to estimate 3D hand pose from single RGB images,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4913–4921. 10.1109/ICCV.2017.525 Search in Google Scholar

[73] D. Victor, Real-Time Hand Tracking Using SSD on TensorFlow, GitHub Repository, 2017. Search in Google Scholar

[74] K. Dixit and A. S. Jalal, “Automatic Indian sign language recognition system,” 2013 3rd IEEE International Advance Computing Conference (IACC), 2013. 10.1109/iadcc.2013.6514343 . Search in Google Scholar

[75] B. Kang, S. Tripathi, and T. Nguyen, “Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,” 3rd IAPR Asian Conference on Pattern Recognition, Kuala Lumpur, Malaysia, 2015. 10.1109/acpr.2015.7486481 . Search in Google Scholar

[76] https://en.wikipedia.org/wiki/Backpropagation. Search in Google Scholar

[77] A. M. Jarman, S. Arshad, N. Alam, and M. J. Islam, “An automated bengali sign language recognition system based on fingertip finder algorithm,” Int. J. Electron. Inform., vol. 4, no. 1. pp. 1–10, 2015‏. Search in Google Scholar

[78] P. P. Roy, P. Kumar, and B. -G. Kim, “An efficient sign language recognition (SLR) system using camshift tracker and hidden markov model (HMM),” SN Computer Sci., vol. 2, 79, no. 2, 2021. 10.1007/s42979-021-00485-z . Search in Google Scholar

[79] S. Ghanbari Azar and H. Seyedarabi, “Trajectory-based recognition of dynamic persian sign language using hidden Markov Model,” arXiv e-prints, p. arXiv-1912, 2019. Search in Google Scholar

[80] N. M. Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, G. T. Papadopoulos, V. Zacharopoulou, and P. Daras, “A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition,” IEEE Transactions on Multimedia, p. 1, 2021. 10.1109/tmm.2021.3070438 . Search in Google Scholar

[81] K. Bantupalli and Y. Xie, “American sign language recognition using deep learning and computer vision,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 4896–4899. 10.1109/BigData.2018.8622141 . Search in Google Scholar

[82] F. Utaminingrum, I. Komang Somawirata, and G. D. Naviri, “Alphabet sign language recognition using K-nearest neighbor optimization,” J. Comput., vol. 14, no. 1. pp. 63–70, 2019. 10.17706/jcp.14.1.63-70 Search in Google Scholar

[83] M. M. Kamruzzaman, “Arabic sign language recognition and generating Arabic speech using convolutional neural network,” Wirel. Commun. Mob. Comput., vol. 2020, pp. 1–9, 2020. 10.1155/2020/3685614 . Search in Google Scholar

[84] M. Varsha and C. S. Nair, “Indian sign language gesture recognition using deep convolutional neural network,” 2021 8th International Conference on Smart Computing and Communications (ICSCC), IEEE, 2021. 10.1109/ICSCC51209.2021.9528246 Search in Google Scholar

[85] M. Z. Islam, M. S. Hossain, R. ul Islam, and K. Andersson, “Static hand gesture recognition using convolutional neural network with data augmentation,” 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 2019, pp. 324–329. 10.1109/ICIEV.2019.8858563 . Search in Google Scholar

[86] L. K. S. Tolentino, R. O. Serfa Juan, A. C. Thio-ac, M. A. B. Pamahoy, J. R. R. Forteza, and X. J. O. Garcia, “Static sign language recognition using deep learning,” Int. J. Mach. Learn. Comput., vol. 9, no. 6. pp. 821–827, 2019. 10.18178/ijmlc.2019.9.6.879 Search in Google Scholar

[87] K. Wangchuk, P. Riyamongkol, and R. Waranusast, “Real-time Bhutanese sign language digits recognition system using convolutional neural network,” ICT Exp., vol. 7, no. 2, pp. 215–220, 2020. 10.1016/j.icte.2020.08.002 . Search in Google Scholar

[88] L. K. Tolentino, R. Serfa Juan, A. Thio-ac, M. Pamahoy, J. Forteza, and X. Garcia, “Static sign language recognition using deep learning,” Int. J. Mach. Learn. Comput., vol. 9, pp. 821–827, 2019. 10.18178/ijmlc.2019.9.6.879 . Search in Google Scholar

[89] P. M. Ferreira, J. S. Cardoso, and A. Rebelo, “Multimodal Learning for Sign Language Recognition,” Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), L. Alexandre, J. Salvador Sánchez, and J. Rodrigues, (eds), vol. 10255, Cham, Springer, 2017. 10.1007/978-3-319-58838-4_35 . Search in Google Scholar

[90] A. Elboushaki, R. Hannane, A. Karim, and L. Koutti, “MultiD-CNN: a multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences,” Expert. Syst. Appl., vol. 139, p. 112829, 2019. 10.1016/j.eswa.2019.112829 . Search in Google Scholar

[91] O. Kopuklu, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019. 10.1109/fg.2019.8756576 . Search in Google Scholar

[92] Ch. Yuxiao, L. Zhao, X. Peng, J. Yuan, and D. Metaxas, Construct Dynamic Graphs for Hand Gesture Recognition Via Spatial-temporal Attention, UK, 2019, pp. 1–13. https://bmvc2019.org/wp-content/uploads/papers/0281-paper.pdf. Search in Google Scholar

[93] A. Z. Shukor, M. F. Miskon, M. H. Jamaluddin, F. Bin Ali, M. F. Asyraf, and M. B. Bin Bahar., “A new data glove approach for malaysian sign language detection,” Procedia Computer Science, vol. 76, pp. 60–67, 2015, 10.1016/j.procs.2015.12.276 . Search in Google Scholar

© 2022 Ahmed Sultan et al ., published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Materials

Please login or register with De Gruyter to order this product.

Open Computer Science

Journal and Issue

Articles in the same issue.

sign language detection research paper

Advertisement

Advertisement

Sign language recognition using artificial intelligence

  • Published: 01 November 2022
  • Volume 28 , pages 5259–5278, ( 2023 )

Cite this article

  • R. Sreemathy 1 ,
  • Mousami Turuk 1 ,
  • Isha Kulkarni 1 &
  • Soumya Khurana 1  

694 Accesses

2 Citations

Explore all metrics

Sign language is the natural way of communication of speech and hearing-impaired people. Using Indian Sign Language (ISL) interpretation system, hearing impaired people may interact with normal people with the help of Human Computer Interaction (HCI). This paper presents a method for automatic recognition of two-handed signs of Indian Sign language (ISL). The three phases of this work include preprocessing, feature extraction and classification. We trained a BPN with Histogram Oriented Gradient (HOG) features. The trained model is used for testing the real time gestures. The overall accuracy achieved was 89.5% with 5184 input features and 50 hidden neurons. A deep learning approach was also implemented using AlexNet, GoogleNet, VGG-16 and VGG-19 which gave accuracies of 99.11%, 95.84%, 98.42% and 99.11% respectively. MATLAB is used as the simulation platform. The proposed technology is used as a teaching assistant for specially abled persons and has demonstrated an increase in cognitive ability of 60–70% in children. This system demonstrates image processing and machine learning approaches to recognize alphabets from the Indian sign language, which can be used as an ICT (information and communication technology) tool to enhance their cognitive capability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

sign language detection research paper

Abubakar, F. M. (2012). Image enhancement using histogram equalization and spatial filtering. International Journal of Science and Research (IJSR), 1 (3), 105–107.

Google Scholar  

Athira, P. K., Sruthi, C. J., & Lijiya, A. (2019). A signer independent sign language recognition with co-articulation elimination from live videos: an Indian scenario.  Journal of King Saud University-Computer and Information Sciences .

Balbin, J. R., Padilla, D. A., Caluyo, F. S., Fausto, J. C., Hortinela, C. C., Manlises, C. O., … & Ventura, L. T. (2016). Sign language word translator using neural networks for the aurally impaired as a tool for communication. In 2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE) (pp. 425–429). IEEE.

Bhame, V., Sreemathy, R., & Dhumal, H. (2014). Vision based hand gesture recognition using eccentric approach for human computer interaction. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 949–953). IEEE.

Diaz, C. A. M., Castaneda, E. E. M., & Vassallo, C. A. M. (2019). Deep learning for plant classification in precision agriculture. In 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA) (pp. 9–13). IEEE.

Dutta, K. K., & GS, A. K. (2015). Double handed Indian Sign Language to speech and text. In 2015 Third International Conference on Image Information Processing (ICIIP) (pp. 374–377). IEEE.

Ghani, F. (2004). Improved 2-D median filter for on-line impulse noise suppression. In IEEE Region 10 Conference.

Konwar, A. S., Borah, B. S., & Tuithung, C. T. (2014). An American sign language detection system using HSV color model and edge detection. In 2014 International Conference on Communication and Signal Processing (pp. 743–747). IEEE.

Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-González, A. B., & Corchado, J. M. (2022). Deepsign: Sign language detection and recognition using deep learning. Electronics, 11 (11), 1780.

Article   Google Scholar  

Kurhekar, P., Phadtare, J., Sinha, S., & Shirsat, K. P. (2019). Real time sign language estimation system. In 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) (pp. 654–658). IEEE.

Nimisha, K. P., & Jacob, A. (2020). A brief review of the recent trends in sign language recognition. In 2020 International Conference on Communication and Signal Processing (ICCSP) (pp. 186–190). IEEE.

Papastratis, I., Chatzikonstantinou, C., Konstantinidis, D., Dimitropoulos, K., & Daras, P. (2021). Artificial intelligence technologies for sign language. Sensors, 21 (17), 5843.

Raheja, J. L., Mishra, A., & Chaudhary, A. (2016). Indian sign language recognition using SVM. Pattern Recognition and Image Analysis, 26 (2), 434–441.

Rao, G. A., & Kishore, P. V. V. (2018). Selfie video based continuous Indian sign language recognition system. Ain Shams Engineering Journal, 9 (4), 1929–1939.

Sahoo, A. K., & Ravulakollu, K. K. (2014). Indian sign language recognition using skin color detection. International Journal of Applied Engineering Research (IJAER), 9 (20), 7347–7360.

Sarkaleh, A. K., Poorahangaryan, F., Zanj, B., & Karami, A. (2009). A Neural Network based system for Persian sign language recognition. In 2009 IEEE International Conference on Signal and Image Processing Applications (pp. 145–149). IEEE.

Saxena, A., Jain, D. K., & Singhal, A. (2014). Sign language recognition using principal component analysis. In 2014 Fourth International Conference on Communication Systems and Network Technologies (pp. 810-813). IEEE.

Shuhua, L., & Gaizhi, G. (2010). The application of improved HSV color space model in image processing. In 2010 2nd International Conference on Future Computer and Communication (Vol. 2, pp. V2–10). IEEE.

Singh, R. P., & Dixit, M. (2015). Histogram equalization: A strong technique for image enhancement. International Journal of Signal Processing, Image Processing and Pattern Recognition, 8 (8), 345–352.

Sridhar, A., Ganesan, R. G., Kumar, P., & Khapra, M. (2020). Include: A large scale dataset for indian sign language recognition. In Proceedings of the 28th ACM international conference on multimedia (pp. 1366–1375).

Stergiopoulou, E., & Papamarkos, N. (2009). Hand gesture recognition using a neural network shape fitting technique. Engineering Applications of Artificial Intelligence, 22 (8), 1141–1158.

Sural, S., Qian, G., & Pramanik, S. (2002). Segmentation and histogram generation using the HSV color space for image retrieval. In Proceedings. International Conference on Image Processing (Vol. 2, pp. II-II). IEEE.

Verma, V. K., Srivastava, S., & Kumar, N. (2015). A comprehensive review on automation of Indian sign language. In 2015 International Conference on Advances in Computer Engineering and Applications (pp. 138-142). IEEE

Download references

Acknowledgements

We thank Rajiv Gandhi science & Technology Commission, Government of Maharashtra, INDIA for funding this work. We also acknowledge the support provided by Pune Institute of Computer Technology, INDIA.

Author information

Authors and affiliations.

Pune Institute of Computer Technology, Pune, India

R. Sreemathy, Mousami Turuk, Isha Kulkarni & Soumya Khurana

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mousami Turuk .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Sreemathy, R., Turuk, M., Kulkarni, I. et al. Sign language recognition using artificial intelligence. Educ Inf Technol 28 , 5259–5278 (2023). https://doi.org/10.1007/s10639-022-11391-z

Download citation

Received : 19 April 2022

Accepted : 04 October 2022

Published : 01 November 2022

Issue Date : May 2023

DOI : https://doi.org/10.1007/s10639-022-11391-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Sign language
  • Morphological operations
  • Histogram of Gradients (HOG)
  • Artificial Neural Networks (ANN)
  • Back Propagation Networks (BPN)
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: real-time indian sign language (isl) recognition.

Abstract: This paper presents a system which can recognise hand poses & gestures from the Indian Sign Language (ISL) in real-time using grid-based features. This system attempts to bridge the communication gap between the hearing and speech impaired and the rest of the society. The existing solutions either provide relatively low accuracy or do not work in real-time. This system provides good results on both the parameters. It can identify 33 hand poses and some gestures from the ISL. Sign Language is captured from a smartphone camera and its frames are transmitted to a remote server for processing. The use of any external hardware (such as gloves or the Microsoft Kinect sensor) is avoided, making it user-friendly. Techniques such as Face detection, Object stabilisation and Skin Colour Segmentation are used for hand detection and tracking. The image is further subjected to a Grid-based Feature Extraction technique which represents the hand's pose in the form of a Feature Vector. Hand poses are then classified using the k-Nearest Neighbours algorithm. On the other hand, for gesture classification, the motion and intermediate hand poses observation sequences are fed to Hidden Markov Model chains corresponding to the 12 pre-selected gestures defined in ISL. Using this methodology, the system is able to achieve an accuracy of 99.7% for static hand poses, and an accuracy of 97.23% for gesture recognition.

Submission history

Access paper:.

  • Download PDF

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Sign language recognition

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. (PDF) Deep Learning for Sign Language Recognition: Current Techniques

    sign language detection research paper

  2. (PDF) Sign Language Detection with Voice Extraction in Matlab using

    sign language detection research paper

  3. (PDF) Sign Language Recognition Using Principal Component Analysis

    sign language detection research paper

  4. Sign Language Recognition

    sign language detection research paper

  5. (PDF) SIGN LANGUAGE RECOGNITION

    sign language detection research paper

  6. (PDF) Real Time Sign Language Detection

    sign language detection research paper

VIDEO

  1. improve sign language ISL

  2. ASL I

  3. Sign Language 300 Years Ago !!!

COMMENTS

  1. (PDF) Real Time Sign Language Detection

    Abstract. A real time sign language detector is a significant step forward in improving communication between the deaf and the general population. We are pleased to showcase the creation and ...

  2. (PDF) Sign Language Recognition

    Sign Language Recognition (SLR) deals with recognizing the hand gestures acquisition and continues till text or speech is generated for corresponding hand gestures. Here hand gestures for sign ...

  3. Machine learning methods for sign language recognition: A critical

    The three countries play a key role in advancing sign language recognition research, with India leading worldwide. India led with 123 publications over the past two decades, covering 15.4%of the total global publications. ... and joining the edges. Edge detection techniques reviewed in this paper are Robert edge detector, Sobel edge detector ...

  4. (PDF) Real-Time Sign Language Detection Using CNN

    In this research paper, we proposed a new deep learning-based approach to detect sign language, which can remove the barrier of communication between normal and deaf people. To detect real-time ...

  5. Sign language recognition using the fusion of image and hand ...

    Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. ... paper is divided ... Sign Language Detection Using SIFT ...

  6. [2204.03328] A Comprehensive Review of Sign Language Recognition

    A machine can understand human activities, and the meaning of signs can help overcome the communication barriers between the inaudible and ordinary people. Sign Language Recognition (SLR) is a fascinating research area and a crucial task concerning computer vision and pattern recognition. Recently, SLR usage has increased in many applications, but the environment, background image resolution ...

  7. Recent progress in sign language recognition: a review

    Sign language is a predominant form of communication among a large group of society. The nature of sign languages is visual, making them distinct from spoken languages. Unfortunately, very few able people can understand sign language making communication with the hearing-impaired infeasible. Research in the field of sign language recognition (SLR) can help reduce the barrier between deaf and ...

  8. Sign Language Detection using Action Recognition

    Abstract: Sign Language Detection has become crucial and effective for humans and research in this area is in progress and is one of the applications of Computer Vision. Earlier works included detection using static signs with the help of a simple deep learning-based Convolutional Neural Network. This proposal is based on continuous detection of image frames in real-time using action detection ...

  9. JOURNAL OF LA A Comprehensive Review of Sign Language Recognition

    convey the meaning; We recognize this language as a sign language. Sign language recognition (SLR) is challenging and complex, and many research opportunities are available with Dr. M. MADHIARASAN was with the Department of Computer Sci-ence and Engineering, Indian Institute of Technology Roorkee, Roor-kee, Uttarakhand, India - 247667 ...

  10. Sign Language Detection Using Tensorflow Object Detection

    For many years, research on different sign languages in the world and most common sign languages are British Sign language, American Sign Language, Indian sign language, Russian sign language, and many more. In this paper, we focus on the American Sign Language to detect various gestures using computer vision and Tensorflow object detection.

  11. Electronics

    Section 3 describes the methodology for implementing an isolated SLR system for real-time sign language detection and recognition, which involves pre-processing, feature extraction, training, and testing steps. This research paper proposes a feedback-based learning methodology using these options, based on a combination of LSTM and GRU: (1) a ...

  12. Artificial Intelligence Technologies for Sign Language

    Previous literature reviews mainly concentrate on specific sign language technologies, such as video-based and sensor-based sign language recognition [3,4,5,6,7] and sign language translation [8,9].Lately, with the development of sign language applications, there are also reviews that presented sign language systems to facilitate hearing-impaired people in teaching and learning, as well as in ...

  13. Research of a Sign Language Translation System Based on Deep Learning

    Therefore, this paper studies hand locating and sign language recognition of common sign language based on neural network, and the main research contents include: 1. A hand locating network based on the Faster R-CNN is established to recognize the sign language video or the part of the hand in the picture, and the result of recognition is ...

  14. Sign Language Recognition Systems: A Decade Systematic ...

    From previous published papers, it has been observed that very limited amount of work has been done on the survey of the sign language recognition as shown in Table 3.Khan et al. [] reviewed sign language components and the challenges and research issues have been discussed.Kausar and Javed [] presented a survey of current research trends and the challenges faced by the researchers.

  15. Sign language identification and recognition: A comparative study

    Sign Language (SL) is the main language for handicapped and disabled people. ... Many research problems are suggested in this domain such as Sign Language Recognition ... and M. B. Bin Bahar., "A new data glove approach for malaysian sign language detection," Procedia Computer Science, vol. 76, pp. 60-67, 2015, 10.1016/j.procs.2015.12.276 ...

  16. Real-Time Sign Language Detection Using CNN

    Sign language is a system of communication using visual gestures and signs. Hearing impaired people and the deaf and dumb community use sign language as their only means of communication. Understanding sign language is so much difficult for a normal person. Therefore, the minority group has always faced many difficulties in communicating with the general population. In this research paper, we ...

  17. Sign Language Recognition System using TensorFlow Object Detection API

    double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset. Keywords:

  18. (PDF) Sign Language Recognition Systems: A Decade ...

    This is the first identifiable academic literature review of sign language recognition systems. It provides an academic database of literature between the duration of 2007-2017 and proposes a ...

  19. British Sign Language Detection Using Ultra-Wideband Radar Sensing and

    This study represents a significant advancement in Sign Language Detection (SLD), a crucial tool for enhancing communication and fostering inclusivity among the hearing-impaired community. It innovatively combines radar technology with deep learning techniques to develop a sophisticated, non-invasive SLD system. Traditional SLD methods often rely on cumbersome wearable devices or struggle with ...

  20. Recognition of Indian Sign Language (ISL) Using Deep Learning Model

    An efficient sign language recognition system (SLRS) can recognize the gestures of sign language to ease the communication between the signer and non-signer community. In this paper, a computer-vision based SLRS using a deep learning technique has been proposed. This study has primary three contributions: first, a large dataset of Indian sign language (ISL) has been created using 65 different ...

  21. Sign language recognition using image based hand gesture recognition

    Hence in this paper introduced software which presents a system prototype that is able to automatically recognize sign language to help deaf and dumb people to communicate more effectively with each other or normal people. Pattern recognition and Gesture recognition are the developing fields of research.

  22. Sign language recognition using artificial intelligence

    Sign language is the natural way of communication of speech and hearing-impaired people. Using Indian Sign Language (ISL) interpretation system, hearing impaired people may interact with normal people with the help of Human Computer Interaction (HCI). This paper presents a method for automatic recognition of two-handed signs of Indian Sign language (ISL). The three phases of this work include ...

  23. [2108.10970v1] Real-time Indian Sign Language (ISL) Recognition

    This paper presents a system which can recognise hand poses & gestures from the Indian Sign Language (ISL) in real-time using grid-based features. This system attempts to bridge the communication gap between the hearing and speech impaired and the rest of the society. The existing solutions either provide relatively low accuracy or do not work in real-time. This system provides good results on ...

  24. Sign language recognition

    This paper presents a novel system to aid in communicating with those having vocal and hearing disabilities. It discusses an improved method for sign language recognition and conversion of speech to signs. The algorithm devised is capable of extracting signs from video sequences under minimally cluttered and dynamic background using skin color segmentation. It distinguishes between static and ...