In this blog we will build a sign language detection model based on convolutional neural networks. If you want to read more about CNN read this blog.
To build a SLR (Sign Language Recognition) we need three things:
Training a deep neural network requires a powerful GPU. We will not need any powerful GPU for this project. But it would still be better to use an online platform like Google Colab. It is free.
Looking for something more advanced in Computer Vision?
1) Dataset
We will use the MNIST sign language dataset. You can download it here.
The dataset has 24 ASL alphabets (J and Z are excluded because they require motion). Each image is 28×28 pixels. That is a 2D array of 784 pixel values per image.
More precisely, each image is a tensor of shape $(28, 28, 1)$. The last dimension is the channel count. Since the images are grayscale, there is only one channel. For RGB it would be $(28, 28, 3)$.
Loading the dataset
import pandas as pd
import numpy as np
X_train = pd.read_csv('sign_mnist_train.csv')
X_test = pd.read_csv('sign_mnist_test.csv')
y_train = X_train['label']
y_test = X_test['label']
X_train = X_train.drop('label', axis=1)
X_test = X_test.drop('label', axis=1) Preprocessing
The CSV stores each image as a flat row of 784 values. We reshape it back to $(28, 28)$ and stack all images into a 4D tensor: (num_samples, height, width, channels).
X_train = np.array(X_train.iloc[:,:])
X_train = np.array([np.reshape(i, (28,28)) for i in X_train])
X_test = np.array(X_test.iloc[:,:])
X_test = np.array([np.reshape(i, (28,28)) for i in X_test])
num_classes = 26
y_train = np.array(y_train).reshape(-1)
y_test = np.array(y_test).reshape(-1)
y_train = np.eye(num_classes)[y_train]
y_test = np.eye(num_classes)[y_test]
X_train = X_train.reshape((27455, 28, 28, 1))
X_test = X_test.reshape((7172, 28, 28, 1))
The np.eye(num_classes)[y_train] line converts integer labels into one-hot vectors.
Label 0 becomes $[1, 0, 0, \ldots, 0]$, label 1 becomes $[0, 1, 0, \ldots, 0]$, and so on.
We need this because the final layer outputs a probability distribution over 26 classes.
2) Build and Train the Model
We will use a CNN. If you are not familiar, I recommend Andrew Ng's CNN course on Coursera or my own blog here.
The convolution operation
A convolutional layer slides a small filter (kernel) across the input and computes a dot product at each position. For a 2D input $I$ and kernel $K$ of size $k \times k$, the output feature map $F$ at position $(i, j)$ is:
One filter produces one feature map. With 8 filters you get 8 feature maps stacked as output.
The spatial size of the output after a convolution depends on input size $W$, kernel size $k$, padding $p$, and stride $s$:
ReLU activation
After each convolution we apply ReLU, which zeroes out all negative values:
Without a non-linearity, stacking layers collapses to a single linear transformation. ReLU is also fast and avoids the vanishing gradient problem that sigmoid has.
MaxPooling
MaxPooling reduces spatial dimensions by taking the max from each non-overlapping window. With pool size $(2, 2)$:
The full model
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten
classifier = Sequential()
classifier.add(Conv2D(filters=8, kernel_size=(3,3), strides=(1,1), padding='same',
input_shape=(28,28,1), activation='relu', data_format='channels_last'))
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Conv2D(filters=16, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'))
classifier.add(Dropout(0.5))
classifier.add(MaxPooling2D(pool_size=(4,4)))
classifier.add(Flatten())
classifier.add(Dense(128, activation='relu'))
classifier.add(Dense(26, activation='softmax')) The first Conv2D takes input shape $(28, 28, 1)$. The last Dense layer outputs probabilities for 26 alphabets. Dropout with $p=0.5$ randomly zeroes 50% of activations during training to prevent overfitting.
Softmax output
The final layer uses softmax. Given raw scores $z \in \mathbb{R}^{26}$, it converts them to probabilities:
All 26 outputs sum to 1. The predicted class is $\hat{y} = \arg\max_i \, \text{softmax}(z_i)$.
Loss function
We use categorical cross-entropy. For one sample with true label $y$ (one-hot) and predicted $\hat{p}$:
Since $y$ is one-hot, only the term for the correct class survives. It penalises the model when it assigns low probability to the right class.
Training
classifier.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100) We use SGD (Stochastic Gradient Descent). At each step it computes the gradient of the loss and moves weights in the opposite direction:
Evaluate and save
accuracy = classifier.evaluate(x=X_test, y=y_test, batch_size=32)
print("Accuracy: ", accuracy[1]) classifier.save('CNNmodel.h5')
weights_file = drive.CreateFile({'title': 'CNNmodel.h5'})
weights_file.SetContentFile('CNNmodel.h5')
weights_file.Upload()
drive.CreateFile({'id': weights_file.get('id')}) This saves the trained model to Google Drive for use locally with OpenCV.
3) OpenCV
Capturing webcam input
We capture a frame from the webcam, crop the region of interest, convert to grayscale, blur to reduce noise, and resize to 28×28. This matches exactly how the training images look.
import cv2
import numpy as np
def main():
cam_capture = cv2.VideoCapture(0)
while True:
_, image_frame = cam_capture.read()
im2 = crop_image(image_frame, 300, 300, 300, 300)
image_grayscale = cv2.cvtColor(im2, cv2.COLOR_BGR2GRAY)
image_grayscale_blurred = cv2.GaussianBlur(image_grayscale, (15,15), 0)
im3 = cv2.resize(image_grayscale_blurred, (28,28), interpolation=cv2.INTER_AREA)
im4 = np.resize(im3, (28, 28, 1))
im5 = np.expand_dims(im4, axis=0) Prediction
The model outputs a softmax vector of shape $(1, 26)$. We take the index of the maximum value. Labels are integers: 0 for A, 1 for B, 2 for C, and so on.
def keras_predict(model, image):
data = np.asarray(image, dtype="int32")
pred_probab = model.predict(data)[0]
pred_class = list(pred_probab).index(max(pred_probab))
return max(pred_probab), pred_class The confidence score is $\max_i \, \hat{p}_i$. Below 0.6 usually means bad lighting or a cluttered background. The model reaches 94% accuracy on the test set and works well in real time with a plain background.
Full project on GitHub: github.com/Arshad221b/Sign-Language-Recognition-
Questions or suggestions? me@arshad-kazi.com