March 10, 2024

Building a Sign Language Recognition System with CNNs and OpenCV

A practical walkthrough of building a real-time American Sign Language alphabet recogniser from scratch. Dataset, CNN architecture with the actual math, and live webcam inference using OpenCV.

computer visionCNNsign languageOpenCVkerasdeep learning

In this blog we will build a sign language detection model based on convolutional neural networks. If you want to read more about CNN read this blog.

To build a SLR (Sign Language Recognition) we need three things:

Dataset 28×28 grayscale images of ASL alphabets

Model A CNN trained to classify 26 hand shapes

Platform OpenCV for live webcam inference

Training a deep neural network requires a powerful GPU. We will not need any powerful GPU for this project. But it would still be better to use an online platform like Google Colab. It is free.

Looking for something more advanced in Computer Vision?

★★☆☆☆ Multi-task Learning — GitHub
★★☆☆☆ YOLO-NAS for object detection — GitHub
★★★★☆ 3D Image segmentation — GitHub
★★★★★ 2D to 3D Human Pose estimation — GitHub, Blog
★★★★★ Images to 3D Render (NeRF) — GitHub

1) Dataset

We will use the MNIST sign language dataset. You can download it here.

The dataset has 24 ASL alphabets (J and Z are excluded because they require motion). Each image is 28×28 pixels. That is a 2D array of 784 pixel values per image.

More precisely, each image is a tensor of shape $(28, 28, 1)$. The last dimension is the channel count. Since the images are grayscale, there is only one channel. For RGB it would be $(28, 28, 3)$.

Loading the dataset

import pandas as pd
import numpy as np

X_train = pd.read_csv('sign_mnist_train.csv')
X_test  = pd.read_csv('sign_mnist_test.csv')

y_train = X_train['label']
y_test  = X_test['label']

X_train = X_train.drop('label', axis=1)
X_test  = X_test.drop('label', axis=1)

Preprocessing

The CSV stores each image as a flat row of 784 values. We reshape it back to $(28, 28)$ and stack all images into a 4D tensor: (num_samples, height, width, channels).

X_train = np.array(X_train.iloc[:,:])
X_train = np.array([np.reshape(i, (28,28)) for i in X_train])
X_test  = np.array(X_test.iloc[:,:])
X_test  = np.array([np.reshape(i, (28,28)) for i in X_test])

num_classes = 26
y_train = np.array(y_train).reshape(-1)
y_test  = np.array(y_test).reshape(-1)

y_train = np.eye(num_classes)[y_train]
y_test  = np.eye(num_classes)[y_test]

X_train = X_train.reshape((27455, 28, 28, 1))
X_test  = X_test.reshape((7172, 28, 28, 1))

The np.eye(num_classes)[y_train] line converts integer labels into one-hot vectors. Label 0 becomes $[1, 0, 0, \ldots, 0]$, label 1 becomes $[0, 1, 0, \ldots, 0]$, and so on. We need this because the final layer outputs a probability distribution over 26 classes.

2) Build and Train the Model

We will use a CNN. If you are not familiar, I recommend Andrew Ng's CNN course on Coursera or my own blog here.

The convolution operation

A convolutional layer slides a small filter (kernel) across the input and computes a dot product at each position. For a 2D input $I$ and kernel $K$ of size $k \times k$, the output feature map $F$ at position $(i, j)$ is:

$$F(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m,\ j+n) \cdot K(m, n)$$

One filter produces one feature map. With 8 filters you get 8 feature maps stacked as output.

The spatial size of the output after a convolution depends on input size $W$, kernel size $k$, padding $p$, and stride $s$:

$$W_{out} = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1$$

Our first Conv2D layer: $W=28$, $k=3$, $p=1$ (same padding), $s=1$. Plugging in: $W_{out} = \lfloor(28 - 3 + 2)/1\rfloor + 1 = 28$. The spatial size is preserved. Number of channels becomes 8 (our filter count).

ReLU activation

After each convolution we apply ReLU, which zeroes out all negative values:

$$\text{ReLU}(x) = \max(0, x)$$

Without a non-linearity, stacking layers collapses to a single linear transformation. ReLU is also fast and avoids the vanishing gradient problem that sigmoid has.

MaxPooling

MaxPooling reduces spatial dimensions by taking the max from each non-overlapping window. With pool size $(2, 2)$:

$$\text{MaxPool}_{(2,2)}(F) \in \mathbb{R}^{\lfloor H/2 \rfloor \times \lfloor W/2 \rfloor \times C}$$

First MaxPool (2×2): 28×28 feature maps become 14×14. Second MaxPool (4×4): 14×14 becomes 3×3. This shrinks computation and forces the network to learn position-invariant features.

The full model

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten

classifier = Sequential()
classifier.add(Conv2D(filters=8, kernel_size=(3,3), strides=(1,1), padding='same',
                      input_shape=(28,28,1), activation='relu', data_format='channels_last'))
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Conv2D(filters=16, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'))
classifier.add(Dropout(0.5))
classifier.add(MaxPooling2D(pool_size=(4,4)))
classifier.add(Flatten())
classifier.add(Dense(128, activation='relu'))
classifier.add(Dense(26, activation='softmax'))

The first Conv2D takes input shape $(28, 28, 1)$. The last Dense layer outputs probabilities for 26 alphabets. Dropout with $p=0.5$ randomly zeroes 50% of activations during training to prevent overfitting.

Softmax output

The final layer uses softmax. Given raw scores $z \in \mathbb{R}^{26}$, it converts them to probabilities:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{26} e^{z_j}}$$

All 26 outputs sum to 1. The predicted class is $\hat{y} = \arg\max_i \, \text{softmax}(z_i)$.

Loss function

We use categorical cross-entropy. For one sample with true label $y$ (one-hot) and predicted $\hat{p}$:

$$\mathcal{L} = -\sum_{i=1}^{26} y_i \log(\hat{p}_i)$$

Since $y$ is one-hot, only the term for the correct class survives. It penalises the model when it assigns low probability to the right class.

Training

classifier.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)

We use SGD (Stochastic Gradient Descent). At each step it computes the gradient of the loss and moves weights in the opposite direction:

$$\theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}$$

$\theta$ = model weights, $\eta$ = learning rate. You can drop epochs to 25 for faster training.

Evaluate and save

accuracy = classifier.evaluate(x=X_test, y=y_test, batch_size=32)
print("Accuracy: ", accuracy[1])

classifier.save('CNNmodel.h5')
weights_file = drive.CreateFile({'title': 'CNNmodel.h5'})
weights_file.SetContentFile('CNNmodel.h5')
weights_file.Upload()
drive.CreateFile({'id': weights_file.get('id')})

This saves the trained model to Google Drive for use locally with OpenCV.

3) OpenCV

Capturing webcam input

We capture a frame from the webcam, crop the region of interest, convert to grayscale, blur to reduce noise, and resize to 28×28. This matches exactly how the training images look.

import cv2
import numpy as np

def main():
    cam_capture = cv2.VideoCapture(0)
    while True:
        _, image_frame = cam_capture.read()

        im2 = crop_image(image_frame, 300, 300, 300, 300)
        image_grayscale = cv2.cvtColor(im2, cv2.COLOR_BGR2GRAY)
        image_grayscale_blurred = cv2.GaussianBlur(image_grayscale, (15,15), 0)
        im3 = cv2.resize(image_grayscale_blurred, (28,28), interpolation=cv2.INTER_AREA)

        im4 = np.resize(im3, (28, 28, 1))
        im5 = np.expand_dims(im4, axis=0)

Prediction

The model outputs a softmax vector of shape $(1, 26)$. We take the index of the maximum value. Labels are integers: 0 for A, 1 for B, 2 for C, and so on.

def keras_predict(model, image):
    data = np.asarray(image, dtype="int32")
    pred_probab = model.predict(data)[0]
    pred_class = list(pred_probab).index(max(pred_probab))
    return max(pred_probab), pred_class

The confidence score is $\max_i \, \hat{p}_i$. Below 0.6 usually means bad lighting or a cluttered background. The model reaches 94% accuracy on the test set and works well in real time with a plain background.

Full project on GitHub: github.com/Arshad221b/Sign-Language-Recognition-

Questions or suggestions? me@arshad-kazi.com