Imagine playing a captivating Japanese RPG but facing a language barrier that hinders your immersion. What if you could have a device that translates the text on your screen in real-time? Introducing the HDMI Babelfish — a translation device that allows seamless translation from HDMI input to HDMI output. This blog post will walk you through the initial steps of this process using Optical Character Recognition (OCR) with Tesseract to detect Japanese text on the screen. This text is then translated using Meta' NLLB and rendered on the screen.

None
A Japanese Role Playing Game (RPG) successully captured from PS3. Image released under CC BY 4.0

Introduction to the Concept

The idea is to create a device that intercepts the HDMI signal from your gaming console, processes the video frames to detect and translate the text, and outputs the translated text back to your screen. This is achieved using computer vision, OCR, and language translation technologies.

Access to HDMI with video capture will lead to several problems. The PlayStation 3 (PS3) HDMI output for example is encrypted using HDCP (High-bandwidth Digital Content Protection), a technology designed to prevent the copying of digital video and audio content as it travels across connections. To process the video frames for text detection and translation, it is necessary to bypass this encryption. I was successful in achieving this using the VEDINDUST HDMI Extender 200ft HDMI Ethernet RJ45 to HDMI over Cat5e/Cat6 Cable Transmission HDMI Repeater Adapter supports EDID 1080p 3D HDCP POC.

This HDMI extender works by converting the HDMI signal into a format that can be transmitted over Ethernet cables (Cat5e/Cat6) and then back to HDMI. During this conversion process, the HDCP encryption is inadvertently stripped, allowing the signal to be transmitted without encryption. This is crucial for processing the video frames on a computer. Additionally, the extender allows the HDMI signal to be transmitted over long distances, up to 200ft, providing flexibility in setting up the device and connecting it to a computer for processing. It also supports EDID (Extended Display Identification Data), ensuring compatibility with various displays and preserving the video quality at 1080p resolution and 3D content. By using this HDMI extender, we can effectively bypass the HDCP encryption on the PS3 HDMI output, enabling us to capture and process the video frames for real-time translation.

Using Tesseract to Detect Text on an Example Video

Let's dive into Python and learn how to load a .mov file, extract an image every second, detect Japanese text using Tesseract, and print the bounding box positions and text to the console.

Part 1: Importing Necessary Libraries

We begin by importing the libraries required for this task. These include OpenCV for video processing, Tesseract for OCR, and PIL for image manipulation.

import cv2
import torch
import pytesseract
from pytesseract import Output
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from collections import deque
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import ospy

The cv2 library, also known as OpenCV, is a powerful tool for computer vision tasks. It provides functions to capture video, process images, and perform operations such as object detection and tracking. In this project, OpenCV is used to read video frames from the .mov file and convert them into a format suitable for further processing.

pytesseract is an interface for the Tesseract OCR engine, which is an open-source optical character recognition tool. Tesseract can read text from images and convert it into a digital string format. Here, pytesseract is used to detect and extract Japanese text from each video frame, providing bounding box information and recognized text.

The PIL library, or Python Imaging Library, allows for extensive image processing capabilities. PIL's Image, ImageDraw, and ImageFont modules are used to manipulate images, draw bounding boxes around detected text, and render the recognized text onto the image. This makes it possible to visually verify the detected text and its position within the video frame.

numpy is a fundamental package for numerical operations in Python. It provides support for arrays, matrices, and many mathematical functions to operate on these data structures. In this script, numpy is used to perform numerical operations such as calculating percentiles for bounding box coordinates, which helps in accurately determining the area covered by detected text.

deque from the collections module is a specialized list optimized for fast append and pop operations from both ends. Here, deque is used to store the history of bounding boxes detected in consecutive frames. This helps in smoothing out the bounding box calculations by averaging the results over multiple frames, thus providing a more stable and accurate detection area for text translation.

transformers and torch will be used to load the NLLB model and to run the translation engine. You will have to install both packages using pip. In our demo, we use the NLLB 200 distilled model with 600 million parameters which you can download from HuggingFace here.

Part 2: Defining Helper Functions

Next, we define two helper functions. The first function calculates the bounding box coordinates using the 10th and 90th percentiles of detected text boxes. The second function calculates the average bounding box from the history of previous detections.

def calculate_total_bounding_box_percentile(d, draw, image_width, image_height, debug = False):
    x_coords = []
    y_coords = []
    w_coords = []
    h_coords = []
    limx = 800
    limy = 80

    n_boxes = len(d['level'])
    for i in range(n_boxes):
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        text = d['text'][i]

        # Filter out empty text
        if text.strip() and (h < limy) and (w < limx):
            x_coords.append(x)
            y_coords.append(y)
            w_coords.append(w)
            h_coords.append(h)
            if debug:
                draw.rectangle([x, y, x + w, y + h], outline=(0, 255, 0), width=2)
        if text.strip() and ((h > limy) or (w > limx)):
            if debug:
                draw.rectangle([x, y, x + w, y + h], outline=(255, 0, 0), width=2)

    if not x_coords:
        return None, None, None, None

    # Sort the coordinates
    x_coords.sort()
    y_coords.sort()
    w_coords.sort()
    h_coords.sort()

    # Calculate percentiles
    total_x1 = np.percentile(x_coords, 10)
    total_y1 = np.percentile(y_coords, 10)
    total_x2 = np.percentile([x + w for x, w in zip(x_coords, w_coords)], 90)
    total_y2 = np.percentile([y + h for y, h in zip(y_coords, h_coords)], 90)

    # Add margins
    margin_x = 0.05 * image_width
    margin_y = 0.05 * image_height

    total_x1 = max(0, total_x1)
    total_y1 = max(0, total_y1)
    total_x2 = min(image_width, total_x2 + margin_x)
    total_y2 = min(image_height, total_y2 + margin_y)

    return total_x1, total_y1, total_x2, total_y2

In calculate_total_bounding_box_percentile, we collect the coordinates of detected text boxes, filter out unwanted ones based on size, and then calculate the 10th and 90th percentiles to find the bounding box that covers the main area of interest. This method helps in ignoring outliers and focusing on the central text area. We also add a margin to ensure that all relevant text is captured.

The average_bounding_boxes function takes a history of bounding boxes and calculates the average. This helps in smoothing the detection results over multiple frames, making the text detection more stable and reliable.

Part 3: Initializing Video Capture

Next, we initialize video capture from a .mov file and set parameters like frame rate and extraction interval. The video_path is the path to the video file we want to process. We use the cap object, an instance of cv2.VideoCapture, to read the video. The fps variable stores the frames per second of the video, which we retrieve using OpenCV's CAP_PROP_FPS property. To ensure we process one frame every 0.25 seconds, we calculate the frame_interval by dividing the frames per second by 4. This interval determines how often we extract frames from the video for text detection and processing.

video_path = 'JapanRPG_TestSequence.mov'
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
 print("Error opening video file")
 exit()
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps / 4) # Interval in frames to extract one image every 0.25 seconds

Part 4: Configuring Tesseract and Loading Font

We then set up Tesseract for OCR with specific configurations and load a font that supports Japanese characters. The tess_config_initial variable sets the initial Tesseract configuration to fully automatic page segmentation with the Japanese language model. For the refined configuration, tess_config_refined assumes a single uniform block of text, also using the Japanese language model. The font_path is the path to the font file that supports Japanese characters, and we use the ImageFont.truetype method to load this font.

# Frame rate of the video
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps/4)  # Interval in frames to extract one image per second

# Tesseract configuration for Japanese language
tess_config_initial = '--psm 3 -l jpn'  # Initial configuration
tess_config_refined = '--psm 6 -l jpn'  # Refined configuration for block of text

Part 5: Processing Each Frame

We proceed to loop through each frame of the video, extracting an image at the specified interval, and using Tesseract to detect text. Initially, we set min_box_width to 0 and frame_count to 0. The image_height and image_width variables are initialized with placeholder values, which will be updated based on the actual video frame dimensions. We use a deque called history to store the history of bounding boxes, with a maximum length of 10 to keep track of the last 10 bounding boxes detected.

min_box_width = 0
frame_count = 0
image_height, image_width = 640, 480
history = deque(maxlen=10) # To store the history of bounding boxes

In the main loop, we read each frame using cap.read(). If the frame is read successfully and the frame_count is a multiple of frame_interval, we process the frame. The frame is converted to a format suitable for Tesseract, and the text detection results are obtained. If it's the first frame being processed, we update the image_height, image_width, and min_box_width based on the actual frame dimensions.

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    if frame_count % frame_interval == 0:
        # Extract image every second
        image = frame
        if frame_count == 0:
            image_height, image_width = image.shape[:2]
            min_box_width = 0.2 * image_width

        # Use pytesseract to get bounding boxes of text
        d = pytesseract.image_to_data(image, config=tess_config_initial, output_type=Output.DICT)

We then convert the OpenCV image to a PIL image to draw bounding boxes and text. The calculate_total_bounding_box_percentile function computes the bounding box coordinates using percentiles, and these coordinates are appended to the history deque. The average_bounding_boxes function calculates the average bounding box from the history to ensure stable detection results.

# Convert OpenCV image to PIL image
image_pil = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
draw = ImageDraw.Draw(image_pil)

total_x1, total_y1, total_x2, total_y2 = calculate_total_bounding_box_percentile(d, draw, image_width, image_height)
history.append((total_x1, total_y1, total_x2, total_y2))
# Append the current bounding box to the history
avg_x1, avg_y1, avg_x2, avg_y2 = average_bounding_boxes(history)

If the computed bounding box is valid, we extract the region of interest (ROI) and perform a second pass of refined OCR on this region. The detected text is then drawn onto the image, and the refined text is printed to the console.

if total_x1 is not None and total_x2 is not None and total_y1 is not None and total_y2 is not None:
  if total_x1 < total_x2 and total_y1 < total_y2 and (total_x2 - total_x1 > min_box_width):
    roi = image[int(total_y1):int(total_y2), int(total_x1):int(total_x2)]
    refined_text = pytesseract.image_to_string(roi, config=tess_config_refined)

Finally, we convert the PIL image back to OpenCV format and display the processed frame with bounding boxes and detected text. The loop continues until all frames are processed or a maximum frame count is reached.

draw.rectangle([total_x1, total_y1, total_x2, total_y2], outline=(0, 0, 255), width=2)
draw.text((total_x1, total_y1–10), refined_text, font=font, fill=(0, 255, 0))
print(f"Refined Text: {refined_text}")
image = cv2.cvtColor(np.array(image_pil), cv2.COLOR_RGB2BGR)
cv2.imshow('Frame', image)
cv2.waitKey(int(frame_interval / fps * 10))

frame_count += 1

if frame_count > 2000:
    break

By following these steps, we can detect and extract Japanese text from video frames, laying the groundwork for a real-time HDMI translation device that bridges the language gap in Japanese RPGs. The results of this on an example video are shown here:

None
Online on screen text detection using tesseract on an example video sequence. The detection result is shown in blue. Image released under CC BY 4.0

Part 6: NLLB Integration

To create a translation, we need to load the NLLB model before going into the processing loop. We therefore add the following lines before the loop to load the tokenizer and the model pipeline.

tokenizer = AutoTokenizer.from_pretrained("nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("nllb-200-distilled-600M")
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang='jpn_Jpan', tgt_lang='eng_Latn',
 max_length=200)

This translator is then called in our loop on the image stream after we recognised the text on the image:

# Translate and print
 translation = translator(refined_text)[0]
 trans_text = add_newline_after_center_word(translation['translation_text'])
 print(f"Translated Text: {trans_text}")

Here, we use an additional auxiliary function that detects the center word and adds a new line to be able to fit the text on the screen.

Part 7: Rendering the translated string

First, we average the gray value of the detected ROI and rewrite all pixels to have a uniform value in this area using OpenCV:

# Compute the average color in the bounding box
 avg_color = cv2.mean(image[int(total_y1):int(total_y2), int(total_x1):int(total_x2)])[:3]
 avg_color = tuple(map(int, avg_color))
# Set the entire bounding box to the average color
 image[int(total_y1):int(total_y2), int(total_x1):int(total_x2)] = avg_color

Now, we use PIL Image to render the text into the image:

#Convert to PIL Image
 image_pil = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
 draw = ImageDraw.Draw(image_pil)
#Draw text
 draw.text((total_x1, total_y1), trans_text, font=font, fill=(255, 255, 255))
# Convert back to OpenCV format
 image = cv2.cvtColor(np.array(image_pil), cv2.COLOR_RGB2BGR)

This then allows us to render the translation in white on the screen. Finally, the result of our tech demo looks like this:

None
Online on screen translation using NLLB on an example video sequence. The detection result is shown in blue. Image released under CC BY 4.0

The complete source code of this tutorial is published as well on GitHub.

Things to do to continue this project

Obviously, the detection is not very stable. We already use two passes and temporal averaging to prevent jumping of the bounding boxes. Yet, outliers still cause considerable variaion. Furthermore, the feed the OCR directly to the NLLB model. Also here temporal averaging would help tremendously. Techniques on forced alignment of strings and voting will further improve the tranlsation stability. The good old ROVER might be a good choice for this. Finally, the rendering of the text could be optimised in order to fully fit the bounding box to create a nicer visualisation.

Looking further into the future, we aim to explore the use of embedded devices such as the NVIDIA Jetson Nano. These powerful, compact devices are capable of handling complex computational tasks, including computer vision and machine learning, in real-time. By leveraging the Jetson Nano, we could potentially process and translate text from a live HDMI capture stream, rather than from a pre-recorded video. This would enable real-time translation and rendering, providing a truly immersive and interactive gaming experience without the language barrier. The goal is to have a seamless, all-in-one solution that can perform text detection, translation, and rendering in real-time, enhancing accessibility and enjoyment for gamers worldwide.

If you liked this blog post, I recommend having a look at our free deep learning resources or my YouTube Channel.

Text and images of this article are licensed under Creative Commons License 4.0 Attribution. Feel free to reuse and share any part of this work.