How to Fine-tune EasyOCR to achieve Better OCR Performance

OCR is a valuable tool when you want to extract text from images. Sometimes, however, the OCR you are using is not working as well as you…

Eivind Kjosbakken

Towards AI

· ~11 min read · December 8, 2023 (Updated: December 18, 2023) · Free: No

OCR is a valuable tool when you want to extract text from images. Sometimes, however, the OCR you are using is not working as well as you want it to for your specific needs. If you are facing such an issue, fine-tuning your OCR engine is the way to go. In this tutorial, I will show you how to fine-tune EasyOCR, a free, open-source OCR engine that you can use with Python.

Overview

Prerequisites
Installing required packages
Cloning required Git repository
Generating dataset
Convert dataset to lmdb format
Retrieve a pre-trained OCR model:
Run the fine-tuning
Running inference with your fine-tuned model
A qualitative test of performance
Quantitative test of performance
Conclusion

Prerequisites

Basic Python knowledge
Basic knowledge of how to use the terminal

Installing required packages

First off, let's install the required pip packages. I recommend making a virtual environment for this, though it is not required. Run the commands below one line at a time:

pip install fire
pip install lmdb
pip install opencv-python
pip install natsort
pip install nltk

You also need to install PyTorch from this website (choose your specifications and copy the pip install command, see the command below that I used for my specifications). Preferably choose the GPU version, but the CPU version will work fine as well, the difference being that running the fine-tuning will be slower on the CPU.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Cloning required Git repository

First off, you need a Git repository that will help you run the fine-tuning. Clone this Git repo with the command below:

git clone https://github.com/clovaai/deep-text-recognition-benchmark

The deep-text-recognition-benchmark Github repo will give us some useful files to use for fine-tuning the EasyOCR model. Note that many of the terminal commands used in this article are taken from that repository and then adapted to my needs, so the repository is worth a read.

I would like to add a note here that clovaai on Git in general has a lot of good repositories that have been of immense help to me, so feel free to check out other interesting repositories they have. Another very interesting repo they have is the Donut model repo, and I have also written an article on fine-tuning the Donut model that you should also check out.

Generating dataset

Before you can fine-tune your OCR, you have to have a dataset to fine-tune. You can either download a dataset or make one yourself. Since I want my OCR to be particularly good at scanning supermarket receipts, I will make a dataset of items you can find in the supermarket, but feel free to make a dataset from whatever data you need your OCR to be good at. For this chapter, I used this GitHub page to help me.

The simplest approach, use my dummy dataset:

If you want to have this step as simple as possible (recommended if you are just testing), you can download a dummy dataset I have made and uploaded to Google Drive here (download the whole folder).

Download a dataset

If you want another larger dataset, you can download a dataset from this Dropbox page by downloading the data_lmdb_release.zip file (note that it is a bit over 18GB in size).

Make your own dataset

If you want the cooler approach of creating your own dataset, you can follow this tutorial on generating a dataset for OCR fine-tuning,'

Convert the dataset to lmdb format

Lmdb stands for Lightning Memory-Mapped Database Manager and is essentially an encoding you can use for your dataset to train AI models. You can read more about it on the lmdb docs here. After you have made your dataset, you should have a folder with your images, and the labels for all the images (the text in the images) in a labels.txt file. Your folder should look like the image below, and this folder should be inside the deep-text-recognition folder:

How the folder for your dataset should look before converting to lmdb format

NOTE: Make sure to have at least 10 images in your folder, as if you have fewer you could get an error when running the training script later in the tutorial.

You then have to make some changes in the create_lmdb_dataset.py file in the deep-text-recognition-benchmark folder:

I had to set the map_size variable lower since I was getting a disk memory error. I put the value for map_size to 1073741824, and you can see the line I changed below:

# OLD LINE
# ...
env = lmdb.open(outputPath, map_size=1099511627776)
# ...

# NEW LINE 
# ...
env = lmdb.open(outputPath, map_size=1073741824) 
# ...

I also got an error with the utf encoding, so I removed the utf-8 encoding when opening the gtFile. The new line then looks like this:

# OLD LINE
# ...
with open(gtFile, 'r', encoding='utf-8') as data:
# ...

# NEW LINE
# ...
with open(gtFile, 'r') as data:
# ...

Lastly, I also had to change the way imagePath was read:

# OLD LINE
# ...
imagePath, label = datalist[i].strip('\n').split('\t')
# ...

# NEW LINES
# ...
imagePath, label = datalist[i].strip('\n').split('.png')
imagePath += '.png'
# ...

My full create_lmdb_dataset.py file then looks like this (code from this Git repo, with the changes above applied):

import fire
import os
import lmdb
import cv2

import numpy as np


def checkImageIsValid(imageBin):
    if imageBin is None:
        return False
    imageBuf = np.frombuffer(imageBin, dtype=np.uint8)
    img = cv2.imdecode(imageBuf, cv2.IMREAD_GRAYSCALE)
    imgH, imgW = img.shape[0], img.shape[1]
    if imgH * imgW == 0:
        return False
    return True


def writeCache(env, cache):
    with env.begin(write=True) as txn:
        for k, v in cache.items():
            txn.put(k, v)


def createDataset(inputPath, gtFile, outputPath, checkValid=True):
    """
    Create LMDB dataset for training and evaluation.
    ARGS:
        inputPath  : input folder path where starts imagePath
        outputPath : LMDB output path
        gtFile     : list of image path and label
        checkValid : if true, check the validity of every image
    """
    os.makedirs(outputPath, exist_ok=True)
    env = lmdb.open(outputPath, map_size=1073741824) #TODO Changed map size
    cache = {}
    cnt = 1

    with open(gtFile, 'r') as data: #TODO removed utf-8 encoding here since I have norwegian letters
        datalist = data.readlines()

    nSamples = len(datalist)
    print(nSamples)
    for i in range(nSamples):
        #TODO changed the way imagePath is found as well to match my usecase
        imagePath, label = datalist[i].strip('\n').split('.png')
        imagePath += '.png'

        # imagePath, label = datalist[i].strip('\n').split('\t')
        imagePath = os.path.join(inputPath, imagePath)

        # # only use alphanumeric data
        # if re.search('[^a-zA-Z0-9]', label):
        #     continue

        if not os.path.exists(imagePath):
            print('%s does not exist' % imagePath)
            continue
        with open(imagePath, 'rb') as f:
            imageBin = f.read()
        if checkValid:
            try:
                if not checkImageIsValid(imageBin):
                    print('%s is not a valid image' % imagePath)
                    continue
            except:
                print('error occured', i)
                with open(outputPath + '/error_image_log.txt', 'a') as log:
                    log.write('%s-th image data occured error\n' % str(i))
                continue

        imageKey = 'image-%09d'.encode() % cnt
        labelKey = 'label-%09d'.encode() % cnt
        cache[imageKey] = imageBin
        cache[labelKey] = label.encode()

        if cnt % 1000 == 0:
            writeCache(env, cache)
            cache = {}
            print('Written %d / %d' % (cnt, nSamples))
        cnt += 1
    nSamples = cnt-1
    cache['num-samples'.encode()] = str(nSamples).encode()
    writeCache(env, cache)
    print('Created dataset with %d samples' % nSamples)


if __name__ == '__main__':
    fire.Fire(createDataset)

After you have the correct data and the correct create_lmbd_dataset.py file, move the folder over to the deep-text-recognition-benchmark folder (the Git repo you cloned). Then run the following command:

python .\create_lmdb_dataset.py <data folder name> <path to labels.txt in data folder> <output folder for your lmdb dataset>

where:

<data folder name> is the name of your folder with images and labels.txt (output in my case)
<path to labels.txt> is the <data folder name> + the labels.txt (so .\output\labels.txt in my case)
<output folder for your lmdb dataset> is the name of a folder that will be created for your dataset converted to lmdb format (I called it .\lmbd_output)

For me, the command above was then (make sure to run this command inside the deep-text-recognition-benchmark folder):

python .\create_lmdb_dataset.py .\output .\output\labels.txt .\lmbd_output

Now, you should have a new folder, like the image below, in your deep-text-recognition-benchmark folder.

How the folder for your lmdb converted data should look

NOTE: running the command on an existing folder does NOT overwrite the existing folder. Therefore, make sure you either delete a folder or give the lmdb_output a new name (this was something I struggled with for a while, so hopefully, this warning will make sure you avoid that error)

Retrieve a pre-trained OCR model:

Now, you need a pre-trained OCR model you can fine-tune with your dataset. For this, you can go to this Dropbox website and download the TPS-ResNet-BiLSTM-Attn.pth model and place it in your deep-text-recognition-benchmark folder (I know this looks a bit shady, but this is the way the deep-text-recognition-benchmark repository tells you how to do it. The Dropbox is not mine, and I am linking it here because it is linked in the Git repo text-recognition-benchmark)

Run the fine-tuning

First, a note if you are using CPU (this can be ignored if you are using GPU). If you run on CPU, you will likely get an error saying RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. This can be fixed by changing lines 85 and 87 in the train.py file:

# OLD LINES
# ...
if opt.FT:
    model.load_state_dict(torch.load(opt.saved_model), strict=False)
else:
    model.load_state_dict(torch.load(opt.saved_model))
# ...


# NEW LINES (change to this if you are using CPU)
#
if opt.FT:
    model.load_state_dict(torch.load(opt.saved_model,map_location='cpu'), strict=False)
else:
    model.load_state_dict(torch.load(opt.saved_model,map_location='cpu'))
# ...

Finally, you can then run the fine-tuning. To run the fine-tuning, you can use the command below:

python train.py --train_data lmdb_output --valid_data lmdb_output --select_data "/" --batch_ratio 1.0 --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --batch_size 2 --data_filtering_off --workers 0 --batch_max_length 80 --num_iter 10 --valInterval 5 --saved_model TPS-ResNet-BiLSTM-Attn.pth

Some notes on the command:

data_filtering_off is set to True (you only have to use the flag, not give it a variable). I had to not use data_filtering, since if I filtering was enabled I got no samples to train on.
workers had to be set to 0 to avoid errors. I think this has something to do with multi-GPU settings, and this is also referred to in the train.py file in the deep-text-recognition-benchmark folder
batch_max_length is the maximum length of any text in the training dataset. If you are using a different dataset, feel free to change this variable, but make sure this variable is at least as large as the longest string you are using in your dataset, else you will receive an error
For this tutorial, I use train_data and valid_data to refer to the same folder. In practice, I would create one folder with a training dataset, and one for a validation dataset and refer to those instead.
I set num_iter to 10 so you can make sure it works. Naturally, this variable must be set much higher when running actual fine-tuning of the model.
saved_model is an optional parameter, but if you don't set it, you will train a model from scratch. You probably do not want to do that (as this will require a lot of training), so set the saved_model flag to the existing model you downloaded from Dropbox.

Running inference with your fine-tuned model

After you have fine-tuned your model, you want to run inference with it. To do this, you can use the command below:

python demo.py --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --image_folder <path to images to test on> --saved_model <path to model to use>

Where

<path to images to test on> is a folder consisting of PNG images you want to test on. For me, this was: output
<path to model to use> is the path to the saved model from your fine-tuning. For me, this was: .\saved_models\TPS-ResNet-BiLSTM-Attn-Seed1111\best_accuracy.pth (the fine-tuning saves the fine-tuned model in a saved_models folder)

The command I used was:

python demo.py --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --image_folder output --saved_model .\saved_models\TPS-ResNet-BiLSTM-Attn-Seed1111\best_accuracy.pth

The command simply outputs the model's prediction and confidence score for each image in the <path to images to test on> folder, so you can then check the performance of the model by looking at the images yourself and seeing if the model made the right prediction. This is a qualitative test of the performance of the model.

A qualitative test of performance

To see if the fine-tuning worked, I will do a qualitative test of the performance by testing the original model, vs my fine-tuned model on 10 specific words and numbers. The words I tested are shown below (merged vertically into one image). I had to make it a bit difficult for the model by adding skew and blurring.

Self-made images merged with https://products.aspose.app/pdf/merger/png-to-png. The words from top to bottom are: "vanskeligheter", "uvanligheter", "skrekkeksempel", "rosenborg"

Considering I want my OCR to read Norwegian supermarket receipts, I have some Norwegian words there (the words are taken from http://openfoodfacts.com/, you can read more about it in this article). Hopefully, my fine-tuned model should perform better on these words, as the original OCR model is not used to seeing Norwegian words, while my fine-tuned model has been trained on some Norwegian words.

The text in each image is:

image0 -> vanskeligheter
image1 -> uvanligheter
image2 -> skrekkeksempel
image3 -> rosenborg

Results for the original model (not fine-tuned):

Results for the original model (not fine-tuned) on a qualitative test. You can see the model struggles quite a bit

Results for fine-tuned model:

Results for the fine-tuned model. You can see the model achieves perfect accuracy because of the fine-tuning.

As you can see, the fine-tuning has worked, and the fine-tuned model achieves perfect results in this qualitative example.

Quantitative test of performance

If you want a more quantitative test, you can either look at the validation results that show up during fine-tuning, or you can use the command below:

python test.py --eval_data <path to test data set in lmdb format> --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --saved_model <path to model to test> --batch_max_length 70 --workers 0 --batch_size 2 --data_filtering_off

where:

<path to test data set in lmdb format> is the path to the folder containing the test data in lmdb format. For me, this was: lmdb_norwegian_data_test
<path to model to test> is the path to the model you want to test the performance of. For me, this was: saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth

The command I used was therefore:

python test.py --eval_data lmdb_norwegian_data_test --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth --batch_max_length 70 --workers 0 --batch_size 2 --data_filtering_off

This will output accuracy in percentage, so a number between 0 and 100, which is the accuracy the OCR model achieves on your test dataset.

In my experience, the model you download from Dropbox needs a bit of training. In the beginning, the model will make complete nonsense predictions, but if you let it train for 30 minutes or so you should start to see some improvements.

I then ran the test.py on the 4 images I showed above and got the following results, with the old (not fine-tuned) model to the left and the new fine-tuned model to the right. You can see that the new fine-tuned model performs better.

Conclusion

Congrats, you can now fine-tune your OCR model. To make a significant impact on a larger model and generalize it, you probably have to make a larger dataset, which you can learn about in this tutorial, and then let the model train for a while. In the end, the OCR model will hopefully perform better for your specific use case.

#ai #data-science #ocr #fine-tuning #dataset