Shotor (means camel in Persian) is a free synthetic dataset for Word Level OCR.
The current version contains 120000 grayscale 50*100 images and corresponding words. The words contain only alphabet.
Note: To train a robust model, apply augmentations like scaling, translation, additive noise and ... on the images.
To see an example of using the Shotor dataset see this notebook:
A simple word level OCR for Persian Language using Pytorch and OpenCV
I used these resourses to create word lists:
- Persian Wikipedia
- Ganjoor Website
- Persian Spell Checking Data by reza1615
The images have been generated using multiple fonts:
- a few fonts from https://rastikerdar.github.io/
- and some fonts from https://www.fontirani.ir/
Created by: Amirabbas Asadi ([email protected])