A text-line level historical handwritten Ethiopic OCR Dataset
This repository contains a historical handwritten dataset called HHD-Ethiopic, and baselines models and human-level performance for benchmarking Historical Handwritten Ethiopic text-image recognition. HHD-Ethiopic is a text-line level historical handwritten Ethiopic OCR Dataset specifically designed for historical handwritten Ethiopic text-image recognition tasks. The full paper is here.
The HHD-Ethiopic OCR dataset consists of ~80k text-line images extracted from
Sample text-line images and their corresponding ground-truth text are shown below. For a more thorough tutorial about the dataset see formats of the dataset
No. | Text-line Image | Ground-Truth Text |
---|---|---|
[Image 1] | ወጽራኅየኒ፡ቦአ፡ቅድሜሁ፡ውስተ፡ዕዘኒሁ | |
[Image 2] | ፍራስ፡እሳት፡ወጽሩዓን | |
[Image 3] | ወአንሰ፡በብዝኃ፡አሀውዕ፡ቢተኩ | |
[Image 4] | ወአድኅነከ፡ይትፌሥሑ። |
In the current implementation, the NumPy format of the HHD-Ethiopic dataset is used for training and testing the baseline models. Download the dataset.
After downloading HHD-Ethiopic, install the requirements, to demonstrate we just used the Train data and Test data stored in numpy format. To train and test all baseline models, please use all source codes link.
pip install -r requirements.txt
To Train the model from scratch
$ python3 train_model_plain_CTC.py
Alternatively, you can also run the training code demonstration in Google Colab directly .
To Prediction/test
$ python3 test_model_plain_CTC.py
Alternatively, you can also run the testing code demonstration in ! Colab directly.
*Please note that the two Colab demos provided here are the HPopt-Attn-CTC implementation as a sample demo.
Sample results and Character Error Rate (CER) per line are shown below:
We welcome contributions and feedback from the research community to further enhance the HHD-Ethiopic dataset and code. If you have any suggestions, please feel free to send them via email: [email protected] or [email protected]
We would like to express our gratitude to the Ethiopian National Archive and Library Agency (ENALA) for providing access to the historical handwritten documents used in creating the HHD-Ethiopic dataset. We are also grateful to ICT4D research center, Bahir Dar Institute of Technology, and ChaLearn for their funding. Furthermore, we would like to acknowledge the support and contributions of the annotators who made this dataset possible.
This work is licensed under a Creative Commons Attribution 4.0 International License.