HHD-Ethiopic

A text-line level historical handwritten Ethiopic OCR Dataset

Overview

This repository contains a historical handwritten dataset called HHD-Ethiopic, and baselines models and human-level performance for benchmarking Historical Handwritten Ethiopic text-image recognition. HHD-Ethiopic is a text-line level historical handwritten Ethiopic OCR Dataset specifically designed for historical handwritten Ethiopic text-image recognition tasks. The full paper is here.

Dataset Details

The HHD-Ethiopic OCR dataset consists of ~80k text-line images extracted from $18^{th}$ to $20^{th}$ centuries of historical handwritten Ethiopic manuscripts. Each text-line image is accompanied by its ground-truth text line transcription. The dataset can be directly downloaded from Hugging Face HHD-Ethiopic Dataset and/or Zenodo HHD-Ethiopic Dataset. Additional synthetically generated Ethiopic text-line images and their corresponding ground truth texts are available from this link.

Sample text-line images and their corresponding ground-truth text are shown below. For a more thorough tutorial about the dataset see formats of the dataset

No.	Text-line Image	Ground-Truth Text
[Image 1]		ወጽራኅየኒ፡ቦአ፡ቅድሜሁ፡ውስተ፡ዕዘኒሁ
[Image 2]		ፍራስ፡እሳት፡ወጽሩዓን
[Image 3]		ወአንሰ፡በብዝኃ፡አሀውዕ፡ቢተኩ
[Image 4]		ወአድኅነከ፡ይትፌሥሑ።

Getting Started

In the current implementation, the NumPy format of the HHD-Ethiopic dataset is used for training and testing the baseline models. Download the dataset.

After downloading HHD-Ethiopic, install the requirements, to demonstrate we just used the Train data and Test data stored in numpy format. To train and test all baseline models, please use all source codes link.

pip install -r requirements.txt

To Train the model from scratch

$ python3 train_model_plain_CTC.py

Alternatively, you can also run the training code demonstration in Google Colab directly .

To Prediction/test

$ python3 test_model_plain_CTC.py

Alternatively, you can also run the testing code demonstration in ! Colab directly.

*Please note that the two Colab demos provided here are the HPopt-Attn-CTC implementation as a sample demo.

Sample testing results

Sample results and Character Error Rate (CER) per line are shown below:

_Ground-truth	_Prediction	_{Edit Distance}	_{CER/Line (100%)}
_{ሰፉሐከ፡የማነከ፡ወውሕጠቶሙ፡ምድር።}	_{ሰፉሕከ፡የማነከ፡ወውሕጠቶሙ፡ምድ።}	2	9
_{ምድር፡ይኔጽር፡ዘሀሎ፡በየብስ፡}	_{ምድር፡ይኔጽር፡ዘሀሎ፡በየብስ፡}	1	5
_{ለብሔረ፡ኢትዮጵያ}	_{አብሒረ፡ኢትየጵያ}	4	40
_{ዓገሠ።በዝሕማም፡መሥጋ፡}	_{ዓገሠ።በዝሕማም፡በሥጋ፡}	2	20

Feedbacks

We welcome contributions and feedback from the research community to further enhance the HHD-Ethiopic dataset and code. If you have any suggestions, please feel free to send them via email: [email protected] or [email protected]

Acknowledgments

We would like to express our gratitude to the Ethiopian National Archive and Library Agency (ENALA) for providing access to the historical handwritten documents used in creating the HHD-Ethiopic dataset. We are also grateful to ICT4D research center, Bahir Dar Institute of Technology, and ChaLearn for their funding. Furthermore, we would like to acknowledge the support and contributions of the annotators who made this dataset possible.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
Dataset		Dataset
Supplementary_file		Supplementary_file
hhd-ethiopic		hhd-ethiopic
labeling_tool		labeling_tool
mmocr		mmocr
src		src
LICENSE		LICENSE
Paper_ICDAR_2024.pdf		Paper_ICDAR_2024.pdf
README.md		README.md
Test_HPopt-Attn-CTC.ipynb		Test_HPopt-Attn-CTC.ipynb
requirements.txt		requirements.txt
train_HPopt_Attn_CTC.ipynb		train_HPopt_Attn_CTC.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HHD-Ethiopic

Overview

Dataset Details

Getting Started

Sample testing results

Feedbacks

Acknowledgments

License

About

Releases

Packages

Contributors 3

Languages

License

bdu-birhanu/HHD-Ethiopic

Folders and files

Latest commit

History

Repository files navigation

HHD-Ethiopic

Overview

Dataset Details

Getting Started

Sample testing results

Feedbacks

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages