Skip to content

Releases: aimagelab/HWD

Generated Images

06 Mar 10:03
Compare
Choose a tag to compare

The GeneratedDataset class provides an easy way to download and use the images generated during the publication of the Emuru paper. To use this dataset, pass a string formatted as {dataset}__{model} to the GeneratedDataset class. The available datasets can be found at: HWD Releases - Generated Dataset.

Example usage:

from hwd.datasets import GeneratedDataset

fakes = GeneratedDataset('iam_words__emuru')
reals = GeneratedDataset('iam_words__reference')

Leopardi Dataset

11 Feb 12:44
Compare
Choose a tag to compare

To favor the research towards HTR systems able to work on historical documents even in the absence of large training datasets, we devise a new dataset consisting of a small collection of early 19th Century letters written in Italian by Giacomo Leopardi.

The letters are preserved at the Estense Library in Modena, and their high-resolution scans are also available at its Digital Library2. In particular, there are 168 pages containing text in Giacomo Leopardi’s handwriting, both letter bodies and envelope fronts.

@inproceedings{cascianelli2021learning,
  title={Learning to read L’Infinito: handwritten text recognition with synthetic training data},
  author={Cascianelli, Silvia and Cornia, Marcella and Baraldi, Lorenzo and Piazzi, Maria Ludovica and Schiuma, Rosiana and Cucchiara, Rita},
  booktitle={Computer Analysis of Images and Patterns: 19th International Conference, CAIP 2021, Virtual Event, September 28--30, 2021, Proceedings, Part II 19},
  pages={340--350},
  year={2021},
  organization={Springer}
}

SaintGall Dataset

10 Feb 16:39
Compare
Choose a tag to compare

Data Set

The Saint Gall database presented in [1] contains a handwritten historical manuscript with following characteristics:

  • 9th century
  • Latin language
  • single writer
  • Carolingian script
  • ink on parchment

The original manuscript is housed at the Abbey Library of Saint Gall, Switzerland. The manuscript images [3] were made available online by the e-codices project and a text edition [4] was attached at page-level by the Monumenta project. We have additionally added our binarized and normalized text line images to the manuscript data. Altogether, the manuscript data is given by:

  • page images (JPEG, 300dpi)
  • binarized and normalized text line images
  • text edition at page-level (word spelling, capitalization, and punctuation deviate from the image)

Using the semi-automatic proceeding proposed in [2], the following ground truth was created:

  • text line locations
  • word locations
  • transcription at line-level (corresponds exactly with the image)
    Note that only the main text region was covered during text line extraction, i.e., ornamented initial characters and some of the capitalized headings were left out.

Statistics

The Saint Gall database includes:

  • 60 pages
  • 1,410 text lines
  • 11,597 words
  • 4,890 word labels
  • 5,436 word spellings
  • 49 letters

Download

If not already done, we ask you to register before downloading the database. Once registered, you can download the Saint Gall database here:

The archive contains a README file with detailed information about the data formats used. We also provide the training, validation, and test set IDs that were used in the original publication [1] to perform an automatic transcription alignment. Note that with respect to this task, word locations were only needed for the validation and test set. Hence, word locations are not available for the training set so far.

Terms of Use

The Saint Gall database may be used for non-commercial research and teaching purposes only. If you are publishing scientific work based on the Saint Gall database, we request you to include a reference to our paper [1] A. Fischer, V. Frinken, A. Fornés, and H. Bunke: "Transcription Alignment of Latin Manuscripts using Hidden Markov Models," in Proc. 1st Int. Workshop on Historical Document Imaging and Processing, pages 29-36, 2011.

With kind permission of Prof. Ernst Tremp from the Abbey Library of Saint Gall, the original manuscript images [3] provided by e-codices can be used for non-commercial research and teaching purposes explicitly as follows:

  • Show and print sample manuscript images in scientific publications
  • Show sample manuscript images during talks
  • Show sample manuscript images online

For any purposes other than non-commercial research and teaching, the Abbey Library of Saint Gall has to be contacted first.

With kind permission of Max Bänziger from the Monumenta project, the aligned text edition [4] is also included in the Saint Gall database.

References

Printed versions of the papers are linked by DOI. Additionally, we provide accepted preprint versions as PDFs. The preprints are intended for convenient online browsing only.

[1] A. Fischer, V. Frinken, A. Fornés, and H. Bunke: "Transcription Alignment of Latin Manuscripts using Hidden Markov Models," in Proc. 1st Int. Workshop on Historical Document Imaging and Processing, pages 29-36, 2011. [doi] [pdf]

[2] A. Fischer, E. Indermühle, H. Bunke, G. Viehhauser, and M. Stolz: "Ground Truth Creation for Handwriting Recognition in Historical Documents," in Proc. 9th Int. Workshop on Document Analysis Systems, pages 3-10, 2010. [doi] [pdf]

[3] Manuscript images of the Codex Sangallensis 562, © 2006 St. Gallen, Stiftsbibliothek

[4] J.-P. Migne PL114, 1852

Washington Dataset

10 Feb 10:36
Compare
Choose a tag to compare

Data Set

The Washington database was created from the George Washington Papers at the Library of Congress and has the following characteristics:

  • 18th century
  • English language
  • two writers
  • longhand script
  • ink on paper

The original manuscript images [4] have already been used, for example, by Rath and Manmatha in [3]. The Washington database contains our own text line and word images alongside with their transcription. Altogether, the manuscript data is given by:

  • binarized and normalized text line images
  • binarized and normalized word images
    The ground truth contains:
  • transcription at line-level
  • transcription at word-level

Statistics

The Washington database includes:

  • 20 pages
  • 656 text lines
  • 4,894 word instances
  • 1,471 word classes
  • 82 letters

Download

If not already done, we ask you to register before downloading the database. Once registered, you can download the Washington database here:

The archive contains a README file with detailed information about the data formats used. We also provide the training, validation, and test set IDs that were used, for example, in [1] and [2].

Terms of Use

The Washington database may be used for non-commercial research and teaching purposes only. If you are publishing scientific work based on the Washington database, we request you to include a reference to our paper [1] A. Fischer, A. Keller, V. Frinken, and H. Bunke: "Lexicon-Free Handwritten Word Spotting Using Character HMMs," in Pattern Recognition Letters, Volume 33(7), pages 934-942, 2012.

References

Printed versions of the papers are linked by DOI. Additionally, we provide accepted preprint versions as PDFs. The preprints are intended for convenient online browsing only.

[1] A. Fischer, A. Keller, V. Frinken, and H. Bunke: "Lexicon-Free Handwritten Word Spotting Using Character HMMs," in Pattern Recognition Letters, Volume 33(7), pages 934-942, 2012. [doi] [pdf]

[2] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke: "A Novel Word Spotting Method Based on Recurrent Neural Networks," in IEEE Trans. PAMI, Volume 34(2), pages 211-224, 2012. [doi] [pdf]

[3] T. M. Rath and R. Manmatha: "Word Spotting for Historical Documents," in Int. Journal on Document Analysis and Recognition, Volume 9, pages 139-152, 2007.

[4] George Washington Papers at the Library of Congress from 1741-1799, Series 2, Letterbook 1, pages 270-279 and 300-309

Karaoke Dataset

07 Nov 21:16
Compare
Choose a tag to compare
karaoke

Merge remote-tracking branch 'refs/remotes/origin/main'

Rimes Dataset

04 Nov 11:12
Compare
Choose a tag to compare

The RIMES database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexation of handwritten documents and faxes) has been created to evaluate automatic recognition and indexing systems for handwritten letters. Of particular interest are cases such as those sent by mail or fax from individuals to companies or administrations.

The database was collected by asking volunteers to write handwritten letters in exchange for gift vouchers. Volunteers were given a fictitious identity (same sex as the real one) and up to 5 scenarios. Each scenario was chosen from among 9 realistic topics: change of personal data (address, bank account), request for information, opening and closing (customer account), change of contract or order, complaint (poor quality of service...), payment difficulties (request for delay, tax exemption...), reminder, claim with other circumstances and a target (administrations or service providers (telephone, electricity, bank, insurance). The volunteers wrote a letter with this information in their own words. The layout was free and the only request was to use white paper and to write legibly in black ink.

The campaign was a success with more than 1,300 people contributing to the RIMES database by writing up to 5 mails. The resulting RIMES database contains 12,723 pages, corresponding to 5605 mails of two to three pages each.

RIMES in evaluations:

The database has been used for several competitions with different tasks and with official train/dev/test splits :

See Papers with code for a list of publications using the RIMES database.

RIMES data :

You can download the data used in the evaluations:

ICDAR2011 Line level : RIMES-2011-Lines.zip

@inproceedings{grosicki2011icdar,
  title={Icdar 2011-french handwriting recognition competition},
  author={Grosicki, Emmanuele and El-Abed, Haikal},
  booktitle={2011 International Conference on Document Analysis and Recognition},
  pages={1459--1463},
  year={2011},
  organization={IEEE}
}

CVL Dataset

04 Nov 10:39
Compare
Choose a tag to compare

The CVL Database is a public database for writer retrieval, writer identification and word spotting. The database consists of 7 different handwritten texts (1 German and 6 English texts). In total 310 writers participated in the dataset. 27 of which wrote 7 texts and 283 writers had to write 5 texts. For each text a rgb color image (300 dpi) comprising the handwritten text and the printed text sample is available as well as a cropped version (only handwritten). An unique id identifies the writer, whereas the Bounding Boxes for each single word are stored in an XML file.

The CVL-database consists of images with cursively handwritten german and english texts which has been choosen from literary works. All pages have a unique writer id and the text number (separated by a dash) at the upper right corner, followed by the printed sample text. The text is placed between two horizontal separatores. Beneath the printed text individuals have been asked to write the text using a ruled undersheet to prevent curled text lines. The layout follows the style of the IAM database. The database was updated on 12/09/2013 since one writer ID (265/266) was wrong. The version number was changed to 1.1.

Samples of the following texts have been used:

  • Edwin A. Abbot – Flatland: A Romance of Many Dimension (92 words).
  • William Shakespeare – Mac Beth (49 words).
  • Wikipedia – Mailüfterl (73 words, under CC Attribution-ShareALike License).
  • Charles Darwin – Origin of Species (52 words).
  • Johann Wolfgang von Goethe – Faust. Eine Tragödie (50 words).
  • Oscar Wilde – The Picture of Dorian Gray (66 words).
  • Edgar Allan Poe – The Fall of the House of Usher (78 words).

License

This database may be used for non-commercial research purposes only (it is licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License). If you publish material based on this database, we request you to include a reference to:

@inproceedings{kleber2013cvl,
  title={Cvl-database: An off-line database for writer retrieval, writer identification and word spotting},
  author={Kleber, Florian and Fiel, Stefan and Diem, Markus and Sablatnig, Robert},
  booktitle={2013 12th international conference on document analysis and recognition},
  pages={560--564},
  year={2013},
  organization={IEEE}
}

IAM Dataset

02 Nov 14:39
Compare
Choose a tag to compare

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

Terms of usage

The IAM Handwriting Database is publicly accessible and freely available for non-commercial research purposes. If you are using data from the IAM Handwriting Database, we request you to register, so we are aware of who is using our data. If you are publishing scientific work based on the IAM Handwriting Database, we request you to include a reference to the paper.

@article{marti2002iam,
  title={The IAM-database: an English sentence database for offline handwriting recognition},
  author={Marti, U-V},
  journal={International journal on document analysis and recognition},
  volume={5},
  pages={39--46},
  year={2002},
  publisher={Springer}
}

Evaluation settings

In this release there is the data necessary to evaluate the models with the standard procedure described in VATr++: Choose Your Words Wisely for Handwritten Text Generation