Data Set
The Saint Gall database presented in [1] contains a handwritten historical manuscript with following characteristics:
- 9th century
- Latin language
- single writer
- Carolingian script
- ink on parchment
The original manuscript is housed at the Abbey Library of Saint Gall, Switzerland. The manuscript images [3] were made available online by the e-codices project and a text edition [4] was attached at page-level by the Monumenta project. We have additionally added our binarized and normalized text line images to the manuscript data. Altogether, the manuscript data is given by:
- page images (JPEG, 300dpi)
- binarized and normalized text line images
- text edition at page-level (word spelling, capitalization, and punctuation deviate from the image)
Using the semi-automatic proceeding proposed in [2], the following ground truth was created:
- text line locations
- word locations
- transcription at line-level (corresponds exactly with the image)
Note that only the main text region was covered during text line extraction, i.e., ornamented initial characters and some of the capitalized headings were left out.
The Saint Gall database includes:
- 60 pages
- 1,410 text lines
- 11,597 words
- 4,890 word labels
- 5,436 word spellings
- 49 letters
If not already done, we ask you to register before downloading the database. Once registered, you can download the Saint Gall database here:
The archive contains a README file with detailed information about the data formats used. We also provide the training, validation, and test set IDs that were used in the original publication [1] to perform an automatic transcription alignment. Note that with respect to this task, word locations were only needed for the validation and test set. Hence, word locations are not available for the training set so far.
Terms of Use
The Saint Gall database may be used for non-commercial research and teaching purposes only. If you are publishing scientific work based on the Saint Gall database, we request you to include a reference to our paper [1] A. Fischer, V. Frinken, A. Fornés, and H. Bunke: "Transcription Alignment of Latin Manuscripts using Hidden Markov Models," in Proc. 1st Int. Workshop on Historical Document Imaging and Processing, pages 29-36, 2011.
With kind permission of Prof. Ernst Tremp from the Abbey Library of Saint Gall, the original manuscript images [3] provided by e-codices can be used for non-commercial research and teaching purposes explicitly as follows:
- Show and print sample manuscript images in scientific publications
- Show sample manuscript images during talks
- Show sample manuscript images online
For any purposes other than non-commercial research and teaching, the Abbey Library of Saint Gall has to be contacted first.
With kind permission of Max Bänziger from the Monumenta project, the aligned text edition [4] is also included in the Saint Gall database.
Printed versions of the papers are linked by DOI. Additionally, we provide accepted preprint versions as PDFs. The preprints are intended for convenient online browsing only.
[1] A. Fischer, V. Frinken, A. Fornés, and H. Bunke: "Transcription Alignment of Latin Manuscripts using Hidden Markov Models," in Proc. 1st Int. Workshop on Historical Document Imaging and Processing, pages 29-36, 2011. [doi] [pdf]
[2] A. Fischer, E. Indermühle, H. Bunke, G. Viehhauser, and M. Stolz: "Ground Truth Creation for Handwriting Recognition in Historical Documents," in Proc. 9th Int. Workshop on Document Analysis Systems, pages 3-10, 2010. [doi] [pdf]
[3] Manuscript images of the Codex Sangallensis 562, © 2006 St. Gallen, Stiftsbibliothek
[4] J.-P. Migne PL114, 1852