This program was developed by Jennifer Kim and Richard Xiao to convert PDFs of legacy data into the Excel listing format for further parsing as it pertains to ontological standards. Both Jennifer and Richard have been experimenting with different approaches to account for variables and data anomalies in Finding Aids produced over the past 50 years at Howard Gotlieb Archival Research Center. Scripts in Python, Java, C++, and Golang are used to implement Tesseract, OpenCV, and EAST (Efficient Accurate Scene Text Detector) to harvest the data. All of this is to retain the integrity of the data.
Out of the box design thinking has proved to be the best way to tackle this project. Most traditional programs will disregard formatting and indentations aka whitespace. This is how one organizes human readable information, in order to make it machine readable the students aim to train the program to adjust for the variability of each document. This is accomplished through machine learning and increased accuracy of white space preservation.
Previous versions can be found in our Wiki Page
Jennifer (Jaehei) Kim, Richard Xiao
Claudia Friedel