Skip to content

jennkimerson/OCR_ArchivalDataOrganization_HGARC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Archival Data Organization for Howard Gotlieb Archival Research Center (HGARC) 2019

The Project:

This program was developed by Jennifer Kim and Richard Xiao to convert PDFs of legacy data into the Excel listing format for further parsing as it pertains to ontological standards. Both Jennifer and Richard have been experimenting with different approaches to account for variables and data anomalies in Finding Aids produced over the past 50 years at Howard Gotlieb Archival Research Center. Scripts in Python, Java, C++, and Golang are used to implement Tesseract, OpenCV, and EAST (Efficient Accurate Scene Text Detector) to harvest the data. All of this is to retain the integrity of the data.

Out of the box design thinking has proved to be the best way to tackle this project. Most traditional programs will disregard formatting and indentations aka whitespace. This is how one organizes human readable information, in order to make it machine readable the students aim to train the program to adjust for the variability of each document. This is accomplished through machine learning and increased accuracy of white space preservation.

Current Version In Development: Mk. 4

Previous versions can be found in our Wiki Page

Team Members:

Jennifer (Jaehei) Kim, Richard Xiao

Supervisor:

Claudia Friedel

About

OCR Archival Data Organization for Howard Gotlieb Archival Research Center

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published