Archival Data Organization for Howard Gotlieb Archival Research Center (HGARC) 2019

The Project:

This program was developed by Jennifer Kim and Richard Xiao to convert PDFs of legacy data into the Excel listing format for further parsing as it pertains to ontological standards. Both Jennifer and Richard have been experimenting with different approaches to account for variables and data anomalies in Finding Aids produced over the past 50 years at Howard Gotlieb Archival Research Center. Scripts in Python, Java, C++, and Golang are used to implement Tesseract, OpenCV, and EAST (Efficient Accurate Scene Text Detector) to harvest the data. All of this is to retain the integrity of the data.

Out of the box design thinking has proved to be the best way to tackle this project. Most traditional programs will disregard formatting and indentations aka whitespace. This is how one organizes human readable information, in order to make it machine readable the students aim to train the program to adjust for the variability of each document. This is accomplished through machine learning and increased accuracy of white space preservation.

Current Version In Development: Mk. 4

Previous versions can be found in our Wiki Page

Team Members:

Jennifer (Jaehei) Kim, Richard Xiao

Supervisor:

Claudia Friedel

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Experiments		Experiments
HGARC_FindingAid		HGARC_FindingAid
HGARC_FindingAid_wUI		HGARC_FindingAid_wUI
Mk.1		Mk.1
Mk.2		Mk.2
Mk.3		Mk.3
Mk.4		Mk.4
README.md		README.md
mk4_local.py		mk4_local.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Archival Data Organization for Howard Gotlieb Archival Research Center (HGARC) 2019

The Project:

Current Version In Development: Mk. 4

Team Members:

Supervisor:

About

Releases

Packages

Contributors 2

Languages

jennkimerson/OCR_ArchivalDataOrganization_HGARC

Folders and files

Latest commit

History

Repository files navigation

Archival Data Organization for Howard Gotlieb Archival Research Center (HGARC) 2019

The Project:

Current Version In Development: Mk. 4

Team Members:

Supervisor:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages