Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support image files in notebook 1 data prep #9

Merged
merged 2 commits into from
Feb 17, 2022
Merged

Conversation

athewsey
Copy link
Contributor

@athewsey athewsey commented Feb 7, 2022

Update the build_data_manifest utility to support single-page image
format documents, for which the pdf2image_regex isn't applicable to
look up cleaned images in S3 from input file.

Issue #, if available: #5

Description of changes:

Update/fix the build_data_manifest utility function to support single-page image format documents.

Previously, this function assumed the pdf2image_regex was applicable to all input documents: But the preprocessing job outputs don't match this format when input raw documents are single images (e.g. PNG, JPEG) or multi-page images (e.g. TIFF). This resulted in errors failing to build collated textract-ref + source-ref manifests for custom corpora including non-PDF documents.

Testing done:

Initially tested with synthetic corpus - exploring additional tests to verify the fix


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Update the build_data_manifest utility to support single-page image
format documents, for which the pdf2image_regex isn't applicable to
look up cleaned images in S3 from input file.
Use IFrame() instead of HTML() to suppress IPython warning. Seems
it's unnecessary to specify content_type, although both the old and
new solutions worked fine with images as well as PDFs.

(Tested in SMStudio on Firefox only)
@athewsey athewsey marked this pull request as ready for review February 17, 2022 12:10
@athewsey athewsey merged commit cd3f3d5 into main Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant