Support image files in notebook 1 data prep #9

athewsey · 2022-02-07T10:07:33Z

Update the build_data_manifest utility to support single-page image
format documents, for which the pdf2image_regex isn't applicable to
look up cleaned images in S3 from input file.

Issue #, if available: #5

Description of changes:

Update/fix the build_data_manifest utility function to support single-page image format documents.

Previously, this function assumed the pdf2image_regex was applicable to all input documents: But the preprocessing job outputs don't match this format when input raw documents are single images (e.g. PNG, JPEG) or multi-page images (e.g. TIFF). This resulted in errors failing to build collated textract-ref + source-ref manifests for custom corpora including non-PDF documents.

Testing done:

Initially tested with synthetic corpus - exploring additional tests to verify the fix

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Update the build_data_manifest utility to support single-page image format documents, for which the pdf2image_regex isn't applicable to look up cleaned images in S3 from input file.

Use IFrame() instead of HTML() to suppress IPython warning. Seems it's unnecessary to specify content_type, although both the old and new solutions worked fine with images as well as PDFs. (Tested in SMStudio on Firefox only)

athewsey added 2 commits February 7, 2022 09:54

feat(nbs): build_data_manifest supports img files

1da8d61

Update the build_data_manifest utility to support single-page image format documents, for which the pdf2image_regex isn't applicable to look up cleaned images in S3 from input file.

fix(nbs): iframe warning on corpus PDF display

a706d0b

Use IFrame() instead of HTML() to suppress IPython warning. Seems it's unnecessary to specify content_type, although both the old and new solutions worked fine with images as well as PDFs. (Tested in SMStudio on Firefox only)

athewsey marked this pull request as ready for review February 17, 2022 12:10

athewsey merged commit cd3f3d5 into main Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support image files in notebook 1 data prep #9

Support image files in notebook 1 data prep #9

athewsey commented Feb 7, 2022

Support image files in notebook 1 data prep #9

Support image files in notebook 1 data prep #9

Conversation

athewsey commented Feb 7, 2022