Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selective ocr to extract key/value data #15

Open
raveslave opened this issue Nov 23, 2019 · 11 comments
Open

selective ocr to extract key/value data #15

raveslave opened this issue Nov 23, 2019 · 11 comments
Assignees
Labels
enhancement New feature or request pinned

Comments

@raveslave
Copy link

wouldn't it be cool to offer this feature.
basically allow to draw an overlay that helps find key, value that can then be mapped to the relevant document type -> field in erpnext!

image

@raveslave raveslave added the enhancement New feature or request label Nov 23, 2019
@madmath03
Copy link
Member

madmath03 commented Nov 25, 2019

Hi @raveslave ,

Thanks for sharing this idea. It's really interesting and looks really cool.

I have a few doubts though about the usability as it seems rather complicated to develop or even to use.

  1. First, relying on the position of text seems like it would break easily if format or invoice "source" changes (shops do not always have the same invoice layout or it may change over time).
  2. You would also have to manually map every text section to a DocType field each time you read a document. This would make it impossible to do bulk imports.
    Maybe the previous mappings could be stored to "guide" users on recurring invoices (kinda same position as an older), as you mentioned, but that could mean to store and read a lot of data to provide this.
  3. This is hardly applicable to multi-page PDF documents.

Though it seems hard to provide this, we will still take a look at it as this definitely goes in the direction we are aiming: importing / generating DocTypes from OCR.

For reference, we're currently more invested in a text based import using simple regular expressions or text processing libraries: https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/

We can keep this open to discuss further if you want to.

@raveslave
Copy link
Author

pls see comments:

  1. true, but the idea is not to rely on the rectangle position, rather have a template to teach the OCR tool to look for that same string. If it fails on a mandatory one, script should notify that manual attention is needed.

  2. My idea is that you only do this mapping once per supplier.
    disregaring ocr, most invoices will be PDF, so in that case, same principle would apply, but easier to implement.

  3. true, but most of the time, the thing you're after is the date & invoice-no to allow populating the bare minimum (mandatory fields) and later matching it to a PO

@raveslave
Copy link
Author

re: tesseract
cool tech, have you tried it on a pile of random invoices?
curious how it works and if there are ways to get parameterized data back from it.

@stale
Copy link

stale bot commented Jan 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jan 24, 2020
@stale stale bot closed this as completed Feb 23, 2020
@madmath03 madmath03 reopened this Apr 19, 2020
@stale stale bot removed the wontfix This will not be worked on label Apr 19, 2020
@raveslave
Copy link
Author

anyone been looking into this lately?

@gio3166
Copy link

gio3166 commented Apr 12, 2021

Hello,

I am currently looking for something like this to use with ERPNext. Converting scanned or email-received PDF purchase invoices to text (or even json) and with the needed data automatically creating a purchase invoice in ERPNext. Only with added functionality for uploading the PDF files from the email and attaching them (or link) to the relevant purchase invoice.
I'm not a programmer... I'm on the financial side..
There are commercial solutions available for this functionality which means it is possible to create.

@madmath03
Copy link
Member

anyone been looking into this lately?

Hi @raveslave,
unfortunately, we did not find the time to look any further into this.

@raveslave
Copy link
Author

just checking in, anyone willing to co-sponsor?

@bharath-kumarn
Copy link

I need to extract key value pairs from PDF tables

@bharath-kumarn
Copy link

bharath-kumarn commented Oct 12, 2021

@raveslave I need to extract key value pairs from PDF tables

@imbraintl
Copy link

Any progress with this on ERPNext

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned
Projects
None yet
Development

No branches or pull requests

6 participants