selective ocr to extract key/value data #15

raveslave · 2019-11-23T12:59:14Z

wouldn't it be cool to offer this feature.
basically allow to draw an overlay that helps find key, value that can then be mapped to the relevant document type -> field in erpnext!

madmath03 · 2019-11-25T17:51:14Z

Hi @raveslave ,

Thanks for sharing this idea. It's really interesting and looks really cool.

I have a few doubts though about the usability as it seems rather complicated to develop or even to use.

First, relying on the position of text seems like it would break easily if format or invoice "source" changes (shops do not always have the same invoice layout or it may change over time).
You would also have to manually map every text section to a DocType field each time you read a document. This would make it impossible to do bulk imports.
Maybe the previous mappings could be stored to "guide" users on recurring invoices (kinda same position as an older), as you mentioned, but that could mean to store and read a lot of data to provide this.
This is hardly applicable to multi-page PDF documents.

Though it seems hard to provide this, we will still take a look at it as this definitely goes in the direction we are aiming: importing / generating DocTypes from OCR.

For reference, we're currently more invested in a text based import using simple regular expressions or text processing libraries: https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/

We can keep this open to discuss further if you want to.

raveslave · 2019-11-25T22:29:43Z

pls see comments:

true, but the idea is not to rely on the rectangle position, rather have a template to teach the OCR tool to look for that same string. If it fails on a mandatory one, script should notify that manual attention is needed.
My idea is that you only do this mapping once per supplier.
disregaring ocr, most invoices will be PDF, so in that case, same principle would apply, but easier to implement.
true, but most of the time, the thing you're after is the date & invoice-no to allow populating the bare minimum (mandatory fields) and later matching it to a PO

raveslave · 2019-11-25T22:31:17Z

re: tesseract
cool tech, have you tried it on a pile of random invoices?
curious how it works and if there are ways to get parameterized data back from it.

stale · 2020-01-24T23:01:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

raveslave · 2021-03-11T21:55:09Z

anyone been looking into this lately?

gio3166 · 2021-04-12T12:22:51Z

Hello,

I am currently looking for something like this to use with ERPNext. Converting scanned or email-received PDF purchase invoices to text (or even json) and with the needed data automatically creating a purchase invoice in ERPNext. Only with added functionality for uploading the PDF files from the email and attaching them (or link) to the relevant purchase invoice.
I'm not a programmer... I'm on the financial side..
There are commercial solutions available for this functionality which means it is possible to create.

madmath03 · 2021-04-12T17:16:39Z

anyone been looking into this lately?

Hi @raveslave,
unfortunately, we did not find the time to look any further into this.

raveslave · 2021-09-18T14:30:25Z

just checking in, anyone willing to co-sponsor?

bharath-kumarn · 2021-10-12T09:51:00Z

I need to extract key value pairs from PDF tables

bharath-kumarn · 2021-10-12T09:51:25Z

@raveslave I need to extract key value pairs from PDF tables

imbraintl · 2023-11-29T02:44:09Z

Any progress with this on ERPNext

raveslave added the enhancement New feature or request label Nov 23, 2019

raveslave assigned madmath03 Nov 23, 2019

madmath03 assigned AminovE99 Nov 23, 2019

stale bot added the wontfix This will not be worked on label Jan 24, 2020

stale bot closed this as completed Feb 23, 2020

madmath03 reopened this Apr 19, 2020

stale bot removed the wontfix This will not be worked on label Apr 19, 2020

madmath03 added the pinned label Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selective ocr to extract key/value data #15

selective ocr to extract key/value data #15

raveslave commented Nov 23, 2019

madmath03 commented Nov 25, 2019 •

edited

Loading

raveslave commented Nov 25, 2019

raveslave commented Nov 25, 2019

stale bot commented Jan 24, 2020

raveslave commented Mar 11, 2021

gio3166 commented Apr 12, 2021

madmath03 commented Apr 12, 2021

raveslave commented Sep 18, 2021

bharath-kumarn commented Oct 12, 2021

bharath-kumarn commented Oct 12, 2021 •

edited

Loading

imbraintl commented Nov 29, 2023

selective ocr to extract key/value data #15

selective ocr to extract key/value data #15

Comments

raveslave commented Nov 23, 2019

madmath03 commented Nov 25, 2019 • edited Loading

raveslave commented Nov 25, 2019

raveslave commented Nov 25, 2019

stale bot commented Jan 24, 2020

raveslave commented Mar 11, 2021

gio3166 commented Apr 12, 2021

madmath03 commented Apr 12, 2021

raveslave commented Sep 18, 2021

bharath-kumarn commented Oct 12, 2021

bharath-kumarn commented Oct 12, 2021 • edited Loading

imbraintl commented Nov 29, 2023

madmath03 commented Nov 25, 2019 •

edited

Loading

bharath-kumarn commented Oct 12, 2021 •

edited

Loading