-
I would like to better understand how to guide the extractor and or parser to assign values to the correct attribute. Links to articles or documents in the project would be great. In the example model below, I have used the label names on the PDF as the field names in the model, but concessions made for clarity of purpose. This model is working for the most part. There are two cases where values don't end up in the proper place. One is class Transaction and the other class StatementTotal. Note the values of owner_deductions and owner_taxes in the expected result and the actual result. class Transaction has a similar problem where values end up in the wrong column and some values duplicated to columns that should empty or None. The source PDF presents only white space where a value might be when there is no value to report; as compared to placing a zero in that spot. Expected output, as json, for StatementTotal, drawn from the source PDF:
Actual result, as json
example pydantic model:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Sorry for the delay @mophilly Just published the new version, please update to 0.0.28 On this point, is really simple, just remove the optionals and a description for good measure:
This will most likely fix your problem.
The primary parsing is done by instructor, the most popular parsing library, that basically uses reflection and other patterns to enforce structure. I go into detail here on the extraction section: https://levelup.gitconnected.com/claude-3-5-the-king-of-document-intelligence-f57bea1d209d instructor - https://github.com/instructor-ai/instructor Also pydanticai, that does the same but is done by the Pydantic team. https://ai.pydantic.dev/ Pydantic will be added shortly into the code, so you will be able to pick an "llm agent" to make the extraction |
Beta Was this translation helpful? Give feedback.
Sorry for the delay @mophilly
Just published the new version, please update to 0.0.28
On this point, is really simple, just remove the optionals and a description for good measure: