-
Warning: technically astute newbie (read: ¡novato peligroso!) The input is a PDF of a complex financial report that presents vendor, customer, and payment information in the report "header". No images to extract. The output will be a spreadsheet so I can verify with input PDF and a json file for importing into a dbms. It is working and is accurate until I hit the report detail section. There, some of the values do not end up in the right places in output spreadsheet. The financial report contains one or more "transaction report" sections, similar to invoice line items. However, each of these are like a mini report. Each transaction report section includes a "transaction header" row with four data elements, a "transaction detail" row that has ten columns, and a "transaction totals" section having two rows with one columns each. The transaction report section can have one or more transaction detail rows. The data extract for the transaction header is working using only the pedantic contract class definitions. The data extraction for the transaction detail section is grabbing the values but placing them in the wrong columns in spreadsheet. I think I need to employ RAG of some kind. I have been studying Label Studio to map the transaction section, but I know very little about it. Further, I don't know where to introduce that into the pipeline. Perhaps the "advanced mapping" is related? I would be grateful for any links to articles and/or advice. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hello @mophilly! Firstly, given the complexity of the work, what model are you using? Because more complex the use case, more the model should be. Also, very imporant, please use vision for this use cases. What vision do, is passing the image of the PDF page directly to the model, that removes 95% of the flackiness Also, what documentLoader are you using? I advice you to use something like Pypdf if is a pure PDF
A spreadsheet? ExtractThinker only allows pydantic and then JSON. you can later convert to JSON The report with the financial lines will be no problem. Look at test extractor and how it extracts lines of the invoice and other things like charts and so on. Can you try this solution:
The model above is made up, please adapt to yours needs. This will just use the model with image, and i know will be enough according to your question. If is not enough, you maybe need to use Splitting, but its unlikely, if you use frontier models (e.g gpt4o, sonnet 3.5, llama 3.3, GEMINI 2.0)
No! RAG is a different paradigm and its not needed here, believe me :) . All the document wil always fit inside the context, RAG will never be needed for extraction. "advanced mapping" i will publish in 2/3h on a .28v, and its not directly related to this. Its for Long responses, where you have so many pages that the results are not returning. LLMs only return 4000 tokens at the time, i dont think is your use case. I ping once is deployed and you can give it a test, maybe fixes your problems Tell me if this is enough. Thanks ! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed answer. It is very kind of you to provide this guidance. I haven’t used vision but I certainly study it. That requires an installation of tesseract, not just pytesseract, IIRC. The source PDF files ranges from two to 30 page, typically two to four. I will add your solution to my work as soon as I return to the office, in about three hours. |
Beta Was this translation helpful? Give feedback.
Hello @mophilly!
Firstly, given the complexity of the work, what model are you using? Because more complex the use case, more the model should be. Also, very imporant, please use vision for this use cases. What vision do, is passing the image of the PDF page directly to the model, that removes 95% of the flackiness
Also, what documentLoader are you using? I advice you to use something like Pypdf if is a pure PDF
A spreadsheet? ExtractThinker only allows pydantic and then JSON. you can later convert to JSON
The report with the financial lines will be no problem. Look at test extractor…