directing order data extraction #125

mophilly · 2024-12-17T16:44:33Z

mophilly
Dec 17, 2024

Warning: technically astute newbie (read: ¡novato peligroso!)
I have crafted a simple document pipeline based on the basic extraction example.

The input is a PDF of a complex financial report that presents vendor, customer, and payment information in the report "header". No images to extract. The output will be a spreadsheet so I can verify with input PDF and a json file for importing into a dbms. It is working and is accurate until I hit the report detail section. There, some of the values do not end up in the right places in output spreadsheet.

The financial report contains one or more "transaction report" sections, similar to invoice line items. However, each of these are like a mini report. Each transaction report section includes a "transaction header" row with four data elements, a "transaction detail" row that has ten columns, and a "transaction totals" section having two rows with one columns each. The transaction report section can have one or more transaction detail rows.

The data extract for the transaction header is working using only the pedantic contract class definitions. The data extraction for the transaction detail section is grabbing the values but placing them in the wrong columns in spreadsheet.

I think I need to employ RAG of some kind. I have been studying Label Studio to map the transaction section, but I know very little about it. Further, I don't know where to introduce that into the pipeline. Perhaps the "advanced mapping" is related?

I would be grateful for any links to articles and/or advice.

Answered by enoch3712

Dec 17, 2024

Hello @mophilly!

Firstly, given the complexity of the work, what model are you using? Because more complex the use case, more the model should be. Also, very imporant, please use vision for this use cases. What vision do, is passing the image of the PDF page directly to the model, that removes 95% of the flackiness

Also, what documentLoader are you using? I advice you to use something like Pypdf if is a pure PDF

The output will be a spreadsheet so I can verify with input PDF and a json file for importing into a dbms.

A spreadsheet? ExtractThinker only allows pydantic and then JSON. you can later convert to JSON

The report with the financial lines will be no problem. Look at test extractor…

View full answer

enoch3712 · 2024-12-17T18:11:29Z

enoch3712
Dec 17, 2024
Maintainer

Hello @mophilly!

Firstly, given the complexity of the work, what model are you using? Because more complex the use case, more the model should be. Also, very imporant, please use vision for this use cases. What vision do, is passing the image of the PDF page directly to the model, that removes 95% of the flackiness

Also, what documentLoader are you using? I advice you to use something like Pypdf if is a pure PDF

The output will be a spreadsheet so I can verify with input PDF and a json file for importing into a dbms.

A spreadsheet? ExtractThinker only allows pydantic and then JSON. you can later convert to JSON

The report with the financial lines will be no problem. Look at test extractor and how it extracts lines of the invoice and other things like charts and so on.

Can you try this solution:

from typing import List, Optional
from pydantic import Field

from extract_thinker.extractor import Extractor
from extract_thinker.models.contract import Contract # the same as pydantic.BaseModel

class TransactionHeader(Contract):
    """
    Represents the 'transaction header' row with four data elements.
    For example:
      - Transaction ID
      - Vendor ID
      - Customer ID
      - Transaction Date
    """
    transaction_id: str = Field(description="Unique identifier for the transaction")
    vendor_id: str = Field(description="Identifier for the vendor")
    customer_id: str = Field(description="Identifier for the customer")
    transaction_date: str = Field(description="Transaction date in YYYY-MM-DD or similar format")

class TransactionDetail(Contract):
    """
    Represents one 'transaction detail' row with ten columns.
    Adjust fields to match the actual columns in your PDF.
    """
    line_number: int = Field(description="Line item number within the transaction")
    item_code: str = Field(description="Item or product code")
    item_description: str = Field(description="Brief description of the item")
    quantity: float = Field(description="Quantity sold or purchased")
    unit_price: float = Field(description="Price per unit of the item")
    extended_price: float = Field(description="Calculated extended price (quantity * unit_price)")
    tax_amount: float = Field(description="Tax applied to this line item")
    discount_amount: float = Field(description="Discount applied to this line item")
    net_amount: float = Field(description="Net amount after tax and discount")
    notes: Optional[str] = Field(None, description="Any additional notes or comments")

class TransactionTotals(Contract):
    """
    Represents the totals section that can have two rows
    (e.g., summary lines). For clarity, structure them as named fields.
    """
    total_tax: float = Field(description="Total tax for this transaction")
    total_amount: float = Field(description="Total amount for this transaction (sum of net amounts)")

class TransactionReport(Contract):
    """
    Represents one complete 'transaction report' section:
    Header, multiple Details, and Totals.
    """
    header: TransactionHeader
    details: List[TransactionDetail]
    totals: TransactionTotals

class FullReport(Contract):
    """
    Represents the entire PDF (if it contains multiple transaction sections),
    or you can keep a single TransactionReport if only one per PDF.
    """
    transactions: List[TransactionReport] = Field(description="List of all transaction sections in the PDF")


test_file_path = ""
extractor = Extractor()
# add document loader if needed
extractor.load_llm("gpt-4o") #use claude sonnet 3.5 or gpt-4o

result = extractor.extract(test_file_path, FullReport, vision=True)

The model above is made up, please adapt to yours needs. This will just use the model with image, and i know will be enough according to your question.

If is not enough, you maybe need to use Splitting, but its unlikely, if you use frontier models (e.g gpt4o, sonnet 3.5, llama 3.3, GEMINI 2.0)

I think I need to employ RAG of some kind.

No! RAG is a different paradigm and its not needed here, believe me :) . All the document wil always fit inside the context, RAG will never be needed for extraction.

"advanced mapping" i will publish in 2/3h on a .28v, and its not directly related to this. Its for Long responses, where you have so many pages that the results are not returning. LLMs only return 4000 tokens at the time, i dont think is your use case. I ping once is deployed and you can give it a test, maybe fixes your problems

Tell me if this is enough. Thanks !

0 replies

mophilly · 2024-12-17T19:02:52Z

mophilly
Dec 17, 2024
Author

Thanks for the detailed answer. It is very kind of you to provide this guidance.
model: gpt 4 mini (I don’t have exact spelling at hand)
Doc loader is using pyPDF (but I will double check)
I added a module that uses openpyxl to take the result as json and output a spreadsheet.

I haven’t used vision but I certainly study it. That requires an installation of tesseract, not just pytesseract, IIRC.

The source PDF files ranges from two to 30 page, typically two to four.

I will add your solution to my work as soon as I return to the office, in about three hours.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

directing order data extraction #125

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

directing order data extraction #125

mophilly Dec 17, 2024

Replies: 2 comments

enoch3712 Dec 17, 2024 Maintainer

mophilly Dec 17, 2024 Author

mophilly
Dec 17, 2024

enoch3712
Dec 17, 2024
Maintainer

mophilly
Dec 17, 2024
Author