Skip to content

Latest commit

 

History

History
81 lines (59 loc) · 3.57 KB

extract-metadata.md

File metadata and controls

81 lines (59 loc) · 3.57 KB

Extract metadata

Sends an AI request to supported Large Language Models (LLMs) and returns extracted data as a set of key-value pairs, using a Box metadata template.

Official documentation

Setup

Because we are using Box metadata we'll need to create a metadata template first. Run the script src/gen_sample_metadata.py to create a sample metadata template.

python src/gen_sample_metadata.py

Now if you navigate to the Box Admin Console and go to Metadata Templates you should see a new template called "Leases Workshop", and selecting it should show these fields:

Leases metadata template

Workshop

To extract document data using the structure of a Box metadata template, you need to define the template key and a template scope. The template key is the unique identifier of the metadata template, and the scope is the enterprise ID.

# Metadata template config
template_key = "leases_workshop"
template_type = "metadata_template"
template_scope = f"enterprise_{conf.enterprise_id}"

# Documents
items = [AiItemBase(id=file.id, type="file") for file in hab_files]

# Extract metadata from single document
item = items[0]
metadata_template = CreateAiExtractStructuredMetadataTemplate(
    template_key=template_key,
    type=type,
    scope=scope,
)
ai_response = client.ai.create_ai_extract_structured(
    items=[item],
    metadata_template=metadata_template,
).to_dict()
print_ai_response("Extract metadata from single document", ai_response)

The above returns the following:

Description: Extract metadata from single document

Answer:
{'number_of_bedrooms': 1, 'lessee_email': '[email protected]', 'lessee_name': 'Marie Tharp', 'agreement_date': '2024-04-24T00:00:00Z', 'lease_start_date': '2024-05-01T00:00:00Z', 'lease_end_date': '2027-04-30T00:00:00Z', 'monthly_rent': 3125, 'property_type': 'HAB-1', 'agreement_term': 3, 'property_id': 'HAB-1-01'}

The next step is to take this dictionary and apply it to the document as metadata. This is done by creating a metadata object and updating the document with it.

# Apply metadata to document
metadata = apply_metadata_to_document(client, item.id, template_key, ai_response)
print_ai_response("Metadata applied to document:", metadata)

The above returns the following:

Description: Metadata applied to document:

Answer:
{'property_type': 'HAB-1', 'lease_end_date': '2027-04-30T00:00:00.000Z', 'lessee_email': '[email protected]', 'lease_start_date': '2024-05-01T00:00:00.000Z', 'property_id': 'HAB-1-01', 'agreement_term': 3, 'number_of_bedrooms': 1, 'monthly_rent': 3125, 'agreement_date': '2024-04-24T00:00:00.000Z', 'lessee_name': 'Marie Tharp'}

Go ahead and execute the script src/box_ai_extract_metadata.py to see the data extraction process.

Take a look at the workshop script to see how to extract data from multiple documents.

In my case it took about 3 minutes to extract data from 50 documents. If you navigate to your Box web app and select a document you should see the metadata applied to it: Lease document with metadata

There are a few details related with metadata. From creating a metadata template to applying metadata to a document. The scriptsrc/utils/box_metadata.py covers all these steps.