Taming 12,775 Contracts for $39.26!

Overview

This repository contains the scripts and methodology used to analyze 12,772 contracts extracted from SEC filings using AI models. The project demonstrates the potential of Large Language Models (LLMs) in extracting metadata from legal documents at scale and at a fraction of the cost of traditional methods.

You can check my blog at Medium at synthetic.lawyer

Key Features

Processes 12,773 agreements from SEC EDGAR database
Utilizes NuExtract, GPT and Google's Gemini for metadata extraction
Implements structured outputs for consistent data parsing
Achieves a cost of approximately $0.003 per agreement

Dataset

The dataset used in this project consists of 12,773 agreements extracted from SEC filings, each labeled as "Exhibit 10" in the EDGAR database. These documents represent a variety of material contracts that public companies are required to disclose.

The full dataset is available on Hugging Face: arthrod/taming-12773 (https://huggingface.co/datasets/arthrod/tqming-12773

Methodology

Data Collection: Agreements were sourced from the SEC's EDGAR database.
AI Models: The project primarily uses GPT and Google's Gemini for metadata extraction.
Structured Outputs: A schema was defined to extract key metadata elements from each agreement.
Processing: Each document was processed through the AI models to extract structured data.

Scripts

This repository contains the following key scripts:

Test script.
NuExtract-1.5 (hosted on Shadeform using vLLM), GPT-4o-mini (OpenAI Async) and Flash 1.5 (VertexAPI).
Judge script (GPT-4o-mini).
Normalization script.

Results

Successfully processed 12,773 documents
Total cost: $39.26 ($0.003 per agreement)
Extracted metadata includes:
- Agreement type
- Parties involved
- Effective date
- Termination date
- Governing law
- And more (see full schema in any of the scripts)

Future Work

Integration with Claude and other AI models
Expansion of the metadata schema
Performance comparison between different AI models
Exploration of use cases in due diligence and contract management

Contributing

Contributions to this project are welcome! Please feel free to submit a Pull Request.

License

MIT

Contact

For questions or feedback, please email me at [email protected].

Remember to star this repo if you find it useful, and happy coding!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
normalization.py		normalization.py
secbaprocessgptasync.py		secbaprocessgptasync.py
secbaprocessgptjudge.py		secbaprocessgptjudge.py
secbaprocessnu.py		secbaprocessnu.py
secbaprocessvertex.py		secbaprocessvertex.py
testesecba.py		testesecba.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taming 12,775 Contracts for $39.26!

Overview

Key Features

Dataset

Methodology

Scripts

Results

Future Work

Contributing

License

Contact

About

Releases

Packages

Languages

License

arthrod/taming-12773

Folders and files

Latest commit

History

Repository files navigation

Taming 12,775 Contracts for $39.26!

Overview

Key Features

Dataset

Methodology

Scripts

Results

Future Work

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages