Skip to content

alea-institute/kl3m-data

Repository files navigation

KL3M Training Data

Collection and Preprocessing of Training Data for KL3M

License: MIT

Description

This ALEA project contains the complete source code to collect and preprocess all training data related to the KL3M embedding and generative models.

Paper

Pending arXiv submission

Citation

Pending arXiv submission

Primary Sources

Summary

TODO: Table

US

  • us/dockets: PACER/RECAP docket sheets via archive.org
  • us/dotgov: filtered .gov TLD domains via direct retrieval
  • us/ecfr: Electronic Code of Federal Regulations (eCFR) via NARA/GPO API
  • us/edgar: SEC EDGAR data via SEC feed
  • us/fdlp: US Federal Depository Library Program (FDLP) via GPO
  • us/fr: Federal Register data via NARA/GPO API
  • us/govinfo: US Government Publishing Office (GPO) data via GovInfo API
  • us/recap: RECAP raw documents via S3
  • us/recap_docs: RECAP attached docs (Word, WordPerfect, PDF, MP3) via S3
  • us/reg_docs: Documents associated with regulations.gov dockets via regulations.gov API
  • us/usc: US Code releases via Office of the Law Revision Counsel (OLRC)
  • us/uspto_patents: USPTO patent grants via USPTO bulk data

EU ("Federal")

  • eu/eurlex_oj: EU Official Journal via Cellar/Europa

UK

  • uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download

Germany

  • de/bundesgesetzblatt: Bundesgesetzblatt (BGBl) 2023- from recht.bund.de

Australia

Canada

India

Tasks

Extraction

Summarization

Transform and Convert

Installation

TODO

Usage

TODO

License

The source code for this ALEA project is released under the MIT License. See the LICENSE file for details.

Top-level dependencies are all licensed MIT, BSD-3, or Apache 2.0 See poetry show --tree for details.

Support

If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.

Learn More

To learn more about ALEA and our KL3M models and data, visit the ALEA website.

Releases

No releases published

Packages

No packages published

Languages