Skip to content

v0.9.0

Compare
Choose a tag to compare
@percevalw percevalw released this 26 Feb 10:42
· 16 commits to main since this release

What's Changed ?

Added

  • New unified edspdf.data api (pdf files, pandas, parquet) and LazyCollection object
    to efficiently read / write data from / to different formats & sources. This API is
    has been heavily inspired by the edsnlp.data API.
  • New unified processing API to select the execution backend via data.set_processing(...)
    to replace the old accelerators API (which is now deprecated, but still available).
  • huggingface-embedding now supports quantization and other AutoModel.from_pretrained kwargs
  • It is now possible to add convert a label to multiple labels in the simple-aggregator component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title": "title",
    }

Fixed

  • huggingface-embedding now resize bbox features for large PDFs, instead of making the model crash
  • huggingface-embedding and sub-box-cnn-pooler now handle empty PDFs correctly

Pull Requests

Full Changelog: v0.8.1...v0.9.0