v0.9.0
What's Changed ?
Added
- New unified
edspdf.data
api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by theedsnlp.data
API. - New unified processing API to select the execution backend via
data.set_processing(...)
to replace the oldaccelerators
API (which is now deprecated, but still available). huggingface-embedding
now supports quantization and otherAutoModel.from_pretrained
kwargs- It is now possible to add convert a label to multiple labels in the
simple-aggregator
component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
Fixed
huggingface-embedding
now resize bbox features for large PDFs, instead of making the model crashhuggingface-embedding
andsub-box-cnn-pooler
now handle empty PDFs correctly
Pull Requests
- API update (data & processing) by @percevalw in #25
Full Changelog: v0.8.1...v0.9.0