Skip to content

Latest commit

 

History

History
911 lines (651 loc) · 38.7 KB

CHANGES.rst

File metadata and controls

911 lines (651 loc) · 38.7 KB

Release history

.. currentmodule:: skrub

Ongoing development

Skrub is a very recent package. It is currently undergoing fast development and backward compatibility is not ensured.

New features

Changes

Bug fixes

Maintenance

Release 0.4.1

Changes

Bug fixes

Maintenance

Release 0.4.0

Highlights

  • The :class:`TextEncoder` can extract embeddings from a string column with a deep learning language model (possibly downloaded from the HuggingFace Hub).
  • Several improvements to the :class:`TableReport` such as better support for other scripts than the latin alphabet in the bar plot labels, smaller report sizes, clipping the outliers to better see the details of distributions in histograms. See the full changelog for details.
  • The :class:`TableVectorizer` can now drop columns that contain a fraction of null values above a user-chosen threshold.

New features

Major changes

Minor changes

Bug fixes

Release 0.3.1

Minor changes

Release 0.3.0

Highlights

  • Polars dataframes are now supported across all skrub estimators.
  • :class:`TableReport` generates an interactive report for a dataframe. This page regroups some precomputed examples.

Major changes

Minor changes

Release 0.2.0

Major changes

Minor changes

skrub release 0.1.1

This is a bugfix release to adapt to the most recent versions of pandas (2.2) and scikit-learn (1.5). There are no major changes to the functionality of skrub.

skrub release 0.1.0

Major changes

Minor changes

Before skrub: dirty_cat

Skrub was born from the dirty_cat package.

Dirty-cat release 0.4.1

Major changes

Minor changes

Dirty-cat Release 0.4.0

Major changes

Minor changes

Bug fixes

Dirty-cat Release 0.3.0

Major changes

Notes

Dirty-cat Release 0.2.2

Bug fixes

Dirty-cat Release 0.2.1

Major changes

Bug-fixes

Notes

Dirty-cat Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

Notes

Dirty-cat Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

Bug-fixes

Dirty-cat Release 0.1.1

Major changes

Bug-fixes

Dirty-cat Release 0.1.0

Major changes

Bug-fixes

Dirty-cat Release 0.0.7

  • MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables.
  • datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
    • The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
    • The field "description" has been renamed to "DESCR".
  • SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
  • SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Dirty-cat Release 0.0.6

  • SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
    • computing the vocabulary count vectors in fit instead of transform
    • computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the :class:`SimilarityEncoder`.
  • SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder` to be created when categories was a list.
  • SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Dirty-cat Release 0.0.5

  • SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
  • SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
  • SimilarityEncoder: Performance improvements in the ngram similarity.
  • SimilarityEncoder: Expose a get_feature_names method.