.. currentmodule:: skrub
Skrub is a very recent package. It is currently undergoing fast development and backward compatibility is not ensured.
- A new parameter
verbose
has been added to the :class:`TableReport` to toggle on or off the printing of progress information when a report is being generated. :pr:`1182` by :user:`Priscilla Baah<priscilla-b>`. - A parameter
verbose
has been added to the :func:`patch_display` to toggle on or off the printing of progress information when a table report is being generated. :pr:`1188` by :user:`Priscilla Baah<priscilla-b>`. - :func:`tabular_learner` accepts the alias
"regression"
for the option"regressor"
and"classification"
for"classifier"
. :pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.
- Generating a
TableReport
could have an effect on the matplotib configuration which could cause plots not to display inline in jupyter notebooks any more. This has been fixed in skrub in :pr:`1172` by :user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked here. - The labels on bar plots in the
TableReport
for columns of object dtypes that have a repr spanning multiple lines could be unreadable. This has been fixed in :pr:`1196` by :user:`Jérôme Dockès <jeromedockes>`. - Improve the performance of :func:`deduplicate` by removing some unnecessary computations. :pr:`1193` by :user:`Jérôme Dockès <jeromedockes>`.
- Make
skrub
compatible with scikit-learn 1.6. :pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.
- The :class:`TextEncoder` can extract embeddings from a string column with a deep learning language model (possibly downloaded from the HuggingFace Hub).
- Several improvements to the :class:`TableReport` such as better support for other scripts than the latin alphabet in the bar plot labels, smaller report sizes, clipping the outliers to better see the details of distributions in histograms. See the full changelog for details.
- The :class:`TableVectorizer` can now drop columns that contain a fraction of null values above a user-chosen threshold.
- The :class:`TextEncoder` is now available to encode string columns with diverse entries. It allows the representation of table entries as embeddings computed by a deep learning language model. The weights of this model can be fetched locally or from the HuggingFace Hub. :pr:`1077` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
- The :func:`column_associations` function has been added. It computes a pairwise measure of statistical dependence between all columns in a dataframe (the same as shown in the :class:`TableReport`). :pr:`1109` by :user:`Jérôme Dockès <jeromedockes>`.
- The :func:`patch_display` function has been added. It changes the display of pandas and polars dataframes in jupyter notebooks to replace them with a :class:`TableReport`. This can be undone with :func:`unpatch_display`. :pr:`1108` by :user:`Jérôme Dockès <jeromedockes>`
- :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require the operations argument. They do not split columns by type anymore, but apply operations on all selected cols. "median" is now supported, "hist" and "value_counts" are no longer supported. :pr:`1116` by :user:`Théo Jolivet <TheooJ>`.
- The :class:`AggTarget` no longer supports y inputs of type list. :pr:`1116` by :user:`Théo Jolivet <TheooJ>`.
The column filter selection dropdown in the tablereport is smaller and its label has been removed to save space. :pr:`1107` by :user:`Jérôme Dockès <jeromedockes>`.
The TableReport now uses the font size of its parent element when inserted into another page. This makes it smaller in pages that use a smaller font size than the browser default such as VSCode in some configurations. It also makes it easier to control its size when inserting it in a web page by setting the font size of its parent element. A few other small adjustments have also been made to make it a bit more compact. :pr:`1098` by :user:`Jérôme Dockès <jeromedockes>`.
Display of labels in the plots of the TableReport, especially for other scripts than the latin alphabet, has improved.
- before, some characters could be missing and replaced by empty boxes.
- before, when the text is truncated, the ellipsis "..." could appear on the wrong side for right-to-left scripts.
Moreover, when the text contains line breaks it now appears all on one line. Note this only affects the labels in the plots; the rest of the report did not have these problems. :pr:`1097` by :user:`Jérôme Dockès <jeromedockes>` and :pr:`1138` by :user:`Jérôme Dockès <jeromedockes>`.
In the TableReport it is now possible, before clicking any of the cells, to reach the dataframe sample table and activate a cell with tab key navigation. :pr:`1101` by :user:`Jérôme Dockès <jeromedockes>`.
The "Column name" column of the "summary statistics" table in the TableReport is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme Dockès <jeromedockes>`.
Added parameter drop_null_fraction to TableVectorizer to drop columns based on whether they contain a fraction of nulls larger than the given threshold. :pr:`1115` and :pr:`1149` by :user:`Riccardo Cappuzzo <rcap107>`.
The :class:`TableReport` now provides more helpful output for columns of dtype TimeDelta / Duration. :pr:`1152` by :user:`Jérôme Dockès <jeromedockes>`.
The :class:`TableReport` now also reports the number of unique values for numeric columns. :pr:`1154` by :user:`Jérôme Dockès <jeromedockes>`.
The :class:`TableReport`, when plotting histograms, now detects outliers and clips the range of data shown in the histogram. This allows seeing more detail in the shown distribution. :pr:`1157` by :user:`Jérôme Dockès <jeromedockes>`.
- The :class:`TableReport` could raise an exception when one of the columns contained datetimes with time zones and missing values; this has been fixed in :pr:`1114` by :user:`Jérôme Dockès <jeromedockes>`.
- In scikit-learn versions older than 1.4 the :class:`TableVectorizer` could fail on polars dataframes when used with the default parameters. This has been fixed in :pr:`1122` by :user:`Jérôme Dockès <jeromedockes>`.
- The :class:`TableReport` would raise an exception when the input (pandas) dataframe contained several columns with the same name. This has been fixed in :pr:`1125` by :user:`Jérôme Dockès <jeromedockes>`.
- The :class:`TableReport` would raise an exception when a column contained infinite values. This has been fixed in :pr:`1150` by :user:`Jérôme Dockès <jeromedockes>` and :pr:`1151` by Jérôme Dockès.
- For tree-based models, :func:`tabular_learner` now adds handle_unknown='use_encoded_value' to the OrdinalEncoder, to avoid errors with new categories in the test set. This is consistent with the setting of OneHotEncoder used by default in the :class:`TableVectorizer`. :pr:`1078` by :user:`Gaël Varoquaux <gaelvaroquaux>`
- The reports created by :class:`TableReport`, when inserted in an html page (or displayed in a notebook), now use the same font as the surrounding page. :pr:`1038` by :user:`Jérôme Dockès <jeromedockes>`.
- The content of the dataframe corresponding to the currently selected table cell in the TableReport can be copied without actually selecting the text (as in a spreadsheet). :pr:`1048` by :user:`Jérôme Dockès <jeromedockes>`.
- The selection of content displayed in the TableReport's copy-paste boxes has been removed. Now they always display the value of the selected item. When copied, the repr of the selected item is copied to the clipboard. :pr:`1058` by :user:`Jérôme Dockès <jeromedockes>`.
- A "stats" panel has been added to the TableReport, showing summary statistics
for all columns (number of missing values, mean, etc. -- similar to
pandas.info()
) in a table. It can be sorted by each column. :pr:`1056` and :pr:`1068` by :user:`Jérôme Dockès <jeromedockes>`. - The credit fraud dataset is now available with the :func:`fetch_credit_fraud function`. :pr:`1053` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
- Added zero padding for column names in :class:`MinHashEncoder` to improve column ordering consistency. :pr:`1069` by :user:`Shreekant Nandiyawar <Shree7676>`.
- The selection in the TableReport's sample table can now be manipulated with the keyboard. :pr:`1065` by :user:`Jérôme Dockès <jeromedockes>`.
- The
TableReport
now displays the pandas (multi-)index, and has a better display & interaction of pandas columns when the columns are a MultiIndex. :pr:`1083` by :user:`Jérôme Dockès <jeromedockes>`. - It is possible to control the number of rows displayed by the TableReport in
the "sample" tab panel by specifying
n_rows
. :pr:`1083` by :user:`Jérôme Dockès <jeromedockes>`. - the TableReport used to raise an exception when the dataframe contained unhashable types such as python lists. This has been fixed in :pr:`1087` by :user:`Jérôme Dockès <jeromedockes>`.
- Display's columns name with the HTML representation of the fitted TableVectorizer. This has been fixed in :pr:`1093` by :user:`Shreekant Nandiyawar <Shree7676>`.
- AggTarget will now work even when y is a Series and not raise any error. This has been fixed in :pr:`1094` by :user:`Shreekant Nandiyawar <Shree7676>`.
- Polars dataframes are now supported across all
skrub
estimators. - :class:`TableReport` generates an interactive report for a dataframe. This page regroups some precomputed examples.
- The :class:`InterpolationJoiner` now supports polars dataframes. :pr:`1016` by :user:`Théo Jolivet <TheooJ>`.
- The :class:`TableReport` provides an interactive report on a dataframe's contents: an overview, summary statistics and plots, statistical associations between columns. It can be displayed in a jupyter notebook, a browser tab or saved as a static HTML page. :pr:`984` by :user:`Jérôme Dockès <jeromedockes>`.
- :class:`Joiner` and :func:`fuzzy_join` used to raise an error when columns with the same name appeared in the main and auxiliary table (after adding the suffix). This is now allowed and a random string is inserted in the duplicate column to ensure all names are unique. :pr:`1014` by :user:`Jérôme Dockès <jeromedockes>`.
- :class:`AggJoiner` and :class:`AggTarget` could produce outputs whose column names varied across calls to transform in some cases in the presence of duplicate column names, now the output names are always the same. :pr:`1013` by :user:`Jérôme Dockès <jeromedockes>`.
- In some cases :class:`AggJoiner` and :class:`AggTarget` inserted a column in the output named "index" containing the pandas index of the auxiliary table. This has been corrected. :pr:`1020` by :user:`Jérôme Dockès <jeromedockes>`.
- The :class:`Joiner` has been adapted to support polars dataframes. :pr:`945` by :user:`Théo Jolivet <TheooJ>`.
- The :class:`TableVectorizer` now consistently applies the same transformation
across different calls to transform. There also have been some breaking
changes to its functionality: (i) all transformations are now applied
independently to each column, i.e. it does not perform multivariate
transformations (ii) in
specific_transformers
the same column may not be used twice (go through 2 different transformers). :pr:`902` by :user:`Jérôme Dockès <jeromedockes>`. - Some parameters of :class:`TableVectorizer` have been renamed: high_cardinality_transformer → high_cardinality, low_cardinality_transformer → low_cardinality, datetime_transformer → datetime, numeric_transformer → numeric. :pr:`947` by :user:`Jérôme Dockès <jeromedockes>`.
- The :class:`GapEncoder` and :class:`MinHashEncoder` are now a single-column
transformers: their
fit
,fit_transform
andtransform
methods accept a single column (a pandas or polars Series). Dataframes and numpy arrays are not accepted. :pr:`920` and :pr:`923` by :user:`Jérôme Dockès <jeromedockes>`. - Added the :class:`MultiAggJoiner` that allows to augment a main table with multiple auxiliary tables. :pr:`876` by :user:`Théo Jolivet <TheooJ>`.
- :class:`AggJoiner` now only accepts a single table as an input, and some of its
parameters were renamed to be consistent with the :class:`MultiAggJoiner`.
It now has a
key`
parameter that allows to join main and auxiliary tables that share the same column names. :pr:`876` by :user:`Théo Jolivet <TheooJ>`. - :func:`tabular_learner` has been added to easily create a supervised learner that works well on tabular data. :pr:`926` by :user:`Jérôme Dockès <jeromedockes>`.
- :class:`GapEncoder` and :class:`MinHashEncoder` used to modify their input in-place, replacing missing values with a string. They no longer do so. Their parameter handle_missing has been removed; now missing values are always treated as the empty string. :pr:`930` by :user:`Jérôme Dockès <jeromedockes>`.
- The minimum supported python version is now 3.9 :pr:`939` by :user:`Jérôme Dockès <jeromedockes>`.
- Skrub supports numpy 2. :pr:`946` by :user:`Jérôme Dockès <jeromedockes>`.
- :func:`~datasets.fetch_ken_embeddings` now add suffix even with the default value for the parameter pca_components. :pr:`956` by :user:`Guillaume Lemaitre <glemaitre>`.
- :class:`Joiner` now performs some preprocessing (the same as done by the :class:`TableVectorizer`, eg trying to parse dates, converting pandas object columns with mixed types to a single type) on the joining columns before vectorizing them. :pr:`972` by :user:`Jérôme Dockès <jeromedockes>`.
This is a bugfix release to adapt to the most recent versions of pandas (2.2) and scikit-learn (1.5). There are no major changes to the functionality of skrub.
- :class:`TargetEncoder` has been removed in favor of :class:`sklearn.preprocessing.TargetEncoder`, available since scikit-learn 1.3.
- :class:`Joiner` and :func:`fuzzy_join` support several ways of rescaling
distances;
match_score
has been replaced bymax_dist
; bugs which prevented the Joiner to consistently vectorize inputs and accept or reject matches across calls to transform have been fixed. :pr:`821` by :user:`Jérôme Dockès <jeromedockes>`. - :class:`InterpolationJoiner` was added to join two tables by using machine-learning to infer the matching rows from the second table. :pr:`742` by :user:`Jérôme Dockès <jeromedockes>`.
- Pipelines including :class:`TableVectorizer` can now be grid-searched, since we can now call set_params on the default transformers of :class:`TableVectorizer`. :pr:`814` by :user:`Vincent Maladiere <Vincent-Maladiere>`
- :func:`to_datetime` is now available to support pandas.to_datetime over dataframes and 2d arrays. :pr:`784` by :user:`Vincent Maladiere <Vincent-Maladiere>`
- Some parameters of :class:`Joiner` have changed. The goal is to harmonize parameters across all estimator that perform join(-like) operations, as discussed in #751. :pr:`757` by :user:`Jérôme Dockès <jeromedockes>`.
- :func:`dataframe.pd_join`, :func:`dataframe.pd_aggregate`, :func:`dataframe.pl_join` and :func:`dataframe.pl_aggregate` are now available in the dataframe submodule. :pr:`733` by :user:`Vincent Maladiere <Vincent-Maladiere>`
- :class:`FeatureAugmenter` is renamed to :class:`Joiner`. :pr:`674` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :func:`fuzzy_join` and :class:`FeatureAugmenter` can now join on datetime columns. :pr:`552` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :class:`Joiner` now supports joining on multiple column keys. :pr:`674` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- The signatures of all encoders and functions have been revised to enforce cleaner calls. This means that some arguments that could previously be passed positionally now have to be passed as keywords. :pr:`514` by :user:`Lilian Boulard <LilianBoulard>`.
- Parallelized the :class:`GapEncoder` column-wise. Parameters n_jobs and verbose added to the signature. :pr:`582` by :user:`Lilian Boulard <LilianBoulard>`
- Introducing :class:`AggJoiner`, a transformer performing aggregation on auxiliary tables followed by left-joining on a base table. :pr:`600` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
- Introducing :class:`AggTarget`, a transformer performing aggregation on the target y, followed by left-joining on a base table. :pr:`600` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
- Added the :class:`SelectCols` and :class:`DropCols` transformers that allow selecting a subset of a dataframe's columns inside of a pipeline. :pr:`804` by :user:`Jérôme Dockès <jeromedockes>`.
- :class:`DatetimeEncoder` doesn't remove constant features anymore. It also supports an 'errors' argument to raise or coerce errors during transform, and a 'add_total_seconds' argument to include the number of seconds since Epoch. :pr:`784` by :user:`Vincent Maladiere <Vincent-Maladiere>`
- Scaling of
matching_score
in :func:`fuzzy_join` is now between 0 and 1; it used to be between 0.5 and 1. Moreover, the division by 0 error that occurred when all rows had a perfect match has been fixed. :pr:`802` by :user:`Jérôme Dockès <jeromedockes>`. - :class:`TableVectorizer` is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like :class:`MinHashEncoder`, and :class:`GapEncoder`. :pr:`592` by :user:`Leo Grinsztajn <LeoGrin>`
inverse_transform
in :class:`SimilarityEncoder` now works as expected; it used to raise an exception. :pr:`801` by :user:`Jérôme Dockès <jeromedockes>`.- :class:`TableVectorizer` propagate the n_jobs parameter to the underlying transformers except if the underlying transformer already set explicitly n_jobs. :pr:`761` by :user:`Leo Grinsztajn <LeoGrin>`, :user:`Guillaume Lemaitre <glemaitre>`, and :user:`Jerome Dockes <jeromedockes>`.
- Parallelized the :func:`deduplicate` function. Parameter n_jobs added to the signature. :pr:`618` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Lilian Boulard <LilianBoulard>`
- Functions :func:`datasets.fetch_ken_embeddings`, :func:`datasets.fetch_ken_table_aliases` and :func:`datasets.fetch_ken_types` have been renamed. :pr:`602` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Make pyarrow an optional dependencies to facilitate the integration with pyodide. :pr:`639` by :user:`Guillaume Lemaitre <glemaitre>`.
- Bumped minimal required Python version to 3.10. :pr:`606` by :user:`Gael Varoquaux <GaelVaroquaux>`
- Bumped minimal required versions for the dependencies: - numpy >= 1.23.5 - scipy >= 1.9.3 - scikit-learn >= 1.2.1 - pandas >= 1.5.3 :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
- You can now pass column-specific transformers to :class:`TableVectorizer` using the specific_transformers argument. :pr:`583` by :user:`Lilian Boulard <LilianBoulard>`.
- Do not support 1-D array (and pandas Series) in :class:`TableVectorizer`. Pass a 2-D array (or a pandas DataFrame) with a single column instead. This change is for compliance with the scikit-learn API. :pr:`647` by :user:`Guillaume Lemaitre <glemaitre>`
- Fixes a bug in :class:`TableVectorizer` with remainder: it is now cloned if it's a transformer so that the same instance is not shared between different transformers. :pr:`678` by :user:`Guillaume Lemaitre <glemaitre>`
- :class:`GapEncoder` speedup :pr:`680` by :user:`Leo Grinsztajn <LeoGrin>`
- Improved :class:`GapEncoder`'s early stopping logic. The parameters tol and min_iter have been removed. The parameter max_no_improvement can now be used to control the early stopping. :pr:`663` by :user:`Simona Maggio <simonamaggio>` :pr:`593` by :user:`Lilian Boulard <LilianBoulard>` :pr:`681` by :user:`Leo Grinsztajn <LeoGrin>`
- Implementation improvement leading to a ~x5 speedup for each iteration.
- Better default hyperparameters: batch_size now defaults to 1024, and max_iter_e_steps to 1.
- Removed the most_frequent and k-means strategies from the :class:`SimilarityEncoder`. These strategy were used for scalability reasons, but we recommend using the :class:`MinHashEncoder` or the :class:`GapEncoder` instead. :pr:`596` by :user:`Leo Grinsztajn <LeoGrin>`
- Removed the similarity argument from the :class:`SimilarityEncoder` constructor, as we only support the ngram similarity. :pr:`596` by :user:`Leo Grinsztajn <LeoGrin>`
- Added the analyzer parameter to the :class:`SimilarityEncoder` to allow word counts for similarity measures. :pr:`619` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- skrub now uses modern type hints introduced in PEP 585. :pr:`609` by :user:`Lilian Boulard <LilianBoulard>`
- Some bug fixes for :class:`TableVectorizer` ( :pr:`579`):
- check_is_fitted now looks at "transformers_" rather than "columns_"
- the default of the remainder parameter in the docstring is now "passthrough" instead of "drop" to match the implementation.
- uint8 and int8 dtypes are now considered as numerical columns.
- Removed the leading "<" and trailing ">" symbols from KEN entities and types. :pr:`601` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Add get_feature_names_out method to :class:`MinHashEncoder`. :pr:`616` by :user:`Leo Grinsztajn <LeoGrin>`
- Removed requests from the requirements. :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
- :class:`TableVectorizer` now handles mixed types columns without failing by converting them to string before type inference. :pr:`623`by :user:`Leo Grinsztajn <LeoGrin>`
- Moved the default storage location of data to the user's home folder. :pr:`652` by :user:`Felix Lefebvre <flefebv>` and :user:`Gael Varoquaux <GaelVaroquaux>`
- Fixed bug when using :class:`TableVectorizer`'s transform method on categorical columns with missing values. :pr:`644` by :user:`Leo Grinsztajn <LeoGrin>`
- :class:`TableVectorizer` never output a sparse matrix by default. This can be changed by increasing the sparse_threshold parameter. :pr:`646` by :user:`Leo Grinsztajn <LeoGrin>`
- :class:`TableVectorizer` doesn't fail anymore if an inferred type doesn't work during transform. The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`
- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter overload_job_titles to allow overloading the job titles (employee_position_title) with the column underfilled_job_title, which provides some more information about the job title. :pr:`581` by :user:`Lilian Boulard <LilianBoulard>`
- Fix bugs which was triggered when extract_until was "year", "month", "microseconds" or "nanoseconds", and add the option to set it to None to only extract total_time, the time from epoch. :class:`DatetimeEncoder`. :pr:`743` by :user:`Leo Grinsztajn <LeoGrin>`
Skrub was born from the dirty_cat package.
- :func:`fuzzy_join` and :class:`FeatureAugmenter` can now join on numerical columns based on the euclidean distance. :pr:`530` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :func:`fuzzy_join` and :class:`FeatureAugmenter` can perform many-to-many joins on lists of numerical or string key columns. :pr:`530` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :func:`GapEncoder.transform` will not continue fitting of the instance anymore. It makes functions that depend on it (:func:`~GapEncoder.get_feature_names_out`, :func:`~GapEncoder.score`, etc.) deterministic once fitted. :pr:`548` by :user:`Lilian Boulard <LilianBoulard>`
- :func:`fuzzy_join` and :class:`FeatureAugmenter` now perform joins on missing values as in pandas.merge but raises a warning. :pr:`522` and :pr:`529` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Added :func:`get_ken_table_aliases` and :func:`get_ken_types` for exploring KEN embeddings. :pr:`539` by :user:`Lilian Boulard <LilianBoulard>`.
- Improvement of date column detection and date format inference in :class:`TableVectorizer`. The format inference now tries to find a format which works for all non-missing values of the column, and only tries pandas default inference if it fails. :pr:`543` by :user:`Leo Grinsztajn <LeoGrin>` :pr:`587` by :user:`Leo Grinsztajn <LeoGrin>`
- SuperVectorizer is renamed as :class:`TableVectorizer`, a warning is raised when using the old name. :pr:`484` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
- New experimental feature: :class:`FeatureAugmenter`, a transformer that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxiliary tables. :pr:`409` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
- The :class:`MinHashEncoder` now supports a n_jobs parameter to parallelize the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.
- New experimental feature: deduplicating misspelled categories using :func:`deduplicate` by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. :pr:`339` by :user:`Moritz Boos <mjboos>`.
- Add example Wikipedia embeddings to enrich the data. :pr:`487` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- datasets.fetching: contains a new function :func:`get_ken_embeddings` that can be used to download Wikipedia embeddings and filter them by type.
- datasets.fetching: contains a new function :func:`fetch_world_bank_indicator` that can be used to download indicators from the World Bank Open Data platform. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Removed example Fitting scalable, non-linear models on data with dirty categories. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. :pr:`432` by :user:`Lilian Boulard <LilianBoulard>` - Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. :pr:`432` and :pr:`453` by :user:`Lilian Boulard <LilianBoulard>` - The :class:`TableVectorizer`'s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown="ignore" instead of handle_unknown="error" (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. :pr:`473` by :user:`Leo Grinsztajn <LeoGrin>`
- The :class:`MinHashEncoder` now considers None and empty strings as missing values, rather than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`
New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`TableVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
The :class:`TableVectorizer` has seen some major improvements and bug fixes:
- Fixes the automatic casting logic in
transform
. - To avoid dimensionality explosion when a feature has two unique values, the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two vectors (see parameter drop="if_binary").
fit_transform
andtransform
can now return unencoded features, like the :class:`~sklearn.compose.ColumnTransformer`'s behavior. Previously, aRuntimeError
was raised.
- Fixes the automatic casting logic in
Backward-incompatible change in the TableVectorizer: To apply
remainder
to features (with the*_transformer
parameters), the value'remainder'
must be passed, instead ofNone
in previous versions.None
now indicates that we want to use the default transformer. :pr:`303` by :user:`Lilian Boulard <LilianBoulard>`Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`
Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The :class:`SimilarityEncoder` now exclusively uses
ngram
for similarities, and the similarity parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`
- The :class:`SimilarityEncoder` now exclusively uses
- The
transformers_
attribute of the :class:`TableVectorizer` now contains column names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`
- Fixed a bug in the :class:`TableVectorizer` causing a :class:`FutureWarning` when using the :func:`get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
Improvements to the :class:`TableVectorizer`
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
:func:`get_feature_names` becomes :func:`get_feature_names_out`, following changes in the scikit-learn API. :func:`get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
- Improvements to the :class:`MinHashEncoder`
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` function, on multiple columns.
- Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument. :pr:`242` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :class:`GapEncoder`'s get_feature_names_out now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`
- Fixed :class:`DeprecationWarning` raised by the usage of distutils.version.LooseVersion. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
- Remove trailing imports in the :class:`MinHashEncoder`.
- Fix typos and update links for website.
- Documentation of the :class:`TableVectorizer` and the :class:`SimilarityEncoder` improved.
Also see pre-release 0.2.0a1 below for additional changes.
Bump minimum dependencies:
- scikit-learn (>=0.21.0) :pr:`202` by :user:`Lilian Boulard <LilianBoulard>`
- pandas (>=1.1.5) ! NEW REQUIREMENT ! :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
- The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
- The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
- The example notebooks were updated to reflect these changes. :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
Backward incompatible change to :class:`MinHashEncoder`: The :class:`MinHashEncoder` now only supports two dimensional inputs of shape (N_samples, 1). :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
Update handle_missing parameters:
- :class:`GapEncoder`: the default value "zero_impute" becomes "empty_impute" (see doc).
- :class:`MinHashEncoder`: the default value "" becomes "zero_impute" (see doc).
Add a method "get_feature_names_out" for the :class:`GapEncoder` and the :class:`TableVectorizer`, since get_feature_names will be depreciated in scikit-learn 1.2. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the :class:`TableVectorizer`
- Missing values are not systematically imputed anymore
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes
Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:
pip install --pre dirty_cat==0.2.0a1
or from the GitHub repository:
pip install git+https://github.com/dirty-cat/dirty_cat.git
- Bump minimum dependencies:
- Python (>= 3.6)
- NumPy (>= 1.16)
- SciPy (>= 1.2)
- scikit-learn (>= 0.20.0)
- :class:`TableVectorizer`: Added automatic transform through the :class:`TableVectorizer` class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's :class:`~sklearn.compose.ColumnTransformer` simpler to use on heterogeneous pandas DataFrame. :pr:`167` by :user:`Lilian Boulard <LilianBoulard>`
- Backward incompatible change to :class:`GapEncoder`: The :class:`GapEncoder` now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent :class:`GapEncoder` models, and are then concatenated into a single matrix. :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
- Fix get_feature_names for scikit-learn > 0.21. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- RuntimeWarnings due to overflow in :class:`GapEncoder`. :pr:`161` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- :class:`GapEncoder`: Added online Gamma-Poisson factorization through the :class:`GapEncoder` class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. :pr:`153` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- Multiprocessing exception in notebook. :pr:`154` by :user:`Lilian Boulard <LilianBoulard>`
- MinHashEncoder: Added
minhash_encoder.py
andfast_hast.py
files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables. - datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
- The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
- The field "description" has been renamed to "DESCR".
- SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a
similarity metric. Our implementation now accurately reproduces the behaviour
of the
python-Levenshtein
implementation. - SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
- TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
- MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.
- SimilarityEncoder: Accelerate
SimilarityEncoder.transform
, by:- computing the vocabulary count vectors in
fit
instead oftransform
- computing the similarities in parallel using
joblib
. This option can be turned on/off via then_jobs
attribute of the :class:`SimilarityEncoder`.
- computing the vocabulary count vectors in
- SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder`
to be created when
categories
was a list. - SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.
- SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
- SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
- SimilarityEncoder: Performance improvements in the ngram similarity.
- SimilarityEncoder: Expose a get_feature_names method.