-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the StringEncoder transformer #1159
base: main
Are you sure you want to change the base?
Conversation
Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices. @jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version. Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer? (writing this down so I don't forget) |
I'm very happy to see this progressing. Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters) |
Where can I find the benchmarks? |
Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634 |
Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset. |
It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder |
That's very interesting! |
Good point. I was confusing with the "add_word" strategy of the GapEncoder ( Line 236 in 8a542bb
|
One last thing (I always come up with more :D ): we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned |
We discussed this PR during this week's meeting, and some points came up:
I'll clean up the code I am using and try to run the experiments in the next days. |
This is something for a separate PR though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly corner cases remaining 🎉 :)
@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans): | |||
# We set ``n_components`` to 30; however, to achieve the best performance, we would | |||
# need to find the optimal value for this hyperparameter using either |GridSearchCV| | |||
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this | |||
# example small. | |||
# small example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep the computation time for this ...
skrub/_string_encoder.py
Outdated
First, apply a tf-idf vectorization of the text, then reduce the dimensionality | ||
with a truncated SVD decomposition with the given number of parameters. | ||
|
||
New features will be named `{col_name}_{component}` if the series has a name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need double backticks
skrub/_string_encoder.py
Outdated
Parameters | ||
---------- | ||
n_components : int, default=30 | ||
Number of components to be used for the PCA decomposition. Must be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep the number of acronyms under control maybe we should stick to "SVD" not "PCA". also we could have the expanded acronyms in parentheses the first time we mention them and links to their wikipedia pages in a Notes section
Number of components to be used for the PCA decomposition. Must be a | ||
positive integer. | ||
vectorizer : str, "tfidf" or "hashing" | ||
Vectorizer to apply to the strings, either `tfidf` or `hashing` for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here not sure what was your desired formatting -- single backticks will be italic, double for monospace
skrub/_string_encoder.py
Outdated
scikit-learn TfidfVectorizer or HashingVectorizer respectively. | ||
|
||
ngram_range : tuple of (int, int) pairs, default=(3,4) | ||
Whether the feature should be made of word or character n-grams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the docs for n_gram_range and analyzer got swapped
skrub/_string_encoder.py
Outdated
analyzer : str, "char", "word" or "char_wb", default="char_wb" | ||
The lower and upper boundary of the range of n-values for different | ||
n-grams to be extracted. All values of n such that min_n <= n <= max_n | ||
will be used. For example an `ngram_range` of `(1, 1)` means only unigrams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about rst vs markdown
skrub/_string_encoder.py
Outdated
ngram_range=self.ngram_range, analyzer=self.analyzer | ||
), | ||
), | ||
("tsvd", TruncatedSVD(n_components=self.n_components)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in the textencoder, I think we need to handle the case where we end up with the smaller dimension of the tfidf < self.n_components (could happen for example if fitting on a column with few unique words and setting a large n_components and using the word analyzer). in that case we can do the same as textencoder ie keep tfidf[:, :self.n_components]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding that logic might require you to move the svd out of the pipeline)
I tested TableVectorizer with It's also surprising to see that GapEncoder with |
Nice! So what is the conclusion regarding the |
I'm happy that the string encoder looks like a great baseline for short, messy columns and long, free-form text as well. |
we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned
This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.
|
I updated the doc page on the Encoders, but it was only to add the |
My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced. I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rcap107! Here is a bunch of questions and nitpicks :)
while being very efficient and quick to fit. | ||
|
||
:class:`GapEncoder` provides better performance on dirty categories, while | ||
:class:`TextEncoder` works better on free-flowing text. However, both encoders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this? How to make clear that TextEncoder will help you bring the "mercedes" and "bmw" categories closer than the "toyota" category, even for single words?
:class:`TextEncoder` works better on free-flowing text. However, both encoders | |
:class:`TextEncoder` works better on free-flowing text or when external context helps. However, both encoders |
`tf-idf vectorization <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, then | ||
follow it with a dimensionality reduction algorithm such as | ||
`TruncatedSVD <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_ | ||
to limit the number of features: the :class:`StringEncoder` implements this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that SVD also helps because we can't concatenate sparse vectors to dataframes when working with tabular learning tasks? Otherwise, models like logistic regression can handle sparse input without trouble
string_encoder_pipe = clone(gap_pipe).set_params( | ||
**{"tablevectorizer__high_cardinality": string_encoder} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people complained that these lines were a bit complex, and didn't allow them to get the full picture by looking at this single cell. I'm in favor of replacing all clone(...).set_params
with the full pipeline. This would also need to be done for the text encoder and minhash encoder.
string_encoder_pipe = clone(gap_pipe).set_params( | |
**{"tablevectorizer__high_cardinality": string_encoder} | |
) | |
string_encoder_pipe = make_pipeline( | |
TableVectorizer(high_cardinality=string_encoder), | |
HistGradientBoostingClassifier(), | |
) |
if (min_shape := min(X_out.shape)) >= self.n_components: | ||
self.tsvd_ = TruncatedSVD(n_components=self.n_components) | ||
result = self.tsvd_.fit_transform(X_out) | ||
else: | ||
warnings.warn( | ||
f"The matrix shape is {(X_out.shape)}, and its minimum is " | ||
f"{min_shape}, which is too small to fit a truncated SVD with " | ||
f"n_components={self.n_components}. " | ||
"The embeddings will be truncated by keeping the first " | ||
f"{self.n_components} dimensions instead. " | ||
) | ||
# self.n_components can be greater than the number | ||
# of dimensions of result. | ||
# Therefore, self.n_components_ below stores the resulting | ||
# number of dimensions of result. | ||
result = X_out[:, : self.n_components].toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe L140 to L155 could be brought into a common utils with the text encoder, WDYT?
if self.analyzer not in ["char_wb", "char", "word"]: | ||
raise ValueError(f"Unknown analyzer {self.analyzer}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TfidfVectorizer
and HashingVectorizer
already perform this check I assume?
] | ||
) | ||
else: | ||
raise ValueError(f"Unknown vectorizer {self.vectorizer}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, to clarify the error
raise ValueError(f"Unknown vectorizer {self.vectorizer}.") | |
raise ValueError(f"Unknown vectorizer {self.vectorizer}. Options are 'tfidf' or 'hashing', got {self.vectorizer!r}") |
By the way, should the option be called "count" instead of "tfidf"? Since the difference is the CountVectorizer within the TfidfVectorizer
else: | ||
raise ValueError(f"Unknown vectorizer {self.vectorizer}.") | ||
|
||
X_out = self.vectorizer_.fit_transform(sbd.to_numpy(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I think SingleColumnTransformer._check_single_column
check that X is either a polars or pandas series, so we don't need sbd.to_numpy
. WDYT?
# number of dimensions of result. | ||
result = X_out[:, : self.n_components].toarray() | ||
|
||
self._is_fitted = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this flag necessary, since we have all_outputs_
downstream?
Could we use check_is_fitted(self, "all_outputs_")
instead?
""" | ||
Check fitted status and return a Boolean value. | ||
""" | ||
return hasattr(self, "_is_fitted") and self._is_fitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my suggestion above
return hasattr(self, "_is_fitted") and self._is_fitted | |
return check_is_fitted(self, "all_outputs_") |
This is a first draft of a PR to address #1121
I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.
Things left to do: