Adding the StringEncoder transformer #1159

rcap107 · 2024-11-26T16:17:05Z

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

rcap107 · 2024-12-05T15:43:10Z

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

GaelVaroquaux · 2024-12-09T14:25:47Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

rcap107 · 2024-12-09T14:27:13Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

GaelVaroquaux · 2024-12-09T15:07:03Z

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

Vincent-Maladiere · 2024-12-09T15:30:42Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

rcap107 · 2024-12-09T15:32:48Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

Vincent-Maladiere · 2024-12-09T15:42:11Z

That's very interesting!

GaelVaroquaux · 2024-12-15T18:07:19Z

IIUC correctly char_wb prevents char ngrams from crossing word boundaries but they're still only character ngrams no?

Good point. I was confusing with the "add_word" strategy of the GapEncoder (

skrub/skrub/_gap_encoder.py

Line 236 in 8a542bb

if self.add_words: # Init a word counts vectorizer if needed

). I would be interested if we could also explore this option. I seem to remember that it can help markedly, though come at a cost

GaelVaroquaux · 2024-12-15T18:09:08Z

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

rcap107 · 2024-12-16T11:00:47Z

We discussed this PR during this week's meeting, and some points came up:

In the employees salary case (second example), the prediction performance may be due mostly to columns other than the one that is being encoded, so I should try both OrdinalEncoder and simply dropping the column to see what's the effect of the column on the prediction.
The StringEncoder with the current default parameters seems like a good default as a high cardinality encoder.
The overhead of HashingVectorizer in the small datasets I considered is probably the reason why it's so slow, and why it's probably not worth using (at least as default) for our use case

I'll clean up the code I am using and try to run the experiments in the next days.

rcap107 · 2024-12-16T11:01:03Z

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

This is something for a separate PR though

jeromedockes

mostly corner cases remaining 🎉 :)

jeromedockes · 2024-12-16T13:39:39Z

examples/02_text_with_string_encoders.py

@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
 # We set ``n_components`` to 30; however, to achieve the best performance, we would
 # need to find the optimal value for this hyperparameter using either |GridSearchCV|
 # or |RandomizedSearchCV|. We skip this part to keep the computation time for this
-# example small.
+# small example.


to keep the computation time for this ...

jeromedockes · 2024-12-16T13:41:32Z

skrub/_string_encoder.py

+    First, apply a tf-idf vectorization of the text, then reduce the dimensionality
+    with a truncated SVD decomposition with the given number of parameters.
+
+    New features will be named `{col_name}_{component}` if the series has a name,


I think you need double backticks

jeromedockes · 2024-12-16T13:42:49Z

skrub/_string_encoder.py

+    Parameters
+    ----------
+    n_components : int, default=30
+        Number of components to be used for the PCA decomposition. Must be a


to keep the number of acronyms under control maybe we should stick to "SVD" not "PCA". also we could have the expanded acronyms in parentheses the first time we mention them and links to their wikipedia pages in a Notes section

jeromedockes · 2024-12-16T13:43:38Z

skrub/_string_encoder.py

+        Number of components to be used for the PCA decomposition. Must be a
+        positive integer.
+    vectorizer : str, "tfidf" or "hashing"
+        Vectorizer to apply to the strings, either `tfidf` or `hashing` for


also here not sure what was your desired formatting -- single backticks will be italic, double for monospace

jeromedockes · 2024-12-16T13:44:20Z

skrub/_string_encoder.py

+        scikit-learn TfidfVectorizer or HashingVectorizer respectively.
+
+    ngram_range : tuple of (int, int) pairs, default=(3,4)
+        Whether the feature should be made of word or character n-grams.


looks like the docs for n_gram_range and analyzer got swapped

jeromedockes · 2024-12-16T13:44:40Z

skrub/_string_encoder.py

+    analyzer : str, "char", "word" or "char_wb", default="char_wb"
+        The lower and upper boundary of the range of n-values for different
+        n-grams to be extracted. All values of n such that min_n <= n <= max_n
+        will be used. For example an `ngram_range` of `(1, 1)` means only unigrams,


same comment about rst vs markdown

skrub/_string_encoder.py

jeromedockes · 2024-12-16T13:48:52Z

skrub/_string_encoder.py

+                            ngram_range=self.ngram_range, analyzer=self.analyzer
+                        ),
+                    ),
+                    ("tsvd", TruncatedSVD(n_components=self.n_components)),


as in the textencoder, I think we need to handle the case where we end up with the smaller dimension of the tfidf < self.n_components (could happen for example if fitting on a column with few unique words and setting a large n_components and using the word analyzer). in that case we can do the same as textencoder ie keep tfidf[:, :self.n_components]

(adding that logic might require you to move the svd out of the pipeline)

rcap107 · 2024-12-16T14:16:08Z

I tested TableVectorizer with drop for high_cardinality and the result is pretty bad. I also tested GapEncoder with add_words=True, but it does not seem to help here. Actually reading the traceback let me run the OrdinalEncoder, which seems to provide some benefit over straight up dropping the column, but it's still not quite as good as the other Encoders (which is a good thing imo)

It's also surprising to see that GapEncoder with add_words=True seems to be slightly faster than default GapEncoder

Co-authored-by: Jérôme Dockès <[email protected]>

…df-pca

Vincent-Maladiere · 2024-12-16T15:50:34Z

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

Vincent-Maladiere · 2024-12-16T15:52:24Z

I'm happy that the string encoder looks like a great baseline for short, messy columns and long, free-form text as well.

GaelVaroquaux · 2024-12-16T18:12:18Z

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though

I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

rcap107 · 2024-12-17T11:06:23Z

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

I updated the doc page on the Encoders, but it was only to add the StringEncoder and a short summary of the different methods. Looking at the page, I think it would be better to expand on it with more detail for all encoders and maybe an explanation of the parameters, but that's something that would take way more effort (and definitely something for a separate PR).

rcap107 · 2024-12-17T12:24:54Z

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced.

I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder.

doc/encoding.rst

Vincent-Maladiere

Hey @rcap107! Here is a bunch of questions and nitpicks :)

Vincent-Maladiere · 2025-01-06T14:31:38Z

doc/encoding.rst

+while being very efficient and quick to fit.
+
+:class:`GapEncoder` provides better performance on dirty categories, while
+:class:`TextEncoder` works better on free-flowing text. However, both encoders


Should we add this? How to make clear that TextEncoder will help you bring the "mercedes" and "bmw" categories closer than the "toyota" category, even for single words?

Suggested change

:class:`TextEncoder` works better on free-flowing text. However, both encoders

:class:`TextEncoder` works better on free-flowing text or when external context helps. However, both encoders

Vincent-Maladiere · 2025-01-06T14:33:25Z

doc/encoding.rst

+`tf-idf vectorization <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, then
+follow it with a dimensionality reduction algorithm such as
+`TruncatedSVD <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_
+to limit the number of features: the :class:`StringEncoder` implements this


Maybe add that SVD also helps because we can't concatenate sparse vectors to dataframes when working with tabular learning tasks? Otherwise, models like logistic regression can handle sparse input without trouble

Vincent-Maladiere · 2025-01-06T14:43:00Z

examples/02_text_with_string_encoders.py

+string_encoder_pipe = clone(gap_pipe).set_params(
+    **{"tablevectorizer__high_cardinality": string_encoder}
+)


Some people complained that these lines were a bit complex, and didn't allow them to get the full picture by looking at this single cell. I'm in favor of replacing all clone(...).set_params with the full pipeline. This would also need to be done for the text encoder and minhash encoder.

Suggested change

string_encoder_pipe = clone(gap_pipe).set_params(

**{"tablevectorizer__high_cardinality": string_encoder}

)

string_encoder_pipe = make_pipeline(

TableVectorizer(high_cardinality=string_encoder),

HistGradientBoostingClassifier(),

)

Vincent-Maladiere · 2025-01-06T14:47:27Z

skrub/_string_encoder.py

+        if (min_shape := min(X_out.shape)) >= self.n_components:
+            self.tsvd_ = TruncatedSVD(n_components=self.n_components)
+            result = self.tsvd_.fit_transform(X_out)
+        else:
+            warnings.warn(
+                f"The matrix shape is {(X_out.shape)}, and its minimum is "
+                f"{min_shape}, which is too small to fit a truncated SVD with "
+                f"n_components={self.n_components}. "
+                "The embeddings will be truncated by keeping the first "
+                f"{self.n_components} dimensions instead. "
+            )
+            # self.n_components can be greater than the number
+            # of dimensions of result.
+            # Therefore, self.n_components_ below stores the resulting
+            # number of dimensions of result.
+            result = X_out[:, : self.n_components].toarray()


Maybe L140 to L155 could be brought into a common utils with the text encoder, WDYT?

Vincent-Maladiere · 2025-01-06T14:53:25Z

skrub/_string_encoder.py

+        if self.analyzer not in ["char_wb", "char", "word"]:
+            raise ValueError(f"Unknown analyzer {self.analyzer}")


TfidfVectorizer and HashingVectorizer already perform this check I assume?

Vincent-Maladiere · 2025-01-06T14:56:35Z

skrub/_string_encoder.py

+                ]
+            )
+        else:
+            raise ValueError(f"Unknown vectorizer {self.vectorizer}.")


Nitpick, to clarify the error

Suggested change

raise ValueError(f"Unknown vectorizer {self.vectorizer}.")

raise ValueError(f"Unknown vectorizer {self.vectorizer}. Options are 'tfidf' or 'hashing', got {self.vectorizer!r}")

By the way, should the option be called "count" instead of "tfidf"? Since the difference is the CountVectorizer within the TfidfVectorizer

Vincent-Maladiere · 2025-01-06T15:04:51Z

skrub/_string_encoder.py

+        else:
+            raise ValueError(f"Unknown vectorizer {self.vectorizer}.")
+
+        X_out = self.vectorizer_.fit_transform(sbd.to_numpy(X))


Nitpick: I think SingleColumnTransformer._check_single_column check that X is either a polars or pandas series, so we don't need sbd.to_numpy. WDYT?

Vincent-Maladiere · 2025-01-06T15:06:11Z

skrub/_string_encoder.py

+            # number of dimensions of result.
+            result = X_out[:, : self.n_components].toarray()
+
+        self._is_fitted = True


Is this flag necessary, since we have all_outputs_ downstream?
Could we use check_is_fitted(self, "all_outputs_") instead?

Vincent-Maladiere · 2025-01-06T15:07:42Z

skrub/_string_encoder.py

+        """
+        Check fitted status and return a Boolean value.
+        """
+        return hasattr(self, "_is_fitted") and self._is_fitted


Following my suggestion above

Suggested change

return hasattr(self, "_is_fitted") and self._is_fitted

return check_is_fitted(self, "all_outputs_")

rcap107 added 12 commits November 21, 2024 10:56

Fixing changelog with correct account

ec37e13

Merge remote-tracking branch 'upstream/main'

b3dae47

Merge branch 'main' of github.com:skrub-data/skrub

99e5450

Initial commit

4f7e46e

Update

583250b

Merge branch 'main' of github.com:skrub-data/skrub

4a39f36

Merge branch 'main' of github.com:skrub-data/skrub

ee2f739

Merge branch 'main' into tfidf-pca

30ad689

Merge remote-tracking branch 'upstream/main' into tfidf-pca

d7f1cd7

Updated object and added test

8686d7f

quick update to changelog

eb4de97

Fixed test

96423ba

rcap107 added 7 commits December 7, 2024 12:20

Merge branch 'main' of github.com:skrub-data/skrub

e01637c

Replacing PCA with TruncatedSVD

3a1f6eb

Updated init

398f9db

Updated example to add StringEncoder

3a45f19

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

38a9f2d

Updating changelog.

51856b3

📝 Updating docstrings

58a3559

📝 Fixing example

8e4fce2

rcap107 added 2 commits December 9, 2024 16:20

✅ Fixing tests and renaming test file

afdb361

✅ Fixing coverage

6c6d884

🐛 Fixing the name of a variable

9366d90

rcap107 added 5 commits December 16, 2024 14:10

Fixing hashing test.

7783565

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

50b6e14

Merge branch 'tfidf-pca' into string-encoder-bench

171db27

Making coverage happy

3ff3f1a

Merge branch 'tfidf-pca' into string-encoder-bench

ba6ace7

jeromedockes reviewed Dec 16, 2024

View reviewed changes

rcap107 and others added 6 commits December 16, 2024 15:19

Updating code for clarity

144ab11

Updating docstring

ffc0d73

Fixing a bug

c5c3a73

Update skrub/_string_encoder.py

b103ca6

Co-authored-by: Jérôme Dockès <[email protected]>

Updating docstring

d9242fa

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

b8ee33d

…df-pca

rcap107 added 2 commits December 17, 2024 11:15

Updating tests and code to address corner cases

64c43c3

Updating docs for encoders

eb0a131

rcap107 commented Dec 17, 2024

View reviewed changes

doc/encoding.rst Show resolved Hide resolved

rcap107 and others added 3 commits December 17, 2024 15:56

Delete examples/02_text_with_string_encoders_employee_salaries.py

9268331

Adding StringEncoder to doc index

49553d9

Doc fixes

a0afc68

rcap107 self-assigned this Dec 22, 2024

Vincent-Maladiere reviewed Jan 6, 2025

View reviewed changes

	:class:`TextEncoder` works better on free-flowing text. However, both encoders
	:class:`TextEncoder` works better on free-flowing text or when external context helps. However, both encoders

-string_encoder_pipe = clone(gap_pipe).set_params(
-    **{"tablevectorizer__high_cardinality": string_encoder}
-)
+string_encoder_pipe = make_pipeline(
+    TableVectorizer(high_cardinality=string_encoder),
+    HistGradientBoostingClassifier(),
+)

		if self.analyzer not in ["char_wb", "char", "word"]:
		raise ValueError(f"Unknown analyzer {self.analyzer}")

	raise ValueError(f"Unknown vectorizer {self.vectorizer}.")
	raise ValueError(f"Unknown vectorizer {self.vectorizer}. Options are 'tfidf' or 'hashing', got {self.vectorizer!r}")

	return hasattr(self, "_is_fitted") and self._is_fitted
	return check_is_fitted(self, "all_outputs_")

Adding the StringEncoder transformer #1159

Are you sure you want to change the base?

Adding the StringEncoder transformer #1159

Conversation

rcap107 commented Nov 26, 2024 • edited Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 • edited Loading

Vincent-Maladiere commented Dec 9, 2024

GaelVaroquaux commented Dec 15, 2024

GaelVaroquaux commented Dec 15, 2024

rcap107 commented Dec 16, 2024

rcap107 commented Dec 16, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Dec 16, 2024 • edited Loading

Vincent-Maladiere commented Dec 16, 2024

Vincent-Maladiere commented Dec 16, 2024

GaelVaroquaux commented Dec 16, 2024 via email

rcap107 commented Dec 17, 2024

rcap107 commented Dec 17, 2024

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 9, 2024 •

edited

Loading

rcap107 commented Dec 16, 2024 •

edited

Loading