WIP add the Recipe #1064

jeromedockes · 2024-09-09T13:54:29Z

This is still in draft mode but I'll open the PR so we can discuss the example.

I still need to add more tests and reference documentation

jeromedockes · 2024-09-09T14:48:17Z

the example is example 10, "using the recipe"

Vincent-Maladiere

Here are some first remarks on the example and high-level concepts of the Recipe. As the Recipe offers many new features, I think the example should be simplified and more focused on the Recipe itself.

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T16:47:49Z

examples/10_recipe.py

+# %%
+from skrub import Recipe, datasets
+
+dataset = datasets.fetch_employee_salaries()


Is this a good time to change our "default" demo dataset?

I looked a bit but haven't found a good replacement yet. but as I suspect we will merge the fraud data example before this one, maybe employee salaries can be replaced with that one

Ok, to do so we need to take care of the join operation with the recipe first, right?

good point, so that would come later. if anyone has suggestions for another dataset for this example I'd be happy to diversify a bit from employee salaries.

also, in employee salaries should we remove the "year first hired" in the fetcher? both here and in the tablevectorizer examples, the datetime encoder isn't useful because the feature it extracts has already been inserted in the dataset

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T17:07:41Z

examples/10_recipe.py

+from skrub import DatetimeEncoder
+from skrub import selectors as s
+
+recipe = recipe.add(DatetimeEncoder(), cols=s.any_date())


Unrelated to the recipe: as a user, I'm quite upset that the DatetimeEncoder doesn't perform the parsing with ToDatetime() for me. Sure, uncoupling all elements makes sense from a pure computer science perspective, but from the practitioner's (and the beginner's) point of view, this is a bit cumbersome.

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

That would be great IMO

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

Yes, I remember the choices that led to this design, and I agree with them.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

Ok, if that doesn't introduce too much complexity on the TV part, I'm all for it.

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

That's interesting; I need to refresh my memory regarding this part.

Side question: Would using the TV with the Recipe make sense in general? I'm thinking about CVing the transformers and their hyper-parameters more easily.

yes using the TableVectorizer in the Recipe completely makes sense, and it will help tune the choice of the encoders and their hyperparameters (the choose_* can be arbitrarily nested). I didn't do it in this example because on this dataset the TableVectorizer does everything fine so there would be only one step and it would be harder to showcase some features of the recipe

Ok! What about showing the recipe with TV at the end? Or would that make the message less obvious?

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T17:30:42Z

examples/10_recipe.py

+# choices.
+
+# %%
+recipe.get_cv_results_table(randomized_search)


This interaction between the HP tuner and the recipe is interesting. I like that the recipe ties different elements together and makes pragmatic assumptions about the user flow.

Would that work with another HP tuner e.g. HalvingRandomizedSearch?

yes I think I haven't added the halving search yet because when I made the recipe it was still experimental in scikit-learn (not sure if that's still the case) and its parameters are a bit hard for users to wrap their head around but at some point we should definitely add it.

atm it also has the gridsearch (although you can only use it if you don't have any continuous distributions in the hyperparameters of course)

I'm also curious to see whether people using hp tuning libraries like optuna or hyperopt could use the recipe easily, provided we know how to extract some sort of cv_results_ from their tuners.

People could try something along the lines of:

model = recipe.get_pipeline() tuner.fit(model, recipe.get_X(), recipe.get_y()) recipe.plot_parallel(tuner)

Of course, that would require us to know the methods used by other libraries, but it could be worth it in a subsequent iteration. WDYT?

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere · 2024-09-13T10:34:19Z

skrub/_recipe.py

+    ):
+        if self._has_predictor():
+            pred_name = self._get_step_names()[-1]
+            raise ValueError(


Not a high prio: should we allow more flexibility here and have estimators working as transformers? The hard part is making sure that's what the user wants and they are not stacking estimators by mistake.

sklego introduced this concept that might make sense for us: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/estimator_transformer.py

This reverts commit f351627.

Vincent-Maladiere · 2024-10-08T09:21:35Z

Hey @jeromedockes, could you write a small TL;DR regarding the recent changes?

jeromedockes · 2024-10-08T09:31:05Z

yes:

a small change to be compatible with the current version of the tablereport (columns that match a filter must now be given by their indices not column names)
removing get_x_test etc as we discussed in the first round of review
adding (developer) documentation to the _tuning module

Vincent-Maladiere · 2024-10-08T10:17:39Z

Great, thanks!

Vincent-Maladiere · 2024-10-09T16:21:37Z

skrub/_tuning.py

+    Make a copy of a dataclass instance with different values for some of the attributes
+    """
+    return obj.__class__(
+        **({f.name: getattr(obj, f.name) for f in dataclasses.fields(obj)} | fields)


Why not:

from dataclasses import asdict obj.__class__(**(asdict(obj) | fields))

?

asdict recurses into attributes and makes a deep copy, here we want a shallow copy

EmilHvitfeldt · 2024-11-20T17:02:33Z

Hello all 👋

Love the package, and what you are doing here! If you don’t know me, I’m one of the developers of tidymodels.

I’m here to ask if you are set on the name recipe for this class. I see you have referred to this class by other names in other issues. The reason why I ask is that we maintain an R package called recipes which produces recipe() objects as well. And as far as I can tell, they appear to have some overlap in scope. Namely, a way to sequence a list of transformers, for feature engineering/preprocessing, with with selectors such as all_numeric() and the like. You can correct me if I’m wrong.

If they do overlap, I like to think it would be in our best interest to have disjoint names, in part to improve search results online.

Best!
Emil Hvitfeldt

GaelVaroquaux · 2024-11-21T08:32:57Z

Dear Emil, The name "recipe" is not cast in stone. We are experimenting with APIs and names to make the resulting code and documentation as easy as possible to read and understand. Note that the terminology "recipe" is also used in other projects, for instance https://ibis-project.github.io/ibis-ml/ There are only a limited number of commonly understood words within a certain scope, and it is bound that there will be terminology intersection between packages. For instance the terminology "data frame" is used across many packages. This can arguably be a good thing, to help users understand links and concepts. Anyhow, we are still experimenting a lot with the concepts here (unfortunately not everything is visible online, forgive us), I just cannot really tell where we are going to go at this point. Best

jeromedockes added 7 commits September 9, 2024 15:51

add the Recipe

6eab7fa

remove unused file

f0182eb

changelog

959e3e1

pixi update

485ec5f

_

a9e3b70

remove some git conflicts

ec653c2

remove some git conflicts

dab4c45

jeromedockes marked this pull request as draft September 9, 2024 14:35

jeromedockes added this to the 0.4.0 milestone Sep 9, 2024

jeromedockes added 2 commits September 9, 2024 17:51

_

520de34

_

947c7a0

Vincent-Maladiere reviewed Sep 11, 2024

View reviewed changes

Update examples/10_recipe.py

0c834df

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere reviewed Sep 13, 2024

View reviewed changes

jeromedockes added 15 commits September 24, 2024 12:02

Merge remote-tracking branch 'upstream/main' into add-recipe

b69c1ba

column filters must now be given as indices

72f6344

Merge remote-tracking branch 'upstream/main' into add-recipe

64ec3f0

Merge remote-tracking branch 'upstream/main' into add-recipe

33ceeec

remove get_x_train etc

db6a620

_

ca5eb4c

_

f351627

Revert "_"

8c337f8

This reverts commit f351627.

_

c0248fa

_tuning module docstring

e1b8e1d

docstring

eed7d48

_

ccb2df7

_

21a303a

_

7b08544

_

ce9f9d4

jeromedockes added 2 commits October 3, 2024 16:34

_

b95826c

_

830fdc7

Merge remote-tracking branch 'upstream/main' into add-recipe

ad4d357

Vincent-Maladiere reviewed Oct 9, 2024

View reviewed changes

jeromedockes removed this from the 0.4.0 milestone Oct 23, 2024

Merge remote-tracking branch 'upstream/main' into add-recipe

9560afd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP add the Recipe #1064

WIP add the Recipe #1064

jeromedockes commented Sep 9, 2024

jeromedockes commented Sep 9, 2024 •

edited

Loading

Vincent-Maladiere left a comment

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

jeromedockes Sep 13, 2024

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

jeromedockes Sep 13, 2024

Vincent-Maladiere Sep 13, 2024 •

edited

Loading

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

Vincent-Maladiere Sep 13, 2024

Vincent-Maladiere commented Oct 8, 2024

jeromedockes commented Oct 8, 2024

Vincent-Maladiere commented Oct 8, 2024

Vincent-Maladiere Oct 9, 2024 •

edited

Loading

jeromedockes Oct 9, 2024

EmilHvitfeldt commented Nov 20, 2024

GaelVaroquaux commented Nov 21, 2024 via email

WIP add the Recipe #1064

Are you sure you want to change the base?

WIP add the Recipe #1064

Conversation

jeromedockes commented Sep 9, 2024

jeromedockes commented Sep 9, 2024 • edited Loading

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 8, 2024

jeromedockes commented Oct 8, 2024

Vincent-Maladiere commented Oct 8, 2024

Vincent-Maladiere Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmilHvitfeldt commented Nov 20, 2024

GaelVaroquaux commented Nov 21, 2024 via email

jeromedockes commented Sep 9, 2024 •

edited

Loading

Vincent-Maladiere Sep 13, 2024 •

edited

Loading

Vincent-Maladiere Oct 9, 2024 •

edited

Loading