-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP add the Recipe #1064
base: main
Are you sure you want to change the base?
WIP add the Recipe #1064
Conversation
the example is example 10, "using the recipe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some first remarks on the example and high-level concepts of the Recipe. As the Recipe offers many new features, I think the example should be simplified and more focused on the Recipe itself.
# %% | ||
from skrub import Recipe, datasets | ||
|
||
dataset = datasets.fetch_employee_salaries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good time to change our "default" demo dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked a bit but haven't found a good replacement yet. but as I suspect we will merge the fraud data example before this one, maybe employee salaries can be replaced with that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, to do so we need to take care of the join operation with the recipe first, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, so that would come later. if anyone has suggestions for another dataset for this example I'd be happy to diversify a bit from employee salaries.
also, in employee salaries should we remove the "year first hired" in the fetcher? both here and in the tablevectorizer examples, the datetime encoder isn't useful because the feature it extracts has already been inserted in the dataset
from skrub import DatetimeEncoder | ||
from skrub import selectors as s | ||
|
||
recipe = recipe.add(DatetimeEncoder(), cols=s.any_date()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the recipe: as a user, I'm quite upset that the DatetimeEncoder
doesn't perform the parsing with ToDatetime()
for me. Sure, uncoupling all elements makes sense from a pure computer science perspective, but from the practitioner's (and the beginner's) point of view, this is a bit cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder
they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.
before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)
but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder
That would be great IMO
they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.
Yes, I remember the choices that led to this design, and I agree with them.
before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)
Ok, if that doesn't introduce too much complexity on the TV part, I'm all for it.
but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)
That's interesting; I need to refresh my memory regarding this part.
Side question: Would using the TV with the Recipe make sense in general? I'm thinking about CVing the transformers and their hyper-parameters more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes using the TableVectorizer in the Recipe completely makes sense, and it will help tune the choice of the encoders and their hyperparameters (the choose_*
can be arbitrarily nested). I didn't do it in this example because on this dataset the TableVectorizer does everything fine so there would be only one step and it would be harder to showcase some features of the recipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! What about showing the recipe with TV at the end? Or would that make the message less obvious?
# choices. | ||
|
||
# %% | ||
recipe.get_cv_results_table(randomized_search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interaction between the HP tuner and the recipe is interesting. I like that the recipe ties different elements together and makes pragmatic assumptions about the user flow.
Would that work with another HP tuner e.g. HalvingRandomizedSearch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think I haven't added the halving search yet because when I made the recipe it was still experimental in scikit-learn (not sure if that's still the case) and its parameters are a bit hard for users to wrap their head around but at some point we should definitely add it.
atm it also has the gridsearch (although you can only use it if you don't have any continuous distributions in the hyperparameters of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious to see whether people using hp tuning libraries like optuna or hyperopt could use the recipe easily, provided we know how to extract some sort of cv_results_
from their tuners.
People could try something along the lines of:
model = recipe.get_pipeline()
tuner.fit(model, recipe.get_X(), recipe.get_y())
recipe.plot_parallel(tuner)
Of course, that would require us to know the methods used by other libraries, but it could be worth it in a subsequent iteration. WDYT?
Co-authored-by: Vincent M <[email protected]>
): | ||
if self._has_predictor(): | ||
pred_name = self._get_step_names()[-1] | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a high prio: should we allow more flexibility here and have estimators working as transformers? The hard part is making sure that's what the user wants and they are not stacking estimators by mistake.
sklego introduced this concept that might make sense for us: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/estimator_transformer.py
This reverts commit f351627.
Hey @jeromedockes, could you write a small TL;DR regarding the recent changes? |
yes:
|
Great, thanks! |
Make a copy of a dataclass instance with different values for some of the attributes | ||
""" | ||
return obj.__class__( | ||
**({f.name: getattr(obj, f.name) for f in dataclasses.fields(obj)} | fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not:
from dataclasses import asdict
obj.__class__(**(asdict(obj) | fields))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asdict recurses into attributes and makes a deep copy, here we want a shallow copy
Hello all 👋 Love the package, and what you are doing here! If you don’t know me, I’m one of the developers of tidymodels. I’m here to ask if you are set on the name recipe for this class. I see you have referred to this class by other names in other issues. The reason why I ask is that we maintain an R package called recipes which produces If they do overlap, I like to think it would be in our best interest to have disjoint names, in part to improve search results online. Best! |
Dear Emil,
The name "recipe" is not cast in stone. We are experimenting with APIs and names to make the resulting code and documentation as easy as possible to read and understand.
Note that the terminology "recipe" is also used in other projects, for instance https://ibis-project.github.io/ibis-ml/
There are only a limited number of commonly understood words within a certain scope, and it is bound that there will be terminology intersection between packages. For instance the terminology "data frame" is used across many packages. This can arguably be a good thing, to help users understand links and concepts.
Anyhow, we are still experimenting a lot with the concepts here (unfortunately not everything is visible online, forgive us), I just cannot really tell where we are going to go at this point.
Best
|
This is still in draft mode but I'll open the PR so we can discuss the example.
I still need to add more tests and reference documentation