-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge ebm with different subset of variables #564
Comments
Hi @sadsquirrel369 -- This is supported, but is currently a bit more complicated than it should be. In the future we want to support scikit-learn's warm_start functionality, which will make this simpler. Today, you need to do the following:
|
@paulbkoch Thanks for the prompt reply. So by excluding variables (with the parameter) in the model "mains" fitting, will all of the feature names be in the model.feature_names_in_ variable, irrespective of whether they were in the original dataset? |
Hi @sadsquirrel369 -- Features that are excluded will be recorded in the model.feature_names_in_ attribute, but they will not be used for prediction. Anything that is used for prediction is called a "term" in EBMs. If you print the model.term_names_ you'll see a list of everything that is used for prediction. For some datatypes like numpy arrays there are no column names and features are determined by their index, so it's important in these cases that both the features used in mains and the features used in pairs are all in the same dataset, even if they are not used in the model. |
Thanks for the help! |
Hi @paulbkoch, When trying to merge the mains model with an interaction model I get this issue: Inconsistent bin types within a model: `--------------------------------------------------------------------------- /opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/interpret/glassbox/ebm/merge_ebms.py in merge_ebms(models) Exception: Inconsistent bin types within a model.` It appears the issue stems from some variables used in the interaction not having bin values in the main models because they were excluded. The interaction models will work correctly when the variables are present in both the main and interaction models. However, some variables are only beneficial when used in an interaction and not on their own (for example for vehicle classification where the combination of weight and power can help us identify different vehicle types). |
Hi @sadsquirrel369 -- This is really interesting. It appears you have a model where one of the feature mains is considered a categorical or continuous, but a pair using the same feature is considered to be the opposite. Are you doing any re-merging where you first merge a set of models and then merge that result again with some other models, or is it happening on the first merge when the main and interaction models are combined? You can probably avoid this error by explicitly setting the feature_types parameter on all calls to the ExplainableBoostingClassifier constructor, thereby ensuring they are identical in all models being merged. This is something we could handle better though within merge_ebms. We can convert a feature from categorical into continuous during merges, but perhaps this isn't completely robust to more complicated scenarios involving pairs. |
I'm currently encountering this same error when trying to merge two EBMs I have ~10 features and I'm wondering if there's a streamlined way to specify all of these feature types? I'm getting the inconsistent bins error on the second merge (basically I'm trying to batch train an EBM model since my data is larger than what can fit in memory). I specify the data types using the feature_types parameter using the below snippet:
And when I try to implement the workaround suggested in issue #576, I still get the same error
|
Hi @ANNIKADAHLMANN-8451 -- Can you verify that the dtypes in the pandas dataframes match? Pandas auto-infers dtypes too, and if one of the datasets has a single sample with a string that can't be represented as a float it would mismatch the dtypes in pandas, or alternatively it could give one dataset an int64 value and another float64. Either of these conditions would then cause a mismatch in the dtypes variable above, and then in the models. If that doesn't present a solution, can you please output the value of ebm.feature_types_in_ for both models and post that here. Side note 1: From a previous issue on Azure Synapse EBMs, I'm aware that 8451 uses Azure in some capacity. I should probably put on my Microsoft sales hat and mention that Azure does have VMs with 4TB of memory available. It might allow you to avoid batch processing. Side note 2, which isn't related to this issue: I think you should replace the 'ordinal' string above with a list that contains the ordinal strings. For ordinals like ["low", "medium", "high"] the order cannot be inferred, so most of the time you need to specify it. The default for 'ordinal' is to sort them alphabetically, but that's rarely what you want. I've recently removed 'ordinal' from the documentation as an option, and plan to deprecate it at some point. |
Just to verify, I'm thinking about the conversions correctly (pandas types -> EBM types):
I don't think I would be able to run a 4TB VM for cost purposes, but I definitely should look into optimizing my cluster! And I just changed those ordinal values accordingly thank you for the side note :) |
Your mapping makes sense to me, although interpret is flexible enough to treat floats and ints as nominal/ordinal if you specify that. If you are asking what interpret uses when 'auto' is specified, the default behavior for EBMs is that float64 and int64 are continuous. For objects or strings, if all the feature values can be converted to floats, then it treats them as continuous too. Anything with non-float representable content is 'nominal'. |
That makes sense! Thank you for the details and the timely response. We figured out my bug for now. I was calling merge_ebms() in a for loop which was causing the error and the solution I found was to append a list of EBMs then call merge_ebms() once at the end. |
I am fitting a model with a subset of variables with no interaction present. I now want to fit interactions with a larger subset of variables and merge it with the original model.
The merge ebm method does not allow for this in its current form. Is there not a smart way to build a new model with the components of the two underlying models into a new clean instance?
The text was updated successfully, but these errors were encountered: