Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge ebm with different subset of variables #564

Open
sadsquirrel369 opened this issue Jul 24, 2024 · 11 comments
Open

Merge ebm with different subset of variables #564

sadsquirrel369 opened this issue Jul 24, 2024 · 11 comments

Comments

@sadsquirrel369
Copy link

I am fitting a model with a subset of variables with no interaction present. I now want to fit interactions with a larger subset of variables and merge it with the original model.

The merge ebm method does not allow for this in its current form. Is there not a smart way to build a new model with the components of the two underlying models into a new clean instance?

@sadsquirrel369 sadsquirrel369 changed the title Merge ebm with differentsubset of variables Merge ebm with different subset of variables Jul 24, 2024
@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- This is supported, but is currently a bit more complicated than it should be. In the future we want to support scikit-learn's warm_start functionality, which will make this simpler. Today, you need to do the following:

  1. Make a dataframe or numpy array with a superset of all the features you'll need for both mains and interactions
  2. set interactions=0 and exclude any individual features that you don't want considered in the mains.
  3. Fit the mains model
  4. Use exclude to exclude all mains, and you can also exclude any additional pairs you don't want to be considered for pairs. Set interactions to either a number for automatic detection, or a list of the specific interactions. Call fit using the init_score parameter set to the mains model so that it boosts the pairs on top of the mains.
  5. call merge_ebms on the two EBMs. There are more details to this which are covered in our docs here: https://interpret.ml/docs/python/examples/custom-interactions.html

@sadsquirrel369
Copy link
Author

@paulbkoch Thanks for the prompt reply. So by excluding variables (with the parameter) in the model "mains" fitting, will all of the feature names be in the model.feature_names_in_ variable, irrespective of whether they were in the original dataset?

@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- Features that are excluded will be recorded in the model.feature_names_in_ attribute, but they will not be used for prediction. Anything that is used for prediction is called a "term" in EBMs. If you print the model.term_names_ you'll see a list of everything that is used for prediction. For some datatypes like numpy arrays there are no column names and features are determined by their index, so it's important in these cases that both the features used in mains and the features used in pairs are all in the same dataset, even if they are not used in the model.

@sadsquirrel369
Copy link
Author

Thanks for the help!

@sadsquirrel369
Copy link
Author

Hi @paulbkoch,

When trying to merge the mains model with an interaction model I get this issue:

Inconsistent bin types within a model:

`---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/var/folders/3b/lp8_hqx917138jd8rxttmzjc0000gn/T/ipykernel_35823/3985698511.py in
----> 1 merge_ebms([loaded_model,loaded_int_model])

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/interpret/glassbox/ebm/merge_ebms.py in merge_ebms(models)
392 for model in models:
393 if any(len(set(map(type, bin_levels))) != 1 for bin_levels in model.bins
):
--> 394 raise Exception("Inconsistent bin types within a model.")
395
396 feature_bounds = getattr(model, "feature_bounds
", None)

Exception: Inconsistent bin types within a model.`

It appears the issue stems from some variables used in the interaction not having bin values in the main models because they were excluded. The interaction models will work correctly when the variables are present in both the main and interaction models. However, some variables are only beneficial when used in an interaction and not on their own (for example for vehicle classification where the combination of weight and power can help us identify different vehicle types).

@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- This is really interesting. It appears you have a model where one of the feature mains is considered a categorical or continuous, but a pair using the same feature is considered to be the opposite. Are you doing any re-merging where you first merge a set of models and then merge that result again with some other models, or is it happening on the first merge when the main and interaction models are combined?

You can probably avoid this error by explicitly setting the feature_types parameter on all calls to the ExplainableBoostingClassifier constructor, thereby ensuring they are identical in all models being merged. This is something we could handle better though within merge_ebms. We can convert a feature from categorical into continuous during merges, but perhaps this isn't completely robust to more complicated scenarios involving pairs.

@ANNIKADAHLMANN-8451
Copy link

ANNIKADAHLMANN-8451 commented Oct 10, 2024

I'm currently encountering this same error when trying to merge two EBMs I have ~10 features and I'm wondering if there's a streamlined way to specify all of these feature types? I'm getting the inconsistent bins error on the second merge (basically I'm trying to batch train an EBM model since my data is larger than what can fit in memory). I specify the data types using the feature_types parameter using the below snippet:

dtypes = ['continuous' if d == 'float64' else None if d == 'int64' else 'ordinal' if col in ordinal_types else 'nominal' for d, col in zip(X_trn.dtypes, X_trn.columns)]

And when I try to implement the workaround suggested in issue #576, I still get the same error

for attr, val in clf1.get_params(deep=False).items():
    if not hasattr(clf, attr):
        setattr(clf, attr, val)

@paulbkoch
Copy link
Collaborator

Hi @ANNIKADAHLMANN-8451 -- Can you verify that the dtypes in the pandas dataframes match? Pandas auto-infers dtypes too, and if one of the datasets has a single sample with a string that can't be represented as a float it would mismatch the dtypes in pandas, or alternatively it could give one dataset an int64 value and another float64. Either of these conditions would then cause a mismatch in the dtypes variable above, and then in the models.

If that doesn't present a solution, can you please output the value of ebm.feature_types_in_ for both models and post that here.

Side note 1: From a previous issue on Azure Synapse EBMs, I'm aware that 8451 uses Azure in some capacity. I should probably put on my Microsoft sales hat and mention that Azure does have VMs with 4TB of memory available. It might allow you to avoid batch processing.

Side note 2, which isn't related to this issue: I think you should replace the 'ordinal' string above with a list that contains the ordinal strings. For ordinals like ["low", "medium", "high"] the order cannot be inferred, so most of the time you need to specify it. The default for 'ordinal' is to sort them alphabetically, but that's rarely what you want. I've recently removed 'ordinal' from the documentation as an option, and plan to deprecate it at some point.

@ANNIKADAHLMANN-8451
Copy link

Just to verify, I'm thinking about the conversions correctly (pandas types -> EBM types):

  • float64 -> continuous
  • int64 -> None
  • ordinal -> list of str (in order of importance)
  • object (where order doesn't matter) -> nominal

I don't think I would be able to run a 4TB VM for cost purposes, but I definitely should look into optimizing my cluster! And I just changed those ordinal values accordingly thank you for the side note :)

@paulbkoch
Copy link
Collaborator

Your mapping makes sense to me, although interpret is flexible enough to treat floats and ints as nominal/ordinal if you specify that. If you are asking what interpret uses when 'auto' is specified, the default behavior for EBMs is that float64 and int64 are continuous. For objects or strings, if all the feature values can be converted to floats, then it treats them as continuous too. Anything with non-float representable content is 'nominal'.

@ANNIKADAHLMANN-8451
Copy link

That makes sense! Thank you for the details and the timely response. We figured out my bug for now. I was calling merge_ebms() in a for loop which was causing the error and the solution I found was to append a list of EBMs then call merge_ebms() once at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants