Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input data type inference behavior change in 1.14.1 #1150

Open
hopper-signifyd opened this issue Dec 31, 2024 · 4 comments
Open

Input data type inference behavior change in 1.14.1 #1150

hopper-signifyd opened this issue Dec 31, 2024 · 4 comments

Comments

@hopper-signifyd
Copy link

Hello and thanks for this wonderful library! For the past 18 months, we've been using version 1.14.0 to convert an sklearn XGBoostClassifier pipeline model to ONNX. Everything has worked great. Recently we needed to upgrade to newer versions of onnx and onnxruntime. Upgrading to the latest versions of those libraries as well as the latest version of this library resulted in score mismatch issues between the sklearn pipeline version of our model and the ONNX version of our model. After much head scratching and searching, I narrowed down the issue to version 1.14.1 of skl2onnx. If I keep the old versions of onnxruntime and onnx that we've been using for the last 18 months and only switch from skl2onnx==1.14.0 to skl2onnx==1.14.1, I can reproduce this score mismatch error. (I can also reproduce the issue when using any newer version of onnx, onnxruntime, or skl2onnx>=1.14.1

After inspecting the models in Netron, it looks like the underlying structure has changed a bit. For context, our sklearn model pipeline takes in a combination of float64 and string inputs. The string inputs are all one-hot encoded by the model pipeline. The float64 inputs are run through the venerable sklearn passthrough transformer. These values are then fed to the XGBClassifier.

In version 0.14.0, the numeric float64 inputs feed to a Concat node and then into a Cast node which converts them all to float32. The string inputs go one hot encoding -> Concat -> Reshape and then meet up with the numeric inputs at a Concat node.

In version 0.14.1, however, the numeric inputs are not being cast to float32, but the Reshape output of the string inputs is being cast to float64.

I believe this indicates that the input to my TreeEnsembleClassifier node in 0.14.0 was an array of float32, but in 0.14.1 (and beyond) it's an array of float64. (I tried loading the model into memory and running shape inference to generate the various data types but it wasn't working in the way I expected.)

For a sample dataset of 1k rows, this seemingly minor change results in 33% of the scores having a difference greater than 10E-5 between the sklearn and ONNX versions of the model. The greatest difference I observed with my small tests dataset was 0.04.

Workarounds:

  • Change the data type on the numeric inputs from float64 to float32 upfront. This resolves the score mismatch issue, but won't be accepted by our model serving environment (it insists on passing float64 for reasons outside of our control).
  • Instead of using passthrough in the sklearn model, I created a custom sklearn transformer that converts its input to float32. I registered a corresponding ONNX converter which translates this into a basic Cast node. This works.

It's been 18 months since the release of 0.14.1 and I couldn't find any similar issues. Is this a bug? Or were we simply getting lucky before that scores were matching perfectly between sklearn and ONNX on 0.14.0? FWIW we pass float64 to the sklearn model when generating scores, so it seems wrong that we'd get one score value back from the sklearn model, but a different score back from the ONNX model when passing in the same float64 inputs.

Can you provide any insight on what caused this change? I've poured through the code in this library and the changes in the 1.14.1 release and it's not immediately obvious to me which one caused it. My guess is that it's this one.

Also, I think this commit was included in that release even though it wasn't in the release notes.

@xadupre
Copy link
Collaborator

xadupre commented Jan 8, 2025

The releases notes include significant changes. I did not expect the changes you mention to have an impact but I obviously made a mistake. Is it possible to have a short example I can use to reproduce your discrepancies and see how I can fix it. In ai.onnx.ml v3, double are not fully supported by TreeEnsembleClassifier. v5 fuses both TreeEnsembleClassifier and TreeEnsembleRegressor into a single operator. Switching to this would give us more freedom.

@hopper-signifyd
Copy link
Author

Thanks for the reply!

When converting a model from sklearn to ONNX, does the conversion code know the type associated with the input? If so, could we add a Cast node to convert float64 to float32? The docs for TreeEnsembleClassifier seem to indicate that tensor(double) is a valid input type.

v5 fuses both TreeEnsembleClassifier and TreeEnsembleRegressor into a single operator. Switching to this would give us more freedom.

You're referring to TreeEnsemble, no? Updating the library to support that makes sense to me

@xadupre
Copy link
Collaborator

xadupre commented Jan 8, 2025

We know the input type at conversion type and it is float32, the Cast may not be added but the users must always use float32 when running the onnx model then. I'm referring to TreeEnsemble. This new operator also supports urles such as x in {set of values}. The fact that TreeEnsemble did not support float64 make the conversion impossible in some cases. That should be better but I need some time to update the library.

@hopper-signifyd
Copy link
Author

Got it. Thanks for the consideration. No rush from me on switching to TreeEnsemble. For our current use case, we're just adding a Cast node to float32, so everything is working at the moment for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants