Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Category encoding support beyond one-hot #124

Open
TijmenvanderKemp opened this issue Mar 31, 2021 · 1 comment
Open

Category encoding support beyond one-hot #124

TijmenvanderKemp opened this issue Mar 31, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@TijmenvanderKemp
Copy link

Is your feature request related to a problem? Please describe.
I have a dataset with a lot of categorical data. At the moment, I'm doing all my training and predicting in Python, but I'm eager to make the step to Java. I use James-Stein encoding on many of my columns. I'm looking for a way to do this in Tribuo.

Describe the solution you'd like
Could you help me think of a good way to realise this? The logical place in my mind is to create a FieldProcessor like the DoubleProcessor but then instead of parsing the value, we apply some sort of encoding to give a good value. It would be very handy to have this in the provenance system as well, I think.

Describe alternatives you've considered
I've considered keeping my encoders in Python and writing an API to reach them, but it feels like the platform is so close to being able to do this, because I think one-hot encoding is already possible in the form of the IdentityProcessor

@TijmenvanderKemp TijmenvanderKemp added the enhancement New feature or request label Mar 31, 2021
@Craigacp
Copy link
Member

Craigacp commented Mar 31, 2021

Unfortunately I think the current Tribuo APIs don't have a good way of doing this. RowProcessor operates on a single row at a time, it never has the view of the full dataset that it would need to compute the Output mean conditioned on a specific feature value. Additionally the feature transformation infrastructure only considers a single feature at a time, so there is no way to make its behaviour conditional on the Output.

This particular encoding would fit better into Tribuo's transformation system (org.tribuo.transform) rather than any inbound ETL step (e.g. RowProcessor) as that way it is easier to ensure correct usage, as if it lives inside TransformTrainer then you wouldn't be able to leak information from test to train by applying it during ETL and then splitting the dataset with TrainTestSplitter or CrossValidation. But as I mention above the transformation system doesn't currently have a way to do that kind of dependent transform. We're thinking about how we can evolve the design of it to allow transformations like PCA, so I'll make sure we include this kind of encoding as another design point.

If you want to do this today, and are willing to manage the train/test data leakage issues yourself then I think the simplest way is to write a DataSource which accepts another DataSource and performs the transformation on construction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants