Category encoding support beyond one-hot #124

TijmenvanderKemp · 2021-03-31T11:05:35Z

Is your feature request related to a problem? Please describe.
I have a dataset with a lot of categorical data. At the moment, I'm doing all my training and predicting in Python, but I'm eager to make the step to Java. I use James-Stein encoding on many of my columns. I'm looking for a way to do this in Tribuo.

Describe the solution you'd like
Could you help me think of a good way to realise this? The logical place in my mind is to create a FieldProcessor like the DoubleProcessor but then instead of parsing the value, we apply some sort of encoding to give a good value. It would be very handy to have this in the provenance system as well, I think.

Describe alternatives you've considered
I've considered keeping my encoders in Python and writing an API to reach them, but it feels like the platform is so close to being able to do this, because I think one-hot encoding is already possible in the form of the IdentityProcessor

Craigacp · 2021-03-31T13:49:34Z

Unfortunately I think the current Tribuo APIs don't have a good way of doing this. RowProcessor operates on a single row at a time, it never has the view of the full dataset that it would need to compute the Output mean conditioned on a specific feature value. Additionally the feature transformation infrastructure only considers a single feature at a time, so there is no way to make its behaviour conditional on the Output.

This particular encoding would fit better into Tribuo's transformation system (org.tribuo.transform) rather than any inbound ETL step (e.g. RowProcessor) as that way it is easier to ensure correct usage, as if it lives inside TransformTrainer then you wouldn't be able to leak information from test to train by applying it during ETL and then splitting the dataset with TrainTestSplitter or CrossValidation. But as I mention above the transformation system doesn't currently have a way to do that kind of dependent transform. We're thinking about how we can evolve the design of it to allow transformations like PCA, so I'll make sure we include this kind of encoding as another design point.

If you want to do this today, and are willing to manage the train/test data leakage issues yourself then I think the simplest way is to write a DataSource which accepts another DataSource and performs the transformation on construction.

TijmenvanderKemp added the enhancement New feature or request label Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Category encoding support beyond one-hot #124

Category encoding support beyond one-hot #124

TijmenvanderKemp commented Mar 31, 2021

Craigacp commented Mar 31, 2021 •

edited

Loading

Category encoding support beyond one-hot #124

Category encoding support beyond one-hot #124

Comments

TijmenvanderKemp commented Mar 31, 2021

Craigacp commented Mar 31, 2021 • edited Loading

Craigacp commented Mar 31, 2021 •

edited

Loading