Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw error when tables are presented with new column orders? #1144

Open
ablaom opened this issue Nov 3, 2024 · 1 comment
Open

Throw error when tables are presented with new column orders? #1144

ablaom opened this issue Nov 3, 2024 · 1 comment

Comments

@ablaom
Copy link
Member

ablaom commented Nov 3, 2024

Over at MLJFlux, @tiemvanderdeure has pointed out the following issue that is actually MLJ generic.

As the example below shows, a user presenting a table for training a model cannot present new data for prediction with a different ordering of the table columns:

N = 1000
X = (x1 = rand(Float32, N), x2 = randn(Float32, N), x3 = categorical(rand('a':'c', N)))
y = categorical(bitrand(N))

model = MLJFlux.NeuralNetworkBinaryClassifier(epochs = 10, builder=MLJFlux.MLP(; hidden=(5,4)), batch_size = 100)
mach = machine(model, X, y)
fit!(mach)

# this errors
predict(mach, (x3 = X.x3, x1 = X.x1, x2 = X.x2))

# this is false!
all(predict(mach, (x2 = X.x2, x1 = X.x1, x3 = X.x3)) .≈ predict(mach, X))

Here is my response from the original post:

Mmm. I think this kind of implicit assumption - that the columns of tables are ordered, and that they be presented in a consistent order, is everywhere in MLJ, and probably elsewhere. [Transferring this issue to MLJ].

One could either try to allow tables to be presented in any column order, or throw a warning when the original order is violated. Personally, I think the latter would be sufficient. If MLJ had a generic data-front end for dealing with tables, apart from Tables.matrix which dumps the feature names, this could be an easy fix either way. But a lot of interfaces just don't save the feature names.

I'd support some kind of resolution, but it's a big ask to adapt across the ecosystem.

@tiemvanderdeure
Copy link

This is a problem that other users have also made issues about (e.g. #1023, but I think that there are more).

As a user (and as a contributor as well), the fact that the input into an MLJ machine is a Tables.jl-compatible table made me assume that machines would treat it as tabular data, i.e. use column names. It personally caught me off guard that they don't, and I doubt that I'm the only one.

What makes this more confusing is that some MLJ models do use column names, e.g. those in MLJGLMInterface.jl.

I'd support some kind of resolution, but it's a big ask to adapt across the ecosystem.

I see the point - there are a lot of models out there, and requiring them to use column keys is not going to work.

Maybe there could be an extra model trait in MMI of whether or not a model uses column keys, so that an example like the one above can be part of the test suite for those models.

Otherwise there is always FeatureSelector in MLJModels, which is great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants