-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awkward use case: features that don't naturally combine into a table #915
Comments
@ExpandingMan Thank you for spending some substantial time with MLJ's learning networks. And your feedback is very much appreciated. You raise a few interesting issues here, and I don't have any magic bullet to resolve them all. For now, let me focus on the problem of combining the output of different transformers. Actually, unlesss I misunderstand, this is not really a problem with the learning networks API per se. If, more generally, I know how to horizontally concatenate two objects (for which ordinary
Yes, I think I am basically agreeing with you here - this is a promising direction. However, I am not quite sure why machines are relevant here, as we are just asking about an ordinary function that has multiple inputs. If, however, you want this "combining function" to have parameters (eg, output type) then you can define a I must concede that MLJ's decision to try and work through the tables interface has some performance drawbacks. As you say, you have to think a lot more to avoid unnecessary copying. But even within that framework there is probably room for improvement and the built-in transformers provided by MLJModels could do with a review (Tables.jl was not very mature when this code was written). I note that TableTransforms.jl, AutoMLPipeline and elsewhere, transformers, such as |
Oh, by the way, a PR to clarify the status quo in the documentation would be very welcome. |
@ExpandingMan Although it's not part of the public API, TableTransforms has the julia> table1
3×2 DataFrame
Row │ x z
│ Char Float64
─────┼───────────────────
1 │ 𘂯 0.673471
2 │ \U3f846 0.360792
3 │ \Ud50cb 0.68075
julia> table2
(x = [0.41754294943943493, 0.7713462387833814, 0.9189998773436003], y = ['\U84fa1', '\U5e144', '\U872a4'])
julia> TableTransforms.tablehcat([table1, table2])
3×4 DataFrame
Row │ x z x_ y
│ Char Float64 Float64 Char
─────┼──────────────────────────────────────
1 │ 𘂯 0.673471 0.417543 \U84fa1
2 │ \U3f846 0.360792 0.771346 \U5e144
3 │ \Ud50cb 0.68075 0.919 \U872a4 |
Thanks for your responses.
Right. The current API does indeed work correctly, as my initial example shows, it's a more a matter of awkwardness. It took me a little while to work out exactly what to do here (and, for what it's worth, I have a ton of Julia experience). Again, return types are potentially a major part of this issue: as far as I can tell there is no standard for what type is returned by a particular machine component, and to figure it out requires some trial and error with truncated learning networks.
It seems this is the fundamental issue at the core of the matter. It seems to me that for machine learning what is needed is an object with I do agree that better ways of concatenating tables seems like the best medium term solutions, and it seems like that's already close. Thanks again for taking the time to think about this. |
One possibility I've been thinking more about is the JuliaML/MLUtils.jl#61 |
So generally, transformers in MLJ that train on a table will transform to a table of the same type (assuming that is sink type). I think TSVDTransformer is a special case: if you train on a table, then transform returns a matrix-table. (If you train on a matrix, which is allowed, then you transform to a matrix, which could be sparse if the training matrix is.) I think the reason for this choice had to do with sparsity: the function |
I know it's been a while since I've commented on this, but I think I have run into another case that exposes the need for some kind of new feature here. Currently |
Is your feature request related to a problem? Please describe.
With the current interface it can be extremely awkward to combine features which do not naturally fit together in a table, particularly if they must be fed into separate models. For concreteness, take the following example
Note the presence of two different
FeatureSelector
's. In many cases, the existence of afeatures
keyword in a model makes this process smoother, not only because it eliminates the need for a separateFeatureSelector
, but more importantly because its outputs are already combined (i.e. it doesn't eliminate the non-selected features).I find several features of this example problematic:
A
and the rest of the columns into a single dataframe (or other table), i.e. no matter what we have to pretend there is only a single training input. This might not be so bad, but again, it's a little worrying from a performance perspective since the input is necessarily so explicitly tabular.Ξ
.hcat
only works nicely on dataframes and machines don't appear to be constrained in the exact form of their output. This means that users are required to take apart their would-be model in order to figure out the exact form of each output that must be combined in some way.Describe the solution you'd like
It's of course possible I'm missing simpler options that already exist, though I did spend a significant portion of the day digging into this, so I don't think that's the case.
After some thought, I don't yet see a fantastic solution to this because most of the solutions I can think of would involve a significant re-work of
Machine
, which is certainly not ideal. Some ideas:machine
can have multiple inputs, but I could not get it to work consistently.features
keyword. The above example would be a lot simpler ifTSVD
had this (I deliberately choseTSVD
because it does not). On the other hand, this seems like a fragile solution to me; for one, if my understanding of model implementations is correct, it would really suck to have to try to ensure that they always have certain keywords, but it also doesn't address what is perhaps a deeper issue of data not always being strictly tabular.hcat
, perhaps involving some wrapper around the output... I am having a bit of a hard time coming up with a good example without promotingmachine
to take multiple input arguments though so... maybe ifmachine
had multiple inputs only for surrogate models? Which is confusing. Just thinking out loud here.The text was updated successfully, but these errors were encountered: