Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervised UMAP using already projected points as regression #712

Open
theahura opened this issue Jun 29, 2021 · 6 comments
Open

Supervised UMAP using already projected points as regression #712

theahura opened this issue Jun 29, 2021 · 6 comments

Comments

@theahura
Copy link

Hey Leland,

Thanks for this great library, and for being so responsive with issues.

Question: Is it possible to train a (semi-)supervised UMAP model where some of the projections are already known/provided, as a regression task? This is contrast to using labels for supervision, which are categorical.

To provide an example, imagine I had a set of 100 embeddings. 50 of those embeddings have 2d coordinates associated with them; the other 50 do not. I want to be able to train a UMAP model on the 50 embeddings with 2d coordinates, and then run inference on the other 50 (or do a semi-supervised training and run on all of them at the same time).

If this isn't feasible with UMAP, do you know of any other algos/models that might be a good fit for what I'm suggesting (besides deep learning of course)?

Thanks! Amol

@theahura theahura changed the title Supervised UMAP using already projected points Supervised UMAP using already projected points as regression Jun 29, 2021
@lmcinnes
Copy link
Owner

lmcinnes commented Jul 5, 2021

There are two ways you can view this. The 2d coordinates could be viewed merely as target values, like labels but numeric instead of categorical. This can actually be handled just fine by setting target_metric="euclidean" and passing the coordinates in as a y value to fit. I don't think this is what you mean however.

Instead it sounds more like you want to fix some of the points to given locations in the embedding and then fit the rest of the points around that. This is actually being worked on right now. See #606 and #620 for discussion and work on that. There are some catches in exactly how to do this so I think your input into this would be most welcome.

@theahura
Copy link
Author

Thanks for the response, and sorry for my delayed post.

You're right that fixing points is closer to what I need. But I can play around with the first suggestion, using a euclidean target_metric. Does that support semi-supervised cases? In the documentation (https://umap-learn.readthedocs.io/en/latest/supervised.html) you describe using -1 as a 'masked' value. Makes sense for categorical data, but how does that work for numerical data?

@lmcinnes
Copy link
Owner

The semi-supervised case is going to be an issue for other target_metrics since they won't specially handle "masked" values. In principle you can write your own custom metric that has special handling for masked values. I suspect in the long run this is going to really be doing what you want however.

@theahura
Copy link
Author

If checks pass on #620 is that sufficient to merge into the library? Or do you feel there is more to be done there? I had a look over the code in that PR, though admittedly a lot of the math flew over my head.

@lmcinnes
Copy link
Owner

There are some slightly more philosophical and API issues that need to get worked out before it gets merged, but you can certainly just check out the branch from the PR and use it -- it works; it just requires a little care from the user or unexpected results may occur.

@kruus
Copy link

kruus commented Aug 11, 2021

To expand on @lmcinnes API/philosopy comment: I started using #620, but now am favoring a more flexible API style, inspired by the PyMDE constraints objects.

Passing UMAP constraint objects also can avoid some jit-related code duplication in #620. So one cleaner API might just add an optional constraint parameter, where constraint objects have a few standard project_foo functions. (The constraint parameter might end up being a dict, indicating how/when the project_foo functions of several constraints get called: do they operate on gradients, or individual embedding mods, or post-epoch point-cloud, ...?) Like PyMDE, we'd supply a couple of handy constraints to get folks up and running.

In a first example, I've used a pin_mask (uggh, will change!) to anchor "my selected anchors" at left/right best/worst x limits, leaving y free. (This is not exactly the "spring force" that @theahura seems to want). A pre-alpha, simple working constraint ensures that all non-anchors remain between my best/worst x limits.

BTW, another issue (even in #620) is that UMAP init phase applies dimension-wise scale factors
even for init being an ndarray.

  1. If I init with a previous embedding ndarray, for example, with carefully set "euclidean" distance, I don't like UMAP rescaling as if I suddenly wanted a weighted-euclidean embedding for the init!
  2. If I want to anchor some points in my init ndarray, and use gradient masking in a constraint, they do end up anchored ... but "somewhere else". (Is this historical cruft? Didn't spectral init just do its own rescaling?)

Now, when constraints come into play, init becomes a little more complex. I'm thinking some types of constraint objects may add a function that can approximately satisfy a constraint. Some constraints that put more load on layouts.py iterations might reasonably provide some rotate/(scale?)/translate of init data to ease the load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants