Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to fix points in the low embedding #606

Open
matthieuheitz opened this issue Mar 4, 2021 · 20 comments
Open

Possibility to fix points in the low embedding #606

matthieuheitz opened this issue Mar 4, 2021 · 20 comments

Comments

@matthieuheitz
Copy link
Contributor

Hi,
I'd like to know if it would be possible to add a prior on the low embedding, meaning that I want certain points to be at certain coordinates in the low embedding, and I would like UMAP to optimize all other points.
Is that something that is doable with the actual code, or would it need some changes?
How difficult would it be to add this feature?
Thank you very much.

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 4, 2021

The code, as it stands, would need some changes to allow this to happen. This is actually already a requested feature (see #432) -- although that request may differ in specific details compared to what you need. It was on my development plan, but I have not gotten to it yet. It is not really that difficult to add, so if you wanted to have a look at doing it yourself I can provide guidance as to what changes would need to be made.

@matthieuheitz
Copy link
Contributor Author

Yes, that would be fantastic!
I could definitely add this with some guidance.
My initial thought was also to have a mask that says which points are fixed.
Do you have a preferred way for communication? Discussing all details here might be too much.

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 5, 2021

You can email me at "leland DOT mcinnes AT gmail DOT com" and I can give you a general outline and we can discuss from there.

@jondo
Copy link
Contributor

jondo commented Mar 22, 2021

@matthieuheitz, thank you very much for tackling this! How is it going so far?
Actually, I would be interested to have the detailed discussion here.

Update: Ah, I just found your related pull request #620 .

@matthieuheitz
Copy link
Contributor Author

Yes, it's not complete yet, and it's still breaking some CI checks, so I'll work on that this week.
But yes, you can already look at my changes on the PR.

jondo added a commit to jondo/umap that referenced this issue Mar 23, 2021
Brute-force pinning as in "embedding constraints"
(lmcinnes#432 (comment)),
while "Possibility to fix points in the low embedding"
(lmcinnes#606) is not done.
@jondo
Copy link
Contributor

jondo commented Mar 24, 2021

In the meantime I have tested @dribnet's brute force implementation from #432 (comment) (see my code).

I have now first computed an embedding without pinning, then I saved some resulting data point coordinates from there. Of course, computing another embedding (with same data and random seed) by pinning these data points to the saved coordinates leads to an overall different embedding (except for the pinned points).

I now wonder whether there could be a stable implementation of pinning, s.t. pinning data points to their current embedding locations would not change the embedding. Maybe one could internally first compute the unpinned embedding, then do an additional "pinning adjustment" phase (which would result in no further change in the above case)?

@jondo
Copy link
Contributor

jondo commented Mar 29, 2021

In my tests, the mentioned brute force implementation also has the unwanted effect that sometimes the fixed points stay separated from the remaining data. This effect can also be seen in the last image of #432 (comment), I have marked it here:
fixed-points
Increasing n_epochs not always helps here.

@matthieuheitz, do you also see this effect in your PR? What could be done about this? Maybe the fixed points' positions should already influence how the low dimensional embedding is initialized?

@jondo
Copy link
Contributor

jondo commented Apr 1, 2021

With jondo@7416cce, based on the current PR state, I could even get an implementation of pinning that is "stable" in the sense I described above:

I first compute the initial_embedding as normal embedding without pinning, then get the amount of necessary additional pinning epochs while I am setting the pinned coordinates:

sumdist2 = 0
for (index, point) in pinning_points:
    dist2 = (point - initial_embedding[index])**2
    sumdist2 += dist2.mean()
    initial_embedding[index] = point
    pin_mask[index] = 0
pinning_rmse = np.sqrt(sumdist2 / len(pinning_points))
n_epochs = int(pinning_rmse * 200)

Then I do the pinning epochs (which can be as low as zero due to e442bcd):

reducer = umap.UMAP(n_epochs=n_epochs, init=initial_embedding)
embedding = reducer.fit_transform(X, pin_mask=pin_mask)

For n_epochs == 0 I can even shortcut to embedding = initial_embedding.

However, in comparison to just using the spectral embedding as initial embedding (via e442bcd), I noticed that for the general case (pinning_rmse > 0) this takes significantly more time to get the embedding close to the pinned points, so it's not worth it.

@jondo
Copy link
Contributor

jondo commented May 9, 2021

The "fixed points stay less connected" artifacts could probably be avoided by implementing this whole "fixing points" feature outside UMAP instead: by using some graph layout library to pull the low-dimensional graph of the unconstrained embedding into the right shape/place.

@matthieuheitz
Copy link
Contributor Author

matthieuheitz commented Jun 1, 2021

Sorry for the long silence!
So, I ran a couple of experiments, and actually @jondo, I have the opposite problem as you do, but I think they are related.
Since we rescale the initial embedding to [0,10]^2, when we have a lot of points, UMAP cannot fit all of them in that square, so it "overflows".
See the result in bottom right figure; for N=100, 500 and 2000 points. The ground truth I give UMAP are 3D points where the first two coordinates are random, and the third is 0. Fixed points are in red. In the initialization, the middle points are distributed normally.
comparison_init_fixedpoints_2DsquareHD_N100_D10_nepoch5000_nneighb20_mindist1
comparison_init_fixedpoints_2DsquareHD_N500_D10_nepoch5000_nneighb20_mindist1
comparison_init_fixedpoints_2DsquareHD_N2000_D10_nepoch5000_nneighb20_mindist1

This above is an example where actually, fixed points are not super necessary, because the reconstruction without them makes sense. But this also happens on datasets where fixed points would make more sense. For example, this is an orthogonal star in high dimension: points where only one coordinate have values in [0,0.1,0.2,...,1.0]. I don't plot the ground truth because I can't in 2D. Result with fixed points are in bottom right figure: for N=100,200,500,1000 points:
comparison_init_fixedpoints_ortho_pos_star_N100_D10
comparison_init_fixedpoints_ortho_pos_star_N200_D10
comparison_init_fixedpoints_ortho_pos_star_N500_D10
comparison_init_fixedpoints_ortho_pos_star_N1000_D10

In both, where it says "no initialization" actually means spectral init.
Do you think tuning "spread" would help? I tried playing with it, but I'm not getting better results, but maybe I'm using it wrong.

Anyway, I can identify 2 cases of wanting to fix points, and they are quite different:

  • we want to keep the pinned points coordinates fixed in space, e.g to fix the rotation degree of freedom of the output
  • we just want to preserve the relative position of fixed points to each other, no matter their coordinates (and so accepting the result being up to a rotation).

Those 2 uses cases should be coded very differently, and I'm not sure how to address that. Any thoughts?

Also, do you think there is a way to predict from the number of points and other parameters, the global size (bounding box) of the low dimensional embedding? If there was, we could initialize the embedding with the fixed points at their given positions, scaled to fit in the bounding box, and all other points initialized with the spectral init.

@lmcinnes
Copy link
Owner

lmcinnes commented Jun 2, 2021

Thanks for running experiments. It was these sorts of issues that had me scaling the initialization in the first place. I guess the real catch now is that the scaling (or the "natural size" of the final embedding) is not that fixed and varies with the number of data points (not that surprising, but still...)

I can think of a couple of easy(ish) ways to handle this, and otherwise I think it would require some deeper thought to try to discern the "right" thing to do. The first approach is to try to find a heuristic formula for scaling with dataset size. It is probably possible to come up with something halfway reasonable. The second approach is harder, but probably more useful: if we zero-center the initialization then we can apply consistent uniform scaling of fixed points depending on some measure of stress on them in the embedding.

I'm not that happy with either of those approaches, and they only address the first use case: pinning relative position (hence out freedom in rescaling things). The second use case wants very fixed positions. Supporting that seems a lot easier right now, so potentially the best interim solution is to leave it on the user to provide a suitable initialization for the fixed points.

One final option may be the following: we instead allow users to supply an initialization for fixed points only and don't require them to initialize everything else; we can then perform a spectral initialization of the whole dataset, procrustes align that with respect to the fixed points, and then snap the fix points to their user specified positions. This takes a lot of the effort out of getting the initialization "right" in some ways. Thoughts?

@matthieuheitz
Copy link
Contributor Author

matthieuheitz commented Jun 4, 2021

The heuristic formula idea is what I was thinking about but it might be too complicated, it seems like there would be so many particular cases where it wouldn't apply. And it also depends on how "converged" we are, but n_epochs doesn't tell us that, from what I've tried, some datasets require 500 epochs to converge, others 5000... And maybe if the natural scale of an embedding is [0,100] (when it started from [0,10]), starting directly at [0,100] might not converge to the same "natural" scale, or might converge to the same scale, but yield different results (better or worse, I don't know...).

I like the idea of the zero-centring + scaling depending on stress. And it seems that the result is practically invariant to translation of the initialization, so that's nice. I thought it would be invariant to scaling if we also multiplied mindist and spread by the same scaling factor, but judging from my tests, that's not the case.

For your last suggestion, I'm not sure I understand, can you give more details? I'm confused by "with respect to the fixed points". What do you want to align the spectral initialization of the whole dataset with?

@lmcinnes
Copy link
Owner

Sorry for the slow response. The plan for the last idea would be that the user supplies only initial locations for points that are to remain fixed (this obviously doesn't work with the soft mask ideas). The code then temporarily ignores the fixed points position and does the usual spectral initialization of the full dataset. This now gives us a dataset of locations for the fixed points as supplied by the user and a dataset of of locations of the corresponding points as generated by the spectral initialization of the full dataset. The goal, then, if to find a suitable affine transformation that maps the spectral initialized locations of the points to the user supplied fixed point locations. The easiest way to do this is by translating both datasets to be origin centered, globally scalling them to have the same variance, and then solving the orthogonal procrustes problem to find the rotation and reflection that minimises the error. You then have a linear map given the the rotation + reflection + scaling + translation from the spectral initialized locations of the would be fixed points to (as close an approximation as a suitable linear map can provide of) the user supplied locations. You can apply that linear map to the spectral initialization of the full dataset. This doesn't solve all the problems, but perhaps it makes it easier for the user since they don't have to specify everything just where they want to fixed points.

To be fair I am still having concerns with this idea. The stress based scaling feels like the best answer, but it is certainly a challenging problem as to how best to calculate it. At this stage I suspect the best thing to do is simply to go with roughly what you have already and simply providing good documentation of the concerns that can arise (that the user may need to be careful and/or scale their initialization to get the results that they want). The plot you posted in this issue would work very well in tutorial documentation describing these sorts of issues.

@matthieuheitz
Copy link
Contributor Author

Okay, but depending on the number of fixed points, it's likely you won't find a linear transformation that maps them from their spectral embedding position to their user-provided position. Since we are most often embedding points in 2D, as soon as there are more than 3 fixed points, we'll struggle to align them.
As an example, I plotted the spectral embedding (top right) for the two previous datasets:
comparison_init_fixedpoints_2DsquareHD_N500_D10_nepoch1000_nneighb20_mindist1_spread1
comparison_init_fixedpoints_ortho_pos_star_N500_D10_nepoch1000_nneighb20_mindist1_spread1
For the first one, you can imagine that the procrustes alignement will go well, but for the second one, I don't know if the result will make much sense.

I really like the stress based scaling though. Is there not already a stress measure on each point?
What would be the challenges here?
Right now, I feel like one main problem would be that UMAP is not scale invariant. But maybe it's because I forgot to change some parameters. So I tried changing min_dist and spread. Now, I just found that the gradients are clipped to [-4,4] when updating the positions, so that should also depend on the scale. Are there other places in the code where things depend on the scale being in [0,10]?

I understand that we could just keep the mask method, and explain its limitations well.
But I just want to see how hard it is to do something cleaner and more powerful. Moreover, people who use UMAP probably use it for medium to large datasets, which is exactly when our current method doesn't work. So to me, it's a good demo, but I'm not sure it will serve many people.

@theahura
Copy link

theahura commented Jul 13, 2021

hey @matthieuheitz curious what the status is of this PR/feature request. Is this still in active development? (Is there something I can do to help out?)

Reading the discussion above, I'm of the opinion that releasing a well documented approach using the heuristic is the way to go simply because it will get some aspects of this feature out for general use. As people use the tool we'll get a much better understanding of actual use cases and edge cases, compared to artificial datasets/experiments.

@kruus
Copy link

kruus commented Sep 12, 2021

In my constraints branch, I did the Procrustes approach Leland suggests, without having read this discussion. It just seemed a natural way to do it. My branch translates pin_mask things into more general framework, where you supply point or tangent space projectors as numba functions. This allows a lot of flexibility in constraint types, such as hard pinning, spring force “pinning”, as well as generic constraints like dimension wise bounds, etc.

@matthieuheitz
Copy link
Contributor Author

@theahura Sorry for the very delayed response. Unfortunately, I've stopped working on that because I realized I did not need it as much, and I've been quite busy so I haven't taken the time to code a clean interface and write up a good documentation for it in #620.
@kruus, I would love to see how it works if you have small demos like the one I posted. Is there somewhere you showcase these features?

@kruus
Copy link

kruus commented Sep 13, 2021

See the constraints branch of this fork. Although I began with
a fork from the @matthieuheitz PR, my constraints are basically some numba jit functions that expect
certain arguments, somewhat like the approach used for output_metric. To keep something like
your original pin_mask proposal, I auto-translate 0/1 array data into the some appropriate constraints
function. I did not support point-specific gradient weights as your original did, but it could be easily done
by supplying a gradient-modifying constraint.

Perhaps begin with the docstrings for output_constrain in the constructor (topologically-oriented constraints,
like generic bounds or centering) and data_constrain for the 'fit*' routine, which allows for data-point-specific
constraints (like pinning or spring forces). The added constraints.py file shows some example numba constraint
function examples and some quick'n'dirty self-test code.

A KISS examples/iris/iris4.ipynb shows the Procrustes' thing approach.
I probably need an upstream merge because, iirc, the n_epochs=0 mod might be useful (?)
Here is a sequence of figures
showing two approaches to pin 3 "good" points to (x=-10, y=free) and 3 others ("bad") to (x=+10, y=free).

Although folks are using my stuff in its current form (and actually really need it) there are numerous tricky
issues, some perhaps not so obvious, that could use fresh eyes, discussion, suggestions... Thanks!

@kruus
Copy link

kruus commented Sep 13, 2021

Oh... I ignore finer points about data scaling. Presumably, 'min dist' should be small enough to be OK for most datasets (?)
So one of my first gotcha' fixes was to turn off UMAP's well-intentioned auto-rescaling to +/-10, when using constraints.
It is of course still possible that fixing some points caused others to embed at "outside" my desired box. This I handled
a simple output_constrain=... to bound various/all dimensions to my personal desired range during the gradient descent.
(My constraints.py supplies some simple functions to do this). So bounding embedding x to +/-5 limits should just work.

The +/-4 gradient bound did not concern me --- close to convergence, all gradients should be small. I view it as
an ignorable safety mechanism for initial iterations, to avoid +/-infinity badness.

I also punted on the issue of constraints interacting badly with co-located points, (point-indexed constraints
do not support UMAP with unique=True). My thought here was to let umap do its thing, but switch to a
soft spring constraint (my constraints.py has an example simple spring-force constraint. This should allow
the UMAP min dist to lightly separate the points, even if logically they are all being pulled to common location.
But this still has badness wrt. small n_neighbors, unless the spring forces are sufficiently light to keep the topology
connected :(. Perhaps detect and warn?

I haven't thought about constraints involving the embeddings of pairs of points, but don't see any reason you
could not masquerade them as single-point constraints with added internal pairwise info.
After all, UMAP embedding is implemented as a sequence of single-point updates, even with move_other.

@kruus
Copy link

kruus commented Sep 13, 2021

Your second example is really interesting, and relevant to me. I also predict feature requests for such "radial" pinnings,
where rigid alignment might not make much sense. My gut says to not worry, and write a constraint such that pinned embeddings lie somewhere, anywhere, on some (2-/3-D?) ball, but not constraining their relative locations much.

For 2-D, the most difficult case, I still want the potentially incompatible init embeddings to "move past one another" and
sort themselves out topologically. My thought is to embed in 3-D first for an epoch or two, and use this to init the final
2-D embedding. The iris4 notebook has a vaguely related PCA approach, where internally I work for a bit in 3-D.
For my work, there are valid ML reasons for keeping 'pin data' (logically, even if not visually) as separate user dimensions,
to distinguish them from UMAP dimensions stemming from the high-dimensional topology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants