-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maintaining constant factors with skip_modes #25
Comments
Sorry to repost, but I've encountered another small bug running the code below. Uext = U.factors.copy()
Uext.factors[1] = np.random.randn(*Uext.factors[1].shape)
# Fit model to the full dataset, only fitting the final set of factors.
V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 1], verbose=True) Here's the traceback: NCP_BCD: iteration 1, objective 1.1158774944307905, improvement inf.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-92-31e1d436f07a> in <module>
5
6 # Fit model to the full dataset, only fitting the final set of factors.
----> 7 V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 1], verbose=True)
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/tensortools/optimize/ncp_bcd.py in ncp_bcd(X, rank, random_state, init, skip_modes, negative_modes, **options)
139
140 # Compute Gradient.
--> 141 grad = Um[n].dot(grams) - p
142
143 # Enforce nonnegativity (project onto nonnegative orthant).
IndexError: too many indices for array My guess is that this is due to https://github.com/ahwillia/tensortools/blob/304b467c0a404b638db7458d8363c82c152fa016/tensortools/optimize/ncp_bcd.py#L162 This makes Would just calling |
This might also be obvious, but NCP_BCD: iteration 1, objective 0.4673757824789901, improvement inf.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-135-f4bdde02fcd6> in <module>
3
4 # Fit model to the full dataset, only fitting the final set of factors.
----> 5 V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 2], verbose=True)
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/tensortools/optimize/ncp_bcd.py in ncp_bcd(X, rank, random_state, init, skip_modes, negative_modes, **options)
154 # Correction and extrapolation.
155 grams *= U[N - 1].T.dot(U[N - 1])
--> 156 obj_bcd = 0.5 * (sci.sum(grams) - 2 * sci.sum(U[N-1] * p) + normX**2 )
157
158 extraw = (1 + sci.sqrt(1 + 4 * extraw_old**2)) / 2.0
ValueError: operands could not be broadcast together with shapes (396,24) (890,24)
This is because in the earlier loop through It looks like one might also encounter a similar error on line 148 in |
Excellent -- sounds like a really cool application! I've always wanted to try some of these analyses on gene expression data. All these questions and comments are very helpful. Let me see if I'm understanding correctly.
Can you explain your thinking here a bit more? My understanding is that you are looking to validate the model on heldout data. What we did in our paper was to hold out randomized subsets of the data, since you can't entirely leave out a full 2-D slice of the tensor and still be able to fit the model. I have a blog post that explains the thinking behind this in a bit more detail. I hope that clears up what I'm trying to describe above. Now, to do this with tensortools you'll want to use the tensor = # ... this is your 3d data
mask = np.random.rand(tensor.shape) > .9 # Leave out ~10% of the data at random.
# Fit model.
result = ncp_hals(tensor, rank, mask=mask, ...)
# Compute residuals and training/test mean squared error.
sq_resids = (result.factors.full() - tensor) ** 2
test_mse = np.linalg.mean(sq_resids[mask])
train_mse = np.linalg.mean(sq_resids[~mask]) I didn't test that ^^^ out, so check it over and use with care.
I think this is okay, because
Good catch on these. I'll try to fix now. I would generally recommend using |
I think this should fix the problems you identified. Let me know if you agree -- ahwillia@200c9c3 cc @klmcguir -- have you been using |
Hi Alex, Yes, I think that commit should fix those errors. Thank you! As for my first question, I think I might have been unclear. Your suggestion of the mask makes sense for cross-validating on held-out data from a single dataset, but I was think of using the function in an analogous manner to what I've been doing for PCA and NMF. There, I call Extending that to the tensor, I'd fit a rank k decomposition to a 3D tensor of conditions x cell-types x genes. I'm interested in holding the gene factors constant while refitting on either new conditions or cell-types or both. This is where I was thinking I'd use Does that approach make sense? |
That approach potentially makes sense, but you should be careful with how you interpret the outcome. Even in the NMF case you aren't really cross-validating the full model, because when you call In short, I think it may depend on your goals. If you want to evaluate the performance of the model quantitatively on heldout data I would be careful with what you're doing. If you are just trying to create a visualization of some sort then what you're suggesting is probably fine (and I think similar to what Kelly has been doing). |
Hi Alex,
Thanks for this nice package! I've been applying it to some single-cell RNA seq data in our lab (tensors with cell-types x genes x conditions), and it's yielding promising results.
Like Kelly, I've been interested in fitting the tensor decomposition on subsets of the data (cells or genes in this case) and then applying that decomposition to additional datasets.
It looks like this is implemented via the
init
andskip_modes
arguments, as you described in #20. However, even in the output of the refitting example that you provided, theskip_modes
factors have slightly different values than theirinit
values. Although similar, they are not actually being held constant.Presumably, this is because of the calls to
U.rebalance
in the ALS algorithms? Not sure it's that big of an issue, but I didn't know the best way to address it. Should I use the bcd algorithm instead, add an option to ignore the rebalancing when refitting, or do you have any other suggestions?Best,
David
The text was updated successfully, but these errors were encountered: