-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results interpretation - overestimation? #163
Comments
Question 2 I think is answered by #134 (?) |
Hi @giorgiatosoni , you're right that this is related to #134 . CellBender seems to struggle to call cells on some experiments unfortunately. I am actively working on this. Currently, my answers to your questions would be:
I would suggest trying
you could even try |
Hi @sjfleming, thank you very much for your answer! I think with the new parameters the cell call is much more realistic: 1752 called cells vs 7998 called cells with the previous parameters. However I see that with the new parameters the training plot looks a bit weird, what do you think? |
Also, update on "Yes, the output can still be considered cleaner in most cases I've seen. But it is probably still better to try to make a few tweaks to try to get cell calling to work appropriately." |
@giorgiatosoni , great, I think your second plot looks excellent. That's exactly what I would hope to see. The training plot also looks fine to me, I don't see any problem with that. As far as the initial run, do you mean that cellbender removed zero counts the first time? |
@sjfleming nice thanks!! Yes exactly, it didn't remove any counts, probably because it failed to determine what was ambient? |
Huh, I did not expect that... well, thanks for letting me know! I should keep that in mind in future... |
Hi, Coming back to the overestimation problem, I have another sample for which I tried different parameters, but all the nuclei are kept after running CellBender. The parameters tried are: cellbender remove-background and cellbender remove-background These are the resulting summaries: output.pdf Any suggestions?? |
Hmm, I have to say I'm not sure. How about trying
and maybe, for good measure, add
Does the log file indicate that CellBender is accurately estimating the counts in cells and the counts in empty droplets? ( |
Thank you, I'm gonna try this! Not sure if the CellBender estimation is correct here.. cellbender:remove-background: Prior on counts in empty droplets is 17 |
Ah yeah, it's getting the empty droplet counts wrong :( Hopefully the |
Oh I see.. thanks for the explanation! |
Hi @sjfleming, getting back to this after some time.. Just wanted to let you know that --low-count-threshold 100 did the trick, the output looks much better now S8R5_output.pdf Also we are planning to use cellbender for a quite big analysis, I was wondering if you have an estimated release date for v3 :) Thanks a lot for all your help, it's much appreciated!! Best, |
Also, have the same question! When will the v3 be released? :D |
Hi @giorgiatosoni , great! That kind of "overestimation" is expected in this case and here's why: For the UMI curve above, we can see that, while there does appear to be a sharp drop-off pretty close to the expected 4k cells, there is a long tail of droplets with counts that are well above the "empty droplet plateau level" which seems to be about 1000 UMI counts in this experiment. CellBender takes into account UMI counts as well as the gene expression profile itself, and the algorithm has apparently determined that those highlighted droplets are sufficiently different from the completely empty droplets that the posterior probability of being "non-empty" is high. But it's worth keeping in mind: although the plot says "cell probability", it's really the probability that a droplet is not empty. The subtle difference is that you should still do cell QC after CellBender. While CellBender has identified about 12k non-empty droplets, not all of these are going to be great quality droplets you want to use in your analysis. But the nice thing is, a lot of the time CellBender + cell QC will pick up on cell types that might have been missed out using cell calling with CellRanger + cell QC. CellBender (without additional cell QC) will include more droplets with poor quality / dying cells though, since those droplets are in fact not empty. |
@giorgiatosoni and @ccruizm , great question about timing for v3 release. I really hope to have it ready by March 10. |
Thank you so much @sjfleming and looking forward to using the v3! |
Hi @sjfleming, thanks for this helpful discussion. I was wondering if you could expand a bit more on how the Thanks! |
Update on the timing for v3 release: follow this PR |
Hi @wmacnair , Unfortunately it's a little more subtle than that. There are a number of heuristics currently used to figure out (1), (2), and (3), and they all center around looking at the full UMI curve to guess the values. But in datasets that are very clean, and in particular if the real "empty droplet plateau" is like 10 or 15 UMI counts per empty droplet, then this default value becomes a problem, and it should be changed to more like 5. The rule of thumb is that this value should be below the empty droplet plateau, and just cut off the long long tail of very very low count droplets. |
Hi @sjfleming Thanks for this explanation, that makes sense. Yes developing reliable heuristics to parse out the UMI curve is hard! Especially in the context of deeply-sequenced and "dirty" clinical samples, where the empties plateau can have UMI counts in the hundreds or even thousands, the knee going from cells to empties is not sharp at all, and even the "errors" plateau might be at 100... I think something relatively simple to add that could be helpful would be to include an additional plot in your diagnostics, that showed the priors learned by Thanks again! |
On a slightly different topic, as I wasn't sure where to put it - do you know this paper? An interesting claim here is that they see two distinct types of ambient contamination, one cytoplasmic, and one nuclear. This fits with my own observations of our data, i.e. two distinct profiles of ambient in some samples. So perhaps v4 could include two distinct ambient profiles? ;) Will |
Thanks for the suggestion @wmacnair , I think a plot that shows the cellbender priors would be a nice addition! I have come across that paper yeah, but I didn't see the part about two distinct types of ambient. That's an interesting observation. I'll have to read more carefully to understand exactly what they mean by that! |
I think the idea is that you get two different biological processes leading to different ambient profiles. One is from the cytoplasm, and I think has a lot of MT- gene expression, while the other is from nuclei that have lysed. I think this could explain some cases where It feels like in principle this could be ok to implement within your current framework, although probably a decent amount of work. Something like a |
It's an interesting though, but yeah it would be a different noise model. My current thinking would still be that even if you have two different biological processes leading to different ambient profiles... like your example of nuclear ambient and cytoplasmic ambient, which is a good example... I'd still expect those ambient RNAs to mix completely once they are cell-free, and to form one combined "ambient RNA" profile from which the empty droplets sample their counts. Do you agree? But I do agree with you... there is one thing that CellBender does not model currently, which can happen... things like mitochondria sticking to a nucleus in snRNA-seq. I have seen pictures of this kind of thing under the microscope. Bits of organelles, etc. "Debris". Because that's another level of complexity, where some nuclei might have "junk" attached and others completely do not, CellBender doesn't try to model it. That's a real hard problem at that point... because being able to tell whether something is "an added, unwanted mitochondrion or organelle" versus true biological differences in gene expression in the nucleus is going to be very tough. Right now the thinking is that those nuclei which have the "extra cytoplasmic stuff" should get QCed out (by some downstream tool after CellBender) due to their high MT fraction, or high exonic read fraction. |
Regarding the "sticky" things - e.g. mitochondria, organelles, ER - yes I also strongly think that's the case. I actually did some analysis of some of our snRNAseq data, and for some samples the two assumptions "MT- gene expression is purely ambient" and "the ambient profile is completely captured by the empties" can't both be true. As in, if you remove the maximum amount of ambient consistent with true expression being >= 0, there is still some MT- left; and if you use MT- levels to determine how much ambient to remove, this implies true expression < 0... I slightly disagree on how to address this, and I don't think we necessarily want to exclude these nuclei via QC. If they're only contaminated by mitochondria, what is the rationale to exclude them? I've been excluding the MT- genes and keeping the nuclei, partly because if I don't do that we lose even more of our data to QC! And I do think that QC is still not well understood, in particular that we have pretty arbitrary rationales for excluding cells, without data to link them to true biology, e.g. what number of UMIs corresponds to an intact cell? See thread here. Regarding whether "ambient RNA" is one nicely mixed profile, maybe...?? I think I've been surprised enough by biology to really not be sure what the truth is! It could also be the case that what contaminates a given droplet is "lumpy", so e.g. x * well-mixed ambient + n * organelle chunks. Or maybe better phrased: contamination by ambient RNA could be well-mixed, but actually there are simultaneous multiple contamination processes... 😅 What this all points to is the need for a gold standard dataset to allow proper benchmarking of methods like Cheers |
Hi,
I would like to have some help: I'm running CellBender on a "problematic" sample (trial of new nuclei isolation protocol , resulted in lower quality). We know that many cells need to be discarded from this sample (achieved with stringent filtering for nUMI and %mt). I decided to try CellBender to see if I could get a better estimation on the keepable cell for downstream analysis, but CellBender seems to keep all the cells. I'm not sure what actually is cell and what is ambient/low quality cell from the Cell ranger output, I used the following parameters:
and got the attached pdf output.
s2r2.pdf
Now my questions are:
What does this output mean? Is this sample not suitable for CellBender? or I just used the wrong parameters?
Since the outputs of cellranger and CellBender are comparable in the # of cells kept, can the output of CellBender be considered cleaner because it should have also removed some background from the new count matrix, or the results of CellBender are not trustable?
I hope it is clear and thanks very much in advance!
The text was updated successfully, but these errors were encountered: