You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this issue while trying to create clusters of pathways from msigdb:
When clustering gene sets there some clusters whose members very low correlation, include pairwise correlations of zero. Details and reproducible example below.
It might be an expected consequence of the methods themselves, but it is odd. My question is if you can explain it this behavior. It might be related to this issue
I am also aware of this caveat ;)
But note, as we benchmarked in the manuscript, the clustering on the gene overlap similarity performs much worse than on the semantic similarity.
To reproduce the issue I collected some random pathways from collection which are relevant for my real life example, and added one particular pathways which caused me to note the issue.
Following the package instructions, the similarity matrix is calculated using overlap because one of my ideas was to reduce the sets which are contained in others. I have not tested other similarity metrics.
I followed with the clustering itself, here using dynamicTreeCut but other methods showed the same issues (bin_cut and apcluster). For data inspection I created a new table with the pairwise correlations
As we can see, those two gene sets, which are in the same cluster have no overlap.
Since this could simply be becasue there are stronger overlaps with the other gene sets in the cluster, I checked all pairwise similarity values for WP_MIRNA_BIOGENESIS in that cluster vs all clusters:
So it seems like the clustering methods, at least those tested end up putting a number of what should be singletons in a single cluster. Is there a way to alleviate this issue, or a clustering method (or similarity metric) which is less prone to this issue?
Cheers!
The text was updated successfully, but these errors were encountered:
Hi @jokergoo,
I came across this issue while trying to create clusters of pathways from msigdb:
When clustering gene sets there some clusters whose members very low correlation, include pairwise correlations of zero. Details and reproducible example below.
It might be an expected consequence of the methods themselves, but it is odd. My question is if you can explain it this behavior. It might be related to this issue
I am also aware of this caveat ;)
To reproduce the issue I collected some random pathways from collection which are relevant for my real life example, and added one particular pathways which caused me to note the issue.
Following the package instructions, the similarity matrix is calculated using
overlap
because one of my ideas was to reduce the sets which are contained in others. I have not tested other similarity metrics.I followed with the clustering itself, here using
dynamicTreeCut
but other methods showed the same issues (bin_cut
andapcluster
). For data inspection I created a new table with the pairwise correlationsNow let’s find the offending cluster and pick a biologically unrelated gene set:
As we can see, those two gene sets, which are in the same cluster have no overlap.
Since this could simply be becasue there are stronger overlaps with the other gene sets in the cluster, I checked all pairwise similarity values for
WP_MIRNA_BIOGENESIS
in that cluster vs all clusters:The mean similarity values is higher in the cluster, but still quite low. But how low?
So I compared the mean similarity per cluster, and it seems like this cluster as the lowest mean similarity score of all clusters by some margin:
So it seems like the clustering methods, at least those tested end up putting a number of what should be singletons in a single cluster. Is there a way to alleviate this issue, or a clustering method (or similarity metric) which is less prone to this issue?
Cheers!
The text was updated successfully, but these errors were encountered: