Keeping track of sets of samples #2716
Unanswered
PaulYannJay
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
First, thanks for the amazing tskit toolkit, It is such a pleasure to work with !
I'm trying to compute analyses that requires me to identify nodes that cluster the same set of samples, and I would like to find an efficient way to do so. Starting from a sequence of trees, I would like to identify (e.g. attribute another ID/flag) all nodes based on the set of samples (i.e. leaf nodes) that coalesce to them, whatever how these samples relate to each other (i.e. whatever the tree topology) or the node's ages. The idea is then to use these new node IDs to merge nodes that correspond to the same sets of sample if they are in adjacent trees. With that, we could identify genomic regions that present particular genetic structure, for instance when a given set of samples are consistently grouped together.
So far, the solution I have found is to loop over the nodes of all trees, to compute two statistics (Arithmetic and Harmonic means) on the list of sample's numeric IDs that coalesce to the nodes and then to outputs these values for each nodes in a data table. Two different sets of samples are very unlikely to have the same Arithmetic and Harmonic means, and so we can easily identify nodes with same sets of samples with this (for instance the following python loop do so).
Then, we can easily merge adjacent nodes that have the same arithmetic and harmonic means (using for instance the following R script).
However, I guess their is much more efficient solutions than looping over all the nodes and sets of samples.
Do you have any better idea ?
Thank you very much,
Cheers,
Paul
Beta Was this translation helpful? Give feedback.
All reactions