Keeping track of sets of samples #2716

PaulYannJay · 2023-02-24T15:45:16Z

PaulYannJay
Feb 24, 2023

Hi all,
First, thanks for the amazing tskit toolkit, It is such a pleasure to work with !

I'm trying to compute analyses that requires me to identify nodes that cluster the same set of samples, and I would like to find an efficient way to do so. Starting from a sequence of trees, I would like to identify (e.g. attribute another ID/flag) all nodes based on the set of samples (i.e. leaf nodes) that coalesce to them, whatever how these samples relate to each other (i.e. whatever the tree topology) or the node's ages. The idea is then to use these new node IDs to merge nodes that correspond to the same sets of sample if they are in adjacent trees. With that, we could identify genomic regions that present particular genetic structure, for instance when a given set of samples are consistently grouped together.

So far, the solution I have found is to loop over the nodes of all trees, to compute two statistics (Arithmetic and Harmonic means) on the list of sample's numeric IDs that coalesce to the nodes and then to outputs these values for each nodes in a data table. Two different sets of samples are very unlikely to have the same Arithmetic and Harmonic means, and so we can easily identify nodes with same sets of samples with this (for instance the following python loop do so).
Then, we can easily merge adjacent nodes that have the same arithmetic and harmonic means (using for instance the following R script).

However, I guess their is much more efficient solutions than looping over all the nodes and sets of samples.
Do you have any better idea ?

Thank you very much,

Cheers,
Paul

Python:
import statistics
textfileNode=open("Node.NodeStat", "w") #open the output file
TreeList=ts.trees(sample_lists=True) 
for tree in TreeList:
    TreeInterval=tree.interval
    Left=str(TreeInterval[0])
    Right=str(TreeInterval[1])
    TreeNode=tree.nodes()
    for node in TreeNode:
            NodeIdMean=statistics.fmean(tree.samples(node)) #Mean of sample's numerical ID
            NodeIdHarmoMean=statistics.harmonic_mean(tree.samples(node)) #Harmonic mean of sample's numerical ID
            Line=Left + "\t" + Right + "\t" + str(node) + "\t" + str(OldNodeNotRoot[node]) + "\t" +  str(NodeIdMean) + "\t" +  str(NodeIdHarmoMean) + "\t" + str(len(list(tree.samples(node)))) + "\n" #Output Line
            textfileNode.write(Line)


R:
library(dplyr)

Node=read.table("Node.NodeStat", stringsAsFactors = F)
colnames(Node)=c("Left", "Right", "ID", "Age", "NodeMeanId", "NodeHarmoMeanId", "NSample")
Node$Size=Node$Right - Node$Left #Compute the size span by the node 

NodeSort=Node %>% arrange(NodeMeanId, NodeHarmoMeanId, Left) %>% mutate(Follow=ifelse((lag(Right)==Left & lag(NodeMeanId)==NodeMeanId & lag(NodeHarmoMeanId)==NodeHarmoMeanId), 0, 1) %>% coalesce(0)) #Sort the nodes by their new IDs (mean and harmonic mean) and their position. If two nodes have the same new IDs (i.e. the same set of samples coalesce to these two nodes) and if the are in adjacent tree, indicate it with a 0 in a new "Follow" column. If not, indicate with a 1 (when it switch to a node with another sample set or in a non-adjacent position)
NodeSort2 =NodeSort %>% group_by(NodeMeanId, NodeHarmoMeanId) %>% mutate(IDID=cumsum(Follow) + 1) %>% ##Create a group ID (named IDID) based on sample set and position. If a node is associated with the same sample sample set than previously but with a non-consecutive position, it creates a new group
  group_by(IDID, NodeMeanId, NodeHarmoMeanId) %>% #Group by this ID
  mutate(SumSpan=sum(Size), FirstPos=min(Left), LastPos=max(Right), Age=mean(Age))%>%  #merge the nodes
  distinct(IDID, NodeMeanId, NodeHarmoMeanId, .keep_all=TRUE) # merge the nodes (it is a trick when using mutate to have the same results as summary() but to keep all columns)

write.table(NodeSort2, paste("Node.NodeStat", ".summarised.txt", sep=""), quote = F, col.names = T, row.names = F) #write output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keeping track of sets of samples #2716

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Keeping track of sets of samples #2716

Uh oh!

PaulYannJay Feb 24, 2023

Replies: 0 comments

PaulYannJay
Feb 24, 2023