-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement Viterbi on segregation prob #165
Comments
One way to implement the Viterbi algorithm on seg probs is by setting the loci as time, different segregation statuses as states and with transmission function depending on the recombination rate. However, the problems with this implementation are:
The other way to implement the algorithm is by first selecting a few recombination hotspots by using the recombination rates, then applying the algorithm with the recombination hotspots
|
@XingerTang as you know, having good estimates of recombination rates, and hence hotspots, is hard so we can not start from something we don't generally have. I suggest that we first just do naive-call and see how it looks like. I suspect that it should work well, but we will have some false positives (=switch errors). Do we need an alternative algorithm that would smooth out these short recombination blips (so that we get long stretches of haplotypes - we know there are only a few recombinations between a parent and a progeny!) and create clear cuts from one parent's haplotype to another? |
@gregorgorjanc Uni is trying to implement the naive-call function for the AlphaPeel format data. I experimented on my side on the segregation data in my custom data format just to see the result quickly. For the measurement of the performance, the sum of the absolute values of the differences between the predicted segregation data and the true segregation data is used. The real inaccuracies should be obtained by dividing this absolute sum by On the whole data set, the absolute sum of differences is 1245326. Excluding the first generation with full of [0.25, 0.25, 0.25, 0.25], the absolute sum of differences is 645326. ( |
@gregorgorjanc For smoothing out short recombination blips, a common practice is using the windows but it is difficult to locate the exact position of the recombination event. I'm still working on the regression matrix that uses the segregation probabilities of the neighbouring locus to predict the recombination rate, if that works well, I expect it can smooth at least some short recombination blips. |
As the generation increases, the performance of the naive calling algorithm is improved except the last generation. The following data is already divided by 2.
This is probably a result of the improved accuracies of the segregation probabilities estimated by the AlphaPeel for the middle generations. |
@XingerTang this is my codes for "Naive-Convert" and "Viterbi-convert". def naive_convert(seg_prob_array):
seg_array=np.zeros(seg_prob_array.shape)
for i in range(0,seg_prob_array.shape[0]):
for j in range(0,seg_prob_array.shape[1]):
seg_array[i,j,np.argmax(seg_prob_array[i,j])]=1
return seg_array Viterbi-convert # use the equal recombination rate to generate transmission Vecter for loci
# if have distance, we can imitate li-Stephen model's trasition fucntion and add distance
transmission=np.ones(seg_prob_array.shape[1])*(1636/(800*2000))
def Viterbi(pointSeg, transmission):
# Segregation estimate state ordering: pp, pm, mp, mm
nLoci = pointSeg.shape[0]
seg = np.full(pointSeg.shape, 0.25, dtype=np.float32)
seg_arg=np.full(pointSeg.shape, 0, dtype=np.int16)
for i in range(nLoci):
for j in range(4):
seg[i,j] = pointSeg[i,j]
tmp = np.full((4), 0, dtype=np.float32)
new = np.full((4), 0, dtype=np.float32)
new_arg= np.full((4), 0, dtype=np.int16)
prev = np.full((4), 0.25, dtype=np.float32)
for i in range(0, nLoci):
# create the transition function
e = transmission[i - 1]
e2 = e**2
e1e = e * (1 - e)
e2i = (1.0 - e) ** 2
transition=np.array([[e2i,e1e,e1e,e2],
[e1e,e2i,e2,e1e],
[e1e,e2,e2i,e1e,],
[e2,e1e,e1e,e2i],])
# if there is first locus,regard [0.25,0.25,0.25,0.25] as the initial state
if i!=0:
for j in range(4):
tmp[j] = prev[j] * pointSeg[i - 1,j]
else:
for j in range(4):
tmp[j] = prev[j] * 0.25
# normalize
sum_j = 0
for j in range(4):
sum_j += tmp[j]
if sum_j==0:
sum_j=1
for j in range(4):
tmp[j] = tmp[j] / sum_j
# ! fm fm fm fm
# !segregationOrder: pp, pm, mp, mm
# forward
for j in range(4):
new[j] = np.max(tmp*transition[j]*seg[i,j])
new_arg[j] = np.argmax(tmp*transition[j]*seg[i,j])
for j in range(4):
seg_arg[i,j] = new_arg[j]
prev = new
# there are four ways we can track, use the most likely state for the final locus (results from the segregation probability )
prev_index=np.argmax(pointSeg[-1])
# back track
track_rec=np.zeros(seg_arg.shape[0]-1,dtype=np.int16)
a=np.zeros(seg_arg.shape[0],dtype=np.int16)
for i in range(seg_arg.shape[0]-1,-1,-1):
prev_index=seg_arg[i][prev_index]
a[i]=prev_index
# find the breakpoint
for i in range(seg_arg.shape[0]-1):
if a[i]!=a[i+1]:
track_rec[i]=1
return track_rec
# run every individuals
seg_a=np.zeros((1000,1999),dtype=np.int16)
for individual in range(seg_a.shape[0]):
track_rec = Viterbi(seg_prob_array[individual],transmission)
seg_a[individual]+=track_rec |
@gregorgorjanc and @XingerTang , There is a good news and bad news.
The details for every generation:
|
@AprilYUZhang I'm wondering how you measure the accuracy of the recombination probability |
Just 1 - difference, maybe not very accurancy. def cor2(a,b):
cos_theta=0
for i in range(0,a.shape[0]):
for j in range(0,a.shape[1]):
cos_theta += np.abs(a[i,j]-b[i,j])
return 1-cos_theta/(a.shape[0]*a.shape[1]) |
@AprilYUZhang Could you measure the accuracy of the prediction of the segregation statuses for both methods? |
Viterbi method just uses the Original's segregation probabolities, so I skiped. I also notice the segregation probabolities and recombination rate have the opposite results. I think it should be attributed to too much noise in the segregation prob. |
@AprilYUZhang By calling the segregation status, it definitely would lose some information, which could contribute to the decrease in accuracy. Your Viterbi implementation can be easily modified to record the predicted segregation status (just return the list |
In fact, Viterbi could not output a new segregation probablity, because it is just max-production, I guess it shouldn't be regards a new segregation probablity. |
Isn't the argmax state (segregation status) the most likely state (segregation status) given by the Viterbi? |
I don't think we should focus on this because the most likely state for every locus is just based on the former one. If we look at this number alone, it is meaningless. |
It is also based on all the previous loci because the probabilities are passed via each multiplication. The reason why I still insist on checking the segregation status is that sometimes the accuracy of recombination predictions can lie, it may have a recomb, but of the wrong haplotype. |
|
It seems that the Viterbi implementation preserves much more information than naive converted states, that's good news! |
@gregorgorjanc and @XingerTang To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.) Accuracy is all the correct results divided by the total.The recall rate can be regarded as how many cherries can be identified correctly, and the precision is how many of the results that are considered cherries are correct.
So we can see that Original+Logistic carefully judges recombination events, leading to high accuracy and precision. But it ignores a large part of the real recombination events, causing a low recall rate . Naive-Convert has better performance on recall rates. According to precision and recall curves, the Naive-convert has the best performance. Which is a good method to estimate the recombination events? To explore the gap between predicted and true locations, I performed an intersection between various methods and the true values. Due to the limitations of these methods, which may lead to inaccurate judgments of the locations where recombination events occurred, I also wonder if the predicted recombination events are in the neighbouring true values. Although the performance of exact prediction is poor, it will improve when we extend the region. For example, Naive-converted just predicts precisely 223 recombination events, while 1056 recombination events happen to true loci or surrounding true loci. Faced with such a long Markov chain, Naive-converted and Viterbi-converted have limited ability to accurately predict each point. Also, the comparison of Naive-converted and Viterbi-converted shows this point. The intersection of Naive-converted and Viterbi-converted includes only 15, but the rough intersection is more than their own predicted number (Because nearby sites will be counted repeatedly). Apparently, Naive-converted and Viterbi-converted rely on different algorithms but have similar results. Table 2: The number of recombination events
**Original (>0) is there is a recombination event if recombination probablity is greater than 0. "A and B" means to the number of recombination events of A and B occurring at the same position. "A and B neighbouring" means to the number of recombination events of A and B occurring at the 20 variant window. * |
@AprilYUZhang Could you clarify some terms in your table 2? It is unclear to me what true and original+logistic mean. |
"original+logistic: To fit the precision and recall, I convert the original recombination probability to discrete values by logistic regression. (I didn't split train and test dataset, so there is risk of overfitting.)" |
@AprilYUZhang Logistic regression assumes the recombination probabilities to be independent, so it is not applicable in this way. |
The text was updated successfully, but these errors were encountered: