-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50
Comments
Out of order scenarioThe tokens occurrences are out of order when suggested to two unrelated source tokens. Because the source tokens are different this cannot be solved by Alignment Relative Occurrence. This is tricky because we can't know for certain if the suggestion is completely invalid, or just needs to be used elsewhere. Therefore, we need to increase/decrease the confidence by some arbitrary value, but if this is too strong it could have negative propagating effects. Distance scenarioA large distance exists between the target tokens within the target sentence. Because wordMAP by definition operates under the assumption of contiguous n-grams, we can accurately calculate the n-gram relative token distance. |
n-gram Relative Token Distance
Sample data:
The distance between the tokens is
To normalize the above value we need the maximum distance within the sentence. This is easily calculated by performing the above calculation on the first and last positions
Finally, we are able to calculate the distance ratio
Interpretation
|
We could enforce the order of occurrence when we do the final sorting of predictions. My one concern with this approach is do we want to enforce the order of occurrence in the predictions rather than finding some way to give it a weighted score, so that we are simply influencing the results instead of hitting it with a hammer? |
Perhaps we could add a switch that allows turning on enforcing order of occurrence instead of hardcoding it. |
After some tinkering, I've determined WordMAP is actually working as expected. Alignment memory has a compounding effect, so if we had a lot of alignment memory but the overall weight was bent towards ConclusionThis isn't a bug at all, but the nature of WordMAP, and the results are influenced by the inputted alignment memory. The only way to fix this would be to take away the trump card given to alignment memory. Perhaps a user configurable weight could be introduced to dampen the power of alignment memory and allow the machine predictions to have an effect. |
@PhotoNomad0 ☝️ |
Maybe there is a case where |
OK, maybe this is the issue - I found three cases where ὁ and Θεὸς are combined (all the alignments made for Θεὸς). So shouldn't wordMap suggest they be combined?: |
So the alignment memory should be |
There is also a case where |
Summary: @neutrinog found an instance where |
I think we could deal with the out of order occurrences after the predictions are generated and when the engine moves into building the suggestion. At that point we could insert certain rules like keeping the word occurrences in order. To illustrate here's some handy ASCII art:
The last step above is where we'd enforce the order of occurrence. Previously I had been trying to do so in the algorithms which wasn't working. |
I ended up solving this with what may not be the most elegant solution, but it works for now and the performance hit isn't noticeable at the moment. After scoring all of the predictions the engine will selectively build out a suggestion. During this process it will monitor word occurrences, and discard any suggestions that produce anything out of order. In most situations this should complete within a reasonable amount of time. However, it's theoretically possible this could add a lot of time to prediction. |
word occurrence order has received a lot of attention in unfoldingWord/translationCore#6237. This issue is redundant/irrelevant now. |
This is different from the issue fixed here #49.
In this case the source tokens are not similar.
EDIT: see better screenshot here.
The text was updated successfully, but these errors were encountered: