word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

da1nerd · 2019-10-23T02:52:42Z

This is different from the issue fixed here #49.

In this case the source tokens are not similar.

EDIT: see better screenshot here.

da1nerd · 2019-10-23T03:12:36Z

Out of order scenario

The tokens occurrences are out of order when suggested to two unrelated source tokens. Because the source tokens are different this cannot be solved by Alignment Relative Occurrence.

This is tricky because we can't know for certain if the suggestion is completely invalid, or just needs to be used elsewhere. Therefore, we need to increase/decrease the confidence by some arbitrary value, but if this is too strong it could have negative propagating effects.

Distance scenario

A large distance exists between the target tokens within the target sentence. Because wordMAP by definition operates under the assumption of contiguous n-grams, we can accurately calculate the n-gram relative token distance.

da1nerd · 2019-10-23T03:53:09Z

n-gram Relative Token Distance

Given a target sentence of token length T that is greater than 0.
And given two tokens x and y.
And given token positions start at 0.
And given tokens cannot occupy the same position.
And given we want to determine the relative distance between the tokens x and y.

Sample data:

T = 7
x = 2
y = 4

The distance between the tokens is abs(x-y) - 1. We subtract one because two tokens next to each other have a distance of 0.

d = abs(2-4) - 1 = 1

To normalize the above value we need the maximum distance within the sentence. This is easily calculated by performing the above calculation on the first and last positions abs(0-T-1)-1 or just T-2.

D = 7 - 2 = 5

Finally, we are able to calculate the distance ratio d/D.

r = 1 / 5 = 0.2

Interpretation

A score of 0 indicates the tokens are right next to each other.
A score of 1 indicates the tokens are on opposite sides of the sentence.

da1nerd · 2019-10-24T16:15:28Z

The above algorithm was implemented, however this didn't solve the problem, because wordMAP only supports contiguous tokens. So the algorithm is redundant (for now).

We still need to address the out of order word occurrences.
Here's a better representation of the problem.

See how the word "God" is not suggested in order of occurrence.

da1nerd · 2019-10-24T17:03:14Z

We could enforce the order of occurrence when we do the final sorting of predictions.
This will basically give order of occurrence a trump card.
This would not however effect the overall score of the suggestion (a suggestion is composed of individual alignment predictions), so this shouldn't cause valid suggestions to be lost.

My one concern with this approach is do we want to enforce the order of occurrence in the predictions rather than finding some way to give it a weighted score, so that we are simply influencing the results instead of hitting it with a hammer?

da1nerd · 2019-10-24T17:19:12Z

Perhaps we could add a switch that allows turning on enforcing order of occurrence instead of hardcoding it.

da1nerd · 2019-10-28T13:35:15Z

After some tinkering, I've determined WordMAP is actually working as expected.
The example problem above occurs with the alignment memory Θεὸς=the God. Because alignment memory automatically gets the highest prediction score we are forcing the out of order use of God. But if, for example the memory was simply Θεὸς=God we see everything in order.

Alignment memory has a compounding effect, so if we had a lot of alignment memory but the overall weight was bent towards Θεὸς=God we'll get expected results. If however the overall weight was bent towards Θεὸς=the God, we get the "bug" above.

Conclusion

This isn't a bug at all, but the nature of WordMAP, and the results are influenced by the inputted alignment memory. The only way to fix this would be to take away the trump card given to alignment memory. Perhaps a user configurable weight could be introduced to dampen the power of alignment memory and allow the machine predictions to have an effect.

da1nerd · 2019-10-28T13:35:45Z

@PhotoNomad0 ☝️

PhotoNomad0 · 2019-10-28T13:46:43Z

@neutrinog - maybe if I posted the old algorithm suggestions for comparison, it would be more obvious that there is a problem. I don't think there is alignment memory where Θεὸς=the God. The old algorithm is doing a much better job on this verse:

PhotoNomad0 · 2019-10-28T13:50:57Z

Maybe there is a case where Θεὸς=the God, will check the ~~csv export~~ alignments. Still seems that it would map to the most common usage.

PhotoNomad0 · 2019-10-28T14:00:03Z

OK, maybe this is the issue - I found three cases where ὁ and Θεὸς are combined (all the alignments made for Θεὸς). So shouldn't wordMap suggest they be combined?: {"topWords":[{"word":"ὁ","strong":"G35880","lemma":"ὁ","morph":"Gr,EA,,,,NMS,","occurrence":1,"occurrences":1},{"word":"Θεὸς","strong":"G23160","lemma":"θεός","morph":"Gr,N,,,,,NMS,","occurrence":1,"occurrences":1}],"bottomWords":[{"word":"God","occurrence":1,"occurrences":1,"type":"bottomWord"}]}

PhotoNomad0 · 2019-10-28T14:01:31Z

So the alignment memory should be ὁ Θεὸς=the God

PhotoNomad0 · 2019-10-28T14:07:03Z

There is also a case where ὁ Θεὸς=God in 12:26 (the current verse).

PhotoNomad0 · 2019-10-28T14:28:08Z

Summary: @neutrinog found an instance where Θεὸς is aligned to the God so it is a valid suggestion, but the old algorithm did better.

da1nerd · 2019-10-28T14:49:42Z

I think we could deal with the out of order occurrences after the predictions are generated and when the engine moves into building the suggestion. At that point we could insert certain rules like keeping the word occurrences in order.

To illustrate here's some handy ASCII art:

[input]->[generate index]->[run prediction algorithms]->[generate suggestion]

The last step above is where we'd enforce the order of occurrence. Previously I had been trying to do so in the algorithms which wasn't working.

da1nerd · 2019-11-06T08:16:11Z

This is the issue I'm running into with the current AlignmentPosition algorithm. This isn't really meant to make sense to anyone but me, but basically because of how the numbers are distributed, the closest pair of numbers are out of order.

da1nerd · 2019-11-06T10:53:58Z

I ended up solving this with what may not be the most elegant solution, but it works for now and the performance hit isn't noticeable at the moment. After scoring all of the predictions the engine will selectively build out a suggestion. During this process it will monitor word occurrences, and discard any suggestions that produce anything out of order.

In most situations this should complete within a reasonable amount of time. However, it's theoretically possible this could add a lot of time to prediction.

da1nerd · 2020-03-12T08:52:08Z

word occurrence order has received a lot of attention in unfoldingWord/translationCore#6237. This issue is redundant/irrelevant now.

da1nerd mentioned this issue Oct 23, 2019

updated wordmap to fix wrong occurrence suggestion unfoldingWord/translationCore#6502

Merged

16 tasks

da1nerd self-assigned this Oct 24, 2019

da1nerd mentioned this issue Oct 24, 2019

Adds n-gram relative distance algorithm #51

Merged

da1nerd closed this as completed Oct 24, 2019

da1nerd reopened this Oct 24, 2019

da1nerd mentioned this issue Oct 28, 2019

Reorganized exports. #52

Merged

da1nerd mentioned this issue Oct 29, 2019

Fixes out of order occurrence #53

Merged

da1nerd mentioned this issue Nov 6, 2019

updated wordmap unfoldingWord/translationCore#6515

Merged

16 tasks

da1nerd added the duplicate This issue or pull request already exists label Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 24, 2019 •

edited

Loading

da1nerd commented Oct 24, 2019 •

edited

Loading

da1nerd commented Oct 24, 2019

da1nerd commented Oct 28, 2019

da1nerd commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019 •

edited

Loading

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019 •

edited

Loading

da1nerd commented Oct 28, 2019 •

edited

Loading

da1nerd commented Nov 6, 2019

da1nerd commented Nov 6, 2019 •

edited

Loading

da1nerd commented Mar 12, 2020

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

Comments

da1nerd commented Oct 23, 2019 • edited Loading

EDIT: see better screenshot here.

da1nerd commented Oct 23, 2019 • edited Loading

Out of order scenario

Distance scenario

da1nerd commented Oct 23, 2019 • edited Loading

n-gram Relative Token Distance

Interpretation

da1nerd commented Oct 24, 2019 • edited Loading

da1nerd commented Oct 24, 2019 • edited Loading

da1nerd commented Oct 24, 2019

da1nerd commented Oct 28, 2019

Conclusion

da1nerd commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019 • edited Loading

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019

PhotoNomad0 commented Oct 28, 2019 • edited Loading

da1nerd commented Oct 28, 2019 • edited Loading

da1nerd commented Nov 6, 2019

da1nerd commented Nov 6, 2019 • edited Loading

da1nerd commented Mar 12, 2020

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 23, 2019 •

edited

Loading

da1nerd commented Oct 24, 2019 •

edited

Loading

da1nerd commented Oct 24, 2019 •

edited

Loading

PhotoNomad0 commented Oct 28, 2019 •

edited

Loading

PhotoNomad0 commented Oct 28, 2019 •

edited

Loading

da1nerd commented Oct 28, 2019 •

edited

Loading

da1nerd commented Nov 6, 2019 •

edited

Loading