Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

Open
da1nerd opened this issue Oct 23, 2019 · 17 comments
Assignees
Labels
duplicate This issue or pull request already exists

Comments

@da1nerd
Copy link
Contributor

da1nerd commented Oct 23, 2019

This is different from the issue fixed here #49.

In this case the source tokens are not similar.

EDIT: see better screenshot here.

Screen Shot 2019-10-21 at 8 49 14 AM

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 23, 2019

Out of order scenario

The tokens occurrences are out of order when suggested to two unrelated source tokens. Because the source tokens are different this cannot be solved by Alignment Relative Occurrence.

This is tricky because we can't know for certain if the suggestion is completely invalid, or just needs to be used elsewhere. Therefore, we need to increase/decrease the confidence by some arbitrary value, but if this is too strong it could have negative propagating effects.

Distance scenario

A large distance exists between the target tokens within the target sentence. Because wordMAP by definition operates under the assumption of contiguous n-grams, we can accurately calculate the n-gram relative token distance.

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 23, 2019

n-gram Relative Token Distance

  • Given a target sentence of token length T that is greater than 0.
  • And given two tokens x and y.
  • And given token positions start at 0.
  • And given tokens cannot occupy the same position.
  • And given we want to determine the relative distance between the tokens x and y.

Sample data:

T = 7
x = 2
y = 4

The distance between the tokens is abs(x-y) - 1. We subtract one because two tokens next to each other have a distance of 0.

d = abs(2-4) - 1 = 1

To normalize the above value we need the maximum distance within the sentence. This is easily calculated by performing the above calculation on the first and last positions abs(0-T-1)-1 or just T-2.

D = 7 - 2 = 5

Finally, we are able to calculate the distance ratio d/D.

r = 1 / 5 = 0.2

Interpretation

  • A score of 0 indicates the tokens are right next to each other.
  • A score of 1 indicates the tokens are on opposite sides of the sentence.

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 24, 2019

The above algorithm was implemented, however this didn't solve the problem, because wordMAP only supports contiguous tokens. So the algorithm is redundant (for now).

We still need to address the out of order word occurrences.
Here's a better representation of the problem.

See how the word "God" is not suggested in order of occurrence.
image

@da1nerd da1nerd reopened this Oct 24, 2019
@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 24, 2019

We could enforce the order of occurrence when we do the final sorting of predictions.
This will basically give order of occurrence a trump card.
This would not however effect the overall score of the suggestion (a suggestion is composed of individual alignment predictions), so this shouldn't cause valid suggestions to be lost.

My one concern with this approach is do we want to enforce the order of occurrence in the predictions rather than finding some way to give it a weighted score, so that we are simply influencing the results instead of hitting it with a hammer?

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 24, 2019

Perhaps we could add a switch that allows turning on enforcing order of occurrence instead of hardcoding it.

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 28, 2019

After some tinkering, I've determined WordMAP is actually working as expected.
The example problem above occurs with the alignment memory Θεὸς=the God. Because alignment memory automatically gets the highest prediction score we are forcing the out of order use of God. But if, for example the memory was simply Θεὸς=God we see everything in order.

Alignment memory has a compounding effect, so if we had a lot of alignment memory but the overall weight was bent towards Θεὸς=God we'll get expected results. If however the overall weight was bent towards Θεὸς=the God, we get the "bug" above.

Conclusion

This isn't a bug at all, but the nature of WordMAP, and the results are influenced by the inputted alignment memory. The only way to fix this would be to take away the trump card given to alignment memory. Perhaps a user configurable weight could be introduced to dampen the power of alignment memory and allow the machine predictions to have an effect.

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 28, 2019

@PhotoNomad0 ☝️

@PhotoNomad0
Copy link
Contributor

@neutrinog - maybe if I posted the old algorithm suggestions for comparison, it would be more obvious that there is a problem. I don't think there is alignment memory where Θεὸς=the God. The old algorithm is doing a much better job on this verse:

Screen Shot 2019-10-28 at 9 42 27 AM

@PhotoNomad0
Copy link
Contributor

PhotoNomad0 commented Oct 28, 2019

Maybe there is a case where Θεὸς=the God, will check the csv export alignments. Still seems that it would map to the most common usage.

@PhotoNomad0
Copy link
Contributor

OK, maybe this is the issue - I found three cases where ὁ and Θεὸς are combined (all the alignments made for Θεὸς). So shouldn't wordMap suggest they be combined?: {"topWords":[{"word":"ὁ","strong":"G35880","lemma":"ὁ","morph":"Gr,EA,,,,NMS,","occurrence":1,"occurrences":1},{"word":"Θεὸς","strong":"G23160","lemma":"θεός","morph":"Gr,N,,,,,NMS,","occurrence":1,"occurrences":1}],"bottomWords":[{"word":"God","occurrence":1,"occurrences":1,"type":"bottomWord"}]}

@PhotoNomad0
Copy link
Contributor

So the alignment memory should be ὁ Θεὸς=the God

@PhotoNomad0
Copy link
Contributor

There is also a case where ὁ Θεὸς=God in 12:26 (the current verse).

@PhotoNomad0
Copy link
Contributor

PhotoNomad0 commented Oct 28, 2019

Summary: @neutrinog found an instance where Θεὸς is aligned to the God so it is a valid suggestion, but the old algorithm did better.

@da1nerd
Copy link
Contributor Author

da1nerd commented Oct 28, 2019

I think we could deal with the out of order occurrences after the predictions are generated and when the engine moves into building the suggestion. At that point we could insert certain rules like keeping the word occurrences in order.

To illustrate here's some handy ASCII art:

[input]->[generate index]->[run prediction algorithms]->[generate suggestion]

The last step above is where we'd enforce the order of occurrence. Previously I had been trying to do so in the algorithms which wasn't working.

@da1nerd
Copy link
Contributor Author

da1nerd commented Nov 6, 2019

This is the issue I'm running into with the current AlignmentPosition algorithm. This isn't really meant to make sense to anyone but me, but basically because of how the numbers are distributed, the closest pair of numbers are out of order.

image

@da1nerd
Copy link
Contributor Author

da1nerd commented Nov 6, 2019

I ended up solving this with what may not be the most elegant solution, but it works for now and the performance hit isn't noticeable at the moment. After scoring all of the predictions the engine will selectively build out a suggestion. During this process it will monitor word occurrences, and discard any suggestions that produce anything out of order.

In most situations this should complete within a reasonable amount of time. However, it's theoretically possible this could add a lot of time to prediction.

@da1nerd
Copy link
Contributor Author

da1nerd commented Mar 12, 2020

word occurrence order has received a lot of attention in unfoldingWord/translationCore#6237. This issue is redundant/irrelevant now.

@da1nerd da1nerd added the duplicate This issue or pull request already exists label Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants