-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Word Alignment engine: #506
base: main
Are you sure you want to change the base?
Conversation
b547740
to
f72b568
Compare
82c30fe
to
da20548
Compare
41d48d1
to
9b268f2
Compare
@ddaspit - this is finally ready for review. |
c760b43
to
70810a7
Compare
70810a7
to
a6cc484
Compare
8e0d367
to
5630625
Compare
c918168
to
0b9c983
Compare
cc5a66e
to
07f6ca4
Compare
I'll need to update the branch and otherwise might have some small TODOs floating around, but I think this PR is generally ready for review, @johnml1135 @ddaspit 🥳 |
|
One general question - I called the word alignment engine "Statistical" instead of "Nmt". I couldn't come up with a better name. |
Rebasing will be needed - there are a number of changes that have happened that will need to be merged in. |
This should be "8.0" |
I though this was already working. Was something wrong with it? |
Odd - I don't remember needing to redo the Echo translation engine much. Was this to handle the ParallelCorpus changes? |
Do we know why the translation platform service was not enabled before? Was it split out? |
Just trying to remember - the "engine commit service" is all about training pairs and committing the new engine (something done in SMT but not WA). Thot is needed for both, so it is specified separately. |
It's a configuration limitation and duplication (having SMT and WA share the same drive), but good enough for right now. I see no compelling reason to change it at this time. |
So, 3 options: Currently, I am for A. |
I'll ping you when the rebase changes look good. Still plugging away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 178 of 224 files reviewed, 23 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 982 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
#610 - new issue created.
Is this case covered in TranslationEnginesController
, line 1484:
private static ParallelCorpusFilter Map(ParallelCorpusFilterConfigDto source)
{
if (source.TextIds != null && source.ScriptureRange != null)
{
throw new InvalidOperationException(
$"The parallel corpus filter for corpus {source.CorpusId} is not valid: At most, one of TextIds and ScriptureRange can be set."
);
}
...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 161 of 225 files reviewed, 24 unresolved discussions (waiting on @ddaspit and @johnml1135)
Previously, johnml1135 (John Lambert) wrote…
Is it ready now?
Yes, still working on responding to comments here and on the Machine PR, but some things have already been addressed and the rebase has been done. @ddaspit @johnml1135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 158 of 227 files reviewed, 14 unresolved discussions (waiting on @ddaspit and @johnml1135)
Previously, johnml1135 (John Lambert) wrote…
When you get to it, can you clean up the commit message and rebase?
👍
docker-compose.yml
line 26 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
More words... My instinct is to let @ddaspit pass the deciding vote.
OK, I'm happy either way.
src/Machine/src/Serval.Machine.Shared/Services/IWordAlignmentModelFactory.cs
line 1 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why not have this (and ISmtModelFactory) both inherit from IModelFactory?
Done. I don't know - not sure if I did that or you 😆
src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs
line 111 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Check this PR out - #592. In it it talks about using the protobuf serialization add-on. That may resolve issues here and remove the need for this internal class.
I think it's the AlignedWordPair string parsing that's key, and I don't think the protobuf serialization add-on would cover that / could be adapted to cover that - can it?
src/Machine/src/Serval.Machine.Shared/Services/WordAlignmentEngineStateService.cs
line 46 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Wait - "commit" will never write to disk... Is this the right way to handle this?
I think this has been fixed. Lmk.
src/Serval/src/Serval.WordAlignment/Configuration/IMediatorRegistrationConfiguratorExtensions.cs
line 9 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There have been some data file and corpus consumer updates to main. We need to make sure that they get properly propagated to work alignment when merging in the changes.
I believe these have been carried over. I went back through the PRs since December and tried to carry over any changes. Let me know if it looks like I missed something.
src/Serval/src/Serval.WordAlignment/Configuration/IMongoDataAccessConfiguratorExtensions.cs
line 32 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Note that when merging in main, executionData will need to be added to word alignment as well.
Done.
src/Serval/src/Serval.WordAlignment/Contracts/WordAlignmentBuildDto.cs
line 30 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This will need executionData.
Done.
src/Serval/src/Serval.WordAlignment/Models/Build.cs
line 20 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Needs ExecutionData.
Done.
src/Serval/src/Serval.WordAlignment/Services/BuildService.cs
line 16 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
The Translation "BuildService" has GetAsync - checking whether it is initialized or not.
Done.
src/Serval/src/Serval.WordAlignment/Services/WordAlignmentPlatformServiceV1.cs
line 6 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
ExecutionData will need to be passed up in this function.
Done. I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit, @johnml1135, I believe I've responded to all comments and that this PR is ready for re-review.
Reviewable status: 154 of 227 files reviewed, 14 unresolved discussions (waiting on @ddaspit and @johnml1135)
Previously, Enkidu93 (Eli C. Lowry) wrote…
Can you squash most of these commits? They all appear doubled - and a simple 2 commit "John gave me this and then I made these changes" is often helpful. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Without a vote from Damien, I am assuming to stay with Statistical. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
This has not been updated. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
This has still not been resolved, if I am correct. |
Hanging TODO... |
7d492b7
to
b04ee9b
Compare
Previously, johnml1135 (John Lambert) wrote…
Never mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 135 of 227 files reviewed, 15 unresolved discussions (waiting on @ddaspit and @johnml1135)
Previously, johnml1135 (John Lambert) wrote…
Can you squash most of these commits? They all appear doubled - and a simple 2 commit "John gave me this and then I made these changes" is often helpful.
Sorry the commit messages are still messy I'm realizing - I can fix that if you'd like. It is squashed though.
.github/workflows/ci.yml
line 22 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This has not been updated.
Oops, sorry - done!
src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs
line 81 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Never mind.
No, I think you're right. This should be using the ParallelCorpusPreprocessingService
. I just lost track of this thread since I had responded - sorry!
Add cleanup service tests for word alignment Fix flaky test Fix naming Fix more names Passing statistical engine service tests - first pass Add hangfire implementation and tests Refactor alignment data structure; revert to pretranslate/word-align where appropriate Use parallel data when inferencing for word alignment Fix JSON serialization Rebase-related changes: extend executionData, commit-related issues to WA, small rebase mistakes Get rid of WordAlignmentResult
b04ee9b
to
6da569a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 135 of 227 files reviewed, 15 unresolved discussions (waiting on @ddaspit and @johnml1135)
Previously, Enkidu93 (Eli C. Lowry) wrote…
Sorry the commit messages are still messy I'm realizing - I can fix that if you'd like. It is squashed though.
Cleaned it up a bit.
Previously, Enkidu93 (Eli C. Lowry) wrote…
Options: My best guess is to go with option B - convert in machine-engine before it goes over GRPC (probably as the message is enqueued to be sent in the outbox). That might be cleanest - having the"knowledge" of how to parse that type of word alignment string be in the lower statistical level. |
If I am correct, the training of the model (either in ClearML or dotnet) is separate from this. Also, we do not "update" this model. but only the SMT model. Therefore. we have no reason to actually "save" the model at any time, as it is never updated unless it is retrained with a new build. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
It looks like everything is good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 27 files at r6, 1 of 48 files at r7, 1 of 23 files at r9, 59 of 59 files at r14, 24 of 24 files at r15, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)
Previously, johnml1135 (John Lambert) wrote…
Thank you. I'll remove it and create an issue. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Eli - I believe that you have completed this - is that correct? |
Previously, johnml1135 (John Lambert) wrote…
What have you decided with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 225 of 227 files reviewed, 3 unresolved discussions (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 1 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Eli - I believe that you have completed this - is that correct?
I believe so, yes.
src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs
line 111 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
What have you decided with this?
Sorry - I had drafted a message and apparently it hadn't sent. I originally agreed that B would be cleaner, but now I'm thinking that it'd be better if possible just to store the word pairs in a more JSONy way coming out of the machine.py job itself. How does that strike you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the out-of-date parallel corpus preprocessing in the EchoEngine. I think the last outstanding issue is the parsing of the alignments. Options are: 1. Change the output of the machine.py job to be a proper JSONification of the aligned word pairs (rather than the string representation). 2. Leave as is (convert them at serialization in Serval). 3. Parse the strings at the MachinEngine-level while passing on the inference stream.
Reviewable status: 195 of 227 files reviewed, 2 unresolved discussions (waiting on @johnml1135)
src/Machine/src/Serval.Machine.Shared/Services/WordAlignmentEngineState.cs
line 68 at r15 (raw file):
Previously, johnml1135 (John Lambert) wrote…
If I am correct, the training of the model (either in ClearML or dotnet) is separate from this. Also, we do not "update" this model. but only the SMT model. Therefore. we have no reason to actually "save" the model at any time, as it is never updated unless it is retrained with a new build.
Done.
Previously, Enkidu93 (Eli C. Lowry) wrote…
Hmm - as long as the old format is still able to be created - non breaking API. The "alignment format" that is used I believe is a "standard" of sorts and I would expect people to be able to get that format out If a "to_json" routine were added (or similar) I would be ok with it. I generally like things to be super-compact and Json would be less compact than the existing format. My biggest concern would be the readability of the Json format - where something like
It would expand the total amount of data 5-10 fold and take 50-100 times as many lines. If there is a better way around this issue, I am all ears. Potentially it could be something like:
That would allow the "extra text" to be less, and keep the extra lines under control while maintaining visual flow and needing no custom parsing. The scores could be floats (as above) or some int16 scaled so that 65,535 = 1.0, or some such thing. |
See comment below. |
This change is