Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Word Alignment engine: #506

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add Word Alignment engine: #506

wants to merge 4 commits into from

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Oct 9, 2024

  • Add HTTP API
  • Add Grpc
  • Add Echo
  • Add Serval services
  • Add machine level implementation
  • Add tests

This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit October 9, 2024 16:38
@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 65.20942% with 1138 lines in your changes missing coverage. Please review.

Project coverage is 64.53%. Comparing base (fd86898) to head (a7c0e7c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
....Shared/Configuration/IMachineBuilderExtensions.cs 0.00% 143 Missing ⚠️
...ServalWordAlignmentPlatformOutboxMessageHandler.cs 0.00% 138 Missing ⚠️
...red/Services/ServalWordAlignmentEngineServiceV1.cs 0.00% 133 Missing ⚠️
...red/Services/ServalWordAlignmentPlatformService.cs 0.00% 102 Missing ⚠️
...src/Serval.WordAlignment/Services/EngineService.cs 88.06% 51 Missing and 7 partials ⚠️
...Serval.Machine.Shared/Services/ModelFactoryBase.cs 0.00% 43 Missing ⚠️
...hine.Shared/Services/TranslationBuildJobService.cs 0.00% 42 Missing ⚠️
...l.Machine.Shared/Services/ClearMLMonitorService.cs 10.86% 41 Missing ⚠️
...achine.Shared/Services/StatisticalTrainBuildJob.cs 72.86% 31 Missing and 4 partials ⚠️
...ignment/Services/WordAlignmentPlatformServiceV1.cs 88.88% 21 Missing and 13 partials ⚠️
... and 58 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #506      +/-   ##
==========================================
+ Coverage   64.50%   64.53%   +0.03%     
==========================================
  Files         280      342      +62     
  Lines       14113    18513    +4400     
  Branches     1824     2397     +573     
==========================================
+ Hits         9103    11947    +2844     
- Misses       4349     5696    +1347     
- Partials      661      870     +209     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 3 times, most recently from b547740 to f72b568 Compare October 14, 2024 16:58
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch from 82c30fe to da20548 Compare October 24, 2024 19:38
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 4 times, most recently from 41d48d1 to 9b268f2 Compare November 6, 2024 16:00
@johnml1135
Copy link
Collaborator Author

@ddaspit - this is finally ready for review.

@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 2 times, most recently from c760b43 to 70810a7 Compare November 12, 2024 20:03
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 2 times, most recently from c918168 to 0b9c983 Compare December 9, 2024 17:37
@johnml1135 johnml1135 requested a review from Enkidu93 December 16, 2024 12:30
@Enkidu93
Copy link
Collaborator

I'll need to update the branch and otherwise might have some small TODOs floating around, but I think this PR is generally ready for review, @johnml1135 @ddaspit 🥳

@johnml1135
Copy link
Collaborator Author

-- commits line 47 at r4:
When you get to it, can you clean up the commit message and rebase?

@johnml1135
Copy link
Collaborator Author

docker-compose.yml line 26 at r4 (raw file):

      - ASPNETCORE_WordAlignment__Engines__0__Type=EchoWordAlignment
      - ASPNETCORE_WordAlignment__Engines__0__Address=http://echo
      - ASPNETCORE_WordAlignment__Engines__1__Type=Statistical

One general question - I called the word alignment engine "Statistical" instead of "Nmt". I couldn't come up with a better name.

@johnml1135
Copy link
Collaborator Author

Rebasing will be needed - there are a number of changes that have happened that will need to be merged in.

@johnml1135
Copy link
Collaborator Author

.github/workflows/ci.yml line 22 at r4 (raw file):

        uses: supercharge/[email protected]
        with:
          mongodb-version: "6.0"

This should be "8.0"

@johnml1135
Copy link
Collaborator Author

.github/workflows/ci.yml line 30 at r4 (raw file):

        continue-on-error: true
        if: ${{ github.ref_name }} != "main"
        run: cd .. && git clone https://github.com/sillsdev/machine.git --branch ${{ github.ref_name }} && dotnet build machine

I though this was already working. Was something wrong with it?

@johnml1135
Copy link
Collaborator Author

src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs line 81 at r4 (raw file):

                {
                    using (
                        AsyncClientStreamingCall<InsertPretranslationsRequest, Empty> call =

Odd - I don't remember needing to redo the Echo translation engine much. Was this to handle the ParallelCorpus changes?

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.EngineServer/Program.cs line 14 at r4 (raw file):

    .AddServalTranslationEngineService()
    .AddServalWordAlignmentEngineService()
    .AddServalTranslationPlatformService()

Do we know why the translation platform service was not enabled before? Was it split out?

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Configuration/IMachineBuilderExtensions.cs line 203 at r4 (raw file):

                case EngineType.SmtTransfer:
                    builder.Services.AddSingleton<SmtTransferEngineStateService>();
                    builder.Services.AddHostedService<SmtTransferEngineCommitService>();

Just trying to remember - the "engine commit service" is all about training pairs and committing the new engine (something done in SMT but not WA). Thot is needed for both, so it is specified separately.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Configuration/IMachineBuilderExtensions.cs line 534 at r4 (raw file):

            throw new InvalidOperationException("SMT Engine and Statistical directory is required");
        if (smtDriveLetter != statisticsDriveLetter)
            throw new InvalidOperationException("SMT Engine and Statistical directory must be on the same drive");

It's a configuration limitation and duplication (having SMT and WA share the same drive), but good enough for right now. I see no compelling reason to change it at this time.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Models/WordAlignment.cs line 10 at r4 (raw file):

    public required IReadOnlyList<string> SourceTokens { get; set; }
    public required IReadOnlyList<string> TargetTokens { get; set; }
    public required IReadOnlyList<double> Confidences { get; set; } //TODO It seems to me that it'd more natural to have the confidence as part of the word pair object - but I understand that this is currently not the case with the translation result; would it be breaking to change it there too?

So, 3 options:
A. Leave as is
B. Switch both to put confidences in "AlignedWordPair"
C. Make a new structure that just has the pairs, and not the confidences?

Currently, I am for A.

@Enkidu93
Copy link
Collaborator

I'll ping you when the rebase changes look good. Still plugging away.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 178 of 224 files reviewed, 23 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 982 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

#610 - new issue created.

Is this case covered in TranslationEnginesController, line 1484:

    private static ParallelCorpusFilter Map(ParallelCorpusFilterConfigDto source)
    {
        if (source.TextIds != null && source.ScriptureRange != null)
        {
            throw new InvalidOperationException(
                $"The parallel corpus filter for corpus {source.CorpusId} is not valid: At most, one of TextIds and ScriptureRange can be set."
            );
        }
...

@johnml1135
Copy link
Collaborator Author

-- commits line 76 at r8:
Is it ready now?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 161 of 225 files reviewed, 24 unresolved discussions (waiting on @ddaspit and @johnml1135)


-- commits line 76 at r8:

Previously, johnml1135 (John Lambert) wrote…

Is it ready now?

Yes, still working on responding to comments here and on the Machine PR, but some things have already been addressed and the rebase has been done. @ddaspit @johnml1135

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 158 of 227 files reviewed, 14 unresolved discussions (waiting on @ddaspit and @johnml1135)


-- commits line 47 at r4:

Previously, johnml1135 (John Lambert) wrote…

When you get to it, can you clean up the commit message and rebase?

👍


docker-compose.yml line 26 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

More words... My instinct is to let @ddaspit pass the deciding vote.

OK, I'm happy either way.


src/Machine/src/Serval.Machine.Shared/Services/IWordAlignmentModelFactory.cs line 1 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why not have this (and ISmtModelFactory) both inherit from IModelFactory?

Done. I don't know - not sure if I did that or you 😆


src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs line 111 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Check this PR out - #592. In it it talks about using the protobuf serialization add-on. That may resolve issues here and remove the need for this internal class.

I think it's the AlignedWordPair string parsing that's key, and I don't think the protobuf serialization add-on would cover that / could be adapted to cover that - can it?


src/Machine/src/Serval.Machine.Shared/Services/WordAlignmentEngineStateService.cs line 46 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Wait - "commit" will never write to disk... Is this the right way to handle this?

I think this has been fixed. Lmk.


src/Serval/src/Serval.WordAlignment/Configuration/IMediatorRegistrationConfiguratorExtensions.cs line 9 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There have been some data file and corpus consumer updates to main. We need to make sure that they get properly propagated to work alignment when merging in the changes.

I believe these have been carried over. I went back through the PRs since December and tried to carry over any changes. Let me know if it looks like I missed something.


src/Serval/src/Serval.WordAlignment/Configuration/IMongoDataAccessConfiguratorExtensions.cs line 32 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Note that when merging in main, executionData will need to be added to word alignment as well.

Done.


src/Serval/src/Serval.WordAlignment/Contracts/WordAlignmentBuildDto.cs line 30 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This will need executionData.

Done.


src/Serval/src/Serval.WordAlignment/Models/Build.cs line 20 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Needs ExecutionData.

Done.


src/Serval/src/Serval.WordAlignment/Services/BuildService.cs line 16 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

The Translation "BuildService" has GetAsync - checking whether it is initialized or not.

Done.


src/Serval/src/Serval.WordAlignment/Services/WordAlignmentPlatformServiceV1.cs line 6 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

ExecutionData will need to be passed up in this function.

Done. I think.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit, @johnml1135, I believe I've responded to all comments and that this PR is ready for re-review.

Reviewable status: 154 of 227 files reviewed, 14 unresolved discussions (waiting on @ddaspit and @johnml1135)

@johnml1135
Copy link
Collaborator Author

-- commits line 76 at r8:

Previously, Enkidu93 (Eli C. Lowry) wrote…

Yes, still working on responding to comments here and on the Machine PR, but some things have already been addressed and the rebase has been done. @ddaspit @johnml1135

Can you squash most of these commits? They all appear doubled - and a simple 2 commit "John gave me this and then I made these changes" is often helpful.

@johnml1135
Copy link
Collaborator Author

docker-compose.yml line 26 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, I'm happy either way.

Without a vote from Damien, I am assuming to stay with Statistical.

@johnml1135
Copy link
Collaborator Author

.github/workflows/ci.yml line 22 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, will do

This has not been updated.

@johnml1135
Copy link
Collaborator Author

src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs line 81 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This looks like a place where a rebase is needed. This is what it looked like before I added the ParallelCorpus changes.

This has still not been resolved, if I am correct.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 108 at r11 (raw file):

    }

    //TODO: Move this method to translation-specific PreprocessBuildJob

Hanging TODO...

@Enkidu93 Enkidu93 force-pushed the word_alignment_try_2 branch from 7d492b7 to b04ee9b Compare February 3, 2025 20:15
@johnml1135
Copy link
Collaborator Author

src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs line 81 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This has still not been resolved, if I am correct.

Never mind.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 135 of 227 files reviewed, 15 unresolved discussions (waiting on @ddaspit and @johnml1135)


-- commits line 76 at r8:

Previously, johnml1135 (John Lambert) wrote…

Can you squash most of these commits? They all appear doubled - and a simple 2 commit "John gave me this and then I made these changes" is often helpful.

Sorry the commit messages are still messy I'm realizing - I can fix that if you'd like. It is squashed though.


.github/workflows/ci.yml line 22 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This has not been updated.

Oops, sorry - done!


src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs line 81 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Never mind.

No, I think you're right. This should be using the ParallelCorpusPreprocessingService. I just lost track of this thread since I had responded - sorry!

Add cleanup service tests for word alignment

Fix flaky test

Fix naming

Fix more names

Passing statistical engine service tests - first pass

Add hangfire implementation and tests

Refactor alignment data structure; revert to pretranslate/word-align where appropriate

Use parallel data when inferencing for word alignment

Fix JSON serialization

Rebase-related changes: extend executionData, commit-related issues to WA, small rebase mistakes

Get rid of WordAlignmentResult
@Enkidu93 Enkidu93 force-pushed the word_alignment_try_2 branch from b04ee9b to 6da569a Compare February 3, 2025 20:25
Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 135 of 227 files reviewed, 15 unresolved discussions (waiting on @ddaspit and @johnml1135)


-- commits line 76 at r8:

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sorry the commit messages are still messy I'm realizing - I can fix that if you'd like. It is squashed though.

Cleaned it up a bit.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs line 111 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think it's the AlignedWordPair string parsing that's key, and I don't think the protobuf serialization add-on would cover that / could be adapted to cover that - can it?

Options:
A. Keep as is.
B. Move the parsing of AlignedWordPair to the MachineEngine and pass the data back to serval in the non-string format
C. Preserve the data in Serval in the "string" format

My best guess is to go with option B - convert in machine-engine before it goes over GRPC (probably as the message is enqueued to be sent in the outbox). That might be cleanest - having the"knowledge" of how to parse that type of word alignment string be in the lower statistical level.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/WordAlignmentEngineState.cs line 68 at r15 (raw file):

            CurrentBuildRevision = buildRevision;

        SaveModel();

If I am correct, the training of the model (either in ClearML or dotnet) is separate from this. Also, we do not "update" this model. but only the SMT model. Therefore. we have no reason to actually "save" the model at any time, as it is never updated unless it is retrained with a new build.

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.WordAlignment/Configuration/IMediatorRegistrationConfiguratorExtensions.cs line 9 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I believe these have been carried over. I went back through the PRs since December and tried to carry over any changes. Let me know if it looks like I missed something.

It looks like everything is good.

Copy link
Collaborator Author

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 27 files at r6, 1 of 48 files at r7, 1 of 23 files at r9, 59 of 59 files at r14, 24 of 24 files at r15, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Feb 6, 2025

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 108 at r11 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Hanging TODO...

Thank you. I'll remove it and create an issue.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 1 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Will do. It's an undertaking 😆. Starting on it now.

Eli - I believe that you have completed this - is that correct?

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs line 111 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Options:
A. Keep as is.
B. Move the parsing of AlignedWordPair to the MachineEngine and pass the data back to serval in the non-string format
C. Preserve the data in Serval in the "string" format

My best guess is to go with option B - convert in machine-engine before it goes over GRPC (probably as the message is enqueued to be sent in the outbox). That might be cleanest - having the"knowledge" of how to parse that type of word alignment string be in the lower statistical level.

What have you decided with this?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 225 of 227 files reviewed, 3 unresolved discussions (waiting on @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 1 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Eli - I believe that you have completed this - is that correct?

I believe so, yes.


src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs line 111 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What have you decided with this?

Sorry - I had drafted a message and apparently it hadn't sent. I originally agreed that B would be cleaner, but now I'm thinking that it'd be better if possible just to store the word pairs in a more JSONy way coming out of the machine.py job itself. How does that strike you?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the out-of-date parallel corpus preprocessing in the EchoEngine. I think the last outstanding issue is the parsing of the alignments. Options are: 1. Change the output of the machine.py job to be a proper JSONification of the aligned word pairs (rather than the string representation). 2. Leave as is (convert them at serialization in Serval). 3. Parse the strings at the MachinEngine-level while passing on the inference stream.

Reviewable status: 195 of 227 files reviewed, 2 unresolved discussions (waiting on @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/WordAlignmentEngineState.cs line 68 at r15 (raw file):

Previously, johnml1135 (John Lambert) wrote…

If I am correct, the training of the model (either in ClearML or dotnet) is separate from this. Also, we do not "update" this model. but only the SMT model. Therefore. we have no reason to actually "save" the model at any time, as it is never updated unless it is retrained with a new build.

Done.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxMessageHandler.cs line 111 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sorry - I had drafted a message and apparently it hadn't sent. I originally agreed that B would be cleaner, but now I'm thinking that it'd be better if possible just to store the word pairs in a more JSONy way coming out of the machine.py job itself. How does that strike you?

Hmm - as long as the old format is still able to be created - non breaking API. The "alignment format" that is used I believe is a "standard" of sorts and I would expect people to be able to get that format out If a "to_json" routine were added (or similar) I would be ok with it.

I generally like things to be super-compact and Json would be less compact than the existing format. My biggest concern would be the readability of the Json format - where something like 1:1:0.5 2:2:0.25 3:3:0.75 becomes:

[
  [
    "sourceIndex": 1,
    "targetIndex": 1,
    "alignmentScore": 0.5,
  ],
  [
    "sourceIndex: 2, ...

It would expand the total amount of data 5-10 fold and take 50-100 times as many lines.

If there is a better way around this issue, I am all ears. Potentially it could be something like:

[
  [
    "SourceTokens": ["hello", "world"],
    "TargetTokens": ["hola", "mundo"],
    "SourceTargetScore": [ (1, 1, 0.5), (2, 2, 0.75) ],
  ], ...

That would allow the "extra text" to be less, and keep the extra lines under control while maintaining visual flow and needing no custom parsing. The scores could be floats (as above) or some int16 scaled so that 65,535 = 1.0, or some such thing.

@johnml1135
Copy link
Collaborator Author

See comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants