Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Possible feature and bugfix contributions from Microsoft research team's fork of Metaseq #726

Open
mattmazzola opened this issue Jun 1, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@mattmazzola
Copy link
Contributor

We are a team at @microsoft Research that has a fork Metaseq repo with these additional features:

  1. New pipeline task to perform Knowledge Distillation via Log Probabilities using a modified Cross Entropy implementation.
  2. Improved inference script with added functionality such as ability to output logprobs/logits.
  3. Improvements to Training Stop Conditions
  4. Scripts to support Teacher data generation using Open AI Service
  5. Documentation system using Sphinx
    1. Documentation of Co-Teaching training process (https://arxiv.org/pdf/2305.02031.pdf)
  6. Improved evaluation configuration to evaluate with different metrics depending on dataset
  7. Miscellaneous Bug Fixes
    1. jsonl_dataset.py#_build_index properly accounts for multi-byte characters.

Questions

  • Which of the features above would you be interested in us contributing back to Metaseq?
  • Would you be able to offer assistance with the merge process?
    • For example, testing and verification of functionality for a feature PR.

We would be happy to answer any questions you have about the above components.

@tupini07

@suchenzang
Copy link
Contributor

@mattmazzola Sorry for delay - I've been on PTO; would be interested in all of the above contributions as they come online (deferring to you on what the best ordering here would be)!

@mattmazzola
Copy link
Contributor Author

interested in all of the above contributions

Ok! I will talk with rest of team and see what we want to do.

We are trying to roll off our current work and transition to another project so it is not clear how much time we be able to spend these contributions. This creates a kind of trade-off / conflict between wanting the larger items for impact, but smaller items for less commitment.

deferring to you on what the best ordering here would be

These fixes and features from our fork has some non-trivial divergence from metaseq main so it's less easy to judge how much work until we see how many merge conflicts there are. It also makes testing difficult or not possible since our infrastructure was using different dependency set running Azure Machine Learning environment.

The list above was an ordered by estimate of how impactful the PR contributions would be to Metaseq; however, given the difficulties I was trying to create PRs with inverse order to increase likelihood they merge.
Beginning with the smallest / easiest since they were least likely to break something and wouldn't rely on as much help.

I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable.
I think to be safest the PR or branch could be taken over by a core maintainer and verified.

@suchenzang
Copy link
Contributor

These fixes and features from our fork has some non-trivial divergence from metaseq main... I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable.

That makes a lot of sense - feel free to open up PRs in whatever state you have them; they will be a useful starting point for figuring out how to merge / test them and pull into main over time.

@mattmazzola
Copy link
Contributor Author

I have created PRs for all of the items on the initial issue list (except for item 4) and referenced this issue.
Hopefully these can help improve Metaseq. Perhaps someone will continue exploration of the "soft" distillation technique in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants