Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provenance #101

Merged
merged 5 commits into from
Oct 1, 2024
Merged

Provenance #101

merged 5 commits into from
Oct 1, 2024

Conversation

greenw0lf
Copy link
Collaborator

Closes #81

@greenw0lf greenw0lf changed the base branch from main to 52-setup-service-decouple-from-dane September 26, 2024 12:38
@greenw0lf greenw0lf marked this pull request as ready for review September 26, 2024 12:57
@greenw0lf
Copy link
Collaborator Author

Maybe it'd be nice for you @mwigham to review this PR since you also worked with provenance, but for the ASR exporter. I made some choices that I'm not so sure are the best in terms of what to report and in what field of the provenance.

And I know the target branch is service-decouple and not main, but I built on top of that so I didn't want you to have to bother with reviewing those changes as well.

@greenw0lf greenw0lf requested a review from mwigham September 26, 2024 12:59
@greenw0lf greenw0lf self-assigned this Sep 26, 2024
@greenw0lf greenw0lf added the enhancement New feature or request label Sep 26, 2024
Copy link

@mwigham mwigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I like it very much that each step returns the provenance in its output
I also really like the inclusion of config elements as parameters

Some of your choices in modelling the provenance are interesting (and I mean that in a positive way). Something for later discussion in the team.

As far as I'm concerned this is very good for the PoC. You might want to check with @jblom as to whether all the elements that he is interested in about the worker are included.

else:
logger.info(f"Whisper transcript already present in {output_path}")
provenance = {
"activity_name": "Whisper transcript already exists",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting point for later discussion with the team as to whether we model this as a different activity, or as a different output of the Whisper speech processing activity. There's something to be said for both

transcode.py Outdated Show resolved Hide resolved
end_time = (time.time() - start_time) * 1000
provenance["processing_time_ms"] = end_time
provenance["output_data"] = input_path
provenance["steps"].append("No transcode required, input is audio")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again we need to think (in the future) a bit more clearly as a team about how we want to communicate about steps that were skipped (but not failed).

asr.py Outdated
@@ -11,50 +25,111 @@

logger = logging.getLogger(__name__)
os.environ["HF_HOME"] = model_base_dir # change dir where model is downloaded
my_version = pkg_resources.get_distribution(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting. How does this work to get the worker version? We have struggled previously with trying to put in a Github version but we had trouble getting that into the Dockerfile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't put in the Github version. It simply obtains the version from the pyproject.toml file. And also, this approach wasn't actually working, so I'm replacing it with a different one that actually reads pyproject.toml and outputs what is written in the version field.

@jblom
Copy link
Contributor

jblom commented Oct 1, 2024

@greenw0lf @mwigham ok I've globally checked the code and comments here and it seems fine to merge it to the "dane as a service branch". From there I'll also test the code a bit more thoroughly and then merge with main, so we can try it out later this week.

@jblom jblom merged commit a4e44a2 into 52-setup-service-decouple-from-dane Oct 1, 2024
1 check passed
@jblom jblom deleted the 81-prov branch October 1, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provenance for Whisper
3 participants