-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provenance #101
Provenance #101
Conversation
phew
remove whitespace on empty line
Maybe it'd be nice for you @mwigham to review this PR since you also worked with provenance, but for the ASR exporter. I made some choices that I'm not so sure are the best in terms of what to report and in what field of the provenance. And I know the target branch is service-decouple and not main, but I built on top of that so I didn't want you to have to bother with reviewing those changes as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I like it very much that each step returns the provenance in its output
I also really like the inclusion of config elements as parameters
Some of your choices in modelling the provenance are interesting (and I mean that in a positive way). Something for later discussion in the team.
As far as I'm concerned this is very good for the PoC. You might want to check with @jblom as to whether all the elements that he is interested in about the worker are included.
else: | ||
logger.info(f"Whisper transcript already present in {output_path}") | ||
provenance = { | ||
"activity_name": "Whisper transcript already exists", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting point for later discussion with the team as to whether we model this as a different activity, or as a different output of the Whisper speech processing activity. There's something to be said for both
end_time = (time.time() - start_time) * 1000 | ||
provenance["processing_time_ms"] = end_time | ||
provenance["output_data"] = input_path | ||
provenance["steps"].append("No transcode required, input is audio") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again we need to think (in the future) a bit more clearly as a team about how we want to communicate about steps that were skipped (but not failed).
asr.py
Outdated
@@ -11,50 +25,111 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
os.environ["HF_HOME"] = model_base_dir # change dir where model is downloaded | |||
my_version = pkg_resources.get_distribution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting. How does this work to get the worker version? We have struggled previously with trying to put in a Github version but we had trouble getting that into the Dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach doesn't put in the Github version. It simply obtains the version from the pyproject.toml
file. And also, this approach wasn't actually working, so I'm replacing it with a different one that actually reads pyproject.toml
and outputs what is written in the version field.
Previous attempt did not work
@greenw0lf @mwigham ok I've globally checked the code and comments here and it seems fine to merge it to the "dane as a service branch". From there I'll also test the code a bit more thoroughly and then merge with main, so we can try it out later this week. |
Closes #81