Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for Dataflow runner #3

Merged
merged 31 commits into from
Oct 20, 2024
Merged

Refactor for Dataflow runner #3

merged 31 commits into from
Oct 20, 2024

Conversation

pmhalvor
Copy link
Owner

@pmhalvor pmhalvor commented Sep 28, 2024

The changes in this PR start prepping the pipeline to be runnable on a Dataflow Runner, though is currently incomplete (as of 07eac77).

Work on this is currently paused, in order to complete the full pipeline (PR #2 for classifier w/ model-aaS, a post-processing stage #4, and standardized writing to local and cloud #20). I'll pick this Dataflow runner PR up again once those are complete.

Things to implement:

  • Runnable on Google Dataflow Runner
  • Trigger from local
  • Update README w/ run instructions
  • Add docs
    • LADR on how to serve model endpoint
    • Setting up Cloud Run
    • Dataflow Runner w/ correct IAM accesses
  • Deploy latest images to registry
    • Publicly available
    • Push latest via GHA worker (guide, commit: 8cd65f6)
    • Unit tests to build images Keep dependencies to minimum in GHA (avoids sudo apt-get install libhdf5-dev libsndfile1)
  • Terraform IaC elements? add in a separate PR later

@pmhalvor pmhalvor mentioned this pull request Sep 28, 2024
5 tasks
@pmhalvor pmhalvor changed the base branch from add-classifier to master September 30, 2024 20:34
@pmhalvor pmhalvor added blocked waiting on another PR to be merged and removed blocked waiting on another PR to be merged labels Oct 7, 2024
@pmhalvor pmhalvor force-pushed the refactor-for-dataflow-runner branch 3 times, most recently from 4886b5d to 0d3967f Compare October 20, 2024 11:03
@pmhalvor pmhalvor force-pushed the refactor-for-dataflow-runner branch 2 times, most recently from 06ca412 to 76f6335 Compare October 20, 2024 12:29
@pmhalvor
Copy link
Owner Author

Images for model-server and pipeline-worker are currently pushed to artifact registry on every push to a PR w/ base as main. These are tagged with semantic versions along with git hashes. If costs for this get way to high, we can change this so images are only pushed when the PR is merged to main. During early development though, I want to keep pushing to AR for every push to PR.

When PRs get merged to main, we then also release a public version of the images to Docker Hub. These are found under:

  • docker.io/permortenhalvorsen024/whale-speech-model-server:latest
  • docker.io/permortenhalvorsen024/whale-speech-pipeline-worker:latest

These public releases make running the pipeline easier for developers who just want to use this tool without having to rebuild things. A Cloud Run can easily be spun up with the model server image. Then the pipeline can be run with a direct runner using the pipeline worker image on a Direct Runner. If gcloud is configured locally, then the same can easily be run on a Dataflow Runner. Refer to the README for more information on these different run methods.

@pmhalvor pmhalvor merged commit c42cc6d into main Oct 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant