forked from gleanerio/scheduler
-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Colton Loftus edited this page Sep 3, 2024
·
15 revisions
Scheduler IoW Notes
- build files (
implnet_{jobs,ops}_*.py
) are tracked in the repo, making a verbose git history and PRs more work to review - multiple organizations have stored configurations in the repo, causing a higher burden on maintainers
- build is done with environment variables and multiple coupled components instead of one build script, making it more challenging to debug, test, and refactor
- Build the
gleanerconfig.yml
- This config builds upon a
gleanerconfigPREFIX.yaml
file that is the base template - Each organization has a nabuconfig.yaml which specifies configuration and context for how to retrieve triplet data and how to store it in minio
- This config builds upon a
- Generate the
jobs/
ops/
sch/
andrepositories/
directories which container the Python files that describe when to run the job - Generate the
workspace.yaml
file which describes the relative path for the Python file which contains references to all the jobs- This can likely be eliminated when refactoring
- Condense code into one central Python build program
- Use https://github.com/docker/docker-py to control the containers instead of shell scripts. (Makes it easier to test and debug to have it all in one language as a data pipeline)
- By using a cli library like https://typer.tiangolo.com/ we can validate argument correctness and fail early, making it easier to debug instead of reading in the arguments and failing after containers are spun up
- Move all build files to the root of the repo to make it more clear for end users
- (i.e. makefiles,
build/
directory, etc.)
- (i.e. makefiles,
- Refactor such that individual organizations store their configuration outside the repo.
- The Python build program should be able to read the configuration files at an arbitrary path that the user specifices
- Add types and doc strings for easier maintenance long term
- Use jinja templating instead of writing raw text to the output files
- Currently jobs are ran by hard coding them in a template in a Python file
- Unclear if this is scalable to huge datasets. Probably best to use a generator so we do not need to load everything into the ast