Skip to content
Colton Loftus edited this page Sep 3, 2024 · 15 revisions

Scheduler IoW Notes

Current Challenges

  • build files (implnet_{jobs,ops}_*.py) are tracked in the repo, making a verbose git history and PRs more work to review
  • multiple organizations have stored configurations in the repo, causing a higher burden on maintainers
  • build is done with environment variables and multiple coupled components instead of one build script, making it more challenging to debug, test, and refactor

Current understanding for alignment

Steps

  1. Build the gleanerconfig.yml
    • This config builds upon a gleanerconfigPREFIX.yaml file that is the base template
    • Each organization has a nabuconfig.yaml which specifies configuration and context for how to retrieve triplet data and how to store it in minio
  2. Generate the jobs/ ops/ sch/ and repositories/ directories which container the Python files that describe when to run the job
  3. Generate the workspace.yaml file which describes the relative path for the Python file which contains references to all the jobs
    • This can likely be eliminated when refactoring

Ideas for improvement

  • Condense code into one central Python build program
    • Use https://github.com/docker/docker-py to control the containers instead of shell scripts. (Makes it easier to test and debug to have it all in one language as a data pipeline)
    • By using a cli library like https://typer.tiangolo.com/ we can validate argument correctness and fail early, making it easier to debug instead of reading in the arguments and failing after containers are spun up
  • Move all build files to the root of the repo to make it more clear for end users
    • (i.e. makefiles, build/ directory, etc.)
  • Refactor such that individual organizations store their configuration outside the repo.
    • The Python build program should be able to read the configuration files at an arbitrary path that the user specifices
  • Add types and doc strings for easier maintenance long term
  • Use jinja templating instead of writing raw text to the output files
  • Currently jobs are ran by hard coding them in a template in a Python file
    • Unclear if this is scalable to huge datasets. Probably best to use a generator so we do not need to load everything into the ast
Clone this wiki locally