You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, every job within Spack CI is built 2 times regardless of the failure reason. Error modes can range from OOM kills to typos in package configs to true build failures to network interruptions.
Gitlab allows for custom rules for when retries are to be automatically triggered. The CI generate script currently lists all error modes except for timeout failures as valid reasons for retrying.
Given that several of these potential failures are deterministic, there is no sense in retrying certain jobs -- this leads to wasted cycles and inefficiency. Gitlab now supports retry rules based on the exit code of a job, an approach we could use for jobs that fail due to deterministic issues.
Additionally, with the implementation of dynamic resource allocation in CI, we'll need to integrate safety measures such as the ability to automatically retry jobs with more resources if they were OOM killed.
Example workflow
A job fails and Gantry receives the webhook. It checks if it was OOM killed and updates the memory limit if necessary.
Challenges
There is no support in Gitlab to retry singular jobs while updating variables.
Issue 37268 added support for specifying variables when retrying manual jobs via the web UI. A separate issue was filed to push for this to be added to the API. Additionally, there is support to start manual jobs with custom variables via the API.
Unfortunately, jobs in Spack are not manual, as they run automatically after being triggered/scheduled. What we want to do is retry an individual job (no matter if schedules/triggered/manual) within a pipeline, while being able to pass custom variables. Without this functionality, we won't be able to modify the resource request/limit of a job that has been killed due to resource contention.
Issue 387798 would resolve this issue, but there has been no significant movement to implement the features.
questions:
are we able to determine if a gitlab job was retried and what the original id was? seems like no
alternative solutions
restarting the entire pipeline
if a pipeline failed due to OOM builds, we follow spackbot's model and recreate the pipeline with this endpoint
restarting the pipeline would trigger ci generate and request new updated allocations from gantry
cons: every single job in the pipeline will be re-run, which leads to more wasted cycles (but are cached?)
At the moment, every job within Spack CI is built 2 times regardless of the failure reason. Error modes can range from OOM kills to typos in package configs to true build failures to network interruptions.
Gitlab allows for custom rules for when retries are to be automatically triggered. The CI generate script currently lists all error modes except for timeout failures as valid reasons for retrying.
Given that several of these potential failures are deterministic, there is no sense in retrying certain jobs -- this leads to wasted cycles and inefficiency. Gitlab now supports retry rules based on the exit code of a job, an approach we could use for jobs that fail due to deterministic issues.
Additionally, with the implementation of dynamic resource allocation in CI, we'll need to integrate safety measures such as the ability to automatically retry jobs with more resources if they were OOM killed.
Example workflow
A job fails and Gantry receives the webhook. It checks if it was OOM killed and updates the memory limit if necessary.
Challenges
There is no support in Gitlab to retry singular jobs while updating variables.
Issue 37268 added support for specifying variables when retrying manual jobs via the web UI. A separate issue was filed to push for this to be added to the API. Additionally, there is support to start manual jobs with custom variables via the API.
Unfortunately, jobs in Spack are not manual, as they run automatically after being triggered/scheduled. What we want to do is retry an individual job (no matter if schedules/triggered/manual) within a pipeline, while being able to pass custom variables. Without this functionality, we won't be able to modify the resource request/limit of a job that has been killed due to resource contention.
Issue 387798 would resolve this issue, but there has been no significant movement to implement the features.
questions:
alternative solutions
restarting the entire pipeline
ci generate
and request new updated allocations from gantrypre_build script
pre_build
script in spack-infrastructure to check if a job has been retried and update the job variablesk8s middleware
downstream/child pipelines and manual jobs
TODO:
The text was updated successfully, but these errors were encountered: