Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add STOP_ON_FAIL option for wf jobs #6101

Merged
merged 3 commits into from
Oct 31, 2023

Conversation

yngve-sk
Copy link
Contributor

@yngve-sk yngve-sk commented Sep 18, 2023

Issue
Resolves #2386

Approach
Add kw to workflow job parser and halt execution if the stop_on_fail (defaults to False) attribute of a workflow job is True.

STOP_ON_FAIL=True can be added to ert scripts like this:

from ert import ErtScript
class AScript(ErtScript):
    stop_on_fail = True
    ...

Or to .sh scripts like this:

#!/bin/bash
ekho helo wordl
STOP_ON_FAIL=TRUE

Or the workflow job declaration file:

STOP_ON_FAIL True
INTERNAL False
EXECUTABLE failing_script.sh

In all cases, this defaults to False. The declaration in the workflow job will have the highest priority, if it is declared in multiple places.

(Screenshot of new behavior in GUI if applicable)

Pre review checklist

  • Read through the code changes carefully after finishing work
  • Make sure tests pass locally (after every commit!)
  • Prepare changes in small commits for more convenient review (optional)
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Updated documentation
  • Ensured that unit tests are added for all new behavior (See
    Ground Rules),
    and changes to existing code have good test coverage.

Pre merge checklist

  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.

@yngve-sk yngve-sk changed the title Make it possible to specify STOP_ON_FAIL for wf jobs Add STOP_ON_FAIL option for wf jobs Sep 18, 2023
@yngve-sk yngve-sk force-pushed the strict-workflowjobs branch 2 times, most recently from 3f8c486 to 6407e20 Compare September 18, 2023 13:05
@eivindjahren
Copy link
Contributor

I think we should flip the boolean value and call it CONTINUE_ON_ERROR. That is what github actions uses.

@yngve-sk
Copy link
Contributor Author

I think we should flip the boolean value and call it CONTINUE_ON_ERROR. That is what github actions uses.

Not sure about this flip, from discussions we agreed that the default would be to continue on error, so imo making it explicit when to stop makes more sense than making it explicit when to continue

@yngve-sk yngve-sk force-pushed the strict-workflowjobs branch 2 times, most recently from 925ceff to 498ae77 Compare September 19, 2023 07:27
@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2023

Codecov Report

Merging #6101 (047d0c3) into main (f338c70) will increase coverage by 0.14%.
Report is 24 commits behind head on main.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##             main    #6101      +/-   ##
==========================================
+ Coverage   83.39%   83.53%   +0.14%     
==========================================
  Files         343      343              
  Lines       20827    20777      -50     
  Branches      944      938       -6     
==========================================
- Hits        17368    17357      -11     
+ Misses       3170     3129      -41     
- Partials      289      291       +2     
Files Coverage Δ
src/ert/config/ert_script.py 85.58% <100.00%> (+0.40%) ⬆️
src/ert/config/parsing/workflow_job_keywords.py 89.28% <100.00%> (+0.39%) ⬆️
src/ert/config/parsing/workflow_job_schema.py 95.12% <100.00%> (+0.25%) ⬆️
src/ert/config/workflow_job.py 98.43% <100.00%> (+0.02%) ⬆️
src/ert/job_queue/workflow_runner.py 92.19% <88.88%> (+0.32%) ⬆️

... and 39 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@oyvindeide oyvindeide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change! Tried running it in the gui though, and it just hangs for about a minute before I get a popup saying that the workflow failed (I changed to PRE_SIMULATION). The simulation window then changes to Done, but still appears to waiting.

The running time stopped counting after 2 seconds, but did not get the popup for about a minute.

"""
LOAD_WORKFLOW_JOB failing_job failjob
LOAD_WORKFLOW dump_failing_workflow wffail
HOOK_WORKFLOW wffail POST_SIMULATION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could change this to PRE_SIMULATION to avoid spinning up the ensemble evaluator, or perhaps you want it to be POST to check that we exit the tracking loop? If so, that might be worth a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look a bit more into this, running tests it also hangs on PRE_SIMULATION because it waits 60 seconds for the websocket client to connect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should maybe be another issue, I tried some stuff w/ threading, like setting the monitor thread to be a daemon, but there is still something else that keeps it hanging.

When the thread named ert_cli_simulation_thread is spawned, it spawns some other threads too:
Screenshot 2023-09-20 at 09 37 23

Should maybe be a separate issue/thing to look into/thing to postpone until experiment server milestone which will refactor this away anyway?

tests/unit_tests/cli/test_integration_cli.py Outdated Show resolved Hide resolved
tests/unit_tests/job_queue/test_workflow_job.py Outdated Show resolved Hide resolved
)
)

parsed = ert_parser(None, args=[ENSEMBLE_EXPERIMENT_MODE, "poly.ert"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use TEST_RUN_MODE to avoid running all realizations.

@oyvindeide
Copy link
Collaborator

A lot of workflows are using the plugin system to register, but as far as I can tell, this is not really exposed through that api? https://ert.readthedocs.io/en/latest/experimental/plugin_system.html#workflow-jobs

@yngve-sk yngve-sk force-pushed the strict-workflowjobs branch 9 times, most recently from 67cc24b to e356855 Compare September 27, 2023 10:21
@oyvindeide
Copy link
Collaborator

Thanks for good work, when I ran it I noticed that I got a stack trace in my terminal when running from the gui though:

The script 'MisfitPreprocessorJob' caused an error while running:
I will not run
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/OYVEID/projects/ert/ert/src/ert/gui/simulation/run_dialog.py", line 262, in run
    self._run_model.startSimulations(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/base_run_model.py", line 210, in startSimulations
    run_context = self.run_experiment(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/ensemble_smoother.py", line 76, in run_experiment
    self.ert().runWorkflows(
  File "/Users/OYVEID/projects/ert/ert/src/ert/enkf_main.py", line 395, in runWorkflows
    WorkflowRunner(workflow, self, storage, ensemble).run_blocking()
  File "/Users/OYVEID/projects/ert/ert/src/ert/job_queue/workflow_runner.py", line 200, in run_blocking
    raise RuntimeError(
RuntimeError: Workflow job MISFIT_PREPROCESSOR failed with error:

which I would not expect. However the terminal gave me the error message, but the box that popped up in the gui did not.

@yngve-sk
Copy link
Contributor Author

Thanks for good work, when I ran it I noticed that I got a stack trace in my terminal when running from the gui though:

The script 'MisfitPreprocessorJob' caused an error while running:
I will not run
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/OYVEID/projects/ert/ert/src/ert/gui/simulation/run_dialog.py", line 262, in run
    self._run_model.startSimulations(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/base_run_model.py", line 210, in startSimulations
    run_context = self.run_experiment(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/ensemble_smoother.py", line 76, in run_experiment
    self.ert().runWorkflows(
  File "/Users/OYVEID/projects/ert/ert/src/ert/enkf_main.py", line 395, in runWorkflows
    WorkflowRunner(workflow, self, storage, ensemble).run_blocking()
  File "/Users/OYVEID/projects/ert/ert/src/ert/job_queue/workflow_runner.py", line 200, in run_blocking
    raise RuntimeError(
RuntimeError: Workflow job MISFIT_PREPROCESSOR failed with error:

which I would not expect. However the terminal gave me the error message, but the box that popped up in the gui did not.

Was this with PRE_SIMULATION or POST_SIMULATION for the job? I try running poly example with this appended to poly.ert:

LOAD_WORKFLOW_JOB failing_job failjob
LOAD_WORKFLOW dump_failing_workflow wffail
HOOK_WORKFLOW wffail POST_SIMULATION

where failing_job contains:

STOP_ON_FAIL True
INTERNAL False
EXECUTABLE failing_script.sh

failing_script.sh contains

#!/bin/bash
ekho helo wordl

If I open this with ert gui test-data/poly_example/poly.ert, and run Ensemble experiment w/ POST_SIMULATION I get the error message and ert closes properly:
Screenshot 2023-10-16 at 07 36 38

But with PRE_SIMULATION it hangs due to a 60 second timeout for connecting to a websocket, and fixing this would require some more refactoring and might be better for a separate PR?

@oyvindeide
Copy link
Collaborator

I used PRE_FIRST_UPDATE, which is post simulation. It did not hang. I ran an internal workflow, modified the run function of MISFIT_PREPROCESSOR to just throw an error. Can leave it is outside the scope of this PR if the behavior has always been that way.

@yngve-sk yngve-sk force-pushed the strict-workflowjobs branch 3 times, most recently from 7cebd3f to 78c336b Compare October 26, 2023 09:18
Copy link
Collaborator

@oyvindeide oyvindeide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice improvement! Remember to clean up commits before merging.

@oysteoh
Copy link
Contributor

oysteoh commented Oct 30, 2023

Should this also be published in a bigger "note" to our users that this is now possible? There have been multiple requests regarding this right?

@yngve-sk yngve-sk self-assigned this Oct 30, 2023
@yngve-sk yngve-sk added release-notes:new-feature Automatically categorise as new feature in release notes release-notes:user-impact Automatically categorise as breaking for people using CLI/GUI and removed release-notes:user-impact Automatically categorise as breaking for people using CLI/GUI labels Oct 30, 2023
@sondreso
Copy link
Collaborator

Should this also be published in a bigger "note" to our users that this is now possible? There have been multiple requests regarding this right?

Yes, this should be part of the release notes yes! But, can we update the documentation for workflows in this PR to explain the new behavior?

Copy link
Collaborator

@sondreso sondreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢

@yngve-sk yngve-sk merged commit 5ce7918 into equinor:main Oct 31, 2023
42 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:new-feature Automatically categorise as new feature in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERT ignores failure of workflows. Should be able to enable/disable stop of ERT if an ERT workflow fails
6 participants