Add STOP_ON_FAIL option for wf jobs #6101

yngve-sk · 2023-09-18T12:00:48Z

Issue
Resolves #2386

Approach
Add kw to workflow job parser and halt execution if the stop_on_fail (defaults to False) attribute of a workflow job is True.

STOP_ON_FAIL=True can be added to ert scripts like this:

from ert import ErtScript
class AScript(ErtScript):
    stop_on_fail = True
    ...

Or to .sh scripts like this:

#!/bin/bash
ekho helo wordl
STOP_ON_FAIL=TRUE

Or the workflow job declaration file:

STOP_ON_FAIL True
INTERNAL False
EXECUTABLE failing_script.sh

In all cases, this defaults to False. The declaration in the workflow job will have the highest priority, if it is declared in multiple places.

(Screenshot of new behavior in GUI if applicable)

Pre review checklist

Read through the code changes carefully after finishing work
Make sure tests pass locally (after every commit!)
Prepare changes in small commits for more convenient review (optional)
PR title captures the intent of the changes, and is fitting for release notes.
Updated documentation
Ensured that unit tests are added for all new behavior (See
Ground Rules),
and changes to existing code have good test coverage.

Pre merge checklist

Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.

eivindjahren · 2023-09-19T06:44:11Z

I think we should flip the boolean value and call it CONTINUE_ON_ERROR. That is what github actions uses.

yngve-sk · 2023-09-19T07:26:24Z

I think we should flip the boolean value and call it CONTINUE_ON_ERROR. That is what github actions uses.

Not sure about this flip, from discussions we agreed that the default would be to continue on error, so imo making it explicit when to stop makes more sense than making it explicit when to continue

codecov-commenter · 2023-09-19T07:40:26Z

Codecov Report

Merging #6101 (047d0c3) into main (f338c70) will increase coverage by 0.14%.
Report is 24 commits behind head on main.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##             main    #6101      +/-   ##
==========================================
+ Coverage   83.39%   83.53%   +0.14%     
==========================================
  Files         343      343              
  Lines       20827    20777      -50     
  Branches      944      938       -6     
==========================================
- Hits        17368    17357      -11     
+ Misses       3170     3129      -41     
- Partials      289      291       +2

Files	Coverage Δ
src/ert/config/ert_script.py	`85.58% <100.00%> (+0.40%)`	⬆️
src/ert/config/parsing/workflow_job_keywords.py	`89.28% <100.00%> (+0.39%)`	⬆️
src/ert/config/parsing/workflow_job_schema.py	`95.12% <100.00%> (+0.25%)`	⬆️
src/ert/config/workflow_job.py	`98.43% <100.00%> (+0.02%)`	⬆️
src/ert/job_queue/workflow_runner.py	`92.19% <88.88%> (+0.32%)`	⬆️

... and 39 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

oyvindeide

Looks like a good change! Tried running it in the gui though, and it just hangs for about a minute before I get a popup saying that the workflow failed (I changed to PRE_SIMULATION). The simulation window then changes to Done, but still appears to waiting.

The running time stopped counting after 2 seconds, but did not get the popup for about a minute.

oyvindeide · 2023-09-19T07:48:57Z

tests/unit_tests/cli/test_integration_cli.py

+                """
+                   LOAD_WORKFLOW_JOB failing_job failjob
+                   LOAD_WORKFLOW dump_failing_workflow wffail
+                   HOOK_WORKFLOW wffail POST_SIMULATION


Could change this to PRE_SIMULATION to avoid spinning up the ensemble evaluator, or perhaps you want it to be POST to check that we exit the tracking loop? If so, that might be worth a comment.

Will look a bit more into this, running tests it also hangs on PRE_SIMULATION because it waits 60 seconds for the websocket client to connect

I think it should maybe be another issue, I tried some stuff w/ threading, like setting the monitor thread to be a daemon, but there is still something else that keeps it hanging.

When the thread named ert_cli_simulation_thread is spawned, it spawns some other threads too:

Should maybe be a separate issue/thing to look into/thing to postpone until experiment server milestone which will refactor this away anyway?

tests/unit_tests/cli/test_integration_cli.py

tests/unit_tests/job_queue/test_workflow_job.py

oyvindeide · 2023-09-19T07:56:27Z

tests/unit_tests/cli/test_integration_cli.py

+            )
+        )
+
+    parsed = ert_parser(None, args=[ENSEMBLE_EXPERIMENT_MODE, "poly.ert"])


Could use TEST_RUN_MODE to avoid running all realizations.

oyvindeide · 2023-09-20T10:46:23Z

A lot of workflows are using the plugin system to register, but as far as I can tell, this is not really exposed through that api? https://ert.readthedocs.io/en/latest/experimental/plugin_system.html#workflow-jobs

oyvindeide · 2023-10-05T14:01:52Z

Thanks for good work, when I ran it I noticed that I got a stack trace in my terminal when running from the gui though:

The script 'MisfitPreprocessorJob' caused an error while running:
I will not run
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/OYVEID/projects/ert/ert/src/ert/gui/simulation/run_dialog.py", line 262, in run
    self._run_model.startSimulations(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/base_run_model.py", line 210, in startSimulations
    run_context = self.run_experiment(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/ensemble_smoother.py", line 76, in run_experiment
    self.ert().runWorkflows(
  File "/Users/OYVEID/projects/ert/ert/src/ert/enkf_main.py", line 395, in runWorkflows
    WorkflowRunner(workflow, self, storage, ensemble).run_blocking()
  File "/Users/OYVEID/projects/ert/ert/src/ert/job_queue/workflow_runner.py", line 200, in run_blocking
    raise RuntimeError(
RuntimeError: Workflow job MISFIT_PREPROCESSOR failed with error:

which I would not expect. However the terminal gave me the error message, but the box that popped up in the gui did not.

yngve-sk · 2023-10-16T05:37:45Z

Thanks for good work, when I ran it I noticed that I got a stack trace in my terminal when running from the gui though:

The script 'MisfitPreprocessorJob' caused an error while running:
I will not run
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/OYVEID/projects/ert/ert/src/ert/gui/simulation/run_dialog.py", line 262, in run
    self._run_model.startSimulations(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/base_run_model.py", line 210, in startSimulations
    run_context = self.run_experiment(
  File "/Users/OYVEID/projects/ert/ert/src/ert/run_models/ensemble_smoother.py", line 76, in run_experiment
    self.ert().runWorkflows(
  File "/Users/OYVEID/projects/ert/ert/src/ert/enkf_main.py", line 395, in runWorkflows
    WorkflowRunner(workflow, self, storage, ensemble).run_blocking()
  File "/Users/OYVEID/projects/ert/ert/src/ert/job_queue/workflow_runner.py", line 200, in run_blocking
    raise RuntimeError(
RuntimeError: Workflow job MISFIT_PREPROCESSOR failed with error:

which I would not expect. However the terminal gave me the error message, but the box that popped up in the gui did not.

Was this with PRE_SIMULATION or POST_SIMULATION for the job? I try running poly example with this appended to poly.ert:

LOAD_WORKFLOW_JOB failing_job failjob
LOAD_WORKFLOW dump_failing_workflow wffail
HOOK_WORKFLOW wffail POST_SIMULATION

where failing_job contains:

STOP_ON_FAIL True
INTERNAL False
EXECUTABLE failing_script.sh

failing_script.sh contains

#!/bin/bash
ekho helo wordl

If I open this with ert gui test-data/poly_example/poly.ert, and run Ensemble experiment w/ POST_SIMULATION I get the error message and ert closes properly:

But with PRE_SIMULATION it hangs due to a 60 second timeout for connecting to a websocket, and fixing this would require some more refactoring and might be better for a separate PR?

oyvindeide · 2023-10-23T06:32:23Z

I used PRE_FIRST_UPDATE, which is post simulation. It did not hang. I ran an internal workflow, modified the run function of MISFIT_PREPROCESSOR to just throw an error. Can leave it is outside the scope of this PR if the behavior has always been that way.

oyvindeide

LGTM! Nice improvement! Remember to clean up commits before merging.

oysteoh · 2023-10-30T11:13:15Z

Should this also be published in a bigger "note" to our users that this is now possible? There have been multiple requests regarding this right?

sondreso · 2023-10-30T15:58:51Z

Should this also be published in a bigger "note" to our users that this is now possible? There have been multiple requests regarding this right?

Yes, this should be part of the release notes yes! But, can we update the documentation for workflows in this PR to explain the new behavior?

sondreso

LGTM! 🚢

yngve-sk changed the title ~~Make it possible to specify STOP_ON_FAIL for wf jobs~~ Add STOP_ON_FAIL option for wf jobs Sep 18, 2023

yngve-sk force-pushed the strict-workflowjobs branch 2 times, most recently from 3f8c486 to 6407e20 Compare September 18, 2023 13:05

yngve-sk force-pushed the strict-workflowjobs branch from 6407e20 to e40abf0 Compare September 19, 2023 07:06

yngve-sk requested review from sondreso and oyvindeide September 19, 2023 07:06

yngve-sk force-pushed the strict-workflowjobs branch from e40abf0 to 549d19d Compare September 19, 2023 07:10

yngve-sk force-pushed the strict-workflowjobs branch 2 times, most recently from 925ceff to 498ae77 Compare September 19, 2023 07:27

yngve-sk force-pushed the strict-workflowjobs branch from 498ae77 to 9c49c2d Compare September 19, 2023 07:49

oyvindeide reviewed Sep 19, 2023

View reviewed changes

yngve-sk force-pushed the strict-workflowjobs branch from e3226ed to 54bd310 Compare September 19, 2023 08:33

yngve-sk requested a review from oyvindeide September 20, 2023 07:41

yngve-sk force-pushed the strict-workflowjobs branch 9 times, most recently from 67cc24b to e356855 Compare September 27, 2023 10:21

yngve-sk force-pushed the strict-workflowjobs branch from e356855 to 910f26f Compare October 4, 2023 05:26

yngve-sk force-pushed the strict-workflowjobs branch from 910f26f to e28903c Compare October 6, 2023 05:49

yngve-sk force-pushed the strict-workflowjobs branch 3 times, most recently from 7cebd3f to 78c336b Compare October 26, 2023 09:18

oyvindeide approved these changes Oct 30, 2023

View reviewed changes

Yngve S. Kristiansen added 2 commits October 30, 2023 13:32

Store _stderrdata when ert_script fails

b9f3a41

Make it possible to specify STOP_ON_FAIL for wf jobs

af93a47

yngve-sk force-pushed the strict-workflowjobs branch from 78c336b to af93a47 Compare October 30, 2023 12:33

yngve-sk self-assigned this Oct 30, 2023

sondreso approved these changes Oct 31, 2023

View reviewed changes

yngve-sk force-pushed the strict-workflowjobs branch from 0c842eb to b97c406 Compare October 31, 2023 12:21

Add documentation for STOP_ON_FAILURE flag for jobs

047d0c3

yngve-sk force-pushed the strict-workflowjobs branch from b97c406 to 047d0c3 Compare October 31, 2023 13:09

yngve-sk merged commit 5ce7918 into equinor:main Oct 31, 2023
42 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add STOP_ON_FAIL option for wf jobs #6101

Add STOP_ON_FAIL option for wf jobs #6101

yngve-sk commented Sep 18, 2023 •

edited

Loading

eivindjahren commented Sep 19, 2023

yngve-sk commented Sep 19, 2023

codecov-commenter commented Sep 19, 2023 •

edited

Loading

oyvindeide left a comment •

edited

Loading

oyvindeide Sep 19, 2023

yngve-sk Sep 19, 2023

yngve-sk Sep 20, 2023

oyvindeide Sep 19, 2023

oyvindeide commented Sep 20, 2023

oyvindeide commented Oct 5, 2023

yngve-sk commented Oct 16, 2023

oyvindeide commented Oct 23, 2023

oyvindeide left a comment

oysteoh commented Oct 30, 2023

sondreso commented Oct 30, 2023

sondreso left a comment

Add STOP_ON_FAIL option for wf jobs #6101

Add STOP_ON_FAIL option for wf jobs #6101

Conversation

yngve-sk commented Sep 18, 2023 • edited Loading

Pre review checklist

Pre merge checklist

eivindjahren commented Sep 19, 2023

yngve-sk commented Sep 19, 2023

codecov-commenter commented Sep 19, 2023 • edited Loading

Codecov Report

oyvindeide left a comment • edited Loading

Choose a reason for hiding this comment

oyvindeide Sep 19, 2023

Choose a reason for hiding this comment

yngve-sk Sep 19, 2023

Choose a reason for hiding this comment

yngve-sk Sep 20, 2023

Choose a reason for hiding this comment

oyvindeide Sep 19, 2023

Choose a reason for hiding this comment

oyvindeide commented Sep 20, 2023

oyvindeide commented Oct 5, 2023

yngve-sk commented Oct 16, 2023

oyvindeide commented Oct 23, 2023

oyvindeide left a comment

Choose a reason for hiding this comment

oysteoh commented Oct 30, 2023

sondreso commented Oct 30, 2023

sondreso left a comment

Choose a reason for hiding this comment

yngve-sk commented Sep 18, 2023 •

edited

Loading

codecov-commenter commented Sep 19, 2023 •

edited

Loading

oyvindeide left a comment •

edited

Loading