-
Notifications
You must be signed in to change notification settings - Fork 54
Azure batch run in Tower gets stuck in Submitted state #385
Comments
Looking in the logs I can see this. provided the error should be better handled on tower side, I'm not understanding what do you mean with: "Pipeline has failed on the node"
|
Thanks, Paolo! If the run is already completed it is good! But it is not shown as I meant that the pipeline has definitely finished (it failed in this case) and I saw the pool shrinking, however the run status on Tower interface was not updated. |
Um, not nice. What if you try the cancel it? |
Copy & paste that error ID please. Also include the workflow Id (you can find the details page) |
|
I have another run with the same problem:
|
I see exactly the same problem. |
Workspace ID: 236422758311365 |
Always with Azure? |
Yep, using an Azure Batch compute env. |
Sorry was confused by the |
I should be ok, for you both now. |
I still have 4 runs in the same state? |
Mine were both cancelled successfully. |
Please provide the ids |
Workflow IDs: |
should be ok now |
Thanks Vlad, this is useful. Let us look a bit more into this |
@cbr7 Please have a look at the comment from @wikiselev (too many vlads in this thread! 😆) The use case he is reporting causes this exception (to be tracked):
|
Bug report
Expected behavior and actual behavior
Expected: Completed run should get either in
Failed
orSucceeded
state.Actual: Completed run is stuck in
Submitted
state even though a compute pool corresponding to the run has been auto scaled to 0.Steps to reproduce the problem
I created a compute pool using Tower Forge and started a pipeline (private). This created a new run on Tower and auto scaled the newly created pool from 0 to 1 node. Pipeline has failed on the node.
Program output
Tower run never received the output of the pipeline and got stuck in
Submitted
state even though the compute has been down scaled to 0 nodes.Environment
Tower
Additional context
I now have two runs like this (in
Submitted
state). When I try to delete them myself via Tower interface or via Tower CLI (with--force
option) I get an error like:If I get three more runs like this I will hit a limit of concurrent tasks and won't be able to use Tower anymore... I will appreciate your help in deleting them. Thanks!
The text was updated successfully, but these errors were encountered: