Skip to content

Commit 707b3cd

Browse files
Add ALWAYS_CONTINUE (#169)
* Update run.py * Update config.py * Update cp-worker.py * add JOB_RETRIES, docs --------- Co-authored-by: ErinWeisbart <[email protected]>
1 parent f45fd58 commit 707b3cd

File tree

7 files changed

+35
-5
lines changed

7 files changed

+35
-5
lines changed

config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
SQS_QUEUE_NAME = APP_NAME + 'Queue'
3636
SQS_MESSAGE_VISIBILITY = 1*60 # Timeout (secs) for messages in flight (average time to be processed)
3737
SQS_DEAD_LETTER_QUEUE = 'user_DeadMessages'
38+
JOB_RETRIES = 3 # Number of times to retry a job before sending it to DEAD_LETTER_QUEUE
3839

3940
# MONITORING
4041
AUTO_MONITOR = 'True'
@@ -49,6 +50,9 @@
4950
MIN_FILE_SIZE_BYTES = 1 #What is the minimal number of bytes an object should be to "count"?
5051
NECESSARY_STRING = '' #Is there any string that should be in the file name to "count"?
5152

53+
# CELLPROFILER SETTINGS
54+
ALWAYS_CONTINUE = 'False' # Whether or not to run CellProfiler with the --always-continue flag, which will keep CellProfiler from crashing if it errors
55+
5256
# PLUGINS
5357
USE_PLUGINS = 'False' # True to use any plugin from CellProfiler-plugins repo
5458
UPDATE_PLUGINS = 'False' # True to download updates from CellProfiler-plugins repo

documentation/DCP-documentation/SQS_QUEUE_information.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,13 @@ To confirm that multiple Dockers are never processing the same job, you can keep
6868
Once you have run a pipeline once, you can check the execution time (either by noticing how long after you started your jobs that your first jobs begin to finish, or by checking the logs of individual jobs and noting the start and end time), you will then have an accurate idea of roughly how long that pipeline needs to execute, and can set your message visibility accordingly.
6969
You can even do this on the fly while jobs are currently processing; the updated visibility time won’t affect the jobs already out for processing (i.e. if the time was set to 3 hours and you change it to 1 hour, the jobs already processing will remain hidden for 3 hours or until finished), but any job that begins processing AFTER the change will use the new visibility timeout setting.
7070

71+
## JOB_RETRIES
72+
73+
**JOB_RETRIES** is the number of times that a job will be retried before it is sent to the Dead Letter Queue.
74+
The count goes up every time a message is "In Flight" and after the SQS_MESSAGE_VISIBILITY times out, if the count is too high the message will not be made "Available" but will instead go to your SQS_DEAD_LETTER_QUEUE.
75+
We recommend setting this larger than 1 because stochastic job failures are possible (e.g. the EC2 machine running the job become unavailable mid-run).
76+
Allowing large numbers of retries tends to waste compute as most failure modes are not stochastic.
77+
7178
## Example SQS Queue
7279

7380
[[images/Sample_SQS_Queue.png|alt="Sample_SQS_Queue"]]

documentation/DCP-documentation/advanced_configuration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ Alternate locations can be designated in the run script.
1111
* **Log configuration and location of exported logs:** Distributed-CellProfiler creates log groups with a default retention of 60 days (to avoid hitting the AWS limit of 250) and after finishing the run exports them into your bucket with a prefix of 'exportedlogs/LOG_GROUP_NAME/'.
1212
These may be modified in the run script.
1313
* **Advanced EC2 configuration:** Any additional configuration of your EC2 spot fleet (such as installing additional packages or running scripts on startup) can be done by modifying the userData parameter in the run script.
14-
* **SQS queue detailed configuration:** Distributed-CellProfiler creates a queue where messages will be tried 10 times before being consigned to a DeadLetterQueue, and unprocessed messages will expire after 14 days (the AWS maximum).
15-
These values can be modified in run.py .
14+
* **SQS queue detailed configuration:** Distributed-CellProfiler creates a queue where unprocessed messages will expire after 14 days (the AWS maximum).
15+
This value can be modified in run.py .
1616

1717
***
1818

documentation/DCP-documentation/config_examples.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,15 @@ Our internal configurations for each pipeline are as follows:
5353
| SQS_QUEUE_NAME | APP_NAME + 'Queue' | APP_NAME + 'Queue' | APP_NAME + 'Queue' | APP_NAME + 'Queue' | APP_NAME + 'Queue' | We never change this. |
5454
| SQS_MESSAGE_VISIBILITY | 3*60 | 240*60 | 15*60 | 10*60 | 120*60 | About how long you expect a job to take * 1.5 in seconds |
5555
| SQS_DEAD_LETTER_QUEUE | 'YOURNAME_DEADMESSAGES' | 'YOURNAME_DEADMESSAGES' | 'YOURNAME_DEADMESSAGES' | 'YOURNAME_DEADMESSAGES' |'YOURNAME_DEADMESSAGES' | |
56+
| JOB_RETRIES | 3 | 3 | 3 | 3 | 3 | |
5657
| AUTO_MONITOR | 'True' | 'True' | 'True' | 'True' | 'True' | Can be turned off if manually running Monitor. |
5758
| CREATE_DASHBOARD | 'True' | 'True' | 'True' | 'True' | 'True' | |
5859
| CLEAN_DASHBOARD | 'True' | 'True' | 'True' | 'True' | 'True' | |
5960
| CHECK_IF_DONE_BOOL | 'True' | 'True' | 'True' | 'True' | 'True' | Can be turned off if wanting to overwrite old data. |
6061
| EXPECTED_NUMBER_FILES | 1 (an image) | number channels + 1 (an .npy for each channel and isdone) | 3 (Experiment.csv, Image.csv, and isdone) | 1 (an image) | 5 (Experiment, Image, Cells, Nuclei, and Cytoplasm .csvs) | Better to underestimate than overestimate. |
6162
| MIN_FILE_SIZE_BYTES | 1 | 1 | 1 | 1 | 1 | Count files of any size. |
6263
| NECESSARY_STRING | '' | '' | '' | '' | '' | Not necessary for standard workflows. |
64+
| ALWAYS_CONTINUE | 'False' | 'False' | 'False' | 'False' | 'False' | Use with caution. |
6365
| USE_PLUGINS | 'False' | 'False' | 'False' | 'False' | 'False' | Not necessary for standard workflows. |
6466
| UPDATE_PLUGINS | 'False' | 'False' | 'False' | 'False' | 'False' | Not necessary for standard workflows. |
6567
| PLUGINS_COMMIT | '' | '' | '' | '' | '' | Not necessary for standard workflows. |

documentation/DCP-documentation/step_1_configuration.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ We recommend setting this to slightly longer than the average amount of time it
7373
* **SQS_DEAD_LETTER_QUEUE:** The name of the queue to send jobs to if they fail to process correctly multiple times; this keeps a single bad job (such as one where a single file has been corrupted) from keeping your cluster active indefinitely.
7474
This queue will be automatically made if it doesn't exist already.
7575
See [Step 0: Prep](step_0_prep.med) for more information.
76+
* **JOB_RETRIES:** This is the number of times that a job will be retried before it is sent to the Dead Letter Queue.
7677

7778
***
7879

@@ -109,6 +110,15 @@ Useful when trying to detect jobs that may have exported smaller corrupted files
109110

110111
***
111112

113+
### CELLPROFILER SETTINGS
114+
* **ALWAYS CONTINUE:** Whether or not to run CellProfiler with the --always-continue flag, which will keep CellProfiler from crashing if it errors.
115+
Use with caution.
116+
This can be particularly helpful in jobs where a large number of files are loaded in a single run (such as during illumination correction) so that a corrupted or missing file doesn't prevent the whole job completing.
117+
However, this can make it harder to notice jobs that are not completely succesffully so should be used with caution.
118+
We suggest using this setting in conjunction with a small number of JOB_RETRIES.
119+
120+
***
121+
112122
### PLUGINS
113123
* **USE_PLUGINS:** Whether or not you will be using external plugins from the CellProfiler-plugins repository.
114124
* **UPDATE_PLUGINS:** Whether or not to update the plugins repository before use.

run.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
CREATE_DASHBOARD = 'False'
1919
CLEAN_DASHBOARD = 'False'
2020
AUTO_MONITOR = 'False'
21+
ALWAYS_CONTINUE = 'False'
22+
JOB_RETRIES = 10
2123

2224
from config import *
2325

@@ -125,6 +127,7 @@ def generate_task_definition(AWS_PROFILE):
125127
{"name": "USE_PLUGINS", "value": str(USE_PLUGINS)},
126128
{"name": "NECESSARY_STRING", "value": NECESSARY_STRING},
127129
{"name": "DOWNLOAD_FILES", "value": DOWNLOAD_FILES},
130+
{"name": "ALWAYS_CONTINUE", "value": ALWAYS_CONTINUE},
128131
]
129132
if SOURCE_BUCKET.lower()!='false':
130133
task_definition['containerDefinitions'][0]['environment'] += [
@@ -219,9 +222,7 @@ def get_or_create_queue(sqs):
219222
"MaximumMessageSize": "262144",
220223
"MessageRetentionPeriod": "1209600",
221224
"ReceiveMessageWaitTimeSeconds": "0",
222-
"RedrivePolicy": '{"deadLetterTargetArn":"'
223-
+ dead_arn
224-
+ '","maxReceiveCount":"10"}',
225+
"RedrivePolicy": f'{{"deadLetterTargetArn":"{dead_arn}","maxReceiveCount":"{str(JOB_RETRIES)}"}}',
225226
"VisibilityTimeout": str(SQS_MESSAGE_VISIBILITY),
226227
}
227228
sqs.create_queue(QueueName=SQS_QUEUE_NAME, Attributes=SQS_DEFINITION)

worker/cp-worker.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,10 @@
5050
DOWNLOAD_FILES = 'False'
5151
else:
5252
DOWNLOAD_FILES = os.environ['DOWNLOAD_FILES']
53+
if 'ALWAYS_CONTINUE' not in os.environ:
54+
ALWAYS_CONTINUE = False
55+
else:
56+
ALWAYS_CONTINUE = os.environ['ALWAYS_CONTINUE']
5357

5458
localIn = '/home/ubuntu/local_input'
5559

@@ -276,6 +280,8 @@ def runCellProfiler(message):
276280
printandlog("Didn't recognize input file",logger)
277281
if USE_PLUGINS.lower() == 'true':
278282
cmd += f' --plugins-directory={PLUGIN_DIR}'
283+
if ALWAYS_CONTINUE.lower() == 'true':
284+
cmd +=' --always-continue'
279285
print(f'Running {cmd}')
280286
logger.info(cmd)
281287

0 commit comments

Comments
 (0)