Open
Description
Over the years, we have received a number of sporadic reports of @batch
jobs failing without anything on the Metaflow console but CloudWatch containing messages like:
| 2023-05-03T13:58:18.077-07:00 | Setting up task environment.
-- | -- | --
| 2023-05-03T13:58:24.321-07:00 | bash: line 1: [: -le: unary operator expected
| 2023-05-03T13:58:24.321-07:00 | bash: line 1: [: -gt: unary operator expected
| 2023-05-03T13:58:24.322-07:00 | tar: job.tar: Cannot open: No such file or directory
| 2023-05-03T13:58:24.323-07:00 | tar: Error is not recoverable: exiting now
| 2023-05-03T13:58:24.336-07:00 | /usr/local/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError
This seems to happen if Metaflow fails to install its dependencies in the entrypoint (awscli
/ pip
etc), e.g. due to upstream package repos not being responsive. The issue typically fixes itself after a while.
We could provide a better error message at least