-
Runs tasks (which are defined as shell scripts)
-
Ensures only one instance of a task runs at a time (globally in the whole system) - using file-based locks
-
Tasks can run in parallel
-
Output from parallel tasks is correctly annotated with original task name
-
Output lines are timestamped
-
Parallelism can be nested
- But task name annotations are flattened, i.e. are not added repeatedly at each level
-
Output from each task is also collected to a separate file, to easier debug that particular task
-
Tasks can have specified inputs
- Input can be a file, an environment variable, or result of an arbitrary command
-
If a task is invoked again for the same input (on a given machine), it is not re-run
-
Tasks can have specified outputs (files)
-
Task outputs can be cached in an object store
-
If a task is invoked with inputs for which it was already computed on CI, the result is fetched from remote cache and task is not re-run
-
If a task is called multiple times in the same execution, it is only executed once.
-
When running on CI:
- Task output (stdout and stderr) is uploaded to an object store after task is finished
- task status is reported as a GitHub check for a commit
- pending while it's running
- success or failed when finished
- GitHub check details link points to the uploaded task output
-
Unstable inputs (i.e. inputs that have changed during the job execution) are detected
- TODO: do something about input-output files like package-lock.json
-
Tasks can have command-line arguments
- But some options are meant for the task runner, such as
-f
- But some options are meant for the task runner, such as
-
Can force reexecution of a specific task (excl. dependencies) (
-f
)- Note: in previous implementation
-f
forced reexecution of all tasks. This seems less useful, will be under another option.
- Note: in previous implementation
-
Task can be cancelled using SIGINT or SIGTERM, and state is maintained appropriately
-
Fast - if there's nothing to do, returns quickly (<1s, ideally <300ms)
When migrating from another system behind a flag, it is sometimes desirable to build on the old system but still fill remote cache one the new one. For that occasion, a special "prime cache mode" is there.
It modifies the behavior in the following way:
snapshot
never downloads remote cache (incl. fuzzy) - to avoid overwriting stuff (which we assume is already built via another mechanism)snapshot
always skips the job- remote cache is uploaded, despite job being skipped
To use it, first build using another system, and the run taskrunner
with TASKRUNNER_PRIME_CACHE_MODE=1
$TASKRUNNER_STATE_DIRECTORY
(default:/tmp/taskrunner
)locks
- global locks per job${jobName}.lock
- job lock file, job takes the lock when running
hash
- hashes of inputs of already-done jobs${jobName}.hash
- first line is hash, rest is hash input (for debugging)
builds
- build state directory for each build. Each toplevel invocation creates a subdirectory here.${buildId}
- state dir of a specific build.buildId
is derived from the invocation time.logs
- logs produced by jobs in that build.${jobName}.log
- log, without ANSI sequences stripped
results
- per-build cache of job results (we don't re-run jobs twice inside a build, even withoutsnapshot
)${jobName}
- file with status code of the job
TASKRUNNER_DEBUG
- whether to output debug messages to toplevel output. Note that debug messages are always written to per-task logs, regardless of this setting.TASKRUNNER_LOG_INFO
- whether to output "info" messages to toplevel output. They are minimal messages, produced only when there's actually something to be done (including fetching from cache).- more...
- Support stdin? For now redirected from
/dev/null
- Quiet output like
gradle
, only report what is running and progress, not full output, and no output if nothing to do - Marker files - additional hash file in
.stack-work
,node_modules
etc., so that if that dir is cleared, we redo the action- Or: remember hash of some of the output files and check they're still there
- Only some because there can be benign changes
- Can dump enough info to reproduce failures
- For example: hashes of inputs, caches etc.
- Generate a trace (otlp for analysis, or render to a gantt chart)
- less confusing output for cache miss (no "error")
- ??? Something, can't recall now
--cmd
replaced with--raw
, since we can't really execute in the context of the original script
- Task leaks stdin/out/err handle - have a timeout on draining output
- Parallel task failed and we're killed - report status correctly
snapshot
- how to communicate with controller process?- pipe and pass fd to child process?
- named pipe and pass name to child process via env?
- Nested tasks - each should write to original stdout
- Unmerged files when hashing
- bad usage of
snapshot
- e.g. called twice - why
ls-tree -r
is needed - git option of quoting - save cache tar error
- Better output of error messages (to normal streams)
- String/Text unification
- Debugging - show hash input
- More specialized tests for input handling
- In parent task's log, add reference to nested log file
- Debugging aid: when replacing saved hash, show diff between old and new hash input (or save old hash input to compare)
- Bug: pendolino sometimes rebuilds randomly with scripts/UPDATE
- Probable cause: helper generation races with its input hashing
- nope, it generates in another directory
- Probable cause: helper generation races with its input hashing
- More tests for interaction between remote cache and local hash, especially:
- restoring remote cache should also store local hash, but not store remote cache again
- test for root dir != cwd
- test for commit status
- Somehow test content-type in log upload?
- "quiet" operation - no output except when an error happens
- standard operation ("info" mode):
- when job does nothing (already done locally), no output
- when resuming from cache, output one line for start (so that we know something's happening), one for done
- when running, output one line for start, one for done - only for snapshottable jobs
- debug: log everything (maybe later categorize)
- Previous impl no-op tests/scripts/UPDATE: ~1.6s
- Current impl no-op tests/scripts/UPDATE: ~2.3s