Why is workload_requires needed? #182

solo-driven · 2024-07-15T16:30:14Z

Question

ALL EXAMPLES ARE RUN LOCALLY

in the example for htcondor (https://github.com/riga/law/blob/master/examples/sequential_htcondor_at_cern/analysis/tasks.py) workload_requires is being used and results in the following graph:

Scheduled 45 tasks of which:

45 ran successfully:
- 32 CreateChars(...)
- 1 CreateFullAlphabet(...)
- 12 CreatePartialAlphabet(...)

But when I just comment it I get the following more clearer graph:

Scheduled 39 tasks of which:

39 ran successfully:
- 26 CreateChars(...)
- 1 CreateFullAlphabet(...)
- 12 CreatePartialAlphabet(...)

Does this change anything? Other than number of tasks decreases when no workflow_requirements is not provided from 45 to 39

In addition it is also possible by changing reruires and run to:

def requires(self):
    # require CreateChars for each index referred to by the branch_data of _this_ instance

    return CreateChars.req(self, branches=self.branch_data, branch=-1)


def run(self):
    # gather characters and save them
    alphabet = ""
    for inp in self.input()['collection'].targets.values():
        alphabet += inp.load()["char"]
  ....

to obtain the following graph:

And finally the result which I was expecting to see:

can be done by changing the CreateFullAlphabet:

def requires(self):
    return CreatePartialAlphabet.req(self)


def run(self):
    # loop over all targets holding partial alphabet fractions and concat them
    inputs = self.input()["collection"].targets
    parts = [
        inp.load().strip()
        for inp in inputs.values()
    ]
    alphabet = "-".join(parts)

I would really appreciate if you could help me with that, struggled a lot with this trying to find the reason for workload_requires. Thank you for reading

The text was updated successfully, but these errors were encountered:

solo-driven · 2024-07-16T08:13:53Z

*Updated the url to example

riga · 2024-07-19T07:14:42Z

Hi @solo-driven ,

in general, workflow_requires() is meant to define the requirements of a workflow itself. These requirements are resolved before any of the actual (branch) tasks run.

To understand this concept, one should distinguish between local and remote workflows (those that can submit jobs to (e.g.) batch systems), that work slightly differently in the way they initiate their branch tasks. For this, it is imperative to differentiate between the run() method you define on task level (belonging to the branch task), and the run() method of the workflow (encapsulated by the so-called workflow_proxy in the background).

Remote workflows have a run() implementation that send jobs to batch systems. Each job then executes one or more law tasks with the exact command you used to start the workflow - with the addition of the corresponding --branch N parameter(s).

Usually, before jobs can be submitted, one needs to make sure that certain conditions are met, e.g., that certain software is pre-bundled and provided to the batch system (for those that need that). This is exactly where the workflow_requires() method is important. These conditions can be modeled with tasks (in the example above, it could be a task UploadSoftware), and one would typically want to declare as a dependency. However, it's a dependency of the workflow, but not of each individual task.

Local workflows often don't need these extra dependencies that ensure that branch tasks can be run, since you're already in the correct environment. However, you are free to declare them regardless if it fits your use case. There is even a parameter predefined on all workflows, --pilot, whose value you can use in your implementation of workflow_requires() to dynamically add or remove certain workflow requirements. But again, it's fully up to you if you make use of that.

Side note: have a look at how local workflows trigger their branch tasks. There are two options: declare as dependency, or yield as dynamic dependencies (which is a luigi pattern).

That being said, all your example cases are valid and the actual decision of what you declare as a workflow requirement is a design choice you are free to make.

solo-driven · 2024-07-22T15:16:10Z

But why did you use workflow_requires for branches manipulation in that example? As you said it is for controlling the dependency of the whole workflow. Like setting up an environment. (I read your last comment, so probably it is not a best example for it?)

Also I noticed that controlling branches parameter of any dependent worklfow is only possible in workflow_requires and not possible in requires. Can you explain why?

riga · 2024-07-22T17:56:34Z

(I read your last comment, so probably it is not a best example for it?)

Yeah, it probably is not a good example. The linked task is the proxy that lives underneath the workflow and that implements the actual run(), requires() and output() methods that take effect in case a task is a workflow (branch == -1).

Also I noticed that controlling branches parameter of any dependent worklfow is only possible in workflow_requires and not possible in requires. Can you explain why?

The branches (plural) parameter is only a feature of the workflow itself. For specific branch tasks, settings this value has no meaning (since a branch does not have branches on it's own).

solo-driven · 2024-07-23T09:19:01Z

The last question.
Are the any performance differences in the way I "build a dependency tree"? Like in the examples above. In the end we get ~30 tasks which will be distributed by workers right?

And when I specify branches for instance 1:5 will that workflow count as a single task or all the branches will be distributed among workers? If former is true then there should be really no difference at all by the way we build tree

riga · 2024-07-23T11:00:56Z

The workflow itself will count as a single yet separate task in the tree whose only "payload" is to trigger its branch tasks (either via static or dynamic requirements). All branch tasks will be distributed across --workers in any case, so there shouldn't be any performance difference (except for a very small one during tree building at the very beginning).

solo-driven added the question label Jul 15, 2024

solo-driven assigned riga Jul 15, 2024

riga pinned this issue Jul 19, 2024

riga closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is workload_requires needed? #182

Why is workload_requires needed? #182

solo-driven commented Jul 15, 2024 •

edited

Loading

solo-driven commented Jul 16, 2024

riga commented Jul 19, 2024

solo-driven commented Jul 22, 2024

riga commented Jul 22, 2024

solo-driven commented Jul 23, 2024

riga commented Jul 23, 2024

Why is workload_requires needed? #182

Why is workload_requires needed? #182

Comments

solo-driven commented Jul 15, 2024 • edited Loading

Question

solo-driven commented Jul 16, 2024

riga commented Jul 19, 2024

solo-driven commented Jul 22, 2024

riga commented Jul 22, 2024

solo-driven commented Jul 23, 2024

riga commented Jul 23, 2024

solo-driven commented Jul 15, 2024 •

edited

Loading