Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is workload_requires needed? #182

Closed
solo-driven opened this issue Jul 15, 2024 · 6 comments
Closed

Why is workload_requires needed? #182

solo-driven opened this issue Jul 15, 2024 · 6 comments
Assignees
Labels

Comments

@solo-driven
Copy link

solo-driven commented Jul 15, 2024

Question

ALL EXAMPLES ARE RUN LOCALLY

in the example for htcondor (https://github.com/riga/law/blob/master/examples/sequential_htcondor_at_cern/analysis/tasks.py) workload_requires is being used and results in the following graph:
image
Scheduled 45 tasks of which:

  • 45 ran successfully:
    • 32 CreateChars(...)
    • 1 CreateFullAlphabet(...)
    • 12 CreatePartialAlphabet(...)

But when I just comment it I get the following more clearer graph:
image

Scheduled 39 tasks of which:

  • 39 ran successfully:
    • 26 CreateChars(...)
    • 1 CreateFullAlphabet(...)
    • 12 CreatePartialAlphabet(...)

Does this change anything? Other than number of tasks decreases when no workflow_requirements is not provided from 45 to 39

In addition it is also possible by changing reruires and run to:

def requires(self):
    # require CreateChars for each index referred to by the branch_data of _this_ instance

    return CreateChars.req(self, branches=self.branch_data, branch=-1)


def run(self):
    # gather characters and save them
    alphabet = ""
    for inp in self.input()['collection'].targets.values():
        alphabet += inp.load()["char"]
  ....

to obtain the following graph:
image

And finally the result which I was expecting to see:
image

can be done by changing the CreateFullAlphabet:

def requires(self):
    return CreatePartialAlphabet.req(self)


def run(self):
    # loop over all targets holding partial alphabet fractions and concat them
    inputs = self.input()["collection"].targets
    parts = [
        inp.load().strip()
        for inp in inputs.values()
    ]
    alphabet = "-".join(parts)

I would really appreciate if you could help me with that, struggled a lot with this trying to find the reason for workload_requires. Thank you for reading

@solo-driven
Copy link
Author

*Updated the url to example

@riga
Copy link
Owner

riga commented Jul 19, 2024

Hi @solo-driven ,

in general, workflow_requires() is meant to define the requirements of a workflow itself. These requirements are resolved before any of the actual (branch) tasks run.

To understand this concept, one should distinguish between local and remote workflows (those that can submit jobs to (e.g.) batch systems), that work slightly differently in the way they initiate their branch tasks. For this, it is imperative to differentiate between the run() method you define on task level (belonging to the branch task), and the run() method of the workflow (encapsulated by the so-called workflow_proxy in the background).

Remote workflows have a run() implementation that send jobs to batch systems. Each job then executes one or more law tasks with the exact command you used to start the workflow - with the addition of the corresponding --branch N parameter(s).

Usually, before jobs can be submitted, one needs to make sure that certain conditions are met, e.g., that certain software is pre-bundled and provided to the batch system (for those that need that). This is exactly where the workflow_requires() method is important. These conditions can be modeled with tasks (in the example above, it could be a task UploadSoftware), and one would typically want to declare as a dependency. However, it's a dependency of the workflow, but not of each individual task.

Local workflows often don't need these extra dependencies that ensure that branch tasks can be run, since you're already in the correct environment. However, you are free to declare them regardless if it fits your use case. There is even a parameter predefined on all workflows, --pilot, whose value you can use in your implementation of workflow_requires() to dynamically add or remove certain workflow requirements. But again, it's fully up to you if you make use of that.

Side note: have a look at how local workflows trigger their branch tasks. There are two options: declare as dependency, or yield as dynamic dependencies (which is a luigi pattern).


That being said, all your example cases are valid and the actual decision of what you declare as a workflow requirement is a design choice you are free to make.

@riga riga pinned this issue Jul 19, 2024
@solo-driven
Copy link
Author

But why did you use workflow_requires for branches manipulation in that example? As you said it is for controlling the dependency of the whole workflow. Like setting up an environment. (I read your last comment, so probably it is not a best example for it?)

Also I noticed that controlling branches parameter of any dependent worklfow is only possible in workflow_requires and not possible in requires. Can you explain why?

@riga
Copy link
Owner

riga commented Jul 22, 2024

(I read your last comment, so probably it is not a best example for it?)

Yeah, it probably is not a good example. The linked task is the proxy that lives underneath the workflow and that implements the actual run(), requires() and output() methods that take effect in case a task is a workflow (branch == -1).

Also I noticed that controlling branches parameter of any dependent worklfow is only possible in workflow_requires and not possible in requires. Can you explain why?

The branches (plural) parameter is only a feature of the workflow itself. For specific branch tasks, settings this value has no meaning (since a branch does not have branches on it's own).

@solo-driven
Copy link
Author

The last question.
Are the any performance differences in the way I "build a dependency tree"? Like in the examples above. In the end we get ~30 tasks which will be distributed by workers right?

And when I specify branches for instance 1:5 will that workflow count as a single task or all the branches will be distributed among workers? If former is true then there should be really no difference at all by the way we build tree

@riga
Copy link
Owner

riga commented Jul 23, 2024

The workflow itself will count as a single yet separate task in the tree whose only "payload" is to trigger its branch tasks (either via static or dynamic requirements). All branch tasks will be distributed across --workers in any case, so there shouldn't be any performance difference (except for a very small one during tree building at the very beginning).

@riga riga closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants