Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Install subworkflows with modules from different remotes #3083

Open
wants to merge 71 commits into
base: dev
Choose a base branch
from

Conversation

jvfe
Copy link
Member

@jvfe jvfe commented Jul 29, 2024

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Addresses #1927 and #2497

Issue information

Currently, get_components_to_install is the function used to download a subworkflow's modules. It relies on simple text parsing to define where to get a subworkflow/module from and install it.

As outlined in the above issues (and in slack conversations), it is currently not possible to install subworkflows which depend on modules that originate from different remotes. The solution is to copy over required modules to a same remote in order to use them.

Implementation

At this moment the implementation is mostly a work-around/hack. I aimed to make the least possible modifications to the source code in order to make the intended behavior functional. Here's how it works:

  • It still uses the same text parsing for all modules and subworkflows that don't have a meta.yml file defined.

  • If a subworkflow defines a meta.yml, they can optionally define the git_remote and org_path keys for a component. This is based on a suggestion given on the original issue. E.g.:

components:
  - name: gatk4/mutect2
  - name: gatk4/genomicsdbimport
    org_path: custom_remote
    git_remote: https://github.com/custom_remote/modules.git

This info is used to grab the respective modules from the specified remote.
I've setup a dummy repository with a subworkflow that showcases this. If no remote is specified, it assumes the module is from the same remote as the subworkflow (as it is currently).

Trying to run nf-core subworkflows --git-remote https://github.com/jvfe/test-subworkflow-remote install get_genome_annotation with this branch of nf-core/tools should work. There's a basic test I've setup to test for this.

Known issues

All in all, I'm eager to hear what everyone thinks. I'm tagging @muffato who has been supervising and helping me in this implementation.

@jvfe jvfe changed the base branch from master to dev July 29, 2024 11:33
fix: Avoid duplicate downloads in nested subworkflows
* dev: (299 commits)
  Update nf_core/pipelines/create/utils.py
  [automated] Update CHANGELOG.md
  allow numbers in custom pipeline name
  [automated] Update CHANGELOG.md
  Update pre-commit hook pre-commit/mirrors-mypy to v1.11.1
  Fixed linting
  Updated changelog
  Added process_high_memory to create/lint list
  update chagelog
  use pathlib instead of os.path
  Apply suggestions from code review
  add option to skip code linters
  Update gitpod/workspace-base Docker digest to f189a41
  Update pre-commit hook pre-commit/mirrors-mypy to v1.11.0
  [automated] Update CHANGELOG.md
  Update python:3.12-slim Docker digest to 740d94a
  Update .pre-commit-config.yaml
  Update nf_core/__main__.py
  Add default version
  Automatically add a default version in pipelines_create
  ...
@nf-core nf-core deleted a comment from github-actions bot Aug 5, 2024
@awgymer
Copy link
Contributor

awgymer commented Aug 7, 2024

I'm very wary of increasing the idiosyncrasies in the code and I feel that making it such that the code returns variably either strings or dicts and thus necessitates isinstance checks is liable to be difficult to follow/trace through the code in future.

I understand the desire to simply get something working but I think that if we are going to implement this and support it we need to do it right.

I think this also indicates support only for cross-organisational modules and not cross-organisational subworkflows?

@muffato
Copy link
Member

muffato commented Aug 8, 2024

I think this also indicates support only for cross-organisational modules and not cross-organisational subworkflows?

Currently, yes

I'm very wary of increasing the idiosyncrasies in the code and I feel that making it such that the code returns variably either strings or dicts and thus necessitates isinstance checks is liable to be difficult to follow/trace through the code in future.

Once get_components_to_install is updated to return dictionaries for sub-workflows too, all the isinstance(..., dict) will evaluate to true, and all their if statements will be simplified / collapse.

I understand the desire to simply get something working but I think that if we are going to implement this and support it we need to do it right.

It's a question of time management (@jvfe is working part time on this project). We're not sure about about the handling of the default org vs the module org in nf_core/components/install.py (cf the creation of ModulesRepo instances). We didn't want to invest deeper without clearing that first. Making the change in get_components_to_install is easy, but finding the other files that need an update and making sure that all the tests still pass takes a bit longer.

Copy link
Member

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @jvfe!
I think this is a good start 😄 I agree with @awgymer that allowing component to be both a string or a dict is confusing and it will cause problems. But I like the idea of using the information in the meta.yml to obtain the remote form which the component should be installed.
My suggestion would be to convert component to a dictionary in all cases since the beginning, with the default being the current remote.

nf_core/components/components_utils.py Outdated Show resolved Hide resolved
nf_core/components/components_utils.py Outdated Show resolved Hide resolved
nf_core/components/install.py Outdated Show resolved Hide resolved
nf_core/components/install.py Outdated Show resolved Hide resolved
@@ -16,6 +16,7 @@
OLD_TRIMGALORE_SHA = "9b7a3bdefeaad5d42324aa7dd50f87bea1b04386"
OLD_TRIMGALORE_BRANCH = "mimic-old-trimgalore"
GITLAB_URL = "https://gitlab.com/nf-core/modules-test.git"
CROSS_ORGANIZATION_URL = "https://github.com/jvfe/test-subworkflow-remote.git"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the nf-core gitlab account for this test to avoid using external repos. I can give you access if you think it's a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on this. I don't recommend using two remotes with the same org path as this can create major foot-guns (tools will just overwrite the ones in place and there can be weirdness in the modules json).

Might be OK for testing if you are careful but in general it is not something people should try to do.

Copy link
Member Author

@jvfe jvfe Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that using a different org_path is better for testing purposes. Although we don't have to keep it in my personal remote either, I was planning to make that repo read-only as soon as this feature is done, to ensure future stability for the test.

I don't know if it's simple to setup on your side, but maybe something like "nf_core_testing" could be a good alternative org_path? It's different enough but still under nf-core, instead of a personal user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will have a loot at the options we have to create another nf-core owned organisation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nf-core-test/modules I have added you both to the new organisation

* dev: (35 commits)
  don't remove creating a pipeline to test it 😶
  update textual snapshots
  update textual snapshots
  fix indentation
  Update nf_core/pipelines/create/create.py
  [automated] Update CHANGELOG.md
  [automated] Update CHANGELOG.md
  add option to exclude changelog from custom pipeline template
  add option to exclude codespaces from pipeline template
  remove multiqc images from the template docs
  and the same for subworkflows
  run tests also after pytest-migrate
  update textual snapshots
  add multiqc potion to test data
  update textual snapshots
  fix tests
  [automated] Update CHANGELOG.md
  add option to exclude multiqc from pipeline template
  rename regenerae-snapshots to update-textual-snapshots
  Apply suggestions from code review
  ...
jvfe and others added 10 commits August 21, 2024 13:00
…ge_get

Use the power of get to skip if tests
Co-authored-by: Matthieu Muffato <[email protected]>
refact: Change function structure to use dicts not lists
* upstream/dev: (45 commits)
  Apply suggestions from code review
  update textual snapshot
  add option to exclude adaptivecard and slackreport from pipeline template
  update pytest test_make_pipeline_schema
  [automated] Update CHANGELOG.md
  add option to exclude email from pipeline template
  update textual snapshots
  Update nf_core/pipelines/create/templatefeatures.yml
  [automated] Update CHANGELOG.md
  add option to exclude license from pipeline template
  [automated] Update CHANGELOG.md
  Update python:3.12-slim Docker digest to 59c7332
  [automated] Update CHANGELOG.md
  Update pre-commit hook astral-sh/ruff-pre-commit to v0.6.0
  Ran pre commit
  add matthias' suggestion woops
  Overwrite the verbose in ctx in case the user uses it
  Overwrite the verbose in ctx in case the user uses it
  Add a verbose click option
  run prettier on the pipeline output directory
  ...
* upstream/dev: (120 commits)
  markdown header
  add .gitignore to base required files
  update textual snapshots
  don't remove .gitignore when unselecting github
  update textual snapshots
  some more minimal template fixes
  udpate changelog
  handle cases where the directory path contains the name of the component
  switch to os.walk to support <3.10 versions of Python
  fix pipeline linting when skipping nf_schema
  update test to use template_features.yml and add missing files to a feature group
  rename templatefeatures.yml to template_features.yml
  update changelog
  fix linting
  add tests to ensure all files are part of a template customisation group
  add process cpus, memory and time to nextflow.config if we remove base.config
  update textual snapshots
  Revert "don't remove base.config when removing nf-core components"
  don't remove base.config when removing nf-core components
  apply review comments from @mashehu
  ...
@jvfe jvfe marked this pull request as ready for review September 2, 2024 18:28
Copy link
Contributor

@awgymer awgymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the code implemented here this looks good.

I am unclear about the impact on the following commands:

  • subworkflows remove
  • subworkflows update
  • subworkflows lint

I also still have concerns about this particular style of cross-repo (or indeed single-repo) subworkflow (although I am not intending to reignite a debate over the alternative model I support in this MR!) and at the present stage I wouldn't recommend this as a true solution to anyone wanting cross-repo subworkflows because I'm unclear on what level of commitment to maintain and work around issues this introduces there is from an nf-core standpoint. (This isn't so much something for the authors of this PR to address but rather the wider nf-core team)

Comment on lines 52 to 53
def install(self, component: str, silent: bool = False) -> bool:
if isinstance(component, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the isinstance check is needed (and indeed even if it isn't - because this is always a dict now) then the type hint in the method signature is incorrect.

Copy link
Member Author

@jvfe jvfe Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I could see, this is the single isinstance check that was actually needed. There are some places in the codebase where this function gets called with a string input, when removing the isinstance check I noticed quite a few test failures. So, to aim for minimal impact and changes to the codebase and current implementations, I've opted to change the argument type hint, as you suggested.

nf_core/components/components_utils.py Outdated Show resolved Hide resolved
* upstream/dev:
  Move if statements to top of YAML blocks
  Don't look for dev, look for not master
  Remove tests involving environment.yml with 'name'
  Remove name from conda enviroment.yml in module template
  Don't test conda `environment.yml` `name` attribute (which should no longer be there)
  Update CHANGELOG.md
  Add --update-all flag to remove defaults channels dependents
  add a tabix tabix test
  pre commit
  fix issue where tool and subtool have the same name
  udpate changelog
  run nf-core lint --release on PRs to master
refact: Add suggestions from second review
@jvfe
Copy link
Member Author

jvfe commented Sep 10, 2024

In terms of the code implemented here this looks good.

I am unclear about the impact on the following commands:

* `subworkflows remove`

* `subworkflows update`

* `subworkflows lint`

I also still have concerns about this particular style of cross-repo (or indeed single-repo) subworkflow (although I am not intending to reignite a debate over the alternative model I support in this MR!) and at the present stage I wouldn't recommend this as a true solution to anyone wanting cross-repo subworkflows because I'm unclear on what level of commitment to maintain and work around issues this introduces there is from an nf-core standpoint. (This isn't so much something for the authors of this PR to address but rather the wider nf-core team)

From what I could test (and the unit tests themselves), there's no discernible impact to those commands. Also, I do understand your concerns and agree that there could be a more robust reworking of how cross-repo subworkflows could work (and how single-repo subworkflows are implemented themselves!). In this PR, though, I aimed for the simplest solution, one with minimal impact to the codebase and that could serve as a starting point for people looking to build these components across orgs (as is the case in Sanger Tree of Life), in a way that no changes are required for the way subworkflows are currently implemented. Thank you for your input and your review!

Copy link
Member

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a linting test failing, we should update the test here to add a valid fake version.

This is looking good! But it would be good to add a couple more test cases:

  • Installing a subworkflow which installs a module from another repo which is also installed from nf-core, for example fastqc which is already in the pipeline template.
  • Removing nf-core fastqc to make sure the fastqc module from the other repo is kept.
  • Installing the subworkflow on an old version (sha) and updating it, make sure that all modules from different repos are updated.

And if you can think of other situations that I forgot, let's add them too 🙂

In addition to this PR we will need to add documentation on how to use this. Specifying the format of meta.yml.
We should also modify the JSON schema that we use to lint subworkflows, as the elements of components are now required to be strings, but I think this is not something we should add to the nf-core/modules repo, it should work if the non-nf-core-repo add a JSON schema file too. This should also be documented

nf_core/components/install.py Show resolved Hide resolved
@@ -16,6 +16,7 @@
OLD_TRIMGALORE_SHA = "9b7a3bdefeaad5d42324aa7dd50f87bea1b04386"
OLD_TRIMGALORE_BRANCH = "mimic-old-trimgalore"
GITLAB_URL = "https://gitlab.com/nf-core/modules-test.git"
CROSS_ORGANIZATION_URL = "https://github.com/jvfe/test-subworkflow-remote.git"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will have a loot at the options we have to create another nf-core owned organisation.

@jvfe
Copy link
Member Author

jvfe commented Oct 23, 2024

I see a linting test failing, we should update the test here to add a valid fake version.

This is looking good! But it would be good to add a couple more test cases:

* Installing a subworkflow which installs a module from another repo which is also installed from nf-core, for example fastqc which is already in the pipeline template.

* Removing nf-core fastqc to make sure the fastqc module from the other repo is kept.

* Installing the subworkflow on an old version (sha) and updating it, make sure that all modules from different repos are updated.

And if you can think of other situations that I forgot, let's add them too 🙂

In addition to this PR we will need to add documentation on how to use this. Specifying the format of meta.yml. We should also modify the JSON schema that we use to lint subworkflows, as the elements of components are now required to be strings, but I think this is not something we should add to the nf-core/modules repo, it should work if the non-nf-core-repo add a JSON schema file too. This should also be documented

Hi Júlia, just wanted to let you know that we're still working on this feature! Just hit a few roadblocks that, due to the short time I can allocate to this every week, are taking a bit too long to solve. To sum up:

  • The first two test cases you mentioned have been already implemented and are working as intended, without any changes to the code being necessary.
  • Although I initially thought the feature wouldn't impact the update cycle, after writing down the test case I could see the feature requires some changes to the update module. And this is what we've been working on the past weeks. We've already gotten to the point where updating works, but there are still a few things to iron out.

Other than that:

Thank you so much for your reviews and your patience while we work on the feature.

@mirpedrol
Copy link
Member

Hi @jvfe, that's great! Thanks for implementing the tests.

The documentation can be added to the website. Here you have the docs for the subworkflow install command. And for a more detailed guide like the README you shared, I would add a page in the tutorials, under "external usage" like we do for the configs (see here)

Lastly, could you clarify what you meant by adding a valid fake version to the linting test?

This was fixed by another PR, so if you update your branch with the dev branch the test should pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants