Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove git submodule for testing data #2848

Merged
merged 4 commits into from
Sep 12, 2024
Merged

Remove git submodule for testing data #2848

merged 4 commits into from
Sep 12, 2024

Conversation

daljit46
Copy link
Member

@daljit46 daljit46 commented Mar 6, 2024

This PR aims to address #2837. The idea is to remove the Git submodule for testing data from the codebase. Instead, testing data will be cloned in the build directory at build time. This is made possible by using CMake's ExternalProject.

@daljit46 daljit46 requested a review from Lestropie March 6, 2024 14:44
@daljit46 daljit46 self-assigned this Mar 6, 2024
@daljit46 daljit46 changed the title Remove git submodules for testing data Remove git submodule for testing data Mar 6, 2024
Copy link

github-actions bot commented Mar 6, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link

github-actions bot commented Mar 6, 2024

clang-tidy review says "All clean, LGTM! 👍"

@Lestropie
Copy link
Member

Lestropie commented Mar 6, 2024

Makes sense to me. Hopefully this would also by design remove some of the weird annoyances around submodules not changing commit hash on main repo branch change / getting mistakenly included in git commit -a. I'm also thinking that any divergences in test data between code branches should be less likely.

The outstanding relevant question is whether there should remain two test data repositories, or whether they should be merged into one.
(And the "unit test" data added in #2678 would naturally go there also)
But that can perhaps be addressed as part of #2824.

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

@Lestropie
Copy link
Member

There might be a slight disadvantage to developer workflow, but it is tolerable.

Whenever an update to the test data is required, typically one would:

  1. Navigate into the sub-directory occupied by the test data
  2. Make changes to the test data
  3. Navigate back to root
  4. Ensure that run_tests succeeds
  5. Navigate back into the sub-directory occupied by the test data
  6. Generate commit and push
  7. Navigate back to root
  8. Stage submodule commmit hash
  9. Generate commit that may also include corresponding changes to the tests

Here, I presume the workflow will be:

  1. Navigate into the sub-directory occupied by the test data
  2. Make changes to the test data
  3. Navigate back to build directory
  4. Ensure that ctest succeeds
  5. Externally clone the test data repository
  6. Explicitly match the commit hash of that clone with what the source repository was pointing to
  7. Duplicate the changes that were made in the build directory's clone of the test data
  8. Generate commit and push
  9. Navigate back to root of source repository
  10. Stage external project commmit hash
  11. Generate commit that may also include corresponding changes to the tests

?

@daljit46
Copy link
Member Author

daljit46 commented Mar 7, 2024

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

I think for this purpose, FetchContent is more suitable than ExternalProject as it makes content available immediately allowing it to be used in the configuration stage.
Something like this would work:

include(FetchContent)

FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz)
FetchContent_MakeAvailable(json)

target_link_libraries(mrtrix-core PRIVATE nlohmann_json::nlohmann_json)

and IDEs shouldn't have trouble picking this up.

@daljit46 daljit46 mentioned this pull request Mar 8, 2024
@daljit46
Copy link
Member Author

Would like to get this merged too unless there are any objections.

@jdtournier
Copy link
Member

Before we do, I'm not sure I get the rationale.

My understanding of the original issue was that the testing ought to run from within the build directory. I kind of get that bit. But then there's a suggestion to download the test data into the build directory, and that's where I get a bit lost. Does that mean that if we delete the build folder, the system will have to download all the test data again? That's a fair bit of data... I'm hoping that the ExternalProject feature somehow gets around that issue? Otherwise I think this is going to prove very impractical...

Hopefully this is just my ignorance about how that feature works, and a quick bit of education will sort that out...

@daljit46
Copy link
Member Author

daljit46 commented Jul 10, 2024

Yes, the data will be cloned in the build directory. However, the original issue #2837 was to address the problem that our source directory can get polluted with temporary files produced by execution of tests. This also cleanly allows a different data set to be cloned when a developer changes their local branch.
I agree that it can be quite annoying to having to re-download the data if you delete the build directory. Although, the cloning time is probably a very small fraction of the total build time (well at least on a fast enough network).

In #2859, we have the same problem and I implemented an alternative solution: downloaded dependencies are cloned in ${PROJECT_SOURCE_DIR}/_deps (which is also added to .gitignore), so that even if you delete the build directory, no download is necessary. If wanted, we can implement the same solution here.

@jdtournier
Copy link
Member

OK, so this was a valid concern...

On my home network, we're talking minutes to download all the test data for the scripts. Much longer than the actual build time - especially now that we're using ccache. If you have an alternative solution that gets around that, I think it would be better to go with that.

@Lestropie
Copy link
Member

Regarding re-downloading of test data, I also have developed the habit of deleting a build directory and then re-constructing. However #2859 has these suggestions:

cmake -B build --fresh # It reconfigures the project deleting any CMakeCache.txt
cmake --build build --target clean # cleans any build artifacts
cmake --build build --clean-first # Same as previous line but triggers a build

So it might be possible to avoid repetitive long downloads just by changing developer practise.

We must however consider the case where one has multiple build directories:

  • Hypothetically, if one wanted to run tests for both, the test data would be redundantly stored in both build directories.
  • Conversely, one build directory may have been built from one version of source code, and another build directory from a different version of the source code that references a different commit in the testing data.

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.

I'd previously discussed the prospect of having environment variables to be interpreted at configure time for loading external dependencies from the local filesystem into the build directory rather than pulling them from the internet. Hypothetically, that argument could extend to the test data. If the developer has a test data repository cloned on their own system, and that clone contains the commit to which the source code repository currently refers, then hypothetically, the test data repository could be copied into the build directory and the requisite commit checked out (indeed, if that local clone is already pointing to the right commit, then just the data content could be pulled, skipping unnecessary duplication of git content). Perhaps not be worth the effort if just avoiding build directory deletion resolves, but thought it worth mentioning.

@daljit46
Copy link
Member Author

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.

I tend to agree with this. For #2859, cloning in the build directory is even less of an issue as downloading tarball for Eigen, Json for Modern C++ and Half only takes a few seconds.

@daljit46
Copy link
Member Author

@Lestropie @jdtournier I can't exactly recall what was the verdict on this issue, but did we decide to allow for the possibility of specifying a local URL for the data directory (so that we can clone the data into the build)? If yes, what should be the default behaviour?

@jdtournier
Copy link
Member

did we decide to allow for the possibility of specifying a local URL for the data directory

Yes, that was more or less the conclusion we came to. Basically, the behaviour would be to always git clone the test data repo into the build directory, and checkout the relevant commit. This would allow any internet-connected user to run the tests without any further customisation. The URL of the main test data repo would be hard-coded and used by default, but the user will be able to override that URL through an environment variable (or whichever other mechanism is most CMake-compatible), so that the user could specify a local clone of the main repo instead, which would then reside on their own system (and they'd be responsible for updating it if necessary). This would also allow them to make whatever changes are required to the test data on their own system, and check that it all works before uploading the lot up to GitHub.

Does that sounds reasonable?

@Lestropie
Copy link
Member

Sounds right to me.

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used. In the former case, the issue could be either an erroneously specified commit, or it could be a failure to push committed changes to GitHub; whereas in the latter case, in addition to an erroneously specified commit it could be that the manually specified local repo is not fully up to date.

Copy link

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link

clang-tidy review says "All clean, LGTM! 👍"

Copy link

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link

clang-tidy review says "All clean, LGTM! 👍"

@daljit46
Copy link
Member Author

daljit46 commented Aug 28, 2024

I have now rebased the changes by allowing the user to specify the MRTRIX_BINARIES_DATA_DIR and MRTRIX_SCRIPTS_DATA_DIR environment variables to set a local git directory containing the testing data. Do we want to stick with CMake cache variables instead (e.g. via specifying -DMRTRIX_BINARIES_DATA=...)?

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used.

To clone the data we are using ExternalProject_Add and from CMake's documentation:

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

@daljit46 daljit46 requested a review from a team August 28, 2024 08:18
@Lestropie
Copy link
Member

Do we want to stick with CMake cache variables instead?

I think the projected workflow for developers here is that one would have a single clone of the test data repository, make all changes there, and just set an environment variable to point to that location so that all of their builds can make use of it. So while a cmake variable would be more conceptually faithful in terms of configuring a build system, if it's only there for expediting developer workflow, I think an environment variable is what would be maximally convenient for myself.

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

Also this quote references what happens for re-runs of cmake, not for the initial configuration.

I'll try to test.

@daljit46
Copy link
Member Author

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

My understanding here is that ExternalProject_Add will internally use git clone and then git checkout thus one should get the same error messages as when using git (although CMake will report a download error and provide a path containing a log file with the actual full error message).

We now download testing data at build time instead of cloning the data as a git submodule.
The user can also provide local git repos for the data by specifying the MRTRIX_BINARIES_DATA_DIR and MRTRIX_SCRIPTS_DATA_DIR environment variables during the configuration stage.
This prevents from polluting the source directory and it's more in line with CMake's philosophy.
@daljit46
Copy link
Member Author

@MRtrix3/mrtrix3-devs I believe this is now ready to be merged.

Copy link

clang-tidy review says "All clean, LGTM! 👍"

@daljit46 daljit46 merged commit 45e3d96 into dev Sep 12, 2024
6 checks passed
@daljit46 daljit46 deleted the remove_data_submodule branch September 12, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants