Remove git submodule for testing data #2848

daljit46 · 2024-03-06T14:44:56Z

This PR aims to address #2837. The idea is to remove the Git submodule for testing data from the codebase. Instead, testing data will be cloned in the build directory at build time. This is made possible by using CMake's ExternalProject.

github-actions · 2024-03-06T14:46:08Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-03-06T16:26:08Z

clang-tidy review says "All clean, LGTM! 👍"

Lestropie · 2024-03-06T23:33:00Z

Makes sense to me. Hopefully this would also by design remove some of the weird annoyances around submodules not changing commit hash on main repo branch change / getting mistakenly included in git commit -a. I'm also thinking that any divergences in test data between code branches should be less likely.

The outstanding relevant question is whether there should remain two test data repositories, or whether they should be merged into one.
(And the "unit test" data added in #2678 would naturally go there also)
But that can perhaps be addressed as part of #2824.

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

Lestropie · 2024-03-06T23:44:54Z

There might be a slight disadvantage to developer workflow, but it is tolerable.

Whenever an update to the test data is required, typically one would:

Navigate into the sub-directory occupied by the test data
Make changes to the test data
Navigate back to root
Ensure that run_tests succeeds
Navigate back into the sub-directory occupied by the test data
Generate commit and push
Navigate back to root
Stage submodule commmit hash
Generate commit that may also include corresponding changes to the tests

Here, I presume the workflow will be:

Navigate into the sub-directory occupied by the test data
Make changes to the test data
Navigate back to build directory
Ensure that ctest succeeds
Externally clone the test data repository
Explicitly match the commit hash of that clone with what the source repository was pointing to
Duplicate the changes that were made in the build directory's clone of the test data
Generate commit and push
Navigate back to root of source repository
Stage external project commmit hash
Generate commit that may also include corresponding changes to the tests

?

daljit46 · 2024-03-07T10:40:12Z

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

I think for this purpose, FetchContent is more suitable than ExternalProject as it makes content available immediately allowing it to be used in the configuration stage.
Something like this would work:

include(FetchContent)

FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz)
FetchContent_MakeAvailable(json)

target_link_libraries(mrtrix-core PRIVATE nlohmann_json::nlohmann_json)

and IDEs shouldn't have trouble picking this up.

daljit46 · 2024-07-10T13:24:51Z

Would like to get this merged too unless there are any objections.

jdtournier · 2024-07-10T14:11:15Z

Before we do, I'm not sure I get the rationale.

My understanding of the original issue was that the testing ought to run from within the build directory. I kind of get that bit. But then there's a suggestion to download the test data into the build directory, and that's where I get a bit lost. Does that mean that if we delete the build folder, the system will have to download all the test data again? That's a fair bit of data... I'm hoping that the ExternalProject feature somehow gets around that issue? Otherwise I think this is going to prove very impractical...

Hopefully this is just my ignorance about how that feature works, and a quick bit of education will sort that out...

daljit46 · 2024-07-10T14:32:11Z

Yes, the data will be cloned in the build directory. However, the original issue #2837 was to address the problem that our source directory can get polluted with temporary files produced by execution of tests. This also cleanly allows a different data set to be cloned when a developer changes their local branch.
I agree that it can be quite annoying to having to re-download the data if you delete the build directory. Although, the cloning time is probably a very small fraction of the total build time (well at least on a fast enough network).

In #2859, we have the same problem and I implemented an alternative solution: downloaded dependencies are cloned in ${PROJECT_SOURCE_DIR}/_deps (which is also added to .gitignore), so that even if you delete the build directory, no download is necessary. If wanted, we can implement the same solution here.

jdtournier · 2024-07-10T14:36:02Z

OK, so this was a valid concern...

On my home network, we're talking minutes to download all the test data for the scripts. Much longer than the actual build time - especially now that we're using ccache. If you have an alternative solution that gets around that, I think it would be better to go with that.

Lestropie · 2024-07-15T07:10:22Z

Regarding re-downloading of test data, I also have developed the habit of deleting a build directory and then re-constructing. However #2859 has these suggestions:

cmake -B build --fresh # It reconfigures the project deleting any CMakeCache.txt
cmake --build build --target clean # cleans any build artifacts
cmake --build build --clean-first # Same as previous line but triggers a build

So it might be possible to avoid repetitive long downloads just by changing developer practise.

We must however consider the case where one has multiple build directories:

Hypothetically, if one wanted to run tests for both, the test data would be redundantly stored in both build directories.
Conversely, one build directory may have been built from one version of source code, and another build directory from a different version of the source code that references a different commit in the testing data.

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.

I'd previously discussed the prospect of having environment variables to be interpreted at configure time for loading external dependencies from the local filesystem into the build directory rather than pulling them from the internet. Hypothetically, that argument could extend to the test data. If the developer has a test data repository cloned on their own system, and that clone contains the commit to which the source code repository currently refers, then hypothetically, the test data repository could be copied into the build directory and the requisite commit checked out (indeed, if that local clone is already pointing to the right commit, then just the data content could be pulled, skipping unnecessary duplication of git content). Perhaps not be worth the effort if just avoiding build directory deletion resolves, but thought it worth mentioning.

daljit46 · 2024-07-15T15:44:14Z

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.

I tend to agree with this. For #2859, cloning in the build directory is even less of an issue as downloading tarball for Eigen, Json for Modern C++ and Half only takes a few seconds.

daljit46 · 2024-08-15T14:05:24Z

@Lestropie @jdtournier I can't exactly recall what was the verdict on this issue, but did we decide to allow for the possibility of specifying a local URL for the data directory (so that we can clone the data into the build)? If yes, what should be the default behaviour?

jdtournier · 2024-08-15T15:51:24Z

did we decide to allow for the possibility of specifying a local URL for the data directory

Yes, that was more or less the conclusion we came to. Basically, the behaviour would be to always git clone the test data repo into the build directory, and checkout the relevant commit. This would allow any internet-connected user to run the tests without any further customisation. The URL of the main test data repo would be hard-coded and used by default, but the user will be able to override that URL through an environment variable (or whichever other mechanism is most CMake-compatible), so that the user could specify a local clone of the main repo instead, which would then reside on their own system (and they'd be responsible for updating it if necessary). This would also allow them to make whatever changes are required to the test data on their own system, and check that it all works before uploading the lot up to GitHub.

Does that sounds reasonable?

Lestropie · 2024-08-16T00:51:40Z

Sounds right to me.

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used. In the former case, the issue could be either an erroneously specified commit, or it could be a failure to push committed changes to GitHub; whereas in the latter case, in addition to an erroneously specified commit it could be that the manually specified local repo is not fully up to date.

github-actions · 2024-08-27T14:45:52Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-08-28T07:53:46Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-08-28T08:06:08Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-08-28T08:13:39Z

clang-tidy review says "All clean, LGTM! 👍"

daljit46 · 2024-08-28T08:17:49Z

I have now rebased the changes by allowing the user to specify the MRTRIX_BINARIES_DATA_DIR and MRTRIX_SCRIPTS_DATA_DIR environment variables to set a local git directory containing the testing data. Do we want to stick with CMake cache variables instead (e.g. via specifying -DMRTRIX_BINARIES_DATA=...)?

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used.

To clone the data we are using ExternalProject_Add and from CMake's documentation:

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

Lestropie · 2024-08-28T08:33:52Z

Do we want to stick with CMake cache variables instead?

I think the projected workflow for developers here is that one would have a single clone of the test data repository, make all changes there, and just set an environment variable to point to that location so that all of their builds can make use of it. So while a cmake variable would be more conceptually faithful in terms of configuring a build system, if it's only there for expediting developer workflow, I think an environment variable is what would be maximally convenient for myself.

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

Also this quote references what happens for re-runs of cmake, not for the initial configuration.

I'll try to test.

daljit46 · 2024-08-28T08:56:43Z

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

My understanding here is that ExternalProject_Add will internally use git clone and then git checkout thus one should get the same error messages as when using git (although CMake will report a download error and provide a path containing a log file with the actual full error message).

We now download testing data at build time instead of cloning the data as a git submodule. The user can also provide local git repos for the data by specifying the MRTRIX_BINARIES_DATA_DIR and MRTRIX_SCRIPTS_DATA_DIR environment variables during the configuration stage. This prevents from polluting the source directory and it's more in line with CMake's philosophy.

daljit46 · 2024-09-10T22:18:08Z

@MRtrix3/mrtrix3-devs I believe this is now ready to be merged.

github-actions · 2024-09-10T22:19:09Z

clang-tidy review says "All clean, LGTM! 👍"

daljit46 requested a review from Lestropie March 6, 2024 14:44

daljit46 self-assigned this Mar 6, 2024

daljit46 added build CI/tests labels Mar 6, 2024

daljit46 changed the title ~~Remove git submodules for testing data~~ Remove git submodule for testing data Mar 6, 2024

daljit46 force-pushed the remove_data_submodule branch from 1ad9e03 to 97a6457 Compare March 6, 2024 16:24

daljit46 mentioned this pull request Mar 8, 2024

Update testing README #2855

Merged

daljit46 force-pushed the remove_data_submodule branch from 97a6457 to 56656ab Compare August 27, 2024 14:43

daljit46 force-pushed the remove_data_submodule branch from 4c06bd9 to ed5c0be Compare August 28, 2024 08:04

daljit46 requested a review from a team August 28, 2024 08:18

daljit46 added 4 commits September 10, 2024 23:16

Remove testing data submodules

5fba417

Update commit hashes for testing data repos

3a7d2f7

Log location of local testing data directories

91ac695

daljit46 force-pushed the remove_data_submodule branch from 172480c to 91ac695 Compare September 10, 2024 22:17

daljit46 merged commit 45e3d96 into dev Sep 12, 2024
6 checks passed

daljit46 deleted the remove_data_submodule branch September 12, 2024 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove git submodule for testing data #2848

Remove git submodule for testing data #2848

daljit46 commented Mar 6, 2024

github-actions bot commented Mar 6, 2024

github-actions bot commented Mar 6, 2024

Lestropie commented Mar 6, 2024 •

edited

Loading

Lestropie commented Mar 6, 2024

daljit46 commented Mar 7, 2024

daljit46 commented Jul 10, 2024

jdtournier commented Jul 10, 2024

daljit46 commented Jul 10, 2024 •

edited

Loading

jdtournier commented Jul 10, 2024

Lestropie commented Jul 15, 2024

daljit46 commented Jul 15, 2024

daljit46 commented Aug 15, 2024

jdtournier commented Aug 15, 2024

Lestropie commented Aug 16, 2024

github-actions bot commented Aug 27, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Aug 28, 2024

daljit46 commented Aug 28, 2024 •

edited

Loading

Lestropie commented Aug 28, 2024

daljit46 commented Aug 28, 2024

daljit46 commented Sep 10, 2024

github-actions bot commented Sep 10, 2024

Remove git submodule for testing data #2848

Remove git submodule for testing data #2848

Conversation

daljit46 commented Mar 6, 2024

github-actions bot commented Mar 6, 2024

github-actions bot commented Mar 6, 2024

Lestropie commented Mar 6, 2024 • edited Loading

Lestropie commented Mar 6, 2024

daljit46 commented Mar 7, 2024

daljit46 commented Jul 10, 2024

jdtournier commented Jul 10, 2024

daljit46 commented Jul 10, 2024 • edited Loading

jdtournier commented Jul 10, 2024

Lestropie commented Jul 15, 2024

daljit46 commented Jul 15, 2024

daljit46 commented Aug 15, 2024

jdtournier commented Aug 15, 2024

Lestropie commented Aug 16, 2024

github-actions bot commented Aug 27, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Aug 28, 2024

daljit46 commented Aug 28, 2024 • edited Loading

Lestropie commented Aug 28, 2024

daljit46 commented Aug 28, 2024

daljit46 commented Sep 10, 2024

github-actions bot commented Sep 10, 2024

Lestropie commented Mar 6, 2024 •

edited

Loading

daljit46 commented Jul 10, 2024 •

edited

Loading

daljit46 commented Aug 28, 2024 •

edited

Loading