Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird RecursionError involving dask and cloudpickle triggered in tests #104

Open
mattjbr123 opened this issue Jul 11, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@mattjbr123
Copy link
Member

mattjbr123 commented Jul 11, 2024

Unfortunately the logs containing this error message have been deleted so I can't be specific about things, but I still think it's worth documenting in case of future errors of this variety.

Some of the advanced unit tests loop over various different configurations of an example UniFHy Model, defined by different combinations of 'actual' Components (c), 'DataComponents' (d) and 'NullComponents' (n). In particular test_setup_simulate and test_setup_spinup_simulate_resume_runof AdvancedTestModel. After a certain number of iterations through this loop, these tests would fail with a 'RecursionError' in the 'check spatial compatability' routine that is called when a Model is instantiated. The error stack showed calls from unifhy to cf-python passed on to dask, which then looped with the same 4 error messages recurring in the stack again and again until it hits python's default recursion limit. Somewhere in the mix was the 'cloudpickle' package, which can pickle/unpickle objects that python's standard 'pickle' package can't, leading me to believe it is some sort of caching error. After this, all further tests would fail with the same error. The error occured with all versions of python that are tested (3.8, 3.9, 3.10 and 3.11) running on the ubuntu-latest image provided by github actions.

In the end, the only way I could avoid the error was to split the tests and loops up into smaller bundles.
Originally, all the c, d, n configurations of the Model ran for all the 4 test variants (SameTimeSameSpace, DiffTimeSameSpace, SameTimeDiffSpace, DiffTimeDiffSpace) in one test suite. I split up the test variants into 4 completely separate actions (python scripts like this) to be run during the advanced test, and then further split up each of these 4 actions by making each c, d, n connfiguration it's own unit testing suite.
This seemed to clear whatever caching error was happening, but I'm none the wiser as to what was actually going on.

It seems similar to this dask issue, but there has been no action on it for 2 years or so.

@mattjbr123 mattjbr123 added the bug Something isn't working label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant