You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The problem is that when I have particularly large models with substantial parameter spaces to work with, paramscan requires me to pre-divide the data before passing it in to prevent me from running out of memory. It would be great if this was all handled automatically behind the scenes when running experiments, if possible.
Describe the solution you'd like
I've already implemented a variant of the solution that I would like: namely, I have the user define if they expect the model to exceed the memory cap: if so, split the dictionary into a partition based on some user-defined size, otherwise, leave it alone and run as usual. I think having a way to determine that up front, rather than relying on the user would be a great feature, but as of yet, I'm unsure how to estimate memory consumption in general. In that case, both the boolean check, and the partition size of the paramdict can be set efficiently to minimize the chances of the code crashing, and minimize the number of writes to disk that occur.
Describe alternatives you've considered
The original method I used was pre-chunking the data, but since this wasn't always necessary for certain experiments, it typically resulted in more CSVs than I wanted. With this solution, it only chunks if needed. I don't know if there is a suitable alternative for sufficiently large models or models with sufficiently large search spaces.
I do have some code I can provide in a PR, if this seems like it would be of value to the project, if not, feel free to close the issue and I'll keep the changes for my own use cases.
Best,
John
The text was updated successfully, but these errors were encountered:
Hi! If I'm correctly understanding the problem in your case is that the list to which the dictionary expands to is too big (this is what happens behind the scenes in paramscan with the dictionary). If this is so, then I think there should be a simpler solution than what you propose, namely adding the possibility to use a lazy iterator instead of a list of the ranges in the dict (or use this way as default). I think this should be enough to solve the problem, let me know if I'm misunderstanding something, it seems to me a good idea to do something about that anyway!
Hm, I am also not sure whether I have understood the problem: is the problem that the number of generated dictionaries is too large, or that the memory that the final DataFrames occupy is too large, because they have too many columns with different parameters? Since you mentioned you already have a code solution @johnabs perhaps you can paste it here and this will elucidate things.
Is your feature request related to a problem? Please describe.
The problem is that when I have particularly large models with substantial parameter spaces to work with, paramscan requires me to pre-divide the data before passing it in to prevent me from running out of memory. It would be great if this was all handled automatically behind the scenes when running experiments, if possible.
Describe the solution you'd like
I've already implemented a variant of the solution that I would like: namely, I have the user define if they expect the model to exceed the memory cap: if so, split the dictionary into a partition based on some user-defined size, otherwise, leave it alone and run as usual. I think having a way to determine that up front, rather than relying on the user would be a great feature, but as of yet, I'm unsure how to estimate memory consumption in general. In that case, both the boolean check, and the partition size of the paramdict can be set efficiently to minimize the chances of the code crashing, and minimize the number of writes to disk that occur.
Describe alternatives you've considered
The original method I used was pre-chunking the data, but since this wasn't always necessary for certain experiments, it typically resulted in more CSVs than I wanted. With this solution, it only chunks if needed. I don't know if there is a suitable alternative for sufficiently large models or models with sufficiently large search spaces.
I do have some code I can provide in a PR, if this seems like it would be of value to the project, if not, feel free to close the issue and I'll keep the changes for my own use cases.
Best,
John
The text was updated successfully, but these errors were encountered: