Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative paramscan function for dealing with exceeing memory cap. #825

Open
johnabs opened this issue Jul 5, 2023 · 2 comments
Open
Labels
data related with datacollection enhancement New feature or request

Comments

@johnabs
Copy link

johnabs commented Jul 5, 2023

Is your feature request related to a problem? Please describe.
The problem is that when I have particularly large models with substantial parameter spaces to work with, paramscan requires me to pre-divide the data before passing it in to prevent me from running out of memory. It would be great if this was all handled automatically behind the scenes when running experiments, if possible.

Describe the solution you'd like
I've already implemented a variant of the solution that I would like: namely, I have the user define if they expect the model to exceed the memory cap: if so, split the dictionary into a partition based on some user-defined size, otherwise, leave it alone and run as usual. I think having a way to determine that up front, rather than relying on the user would be a great feature, but as of yet, I'm unsure how to estimate memory consumption in general. In that case, both the boolean check, and the partition size of the paramdict can be set efficiently to minimize the chances of the code crashing, and minimize the number of writes to disk that occur.

Describe alternatives you've considered
The original method I used was pre-chunking the data, but since this wasn't always necessary for certain experiments, it typically resulted in more CSVs than I wanted. With this solution, it only chunks if needed. I don't know if there is a suitable alternative for sufficiently large models or models with sufficiently large search spaces.

I do have some code I can provide in a PR, if this seems like it would be of value to the project, if not, feel free to close the issue and I'll keep the changes for my own use cases.

Best,
John

@Tortar
Copy link
Member

Tortar commented Jul 5, 2023

Hi! If I'm correctly understanding the problem in your case is that the list to which the dictionary expands to is too big (this is what happens behind the scenes in paramscan with the dictionary). If this is so, then I think there should be a simpler solution than what you propose, namely adding the possibility to use a lazy iterator instead of a list of the ranges in the dict (or use this way as default). I think this should be enough to solve the problem, let me know if I'm misunderstanding something, it seems to me a good idea to do something about that anyway!

@Tortar Tortar added enhancement New feature or request data related with datacollection labels Jul 6, 2023
@Datseris
Copy link
Member

Datseris commented Jul 7, 2023

Hm, I am also not sure whether I have understood the problem: is the problem that the number of generated dictionaries is too large, or that the memory that the final DataFrames occupy is too large, because they have too many columns with different parameters? Since you mentioned you already have a code solution @johnabs perhaps you can paste it here and this will elucidate things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data related with datacollection enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants