Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #13

Open
shrubb opened this issue Jun 7, 2019 · 10 comments
Open

Memory leak #13

shrubb opened this issue Jun 7, 2019 · 10 comments

Comments

@shrubb
Copy link

shrubb commented Jun 7, 2019

Hi,

when the below example is run, the RAM usage grows forever:

import torch, torch.utils.data
import nonechucks

class DummyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 1_000_000

    def __getitem__(self, idx):
        return 666

dataset = nonechucks.SafeDataset(DummyDataset())

for _ in torch.utils.data.DataLoader(dataset):
    pass

Notes:

  • Here the increase is quite slow; for a RAPID bug demonstration, replace 666 with torch.empty(10_000) (be careful to kill the process in time, before you're OOM!).
  • No problems without SafeDataset.
  • Without torch.utils.data.DataLoader, the leak is still there, although at a smaller scale, around 1 MB of RAM is lost per 30000-40000 __getitem__ calls.
  • PyTorch 1.0.1, nonechucks 0.3.1.
@aronhoff
Copy link

aronhoff commented Jul 8, 2019

SafeDataset.__getitem__ memoizes the dataset items, with no parameter to change this.

This makes the assumption that the dataset will always fit in the memory.
Remove the @memoize line, and the leak is gone.

@msamogh
Copy link
Owner

msamogh commented Jul 19, 2019

Thanks for raising the issue, @shrubb!

Yeah, @aronhoff is spot on. I hadn't really thought about this use-case. Parameterising the memoization seems like a good idea! Would you be want to raise a PR for that, @aronhoff?

@taehyunoh
Copy link

@aronhoff 's solution works like a charm! I had been struggled by this.
If possible, please update pip repository as well. I'd appreciate it!
Thanks

@msamogh
Copy link
Owner

msamogh commented Jul 20, 2019

Well, I've put it there for a reason, and it still ensures that you don't iterate through the entire dataset every time for most cases where the dataset isn't larger than the memory.

I think parameterising is the best solution for now.

@aronhoff
Copy link

aronhoff commented Jul 20, 2019

I tried adding a bool attribute to the memoize class, that would make its __call__ skip the lookup. The idea is that you could use it as a property of the method, e.g. self.__getitem__.memoize = False. (c57e70c)

Unfortunately this does not work with multiprocessing. __getitem__ is not in the __dict__ of an object, so pickling does not save or restore it and its attributes. Putting it into __dict__ does not solve this.

I do not currently have more time to find a way. Solution could be in having a custom __setstate__ in SafeDataset, or perhaps using a custom metaclass for it. Or do it in the __getitem__ function directly. Either way, it seems to need more entanglement between memoize and the owner object.

Perhaps concerns should be separated completely. You could have a MemoizedDataset that passes through the values from another dataset which is potentially a SafeDataset, while memoizing them.

Keep in mind that the cache dicts are going to be different instances accross processes, so you would be replicating your dataset for each subprocess. This may not be what you want, but for small datasets, it could be not worth the effort of dealing with read-write shared memory.

And with regards to the dataset, ImageNet would optimistically take around 200 GB :)

@msamogh
Copy link
Owner

msamogh commented Jul 24, 2019

Thanks for the PR, @shrubb! I have been quite busy, and haven't gotten time to look at it. I'll take a look soon and get back to you.

@timonbimon
Copy link

Has this been resolved? :)
Nonechucks looks pretty useful, but it's very normal for us to have datasets that are much largern than the RAM, so with the memory leak this would be a no-go.

@aksg87
Copy link

aksg87 commented Sep 20, 2021

@timonbimon @aronhoff

Any solution? I just faced a memory leak with this as well!

@Erotemic
Copy link

Another issue with memoizing the __getitem__ is that you assume asking for the same index will always result in the same value.

For complicated training cases, executing __getitem__ twice might return wildly different batch items. It is not safe to assume that an index corresponds 1:1 with a specific item. For instance, in one of my more involved datasets all even indexes correspond to positive examples and odd indexes are for negative examples, but where those examples are somewhat random because an index actually chooses from a pool of examples associated with that index. This allows me to balance my positive / negative training cases, but it very much breaks assumptions made in this library.

@enric1994
Copy link

Face the same issue. Any workarounds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants