Memory leak #13

shrubb · 2019-06-07T13:20:52Z

Hi,

when the below example is run, the RAM usage grows forever:

import torch, torch.utils.data
import nonechucks

class DummyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 1_000_000

    def __getitem__(self, idx):
        return 666

dataset = nonechucks.SafeDataset(DummyDataset())

for _ in torch.utils.data.DataLoader(dataset):
    pass

Notes:

Here the increase is quite slow; for a RAPID bug demonstration, replace 666 with torch.empty(10_000) (be careful to kill the process in time, before you're OOM!).
No problems without SafeDataset.
Without torch.utils.data.DataLoader, the leak is still there, although at a smaller scale, around 1 MB of RAM is lost per 30000-40000 __getitem__ calls.
PyTorch 1.0.1, nonechucks 0.3.1.

The text was updated successfully, but these errors were encountered:

aronhoff · 2019-07-08T10:53:14Z

SafeDataset.__getitem__ memoizes the dataset items, with no parameter to change this.

This makes the assumption that the dataset will always fit in the memory.
Remove the @memoize line, and the leak is gone.

msamogh · 2019-07-19T18:08:23Z

Thanks for raising the issue, @shrubb!

Yeah, @aronhoff is spot on. I hadn't really thought about this use-case. Parameterising the memoization seems like a good idea! Would you be want to raise a PR for that, @aronhoff?

taehyunoh · 2019-07-20T01:23:31Z

@aronhoff 's solution works like a charm! I had been struggled by this.
If possible, please update pip repository as well. I'd appreciate it!
Thanks

msamogh · 2019-07-20T06:25:14Z

Well, I've put it there for a reason, and it still ensures that you don't iterate through the entire dataset every time for most cases where the dataset isn't larger than the memory.

I think parameterising is the best solution for now.

aronhoff · 2019-07-20T18:57:33Z

I tried adding a bool attribute to the memoize class, that would make its __call__ skip the lookup. The idea is that you could use it as a property of the method, e.g. self.__getitem__.memoize = False. (c57e70c)

Unfortunately this does not work with multiprocessing. __getitem__ is not in the __dict__ of an object, so pickling does not save or restore it and its attributes. Putting it into __dict__ does not solve this.

I do not currently have more time to find a way. Solution could be in having a custom __setstate__ in SafeDataset, or perhaps using a custom metaclass for it. Or do it in the __getitem__ function directly. Either way, it seems to need more entanglement between memoize and the owner object.

Perhaps concerns should be separated completely. You could have a MemoizedDataset that passes through the values from another dataset which is potentially a SafeDataset, while memoizing them.

Keep in mind that the cache dicts are going to be different instances accross processes, so you would be replicating your dataset for each subprocess. This may not be what you want, but for small datasets, it could be not worth the effort of dealing with read-write shared memory.

And with regards to the dataset, ImageNet would optimistically take around 200 GB :)

msamogh · 2019-07-24T07:28:06Z

Thanks for the PR, @shrubb! I have been quite busy, and haven't gotten time to look at it. I'll take a look soon and get back to you.

timonbimon · 2019-09-11T12:50:48Z

Has this been resolved? :)
Nonechucks looks pretty useful, but it's very normal for us to have datasets that are much largern than the RAM, so with the memory leak this would be a no-go.

aksg87 · 2021-09-20T21:59:45Z

@timonbimon @aronhoff

Any solution? I just faced a memory leak with this as well!

Erotemic · 2022-03-20T17:11:48Z

Another issue with memoizing the __getitem__ is that you assume asking for the same index will always result in the same value.

For complicated training cases, executing __getitem__ twice might return wildly different batch items. It is not safe to assume that an index corresponds 1:1 with a specific item. For instance, in one of my more involved datasets all even indexes correspond to positive examples and odd indexes are for negative examples, but where those examples are somewhat random because an index actually chooses from a pool of examples associated with that index. This allows me to balance my positive / negative training cases, but it very much breaks assumptions made in this library.

enric1994 · 2022-06-23T09:50:27Z

Face the same issue. Any workarounds?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #13

Memory leak #13

shrubb commented Jun 7, 2019 •

edited

aronhoff commented Jul 8, 2019

msamogh commented Jul 19, 2019

taehyunoh commented Jul 20, 2019

msamogh commented Jul 20, 2019

aronhoff commented Jul 20, 2019 •

edited

msamogh commented Jul 24, 2019

timonbimon commented Sep 11, 2019

aksg87 commented Sep 20, 2021

Erotemic commented Mar 20, 2022

enric1994 commented Jun 23, 2022

Memory leak #13

Memory leak #13

Comments

shrubb commented Jun 7, 2019 • edited

aronhoff commented Jul 8, 2019

msamogh commented Jul 19, 2019

taehyunoh commented Jul 20, 2019

msamogh commented Jul 20, 2019

aronhoff commented Jul 20, 2019 • edited

msamogh commented Jul 24, 2019

timonbimon commented Sep 11, 2019

aksg87 commented Sep 20, 2021

Erotemic commented Mar 20, 2022

enric1994 commented Jun 23, 2022

shrubb commented Jun 7, 2019 •

edited

aronhoff commented Jul 20, 2019 •

edited