Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP and Faiss TemporaryMemoryBuffer error #694

Open
vemchance opened this issue Apr 26, 2024 · 2 comments
Open

DDP and Faiss TemporaryMemoryBuffer error #694

vemchance opened this issue Apr 26, 2024 · 2 comments

Comments

@vemchance
Copy link

Hi, thanks for the incredible library! We've been using pytorch metric learning for a task which requires around 300,000 images belonging to a lot of classes. We're quite new to metric learning and DDP though.

We've been using the DDP example but we keep running into an error on the 2nd or 3rd epoch:

RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244517602/work/faiss/gpu/StandardGpuResources.cpp:530: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer dev 0 space Device stream 0x295de190 size 1610612736 bytes (cudaMalloc error out of memory [2])

The code runs across 6 3090tis and uses 60 cpus, but for some reason after a few epochs Faiss can't allocate 1.5gb for the test_model phase anymore. We used to get a cuda OOM error but fixed this by setting the max split size to 516mb. With the new error, the code just hangs at this point indefinitely until the job is cancelled.

Firstly, I wanted to ask in the DDP example listed, is the validation (or test) dataset partition like the training data? Since it only happens when testing the model, I wasn't sure if this could be the issue.

Otherwise, would anyone know how to avoid this error? We've had a look through the Faiss repo as well and tried to apply most of their suggestions (e.g., reduce the batch size; updated the faiss library to the most recent) but haven't been able to resolve. Since it happen after a certain number of epochs, we're assuming something is accumulating on the GPU (maybe the index?) but can't quite figure out what it might be, so any help is much appreciated.

@KevinMusgrave
Copy link
Owner

KevinMusgrave commented Apr 29, 2024

Firstly, I wanted to ask in the DDP example listed, is the validation (or test) dataset partition like the training data? Since it only happens when testing the model, I wasn't sure if this could be the issue.

I don't think there is a difference.

we're assuming something is accumulating on the GPU (maybe the index?)

By default the index should be set to None after each call:

if self.reset_after:
self.reset()

I assumed the garbage collector would let go of that memory. Maybe I should be explicitly deleting something though? 🤔

Otherwise, would anyone know how to avoid this error?

Unfortunately I don't know too much about faiss, and I've always found distributed training to be tricky, though I can't tell if this error has anything to do with distributed training.

Here are a couple of suggestions:

  1. Try using faiss directly, i.e. just compute k-nearest-neighbors, and don't use AccuracyCalculator or the testing functions, and see if the error goes away. That will at least narrow down the source of the problem.
  2. Try CustomKNN with a batch size, instead of faiss:
from pytorch_metric_learning.distances import CosineSimilarity
from pytorch_metric_learning.utils.inference import CustomKNN

knn_func = CustomKNN(CosineSimilarity(), batch_size=32)
ac = AccuracyCalculator(include=("precision_at_1",), k=1, knn_func=knn_func)

The above code will compute k-nearest-neighbors in batches of 32 at a time, which could help with memory consumption.

@vemchance
Copy link
Author

Thanks so much for the detailed response!

I reduced the batch size again very marginally and this seemed to work. However, when I added a 7th GPU... the faiss error occurred again with the lower batch size in the 4th epoch. I just switched back to 6 GPUs in that instance, though ideally we want to make the most of all the available 7 GPUs to increase the training speed.

I'm going to go a head and try the last two suggestions, I may update this with the results if it's helpful at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants