DDP and Faiss TemporaryMemoryBuffer error #694

vemchance · 2024-04-26T12:20:43Z

Hi, thanks for the incredible library! We've been using pytorch metric learning for a task which requires around 300,000 images belonging to a lot of classes. We're quite new to metric learning and DDP though.

We've been using the DDP example but we keep running into an error on the 2nd or 3rd epoch:

RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244517602/work/faiss/gpu/StandardGpuResources.cpp:530: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer dev 0 space Device stream 0x295de190 size 1610612736 bytes (cudaMalloc error out of memory [2])

The code runs across 6 3090tis and uses 60 cpus, but for some reason after a few epochs Faiss can't allocate 1.5gb for the test_model phase anymore. We used to get a cuda OOM error but fixed this by setting the max split size to 516mb. With the new error, the code just hangs at this point indefinitely until the job is cancelled.

Firstly, I wanted to ask in the DDP example listed, is the validation (or test) dataset partition like the training data? Since it only happens when testing the model, I wasn't sure if this could be the issue.

Otherwise, would anyone know how to avoid this error? We've had a look through the Faiss repo as well and tried to apply most of their suggestions (e.g., reduce the batch size; updated the faiss library to the most recent) but haven't been able to resolve. Since it happen after a certain number of epochs, we're assuming something is accumulating on the GPU (maybe the index?) but can't quite figure out what it might be, so any help is much appreciated.

The text was updated successfully, but these errors were encountered:

KevinMusgrave · 2024-04-29T22:53:44Z

Firstly, I wanted to ask in the DDP example listed, is the validation (or test) dataset partition like the training data? Since it only happens when testing the model, I wasn't sure if this could be the issue.

I don't think there is a difference.

we're assuming something is accumulating on the GPU (maybe the index?)

By default the index should be set to None after each call:

pytorch-metric-learning/src/pytorch_metric_learning/utils/inference.py

Lines 199 to 200 in adfb78c

    
           if self.reset_after: 
        
               self.reset()

I assumed the garbage collector would let go of that memory. Maybe I should be explicitly deleting something though? 🤔

Otherwise, would anyone know how to avoid this error?

Unfortunately I don't know too much about faiss, and I've always found distributed training to be tricky, though I can't tell if this error has anything to do with distributed training.

Here are a couple of suggestions:

Try using faiss directly, i.e. just compute k-nearest-neighbors, and don't use AccuracyCalculator or the testing functions, and see if the error goes away. That will at least narrow down the source of the problem.
Try CustomKNN with a batch size, instead of faiss:

from pytorch_metric_learning.distances import CosineSimilarity
from pytorch_metric_learning.utils.inference import CustomKNN

knn_func = CustomKNN(CosineSimilarity(), batch_size=32)
ac = AccuracyCalculator(include=("precision_at_1",), k=1, knn_func=knn_func)

The above code will compute k-nearest-neighbors in batches of 32 at a time, which could help with memory consumption.

vemchance · 2024-05-10T10:17:29Z

Thanks so much for the detailed response!

I reduced the batch size again very marginally and this seemed to work. However, when I added a 7th GPU... the faiss error occurred again with the lower batch size in the 4th epoch. I just switched back to 6 GPUs in that instance, though ideally we want to make the most of all the available 7 GPUs to increase the training speed.

I'm going to go a head and try the last two suggestions, I may update this with the results if it's helpful at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP and Faiss TemporaryMemoryBuffer error #694

DDP and Faiss TemporaryMemoryBuffer error #694

vemchance commented Apr 26, 2024

KevinMusgrave commented Apr 29, 2024 •

edited

vemchance commented May 10, 2024

DDP and Faiss TemporaryMemoryBuffer error #694

DDP and Faiss TemporaryMemoryBuffer error #694

Comments

vemchance commented Apr 26, 2024

KevinMusgrave commented Apr 29, 2024 • edited

vemchance commented May 10, 2024

KevinMusgrave commented Apr 29, 2024 •

edited