-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP and Faiss TemporaryMemoryBuffer error #694
Comments
I don't think there is a difference.
By default the index should be set to None after each call: pytorch-metric-learning/src/pytorch_metric_learning/utils/inference.py Lines 199 to 200 in adfb78c
I assumed the garbage collector would let go of that memory. Maybe I should be explicitly deleting something though? 🤔
Unfortunately I don't know too much about faiss, and I've always found distributed training to be tricky, though I can't tell if this error has anything to do with distributed training. Here are a couple of suggestions:
from pytorch_metric_learning.distances import CosineSimilarity
from pytorch_metric_learning.utils.inference import CustomKNN
knn_func = CustomKNN(CosineSimilarity(), batch_size=32)
ac = AccuracyCalculator(include=("precision_at_1",), k=1, knn_func=knn_func) The above code will compute k-nearest-neighbors in batches of 32 at a time, which could help with memory consumption. |
Thanks so much for the detailed response! I reduced the batch size again very marginally and this seemed to work. However, when I added a 7th GPU... the faiss error occurred again with the lower batch size in the 4th epoch. I just switched back to 6 GPUs in that instance, though ideally we want to make the most of all the available 7 GPUs to increase the training speed. I'm going to go a head and try the last two suggestions, I may update this with the results if it's helpful at all. |
Hi, thanks for the incredible library! We've been using pytorch metric learning for a task which requires around 300,000 images belonging to a lot of classes. We're quite new to metric learning and DDP though.
We've been using the DDP example but we keep running into an error on the 2nd or 3rd epoch:
RuntimeError: Error in virtual void* faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest&) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244517602/work/faiss/gpu/StandardGpuResources.cpp:530: Error: 'err == cudaSuccess' failed: StandardGpuResources: alloc fail type TemporaryMemoryBuffer dev 0 space Device stream 0x295de190 size 1610612736 bytes (cudaMalloc error out of memory [2])
The code runs across 6 3090tis and uses 60 cpus, but for some reason after a few epochs Faiss can't allocate 1.5gb for the test_model phase anymore. We used to get a cuda OOM error but fixed this by setting the max split size to 516mb. With the new error, the code just hangs at this point indefinitely until the job is cancelled.
Firstly, I wanted to ask in the DDP example listed, is the validation (or test) dataset partition like the training data? Since it only happens when testing the model, I wasn't sure if this could be the issue.
Otherwise, would anyone know how to avoid this error? We've had a look through the Faiss repo as well and tried to apply most of their suggestions (e.g., reduce the batch size; updated the faiss library to the most recent) but haven't been able to resolve. Since it happen after a certain number of epochs, we're assuming something is accumulating on the GPU (maybe the index?) but can't quite figure out what it might be, so any help is much appreciated.
The text was updated successfully, but these errors were encountered: