Error: command buffer exited with error status. #125954
Labels
module: intel
Specific to x86 architecture
module: macos
Mac OS related issues
module: mps
Related to Apple Metal Performance Shaders framework
needs reproduction
Someone else needs to try reproducing the issue given the instructions. No action needed from user
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Describe the bug
I am training llama2.c on an iMac 27" with an AMD Radeon Pro 5700 XT GPU.
There are no recent nightly builds for MacOS + x86_64, so I built Pytorch from source.
I got this exception at epoch 11,580. I was able to resume training and haven't gotten the error again.
Each epoch typically take ~2500 ms, however, when I got the exception, the epoch's were taking much longer (E.g. - 64903.14ms)
Could GPU time-out errors be caused during garbage collection? Something else?
Versions
cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @kulinseth @DenisVieriu97 @jhavukainen
The text was updated successfully, but these errors were encountered: