Error: command buffer exited with error status. #125954

dbl001 · 2024-05-10T20:19:38Z

🐛 Describe the bug

I am training llama2.c on an iMac 27" with an AMD Radeon Pro 5700 XT GPU.
There are no recent nightly builds for MacOS + x86_64, so I built Pytorch from source.
I got this exception at epoch 11,580. I was able to resume training and haven't gotten the error again.
Each epoch typically take ~2500 ms, however, when I got the exception, the epoch's were taking much longer (E.g. - 64903.14ms)

step 11500: train loss 3.2412, val loss 5.7422
saving checkpoint to out
wrote out/model.bin
11500 | loss 7.6908 | lr 2.899000e-05 | 3647545.51ms | mfu 0.45%
11510 | loss 7.5400 | lr 2.895835e-05 | 65127.72ms | mfu 0.40%
11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
	<GFX10_MtlCmdBuffer: 0x7f7bed7a9800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
	<GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
	<GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1

...

Could GPU time-out errors be caused during garbage collection? Something else?

Versions

% python collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.10.13 (main, Sep 11 2023, 08:21:04) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

Versions of relevant libraries:
[pip3] audiolm-pytorch==0.0.1
[pip3] configmypy==0.1.0
[pip3] mypy==1.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.17.1
[pip3] optree==0.11.0
[pip3] pytorch-transformers==1.1.0
[pip3] tensorly-torch==0.4.0
[pip3] torch==2.2.2
[pip3] torch-cluster==1.6.1
[pip3] torch-harmonics==0.6.5
[pip3] torch-scatter==2.1.1
[pip3] torch-sparse==0.6.17
[pip3] torch-spline-conv==1.2.2
[pip3] torch-struct==0.5
[pip3] torch-summary==1.4.5
[pip3] torch-utils==0.1.2
[pip3] torchaudio==2.2.2
[pip3] torchdata==0.7.1
[pip3] torchtext==0.17.2
[pip3] torchtraining-nightly==1604016577
[pip3] torchvision==0.17.2
[pip3] triton==2.1.0
[pip3] vector-quantize-pytorch==0.9.2
[conda] mkl                       2023.2.1                 pypi_0    pypi
[conda] nomkl                     3.0                           0  
[conda] numpy                     1.26.4          py310hf6dca73_0  
[conda] numpy-base                1.26.4          py310hd8f4981_0  
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] tensorly-torch            0.4.0                    pypi_0    pypi
[conda] torch                     2.4.0a0+git409b1a6          pypi_0    pypi
[conda] torch-cluster             1.6.1                    pypi_0    pypi
[conda] torch-harmonics           0.6.5                    pypi_0    pypi
[conda] torch-scatter             2.1.1                    pypi_0    pypi
[conda] torch-sparse              0.6.17                   pypi_0    pypi
[conda] torch-spline-conv         1.2.2                    pypi_0    pypi
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torch-summary             1.4.5                    pypi_0    pypi
[conda] torch-utils               0.1.2                    pypi_0    pypi
[conda] torchaudio                2.2.2                    pypi_0    pypi
[conda] torchdata                 0.7.1                    pypi_0    pypi
[conda] torchtext                 0.17.2                   pypi_0    pypi
[conda] torchtraining-nightly     1604016577               pypi_0    pypi
[conda] torchvision               0.17.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   0.9.2                    pypi_0    pypi

cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @kulinseth @DenisVieriu97 @jhavukainen

The text was updated successfully, but these errors were encountered:

malfet · 2024-05-10T23:31:01Z

Can you provide some sort of minimal reproducer? llama2.c to the best of my knowledge does not use PyTorch in any way (nor uses GPU acceleration)

dbl001 · 2024-05-11T01:58:33Z

llama2.c uses PyTorch when training models. The inference part (e.g. 'run.c') does NOT use PyTorch.
https://github.com/karpathy/llama2.c

Here's an example of the training process using the tinystories dataset.

$ python tinystories.py download
$ python tinystories.py train_vocab --vocab_size=4096
$ python tinystories.py pretokenize --vocab_size=4096
$ python train.py --vocab_source=custom --vocab_size=4096

I used a dataset generated from COVID-19 research papers.
https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge/data

The exception was generated when training a Llama2 model with 12 layers and 12 heads setting device='mps', from 801,915 research papers. The exception only happened once during trainin 25,000 epochs.

output.txt

Do you know what could cause this exception? (e.g. - garbage collection taking too long?)
Why the long times (highlighted in BOLD):

11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
**11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%**
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
**11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%**

I built PyTorch with USE_MINALLOC set to TRUE. Could this explain the delays?

malfet added the module: mps Related to Apple Metal Performance Shaders framework label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: command buffer exited with error status. #125954

Error: command buffer exited with error status. #125954

dbl001 commented May 10, 2024 •

edited by pytorch-bot bot

malfet commented May 10, 2024

dbl001 commented May 11, 2024 •

edited

Error: command buffer exited with error status. #125954

Error: command buffer exited with error status. #125954

Comments

dbl001 commented May 10, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

malfet commented May 10, 2024

dbl001 commented May 11, 2024 • edited

dbl001 commented May 10, 2024 •

edited by pytorch-bot bot

dbl001 commented May 11, 2024 •

edited