Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: command buffer exited with error status. #125954

Open
dbl001 opened this issue May 10, 2024 · 2 comments
Open

Error: command buffer exited with error status. #125954

dbl001 opened this issue May 10, 2024 · 2 comments
Labels
module: intel Specific to x86 architecture module: macos Mac OS related issues module: mps Related to Apple Metal Performance Shaders framework needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@dbl001
Copy link

dbl001 commented May 10, 2024

馃悰 Describe the bug

I am training llama2.c on an iMac 27" with an AMD Radeon Pro 5700 XT GPU.
There are no recent nightly builds for MacOS + x86_64, so I built Pytorch from source.
I got this exception at epoch 11,580. I was able to resume training and haven't gotten the error again.
Each epoch typically take ~2500 ms, however, when I got the exception, the epoch's were taking much longer (E.g. - 64903.14ms)

step 11500: train loss 3.2412, val loss 5.7422
saving checkpoint to out
wrote out/model.bin
11500 | loss 7.6908 | lr 2.899000e-05 | 3647545.51ms | mfu 0.45%
11510 | loss 7.5400 | lr 2.895835e-05 | 65127.72ms | mfu 0.40%
11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
	<GFX10_MtlCmdBuffer: 0x7f7bed7a9800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
	<GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
	<GFX10_MtlCmdBuffer: 0x7f7bd219b800>
    label = <none> 
    device = <GFX10_MtlDevice: 0x7f7d30118000>
        name = AMD Radeon Pro 5700 XT 
    commandQueue = <GFXAAMD_MtlCmdQueue: 0x7f7d398a8cb0>
        label = <none> 
        device = <GFX10_MtlDevice: 0x7f7d30118000>
            name = AMD Radeon Pro 5700 XT 
    retainedReferences = 1

...

Could GPU time-out errors be caused during garbage collection? Something else?

Versions

% python collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.10.13 (main, Sep 11 2023, 08:21:04) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

Versions of relevant libraries:
[pip3] audiolm-pytorch==0.0.1
[pip3] configmypy==0.1.0
[pip3] mypy==1.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.17.1
[pip3] optree==0.11.0
[pip3] pytorch-transformers==1.1.0
[pip3] tensorly-torch==0.4.0
[pip3] torch==2.2.2
[pip3] torch-cluster==1.6.1
[pip3] torch-harmonics==0.6.5
[pip3] torch-scatter==2.1.1
[pip3] torch-sparse==0.6.17
[pip3] torch-spline-conv==1.2.2
[pip3] torch-struct==0.5
[pip3] torch-summary==1.4.5
[pip3] torch-utils==0.1.2
[pip3] torchaudio==2.2.2
[pip3] torchdata==0.7.1
[pip3] torchtext==0.17.2
[pip3] torchtraining-nightly==1604016577
[pip3] torchvision==0.17.2
[pip3] triton==2.1.0
[pip3] vector-quantize-pytorch==0.9.2
[conda] mkl                       2023.2.1                 pypi_0    pypi
[conda] nomkl                     3.0                           0  
[conda] numpy                     1.26.4          py310hf6dca73_0  
[conda] numpy-base                1.26.4          py310hd8f4981_0  
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] tensorly-torch            0.4.0                    pypi_0    pypi
[conda] torch                     2.4.0a0+git409b1a6          pypi_0    pypi
[conda] torch-cluster             1.6.1                    pypi_0    pypi
[conda] torch-harmonics           0.6.5                    pypi_0    pypi
[conda] torch-scatter             2.1.1                    pypi_0    pypi
[conda] torch-sparse              0.6.17                   pypi_0    pypi
[conda] torch-spline-conv         1.2.2                    pypi_0    pypi
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torch-summary             1.4.5                    pypi_0    pypi
[conda] torch-utils               0.1.2                    pypi_0    pypi
[conda] torchaudio                2.2.2                    pypi_0    pypi
[conda] torchdata                 0.7.1                    pypi_0    pypi
[conda] torchtext                 0.17.2                   pypi_0    pypi
[conda] torchtraining-nightly     1604016577               pypi_0    pypi
[conda] torchvision               0.17.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   0.9.2                    pypi_0    pypi

cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @kulinseth @DenisVieriu97 @jhavukainen

@malfet malfet added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user module: macos Mac OS related issues module: intel Specific to x86 architecture triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 10, 2024
@malfet
Copy link
Contributor

malfet commented May 10, 2024

Can you provide some sort of minimal reproducer? llama2.c to the best of my knowledge does not use PyTorch in any way (nor uses GPU acceleration)

@malfet malfet added the module: mps Related to Apple Metal Performance Shaders framework label May 10, 2024
@dbl001
Copy link
Author

dbl001 commented May 11, 2024

llama2.c uses PyTorch when training models. The inference part (e.g. 'run.c') does NOT use PyTorch.
https://github.com/karpathy/llama2.c

Here's an example of the training process using the tinystories dataset.

$ python tinystories.py download
$ python tinystories.py train_vocab --vocab_size=4096
$ python tinystories.py pretokenize --vocab_size=4096
$ python train.py --vocab_source=custom --vocab_size=4096

I used a dataset generated from COVID-19 research papers.
https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge/data

The exception was generated when training a Llama2 model with 12 layers and 12 heads setting device='mps', from 801,915 research papers. The exception only happened once during trainin 25,000 epochs.

Screenshot 2024-05-10 at 7 01 28鈥疨M Screenshot 2024-05-11 at 7 56 21鈥疉M

output.txt

Do you know what could cause this exception? (e.g. - garbage collection taking too long?)
Why the long times (highlighted in BOLD):

11520 | loss 7.5121 | lr 2.892669e-05 | 2504.32ms | mfu 0.42%
11530 | loss 7.1798 | lr 2.889503e-05 | 2536.12ms | mfu 0.43%
**11540 | loss 7.5530 | lr 2.886336e-05 | 64845.53ms | mfu 0.39%
11550 | loss 7.3821 | lr 2.883169e-05 | 64852.63ms | mfu 0.35%**
11560 | loss 7.3344 | lr 2.880000e-05 | 2569.23ms | mfu 0.37%
**11570 | loss 7.3546 | lr 2.876832e-05 | 64916.63ms | mfu 0.34%
11580 | loss 7.1987 | lr 2.873662e-05 | 64903.14ms | mfu 0.31%**

I built PyTorch with USE_MINALLOC set to TRUE. Could this explain the delays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: intel Specific to x86 architecture module: macos Mac OS related issues module: mps Related to Apple Metal Performance Shaders framework needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants