Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Slowdown caused by aws-ofi-nccl conflict #284

Open
sean-smith opened this issue Apr 25, 2024 · 1 comment
Open

NCCL Slowdown caused by aws-ofi-nccl conflict #284

sean-smith opened this issue Apr 25, 2024 · 1 comment
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.

Comments

@sean-smith
Copy link
Contributor

sean-smith commented Apr 25, 2024

If you experience an NCCL slowdown the first step is to enable:

export NCCL_DEBUG=INFO

This will allow you to catch an misconfigurations in the logs, for example if you see:

version `EFA_1.2' not found (required by /opt/amazon/efa/lib/libfabric.so.1) No plugin found (libnccl-net.so), using internal implementation

This likely means it's pulling in a version of the plugin aws-ofi-nccl that's not compiled against the system libfabric. You can check this (assuming you're using conda) by running:

conda list | grep -E "nvidia|nccl|cud|torch"

If this shows something like:

nvidia-nccl-cu12           2.19.3                    pypi_0    pypi

It's likely this version is getting pulled in as dependency and isn't working properly. You can override this and install aws-ofi-nccl from Amazon Pytorch like so:

conda install -y \
    aws-ofi-nccl \
    --override-channels \
    -c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
    -c nvidia -c conda-forge

The error version EFA_1.2 not found should now disappear from the logs.

@sean-smith sean-smith added the Troubleshooting Tips These are informational to make it easier to troubleshoot common issues. label Apr 25, 2024
@verdimrc
Copy link
Contributor

In practice, does this happen only for certain PyTorch build?

Has it ever happened with nccl-tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.
Projects
None yet
Development

No branches or pull requests

2 participants