Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

Closed
stellarpower opened this issue May 5, 2024 · 8 comments
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues type:feature Feature requests

Comments

@stellarpower
Copy link

Issue type

Feature Request

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf-nightly 2.17.0.dev20240504

Custom code

No

OS platform and distribution

Ubuntu Jammy

Mobile device

No response

Python version

3.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I have installed tf-nightly from the official PYPI package, like so:

pip install tf-nightly[and-cuda]   

When I load Tensorflow, it isn't seeing my GPU, and I get the message

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

I normally use the conda-forge packages, in part precisely because it should handle some of these things for me so I don't have to worry. But I saw pip installing a large number of CUDA libraries during the process, so I'd expect most of what I need to be there.

The function MaybeTryDlopenGPULibraries() is responsible for attempting to load the required libraries in at runtime, however, it doesn't tell me what libraries it tried to find, what search path it was using, etc. As I've followed the steps in the guide at that URL, it's not the most helpful diagnostic message without further information.

Whilst short, and therefore not cluttering the screen (which may be good for many situations), the message isn't that helpful to try and work out what the problem is. Obviously on modern complex systems, library search paths can be pretty finicky to work out, so if not the default behaviour, I'd at least like to see a flag/environment variable I can set to see output of what library loads were attempted, what succeeded (and the path), and what was missing, in addition to other debugging output. If the short form of the message is kept as the default behaviour, then it would be good for this to print out how to set this option so that I can go round again and get more verbose output.

Thanks

Standalone code to reproduce the issue

See aobve

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:feature Feature requests label May 5, 2024
@SuryanarayanaY SuryanarayanaY added type:build/install Build and install issues comp:gpu GPU related issues labels May 6, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @stellarpower ,

You need to install GPU driver manually.After that you need to set LD_LIBRARY_PATH to the path where nvidia libraries installed. You may refer this comment . Please refer #63362 for more details. Thanks

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 6, 2024
@stellarpower
Copy link
Author

Thanks; I had done all this previously.

But I have opened as an issue irrespective of my own setup, because I believe it should be possible to get more information from the error message. Without knowing what libraries failed to be opened, just re-installing and following the instructions again isn't a particularly efficient way to debug what happened.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 6, 2024
@wjn0
Copy link

wjn0 commented May 8, 2024

Yes, I agree. I am running into the same issue now. This is particularly frustrating because of the arcane versioning of CUDA-related toolsets (i.e. the Python packages vs. CUDA vs. the dependency matrix in the documentation). For example:

  • TensorFlow documentation lists the correct CUDA version as 11.8, so I installed that and updated my $PATH, $LD_LIBRARY_PATH, etc. accordingly (along with cudNN 8.6 as listed).
  • When I use a fresh Python 3.10 installation to install tensorflow[and-cuda] via Pip, it seems to be defaulting to CUDA runtime 12?
Collecting nvidia-cublas-cu12==12.3.4.1
  Using cached nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (412.6 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.3.107
  Using cached nvidia_cuda_nvrtc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (24.9 MB)
Collecting nvidia-curand-cu12==10.3.4.107
  Using cached nvidia_curand_cu12-10.3.4.107-py3-none-manylinux1_x86_64.whl (56.3 MB)
Collecting nvidia-cusparse-cu12==12.2.0.103
  Using cached nvidia_cusparse_cu12-12.2.0.103-py3-none-manylinux1_x86_64.whl (197.5 MB)
Collecting nvidia-nvjitlink-cu12==12.3.101
  Using cached nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
Collecting nvidia-nccl-cu12==2.19.3
  Using cached nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
Collecting nvidia-cuda-nvcc-cu12==12.3.107
  Using cached nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (22.0 MB)
Collecting nvidia-cusolver-cu12==11.5.4.101
  Using cached nvidia_cusolver_cu12-11.5.4.101-py3-none-manylinux1_x86_64.whl (125.2 MB)
Collecting nvidia-cudnn-cu12==8.9.7.29
  Using cached nvidia_cudnn_cu12-8.9.7.29-py3-none-manylinux1_x86_64.whl (704.7 MB)
Collecting nvidia-cufft-cu12==11.0.12.1
  Using cached nvidia_cufft_cu12-11.0.12.1-py3-none-manylinux1_x86_64.whl (98.8 MB)
Collecting nvidia-cuda-cupti-cu12==12.3.101
  Using cached nvidia_cuda_cupti_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (14.0 MB)
Collecting nvidia-cuda-runtime-cu12==12.3.101
  Using cached nvidia_cuda_runtime_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (867 kB)

and relevant links in the docs only seem to link out to Docker-related stuff, like https://www.tensorflow.org/install/source so the vast majority of information on the internet is out of date.

Is there any clearer guidance for how to get TensorFlow working on GPUs assuming your CUDA install is non-standard, i.e., not installed out of the Ubuntu package repo (which is infeasible in many academic settings)?

Thanks very much in advance.

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

@stellarpower
Copy link
Author

@wjno thanks - I resolved the underlying problem in the end, and from memory thought I had increased the log verbosity as high as it would go, but maybe I had not. If I encounter some library problems again I'll give it a go. Cheers!

@SuryanarayanaY
Copy link
Collaborator

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

Hi @wjn0 , AFAIK the setting TF_CPP_MAX_VLOG_LEVEL=3 will only disable the debugging logs from the console.I doubt and want to know whether after disabling these logs then only cudnn libraries are being detectable? Setting right path for LD_LIBRARY_PATH should resolve the issue irrespective of disabling the debugging logs.Correct me if i am wrong.

Thanks for the info.

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 10, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 18, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

3 participants