"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

stellarpower · 2024-05-05T17:34:54Z

Issue type

Feature Request

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf-nightly 2.17.0.dev20240504

Custom code

No

OS platform and distribution

Ubuntu Jammy

Mobile device

No response

Python version

3.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I have installed tf-nightly from the official PYPI package, like so:

pip install tf-nightly[and-cuda]

When I load Tensorflow, it isn't seeing my GPU, and I get the message

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

I normally use the conda-forge packages, in part precisely because it should handle some of these things for me so I don't have to worry. But I saw pip installing a large number of CUDA libraries during the process, so I'd expect most of what I need to be there.

The function MaybeTryDlopenGPULibraries() is responsible for attempting to load the required libraries in at runtime, however, it doesn't tell me what libraries it tried to find, what search path it was using, etc. As I've followed the steps in the guide at that URL, it's not the most helpful diagnostic message without further information.

Whilst short, and therefore not cluttering the screen (which may be good for many situations), the message isn't that helpful to try and work out what the problem is. Obviously on modern complex systems, library search paths can be pretty finicky to work out, so if not the default behaviour, I'd at least like to see a flag/environment variable I can set to see output of what library loads were attempted, what succeeded (and the path), and what was missing, in addition to other debugging output. If the short form of the message is kept as the default behaviour, then it would be good for this to print out how to set this option so that I can go round again and get more verbose output.

Thanks

Standalone code to reproduce the issue

See aobve

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

SuryanarayanaY · 2024-05-06T07:10:07Z

Hi @stellarpower ,

You need to install GPU driver manually.After that you need to set LD_LIBRARY_PATH to the path where nvidia libraries installed. You may refer this comment . Please refer #63362 for more details. Thanks

stellarpower · 2024-05-06T14:53:55Z

Thanks; I had done all this previously.

But I have opened as an issue irrespective of my own setup, because I believe it should be possible to get more information from the error message. Without knowing what libraries failed to be opened, just re-installing and following the instructions again isn't a particularly efficient way to debug what happened.

wjn0 · 2024-05-08T19:53:13Z

Yes, I agree. I am running into the same issue now. This is particularly frustrating because of the arcane versioning of CUDA-related toolsets (i.e. the Python packages vs. CUDA vs. the dependency matrix in the documentation). For example:

TensorFlow documentation lists the correct CUDA version as 11.8, so I installed that and updated my $PATH, $LD_LIBRARY_PATH, etc. accordingly (along with cudNN 8.6 as listed).
When I use a fresh Python 3.10 installation to install tensorflow[and-cuda] via Pip, it seems to be defaulting to CUDA runtime 12?

Collecting nvidia-cublas-cu12==12.3.4.1
  Using cached nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (412.6 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.3.107
  Using cached nvidia_cuda_nvrtc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (24.9 MB)
Collecting nvidia-curand-cu12==10.3.4.107
  Using cached nvidia_curand_cu12-10.3.4.107-py3-none-manylinux1_x86_64.whl (56.3 MB)
Collecting nvidia-cusparse-cu12==12.2.0.103
  Using cached nvidia_cusparse_cu12-12.2.0.103-py3-none-manylinux1_x86_64.whl (197.5 MB)
Collecting nvidia-nvjitlink-cu12==12.3.101
  Using cached nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
Collecting nvidia-nccl-cu12==2.19.3
  Using cached nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
Collecting nvidia-cuda-nvcc-cu12==12.3.107
  Using cached nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (22.0 MB)
Collecting nvidia-cusolver-cu12==11.5.4.101
  Using cached nvidia_cusolver_cu12-11.5.4.101-py3-none-manylinux1_x86_64.whl (125.2 MB)
Collecting nvidia-cudnn-cu12==8.9.7.29
  Using cached nvidia_cudnn_cu12-8.9.7.29-py3-none-manylinux1_x86_64.whl (704.7 MB)
Collecting nvidia-cufft-cu12==11.0.12.1
  Using cached nvidia_cufft_cu12-11.0.12.1-py3-none-manylinux1_x86_64.whl (98.8 MB)
Collecting nvidia-cuda-cupti-cu12==12.3.101
  Using cached nvidia_cuda_cupti_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (14.0 MB)
Collecting nvidia-cuda-runtime-cu12==12.3.101
  Using cached nvidia_cuda_runtime_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (867 kB)

and relevant links in the docs only seem to link out to Docker-related stuff, like https://www.tensorflow.org/install/source so the vast majority of information on the internet is out of date.

Is there any clearer guidance for how to get TensorFlow working on GPUs assuming your CUDA install is non-standard, i.e., not installed out of the Ubuntu package repo (which is infeasible in many academic settings)?

Thanks very much in advance.

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

stellarpower · 2024-05-08T21:41:50Z

@wjno thanks - I resolved the underlying problem in the end, and from memory thought I had increased the log verbosity as high as it would go, but maybe I had not. If I encounter some library problems again I'll give it a go. Cheers!

SuryanarayanaY · 2024-05-10T06:23:07Z

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

Hi @wjn0 , AFAIK the setting TF_CPP_MAX_VLOG_LEVEL=3 will only disable the debugging logs from the console.I doubt and want to know whether after disabling these logs then only cudnn libraries are being detectable? Setting right path for LD_LIBRARY_PATH should resolve the issue irrespective of disabling the debugging logs.Correct me if i am wrong.

Thanks for the info.

github-actions · 2024-05-18T01:48:16Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-05-25T01:48:35Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-05-25T01:48:38Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:feature Feature requests label May 5, 2024

google-ml-butler bot assigned SuryanarayanaY May 5, 2024

SuryanarayanaY added type:build/install Build and install issues comp:gpu GPU related issues labels May 6, 2024

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 6, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 6, 2024

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 10, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 18, 2024

github-actions bot closed this as completed May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

stellarpower commented May 5, 2024

SuryanarayanaY commented May 6, 2024

stellarpower commented May 6, 2024

wjn0 commented May 8, 2024 •

edited

stellarpower commented May 8, 2024

SuryanarayanaY commented May 10, 2024

github-actions bot commented May 18, 2024

github-actions bot commented May 25, 2024

google-ml-butler bot commented May 25, 2024

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load #66987

Comments

stellarpower commented May 5, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

SuryanarayanaY commented May 6, 2024

stellarpower commented May 6, 2024

wjn0 commented May 8, 2024 • edited

stellarpower commented May 8, 2024

SuryanarayanaY commented May 10, 2024

github-actions bot commented May 18, 2024

github-actions bot commented May 25, 2024

google-ml-butler bot commented May 25, 2024

wjn0 commented May 8, 2024 •

edited