Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

Closed
nghtm opened this issue May 10, 2024 · 1 comment
Closed

Comments

@nghtm
Copy link
Collaborator

nghtm commented May 10, 2024

When the install_dcgm_exporter.sh HyperPod Lifecycle script is run on ml.g5.48xlarges, the container fails:

docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

This results in nvidia-smi errors that are only recoverable after uninstalling the OS driver with the following steps:

Root Cause:

Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM Exporter version 3.3.5-3.4.0-ubuntu22.04

We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.

Similar to this issue reporter here: awslabs/amazon-eks-ami#1523

Explanation on DCGM-Exporter Versions:

The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.

Debugging Tips from Maxhaws@

# Run a Docker container with NVIDIA runtime and GPU support, executing nvidia-smi inside Ubuntu image.
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

# Restart NVIDIA System Management Interface.
sudo nvidia-smi -r 

# Configure NVIDIA runtime for Docker.
sudo nvidia-ctk runtime configure --runtime=docker

# Display contents of Docker daemon configuration file.
cat /etc/docker/daemon.json 

# Restart Docker service.
sudo systemctl restart docker

# Run a Docker container with NVIDIA runtime and GPU support, executing nvidia-smi inside Ubuntu image.
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi


# Display kernel ring buffer messages.
sudo dmesg

# View contents of a directory.
etc modprobe.d

# Append a kernel module option to a configuration file.
echo “options nvidia NVreg_EnableGpuFirmware=0” | sudo tee -a /etc/modprobe.d/nvidia-gsp.conf

# View contents of the newly created configuration file.
sudo cat /etc/modprobe.d/nvidia-gsp.conf

# Unload NVIDIA kernel modules. 

sudo rmmod nvidia_modeset

sudo rmmod nvidia_uvm
# nvidia_uvm is a kernel module used by NVIDIA GPUs (Graphics Processing Units) on Linux systems. It stands for "NVIDIA Unified Virtual Memory." This module is part of the NVIDIA Unified Memory feature, which allows CUDA (Compute Unified Device Architecture) applications to access both CPU and GPU memory seamlessly.

sudo rmmod gdrdrv
# By unloading the `gdrdrv` module, any functionality associated with GPU Direct RDMA is disabled, and the system will no longer be able to utilize this feature for GPU-to-GPU communication across nodes. This might be done for various reasons, such as troubleshooting or when the feature is not required for the current workload.

sudo rmmod nvidia
# The `nvidia` kernel module is part of the NVIDIA GPU driver stack, which provides support for NVIDIA graphics cards and related hardware. By unloading this module, the system effectively disables the NVIDIA GPU driver, which means that the GPU will no longer be available for use in tasks such as graphics rendering, computation, or any other GPU-accelerated workload.



# Iterate through NVIDIA devices and list processes using them.
for i in `seq 8`; do sudo lsof /dev/nvidia$i; done

# Unload NVIDIA kernel modules forcefully.
sudo rmmod --force nvidia_uvm
sudo rmmod nvidia

# View system log file.
sudo less /var/log/syslog

# The `modprobe` command is used to add and remove kernel modules from the Linux kernel. These configuration files allow users to specify options and parameters for loading and configuring kernel modules automatically during system boot or when manually loading modules.

sudo cat /etc/modprob.d

# Reboot a system with Slurm Control Daemon (scontrol).
sudo scontrol reboot ip-10-1-5-148

# Uninstall NVIDIA driver.
sudo nvidia-uninstall

# Download NVIDIA driver installer.
curl -fSsl -O https://us.download.nvidia.com/tesla/535.161.08/NVIDIA-Linux-x86_64-535.161.08.run

# Run NVIDIA driver installer script.
sudo bash NVIDIA-Linux-x86_64-535.161.08.run

# Unload NVIDIA kernel modules.
sudo rmmod nvidia_uvm

# Display GPU information.
sudo nvidia-smi -p 1

# Start NVIDIA Persistence Daemon.
sudo nvidia-persistenced



@nghtm
Copy link
Collaborator Author

nghtm commented May 19, 2024

Resolving issue with PR #326

@nghtm nghtm closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant