HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

nghtm · 2024-05-10T15:45:30Z

When the install_dcgm_exporter.sh HyperPod Lifecycle script is run on ml.g5.48xlarges, the container fails:

docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

This results in nvidia-smi errors that are only recoverable after uninstalling the OS driver with the following steps:

Root Cause:

Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM Exporter version 3.3.5-3.4.0-ubuntu22.04

We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.

Similar to this issue reporter here: awslabs/amazon-eks-ami#1523

Explanation on DCGM-Exporter Versions:

The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.

Debugging Tips from Maxhaws@

# Run a Docker container with NVIDIA runtime and GPU support, executing nvidia-smi inside Ubuntu image.
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

# Restart NVIDIA System Management Interface.
sudo nvidia-smi -r 

# Configure NVIDIA runtime for Docker.
sudo nvidia-ctk runtime configure --runtime=docker

# Display contents of Docker daemon configuration file.
cat /etc/docker/daemon.json 

# Restart Docker service.
sudo systemctl restart docker

# Run a Docker container with NVIDIA runtime and GPU support, executing nvidia-smi inside Ubuntu image.
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi


# Display kernel ring buffer messages.
sudo dmesg

# View contents of a directory.
etc modprobe.d

# Append a kernel module option to a configuration file.
echo “options nvidia NVreg_EnableGpuFirmware=0” | sudo tee -a /etc/modprobe.d/nvidia-gsp.conf

# View contents of the newly created configuration file.
sudo cat /etc/modprobe.d/nvidia-gsp.conf

# Unload NVIDIA kernel modules. 

sudo rmmod nvidia_modeset

sudo rmmod nvidia_uvm
# nvidia_uvm is a kernel module used by NVIDIA GPUs (Graphics Processing Units) on Linux systems. It stands for "NVIDIA Unified Virtual Memory." This module is part of the NVIDIA Unified Memory feature, which allows CUDA (Compute Unified Device Architecture) applications to access both CPU and GPU memory seamlessly.

sudo rmmod gdrdrv
# By unloading the `gdrdrv` module, any functionality associated with GPU Direct RDMA is disabled, and the system will no longer be able to utilize this feature for GPU-to-GPU communication across nodes. This might be done for various reasons, such as troubleshooting or when the feature is not required for the current workload.

sudo rmmod nvidia
# The `nvidia` kernel module is part of the NVIDIA GPU driver stack, which provides support for NVIDIA graphics cards and related hardware. By unloading this module, the system effectively disables the NVIDIA GPU driver, which means that the GPU will no longer be available for use in tasks such as graphics rendering, computation, or any other GPU-accelerated workload.



# Iterate through NVIDIA devices and list processes using them.
for i in `seq 8`; do sudo lsof /dev/nvidia$i; done

# Unload NVIDIA kernel modules forcefully.
sudo rmmod --force nvidia_uvm
sudo rmmod nvidia

# View system log file.
sudo less /var/log/syslog

# The `modprobe` command is used to add and remove kernel modules from the Linux kernel. These configuration files allow users to specify options and parameters for loading and configuring kernel modules automatically during system boot or when manually loading modules.

sudo cat /etc/modprob.d

# Reboot a system with Slurm Control Daemon (scontrol).
sudo scontrol reboot ip-10-1-5-148

# Uninstall NVIDIA driver.
sudo nvidia-uninstall

# Download NVIDIA driver installer.
curl -fSsl -O https://us.download.nvidia.com/tesla/535.161.08/NVIDIA-Linux-x86_64-535.161.08.run

# Run NVIDIA driver installer script.
sudo bash NVIDIA-Linux-x86_64-535.161.08.run

# Unload NVIDIA kernel modules.
sudo rmmod nvidia_uvm

# Display GPU information.
sudo nvidia-smi -p 1

# Start NVIDIA Persistence Daemon.
sudo nvidia-persistenced

The text was updated successfully, but these errors were encountered:

nghtm · 2024-05-19T23:14:03Z

Resolving issue with PR #326

sean-smith mentioned this issue May 14, 2024

updates to HyperPod Observability Lifecycle scripts #326

Merged

nghtm closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

nghtm commented May 10, 2024

nghtm commented May 19, 2024

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges #324

Comments

nghtm commented May 10, 2024

nghtm commented May 19, 2024