Add draft gpu troubles #290

mhuguesaws · 2024-04-30T17:11:43Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

perifaws · 2024-04-30T17:56:08Z

troubleshooting/GPU-Troubleshooting.md

+   scancel [JOB_ID]
+   ```
+
+1. Reset the GPUs


Add a link to the reset option for nvidia-smi

perifaws · 2024-04-30T17:56:18Z

troubleshooting/GPU-Troubleshooting.md

+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
+   ```
+
+1. Cancel


cancel the job in Slurm

perifaws · 2024-04-30T17:56:43Z

troubleshooting/GPU-Troubleshooting.md

+   The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
+
+
+1. Delete the reservation


what is RES_NUMBER?

perifaws · 2024-04-30T17:56:57Z

troubleshooting/GPU-Troubleshooting.md

+   scancel [JOB_ID]
+   ```
+
+1. Place the node in **DRAIN**.


node to terminate is an IP or name?

perifaws · 2024-04-30T17:57:27Z

troubleshooting/GPU-Troubleshooting.md

+
+1. Create a reservation to isolate the node from being used by any jobs.
+   ```bash
+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]


what NODE_TO_TERMINATE should be?

perifaws · 2024-04-30T17:57:42Z

troubleshooting/GPU-Troubleshooting.md

+| 95  | Uncontained ECC error | Reset GPUs          | [AWS ParallelCluster](#reset-gpus)                      |
+
+# AWS ParallelCluster
+


reference to ParallelCluster doc?

perifaws · 2024-04-30T17:58:02Z

troubleshooting/GPU-Troubleshooting.md

+While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
+Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
+
+| Xid | Failure               | Resolution          | Orchestrator                                            |


Scheduler/Orchestrator

perifaws · 2024-04-30T17:59:33Z

troubleshooting/GPU-Troubleshooting.md

+   ```
+
+## Reset GPUs
+


Say that resetting does and what NODE_TO_TERMINATE represents (or how to get it)

perifaws · 2024-04-30T17:59:45Z

troubleshooting/GPU-Troubleshooting.md

+
+1. Delete the reservation
+   ```bash
+   sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]


What is RES_NUMBER?

perifaws · 2024-04-30T17:59:55Z

troubleshooting/GPU-Troubleshooting.md

+
+# Amazon SageMaker HyperPod
+
+TBD


perifaws · 2024-04-30T18:03:03Z

You could link to the AWS doc:

nghtm · 2024-05-03T02:42:07Z

looks good - plan to create a new PR for HyperPod instructions after p-cluster is merged.

Add draft gpu troubles

ff4dd18

perifaws requested changes Apr 30, 2024

View reviewed changes

perifaws requested review from sean-smith, nghtm, awsankur and iankouls-aws April 30, 2024 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add draft gpu troubles #290

Add draft gpu troubles #290

mhuguesaws commented Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws commented Apr 30, 2024 •

edited

nghtm commented May 3, 2024

		The node will have a DRAIN status. Then the instance will be terminated and replaced.


		1. Delete the reservation

		\| 95 \| Uncontained ECC error \| Reset GPUs \| [AWS ParallelCluster](#reset-gpus) \|

		# AWS ParallelCluster

Add draft gpu troubles #290

Are you sure you want to change the base?

Add draft gpu troubles #290

Conversation

mhuguesaws commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perifaws commented Apr 30, 2024 • edited

nghtm commented May 3, 2024

perifaws commented Apr 30, 2024 •

edited