Failed K8s nodes leave jobs hanging indefinitely #2072

kellyaa · 2024-04-18T19:03:27Z

Observed Problem

I tested this with PyTorchJobs, but presumably this would apply to other job types as well
If you fully shutdown a node that the job is running on, and that job's RestartPolicy set to OnFailure, the PyTorchJob does not recover gracefully. The controller does recognize that the pod failed. However, it tries to terminate the pod before it creates a new one. Since the node is down, it cannot terminate the pod. So, the pod stays in a “terminating” state forever, and the PyTorchJob stays in a “Restarting” state forever.

Proposed Solution

Add a new CRD parameter with an optional configurable timeout to force deletion of the pod after it is still in Terminating status for more than the given interval
Update the DeletePod function in pod_control.go so that when it is checking the deletion timestamp, it will force a deletion if the configured elapsed time has passed since the deletion timestamp.

I plan to submit a PR for this unless there's a better insight on how to handle this.

The text was updated successfully, but these errors were encountered:

tenzen-y · 2024-04-26T11:31:18Z

Kubernetes batch/v1 Job has a similar feature, Pod Failure Policy and Pod Disruption Conditions.
If we want to support this feature, we need to implement the same mechanism based on the Pod Disruption Conditions, and need to extend Kubeflow Job API.

/kind feature

tedhtchang · 2024-05-10T06:56:54Z

@tenzen-y Would this be v2 feature as well ? similar situation to the #2045 ?

tenzen-y · 2024-05-10T07:03:57Z

@tenzen-y Would this be v2 feature as well ? similar situation to the #2045 ?

Correct, This could be supported by the Pod Failure Policy and Pod Disruption Condition.

google-oss-prow bot added the kind/feature label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed K8s nodes leave jobs hanging indefinitely #2072

Failed K8s nodes leave jobs hanging indefinitely #2072

kellyaa commented Apr 18, 2024 •

edited

tenzen-y commented Apr 26, 2024

tedhtchang commented May 10, 2024

tenzen-y commented May 10, 2024

Failed K8s nodes leave jobs hanging indefinitely #2072

Failed K8s nodes leave jobs hanging indefinitely #2072

Comments

kellyaa commented Apr 18, 2024 • edited

Observed Problem

Proposed Solution

tenzen-y commented Apr 26, 2024

tedhtchang commented May 10, 2024

tenzen-y commented May 10, 2024

kellyaa commented Apr 18, 2024 •

edited