Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed K8s nodes leave jobs hanging indefinitely #2072

Open
kellyaa opened this issue Apr 18, 2024 · 3 comments
Open

Failed K8s nodes leave jobs hanging indefinitely #2072

kellyaa opened this issue Apr 18, 2024 · 3 comments

Comments

@kellyaa
Copy link
Contributor

kellyaa commented Apr 18, 2024

Observed Problem

I tested this with PyTorchJobs, but presumably this would apply to other job types as well
If you fully shutdown a node that the job is running on, and that job's RestartPolicy set to OnFailure, the PyTorchJob does not recover gracefully. The controller does recognize that the pod failed. However, it tries to terminate the pod before it creates a new one. Since the node is down, it cannot terminate the pod. So, the pod stays in a “terminating” state forever, and the PyTorchJob stays in a “Restarting” state forever.

Proposed Solution

  1. Add a new CRD parameter with an optional configurable timeout to force deletion of the pod after it is still in Terminating status for more than the given interval
  2. Update the DeletePod function in pod_control.go so that when it is checking the deletion timestamp, it will force a deletion if the configured elapsed time has passed since the deletion timestamp.

I plan to submit a PR for this unless there's a better insight on how to handle this.

@tenzen-y
Copy link
Member

Kubernetes batch/v1 Job has a similar feature, Pod Failure Policy and Pod Disruption Conditions.
If we want to support this feature, we need to implement the same mechanism based on the Pod Disruption Conditions, and need to extend Kubeflow Job API.

/kind feature

@tedhtchang
Copy link

@tenzen-y Would this be v2 feature as well ? similar situation to the #2045 ?

@tenzen-y
Copy link
Member

@tenzen-y Would this be v2 feature as well ? similar situation to the #2045 ?

Correct, This could be supported by the Pod Failure Policy and Pod Disruption Condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants