You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested this with PyTorchJobs, but presumably this would apply to other job types as well
If you fully shutdown a node that the job is running on, and that job's RestartPolicy set to OnFailure, the PyTorchJob does not recover gracefully. The controller does recognize that the pod failed. However, it tries to terminate the pod before it creates a new one. Since the node is down, it cannot terminate the pod. So, the pod stays in a “terminating” state forever, and the PyTorchJob stays in a “Restarting” state forever.
Proposed Solution
Add a new CRD parameter with an optional configurable timeout to force deletion of the pod after it is still in Terminating status for more than the given interval
Update the DeletePod function in pod_control.go so that when it is checking the deletion timestamp, it will force a deletion if the configured elapsed time has passed since the deletion timestamp.
I plan to submit a PR for this unless there's a better insight on how to handle this.
The text was updated successfully, but these errors were encountered:
Kubernetes batch/v1 Job has a similar feature, Pod Failure Policy and Pod Disruption Conditions.
If we want to support this feature, we need to implement the same mechanism based on the Pod Disruption Conditions, and need to extend Kubeflow Job API.
Observed Problem
I tested this with PyTorchJobs, but presumably this would apply to other job types as well
If you fully shutdown a node that the job is running on, and that job's RestartPolicy set to
OnFailure
, the PyTorchJob does not recover gracefully. The controller does recognize that the pod failed. However, it tries to terminate the pod before it creates a new one. Since the node is down, it cannot terminate the pod. So, the pod stays in a “terminating” state forever, and the PyTorchJob stays in a “Restarting” state forever.Proposed Solution
Terminating
status for more than the given intervalDeletePod
function in pod_control.go so that when it is checking the deletion timestamp, it will force a deletion if the configured elapsed time has passed since the deletion timestamp.I plan to submit a PR for this unless there's a better insight on how to handle this.
The text was updated successfully, but these errors were encountered: