PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

kellyaa · 2024-04-05T21:08:14Z

Steps to reproduce:

Set the PyTorchJob restartPolicy: ExitCode
Set backoffLimit > 1
Have a container exit with a non-zero exit code greater than 128

Observed Behavior:

When the pod errors out, the controller deletes the pod and recreates it. This is done indefinitely until the pod completes successfully (if it ever does!). It ignores the backoffLimit.

Analysis:

The restart behavior for "OnFailure" vs "ExitCode" is different in that "OnFailure". For "OnFailure", the Pod's actual restart policy is set to "OnFailure", leaving the K8s pod controller to do the restarts. In the case of "ExitCode", then pod's restart policy is set to "Never", and therefore the restarts are controlled via deletion of the pod. As far as I can tell, I think this is at least partially due to the fact that the code that checks for exceeding backoff limit keys off of container restarts. Since in this case it is deleting and recreating pods, there are no container restarts.

kellyaa · 2024-04-12T17:17:31Z

On reviewing this again, there's some possible solutions I can think of:

On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

tenzen-y · 2024-04-17T14:23:08Z

On reviewing this again, there's some possible solutions I can think of:

On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.

Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

This is actually a specification, not a bug. You can see a similar feature in the batch/v1 Job w/ PodFailurePolicy action=Ignore.
So, I think that this is a feature request similar to the batch/v1 Job w/ PodFailurePolicy action=Count.

tenzen-y · 2024-04-17T14:25:29Z

Since selecting (1) means breaking change, we can not select the approach.
So, adding a new field like the batch/v1 Job action field might be better.

tenzen-y · 2024-04-17T14:39:45Z

/kind feature

tedhtchang · 2024-05-02T21:28:14Z

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

tenzen-y · 2024-05-07T11:46:08Z

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

TBH, I would like not to add this feature to v1 API because we need to re-implement the same feature in the training-operator.
But, we started the v2 API design (using JobSet).

tedhtchang · 2024-05-10T02:27:40Z

@tenzen-y
Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

tenzen-y · 2024-05-10T05:57:15Z

@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

As I mentioned here #2045 (comment), changing the current behavior without another knob brings us breaking changes. So, I don't think it could be.

kellyaa · 2024-05-10T14:09:52Z

If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met.
If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit and restarts indefinitely.

To me, the current behavior seems like a bug rather than a feature as far as user experience goes. Changing how the restarts are counted for ExitCode seems like it would be a fix to unexpected behavior, not introducing another knob.

google-oss-prow bot added the kind/feature label Apr 17, 2024

tedhtchang mentioned this issue May 10, 2024

Failed K8s nodes leave jobs hanging indefinitely #2072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

kellyaa commented Apr 5, 2024

kellyaa commented Apr 12, 2024

tenzen-y commented Apr 17, 2024

tenzen-y commented Apr 17, 2024

tenzen-y commented Apr 17, 2024

tedhtchang commented May 2, 2024 •

edited

tenzen-y commented May 7, 2024

tedhtchang commented May 10, 2024 •

edited

tenzen-y commented May 10, 2024

kellyaa commented May 10, 2024 •

edited

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

Comments

kellyaa commented Apr 5, 2024

Steps to reproduce:

Observed Behavior:

Analysis:

kellyaa commented Apr 12, 2024

tenzen-y commented Apr 17, 2024

tenzen-y commented Apr 17, 2024

tenzen-y commented Apr 17, 2024

tedhtchang commented May 2, 2024 • edited

tenzen-y commented May 7, 2024

tedhtchang commented May 10, 2024 • edited

tenzen-y commented May 10, 2024

kellyaa commented May 10, 2024 • edited

tedhtchang commented May 2, 2024 •

edited

tedhtchang commented May 10, 2024 •

edited

kellyaa commented May 10, 2024 •

edited