Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

Open
kellyaa opened this issue Apr 5, 2024 · 9 comments

Comments

@kellyaa
Copy link
Contributor

kellyaa commented Apr 5, 2024

Steps to reproduce:

  1. Set the PyTorchJob restartPolicy: ExitCode
  2. Set backoffLimit > 1
  3. Have a container exit with a non-zero exit code greater than 128

Observed Behavior:

When the pod errors out, the controller deletes the pod and recreates it. This is done indefinitely until the pod completes successfully (if it ever does!). It ignores the backoffLimit.

Analysis:

The restart behavior for "OnFailure" vs "ExitCode" is different in that "OnFailure". For "OnFailure", the Pod's actual restart policy is set to "OnFailure", leaving the K8s pod controller to do the restarts. In the case of "ExitCode", then pod's restart policy is set to "Never", and therefore the restarts are controlled via deletion of the pod. As far as I can tell, I think this is at least partially due to the fact that the code that checks for exceeding backoff limit keys off of container restarts. Since in this case it is deleting and recreating pods, there are no container restarts.

@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 12, 2024

On reviewing this again, there's some possible solutions I can think of:

  1. On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
  2. Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

@tenzen-y
Copy link
Member

On reviewing this again, there's some possible solutions I can think of:

  1. On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
  2. Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

This is actually a specification, not a bug. You can see a similar feature in the batch/v1 Job w/ PodFailurePolicy action=Ignore.
So, I think that this is a feature request similar to the batch/v1 Job w/ PodFailurePolicy action=Count.

@tenzen-y
Copy link
Member

Since selecting (1) means breaking change, we can not select the approach.
So, adding a new field like the batch/v1 Job action field might be better.

@tenzen-y
Copy link
Member

/kind feature

@tedhtchang
Copy link

tedhtchang commented May 2, 2024

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

@tenzen-y
Copy link
Member

tenzen-y commented May 7, 2024

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

TBH, I would like not to add this feature to v1 API because we need to re-implement the same feature in the training-operator.
But, we started the v2 API design (using JobSet).

@tedhtchang
Copy link

tedhtchang commented May 10, 2024

@tenzen-y
Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

@tenzen-y
Copy link
Member

@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

As I mentioned here #2045 (comment), changing the current behavior without another knob brings us breaking changes. So, I don't think it could be.

@kellyaa
Copy link
Contributor Author

kellyaa commented May 10, 2024

If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met.
If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit and restarts indefinitely.

To me, the current behavior seems like a bug rather than a feature as far as user experience goes. Changing how the restarts are counted for ExitCode seems like it would be a fix to unexpected behavior, not introducing another knob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants