You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the implementation of tf-operator, it supports the scaling of replicas. For instance, if I reduce the number of parameter server (ps) replicas from 4 to 2 by using kubectl delete pod, the pods receive the SIGTERM signal and the container exits with code 137. However, the operator still checks if the pod's exit code is normal, which would result in the job exiting abnormally. Is such an implementation somewhat peculiar, or could it be considered a bug?
The text was updated successfully, but these errors were encountered:
Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ?
Actually, we tend to make our tfjobs elastic scaling for resource utilization.I understand that supporting scaling-in within the operator should not pose any other issues, right?
In the implementation of tf-operator, it supports the scaling of replicas. For instance, if I reduce the number of parameter server (ps) replicas from 4 to 2 by using kubectl delete pod, the pods receive the SIGTERM signal and the container exits with code 137. However, the operator still checks if the pod's exit code is normal, which would result in the job exiting abnormally. Is such an implementation somewhat peculiar, or could it be considered a bug?
The text was updated successfully, but these errors were encountered: