Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The job stops restarting workers and exits if the traceback is a code bug. #1068

Open
workingloong opened this issue Apr 8, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@workingloong
Copy link
Collaborator

The restarted worker will fail again if the training fails due to a code bug. The job should exit as soon as possible to release resources on a cluster.

@BalaBalaYi BalaBalaYi self-assigned this Apr 22, 2024
@BalaBalaYi BalaBalaYi added bug Something isn't working question Further information is requested labels Apr 22, 2024
@BalaBalaYi
Copy link
Collaborator

A blacklist mechanism can be introduced to this case: throw an explicit error for user code errors with no more retry and free up resources.

@BalaBalaYi
Copy link
Collaborator

BalaBalaYi commented Apr 24, 2024

@BalaBalaYi BalaBalaYi added enhancement New feature or request and removed bug Something isn't working labels Apr 24, 2024
@BalaBalaYi BalaBalaYi added this to the Backlog milestone Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants