[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

yx367563 · 2024-05-08T08:33:13Z

Description

Currently, when Ray schedules a task, it only takes into account the memory resources requested by the user in the option.
This can lead to the possibility of scheduling multiple memory-hungry tasks to a single worker node if the task doesn't specify a memory parameter, which can trigger an OOM. Even with the retry mechanism, there's no way to ensure which worker node will be allocated in the next scheduling.
So it needs to take into account the actual memory resources already used on the worker node.

Use case

Users may not set the memory parameter when submitting a Ray Task, or they may not be sure how much memory the task will consume when running. Therefore, if we only rely on the memory requested by user for scheduling, it is very likely that multiple tasks with high memory consumption will be scheduled to a single Worker Node. After an OOM retry is triggered, the task may continue to be scheduled to the original Worker Node because the requested memory parameters have not changed.

Possible Solution: Consider the memory already used on the Worker Node when scheduling the task, and expand the memory requested by the task when triggering the OOM retry in conjunction with the size of the memory it previously used.

yx367563 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 8, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024

jjyao added core-scheduler P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024

jjyao assigned anyscalesam May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

yx367563 commented May 8, 2024

[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

[Ray scheduling] The memory already used on the Worker Node needs to be taken into account when scheduling Ray tasks #45196

Comments

yx367563 commented May 8, 2024

Description

Use case