[torch/distributed] Bugfix: wait for all child procs to exit before c… #125969

kiukchung · 2024-05-10T22:03:00Z

Observed Problem

When torchrun has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then torchrun exits successfully.

This results in misleading warning log messages towards the end of the job like the one below:

W0510 14:52:48.185934  672413 api.py:513] Closing process 675171 via signal SIGTERM 
W0510 14:52:48.185984  672413 api.py:513] Closing process 675172 via signal SIGTERM
W0510 14:52:48.186013  672413 api.py:513] Closing process 675174 via signal SIGTERM
# <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->

I0510 14:52:48.229119  672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0510 14:52:48.229161  672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0510 14:52:48.229395  672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
I0510 14:52:48.257544  672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
I0510 14:52:48.568198  672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
I0510 14:52:48.568989  672413 distributed.py:202] Finished running `main`

Root Cause

I noticed that this was due to the incorrect usage of torch.multiprocessing.ProcessContext.join() in torch.distributed.elastic.multiprocessing.api.MultiprocessingContext.

torch.multiprocessing.ProcessContext.join() does not actually wait for ALL child procs to exit, but rather waits for at-least-one child proc to exit. If only a subset of the child procs have exited, it returns False and if all child procs have exited it returns True.

torch.distributed.elastic.multiprocessing.api.MultiprocessingContext was assuming that torch.multiprocessing.ProcessContext.join() blocks indefinitely until all child procs have exited.

Fix

The fix is simple, just loop, while continuing to call pc.join() until it returns True

NOTE: that the indefinite blocking is NOT an issue since by the time torch.distributed.elastic.multiprocessing.api.MultiprocessingContext calls pc.join() it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.

NOTE: since pc.join() already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most nproc_per_node times so no log spamming is observed.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-05-10T22:03:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125969

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 0bd79bd with merge base 7f1d5ab ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/distributed/elastic/multiprocessing/api.py

d4l3k · 2024-05-13T17:47:24Z

torch/distributed/elastic/multiprocessing/api.py

+                # At this point workers finished running the user function
+                # But the child process might still have not exited. Wait for them.
+                # pc.join() blocks [forever] until "a" proc exits. Loop until all of them exits.
+                while not self._pc.join():


should we have a timeout on this? wondering what happens if we have a dead/hung worker process?

When you reach this line, we've already validated that either:

the entrypoint function actually ran and returned a result

-- or -- at least one of the child procs have failed (and a SIGTERM was sent to the rest)

We're waiting for the spawned child proc to exit after the user-provided function has already returned.
This potentially could hang but we were waiting for _pc.join() indefinitely before this change as well.

got it -- sgtm

test/distributed/elastic/multiprocessing/api_test.py

d4l3k · 2024-05-13T17:56:19Z

torch/distributed/elastic/multiprocessing/api.py

+                # At this point workers finished running the user function
+                # But the child process might still have not exited. Wait for them.
+                # pc.join() blocks [forever] until "a" proc exits. Loop until all of them exits.
+                while not self._pc.join():


got it -- sgtm

…losing torch.distributed.elastic.multiprocessing.api.ProcessContext

kiukchung · 2024-05-13T22:41:59Z

@pytorchbot merge

pytorchmergebot · 2024-05-13T22:43:43Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

kiukchung · 2024-05-14T19:31:39Z

@pytorchbot merge

pytorchmergebot · 2024-05-14T19:38:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch#125969) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: pytorch#125969 Approved by: https://github.com/d4l3k

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 10, 2024

kiukchung force-pushed the main branch from 01666bd to 90f4ce3 Compare May 10, 2024 22:07

pytorchbot added the open source label May 10, 2024

kiukchung requested a review from d4l3k May 10, 2024 22:19

d4l3k requested changes May 10, 2024

View reviewed changes

torch/distributed/elastic/multiprocessing/api.py Show resolved Hide resolved

kiukchung force-pushed the main branch from 90f4ce3 to 92b4e09 Compare May 10, 2024 22:44

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 11, 2024

kiukchung force-pushed the main branch 4 times, most recently from f406897 to 72b7500 Compare May 11, 2024 08:48

d4l3k reviewed May 13, 2024

View reviewed changes

d4l3k approved these changes May 13, 2024

View reviewed changes

[torch/distributed] Bugfix: wait for all child procs to exit before c…

0bd79bd

…losing torch.distributed.elastic.multiprocessing.api.ProcessContext

kiukchung force-pushed the main branch from 72b7500 to 0bd79bd Compare May 13, 2024 18:09

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2024

pytorchmergebot added the merging label May 13, 2024

pytorchmergebot removed the merging label May 13, 2024

kiukchung added release notes: distributed (elastic) release notes: distributed (torchelastic) and removed release notes: distributed (elastic) labels May 14, 2024

pytorchmergebot added the merging label May 14, 2024

pytorchmergebot added the Merged label May 15, 2024

pytorchmergebot closed this in 92eb173 May 15, 2024

pytorchmergebot removed the merging label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch/distributed] Bugfix: wait for all child procs to exit before c… #125969

[torch/distributed] Bugfix: wait for all child procs to exit before c… #125969

kiukchung commented May 10, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 10, 2024 •

edited

d4l3k May 13, 2024

kiukchung May 13, 2024

d4l3k May 13, 2024

d4l3k May 13, 2024

kiukchung commented May 13, 2024

pytorchmergebot commented May 13, 2024

kiukchung commented May 14, 2024

pytorchmergebot commented May 14, 2024

[torch/distributed] Bugfix: wait for all child procs to exit before c… #125969

[torch/distributed] Bugfix: wait for all child procs to exit before c… #125969

Conversation

kiukchung commented May 10, 2024 • edited by pytorch-bot bot

Observed Problem

Root Cause

Fix

pytorch-bot bot commented May 10, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125969

⏳ No Failures, 1 Pending

d4l3k May 13, 2024

Choose a reason for hiding this comment

kiukchung May 13, 2024

Choose a reason for hiding this comment

d4l3k May 13, 2024

Choose a reason for hiding this comment

d4l3k May 13, 2024

Choose a reason for hiding this comment

kiukchung commented May 13, 2024

pytorchmergebot commented May 13, 2024

Merge failed

kiukchung commented May 14, 2024

pytorchmergebot commented May 14, 2024

Merge started

kiukchung commented May 10, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 10, 2024 •

edited