Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more instances of the child_process.on('exit') race condition #4721

Merged
merged 9 commits into from
May 23, 2024

Conversation

octogonz
Copy link
Collaborator

@octogonz octogonz commented May 17, 2024

Summary

Following up from PR #4711, I audited the Rush Stack code base to look for any other invocations that might be affected by the same issue.

That PR did not trigger publishing of Rush, so I've also included a rush change file to publish a Rush release.

Details

In this lengthy Zulip chat we investigated a Rush build cache failure on Windows Subsystem for Linux (WSL) that turned out to be a race condition in the way that child_process.on('exit') was handled.

Background info

The sequence of events is like this:

  • child_process.on('data') is being used to collect STDERR and STDOUT
  • child_process.on('exit') fires indicating that the process has terminated
  • child_process.on('data') continues to fire a few more times as STDERR/STDOUT finish flushing 👈👈👈
  • child_process.on('close') finally fires indicating that the flush is complete, and includes the same exit code information that was available in 'exit'

Note that child_process.on('error') can also fire, either up front to indicate spawning failed entirely (in which case there is no exit event at all), or after some data has been received to indicate other kinds of errors.

Therefore, child_process.on('exit') is redundant and does NOT normally need to be handled at all; its main value is if the parent wants to be notified as early as possible.

The bug

The race condition bug occurs when functions are handling data to collect output, but handle exit by rejecting immediately. In this case, their STDERR/STDOUT can get truncated if the flushing is not complete.

This race condition is somewhat rare. In the case of #4711, the affected code had existed for years and years without any issues being reported. But on Windows Subsystem for Linux, @akres encountered a situation where it caused a noticeable malfunction and was reproducible.

How it was tested

I did not try to reproduce the race condition in the modified code. However I did confirm that removing the 500ms delay does not break Rundown.

Copy link
Contributor

@dmichon-msft dmichon-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When signal is truthy, exitCode === null, so some of these checks need to change to account for that.

apps/rundown/src/Rundown.ts Outdated Show resolved Hide resolved
libraries/node-core-library/src/Executable.ts Outdated Show resolved Hide resolved
@octogonz octogonz enabled auto-merge May 23, 2024 01:49
@octogonz octogonz merged commit 1acc142 into main May 23, 2024
5 checks passed
@octogonz octogonz deleted the octogonz/node-exit branch May 23, 2024 02:06
@octogonz octogonz mentioned this pull request May 24, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

None yet

3 participants