Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize thread usage in monitor #1400

Merged
merged 1 commit into from May 3, 2024
Merged

Conversation

byucesoy
Copy link
Member

@byucesoy byucesoy commented Mar 27, 2024

In monitor, we create 2 threads per resource; one for SSH event loop processing
and one for actual pulse check. In previous version, each resource would keep
their threads even after the pulse check is completed. This means the number of
resources we can monitor at the same time is limited by the number of threads
we can create.
This commit changes the behavior so that after the pulse check is completed,
the threads are released. This way, we can monitor significantly more resources
at the same time.
One drawback of the new approach is that we need to re-create the threads for
each check. In my system creating 1000 threads takes about 0.025 seconds, so
overhead is seems negligible.
I also added a new helper method, needs_event_loop_for_pulse_check? to models.
We actually don't need event loop for pulse check for most of the resources,
only PostgresServer and MinioServer need it. Other resources rely on exec! to
perform their pulse check which doesn't need event loop. In fact, I observed
that extra event loop processing actually slows down the exec! calls. By taking
this into consideration, we reduce the number of threads we create and also
improve the speed of some pulse checks.
Another change we are making is that removing the monitoring_interval from the
model and hardcoding it in the monitor as 5 seconds. This removes capability of
setting different monitoring intervals for different resources. Supporting this
would require some work and since it is not used in the current implementation
I decided to remove all together. If we need to support this in the future, we
can add it back with some effort.

@byucesoy byucesoy requested review from fdr and a team March 27, 2024 04:28
bin/monitor Show resolved Hide resolved
@byucesoy byucesoy force-pushed the limit-monitor-thread-count branch 2 times, most recently from cf0bee5 to ab2df92 Compare March 27, 2024 04:48
@byucesoy byucesoy self-assigned this Mar 27, 2024
@fdr
Copy link
Collaborator

fdr commented Mar 28, 2024

Note: will read this more closely, because it is somewhat close to a rewrite, even in whitespace insensitive diff rendering mode. There's logical similarity, but I think I have to swap in the two designs from first principles, which is going to slow me down a bit.

bin/monitor Outdated Show resolved Hide resolved
@byucesoy
Copy link
Member Author

byucesoy commented Apr 1, 2024

Note: will read this more closely, because it is somewhat close to a rewrite, even in whitespace insensitive diff rendering mode. There's logical similarity, but I think I have to swap in the two designs from first principles, which is going to slow me down a bit.

Yep, there are some parts that reuses the previous logic but yep, overall this is in the rewrite territory.

@byucesoy byucesoy requested a review from fdr April 1, 2024 00:38
Copy link
Collaborator

@fdr fdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done yet, just some nits as I swap in the whole thing.

bin/monitor Outdated Show resolved Hide resolved
bin/monitor Outdated Show resolved Hide resolved
bin/monitor Outdated Show resolved Hide resolved
@byucesoy byucesoy requested a review from fdr April 2, 2024 16:11
@byucesoy
Copy link
Member Author

byucesoy commented Apr 2, 2024

Replied to the questions, I'll add some comments.

@byucesoy byucesoy changed the base branch from main to view-coverage-testing April 9, 2024 15:58
@byucesoy byucesoy changed the base branch from view-coverage-testing to main April 9, 2024 15:58
bin/monitor Outdated Show resolved Hide resolved
Copy link
Member

@enescakir enescakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@byucesoy byucesoy force-pushed the limit-monitor-thread-count branch 4 times, most recently from 6605791 to e92104b Compare April 23, 2024 19:49
Copy link
Collaborator

@fdr fdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright put it in and watch it carefully :)

It's probably complex enough that we shouldn't get past the 100% branch coverage regime by stuffing it all in bin...but that can come, perhaps, after you make any initial adjustments.

bin/monitor Outdated Show resolved Hide resolved
In monitor, we create 2 threads per resource; one for SSH event loop processing
and one for actual pulse check. In previous version, each resource would keep
their threads even after the pulse check is completed. This means the number of
resources we can monitor at the same time is limited by the number of threads
we can create.
This commit changes the behavior so that after the pulse check is completed,
the threads are released. This way, we can monitor significantly more resources
at the same time.
One drawback of the new approach is that we need to re-create the threads for
each check. In my system creating 1000 threads takes about 0.025 seconds, so
overhead is seems negligible.
I also added a new helper method, needs_event_loop_for_pulse_check? to models.
We actually don't need event loop for pulse check for most of the resources,
only PostgresServer and MinioServer need it. Other resources rely on exec! to
perform their pulse check which doesn't need event loop. In fact, I observed
that extra event loop processing actually slows down the exec! calls. By taking
this into consideration, we reduce the number of threads we create and also
improve the speed of some pulse checks.
Another change we are making is that removing the monitoring_interval from the
model and hardcoding it in the monitor as 5 seconds. This removes capability of
setting different monitoring intervals for different resources. Supporting this
would require some work and since it is not used in the current implementation
I decided to remove all together. If we need to support this in the future, we
can add it back with some effort.
@byucesoy byucesoy force-pushed the limit-monitor-thread-count branch from b6f8215 to 1fb7765 Compare May 2, 2024 10:31
Copy link
Member

@enescakir enescakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me in general

@byucesoy byucesoy merged commit 19ec0b3 into main May 3, 2024
6 checks passed
@byucesoy byucesoy deleted the limit-monitor-thread-count branch May 3, 2024 15:13
@github-actions github-actions bot locked and limited conversation to collaborators May 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants