-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
In monitor, we create 2 threads per resource; one for SSH event loop processing and one for actual pulse check. In previous version, each resource would keep their threads even after the pulse check is completed. This means the number of resources we can monitor at the same time is limited by the number of threads we can create. This commit changes the behavior so that after the pulse check is completed, the threads are released. This way, we can monitor significantly more resources at the same time. One drawback of the new approach is that we need to re-create the threads for each check. In my system creating 1000 threads takes about 0.025 seconds, so overhead is seems negligible. I also added a new helper method, needs_event_loop_for_pulse_check? to models. We actually don't need event loop for pulse check for most of the resources, only PostgresServer and MinioServer need it. Other resources rely on exec! to perform their pulse check which doesn't need event loop. In fact, I observed that extra event loop processing actually slows down the exec! calls. By taking this into consideration, we reduce the number of threads we create and also improve the speed of some pulse checks. Another change we are making is that removing the monitoring_interval from the model and hardcoding it in the monitor as 5 seconds. This removes capability of setting different monitoring intervals for different resources. Supporting this would require some work and since it is not used in the current implementation I decided to remove all together. If we need to support this in the future, we can add it back with some effort.
- Loading branch information
Showing
7 changed files
with
294 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# frozen_string_literal: true | ||
|
||
class MonitorableResource | ||
attr_reader :deleted, :run_event_loop | ||
|
||
PULSE_TIMEOUT = 120 | ||
|
||
def initialize(resource) | ||
@resource = resource | ||
@session = nil | ||
@mutex = Mutex.new | ||
@pulse = {} | ||
@pulse_check_started_at = Time.now | ||
@pulse_thread = nil | ||
@run_event_loop = false | ||
@deleted = false | ||
end | ||
|
||
def open_resource_session | ||
return if @session && @pulse[:reading] == "up" | ||
|
||
@session = @resource.reload.init_health_monitor_session | ||
rescue => ex | ||
if ex.is_a?(Sequel::NoExistingObject) | ||
Clog.emit("Resource is deleted.") { {resource_deleted: {ubid: @resource.ubid}} } | ||
@session = nil | ||
@deleted = true | ||
end | ||
end | ||
|
||
def process_event_loop | ||
return if @session.nil? || !@resource.needs_event_loop_for_pulse_check? | ||
|
||
@pulse_thread = Thread.new do | ||
sleep 0.01 until @run_event_loop | ||
@session[:ssh_session].loop(0.01) { @run_event_loop } | ||
rescue => ex | ||
Clog.emit("SSH event loop has failed.") { {event_loop_failure: {ubid: @resource.ubid, exception: Util.exception_to_hash(ex)}} } | ||
close_resource_session | ||
end | ||
end | ||
|
||
def check_pulse | ||
@run_event_loop = true if @resource.needs_event_loop_for_pulse_check? | ||
|
||
@pulse_check_started_at = Time.now | ||
begin | ||
@pulse = @resource.check_pulse(session: @session, previous_pulse: @pulse) | ||
Clog.emit("Got new pulse.") { {got_pulse: {ubid: @resource.ubid, pulse: @pulse}} } | ||
rescue => ex | ||
Clog.emit("Pulse checking has failed.") { {pulse_check_failure: {ubid: @resource.ubid, exception: Util.exception_to_hash(ex)}} } | ||
end | ||
|
||
@run_event_loop = false if @resource.needs_event_loop_for_pulse_check? | ||
@pulse_thread&.join | ||
end | ||
|
||
def close_resource_session | ||
return if @session.nil? | ||
|
||
@session[:ssh_session].shutdown! | ||
begin | ||
@session[:ssh_session].close | ||
rescue | ||
end | ||
@session = nil | ||
end | ||
|
||
def force_stop_if_stuck | ||
if @mutex.locked? | ||
if @pulse_check_started_at + PULSE_TIMEOUT < Time.now | ||
Clog.emit("Pulse check has stuck.") { {pulse_check_stuck: {ubid: @resource.ubid}} } | ||
ThreadPrinter.run | ||
Kernel.exit! | ||
end | ||
end | ||
end | ||
|
||
def lock_no_wait | ||
return unless @mutex.try_lock | ||
|
||
begin | ||
yield | ||
ensure | ||
@mutex.unlock | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
# frozen_string_literal: true | ||
|
||
require_relative "../model/spec_helper" | ||
|
||
RSpec.describe MonitorableResource do | ||
let(:postgres_server) { PostgresServer.new { _1.id = "c068cac7-ed45-82db-bf38-a003582b36ee" } } | ||
let(:r_w_event_loop) { described_class.new(postgres_server) } | ||
let(:vm_host) { VmHost.new { _1.id = "46683a25-acb1-4371-afe9-d39f303e44b4" } } | ||
let(:r_without_event_loop) { described_class.new(vm_host) } | ||
|
||
describe "#open_resource_session" do | ||
it "returns if session is not nil and pulse reading is up" do | ||
r_w_event_loop.instance_variable_set(:@session, "not nil") | ||
r_w_event_loop.instance_variable_set(:@pulse, {reading: "up"}) | ||
|
||
expect(postgres_server).not_to receive(:reload) | ||
r_w_event_loop.open_resource_session | ||
end | ||
|
||
it "sets session to resource's init_health_monitor_session" do | ||
expect(postgres_server).to receive(:reload).and_return(postgres_server) | ||
expect(postgres_server).to receive(:init_health_monitor_session).and_return("session") | ||
expect { r_w_event_loop.open_resource_session }.to change { r_w_event_loop.instance_variable_get(:@session) }.from(nil).to("session") | ||
end | ||
|
||
it "sets deleted to true if resource is deleted" do | ||
expect(postgres_server).to receive(:reload).and_raise(Sequel::NoExistingObject) | ||
expect { r_w_event_loop.open_resource_session }.to change(r_w_event_loop, :deleted).from(false).to(true) | ||
end | ||
|
||
it "ignores exception if it is not Sequel::NoExistingObject" do | ||
expect(postgres_server).to receive(:reload).and_raise(StandardError) | ||
expect { r_w_event_loop.open_resource_session }.not_to raise_error | ||
end | ||
end | ||
|
||
describe "#process_event_loop" do | ||
before do | ||
# We are monkeypatching the sleep method here to avoid the actual sleep. | ||
# We also use it to flip the @run_event_loop flag to true, so that the | ||
# loop in the process_event_loop method can exit. | ||
def r_w_event_loop.sleep(duration) | ||
puts "sleeping for #{duration}" | ||
@run_event_loop = true | ||
end | ||
end | ||
|
||
it "returns if session is nil or resource does not need event loop" do | ||
expect(Thread).not_to receive(:new) | ||
|
||
# session is nil | ||
r_w_event_loop.process_event_loop | ||
|
||
# resource does not need event loop | ||
r_without_event_loop.instance_variable_set(:@session, "not nil") | ||
r_without_event_loop.process_event_loop | ||
end | ||
|
||
it "creates a new thread and runs the event loop" do | ||
session = {ssh_session: instance_double(Net::SSH::Connection::Session)} | ||
r_w_event_loop.instance_variable_set(:@session, session) | ||
expect(Thread).to receive(:new).and_yield | ||
expect(session[:ssh_session]).to receive(:loop) | ||
r_w_event_loop.process_event_loop | ||
end | ||
|
||
it "swallows exception and logs it if event loop fails" do | ||
session = {ssh_session: instance_double(Net::SSH::Connection::Session)} | ||
r_w_event_loop.instance_variable_set(:@session, session) | ||
expect(Thread).to receive(:new).and_yield | ||
expect(session[:ssh_session]).to receive(:loop).and_raise(StandardError) | ||
expect(Clog).to receive(:emit) | ||
expect(r_w_event_loop).to receive(:close_resource_session) | ||
r_w_event_loop.process_event_loop | ||
end | ||
end | ||
|
||
describe "#check_pulse" do | ||
it "calls check_pulse on resource and sets pulse" do | ||
expect(postgres_server).to receive(:check_pulse).and_return({reading: "up"}) | ||
expect { r_w_event_loop.check_pulse }.to change { r_w_event_loop.instance_variable_get(:@pulse) }.from({}).to({reading: "up"}) | ||
end | ||
|
||
it "swallows exception and logs it if check_pulse fails" do | ||
expect(vm_host).to receive(:check_pulse).and_raise(StandardError) | ||
expect(Clog).to receive(:emit) | ||
expect { r_without_event_loop.check_pulse }.not_to raise_error | ||
end | ||
|
||
it "waits for the pulse thread to finish" do | ||
pulse_thread = Thread.new {} | ||
expect(pulse_thread).to receive(:join) | ||
r_w_event_loop.instance_variable_set(:@pulse_thread, pulse_thread) | ||
r_w_event_loop.check_pulse | ||
end | ||
end | ||
|
||
describe "#close_resource_session" do | ||
it "returns if session is nil" do | ||
session = {ssh_session: instance_double(Net::SSH::Connection::Session)} | ||
expect(session[:ssh_session]).not_to receive(:shutdown!) | ||
expect(session).to receive(:nil?).and_return(true) | ||
r_w_event_loop.instance_variable_set(:@session, session) | ||
r_w_event_loop.close_resource_session | ||
end | ||
|
||
it "shuts down and closes the session" do | ||
session = {ssh_session: instance_double(Net::SSH::Connection::Session)} | ||
expect(session[:ssh_session]).to receive(:shutdown!) | ||
expect(session[:ssh_session]).to receive(:close) | ||
r_w_event_loop.instance_variable_set(:@session, session) | ||
r_w_event_loop.close_resource_session | ||
end | ||
end | ||
|
||
describe "#force_stop_if_stuck" do | ||
it "does nothing if pulse check is not stuck" do | ||
expect(Kernel).not_to receive(:exit!) | ||
|
||
# not locked | ||
r_w_event_loop.force_stop_if_stuck | ||
|
||
# not timed out | ||
r_w_event_loop.instance_variable_get(:@mutex).lock | ||
r_w_event_loop.instance_variable_set(:@pulse_check_started_at, Time.now) | ||
r_w_event_loop.force_stop_if_stuck | ||
r_w_event_loop.instance_variable_get(:@mutex).unlock | ||
end | ||
|
||
it "triggers Kernel.exit if pulse check is stuck" do | ||
expect(ThreadPrinter).to receive(:run) | ||
expect(Kernel).to receive(:exit!) | ||
|
||
r_w_event_loop.instance_variable_get(:@mutex).lock | ||
r_w_event_loop.instance_variable_set(:@pulse_check_started_at, Time.now - 200) | ||
r_w_event_loop.force_stop_if_stuck | ||
r_w_event_loop.instance_variable_get(:@mutex).unlock | ||
end | ||
end | ||
|
||
describe "#lock_no_wait" do | ||
it "does not yield if mutex is locked" do | ||
r_w_event_loop.instance_variable_get(:@mutex).lock | ||
expect { |b| r_w_event_loop.lock_no_wait(&b) }.not_to yield_control | ||
r_w_event_loop.instance_variable_get(:@mutex).unlock | ||
end | ||
|
||
it "yields if mutex is not locked" do | ||
expect { |b| r_w_event_loop.lock_no_wait(&b) }.to yield_control | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters