New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and down suspended process, #30323 #30335
base: main
Are you sure you want to change the base?
Conversation
if (suspendDetector.wasSuspended(suspendTimeout)) { | ||
// note that suspend detection is running on all nodes | ||
suspendDetected() | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes here is mostly indentation, added above if suspended case
log.warning( | ||
ClusterLogMarker.sbrInstability, | ||
"SBR detected that the process was suspended too long and will down itself.") | ||
actOnDecision(DowningStrategy.DownSelfSuspended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the logs from a kill -STOP
followed by kill -CONT
looks like:
[2021-06-21 12:17:18,570] [WARN] [akka.dispatch.SuspendDetector] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-32] - Process was suspended for [21 seconds]
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.ClusterHeartbeat] [akka://ShoppingCartService@127.0.0.1:2551] [akkaHeartbeatStarvation] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [21634] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, CPU overload, or GC.
[2021-06-21 12:17:18,584] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [akkaMemberChanged] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Marking node [akka://ShoppingCartService@127.0.0.1:2551] as [Down]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sharding.ShardRegion] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ShoppingCart: Self downed, stopping ShardRegion [akka://ShoppingCartService/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Self downed, stopping
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event MemberDowned(Member(akka://ShoppingCartService@127.0.0.1:2551, Down))
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Singleton manager stopping singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton]
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event LeaderChanged(Some(akka://ShoppingCartService@127.0.0.1:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - This node is not the leader any more and not responsible for taking SBR decisions.
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - event RoleLeaderChanged(dc-default,Some(akka://ShoppingCartService@127.0.0.1:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ClusterSingletonManager state change [Oldest -> Stopping]
[2021-06-21 12:17:18,588] [DEBUG] [shopping.cart.ShoppingCart$] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - PostStop cart [cart2]
[2021-06-21 12:17:18,589] [DEBUG] [akka.cluster.sharding.Shard] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Shard [506] shutting down
[2021-06-21 12:17:18,591] [DEBUG] [akka.cluster.sharding.DDataShardCoordinator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Received termination message.
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Unwatching: [Actor[akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton/coordinator#1190468820] -> Actor[akka://ShoppingCartService@127.0.0.1:2552/system/sharding/ShoppingCart#279229579]]
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Cleanup self watch of [akka://ShoppingCartService@127.0.0.1:2552/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,593] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://ShoppingCartService@127.0.0.1:2551] [akkaClusterSingletonTerminated] [ShoppingCartService-akka.actor.default-dispatcher-32] - Singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton] was terminated
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - is no longer leader
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR took decision DownSelfSuspended and is downing [akka://ShoppingCartService@127.0.0.1:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://ShoppingCartService@127.0.0.1:2551, Up), Member(akka://ShoppingCartService@127.0.0.1:2552, Up)], full reachability status: []
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://ShoppingCartService@127.0.0.1:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR is downing [UniqueAddress(akka://ShoppingCartService@127.0.0.1:2551,-2763929823204594617)]
[2021-06-21 12:17:19,576] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Node has been marked as DOWN. Shutting down myself
[2021-06-21 12:17:19,577] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Shutting down...
[2021-06-21 12:17:19,577] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-14] - ClusterReceptionist [akka://ShoppingCartService@127.0.0.1:2551] - terminated/removed
[2021-06-21 12:17:19,578] [INFO] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://ShoppingCartService@127.0.0.1:2551] - Successfully shut down
[2021-06-21 12:17:19,579] [DEBUG] [akka.cluster.Cluster] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-21] - Publish AddressTerminated [akka://ShoppingCartService@127.0.0.1:2552]
[2021-06-21 12:17:19,580] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-22] - Cancel all timers
[2021-06-21 12:17:19,583] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.ClusterUserAction$Down] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Running CoordinatedShutdown with reason [ClusterDowningReason]
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-service-unbind] with [0] tasks
[2021-06-21 12:17:19,584] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-unbind] with [1] tasks.
[2021-06-21 12:17:19,586] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [http-unbind-/127.0.0.1:8101] in CoordinatedShutdown phase [service-unbind]
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing phase [service-requests-done] with [1] tasks.
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing task [http-terminate-/127.0.0.1:8101] in CoordinatedShutdown phase [service-requests-done]
[2021-06-21 12:17:19,589] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://ShoppingCartService@127.0.0.1:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [2] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,590] [INFO] [akka.actor.LocalActorRef] [akka://ShoppingCartService@127.0.0.1:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://ShoppingCartService@127.0.0.1:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-stop] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-cluster-shutdown] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-sharding-shutdown-region] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [region-shutdown] in CoordinatedShutdown phase [cluster-sharding-shutdown-region]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-leave] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [leave] in CoordinatedShutdown phase [cluster-leave]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting] with [4] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-singleton-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-2] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-1] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting-done] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [exiting-completed] in CoordinatedShutdown phase [cluster-exiting-done]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-shutdown] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-shutdown] in CoordinatedShutdown phase [cluster-shutdown]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-actor-system-terminate] with [0] tasks
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [actor-system-terminate] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate]
[2021-06-21 12:17:19,594] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Shutting down remote daemon.
[2021-06-21 12:17:19,595] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports.
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://ShoppingCartService@127.0.0.1:2552], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://ShoppingCartService@127.0.0.1:2552], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,639] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://ShoppingCartService@127.0.0.1:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remoting shut down.
[2021-06-21 12:17:19,646] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown initiated...
[2021-06-21 12:17:19,650] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that sharding already acts quickly on the downing, see
ShoppingCart: Self downed, stopping ShardRegion
PostStop cart [cart2]
import SuspendDetector.SuspendDetected | ||
|
||
// FIXME config | ||
private val tickInterval = 100.millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not have to be this frequently, because I think the scheduled task will trigger immediately when the process is woken up.
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I guess it's mostly just missing config for the various timeouts?
yeah, I wanted to throw out the idea before doing too much. Also thinking if there are more places where it should check? In persistent actors, or should that be in the plugins? Thinking it could abort any updates in flight (that haven't been sent to db yet). |
Hmm, yeah, for persistence would be to avoid delayed writes when the actor was already started on another node? I can't think of any other place where it is immediately obvious to me that it would be good, I think I'll have to "marinate" the idea for a while and see what comes up. |
@volatile private var aliveTime = System.nanoTime() | ||
@volatile private var suspendDetectedTime = aliveTime - 1.day.toNanos | ||
|
||
system.scheduler.scheduleWithFixedDelay(tickInterval, tickInterval) { () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this approach (scheduling checkTime
on the scheduler and dispatcher) yield a false positive in cases of thread starvation within the application? Thread starvation like that is definitely https://www.youtube.com/watch?v=MpUWrl3-mc8, but the response of pre-emptively tearing things down might be out of proportion; so we have a very loose deadline.
It seems like the intent is to capture scenarios where nothing in the JVM process gets scheduled (suspended process/VM, but also sufficiently-high levels of host load or priority inversions etc.), so maybe having the check run in a thread that is nearly always sleeping would be more effective and could have a faster reaction time? My understanding is that the kernel schedulers tend to be eager to give CPU to processes that haven't been runnable in a long time (e.g. because they're sleeping).
Early draft for discussion...
References #30323