-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker reconnect fail due to exception while attaching #4678
Comments
I think I have looked into similar issues in latent worker side, but didn't realize that regular workers would be affected by network issues too. The problem here is that the master thinks the worker is still attached and when a new connection is attempted, thinks that a duplicate worker has connected. At least we should drop the old worker after a timeout. |
Possibly related: We have seen a number of cases where jobs on some of our Windows testers are broken because the master detects an unexpected loss of connection to the worker. The problem appears to be that the worker still think it has a connection, and therefore does not reconnect, or as it is supposed to do, reboot the worker (these workers are supposed to run one test, then reboot on a graceful shutdown). We did see similar disconnects in the old 0.8.4 based buildbot, but in those cases the reboot did happen. Occasionally, when they occur during a Git update step, these disconnects can break the git checkout due to a index.lock file remains afterwards (and this may cause worse problems later). |
Just had it happen again, and what I see is that the worker rebooted for some reason, and reported in the log that it had reconnected to the master. 15ish minutes later the master reported loosing the connection in "a non-clean fashion", and quite the job. Replies to keepalives are "Everything is OK". So, it seems to be a similar type of incident |
This issue continues to happen for us, and in some cases fixing it requires pausing all workers, wait until idle, then restarting the manager process. Most recent cases were this week, involving several separate incidents typically caused by a short network disconnect between worker and manager. |
I have been having another look at this (this keeps happening fairly frequently), while I am looking at the 2.10 upgrade, and my best guess is that the assert happens in the This suggests to me that the Worker object is reused without properly resetting self.conn. I see that detached() do reset that variable, but disconnect() doesn't. Should it have reset self.conn? |
For reference, we are still seeing this issue occasionally when the network connection to workers get broken, with the same exception reported in the log. Most recent case this week. We are now using Buildbot v3.1.1 |
We just had another case like this on our (now) 3.3.0 cluster, This case is odd. According to the log one of the workers started reconnecting (probably due to a network hickup causing the connection to be lost). The manager determined that it had lost the connection to the other end, and allowed the new connection, and connected the worker to the associated builders However: The connection icon for the worker is Red/disconnected, and no email has been sent about the worker connection being lost. Even further oddness: Despite being marked as "offline" in the GUI, the worker is accepting tasks and successfully completing them. This seems to indicate that the internal connected state for the worker and the GUI state for the worker is not in sync for some reason. (Reloading the the GUI pages does not fix the state)
|
I just started looking at an update to Buildbot 3.6, and had a look at the code suspected of triggering the assert that is causing this problem. I see that the attached() assert in master/buildbot/worker/base.py has been changed to provide more information with commit e37a01c , and afterwards I realized there are a couple of questions that should be asked:
I have no idea which of these are the best description of what happens, or if there is a better one. One possible theory, if the conns are different, and self.conn is older, is that there is a race condition between removing the old attachments, and adding the new ones. This might be handled with some kind of locking or using a task that then fires of the reattach task. The other part of the problem, that the connection stays alive despite not being attached to any builder, might be due to missing cleanup of during the exception handling. One possible way to handle/work around this would be to regularly verify that each worker connection is attached to all builders that it is assigned to, e.g. during the ping. |
We have just had yet another incident which left a worker in the "quantum superposition" of being connected, yet disconnected, after an ISP network failure between the manager and some of our workers. To resolve the situation I will have to stop the manager and start it again, which means I will have to pause all workers and wait several hours until the current tasks have completed. What I see from the logs below is that the manager
That suggests to me that the ping test action fails to disconnect the old connection and detach it from the builders it was attached to, or at least not mark it as no longer active. The log from the manager:
The log from the worker ("arbeider"); as can be seen, it took several minutes before the worker was able to reconnect to the manager:
|
Investigating a bit, I suspect that I am guessing that there are further references to old_conn elsewhere that have not been removed until the network error of lost socket arrives. Further, I guess that the attach failure does not result in the WorkerManager closing and removing the new connection. |
I did a test with forcing My guess is, as above, that I also noticed that the remove callback in BTW, it is now over 4 years since I first reported this problem. |
I am having this issue with 3.9.2 as well. |
I have been trying to make a test case to reproduce this issue, but the standard null protocol workers did not reproduce, and I was unable to create a working PB based test that "sever" the connection (SeverWorkerConnectionMixin) then immediately reconnects, while the null version always passed. I finally was able to reproduce by asserting in AbstractWorker.detached which removes the connection and tracing back to what triggered the call to that triggered it, and that is notifyDisconnected(). notifyDisconnected() is called by loseConnection in null, but not PB connections, thus null connections will be fully removed by WorkerManager.newConnection(), but PB connections will not, leading to the old one still being associated with the worker when the new connection is being attached, thus triggering the "but we are already connected to" assert. My guess is that pb.Connection.loseConnection will need to add a call to notifyDisconnected(). I am not sure where such a call should be added, my guess is somewhere after the transport loseConnection call. I am presently planning to deploy the the Hotpatch I created earlier this year which adds notifyDisconnected() in newConnection, but I was never quite sure if it was safe to deploy. Now, seeing that the null protocol implements that solution I think it is safe, although I don't think that function is the one that should have the fix; it will just be a workaround that implement what appears to be necessary. It might also be an idea to look into having a backup check that verifies that connected workers are actually attached to at least one builder, and close connections that aren't attached, e.g. during the keepalive test. A separate backup step might be to have a action in the webgui for each worker that disconnects all registed connections for it. This could be one of the ones that might be added by my suggestion in #5833 BTW, I recently observed this issue with a worker that was located in the same rack as the manager. The worker had an unstable connection, apparently caused by a bad network cable, and in one specific case it was connected but "offline" |
Also, I suspect that exceptions like the "but we are already connected to"-exception, should be updated to properly cleaning up and closing the connection. Part of the problem in this report is due to, when the newer connection was detected as conflicting with an existing connection, causing the exception to be raised, that the newer connection was not closed, leaving it hanging, and kept alive by the connection manager, preventing the worker from trying again. |
…from a worker when there is already a known (but actually disconnected) connection to the same worker. Additionally, match pb.loseConnection with null.loseConnection() by calling self.notifyDisconnected() to make sure the connection is shut down properly. This function is called both places to make sure the old connection is actually shut down. This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (buildbot#4678)
… with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
…ociated with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
I have been doing more thinking about possible causes after the HotPatch didn't actually work as intended, and recently reviewing the log entries of one of the events. I think the original case (without HotPatch) was probably like this:
My HotPatch (and the PR) solves the detachment problem by calling notifyDisconnect() for the old connection. This causes the connection to no longer be attached to the builders. Unfortunately, this did not work properly. What happened after the Builders were attached to the new connection is as follows:
What I now suspect happened is:
The reason for this happening is that notifyDisconnect isn't a fire once callback subscription, but a multiple callback subscription. Add in that the callbacks does not verify that the calling connection is still the active connection (some of the callbacks does not even have that information), and they can trample roughshod over the newer connection's attachment and other state. The minimal way to solve this issue is to delete the subscriptions in the connection after they have been delivered, and replace the old subscription point with a new one. My more recent updates of the PR does this, and also adds a testcase for the fire-twice case (which fails if the notifyDisconnect update isn't there). I have not yet tested this in my HotPatch, I am waiting for the next incident to prove that the disconnect function was called twice. There might be further changes that could strengthen the protection against this issue:
A couple of things I am not sure about, is how sensitive Twisted is to connections being closed due to network errors, and whether there are issues with forcibly closing the socket maintained by Twisted. |
Hmmm, just a related possibility I just started wondering about: Could the apparent problem with Twisted not detecting (or handling/acting on) connection errors in a timely fashion be related to what I observed in my comment #5832 (comment) ? |
…from a worker when there is already a known (but actually disconnected) connection to the same worker. This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (buildbot#4678)
… with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
once, by replacing the current subscription object with a new one after delivering the notifications. A second call could mess up the newer connection's state. (buildbot#4678)
We just had to do some network maintenance that triggered the online-but-offline scenario, and the log message for a duplicate call got triggered for three workers, so I think the double-call scenario is proven. So next step will be to hotpatch the conn object. |
We have had a few instances of workers losing the network connection, and not get properly attached when they try to reconnect.
As far as I can tell, the connection is re-established, the master accepts it, but then some kind of exception happens while it is handling the registration of the worker.
Afterwards, the connection seems to be OK from the worker's side (the master does not disconnect, and apparently responds to keep-alives), but from the master's side, the connection is not registered as active, and it is not associated with any builders.
Fixing this situation either requires manually stop/restart of the worker instance on each worker, or a stop/restart of the master instance.
Environment: Buildbot 2.0.1, TLS connection between worker and master.
The exception reported is as follows; there is no information about what the assert is about.
The text was updated successfully, but these errors were encountered: