-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement local healthcheck endpoint #706
Comments
@Zoullx Thanks for the suggestion. Would it be easier to check whether the Docker container is still running? For example, something like docker container inspect -f '{{.State.Running}}' cloudflare-ddns |
So the issue with that is that the healthcheck command for the Docker Service is ran from the context of the container. So the container would need access to the Docker CLI and the Docker Socket of the host that the container was running on for that command to work. That's why I put local healthcheck endpoint to try and differentiate it from like the healthcheck.io implementation. |
@Zoullx Ah, I wasn't aware that Docker has a dedicated healthcheck protocol and didn't see the benefits of your suggestion. To clarify, what I suggested should definitely be run outside the Docker container. What confuses me is the definition of "healthiness"---do failures of IP detection or IP updating (external errors) count as "unhealthy" in your usage? What are you planning to when you discover the Docker container is "unhealthy"? |
So "healthyness" for a container all depends on the service. Like if there's an error detecting the IP or in setting a changed IP, if the service is still running and will try again at a later time then technically that doesn't mean that the container is "unhealthy". An "unhealthy" container essentially means that the container needs to be restarted and when the Docker Service detects that a container is "unhealthy" it will stop the old container and create a new one automatically in an attempt to end up with a "healthy" container. Essentially it orchestrates the lifetime of the container for you to ensure that the service is up and running to the best of its ability and the healthchecks built in to Docker help towards this end. So tl;dr is that healthy means that the service that the container provides is running. Doesn't matter if it has had some errors or not, just that it's running. Unhealthy would mean that the service in the container is stopped or unable to continue to complete its task for whatever reason. |
@Zoullx I see. Then, I'm confident that this tool will never be "unhealthy". At least I have been trying extremely hard to make sure that is the case; any deviation will be considered a bug. Given this, I am closing this issue. 😅 Please report any bug that might have made you believe otherwise. Still, thank you for the suggestion! ❤️ I did not know that Docker has its own healthcheck protocol. |
@Zoullx Actually, let me reopen this now. Is it still helpful if it always reports "healthy"? |
Actually it is still helpful because timeouts can be set to where if the service doesn't respond in an allotted time, then that would be considered "unhealthy". I'm thinking something like in an environment where there are resource constraints that for whatever reason make it to where the service is running slower than normal and can't respond within a given time, then those timeouts can be adjusted through Docker to specify that the container would be considered unhealthy then. |
@Zoullx Thanks. I think I need more details about how the feature you suggested applies to this tool. My understanding is that the DDNS client is not a service (until we artificially create one just for health checking) because the way it works is to actively detect IP address changes and then talks to the Cloudflare server if necessary, not passively waiting for someone else to tell it that the IP address has changed. Could you possibly help me understand the concrete scenario where you would like Docker to show that it's unhealthy. In particular, did you ever encounter a case where the tool is somehow not working and you would like to manually restart it? (If so, I really want to fix that!) |
@Zoullx Just to clarify that I still want to continue the conversation to pin down the design. I am not against it, just being confused by how it should be done to be useful. |
I think I have come across a perfect case for when the service could identify as being unhealthy so that the Docker service could restart it. I've ran into a couple of different situations. One where something happened with the CF_API_TOKEN environment variable that caused the process to not be able to authorize with Cloudflare and I had to manually restart the container to get the token into the container with the environment variable properly. And then another where something must have happened with the network and the process couldn't probe 1.1.1.1 or 1.0.0.1 causing it to essentially fail anyways and I had to manually restart the container to get things going through again. However, with those things said, the majority of the time I see the process recover. So I would say make it to where if the process fails to do one of these things after a certain number of times make it report as unhealthy. |
@Zoullx Thank you for your comment. If I understand you correctly, you were talking about these scenarios:
In general, the design philosophy is that the restarting should not help in all cases. And it's a bug if it helps. |
So, I think the biggest thing that you might be getting confused about, is that when the container ends up being reported as unhealthy to the Docker service, then the Docker service will automatically "recycle" that container, meaning that the Docker service will stop the unhealthy container and create a new container. This would cause it to pass the environment variables into the container again which should fix issues of the environment variables not resolving properly. As for the failing probes, I might have to see if the service does it again and capture the log output, so I can show you what I'm talking about because I'm not sure I'll be able to explain it well enough. |
Hmm I'm indeed a bit confused by the line you are drawing between "unhealthy" and "stopping". The recommended setting in |
So, what happened in the case of invalid
The other possible configuration is One thing that deeply confused me is why you could restart the container to fix an invalid |
What will immediately help is to check whether you are using 1.1.1.1 or 1.0.0.1 now. If you are using 1.0.0.1 now, I can see how things could have gone wrong. |
@Zoullx I'd like to zoom out a little bit. Is it okay to phrase your main feature request as "restarting the container when restarting might help"? If so, such situations in my opinion are bugs and should be fixed. Ideally, the tool should recover from any "non-fatal" errors; the moment it knows it's "unhealthy", it should just do what restarting could have achieved. Moreover, reporting the specific "unhealthy" state is not easier than just restarting itself, programming-wise. I also checked the current Docker implementation, and I believe until moby/moby#28400 is implemented, unhealthy containers will not trigger the restarting. So, even if there was a scenario where reporting the "unhealthy" status is superior, the current Docker service would not restart the container for you. |
In terms of this, the restart always only happens if the container errors out when trying to start or stops for some unknown reason if it actually started running ever. If the container is running but the process/service/whatever you want to call it, in the container is just erroring out inside the container without a healthcheck, then the container will just keep running in that state because there's no way for the Docker service in charge of that container to know that anything is happening. In that case, the healthcheck is the magic sauce that enables that ability.
For this, my assumption is that something happened to the value in that environment variable, probably because of environmental factors; restart of the node, power outage, what have you, but something external to Docker. Restarting the container caused it to reevaluate the environment variable. Also, I do have a restart policy set in my docker-compose files, they don't help in these cases because there's no way for the Docker service to know that anything is going on inside the container.
Yeah, I mean since I've restarted the container everything has been working fine
This is the log output, other than the 5m checks for detecting the IP, seeing that it's already set and saying that it's going to check again in 5m. Also, just to clarify, I did change my domain from what it actually is to example.com here.
Here, we basically use any status that is not considered healthy to fail the container to cause it to restart, as stated in this comment moby/moby#28400 (comment) So, while yes, you are correct in that the unhealthy status alone will not cause the container to restart because there's no specific policy for that. But that's what that issue is for and not for the fact that we can do something like |
@Zoullx Thanks. I feel we might be talking past each other, as you seem to have repeated what you have said, and so did I. Please allow me summarize and organize what we both have said, along with my comments, to check which part might have given you the feeling that I failed to understand your point. Usefulness of reporting "unhealthy"Your point is that if the tool is running but "unhealthy" under some definition, the Docker will not know it and thus could not stop it (or restart it). I understand this point. My questioning was about when this will happen, and how the current design is insufficient. To clarify, the comparison was between the following design:
In my opinion, the proposed design has the drawback of using additional resources for both (1) inter-process communication for another process to report "unhealthy" and (2) keeping the container running. Moreover, it doesn't seem to be more useful than the current design. Difficulty of reporting "unhealthy"Implementing a way to "detect" unhealthiness is actually tricky, and the code I could think of to signal unhealthiness will likely be more complicated than retrying things (self-recovery) or just exiting. Inter-process communication, either via system calls or file systems, is not easy in a minimum Docker image (as the released ones). This is why I'm leaning towards fixing the deviations (if any) that are not aligned with the current design. I'm happy to elaborate on what's needed to implement the health checking, but so far this point has not been challenged yet. Can restarting trigger re-evaluation of environmental variables?
This really should not happen due to how Docker works, unless you went through a lot of trouble to configure your Docker that way. One main point of using Docker is to guarantee that the environment (including values of environment variables) will be very predictable. In particular, the value of My working theory was that there were some errors, and the process can actually recover from them. You might have chosen to restart the container, which is of course fine. However, restarting it should not be needed. It is also possible that there's some hidden bug that I'm not aware of, and a detailed log will be helpful if you could recreate it. The only case I could think of now is that, the network was down when the process attempts to verify the token, and because the process exited too quickly (within 10 seconds), Docker refused to restart it. This is filed as #722, but again, this doesn't seem to apply in your case; you seem to be claiming that the value of some environment variable might have changed, but that should never happen with Docker. Can failure to probe 1.1.1.1
Your most recent logging shows that 1.1.1.1 is working is for you, so maybe in one of your runs the probing failed due to the networking being temporarily down. In the case of such a failure, the process will stick to 1.1.1.1. Therefore, as the previous case, my working theory is that you restart the container when it can just self-recover. The only case I could think of was that, maybe the probing fails and the process sticks to 1.1.1.1, and yet your network needs the 1.0.0.1 workaround to work. This is filed as #721 but it does not apply in your case. Your network does not require the 1.0.0.1 workaround (which is good). General points about the reportingI coded the process in a way that it should self-recover from any temporary error, including power outage, networking being down, Cloudflare servers being down, DNS zones being reorganized, etc. It is designed to be resilient enough that none of these errors can stop it. It will just try again next time (maybe after 5 minutes). I took this resilience very seriously and treat any deviations as bugs. I am very interested in checking the two cases where you claimed manual restarting is needed, which to me means the process was buggy, but the analysis you provided does not seem to match how Docker works or what the current code is. If you still have the detailed logging, that will be tremendously useful. Thanks. |
This would be to implement a local healthcheck endpoint that the Docker Service could use to check the "healthyness" of the container.
This is entirely hypothetical, but based on what I've noticed about the use of scratch and just running an executable in the container, so no
bash
orsh
, maybe something like this:Basically just making it to where we could pass a parameter to the ddns executable to have it check if the service is running in the container; something along these lines.
The text was updated successfully, but these errors were encountered: