Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement local healthcheck endpoint #706

Open
Zoullx opened this issue Mar 8, 2024 · 18 comments
Open

Implement local healthcheck endpoint #706

Zoullx opened this issue Mar 8, 2024 · 18 comments
Labels
enhancement New feature or request logging Related to what the tool outputs to the end user next: design brainstorming The next step is to reflect upon the information and come up with a good design

Comments

@Zoullx
Copy link

Zoullx commented Mar 8, 2024

This would be to implement a local healthcheck endpoint that the Docker Service could use to check the "healthyness" of the container.

This is entirely hypothetical, but based on what I've noticed about the use of scratch and just running an executable in the container, so no bash or sh, maybe something like this:

healthcheck:
  test: ["CMD", "ddns", "healthcheck"]

Basically just making it to where we could pass a parameter to the ddns executable to have it check if the service is running in the container; something along these lines.

@favonia
Copy link
Owner

favonia commented Mar 8, 2024

@Zoullx Thanks for the suggestion. Would it be easier to check whether the Docker container is still running? For example, something like

docker container inspect -f '{{.State.Running}}' cloudflare-ddns

@favonia favonia added enhancement New feature or request next: learn technical info The next step is to learn more technical details labels Mar 8, 2024
@Zoullx
Copy link
Author

Zoullx commented Mar 8, 2024

So the issue with that is that the healthcheck command for the Docker Service is ran from the context of the container. So the container would need access to the Docker CLI and the Docker Socket of the host that the container was running on for that command to work. That's why I put local healthcheck endpoint to try and differentiate it from like the healthcheck.io implementation.

@favonia
Copy link
Owner

favonia commented Mar 9, 2024

@Zoullx Ah, I wasn't aware that Docker has a dedicated healthcheck protocol and didn't see the benefits of your suggestion. To clarify, what I suggested should definitely be run outside the Docker container.

What confuses me is the definition of "healthiness"---do failures of IP detection or IP updating (external errors) count as "unhealthy" in your usage? What are you planning to when you discover the Docker container is "unhealthy"?

@Zoullx
Copy link
Author

Zoullx commented Mar 9, 2024

So "healthyness" for a container all depends on the service. Like if there's an error detecting the IP or in setting a changed IP, if the service is still running and will try again at a later time then technically that doesn't mean that the container is "unhealthy". An "unhealthy" container essentially means that the container needs to be restarted and when the Docker Service detects that a container is "unhealthy" it will stop the old container and create a new one automatically in an attempt to end up with a "healthy" container. Essentially it orchestrates the lifetime of the container for you to ensure that the service is up and running to the best of its ability and the healthchecks built in to Docker help towards this end.

So tl;dr is that healthy means that the service that the container provides is running. Doesn't matter if it has had some errors or not, just that it's running. Unhealthy would mean that the service in the container is stopped or unable to continue to complete its task for whatever reason.

@favonia
Copy link
Owner

favonia commented Mar 9, 2024

@Zoullx I see. Then, I'm confident that this tool will never be "unhealthy". At least I have been trying extremely hard to make sure that is the case; any deviation will be considered a bug. Given this, I am closing this issue. 😅 Please report any bug that might have made you believe otherwise. Still, thank you for the suggestion! ❤️ I did not know that Docker has its own healthcheck protocol.

@favonia favonia closed this as completed Mar 9, 2024
@favonia favonia added logging Related to what the tool outputs to the end user and removed next: learn technical info The next step is to learn more technical details labels Mar 9, 2024
@favonia
Copy link
Owner

favonia commented Mar 9, 2024

@Zoullx Actually, let me reopen this now. Is it still helpful if it always reports "healthy"?

@favonia favonia reopened this Mar 9, 2024
@Zoullx
Copy link
Author

Zoullx commented Mar 9, 2024

Actually it is still helpful because timeouts can be set to where if the service doesn't respond in an allotted time, then that would be considered "unhealthy". I'm thinking something like in an environment where there are resource constraints that for whatever reason make it to where the service is running slower than normal and can't respond within a given time, then those timeouts can be adjusted through Docker to specify that the container would be considered unhealthy then.

@favonia
Copy link
Owner

favonia commented Mar 9, 2024

@Zoullx Thanks. I think I need more details about how the feature you suggested applies to this tool. My understanding is that the DDNS client is not a service (until we artificially create one just for health checking) because the way it works is to actively detect IP address changes and then talks to the Cloudflare server if necessary, not passively waiting for someone else to tell it that the IP address has changed. Could you possibly help me understand the concrete scenario where you would like Docker to show that it's unhealthy. In particular, did you ever encounter a case where the tool is somehow not working and you would like to manually restart it? (If so, I really want to fix that!)

@favonia favonia added the next: design brainstorming The next step is to reflect upon the information and come up with a good design label Mar 9, 2024
@favonia
Copy link
Owner

favonia commented Mar 17, 2024

@Zoullx Just to clarify that I still want to continue the conversation to pin down the design. I am not against it, just being confused by how it should be done to be useful.

@Zoullx
Copy link
Author

Zoullx commented Mar 21, 2024

I think I have come across a perfect case for when the service could identify as being unhealthy so that the Docker service could restart it. I've ran into a couple of different situations. One where something happened with the CF_API_TOKEN environment variable that caused the process to not be able to authorize with Cloudflare and I had to manually restart the container to get the token into the container with the environment variable properly. And then another where something must have happened with the network and the process couldn't probe 1.1.1.1 or 1.0.0.1 causing it to essentially fail anyways and I had to manually restart the container to get things going through again. However, with those things said, the majority of the time I see the process recover. So I would say make it to where if the process fails to do one of these things after a certain number of times make it report as unhealthy.

@favonia
Copy link
Owner

favonia commented Mar 23, 2024

@Zoullx Thank you for your comment. If I understand you correctly, you were talking about these scenarios:

  1. CF_API_TOKEN is invalid: the container should end immediately, and Docker should also give up quickly due to its restart policy. In this case, the container will be stopped completely, which I believe is better than keeping it running but marking it unhealthy. I believe the "unhealthy" state is a workaround to detect failing containers that don't stop, and, if possible, it is always better to just stop when something's wrong. (Or, is there a reason to keep the tool running even when it knows the updating will definitely fail?)
  2. Failing to probe 1.1.1.1 and 1.0.0.1 and then the updating failed because of that: could you confirm that your network needs 1.0.0.1, and due to the failure to detect 1.0.0.1, the tool uses 1.1.1.1 and failed? If so, that's a good catch... I guess I can redesign the detection so that it will resume the testing.
  3. In other cases, where the process fails to do something after a certain number of times, I do not immediately see how restarting will help. Unless you also change the configuration, the tool is designed to do exactly the same thing (modulo caching) whether you restarts it or not. That is, if something would fail, it will still fail even after restarting.

In general, the design philosophy is that the restarting should not help in all cases. And it's a bug if it helps.

@Zoullx
Copy link
Author

Zoullx commented Mar 23, 2024

So, I think the biggest thing that you might be getting confused about, is that when the container ends up being reported as unhealthy to the Docker service, then the Docker service will automatically "recycle" that container, meaning that the Docker service will stop the unhealthy container and create a new container. This would cause it to pass the environment variables into the container again which should fix issues of the environment variables not resolving properly.

As for the failing probes, I might have to see if the service does it again and capture the log output, so I can show you what I'm talking about because I'm not sure I'll be able to explain it well enough.

@favonia
Copy link
Owner

favonia commented Mar 23, 2024

So, I think the biggest thing that you might be getting confused about, is that when the container ends up being reported as unhealthy to the Docker service, then the Docker service will automatically "recycle" that container, meaning that the Docker service will stop the unhealthy container and create a new container. This would cause it to pass the environment variables into the container again which should fix issues of the environment variables not resolving properly.

Hmm I'm indeed a bit confused by the line you are drawing between "unhealthy" and "stopping". The recommended setting in README has restart: always, which means that Docker should "recycle" the container as you described. Do you have --restart always or restart: always in your Docker arguments or Docker-compose configuration?

@favonia
Copy link
Owner

favonia commented Mar 23, 2024

So, what happened in the case of invalid CF_API_TOKEN is the following:

  1. The tool attempts to authorize it with Cloudflare and it fails.
  2. The tool exits, and due to restart: always, Docker considers restarting it.
  3. Docker might choose not to restart it if the tool exits too quickly.

The other possible configuration is on-failure:N, which will stop the restarting after N failures.

One thing that deeply confused me is why you could restart the container to fix an invalid CF_API_TOKEN. You would have to change its value to make it work. Did I miss something? Are you claiming that you can fix it without changing the token?

@favonia
Copy link
Owner

favonia commented Mar 23, 2024

As for the failing probes, I might have to see if the service does it again and capture the log output, so I can show you what I'm talking about because I'm not sure I'll be able to explain it well enough.

What will immediately help is to check whether you are using 1.1.1.1 or 1.0.0.1 now. If you are using 1.0.0.1 now, I can see how things could have gone wrong.

@favonia
Copy link
Owner

favonia commented Mar 23, 2024

@Zoullx I'd like to zoom out a little bit. Is it okay to phrase your main feature request as "restarting the container when restarting might help"? If so, such situations in my opinion are bugs and should be fixed. Ideally, the tool should recover from any "non-fatal" errors; the moment it knows it's "unhealthy", it should just do what restarting could have achieved. Moreover, reporting the specific "unhealthy" state is not easier than just restarting itself, programming-wise.

I also checked the current Docker implementation, and I believe until moby/moby#28400 is implemented, unhealthy containers will not trigger the restarting. So, even if there was a scenario where reporting the "unhealthy" status is superior, the current Docker service would not restart the container for you.

@Zoullx
Copy link
Author

Zoullx commented Mar 23, 2024

Hmm I'm indeed a bit confused by the line you are drawing between "unhealthy" and "stopping". The recommended setting in README has restart: always, which means that Docker should "recycle" the container as you described. Do you have --restart always or restart: always in your Docker arguments or Docker-compose configuration?

In terms of this, the restart always only happens if the container errors out when trying to start or stops for some unknown reason if it actually started running ever. If the container is running but the process/service/whatever you want to call it, in the container is just erroring out inside the container without a healthcheck, then the container will just keep running in that state because there's no way for the Docker service in charge of that container to know that anything is happening. In that case, the healthcheck is the magic sauce that enables that ability.

So, what happened in the case of invalid CF_API_TOKEN is the following:

1. The tool attempts to authorize it with Cloudflare and it fails.

2. The tool exits, and due to `restart: always`, Docker considers restarting it.

3. Docker might choose not to restart it if the tool exits too quickly.

The other possible configuration is on-failure:N, which will stop the restarting after N failures.

One thing that deeply confused me is why you could restart the container to fix an invalid CF_API_TOKEN. You would have to change its value to make it work. Did I miss something? Are you claiming that you can fix it without changing the token?

For this, my assumption is that something happened to the value in that environment variable, probably because of environmental factors; restart of the node, power outage, what have you, but something external to Docker. Restarting the container caused it to reevaluate the environment variable. Also, I do have a restart policy set in my docker-compose files, they don't help in these cases because there's no way for the Docker service to know that anything is going on inside the container.

What will immediately help is to check whether you are using 1.1.1.1 or 1.0.0.1 now. If you are using 1.0.0.1 now, I can see how things could have gone wrong.

Yeah, I mean since I've restarted the container everything has been working fine

cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 🌟 Cloudflare DDNS
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 🥷 Dropping privileges . . .
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default PUID=1000
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default PGID=1000
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 📖 Reading settings . . .
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default IP4_PROVIDER=cloudflare.trace
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default UPDATE_CRON=@every 5m
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default UPDATE_ON_START=true
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default CACHE_EXPIRATION=6h0m0s
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default TTL=1
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default DETECTION_TIMEOUT=5s
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔸 Use default UPDATE_TIMEOUT=30s
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 📖 Checking settings . . .
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 📖 Probing 1.1.1.1 and 1.0.0.1 . . .
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    😊 1.1.1.1 is working. Great!
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    | 📖 Current settings:
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔧 Domains and IP providers:
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 IPv4 domains:            example.com
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 IPv4 provider:           cloudflare.trace
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔧 Scheduling:
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Timezone:                UTC (UTC+00 now)
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Update frequency:        @every 5m
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Update on start?         true
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Delete on stop?          false
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Cache expiration:        6h0m0s
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔧 New DNS records:
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 TTL:                     1 (auto)
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Proxied domains:         example.com
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Unproxied domains:       (none)
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |    🔧 Timeouts:
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 IP detection:            5s
cloudflare-ddns_cloudflare-ddns.1.mq7x6sqqjj32@sdr630-58f78e80    |       🔸 Record updating:         30s

This is the log output, other than the 5m checks for detecting the IP, seeing that it's already set and saying that it's going to check again in 5m. Also, just to clarify, I did change my domain from what it actually is to example.com here.

@Zoullx I'd like to zoom out a little bit. Is it okay to phrase your main feature request as "restarting the container when restarting might help"? If so, such situations in my opinion are bugs and should be fixed. Ideally, the tool should recover from any "non-fatal" errors; the moment it knows it's "unhealthy", it should just do what restarting could have achieved. Moreover, reporting the specific "unhealthy" state is not easier than just restarting itself, programming-wise.

I also checked the current Docker implementation, and I believe until moby/moby#28400 is implemented, unhealthy containers will not trigger the restarting. So, even if there was a scenario where reporting the "unhealthy" status is superior, the current Docker service would not restart the container for you.

Here, we basically use any status that is not considered healthy to fail the container to cause it to restart, as stated in this comment moby/moby#28400 (comment) So, while yes, you are correct in that the unhealthy status alone will not cause the container to restart because there's no specific policy for that. But that's what that issue is for and not for the fact that we can do something like || false or || exit 1 to make the Docker service believe that the container actually failed which will in fact cause the container to be restarted, well technically recycled.

@favonia
Copy link
Owner

favonia commented Mar 24, 2024

@Zoullx Thanks. I feel we might be talking past each other, as you seem to have repeated what you have said, and so did I. Please allow me summarize and organize what we both have said, along with my comments, to check which part might have given you the feeling that I failed to understand your point.

Usefulness of reporting "unhealthy"

Your point is that if the tool is running but "unhealthy" under some definition, the Docker will not know it and thus could not stop it (or restart it). I understand this point. My questioning was about when this will happen, and how the current design is insufficient. To clarify, the comparison was between the following design:

  1. Current design: exit or self-recover when an error occurs, and let Docker handle restarting if it exits.
  2. Proposed design: signal that the container is unhealthy, and let Docker handle restarting.

In my opinion, the proposed design has the drawback of using additional resources for both (1) inter-process communication for another process to report "unhealthy" and (2) keeping the container running. Moreover, it doesn't seem to be more useful than the current design.

Difficulty of reporting "unhealthy"

Implementing a way to "detect" unhealthiness is actually tricky, and the code I could think of to signal unhealthiness will likely be more complicated than retrying things (self-recovery) or just exiting. Inter-process communication, either via system calls or file systems, is not easy in a minimum Docker image (as the released ones). This is why I'm leaning towards fixing the deviations (if any) that are not aligned with the current design. I'm happy to elaborate on what's needed to implement the health checking, but so far this point has not been challenged yet.

Can restarting trigger re-evaluation of environmental variables?

For this, my assumption is that something happened to the value in that environment variable, probably because of environmental factors

This really should not happen due to how Docker works, unless you went through a lot of trouble to configure your Docker that way. One main point of using Docker is to guarantee that the environment (including values of environment variables) will be very predictable. In particular, the value of CF_API_TOKEN will not change! The value will always be the same. Thus, I have trouble seeing the usefulness of manually restarting Docker in this case.

My working theory was that there were some errors, and the process can actually recover from them. You might have chosen to restart the container, which is of course fine. However, restarting it should not be needed. It is also possible that there's some hidden bug that I'm not aware of, and a detailed log will be helpful if you could recreate it.

The only case I could think of now is that, the network was down when the process attempts to verify the token, and because the process exited too quickly (within 10 seconds), Docker refused to restart it. This is filed as #722, but again, this doesn't seem to apply in your case; you seem to be claiming that the value of some environment variable might have changed, but that should never happen with Docker.

Can failure to probe 1.1.1.1

the process couldn't probe 1.1.1.1 or 1.0.0.1 causing it to essentially fail

Your most recent logging shows that 1.1.1.1 is working is for you, so maybe in one of your runs the probing failed due to the networking being temporarily down. In the case of such a failure, the process will stick to 1.1.1.1. Therefore, as the previous case, my working theory is that you restart the container when it can just self-recover.

The only case I could think of was that, maybe the probing fails and the process sticks to 1.1.1.1, and yet your network needs the 1.0.0.1 workaround to work. This is filed as #721 but it does not apply in your case. Your network does not require the 1.0.0.1 workaround (which is good).

General points about the reporting

I coded the process in a way that it should self-recover from any temporary error, including power outage, networking being down, Cloudflare servers being down, DNS zones being reorganized, etc. It is designed to be resilient enough that none of these errors can stop it. It will just try again next time (maybe after 5 minutes). I took this resilience very seriously and treat any deviations as bugs. I am very interested in checking the two cases where you claimed manual restarting is needed, which to me means the process was buggy, but the analysis you provided does not seem to match how Docker works or what the current code is. If you still have the detailed logging, that will be tremendously useful. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request logging Related to what the tool outputs to the end user next: design brainstorming The next step is to reflect upon the information and come up with a good design
Projects
None yet
Development

No branches or pull requests

2 participants