Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make timeout for check-db configurable, and ensure logs are written when timeout happens #615

Open
4 tasks done
asosnovsky-sumologic opened this issue Jun 21, 2022 · 8 comments
Labels
kind/enhancement kind - new features or changes

Comments

@asosnovsky-sumologic
Copy link

asosnovsky-sumologic commented Jun 21, 2022

Checks

Motivation

In our company we are currently setting up a cluster for airflow, and while debugging some networking issues between our cluster and rds we seem to constantly hit the timeout settings in here https://github.com/airflow-helm/charts/blob/main/charts/airflow/templates/_helpers/pods.tpl#L72-L75 , which once hit hides all of the important logs that we would have received if the process failed on it's on.

The way we handle it right now is by manually editing the container's command and setting the timeout to something higher.

Implementation

Here are some different ideas:

Are you willing & able to help?

  • I am able to submit a PR!
  • I can help test the feature!
@asosnovsky-sumologic asosnovsky-sumologic added the kind/enhancement kind - new features or changes label Jun 21, 2022
@asosnovsky-sumologic asosnovsky-sumologic changed the title Increase or parametrize check-db timeout Remove or parametrize check-db timeout Jun 21, 2022
@thesuperzapper thesuperzapper added this to Triage | Needs PR in Issue Triage and PR Tracking Jun 22, 2022
@asosnovsky-sumologic
Copy link
Author

@thesuperzapper I can make a PR for this, but which approach would you prefer I take? (if you don't care I can just remove the timeout, which is my preference)

@anuragphadnis
Copy link

anuragphadnis commented Aug 18, 2022

We can pick the timeout value from values.yaml by using something like .Values.checkDb.timeout and add if else condition to use default timeout value?
I am also facing similar issue, can you tell me how do you manually edit the container's command?

@thesuperzapper thesuperzapper added this to the airflow-8.7.0 milestone Aug 18, 2022
@thesuperzapper
Copy link
Member

@asosnovsky-sumologic @anuragphadnis

Thanks for raising this, I am not sure why I did not introduce a value to change the timeout length when I first introduced the timeout.

However, I do remember that there are situations when airflow checkdb and airflow db check will hang forever without actually failing (like in your described situation of network issues, #153), so you probably would not see network logs if you remove the timeout anyway.

While you are correct that the airflow Dockerfile has an in-built check-db, this is not suitable for the chart for 2 primary reasons:

  1. We want a separate init-container to perform the check-db (so that the main container does not even try and start if the db is down)
  2. Older versions of the Dockerfile (like those for 1.10) do not have the check-db feature

My thought is that we can do two things:

  1. Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
  2. Check if we get more logs when we lower the main signal to SIGINT, and introduce a KILL signal at a slightly later timeout to ensure the process ends:
    • The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
    • NOTE: I am not sure if SIGINT will actually terminate airflow db check, or if it will even give better logging (in network failure situations), we need to verify this, otherwise this change is not neeeded

Other thoughts:

  • 60s is an incredibly long time for a timeout that literally just connects to a DB, if a connection takes more time than that, its very likely your DB is not functional
  • 60s is the timeout for a single connection attempt, Kubernetes will retry the container (with exponential backoff) so its not like its 60s and then a permanent failure

@anuragphadnis
Copy link

In my case the check-db pod gets restarted after every 60 seconds and connection is not being able to establish in this time. This happens every time after check-db is restarted, and check-db does not log anything. It would be very useful if we can customize the value so that we can see logs or establish the connection.

@asosnovsky
Copy link

@anuragphadnis

I agree with this line of thinking.
In my case I had a networking issue that restricted the db from being reachable by my cluster. The issue was hard to capture because when you do the math with CONNECTION_CHECK_SLEEP_TIME and CONNECTION_CHECK_MAX_COUNT no errors actually got thrown as it just silently kept on retrying. I actually think that the timeout should possibly be set as a dynamic value based on these two environment params?

Like do something along the lines of

{{- if .Values.airflow.legacyCommands }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow checkdb"
{{- else }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow db check"
{{- end }}

Or even provide these values under

# value.yaml
airflow:
 db_connection:
  check_max_count: 10
  check_sleep_time: 30

@anuragphadnis
Copy link

anuragphadnis commented Sep 13, 2022

Yes @asosnovsky

Right now I have modified it in a similar way to use value from config and using that code to deploy the chart. I can create a PR if someone else is not working on it.

@asosnovsky
Copy link

@anuragphadnis i don't mind taking this to completion. If your change is already merged, I can do the rest of the work and update the checkdb command :)

@thesuperzapper
Copy link
Member

thesuperzapper commented Sep 14, 2022

@asosnovsky @anuragphadnis @asosnovsky-sumologic

If anyone wants to pick up the tasks I suggested in #615 (comment), I am happy to review/merge the PR, the tasks are:

  1. Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
  2. Find a way to actually get logs when a timeout is hit (so people know what is happening):
    • We might be able to do this by lowering the main signal to SIGINT, and introducing a KILL signal at a slightly later timeout to ensure the process ends:
      • The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
      • NOTE: I am not sure if SIGINT will actually terminate airflow db check, we need to verify this
    • Alternatively, we can add our own bash trap wrapper that will log a "connection timed out message" if a SIGTERM is passed

@thesuperzapper thesuperzapper changed the title Remove or parametrize check-db timeout make timeout for check-db configurable, and ensure logs are written when timeout happens Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement kind - new features or changes
Projects
Development

No branches or pull requests

4 participants