make timeout for check-db configurable, and ensure logs are written when timeout happens #615

asosnovsky-sumologic · 2022-06-21T15:20:00Z

Checks

I have checked for existing issues.
This report is about the User-Community Airflow Helm Chart.

Motivation

In our company we are currently setting up a cluster for airflow, and while debugging some networking issues between our cluster and rds we seem to constantly hit the timeout settings in here https://github.com/airflow-helm/charts/blob/main/charts/airflow/templates/_helpers/pods.tpl#L72-L75 , which once hit hides all of the important logs that we would have received if the process failed on it's on.

The way we handle it right now is by manually editing the container's command and setting the timeout to something higher.

Implementation

Here are some different ideas:

Include a variable under airflow.check-db that would allow us to set a custom timeout
Remove the timeout completely, since this should be handled internally by airflow with the CONNECTION_CHECK_SLEEP_TIME and CONNECTION_CHECK_MAX_COUNT variables (see https://airflow.apache.org/docs/docker-stack/entrypoint.html#waits-for-airflow-db-connection ) -- I prefer this solution

Are you willing & able to help?

I am able to submit a PR!
I can help test the feature!

The text was updated successfully, but these errors were encountered:

asosnovsky-sumologic · 2022-06-23T17:48:43Z

@thesuperzapper I can make a PR for this, but which approach would you prefer I take? (if you don't care I can just remove the timeout, which is my preference)

anuragphadnis · 2022-08-18T12:55:09Z

We can pick the timeout value from values.yaml by using something like .Values.checkDb.timeout and add if else condition to use default timeout value?
I am also facing similar issue, can you tell me how do you manually edit the container's command?

thesuperzapper · 2022-08-18T22:47:25Z

@asosnovsky-sumologic @anuragphadnis

Thanks for raising this, I am not sure why I did not introduce a value to change the timeout length when I first introduced the timeout.

However, I do remember that there are situations when airflow checkdb and airflow db check will hang forever without actually failing (like in your described situation of network issues, #153), so you probably would not see network logs if you remove the timeout anyway.

While you are correct that the airflow Dockerfile has an in-built check-db, this is not suitable for the chart for 2 primary reasons:

We want a separate init-container to perform the check-db (so that the main container does not even try and start if the db is down)
Older versions of the Dockerfile (like those for 1.10) do not have the check-db feature

My thought is that we can do two things:

Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
Check if we get more logs when we lower the main signal to SIGINT, and introduce a KILL signal at a slightly later timeout to ensure the process ends:
- The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
- NOTE: I am not sure if SIGINT will actually terminate airflow db check, or if it will even give better logging (in network failure situations), we need to verify this, otherwise this change is not neeeded

Other thoughts:

60s is an incredibly long time for a timeout that literally just connects to a DB, if a connection takes more time than that, its very likely your DB is not functional
60s is the timeout for a single connection attempt, Kubernetes will retry the container (with exponential backoff) so its not like its 60s and then a permanent failure

anuragphadnis · 2022-08-19T05:29:39Z

In my case the check-db pod gets restarted after every 60 seconds and connection is not being able to establish in this time. This happens every time after check-db is restarted, and check-db does not log anything. It would be very useful if we can customize the value so that we can see logs or establish the connection.

asosnovsky · 2022-09-12T17:36:30Z

@anuragphadnis

I agree with this line of thinking.
In my case I had a networking issue that restricted the db from being reachable by my cluster. The issue was hard to capture because when you do the math with CONNECTION_CHECK_SLEEP_TIME and CONNECTION_CHECK_MAX_COUNT no errors actually got thrown as it just silently kept on retrying. I actually think that the timeout should possibly be set as a dynamic value based on these two environment params?

Like do something along the lines of

{{- if .Values.airflow.legacyCommands }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow checkdb"
{{- else }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow db check"
{{- end }}

Or even provide these values under

# value.yaml
airflow:
 db_connection:
  check_max_count: 10
  check_sleep_time: 30

anuragphadnis · 2022-09-13T11:45:52Z

Yes @asosnovsky

Right now I have modified it in a similar way to use value from config and using that code to deploy the chart. I can create a PR if someone else is not working on it.

asosnovsky · 2022-09-13T14:46:11Z

@anuragphadnis i don't mind taking this to completion. If your change is already merged, I can do the rest of the work and update the checkdb command :)

thesuperzapper · 2022-09-14T01:43:09Z

@asosnovsky @anuragphadnis @asosnovsky-sumologic

If anyone wants to pick up the tasks I suggested in #615 (comment), I am happy to review/merge the PR, the tasks are:

Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
Find a way to actually get logs when a timeout is hit (so people know what is happening):
- We might be able to do this by lowering the main signal to SIGINT, and introducing a KILL signal at a slightly later timeout to ensure the process ends:
  - The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
  - NOTE: I am not sure if SIGINT will actually terminate airflow db check, we need to verify this
- Alternatively, we can add our own bash trap wrapper that will log a "connection timed out message" if a SIGTERM is passed

asosnovsky-sumologic added the kind/enhancement kind - new features or changes label Jun 21, 2022

asosnovsky-sumologic changed the title ~~Increase or parametrize check-db timeout~~ Remove or parametrize check-db timeout Jun 21, 2022

thesuperzapper added this to Triage | Needs PR in Issue Triage and PR Tracking Jun 22, 2022

thesuperzapper added this to the airflow-8.7.0 milestone Aug 18, 2022

thesuperzapper changed the title ~~Remove or parametrize check-db timeout~~ make timeout for check-db configurable, and ensure logs are written when timeout happens Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make timeout for check-db configurable, and ensure logs are written when timeout happens #615

make timeout for check-db configurable, and ensure logs are written when timeout happens #615

asosnovsky-sumologic commented Jun 21, 2022 •

edited

asosnovsky-sumologic commented Jun 23, 2022

anuragphadnis commented Aug 18, 2022 •

edited

thesuperzapper commented Aug 18, 2022

anuragphadnis commented Aug 19, 2022

asosnovsky commented Sep 12, 2022

anuragphadnis commented Sep 13, 2022 •

edited

asosnovsky commented Sep 13, 2022

thesuperzapper commented Sep 14, 2022 •

edited

make timeout for check-db configurable, and ensure logs are written when timeout happens #615

make timeout for check-db configurable, and ensure logs are written when timeout happens #615

Comments

asosnovsky-sumologic commented Jun 21, 2022 • edited

Checks

Motivation

Implementation

Are you willing & able to help?

asosnovsky-sumologic commented Jun 23, 2022

anuragphadnis commented Aug 18, 2022 • edited

thesuperzapper commented Aug 18, 2022

anuragphadnis commented Aug 19, 2022

asosnovsky commented Sep 12, 2022

anuragphadnis commented Sep 13, 2022 • edited

asosnovsky commented Sep 13, 2022

thesuperzapper commented Sep 14, 2022 • edited

asosnovsky-sumologic commented Jun 21, 2022 •

edited

anuragphadnis commented Aug 18, 2022 •

edited

anuragphadnis commented Sep 13, 2022 •

edited

thesuperzapper commented Sep 14, 2022 •

edited