Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing timeout for wait_for_loggers to 30s #15915

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

vamossagar12
Copy link
Collaborator

@vamossagar12 vamossagar12 commented May 10, 2024

Noticed that the system test test_dynamic_logging seems to be failing when the entire test suite is run. When I analysed the logs (attached)
436.tgz
, I didn't find anything untoward so I am guessing this could be happening when all tests are run. Moreover, when I run just that one test locally, it passes.

This PR was an attempt to increase the timeout to 30s and post that, all the system tests passed (logs attached)
138.tgz
.

@vamossagar12 vamossagar12 marked this pull request as ready for review May 22, 2024 16:20
@vamossagar12
Copy link
Collaborator Author

@C0urante , would you have some time to look at this small PR?

@C0urante
Copy link
Contributor

C0urante commented Jun 4, 2024

Even under heavy load, ten seconds is an extremely long time for cluster-wide logging changes to take effect. I think something else may be wrong.

It looks like the startup check for distributed workers could be insufficient. By default, we wait to see if a worker's REST API is initialized, which is done by querying the /connectors endpoint (see here). However, as was noted in #15249, that check barely does anything aside from ensure that a worker has a valid config and has initialized its REST server. Is it possible that the failures you've seen were caused because workers in the cluster were still starting by the time we issued the logging level adjustment request and waited for it to take effect?

If so, we can try first to change the startup mode for this test from STARTUP_MODE_LISTEN (the default) to STARTUP_MODE_JOIN, which should give stronger guarantees about worker readiness. And as a follow-up, there's KIP-1017, which can be used in situations like this to avoid having to use hacks like parsing log files or checking for nonexistent connectors to determine a worker's health and readiness.

@C0urante C0urante added the tests Test fixes (including flaky tests) label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connect tests Test fixes (including flaky tests)
Projects
None yet
2 participants