Increasing timeout for wait_for_loggers to 30s #15915

vamossagar12 · 2024-05-10T08:56:51Z

Noticed that the system test test_dynamic_logging seems to be failing when the entire test suite is run. When I analysed the logs (attached)
436.tgz
, I didn't find anything untoward so I am guessing this could be happening when all tests are run. Moreover, when I run just that one test locally, it passes.

This PR was an attempt to increase the timeout to 30s and post that, all the system tests passed (logs attached)
138.tgz
.

vamossagar12 · 2024-05-22T16:31:44Z

@C0urante , would you have some time to look at this small PR?

C0urante · 2024-06-04T17:52:00Z

Even under heavy load, ten seconds is an extremely long time for cluster-wide logging changes to take effect. I think something else may be wrong.

It looks like the startup check for distributed workers could be insufficient. By default, we wait to see if a worker's REST API is initialized, which is done by querying the /connectors endpoint (see here). However, as was noted in #15249, that check barely does anything aside from ensure that a worker has a valid config and has initialized its REST server. Is it possible that the failures you've seen were caused because workers in the cluster were still starting by the time we issued the logging level adjustment request and waited for it to take effect?

If so, we can try first to change the startup mode for this test from STARTUP_MODE_LISTEN (the default) to STARTUP_MODE_JOIN, which should give stronger guarantees about worker readiness. And as a follow-up, there's KIP-1017, which can be used in situations like this to avoid having to use hacks like parsing log files or checking for nonexistent connectors to determine a worker's health and readiness.

Increasing timeout for wait_for_loggers to 30s

2bdf1f7

vamossagar12 marked this pull request as ready for review May 22, 2024 16:20

vamossagar12 added 2 commits May 22, 2024 21:59

Updating comment

3fb535c

Correcting grammatical mistake in the comment.

0e0ca16

vamossagar12 added the connect label May 22, 2024

C0urante added the tests Test fixes (including flaky tests) label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing timeout for wait_for_loggers to 30s #15915

Increasing timeout for wait_for_loggers to 30s #15915

vamossagar12 commented May 10, 2024 •

edited

vamossagar12 commented May 22, 2024

C0urante commented Jun 4, 2024 •

edited

Increasing timeout for wait_for_loggers to 30s #15915

Are you sure you want to change the base?

Increasing timeout for wait_for_loggers to 30s #15915

Conversation

vamossagar12 commented May 10, 2024 • edited

vamossagar12 commented May 22, 2024

C0urante commented Jun 4, 2024 • edited

vamossagar12 commented May 10, 2024 •

edited

C0urante commented Jun 4, 2024 •

edited